# PySpark

## Configuration

The PySpark configurations are:

* `alias`: the alias used to refer to the SparkSession within the modules. The default is `spark`.
* `loglevel`: the log level to use for the console. The default value is WARN. Acceptable values can be found [here](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.setLogLevel.html).
* Any `key, value` pair that can be used in the `config` method of the [`SparkSession.builder` class](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.SparkSession.builder.config.html). Some examples are shown below!

```yaml
# profile.yml

<profile name here>: # change this!
  adapters:
    <pyspark adapter name here>:  # change this!
      alias: spark
      loglevel: WARN
      config:
        spark.driver.cores:
        spark.driver.memory:
        spark.driver.memoryOverhead:
        spark.executor.cores:
        spark.executor.memory:
        spark.executor.memoryOverhead:
        spark.executor.instances:
        spark.task.cpus:
        spark.sql.broadcastTimeout:
        # Add additional config variables here!

```

Under the hood, prism takes care of parsing the configuration variables, construction a SparkSession instance, and storing the instance in an attribute named `alias` within the `hooks` object.

Users can access the SparkSession directly using `hooks.{alias}`.

```python
def run(self, tasks, hooks):
    # Assuming the alias is 'spark' in profile.yml
    df = hooks.spark.read.parquet('data_path/') 
```
