PySpark
Configuration
The PySpark configurations are:
- alias: the alias used to refer to the SparkSession within the modules. The default is- spark.
- loglevel: the log level to use for the console. The default value is WARN. Acceptable values can be found here.
- Any - key, valuepair that can be used in the- configmethod of the- SparkSession.builderclass. Some examples are shown below!
# profile.yml
<profile name here>: # change this!
  adapters:
    <pyspark adapter name here>:  # change this!
      alias: spark
      loglevel: WARN
      config:
        spark.driver.cores:
        spark.driver.memory:
        spark.driver.memoryOverhead:
        spark.executor.cores:
        spark.executor.memory:
        spark.executor.memoryOverhead:
        spark.executor.instances:
        spark.task.cpus:
        spark.sql.broadcastTimeout:
        # Add additional config variables here!
Under the hood, prism takes care of parsing the configuration variables, construction a SparkSession instance, and storing the instance in an attribute named alias within the hooks object.
Users can access the SparkSession directly using hooks.{alias}.
def run(self, tasks, hooks):
    # Assuming the alias is 'spark' in profile.yml
    df = hooks.spark.read.parquet('data_path/') Last updated
