PySpark

Configuration

The PySpark configurations are:

  • alias: the alias used to refer to the SparkSession within the modules. The default is spark.

  • loglevel: the log level to use for the console. The default value is WARN. Acceptable values can be found here.

  • Any key, value pair that can be used in the config method of the SparkSession.builder class. Some examples are shown below!

# profile.yml

<profile name here>: # change this!
  adapters:
    <pyspark adapter name here>:  # change this!
      alias: spark
      loglevel: WARN
      config:
        spark.driver.cores:
        spark.driver.memory:
        spark.driver.memoryOverhead:
        spark.executor.cores:
        spark.executor.memory:
        spark.executor.memoryOverhead:
        spark.executor.instances:
        spark.task.cpus:
        spark.sql.broadcastTimeout:
        # Add additional config variables here!

Under the hood, prism takes care of parsing the configuration variables, construction a SparkSession instance, and storing the instance in an attribute named alias within the hooks object.

Users can access the SparkSession directly using hooks.{alias}.

def run(self, tasks, hooks):
    # Assuming the alias is 'spark' in profile.yml
    df = hooks.spark.read.parquet('data_path/')