PySpark

Configuration

The PySpark configurations are:

alias: the alias used to refer to the SparkSession within the modules. The default is spark.
loglevel: the log level to use for the console. The default value is WARN. Acceptable values can be found here.
Any key, value pair that can be used in the config method of the SparkSession.builder class. Some examples are shown below!

# profile.yml

<profile name here>: # change this!
  adapters:
    <pyspark adapter name here>:  # change this!
      alias: spark
      loglevel: WARN
      config:
        spark.driver.cores:
        spark.driver.memory:
        spark.driver.memoryOverhead:
        spark.executor.cores:
        spark.executor.memory:
        spark.executor.memoryOverhead:
        spark.executor.instances:
        spark.task.cpus:
        spark.sql.broadcastTimeout:
        # Add additional config variables here!

Under the hood, prism takes care of parsing the configuration variables, construction a SparkSession instance, and storing the instance in an attribute named alias within the hooks object.

Users can access the SparkSession directly using hooks.{alias}.

def run(self, tasks, hooks):
    # Assuming the alias is 'spark' in profile.yml
    df = hooks.spark.read.parquet('data_path/')

PreviousTrino Nextdbt

Last updated 2 years ago