PySpark
Configuration
The PySpark configurations are:
alias
: the alias used to refer to the SparkSession within the modules. The default isspark
.loglevel
: the log level to use for the console. The default value is WARN. Acceptable values can be found here.Any
key, value
pair that can be used in theconfig
method of theSparkSession.builder
class. Some examples are shown below!
# profile.yml
<profile name here>: # change this!
adapters:
<pyspark adapter name here>: # change this!
alias: spark
loglevel: WARN
config:
spark.driver.cores:
spark.driver.memory:
spark.driver.memoryOverhead:
spark.executor.cores:
spark.executor.memory:
spark.executor.memoryOverhead:
spark.executor.instances:
spark.task.cpus:
spark.sql.broadcastTimeout:
# Add additional config variables here!
Under the hood, prism takes care of parsing the configuration variables, construction a SparkSession instance, and storing the instance in an attribute named alias
within the hooks
object.
Users can access the SparkSession directly using hooks.{alias}
.
def run(self, tasks, hooks):
# Assuming the alias is 'spark' in profile.yml
df = hooks.spark.read.parquet('data_path/')