Introduction
Overview
Spark.jl provides an interface to Apache Spark™ platform, including SQL / DataFrame and Structured Streaming. It closely follows the PySpark API, making it easy to translate existing Python code to Julia.
Spark.jl supports multiple cluster types (in client mode), and can be considered as an analogue to PySpark or RSpark within the Julia ecosystem. It supports running within on-premise installations, as well as hosted instance such as Amazon EMR and Azure HDInsight.
Installation
Spark.jl requires at least JDK 8/11 and Maven to be installed and available in PATH.
] add Spark
To link against a specific version of Spark, also run:
ENV["BUILD_SPARK_VERSION"] = "3.2.1" # version you need
] build Spark
Quick Example
Note that most types in Spark.jl support dot notation for calling functions, e.g. x.foo(y)
is expanded into foo(x, y)
.
using Spark
spark = SparkSession.builder.appName("Main").master("local").getOrCreate()
df = spark.createDataFrame([["Alice", 19], ["Bob", 23]], "name string, age long")
rows = df.select(Column("age") + 1).collect()
for row in rows
println(row[1])
end
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/home/runner/work/Spark.jl/Spark.jl/jvm/sparkjl/target/sparkjl-0.2-assembly.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
22/07/27 06:00:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20
24
Cluster Types
This package supports multiple cluster types (in client mode): local
, standalone
, mesos
and yarn
. The location of the cluster (in case of mesos or standalone) or the cluster type (in case of local or yarn) must be passed as a parameter master
when creating a Spark context. For YARN based clusters, the cluster parameters are picked up from spark-defaults.conf
, which must be accessible via a SPARK_HOME
environment variable.
Current Limitations
- Jobs can be submitted from Julia process attached to the cluster in
client
deploy mode.Cluster
mode is not fully supported, and it is uncertain if it is useful in the Julia context. - Since records are serialised between Java and Julia at the edges, the maximum size of a single row in an RDD is 2GB, due to Java array indices being limited to 32 bits.
Trademarks
Apache®, Apache Spark and Spark are registered trademarks, or trademarks of the Apache Software Foundation in the United States and/or other countries.