RFC: PySpark-like, DataFrame-only API #104

dfdx · 2022-05-16T23:19:55Z

I'm working on a nearly complete rewrite of the Spark.jl and want to get some feedback before releasing it. The summary of the changes:

PySpark-compatible API, including dot-chaining syntax, types and function names
SQL and Streaming API
No RDD API and thus no support for running custom Julia code on a cluster
Experimental support for UDFs via compilation to Java
One big breaking change

Why

At its current state, Spark.jl is terribly outdated and pretty unstable. A lot of effort is needed to keep the old RDD interface working,
but almost nobody uses this API nowadays. At the same time, DataFrame API in main is limited to just a few core methods and is hardly suitable for any real project.

PySpark-compatible API

There are thousands of tutorials, answers and discussions for PySpark out there, as well as hundreds of Pythonists looking for familiar API in the Julia land. So let's just reuse these materials. Examples of this new API:

using Spark.SQL

spark = SparkSession.builder.
        appName("Hello").
        master("local").
        config("some.key", "some-value").
        getOrCreate()

df = spark.read.text("README.md")
df.value.contains("Julia")

SQL and Streaming API

In this initial effort I'm going to implement >= 50% of the SQL and Structured Streaming API, including all the core data types and enough functions to copy-paste most Python tutorials. The long tail of other functions is time-consuming, but otherwise easy to add too.

No RDD

I'm going to discontinue support for the RDD API including JuliaRDD that we used to run custom Julia code on a cluster. The communication protocol between JVM and Julia in that RDD has been terribly outdated, but most importantly, we haven't found a way to reliably manage the state of the Julia workers in a variety of possible runtimes. See this and this for more details.

UDFs via compilation to Java

As described in these issues, the best alternative for running complex Julia scripts on a cluster is Kubernetes. However, to support simple data transformations, the new API also features a Julia-to-Java compiler. Example:

f = s -> lowercase(s)
f_udf = udf(f, "Hi!")
r = jcall2(f_udf.judf, "call", JString, (JString,), "Big Buddha Boom!")
@test convert(String, r) == f("Big Buddha Boom!")

f_udf.judf is a fully functional Java object implementing UDF1 interface that can be passed to any matching Java methods without Julia binary installed on a cluster.

On a lower level, one can also compile any Java classes from a string, including a class starting a Julia server (e.g. via juliacaller). But managing the lifecycle of this server is up to a user and out of scope of this initial effort.

One big breaking change

Along with RDD support, we lose DataFrames.jl compatibility and other contributions. I apologize for this. PRs to bring them back are highly welcome.

I'm currently close to finishing the SQL interface and look forward for the Structured Streaming API. I plan to finish it in 2 weeks or less, and then tag the new version. Comments are highly welcome.

exyi · 2022-05-22T16:40:08Z

src/chainable.jl

+        # and propagate getproperty to the returned object
+        return getproperty(dc(), prop)
+    end
+end


wow, good magic, I had no idea this was possible in Julia :)

exyi

Great work, I really like this full on switch to DataFrames! I left few comments regarding the code, nothing too important... Few more general comments

RDD Drop

I don't think anyone will miss RDDs :D, however it's bit sad to see the "just run this Julia function in parallel" functionality go. Well, it never worked for my use cases because handling libs was a pain, so never mind.

I still think that it would be nice to have a way to run Julia code on the Spark cluster. I understand that you are more of fan of Kubernetes, but Spark clusters are often already available in some institutions and I find it less PITA to run that Kubernetes 😅. General distributed computing might not be a great fit for Spark, but more often than not the use case is simple: just run this simulation (or whatever) with on 1000 different inputs (or 1000 different sets of parameters), for which Spark API is very nice to use.

I was thinking if PackageCompiler.jl could help us produce a self contained executable which we could then just send to any executor from the master (even without Julia installed on executors, possibly?). I'm not that Julia-savvy, so I have no if it's easily possible, but seems like an option to me. Never mind, this is off-topic again 🙃

Java UDFs

I think that directly exposing Java or Scala to users would solve many problems. I'd be quite happy if I could just write

scalaUdf("some_function", "(x: String, i: Integer) => x(i) ")

Or something similar for people which prefer Java. I'd honestly prefer that to the automagic which compiles Julia to Java, but I can see that some people don't want to write even a single line of Java 😂

Arrow.jl / DataFrames.jl

I'd really like to have this one back, I'll happily contribute the Arrow interop again, if you'd like. I think it would be much preferable to get a Julia-native DataFrame from .collect instead of the Vector{Row}. Similarly calling createDataFrame(...) with a Julia Table would be nicer than having to create the data Row by Row. I guess we could even quite easily infer the schema from the provided Julia DataFrame. I'll just be a bit confusing which kind DataFrame I'm working with, maybe naming this one SparkDataFrame would avoid some confusion?

exyi · 2022-05-25T16:53:09Z

src/convert.jl

+        jcall(jmap, "put", JObject, (JObject, JObject), jk, jv)
+    end
+    return jmap
+end


What would you think about getting rid of these conversions and always using Arrow.jl instead? It should handle most cases out of the box and is significantly faster for larger collects

I can bring it back from the PR I made last year, if you think it would be worth it. It would be nice to use it for everything, for consistency.

These conversions are mostly needed to invoke methods from the Java/Scala API, not to pass large amounts of data between Julia and JVM. For example, we need an instance of JSeq to construct a JRow/Row, which you'd often use for testing purposes, and JMap is only used to pass aggregation config, i.e. something like Dict("clicks" => "sum", "impressions" => "count"), to the the GroupedData.agg().

exyi · 2022-05-25T16:57:44Z

src/dataframe.jl

+###############################################################################
+
+@chainable GroupedData
+Base.show(io::IO, gdf::GroupedData) = print(io, "GroupedData()")


I think it would be a bit nicer to print the SQL schema, like GroupData(${gdf.schema})

Unfortunately, RelationalGroupedDataset (the underlying Java class) doesn't have a schema. Here are all the methods available on that object type:

$anonfun$agg$1 $anonfun$agg$2 $anonfun$agg$3 $anonfun$aggregateNumericColumns$1 $anonfun$aggregateNumericColumns$2 $anonfun$alias$1 $anonfun$flatMapGroupsInPandas$1 $anonfun$flatMapGroupsInPandas$2 $anonfun$flatMapGroupsInPandas$3 $anonfun$flatMapGroupsInPandas$4 $anonfun$flatMapGroupsInR$1 $anonfun$flatMapGroupsInR$2 $anonfun$flatMapGroupsInR$3 $anonfun$pivot$1 $anonfun$pivot$2 $anonfun$strToExpr$1 $anonfun$strToExpr$2 $anonfun$toDF$1 $anonfun$toDF$2 agg apply avg count equals flatMapGroupsInPandas flatMapGroupsInR getClass hashCode max mean min notify notifyAll pivot sum toString wait

PySpark also doesn't provide a meaningful representation for GroupedData:

<pyspark.sql.group.GroupedData object at 0x7f66f80aec70>

On the other hand, toString() gives pretty intuitive representation:

RelationalGroupedDataset: [grouping expressions: [name: string], value: [name: string, age: bigint], type: GroupBy]

Cool, the Scala toString is nice :) I didn't know there is no schema field on RelationalGroupedDataset :|

exyi · 2022-05-25T20:03:13Z

src/dataframe.jl

+
+function Base.write(df::DataFrame)
+    jwriter = jcall(df.jdf, "write", JDataFrameWriter, ())
+    return DataFrameWriter(jwriter)


I don't think this is right, Base.write normally does something quite different: https://docs.julialang.org/en/v1/base/io-network/#Base.write I don't know what are the conventions in Julia, but I think it would be better not to overload the built-in function with another one with a very different signature.

exyi · 2022-05-25T20:04:24Z

src/init.jl

-    if !startswith(version, "1.8")
-        @warn "Java 1.8 is recommended for Spark.jl, but Java $version was used."
+    if !startswith(version, "1.8") && !startswith(version, "11.")
+        @warn "Java 1.8 or 1.11 is recommended for Spark.jl, but Java $version was used."


dfdx · 2022-05-26T23:27:46Z

I think that directly exposing Java or Scala to users would solve many problems.

I just realized I've never mentioned it and the code in compiler.jl is too obfuscated for an occasional reader, but actually you can embed Java classes. Here's an example:

import Spark: create_instance, jcall2

src = """
               package julia.compiled;

               public class Hello {

                   public String hello(String name) {
                       return "Hello, " + name;
                   }

                   public void goodbye() {
                       System.out.println("Goodbye!");
                   }
               }
           """
obj = create_instance(src)     # compile the class and instantiate the object in one hop, see also create_class
jresult = jcall2(hello, "hello", JString, (JString,), "Bob")
result = convert(String, jresult)    # "Hello, Bob"

There are certain pitfalls with objects created this way, but from the JVM perspective they are totally valid. As a particular example, udf() function creates an object that implements one of the Spark's UDFn interfaces and can be passed to any applicable Java method.

Compiling Scala should not be much harder, but Scala adds a lot of magic on top of JVM primitives and thus is generally harder to work with via JNI.

however it's bit sad to see the "just run this Julia function in parallel" functionality go

The problem with this definition is uncertainty of "this Julia function". Is this function a pure Julia or has package dependencies? What version of Julia itself is required to run it, and is it already installed on workers? How long will it take to call this function compared to the overhead of launching Julia? And compared to transferring sys image to all workers? How much data it reads from the source and writes to the sink?

Depending on the answers, the optimal solution may be to create a Julia server on a worker or launch Julia on each batch, trace and serialize function or transfer the whole package, use Arrow to pass data between processes or manipulate objects right in the JVM memory, etc. The RDD implementation has always been faulty and simplistic, if we want to do better, we need to come up with very specific requirements for running Julia on Spark workers.

Thus before we have these requirements, we can:

Create UDFs for simple cases.
Run whatever code in Java, including the code to launch Julia workers with properties tailored towards a specific task.
Run Julia in Docker on Kubernetes :)

I'd really like to have this one back, I'll happily contribute the Arrow interop again, if you'd like.

It will be really great! I'm really sorry to drop all the cool contributions we had so far, but hopefully we will get them added again for the new API too.

I think it would be much preferable to get a Julia-native DataFrame from .collect instead of the Vector{Row}.

One of the main design choices in the new API is to copy Python API to enable people simply copy-paste thousands of PySpark examples out their. If we change signature of some methods, users will have to figure out the correct usage, which will drive them away. I'd like to avoid it.

On the other hand, nobody stops us from having e.g. .collect(Table) or .collect_df(). We only need one-way API compatibility after all.

Similarly calling createDataFrame(...) with a Julia Table would be nicer than having to create the data Row by Row.

This is another example where we can use Julia multiple dispatch to create as many convenience methods as we need.

I'm working with, maybe naming this one SparkDataFrame would avoid some confusion?

How about Spark.DataFrame? :)

…n pipeline

dfdx · 2022-05-28T20:18:19Z

I just learnt two things:

Java serialization doesn't work for dynamically compiled classes, so in many cases we won't be able to just embed Java code as we intended. Compiling the class in runtime on each worker is an option, but it will be quite time-consuming.
There are cases in the DataFrame API where the lifetime of a Julia worker is well-known, notably, ForeachWriter.

These two things make me prioritize custom Julia code runner again. I'm thinking of juliacaller for control flow and Arrow for data flow. But there are still many open questions, so I'm going to first release the new DataFrame API without custom Julia runners anyway.

exyi · 2022-05-29T17:19:27Z

One of the main design choices in the new API is to copy Python API to enable people simply copy-paste thousands of PySpark examples out their. If we change signature of some methods, users will have to figure out the correct usage, which will drive them away. I'd like to avoid it.

On the other hand, nobody stops us from having e.g. .collect(Table) or .collect_df(). We only need one-way API compatibility after all.

Cool, I'll look into adding collect_df then :)

I looked into how PySpark handles that, and they use a tiny Python class for Row which is not a wrapper for the Scala Row: https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1781. I think that's slightly preferable, since the Julia-Java function calls are not exactly efficient. We could then use Arrow and convert the Table into a Vector of these Rows. Alternatively, the PySpark Row behaves almost like a Julia named tuple so maybe we could return Vector{named tuple} (arrow had a function for that too, AFAIK)?

These two things make me prioritize custom Julia code runner again. I'm thinking of juliacaller for control flow and Arrow for data flow. But there are still many open questions, so I'm going to first release the new DataFrame API without custom Julia runners anyway.

Sure, do that. I think that most users will just want fast access some hdfs files and/or do some data wrangling in Spark SQL - for these use cases, this approach is pretty much perfect.

dfdx · 2022-05-31T22:23:15Z

I looked into how PySpark handles that, and they use a tiny Python class for Row which is not a wrapper for the Scala Row

That's interesting. I agree we can optimize it this way, but it will take time to re-implement all the methods involving rows, so perhaps in this initial release. But the most interesting questions is why people even want Row to be performant. The only two reasons I used Row in practice are in examples like spark.createDataFrame(...) or to inspect a couple of rows. Honestly, copying data from workers to the driver itself sounds suspicious for me.

Anyway, I believe implementing specialized and more data-efficient functions like collect(df, DataFrames.DataFrame) / collect_df(df) in addition to the existing ones is a good start.

dfdx · 2022-06-01T22:17:58Z

To add to the collect(DataFrames.DataFrame) / collect_df() discussion, PySpark also supports a method with a very clear name - .toPandas(). Likewise, we can have e.g. .to(DataFrames.DataFrame). Although, I don't have a clear preference here, so the final decision is up to the person who implements it.

dfdx added 24 commits May 9, 2022 16:41

Deprecate RDD API

e752e4e

Start building a completely new API mimicking the Python API

ed60744

Add Column, a few basic operations on Column and DataFrame, some tests

339ef63

Add Column API

83953c0

Start implementation of julia2java compiler

7c1a3c2

Add complete UDF implementation (PoC). Introduce modules

999f0f3

Fix issues in CI/CD

39dbe44

Add methods for StructField and StructType

22d809b

Add spark.createDataFrame()

08307c4

Add select()

5d4723a

Add RuntimeConfig

5624463

Add withColumn, filter and others

8a284d6

Fix empty tuple argument in jcall (for older version of JavaCall)

1d8ef0c

Fix more jcall's, rename GroupedData vars

e7572f3

Add more ways to create a DataFrame

eda3861

Finish GroupedData

6bc47e5

Add sql() and friends

b07ba1a

Add join(), remove unused RDD stuff

1239298

Divide code into multiple files

19096ff

Add reader/writer + tests

ed9137d

Update versions in CI/CD

576df2d

Unset JULIA_COPY_STACKS on Appveyor

139cdbb

Add operations on Window

099e042

Another attempt to fix tests on Appveyor

77b8f04

exyi reviewed May 22, 2022

View reviewed changes

Add Windows to the test config in Github Actions

dc2a760

exyi approved these changes May 25, 2022

View reviewed changes

dfdx added 2 commits May 28, 2022 14:54

Rework pom.xml, update versions of everything

db3b1dc

Turn off foreach(::DataStreamWriter, ...), start reworking compilatio…

75f8f17

…n pipeline

Fix compatibility issue

1ff7aad

dfdx added 5 commits May 28, 2022 23:21

Remove Windows from the test matrix for now

96ec648

Switch to Oracle JDK in Github Actions

a5e8227

Try another distribution of OpenJDK

383af82

Try Oracle JDK 11.0.14

3abb2cc

Add a few stubs for docs

afb383d

dfdx added 4 commits May 29, 2022 22:30

Try JDK 8 to fix UncaughtExceptionHandler in thread 'process reaper'

3f22e99

In preparation to the release

795dbe4

Fix dependencies, rename doc files

5826cdf

Add support for Date & DateTime

4ff15f9

Conversion with primitive types

37f7dfb

dfdx added 8 commits June 2, 2022 01:53

Add SQL docs

7b0f1c9

Add a few docstrings

c0a5866

Add a few more docstrings

49501b6

Add API reference

10e7f3f

Make a better representation for a streaming DataFrame

d8d4071

Add Structured Streaming docs

8f43024

Add docs workflow to GitHub Actions

d2da3b3

Update branch name in GitHub Actions workflows

0964d10

dfdx merged commit ac6ad77 into main Jun 4, 2022

dfdx deleted the new-api branch June 4, 2022 22:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: PySpark-like, DataFrame-only API #104

RFC: PySpark-like, DataFrame-only API #104

dfdx commented May 16, 2022

exyi May 22, 2022

exyi left a comment

exyi May 25, 2022

exyi May 25, 2022

dfdx May 27, 2022

exyi May 25, 2022

dfdx May 27, 2022

dfdx May 27, 2022

exyi May 29, 2022

exyi May 25, 2022

exyi May 25, 2022

dfdx commented May 26, 2022 •

edited

Loading

dfdx commented May 28, 2022

exyi commented May 29, 2022

dfdx commented May 31, 2022

dfdx commented Jun 1, 2022

RFC: PySpark-like, DataFrame-only API #104

RFC: PySpark-like, DataFrame-only API #104

Conversation

dfdx commented May 16, 2022

Why

PySpark-compatible API

SQL and Streaming API

No RDD

UDFs via compilation to Java

One big breaking change

Choose a reason for hiding this comment

exyi left a comment

Choose a reason for hiding this comment

RDD Drop

Java UDFs

Arrow.jl / DataFrames.jl

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dfdx commented May 26, 2022 • edited Loading

dfdx commented May 28, 2022

exyi commented May 29, 2022

dfdx commented May 31, 2022

dfdx commented Jun 1, 2022

dfdx commented May 26, 2022 •

edited

Loading