Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: PySpark-like, DataFrame-only API #104

Merged
merged 46 commits into from
Jun 4, 2022
Merged
Show file tree
Hide file tree
Changes from 25 commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
e752e4e
Deprecate RDD API
dfdx May 9, 2022
ed60744
Start building a completely new API mimicking the Python API
dfdx May 10, 2022
339ef63
Add Column, a few basic operations on Column and DataFrame, some tests
dfdx May 11, 2022
83953c0
Add Column API
dfdx May 12, 2022
7c1a3c2
Start implementation of julia2java compiler
dfdx May 13, 2022
999f0f3
Add complete UDF implementation (PoC). Introduce modules
dfdx May 15, 2022
39dbe44
Fix issues in CI/CD
dfdx May 17, 2022
22d809b
Add methods for StructField and StructType
dfdx May 17, 2022
08307c4
Add spark.createDataFrame()
dfdx May 18, 2022
5d4723a
Add select()
dfdx May 19, 2022
5624463
Add RuntimeConfig
dfdx May 20, 2022
8a284d6
Add withColumn, filter and others
dfdx May 20, 2022
1d8ef0c
Fix empty tuple argument in jcall (for older version of JavaCall)
dfdx May 20, 2022
e7572f3
Fix more jcall's, rename GroupedData vars
dfdx May 20, 2022
eda3861
Add more ways to create a DataFrame
dfdx May 21, 2022
6bc47e5
Finish GroupedData
dfdx May 21, 2022
b07ba1a
Add sql() and friends
dfdx May 21, 2022
1239298
Add join(), remove unused RDD stuff
dfdx May 21, 2022
19096ff
Divide code into multiple files
dfdx May 21, 2022
ed9137d
Add reader/writer + tests
dfdx May 21, 2022
576df2d
Update versions in CI/CD
dfdx May 22, 2022
139cdbb
Unset JULIA_COPY_STACKS on Appveyor
dfdx May 22, 2022
099e042
Add operations on Window
dfdx May 22, 2022
77b8f04
Another attempt to fix tests on Appveyor
dfdx May 22, 2022
dc2a760
Add Windows to the test config in Github Actions
dfdx May 22, 2022
db3b1dc
Rework pom.xml, update versions of everything
dfdx May 28, 2022
75f8f17
Turn off foreach(::DataStreamWriter, ...), start reworking compilatio…
dfdx May 28, 2022
1ff7aad
Fix compatibility issue
dfdx May 28, 2022
96ec648
Remove Windows from the test matrix for now
dfdx May 28, 2022
a5e8227
Switch to Oracle JDK in Github Actions
dfdx May 28, 2022
383af82
Try another distribution of OpenJDK
dfdx May 28, 2022
3abb2cc
Try Oracle JDK 11.0.14
dfdx May 28, 2022
afb383d
Add a few stubs for docs
dfdx May 28, 2022
3f22e99
Try JDK 8 to fix UncaughtExceptionHandler in thread 'process reaper'
dfdx May 29, 2022
795dbe4
In preparation to the release
dfdx May 30, 2022
5826cdf
Fix dependencies, rename doc files
dfdx May 30, 2022
4ff15f9
Add support for Date & DateTime
dfdx May 31, 2022
37f7dfb
Conversion with primitive types
dfdx Jun 1, 2022
7b0f1c9
Add SQL docs
dfdx Jun 1, 2022
c0a5866
Add a few docstrings
dfdx Jun 2, 2022
49501b6
Add a few more docstrings
dfdx Jun 2, 2022
10e7f3f
Add API reference
dfdx Jun 2, 2022
d8d4071
Make a better representation for a streaming DataFrame
dfdx Jun 4, 2022
8f43024
Add Structured Streaming docs
dfdx Jun 4, 2022
d2da3b3
Add docs workflow to GitHub Actions
dfdx Jun 4, 2022
0964d10
Update branch name in GitHub Actions workflows
dfdx Jun 4, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,16 +17,18 @@ jobs:
matrix:
version:
- '1.6'
- '1.7'
os:
- ubuntu-latest
- windows-latest
arch:
- x64
steps:
- uses: actions/checkout@v2
- name: Set up JDK 8
- name: Set up JDK 11
uses: actions/setup-java@v2
with:
java-version: '8'
java-version: '11'
distribution: 'adopt'
- uses: julia-actions/setup-julia@latest
with:
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
*.jl.mem
*~
.idea/
.vscode/
target/
project/
*.class
Expand Down
9 changes: 6 additions & 3 deletions Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -5,15 +5,18 @@ version = "0.5.2"
[deps]
IteratorInterfaceExtensions = "82899510-4779-5014-852e-03e436cf321d"
JavaCall = "494afd89-becb-516b-aafa-70d2670c0337"
Reexport = "189a3867-3050-52da-a836-e630ba90ab69"
Serialization = "9e88b42a-f829-5b0c-bbe9-9e923198166b"
Sockets = "6462fe0b-24de-5631-8697-dd941f90decc"
Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"
TableTraits = "3783bdb8-4a98-5b6b-af9a-565f29a5fe9c"
Umlaut = "92992a2b-8ce5-4a9c-bb9d-58be9a7dc841"

[compat]
JavaCall = "0.7"
julia = "1"
TableTraits = "1"
IteratorInterfaceExtensions = "1"
JavaCall = "0.7, 0.8"
TableTraits = "1"
julia = "1"

[extras]
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
Expand Down
4 changes: 2 additions & 2 deletions appveyor.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
environment:
matrix:
- julia_version: 0.7
- julia_version: 1
- julia_version: 1.6
- julia_version: 1.7
- julia_version: nightly

platform:
Expand Down
5 changes: 5 additions & 0 deletions jvm/sparkjl/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,11 @@
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<!-- <dependency>
<groupId>org.mdkt.compiler</groupId>
<artifactId>InMemoryJavaCompiler</artifactId>
<version>1.3.0</version>
</dependency> -->

<!-- additional data formats -->

Expand Down
47 changes: 1 addition & 46 deletions src/Spark.jl
Original file line number Diff line number Diff line change
@@ -1,50 +1,5 @@
module Spark

export
SparkConf,
SparkContext,
RDD,
JuliaRDD,
JavaRDD,
text_file,
parallelize,
map,
map_pair,
map_partitions,
map_partitions_pair,
map_partitions_with_index,
reduce,
filter,
collect,
count,
id,
num_partitions,
close,
@attach,
share_variable,
@share,
flat_map,
flat_map_pair,
cartesian,
group_by_key,
reduce_by_key,
cache,
repartition,
coalesce,
pipe,
# SQL
SparkSession,
Dataset,
sql,
read_json,
write_json,
read_parquet,
write_parquet,
read_df,
write_df



include("core.jl")

end
end
76 changes: 0 additions & 76 deletions src/attach.jl

This file was deleted.

63 changes: 63 additions & 0 deletions src/chainable.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
"""
DotChainer{O, Fn}

See `@chainable` for details.
"""
struct DotChainer{O, Fn}
obj::O
fn::Fn
end

# DotChainer(obj, fn) = DotChainer{typeof(obj), typeof(fn)}(obj, fn)

(c::DotChainer)(args...) = c.fn(c.obj, args...)


"""
@chainable T

Adds dot chaining syntax to the type, i.e. automatically translate:

foo.bar(a)

into

bar(foo, a)

For single-argument functions also support implicit calls, e.g:

foo.bar.baz(a, b)

is treated the same as:

foo.bar().baz(a, b)

Note that `@chainable` works by overloading `Base.getproperty()`,
making it impossible to customize it for `T`. To have more control,
one may use the underlying wrapper type - `DotCaller`.
"""
macro chainable(T)
return quote
function Base.getproperty(obj::$(esc(T)), prop::Symbol)
if hasfield(typeof(obj), prop)
return getfield(obj, prop)
elseif isdefined(@__MODULE__, prop)
fn = getfield(@__MODULE__, prop)
return DotChainer(obj, fn)
else
error("type $(typeof(obj)) has no field $prop")
end
end
end
end


function Base.getproperty(dc::DotChainer, prop::Symbol)
if hasfield(typeof(dc), prop)
return getfield(dc, prop)
else
# implicitely call function without arguments
# and propagate getproperty to the returned object
return getproperty(dc(), prop)
end
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, good magic, I had no idea this was possible in Julia :)

Loading