Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rename #9

Merged
merged 2 commits into from
Jan 16, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -171,3 +171,6 @@ cython_debug/
.pypirc

local/

docs/site/
site/
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1 @@
# pandas_lazy
# lazy_pandas
Binary file added docs/docs/assets/logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
22 changes: 11 additions & 11 deletions docs/docs/index.md
Original file line number Diff line number Diff line change
@@ -1,35 +1,35 @@
---
title: Pandas Lazy
title: Lazy Pandas
hide:
- navigation
- toc
---

# Pandas Lazy
# Lazy Pandas

Welcome to the **Pandas Lazy** official documentation!
Welcome to the **Lazy Pandas** official documentation!
A library inspired by [pandas](https://pandas.pydata.org/) that focuses on *lazy* processing, enabling high performance and lower memory usage for large datasets.

## What is Pandas Lazy?
## What is Lazy Pandas?

Pandas Lazy is built on the concept of delaying DataFrame operations until they are strictly necessary (lazy evaluation). This allows:
Lazy Pandas is built on the concept of delaying DataFrame operations until they are strictly necessary (lazy evaluation). This allows:
- Operations to be optimized in batches.
- Memory usage to be minimized during processing.
- Total runtime to be reduced for complex pipelines.

## Code Comparison

Below is a side-by-side comparison showing how the same operation would look in **pandas** versus **Pandas Lazy**:
Below is a side-by-side comparison showing how the same operation would look in **Pandas** versus **Lazy Pandas**:


=== "Pandas Lazy"
=== "Lazy Pandas"

```python linenums="1" hl_lines="2 5 13"
import pandas as pd
import pandas_lazy as pdl
import lazy_pandas as lpd

def read_taxi_dataset(location: str) -> pd.DataFrame:
df = pdl.read_csv(location, parse_dates=["pickup_datetime"])
df = lpd.read_csv(location, parse_dates=["pickup_datetime"])
df = df[["pickup_datetime", "passenger_count"]]
df["passenger_count"] = df["passenger_count"]
df["pickup_date"] = df["pickup_datetime"].dt.date
Expand Down Expand Up @@ -61,11 +61,11 @@ Below is a side-by-side comparison showing how the same operation would look in
return df
```

Notice that in traditional **pandas**, operations are executed immediately, while in **Pandas Lazy**, computation only occurs when you call `.collect()`.
Notice that in traditional **pandas**, operations are executed immediately, while in **Lazy Pandas**, computation only occurs when you call `.collect()`.

## Memory Usage

Below is a fictitious performance comparison between **pandas** and **Pandas Lazy**, showing a scenario where a large dataset is processed in three stages (reading, aggregation, and complex filtering).
Below is a fictitious performance comparison between **pandas** and **Lazy Pandas**, showing a scenario where a large dataset is processed in three stages (reading, aggregation, and complex filtering).


<div class="grid cards" markdown>
Expand Down
7 changes: 4 additions & 3 deletions docs/mkdocs.yml
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
site_name: PandasLazy
repo_url: https://github.com/mariotaddeucci/pandas-lazy
repo_name: mariotaddeucci/pandas-lazy
site_name: LazyPandas
repo_url: https://github.com/mariotaddeucci/lazy-pandas
repo_name: mariotaddeucci/lazy-pandas
theme:
name: material
logo: assets/logo.png
palette:
- media: "(prefers-color-scheme: light)"
scheme: default
Expand Down
31 changes: 14 additions & 17 deletions docs/scripts/generate_references.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,36 +4,33 @@
import mkdocs_gen_files

package_dir = Path(__file__).parent.parent.parent / "src"

print(str(package_dir))
sys.path.insert(0, str(package_dir))


import pandas_lazy as pdl # noqa: E402
import lazy_pandas as lpd # noqa: E402

vls = []

vls += [
(1000 + idx, "pandas_lazy.LazyFrame", f"LazyFrame.{attr}", attr)
for idx, attr in enumerate(sorted(dir(pdl.LazyFrame)))
(1000 + idx, "lazy_pandas.LazyFrame", f"LazyFrame.{attr}", attr)
for idx, attr in enumerate(sorted(dir(lpd.LazyFrame)))
if not attr.startswith("_")
]

vls += [
(1000 + idx, "pandas_lazy.LazyColumn", f"LazyColumn.{attr}", attr)
for idx, attr in enumerate(sorted(dir(pdl.LazyColumn)))
(1000 + idx, "lazy_pandas.LazyColumn", f"LazyColumn.{attr}", attr)
for idx, attr in enumerate(sorted(dir(lpd.LazyColumn)))
if not attr.startswith("_") and attr not in ["str", "dt", "create_from_function"]
]

vls += [
(2000 + idx, "pandas_lazy.LazyStringColumn", f"LazyColumn.str.{attr}", attr)
for idx, attr in enumerate(sorted(dir(pdl.LazyStringColumn)))
(2000 + idx, "lazy_pandas.LazyStringColumn", f"LazyColumn.str.{attr}", attr)
for idx, attr in enumerate(sorted(dir(lpd.LazyStringColumn)))
if not attr.startswith("_")
]

vls += [
(3000 + idx, "pandas_lazy.LazyDateTimeColumn", f"LazyColumn.dt.{attr}", attr)
for idx, attr in enumerate(sorted(dir(pdl.LazyDateTimeColumn)))
(3000 + idx, "lazy_pandas.LazyDateTimeColumn", f"LazyColumn.dt.{attr}", attr)
for idx, attr in enumerate(sorted(dir(lpd.LazyDateTimeColumn)))
if not attr.startswith("_")
]

Expand All @@ -53,21 +50,21 @@

fn_names = [
attr
for idx, attr in enumerate(sorted(dir(pdl)))
for idx, attr in enumerate(sorted(dir(lpd)))
if not attr.startswith("_")
and callable(getattr(pdl, attr))
and callable(getattr(lpd, attr))
and attr not in ["LazyFrame", "LazyColumn", "LazyStringColumn", "LazyDateTimeColumn"]
]


template = """
# pdl.{function_name}
::: pandas_lazy.{function_name}
# lpd.{function_name}
::: lazy_pandas.{function_name}
options:
members:
- {function_name}
"""

for function_name in fn_names:
with mkdocs_gen_files.open(f"references/General_Functions/pandas_lazy{function_name}.md", "w") as f:
with mkdocs_gen_files.open(f"references/General_Functions/lazy_pandas{function_name}.md", "w") as f:
f.write(template.format(function_name=function_name))
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ description = "Add your description here"
dynamic = [
"version",
]
name = "pandas_lazy"
name = "lazy_pandas"
readme = "README.md"
requires-python = ">=3.10"

Expand Down
File renamed without changes.
51 changes: 0 additions & 51 deletions src/benckmark.py

This file was deleted.

17 changes: 17 additions & 0 deletions src/lazy_pandas/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
from lazy_pandas.column.lazy_column import LazyColumn
from lazy_pandas.column.lazy_datetime_column import LazyDateTimeColumn
from lazy_pandas.column.lazy_string_column import LazyStringColumn
from lazy_pandas.frame.lazy_frame import LazyFrame
from lazy_pandas.general import from_pandas, read_csv, read_delta, read_iceberg, read_parquet

__all__ = [
"LazyFrame",
"LazyColumn",
"read_csv",
"read_parquet",
"from_pandas",
"read_delta",
"read_iceberg",
"LazyDateTimeColumn",
"LazyStringColumn",
]
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@
from duckdb import CoalesceOperator, ConstantExpression, Expression, FunctionExpression
from duckdb.typing import DuckDBPyType

from pandas_lazy.column.lazy_datetime_column import LazyDateTimeColumn
from pandas_lazy.column.lazy_string_column import LazyStringColumn
from lazy_pandas.column.lazy_datetime_column import LazyDateTimeColumn
from lazy_pandas.column.lazy_string_column import LazyStringColumn

__all__ = ["LazyColumn"]

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
from duckdb import ConstantExpression

if TYPE_CHECKING:
from pandas_lazy.column.lazy_column import LazyColumn
from lazy_pandas.column.lazy_column import LazyColumn


class LazyDateTimeColumn:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
from duckdb import ConstantExpression

if TYPE_CHECKING:
from pandas_lazy.column.lazy_column import LazyColumn
from lazy_pandas.column.lazy_column import LazyColumn


class LazyStringColumn:
Expand Down
1 change: 1 addition & 0 deletions src/lazy_pandas/exceptions.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
class LazyPandasUnsupporttedOperation(Exception): ...
Empty file.
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,9 @@
)
from duckdb.typing import DuckDBPyType

from pandas_lazy.column.lazy_column import LazyColumn
from pandas_lazy.exceptions import PandasLazyUnsupporttedOperation
from pandas_lazy.frame.lazy_groupped_frame import LazyGrouppedFrame
from lazy_pandas.column.lazy_column import LazyColumn
from lazy_pandas.exceptions import LazyPandasUnsupporttedOperation
from lazy_pandas.frame.lazy_groupped_frame import LazyGrouppedFrame

if TYPE_CHECKING:
import pandas as pd
Expand Down Expand Up @@ -281,7 +281,7 @@ def __getitem__(self, key: str | list[str] | LazyColumn) -> Union[LazyColumn, "L
LazyColumn or LazyFrame: The selected column or a LazyFrame with the selected columns.

Raises:
PandasLazyUnsupporttedOperation: If an unsupported operation is attempted.
LazyPandasUnsupporttedOperation: If an unsupported operation is attempted.
"""
if isinstance(key, list):
return LazyFrame(self._relation.select(*key))
Expand All @@ -292,8 +292,8 @@ def __getitem__(self, key: str | list[str] | LazyColumn) -> Union[LazyColumn, "L
if isinstance(key, LazyColumn):
return LazyFrame(self._relation.filter(key.expr))

raise PandasLazyUnsupporttedOperation(
f"PandasLazy does not support all pandas operations, use collect() to get a pandas DataFrame and then perform the operation {key}"
raise LazyPandasUnsupporttedOperation(
f"LazyPandas does not support all pandas operations, use collect() to get a pandas DataFrame and then perform the operation {key}"
)

def __setitem__(self, key, value) -> None:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
from duckdb import DuckDBPyRelation

if TYPE_CHECKING:
from pandas_lazy import LazyFrame
from lazy_pandas import LazyFrame


class LazyGrouppedFrame:
Expand Down
27 changes: 13 additions & 14 deletions src/pandas_lazy/general.py → src/lazy_pandas/general.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,7 @@

import duckdb
import duckdb.typing

from pandas_lazy.frame.lazy_frame import LazyFrame
from lazy_pandas.frame.lazy_frame import LazyFrame


def from_pandas(df) -> LazyFrame:
Expand All @@ -19,9 +18,9 @@ def from_pandas(df) -> LazyFrame:
Example:
```python
import pandas as pd
import pandas_lazy as pdl
import lazy_pandas as lpd
df = pd.DataFrame({'column1': [1, 2, 3], 'column2': ['a', 'b', 'c']})
lazy_df = pdl.from_pandas(df)
lazy_df = lpd.from_pandas(df)
```
"""
return LazyFrame(duckdb.from_df(df))
Expand Down Expand Up @@ -116,8 +115,8 @@ def read_csv(

Example:
```python
import pandas_lazy as pdl
df = pdl.read_csv('data.csv', header=True, sep=',', dtype={'column1': 'INTEGER', 'column2': 'VARCHAR'})
import lazy_pandas as lpd
df = lpd.read_csv('data.csv', header=True, sep=',', dtype={'column1': 'INTEGER', 'column2': 'VARCHAR'})
df.head()
```
"""
Expand Down Expand Up @@ -219,8 +218,8 @@ def read_json(

Example:
```python
import pandas_lazy as pdl
df = pdl.read_json('data.json', columns={'userId': 'INTEGER', 'completed': 'BOOLEAN'}, format='array')
import lazy_pandas as lpd
df = lpd.read_json('data.json', columns={'userId': 'INTEGER', 'completed': 'BOOLEAN'}, format='array')
df.head()
```
"""
Expand Down Expand Up @@ -276,8 +275,8 @@ def read_parquet(

Example:
```python
import pandas_lazy as pdl
df = pdl.read_parquet('data.parquet', columns=['column1', 'column2'])
import lazy_pandas as lpd
df = lpd.read_parquet('data.parquet', columns=['column1', 'column2'])
df.head()
```
"""
Expand Down Expand Up @@ -307,9 +306,9 @@ def read_delta(path: str, *, conn: duckdb.DuckDBPyConnection | None = None) -> L

Example:
```python
import pandas_lazy as pdl
import lazy_pandas as lpd
from datetime import date
df = pdl.read_delta('s3://bucket/path_to_delta_table')
df = lpd.read_delta('s3://bucket/path_to_delta_table')
df.head()
```
"""
Expand All @@ -332,8 +331,8 @@ def read_iceberg(path: str, *, conn: duckdb.DuckDBPyConnection | None = None) ->

Example:
```python
import pandas_lazy as pdl
df = pdl.read_iceberg('s3://bucket/path_to_iceberg_table')
import lazy_pandas as lpd
df = lpd.read_iceberg('s3://bucket/path_to_iceberg_table')
df.head()
```
"""
Expand Down
File renamed without changes.
Loading
Loading