Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NodeJS]: readIPC from buffer fails with 'Arrow file does not contain correct header', while it works in ArrowJS #109

Open
0xgeert opened this issue Jan 19, 2022 · 7 comments
Labels
enhancement New feature or request

Comments

@0xgeert
Copy link

0xgeert commented Jan 19, 2022

Using Node.JS

What version of polars are you using?

"nodejs-polars": "^0.2.0"

What operating system are you using polars on?

MacOS Big Sur 11.1

Describe your bug.

Reading in a buffer from an .ipc (ArrowStream) file using readIPC fails with Error: Arrow file does not contain correct header. At the same time the file is not corrupt since it can be loaded using apache-arrow's Table.from method

What are the steps to reproduce the behavior?

See code example below. I'll post both the .arrow file (works) and .ipc file (doesn't work) as attachment

const pl = require('nodejs-polars'); 
const { Table } = require('apache-arrow')
const { readFileSync } = require('fs');

const fromArrow = readFileSync('hits.arrow'); 
const fromIPC = readFileSync('hits.ipc'); 

// Read Arrow file by Arrow.js -> works
const df = Table.from([fromArrow])
console.log("df", df.count()) // 10

// Read Arrow file by polars -> works
const dfPolars = pl.readIPC(fromArrow)
console.log("dfPolars", dfPolars) // prints nice table with 10 entries

// Read IPC (ArrowStream) file by Arrow.js -> works
const dfIpc = Table.from([fromIPC])
console.log("dfIpc", dfIpc.count()) // 10

// Read IPC (ArrowStream) by polars -> Fails
const dfIpcPolars = pl.readIPC(fromIPC)
console.log("dfIpcPolars", dfIpcPolars) // Error: Arrow file does not contain correct header


@0xgeert
Copy link
Author

0xgeert commented Jan 19, 2022

@ritchie46
Copy link
Member

The IPC readers are implemented upstream. Could you make this issue here? https://github.com/jorgecarleitao/arrow2

@jorgecarleitao
Copy link

I am a bit surprised about pl.readIPC(fromArrow) and pl.readIPC(fromIPC): shouldn't these be two different signatures? One thing is to read a stream (.ipc), the other is a file (.arrow). I think that we are just missing a readIPCStream in Polars' API that can read arrow streams (as opposed to arrow files).

@ritchie46
Copy link
Member

Ah.. Polars doesn't have that distinction no. So the IPC is the stream and the .arrow is the feather file as the IPC data + additional headers?

Then we must add this.

@joshuataylor
Copy link

Hi!

I'm keen to get this into polars, as Snowflake uses this for their response format and would be awesome to get it in for reading data straight from SF into Polars.

Here is a quick primer about the streaming files from Arrow: https://arrow.apache.org/docs/python/ipc.html
And the guide here from arrow2 about reading the stream: https://jorgecarleitao.github.io/arrow2/io/ipc_stream_read.html

IMHO, supporting files initially is fine, later can do other streaming support.

I've started looking into this, and the major blocker I can see is projections.

In arrow2, projections are not supported here: https://github.com/jorgecarleitao/arrow2/blob/main/src/io/ipc/read/stream.rs#L185

So we will need to build the projection from the chunks.

Thoughts?

@stinodego stinodego added the enhancement New feature or request label Jul 14, 2023
@stinodego stinodego transferred this issue from pola-rs/polars Sep 8, 2023
@stinodego
Copy link
Member

stinodego commented Sep 8, 2023

Transfering this to the NodeJS repo as I have no way to reproduce this using Python/Rust. Not sure if this is still relevant.

@Bidek56
Copy link
Collaborator

Bidek56 commented Jul 25, 2024

@0xgeert Please try: pl.read_ipc_stream using py-polars as described here. It works fine for me. Thx

universalmind303 pushed a commit that referenced this issue Aug 16, 2024
TLDR: Solves #109

More or less the IPC Stream methods are straight copies of the IPC File
(Feather) ones, swapping out the IpcReader, IpcWriter for their
streaming equivalents; the API should be identical to py-polars (with
the exception of file-like objects as input for read_ipc,
read_ipc_stream - not much point adding that until streaming IO is
exposed upstream).

I've left the docstrings basically untouched, let me know if you want
those tweaked (the `@param` s appear to have drifted over time).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants