Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions (from email) #227

Open
1 of 5 tasks
scsmithr opened this issue Sep 11, 2024 · 0 comments
Open
1 of 5 tasks

Questions (from email) #227

scsmithr opened this issue Sep 11, 2024 · 0 comments

Comments

@scsmithr
Copy link
Member

scsmithr commented Sep 11, 2024

What's left to do for hybrid execution?

Authentication of the hybrid interface?

Currently no authentication. Anyone is able to connect to "https://server.rayexec.glaredb.com".

There will at some point be a proxy that intercepts and authenticates the request. There will also be some amount of catalog lookup here to ensure the query we're sending to the node has all info it needs for planning/executing.

Right now we don't need to do that since the client is able to send that info directly, and it doesn't need to injected or read from anywhere.

For example (assuming wasm):

-- Stores attach info on client.
ATTACH postgres DATABASE AS mypg ...

-- Plans remotely be sending attach info to server.
SELECT * FROM mypg.schema.table

What's left before we can begin implementing more data sources (db vs files)?

Files should be read to go enough.

E.g. parquet: https://github.com/glaredb/rayexec/blob/cfd482eba4020acf9d211138c520c62f5e081737/crates/rayexec_parquet/src/lib.rs#L18-L65

There might be some slight changes but nothing major.

Databases slightly less sure of. They'll need to implement connect: https://github.com/glaredb/rayexec/blob/cfd482eba4020acf9d211138c520c62f5e081737/crates/rayexec_execution/src/datasource.rs#L64-L73

The semantics of DataSourceConnection still needs to be hammered out around actually making changes to catalogs in remote databases, but I imagine for reads, it should be decently ready enough. E.g. postgres: https://github.com/glaredb/rayexec/blob/cfd482eba4020acf9d211138c520c62f5e081737/crates/rayexec_postgres/src/lib.rs#L55-L68

What's the path to globbing file paths?

https://github.com/glaredb/rayexec/blob/cfd482eba4020acf9d211138c520c62f5e081737/crates/rayexec_io/src/location.rs#L66-L85

Either extend FileLocation or have a wrapper around FileLocation which is able to handle globbing which produces FileLocations. Personally I like how simple FileLocation is right now, so leaning more towards a FileList struct which can parse/handle hive & glob.

How much work is it going to be to support different cloud provider object storage?

S3 already in. GCS will require the service account flow (#190). I don't think GCS would take too long, maybe a day or two.

Have no clue for azure.

Where do we stand with the "native" storage? Updates? Inserts?

Temp tables exist with inserts.

"Native storage" will just be another data source. The idea is that datasources will return both a TableStorage and CatalogStorage implementation from connect (https://github.com/glaredb/rayexec/blob/cfd482eba4020acf9d211138c520c62f5e081737/crates/rayexec_execution/src/datasource.rs#L37-L41) and it's what I plan to use for the native storage (just plop in delta for table storage + whatever catalog stuff).

TableStorage implements this trait: https://github.com/glaredb/rayexec/blob/cfd482eba4020acf9d211138c520c62f5e081737/crates/rayexec_execution/src/storage/table_storage.rs#L8-L18

This will be how tables are physically created/dropped/scanned.

DataTables implement this trait: https://github.com/glaredb/rayexec/blob/cfd482eba4020acf9d211138c520c62f5e081737/crates/rayexec_execution/src/storage/table_storage.rs#L20-L40

CatalogStorage will implement this trait: https://github.com/glaredb/rayexec/blob/cfd482eba4020acf9d211138c520c62f5e081737/crates/rayexec_execution/src/storage/catalog_storage.rs#L7-L13

CatalogStorage is the larger unknown right now.

Where does the delta-lake implementation stand? Vacuuming? Write operations?

Very simple reads right now. No writes yet.

Where is catalog persistence (Databases? Tables?) and what more remains?

See above for CatalogStorage & TableStorage. These are implemented for memory tables, and postgres has an implementation of CatalogStorage that only does table lookup, no actual catalog modifications.

@scsmithr scsmithr pinned this issue Sep 11, 2024
@scsmithr scsmithr unpinned this issue Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant