Questions (from email) #227

scsmithr · 2024-09-11T17:23:38Z

What's left to do for hybrid execution?

(req) Source/sink operators, client/server queue
Intelligent backoff
- Currently just spins: https://github.com/glaredb/rayexec/blob/cfd482eba4020acf9d211138c520c62f5e081737/crates/rayexec_execution/src/hybrid/stream.rs#L125-L139
(req) Remaining proto (de)serialization for operators
- Currently missing a few.
- Additional tests
Location clustering optimization in plan
Remote attach verification

Authentication of the hybrid interface?

Currently no authentication. Anyone is able to connect to "https://server.rayexec.glaredb.com".

There will at some point be a proxy that intercepts and authenticates the request. There will also be some amount of catalog lookup here to ensure the query we're sending to the node has all info it needs for planning/executing.

Right now we don't need to do that since the client is able to send that info directly, and it doesn't need to injected or read from anywhere.

For example (assuming wasm):

-- Stores attach info on client.
ATTACH postgres DATABASE AS mypg ...

-- Plans remotely be sending attach info to server.
SELECT * FROM mypg.schema.table

What's left before we can begin implementing more data sources (db vs files)?

Files should be read to go enough.

E.g. parquet: https://github.com/glaredb/rayexec/blob/cfd482eba4020acf9d211138c520c62f5e081737/crates/rayexec_parquet/src/lib.rs#L18-L65

There might be some slight changes but nothing major.

Databases slightly less sure of. They'll need to implement connect: https://github.com/glaredb/rayexec/blob/cfd482eba4020acf9d211138c520c62f5e081737/crates/rayexec_execution/src/datasource.rs#L64-L73

The semantics of DataSourceConnection still needs to be hammered out around actually making changes to catalogs in remote databases, but I imagine for reads, it should be decently ready enough. E.g. postgres: https://github.com/glaredb/rayexec/blob/cfd482eba4020acf9d211138c520c62f5e081737/crates/rayexec_postgres/src/lib.rs#L55-L68

What's the path to globbing file paths?

https://github.com/glaredb/rayexec/blob/cfd482eba4020acf9d211138c520c62f5e081737/crates/rayexec_io/src/location.rs#L66-L85

Either extend FileLocation or have a wrapper around FileLocation which is able to handle globbing which produces FileLocations. Personally I like how simple FileLocation is right now, so leaning more towards a FileList struct which can parse/handle hive & glob.

How much work is it going to be to support different cloud provider object storage?

S3 already in. GCS will require the service account flow (#190). I don't think GCS would take too long, maybe a day or two.

Have no clue for azure.

Where do we stand with the "native" storage? Updates? Inserts?

Temp tables exist with inserts.

"Native storage" will just be another data source. The idea is that datasources will return both a TableStorage and CatalogStorage implementation from connect (https://github.com/glaredb/rayexec/blob/cfd482eba4020acf9d211138c520c62f5e081737/crates/rayexec_execution/src/datasource.rs#L37-L41) and it's what I plan to use for the native storage (just plop in delta for table storage + whatever catalog stuff).

TableStorage implements this trait: https://github.com/glaredb/rayexec/blob/cfd482eba4020acf9d211138c520c62f5e081737/crates/rayexec_execution/src/storage/table_storage.rs#L8-L18

This will be how tables are physically created/dropped/scanned.

DataTables implement this trait: https://github.com/glaredb/rayexec/blob/cfd482eba4020acf9d211138c520c62f5e081737/crates/rayexec_execution/src/storage/table_storage.rs#L20-L40

CatalogStorage will implement this trait: https://github.com/glaredb/rayexec/blob/cfd482eba4020acf9d211138c520c62f5e081737/crates/rayexec_execution/src/storage/catalog_storage.rs#L7-L13

CatalogStorage is the larger unknown right now.

Where does the delta-lake implementation stand? Vacuuming? Write operations?

Very simple reads right now. No writes yet.

Where is catalog persistence (Databases? Tables?) and what more remains?

See above for CatalogStorage & TableStorage. These are implemented for memory tables, and postgres has an implementation of CatalogStorage that only does table lookup, no actual catalog modifications.

The text was updated successfully, but these errors were encountered:

scsmithr pinned this issue Sep 11, 2024

scsmithr unpinned this issue Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions (from email) #227

Questions (from email) #227

scsmithr commented Sep 11, 2024 •

edited

Loading

Questions (from email) #227

Questions (from email) #227

Comments

scsmithr commented Sep 11, 2024 • edited Loading

scsmithr commented Sep 11, 2024 •

edited

Loading