Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"datasette insert" command and plugin hook #1160

Open
simonw opened this issue Dec 29, 2020 · 23 comments
Open

"datasette insert" command and plugin hook #1160

simonw opened this issue Dec 29, 2020 · 23 comments

Comments

@simonw
Copy link
Owner

simonw commented Dec 29, 2020

Tools for loading data into Datasette currently mostly exist as separate utilities - yaml-to-sqlite and csvs-to-sqlite and suchlike.

Bringing these into Datasette could have some interesting properties:

  • A datasette insert command could be extended with plugins to handle more formats
  • Any format that can be inserted on the command-line could also be inserted using a web UI or web API - which would benefit from new format plugin hooks
  • If Datasette ever grows beyond SQLite (see Prototoype for Datasette on PostgreSQL #670) a built-in import mechanism could work for those other databases as well - without me needing to write yaml-to-postgresql and suchlike
@simonw
Copy link
Owner Author

simonw commented Dec 29, 2020

Basic command design:

datasette insert data.db blah.csv

The options can include:

  • --format to specify the exact format - without this it will be guessed based on the filename
  • --table to specify the table (otherwise the filename is used)
  • --pk to specify one or more primary key columns
  • --replace to specify that existing rows with a matching primary key should be replaced
  • --upsert to specify that existing matching rows should be upserted
  • --ignore to ignore matching rows
  • --alter to alter the table to add missing columns
  • --type column type to specify the type of a column - useful when working with CSV or TSV files

@simonw
Copy link
Owner Author

simonw commented Dec 29, 2020

The UI can live at /-/insert and be available by default to the root user only. It can offer the following:

  • Upload a file and have the import type detected (equivalent to datasette insert data.db thatfile.csv)
  • Copy and paste the data to be inserted into a textarea
  • API equivalents of these

@simonw
Copy link
Owner Author

simonw commented Dec 29, 2020

Other names I considered:

  • datasette load
  • datasette import - I decided to keep this name available for any future work that might involve plugins that help import data from APIs as opposed to inserting it from files

@simonw
Copy link
Owner Author

simonw commented Dec 29, 2020

Default formats to support:

  • CSV
  • TSV
  • JSON and newline-delimited JSON
  • YAML

Each of these will be implemented as a default plugin.

@simonw
Copy link
Owner Author

simonw commented Dec 29, 2020

The insert command should also accept URLs - anything starting with http:// or https://.

It should accept more than one file name at a time for bulk inserts.

if using a URL that URL will be passed to the method that decides if a plugin implementation can handle the import or not. This will allow plugins to register themselves for specific websites.

@simonw
Copy link
Owner Author

simonw commented Dec 29, 2020

It would be pretty cool if you could launch Datasette directly against an insert-compatible file or URL without first having to load it into a SQLite database file.

Or imagine being able to tail a log file and like that directly into a new Datasette process, which then runs a web server with the UI while simultaneously continuing to load new entries from that log into the in-memory SQLite database that it is serving...

Not quite sure what that CLI interface would look like. Maybe treat that as a future stretch goal for the moment.

@simonw
Copy link
Owner Author

simonw commented Dec 29, 2020

Potential design for this: a datasette memory command which takes most of the same arguments as datasette serve but starts an in-memory database and treats the command arguments as things that should be inserted into that in-memory database.

tail -f access.log | datasette memory - \
  --format clf -p 8002 -o

@simonw simonw closed this as completed Dec 29, 2020
@simonw
Copy link
Owner Author

simonw commented Dec 29, 2020

Given the URL option could it be possible for plugins to "subscribe" to URLs that keep on streaming?

datasette insert db.db https://example.con/streaming-api \
  --format api-stream

@simonw simonw reopened this Dec 29, 2020
@simonw
Copy link
Owner Author

simonw commented Dec 29, 2020

More thoughts on this: the key mechanism that populates the tables needs to be an aysnc def method of some sort so that it can run as part of the async loop in core Datasette - for importing from web uploads.

@simonw
Copy link
Owner Author

simonw commented Dec 29, 2020

If I'm going to execute 1000s of writes in an async def operation it may make sense to break that up into smaller chunks, so as not to block the event loop for too long.

https://stackoverflow.com/a/36648102 and python/asyncio#284 confirm that await asyncio.sleep(0) is the recommended way of doing this.

@simonw
Copy link
Owner Author

simonw commented Dec 29, 2020

It would be neat if datasette insert could accept a --plugins-dir option which allowed one-off format plugins to be registered. Bit tricky to implement since the --format Click option will already be populated by that plugin hook call.

@simonw
Copy link
Owner Author

simonw commented Dec 29, 2020

Figuring out the API design

I want to be able to support different formats, and be able to parse them into tables either streaming or in one go depending on if the format supports that.

Ideally I want to be able to pull the first 1,024 bytes for the purpose of detecting the format, then replay those bytes again later. I'm considering this a stretch goal though.

CSV is easy to parse as a stream - here’s how sqlite-utils does it:

    dialect = "excel-tab" if tsv else "excel"
    with file_progress(json_file, silent=silent) as json_file:
        reader = csv_std.reader(json_file, dialect=dialect)
        headers = next(reader)
        docs = (dict(zip(headers, row)) for row in reader)

Problem: using db.insert_all() could block for a long time on a big set of rows. Probably easiest to batch the records before calling insert_all() and then run a batch at a time using a db.execute_write_fn() call.

@simonw
Copy link
Owner Author

simonw commented Dec 29, 2020

Important detail from https://docs.python.org/3/library/csv.html#csv.reader

If csvfile is a file object, it should be opened with newline=''. [1]

[...]

If newline='' is not specified, newlines embedded inside quoted fields will not be interpreted correctly, and on platforms that use \r\n linendings on write an extra \r will be added. It should always be safe to specify newline='', since the csv module does its own (universal) newline handling.

@simonw
Copy link
Owner Author

simonw commented Dec 29, 2020

Does it definitely make sense to break this operation up into the code that turns the incoming format into a iterator of dictionaries, then the code that inserts those into the database using sqlite-utils?

That seems right for simple imports, where the incoming file represents a sequence of records in a single table. But what about more complex formats? What if a format needs to be represented as multiple tables?

@simonw
Copy link
Owner Author

simonw commented Dec 29, 2020

Aside: maybe datasette insert works against simple files, but a later mechanism called datasette import allows plugins to register sub-commands, like datasette import github ... or datasette import jira ... or whatever.

This would be useful for import mechanisms that are likely to need their own custom set of command-line options unique to that source.

@simonw
Copy link
Owner Author

simonw commented Dec 29, 2020

What's the simplest thing that could possible work? I think it's datasette insert blah.db data.csv - no URL handling, no other formats.

@simonw
Copy link
Owner Author

simonw commented Dec 29, 2020

It would be nice if this abstraction could support progress bars as well. These won't necessarily work for every format - or they might work for things loaded from files but not things loaded over URLs (if the content-length HTTP header is missing) - but if they ARE possible it would be good to provide them - both for the CLI interface and the web insert UI.

@simonw
Copy link
Owner Author

simonw commented Dec 29, 2020

I'm going to break out some separate tickets.

@simonw
Copy link
Owner Author

simonw commented Dec 29, 2020

How much of this should I get done in a branch before merging into main?

The challenge here is the plugin hook design: ideally I don't want an incomplete plugin hook design in main since that could be a blocker for a release.

@simonw
Copy link
Owner Author

simonw commented Dec 29, 2020

If I design this right I can ship a full version of the command-line datasette insert command in a release without doing any work at all on the Web UI version of it - that UI can then come later, without needing any changes to be made to the plugin hook.

@simonw
Copy link
Owner Author

simonw commented Dec 29, 2020

The documentation for this plugin hook is going to be pretty detailed, since it involves writing custom classes.

I'll stick it all on the existing hooks page for the moment, but I should think about breaking up the plugin hook documentation into a page-per-hook in the future.

@simonw
Copy link
Owner Author

simonw commented Dec 29, 2020

If I can get this working for CSV, TSV, JSON and JSON-NL that should be enough to exercise the API design pretty well across both streaming and non-streaming formats.

@simonw
Copy link
Owner Author

simonw commented Jan 3, 2021

Should this command include a --fts option for configuring full-text search on one-or-more columns?

I thought about doing that for sqlite-utils insert in simonw/sqlite-utils#202 and decided not to because of the need to include extra options covering the FTS version, porter stemming options and whether or not to create triggers.

But maybe I can set sensible defaults for that with datasette insert ... -f title -f body? Worth thinking about a bit more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant