From f0230346a496da483738b1dca76ac76853b03971 Mon Sep 17 00:00:00 2001 From: Matt Jadud Date: Fri, 27 Dec 2024 14:54:59 -0500 Subject: [PATCH] Cleaning up docs --- docs/README.md | 8 +- docs/architecture/README.md | 5 - docs/architecture/databases.md | 192 +++++++++++ docs/architecture/design_postgres.md | 319 ------------------ docs/architecture/design_sqlite.md | 136 -------- docs/architecture/domain64.md | 39 ++- docs/architecture/{docs => }/entree.md | 0 docs/architecture/{docs => }/extract.md | 0 docs/architecture/{docs => }/fetch.md | 0 docs/architecture/goals_and_principles.md | 53 --- docs/architecture/{docs => }/index.md | 1 - docs/architecture/{docs => }/migrate.md | 0 docs/architecture/mkdocs.yml | 13 - docs/architecture/{docs => }/pack.md | 0 docs/architecture/{docs => }/pipeline.md | 0 docs/architecture/{docs => }/processing.md | 0 docs/architecture/requirements.txt | 3 - docs/architecture/{docs => }/tooling.md | 0 docs/{architecture/docs => images}/whacks.png | Bin docs/{architecture/docs => }/principles.md | 6 +- docs/process/good_faith_mou.md | 47 --- 21 files changed, 236 insertions(+), 586 deletions(-) delete mode 100644 docs/architecture/README.md create mode 100644 docs/architecture/databases.md delete mode 100644 docs/architecture/design_postgres.md delete mode 100644 docs/architecture/design_sqlite.md rename docs/architecture/{docs => }/entree.md (100%) rename docs/architecture/{docs => }/extract.md (100%) rename docs/architecture/{docs => }/fetch.md (100%) delete mode 100644 docs/architecture/goals_and_principles.md rename docs/architecture/{docs => }/index.md (99%) rename docs/architecture/{docs => }/migrate.md (100%) delete mode 100644 docs/architecture/mkdocs.yml rename docs/architecture/{docs => }/pack.md (100%) rename docs/architecture/{docs => }/pipeline.md (100%) rename docs/architecture/{docs => }/processing.md (100%) delete mode 100644 docs/architecture/requirements.txt rename docs/architecture/{docs => }/tooling.md (100%) rename docs/{architecture/docs => images}/whacks.png (100%) rename docs/{architecture/docs => }/principles.md (97%) delete mode 100644 docs/process/good_faith_mou.md diff --git a/docs/README.md b/docs/README.md index cda81823..658a3da1 100644 --- a/docs/README.md +++ b/docs/README.md @@ -3,7 +3,7 @@ title: Jemison documentation Welcome to `jemison`, a small-but-mighty search engine. Below is a map to the many kinds of documentation (process, decisions, practices, etc.) for this project. -* [Good Faith MOU](process/good_faith_mou.md) (a project charter of sorts) -* [Architectural decision records](adr/) -* [System diagrams](diagrams/) -* [Agile process](process/agile.md) \ No newline at end of file +* [principles](principles.md) (a project charter of sorts) +* [architectural decision records](adr/) are where decisions about the product are made and recorded +* [agile process](process/agile.md) describes how we engage with our repositories and code +* [architecture](architecture/index.md) describes the architecture of the search engine \ No newline at end of file diff --git a/docs/architecture/README.md b/docs/architecture/README.md deleted file mode 100644 index 81870ce6..00000000 --- a/docs/architecture/README.md +++ /dev/null @@ -1,5 +0,0 @@ -This is a `mkdocs` site. - -`mkdocs serve` - -should allow for the local browsing of this Markdown documentation in a browser. \ No newline at end of file diff --git a/docs/architecture/databases.md b/docs/architecture/databases.md new file mode 100644 index 00000000..0eee4538 --- /dev/null +++ b/docs/architecture/databases.md @@ -0,0 +1,192 @@ +# databases + +The engine is supported by multiple databases. We do this because different databases may want/need to scale differently in the future, and we are taking a preemptive step in that direction. + +## unique data values + +There are a small number of unique data representations in our application that merit note. + +* the [domain64](domain64.md) representation for domain names. This is a 64-bit integer representation of domain names that we use for search optimization and table partitioning. + +## queues + +The queues database serves only one purpose: to handle the queues used by Jemison. + +Our queueing system gets hit hard, and therefore we do all of that work on one database. Further, the table migrations are automanaged by the [river](https://riverqueue.com/) library. Keeping it separate both protects the queue tables as well as any operational tables we create in the application. + +## work + +The "work" database is where application tables specific to the processing of data live. + +### guestbook + +The guestbook is where we keep track of URLs that have been/want to be searched. These tables live in the `cmd/migrate` app, which handles our migrations on every deploy. [These are dbmate migrations](https://github.com/GSA-TTS/jemison/tree/main/cmd/migrate/work_db/db/migrations). + +```sql +create table guestbook ( + id bigint generated always as identity primary key, + domain64 bigint not null, + last_modified timestamp, + last_fetched timestamp, + next_fetch timestamp not null, + scheme integer not null default 1, + content_type integer not null default 1, + content_length integer not null default 0, + path text not null, + unique (domain64, path) +); +``` + +The dates drive a significant part of the entree/fetch algorithms. + +* `last_modified` is EITHER the timestamp provided by the remote webserver for any given page, OR if not present, we assign this value in `fetch`, setting it to the last fetched timestamp. +* `last_fetched` is the time that the page was fetched. This is updated every time we fetch the page. +* `next_fetch` is a computed value; if a page is intended to be fetched weekly, then `fetch` will set this as the current time plus one week at the time the page is fetched. + +### hosts + +```sql +create table hosts ( + id bigint generated always as identity primary key, + domain64 bigint, + next_fetch timestamp not null, + unique(id), + unique(domain64), + constraint domain64_domain + check (domain64 > 0 and domain64 <= max_bigint()) +) +; +``` + +Like the `guestbook`, this table plays a role in determining whether a given domain should be crawled. If we want to crawl a domain *right now*, we set the `next_fetch` value in this table to yesterday, allowing all crawls of URLs under this domain to be valid. + +## search + +The `search` database holds our data pipelines and the tables that get actively searched. + +This database is not (yet) well designed. Currently, there is a notion of a `raw_content` table, which is where `pack` deposits text. + +```sql +CREATE TABLE raw_content ( + id BIGSERIAL PRIMARY KEY, + host_path BIGINT references guestbook(id), + tag TEXT default , + content TEXT +) +``` + +From there, it is unclear how best to structure and optimize the content. + +There are two early-stage ideas. Both have tradeoffs in terms of performance and implementation complexity, and it is not clear yet which to pursue. + + +### one idea: inheritence. + +https://www.postgresql.org/docs/current/tutorial-inheritance.html + +We could define a searchable table as `gov`. + +```sql +create table gov ( + id ..., + host_path ..., + tag ..., + content ... +); +``` + +From there, we could have *empty* inheritence tables. + +```sql +create table gsa () inherits (gov); +create table hhs () inherits (gov); +create table nih () inherits (gov); +``` + +and, from there, the next level down: + +```sql +create table cc () inherits (nih); +create table nccih () inherits (nih); +create table nia () inherits (nih); +``` + +Then, insertions happen at the **leaves**. That is, we only insert at the lowest level of the hierarchy. However, we can then query tables higher up, and get results from the entire tree. + +This does two things: + +1. It lets queries against a given domain happen naturally. If we want to query `nia.nih.gov`, we target that table with our query. +2. If we want to query all of `nih`, then we query the `nih` table. +3. If we want to query everything, we target `gov` (or another tld). + +Given that we are going to treat these tables as build artifacts, we can always regenerate them. And, it is possible to add new tables through a migration easily; we just add a new create table statement. + +(See [this article](https://medium.com/miro-engineering/sql-migrations-in-postgresql-part-1-bc38ec1cbe75) about partioning/inheritence, indexing, and migrations. It's gold.) + +### declarative partitioning + +Another approach is to use `PARTITION`s. + +This would suggest our root table has columns we can use to drive the derivative partitions. + +```sql +create table gov ( + id ..., + domain64 BIGINT, + host_path ..., + tag ..., + content ... + partition by range(domain64) +); +``` + +To encode all of the TLDs, domains, and subdomains we will encounter, we'll use a `domain64` encoding. Why? It maps the entire URL space into a single, 64-bit number (or, `BIGINT`). + +``` +FF:FFFFFF:FFFFFF:FF +``` + +or + +``` +tld:domain:subdomain:subsub +``` + +This is described more in detail in [domain64.md](domain64.md). + +As an example: + +| tld | domain | sub | hex | dec | +|-----|--------|-----|----------------------|-------------------| +| gov | gsa | _ | #x0100000100000000 | 72057598332895232 | +| gov | gsa | tts | #x0100000100000100 | 72057598332895488 | +| gov | gsa | api | #x0100000100000200 | 72057598332895744 | + +GSA is from the range #x0100000001000000 -> #x0100000001FFFFFF, or 72057594054705152 -> 72057594071482367 (a diff of 16777215). Nothing else can be in that range, because we're using the bitstring to partition off ranges of numbers. + +Now, everything becomes bitwise operations on 64-bit integers, which will be fast everywhere... and, our semantics map well to our domain. + +Partitioning to get a table with only GSA entries is + +```sql +CREATE TABLE govgsa PARTITION OF gov + FOR VALUES FROM (72057598332895232) TO (72057602627862527); +``` + +Or, just one subdomain in the space: + +```sql +CREATE TABLE govgsatts PARTITION OF gov + FOR VALUES FROM (72057598332895488) TO (72057598332895743); +``` + +or we can keep the hex representation: + +```sql +CREATE TABLE govgsatts PARTITION OF gov + FOR VALUES FROM (select x'0100000100000100') TO (select x'01000001000001FF'); +``` + +All table operations are on the top-level table (insert, etc.), the indexes and whatnot are inherited automatically, and I can search the TLD, domain, or subdomain without difficulty---because it all becomes a question of what range the `domain64` value is in. + + diff --git a/docs/architecture/design_postgres.md b/docs/architecture/design_postgres.md deleted file mode 100644 index a3b0b56a..00000000 --- a/docs/architecture/design_postgres.md +++ /dev/null @@ -1,319 +0,0 @@ -# postgres design - -## the issue with the current design - -The SQLite-based approach will have scaling issues in cloud.gov. - -Specifically, cloud.gov puts a hard 7GB limit on a single compute instance. This means multiple (bad) things for an approach based on SQLite databases. - -1. As we pack a DB, it must be smaller than 7GB. -2. If we want to `VACUUM` a DB, it must be smaller than ~2GB. (`VACUUM` may require 2x the size of the DB for cleanup and compression. This suggests a 2GB DB might require 4GB additional while undergoing a `VACUUM` before deployment.) -3. Large sites are clearly larger than 7GB (or 2GB) - -So, what does this look like if we use Postgres? A few design ideas... - -## overall: a data pipeline (again) - -The FAC database design recently turned into a data pipeline. It's a smaller set of data (which makes copying fast/inexpensive), but we basically engage in a series of safe transformations to the data that let us do interesting things. - -The rest of Jemison is designed this way, so why not the DB? - -The first thing is to get the content that `fetch`/`walk`/`extract` develop into a database. From there, we can run jobs that do interesting things. The important thing is that content lands somewhere. - -## three databases (possibly four) - -The `queues` database currently is where `river` does all of its work. This is a high-traffic zone, and probably a bad place to put any other tables. It can be a small instance, but it probably should be all by itself. Let the queues thrash. - -Then, we need to do a few things: - -1. Track what URLs have been visited, and when -2. Get the content in, and ready for processing/indexing - -Arguably... we could skip a table, and go straight to "what we need." But, that would imply writing custom code to go from S3->Pg. Better to have a generic "slurp the content into Pg," and then use SQL for the rest of the data pipeline. - -### constants / lookup tables - -There are a few lookup tables that we want, to keep things small in the otherwise large tables. - -We could use `ENUM`. There are tradeoffs: - -https://neon.tech/postgresql/postgresql-tutorial/postgresql-enum - -Long-and-short, I think I'll stick to lookup tables. Extensible, simple, easy to maintain via migration. - -> [!IMPORTANT] -> The idea of using [domain64](domain64.md) notation for the content tables was developed after some of this content was written. I've come back around and updated the section regarding lookup tables to reflect this idea. - -#### `schemes` - -Huh. [Don't do this](https://wiki.postgresql.org/wiki/Don%27t_Do_This#Don.27t_use_serial). - -```sql -create table schemes ( - id integer generated by default as identity primary key, - scheme text, - unique(scheme) -); - -insert into schemes - (id, scheme) - values - (1, 'https'), (2, 'http'), (3, 'ftp') - on conflict do nothing -; -``` - -#### `tlds` - -A TLD should map to its `domain64` representation. - -```sql -create table tlds ( - id, - tld text, - unique(id), - unique(tld), - constraint domain64_tld check (id > 0 and id < 256) -); - -insert into tlds - (id, tld) - values - (1, 'gov'), (2, 'com'), (3, 'org'), (4, 'net') - on conflict do nothing -; -``` - -These can be maintained as [jsonnet](https://jsonnet.org) files in the repository, and loaded on every deploy. - -#### `content_type` - -```sql -create table content_types ( - id integer generated by default as identity primary key, - content_type text - unique(content_type) -); - -insert into content_types - (id, content_type) - values - (1, 'application/octet-stream'), (2, 'text/html'), (3, 'application/pdf') - on conflict do nothing -; -``` - -We will use [simplified/clean MIME types](https://developer.mozilla.org/en-US/docs/Web/HTTP/MIME_types/Common_types). Some get more complex when the webserver talks to us. - -#### `hosts` - -I want to know the hosts I should know about. This is important for crawling: we never want to wander outside the space of what we are supposed to know. However, this should be information that is *derivative* of the domain64 information we're tracking. - -From a domain64 perspective, these are the domains. - -```sql -create table domains ( - id bigint, - domain text, - unique(id), - unique(domain), - constraint domain64_domain check (domain > 0 and domain <= select(x'FFFFFF')) -); - -insert into hosts - (id, host) - values - (...) - on conflict do nothing -; -``` - -Again, this will be a Jsonnet file. The uniqueness constraint will help prevent re-ordering. New domains should only ever get new indexes, and we should never renumber. That is, if we remove a domain, we should *comment it out* in the Jsonnet, and "retire" the number. *Monotonically increasing* is a term that comes to mind. - -#### `tags` - -```sql -create table tags ( - id integer generated by default as identity primary key, - tag text not null, - unique(tag) -; - -insert into tags - (id, tag) - values - (1, 'title'), - (2, 'p'), - (3, 'div'), - (...) -; -``` - -### the guestbook - -With those tables in place... - -```sql -create table if not exists guestbook ( - id bigserial primary key, - last_modified timestamp not null, -- should this be nullable? - last_fetched timestamp not null, - next_fetch timestamp not null, - scheme integer not null references schemes(id) default 1, - host integer references hosts(id) not null, - content_type integer references content_types(id) default 1, -- nullable? - content_length integer not null default 0, - path text not null, - unique (host, path) -); -``` - -This used to have a `sha1` of the content, and the content length. The SHA1 is probably less useful than the ETag... and that might be debateable. Defaulting the `content_length` to 0 feels right... I'd rather always find a number than have `null`. - -#### a note about table ordering... - -*re mi do do so* - -https://docs.gitlab.com/ee/development/database/ordering_table_columns.html - -The `guestbook` table (and tables that follow) will get big enough that padding will matter. Therefore, column ordering (8 word values, then 4, then variable) will matter a great deal for page alignment. - -#### `guestbook` is a lookup table - -In many ways. But, the `host/path` combination now has a unique id. This means that we can use that `id` in other tables, and know that we're referring to a particular path on a particular host. For the content tables, we'll refer to the guestbook `id`. - -*What happens if we loose the guestbook table?* - -Then we need to start a fresh crawl. That, or we back it up every now and then. We'll see what we do. - -## the content - -`extract` puts extracted content into S3, and we'll then pack that content into a table. The `raw_content` table is the first step of our SQL-based data pipeline. It will be in the same database as the `guestbook`, unless we ultimately decide that the cost of having it there is prohibitive (in performance or space). - -```sql -CREATE TABLE raw_content ( - id BIGSERIAL PRIMARY KEY, - host_path BIGINT references guestbook(id), - tag TEXT default , - content TEXT -) -``` - -Hm. Now, the `title` becomes a `tag`. A header becomes a `tag`. And, all the body content is by-tag. We can still prioritize/weight things by `tag`. (A `path` tag might even be a way to put all content, including paths, into one table.) - -## the pipeline - -From here, the question is "what kinds of search do we want to support?" - -The `raw_content` table will end up being... at least 25M rows, possibly pushing closer to 100M rows. However, the idea here is not to work with this table directly. It is where we have an up-to-date version of the content of websites and PDFs (and other documents) in a location that is ready and amenable to batch, pipeline processing in SQL. - -### one idea: inheritence. - -https://www.postgresql.org/docs/current/tutorial-inheritance.html - -We could define a searchable table as `gov`. - -```sql -create table gov ( - id ..., - host_path ..., - tag ..., - content ... -); -``` - -From there, we could have *empty* inheritence tables. - -```sql -create table gsa () inherits (gov); -create table hhs () inherits (gov); -create table nih () inherits (gov); -``` - -and, from there, the next level down: - -```sql -create table cc () inherits (nih); -create table nccih () inherits (nih); -create table nia () inherits (nih); -``` - -Then, insertions happen at the **leaves**. That is, we only insert at the lowest level of the hierarchy. However, we can then query tables higher up, and get results from the entire tree. - -This does two things: - -1. It lets queries against a given domain happen naturally. If we want to query `nia.nih.gov`, we target that table with our query. -2. If we want to query all of `nih`, then we query the `nih` table. -3. If we want to query everything, we target `gov` (or another tld). - -Given that we are going to treat these tables as build artifacts, we can always regenerate them. And, it is possible to add new tables through a migration easily; we just add a new create table statement. - -(See [this article](https://medium.com/miro-engineering/sql-migrations-in-postgresql-part-1-bc38ec1cbe75) about partioning/inheritence, indexing, and migrations. It's gold.) - -### declarative partitioning - -Another approach is to use `PARTITION`s. - -This would suggest our root table has columns we can use to drive the derivative partitions. - -```sql -create table gov ( - id ..., - domain64 BIGINT, - host_path ..., - tag ..., - content ... - partition by range(domain64) -); -``` - -To encode all of the TLDs, domains, and subdomains we will encounter, we'll use a `domain64` encoding. Why? It maps the entire URL space into a single, 64-bit number (or, `BIGINT`). - -``` -FF:FFFFFF:FFFFFF:FF -``` - -or - -``` -tld:domain:subdomain:subsub -``` - -This is described more in detail in [domain64.md](domain64.md). - -As an example: - -| tld | domain | sub | hex | dec | -|-----|--------|-----|----------------------|-------------------| -| gov | gsa | _ | #x0100000100000000 | 72057598332895232 | -| gov | gsa | tts | #x0100000100000100 | 72057598332895488 | -| gov | gsa | api | #x0100000100000200 | 72057598332895744 | - -GSA is from the range #x0100000001000000 -> #x0100000001FFFFFF, or 72057594054705152 -> 72057594071482367 (a diff of 16777215). Nothing else can be in that range, because we're using the bitstring to partition off ranges of numbers. - -Now, everything becomes bitwise operations on 64-bit integers, which will be fast everywhere... and, our semantics map well to our domain. - -Partitioning to get a table with only GSA entries is - -```sql -CREATE TABLE govgsa PARTITION OF gov - FOR VALUES FROM (72057598332895232) TO (72057602627862527); -``` - -Or, just one subdomain in the space: - -```sql -CREATE TABLE govgsatts PARTITION OF gov - FOR VALUES FROM (72057598332895488) TO (72057598332895743); -``` - -or we can keep the hex representation: - -```sql -CREATE TABLE govgsatts PARTITION OF gov - FOR VALUES FROM (select x'0100000100000100') TO (select x'01000001000001FF'); -``` - -All table operations are on the top-level table (insert, etc.), the indexes and whatnot are inherited automatically, and I can search the TLD, domain, or subdomain without difficulty---because it all becomes a question of what range the `domain64` value is in. - - diff --git a/docs/architecture/design_sqlite.md b/docs/architecture/design_sqlite.md deleted file mode 100644 index f7b4ad9c..00000000 --- a/docs/architecture/design_sqlite.md +++ /dev/null @@ -1,136 +0,0 @@ -# sqlite / `serve` - -The output of `pack` (and `superpack`) is an SQLite file. - -We copy this file from S3 to a host, and serve queries out of it directly. This is because it is *very* fast, and it saves us from having to garden/care for a live database server under production loads. (Or: our search can be scaled horizontally across small EC2 instances, as opposed to trying to figure out how to get read replicas/etc. running well under the brokered environment of cloud.gov). - -## connection string / driver - -https://github.com/mattn/go-sqlite3?tab=readme-ov-file#:~:text=see%20PRAGMA%20defer_foreign_keys-,Foreign%20Keys,-_foreign_keys%20%7C%20_fk - -We need foreign keys. - -`_fk=true` - - -## metadata - -Every SQLite file needs a `metadata` table. This helps us know what we are serving from. For example, we may someday go from "version 1" of our tables to "version 2." When we do this: - -* `pack` will start producing version 2 tables, and -* `serve` will need to know how to serve both v1 and v2 tables. - -This will only be true until we have re-crawled or otherwise regenerated all of our SQLite databases. At that point, `serve` can "forget" how to handle v1 tables. - -### `metadata` - -Metadata is a k/v table. Therefore, the design is two columns. Remember, SQLite *barely* has types. - -| column | type | -| --- | --- | -| key | TEXT | -| value | TEXT | - -The following keys MUST exist: -* `version`, which is an integer. Only update for breaking changes. -* `last_updated`, a `DATE`. This is the date we wrote the file. - -### `paths` - -These are not for searching, but to keep tables small. - -```sql -CREATE TABLE paths ( - id PRIMARY KEY, - path TEXT, - UNIQUE(path) -); -``` - -### `titles` - -Titles will go in their own table. - -```sql -CREATE TABLE titles ( - id PRIMARY KEY, - path INTEGER, - title TEXT, - FOREIGN KEY(path) REFERENCES paths(id) -); -``` - -### `headers` - -The `level` is the integer portion of `H1`, `H2`, etc. - -```sql -CREATE TABLE headers ( - id PRIMARY KEY, - path INTEGER, - level INTEGER, - header TEXT, - FOREIGN KEY(path) REFERENCES paths(id) -); -``` - -### `content` - -This is the body content of the page. So, everything but the other tables. - -```sql -CREATE TABLE contents ( - id PRIMARY KEY, - path INTEGER, - tag TEXT, - content TEXT, - FOREIGN KEY(path) REFERENCES paths(id) -); -``` - -## FTS5 tables - -```sql -CREATE VIRTUAL TABLE titles_fts USING fts5( - title, - content='titles', - content_rowid='id' -); - -CREATE VIRTUAL TABLE headers_fts USING fts5( - header, - content='header', - content_rowid='id' -); - -CREATE VIRTUAL TABLE contents_fts USING fts5( - content, - content='contents', - content_rowid='id' -); -``` - - -### FTS5 triggers - -We need insert triggers for this model. - -```sql -CREATE TRIGGER titles_ai AFTER INSERT ON titles - BEGIN - INSERT INTO titles_fts (rowid, title) - VALUES (new.id, new.title); - END; - -CREATE TRIGGER headers_ai AFTER INSERT ON headers - BEGIN - INSERT INTO headers_fts (rowid, header) - VALUES (new.id, new.header); - END; - -CREATE TRIGGER contents_ai AFTER INSERT ON contents - BEGIN - INSERT INTO contents_fts (rowid, content) - VALUES (new.id, new.content); - END; -``` \ No newline at end of file diff --git a/docs/architecture/domain64.md b/docs/architecture/domain64.md index 7620b64e..478aebc0 100644 --- a/docs/architecture/domain64.md +++ b/docs/architecture/domain64.md @@ -93,4 +93,41 @@ Jsonnet will naturally sort by the hex key values. } } } -``` \ No newline at end of file +``` + +## considerations + +It would be nice to be able to uniquely identify: + +1. A domain +2. A subdomain +3. A top-level path +4. A path + +in a single value. That is, to have a single integer value that is structured, sortable/filterable, and provides information all the way down to the path level. + +* We do not need 255 TLDs; we might need fewer than 16 (2^4, or one nibble) +* We do not need 16M domains. We need fewer than 100K. 2^16 is 65K (four nibbles), 2^18 is 262K. +* We do not need 16M subdomains under every domain. It is probably fewer than 4K (three nibbles). + +This suggests + +F:FFFF:FFF:... + +meaning we have 8x4=32 bits, and therefore half of the number remains for representing paths. + +Across 215K paths, we have 4300 unique path roots. For a given domain space, it might be in the hundreds. + +Using those 32 bits for paths, we could: + +* Use the first two bits to indicate how we are using the path. + * 00 means no structure; treat the next 30 bits (1B) as unique paths + * 01 means we used two nibbles for the root (64 roots) and 6 nibbles for paths (16M) + * 10 means we used three nibbles for the root (1024 roots) and 5 nibbles for paths (1M) + * 11 is undefined + +This would make subpath searching optimal. We can filter, based on the domain64, down to the path + +Knowing if we can do this a priori is the trick; that is, what path structure is appropriate for a given site? It might be that we have to assume `00`, and then under analysis (post-crawl), potentially re-assign, which allows for optimization after a second crawl? + +Or, we ask our partners. \ No newline at end of file diff --git a/docs/architecture/docs/entree.md b/docs/architecture/entree.md similarity index 100% rename from docs/architecture/docs/entree.md rename to docs/architecture/entree.md diff --git a/docs/architecture/docs/extract.md b/docs/architecture/extract.md similarity index 100% rename from docs/architecture/docs/extract.md rename to docs/architecture/extract.md diff --git a/docs/architecture/docs/fetch.md b/docs/architecture/fetch.md similarity index 100% rename from docs/architecture/docs/fetch.md rename to docs/architecture/fetch.md diff --git a/docs/architecture/goals_and_principles.md b/docs/architecture/goals_and_principles.md deleted file mode 100644 index cced2eee..00000000 --- a/docs/architecture/goals_and_principles.md +++ /dev/null @@ -1,53 +0,0 @@ - -# goals / principles - -There are a bunch of possible goals. Some might be incompatible. - -## compliance-first - -The goal is a service that can achieve ATO. That means a whole host of things about deployment, access control, logging, etc. - -### FISMA? - -Is this a search engine that operates at FISMA Low, Medium, or High? This question will impact some issues of overall service design. For example, we may want to keep separate things separate if we're operating at High. This could be an argument for a "search engine in a box" (next). Systems that want to be clean/separate can just be their own instances. - -## single-server, end-to-end - -Do I want a "search engine in a box?" That is, do I want to be able to deploy a single server instance, with nothing more than a connection to S3, and be able to run a complete search service? Maybe. That would be an interesting design constraint. - -Working with this, though, it suggests that library/module design should be carried out in a way that _either_ many small services can be built, _or_ a single service that combines them all. - -## live vs. static - -There is one world where the search component is live; that is, a page might be querying an app on our infra, and receiving results. - -There is another world where the result of the indexing and content cleanup is a static asset that can be embedded in a static site build. - -There might be a third world where all of this runs on someone else's infrastructure to produce the assets in question---dynamic or otherwise. - -For now, being able to handle an end-to-end process that yields a living search engine, and knowing that it can also generate a static site search as well. - -Possible static site tools: - -* [tinysearch](https://github.com/tinysearch/tinysearch) -* [pagefind](https://pagefind.app/) - -and the Hugo project maintains a [list of more](https://gohugo.io/tools/search/). - -## search is a data pipeline - -First you need to crawl a site, and grab the content. - -Then you need to process that content. Perhaps you index it. Perhaps you apply AI to images to determine if there are cats present. Or cats eating hotdogs. Or dogs and cats, living together. - -Then you need to bundle it up into a search interface... - -Then you need to track and store usage and performance... - -As much as possible, each service/step will consume some content and produce some content. Ideally, all of this scales embarrasingly: meaning, we have jobs on queues, and can throw more workers at the queues if we need things to go faster. The content consumed, once fetched from the web, is shuffled in and out of S3 buckets. - -## extensible - -Everyone wants that. But, if we hold hard-and-fast to a worker/queue model, treat everything as a pipeline, and develop services in a manner that they are pluggable, it becomes possible to imagine having a base service, and then have more advanced services that come at a cost (because, perhaps, they require more resource to devleop, maintain, serve, etc.). - -AWS did this by saying "everything is an API." Much the same here; common APIs and queueing models (ideally with models that can be accessed from multiple languages, so components can be built in whatever tooling makes the most sense) will be the path to extensibility. \ No newline at end of file diff --git a/docs/architecture/docs/index.md b/docs/architecture/index.md similarity index 99% rename from docs/architecture/docs/index.md rename to docs/architecture/index.md index 33024036..935eec94 100644 --- a/docs/architecture/docs/index.md +++ b/docs/architecture/index.md @@ -31,4 +31,3 @@ Once we have the content in Postgres, we have another powerful language at our d At this point, further services clean, process, and prepare the text for search. Read more about the [data processing pipeline](processing.md). - diff --git a/docs/architecture/docs/migrate.md b/docs/architecture/migrate.md similarity index 100% rename from docs/architecture/docs/migrate.md rename to docs/architecture/migrate.md diff --git a/docs/architecture/mkdocs.yml b/docs/architecture/mkdocs.yml deleted file mode 100644 index 956170aa..00000000 --- a/docs/architecture/mkdocs.yml +++ /dev/null @@ -1,13 +0,0 @@ -# https://github.com/mkdocs/catalog?tab=readme-ov-file#-navigation--page-building -site_name: Jemison Documentation -nav: - - Home: index.md - - Principles: principles.md - - Tooling: tooling.md -theme: material -markdown_extensions: - - pymdownx.superfences: - custom_fences: - - name: mermaid - class: mermaid - format: !!python/name:pymdownx.superfences.fence_code_format diff --git a/docs/architecture/docs/pack.md b/docs/architecture/pack.md similarity index 100% rename from docs/architecture/docs/pack.md rename to docs/architecture/pack.md diff --git a/docs/architecture/docs/pipeline.md b/docs/architecture/pipeline.md similarity index 100% rename from docs/architecture/docs/pipeline.md rename to docs/architecture/pipeline.md diff --git a/docs/architecture/docs/processing.md b/docs/architecture/processing.md similarity index 100% rename from docs/architecture/docs/processing.md rename to docs/architecture/processing.md diff --git a/docs/architecture/requirements.txt b/docs/architecture/requirements.txt deleted file mode 100644 index 3d597136..00000000 --- a/docs/architecture/requirements.txt +++ /dev/null @@ -1,3 +0,0 @@ -mkdocs -mkdocs-material -mkdocs-toc-md \ No newline at end of file diff --git a/docs/architecture/docs/tooling.md b/docs/architecture/tooling.md similarity index 100% rename from docs/architecture/docs/tooling.md rename to docs/architecture/tooling.md diff --git a/docs/architecture/docs/whacks.png b/docs/images/whacks.png similarity index 100% rename from docs/architecture/docs/whacks.png rename to docs/images/whacks.png diff --git a/docs/architecture/docs/principles.md b/docs/principles.md similarity index 97% rename from docs/architecture/docs/principles.md rename to docs/principles.md index eb448f4e..8984233b 100644 --- a/docs/architecture/docs/principles.md +++ b/docs/principles.md @@ -17,7 +17,8 @@ toc_md_description: toc-md-value ### mental locks -Roger von Oech wrote the book A Whack on the Side of the Head. It’s a fun book. A light read. +Roger von Oech wrote the book *A Whack on the Side of the Head*. It’s a fun book. A light read. + At the core of this (small) book are ten mental locks that we tend to constrain ourselves with. "Is that the right answer?" "Are we following the rules?" "Can we be practical for a moment?" These are von Oech's mental locks. Be conscious of when you are using them to keep yourself from exploring ideas. They are pervasive in government. @@ -33,9 +34,6 @@ These are von Oech's mental locks. Be conscious of when you are using them to ke 9. To err is wrong 10. I'm not creative - - - ## product principles * **We work in English and Spanish**. We will expand this list, but these are our primary target languages to start. diff --git a/docs/process/good_faith_mou.md b/docs/process/good_faith_mou.md deleted file mode 100644 index 0f626137..00000000 --- a/docs/process/good_faith_mou.md +++ /dev/null @@ -1,47 +0,0 @@ -# **A Good Faith MOU** - -A memorandum of understanding (MOU) is a contract. A soured relationship is not helped by the existence of a contract. By the time it is “needed,” an MOU is a hammer applied too late in the context of growing and nurturing a team. - -The *writing* of a MOU can be extremely valuable. Writing the MOU lets the team spell out what they think is important in relationships. It lets individuals say what they need to be successful and whole in their work. It is an exercise in trust building and empathetic understanding of others. As a non-contractual document, we could call this a *good faith MOU*. It is a *social* contract, to be upheld by everyone on a team in all their work. - -This is a living and aspirational document. It does not need to be perfect. We will revisit it as part of our retrospective process, and can edit, grow, or shrink it as needed. Footnotes are used to capture *actions* and *behaviors* we can engage in to work towards our aspirations. - -* **Warmth**. We are, first and foremost, human beings worthy of respect and compassion, continuing to live at work under pandemic conditions. We all are experiencing this differently. While our cup may feel more or less full on any given day, we aspire to bring our most compassionate selves to the work we do and our interactions with our colleagues[^1]. -* **Trust**. Trust is earned through words and actions. It comes from doing what we say we will do, when we say we will do it. It comes from saying when we misjudged, and our goals might not be achieved the way we hoped. Open, honest communication is foundational to trust[^2]. - * Across the team, we consistently expressed that we value accountability, transparency, and regular open/frank communication. These are all things that lay foundations for *trust*. -* **Working in the open**. As federal employees, our work grows the public domain. As much as possible, we also hope our words and actions might model how this work can also take place openly. In doing so, it encourages others to contribute to our efforts of improving the government and it’s processes *for the benefit of the people[^3]*. - * Clear commitments and transparent milestones are more easily achieved when the work is open. -* **Collaboration**. We value working together. It is easier to brainstorm, to explore ideas, to get unstuck, and to push each other when we are working together[^4]. -* **Documentation**. Our work should be documented. More specifically, we should 1\) try and document the rationale for important design decisions (without which, we might “roll back” a decision that was made for good reason), 2\) document work products (so that others can pick them up and continue the work), and 3\) document reflections (so that we can communicate what and why about our work led to success or failure)[^5]. -* **Be bold**. In open, collaborative work, being bold means many things. It might mean *not asking permission*, especially when using tools where changes can easily be rolled back. With regards to our thinking, it means avoiding mental locks, like believing we have to do things that are *logical*, or that we have to come up with *the right answer.* While it is true that we have milestones to achieve, part of good agile practice is also knowing how and when to pivot, and that occasionally requires us to be bold. The appendix provides a bit more context for what it might mean to “be bold.” - -## **Appendix: A Whack on the Side of the Head** - -Roger von Oech wrote the book *A Whack on the Side of the Head*. It’s a fun book. A light read. - -At the core of this (small) book are ten mental locks that we tend to constrain ourselves with. He gives them names, and they are mostly self-explanatory: they're the kinds of things we say to ourselves to keep us from moving forward, or trying new things: - -* The right answer -* That's not logical -* Follow the rules -* Be practical -* Play is frivolous -* That's not my area -* Don't be foolish -* Avoid ambiguity -* To err is wrong -* I'm not creative - -These phrases have their time and place. But, we also want to question them as part of our 10x-funded entrepreneurial work, and make sure that we’re challenging ourselves to deliver the best value to our partners and *the people*. We do this by balancing well-established practices with a willingness to make mistakes and try things that are new (to us, or to everyone). - - - -[^1]: We can remember to start meetings by checking in, and listening to our colleagues. Sometimes, making space for out-of-band conversations (“virtual coffee”) to unpack things can help, too. - -[^2]: Regular retrospective processes that look at how we are doing our work as individuals and as a team will help foster a space for open dialogue about what we are doing, why, and how we might engage best individually and as a team. - -[^3]: “Working in the open” means many things. Making repositories public, licensing things with open/libre licenses, and (perhaps more importantly), building *open community* around the work itself. This takes effort, but without community and engagement, we achieve *free* or *openly available*, but not open engagement. - -[^4]: There are some tasks that benefit from parallel workstreams (“divide and conquer,” to use the language of colonizers). However, there are many tasks that benefit from cowork sessions. In addition, coworking builds knowledge of each-other, and therefore trust. - -[^5]: Where possible, an important part of our development practice will be *automation*. This is sometimes called *DevOps* or *DevSecOps.* “Development Operations” means that we write code that builds, tests, deploys, tests, and runs our code. (There’s at least two *different* kinds of testing we can do.) Good devops practices make it more likely that others can pick up and continue our work, because the difficult work of building, deploying, and testing our code is, in a word, “self-documenting.” \ No newline at end of file