Query migration to all dataset (#5)

* start * getting started update * more query examples * fix path * outdated tables references * guided tour part 1 * part 3 link * restructure migration doc * pages migration queries verified * request migration queries verified * request schema update * query size increase notes * fix * npm update * rollback test * removed plugin * notebooks in repo * links update * guide3 * moved notebooks * page count fix * include 0-100kb resp * ceiling bins * queries descriptions update * query diffs reformatted * Update src/content/docs/guides/getting-started.md Co-authored-by: Rick Viscomi <[email protected]> * updated description --------- Co-authored-by: Rick Viscomi <[email protected]>
HTTPArchive · Jul 14, 2024 · 08bc0aa · 08bc0aa
1 parent a80af55
commit 08bc0aa
Show file tree

Hide file tree

Showing 24 changed files with 5,680 additions and 108 deletions.
diff --git a/.gitignore b/.gitignore
@@ -21,4 +21,6 @@ pnpm-debug.log*
 .DS_Store
 
 # local tools
-*.code-workspace
+*.code-workspace
+.idx/dev.nix
+.vscode/settings.json
diff --git a/astro.config.mjs b/astro.config.mjs
@@ -21,7 +21,7 @@ export default defineConfig({
             window.dataLayer = window.dataLayer || [];
             function gtag(){dataLayer.push(arguments);}
             gtag('js', new Date());
-          
+
             gtag('config', 'G-SK2FZXB50K');`
         }
       ],
@@ -38,7 +38,9 @@ export default defineConfig({
           items: [
             { label: 'Getting started', link: '/guides/getting-started/' },
             { label: 'Minimizing query costs', link: '/guides/minimizing-costs/' },
+            { label: 'Guided tour', link: '/guides/guided-tour/' },
             { label: 'Release cycle', link: '/guides/release-cycle/' },
+            { label: 'Migrate queries to `all` dataset', link: '/guides/migrating-to-all-dataset/' },
           ],
         },
         {

diff --git a/src/content/docs/guides/bigquery-httparchive-dataset-pinned.png b/src/content/docs/guides/bigquery-httparchive-dataset-pinned.png
diff --git a/src/content/docs/guides/bigquery-query-in-a-new-tab.png b/src/content/docs/guides/bigquery-query-in-a-new-tab.png
diff --git a/src/content/docs/guides/bigquery-run-sample-query.png b/src/content/docs/guides/bigquery-run-sample-query.png
diff --git a/src/content/docs/guides/bigquery-summary_pages.png b/src/content/docs/guides/bigquery-summary_pages.png
diff --git a/src/content/docs/guides/exploring_summary_pages_tables.png b/src/content/docs/guides/exploring_summary_pages_tables.png
diff --git a/src/content/docs/guides/getting-started.md b/src/content/docs/guides/getting-started.md
diff --git a/src/content/docs/guides/google-cloud-create-new-project.png b/src/content/docs/guides/google-cloud-create-new-project.png
diff --git a/src/content/docs/guides/google-cloud-select-project.png b/src/content/docs/guides/google-cloud-select-project.png
diff --git a/src/content/docs/guides/google-cloud-welcome.png b/src/content/docs/guides/google-cloud-welcome.png
diff --git a/src/content/docs/guides/guided-tour.mdx b/src/content/docs/guides/guided-tour.mdx
@@ -0,0 +1,26 @@
+---
+title: Guided Tour
+description: HTTP Archive data analysis in BigQuery
+---
+
+
+The HTTP Archive contains a tremendous amount of information that can be used to understand the evolution of the web. And since the raw data is available in Google BigQuery, you can start digging into it with a minimal amount of setup!
+
+If you are new to BigQuery, then the [Getting Started guide](./getting-started.md) will walk you through the basic setup. That guide ends with a sample query that explores MIME types from the `pages` tables. In this guide, we'll explore more of the tables and build additional queries that you can learn from. The easiest way to get started is by following along, testing some of the queries and learning from them. If you need any help then there is plenty of support available from the community at [https://discuss.httparchive.org](https://discuss.httparchive.org).
+
+**Prerequisites:**
+
+- This guide assumes that you've completed the setup from the [Getting Started guide](./getting-started.md).
+- You would be safe processing extremely-large tables contained in this dataset if you follow the [minimizing query costs guide](/guides/minimizing-costs/).
+- It also assumes some familiarity with SQL. All of the examples provided will be using [Standard SQL](https://cloud.google.com/bigquery/docs/reference/standard-sql/).
+
+Migration Guides:
+
+- If you are looking to adapt older HTTP Archive queries, written in [Legacy SQL](https://cloud.google.com/bigquery/docs/reference/legacy-sql), then you may find this [migration guide](https://cloud.google.com/bigquery/docs/reference/standard-sql/migrating-from-legacy-sql) helpful.*
+- If you've been working with the deprecated dataset `pages` or `requests`, there is a guide on [migrating your queries to the `all` dataset](/guides/migrating-to-all-dataset/).
+
+This guide is split into multiple sections, each one focusing on different tables in the HTTP Archive. Each section builds on top of the previous one:
+
+1. [Exploring the `httparchive.all.pages` tables](https://colab.research.google.com/github/rviscomi/har.fyi/blob/main/workbooks/exploring_httparchive-all-pages_tables.ipynb)
+2. [Exploring the `httparchive.all.requests` tables](https://colab.research.google.com/github/rviscomi/har.fyi/blob/main/workbooks/exploring_httparchive-all-requests_tables.ipynb)
+3. [JOINing `pages` and `requests` tables](https://colab.research.google.com/github/rviscomi/har.fyi/blob/main/workbooks/exploring_pages_and_requests_tables_joined.ipynb)
diff --git a/src/content/docs/guides/migrating-to-all-dataset.mdx b/src/content/docs/guides/migrating-to-all-dataset.mdx
@@ -0,0 +1,327 @@
+---
+title: Migrate queries to `all` dataset
+description: Assisting with query migration to the new dataset
+---
+
+import { Tabs, TabItem } from '@astrojs/starlight/components';
+
+New tables have been introduced in the HTTP Archive dataset, which are more efficient and easier to use. The `all` dataset contains all the data from the previous `pages`, `requests`, and other datasets. This guide will help you migrate your queries to the new dataset.
+
+## Migrating to `all.pages`
+
+### Page data schemas comparison
+
+previously | `all.pages`
+---|---
+date in a table name | [`date`](/reference/tables/pages/#date)
+client as `_TABLE_SUFFIX` | [`client`](/reference/tables/pages/#client)
+`url` in `pages.YYYY_MM_DD_client` | [`page`](/reference/tables/pages/#page)
+not available | [`is_root_page`](/reference/tables/pages/#is_root_page)
+not available | [`root_page`](/reference/tables/pages/#root_page)
+not available | [`rank`](/reference/tables/pages/#rank)
+`$.testID` within `payload` column in `pages.YYYY_MM_DD_client`, `wptid` column in `summary_pages.YYYY_MM_DD_client` | [`wptid`](/reference/tables/pages/#wptid)
+`payload`  in `pages.YYYY_MM_DD_client` | [`payload`](/reference/tables/pages/#payload)
+`req*`, `resp*` and other in `summary_pages.YYYY_MM_DD_client` | [`summary`](/reference/tables/pages/#summary)
+`$.CUSTOM_METRIC_NAME` within `payload` column in `pages.YYYY_MM_DD_client` | [`custom_metrics`](/reference/tables/pages/#custom_metrics)
+`report` in `lighthouse.YYYY_MM_DD_client` | [lighthouse](/reference/tables/pages/#lighthouse)
+`feature`, `type`, `id` in `blink_features.features` | `feature`, `type`, `id` in [`features`](/reference/tables/pages/#features)
+`category`, `app`, `info` in `technologies.YYYY_MM_DD_client` | `categories`, `technology`, `info` in [`technologies`](/reference/tables/pages/#technologies)
+not available | [`metadata`](/reference/tables/pages/#metadata)
+
+### Page query updates
+
+- Migrate from `blink_features.features`
+
+<Tabs>
+  <TabItem label="Before">
+```sql
+SELECT
+  url,
+  feature,
+  type,
+  id
+FROM `httparchive.blink_features.features`
+WHERE yyyymmdd = DATE('2024-05-01')
+  AND client = 'desktop'
+```
+  </TabItem>
+  <TabItem label="After">
+```sql
+SELECT
+  page,
+  features.feature,
+  features.type,
+  features.id
+FROM `httparchive.all.pages`,
+UNNEST (features) AS features
+WHERE date = '2024-06-01'
+  AND client = 'desktop'
+  AND is_root_page
+```
+  </TabItem>
+</Tabs>
+
+- Migrate from `lighthouse.YYYY_MM_DD_client`
+
+<Tabs>
+  <TabItem label="Before">
+```sql
+SELECT
+  url,
+  JSON_QUERY(report, '$.audits.largest-contentful-paint.numericValue') AS LCP,
+FROM `httparchive.lighthouse.2024_06_01_desktop`
+```
+  </TabItem>
+  <TabItem label="After">
+```sql
+/* This query will process 17 TB when run. */
+SELECT
+  page,
+  JSON_QUERY(lighthouse, '$.audits.largest-contentful-paint.numericValue') AS LCP,
+FROM `httparchive.all.pages`
+WHERE date = '2024-06-01'
+  AND client = 'desktop'
+  AND is_root_page
+```
+  </TabItem>
+</Tabs>
+
+- Migrate from `pages.YYYY_MM_DD_client`
+
+<Tabs>
+  <TabItem label="Before">
+```sql
+SELECT
+  url,
+  _TABLE_SUFFIX AS client,
+  JSON_QUERY(payload, '$.testID') AS testID,
+-- JSON with the results of the custom metrics,
+  JSON_QUERY(payload, '$._privacy') AS custom_metrics,
+FROM `httparchive.pages.2022_06_01_*`
+```
+  </TabItem>
+  <TabItem label="After">
+```sql
+SELECT
+  page,
+  client,
+  wptid,
+-- JSON with the results of the custom metrics,
+  JSON_QUERY(custom_metrics, '$.privacy') AS custom_metrics,
+FROM `httparchive.all.pages`
+WHERE date = '2022-06-01'
+  AND is_root_page
+```
+  </TabItem>
+</Tabs>
+
+- Migrate from `summary_pages.YYYY_MM_DD_client`
+
+<Tabs>
+  <TabItem label="Before">
+```sql
+SELECT
+  numDomains,
+  COUNT(0) pages,
+  ROUND(AVG(reqTotal),2) avg_requests,
+FROM `httparchive.summary_pages.2024_06_01_desktop`
+GROUP BY
+  numDomains
+HAVING
+  pages > 1000
+ORDER BY
+  numDomains ASC
+```
+  </TabItem>
+  <TabItem label="After">
+```sql
+/* This query will process 110 GB when run. */
+SELECT
+  CAST(JSON_VALUE(summary, '$.numDomains') AS INT64) AS numDomains,
+  COUNT(0) pages,
+  ROUND(AVG(CAST(JSON_VALUE(summary, '$.reqTotal') AS INT64)),2) as avg_requests,
+FROM `httparchive.all.pages`
+WHERE date = '2024-06-01'
+  AND client = 'desktop'
+  AND is_root_page
+GROUP BY
+  numDomains
+HAVING
+  pages > 1000
+ORDER BY
+  numDomains ASC
+```
+  </TabItem>
+</Tabs>
+
+- Migrate from `technologies.YYYY_MM_DD_client`
+
+<Tabs>
+  <TabItem label="Before">
+```sql
+SELECT
+  url,
+  category,
+  app,
+  info
+FROM `httparchive.technologies.2024_06_01_desktop`
+```
+  </TabItem>
+  <TabItem label="After">
+```sql
+/* This query will process 27 GB when run. */
+SELECT
+  page,
+  technologies.categories,
+  technologies.technology,
+  technologies.info
+FROM `httparchive.all.pages`,
+UNNEST (technologies) AS technologies
+WHERE date = '2024-06-01'
+  AND client = 'desktop'
+  AND is_root_page
+```
+
+  </TabItem>
+</Tabs>
+
+## Migrating to `all.requests`
+
+### Request data schemas comparison
+
+previously | `all.requests`
+---|---
+date in a table name | [`date`](/reference/tables/requests/#date)
+client as `_TABLE_SUFFIX` | [`client`](/reference/tables/requests/#client)
+`page` in `requests.YYYY_MM_DD_client` | [`page`](/reference/tables/requests/#page)
+not available | [`is_root_page`](/reference/tables/requests/#is_root_page)
+not available | [`root_page`](/reference/tables/requests/#root_page)
+`url` in `requests.YYYY_MM_DD_client` | [`url`](/reference/tables/requests/#url)
+`firstHtml` in `summary_requests.YYYY_MM_DD_client` | [`is_main_document`](/reference/tables/requests/#is_main_document)
+`type` in `summary_requests.YYYY_MM_DD_client` | [`type`](/reference/tables/requests/#type)
+`$._index` within `payload` in `requests.YYYY_MM_DD_client` | [`index`](/reference/tables/requests/#index)
+`payload` column in `requests.YYYY_MM_DD_client` | [`payload`](/reference/tables/requests/#payload)
+`req*`, `resp*` and other in `summary_requests.YYYY_MM_DD_client` | [`summary`](/reference/tables/requests/#summary)
+`req_*` and `reqOtherHeaders` in `almanac.requests` | [`request_headers`](/reference/tables/requests/#request_headers)
+`resp_*` and `respOtherHeaders` in `almanac.requests` | [`response_headers`](/reference/tables/requests/#response_headers)
+`body` in `response_bodies.YYYY_MM_DD_client` | [`response_body`](/reference/tables/requests/#response_body)
+
+### Request query updates
+
+- Migrate from `almanac.requests`
+
+<Tabs>
+  <TabItem label="Before">
+```sql
+SELECT
+  LOWER(JSON_VALUE(request_headers, '$.name')) AS header_name,
+  JSON_VALUE(request_headers, '$.value') AS header_value,
+FROM `httparchive.almanac.requests`,
+UNNEST(JSON_QUERY_ARRAY(request_headers)) AS request_headers
+WHERE date = '2024-06-01'
+  AND client = 'desktop'
+  AND firstHtml
+```
+  </TabItem>
+  <TabItem label="After">
+```sql
+SELECT
+  LOWER(request_headers.name) AS header_name,
+  request_headers.value AS header_value,
+FROM `httparchive.all.requests`,
+UNNEST(request_headers) AS request_headers
+WHERE date = '2024-06-01'
+  AND client = 'desktop'
+  AND is_main_document
+  AND is_root_page
+```
+  </TabItem>
+</Tabs>
+
+- Migrate from `requests.YYYY_MM_DD_client`
+
+<Tabs>
+  <TabItem label="Before">
+```sql
+SELECT
+  page,
+  url,
+  JSON_VALUE(payload, '$.response.content.mimeType') AS mimeType,
+  CAST(JSON_VALUE(payload, '$.response.bodySize') AS INT64) AS respBodySize,
+FROM `httparchive.requests.2024_06_01_desktop`
+```
+  </TabItem>
+  <TabItem label="After">
+```sql
+SELECT
+  page,
+  url,
+  JSON_VALUE(summary, '$.mimeType') AS mimeType,
+  CAST(JSON_VALUE(summary, '$.respBodySize') AS INT64) AS respBodySize,
+FROM `httparchive.all.requests`
+WHERE date = '2024-06-01'
+  AND client = 'desktop'
+  AND is_root_page
+```
+  </TabItem>
+</Tabs>
+
+- Migrate from `response_bodies.YYYY_MM_DD_client`
+
+
+<Tabs>
+  <TabItem label="Before">
+```sql
+SELECT
+  page,
+  url,
+  BYTE_LENGTH(response_body) AS bodySize
+FROM `httparchive.response_bodies.2024_06_01_desktop`
+```
+  </TabItem>
+  <TabItem label="After">
+```sql
+/* This query will process 174 TB when run. */
+SELECT
+  page,
+  url,
+  BYTE_LENGTH(response_body) AS bodySize
+FROM `httparchive.all.requests`
+WHERE date = '2024-06-01'
+  AND client = 'desktop'
+  AND is_root_page
+```
+  </TabItem>
+</Tabs>
+
+- Migrate from `summary_requests.YYYY_MM_DD_client`
+
+
+<Tabs>
+  <TabItem label="Before">
+```sql
+SELECT
+  ROUND(respBodySize/1024/100)*100 AS responseSize100KB,
+  COUNT(0) requests,
+FROM `httparchive.summary_requests.2024_06_01_desktop`
+GROUP BY responseSize100KB
+HAVING responseSize100KB > 0
+ORDER BY responseSize100KB ASC
+```
+  </TabItem>
+  <TabItem label="After">
+```sql
+/* This query will process 10 TB when run. */
+SELECT
+  ROUND(CAST(JSON_VALUE(summary, '$.respBodySize') AS INT64)/1024/100)*100 AS responseSize100KB,
+  COUNT(0) requests,
+FROM `httparchive.all.requests`
+WHERE date = '2024-06-01'
+  AND client = 'desktop'
+  AND is_root_page
+GROUP BY responseSize100KB
+HAVING responseSize100KB > 0
+ORDER BY responseSize100KB ASC
+```
+  </TabItem>
+</Tabs>
diff --git a/src/content/docs/guides/numDomains_requests_graph.png b/src/content/docs/guides/numDomains_requests_graph.png
diff --git a/src/content/docs/guides/simple_agg_query_example.png b/src/content/docs/guides/simple_agg_query_example.png
diff --git a/src/content/docs/guides/simple_join_example.png b/src/content/docs/guides/simple_join_example.png
diff --git a/src/content/docs/guides/sync_async_graph.png b/src/content/docs/guides/sync_async_graph.png
diff --git a/src/content/docs/guides/type_summary_example_query.png b/src/content/docs/guides/type_summary_example_query.png
diff --git a/src/content/docs/guides/type_summary_example_query2.png b/src/content/docs/guides/type_summary_example_query2.png