Skip to content

Commit

Permalink
Query migration to all dataset (#5)
Browse files Browse the repository at this point in the history
* start

* getting started update

* more query examples

* fix path

* outdated tables references

* guided tour part 1

* part 3 link

* restructure migration doc

* pages migration queries verified

* request migration queries verified

* request schema update

* query size increase notes

* fix

* npm update

* rollback test

* removed plugin

* notebooks in repo

* links update

* guide3

* moved notebooks

* page count fix

* include 0-100kb resp

* ceiling bins

* queries descriptions update

* query diffs reformatted

* Update src/content/docs/guides/getting-started.md

Co-authored-by: Rick Viscomi <[email protected]>

* updated description

---------

Co-authored-by: Rick Viscomi <[email protected]>
  • Loading branch information
max-ostapenko and rviscomi authored Jul 14, 2024
1 parent a80af55 commit 08bc0aa
Show file tree
Hide file tree
Showing 24 changed files with 5,680 additions and 108 deletions.
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -21,4 +21,6 @@ pnpm-debug.log*
.DS_Store

# local tools
*.code-workspace
*.code-workspace
.idx/dev.nix
.vscode/settings.json
4 changes: 3 additions & 1 deletion astro.config.mjs
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ export default defineConfig({
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'G-SK2FZXB50K');`
}
],
Expand All @@ -38,7 +38,9 @@ export default defineConfig({
items: [
{ label: 'Getting started', link: '/guides/getting-started/' },
{ label: 'Minimizing query costs', link: '/guides/minimizing-costs/' },
{ label: 'Guided tour', link: '/guides/guided-tour/' },
{ label: 'Release cycle', link: '/guides/release-cycle/' },
{ label: 'Migrate queries to `all` dataset', link: '/guides/migrating-to-all-dataset/' },
],
},
{
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
299 changes: 199 additions & 100 deletions src/content/docs/guides/getting-started.md

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
26 changes: 26 additions & 0 deletions src/content/docs/guides/guided-tour.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
---
title: Guided Tour
description: HTTP Archive data analysis in BigQuery
---


The HTTP Archive contains a tremendous amount of information that can be used to understand the evolution of the web. And since the raw data is available in Google BigQuery, you can start digging into it with a minimal amount of setup!

If you are new to BigQuery, then the [Getting Started guide](./getting-started.md) will walk you through the basic setup. That guide ends with a sample query that explores MIME types from the `pages` tables. In this guide, we'll explore more of the tables and build additional queries that you can learn from. The easiest way to get started is by following along, testing some of the queries and learning from them. If you need any help then there is plenty of support available from the community at [https://discuss.httparchive.org](https://discuss.httparchive.org).

**Prerequisites:**

- This guide assumes that you've completed the setup from the [Getting Started guide](./getting-started.md).
- You would be safe processing extremely-large tables contained in this dataset if you follow the [minimizing query costs guide](/guides/minimizing-costs/).
- It also assumes some familiarity with SQL. All of the examples provided will be using [Standard SQL](https://cloud.google.com/bigquery/docs/reference/standard-sql/).

Migration Guides:

- If you are looking to adapt older HTTP Archive queries, written in [Legacy SQL](https://cloud.google.com/bigquery/docs/reference/legacy-sql), then you may find this [migration guide](https://cloud.google.com/bigquery/docs/reference/standard-sql/migrating-from-legacy-sql) helpful.*
- If you've been working with the deprecated dataset `pages` or `requests`, there is a guide on [migrating your queries to the `all` dataset](/guides/migrating-to-all-dataset/).

This guide is split into multiple sections, each one focusing on different tables in the HTTP Archive. Each section builds on top of the previous one:

1. [Exploring the `httparchive.all.pages` tables](https://colab.research.google.com/github/rviscomi/har.fyi/blob/main/workbooks/exploring_httparchive-all-pages_tables.ipynb)
2. [Exploring the `httparchive.all.requests` tables](https://colab.research.google.com/github/rviscomi/har.fyi/blob/main/workbooks/exploring_httparchive-all-requests_tables.ipynb)
3. [JOINing `pages` and `requests` tables](https://colab.research.google.com/github/rviscomi/har.fyi/blob/main/workbooks/exploring_pages_and_requests_tables_joined.ipynb)
327 changes: 327 additions & 0 deletions src/content/docs/guides/migrating-to-all-dataset.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,327 @@
---
title: Migrate queries to `all` dataset
description: Assisting with query migration to the new dataset
---

import { Tabs, TabItem } from '@astrojs/starlight/components';

New tables have been introduced in the HTTP Archive dataset, which are more efficient and easier to use. The `all` dataset contains all the data from the previous `pages`, `requests`, and other datasets. This guide will help you migrate your queries to the new dataset.

## Migrating to `all.pages`

### Page data schemas comparison

previously | `all.pages`
---|---
date in a table name | [`date`](/reference/tables/pages/#date)
client as `_TABLE_SUFFIX` | [`client`](/reference/tables/pages/#client)
`url` in `pages.YYYY_MM_DD_client` | [`page`](/reference/tables/pages/#page)
not available | [`is_root_page`](/reference/tables/pages/#is_root_page)
not available | [`root_page`](/reference/tables/pages/#root_page)
not available | [`rank`](/reference/tables/pages/#rank)
`$.testID` within `payload` column in `pages.YYYY_MM_DD_client`, `wptid` column in `summary_pages.YYYY_MM_DD_client` | [`wptid`](/reference/tables/pages/#wptid)
`payload` in `pages.YYYY_MM_DD_client` | [`payload`](/reference/tables/pages/#payload)
`req*`, `resp*` and other in `summary_pages.YYYY_MM_DD_client` | [`summary`](/reference/tables/pages/#summary)
`$.CUSTOM_METRIC_NAME` within `payload` column in `pages.YYYY_MM_DD_client` | [`custom_metrics`](/reference/tables/pages/#custom_metrics)
`report` in `lighthouse.YYYY_MM_DD_client` | [lighthouse](/reference/tables/pages/#lighthouse)
`feature`, `type`, `id` in `blink_features.features` | `feature`, `type`, `id` in [`features`](/reference/tables/pages/#features)
`category`, `app`, `info` in `technologies.YYYY_MM_DD_client` | `categories`, `technology`, `info` in [`technologies`](/reference/tables/pages/#technologies)
not available | [`metadata`](/reference/tables/pages/#metadata)

### Page query updates

- Migrate from `blink_features.features`

<Tabs>
<TabItem label="Before">
```sql
SELECT
url,
feature,
type,
id
FROM `httparchive.blink_features.features`
WHERE yyyymmdd = DATE('2024-05-01')
AND client = 'desktop'
```
</TabItem>
<TabItem label="After">
```sql
SELECT
page,
features.feature,
features.type,
features.id
FROM `httparchive.all.pages`,
UNNEST (features) AS features
WHERE date = '2024-06-01'
AND client = 'desktop'
AND is_root_page
```
</TabItem>
</Tabs>

- Migrate from `lighthouse.YYYY_MM_DD_client`

<Tabs>
<TabItem label="Before">
```sql
SELECT
url,
JSON_QUERY(report, '$.audits.largest-contentful-paint.numericValue') AS LCP,
FROM `httparchive.lighthouse.2024_06_01_desktop`
```
</TabItem>
<TabItem label="After">
```sql
/* This query will process 17 TB when run. */
SELECT
page,
JSON_QUERY(lighthouse, '$.audits.largest-contentful-paint.numericValue') AS LCP,
FROM `httparchive.all.pages`
WHERE date = '2024-06-01'
AND client = 'desktop'
AND is_root_page
```
</TabItem>
</Tabs>

- Migrate from `pages.YYYY_MM_DD_client`

<Tabs>
<TabItem label="Before">
```sql
SELECT
url,
_TABLE_SUFFIX AS client,
JSON_QUERY(payload, '$.testID') AS testID,
-- JSON with the results of the custom metrics,
JSON_QUERY(payload, '$._privacy') AS custom_metrics,
FROM `httparchive.pages.2022_06_01_*`
```
</TabItem>
<TabItem label="After">
```sql
SELECT
page,
client,
wptid,
-- JSON with the results of the custom metrics,
JSON_QUERY(custom_metrics, '$.privacy') AS custom_metrics,
FROM `httparchive.all.pages`
WHERE date = '2022-06-01'
AND is_root_page
```
</TabItem>
</Tabs>

- Migrate from `summary_pages.YYYY_MM_DD_client`

<Tabs>
<TabItem label="Before">
```sql
SELECT
numDomains,
COUNT(0) pages,
ROUND(AVG(reqTotal),2) avg_requests,
FROM `httparchive.summary_pages.2024_06_01_desktop`
GROUP BY
numDomains
HAVING
pages > 1000
ORDER BY
numDomains ASC
```
</TabItem>
<TabItem label="After">
```sql
/* This query will process 110 GB when run. */
SELECT
CAST(JSON_VALUE(summary, '$.numDomains') AS INT64) AS numDomains,
COUNT(0) pages,
ROUND(AVG(CAST(JSON_VALUE(summary, '$.reqTotal') AS INT64)),2) as avg_requests,
FROM `httparchive.all.pages`
WHERE date = '2024-06-01'
AND client = 'desktop'
AND is_root_page
GROUP BY
numDomains
HAVING
pages > 1000
ORDER BY
numDomains ASC
```
</TabItem>
</Tabs>

- Migrate from `technologies.YYYY_MM_DD_client`

<Tabs>
<TabItem label="Before">
```sql
SELECT
url,
category,
app,
info
FROM `httparchive.technologies.2024_06_01_desktop`
```
</TabItem>
<TabItem label="After">
```sql
/* This query will process 27 GB when run. */
SELECT
page,
technologies.categories,
technologies.technology,
technologies.info
FROM `httparchive.all.pages`,
UNNEST (technologies) AS technologies
WHERE date = '2024-06-01'
AND client = 'desktop'
AND is_root_page
```

</TabItem>
</Tabs>

## Migrating to `all.requests`

### Request data schemas comparison

previously | `all.requests`
---|---
date in a table name | [`date`](/reference/tables/requests/#date)
client as `_TABLE_SUFFIX` | [`client`](/reference/tables/requests/#client)
`page` in `requests.YYYY_MM_DD_client` | [`page`](/reference/tables/requests/#page)
not available | [`is_root_page`](/reference/tables/requests/#is_root_page)
not available | [`root_page`](/reference/tables/requests/#root_page)
`url` in `requests.YYYY_MM_DD_client` | [`url`](/reference/tables/requests/#url)
`firstHtml` in `summary_requests.YYYY_MM_DD_client` | [`is_main_document`](/reference/tables/requests/#is_main_document)
`type` in `summary_requests.YYYY_MM_DD_client` | [`type`](/reference/tables/requests/#type)
`$._index` within `payload` in `requests.YYYY_MM_DD_client` | [`index`](/reference/tables/requests/#index)
`payload` column in `requests.YYYY_MM_DD_client` | [`payload`](/reference/tables/requests/#payload)
`req*`, `resp*` and other in `summary_requests.YYYY_MM_DD_client` | [`summary`](/reference/tables/requests/#summary)
`req_*` and `reqOtherHeaders` in `almanac.requests` | [`request_headers`](/reference/tables/requests/#request_headers)
`resp_*` and `respOtherHeaders` in `almanac.requests` | [`response_headers`](/reference/tables/requests/#response_headers)
`body` in `response_bodies.YYYY_MM_DD_client` | [`response_body`](/reference/tables/requests/#response_body)

### Request query updates

- Migrate from `almanac.requests`

<Tabs>
<TabItem label="Before">
```sql
SELECT
LOWER(JSON_VALUE(request_headers, '$.name')) AS header_name,
JSON_VALUE(request_headers, '$.value') AS header_value,
FROM `httparchive.almanac.requests`,
UNNEST(JSON_QUERY_ARRAY(request_headers)) AS request_headers
WHERE date = '2024-06-01'
AND client = 'desktop'
AND firstHtml
```
</TabItem>
<TabItem label="After">
```sql
SELECT
LOWER(request_headers.name) AS header_name,
request_headers.value AS header_value,
FROM `httparchive.all.requests`,
UNNEST(request_headers) AS request_headers
WHERE date = '2024-06-01'
AND client = 'desktop'
AND is_main_document
AND is_root_page
```
</TabItem>
</Tabs>

- Migrate from `requests.YYYY_MM_DD_client`

<Tabs>
<TabItem label="Before">
```sql
SELECT
page,
url,
JSON_VALUE(payload, '$.response.content.mimeType') AS mimeType,
CAST(JSON_VALUE(payload, '$.response.bodySize') AS INT64) AS respBodySize,
FROM `httparchive.requests.2024_06_01_desktop`
```
</TabItem>
<TabItem label="After">
```sql
SELECT
page,
url,
JSON_VALUE(summary, '$.mimeType') AS mimeType,
CAST(JSON_VALUE(summary, '$.respBodySize') AS INT64) AS respBodySize,
FROM `httparchive.all.requests`
WHERE date = '2024-06-01'
AND client = 'desktop'
AND is_root_page
```
</TabItem>
</Tabs>

- Migrate from `response_bodies.YYYY_MM_DD_client`


<Tabs>
<TabItem label="Before">
```sql
SELECT
page,
url,
BYTE_LENGTH(response_body) AS bodySize
FROM `httparchive.response_bodies.2024_06_01_desktop`
```
</TabItem>
<TabItem label="After">
```sql
/* This query will process 174 TB when run. */
SELECT
page,
url,
BYTE_LENGTH(response_body) AS bodySize
FROM `httparchive.all.requests`
WHERE date = '2024-06-01'
AND client = 'desktop'
AND is_root_page
```
</TabItem>
</Tabs>

- Migrate from `summary_requests.YYYY_MM_DD_client`


<Tabs>
<TabItem label="Before">
```sql
SELECT
ROUND(respBodySize/1024/100)*100 AS responseSize100KB,
COUNT(0) requests,
FROM `httparchive.summary_requests.2024_06_01_desktop`
GROUP BY responseSize100KB
HAVING responseSize100KB > 0
ORDER BY responseSize100KB ASC
```
</TabItem>
<TabItem label="After">
```sql
/* This query will process 10 TB when run. */
SELECT
ROUND(CAST(JSON_VALUE(summary, '$.respBodySize') AS INT64)/1024/100)*100 AS responseSize100KB,
COUNT(0) requests,
FROM `httparchive.all.requests`
WHERE date = '2024-06-01'
AND client = 'desktop'
AND is_root_page
GROUP BY responseSize100KB
HAVING responseSize100KB > 0
ORDER BY responseSize100KB ASC
```
</TabItem>
</Tabs>
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added src/content/docs/guides/simple_join_example.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added src/content/docs/guides/sync_async_graph.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 08bc0aa

Please sign in to comment.