-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* start * getting started update * more query examples * fix path * outdated tables references * guided tour part 1 * part 3 link * restructure migration doc * pages migration queries verified * request migration queries verified * request schema update * query size increase notes * fix * npm update * rollback test * removed plugin * notebooks in repo * links update * guide3 * moved notebooks * page count fix * include 0-100kb resp * ceiling bins * queries descriptions update * query diffs reformatted * Update src/content/docs/guides/getting-started.md Co-authored-by: Rick Viscomi <[email protected]> * updated description --------- Co-authored-by: Rick Viscomi <[email protected]>
- Loading branch information
1 parent
a80af55
commit 08bc0aa
Showing
24 changed files
with
5,680 additions
and
108 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -21,4 +21,6 @@ pnpm-debug.log* | |
.DS_Store | ||
|
||
# local tools | ||
*.code-workspace | ||
*.code-workspace | ||
.idx/dev.nix | ||
.vscode/settings.json |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Large diffs are not rendered by default.
Oops, something went wrong.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
--- | ||
title: Guided Tour | ||
description: HTTP Archive data analysis in BigQuery | ||
--- | ||
|
||
|
||
The HTTP Archive contains a tremendous amount of information that can be used to understand the evolution of the web. And since the raw data is available in Google BigQuery, you can start digging into it with a minimal amount of setup! | ||
|
||
If you are new to BigQuery, then the [Getting Started guide](./getting-started.md) will walk you through the basic setup. That guide ends with a sample query that explores MIME types from the `pages` tables. In this guide, we'll explore more of the tables and build additional queries that you can learn from. The easiest way to get started is by following along, testing some of the queries and learning from them. If you need any help then there is plenty of support available from the community at [https://discuss.httparchive.org](https://discuss.httparchive.org). | ||
|
||
**Prerequisites:** | ||
|
||
- This guide assumes that you've completed the setup from the [Getting Started guide](./getting-started.md). | ||
- You would be safe processing extremely-large tables contained in this dataset if you follow the [minimizing query costs guide](/guides/minimizing-costs/). | ||
- It also assumes some familiarity with SQL. All of the examples provided will be using [Standard SQL](https://cloud.google.com/bigquery/docs/reference/standard-sql/). | ||
|
||
Migration Guides: | ||
|
||
- If you are looking to adapt older HTTP Archive queries, written in [Legacy SQL](https://cloud.google.com/bigquery/docs/reference/legacy-sql), then you may find this [migration guide](https://cloud.google.com/bigquery/docs/reference/standard-sql/migrating-from-legacy-sql) helpful.* | ||
- If you've been working with the deprecated dataset `pages` or `requests`, there is a guide on [migrating your queries to the `all` dataset](/guides/migrating-to-all-dataset/). | ||
|
||
This guide is split into multiple sections, each one focusing on different tables in the HTTP Archive. Each section builds on top of the previous one: | ||
|
||
1. [Exploring the `httparchive.all.pages` tables](https://colab.research.google.com/github/rviscomi/har.fyi/blob/main/workbooks/exploring_httparchive-all-pages_tables.ipynb) | ||
2. [Exploring the `httparchive.all.requests` tables](https://colab.research.google.com/github/rviscomi/har.fyi/blob/main/workbooks/exploring_httparchive-all-requests_tables.ipynb) | ||
3. [JOINing `pages` and `requests` tables](https://colab.research.google.com/github/rviscomi/har.fyi/blob/main/workbooks/exploring_pages_and_requests_tables_joined.ipynb) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,327 @@ | ||
--- | ||
title: Migrate queries to `all` dataset | ||
description: Assisting with query migration to the new dataset | ||
--- | ||
|
||
import { Tabs, TabItem } from '@astrojs/starlight/components'; | ||
|
||
New tables have been introduced in the HTTP Archive dataset, which are more efficient and easier to use. The `all` dataset contains all the data from the previous `pages`, `requests`, and other datasets. This guide will help you migrate your queries to the new dataset. | ||
|
||
## Migrating to `all.pages` | ||
|
||
### Page data schemas comparison | ||
|
||
previously | `all.pages` | ||
---|--- | ||
date in a table name | [`date`](/reference/tables/pages/#date) | ||
client as `_TABLE_SUFFIX` | [`client`](/reference/tables/pages/#client) | ||
`url` in `pages.YYYY_MM_DD_client` | [`page`](/reference/tables/pages/#page) | ||
not available | [`is_root_page`](/reference/tables/pages/#is_root_page) | ||
not available | [`root_page`](/reference/tables/pages/#root_page) | ||
not available | [`rank`](/reference/tables/pages/#rank) | ||
`$.testID` within `payload` column in `pages.YYYY_MM_DD_client`, `wptid` column in `summary_pages.YYYY_MM_DD_client` | [`wptid`](/reference/tables/pages/#wptid) | ||
`payload` in `pages.YYYY_MM_DD_client` | [`payload`](/reference/tables/pages/#payload) | ||
`req*`, `resp*` and other in `summary_pages.YYYY_MM_DD_client` | [`summary`](/reference/tables/pages/#summary) | ||
`$.CUSTOM_METRIC_NAME` within `payload` column in `pages.YYYY_MM_DD_client` | [`custom_metrics`](/reference/tables/pages/#custom_metrics) | ||
`report` in `lighthouse.YYYY_MM_DD_client` | [lighthouse](/reference/tables/pages/#lighthouse) | ||
`feature`, `type`, `id` in `blink_features.features` | `feature`, `type`, `id` in [`features`](/reference/tables/pages/#features) | ||
`category`, `app`, `info` in `technologies.YYYY_MM_DD_client` | `categories`, `technology`, `info` in [`technologies`](/reference/tables/pages/#technologies) | ||
not available | [`metadata`](/reference/tables/pages/#metadata) | ||
|
||
### Page query updates | ||
|
||
- Migrate from `blink_features.features` | ||
|
||
<Tabs> | ||
<TabItem label="Before"> | ||
```sql | ||
SELECT | ||
url, | ||
feature, | ||
type, | ||
id | ||
FROM `httparchive.blink_features.features` | ||
WHERE yyyymmdd = DATE('2024-05-01') | ||
AND client = 'desktop' | ||
``` | ||
</TabItem> | ||
<TabItem label="After"> | ||
```sql | ||
SELECT | ||
page, | ||
features.feature, | ||
features.type, | ||
features.id | ||
FROM `httparchive.all.pages`, | ||
UNNEST (features) AS features | ||
WHERE date = '2024-06-01' | ||
AND client = 'desktop' | ||
AND is_root_page | ||
``` | ||
</TabItem> | ||
</Tabs> | ||
|
||
- Migrate from `lighthouse.YYYY_MM_DD_client` | ||
|
||
<Tabs> | ||
<TabItem label="Before"> | ||
```sql | ||
SELECT | ||
url, | ||
JSON_QUERY(report, '$.audits.largest-contentful-paint.numericValue') AS LCP, | ||
FROM `httparchive.lighthouse.2024_06_01_desktop` | ||
``` | ||
</TabItem> | ||
<TabItem label="After"> | ||
```sql | ||
/* This query will process 17 TB when run. */ | ||
SELECT | ||
page, | ||
JSON_QUERY(lighthouse, '$.audits.largest-contentful-paint.numericValue') AS LCP, | ||
FROM `httparchive.all.pages` | ||
WHERE date = '2024-06-01' | ||
AND client = 'desktop' | ||
AND is_root_page | ||
``` | ||
</TabItem> | ||
</Tabs> | ||
|
||
- Migrate from `pages.YYYY_MM_DD_client` | ||
|
||
<Tabs> | ||
<TabItem label="Before"> | ||
```sql | ||
SELECT | ||
url, | ||
_TABLE_SUFFIX AS client, | ||
JSON_QUERY(payload, '$.testID') AS testID, | ||
-- JSON with the results of the custom metrics, | ||
JSON_QUERY(payload, '$._privacy') AS custom_metrics, | ||
FROM `httparchive.pages.2022_06_01_*` | ||
``` | ||
</TabItem> | ||
<TabItem label="After"> | ||
```sql | ||
SELECT | ||
page, | ||
client, | ||
wptid, | ||
-- JSON with the results of the custom metrics, | ||
JSON_QUERY(custom_metrics, '$.privacy') AS custom_metrics, | ||
FROM `httparchive.all.pages` | ||
WHERE date = '2022-06-01' | ||
AND is_root_page | ||
``` | ||
</TabItem> | ||
</Tabs> | ||
|
||
- Migrate from `summary_pages.YYYY_MM_DD_client` | ||
|
||
<Tabs> | ||
<TabItem label="Before"> | ||
```sql | ||
SELECT | ||
numDomains, | ||
COUNT(0) pages, | ||
ROUND(AVG(reqTotal),2) avg_requests, | ||
FROM `httparchive.summary_pages.2024_06_01_desktop` | ||
GROUP BY | ||
numDomains | ||
HAVING | ||
pages > 1000 | ||
ORDER BY | ||
numDomains ASC | ||
``` | ||
</TabItem> | ||
<TabItem label="After"> | ||
```sql | ||
/* This query will process 110 GB when run. */ | ||
SELECT | ||
CAST(JSON_VALUE(summary, '$.numDomains') AS INT64) AS numDomains, | ||
COUNT(0) pages, | ||
ROUND(AVG(CAST(JSON_VALUE(summary, '$.reqTotal') AS INT64)),2) as avg_requests, | ||
FROM `httparchive.all.pages` | ||
WHERE date = '2024-06-01' | ||
AND client = 'desktop' | ||
AND is_root_page | ||
GROUP BY | ||
numDomains | ||
HAVING | ||
pages > 1000 | ||
ORDER BY | ||
numDomains ASC | ||
``` | ||
</TabItem> | ||
</Tabs> | ||
|
||
- Migrate from `technologies.YYYY_MM_DD_client` | ||
|
||
<Tabs> | ||
<TabItem label="Before"> | ||
```sql | ||
SELECT | ||
url, | ||
category, | ||
app, | ||
info | ||
FROM `httparchive.technologies.2024_06_01_desktop` | ||
``` | ||
</TabItem> | ||
<TabItem label="After"> | ||
```sql | ||
/* This query will process 27 GB when run. */ | ||
SELECT | ||
page, | ||
technologies.categories, | ||
technologies.technology, | ||
technologies.info | ||
FROM `httparchive.all.pages`, | ||
UNNEST (technologies) AS technologies | ||
WHERE date = '2024-06-01' | ||
AND client = 'desktop' | ||
AND is_root_page | ||
``` | ||
|
||
</TabItem> | ||
</Tabs> | ||
|
||
## Migrating to `all.requests` | ||
|
||
### Request data schemas comparison | ||
|
||
previously | `all.requests` | ||
---|--- | ||
date in a table name | [`date`](/reference/tables/requests/#date) | ||
client as `_TABLE_SUFFIX` | [`client`](/reference/tables/requests/#client) | ||
`page` in `requests.YYYY_MM_DD_client` | [`page`](/reference/tables/requests/#page) | ||
not available | [`is_root_page`](/reference/tables/requests/#is_root_page) | ||
not available | [`root_page`](/reference/tables/requests/#root_page) | ||
`url` in `requests.YYYY_MM_DD_client` | [`url`](/reference/tables/requests/#url) | ||
`firstHtml` in `summary_requests.YYYY_MM_DD_client` | [`is_main_document`](/reference/tables/requests/#is_main_document) | ||
`type` in `summary_requests.YYYY_MM_DD_client` | [`type`](/reference/tables/requests/#type) | ||
`$._index` within `payload` in `requests.YYYY_MM_DD_client` | [`index`](/reference/tables/requests/#index) | ||
`payload` column in `requests.YYYY_MM_DD_client` | [`payload`](/reference/tables/requests/#payload) | ||
`req*`, `resp*` and other in `summary_requests.YYYY_MM_DD_client` | [`summary`](/reference/tables/requests/#summary) | ||
`req_*` and `reqOtherHeaders` in `almanac.requests` | [`request_headers`](/reference/tables/requests/#request_headers) | ||
`resp_*` and `respOtherHeaders` in `almanac.requests` | [`response_headers`](/reference/tables/requests/#response_headers) | ||
`body` in `response_bodies.YYYY_MM_DD_client` | [`response_body`](/reference/tables/requests/#response_body) | ||
|
||
### Request query updates | ||
|
||
- Migrate from `almanac.requests` | ||
|
||
<Tabs> | ||
<TabItem label="Before"> | ||
```sql | ||
SELECT | ||
LOWER(JSON_VALUE(request_headers, '$.name')) AS header_name, | ||
JSON_VALUE(request_headers, '$.value') AS header_value, | ||
FROM `httparchive.almanac.requests`, | ||
UNNEST(JSON_QUERY_ARRAY(request_headers)) AS request_headers | ||
WHERE date = '2024-06-01' | ||
AND client = 'desktop' | ||
AND firstHtml | ||
``` | ||
</TabItem> | ||
<TabItem label="After"> | ||
```sql | ||
SELECT | ||
LOWER(request_headers.name) AS header_name, | ||
request_headers.value AS header_value, | ||
FROM `httparchive.all.requests`, | ||
UNNEST(request_headers) AS request_headers | ||
WHERE date = '2024-06-01' | ||
AND client = 'desktop' | ||
AND is_main_document | ||
AND is_root_page | ||
``` | ||
</TabItem> | ||
</Tabs> | ||
|
||
- Migrate from `requests.YYYY_MM_DD_client` | ||
|
||
<Tabs> | ||
<TabItem label="Before"> | ||
```sql | ||
SELECT | ||
page, | ||
url, | ||
JSON_VALUE(payload, '$.response.content.mimeType') AS mimeType, | ||
CAST(JSON_VALUE(payload, '$.response.bodySize') AS INT64) AS respBodySize, | ||
FROM `httparchive.requests.2024_06_01_desktop` | ||
``` | ||
</TabItem> | ||
<TabItem label="After"> | ||
```sql | ||
SELECT | ||
page, | ||
url, | ||
JSON_VALUE(summary, '$.mimeType') AS mimeType, | ||
CAST(JSON_VALUE(summary, '$.respBodySize') AS INT64) AS respBodySize, | ||
FROM `httparchive.all.requests` | ||
WHERE date = '2024-06-01' | ||
AND client = 'desktop' | ||
AND is_root_page | ||
``` | ||
</TabItem> | ||
</Tabs> | ||
|
||
- Migrate from `response_bodies.YYYY_MM_DD_client` | ||
|
||
|
||
<Tabs> | ||
<TabItem label="Before"> | ||
```sql | ||
SELECT | ||
page, | ||
url, | ||
BYTE_LENGTH(response_body) AS bodySize | ||
FROM `httparchive.response_bodies.2024_06_01_desktop` | ||
``` | ||
</TabItem> | ||
<TabItem label="After"> | ||
```sql | ||
/* This query will process 174 TB when run. */ | ||
SELECT | ||
page, | ||
url, | ||
BYTE_LENGTH(response_body) AS bodySize | ||
FROM `httparchive.all.requests` | ||
WHERE date = '2024-06-01' | ||
AND client = 'desktop' | ||
AND is_root_page | ||
``` | ||
</TabItem> | ||
</Tabs> | ||
|
||
- Migrate from `summary_requests.YYYY_MM_DD_client` | ||
|
||
|
||
<Tabs> | ||
<TabItem label="Before"> | ||
```sql | ||
SELECT | ||
ROUND(respBodySize/1024/100)*100 AS responseSize100KB, | ||
COUNT(0) requests, | ||
FROM `httparchive.summary_requests.2024_06_01_desktop` | ||
GROUP BY responseSize100KB | ||
HAVING responseSize100KB > 0 | ||
ORDER BY responseSize100KB ASC | ||
``` | ||
</TabItem> | ||
<TabItem label="After"> | ||
```sql | ||
/* This query will process 10 TB when run. */ | ||
SELECT | ||
ROUND(CAST(JSON_VALUE(summary, '$.respBodySize') AS INT64)/1024/100)*100 AS responseSize100KB, | ||
COUNT(0) requests, | ||
FROM `httparchive.all.requests` | ||
WHERE date = '2024-06-01' | ||
AND client = 'desktop' | ||
AND is_root_page | ||
GROUP BY responseSize100KB | ||
HAVING responseSize100KB > 0 | ||
ORDER BY responseSize100KB ASC | ||
``` | ||
</TabItem> | ||
</Tabs> |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.