feat: Support HTML extractors #547

piotr-roslaniec · 2025-03-03T15:32:34Z

Related Issues

Closes feat: Support HTML data format in Manifest matching #536

Summary

Added HTML extraction support to the manifest response matching system, allowing users to extract and validate data from HTML responses using CSS selectors.

Changes

Added HTML as a supported format in the DataFormat enum
Extended the Extractor struct with an optional attribute field for HTML-specific extraction
Implemented HTML parsing and extraction using the tl crate
Added comprehensive tests for HTML extraction functionality
Created test fixtures for HTML extraction validation

Reviewer Notes

CSS selector implementation follows a hierarchical approach for consistency with JSON path extraction
The attribute field is optional and defaults to extracting text content when not specified
Type conversion follows the same rules as JSON extraction (string, number, boolean, array)

Required Reviews

The high amount of lines added is due to the text fixture. Be not afraid.
Minimum number of reviews before merge: 2

test: add predicate tests refactor: create predicate error type with concrete cases feat: extract values from notary response test: update http tests chore: add more concrete extractor error types refactor: use extractor error in return types feat: add string and array predicate types refactor: initial integration with notary test: test regex predicates

Autoparallel

bravo!

left comments, but nothing blocking :)

Autoparallel · 2025-03-08T12:54:31Z

fixture/client.html.tee_tcp_local.json

+  "proving": {
+    "manifest": {
+      "manifestVersion": "1",
+      "id": "wikipedia-claude-shannon",


dude, i love Shannon.

The original Claude

Autoparallel · 2025-03-08T13:00:50Z

web-prover-core/fixtures/website.html

I assume you wrote this by hand.

From the memory, yes

Autoparallel · 2025-03-08T13:02:59Z

web-prover-core/src/http.rs

-  /// Raw JSON value returned by a notary.
-  pub json: Option<serde_json::Value>,
+  /// Raw response body from the notary
+  pub body: Option<Vec<u8>>,


Autoparallel · 2025-03-08T13:09:21Z

web-prover-core/src/parser/config.rs

+    let extractor: Box<dyn DocumentExtractor> = match self.format {
+      DataFormat::Json => Box::new(JsonDocumentExtractor),
+      DataFormat::Html => Box::new(HtmlDocumentExtractor),
+    };


Suggested change

let extractor: Box<dyn DocumentExtractor> = match self.format {

DataFormat::Json => Box::new(JsonDocumentExtractor),

DataFormat::Html => Box::new(HtmlDocumentExtractor),

};

let extractor: impl DocumentExtractor = match self.format {

DataFormat::Json => JsonDocumentExtractor,

DataFormat::Html => HtmlDocumentExtractor,

};

does this work? Ever since return position impl trait was stabilized, I think we don't need to Box<dyn Trait> as much.

You might need an as impl DocumentExtractor or something in there, but i'm 99% sure this is doable.

Autoparallel · 2025-03-08T13:12:55Z

web-prover-core/src/parser/config.rs

-          required:       false, // Optional
-          predicates:     vec![],
-        },
+        extractor!(


why do we want to use a macro here?

Macros feel like a last resort to me, but i bet you have a good reason

No particular reason. I use macros as a constructor in tests. IIRC this was for consistency.

Autoparallel · 2025-03-08T13:15:40Z

web-prover-core/src/parser/config.rs

+    assert!(matches!(result, Err(ExtractorError::InvalidHtml(_))));
+  }
+
+  #[test]


I really appreciate this test driven development that you do. It makes it really easy to see how what you've added works and what it handles. Thanks!

Autoparallel · 2025-03-08T13:17:28Z

web-prover-core/src/parser/extractors/types.rs

+impl TryFrom<&str> for ExtractorType {
+  type Error = ExtractorError;
+
+  fn try_from(value: &str) -> Result<Self, Self::Error> {


Is it worth doing value.to_lowercase() here to make this a bit more robust?

Autoparallel · 2025-03-08T13:19:23Z

web-prover-core/src/parser/extractors/types.rs

+
+/// A data extractor configuration
+#[derive(Debug, Clone, Serialize, Deserialize, PartialEq)]
+pub struct Extractor {


It almost seems like having Extractor be generic over a Parser or something would be handy in the future? Especially if we have nested extraction to do. i.e., JSON inside of HTML.

Don't need to do it now, I'm just pontificating.

Autoparallel · 2025-03-08T13:19:45Z

web-prover-core/src/parser/extractors/types.rs

+  pub attribute:      Option<String>,
+}
+
+impl Extractor {}


Autoparallel · 2025-03-08T13:20:32Z

web-prover-core/src/parser/test_fixtures.rs

@@ -111,3 +112,199 @@ fn test_coinbase_extraction() {
    "8.967465955899945"
  );
 }
+
+#[test]
+fn test_website_extraction() {


this is so sick

lonerapier

awesome work on this, no changes from me.

these tests are just 🤌

lonerapier · 2025-03-09T05:34:15Z

web-prover-core/src/parser/extractors/types.rs

+
+/// The result of an extraction operation
+#[derive(Debug, Clone, Default, Serialize, PartialEq)]
+pub struct ExtractionResult {


does it make sense to make this:

pub struct ExtractionOutput { values: HashMap<String, Result<Value, ExtractionError>> }

so that it's automatically indexed by extraction id, and can be lazily evaluated client side for actual value or error

lonerapier · 2025-03-09T05:39:27Z

web-prover-core/src/parser/extractors/types.rs

+  #[serde(skip_serializing_if = "Option::is_none")]
+  pub attribute:      Option<String>,


I think we can make Extractor an enum and keep a CommonExtractorConfig in both, to prevent configs interfering with each other

pub struct Extractor { Json(JsonExtractorConfig), Html(HtmlExtractorConfig), }

piotr-roslaniec linked an issue Mar 3, 2025 that may be closed by this pull request

feat: Support HTML data format in Manifest matching #536

Open

This was referenced Mar 4, 2025

feat!: Manifest body extractors #535

Merged

feat!: Manifest v2 #541

Draft

piotr-roslaniec force-pushed the feat/html-extractors#536 branch from efa4bd4 to 53070ce Compare March 5, 2025 10:08

piotr-roslaniec added 19 commits March 6, 2025 13:16

(wip): sketch html extractor

06ec3a8

(wip): support multiple selectors in html format

5bd7483

(wip): add more expansive html parser tests

c036c15

(wip): break down complex functions

182915a

(wip): refactor: move extractors to a separate module

17914d7

(wip): refactor: move tests to their respective files

96f2370

(wip): refactor: split extractor file into smaller files

2ad2f04

(wip): refactor: refactor utils

ccd2c1c

(wip): refactor a dense function

ad85482

(wip): refactor test into a website fixture

72197b5

(wip): refactor json helpers

23c8d1e

(wip): refactor parser selector into a format file

6279ba7

(wip): refactor tests

4203120

(wip): update errors

c2e2fbe

(wip): refactor extraction logic

3cfce6f

(wip): self review

000d1ba

(wip): improve errors

f913261

(wip): integrate html format with the notary

99fc11a

piotr-roslaniec force-pushed the feat/html-extractors#536 branch from 744df08 to 99fc11a Compare March 6, 2025 12:27

chore: delete an empty file

487e10e

piotr-roslaniec marked this pull request as ready for review March 6, 2025 12:57

piotr-roslaniec requested review from mattes, devloper, drewjenkins and Autoparallel March 6, 2025 12:57

piotr-roslaniec requested review from 0xFloyd, lonerapier and 0xJepsen March 6, 2025 12:57

piotr-roslaniec force-pushed the feat/html-extractors#536 branch from d4e64b7 to 41f140c Compare March 6, 2025 12:58

test: fix a broken test

4cc978c

piotr-roslaniec force-pushed the feat/html-extractors#536 branch from 41f140c to 4cc978c Compare March 6, 2025 13:13

piotr-roslaniec mentioned this pull request Mar 7, 2025

feat!: Return actionable errors from Manifest validation #552

Open

Autoparallel approved these changes Mar 8, 2025

View reviewed changes

lonerapier approved these changes Mar 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Support HTML extractors #547

feat: Support HTML extractors #547

piotr-roslaniec commented Mar 3, 2025 •

edited

Loading

Autoparallel left a comment

Autoparallel Mar 8, 2025

piotr-roslaniec Mar 8, 2025

Autoparallel Mar 8, 2025

piotr-roslaniec Mar 8, 2025

Autoparallel Mar 8, 2025

Autoparallel Mar 8, 2025

Autoparallel Mar 8, 2025

Autoparallel Mar 8, 2025

Autoparallel Mar 8, 2025

piotr-roslaniec Mar 8, 2025

Autoparallel Mar 8, 2025

Autoparallel Mar 8, 2025

Autoparallel Mar 8, 2025

Autoparallel Mar 8, 2025

Autoparallel Mar 8, 2025

lonerapier left a comment

lonerapier Mar 9, 2025

lonerapier Mar 9, 2025

		#[serde(skip_serializing_if = "Option::is_none")]
		pub attribute: Option<String>,

feat: Support HTML extractors #547

Are you sure you want to change the base?

feat: Support HTML extractors #547

Conversation

piotr-roslaniec commented Mar 3, 2025 • edited Loading

Related Issues

Summary

Changes

Reviewer Notes

Required Reviews

Autoparallel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lonerapier left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piotr-roslaniec commented Mar 3, 2025 •

edited

Loading