Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support HTML extractors #547

Open
wants to merge 21 commits into
base: feat/manifest-v2
Choose a base branch
from

Conversation

piotr-roslaniec
Copy link
Contributor

@piotr-roslaniec piotr-roslaniec commented Mar 3, 2025

Related Issues

Summary

Added HTML extraction support to the manifest response matching system, allowing users to extract and validate data from HTML responses using CSS selectors.

Changes

  • Added HTML as a supported format in the DataFormat enum
  • Extended the Extractor struct with an optional attribute field for HTML-specific extraction
  • Implemented HTML parsing and extraction using the tl crate
  • Added comprehensive tests for HTML extraction functionality
  • Created test fixtures for HTML extraction validation

Reviewer Notes

  • CSS selector implementation follows a hierarchical approach for consistency with JSON path extraction
  • The attribute field is optional and defaults to extracting text content when not specified
  • Type conversion follows the same rules as JSON extraction (string, number, boolean, array)

Required Reviews

  • The high amount of lines added is due to the text fixture. Be not afraid.
  • Minimum number of reviews before merge: 2

@piotr-roslaniec piotr-roslaniec linked an issue Mar 3, 2025 that may be closed by this pull request
This was referenced Mar 4, 2025
@piotr-roslaniec piotr-roslaniec force-pushed the feat/html-extractors#536 branch from efa4bd4 to 53070ce Compare March 5, 2025 10:08
@piotr-roslaniec piotr-roslaniec force-pushed the feat/html-extractors#536 branch from 744df08 to 99fc11a Compare March 6, 2025 12:27
@piotr-roslaniec piotr-roslaniec marked this pull request as ready for review March 6, 2025 12:57
Copy link
Contributor

@Autoparallel Autoparallel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bravo!

left comments, but nothing blocking :)

"proving": {
"manifest": {
"manifestVersion": "1",
"id": "wikipedia-claude-shannon",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dude, i love Shannon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original Claude

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you wrote this by hand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the memory, yes

/// Raw JSON value returned by a notary.
pub json: Option<serde_json::Value>,
/// Raw response body from the notary
pub body: Option<Vec<u8>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines +43 to +46
let extractor: Box<dyn DocumentExtractor> = match self.format {
DataFormat::Json => Box::new(JsonDocumentExtractor),
DataFormat::Html => Box::new(HtmlDocumentExtractor),
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let extractor: Box<dyn DocumentExtractor> = match self.format {
DataFormat::Json => Box::new(JsonDocumentExtractor),
DataFormat::Html => Box::new(HtmlDocumentExtractor),
};
let extractor: impl DocumentExtractor = match self.format {
DataFormat::Json => JsonDocumentExtractor,
DataFormat::Html => HtmlDocumentExtractor,
};

does this work? Ever since return position impl trait was stabilized, I think we don't need to Box<dyn Trait> as much.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might need an as impl DocumentExtractor or something in there, but i'm 99% sure this is doable.

required: false, // Optional
predicates: vec![],
},
extractor!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we want to use a macro here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Macros feel like a last resort to me, but i bet you have a good reason

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No particular reason. I use macros as a constructor in tests. IIRC this was for consistency.

assert!(matches!(result, Err(ExtractorError::InvalidHtml(_))));
}

#[test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really appreciate this test driven development that you do. It makes it really easy to see how what you've added works and what it handles. Thanks!

impl TryFrom<&str> for ExtractorType {
type Error = ExtractorError;

fn try_from(value: &str) -> Result<Self, Self::Error> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth doing value.to_lowercase() here to make this a bit more robust?


/// A data extractor configuration
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq)]
pub struct Extractor {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It almost seems like having Extractor be generic over a Parser or something would be handy in the future? Especially if we have nested extraction to do. i.e., JSON inside of HTML.

Don't need to do it now, I'm just pontificating.

pub attribute: Option<String>,
}

impl Extractor {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty impl

@@ -111,3 +112,199 @@ fn test_coinbase_extraction() {
"8.967465955899945"
);
}

#[test]
fn test_website_extraction() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is so sick

Copy link
Collaborator

@lonerapier lonerapier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome work on this, no changes from me.

these tests are just 🤌


/// The result of an extraction operation
#[derive(Debug, Clone, Default, Serialize, PartialEq)]
pub struct ExtractionResult {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense to make this:

pub struct ExtractionOutput {
   values: HashMap<String, Result<Value, ExtractionError>>
}

so that it's automatically indexed by extraction id, and can be lazily evaluated client side for actual value or error

Comment on lines +109 to +110
#[serde(skip_serializing_if = "Option::is_none")]
pub attribute: Option<String>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can make Extractor an enum and keep a CommonExtractorConfig in both, to prevent configs interfering with each other

pub struct Extractor {
  Json(JsonExtractorConfig),
  Html(HtmlExtractorConfig),
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: Support HTML data format in Manifest matching
3 participants