-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Support HTML extractors #547
base: feat/manifest-v2
Are you sure you want to change the base?
Conversation
efa4bd4
to
53070ce
Compare
test: add predicate tests refactor: create predicate error type with concrete cases feat: extract values from notary response test: update http tests chore: add more concrete extractor error types refactor: use extractor error in return types feat: add string and array predicate types refactor: initial integration with notary test: test regex predicates
744df08
to
99fc11a
Compare
d4e64b7
to
41f140c
Compare
41f140c
to
4cc978c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bravo!
left comments, but nothing blocking :)
"proving": { | ||
"manifest": { | ||
"manifestVersion": "1", | ||
"id": "wikipedia-claude-shannon", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dude, i love Shannon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original Claude
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you wrote this by hand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the memory, yes
/// Raw JSON value returned by a notary. | ||
pub json: Option<serde_json::Value>, | ||
/// Raw response body from the notary | ||
pub body: Option<Vec<u8>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
let extractor: Box<dyn DocumentExtractor> = match self.format { | ||
DataFormat::Json => Box::new(JsonDocumentExtractor), | ||
DataFormat::Html => Box::new(HtmlDocumentExtractor), | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let extractor: Box<dyn DocumentExtractor> = match self.format { | |
DataFormat::Json => Box::new(JsonDocumentExtractor), | |
DataFormat::Html => Box::new(HtmlDocumentExtractor), | |
}; | |
let extractor: impl DocumentExtractor = match self.format { | |
DataFormat::Json => JsonDocumentExtractor, | |
DataFormat::Html => HtmlDocumentExtractor, | |
}; |
does this work? Ever since return position impl trait was stabilized, I think we don't need to Box<dyn Trait>
as much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might need an as impl DocumentExtractor
or something in there, but i'm 99% sure this is doable.
required: false, // Optional | ||
predicates: vec![], | ||
}, | ||
extractor!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we want to use a macro here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Macros feel like a last resort to me, but i bet you have a good reason
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No particular reason. I use macros as a constructor in tests. IIRC this was for consistency.
assert!(matches!(result, Err(ExtractorError::InvalidHtml(_)))); | ||
} | ||
|
||
#[test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really appreciate this test driven development that you do. It makes it really easy to see how what you've added works and what it handles. Thanks!
impl TryFrom<&str> for ExtractorType { | ||
type Error = ExtractorError; | ||
|
||
fn try_from(value: &str) -> Result<Self, Self::Error> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth doing value.to_lowercase()
here to make this a bit more robust?
|
||
/// A data extractor configuration | ||
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq)] | ||
pub struct Extractor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It almost seems like having Extractor
be generic over a Parser
or something would be handy in the future? Especially if we have nested extraction to do. i.e., JSON inside of HTML.
Don't need to do it now, I'm just pontificating.
pub attribute: Option<String>, | ||
} | ||
|
||
impl Extractor {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
empty impl
@@ -111,3 +112,199 @@ fn test_coinbase_extraction() { | |||
"8.967465955899945" | |||
); | |||
} | |||
|
|||
#[test] | |||
fn test_website_extraction() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is so sick
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome work on this, no changes from me.
these tests are just 🤌
|
||
/// The result of an extraction operation | ||
#[derive(Debug, Clone, Default, Serialize, PartialEq)] | ||
pub struct ExtractionResult { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it make sense to make this:
pub struct ExtractionOutput {
values: HashMap<String, Result<Value, ExtractionError>>
}
so that it's automatically indexed by extraction id, and can be lazily evaluated client side for actual value or error
#[serde(skip_serializing_if = "Option::is_none")] | ||
pub attribute: Option<String>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can make Extractor an enum and keep a CommonExtractorConfig
in both, to prevent configs interfering with each other
pub struct Extractor {
Json(JsonExtractorConfig),
Html(HtmlExtractorConfig),
}
Related Issues
Summary
Added HTML extraction support to the manifest response matching system, allowing users to extract and validate data from HTML responses using CSS selectors.
Changes
DataFormat
enumExtractor
struct with an optionalattribute
field for HTML-specific extractiontl
crateReviewer Notes
attribute
field is optional and defaults to extracting text content when not specifiedRequired Reviews