Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix '$\/]' in HTML::Entities::encode_entities #45

Merged
merged 3 commits into from
Jul 24, 2024

Conversation

mauke
Copy link
Contributor

@mauke mauke commented Jul 9, 2024

encode_entities() generates and evals custom code at runtime (as a performance optimization). However, its code generation was too naive and certain characters could be used to break out of the character class regex:

  • $ (dollar sign) would trigger perl's variable interpolation
  • ] and / (the character class and regex delimiters, respectively)

The latter two were usually escaped, but not if they were preceded by \ in the input, even if that \ was itself escaped by another \ (NB: this is why it is generally a mistake to handle escaping logic with look-behinds: you need arbitrary-width look-behind to figure out whether the current chain of backslashes is even or odd in length in order to know what is being escaped or not).

The latter issue was fixed by doing a single pass over the input string with no look-behind; the former by switching the delimiter to ' (which inhibits interpolation).

Fixes #44.

@haarg
Copy link
Member

haarg commented Jul 9, 2024

I think it would be worth keeping parity with URI::Escape, which allows character classes like [:alpha:] to be used. That does not currently work, but it did in an earlier form of the code.

I fixed some similar issues in URI::Escape in libwww-perl/URI#112. It also eliminated the unneeded subref and string eval. I think I'd prefer to also fix that here as long as we're making changes. But the tests in this PR demonstrate a mistake in that PR. It doesn't properly account for all double \\ sequences.

encode_entities() generates and evals custom code at runtime (as a
performance optimization). However, its code generation was too naive
and certain characters could be used to break out of the character class
regex:

 - `$` (dollar sign) would trigger perl's variable interpolation
 - `]` and `/` (the character class and regex delimiters, respectively)

The latter two were usually escaped, but not if they were preceded by
`\` in the input, even if that `\` was itself escaped by another `\`
(NB: this is why it is generally a mistake to handle escaping logic with
look-behinds: you need arbitrary-width look-behind to figure out whether
the current chain of backslashes is even or odd in length in order to
know what is being escaped or not).

The latter issue was fixed by doing a single pass over the input string
with no look-behind; the former by switching the delimiter to `'` (which
inhibits interpolation).

Fixes libwww-perl#44.
@mauke mauke force-pushed the fix-encode-entities branch from 396f051 to 9e2ac53 Compare July 9, 2024 22:29
@mauke mauke force-pushed the fix-encode-entities branch from 9e2ac53 to 0b54570 Compare July 9, 2024 22:46
@mauke
Copy link
Contributor Author

mauke commented Jul 9, 2024

I've implemented support for POSIX-style character classes now, along with additional tests.

(The URI::Escape code doesn't seem to handle negated classes like [[:^alpha:]].)

This simplifies things a bit. Instead of coderefs, our cache now
contains only compiled regex objects. The actual s/// op (there is only
one now) lives outside, in the regular code of encode_entities.
@haarg
Copy link
Member

haarg commented Jul 10, 2024

It would be good to also test escaping character classes like \w. Otherwise, lgtm.

@oalders
Copy link
Member

oalders commented Jul 23, 2024

@haarg this just needs additional tests?

@haarg haarg merged commit 6ed3402 into libwww-perl:master Jul 24, 2024
9 checks passed
@haarg
Copy link
Member

haarg commented Jul 24, 2024

I added some tests after merging.

@oalders
Copy link
Member

oalders commented Jul 24, 2024

Thanks, @mauke and @haarg!

@oalders
Copy link
Member

oalders commented Jul 30, 2024

Version 3.83 just released.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

encode_entities doesn't handle $ (dollar sign) or \/ (backslash, slash) as advertised
3 participants