Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tests for SPARQL syntax codepoint escapes and invalid codepoint use #67

Merged
merged 4 commits into from
Dec 8, 2020

Conversation

kasei
Copy link
Contributor

@kasei kasei commented Oct 19, 2020

Adds SPARQL syntax tests for codepoint escape sequences:

  • Tests using 4-byte escape variant (\UXXXXXXXX) for codepoints beyond U+FFFF
  • Tests to ensure that unescaping is done in a single pass (e.g. the results of unescaping \u sequences cannot produce characters that then participate in unescaping a \U sequence)

Fixes #64.

@afs
Copy link
Contributor

afs commented Oct 19, 2020

The way \u and \U are handled in SPARQL is different to Turtle, and IMO Turtle is right, SPARQL is wrong.

w3c/sparql-dev#77

In Turtle \u and \U can appear in certain places (Strings, URIs). URIs will be further checked because they get resolved. (Jena RIOT also checks for it being a valid codepoint.)

We can separately argue whether Turtle/SPARQL should allow them in prefixed names. It is not simply a matter of adding them, because there are syntax rules for prefix names that current the grammar enforces.

Maybe it would be better to have tests covering for what is best, not the mistakes of SPARQL 1.0.

@kasei
Copy link
Contributor Author

kasei commented Oct 19, 2020

@afs Agreed there's an unfortunate disconnect between SPARQL and Turtle. I'm just trying to get any tests into the test suite that exercise the 4-byte variation of the escaping (since it currently isn't used in any tests).

I could strip this down to only testing the escapes in places that Turtle might be accepting of if that's desirable, but having something testing these escapes seems important as there are existing parsers that don't seem to be able to handle them.

@afs
Copy link
Contributor

afs commented Oct 19, 2020

I agree having some tests is a good idea.

Speculation: the mistaken way it was done in SPARQL 1.0 means parser writers don't do the proper cases. For example, javacc has \u but not \U so \U used to spell SELECT is missed or extra work for little benefit.

I only discovered (or had forgotten and rediscovered!) the Turtle prefix issue because of your other tests! Thank you for all your work here!

@ericprud
Copy link
Member

Any idea if anyone uses the SPARQL parsing rules for anything other than perverse fishing attacks? Worth a change in the next version?

@kasei
Copy link
Contributor Author

kasei commented Oct 25, 2020

@ericprud I definitely use both the \u and \U forms, but not outside of string literals.

@afs
Copy link
Contributor

afs commented Oct 26, 2020

Any idea if anyone uses the SPARQL parsing rules for anything other than perverse fishing attacks?

I've don't recall it ever coming up.

I don't know what the right thing is for the prefix case. If adding \u \U then the prefix needs reparsing after the grammar to check legality. This does not happen for Strings; URIs get resolved/checked anyway.

@kasei -- what about in URIs? (not prefixes)

@kasei
Copy link
Contributor Author

kasei commented Oct 26, 2020

what about in URIs? (not prefixes)

Are you asking about my experience with them? Or indicating a desire for more tests using escapes in URIs?

I've never seen them used in URIs. Would be happy to add some more tests to cover other cases.

@gkellogg
Copy link
Member

My own parser only does u-sequence unescapes on PNAME_LN, IRIREF, and the four STRING_LITERAL tokens, which I believe is consistent with Turtle. Consequently, I pass neither syn-codepoint-escape-02 nor syn-codepoint-escape-03, but pass the others.

I'd favor sticking to the Turtle interpretation and not require unescaping of all tokens.

@lisp
Copy link

lisp commented Oct 27, 2020 via email

@gkellogg
Copy link
Member

My own parser only does u-sequence unescapes on PNAME_LN, IRIREF, and the four STRING_LITERAL tokens, which I believe is consistent with Turtle. Consequently, I pass neither syn-codepoint-escape-02 nor syn-codepoint-escape-03, but pass the others.

I'd favor sticking to the Turtle interpretation and not require unescaping of all tokens.

how does on reconcile that with 19.2 ?

Obviously, it doesn't adhere to this, and I think I'm not alone in this. If SPARQL 1.2 were to move more towards Turtle handling of escape sequences, then I don't think we should burden implementations with tests for strict 19.2 interpretation that was not tested adequately before, and may not be widely implemented at all.

@lisp
Copy link

lisp commented Oct 27, 2020

If SPARQL 1.2 were to move more towards Turtle handling of escape sequences

the discussion in #77 does not provide adequate reason to justify circumscribing the interpretation of unicode escapes - neither the purported injection risks, nor the nonconformance of implementations.

@afs
Copy link
Contributor

afs commented Oct 28, 2020

@gkellogg ,

My own parser only does u-sequence unescapes on PNAME_LN, IRIREF, and the four STRING_LITERAL tokens, which I believe is consistent with Turtle.

Useful data point.

Turtle does not have \u\U in prefixed names. Look for UCHAR in the grammar. It's in IRIREF and the 4 string tokens. (Somethign I only found out because of this PR!)

Consequently, I pass neither syn-codepoint-escape-02 nor syn-codepoint-escape-03, but pass the others.

ARQ has not had \U at all for a long time and (that I recall) no one has ever noticed. Given the obfustrication effect, fixing that outside strings and IRIs (including not prefix names) seems less helpful.

I'd favor sticking to the Turtle interpretation and not require unescaping of all tokens.

Same here.

@kasei
Copy link
Contributor Author

kasei commented Nov 22, 2020

Added some more unicode syntax tests based on discussion in #64.

@kasei kasei changed the title Add tests for SPARQL syntax codepoint escapes Add tests for SPARQL syntax codepoint escapes and invalid codepoint use Nov 22, 2020
@afs
Copy link
Contributor

afs commented Nov 25, 2020

(Rushed comments:)

I thought we were going to focus on how things should be. "test_codepoint_escape_02" is a test of \U outside string or URI. The is the "Turtle interpretation" is \u and \U inside the delimiters of string and URIs (not prefix names)

Given the obfustication effect, I don't think we should encourage implementations but instead focus on \U\u in these good places.

Same with test_codepoint_escape_03/"syn-codepoint-escape-03" - and the description is a bit off - the $ is encoded and it is not part of the variable name.

Surrogate pairs are code units of UTF-16, not UTF-8. They are reserved for UTF-16 (my understanding of the link from @lisp on #64) so "test_invalid_codepoint_escaped_bad_01" fails not because it is a broken surrogate but because a reserved coepoint is used. I only mention this because the description talks about partial surrogates. There are other reserved codepoints - all unallocated ones - which would fall under the same banner except that gets into Unicode version (avoid!).

@kasei
Copy link
Contributor Author

kasei commented Nov 25, 2020

@afs

That "test_codepoint_escape_02" test predates the discussion here about doing it the turtle way. If there's agreement, I'll remove those.

Regarding the surrogate pairs, this runs into areas I've never felt totally informed about. I posted a separate UTF-8 FAQ link in #64 in responding to @lisp that seems to discuss this case also, but frustratingly is a bit vague about the universal principle operating here:

A different issue arises if an unpaired surrogate is encountered when converting ill-formed UTF-16 data. By representing such an unpaired surrogate on its own as a 3-byte sequence, the resulting UTF-8 data stream would become ill-formed. While it faithfully reflects the nature of the input, Unicode conformance requires that encoding form conversion always results in a valid data stream. Therefore a converter must treat this as an error.

The discussion of converting UTF-16 makes this seem unrelated, but then it seems to indicate that representing "an unpaired surrogate" in UTF-8 would be invalid.

🤷‍♂️

Do we agree that the test is a correct negative syntax test? If so, do you have suggestions for how to change the description to properly describe this situation?

@afs
Copy link
Contributor

afs commented Nov 25, 2020

I'm no expert and i'm reading the same links.
If surrogates are reserved for UTF-16 use only, then they (any high or low) are not legal in UTF-8 at all whether correctly formed or not.

SPARQL is defined for UTF-8, so I am not following why converting from UTF-16 to UTF-8 is a factor.

@kasei
Copy link
Contributor Author

kasei commented Nov 25, 2020

The mention of "converting from UTF-16 to UTF-8" was only because that part of the linked FAQ text made it seem unrelated, though I think the following text might be important.

While SPARQL is defined in terms of UTF-8, I think the issue here is what to do with surrogates defined using codepoint escapes. Lone surrogates seem clearly illegal (though there might be some ambiguity about the reason for that). I'm not sure if a surrogate pair expressed using escapes should also be illegal, though. The query string could still be valid UTF-8, but I'm not sure what happens after utf-8 decoding and handling of escapes.

Copy link
Member

@gkellogg gkellogg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd favor removing sys-codepoint-escape-02.rq and -escape-03.rq, as being technically correct but outside the general use and future direction. Otherwise, LGTM.

@kasei
Copy link
Contributor Author

kasei commented Nov 26, 2020

@gkellogg Removed those two tests and renumbered the remaining tests.

@afs
Copy link
Contributor

afs commented Nov 27, 2020

@kasei - "the following text" - Which text are you referring that is relevant? it seems to be the "UTF-16 FAQ".

Q: What are surrogates?

A: Surrogates are code points from two special ranges of Unicode values, reserved for use as the leading, and trailing values of paired code units in UTF-16. Leading, also called high, surrogates are from D800 to DBFF, and trailing, or low, surrogates are from DC00 to DFFF. They are called surrogates, since they do not represent characters directly, but only as a pair.

which seems to say that surrogates are UTF-16 only.

For UTF-8 representing the original codepoints (that are encoded as UTF-16 surrogate code units), the conversion would be direct to UTF-8. Then no surrogates encoded into UTF-8.

@afs
Copy link
Contributor

afs commented Nov 27, 2020

How does 1val1STRING_LITERAL1_with_UTF8_boundaries work? I don't understand the checking required for legal/illegal cases.

@kasei
Copy link
Contributor Author

kasei commented Nov 28, 2020

@afs this was pulled over from ShEx based on discussion with @ericprud in #64. My understanding is that it's all legal, and just verifying that codepoints on important unicode boundaries are all able to be used.

@afs
Copy link
Contributor

afs commented Dec 8, 2020

I am not sure whether there is a further assumption about a parser system whose output is UTF-16 as happens in Java since internally it is UTF-16 (and a bit).

However, the tests here can passed by checking that a high surrogate is followed by a low-surrogate regardless of the cause - character I/O or escape processing.

They will break for UCS-2 ("obsolete" - the original Uncode).

@gkellogg
Copy link
Member

gkellogg commented Dec 8, 2020

Okay, I think these are good to go; merging.

@afs
Copy link
Contributor

afs commented Jan 22, 2023

To make sure we have a clear decision, I've started issue #88.

Do we maintain RDF 1.2 compatible SPARQL tests? (We sort of have been, on a case-by-case basis - we should make a clear principle).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SPARQL test suite missing coverage of codepoint escapes
5 participants