-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tests for SPARQL syntax codepoint escapes and invalid codepoint use #67
Conversation
The way \u and \U are handled in SPARQL is different to Turtle, and IMO Turtle is right, SPARQL is wrong. In Turtle \u and \U can appear in certain places (Strings, URIs). URIs will be further checked because they get resolved. (Jena RIOT also checks for it being a valid codepoint.) We can separately argue whether Turtle/SPARQL should allow them in prefixed names. It is not simply a matter of adding them, because there are syntax rules for prefix names that current the grammar enforces. Maybe it would be better to have tests covering for what is best, not the mistakes of SPARQL 1.0. |
@afs Agreed there's an unfortunate disconnect between SPARQL and Turtle. I'm just trying to get any tests into the test suite that exercise the 4-byte variation of the escaping (since it currently isn't used in any tests). I could strip this down to only testing the escapes in places that Turtle might be accepting of if that's desirable, but having something testing these escapes seems important as there are existing parsers that don't seem to be able to handle them. |
I agree having some tests is a good idea. Speculation: the mistaken way it was done in SPARQL 1.0 means parser writers don't do the proper cases. For example, javacc has \u but not \U so \U used to spell SELECT is missed or extra work for little benefit. I only discovered (or had forgotten and rediscovered!) the Turtle prefix issue because of your other tests! Thank you for all your work here! |
Any idea if anyone uses the SPARQL parsing rules for anything other than perverse fishing attacks? Worth a change in the next version? |
@ericprud I definitely use both the \u and \U forms, but not outside of string literals. |
I've don't recall it ever coming up. I don't know what the right thing is for the prefix case. If adding \u \U then the prefix needs reparsing after the grammar to check legality. This does not happen for Strings; URIs get resolved/checked anyway. @kasei -- what about in URIs? (not prefixes) |
Are you asking about my experience with them? Or indicating a desire for more tests using escapes in URIs? I've never seen them used in URIs. Would be happy to add some more tests to cover other cases. |
My own parser only does u-sequence unescapes on PNAME_LN, IRIREF, and the four STRING_LITERAL tokens, which I believe is consistent with Turtle. Consequently, I pass neither syn-codepoint-escape-02 nor syn-codepoint-escape-03, but pass the others. I'd favor sticking to the Turtle interpretation and not require unescaping of all tokens. |
how does on reconcile that with 19.2 ?
…On Mon, Oct 26, 2020, 22:40 Gregg Kellogg ***@***.***> wrote:
My own parser only does u-sequence unescapes on PNAME_LN, IRIREF, and the
four STRING_LITERAL tokens, which I believe is consistent with Turtle.
Consequently, I pass neither syn-codepoint-escape-02 nor
syn-codepoint-escape-03, but pass the others.
I'd favor sticking to the Turtle interpretation and not require unescaping
of all tokens.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#67 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABOZ3SDW22QQRRVCC3TXOTSMXUGLANCNFSM4SWQPM6A>
.
|
Obviously, it doesn't adhere to this, and I think I'm not alone in this. If SPARQL 1.2 were to move more towards Turtle handling of escape sequences, then I don't think we should burden implementations with tests for strict 19.2 interpretation that was not tested adequately before, and may not be widely implemented at all. |
the discussion in #77 does not provide adequate reason to justify circumscribing the interpretation of unicode escapes - neither the purported injection risks, nor the nonconformance of implementations. |
Useful data point. Turtle does not have \u\U in prefixed names. Look for UCHAR in the grammar. It's in IRIREF and the 4 string tokens. (Somethign I only found out because of this PR!)
ARQ has not had \U at all for a long time and (that I recall) no one has ever noticed. Given the obfustrication effect, fixing that outside strings and IRIs (including not prefix names) seems less helpful.
Same here. |
…ogate pair codepoints.
Added some more unicode syntax tests based on discussion in #64. |
(Rushed comments:) I thought we were going to focus on how things should be. "test_codepoint_escape_02" is a test of \U outside string or URI. The is the "Turtle interpretation" is \u and \U inside the delimiters of string and URIs (not prefix names) Given the obfustication effect, I don't think we should encourage implementations but instead focus on \U\u in these good places. Same with test_codepoint_escape_03/"syn-codepoint-escape-03" - and the description is a bit off - the Surrogate pairs are code units of UTF-16, not UTF-8. They are reserved for UTF-16 (my understanding of the link from @lisp on #64) so "test_invalid_codepoint_escaped_bad_01" fails not because it is a broken surrogate but because a reserved coepoint is used. I only mention this because the description talks about partial surrogates. There are other reserved codepoints - all unallocated ones - which would fall under the same banner except that gets into Unicode version (avoid!). |
That "test_codepoint_escape_02" test predates the discussion here about doing it the turtle way. If there's agreement, I'll remove those. Regarding the surrogate pairs, this runs into areas I've never felt totally informed about. I posted a separate UTF-8 FAQ link in #64 in responding to @lisp that seems to discuss this case also, but frustratingly is a bit vague about the universal principle operating here:
The discussion of converting UTF-16 makes this seem unrelated, but then it seems to indicate that representing "an unpaired surrogate" in UTF-8 would be invalid. 🤷♂️ Do we agree that the test is a correct negative syntax test? If so, do you have suggestions for how to change the description to properly describe this situation? |
I'm no expert and i'm reading the same links. SPARQL is defined for UTF-8, so I am not following why converting from UTF-16 to UTF-8 is a factor. |
The mention of "converting from UTF-16 to UTF-8" was only because that part of the linked FAQ text made it seem unrelated, though I think the following text might be important. While SPARQL is defined in terms of UTF-8, I think the issue here is what to do with surrogates defined using codepoint escapes. Lone surrogates seem clearly illegal (though there might be some ambiguity about the reason for that). I'm not sure if a surrogate pair expressed using escapes should also be illegal, though. The query string could still be valid UTF-8, but I'm not sure what happens after utf-8 decoding and handling of escapes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd favor removing sys-codepoint-escape-02.rq and -escape-03.rq, as being technically correct but outside the general use and future direction. Otherwise, LGTM.
@gkellogg Removed those two tests and renumbered the remaining tests. |
@kasei - "the following text" - Which text are you referring that is relevant? it seems to be the "UTF-16 FAQ".
which seems to say that surrogates are UTF-16 only. For UTF-8 representing the original codepoints (that are encoded as UTF-16 surrogate code units), the conversion would be direct to UTF-8. Then no surrogates encoded into UTF-8. |
How does |
I am not sure whether there is a further assumption about a parser system whose output is UTF-16 as happens in Java since internally it is UTF-16 (and a bit). However, the tests here can passed by checking that a high surrogate is followed by a low-surrogate regardless of the cause - character I/O or escape processing. They will break for UCS-2 ("obsolete" - the original Uncode). |
Okay, I think these are good to go; merging. |
To make sure we have a clear decision, I've started issue #88. Do we maintain RDF 1.2 compatible SPARQL tests? (We sort of have been, on a case-by-case basis - we should make a clear principle). |
Adds SPARQL syntax tests for codepoint escape sequences:
Fixes #64.