Add tests for SPARQL syntax codepoint escapes and invalid codepoint use #67

kasei · 2020-10-19T18:37:46Z

Adds SPARQL syntax tests for codepoint escape sequences:

Tests using 4-byte escape variant (\UXXXXXXXX) for codepoints beyond U+FFFF
Tests to ensure that unescaping is done in a single pass (e.g. the results of unescaping \u sequences cannot produce characters that then participate in unescaping a \U sequence)

Fixes #64.

afs · 2020-10-19T19:38:15Z

The way \u and \U are handled in SPARQL is different to Turtle, and IMO Turtle is right, SPARQL is wrong.

In Turtle \u and \U can appear in certain places (Strings, URIs). URIs will be further checked because they get resolved. (Jena RIOT also checks for it being a valid codepoint.)

We can separately argue whether Turtle/SPARQL should allow them in prefixed names. It is not simply a matter of adding them, because there are syntax rules for prefix names that current the grammar enforces.

Maybe it would be better to have tests covering for what is best, not the mistakes of SPARQL 1.0.

kasei · 2020-10-19T19:43:27Z

@afs Agreed there's an unfortunate disconnect between SPARQL and Turtle. I'm just trying to get any tests into the test suite that exercise the 4-byte variation of the escaping (since it currently isn't used in any tests).

I could strip this down to only testing the escapes in places that Turtle might be accepting of if that's desirable, but having something testing these escapes seems important as there are existing parsers that don't seem to be able to handle them.

afs · 2020-10-19T20:08:58Z

I agree having some tests is a good idea.

Speculation: the mistaken way it was done in SPARQL 1.0 means parser writers don't do the proper cases. For example, javacc has \u but not \U so \U used to spell SELECT is missed or extra work for little benefit.

I only discovered (or had forgotten and rediscovered!) the Turtle prefix issue because of your other tests! Thank you for all your work here!

ericprud · 2020-10-25T12:45:01Z

Any idea if anyone uses the SPARQL parsing rules for anything other than perverse fishing attacks? Worth a change in the next version?

kasei · 2020-10-25T19:00:56Z

@ericprud I definitely use both the \u and \U forms, but not outside of string literals.

afs · 2020-10-26T12:30:13Z

Any idea if anyone uses the SPARQL parsing rules for anything other than perverse fishing attacks?

I've don't recall it ever coming up.

I don't know what the right thing is for the prefix case. If adding \u \U then the prefix needs reparsing after the grammar to check legality. This does not happen for Strings; URIs get resolved/checked anyway.

@kasei -- what about in URIs? (not prefixes)

kasei · 2020-10-26T15:17:39Z

what about in URIs? (not prefixes)

Are you asking about my experience with them? Or indicating a desire for more tests using escapes in URIs?

I've never seen them used in URIs. Would be happy to add some more tests to cover other cases.

gkellogg · 2020-10-26T21:40:39Z

My own parser only does u-sequence unescapes on PNAME_LN, IRIREF, and the four STRING_LITERAL tokens, which I believe is consistent with Turtle. Consequently, I pass neither syn-codepoint-escape-02 nor syn-codepoint-escape-03, but pass the others.

I'd favor sticking to the Turtle interpretation and not require unescaping of all tokens.

lisp · 2020-10-27T10:22:38Z

how does on reconcile that with 19.2 ?

…

On Mon, Oct 26, 2020, 22:40 Gregg Kellogg ***@***.***> wrote: My own parser only does u-sequence unescapes on PNAME_LN, IRIREF, and the four STRING_LITERAL tokens, which I believe is consistent with Turtle. Consequently, I pass neither syn-codepoint-escape-02 nor syn-codepoint-escape-03, but pass the others. I'd favor sticking to the Turtle interpretation and not require unescaping of all tokens. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#67 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABOZ3SDW22QQRRVCC3TXOTSMXUGLANCNFSM4SWQPM6A> .

gkellogg · 2020-10-27T18:43:19Z

My own parser only does u-sequence unescapes on PNAME_LN, IRIREF, and the four STRING_LITERAL tokens, which I believe is consistent with Turtle. Consequently, I pass neither syn-codepoint-escape-02 nor syn-codepoint-escape-03, but pass the others.

I'd favor sticking to the Turtle interpretation and not require unescaping of all tokens.

how does on reconcile that with 19.2 ?

Obviously, it doesn't adhere to this, and I think I'm not alone in this. If SPARQL 1.2 were to move more towards Turtle handling of escape sequences, then I don't think we should burden implementations with tests for strict 19.2 interpretation that was not tested adequately before, and may not be widely implemented at all.

lisp · 2020-10-27T22:28:37Z

If SPARQL 1.2 were to move more towards Turtle handling of escape sequences

the discussion in #77 does not provide adequate reason to justify circumscribing the interpretation of unicode escapes - neither the purported injection risks, nor the nonconformance of implementations.

afs · 2020-10-28T11:47:42Z

@gkellogg ,

My own parser only does u-sequence unescapes on PNAME_LN, IRIREF, and the four STRING_LITERAL tokens, which I believe is consistent with Turtle.

Useful data point.

Turtle does not have \u\U in prefixed names. Look for UCHAR in the grammar. It's in IRIREF and the 4 string tokens. (Somethign I only found out because of this PR!)

Consequently, I pass neither syn-codepoint-escape-02 nor syn-codepoint-escape-03, but pass the others.

ARQ has not had \U at all for a long time and (that I recall) no one has ever noticed. Given the obfustrication effect, fixing that outside strings and IRIs (including not prefix names) seems less helpful.

I'd favor sticking to the Turtle interpretation and not require unescaping of all tokens.

Same here.

…ogate pair codepoints.

kasei · 2020-11-22T00:31:51Z

Added some more unicode syntax tests based on discussion in #64.

afs · 2020-11-25T17:17:30Z

(Rushed comments:)

I thought we were going to focus on how things should be. "test_codepoint_escape_02" is a test of \U outside string or URI. The is the "Turtle interpretation" is \u and \U inside the delimiters of string and URIs (not prefix names)

Given the obfustication effect, I don't think we should encourage implementations but instead focus on \U\u in these good places.

Same with test_codepoint_escape_03/"syn-codepoint-escape-03" - and the description is a bit off - the $ is encoded and it is not part of the variable name.

Surrogate pairs are code units of UTF-16, not UTF-8. They are reserved for UTF-16 (my understanding of the link from @lisp on #64) so "test_invalid_codepoint_escaped_bad_01" fails not because it is a broken surrogate but because a reserved coepoint is used. I only mention this because the description talks about partial surrogates. There are other reserved codepoints - all unallocated ones - which would fall under the same banner except that gets into Unicode version (avoid!).

kasei · 2020-11-25T17:37:55Z

@afs

That "test_codepoint_escape_02" test predates the discussion here about doing it the turtle way. If there's agreement, I'll remove those.

Regarding the surrogate pairs, this runs into areas I've never felt totally informed about. I posted a separate UTF-8 FAQ link in #64 in responding to @lisp that seems to discuss this case also, but frustratingly is a bit vague about the universal principle operating here:

A different issue arises if an unpaired surrogate is encountered when converting ill-formed UTF-16 data. By representing such an unpaired surrogate on its own as a 3-byte sequence, the resulting UTF-8 data stream would become ill-formed. While it faithfully reflects the nature of the input, Unicode conformance requires that encoding form conversion always results in a valid data stream. Therefore a converter must treat this as an error.

The discussion of converting UTF-16 makes this seem unrelated, but then it seems to indicate that representing "an unpaired surrogate" in UTF-8 would be invalid.

🤷‍♂️

Do we agree that the test is a correct negative syntax test? If so, do you have suggestions for how to change the description to properly describe this situation?

afs · 2020-11-25T18:54:43Z

I'm no expert and i'm reading the same links.
If surrogates are reserved for UTF-16 use only, then they (any high or low) are not legal in UTF-8 at all whether correctly formed or not.

SPARQL is defined for UTF-8, so I am not following why converting from UTF-16 to UTF-8 is a factor.

kasei · 2020-11-25T18:59:59Z

The mention of "converting from UTF-16 to UTF-8" was only because that part of the linked FAQ text made it seem unrelated, though I think the following text might be important.

While SPARQL is defined in terms of UTF-8, I think the issue here is what to do with surrogates defined using codepoint escapes. Lone surrogates seem clearly illegal (though there might be some ambiguity about the reason for that). I'm not sure if a surrogate pair expressed using escapes should also be illegal, though. The query string could still be valid UTF-8, but I'm not sure what happens after utf-8 decoding and handling of escapes.

gkellogg

I'd favor removing sys-codepoint-escape-02.rq and -escape-03.rq, as being technically correct but outside the general use and future direction. Otherwise, LGTM.

kasei · 2020-11-26T00:02:16Z

@gkellogg Removed those two tests and renumbered the remaining tests.

afs · 2020-11-27T12:40:30Z

@kasei - "the following text" - Which text are you referring that is relevant? it seems to be the "UTF-16 FAQ".

Q: What are surrogates?

A: Surrogates are code points from two special ranges of Unicode values, reserved for use as the leading, and trailing values of paired code units in UTF-16. Leading, also called high, surrogates are from D800 to DBFF, and trailing, or low, surrogates are from DC00 to DFFF. They are called surrogates, since they do not represent characters directly, but only as a pair.

which seems to say that surrogates are UTF-16 only.

For UTF-8 representing the original codepoints (that are encoded as UTF-16 surrogate code units), the conversion would be direct to UTF-8. Then no surrogates encoded into UTF-8.

afs · 2020-11-27T12:40:47Z

How does 1val1STRING_LITERAL1_with_UTF8_boundaries work? I don't understand the checking required for legal/illegal cases.

kasei · 2020-11-28T18:49:57Z

@afs this was pulled over from ShEx based on discussion with @ericprud in #64. My understanding is that it's all legal, and just verifying that codepoints on important unicode boundaries are all able to be used.

afs · 2020-12-08T18:22:11Z

I am not sure whether there is a further assumption about a parser system whose output is UTF-16 as happens in Java since internally it is UTF-16 (and a bit).

However, the tests here can passed by checking that a high surrogate is followed by a low-surrogate regardless of the cause - character I/O or escape processing.

They will break for UCS-2 ("obsolete" - the original Uncode).

gkellogg · 2020-12-08T19:52:51Z

Okay, I think these are good to go; merging.

afs · 2023-01-22T10:30:28Z

To make sure we have a clear decision, I've started issue #88.

Do we maintain RDF 1.2 compatible SPARQL tests? (We sort of have been, on a case-by-case basis - we should make a clear principle).

kasei added 2 commits October 19, 2020 11:30

Add syntax tests for unicode codepoint escapes.

55b9652

Fix comment in syn-codepoint-escape-bad-04.rq.

155e7ce

afs mentioned this pull request Oct 26, 2020

JENA-1982: SPARQL unicode escapes apache/jena#820

Merged

Add unicode syntax tests for codepoint bondaries and for invalid surr…

4a12627

…ogate pair codepoints.

kasei mentioned this pull request Nov 22, 2020

SPARQL test suite missing coverage of codepoint escapes #64

Closed

kasei changed the title ~~Add tests for SPARQL syntax codepoint escapes~~ Add tests for SPARQL syntax codepoint escapes and invalid codepoint use Nov 22, 2020

ericprud approved these changes Nov 25, 2020

View reviewed changes

gkellogg requested changes Nov 25, 2020

View reviewed changes

Remove syntax tests for codepoint escapes used outside of literals.

f3f0bdb

afs approved these changes Dec 8, 2020

View reviewed changes

gkellogg approved these changes Dec 8, 2020

View reviewed changes

gkellogg merged commit 5979a87 into w3c:gh-pages Dec 8, 2020

kasei deleted the sparql-syntax-codepoint-escapes branch December 9, 2020 04:23

ghost mentioned this pull request May 5, 2022

SPARQL parser bug in evaluating unicode escapes RDFLib/rdflib#1884

Open

rubensworks mentioned this pull request Oct 12, 2022

Escape IRIs and literals rubensworks/rdf-string-ttl.js#3

Merged

afs mentioned this pull request Nov 20, 2022

Update data-r2 to be SPARQL 1.1 compatible #83

Merged

afs mentioned this pull request Jan 22, 2023

Decide that SPARQL tests in rdf-tests will be for RDF 1.1. #88

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tests for SPARQL syntax codepoint escapes and invalid codepoint use #67

Add tests for SPARQL syntax codepoint escapes and invalid codepoint use #67

kasei commented Oct 19, 2020

afs commented Oct 19, 2020

kasei commented Oct 19, 2020

afs commented Oct 19, 2020

ericprud commented Oct 25, 2020

kasei commented Oct 25, 2020

afs commented Oct 26, 2020

kasei commented Oct 26, 2020

gkellogg commented Oct 26, 2020

lisp commented Oct 27, 2020 via email

gkellogg commented Oct 27, 2020

lisp commented Oct 27, 2020

afs commented Oct 28, 2020

kasei commented Nov 22, 2020

afs commented Nov 25, 2020 •

edited

Loading

kasei commented Nov 25, 2020

afs commented Nov 25, 2020

kasei commented Nov 25, 2020

gkellogg left a comment

kasei commented Nov 26, 2020

afs commented Nov 27, 2020

afs commented Nov 27, 2020

kasei commented Nov 28, 2020

afs commented Dec 8, 2020

gkellogg commented Dec 8, 2020

afs commented Jan 22, 2023

Add tests for SPARQL syntax codepoint escapes and invalid codepoint use #67

Add tests for SPARQL syntax codepoint escapes and invalid codepoint use #67

Conversation

kasei commented Oct 19, 2020

afs commented Oct 19, 2020

kasei commented Oct 19, 2020

afs commented Oct 19, 2020

ericprud commented Oct 25, 2020

kasei commented Oct 25, 2020

afs commented Oct 26, 2020

kasei commented Oct 26, 2020

gkellogg commented Oct 26, 2020

lisp commented Oct 27, 2020 via email

gkellogg commented Oct 27, 2020

lisp commented Oct 27, 2020

afs commented Oct 28, 2020

kasei commented Nov 22, 2020

afs commented Nov 25, 2020 • edited Loading

kasei commented Nov 25, 2020

afs commented Nov 25, 2020

kasei commented Nov 25, 2020

gkellogg left a comment

Choose a reason for hiding this comment

kasei commented Nov 26, 2020

afs commented Nov 27, 2020

afs commented Nov 27, 2020

kasei commented Nov 28, 2020

afs commented Dec 8, 2020

gkellogg commented Dec 8, 2020

afs commented Jan 22, 2023

afs commented Nov 25, 2020 •

edited

Loading