SPARQL test suite missing coverage of codepoint escapes #64

kasei · 2020-10-16T22:57:43Z

AFAICT, the SPARQL test suite does not have any coverage of the 4-byte variant of codepoint escape sequences (\UXXXXXXXX). This could be a problem for implementations that defer escape handling to tools that only support the 2-byte variant (such as, I believe, the javacc parser generator).

The text was updated successfully, but these errors were encountered:

ericprud · 2020-10-19T06:01:42Z

While you're at it, how about bare codepoints themselves? Got any like https://github.com/shexSpec/shexTest/blob/master/schemas/1val1STRING_LITERAL1_with_UTF8_boundaries.shex ?

kasei · 2020-10-19T14:41:51Z

There are i18n tests in the original DAWG suite, but they similarly stay below U+FFFF.

ericprud · 2020-11-20T23:16:39Z

The strategy I took in ShEx (and some in Turtle before) was to test the boundaries of permissible characters and, iirc, the boundaries of UTF-8 representations. For instance, the above ShEx has the characters 0x80, 0x7FF, 0x800, 0xFFF, 0x1000, 0xCFFF, 0xD000, 0xD7FF, 0xE000, 0xFFFD, 0x10000, 0x3FFFD, 0x40000, 0xFFFFD, 0x100000, 0x10FFFD. The 0xD800-0xDFFF range is prohibited because it's used to encode the LSB of UTF-16 characters.

It also might be nice to reuse those names from the ShEx test suite because they have systematic names and the two languages have identical terminals.

kasei · 2020-11-21T04:54:36Z

@ericprud I've never been clear on whether a partial surrogate pair codepoint was actually prohibited, or just nonsensical. The SPARQL grammar and spec text don't seem to give any guidance on this.

Testing boundaries would definitely be a good idea, though I'm not sure if all languages/environments are happy working with unassigned code points. Would be interested in hearing any experience people have with that sort of data.

lisp · 2020-11-21T06:18:42Z

@ericprud I've never been clear on whether a partial surrogate pair codepoint was actually prohibited, or just nonsensical.

see: http://unicode.org/faq/utf_bom.html#utf16-7

kasei · 2020-11-21T15:17:30Z

see: http://unicode.org/faq/utf_bom.html#utf16-7

Thanks, @lisp. I think I've read right past that in the past because it is strangely under the "UTF-16 FAQ" section, and written in a way that seems (in my reading) to confuse 16 bit code units and a broad applicability to all "UTFs". Re-reading the UTF-8 section, this one seems relevant:

https://unicode.org/faq/utf_bom.html#utf8-5

kasei · 2020-11-22T00:31:00Z

@ericprud 4a12627 is a commit as part of #67 that adds syntax tests for unicode boundaries and the invalid use of a lone surrogate pair codepoint. What do you think?

ericprud · 2020-11-25T16:18:57Z

@kasei , i approved them, which might come off as arrogant but was intended as a procedural +1

kasei mentioned this issue Oct 19, 2020

Add tests for SPARQL syntax codepoint escapes and invalid codepoint use #67

Merged

gkellogg closed this as completed in #67 Dec 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPARQL test suite missing coverage of codepoint escapes #64

SPARQL test suite missing coverage of codepoint escapes #64

kasei commented Oct 16, 2020 •

edited

Loading

ericprud commented Oct 19, 2020

kasei commented Oct 19, 2020

ericprud commented Nov 20, 2020

kasei commented Nov 21, 2020

lisp commented Nov 21, 2020

kasei commented Nov 21, 2020

kasei commented Nov 22, 2020

ericprud commented Nov 25, 2020

SPARQL test suite missing coverage of codepoint escapes #64

SPARQL test suite missing coverage of codepoint escapes #64

Comments

kasei commented Oct 16, 2020 • edited Loading

ericprud commented Oct 19, 2020

kasei commented Oct 19, 2020

ericprud commented Nov 20, 2020

kasei commented Nov 21, 2020

lisp commented Nov 21, 2020

kasei commented Nov 21, 2020

kasei commented Nov 22, 2020

ericprud commented Nov 25, 2020

kasei commented Oct 16, 2020 •

edited

Loading