-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPARQL test suite missing coverage of codepoint escapes #64
Comments
While you're at it, how about bare codepoints themselves? Got any like https://github.com/shexSpec/shexTest/blob/master/schemas/1val1STRING_LITERAL1_with_UTF8_boundaries.shex ? |
There are i18n tests in the original DAWG suite, but they similarly stay below U+FFFF. |
The strategy I took in ShEx (and some in Turtle before) was to test the boundaries of permissible characters and, iirc, the boundaries of UTF-8 representations. For instance, the above ShEx has the characters 0x80, 0x7FF, 0x800, 0xFFF, 0x1000, 0xCFFF, 0xD000, 0xD7FF, 0xE000, 0xFFFD, 0x10000, 0x3FFFD, 0x40000, 0xFFFFD, 0x100000, 0x10FFFD. The 0xD800-0xDFFF range is prohibited because it's used to encode the LSB of UTF-16 characters. It also might be nice to reuse those names from the ShEx test suite because they have systematic names and the two languages have identical terminals. |
@ericprud I've never been clear on whether a partial surrogate pair codepoint was actually prohibited, or just nonsensical. The SPARQL grammar and spec text don't seem to give any guidance on this. Testing boundaries would definitely be a good idea, though I'm not sure if all languages/environments are happy working with unassigned code points. Would be interested in hearing any experience people have with that sort of data. |
|
Thanks, @lisp. I think I've read right past that in the past because it is strangely under the "UTF-16 FAQ" section, and written in a way that seems (in my reading) to confuse 16 bit code units and a broad applicability to all "UTFs". Re-reading the UTF-8 section, this one seems relevant: |
@kasei , i approved them, which might come off as arrogant but was intended as a procedural +1 |
AFAICT, the SPARQL test suite does not have any coverage of the 4-byte variant of codepoint escape sequences (
\UXXXXXXXX
). This could be a problem for implementations that defer escape handling to tools that only support the 2-byte variant (such as, I believe, the javacc parser generator).The text was updated successfully, but these errors were encountered: