Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARQL test suite missing coverage of codepoint escapes #64

Closed
kasei opened this issue Oct 16, 2020 · 8 comments · Fixed by #67
Closed

SPARQL test suite missing coverage of codepoint escapes #64

kasei opened this issue Oct 16, 2020 · 8 comments · Fixed by #67

Comments

@kasei
Copy link
Contributor

kasei commented Oct 16, 2020

AFAICT, the SPARQL test suite does not have any coverage of the 4-byte variant of codepoint escape sequences (\UXXXXXXXX). This could be a problem for implementations that defer escape handling to tools that only support the 2-byte variant (such as, I believe, the javacc parser generator).

@ericprud
Copy link
Member

While you're at it, how about bare codepoints themselves? Got any like https://github.com/shexSpec/shexTest/blob/master/schemas/1val1STRING_LITERAL1_with_UTF8_boundaries.shex ?

@kasei
Copy link
Contributor Author

kasei commented Oct 19, 2020

There are i18n tests in the original DAWG suite, but they similarly stay below U+FFFF.

@ericprud
Copy link
Member

The strategy I took in ShEx (and some in Turtle before) was to test the boundaries of permissible characters and, iirc, the boundaries of UTF-8 representations. For instance, the above ShEx has the characters 0x80, 0x7FF, 0x800, 0xFFF, 0x1000, 0xCFFF, 0xD000, 0xD7FF, 0xE000, 0xFFFD, 0x10000, 0x3FFFD, 0x40000, 0xFFFFD, 0x100000, 0x10FFFD. The 0xD800-0xDFFF range is prohibited because it's used to encode the LSB of UTF-16 characters.

It also might be nice to reuse those names from the ShEx test suite because they have systematic names and the two languages have identical terminals.

@kasei
Copy link
Contributor Author

kasei commented Nov 21, 2020

@ericprud I've never been clear on whether a partial surrogate pair codepoint was actually prohibited, or just nonsensical. The SPARQL grammar and spec text don't seem to give any guidance on this.

Testing boundaries would definitely be a good idea, though I'm not sure if all languages/environments are happy working with unassigned code points. Would be interested in hearing any experience people have with that sort of data.

@lisp
Copy link

lisp commented Nov 21, 2020

@ericprud I've never been clear on whether a partial surrogate pair codepoint was actually prohibited, or just nonsensical.

see: http://unicode.org/faq/utf_bom.html#utf16-7

@kasei
Copy link
Contributor Author

kasei commented Nov 21, 2020

see: http://unicode.org/faq/utf_bom.html#utf16-7

Thanks, @lisp. I think I've read right past that in the past because it is strangely under the "UTF-16 FAQ" section, and written in a way that seems (in my reading) to confuse 16 bit code units and a broad applicability to all "UTFs". Re-reading the UTF-8 section, this one seems relevant:

https://unicode.org/faq/utf_bom.html#utf8-5

@kasei
Copy link
Contributor Author

kasei commented Nov 22, 2020

@ericprud 4a12627 is a commit as part of #67 that adds syntax tests for unicode boundaries and the invalid use of a lone surrogate pair codepoint. What do you think?

@ericprud
Copy link
Member

@kasei , i approved them, which might come off as arrogant but was intended as a procedural +1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants