-
Notifications
You must be signed in to change notification settings - Fork 560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPARQL parser bug in evaluating unicode escapes #1884
Comments
Um, could you provide more details on platform and Python version? In my profound ineptitude, I don't seem to be able to reproduce the reported error using Python 3.8 on a Linux Mint 20.3 Una distro. Here's my test code: $ cat test_unicode_escape.py
import rdflib
tarek = rdflib.URIRef("urn:example:tarek")
likes = rdflib.URIRef("urn:example:likes")
g = rdflib.Graph()
g.add((tarek, likes, rdflib.Literal("\u00a71234")))
assert list(g) == [
(
rdflib.term.URIRef('urn:example:tarek'),
rdflib.term.URIRef('urn:example:likes'),
rdflib.term.Literal('§1234')
)
]
q = """SELECT * WHERE { ?s ?p "\u00a71234" }"""
res = g.query(q)
lres = list(res)
assert rdflib.term.URIRef('urn:example:likes') in lres[0]
assert rdflib.term.URIRef('urn:example:tarek') in lres[0]
# q = 'SELECT * WHERE { ?s ?p "\U0001HHHH" }' (I can't even get $ python test_unicode_escape.py
File "test_unicode_escape.py", line 19
q = 'SELECT * WHERE { ?s ?p "\U0001HHHH" }'
^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 24-29: truncated \UXXXXXXXX escape``` |
Sure. Apologies. I'm not super familiar with rdflib, so specifics of calling code may be important here. I'm trying this on Python 3.8.9 on MacOS. I didn't try to evaluate the query, just parse it. In the below code, both
|
I also suspect that you may be hitting python encoding issues (and not sparql parser encoding issues) when you use just a single backslash:
|
I guess my limited model of the escaping principles is preventing me from achieving a full understanding of the issue you're reporting, gonna have to leave it to more knowledgeable folks 😮💨 |
It might be more understandable if you put |
Ah right, I see. Thank you for your patience. For those following along at home, there's a discussion of numeric character escapes in w3c/sparql-dev#77 which I found informative. Also it's worth noting that currently RDFLib isn't testing the SPARQL parser against the latest W3 test suite but uses the earlier version - which doesn't have the codepoint tests added in w3c/rdf-tests#67. @aucampia and I are gradually hewing the RDFLib test suite into better shape and are close to migrating it to use the latest W3 test suite - where the problematic example originally posted (above) is an existing test datum (edit: incorrect, that test uses |
I noticed one place in the current ( |
There seems to be a couple of bugs in
rdflib.plugins.sparql.parser
in the handling of unicode escapes.Trying to parse this query:
reveals two bugs. The first is in error handling in expandUnicodeEscapes where constructing the error message fails due to an attempt to concatenate a string and a Match object:
Fixing this reveals the more serious issue:
The regular expression used to unescape the unicode data uses the
IGNORECASE
flag:This is fine for matching varied-case hex characters, but it conflates the
\u
and\U
handling, allowing either form to match either 4 or 8 hex digits. In the example query above, the lowercase\u
should only use the first 4 hex digits (00a7
) to produce the character§
, resulting in the object literal"§1234"
. Instead, it finds 8 valid-looking hex digits (00a71234
) and then fails because this is a number outside the range of valid unicode codepoints. A fix for this issue should differentiate the two escaping cases\u
and\U
and match only 4 and 8 digits, respectively.This bug also allows parsing of what should be invalid input in cases where a
\U
escape is caused to match only 4 hex digits. For example,SELECT * WHERE { ?s ?p "\U0001HHHH" }
is parsed without error, despite the invalid escape sequence.The text was updated successfully, but these errors were encountered: