Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARQL parser bug in evaluating unicode escapes #1884

Open
kasei opened this issue May 4, 2022 · 7 comments
Open

SPARQL parser bug in evaluating unicode escapes #1884

kasei opened this issue May 4, 2022 · 7 comments
Labels
bug Something isn't working SPARQL

Comments

@kasei
Copy link

kasei commented May 4, 2022

There seems to be a couple of bugs in rdflib.plugins.sparql.parser in the handling of unicode escapes.

Trying to parse this query:

SELECT * WHERE { ?s ?p "\u00a71234" }

reveals two bugs. The first is in error handling in expandUnicodeEscapes where constructing the error message fails due to an attempt to concatenate a string and a Match object:

TypeError: can only concatenate str (not "re.Match") to str

Fixing this reveals the more serious issue:

ValueError: chr() arg not in range(0x110000)

The regular expression used to unescape the unicode data uses the IGNORECASE flag:

expandUnicodeEscapes_re = re.compile(r"\\u([0-9a-f]{4}(?:[0-9a-f]{4})?)", flags=re.I)

This is fine for matching varied-case hex characters, but it conflates the \u and \U handling, allowing either form to match either 4 or 8 hex digits. In the example query above, the lowercase \u should only use the first 4 hex digits (00a7) to produce the character §, resulting in the object literal "§1234". Instead, it finds 8 valid-looking hex digits (00a71234) and then fails because this is a number outside the range of valid unicode codepoints. A fix for this issue should differentiate the two escaping cases \u and \U and match only 4 and 8 digits, respectively.

This bug also allows parsing of what should be invalid input in cases where a \U escape is caused to match only 4 hex digits. For example, SELECT * WHERE { ?s ?p "\U0001HHHH" } is parsed without error, despite the invalid escape sequence.

@ghost
Copy link

ghost commented May 5, 2022

Um, could you provide more details on platform and Python version? In my profound ineptitude, I don't seem to be able to reproduce the reported error using Python 3.8 on a Linux Mint 20.3 Una distro.

Here's my test code:

$ cat test_unicode_escape.py 
import rdflib

tarek = rdflib.URIRef("urn:example:tarek")
likes = rdflib.URIRef("urn:example:likes")

g = rdflib.Graph()

g.add((tarek, likes, rdflib.Literal("\u00a71234")))

assert list(g) == [
    (
        rdflib.term.URIRef('urn:example:tarek'),
        rdflib.term.URIRef('urn:example:likes'),
        rdflib.term.Literal('§1234')
    )
]

q = """SELECT * WHERE { ?s ?p "\u00a71234" }"""

res = g.query(q)
lres = list(res)
assert rdflib.term.URIRef('urn:example:likes') in lres[0]
assert rdflib.term.URIRef('urn:example:tarek') in lres[0]

# q = 'SELECT * WHERE { ?s ?p "\U0001HHHH" }'

(I can't even get SELECT * WHERE { ?s ?p "\U0001HHHH" } past the interpreter):

$ python test_unicode_escape.py 
  File "test_unicode_escape.py", line 19
    q = 'SELECT * WHERE { ?s ?p "\U0001HHHH" }'
        ^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 24-29: truncated \UXXXXXXXX escape``` 

@kasei
Copy link
Author

kasei commented May 5, 2022

Sure. Apologies. I'm not super familiar with rdflib, so specifics of calling code may be important here. I'm trying this on Python 3.8.9 on MacOS. I didn't try to evaluate the query, just parse it. In the below code, both parseQuery and prepareQuery lead to problems.

#!/usr/bin/env python3

from rdflib.plugins.sparql import prepareQuery
from rdflib.plugins.sparql.parser import parseQuery

if __name__ == '__main__':
	sparql = 'SELECT * WHERE { ?s ?p "\\u00a71234" }'
	print(parseQuery(sparql))
	print(prepareQuery(sparql))

@kasei
Copy link
Author

kasei commented May 5, 2022

I also suspect that you may be hitting python encoding issues (and not sparql parser encoding issues) when you use just a single backslash:

q = """SELECT * WHERE { ?s ?p "\u00a71234" }"""

@ghost
Copy link

ghost commented May 5, 2022

I also suspect that you may be hitting python encoding issues (and not sparql parser encoding issues) when you use just a single backslash:

q = """SELECT * WHERE { ?s ?p "\u00a71234" }"""

I guess my limited model of the escaping principles is preventing me from achieving a full understanding of the issue you're reporting, gonna have to leave it to more knowledgeable folks 😮‍💨

@kasei
Copy link
Author

kasei commented May 5, 2022

I guess my limited model of the escaping principles is preventing me from achieving a full understanding of the issue you're reporting, gonna have to leave it to more knowledgeable folks 😮‍💨

It might be more understandable if you put SELECT * WHERE { ?s ?p "\u00a71234" } into a test.rq file, and then read the contents of the file into the variable q. Then you skip python trying to unescape stuff in the query as if it were just a string in python code.

@ghost
Copy link

ghost commented May 5, 2022

It might be more understandable if you put SELECT * WHERE { ?s ?p "\u00a71234" } into a test.rq file, and then read the contents of the file into the variable q.

Ah right, I see. Thank you for your patience. For those following along at home, there's a discussion of numeric character escapes in w3c/sparql-dev#77 which I found informative. Also it's worth noting that currently RDFLib isn't testing the SPARQL parser against the latest W3 test suite but uses the earlier version - which doesn't have the codepoint tests added in w3c/rdf-tests#67. @aucampia and I are gradually hewing the RDFLib test suite into better shape and are close to migrating it to use the latest W3 test suite - where the problematic example originally posted (above) is an existing test datum (edit: incorrect, that test uses "\U0001f46a").

@ajnelson-nist
Copy link
Contributor

I noticed one place in the current (aa6cde39) code uses the unicodedata built-in module. Is this thread stumbling on another use case for that module?

@aucampia aucampia added the bug Something isn't working label Jul 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working SPARQL
Projects
None yet
Development

No branches or pull requests

3 participants