-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Escape IRIs and literals #3
Escape IRIs and literals #3
Conversation
Remove old deprecated tslint and move to eslint.
IRI's and string literals are now escaped, meaning that special characters like ', ", line feeds, cariage returns, and more are escaped using their backslash equivalent.
Thanks so much for this @benjay10!
We shouldn't copy code directly from other projects. This will cause licensing issues indeed. In any case, let's merge #3 first. |
I took another look at the specification for Turtle (https://www.w3.org/TR/turtle/#h3_literals) and found that only newline, backslashes and quotes really needed to be escaped. What N3 does is probably not incorrect, but just seems to complicated for now. So, this is now my own implementation of an escape mechanism. No copies this time, just inspiration from the spec and what also appears to be done in https://github.com/linkeddata/rdflib.js. If this still looks like copying, this is just basically how anyone could write it based on the spec. Oh, yes, I also removed escaping from the IRI of a NamedNode. It feels like this library is not responsible for unsupported characters in IRIs. The same could be said about what is inside a string literal, but for that case, newlines and such *are allowed* if the string writer formats it with the correct delimiters ("""). Our approach here is to escape characters instead as a simplification.
I came up with a different approach, that actually looks very similar to what rdflib.js does. I only found out while already implementing my idea, so this is a confirmation it should be correct(?). (This can be found here.) This time only backslashes, double quotes and newlines are escaped, and only from string literals. This seems in line with the Turtle spec here. There is of course another possibility: using |
It's simpler indeed, but I think it's not going to be correct for all cases. For example, So I think we'll need an approach similar to the previous one to handle these cases. |
I'm easily confused with all these escape character, so please bear with me. If you want the literal to contain |
Yes, you're right. If this already works, then great :-) What I'm still concerned about is the performance impact. So I do think we need the complexity we had before. |
Scanning the string and replacing characters will always have a performance impact, so yes, the performance will be worse with any of the suggestions so far, but we somehow have to because producing incorrect TTL strings is just not OK. I think your analysis of performance is not correct. Yes, it looks like the string is iterated 4 times, but if you only where to do it once, you would have to combine all the work in a singular, more complex function that might take 4 times as long as one of these simpler functions (or even more due to the increased amount of Also, according to the Turtle specification, there is no mention of surrogate pairs or these 4-bit unicode escape sequences not being allowed. Just mentions about the delimiter characters for strings and newlines, because they would interfere with the Turtle syntax itself. If these special characters are allowed in a Turtle file (given its encoding), are they still problematic? |
Yes, we agree on the incorrectness issue :-)
Sure, sounds good!
This is implied by its dependency on unicode, just like SPARQL. |
Reverted back to the more complex escaping situation, but slightly rewritten. Is this enough to not infringe on the licenses? Is this somewhat of a final result you had in mind? |
Thanks @benjay10, that approach looks good to me! Could you rebase your PR onto master? It looks like the linter changes also slipped in here from your previous PR. |
And also performed Linting fixes on the escape changes.
I hope this is okay. The merging got complicated because the linting changed the regular expressions and some tests started failing because of it (hence the eslint-disable directives). |
lib/TermUtil.ts
Outdated
return stringValue; | ||
} | ||
|
||
/* eslint-disable require-unicode-regexp */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can just follow this rule, and add the u
flag to the regexes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, adding the flag changes the outcome of the tests. The test case with the smiley breaks. This regex is no longer matching with the surrogate pairs. I don't know how to fix this (yet).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it clicked in my head all of a sudden. Adding the u
flag means you have to be more careful matching on unicode code points, so it is not possible to match on only the first part of a surrogate pair. This means the regex has to inlcude both halves and thus become exactly like the second one. I'll commit it below.
lib/TermUtil.ts
Outdated
} | ||
|
||
/* eslint-disable require-unicode-regexp */ | ||
/* eslint-disable unicorn/better-regex */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What change did this rule suggest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The linting automatically changes the first regex from
/["\\\t\n\r\b\f\u0000-\u0019\uD800-\uDBFF]/
to
/[\b\u0000-\u0019"\\\uD800-\uDBFF]/
.
They act the same. Is that because line feeds, carriage returns, and such are included in those unicode ranges? Nevertheless, the second pattern is even more cryptic than the first: I wouldn't know if a tab is included in that pattern or not, while that is clearer from the first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, the rearranged patterns do not affect the test outcomes. If the tests are complete and correct, this change doesn't matter, besides for clarity reasons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the above comments. After rewriting the regex, the linter keeps quiet about it. 🤷
Sorry for not noticing this silly mistake. These file are hidden in the file explorer of my editor so I didn't notice the duplicate-ish one.
Adding the u flag to the regex led to unavoidably rewriting them so they became identical, because with the u flag, you can't match on only the first half of a surrogate pair. With them identical, I of course removed one of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks perfect, thanks so much for this! Preparing a release now...
Released as 1.3.0. |
Big fan of your work in Linked Data and we use your projects throughout our services. The workflow requires to generate SPARQL queries to be executed on a QPARQL endpoint and I use would like to use a library to produces strings from RDF.JS objects. When using this library, I got stung by incorrectly escaped literal strings, which caused the SPARQL queries to be syntactically incorrect. This problem would also affect writing to TTL and N-triples files. This issues is also mentioned by #1.
With this pull request, I would like to propose a solution for escaping special characters such as newlines,
"
and such as implemented by the amazing N3.js project. In fact, the escaping function and the escape characters are a copy of the ones they use. I have also included some tests for this new functionality, also based on their tests.NOTE: