Escape IRIs and literals #3

benjay10 · 2022-10-04T14:34:44Z

Big fan of your work in Linked Data and we use your projects throughout our services. The workflow requires to generate SPARQL queries to be executed on a QPARQL endpoint and I use would like to use a library to produces strings from RDF.JS objects. When using this library, I got stung by incorrectly escaped literal strings, which caused the SPARQL queries to be syntactically incorrect. This problem would also affect writing to TTL and N-triples files. This issues is also mentioned by #1.

With this pull request, I would like to propose a solution for escaping special characters such as newlines, " and such as implemented by the amazing N3.js project. In fact, the escaping function and the escape characters are a copy of the ones they use. I have also included some tests for this new functionality, also based on their tests.

NOTE:

This PR is based on the previous PR about moving to ESLint from deprecated TSLint.
I don't know the implications of copying a function from another project, licensing wise.
This my first attempt at writing TypeScript, my "type-foo" is not up to speed yet.

Remove old deprecated tslint and move to eslint.

IRI's and string literals are now escaped, meaning that special characters like ', ", line feeds, cariage returns, and more are escaped using their backslash equivalent.

rubensworks · 2022-10-05T06:23:44Z

Thanks so much for this @benjay10!

I don't know the implications of copying a function from another project, licensing wise.

We shouldn't copy code directly from other projects. This will cause licensing issues indeed.
What we can do however, is implementing things similarly by looking at the code of other projects.

In any case, let's merge #3 first.

I took another look at the specification for Turtle (https://www.w3.org/TR/turtle/#h3_literals) and found that only newline, backslashes and quotes really needed to be escaped. What N3 does is probably not incorrect, but just seems to complicated for now. So, this is now my own implementation of an escape mechanism. No copies this time, just inspiration from the spec and what also appears to be done in https://github.com/linkeddata/rdflib.js. If this still looks like copying, this is just basically how anyone could write it based on the spec. Oh, yes, I also removed escaping from the IRI of a NamedNode. It feels like this library is not responsible for unsupported characters in IRIs. The same could be said about what is inside a string literal, but for that case, newlines and such *are allowed* if the string writer formats it with the correct delimiters ("""). Our approach here is to escape characters instead as a simplification.

benjay10 · 2022-10-07T23:51:30Z

I came up with a different approach, that actually looks very similar to what rdflib.js does. I only found out while already implementing my idea, so this is a confirmation it should be correct(?). (This can be found here.) This time only backslashes, double quotes and newlines are escaped, and only from string literals. This seems in line with the Turtle spec here.

There is of course another possibility: using """ for encoding strings with newlines, but we would still have to escape backslashes and also check that the string does not contain any """ itself, or we would have to fallback to full escaping everything.

rubensworks · 2022-10-10T09:16:48Z

I came up with a different approach, that actually looks very similar to what rdflib.js does. I only found out while already implementing my idea, so this is a confirmation it should be correct(?). (This can be found here.) This time only backslashes, double quotes and newlines are escaped, and only from string literals. This seems in line with the Turtle spec here.

It's simpler indeed, but I think it's not going to be correct for all cases.

For example, \\n would not be additionally encoded in this approach, even though it should be. Because upon decoding, \\n will be converted to \n, which is not what the original value was, so we lose correct round-tripping.

So I think we'll need an approach similar to the previous one to handle these cases.

benjay10 · 2022-10-10T16:27:19Z

I'm easily confused with all these escape character, so please bear with me.

If you want the literal to contain \\n, shouldn't that be encoded as \\\\n? I wrote a test for that locally and it seems to work this way, it also seems to be implemented like this in rdflib.

rubensworks · 2022-10-11T04:45:24Z

If you want the literal to contain \n, shouldn't that be encoded as \\n? I wrote a test for that locally and it seems to work this way, it also seems to be implemented like this in rdflib.

Yes, you're right. If this already works, then great :-)

What I'm still concerned about is the performance impact.
The previous version was implemented in O(n), while the new one is O(4*n) (n is string length) (since each replaceAll call iterates over the full string).
Also surrogate pairs and 4-bit unicode escape sequences are not handled anymore.

So I do think we need the complexity we had before.

benjay10 · 2022-10-11T18:22:07Z

Scanning the string and replacing characters will always have a performance impact, so yes, the performance will be worse with any of the suggestions so far, but we somehow have to because producing incorrect TTL strings is just not OK.

I think your analysis of performance is not correct. Yes, it looks like the string is iterated 4 times, but if you only where to do it once, you would have to combine all the work in a singular, more complex function that might take 4 times as long as one of these simpler functions (or even more due to the increased amount of if or switch cases). I didn't do any research, but maybe simple regex are executed with native code in some JavaScript interpreters and would outperform the complexer replace with a handler function. If performance is so important, then we should really empirically compare several methods. (Also, O(4n) is equivalent to O(n), but I understand what you want to indicate.)

Also, according to the Turtle specification, there is no mention of surrogate pairs or these 4-bit unicode escape sequences not being allowed. Just mentions about the delimiter characters for strings and newlines, because they would interfere with the Turtle syntax itself. If these special characters are allowed in a Turtle file (given its encoding), are they still problematic?

rubensworks · 2022-10-12T05:14:49Z

so yes, the performance will be worse with any of the suggestions so far, but we somehow have to because producing incorrect TTL strings is just not OK.

Yes, we agree on the incorrectness issue :-)
But your first suggestion looked fine performance-wise.

If performance is so important, then we should really empirically compare several methods.

Sure, sounds good!

Also, according to the Turtle specification, there is no mention of surrogate pairs or these 4-bit unicode escape sequences not being allowed.

This is implied by its dependency on unicode, just like SPARQL.
See for example these related SPARQL spec tests: w3c/rdf-tests#65 w3c/rdf-tests#67

This reverts commit 7c267ee.

This reverts commit 01970d1.

benjay10 · 2022-10-19T03:23:24Z

Reverted back to the more complex escaping situation, but slightly rewritten. Is this enough to not infringe on the licenses? Is this somewhat of a final result you had in mind?

rubensworks · 2022-10-31T08:06:38Z

Thanks @benjay10, that approach looks good to me!

Could you rebase your PR onto master? It looks like the linter changes also slipped in here from your previous PR.

And also performed Linting fixes on the escape changes.

benjay10 · 2022-11-02T16:43:02Z

I hope this is okay. The merging got complicated because the linting changed the regular expressions and some tests started failing because of it (hence the eslint-disable directives).

.eslintrc.cjs

rubensworks · 2022-11-03T09:06:26Z

lib/TermUtil.ts

+  return stringValue;
+}
+
+/* eslint-disable require-unicode-regexp */


I guess we can just follow this rule, and add the u flag to the regexes?

No, adding the flag changes the outcome of the tests. The test case with the smiley breaks. This regex is no longer matching with the surrogate pairs. I don't know how to fix this (yet).

I think it clicked in my head all of a sudden. Adding the u flag means you have to be more careful matching on unicode code points, so it is not possible to match on only the first part of a surrogate pair. This means the regex has to inlcude both halves and thus become exactly like the second one. I'll commit it below.

rubensworks · 2022-11-03T09:07:00Z

lib/TermUtil.ts

+}
+
+/* eslint-disable require-unicode-regexp */
+/* eslint-disable unicorn/better-regex */


What change did this rule suggest?

The linting automatically changes the first regex from
/["\\\t\n\r\b\f\u0000-\u0019\uD800-\uDBFF]/ to
/[\b\u0000-\u0019"\\\uD800-\uDBFF]/.
They act the same. Is that because line feeds, carriage returns, and such are included in those unicode ranges? Nevertheless, the second pattern is even more cryptic than the first: I wouldn't know if a tab is included in that pattern or not, while that is clearer from the first.

To be clear, the rearranged patterns do not affect the test outcomes. If the tests are complete and correct, this change doesn't matter, besides for clarity reasons.

See the above comments. After rewriting the regex, the linter keeps quiet about it. 🤷

Sorry for not noticing this silly mistake. These file are hidden in the file explorer of my editor so I didn't notice the duplicate-ish one.

Adding the u flag to the regex led to unavoidably rewriting them so they became identical, because with the u flag, you can't match on only the first half of a surrogate pair. With them identical, I of course removed one of them.

rubensworks

Looks perfect, thanks so much for this! Preparing a release now...

rubensworks · 2022-11-04T07:40:42Z

Released as 1.3.0.

benjay10 added 2 commits October 4, 2022 02:49

Upgrade packages and move to eslint

8dacb13

Remove old deprecated tslint and move to eslint.

Added support for escaping literals and IRI's

1c97dcb

IRI's and string literals are now escaped, meaning that special characters like ', ", line feeds, cariage returns, and more are escaped using their backslash equivalent.

benjay10 added 2 commits October 8, 2022 01:28

Remove unnecessary tests

7c267ee

benjay10 added 4 commits October 19, 2022 03:29

Revert "Remove unnecessary tests"

5314289

This reverts commit 7c267ee.

Revert "Drastically simplified character escaping"

039a80e

This reverts commit 01970d1.

Reverted back to more complex escaping, rewritten

8c775fa

Support 4-byte escape codes

31dcdcd

Merge branch 'master' into escape-IRI-and-literal

4730982

And also performed Linting fixes on the escape changes.

rubensworks requested changes Nov 3, 2022

View reviewed changes

benjay10 added 2 commits November 4, 2022 02:34

Remove redundant eslintrc file

a84f2e2

Sorry for not noticing this silly mistake. These file are hidden in the file explorer of my editor so I didn't notice the duplicate-ish one.

Removed linting directives, rewritten regex

aaf6a77

Adding the u flag to the regex led to unavoidably rewriting them so they became identical, because with the u flag, you can't match on only the first half of a surrogate pair. With them identical, I of course removed one of them.

benjay10 requested a review from rubensworks November 4, 2022 01:39

rubensworks approved these changes Nov 4, 2022

View reviewed changes

rubensworks merged commit 917eb73 into rubensworks:master Nov 4, 2022

benjay10 mentioned this pull request Nov 9, 2022

Incorrect escaping of IRIs #5

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Escape IRIs and literals #3

Escape IRIs and literals #3

benjay10 commented Oct 4, 2022

rubensworks commented Oct 5, 2022

benjay10 commented Oct 7, 2022

rubensworks commented Oct 10, 2022

benjay10 commented Oct 10, 2022

rubensworks commented Oct 11, 2022 •

edited

Loading

benjay10 commented Oct 11, 2022 •

edited

Loading

rubensworks commented Oct 12, 2022

benjay10 commented Oct 19, 2022

rubensworks commented Oct 31, 2022

benjay10 commented Nov 2, 2022

rubensworks Nov 3, 2022

benjay10 Nov 4, 2022

benjay10 Nov 4, 2022

rubensworks Nov 3, 2022

benjay10 Nov 4, 2022

benjay10 Nov 4, 2022

benjay10 Nov 4, 2022

rubensworks left a comment

rubensworks commented Nov 4, 2022

Escape IRIs and literals #3

Escape IRIs and literals #3

Conversation

benjay10 commented Oct 4, 2022

rubensworks commented Oct 5, 2022

benjay10 commented Oct 7, 2022

rubensworks commented Oct 10, 2022

benjay10 commented Oct 10, 2022

rubensworks commented Oct 11, 2022 • edited Loading

benjay10 commented Oct 11, 2022 • edited Loading

rubensworks commented Oct 12, 2022

benjay10 commented Oct 19, 2022

rubensworks commented Oct 31, 2022

benjay10 commented Nov 2, 2022

rubensworks Nov 3, 2022

Choose a reason for hiding this comment

benjay10 Nov 4, 2022

Choose a reason for hiding this comment

benjay10 Nov 4, 2022

Choose a reason for hiding this comment

rubensworks Nov 3, 2022

Choose a reason for hiding this comment

benjay10 Nov 4, 2022

Choose a reason for hiding this comment

benjay10 Nov 4, 2022

Choose a reason for hiding this comment

benjay10 Nov 4, 2022

Choose a reason for hiding this comment

rubensworks left a comment

Choose a reason for hiding this comment

rubensworks commented Nov 4, 2022

rubensworks commented Oct 11, 2022 •

edited

Loading

benjay10 commented Oct 11, 2022 •

edited

Loading