Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Escape IRIs and literals #3

Merged
merged 11 commits into from
Nov 4, 2022
Merged

Escape IRIs and literals #3

merged 11 commits into from
Nov 4, 2022

Conversation

benjay10
Copy link
Contributor

@benjay10 benjay10 commented Oct 4, 2022

Big fan of your work in Linked Data and we use your projects throughout our services. The workflow requires to generate SPARQL queries to be executed on a QPARQL endpoint and I use would like to use a library to produces strings from RDF.JS objects. When using this library, I got stung by incorrectly escaped literal strings, which caused the SPARQL queries to be syntactically incorrect. This problem would also affect writing to TTL and N-triples files. This issues is also mentioned by #1.

With this pull request, I would like to propose a solution for escaping special characters such as newlines, " and such as implemented by the amazing N3.js project. In fact, the escaping function and the escape characters are a copy of the ones they use. I have also included some tests for this new functionality, also based on their tests.

NOTE:

  • This PR is based on the previous PR about moving to ESLint from deprecated TSLint.
  • I don't know the implications of copying a function from another project, licensing wise.
  • This my first attempt at writing TypeScript, my "type-foo" is not up to speed yet.

Remove old deprecated tslint and move to eslint.
IRI's and string literals are now escaped, meaning that special
characters like ', ", line feeds, cariage returns, and more are escaped
using their backslash equivalent.
@rubensworks
Copy link
Owner

Thanks so much for this @benjay10!

I don't know the implications of copying a function from another project, licensing wise.

We shouldn't copy code directly from other projects. This will cause licensing issues indeed.
What we can do however, is implementing things similarly by looking at the code of other projects.

In any case, let's merge #3 first.

I took another look at the specification for Turtle
(https://www.w3.org/TR/turtle/#h3_literals) and found that only newline,
backslashes and quotes really needed to be escaped. What N3 does is
probably not incorrect, but just seems to complicated for now. So, this
is now my own implementation of an escape mechanism. No copies this
time, just inspiration from the spec and what also appears to be done in
https://github.com/linkeddata/rdflib.js. If this still looks like
copying, this is just basically how anyone could write it based on the
spec.

Oh, yes, I also removed escaping from the IRI of a NamedNode. It feels
like this library is not responsible for unsupported characters in IRIs.
The same could be said about what is inside a string literal, but for
that case, newlines and such *are allowed* if the string writer formats
it with the correct delimiters ("""). Our approach here is to escape
characters instead as a simplification.
@benjay10
Copy link
Contributor Author

benjay10 commented Oct 7, 2022

I came up with a different approach, that actually looks very similar to what rdflib.js does. I only found out while already implementing my idea, so this is a confirmation it should be correct(?). (This can be found here.) This time only backslashes, double quotes and newlines are escaped, and only from string literals. This seems in line with the Turtle spec here.

There is of course another possibility: using """ for encoding strings with newlines, but we would still have to escape backslashes and also check that the string does not contain any """ itself, or we would have to fallback to full escaping everything.

@rubensworks
Copy link
Owner

I came up with a different approach, that actually looks very similar to what rdflib.js does. I only found out while already implementing my idea, so this is a confirmation it should be correct(?). (This can be found here.) This time only backslashes, double quotes and newlines are escaped, and only from string literals. This seems in line with the Turtle spec here.

It's simpler indeed, but I think it's not going to be correct for all cases.

For example, \\n would not be additionally encoded in this approach, even though it should be. Because upon decoding, \\n will be converted to \n, which is not what the original value was, so we lose correct round-tripping.

So I think we'll need an approach similar to the previous one to handle these cases.

@benjay10
Copy link
Contributor Author

I'm easily confused with all these escape character, so please bear with me.

If you want the literal to contain \\n, shouldn't that be encoded as \\\\n? I wrote a test for that locally and it seems to work this way, it also seems to be implemented like this in rdflib.

@rubensworks
Copy link
Owner

rubensworks commented Oct 11, 2022

If you want the literal to contain \n, shouldn't that be encoded as \\n? I wrote a test for that locally and it seems to work this way, it also seems to be implemented like this in rdflib.

Yes, you're right. If this already works, then great :-)

What I'm still concerned about is the performance impact.
The previous version was implemented in O(n), while the new one is O(4*n) (n is string length) (since each replaceAll call iterates over the full string).
Also surrogate pairs and 4-bit unicode escape sequences are not handled anymore.

So I do think we need the complexity we had before.

@benjay10
Copy link
Contributor Author

benjay10 commented Oct 11, 2022

Scanning the string and replacing characters will always have a performance impact, so yes, the performance will be worse with any of the suggestions so far, but we somehow have to because producing incorrect TTL strings is just not OK.

I think your analysis of performance is not correct. Yes, it looks like the string is iterated 4 times, but if you only where to do it once, you would have to combine all the work in a singular, more complex function that might take 4 times as long as one of these simpler functions (or even more due to the increased amount of if or switch cases). I didn't do any research, but maybe simple regex are executed with native code in some JavaScript interpreters and would outperform the complexer replace with a handler function. If performance is so important, then we should really empirically compare several methods. (Also, O(4n) is equivalent to O(n), but I understand what you want to indicate.)

Also, according to the Turtle specification, there is no mention of surrogate pairs or these 4-bit unicode escape sequences not being allowed. Just mentions about the delimiter characters for strings and newlines, because they would interfere with the Turtle syntax itself. If these special characters are allowed in a Turtle file (given its encoding), are they still problematic?

@rubensworks
Copy link
Owner

so yes, the performance will be worse with any of the suggestions so far, but we somehow have to because producing incorrect TTL strings is just not OK.

Yes, we agree on the incorrectness issue :-)
But your first suggestion looked fine performance-wise.

If performance is so important, then we should really empirically compare several methods.

Sure, sounds good!

Also, according to the Turtle specification, there is no mention of surrogate pairs or these 4-bit unicode escape sequences not being allowed.

This is implied by its dependency on unicode, just like SPARQL.
See for example these related SPARQL spec tests: w3c/rdf-tests#65 w3c/rdf-tests#67

@benjay10
Copy link
Contributor Author

Reverted back to the more complex escaping situation, but slightly rewritten. Is this enough to not infringe on the licenses? Is this somewhat of a final result you had in mind?

@rubensworks
Copy link
Owner

Thanks @benjay10, that approach looks good to me!

Could you rebase your PR onto master? It looks like the linter changes also slipped in here from your previous PR.

And also performed Linting fixes on the escape changes.
@benjay10
Copy link
Contributor Author

benjay10 commented Nov 2, 2022

I hope this is okay. The merging got complicated because the linting changed the regular expressions and some tests started failing because of it (hence the eslint-disable directives).

.eslintrc.cjs Outdated Show resolved Hide resolved
lib/TermUtil.ts Outdated
return stringValue;
}

/* eslint-disable require-unicode-regexp */
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can just follow this rule, and add the u flag to the regexes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, adding the flag changes the outcome of the tests. The test case with the smiley breaks. This regex is no longer matching with the surrogate pairs. I don't know how to fix this (yet).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it clicked in my head all of a sudden. Adding the u flag means you have to be more careful matching on unicode code points, so it is not possible to match on only the first part of a surrogate pair. This means the regex has to inlcude both halves and thus become exactly like the second one. I'll commit it below.

lib/TermUtil.ts Outdated
}

/* eslint-disable require-unicode-regexp */
/* eslint-disable unicorn/better-regex */
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What change did this rule suggest?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The linting automatically changes the first regex from
/["\\\t\n\r\b\f\u0000-\u0019\uD800-\uDBFF]/ to
/[\b\u0000-\u0019"\\\uD800-\uDBFF]/.
They act the same. Is that because line feeds, carriage returns, and such are included in those unicode ranges? Nevertheless, the second pattern is even more cryptic than the first: I wouldn't know if a tab is included in that pattern or not, while that is clearer from the first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, the rearranged patterns do not affect the test outcomes. If the tests are complete and correct, this change doesn't matter, besides for clarity reasons.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the above comments. After rewriting the regex, the linter keeps quiet about it. 🤷

Sorry for not noticing this silly mistake. These file are hidden in the
file explorer of my editor so I didn't notice the duplicate-ish one.
Adding the u flag to the regex led to unavoidably rewriting them so they
became identical, because with the u flag, you can't match on only the
first half of a surrogate pair. With them identical, I of course removed
one of them.
@benjay10 benjay10 requested a review from rubensworks November 4, 2022 01:39
Copy link
Owner

@rubensworks rubensworks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks perfect, thanks so much for this! Preparing a release now...

@rubensworks rubensworks merged commit 917eb73 into rubensworks:master Nov 4, 2022
@rubensworks
Copy link
Owner

Released as 1.3.0.

@benjay10 benjay10 mentioned this pull request Nov 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants