Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single carriage return (\r) not treated as line break #595

Open
jackwhelpton opened this issue Dec 16, 2024 · 2 comments
Open

Single carriage return (\r) not treated as line break #595

jackwhelpton opened this issue Dec 16, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@jackwhelpton
Copy link

jackwhelpton commented Dec 16, 2024

Describe the bug
Single carriage return line endings (\r) are treated as non-breaking characters. This is contrary to the YAML spec, which explicitly calls out CR, LF and CRLF as the (only) possible line breaks.

A similar problem afflicts other parsers, for example jbeder/yaml-cpp#986

To Reproduce

const lfTokens = Array.from((new Lexer()).lex("text: \"a\n\n\n\n b\""));
console.log(lfTokens.length);

const crTokens = Array.from((new Lexer()).lex("text: \"a\n\r\r\r b\""));
console.log(crTokens.length);

Note the output:

6
9

Expected behaviour

6
6

The extra tokens in the second case are a result of \r being treated like any other (non-breaking) character, meaning the tokens are split only at the first \n and not reassembled:

(..., a, \n, \x1F, \r\r\r b)

Versions (please complete the following information):

  • Environment: [e.g. Node.js 14.7.0 or Chrome 87.0]
    Node.js 20.17.0

  • yaml: [e.g. 1.10.0 or 2.0.0-2]
    2.6.1

Additional context
There are several places this seems to manifest, for example:

https://github.com/eemeli/yaml/blob/main/src/parse/lexer.ts#L223
https://github.com/eemeli/yaml/blob/main/src/parse/lexer.ts#L191

@jackwhelpton jackwhelpton added the bug Something isn't working label Dec 16, 2024
@eemeli
Copy link
Owner

eemeli commented Dec 16, 2024

While recognising that this is a divergence from the spec, does it affect any real-world use for you? As in, do you have some content that is actually using a lone \r as a line break?

@jackwhelpton
Copy link
Author

No real-world impact at the moment, beyond the time spent debugging it.

For full context: I'm working on a component that extracts YAML fields from a larger document for passing to a third-party tool, and needs to be able to correlate line and column number from that extracted content with the original document.

I'd hoped I could reuse much of the existing logic, but given the need to know about the effects on column position of expanding, for instance, escape codes, for now I'm effectively replacing the whole "compose" layer. In the rewritten version, I keep track of positions, so I can emit a list of coordinates with the composed text that lets me align it to the original source doc.

My set of unit tests/edge cases for this include a mixture of line endings, and those are failing, which is why I had to track the problem back through the layers (right to the lexer).

If there's no intent to fix this, for now I just need to ensure all my test cases use only \n or \r\n, and that I document that as a restriction. I think mixing those two in a file would also be supported, as most/all of the problems I've encountered so far seem to revolve around \r.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants