Achieving Zero-Copy with Escape Transformations in Chumsky - Is It Possible? #717

Oyelowo · 2025-01-14T07:19:15Z

Oyelowo
Jan 14, 2025

Hi, thanks for the amazing library! I’m building a SQL-like lexer and aiming to remain zero-copy while also handling escape sequences like \' → ' or \\n → \n.

I’ve noticed that in many Chumsky examples (such as JSON parsing), the parser uses .to(...) or .then_ignore(...) to handle escape sequences, but the resulting tokens often store raw slices of the input using .to_slice(). This means the final structure still holds the original \\n, not an actual newline character.

My challenge:
I want the resulting token to contain the transformed character (e.g., a single \n instead of \\n) without breaking zero-copy. However, since these transformations involve changing the size of the substring, it seems inevitable that I’d need to allocate a new String.

Questions:

Is there a way to perform such escape transformations during lexing while preserving zero-copy for the rest of the token?
Is .to(...) with .to_slice() primarily used for validation rather than modifying the final token?
Is it generally accepted that “interpretation requires allocation,” or is there a common trick (like partial slices or on-the-fly transformations) that achieves a balance between zero-copy and transformed tokens?

I suspect that if we truly need to modify bytes (e.g., collapse \\n to \n), zero-copy might not be possible without allocating. Still, I'd love to hear any recommended patterns for unifying raw slices and transformed escapes in Chumsky without multiple passes or large allocations.

I currently have something like this(also have the to_slice variant):

#[derive(Debug, Clone)]
struct Stringg(String);

pub trait YeParser<'a, T>: Parser<'a, &'a str, T, extra::Err<Rich<'a, char>>> {}

impl<'a, P, T> YeParser<'a, T> for P where P: Parser<'a, &'a str, T, extra::Err<Rich<'a, char>>> {}


pub fn stringg<'a>() -> impl YeParser<'a, Stringg> {
    let escape = just('\\')
        .ignore_then(choice((
            ...
            just('\\'),
            just('n').to('\n'), 
            ...
        )))
            
        .map(|c| Escape(c.to_string()));

    let inner = none_of("\\\'")
        .repeated()
        .at_least(1)
        .collect::<String>();

    let content = inner.or(escape.map(|e| e.0));

    content
        .repeated()
        .at_least(1)
        .collect::<Vec<String>>()
        .map(|s| s.join(""))
        .map(Stringg)
        .delimited_by(just('\''), just('\''))
}

Appreciate any guidance!

Answered by zesterer

Jan 19, 2025

Chumsky is not, in general, set up to modify an input's memory during parsing and, in fact, the library relies on being able to backtrack to observe input multiple times.

Turning \ + n into an ASCII newline character would leave the string with an unoccupied byte. By convention one might replace the unoccupied byte with a DEL, but this seems like a silly solution and unlikely to be meaningful for most text systems, and only works with ASCII.

Going back to chumsky, one pattern I've seen is to have two parsers: one for strings with no escape characters - which does not allocate - and another for strings with escape characters, which does allocate, to perform character replacement. They can …

View full answer

zesterer · 2025-01-19T23:30:01Z

zesterer
Jan 19, 2025
Maintainer

Chumsky is not, in general, set up to modify an input's memory during parsing and, in fact, the library relies on being able to backtrack to observe input multiple times.

Turning \ + n into an ASCII newline character would leave the string with an unoccupied byte. By convention one might replace the unoccupied byte with a DEL, but this seems like a silly solution and unlikely to be meaningful for most text systems, and only works with ASCII.

Going back to chumsky, one pattern I've seen is to have two parsers: one for strings with no escape characters - which does not allocate - and another for strings with escape characters, which does allocate, to perform character replacement. They can be combined together using Cow, like so:

let unescaped = ... ; // outputs `&str`
let escaped = ... ; // outputs `String`

// First, try parsing with the unescaped version to skip allocating if we can
let string = unescaped.map(Cow::Borrowed)
    // if this doesn't work, fall back to allocating
    .or(escaped.map(Cow::Owned));

All this aside, I'd recommend that you benchmark. People often fret about the 'cost of allocation', but more often than not allocators get a bad reputation because their cost is often mixed up with other features of high-level languages like dynamic dispatch, a lack of inlining, cache incoherence, etc. Most of these issues don't apply in Rust, and small allocations in particular are often very quick when using a decent modern allocator.

0 replies

Oyelowo · 2025-01-23T16:22:37Z

Oyelowo
Jan 23, 2025
Author

Appreciate you taking the time!

I’ve settled on having one non-allocated variant and one allocated variant (for interpolated/non-interpolated cases). Instead of using Cow, I opted for a custom enum to explicitly control when to parse escaped vs. unescaped strings, depending on the usage context.

Chumsky has been a pleasure to work with, great work!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Achieving Zero-Copy with Escape Transformations in Chumsky - Is It Possible? #717

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Achieving Zero-Copy with Escape Transformations in Chumsky - Is It Possible? #717

Oyelowo Jan 14, 2025

Replies: 2 comments

zesterer Jan 19, 2025 Maintainer

Oyelowo Jan 23, 2025 Author

Oyelowo
Jan 14, 2025

zesterer
Jan 19, 2025
Maintainer

Oyelowo
Jan 23, 2025
Author