Skip to content

Commit

Permalink
Improve error handling in the DFA builder et al. (#350)
Browse files Browse the repository at this point in the history
* Skip building the DFA if parsing a string regex fails.

* Simplify tracking state when visiting regexes.

We don't need to separately track is/has void; we can only track the latter. And failed string regexes are specially tracked and do not emit `FARKLE0003`.

* Avoid stack overflows when processing too deeply nested regexes.

* Improve documentation.
Add example for `FARKLE0001` and consistently use "regex" instead of "regular expression".
  • Loading branch information
teo-tsirpanis authored Jan 3, 2025
1 parent 29c378e commit 0653016
Show file tree
Hide file tree
Showing 10 changed files with 134 additions and 34 deletions.
16 changes: 13 additions & 3 deletions docs/diagnostics/FARKLE0001.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,20 @@ description: FARKLE0001: DFA exceeds maximum number of states
---
# FARKLE0001: DFA exceeds maximum number of states

Certain regular expressions can result in a DFA that has a disproportionately large number of states. For example the regex `[ab]*a[ab]{32}` will result in a DFA with billions of states. In previous versions of Farkle, such regexes would cause the builder to consume large amounts of memory. Starting with Farkle 7 there is a limit on the number of states that a DFA can have. This error is emitted when that limit is reached, in which case no DFA gets built.
Certain regexes can result in a DFA that has a disproportionately large number of states. For example the regex `[ab]*a[ab]{32}` will result in a DFA with billions of states. In previous versions of Farkle, such regexes would cause the builder to consume large amounts of memory. Starting with Farkle 7 there is a limit on the number of states that a DFA can have. This error is emitted when that limit is reached, in which case no DFA gets built.

This limit can be customized by setting the `BuilderOptions.MaxTokenizerStates` property. Setting it to `int.MaxValue` will disable the limit.

TODO: Add an example when the relevant APIs get implemented.

The default value is an implementation detail and linearly scales with the complexity of the regexes being built, thus avoiding exponential state blowups.

## Example code

```csharp
var terminal = Terminal.Create("My Terminal", Regex.FromRegexString("[ab]*a[ab]{20}"));

// Non-compliant code
var parser = terminal.BuildSyntaxCheck();

// Compliant code
var parser = terminal.BuildSyntaxCheck(new BuilderOptions() { MaxTokenizerStates = int.MaxValue });
```
10 changes: 4 additions & 6 deletions docs/diagnostics/FARKLE0009.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,13 @@
category: Diagnostic codes
categoryindex: 3
title: FARKLE0009
description: FARKLE0009: Failed to parse regular expresssion
description: FARKLE0009: Failed to parse regex
---
# FARKLE0009: Failed to parse regular expresssion
# FARKLE0009: Failed to parse regex

This error is emitted when Farkle fails to parse a string regular expression that was passed in the `Regex.FromRegexString` method, or the `Regex.regexString` F# function.
This error is emitted when Farkle fails to parse a string regex that was passed in the `Regex.FromRegexString` method, or the `Regex.regexString` F# function. In this case no DFA gets built and the grammar cannot be used for tokenizing.

When this error is emitted, the string regex will be substituted with [a regex that matches nothing](FARKLE0003.md), and the parser built by Farkle will always fail to parse any text. The grammar will still be generated, but directly using it for tokenizing will very likely result in unexpected behavior.

To fix this error, ensure that the regular expression string is valid. The error message will help you identify and fix the problem.
To fix this error, ensure that the regex string is valid. The error message will help you identify and fix the problem.

## Example code

Expand Down
11 changes: 11 additions & 0 deletions docs/diagnostics/FARKLE0010.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
---
category: Diagnostic codes
categoryindex: 3
title: FARKLE0010
description: FARKLE0010: Regex is too complex
---
# FARKLE0010: Regex is too complex

This error is emitted when Farkle fails to process a regex because it reached a limitation of the library or the system. In this case no DFA gets built and the grammar cannot be used for tokenizing.

The precise circumstances that trigger this error are an implementation detail, but encountering it is not expected to be common when working with real-world regexes.
67 changes: 43 additions & 24 deletions src/Farkle/Builder/Dfa/DfaBuild.cs
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
using Farkle.Grammars.Writers;
using System.Collections.Immutable;
using System.Diagnostics;
using System.Runtime.CompilerServices;

namespace Farkle.Builder.Dfa;

Expand Down Expand Up @@ -103,7 +104,11 @@ private DfaBuild(IGrammarSymbolsProvider symbols, BuilderLogger log, Cancellatio
// successfully produces a DFA that will always fail either way.

var @this = new DfaBuild<TChar>(symbols, log, cancellationToken);
var (leaves, followPos, rootFirstPos) = @this.BuildRegexTree(caseSensitive);
var (leaves, followPos, rootFirstPos, hasError) = @this.BuildRegexTree(caseSensitive);
if (hasError)
{
return null;
}
maxTokenizerStates = BuilderOptions.GetMaxTokenizerStates(maxTokenizerStates, leaves.Count);
var dfaStates = @this.BuildDfaStates(leaves, followPos, rootFirstPos, maxTokenizerStates);
if (dfaStates is null)
Expand Down Expand Up @@ -495,29 +500,26 @@ private static Regex LowerRegex(Regex regex, bool caseSensitive, Dictionary<(Reg
return null!;
}

private (List<RegexLeaf> Leaves, List<BitSet> FollowPos, BitSet RootFirstPos) BuildRegexTree(bool caseSensitive)
private (List<RegexLeaf> Leaves, List<BitSet> FollowPos, BitSet RootFirstPos, bool HasError) BuildRegexTree(bool caseSensitive)
{
Dictionary<(Regex, bool CaseSensitive), Regex> loweredRegexCache = [];
List<RegexLeaf> leaves = [];
List<BitSet> followPos = [];
BitSet rootFirstPos = BitSet.Empty;
bool hasError = false;

int count = Symbols.SymbolCount;
for (int i = 0; i < count; i++)
{
Regex regex = Symbols.GetRegex(i);
RegexCharacteristics characteristics = RegexCharacteristics.None;
// If the symbol's regex's root is an Alt, we assign each of its children a different priority. This
// emulates the behavior of GOLD Parser and resolves some nasty indistinguishable symbols errors.
// Earlier versions of Farkle were flattening nested Alts. Because we are not doing that anymore,
// this will slightly change behavior, but the impact is so small that it's not worth proactively
// caring about.
ReadOnlySpan<Regex> regexes = regex.IsAlt(out var altRegexes) ? altRegexes.AsSpan() : [regex];
// The regex contains Regex.Void somewhere.
bool hasVoid = regexes.IsEmpty;
// The entire regex is equivalent to Regex.Void and impossible to match.
// We detect that by checking if it's not nullable or its LastPos is empty,
// which would make it unable to flow from the root to the end leaf.
bool isVoid = true;
int? endLeafIndexTerminal = null, endLeafIndexLiteral = null;
if (Symbols.GetName(i).Kind == TokenSymbolKind.Noise)
{
Expand All @@ -535,28 +537,32 @@ private static Regex LowerRegex(Regex regex, bool caseSensitive, Dictionary<(Reg
rootFirstPos = rootFirstPos.Set(leafIndex, true);
}
LinkFollowPos(in info.LastPos, BitSet.Singleton(leafIndex));
hasVoid |= info.HasVoid;
isVoid &= info.IsVoid;
characteristics |= info.Characteristics;
}
// If your regex is void but somehow skips the initial unwrapping of the Alt (such as by
// being a failed string regex, which gets expanded to void later), Visit will _be_ void,
// but not _have_ void, because the latter gets propagated on concatenations, which means
// that the assert does not hold.
// Debug.Assert(!isVoid || hasVoid, "Internal error: isVoid => hasVoid does not hold.");
// Let's emit the same diagnostic for a regex that both is entirely void
// or part of it is. This situation is very niche.
if (Log.IsEnabled(DiagnosticSeverity.Warning) && isVoid || hasVoid)
bool regexHasError = (characteristics & RegexCharacteristics.HasError) != 0;
bool hasVoid = regexes.IsEmpty || (characteristics & RegexCharacteristics.HasVoid) != 0;
hasError |= regexHasError;
if (Log.IsEnabled(DiagnosticSeverity.Warning) && !regexHasError && hasVoid)
{
Log.RegexContainsVoid(Symbols.GetName(i));
}
if ((characteristics & RegexCharacteristics.IsTooComplex) != 0)
{
Log.RegexTooComplexError(Symbols.GetName(i));
}
}

return (leaves, followPos, rootFirstPos);
return (leaves, followPos, rootFirstPos, hasError);

RegexInfo Visit(in DfaBuild<TChar> @this, int symbolIndex, Regex regex, bool caseSensitive, bool isLowered = false)
{
@this.CancellationToken.ThrowIfCancellationRequested();

if (!RuntimeHelpersCompat.TryEnsureSufficientExecutionStack())
{
return RegexInfo.Error(RegexCharacteristics.IsTooComplex);
}

bool isCaseOverriden = false;
caseSensitive = regex.AdjustCaseSensitivityFlag(caseSensitive, ref isCaseOverriden);

Expand All @@ -574,8 +580,7 @@ RegexInfo Visit(in DfaBuild<TChar> @this, int symbolIndex, Regex regex, bool cas
// for little benefit; the most common usage pattern of string regexes is directly on a terminal,
// and not composed in another regex.
@this.Log.RegexStringParseError(@this.Symbols.GetName(symbolIndex), error);
regex = Regex.Void;
break;
return RegexInfo.Error();
}
caseSensitive = regex.AdjustCaseSensitivityFlag(caseSensitive, ref isCaseOverriden);
}
Expand Down Expand Up @@ -756,12 +761,10 @@ private readonly struct RegexInfo(BitSet FirstPos, BitSet LastPos, bool IsNullab
/// </remarks>
public bool IsVoid => !IsNullable && LastPos.IsEmpty;

private RegexCharacteristics Characteristics { get; } = Characteristics;
public RegexCharacteristics Characteristics { get; } = Characteristics;

public bool HasStar => (Characteristics & RegexCharacteristics.HasStar) != 0;

public bool HasVoid => (Characteristics & RegexCharacteristics.HasVoid) != 0;

public RegexInfo AsOptional() =>
new(FirstPos, LastPos, true, Characteristics);

Expand All @@ -772,6 +775,9 @@ public RegexInfo AsStar() =>

public static RegexInfo Void => new(BitSet.Empty, BitSet.Empty, false);

public static RegexInfo Error(RegexCharacteristics extraCharacteristics = 0) =>
new(BitSet.Empty, BitSet.Empty, false, RegexCharacteristics.HasError | extraCharacteristics);

public static RegexInfo Singleton(int index)
{
BitSet pos = BitSet.Singleton(index);
Expand Down Expand Up @@ -836,7 +842,20 @@ private enum RegexCharacteristics
/// <see cref="RegexInfo.IsVoid"/> property.
/// </para>
/// </remarks>
HasVoid = 2
HasVoid = 2,
/// <summary>
/// Processing of the regex failed for some reason. The builder will continue
/// processing the regexes to uncover more errors, but will not emit a DFA.
/// </summary>
HasError = 4,
/// <summary>
/// Processing of the regex failed because it is too complex.
/// Must be specified alongside <see cref="HasError"/>.
/// </summary>
/// <remarks>
/// This flag exists to report an error only once per symbol.
/// </remarks>
IsTooComplex = 8
}

private enum IntervalType : byte
Expand Down
26 changes: 26 additions & 0 deletions src/Farkle/Compatibility/RuntimeHelpersCompat.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
// Copyright © Theodore Tsirpanis and Contributors.
// SPDX-License-Identifier: MIT

#if NETCOREAPP || NETSTANDARD2_1_OR_GREATER
global using RuntimeHelpersCompat = System.Runtime.CompilerServices.RuntimeHelpers;
#else
using System.Runtime.CompilerServices;

namespace Farkle.Compatibility;

internal static class RuntimeHelpersCompat
{
public static bool TryEnsureSufficientExecutionStack()
{
try
{
RuntimeHelpers.EnsureSufficientExecutionStack();
}
catch (InsufficientExecutionStackException)
{
return false;
}
return true;
}
}
#endif
3 changes: 3 additions & 0 deletions src/Farkle/Diagnostics/Builder/BuilderLoggerExtensions.cs
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,9 @@ public static void SymbolMultipleRenames(in this BuilderLogger logger, string or
public static void RegexStringParseError(in this BuilderLogger logger, in BuilderSymbolName symbolName, object error) =>
logger.Error("FARKLE0009", LocalizedDiagnostic.Create(nameof(Resources.Builder_RegexStringParseError), symbolName, error));

public static void RegexTooComplexError(in this BuilderLogger logger, in BuilderSymbolName symbolName) =>
logger.Error("FARKLE0010", LocalizedDiagnostic.Create(nameof(Resources.Builder_RegexTooComplex), symbolName));

public static void InformationLocalized(in this BuilderLogger logger, string resourceKey)
{
if (logger.IsEnabled(DiagnosticSeverity.Information))
Expand Down
2 changes: 2 additions & 0 deletions src/Farkle/Properties/Resources.cs
Original file line number Diff line number Diff line change
Expand Up @@ -291,6 +291,8 @@ public static string GetEofString(IFormatProvider? formatProvider)

public static string Builder_RegexStringParseError => GetResourceString(nameof(Builder_RegexStringParseError));

public static string Builder_RegexTooComplex => GetResourceString(nameof(Builder_RegexTooComplex));

public static string Warning => GetResourceString(nameof(Warning));

public static string Error => GetResourceString(nameof(Error));
Expand Down
3 changes: 3 additions & 0 deletions src/Farkle/Properties/Resources.el.resx
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,9 @@
<data name="Builder_RegexStringParseError" xml:space="preserve">
<value>Αποτυχία συντακτικής ανάλυσης της κανονικής έκφρασης του {0}: {1}</value>
</data>
<data name="Builder_RegexTooComplex" xml:space="preserve">
<value>Η κανονική έκφραση του {0} είναι υπερβολικά σύνθετη</value>
</data>
<data name="Builder_BuildingStarted" xml:space="preserve">
<value>Χτίζεται η γραμματική {0}</value>
</data>
Expand Down
3 changes: 3 additions & 0 deletions src/Farkle/Properties/Resources.resx
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,9 @@
<data name="Builder_RegexStringParseError" xml:space="preserve">
<value>Failed to parse the regex of {0}: {1}</value>
</data>
<data name="Builder_RegexTooComplex" xml:space="preserve">
<value>The regex of {0} is too complex</value>
</data>
<data name="Builder_BuildingStarted" xml:space="preserve">
<value>Building grammar {0}</value>
</data>
Expand Down
27 changes: 26 additions & 1 deletion tests/Farkle.Tests/GrammarBuilderTests.fs
Original file line number Diff line number Diff line change
Expand Up @@ -246,11 +246,36 @@ let tests = testList "Grammar builder tests" [
nont.AutoWhitespace(false)
|> buildWithWarnings
Expect.hasLength warnings 1 "Building emitted the wrong number of warnings"
Expect.equal warnings.[0].Code "FARKLE0004" "The warning was not of the correct type"
Expect.equal warnings[0].Code "FARKLE0004" "The warning was not of the correct type"
Expect.equal grammar.GrammarInfo.Attributes Grammars.GrammarAttributes.Unparsable "The grammar was not marked as unparsable"
Expect.isFalse (grammar.GetSymbolFromSpecialName("__MySpecialName").HasValue) "The special name should not be present in the grammar file"
}

test "An invalid string regex causes no DFA to be built" {
let grammar, errors =
Regex.regexString "("
|> terminalU "T"
|> buildWithWarnings
Expect.isNull grammar.DfaOnChar "The DFA should not have been built"
Expect.hasLength errors 1 "Building emitted the wrong number of errors"
Expect.equal errors[0].Code "FARKLE0009" "The error was not of the correct type"
}

test "A deeply nested regex does not cause a stack overflow" {
let depth = 10_000
let regex =
Seq.replicate depth (Regex.string "a")
// acc + x would flatten the tree, but concat does not.
|> Seq.fold (fun acc x -> Regex.concat [acc; x]) (Regex.string "")
let grammar, errors =
regex
|> terminalU "T"
|> buildWithWarnings
Expect.isNull grammar.DfaOnChar "The DFA should not have been built"
Expect.hasLength errors 1 "Building emitted the wrong number of errors"
Expect.equal errors[0].Code "FARKLE0010" "The error was not of the correct type"
}

test "Many block groups can be ended by the same symbol" {
// It doesn't cause a DFA conflict because the
// end symbols of the different groups are considered equal.
Expand Down

0 comments on commit 0653016

Please sign in to comment.