-
Notifications
You must be signed in to change notification settings - Fork 453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[for reference] all work done which is not in original repo #338
Open
GerHobbelt
wants to merge
2,108
commits into
zaach:master
Choose a base branch
from
GerHobbelt:master
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This was referenced Jan 28, 2017
gitbook-legacy bot
pushed a commit
to GerHobbelt/jison
that referenced
this pull request
Feb 26, 2017
… in the console.log() statements in there: console.log() adds a newline automatically, while the original C code `printf()` does not.
…ter code stripping. Adjusted stripper regexes to fix this.
… the preceeding commits: `action === 0` is the error parse state and that one, when it is discovered during error **recovery** in the inner slow parse loop, is handed back to the outer loop to prevent undue code duplication. Handing back means the outer loop will have to process that state, not exit on it immediately!
…reset/cleanup the `recoveringErrorInfo` object as one may invoke `yyerrok` while still inside the error recovery phase of the parser, thus *potentially* causing trouble down the lane for subsequent parse states. (This is another edge case that's hard to produce: better-safe-than-sorry coding style applies.)
…amples/Makefile. Tweak `make superclean` to ensure that we can bootstrap once you've run `make prep` by reverting the jison/dist/ directory after 'supercleaning'.
…about a piece of action code which "does not compile": lexer and parser line tracking yylloc info starts counting at line ONE(1) instead of ZERO(0) hence we do NOT need to compensate when bumping down the action code before parsing/validating it in here.
…mpare the full set of examples` output vs. a given reference. This is basically a 'system test' / 'acceptance test' **test level** that co-exists with the unit tests and integration tests in the tests/ directory: those tests are already partly leaning towards a 'system test' level and that is "polluting" the implied simplicity of unit tests...
…ch is included with every generated parser: this makes those reports easier to understand at a glance.
…ippets and other code blocks. We don't want to do them all, so there's #26
…liver a cleaner info set when custom lexers are involved AND not exhibit side effects such as modifying the provided lexer spec when it comes in native format, i.e. doesn't have to be parsed or JSON.parse()d anymore: we should strive for an overall cleaner interface behaviour, even if that makes some internals a tad more hairy.
… it should always have produced an 'expected set of tokens' in the info hash, whether you're running in an error recovery enabled grammar or a simple (non-error-recovering) grammar.
- DO NOT cleanup the old one before we start the new error info track: the old one will *linger* on the error stack and stay alive until we invoke the parser's cleanup API! - `recoveringErrorInfo` is also part of the `__error_recovery_infos` array, hence has been destroyed already: no need to do that *twice*.
…llback set a la jison parser run-time: - `fastLex()`: return next match that has a token. Identical to the `lex()` API but does not invoke any of the `pre_lex()` nor any of the `post_lex()` callbacks. - `canIUse()`: return info about the lexer state that can help a parser or other lexer API user to use the most efficient means available. This API is provided to aid run-time performance for larger systems which employ this lexer. - now executes all `pre_lex()` and `post_lex()` callbacks provided as + member function, i.e. `lexer.pre_lex()` and `lexer.post_lex()` + member of the 'shared state' `yy` as passed to the lexer via the `setInput()` API, i.e. `lexer.yy.pre_lex()` and `lexer.yy.post_lex()` + member of the lexer options, i.e. `lexer.options.pre_lex()` and `lexer.options.post_lex()`
…lon rule (which has no location info); add / introduce the `lexer::deriveLocationInfo()` API to help you & us to construct a more-or-less useful/sane location info object from the context surrounding it when the requested location info itself is not available.
…comparison` as it will compare more than just the generated codegen parsers' sources...
…e used to reconstruct missing/epsilon location infos. This helps fix crashes observed when reporting some errors that are triggered while parsing epsilon rules, but will also serve other purposes. The important bit here is that it helps prevent crashes inside the lexer's `prettyPrintRange()` API when no or faulty location info object(s) have been passed as parameters: robuster lexer APIs.
…ed according to the internal action+ parse kernel analysis. NOTE: the fact that the error reporting/recovery logic checks the **lexer.yylineno** lexer attribute does not count as that code won't need / touch the internal `yylineno` variable in any way.
# Conflicts: # lib/jison-parser-kernel.js
… is obsoleted anyway.
…dn't work as the `parseError` would not propagate into the parser kernel due to the way `shallow_copy_noclobber` worked. This is quite hairy as we depend on its behaviour of NOT overwriting members so that we can use it for yylloc propagation code inside the kernel. With this fix, that functionality should remain unchanged while now anything set in `parser.yy` should make it into the parser kernel *properly* once again.
…ernel into the main source file (see previous commit)
…de a more robust lexer interface: // 1) make sure any outside interference is detected ASAP: // these attributes are to be treated as 'const' values // once the lexer has produced them with the token (return value \`r\`). // 2) make sure any subsequent \`lex()\` API invocation CANNOT // edit the \`yytext\`, etc. token attributes for the *current* // token, i.e. provide a degree of 'closure safety' so that // code like this: // // t1 = lexer.lex(); // v = lexer.yytext; // l = lexer.yylloc; // t2 = lexer.lex(); // assert(lexer.yytext !== v); // assert(lexer.yylloc !== l); // // succeeds. Older (pre-v0.6.5) jison versions did not *guarantee* // these conditions. this.yytext = Object.freeze(this.yytext); this.matches = Object.freeze(this.matches); this.yylloc.range = Object.freeze(this.yylloc.range); this.yylloc = Object.freeze(this.yylloc);
# Conflicts: # lib/jison.js # package-lock.json # package.json # packages/jison-lex/regexp-lexer.js # packages/jison2json/tests/tests.js
# Conflicts: # README.md # lib/cli.js # package.json
…'re going to take a different route towards parsing jison action code as the current approach is a maintenance nightmare. recast is again playing up and I'm getting sick of it all and that never was the goal of this.
…is pair to cooperate.
added js-sequence-diagrams to demo projects list
…code to (temporarily) turn the jison generated source code into 'regular javascript' so we can pull it through standard babel or similar tools. (The previous attempt was to enhance the babel tokenizer and have the jison identifiers processed that way, but given the structure of babel, it meant tracking a slew of large packages, which turned out way too costly. So we revert to this 'Unicode hack' which employs the JavaScript specification about which Unicode characters are *legal in a JavaScript identifier*. TODO: Should write a blog/article about this. Here's the comments from the horse's mouth: --- Determine which Unicode NonAsciiIdentifierStart characters are unused in the given sourcecode and provide a mapping array from given (JISON) start/end identifier character-sequences to these. The purpose of this routine is to deliver a reversible transform from JISON to plain JavaScript for any action code chunks. This is the basic building block which helps us convert jison variables such as `$id`, `$3`, `$-1` ('negative index' reference), `@id`, `#id`, `#TOK#` to variable names which can be parsed by a regular JavaScript parser such as esprima or babylon. ``` function generateMapper4JisonGrammarIdentifiers(input) { ... } ``` IMPORTANT: we only want the single char Unicodes in here so we can do this transformation at 'Char'-word rather than 'Code'-codepoint level. ``` const IdentifierStart = unicode4IdStart.filter((e) => e.codePointAt(0) < 0xFFFF); ``` As we will be 'encoding' the Jison Special characters @ and # into the IDStart Unicode range to make JavaScript parsers *not* barf a hairball on Jison action code chunks, we must consider a few things while doing that: We CAN use an escape system where we replace a single character with multiple characters, as JavaScript DOES NOT discern between single characters and multi-character strings: anything between quotes is a string and there's no such thing as C/C++/C#'s `'c'` vs `"c"` which is *character* 'c' vs *string* 'c'. As we can safely escape characters, all we need to do is find a character (or set of characters) which are in the ID_Start range and are expected to be used rarely while clearly identifyable by humans for ease of debugging of the escaped intermediate values. The escape scheme is simple and borrowed from ancient serial communication protocols and the JavaScript string spec alike: - assume the escape character is A - then if the original input stream includes an A, we output AA - if the original input includes a character #, which must be escaped, it is encoded/output as A This is the same as the way the backslash escape in JavaScript strings works and has a minor issue: sequences of AAA with an odd number of A's CAN occur in the output, which might be a little hard to read. Those are, however, easily machine-decodable and that's what's most important here. To help with that AAA... issue AND because we need to escape multiple Jison markers, we choose to a slightly tweaked approach: we are going to use a set of 2-char wide escape codes, where the first character is fixed and the second character is chosen such that the escape code DOES NOT occur in the original input -- unless someone would have intentionally fed nasty input to the encoder as we will pick the 2 characters in the escape from 2 utterly different *human languages*: - the first character is ဩ which is highly visible and allows us to quickly search through a source to see if and where there are *any* Jison escapes. - the second character is taken from the Unicode CANADIAN SYLLABICS range (0x1400-0x1670) as far as those are part of ID_Start (0x1401-0x166C or there-abouts) and, unless an attack is attempted at jison, we can be pretty sure that this 2-character sequence won't ever occur in real life: even when one writes such a escape in the comments to document this system, e.g. 'ဩᐅ', then there's still plenty alternatives for the second character left. - the second character represents the escape type: $-n, $#, #n, @n, #ID#, etc. and each type will pick a different base shape from that CANADIAN SYLLABICS charset. - note that the trailing '#' in Jison's '#TOKEN#' escape will be escaped as a different code to signal '#' as a token terminator there. - meanwhile, only the initial character in the escape needs to be escaped if encountered in the original text: ဩ -> ဩဩ as the 2nd and 3rd character are only there to *augment* the escape. Any CANADIAN SYLLABICS in the original input don't need escaping, as these only have special meaning when prefixed with ဩ - if the ဩ character is used often in the text, the alternative ℹ இ ண ஐ Ϟ ല ઊ characters MAY be considered for the initial escape code, hence we start with analyzing the entire source input to see which escapes we'll come up with this time. The basic shapes are: - 1401-141B: ᐁ 1 - 142F-1448: ᐯ 2 - 144C-1465: ᑌ 3 - 146B-1482: ᑫ 4 - 1489-14A0: ᒉ 5 - 14A3-14BA: ᒣ 6 - 14C0-14CF: ᓀ - 14D3-14E9: ᓓ 7 - 14ED-1504: ᓭ 8 - 1510-1524: ᔐ 9 - 1526-153D: ᔦ - 1542-154F: ᕂ - 1553-155C: ᕓ - 155E-1569: ᕞ - 15B8-15C3: ᖸ - 15DC-15ED: ᗜ 10 - 15F5-1600: ᗵ - 1614-1621: ᘔ - 1622-162D: ᘢ ## JISON identifier formats ## - direct symbol references, e.g. `#NUMBER#` when there's a `%token NUMBER` for your grammar. These represent the token ID number. -> (1+2) start-# + end-# - alias/token value references, e.g. `$token`, `$2` -> $ is an accepted starter, so no encoding required - alias/token location reference, e.g. `@token`, `@2` -> (6) single-@ - alias/token id numbers, e.g. `#token`, `#2` -> (3) single-# - alias/token stack indexes, e.g. `##token`, `##2` -> (4) double-# - result value reference `$$` -> $ is an accepted starter, so no encoding required - result location reference `@$` -> (6) single-@ - rule id number `#$` -> (3) single-# - result stack index `##$` -> (4) double-# - 'negative index' value references, e.g. `$-2` -> (8) single-negative-$ - 'negative index' location reference, e.g. `@-2` -> (7) single-negative-@ - 'negative index' stack indexes, e.g. `##-2` -> (5) double-negative-#
# Conflicts: # ports/csharp/Jison/Jison/csharp.js # ports/php/php.js # ports/php/template.php
…now have an augmented API.
…a second argument (`options`): cleaning up calling code which assumed as much.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Since a lot has been done and several of these features are tough to 'extract cleanly' to produce 'simple' patches (they won't be simple anyway), the list of differences (features and fixes in the derived repo):
[to be completed]
Main features
full Unicode support (okay, astral codepoints are hairy and only partly supported) in lexer and parser
lexer can handle XRegExp
\pXXX
unicode regex atoms, e.g.\p{Alphabetic}
jison auto-expands and re-combines these when used inside regex set expressions in macros, e.g.
will be reduced to the equivalent of
hence you don't need to worry your regexes will include duplicate characters in regex
[...]
set expressions.parser rule names can be Unicode identifiers (you're not limited to US ASCII there).
lexer macros can be used inside regex set expressions (in other macros and/or lexer rules); the lexer will barf a hairball (i.e. throw an informative error) when the macro cannot be expanded to represent a character set without causing counter-intuitive results), e.g. this is a legal series of lexer macros now:
the parser generator produces optimized parse kernels: any feature you do not use in your grammar (e.g.
error
rule driven error recovery or@elem
location info tracking) is rigorously stripped from the generated parser kernel, producing the fastest possible parser engine.you can define a custom written lexer in the grammar definition file's
%lex ... /lex
section in case you find the standard lexer is too slow to your liking on otherwise insufficient. (This is done by specifying a no-rules lexer with the custom lexer placed in the lexer trailing action code block.)you can
%include
action code chunks from external files, in case you find that the action code blurbs obscure the grammar's / lexer's definition. Use this when you have complicated/extensive action code for rules or a large amount of 'trailing code' ~ code following the%%
end-of-ruleset marker.CLI:
-c 2
-- you now have the choice between two different table compression algorithms:Minor 'Selling Points'
you can produce parsers which do not include a
try ... catch
wrapper for that last bit of speed and/or when you want to handle errors in surrounding userland code.all errors are thrown using a parser and lexer-specific
Error
-derived class which allows userland code to discern which type of error (and thus: available extra error information!) is being processed via a simple/fastinstanceof
check for either of them.the jison CLI tool will print additional error information when a grammar parse error occurred (derived off / closely related to Add detail grammar error output #321 and Stop silencing useful information. #258)
the jison CLI tool will print parse table statistics when requested (
-I
commandline switch) so you can quickly see how much table space your grammar is consuming. Handy when you are optimizing your grammar to reduce the number of states per parse for performance reasons.includes [a derivative or close relative of] Added support for ES2015 module generation #326, Pass moduleMain option to generator #316, Support
%options ranges
in the grammar. #302, Fix moduleName (and other options?) being lost #290, #282,#283 - support specifying moduleName on CLI, support generate names... #284fixes
this.yy.parser
missing errors)this.yy.lexer
missing errors)performAction
invocation trouble)Error
-derived instances with a text message and extra info attached),yyvstack
,yysstack
, etc. -- documented in the documented grammar file's top API documenting comment chunk),parseError
can now produce a return value for the parser to return to the calling userland code),%include filepath
statements in stead of any code chunk),instanceof
of parser and lexer error class),Where is this thing heading?
recast
et al to help analyze rule action code to help code-strip both parser and lexer to produce fast parse/lex runs. Currently only the parser gets analyzed (a tad roughly) to strip costly operations from the parser run-time to make it fast / efficient.