Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for Issue 2988: Change individual to grouped parsing #2989

Closed
wants to merge 66 commits into from
Closed

Fix for Issue 2988: Change individual to grouped parsing #2989

wants to merge 66 commits into from

Conversation

kaby76
Copy link
Contributor

@kaby76 kaby76 commented Jan 4, 2023

Statement of Problem

This PR is to address #2988.

Tests in the V4 repo parse one input file per program invocation, which I call individual parsing. But, we can save a good deal of time by parsing multiple input files per program invocation, called grouped parsing. Let's see what this looks like in the output from a build.

  • For Individual parsing, one test is processed by one program call:
./bin/Debug/net6.0/Test.exe ../examples/AllInOne8.java
Time: 00:00:05.9106526
Parse succeeded.
./bin/Debug/net6.0/Test.exe ../examples/helloworld.java
Time: 00:00:00.5270543
Parse succeeded.
./bin/Debug/net6.0/Test.exe ../examples/IdentifierTest.java
Time: 00:00:00.6478677
Parse succeeded.
...
  • For Grouped parsing, all tests are processed by one program call:
./bin/Debug/net6.0/Test.exe ../examples/AllInOne8.java ../examples/helloworld.java ../examples/IdentifierTest.java ...
CSharp 0 ../examples/AllInOne8.java success 6.1742676
CSharp 1 ../examples/helloworld.java success 0.0255556
CSharp 2 ../examples/IdentifierTest.java success 0.1810686
...

With grouped parsing, the time for parsing each successive input should be shorter relative to the time the input took with individual parsing. Internally, Antlr keeps a cache of prediction context between each parse for faster parsing. Grouped parsing should help for Java, CSharp, and probably a few other targets. But, it may not help with all targets, especially PHP, which shows no speed-up with warm-up, likely a bug. See antlr/antlr-php-runtime#36.

Requirements, Design, and Implementation

To implement grouped parsing, the newest version of Trash trgen will need to be employed, as well as the templates in this repo updated. This PR changes the CI build tests, and some of the .errors and .tree files (more on this explained below).

After a detailed reading and testing of the current _scripts/test.ps1 and templates, I came to the conclusion that there was a significant disconnect in the requirements, design, and implementation between the Bash and Powershell test scripts. Therefore, we need to state some requirements for these scripts and the generated scripts for trgen.

Requirements

  • The basic requirement for the Bash or Powershell test scripts is to provide a simple, self-contained method to build and test the code in a Bash or Powershell environment.
  • The scripts should take minimal parameters so that people can focus on updating changes to a grammar rather than wasting time figuring out how to test the grammar across eight targets. (Note, to be implemented for Build is very slow, tests too much, outputs too much #2883.)
  • trgen should generate scripts test.sh and test.ps1. The Bash and Powershell scripts should mirror each other step-by-step. The scripts should also mirror each other across targets (e.g., Java and Dart.
  • There should be scripts for building, testing, and cleaning up, both Bash and Powershell.
  • The top-level scripts regtest.sh and test.ps1 (replacing tester.ps1) should be incremental (taking into account git status), and runable from any directory (including a grammar, or a collection of grammars, like asm/, antlr, or sql/). (NB: This requirement will be addressed in Build is very slow, tests too much, outputs too much #2883.)
  • With grouped parsing, the parser driver program must generate .errors and .tree files; it cannot be redirected because output for all inputs would then be grouped together. NB: Careful attention is placed on capturing all parser output. Any output to stderr that is not for trees and standard parsing errors is considered an error in the execution of the parser. For example, this can happen especially with dynamically typed and parsed languages like JavaScript, Python3, and PHP in which the base class is not correct, and the interpreter throws an error.
  • git will be used to implement diffing of the .errors and .tree files.

Discussion of requirements

It's hard to remember how to run the compiler and program of the parser program for a particular target. It's even worse trying to remember the syntax and runtimes for the targets. That's the whole point of trgen.

Unfortunately, tester.psm1 is not a stand-alone Powershell script because the extension is ".psm1", not ".ps1".

Further, the script assumed the test input was always in a directory called "examples/". trgen reads the pom.xml to get the location of the input tests.

Since grouped parsing can't use Bash/Powershell capture of the parse tree--because there would be multiple trees captured for multiple input files--the driver program must place the output in a specific output file. But, then, how does the script check diffs of these parse tree files??

In order to implement parse tree file diffs, I use git diff to find those files that are changed. It's laborious to write code to do diffs in Bash and Powershell. In fact, the Powershell "diff" strips out newlines of a multiline output file contained in a string. Let's make the code simpler by just using "git". If there's no git, or the grammar has been removed from the rest of the cloned repo, then the tests should just default to returning any errors found, even if they are expexted.

The .errors files seem to have extra newlines at the end. Extremely annoying. I don't know if they are hand-edited or the drivers that created them include extra printf's for the hell of it. In any case, I remastered a number of these files with this PR in order for the tests to pass.

From this point forward, test.sh or test.ps1 should be used to remaster all .tree and .errors files, as pondered in this comment.

Results

Although not scientific, I compared one testing of the V4 grammars with individual parsing vs grouped parsing. On an unloaded Windows 11 machine (Ryzen 7 2700, 8-cores, DDR4 16GB, SanDisk SDSSDH3 1TB, Antlr4.11.1, NET SDK 7.0.101), individual parsing of the CSharp target completed in 1h 44m, grouped parsing completed in 1h 11m. That is significantly faster.

Conclusions

This is an important PR. Grouped parsing helps a lot in the parse time for the build. But, overall, the main bottleneck in the build is the compilation of the generated code. It really goes back to #2883. The "incremental" testing isn't really working because, somehow, the information on what specifically to test is ignored. In addition, we shouldn't be testing combined grammars because they don't have target-specific action code.

@kaby76 kaby76 changed the title Fix for Issue 2988 Fix for Issue 2988: Change individual to grouped parsing Jan 5, 2023
kaby76 added 22 commits January 5, 2023 13:20
…plicate functionality, when it's not clear how best to implement this. Removing newlines indescriminately is NOT the way to do a comparison!
…ctly what is in makefile and test.sh, as they should have been to begin with!! The WHOLE POINT of providing a makefile and test.sh is so the details of building and running the generated parser driver code--which changes for every damn target--are completely hidden, and so I don't need to know how to compile and run Go vs Java vs C# vs Dart, etc. Still, all I want is to use a Powershell environment and type "builder.ps1" or "tester.ps1" and just get the thing working. Prior to this, all I could do was "cd to the root directory", type "pwsh _scripts/test.ps1" and it would test everything.
…l if you use --template-source-directory, but it's an embedded resource in trgen.
@kaby76 kaby76 marked this pull request as ready for review January 16, 2023 02:12
@kaby76
Copy link
Contributor Author

kaby76 commented Jan 16, 2023

@teverett @KvanTTT After working on this PR for two weeks, I think I finally have this PR ready for review. The new Bash and Powershell scripts should resemble each other more closely and across all Antlr targets. That should make it easier to understand and maintain.

After this, I plan to fix #2883

@kaby76 kaby76 marked this pull request as draft January 18, 2023 00:30
@kaby76 kaby76 marked this pull request as ready for review January 18, 2023 10:53
@kaby76 kaby76 marked this pull request as draft January 18, 2023 12:33
@kaby76 kaby76 marked this pull request as ready for review January 21, 2023 01:18
@kaby76 kaby76 marked this pull request as draft January 21, 2023 01:20
@kaby76 kaby76 marked this pull request as ready for review January 22, 2023 22:29
@kaby76
Copy link
Contributor Author

kaby76 commented Jan 23, 2023

Needing to refork this repo, and squash commits. Will open a new PR.

@kaby76 kaby76 closed this Jan 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant