semantics-parser
Folders and files
Name | Name | Last commit date | ||
---|---|---|---|---|
parent directory.. | ||||
ARM64 Semantics Parser This project is for parsing the instruction pseudocode in the XML specification of the ARMv8 instructions. At a high-level, a Python script reads the XML file for each instruction, extracts the operation pseudocode from them and stores them in a file named after the instruction. The binary generated from this project can then read the generated file and output equivalent, ROSE-compliant C++ code which can then be placed into the Dispatcher class' code. There are two types of pseudocode in the XML specification: * Decode: Pseudocode for extracting relevant fields from the raw instruction that will be used in the actual execution of the instruction. * Operation: Pseudocode for the actual operation of the instruction. We currently do not parse the first one. This means that fields defined/declared/initialized in the decode PS and used in operation PS need to be generated somehow when we parse the operation PS in this project. Appropriate constructs are included to achieve this and comments are also included for them in the code. It should also be possible to extend the grammar used for parsing to parse the decode pseudocode, thereby building an instruction decoder automatically from the XML specification. The way the pseudocode is converted to C++ is by grammar-based parsing. A Bison (Yacc) grammar is defined to parse symbols returned by a lexer defined using Flex (Lex) - both of these were built from scratch. The grammar closely resembles the grammar for a regular language like C, but has lots of rules and special cases for handling situations that occur in the ARM pseudocode. Currently, only non-SIMD instructions are parsed - even among these, some are not yet parsed although overall the coverage for the non-SIMD is close to 93% and most of the remaining instructions are system instructions that are not really relevant for user-mode programs. To parse SIMD instructions, this grammar would have to be extended. Perhaps the most important thing for SIMD pseudocode is supporting loops and accessing a register as an array, but of course there could be more. The project is a regular CMAKE project. Currently, we generate C++ code within the action for each BNF rule. This is a "stateless" approach - no information about previously matched strings or generated C++ code is stored. This has problems. It is possible that the code to be generated for the matched string depends on something that comes later. Similarly, if a certain string is seen, some code already generated previously might need to be modified. This also makes it difficult to generate declaration statements for identifiers that haven't been seen before, since we need to do a lot of string processing to insert the declaration somewhere in the middle. There is also the problem that the generated C++ for the same matched string can be different based on where in the pseudocode it appears. A first idea I had to solve some of these problems was to maintain state. This is not implemented and is for future work. We can have a struct/class representing a statement in the final code and containing all of the required information about the string matched from each rule. When the rule is matched, we populate this struct and store it in a global array instead of directly generating C++ code for it. After all rules are matched, we iterate over the global array and generate code, utilizing the information about all other statements in the program to make modification to the current statement if necessary.