Intermediate representation #358

damip · 2024-09-10T20:05:01Z

damip
Sep 10, 2024

The goal of this proposal is to create an Intermediate Representation (IR) in order to move some optimizations away from the AST and towards the IR itself.

AirScript code:

fn madd3(a: felt[3], b: felt) -> felt {
    let d = [c * b for c in a]
    return sum(d)
}
    
[...]

    enf s^2 = s
    enf match {
        case s: c' = a + b
        case !s: c' = a * c
    }
    
    enf x^2 = x for x in d
    
    let y = sum([k * c for (k, c) in (k, c[1..4])])
    
    let z = madd3(c[1..4], 5)
    
[...]

IR:


block b0(enable) {
    // enf s^2 = s
    x0 = read(col_c, offset=0)  // s
    x1 = x0 * x0  // s^2
    x2 = x1 - x0  // s^2 -s
    x3 = x2 * enable // (s^2 - s) * enable
    enf x3
    
    // case s
    x4 = enable * x0
    b1(enable = x0)
    
    // case !s
    x5 = const(1)
    x6 = x5 - x0
    x7 = enable * x6
    b2(enable = x7)
    
    // enf x^2 = x for x in d
    x6 = const(1)
    b3(enable = x6, a0 = d0)
    b3(enable = x6, a0 = d1)
    b3(enable = x6, a0 = d2)
    b3(enable = x6, a0 = d3)
    
    // let y = sum([k * c for (k, c) in (k, c[1..4])])
    x7 = b4(x6, k0, c1)
    x8 = b4(x6, k1, c2)
    x9 = b4(x6, k2, c3)
    x10 = x7 + x8
    x11 = x10 + x9  // x11 is y
    
    x12 = const(5)
    x13 = b5(x6, c1, c2, c3, 5)  // x13 is z
}



block b1(enable) {
    x0 = read(col_c, offset=1)
    x1 = read(col_a, offset=0)
    x2 = read(col_b, offset=0)
    x3 = x1 + x2
    x4 = x0 - x3
    x5 = x4 * enable
    enf x5
}


block b2(enable) {
    x0 = read(col_c, offset=1)
    x1 = read(col_a, offset=0)
    x2 = read(col_c, offset=0)
    x3 = x1 * x2
    x4 = x0 - x3
    x5 = x4 * enable
    enf x5
}

block b3(enable, a0) {
    x0 = a0 * a0
    x1 = x0 - a0
    x2 = x1 * enable
    enf x2
}

block b4(enable, a0, a1) -> x0 {
    x0 = a0 * a1
}

block b5(enable, a0, a1, a2, a3) -> x5 {
    x0 = b6(a0, a3)
    x1 = b6(a1, a3)
    x2 = b6(a2, a3)
    x3 = b6(a3, a3)
    x4 = x0 + x1
    x5 = x4 + x3
}

block b6(enable, a0, a1) -> x0 {
    x0 = a0 * a1
}

Note that all expressions in the IR are in SSA form

Optimizations (IR -> IR):

inlining:
    blocks => replace block calls by their contents
    note: inlined blocks can be hints for subexpression detection (since blocks are meant to be repeated subexpressions)
    note: after inlining, thanks to SSA, the format of the IR matches the format of the AIR

constant propagation (moved away from the AST to the IR):
    const(a) + const(b) => drop instruction, replace by const(a+b)
    const(a) - const(b) => drop instruction, replace by const(a-b)
    const(a) * const(b) => drop instruction, replace by const(a*b)
    XXX + const(0) or const(0) + XXX => drop instruction, replace by XXX
    XXX * const(0) or const(0) * XXX => drop instruction, replace by const(0)
    XXX * const(1) or const(1) * XXX  => drop instruction, replace by XXX
    enf const(0) => eliminate the constraint
    enf const(x) where x != 0 => error: constraints are always unsatisfiable

bitwalker · 2024-09-16T01:55:37Z

bitwalker
Sep 16, 2024
Collaborator

Thanks @damip!

Overall, I think this conveys the general idea of the IR, but I think we should try to pin down a specification of sorts, or at least, describe in detail the following elements:

The set of instructions and their semantics, i.e. what their arguments and results are, and what their canonicalization and validation rules should be.
How types are represented and propagated. We'll certainly need a textual representation for debugging with the IR, and what you've already sketched out is mostly good enough, but it's missing some information. In the examples you've provided, types aren't shown, though I'm assuming the intent is for the IR to be typed, since resolving types is something we need to do during semantic analysis, prior to lowering into the IR. We could do the actual type checking on the IR, but I don't think I'd make types optional in the IR, as it will make working with type information in all IR-related code more painful. Doing all of the type inference/resolution in the AST simplifies things in the IR considerably, since we can safely assume that we know the types of all expressions/values in the IR, and simply need to validate that we don't violate any type constraints.

Aside from elaborating on those items, I have a few questions/notes on the structure of the IR:

Where does enable come from in your example? How it's used makes sense, but it isn't clear to me how b0 ends up with it as a block argument. Presumably it is derived from some known information.
It probably makes sense to have two "tiers" or dialects of instructions - same instruction set, but not necessarily all used simultaneously. The first is a bit higher level, so that it is easier to do certain early phase analyses and optimizations, then later, towards the end of the pipeline, you rewrite all of the high-level ops using low-level equivalents. For example, lowering enf s^2 = s directly to v0 = s; v1 = mul s, s; v2 = sub v1, s; v3 = mul v2, enable; enf v3 loses a lot of information about where those expressions originated, and makes it more difficult to analyze/optimize them. If instead you lowered to v0 = s; v1 = mul s, s; enf.eq v1, v0, you preserve more information for analysis and optimization, and only at the end would you then lower enf.eq v1, v0 to the more primitive enf.assert instruction (I've made up the enf.eq and enf.assert instructions here, but hopefully the semantics are clear).

Blocks and Control Flow

The notion of a block here is a bit unusual. Typically, a block in a compiler IR refers to the concept of a basic block, i.e. a sequence of primitive instructions free from control flow. The only time you end up with multiple blocks in the IR is in the presence of conditional branches, i.e. code which should only be executed when a given condition holds, and the branch is taken.

This is in contrast to something like z = select cond, x, y, where the expressions that produce x and y are always executed, but the value bound to z is conditional on the result of some other boolean expression (i.e. the result of one of the two expressions might never be used, but always computing it is cheaper than the cost of a conditional branch). The general idea behind keeping blocks "basic" like this, is that it simplifies some aspects of visiting, analyzing, and optimizing both within and across blocks.

While AirScript doesn't really support control flow in the normal sense, it does have some forms of it, both implicit and explicit - notably comprehensions and conditional constraints both express some limited forms of control. What you've described as "block calls" is really an unstructured control flow primitive called an "unconditional branch". Unstructured control flow is very general, and very powerful, but difficult to analyze and optimize as a result of its lack of constraints - this tradeoff is usually worth it for general-purpose programming languages though.

Proposal: A Dataflow-Centric Alternative

In our case though, I think we might benefit from a different style of IR that is more suited to representing limited forms of control, i.e. structured control flow. As part of that, I think it also would be good to make the dataflow graph the focus, rather than the control flow graph (typically, compiler IRs, especially SSA IRs, tend to be CFG-oriented, and use analysis to reify a DFG). There is in fact a formalism for the type of IR I'm describing, called the Regionalized Value-State Dependence Graph (RVSDG), in case you are interested in digging deeper into that, you can also look at MLIR, which has a design that takes a lot of inspiration from RVSDG. I'm not necessarily saying we implement it 1:1, but rather use some of its concepts/primitives for the AirScript IR.

The RVSDG can be described as a hierarchical dataflow graph, where regions and various kinds of operations form the hierarchy. For our purposes, we're primarily interested in two kinds of operations: "simple", i.e. primitive operations like arithmetic, loads and stores, etc.; and "structural", operations which introduce new nested regions, e.g. if/then/else expressions, for or while loops. There are additional kinds defined in RVSDG, notably functions and global variables/state, but you can get a grasp of how things worked with just the first two.

Dataflow in RVSDG is expressed via operations, in the form of:

Inputs, i.e. IR entities which are inputs to the operation
Outputs, i.e. IR entities which are output by the operation
Operands, i.e. the values passed as arguments to the operation
Results, i.e. the values produced by the operation

Inputs to an operation are always the output of some other operation, thus RVSDG is by construction, always in SSA form.

In many cases, there is a 1:1 correlation between inputs/operands and outputs/results. An example of when this isn't the case, can be observed with the call op (or whatever we want to name the instruction that represents function calls). The call operation would take all of the function arguments as operands and inputs, but would also take the callee function itself as an input. The intuition behind this is pretty obvious - you have to have a function in order to call it. For the most part though, inputs are operands, and outputs are results.

"Structured" operations also have one or more regions, the semantics of which depend on the specific operation. Regions consist of blocks, which are your typical basic block. A block contains one or more operations. A region always has at least one block, called the entry block, whose parameter list corresponds to the operands for its containing region. For example, a Function could be described as an operation consisting of a single region (the function body), whose entry block parameter list corresponds to the function arguments, thus materializing those arguments as values for use as instruction operands. The ordering of operations within a block is more for convenience than to impose strict execution order - the scheduling of operations is dictated primarily based on dataflow.

So to make this more concrete, I'm imagining the following IR entities and general structure:

A Module contains one or more Function
A Function contains a single Region
A Region contains at least one Block (the entry block), but can contain many blocks
A Block contains at least one Operation (the terminator), but can contain many non-terminator ops preceding the terminator
An Operation may have zero or more Regions, depending on whether it is primitive or structured. It defines zero or more Values (results), and expects zero or more Operands.
A Value should be typed, and maintain a list of uses
An Operand should consist of a reference to the Value it uses, an intrusive link for the use list of that Value, a reference to the Operation it is an operand of, and the index of the operand in the operand list of that Operation

The way in which value definitions and their uses are tracked can vary, but one of the advantages of the structure I described above, is that it makes traversing and manipulating the use-def graph of a function very simple. Implementing it is a bit tricky however.

So assuming we have such a basic structure, and assuming all of the primitive ops are pretty self-explanatory, I imagine AirScript IR would have the following "structured" ops:

for, which takes an iterable as an operand, either produces no results, or produces a collection, and has a single-region/single-block body, which is executed for every element of the input iterable. This would be used for comprehensions and comprehension constraints. Expects the for body to be terminated with either yield (for regular comprehensions) or enf (for comprehension constraints).
reduce, takes an iterable and an accumulator value as operands, and produces a new value with lesser dimensionality than the input (i.e. matrix -> vector, vector -> scalar, etc.). Like for, this is a single-region/single-block body, responsible for executing the value for the current iteration, as well as the new accumulator value. The reduce body must be terminated with a yield op, returning the new accumulator value.
switch, takes a scalar integral value as input, a set of cases composed of a constant value and a block to be executed if the input matches that constant, and an optional default case for if none of the cases match and the cases where non-exhaustive. The first case that matches is the one that gets executed (although we could modify this to allow executing all matching cases). This primitive would be used when lowering enf match.
if, takes a boolean operand as input, and executes one of two regions, depending on the value of the input. Both regions must be single-block regions.

This would provide a straightforward lowering for the current AirScript syntactic constructs like list comprehensions, comprehension constraints, conditional constraints, etc.

A couple of other key benefits:

This structure puts the emphasis on dataflow, rather than control flow. The structured ops, despite feeling like control flow, are less about control flow than they are about making conditional dataflow explicit. This makes analyses and optimizations smarter, and provides more information to the backend
Regions provide a boundary for code motion, e.g. lifting an expression out of a comprehension body into the containing region of the comprehension would cross a region boundary, which should be done explicitly, when it is known that the expression will always be executed; whereas de-duplicating expressions which are equivalent within a region is also a form of code motion, but does not cross any such boundary, so can be done freely. This ensures maximum reuse of computations, without going too far in that direction and computing things that end up never being used.
Easier to reason about. Since some high-level structure is retained, reading the IR is a bit easier to follow.
More modular - we can add new operations, rewrites, etc., without having to make changes in many places across the IR. This can be done by abstracting over concrete operation types using operation traits, but even just basic functionality such as folding, canonicalization, and verification, can be defined on concrete operation types, rather than in one big instruction enumeration.

The above suggestion/proposal doesn't have to be where we go with this by the way, but hopefully it will be useful as a way to spur some discussion on the design we do ultimately end up with. I'd also point to it as an example of the kind of information I'm looking for in the design document, i.e. the specific elements of the IR you've defined, their semantics/relationships, etc.

I'm less focused on the specific details of the IR - more so that it is cohesive, has a comprehensive spec, and delivers on the benefits we're hoping to gain from the effort, i.e. that it makes analysis and optimization easier, is more maintainable, and can be extended without significant refactoring (within reason). So long as all of that is accounted for, I'm not too particular about the specific design details - so even though I just brain-dumped a whole bunch of information about a possible design, feel free to discard that in favor of your own approach if you have a clear vision for how to implement it.

0 replies

Leo-Besancon · 2024-09-19T08:10:24Z

Leo-Besancon
Sep 19, 2024

Thanks for the precious feedback @bitwalker!
It is indeed a first attempt at an IR and we wanted to validate/amend it with you before moving on with canonicalization, validation rules and other details.

A better explanation of the choices

Guidelines

The fundamental design guideline we followed is:
"we will start from the low-level AIR and add the minimal amount of extra features to allow easy optimization for set of desirable optimizations",
with extra points for:

restricted grammar IR: if the IR has too much vocabulary, keywords and so on, each optimization pass will need to be aware of all the mutual interactions between those concepts in order optimize them
avoiding any "leaks" of the AST into the IR to make it agnostic with respect to AirScript itself. This opens the possibility of other front-ends in the future, and more importantly it completely decouples the IR from changes that we might bring to AirScript in the future, which should simplify future work.
readability, explainability, linkability to the AST and the IR for debugging

As for the "set of target optimizations" we targeted:

constant propagation (which is already straightforward with the SSA-style instructions of the AIR)
value numbering and common subexpression elimination (this is also usually done under SSA in other compilers, and we have added "blocks" to hint the eliminator on semantically plausible repeated subexpressions)
inlining (straightforward to implement with the concept of "block" described below)

Basically the goal is to move as many optimization steps as possible away from AST/AIR and have them applied on the IR itself.

AST to IR

To achieve the goal of having a low-vocabulary IR, we thought of eliminating the concept of type and unrolling loops/vectors/lists when passing from the AST to the AIR.
That way, if a new type is added to the language frontend (eg. u24), it won't have consequences on the IR.

Instructions and "blocks"

The instructions are regrouped under "blocks" that take arguments and can return values. The instructions are:

addition
multiplication
subtraction
enf (enforces that a value is equal to zero)
calling a block
read
const

More on blocks

The reason why we didn't go for the usual basic blocks is that here there is no real branching nor jumps, everything is always executed as a whole graph.
In that sense we felt like AirScript is closer to combinatory VHDL than to x86 ASM.
Basically the concepts of “condtional / unconditional jumps”, “loops” and other control flow concepts felt out of place in our context in which all conditional branches need to be taken into account at the same time.

We went for function-like blocks of conditions or values that are semantically likely to be reused by being called multiple times from different places.
For example, enf x^2 = x for x in d would be transformed into a block that takes x as argument and executes enf x^2=x, and that block would be explicitly called for each element of the unrolled array d.
This is interesting because it can give a strong hint to the common subexpression elimination pass since blocks are semantically meant to be reused.

As for why the arguments of each block start with an enable argument (the first argument), that's because we want to improve the readability of nested match compared to how they currently appear in the AIR, and simplify translation for nested conditional cases.

The translation rules are essentially:

if there is an enf x instruction in a block, instead of translating it as enf x we translate it as x2 = x * enable; enf x2. That way, conditions within a block only apply if the block is enabled
if a block B1 contains an explicit call to a block B2, the enable value of the called block B2 is set to be equal to the enable value of B1 so that all nested calls inherit the enabling state of the block that calls them
if there is a match x in a block B1: extact the two branches as blocks B2 for case x and B3 for case !x and call them both by setting their enabling argument to enable * x for the B2 and enable * (1-x) for B3. That way the two case calls are disabled when the block B is disabled, but we also enable the right call from the match when B is itself enabled.

This has the advantage of explicitly including the multiplications generated by nested match cases within the syntax.

Values and constraints within list comprehensions (lists are unrolled on translation to IR), evaluators and functions are all replaced with the unique concept of block.

Note that constant propagation automatically eliminates unused enable arguments that are hard-set to const(1), and eliminates whole blocks for which enable is const(0).

Here is an example translation process for a nested match:

Input (as AST):


ev foo([x, a, b, c, d, e, f]) {
	enf match x {
    	case x: bar(a, b, c, d)
    	case !x: bar(c, d, e, f)
	}
}

ev bar([v0, v1, v2, v3]) {
	enf match v0 {
    	case v0: v1 = v2 + v3
    	case !v0: v1 = v2 * v3
	}
}

We first expand the match instances.

First, blocks are created for the case x0 and case !x0 branches respectively:

// case v0
// a0 is the "enable" argument
block b1(a0, a1, a2, a3) {
	x0 = a2 + a3
	x1 = x0 - a1
	x2 = a0 * x1
	enf x2
}

// case !v0
// a0 is the "enable" argument
block b2(a0, a1, a2, a3) {
	x0 = a2 * a3
	x1 = x0 - a1
	x2 = a0 * x1
	enf x2
}

Then the match is replaced within the bar block:

// bar function
// args:
//   a0: enable
//   a1: v0
//   a2: v1
//   a3: v2
//   a4: v3
block b3(a0, a1, a2, a3, a4) {
	// case v0
	x0 = a0 * a1
	b1(x0, a2, a3, a4)
    
	// case !v0
	x1 = const(1)
	x2 = x1 - a1
	x3 = a0 * x2
	b2(x3, a2, a3, a4)
}

We proceed the same way for foo:

// foo function
// args:
//   a0: enable
//   a1: x
//   a2: a
//   a3: b
//   a4: c
//   a5: d
//   a6: e
//   a7: f
block b4(a0, a1, a2, a3, a4, a5, a6, a7) {
	// case x
	x0 = a0 * a1
	b3(x0, a2, a3, a4, a5)
    
	// case !x
	x1 = const(1)
	x2 = x1 - a1
	x3 = a0 * x2
	b3(x0, a4, a5, a6, a7)
}

The nice thing is that the IR remains readable and keeps a similar structure as the initial code thanks to blocks, despite being very close to the AIR.
The deeply nested "enf" instructions are correctly enabled or disabled depending on the nested case through which they went though when called without resorting to passing all the history of upstream match conditions as explicit arguments. All "enabling" multiplications are explicitly part of the graph and are optimized together with the other operators.

For IR -> IR optimizations:

treat blocks as candidate subexpressions for sub-expression elimination
perform constant propagation directly on the IR
inline blocks that are called only once
if a block contains at most 1 instruction, replace it with that instruction
and so on...

Then, to translate to AIR we simply need to expand blocks into the AIR tree and link them according to calls.

Questions relative to your proposal and RVSDG

We read the article presenting RVSDG (https://arxiv.org/pdf/1912.05036), and it makes a lot of sense. In particular, the structure you described, and the optimization pass description, would certainly be useful to have during implementation.

Given the guidelines outlined above that we used to guide our design proposal, our questions are mainly:

Do we need type information for a specific optimization step that would not work by handling scalars? We feel that types would be mainly necessary to represent aggregated types (vectors and matrices), that could be unrolled early in the AST > IR lowering
Would you require a human readable display representation for the IR in the RVSDG form (for debugging purposes)? It seems that directly printing the RVSDG graph would not be readable, and displaying it would require additional work (maybe going AST > human readable IR > RVSDG, adding a step), but maybe you have insights on this?
We aimed to have a very simple vocabulary (to avoid leaking the language specifics in the IR and avoid writing optimization passes on a big instruction set). In the structured instructions you mentioned, we feel that for / reduce could be handled by unrolling, and switch / if can be united in a single instruction. Would that be possible?
When taking the case of two nested matches:

match x {
  case x : {
    enf match y {
      case y :  a = b+c,
      case !y: e = f+g
    },
  }
  case !x: {
    enf match y {
      case y :  h = i+j,
      case !y: k = l+m
    },
  }
}

// we expect that e = f+g is translated to the equivalent of (e-f-g)*(1-y)*x = 0 so all upstream branching needs to be taken into account for all such leaves

This would be translated into a block, containing an operation with two regions, within which we have blocks containing an operation with two regions (nested structure).
When would this be translated into pure math?

If it’s during lowering of the IR to the AIR, this means we would “magically” introduce new arithmetical nodes (multiplications by x and (1-x), as well as y and (1-y)) late in the process, and they might not benefit from previous optimizations passes
If it’s during a pass of the IR, what does this pass look like? We would probably lose the nested structure and information?

Last considerations

One more idea. In the end (AIR) we are talking about symbolic math using only additions, constants, subtractions and multiplications. There are several libraries that would allow us to optimize math much better than traditional sequential compilers do, for example by resorting to factoring, algebraic identities and so on. Making sure that after all its optimization passes the IR only resorts to basic arithmetic and is compatible with symbolic algebra frameworks would open a whole new world of optimizations using existing libs directly on the IR. It would also make the IR -> AIR transform straightforward.

To wrap up everything, we feel that we could take the structure from RVSDG and apply it with a limited vocabulary (especially if we can unroll the loops during lowering from AST to the IR and work on scalar elements), to take advantage of all this structure and the underlying information to optimize better, while still staying as close to the AIR as possible.

We really appreciate your feedback on this, and aside from our planned meetings we can also discuss more about the constraints that guide this design in more depth if needed.

1 reply

bobbinth Sep 19, 2024
Maintainer

Without having read the rest of the comment, wanted to leave a quick note about this:

One more idea. In the end (AIR) we are talking about symbolic math using only additions, constants, subtractions and multiplications. There are several libraries that would allow us to optimize math much better than traditional sequential compilers do, for example by resorting to factoring, algebraic identities and so on. Making sure that after all its optimization passes the IR only resorts to basic arithmetic and is compatible with symbolic algebra frameworks would open a whole new world of optimizations using existing libs directly on the IR. It would also make the IR -> AIR transform straightforward.

In my mind, most of the "algebraic" optimizations would be done on the algebraic graph. So, I think of th IR as being about enforcing correct semantics (e.g., type checking) and AIR as the place for optimizations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermediate representation #358

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Intermediate representation #358

damip Sep 10, 2024

Replies: 2 comments · 1 reply

bitwalker Sep 16, 2024 Collaborator

Blocks and Control Flow

Proposal: A Dataflow-Centric Alternative

Leo-Besancon Sep 19, 2024

A better explanation of the choices

Guidelines

AST to IR

Instructions and "blocks"

More on blocks

Questions relative to your proposal and RVSDG

Last considerations

bobbinth Sep 19, 2024 Maintainer

damip
Sep 10, 2024

Replies: 2 comments 1 reply

bitwalker
Sep 16, 2024
Collaborator

Leo-Besancon
Sep 19, 2024

bobbinth Sep 19, 2024
Maintainer