High level optimizer server API #380

markshannon · 2022-05-13T13:47:10Z

markshannon
May 13, 2022
Collaborator

We need an API for optimizers that is driven by the VM, instead of having the VM driven by the optimizer like Pyston or Cinder.

Note:

This is not in competition with #368 as this API requires registration, which would most likely be done using the mechanism being discussed in #368

Design

Once an optimizer is registered, it is up to the VM when it calls the optimizer, respecting the published capabilities of the optimizer.
Whenever the VM wants to optimize code at a hotspot, it calls the optimizer to get an execution object capable of executing from that hotpsot. It is up to the optimizer to determine how that execution occurs, it is opaque to the VM.

The only constraint on the execution object is that it leaves the VM in a valid state, exactly as if the interpreter had executed 0 or more VM instructions.

The API

type struct {
    OBJECT_HEADER;
    _PyInterpreterFrame *(*execute)(PyExecutorObject *self, _PyInterpreterFrame *frame, PyObject **stack_pointer);
    /* Data needed by the executor goes here, but is opaque to the VM */
} PyExecutorObject;

typedef enum {
     function_entry = 1,
     back_edges = 2,
     anywhere = 4,
     /* thread-safety? */
} PyOptimizerCapabilities;

type struct {
    OBJECT_HEADER;
    PyExecutorObject *(*compile)(PyOptimizerObject* self, PyCodeObject *code, int offset);
    PyOptimizerCapabilities capabilities;
    float optimization_cost;
    float run_cost;
    /* Data needed by the compiler goes here, but is opaque to the VM */
} PyOptimizerObject;

void _Py_Executor_Replace(PyCodeObject *code, int offset, PyExecutorObject *executor);

int _Py_Optimizer_Register(PyOptimizerObject* optimizer);

optimization_cost should be the approximate cost of producing the executor object, relative to evaluating the equivalent code in the interpreter.
run_cost should be the approximate time to execute the executor object, relative to the interpreter.
These numbers cannot be exact, but should be a reasonable estimate of average costs, to help the VM determine when to optimize code.

An optimizer is registered by calling _Py_Optimizer_Register(optimizer). Once registered, the VM will call optimizer->compile(optimizer, code, offset) whenever it feels that code needs to be optimized.
The _Py_Executor_Replace function is provided to allow PyExecutionObjects to replace themselves whenever necessary.

Example

This is a valid, but useless implementation of an "optimizer":

_PyInterpreterFrame * noop_execute(PyExecutorObject *executor, _PyInterpreterFrame *frame, PyObject **stack_pointer)
{
    return frame;
}

PyExecutorObject *noop_compile(PyOptimizerObject* optimizer, PyCodeObject *code, int offset) {
    return make_noop_executor();
}

PyOptimizerObject jit = {
    /* Object header */
   noop_compile,
   7, /* Can "optimize" anything! */
   inf, /* Takes 1 unit of time to "optimize" 0 code; 1/0 == inf.  */
   inf
};

Runtime costs if not used

The above optimizer API design assumes that the VM will perform hotspot detection. Hotspot detection has a runtime cost, but which is a waste if no optimizer is enabled. Fortunately we have an adaptive interpreter, which can turn on runtime hotspot checking only if needed.

Implementation outline

Finding hotspots

This can be done with two counters. One on backedges and function entry points, and one global counters. We hit a hotspot when the code is executed frequently. The frequency we need to reach depends on the optimization_cost and run_cost of the optimizer, as well as how long the program has been running, optimize when frequency > threshold.
frequency = delta(local counter)/delta(global counter). threshold = N*(1/((global counter)+1)) + M*optimization_cost/run_cost where M and N are tunable parameters.

We check frequency whenever the local counter hits 0. Since we started the local counter at some constant K, then delta(local counter) == K. If we store the value of the global counter when the local counter was set to K, then we have frequency = K/((global counter) - (local value of global counter))

With C = M*optimization_cost/run_cost
frequency > threshold is equivalent to
K/((global counter) - (local value of global counter)) > N*(1/((global counter)+1)) + C
It is then just algebra, to re-organize the above equations into an efficient test of frequency > threshold.

Entering executor objects

Once we have hit a hotspot and got a PyExecutorObject, we need to transfer execution to it. Once again, we use our ability to replace instructions at runtime. Given that any potential hotspot needs space for counters, we can reuse that space to point to a PyExecutorObject and replace the JUMP_BACKWARD with an ENTER_EXECUTOR opcode.

TARGET(ENTER_EXECUTOR) {
    _PyExecutorCache *cache = (_PyExecutorCache *)next_instr;
    PyExecutorObject *executor = read_obj(cache->executor);
    frame = executor->execute(executor, frame, stack_pointer);
    assert(cframe.current_frame == frame);
    goto resume_frame;
}

The equivalent version for function entry would replace RESUME and probably store the executor on the code object, not inline.

markshannon · 2022-05-20T17:42:57Z

markshannon
May 20, 2022
Collaborator Author

@kmod
@carljm
I assume that this won't be particularly useful for either Pyston or Cinder, but I'd be keen to hear your thoughts anyway.

0 replies

kmod · 2022-05-20T19:33:41Z

kmod
May 20, 2022

We'd generally prefer more control, not less.

Just curious, is this being driven by a specific use-case?

0 replies

carljm · 2022-05-24T16:44:51Z

carljm
May 24, 2022

Yes, you're right, CinderJIT wouldn't use this API, we would want fuller control by fully replacing the eval loop. A few thoughts/questions anyway:

It seems like optimization_cost and run_cost are only used together as optimization_cost / run_cost, which nets to a sort of "aggressiveness score." I guess they are kept separate so that (hopefully) it is more intuitive to pick reasonable values?
I'm not clear on the difference in practice between a capabilities value of 4 vs 5 or 6 or 7. Is "anything" not in fact "anything"?
The "valid but useless" example seems incomplete: make_noop_executor() is not defined, and noop_execute() is unused.

CinderJIT sees a lot of value from optimizing Python-to-Python calls, in some cases entirely eliminating Python argument marshaling and just placing arguments in the right registers and doing an x64 call. It seems that type of optimization would be impossible under this API? (It remains to be seen how much of the value we get from that remains in 3.10+ with the upstream improvements to Python function call overhead.)

3 replies

markshannon May 25, 2022
Collaborator Author

Yes, my thinking for keeping the optimization_cost separate from the run_cost was that it is more natural to express. But, the optimization_cost does matter at startup to prevent the compiler consuming too much CPU.

Why would inlining be impossible with this API. It is up to the optimizer to choose the region it optimizes, you can inline as many functions as you want.

carljm May 25, 2022

Why would inlining be impossible with this API. It is up to the optimizer to choose the region it optimizes, you can inline as many functions as you want.

Ok, makes sense, it should be possible. That does raise a couple more questions though:

Are the necessary APIs for maintaining the frame stack correctly all available to an external optimizer?
The optimizer may need to deopt within an inlined/called function. Is it valid for the optimizer's executor to return to the VM in a different frame and code object than the one it was called from? ("The only constraint is that it returns in a valid VM state" suggests yes, but "as if the VM had executed zero or more instructions" muddies it slightly, since in this case we would in effect have only partially executed one call instruction in the original code object.)

markshannon Jul 18, 2022
Collaborator Author

Are the necessary APIs for maintaining the frame stack correctly all available to an external optimizer?

Probably not at the moment, but we should add them if necessary. Maybe as part of the unstable API, or maybe in an optimizer-only API.

Is it valid for the optimizer's executor to return to the VM in a different frame and code object than the one it was called from?

Yes. That is very important, and a motivation for decoupling the Python stack from the C stack.

"as if the VM had executed zero or more instructions" muddies it slightly, since in this case we would in effect have only partially executed one call instruction in the original code object.

Zero or more bytecode instructions. Executing part of an instruction would not be valid.
In 3.11 the CALL instruction does not handle the return value (for Python functions), so exiting the optimized region mid call, would be leave the VM in a valid state.

ericsnowcurrently · 2022-05-24T19:46:44Z

ericsnowcurrently
May 24, 2022
Maintainer

I get the sense that there is a certain amount of work that most/all JIT implementations have in common and that CPython's eval loop is already in a position to do that work (letting the JIT focus on what it's good at). If that's the case then how would you describe/enumerate that common work and how would it make sense for a JIT to plug into it, API-wise?

What I'm hearing is "It isn't worth the trouble. Swapping out the whole eval loop is good enough." What would make it worth the trouble?

Ultimately, we want to maximize the value CPython provides, to empower you (and everyone).

5 replies

carljm May 24, 2022

I think perhaps the most prominent reason Cinder will want to take over the eval loop is Static Python: we add new opcodes. That alone makes this API insufficient for us.

Another pretty fundamental reason is that because we have to support the prefork-style workload, where there is no useful in-process warmup phase, we can't rely on the VM to make decisions about when to invoke us based on what is hot. We have to support an entirely different style of compilation (really, not JIT at all but in-process AOT compilation) where all the profiling data is provided a priori. We don't really expect many other people to be interested in this compilation model, but it's important to us and another reason this API wouldn't work for us.

Setting both of those aside, I do think it is possible that someone could write a JIT that would find this API useful, and would be happy for the VM to handle type specialization and deciding when to invoke the JIT, and the JIT could just focus on compiling down the already-specialized bytecode. It just doesn't fit our needs.

kmod May 25, 2022

I think I disagree that there is a large common area that could be factored out into a central place -- the things that are being talked about here feel pretty coupled to the design choices of the JIT in question, and I believe Pyston/Cinder/Faster CPython already handle these things differently. I'm sure some of the difference is unnecessary and could be reduced, but I guess I just view the interpreter + JIT as pretty tightly related and I don't quite understand the desire to draw an abstraction boundary between the two.

The reason I asked about use-cases is that if you look at the JIT projects out there, I haven't personally seen evidence of the problem "writing a JIT for Python has too much boilerplate and it'd be great if it were easier", but both Pyston and Cinder ran into the problem "the existing JIT API for Python is too narrow so to achieve our needs we have to fork the entire codebase" -- and Faster CPython would have too if it wasn't in-tree. I think it'd be easier for me to understand and comment on the design if I understood a case where it would be used.

Another random thought -- since this API seems like a subset of PEP 523, couldn't this functionality be provided by a PEP-523-using library? ie so CPython could still say "here is an easier way to write JITs for CPython" without the hurdle of being included in-tree.

ericsnowcurrently May 25, 2022
Maintainer

IIRC, @markshannon's motivation is to replace the PEP 523, due to certain optimizations is prevents. However, there's a larger interest at play here...

Maintaining a CPython fork is a cost JIT implementors shouldn't have to pay. AFAICT, CPython can provide all the necessary API, startup hooks, build tooling, etc. necessary for a project to provide a JIT via an extension module. We need to come to an understanding about what Pyston/Cinder/etc. need in order to get there.

Both your responses, as well as explanations elsewhere, help with that. I expect not all your needs are clear enough yet.

I suppose this discussion is more suitable for #368 (or a new one).

@carljm

I think perhaps the most prominent reason Cinder will want to take over the eval loop is Static Python: we add new opcodes. That alone makes this API insufficient for us.

One thing we are planning on is code generation for the eval loop. I expect the [still hypothetical] tooling for that could be made to support custom opcodes, etc. without substantial work. Would that be sufficient to meet your need on this?

@carljm

Another pretty fundamental reason is that because we have to support the prefork-style workload, where there is no useful in-process warmup phase, we can't rely on the VM to make decisions about when to invoke us based on what is hot. We have to support an entirely different style of compilation (really, not JIT at all but in-process AOT compilation) where all the profiling data is provided a priori. We don't really expect many other people to be interested in this compilation model, but it's important to us and another reason this API wouldn't work for us.

Would eval loop code generation help with this? Is there anything else that would be useful, e.g. a startup hook?

@carljm

I do think it is possible that someone could write a JIT that would find this API useful

I expect everyone agrees we want to focus on the concrete needs of Cinder, Pyston, etc. rather than possibilities. 😄

@kmod

I think I disagree that there is a large common area that could be factored out into a central place -- the things that are being talked about here feel pretty coupled to the design choices of the JIT in question, and I believe Pyston/Cinder/Faster CPython already handle these things differently.

Clearly there is only one right way to do a JIT. But seriously, if there isn't a large enough common area with the eval loop then there definitely isn't much point to a more granular API to meet that need.

@kmod

I'm sure some of the difference is unnecessary and could be reduced, but I guess I just view the interpreter + JIT as pretty tightly related and I don't quite understand the desire to draw an abstraction boundary between the two.

The main desire is to make things easier for everyone and to improve the level to which we can collaborate. An abstraction boundary is only one idea and we can discard it if not worth it.

@kmod

but both Pyston and Cinder ran into the problem "the existing JIT API for Python is too narrow so to achieve our needs we have to fork the entire codebase" -- and Faster CPython would have too if it wasn't in-tree.

As noted above, I consider it somewhat of a failure if JIT projects have to fork CPython to get what they need.

markshannon May 25, 2022
Collaborator Author

@kmod

I guess I just view the interpreter + JIT as pretty tightly related and I don't quite understand the desire to draw an abstraction boundary between the two

https://openjdk.java.net/jeps/243 and the graal compiler suggests to me that a pluggable optimizer API can work very well for method-at-a-time optimizers.

carljm May 25, 2022

@ericsnowcurrently

One thing we are planning on is code generation for the eval loop. I expect the [still hypothetical] tooling for that could be made to support custom opcodes, etc. without substantial work. Would that be sufficient to meet your need on this?

Presumably this tooling runs before/during compilation. If we want Static Python usable as an extension with an already-compiled vanilla Python, then the built-in eval loop clearly won't have support for our opcodes compiled in. We still have to compile the eval loop plus our opcodes as part of our extension module. So I think the code-generated eval loop is somewhat useful for us in that it reduces the maintenance burden of totally copying the eval loop into our codebase, but it doesn't change the fact that at runtime we'll need to replace the entire eval loop with our version of it, using the startup hook in #368.

Would eval loop code generation help with [prefork workload]? Is there anything else that would be useful, e.g. a startup hook?

I think what we need for this is function-created callback (python/cpython#91053) and API to set vectorcall entrypoint on a function object (python/cpython#92257). The latter basically bypasses the API described here by letting us just have vectorcall for a function always go directly to our already-compiled code for it.

undingen · 2022-05-24T20:03:52Z

undingen
May 24, 2022

On of the reasons why Pyston has a modified copy of ceval.c is that we need to collect additional profile info before we JIT compile code (not only to fill in the attribute caches but also we e.g. use it to know store some type info and reference counts). Maybe we could use something like this #162 if lets us register a callback which does the profiling and afterwards lets CPythons interpreter handle the instruction (but I'm not sure if all opcodes we use support specialization).

3 replies

markshannon May 25, 2022
Collaborator Author

What information do you record?
There is already a lot of information recorded by the adaptive interpreter, although it might not be obvious how to get at it.

It would be helpful to know what you use, as we probably want to record it for ourselves.

undingen May 30, 2022

We collect some stats about the reference count of some operations: https://github.com/pyston/pyston/blob/954e63d88fe6abcde5e562afcc4ea27d65743d6c/Python/aot_ceval.c#L2070

#define BINARY_OP_OPCACHE_PROF() \
    do { \
        if (Py_TYPE(left) == Py_TYPE(right)) { \
            _PyOpcache *co_opcache; \
            OPCACHE_CHECK(); \
            if (co_opcache && co_opcache->optimized < 10) { \
                co_opcache->u.t_refcnt.type = Py_TYPE(left); \
                co_opcache->u.t_refcnt.refcnt1_left += (Py_REFCNT(left) == 1) ? 1 : 0; \
                co_opcache->u.t_refcnt.refcnt2_left += (Py_REFCNT(left) == 2) ? 1 : 0; \
                co_opcache->u.t_refcnt.refcnt1_right+= (Py_REFCNT(right) == 1) ? 1 : 0; \
                ++co_opcache->optimized; \
            } \
        } \
    } while (0)

The JIT will than for some bytecodes check if the reference count is nearly always 1 (or 2) and emit extra code which directly modifies the temporary object inplace. We do this for string and list concatenation and float operations but only want to emit the code when we know it's very likely to succeed. (refcnt2_left catches left += right where Py_REFCNT(left)== 2 because (one ref of left is in the fast var array and one on the value stack in the interpreter at the INPLACE_ADD)

For some other operations we just want to know the type.

Pyston will get more cross bytecode optimizations in the future and I think this will make it necessary to add more of this kind of simple profiling.

markshannon May 31, 2022
Collaborator Author

We've experimented with specializations for reference counts in binary operations, but they didn't seem to be worth it for the adaptive interpreter.
If they are useful when optimizing larger regions, it might be worth adding those specializations (as long as they don't slow things down) just to provide profiling data.

kmod · 2022-06-13T20:05:17Z

kmod
Jun 13, 2022

We launched "pyston-lite" last week, which is a PEP523 extension module which uses a pth file to register itself. We think we were able to solve the safety issues, though it involved avoiding co_opcache and using the more-expensive co_extra. Another difficulty was wanting to access static functions inside CPython; sometimes we were able to find the pointers dynamically, and some of the functions we had to copy into our extension module.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High level optimizer server API #380

{{title}}

Replies: 6 comments 11 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

High level optimizer server API #380

markshannon May 13, 2022 Collaborator

Design

The API

Example

Runtime costs if not used

Implementation outline

Finding hotspots

Entering executor objects

Replies: 6 comments · 11 replies

markshannon May 20, 2022 Collaborator Author

markshannon May 25, 2022 Collaborator Author

markshannon Jul 18, 2022 Collaborator Author

ericsnowcurrently May 24, 2022 Maintainer

ericsnowcurrently May 25, 2022 Maintainer

markshannon May 25, 2022 Collaborator Author

markshannon May 25, 2022 Collaborator Author

markshannon May 31, 2022 Collaborator Author

markshannon
May 13, 2022
Collaborator

Replies: 6 comments 11 replies

markshannon
May 20, 2022
Collaborator Author

markshannon May 25, 2022
Collaborator Author

markshannon Jul 18, 2022
Collaborator Author

ericsnowcurrently
May 24, 2022
Maintainer

ericsnowcurrently May 25, 2022
Maintainer

markshannon May 25, 2022
Collaborator Author

markshannon May 25, 2022
Collaborator Author

markshannon May 31, 2022
Collaborator Author