High level optimizer server API #380
Replies: 6 comments 11 replies
-
@kmod |
Beta Was this translation helpful? Give feedback.
-
We'd generally prefer more control, not less. Just curious, is this being driven by a specific use-case? |
Beta Was this translation helpful? Give feedback.
-
Yes, you're right, CinderJIT wouldn't use this API, we would want fuller control by fully replacing the eval loop. A few thoughts/questions anyway:
CinderJIT sees a lot of value from optimizing Python-to-Python calls, in some cases entirely eliminating Python argument marshaling and just placing arguments in the right registers and doing an x64 call. It seems that type of optimization would be impossible under this API? (It remains to be seen how much of the value we get from that remains in 3.10+ with the upstream improvements to Python function call overhead.) |
Beta Was this translation helpful? Give feedback.
-
I get the sense that there is a certain amount of work that most/all JIT implementations have in common and that CPython's eval loop is already in a position to do that work (letting the JIT focus on what it's good at). If that's the case then how would you describe/enumerate that common work and how would it make sense for a JIT to plug into it, API-wise? What I'm hearing is "It isn't worth the trouble. Swapping out the whole eval loop is good enough." What would make it worth the trouble? Ultimately, we want to maximize the value CPython provides, to empower you (and everyone). |
Beta Was this translation helpful? Give feedback.
-
On of the reasons why Pyston has a modified copy of ceval.c is that we need to collect additional profile info before we JIT compile code (not only to fill in the attribute caches but also we e.g. use it to know store some type info and reference counts). Maybe we could use something like this #162 if lets us register a callback which does the profiling and afterwards lets CPythons interpreter handle the instruction (but I'm not sure if all opcodes we use support specialization). |
Beta Was this translation helpful? Give feedback.
-
We launched "pyston-lite" last week, which is a PEP523 extension module which uses a pth file to register itself. We think we were able to solve the safety issues, though it involved avoiding |
Beta Was this translation helpful? Give feedback.
-
We need an API for optimizers that is driven by the VM, instead of having the VM driven by the optimizer like Pyston or Cinder.
Note:
Design
Once an optimizer is registered, it is up to the VM when it calls the optimizer, respecting the published capabilities of the optimizer.
Whenever the VM wants to optimize code at a hotspot, it calls the optimizer to get an execution object capable of executing from that hotpsot. It is up to the optimizer to determine how that execution occurs, it is opaque to the VM.
The only constraint on the execution object is that it leaves the VM in a valid state, exactly as if the interpreter had executed 0 or more VM instructions.
The API
optimization_cost
should be the approximate cost of producing the executor object, relative to evaluating the equivalent code in the interpreter.run_cost
should be the approximate time to execute the executor object, relative to the interpreter.These numbers cannot be exact, but should be a reasonable estimate of average costs, to help the VM determine when to optimize code.
An optimizer is registered by calling
_Py_Optimizer_Register(optimizer)
. Once registered, the VM will calloptimizer->compile(optimizer, code, offset)
whenever it feels that code needs to be optimized.The
_Py_Executor_Replace
function is provided to allowPyExecutionObject
s to replace themselves whenever necessary.Example
This is a valid, but useless implementation of an "optimizer":
Runtime costs if not used
The above optimizer API design assumes that the VM will perform hotspot detection. Hotspot detection has a runtime cost, but which is a waste if no optimizer is enabled. Fortunately we have an adaptive interpreter, which can turn on runtime hotspot checking only if needed.
Implementation outline
Finding hotspots
This can be done with two counters. One on backedges and function entry points, and one global counters. We hit a hotspot when the code is executed frequently. The frequency we need to reach depends on the
optimization_cost
andrun_cost
of the optimizer, as well as how long the program has been running, optimize whenfrequency > threshold
.frequency = delta(local counter)/delta(global counter)
.threshold = N*(1/((global counter)+1)) + M*optimization_cost/run_cost
whereM
andN
are tunable parameters.We check frequency whenever the local counter hits 0. Since we started the local counter at some constant
K
, thendelta(local counter) == K
. If we store the value of the global counter when the local counter was set to K, then we havefrequency = K/((global counter) - (local value of global counter))
With
C = M*optimization_cost/run_cost
frequency > threshold
is equivalent toK/((global counter) - (local value of global counter)) > N*(1/((global counter)+1)) + C
It is then just algebra, to re-organize the above equations into an efficient test of
frequency > threshold
.Entering executor objects
Once we have hit a hotspot and got a
PyExecutorObject
, we need to transfer execution to it. Once again, we use our ability to replace instructions at runtime. Given that any potential hotspot needs space for counters, we can reuse that space to point to aPyExecutorObject
and replace theJUMP_BACKWARD
with anENTER_EXECUTOR
opcode.The equivalent version for function entry would replace
RESUME
and probably store the executor on the code object, not inline.Beta Was this translation helpful? Give feedback.
All reactions