Simulating a Pauli tensor faster than as a generalised permutation matrix #154

TysonRayJones · 2024-08-21T10:26:18Z

TysonRayJones
Aug 21, 2024

Hi there,

Imagine that we seek to apply a tensor of Pauli operators (rather than just calculate their expectation values). That is, they appear as same-column gates in our circuit like might appear due to twirling, or Monte Carlo decomposition of Pauli channels, etc.

We could inefficiently effect these in-turn as one-qubit gates. It turns out that's a factor #paulis slower than necessary, and is especially a shame when the operators have common controls:

It is ideal to simulate them as a single operator:

Tensors of Pauli operators have a special form owed to each being diagonal or anti-diagonal, and can be simulated as quickly as a single Pauli operator (with no necessary matrix allocations) as outlined in Sec IV E here. It is my understanding that it is not currently possible to leverage this in cuStateVec.

Their Z-basis matrices are simple, specific instances of generalised permutation matrices:

They can ergo be simulated using cuStateVec's custatevecApplyGeneralizedPermutationMatrix, however a few issues remain:

we must ourselves build the diagonal elements which is a chore (involving counting the parity of Y operators, tracking where Y and Z non-unity coeffs are effected, etc). A similar chore is likely already being automatically performed in custatevecApplyPauliRotation, if it is using an algorithm similar to Sec IV G.
we must ourselves build the permutation matrices (by tracking X and Y operators, etc). This is a totally superfluous allocation of memory which is exponential in the number of Paulis (I think as bad as 4^#paulis)! Essentially, the Pauli tensor simply swaps pairs of amplitudes and never moves them "in a loop" like a general permutation matrix might, so the final destination index of the amplitudes is simply and cheaply encoded using a fixed number of bitmasks, rather than an exponentially big matrix.

Ergo a new, dedicated function for simulating Pauli tensors in cuStateVec would be very valuable and much faster than the existing method (linearly faster than in-turn simulation, and exponentially faster than via permutation matrices).
And notably, it will be easy; the logic for simulating the Pauli tensor is identical to optimal simulation of the Pauli gadget (custatevecApplyPauliRotation); they differ only by constant coefficients multiplied onto each amplitude (courtesy of Euler's formula).

You can view a non-distributed implementation of the algorithm here with arguments prepared here, which is derived in Sec IV E here. It is trivial to introduce control qubits in arbitrary states, which merely eliminate iterations of the main loop (each of which modifies one amplitude).

In my application (which I'll soon have the public code to link), I will have to fall-back to using a custom kernel implementing the above algorithm - it would be fantastic to update it to use a cuStateVec function in the future!

ymagchi · 2024-08-21T23:03:43Z

ymagchi
Aug 21, 2024
Maintainer

Hi @TysonRayJones,

Thank you so much for your feedback.
We may be able to lift this requirement in the future, while the cuStateVec library has prioritized the functionality to pass an arbitrary matrix.

I'd like to add to the computation cost of custatevecApplyGeneralizedPermutationMatrix: both permutation/diagonal matrices will be passed as arrays of size 2^#Paulis and I expect that the kernel would be almost memory-bound because each state vector element requires only 1 matrix element during the multiplication. We will need some overhead to create a permutation table but the perf of custatevecApplyGeneralizedPermutationMatrix will not slow down exponentially to the number of Paulis.

1 reply

TysonRayJones Aug 22, 2024
Author

Ahh right you are! Each updated amplitude uses a single permutation element so the number of operations does not scale exponentially with #Paulis - I unconsciously borrowed the logic of the many-target general matrix, my bad!

It is still true however that, in addition to the overhead of creation, querying that exponentially large matrix will have to bite somewhere due to caching. For example, 25 Paulis will mean querying a permutation array of 256 MiB in a hot loop/dispatch. If CUDA's caching slowdown was logarithmic with memory size, use of the generalised permutation matrices would still induce a linear slowdown over a bespoke method. Of course we might never reach such a regime before running out of memory ¯\(ツ)/¯

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simulating a Pauli tensor faster than as a generalised permutation matrix #154

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Simulating a Pauli tensor faster than as a generalised permutation matrix #154

TysonRayJones Aug 21, 2024

Replies: 1 comment · 1 reply

ymagchi Aug 21, 2024 Maintainer

TysonRayJones Aug 22, 2024 Author

TysonRayJones
Aug 21, 2024

Replies: 1 comment 1 reply

ymagchi
Aug 21, 2024
Maintainer

TysonRayJones Aug 22, 2024
Author