CPU identity subtraction < 0 #26147

andrewlkd · 2025-01-28T10:45:10Z

Description

Hello,

I'm not sure if this is a JAX bug or a device precision issue, but the following code produces a negative value when jitted (0 is expected).

def f(x1, s):
  x2 = x1 * (1.0 + s)
  d = (x2)**2 - (x1)**2
  return(d)
f_jit = jax.jit(f)
x = 0.1
s = 0.0
print(f(x, s)) # 0 
print(f_jit(x, s)) # < 0

Note that

Replacing with d = (x2 + x1)(x2 - x1)
Not passing s as an argument, and instead harcoding to 0
Running on TPU
all fix the issue.

Thanks!

System info (python version, jaxlib version, accelerator, etc.)

jax:    0.5.1
jaxlib: 0.5.1
numpy:  2.2.1
python: 3.11.8 (stable, redacted, redacted) [Clang 9999.0.0 (faa3f752896903c2d09d389970d3d0ebf50a1073)]
device info: cpu-1, 1 local devices"
process_count: 1
platform: uname_result(system='Linux', node=[redacted], release='5.10.0-smp-1105.32.0.0', version='#1 [v5.10.0-1105.32.0.0] SMP @1729903589', machine='x86_64')

The text was updated successfully, but these errors were encountered:

jakevdp · 2025-01-28T17:26:35Z

Hi - in general JIT compilation does not guarantee bitwise exact outputs, but the output should generally be within the expected precision of the floating point representation being used. In this case, you're working with float32, so differences of the order np.finfo('float32').eps or about 1E-7 are not unexpected.

You can dig a bit into what's going on by using the ahead-of-time compilation tools; for example:

print("uncompiled:")
print(f_jit.lower(x, s).as_text())
print("\ncompiled:")
print(f_jit.lower(x, s).compile().as_text())

expand output

uncompiled:
module @jit_f attributes {mhlo.num_partitions = 1 : i32, mhlo.num_replicas = 1 : i32} {
  func.func public @main(%arg0: tensor<f32> {mhlo.layout_mode = "default"}, %arg1: tensor<f32> {mhlo.layout_mode = "default"}) -> (tensor<f32> {jax.result_info = "", mhlo.layout_mode = "default"}) {
    %cst = stablehlo.constant dense<1.000000e+00> : tensor<f32>
    %0 = stablehlo.add %cst, %arg1 : tensor<f32>
    %1 = stablehlo.multiply %arg0, %0 : tensor<f32>
    %2 = stablehlo.multiply %1, %1 : tensor<f32>
    %3 = stablehlo.multiply %arg0, %arg0 : tensor<f32>
    %4 = stablehlo.subtract %2, %3 : tensor<f32>
    return %4 : tensor<f32>
  }
}


compiled:
HloModule jit_f, is_scheduled=true, entry_computation_layout={(f32[], f32[])->f32[]}, allow_spmd_sharding_propagation_to_parameters={true,true}, allow_spmd_sharding_propagation_to_output={true}

%fused_computation (param_0.1: f32[], param_1.4: f32[]) -> f32[] {
  %param_0.1 = f32[] parameter(0)
  %param_1.4 = f32[] parameter(1)
  %constant.0 = f32[] constant(1)
  %add.0 = f32[] add(f32[] %param_1.4, f32[] %constant.0), metadata={op_name="jit(f)/jit(main)/add" source_file="<ipython-input-1-afa1b93b6de2>" source_line=4}
  %multiply.2 = f32[] multiply(f32[] %param_0.1, f32[] %add.0), metadata={op_name="jit(f)/jit(main)/mul" source_file="<ipython-input-1-afa1b93b6de2>" source_line=4}
  %multiply.1 = f32[] multiply(f32[] %multiply.2, f32[] %multiply.2), metadata={op_name="jit(f)/jit(main)/integer_pow" source_file="<ipython-input-1-afa1b93b6de2>" source_line=5}
  %multiply.0 = f32[] multiply(f32[] %param_0.1, f32[] %param_0.1), metadata={op_name="jit(f)/jit(main)/integer_pow" source_file="<ipython-input-1-afa1b93b6de2>" source_line=5}
  ROOT %subtract.0 = f32[] subtract(f32[] %multiply.1, f32[] %multiply.0), metadata={op_name="jit(f)/jit(main)/sub" source_file="<ipython-input-1-afa1b93b6de2>" source_line=5}
}

ENTRY %main.9 (Arg_0.1: f32[], Arg_1.2: f32[]) -> f32[] {
  %Arg_0.1 = f32[] parameter(0), metadata={op_name="x1"}
  %Arg_1.2 = f32[] parameter(1), metadata={op_name="s"}
  ROOT %fusion = f32[] fusion(f32[] %Arg_0.1, f32[] %Arg_1.2), kind=kLoop, calls=%fused_computation, metadata={op_name="jit(f)/jit(main)/sub" source_file="<ipython-input-1-afa1b93b6de2>" source_line=5}
}

That said, I don't see anything obviously problematic here. The compiled version fuses the full computation into a single kernel, and perhaps some of the fusion logic re-orders the computation in such a way that errors accumulate differently.

andrewlkd · 2025-01-29T11:03:50Z

Thanks Jake! Indeed I had taken a look at some of the lowered methods and was confused because I couldn't find an obvious issue between cases that did and didn't work.

We have a method in our repo https://github.com/google-deepmind/graphcast/blob/main/graphcast/samplers_utils.py#L418 that this is a simple reproducer of.

The method returns NaNs when running on CPU and the stochastic_churn_rate is 0. Otherwise, it is fine (when on TPU or when the stochastic_churn_rate is non zero).

We have identified that new_noise_level**2 - noise_level**2 can evaluate to less than 0 when new_noise_level == noise_level (this occurs when the stochastic_churn_rate is 0).

I guess we'll have to add some jnp.maximum call to clamp to 0, or rewrite as the difference of two squares.

pearu · 2025-01-29T12:17:09Z

I can reproduce the issue on CPU but not on CUDA.

While it is true that evaluation of f(x, s) is performed using float64, I would expect that forcing float32 inputs, that is, calling f(jnp.float32(x), jnp.float32(s)) would produce the same results as f_jit(x, s).

If "the fusion logic re-orders the computation in such a way that errors accumulate differently" is indeed true, I'd consider this as a bug because floating-point arithmetic is non-associative and algorithms that are designed to take non-associativity into account to improve the accuracy of results, become broken.

However, with the given inputs (x1=0.1, s=0.0), I cannot pinpoint what could cause differences in error accumulations:

fp addition when one operand is zero, is exact,
fp multiplication when one operand is one, is exact.

Hence, x1 and x2 ought to be equal, so ought to be equal their squares, and subtraction of equal values ought to result zero value.
Even when CPU and CUDA operations use different FTZ modes, I cannot explain the non-zero result from jitted function.

pearu · 2025-01-29T12:28:00Z

@andrewlkd, using d = (x2 + x1)(x2 - x1) makes more sense because it reduces cancellations errors that occur when using x2**2 - x1**2.

pearu · 2025-01-29T12:33:45Z

Here is a simpler reproducer of the issue:

>>> def f(x1, x2):
...   return x2 ** 2 - x1 ** 2
... 
>>> jax.jit(f)(jnp.array(0.1), jnp.array(0.1))
Array(-4.0978193e-10, dtype=float32, weak_type=True)
>>> def f(x1, x2):
...   return 0 + x1 ** 2 - x1 ** 2
... 
>>> jax.jit(f)(jnp.array(0.1), jnp.array(0.1))
Array(0., dtype=float32, weak_type=True)

pearu · 2025-01-29T12:40:42Z

Notice the following (coincidence?):

>>> numpy.float32(0.1) ** 2 - numpy.float32(numpy.float64(numpy.float32(0.1)) ** 2)
-4.097819306103645e-10

that reproduces the above result, that is, the origin of the value is float->double->float casting.

jakevdp · 2025-01-29T18:24:24Z

This would be worth reporting upstream at https://github.com/openxla/xla. @pearu would you like to do that?

pearu · 2025-01-30T21:58:04Z

Here's report to upstream openxla/xla#22116 that includes a couple of other diagnosis results.

andrewlkd added the bug Something isn't working label Jan 28, 2025

andrewlkd mentioned this issue Jan 28, 2025

gencast_mini_demo.ipynb on AMD CPU google-deepmind/graphcast#113

Closed

pearu mentioned this issue Jan 30, 2025

Enabling CPU backend optimization reveals a possible inconsistency in computing float32 power differences openxla/xla#22116

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU identity subtraction < 0 #26147

CPU identity subtraction < 0 #26147

andrewlkd commented Jan 28, 2025

jakevdp commented Jan 28, 2025

andrewlkd commented Jan 29, 2025

pearu commented Jan 29, 2025 •

edited

Loading

pearu commented Jan 29, 2025

pearu commented Jan 29, 2025 •

edited

Loading

pearu commented Jan 29, 2025 •

edited

Loading

jakevdp commented Jan 29, 2025

pearu commented Jan 30, 2025

CPU identity subtraction < 0 #26147

CPU identity subtraction < 0 #26147

Comments

andrewlkd commented Jan 28, 2025

Description

System info (python version, jaxlib version, accelerator, etc.)

jakevdp commented Jan 28, 2025

andrewlkd commented Jan 29, 2025

pearu commented Jan 29, 2025 • edited Loading

pearu commented Jan 29, 2025

pearu commented Jan 29, 2025 • edited Loading

pearu commented Jan 29, 2025 • edited Loading

jakevdp commented Jan 29, 2025

pearu commented Jan 30, 2025

pearu commented Jan 29, 2025 •

edited

Loading

pearu commented Jan 29, 2025 •

edited

Loading

pearu commented Jan 29, 2025 •

edited

Loading