Extremely poor performance on AES256Gcm with anything but opt-level=3 #581

0xngold · 2024-02-23T20:35:23Z

Problem description

It seems like AES256Gcm's performance is extremely low in debug builds; specifically, if anything except opt-level =3 is used. The benchmarks below show ~14s to encrypt a 64MB plaintext block when the bench profile is overridden to opt-level = 0.

With `opt-level = 0` set:

$ cargo bench -p crypto --bench encrypt_decrypt

Benchmarking encrypt_64mb_direct/encrypt_64MB: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 147.6s.
encrypt_64mb_direct/encrypt_64MB                                                                          
                        time:   [14.733 s 14.773 s 14.817 s]

With `opt-level = 3` set:

encrypt_64mb_direct/encrypt_64MB                                                                           
                        time:   [61.542 ms 61.875 ms 62.530 ms]
                        change: [-99.584% -99.581% -99.579%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 10 measurements (20.00%)
  2 (20.00%) high severe

Analysis

vTune profiles suggested that AES-NI was being used in the slow case, so it's probably not a software implementation fallback issue [1]. The stacks are mostly unintelligible though so I can't say for sure (the crates have a ton of metaprogramming). I haven't attempted a godbolt analysis yet, because it seems like I must be doing something trivial very wrong for performance to be this bad.

Is this just expected in debug / non-opt builds? Is there a known reason for this behavior?

[1] It seems like older versions of the cipher crate used to require special RUSTFLAGS to use AES-NI, but newer versions of the crates (I'm using 0.4.4) auto-detect AES-NI presence.

Benchmark code

use aead::stream::EncryptorLE31;
use aead::AeadCore;
use aead::KeyInit;
use aead::OsRng;
use aes_gcm::Aes256Gcm;
use criterion::criterion_group;
use criterion::criterion_main;
use criterion::Criterion;

pub fn encrypt_64mb_direct(c: &mut Criterion) {
    let aes256_nonce = Aes256Gcm::generate_nonce(OsRng);
    let streaming_nonce = Aes256StreamNonce::from_slice(&aes256_nonce.as_slice()[0..8]);
    let key = Aes256Gcm::generate_key(OsRng);
    let aes = Aes256Gcm::new(&key);
    let mut encryptor = EncryptorLE31::from_aead(aes, streaming_nonce);
    let mut plaintext = vec![0u8; 1024 * 1024 * 64];

    // Set a low sample size because 64MB chunks take multiple *seconds* if optimizations are off.
    let mut bench_group = c.benchmark_group("encrypt_64mb_direct");
    bench_group.sample_size(10);

    bench_group.bench_function("encrypt_64MB", |b| {
        b.iter(|| {
            encryptor
                .encrypt_next_in_place(&[], &mut plaintext)
                .unwrap();
        });
    });
}

criterion_group!(benches, encrypt_64mb_direct);
criterion_main!(benches);

Cargo.toml adjustments for `opt-level`

[profile.bench]
opt-level = 0

The text was updated successfully, but these errors were encountered:

newpavlov · 2024-02-24T02:49:21Z

Yes, it's an expected behavior. Our code relies heavily on inlining across crates for optimal performance. For example, without inlining all intrinsics will be separate function calls instead of being inlined and compiled as just one instruction.

0xngold · 2024-02-26T20:19:49Z

Ah okay. Looking at the call stacks that makes sense. Thanks!

newpavlov closed this as not planned Won't fix, can't repro, duplicate, stale Feb 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extremely poor performance on AES256Gcm with anything but opt-level=3 #581

Extremely poor performance on AES256Gcm with anything but opt-level=3 #581

0xngold commented Feb 23, 2024

newpavlov commented Feb 24, 2024

0xngold commented Feb 26, 2024

Extremely poor performance on AES256Gcm with anything but opt-level=3 #581

Extremely poor performance on AES256Gcm with anything but opt-level=3 #581

Comments

0xngold commented Feb 23, 2024

Problem description

With opt-level = 0 set:

With opt-level = 3 set:

Analysis

Benchmark code

Cargo.toml adjustments for opt-level

newpavlov commented Feb 24, 2024

0xngold commented Feb 26, 2024

With `opt-level = 0` set:

With `opt-level = 3` set:

Cargo.toml adjustments for `opt-level`