Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

corndog: use systemd-sysctl for sysctl options #333

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

aetimmes
Copy link

Issue number:

Closes bottlerocket-os/bottlerocket#4314

Description of changes:

Change corndog's sysctl function to utilize systemd-sysctl instead of writing values ephemerally to tmpfs, so that other system functions (eg Ciliums sysctl-fixup) don't completely clobber the sysctls set by corndog.

Testing done:

Unit tests added inline

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

@aetimmes
Copy link
Author

I don't speak Rust natively so if there's anything in here that's not idiomatic, feel free to let me know.

Comment on lines -117 to -120
// We don't fail because sysctl keys can vary between kernel versions and depend on
// loaded modules. It wouldn't be possible to deploy settings to a mixed-kernel fleet
// if newer sysctl values failed on your older kernels, for example, and we believe
// it's too cumbersome to have to specify in settings which keys are allowed to fail.
Copy link
Author

@aetimmes aetimmes Dec 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation of this PR doesn't maintain the functionality mentioned in this comment, but we could pretty easily achieve this by prepending - to all the sysctls passed in by default (which would then remove the need for the backup-modify-restore shuffle, which might make things simpler overall), or by documenting the ability of users to do so. Open to feedback/direction here.

config.push_str(&format!("{}={}\n", key.as_ref(), value));
}

fs::write(SYSCTL_CONFIG, config).context(error::WriteSysctlConfigSnafu)?;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to do some double-checking around:

  • Does this actually append to the file as written, or does it clobber it?
  • If we append and we write duplicate sysctls to the same config file, how does systemd-sysctl handle that? (Does the most recent sysctl win?)

I haven't yet been able to get a local build working (turns out my hard drive is smaller than I thought!) but I'll investigate these shortly.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answers:

  • This clobbers the file (must fix)
  • systemd-sysctl will use the last value in a file for the same sysctl. For example, a config file with the contents:
dev.raid.speed_limit_max = 199999 
dev.raid.speed_limit_max = 199998 
garbage                           
dev.raid.speed_limit_max = 199997

and running systemctl restart systemd-sysctl will set dev.raid.speed_limit_max to 199997 and report an error when attempting to parse the "garbage" line - the return code of the systemd unit will be 1 but all of the correctly specified sysctl k/v pairs will have been applied.

systemd-sysctl also will deduplicate multiple lines and only write when there are changes to be made, so in theory endlessly appending to a file should be a reasonable way to do this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Output from running systemd-sysctl with the above file in support of the above conclusions:

bash-5.1# SYSTEMD_LOG_LEVEL=debug /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/systemd-sysctl
Parsing /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/sysctl.d/50-default.conf
Parsing /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/sysctl.d/50-pid-max.conf
Parsing /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/sysctl.d/80-release.conf
Parsing /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/sysctl.d/90-kubelet.conf
Parsing /etc/sysctl.d/90-primary_interface.conf
Parsing /etc/sysctl.d/99-zzz-override_cilium.conf
Overwriting earlier assignment of net/ipv4/conf/all/rp_filter at '/etc/sysctl.d/99-zzz-override_cilium.conf:7'.
Parsing /etc/sysctl.d/99-zzz-zzz-testing.conf
Overwriting earlier assignment of dev/raid/speed_limit_max at '/etc/sysctl.d/99-zzz-zzz-testing.conf:2'.
/etc/sysctl.d/99-zzz-zzz-testing.conf:3: Line is not an assignment, ignoring: garbage
Overwriting earlier assignment of dev/raid/speed_limit_max at '/etc/sysctl.d/99-zzz-zzz-testing.conf:4'.
...
Setting '/proc/sys/dev/raid/speed_limit_max' to '199997'
No change in value '199997', suppressing write

@ytsssun
Copy link
Contributor

ytsssun commented Jan 6, 2025

Sorry for the delay, @aetimmes. And thanks so much for opening the PR for this. I will take a look at this PR and get back to you soon. I will also help verify the behavior.

Comment on lines 113 to 149
fn validate_sysctl_entry<K>(key: K, value: &str) -> Result<()>
where
K: AsRef<str>,
{
let key = key.as_ref();

// Basic validation of key format
if !key.chars().all(|c| c.is_ascii_alphanumeric() || c == '.' || c == '_' || c == '-' || c == '*') {
return error::InvalidSysctlKeySnafu { key: key.to_string() }.fail();
}

// Ensure value doesn't contain newlines which could break the config file format
if value.contains('\n') {
return error::InvalidSysctlValueSnafu {
key: key.to_string(),
value: value.to_string()
}.fail();
}

Ok(())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We perform validation when parsing the key from userdata already. See https://github.com/bottlerocket-os/bottlerocket-settings-sdk/blob/0cf857357bc799bf036b5f5bf865d3c4af4e9256/bottlerocket-settings-models/modeled-types/src/shared.rs#L703-L733.

We may need to account for a more strict validation there though.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need to adjust in a couple directions - systemd-sysctl also allows for wildcards in the keys, which it looks like the current format doesn't allow for. I can open a separate PR for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of the additional validation here? Is something being allowed in as a setting that would be rejected by systemd-sysctl?

The settings API is intended to be the first and best line of defense - the only time checks should be needed at apply-time is if there's invalid input that the API isn't currently capable of detecting, like two mutually exclusive or dependent settings.

If we do need some extra validation for compatibility, then it'd be better to warn and remove the setting from the output, so that it's not a hard error. But even that is a potential compatibility break on upgrade that's best avoided, if there were sysctls that would have been applied before but that can't be applied now because of more restrictive rules.

Copy link
Author

@aetimmes aetimmes Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of the additional validation here

No purpose - I was unaware of the bottlerocket-settings-sdk repo when I took the first pass at this PR; I agree that the validation in this MR should be removed.

I would still like to update the validation in the bottlerocket-settings-sdk repo to allow for wildcards where possible since they're supported by systemd-sysctl in general, but given that that PR would be allowing for extra functionality, I don't consider it a blocker for this change.

where
K: AsRef<str>,
{
// TODO: make the priority configurable?
const SYSCTL_CONFIG: &str = "/etc/sysctl.d/95-corndog.conf";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this constant up (somewhere in L26-L28) so that we don't recreate the constant every time we invoke this function.

Also maybe add _PATH as the suffix suggesting it is the path of the config not the actual content of the config?

Comment on lines 158 to 160
for (key, value) in &sysctls {
validate_sysctl_entry(key, value)?;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again validation already happened when parsing the keys.

}

// Create config string
let mut config = String::new();
for (key, value) in sysctls {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had conversation that iterating through the map entry is not ideal since it does not guarantee an ordering here.

One way to address that is to sort the sysctl entries first by the keys and then iterate based on the sorted keys. You can do something like

    let mut sysctl_vec: Vec<_> = sysctls
        .iter()
        .map(|(k,v)| (k.as_ref(), v))
        .collect();
    sysctl_vec.sort();
    for (key, value) in sysctl_vec {
         ...
    }

On another note, we may also want to preserve the order the entries are specified in the userdata. But that would require a different struct for sysctls that can preserve the order. For example BTreeMap, but it has performance trade-offs since the insertions, lookups, and deletions would not be as fast.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - we definitely want to maintain ordering becuase of the interaction between wildcards and overrides. I'm not overly worried about performance here since the upper bound of the size of this structure is probably on the order of thousands.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it looks like using a BTreeMap would complicate the serde bits a fair amount since BTreeMap doesn't implement several of the interfaces needed by serde - how do we feel about a Vec here instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking a step back. I think this request is not necessarily blocking the PR. It is ok if you want to leave this part untouched. It can be tracked separately.

Comment on lines 169 to 176
let mut file = fs::OpenOptions::new()
.create(true)
.append(true)
.open(SYSCTL_CONFIG)
.context(error::OpenSysctlConfigSnafu)?;

file.write_all(config.as_bytes())
.context(error::WriteSysctlConfigSnafu)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fs::write("path/to/file.txt", config).context(error::WriteSysctlConfigSnafu)?;

This would try to create the file if it does not exist.

Comment on lines 179 to 182
let status = Command::new("systemctl")
.args(["restart", "systemd-sysctl"])
.status()
.context(error::SystemctlSnafu)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't really need to restart the whole sysctld service.

/usr/lib/systemd/systemd-sysctl /etc/sysctl.d/99-corndog.conf

Would apply this specific file to sysctld.

And could you wrap the binary path and args into constants? Something like below and put it along with the definition of the SYSCTL_CONFIG_PATH.

const SYSTEMD_SYSCTL_BIN: &str = "/usr/lib/systemd/systemd-sysctl";

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll need to confirm this, but I believe only applying the specific file may not necessarily respect the ordering of the rest of the config files, so sysctl restart systemd-sysctl is correct IMO.

For example, the Cilium sysctl.conf file has a prefix of 99-zzz, so applying the 95 priority corndog file by itself may clobber any entries that are also set in the Cilium conf file.

If we kept sysctl restart systemd-ctl, would we still want these wrapped into constants?

Comment on lines 184 to 186
if !status.success() {
return error::SystemctlFailedSnafu.fail();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When hitting this case, it would be also good to print the errors from the output. So I would get the output of the command and check the status and the stderr.

    let output = Command::new(SYSTEMD_SYSCTL_BIN)
        .arg(SYSCTL_CONFIG_PATH)
        .output()
        .context(ExecutionSnafu{command: format!("{} {}", SYSTEMD_SYSCTL_BIN, SYSCTL_CONFIG_PATH)})?;
    if !output.status.success() {
        warn!(
            "Failed to apply sysctl config: {}",
            String::from_utf8_lossy(&output.stderr) # account for invalid utf8 encoding
        );
    }

@ytsssun
Copy link
Contributor

ytsssun commented Jan 10, 2025

Left some comments, the change looks good overall. I will run some test with the change to verify the behavior. I will get back to you on that.

Comment on lines 113 to 149
fn validate_sysctl_entry<K>(key: K, value: &str) -> Result<()>
where
K: AsRef<str>,
{
let key = key.as_ref();

// Basic validation of key format
if !key.chars().all(|c| c.is_ascii_alphanumeric() || c == '.' || c == '_' || c == '-' || c == '*') {
return error::InvalidSysctlKeySnafu { key: key.to_string() }.fail();
}

// Ensure value doesn't contain newlines which could break the config file format
if value.contains('\n') {
return error::InvalidSysctlValueSnafu {
key: key.to_string(),
value: value.to_string()
}.fail();
}

Ok(())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of the additional validation here? Is something being allowed in as a setting that would be rejected by systemd-sysctl?

The settings API is intended to be the first and best line of defense - the only time checks should be needed at apply-time is if there's invalid input that the API isn't currently capable of detecting, like two mutually exclusive or dependent settings.

If we do need some extra validation for compatibility, then it'd be better to warn and remove the setting from the output, so that it's not a hard error. But even that is a potential compatibility break on upgrade that's best avoided, if there were sysctls that would have been applied before but that can't be applied now because of more restrictive rules.

Comment on lines 135 to 163
/// Restarts systemd-sysctl service
fn restart_sysctl() -> Result<()> {
let status = Command::new("systemctl")
.args(["restart", "systemd-sysctl"])
.status()
.context(error::SystemctlSnafu)?;

if !status.success() {
return error::SystemctlFailedSnafu.fail();
}

Ok(())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restarting a service can lead to unexpected behavior if units depend on that service, or if for whatever reason it hasn't started yet or failed to start successfully. This is also using the blocking form which can lead to trouble in early boot.

It'd be better to avoid all of those issues and just invoke systemd-sysctl directly. I read your other comment and understand that we can't pass just the one config file, which makes sense, but the unit just runs /usr/lib/systemd/systemd-sysctl with no arguments which we can do here also.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense to me; I'll adjust accordingly.

@@ -18,6 +18,11 @@ use std::path::{Path, PathBuf};
use std::str::FromStr;
use std::string::String;
use std::{env, process};
use std::process::Command;
use tempfile::NamedTempFile;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't appear to be used, but it would be good to use it and follow the usual pattern:

  1. create a tempfile in the same directory as the target file
  2. write all the content to it
  3. persist the tempfile to the final target once it's ready

(We want to avoid changing the original file if some other process might have it open, and we also want to avoid repeatedly appending to the same file on subsequent runs, which could lead to unexpected behavior or resource exhaustion.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

want to avoid repeatedly appending to the same file on subsequent runs

To be explicitly clear - if I make a new api client request to set sysctl userdata on a host that already has an existing /etc/sysctl.d/95-corndog.conf file, which of these is the desired behavior?

  • append the new sysctls to the existing list of sysctls (allowing for duplicates; gets expensive if many calls are made but systemd-sysctl handles deduping for us)
  • upsert the new sysctls into the existing list of sysctls (if a previously detected entry exists, remove the existing key and append the new k/v - more expensive but no duplicates)
  • replace the old sysctls with the new ones (seems like a breaking change)

In the case of the first two options, is the suggestion above to make a copy of the existing sysctl conf file as the tempfile (as opposed to operating on the original)?

(Apologies for being pedantic; I just want to make sure I understand the requirements here)

@aetimmes
Copy link
Author

Thanks for the reviews! The feedback all makes sense (modulo the couple of open questions I have) - I'm weirdly struggling to get my local Rust install to play nice with the pre-existing tests so my follow-up will be slightly delayed while I debug my machine.

@ytsssun
Copy link
Contributor

ytsssun commented Jan 17, 2025

@aetimmes Let me know if you need help setting up the environment. Or I can even help with testing the change by building a test AMI. I can either do that with your updated PR, or you can point me to your fork, whichever way works for you.

@aetimmes aetimmes force-pushed the aet-persist-sysctls branch from 6b0507a to c1cd332 Compare January 20, 2025 15:40
@aetimmes aetimmes force-pushed the aet-persist-sysctls branch from c1cd332 to 266b460 Compare January 20, 2025 16:09
@aetimmes
Copy link
Author

Okay - turns out my "environment issues" were just me being bad at Rust and not understanding mocking.

I just force-pushed a refactor that:

  • uses temp files instead of appending
  • decomposes the top-level function into config gen/config persistence/command execution
  • does some wrapping/dependency injection to allow for running tests without #[cfg(not(test))] declarations

This is ready for re-review; with the following notes:

  • In lieu of an answer to this question above, I just assumed "we want to clobber the old config file"
  • I punted on the ordered deserialization for now but that probably does block the inclusion of wildcards in validation for now (not the end of the world for me personally since we're not currently using them in userdata 😎)
  • Do we want to use systemd-sysctl for hugepages too? (Perhaps using a separate conf file?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Persist userdata sysctl changes to /etc/sysctl.d
3 participants