Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce privileged-mode #9017

Merged
merged 5 commits into from
Feb 21, 2025
Merged

Conversation

A1kmm
Copy link
Contributor

@A1kmm A1kmm commented Oct 24, 2024

The privileged-mode setting lets admins decide what level of privilege tasks running as privileged should have. This gives the ability to lock down privileged access to a level that isn't equivalent to full root on the host.

There are three proposed levels:
full, the status quo. This has multiple vectors to take over the host, including by loading modules into the kernel.
fuse-only, enough to work with containers using tools like buildah and podman if they are configured appropriately. As long as the Concourse worker is run in a user namespace on an up-to-date Linux kernel, this shouldn't be enough access to escape the container. ignore - privileged tasks have the same access as normal tasks.

To get podman and buildah working, a few more syscalls need to be allowed through seccomp. A few harmless ones have been added to the general allow list, while others related to mounting and unsharing are only added for fuse-only mode.

Changes proposed by this PR

  • Implement privileged-mode
  • Manual local testing: CONCOURSE_CONTAINERD_PRIVILEGED_MODE: full, can create container with buildah and run with podman.
  • Manual local testing: CONCOURSE_CONTAINERD_PRIVILEGED_MODE: fuse-only, can create container with buildah and run with podman. Capabilities are less.
  • Manual local testing: CONCOURSE_CONTAINERD_PRIVILEGED_MODE: fuse-only, cannot escape container using cgroup release_agent (note: can still escape if Worker not run in a new userns, setting release_agent fails if using a new userns).
  • Manual local testing: CONCOURSE_CONTAINERD_PRIVILEGED_MODE: ignore, cannot create container with buildah and run with podman, as expected.
  • Write automated tests for functionality.
  • Convert from draft PR to normal PR.

Notes to reviewer

This pipeline is helpful for manual testing:

jobs:
  - name: build-container
    public: false
    plan:
    - task: build
      privileged: true
      config:
        platform: linux
        image_resource:
          type: registry-image
          source:
            repository: quay.io/buildah/stable
        run:
          path: /bin/bash
          args:
            - "-c"
            - |
              capsh --print &&\
              yum -y install podman &&\
              mkdir container-storage &&\
              ls -l /dev/fuse /usr/bin/fuse-overlayfs $(pwd) $(pwd)/container-storage &&\
              PODMAN_ROOT=$(pwd)/container-storage &&\
              echo FROM mirror.gcr.io/alpine:latest >Dockerfile &&\
              echo CMD echo Hello World >>Dockerfile &&\
              buildah bud --root=$PODMAN_ROOT -t helloworld &&\
              echo "[containers]" >/etc/containers/containers.conf &&\
              echo "keyring = false" >>/etc/containers/containers.conf &&\
              podman run --rm --uts=host --network=host --userns=host --root=$PODMAN_ROOT --cgroups=disabled -it helloworld

Release Note

  • Added a new --privileged-mode option to the worker, which accepts full (default, original behaviour), fuse-only (privileged: true tasks can use tools like buildah and podman, but can't escape if user namespaces are used to run the worker), ignore (privileged: true tasks have no extra access compared to privileged: false tasks)

@A1kmm A1kmm marked this pull request as ready for review October 25, 2024 09:56
@A1kmm A1kmm requested a review from a team as a code owner October 25, 2024 09:56
@taylorsilva taylorsilva added this to the v7.13.0 milestone Dec 4, 2024
@taylorsilva
Copy link
Member

Thanks for the PR! Will review it soon (hopefully)

Copy link
Member

@taylorsilva taylorsilva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some confusing operator UX with this PR. You mention this adds a --privileged-mode flag, but it actually adds two flags: --containerd-privileged-mode and --baggageclaim-privileged-mode.

I can see that the value passed into --containerd-privileged-mode gets passed into baggageclaim. Could we consolidate to only exposing the --containerd-privileged-mode flag instead? It also means less flags to add to the chart and bosh deployment. This feature only makes sense for the containerd runtime, correct?

@A1kmm
Copy link
Contributor Author

A1kmm commented Jan 14, 2025

This feature only makes sense for the containerd runtime, correct?

It is only usable with containerd at the moment anyway; it would take further investigation (and possibly different options) for different backends, so limiting it to containerd makes sense.

Could we consolidate to only exposing the --containerd-privileged-mode flag instead?

The argument is used as a way to let the baggageclaim runner know the privileged mode. Perhaps the best solution is to pass that as a Go argument to BaggageClaimCommand.Runner instead of making it part of the structure - that would then eliminate the need for making it part of the command structure, and hence a flag.

If other backends start supporting a similar mode in the future, potentially the logic could then be extended to make baggageclaim do the right thing for all of them.

@A1kmm
Copy link
Contributor Author

A1kmm commented Jan 15, 2025

I have now updated the PR to not use the extra argument in the baggageclaim command structure, and instead just pass it using Go arguments @taylorsilva.

@A1kmm A1kmm force-pushed the privileged-mode branch 2 times, most recently from c91138c to 012b4c4 Compare January 16, 2025 07:40
@A1kmm A1kmm requested a review from taylorsilva January 16, 2025 09:25
@taylorsilva
Copy link
Member

Sorry for the long delay. I'm reviewing this today as I don't want to let this go super stale.

@taylorsilva
Copy link
Member

taylorsilva commented Feb 13, 2025

Had a heck of a time getting my linux env setup correctly to try this out. I'm on a v6.6 kernel. We should probably specify what "up to date kernel" exactly means. What features does this exactly depend on? Will a v5 kernel allow me to use this feature? or is this a v6+ kernel only feature that we're adding here?

I cleaned up the test pipeline a bit:

jobs:
  - name: build-container
    public: false
    plan:
    - task: build
      privileged: true
      config:
        platform: linux
        image_resource:
          type: registry-image
          source:
            repository: quay.io/buildah/stable
        run:
          path: /bin/bash
          args:
            - "-c"
            - |
              set -euo pipefail
              capsh --print
              yum -y install podman &> /dev/null
              mkdir container-storage
              ls -l /dev/fuse /usr/bin/fuse-overlayfs $(pwd) $(pwd)/container-storage
              PODMAN_ROOT=$(pwd)/container-storage
              echo FROM mirror.gcr.io/alpine:latest >Dockerfile
              echo CMD echo Hello World >>Dockerfile
              buildah bud --root=$PODMAN_ROOT -t helloworld
              echo "[containers]" >/etc/containers/containers.conf
              echo "keyring = false" >>/etc/containers/containers.conf
              podman run --rm --uts=host --network=host --userns=host --root=$PODMAN_ROOT --cgroups=disabled -it helloworld

The full worked as usual:

Current: =eip cap_perfmon,cap_bpf,cap_checkpoint_restore-eip

Bounding set =cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read

Ambient set =

Current IAB: cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read,!cap_perfmon,!cap_bpf,!cap_checkpoint_restore

Securebits: 00/0x0/1'b0 (no-new-privs=0)

 secure-noroot: no (unlocked)

 secure-no-suid-fixup: no (unlocked)

 secure-keep-caps: no (unlocked)

 secure-no-ambient-raise: no (unlocked)

uid=0(root) euid=0(root)

gid=0(root)

groups=

Guessed mode: HYBRID (4)

crw-rw-rw- 1 root root 10, 229 Feb 13 19:06 /dev/fuse

-rwxr-xr-x 1 root root  112576 Jul 17  2024 /usr/bin/fuse-overlayfs


/tmp/build/80754af9:

total 4

drwxr-xr-x 2 root root 4096 Feb 13 19:07 container-storage


/tmp/build/80754af9/container-storage:

total 0

STEP 1/2: FROM mirror.gcr.io/alpine:latest

Trying to pull mirror.gcr.io/alpine:latest...

Getting image source signatures

Copying blob 1f3e46996e29 done   | ======>-----------------------] 1.4MiB / 3.5MiB | 21.0 MiB/s

Copying config b0c9d60fc5 done   | ------------------------------] 0.0b / 3.5MiB | 0.0 b/s

Writing manifest to image destination

STEP 2/2: CMD echo Hello World

COMMIT helloworld

Getting image source signatures

Copying blob a0904247e36a skipped: already exists  

Copying blob 5f70bf18a086 done   | 

Copying config d2778cac4c done   | 

Writing manifest to image destination

--> d2778cac4c93

Successfully tagged localhost/helloworld:latest

d2778cac4c93497895adc76e358a19bc14aaadf673debb98e42354a4904ac17a

WARN[0000] Using cgroups-v1 which is deprecated in favor of cgroups-v2 with Podman v5 and will be removed in a future version. Set environment variable `PODMAN_IGNORE_CGROUPSV1_WARNING` to hide this warning. 

Hello World

When I changed to fuse-only, podman could run the container, but failed to clean it up. The error came from crun and makes sense in that we can see earlier that the container does not have CAP_SYS_PTRACE, therefore calling pidfd fails. You mention that podman and buildah need to be "configured correctly". What needs to change in this pipeline to get this simple test to work? Or is this expected? There's obviously enough capabilities for fuse stuff to work, so yay! But the inability of podman to completely manage its containers seems not great 🙁

Current: cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_sys_admin,cap_mknod,cap_audit_write,cap_setfcap=eip

Bounding set =cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_sys_admin,cap_mknod,cap_audit_write,cap_setfcap

Ambient set =

Current IAB: cap_chown,cap_dac_override,!cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,!cap_linux_immutable,cap_net_bind_service,!cap_net_broadcast,!cap_net_admin,cap_net_raw,!cap_ipc_lock,!cap_ipc_owner,!cap_sys_module,!cap_sys_rawio,cap_sys_chroot,!cap_sys_ptrace,!cap_sys_pacct,cap_sys_admin,!cap_sys_boot,!cap_sys_nice,!cap_sys_resource,!cap_sys_time,!cap_sys_tty_config,cap_mknod,!cap_lease,cap_audit_write,!cap_audit_control,cap_setfcap,!cap_mac_override,!cap_mac_admin,!cap_syslog,!cap_wake_alarm,!cap_block_suspend,!cap_audit_read,!cap_perfmon,!cap_bpf,!cap_checkpoint_restore

Securebits: 00/0x0/1'b0 (no-new-privs=0)

 secure-noroot: no (unlocked)

 secure-no-suid-fixup: no (unlocked)

 secure-keep-caps: no (unlocked)

 secure-no-ambient-raise: no (unlocked)

uid=0(root) euid=0(root)

gid=0(root)

groups=

Guessed mode: HYBRID (4)

crw-rw-rw- 1 nobody nobody 10, 229 Feb 13 19:07 /dev/fuse

-rwxr-xr-x 1 root   root    112576 Jul 17  2024 /usr/bin/fuse-overlayfs


/tmp/build/80754af9:

total 4

drwxr-xr-x 2 root root 4096 Feb 13 19:08 container-storage


/tmp/build/80754af9/container-storage:

total 0

STEP 1/2: FROM mirror.gcr.io/alpine:latest

Trying to pull mirror.gcr.io/alpine:latest...

Getting image source signatures

Copying blob 1f3e46996e29 done   | =======>----------------------] 1.4MiB / 3.5MiB | 5.5 MiB/s/s

Copying config b0c9d60fc5 done   | 

Writing manifest to image destination

STEP 2/2: CMD echo Hello World

COMMIT helloworld

Getting image source signatures

Copying blob a0904247e36a skipped: already exists  

Copying blob 5f70bf18a086 done   | 

Copying config c034a79a62 done   | 

Writing manifest to image destination

--> c034a79a62c1

Successfully tagged localhost/helloworld:latest

c034a79a62c1158a0b6e9be58af4931a45d4a371d070e39dfa3d14654be45d61

WARN[0000] Using cgroups-v1 which is deprecated in favor of cgroups-v2 with Podman v5 and will be removed in a future version. Set environment variable `PODMAN_IGNORE_CGROUPSV1_WARNING` to hide this warning. 

Hello World

2025-02-13T19:08:52.259553Z: open pidfd: Operation not permitted

ERRO[0000] Removing container 4e90e98e870311500c0ad2cc7697b1bd3119a303726b56aa63c06e1161ba8c8c: cleaning up container 4e90e98e870311500c0ad2cc7697b1bd3119a303726b56aa63c06e1161ba8c8c: removing container 4e90e98e870311500c0ad2cc7697b1bd3119a303726b56aa63c06e1161ba8c8c from runtime: `/usr/bin/crun delete --force 4e90e98e870311500c0ad2cc7697b1bd3119a303726b56aa63c06e1161ba8c8c` failed: exit status 1 

@taylorsilva
Copy link
Member

Are there any existing implementations of a fuse-only privileged mode by any other container orchestrators?

That warning about cgroups-v1 bothered me so I updated my kernel params to disable cgroups v1. I still get the same result with podman failing to clean up the container:

open pidfd: Operation not permitted

The privileged-mode setting lets admins decide what level of privilege
tasks running as privileged should have. This gives the ability to
lock down privileged access to a level that isn't equivalent to full
root on the host.

There are three proposed levels:
full, the status quo. This has multiple vectors to take over the host,
including by loading modules into the kernel.
fuse-only, enough to work with containers using tools like buildah and
podman if they are configured appropriately. As long as the Concourse
worker is run in a user namespace on an up-to-date Linux kernel, this
shouldn't be enough access to escape the container.
ignore - privileged tasks have the same access as normal tasks.

To get podman and buildah working, a few more syscalls need to be
allowed through seccomp. A few harmless ones have been added to the
general allow list, while others related to mounting and unsharing
are only added for fuse-only mode.

Signed-off-by: Andrew Miller <[email protected]>
@A1kmm
Copy link
Contributor Author

A1kmm commented Feb 13, 2025

We should probably specify what "up to date kernel" exactly means. What features does this exactly depend on? Will a v5 kernel allow me to use this feature? or is this a v6+ kernel only feature that we're adding here?

There shouldn't be a v6+ only feature in use. I'm more more hoping to provide guidance to admins to ensure security with fuse-only mode; if their kernel has a local privilege escalation bug, they shouldn't have a false sense of security that this will protect them.

In particular, the security of this relies on:

  • A kernel patched for CVE-2022-0492 - without that, CAP_SYS_ADMIN even in a non-default user namespace is enough to set up a cgroups v1 release agent that can escape the container (i.e. allows escalation to full root on the host).
  • Running the Concourse Worker in a non-default user namespace, so the protection of fuse-only mode is effective. When running with podman-compose, I tested with the following in docker-compose.yml for the worker:
    cgroupns_mode: private
    userns_mode: auto:size=65536

(alongside setting the environment CONCOURSE_CONTAINERD_PRIVILEGED_MODE: "fuse-only").

With Docker, it is a central daemon setting: https://docs.docker.com/engine/security/userns-remap/

If admins don't remap namespaces, they might get a sense of false confidence with fuse-only privileged mode (it will still work, it just won't be secure). I suggest conveying this at least in the release notes (and happy to write up some more docs for this if you feel that is warranted).

ERRO[0000] Removing container 4e90e98e870311500c0ad2cc7697b1bd3119a303726b56aa63c06e1161ba8c8c: cleaning up container 4e90e98e870311500c0ad2cc7697b1bd3119a303726b56aa63c06e1161ba8c8c: removing container 4e90e98e870311500c0ad2cc7697b1bd3119a303726b56aa63c06e1161ba8c8c from runtime: /usr/bin/crun delete --force 4e90e98e870311500c0ad2cc7697b1bd3119a303726b56aa63c06e1161ba8c8c failed: exit status 1

I wonder if it is non-deterministic as to whether this shows up in the logs. I remember seeing it once, but it disappeared again with adding another privilege - but it might have just been non-determinism. The error might happen every time, but not make it into the logs always.

I'll investigate a bit more to see if I can find a way to reproduce it more consistently, so I can tell if @analytically's proposed new allowed syscalls are sufficient.

This avoids an error that occurs cleaning up containers.

Signed-off-by: Andrew Miller <[email protected]>
@A1kmm
Copy link
Contributor Author

A1kmm commented Feb 14, 2025

Are there any existing implementations of a fuse-only privileged mode by any other container orchestrators?

I think many other orchestrators let you do this, but they are more tightly coupled to Linux containers, so they do it by providing ways to customise seccomp policies and capabilities.

I'm thinking that would be different to Concourse because Concourse has a design principle not to couple tightly to this type of implementation detail. This is one of the things that makes Concourse great to use, but it also means that it's a bit more painful to implement a feature like this compared to other orchestrators we might compare to, because we've got to come up with an upfront opinionated set of capabilities and allowed system calls that is both secure and works for the use case of building and testing containers within Concourse.

@A1kmm
Copy link
Contributor Author

A1kmm commented Feb 14, 2025

I'll investigate a bit more to see if I can find a way to reproduce it more consistently, so I can tell if @analytically's proposed new allowed syscalls are sufficient.

BTW just to update on this, it seems your tidied up script is more likely to reproduce the issue than my original. But with the additional allowed syscalls (in the latest version of this PR), I haven't been able to reproduce it again after multiple runs.

Copy link
Member

@taylorsilva taylorsilva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR Andrew! Thanks for your patience as well as this took longer than it should have for me to review and merge.

@taylorsilva
Copy link
Member

@A1kmm Writing up some docs would be very helpful for operators. I was thinking maybe adding a subsection to the containerd section here: https://concourse-ci.org/concourse-worker.html#containerd-runtime

You can find the page backing that section here: https://github.com/concourse/docs/blob/1c16e34c6d3a013289d11cbc15fdd8f60ef3218f/lit/docs/install/worker.lit#L496

@taylorsilva taylorsilva merged commit 7cb34d0 into concourse:master Feb 21, 2025
11 checks passed
@A1kmm
Copy link
Contributor Author

A1kmm commented Feb 21, 2025

Thanks for reviewing Taylor - really appreciate that you are maintaining Concourse, and completely understand the time given it is a complex new feature.

For cross-reference: I just created a documentation PR corresponding to this PR over at: concourse/docs#549

@A1kmm A1kmm deleted the privileged-mode branch February 21, 2025 11:11
@aliculPix4D
Copy link
Contributor

aliculPix4D commented Feb 21, 2025

@taylorsilva just curious; did you maybe tried to rebuild the worker's binaries after you merged this PR?

I just sync'ed our fork with the upstream master and I get this error when trying to built workers binaries.

cmd.Containerd.PrivilegedMode undefined (type ContainerdRuntime has no field or method PrivilegedMode)

Still not sure if this is related to our fork or its introduced by the PR... so sorry for the noise if this is coming from our side...

UPDATE: it failed to build darwin (arm64) and windows workers, linux is apparently fine.
$ CGO=0 GOOS=$goos GOARCH=$goarch go build -o concourse/bin/$CONCOURSE -ldflags "$ldflags" ./cmd/concourse

@taylorsilva
Copy link
Member

UPDATE: it failed to build darwin and windows workers, linux is apparently fine.

I did not try those! Thanks for noting that, I'll look into it and get a fix in.

@@ -265,5 +265,5 @@ func (cmd *WorkerCommand) baggageclaimRunner(logger lager.Logger) (ifrit.Runner,

cmd.Baggageclaim.OverlaysDir = filepath.Join(cmd.WorkDir.Path(), "overlays")

return cmd.Baggageclaim.Runner(nil)
return cmd.Baggageclaim.Runner(nil, cmd.Containerd.PrivilegedMode)
Copy link
Contributor

@aliculPix4D aliculPix4D Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worker/workercmd/worker.go:268:53: cmd.Containerd.PrivilegedMode undefined (type ContainerdRuntime has no field or method PrivilegedMode)

cmd.Containerd.PrivilegedMode is undefined when building on different OS (darwin/windows)

Copy link
Contributor

@aliculPix4D aliculPix4D Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I didn't go through all the PR and I got interested to this mostly due to our darwin and windows workers failed to build, but now I am curious: What happens here on linux but when someone uses GardenBackend?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing different happens with the garden backend. This is a containerd specific setting. Everything defaults to the current status quo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants