Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for finding oxide processes on a sled #7194

Conversation

papertigers
Copy link
Contributor

@papertigers papertigers commented Dec 2, 2024

This PR adds support for finding all oxide processes via libcontract and relying on the fact that we deploy oxide services with the fmri prefix svc:/oxide/. This allows us to find every pid started via smf even if the smf script uses ctrun.

You can see this for yourself by running pfexec ctstat -av on a helios/illumos based system:

205     1       process owned   2492    0       -       -       
        cookie:                0
        informative event set: core
        critical event set:    hwerr empty
        fatal event set:       hwerr
        parameter set:         noorphan regent
        member processes:      2497
        inherited contracts:   none
        service fmri:          svc:/oxide/wicketd:default
        service fmri ctid:     204
        creator:               ctrun
        aux:                   
249     1       process owned   1334    0       -       -       
        cookie:                0x20
        informative event set: none
        critical event set:    hwerr empty
        fatal event set:       none
        parameter set:         inherit regent
        member processes:      2794
        inherited contracts:   none
        service fmri:          svc:/oxide/dendrite:default
        service fmri ctid:     249
        creator:               svc.startd
        aux:                   start
254     1       process owned   1334    0       -       -       
        cookie:                0x20
        informative event set: none
        critical event set:    hwerr empty
        fatal event set:       none
        parameter set:         inherit regent
        member processes:      2802
        inherited contracts:   none
        service fmri:          svc:/oxide/lldpd:default
        service fmri ctid:     254
        creator:               svc.startd
        aux:                   start

This PR is on top of:

Created using spr 1.3.6-beta.1
@papertigers papertigers marked this pull request as draft December 2, 2024 22:21
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
@papertigers papertigers marked this pull request as ready for review December 5, 2024 15:45
@papertigers
Copy link
Contributor Author

@citrus-it I added you as a reviewer - if you have time do you mind looking over the libcontract bindings this is using?

// The lifetime of this string is tied to the lifetime of the status
// handle itself and will be cleaned up when the handle is freed.
let mut ptr: *mut c_char = std::ptr::null_mut();
libcall_io!(ct_pr_status_get_svc_fmri(self.handle, &mut ptr)).map_err(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From https://illumos.org/man/3CONTRACT/ct_pr_status_get_svc_creator#:~:text=The%20ct_pr_status_get_svc_fmri(),not%20be%20modified

  The ct_pr_status_get_svc_fmri(), ct_pr_status_get_svc_creator(), and
  ct_pr_status_get_svc_aux() functions read, respectively, the service
  FMRI, the contract's creator execname and  the creator's auxiliary field.
  The buffer pointed to by fmri, aux or creator, is freed by a call to
  ct_status_free() and should not be modified.

I'm under the impression that the ptr value here must be freed via a call to ct_status_free - is that right? or is this just saying that the lifetime of the returned result must be shorter than ContractStatus?

If it's the former (the ptr value must be freed explicitly): Are we leaking?
If it's the latter (the ptr values lives as long as ContractStatus): Is this code unsound? Would it be possible to keep using the returned CString after self has been dropped?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ptr returned is tied to the lifetime of the contract status handle ct_stathdl_t. We read that pointer as a cstr but the function returns a CString which copies the data to own it. If I understand correctly the ptr returned from the call to ct_pr_status_get_svc_fmri will not leave anything around because it will be cleaned up when ct_status_free is called via the drop method on ContractStatus.

I will verify this, but this was at least my intention upon writing this code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, gotcha, so:

        let cstr = unsafe { CStr::from_ptr(ptr) }

Would create a cstr that needs to have a shorter lifetime than the handle, but:

        cstr.to_owned()

Would clone it, and free us from that lifetime constraint when the function returns

@hawkw
Copy link
Member

hawkw commented Dec 10, 2024

I wonder if it would be valuable to add an omdb command that does this and dumps the output, too? Of course, you can just run the appropriate ctstat commands for that, but it could maybe be nice to have a way to do this baked in?

@papertigers
Copy link
Contributor Author

I wonder if it would be valuable to add an omdb command that does this and dumps the output, too? Of course, you can just run the appropriate ctstat commands for that, but it could maybe be nice to have a way to do this baked in?

Yes! The plan is to introduce a diagnostics command that lets you call the same endpoints that the support bundle sagas are using to collect all of the info.

Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
@papertigers papertigers merged commit 1d19061 into spr/papertigers/main.add-support-for-finding-oxide-processes-on-a-sled Jan 7, 2025
27 checks passed
@papertigers papertigers deleted the spr/papertigers/add-support-for-finding-oxide-processes-on-a-sled branch January 7, 2025 05:10
papertigers added a commit that referenced this pull request Jan 7, 2025
papertigers added a commit that referenced this pull request Jan 8, 2025
…7228)

This introduces the `SledDiagnosticsQueryOutput` type so support bundles
can now have structured output in collected files. This makes it
potentially easier for an operator to write scripts that search for
particular commands based on various fields.

This is on top of:
- #7194
- #7193
papertigers added a commit that referenced this pull request Jan 8, 2025
This PR adds support for gathering `pfiles` output for all Oxide
processes on a sled.

This is on top of:
- #7228 
- #7194
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants