Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Truncated core file when COMP_COMPRESSION is set to "true" #165

Open
amikugup opened this issue Oct 16, 2024 · 11 comments
Open

Truncated core file when COMP_COMPRESSION is set to "true" #165

amikugup opened this issue Oct 16, 2024 · 11 comments

Comments

@amikugup
Copy link

We are observing a strange issue with IBM core dump handler. Actually, we are getting truncated core file when COMP_COMPRESSION flag is set to "true". gdb is complaining about the truncated file and core file size is close to 900 MB while gdb is expecting a core file size of 3 GB.

We didn't see any such issue when we turned off the compression. we got a full core file and gdb is also happy.
Is this a known issue with the compression flag?

@pereyra-m
Copy link

Hi.

I'm getting problems too with big dumps, they can't be read with gdb.
I'll try without compression.

@No9
Copy link
Collaborator

No9 commented Nov 6, 2024

Let me know how you get on
We use the zip crate and just use 'COMP_COMPRESSION' as a flag so I'd say it's likely a bug in that crate.

zip::CompressionMethod::Deflated

Looks like a lot has been added to zip as it's now on version 2.2.0 so a PR with a bump would be appreciated.

Thanks

@pereyra-m
Copy link

Hi again.

We were using the 8.6.0 version, and even when the flag was set to "false", the dumps were uploaded compressed and the big ones were corrupt.
The release notes show that this was solved in recent versions, so we upgraded to 8.10.0 and now it's working.

@amikugup
Copy link
Author

amikugup commented Nov 7, 2024

We are already using v8.10.0 but that doesn't solve the problem and we still need to disable the compression to make this work. @pereyra-m this would be helpful if you can share more details regarding the core file size that you have tried and configurable value you are using.

@pereyra-m
Copy link

We haven't any special configuration, and we noticed the corruptions when the dumps were higher than 1GB more or less.
The error was something like

Failed to read a valid object file image from memory.

Maybe in your case is something else.

@amikugup
Copy link
Author

amikugup commented Dec 2, 2024

We are seeing this issue without compression as well in certain scenario.
Is this issue get fixed in latest release? Any thoughts regarding this IBCH Team?

@No9
Copy link
Collaborator

No9 commented Dec 2, 2024

Can you set the composer log level to Debug
See: https://github.com/IBM/core-dump-handler/blob/main/charts/core-dump-handler/values.yaml#L28

logLevel: "Debug"

And once the issue arises provide the output of cat /var/mnt/core-dump-handler/composer.log from an agent on a node that has collected a core dump.
N.B. The location of composer.log will depend on your mountpoint settings if you have overridden it.

Thanks

@connectrajeev
Copy link

Hello IBM-CDH team,

This reponse is on behalf of @amikugup, here are the requested debug logs. We have deleted certain logs from the file as we felt that has our setup and proprietary details. Do let us know if there is some relevant info removed by us from the composer.log

composer.log

During further debugging of this issue, we concluded that it might not be related to the IBM-CDH. Instead, the problem seems to related to Linux pipe. The Linux kernel writes core-dump data very fast and the CDC is unable to consume it at the same speed, causing the Linux pipe to overflow. As a result, the CDC misses reading a portion of the data.

We were unable to find a way to increase the Linux pipe size since it appears to be a read-only parameter according to ulimit. If you know of any method to increase the Linux pipe size, please share it with us. We would like to try it and see if it helps prevent the core file from being truncated.
Thanks.

@No9
Copy link
Collaborator

No9 commented Dec 6, 2024

Hi @connectrajeev

I agree it's likely the issue is upstream as I am not seeing a Error writing core file message in the compose log.

Can you confirm the following please:
Host Operating System with version number
The file is getting consistently truncated at a certain size (e.g. 900 MB as @amikugup originally stated)

I don't think it's the page size as I would expect the OS to block until its read but I would need to read the kernel core dump code to confirm.

@connectrajeev
Copy link

Hello @No9,

Thanks for your response!
The core-file size of our application is somewhere between 5GB to 10GB, and we have noticed only under some specific testcases of our application core-file is truncated, if we manually send any core signal to our application and try to generate the core-file it doesn't get truncated.

Host OS: "Oracle Linux Server 9.3"
What are the possibilities for a core-file to truncate in case if there are some pending signals at the application to process and at the same time application receives the SIGSEGV?

@No9
Copy link
Collaborator

No9 commented Dec 7, 2024

OK This is progress - I think the core dump will take precedence over all signals.
Looking at the code for the core dump in the kernel
https://github.com/torvalds/linux/blob/18bf34080c4c3beb6699181986cc97dd712498fe/fs/coredump.c#L567
I would suggest using dmesg and look for the KERNEL_WARNING errors to make sure the above assumption is true.

I've not tested this on Oracle Linux Server at all and don't have access to one with a k8s config so I can only suggest ideas at this stage.
Does the host have systemd-core dump installed and running? systemctl list-units | grep core

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants