Add retry logic to image layer fetching and decompression #291

miz060 · 2025-01-14T00:26:42Z

Merge Checklist

Followed patch format from upstream recommendation: https://github.com/kata-containers/community/blob/main/CONTRIBUTING.md#patch-format
- Included a single commit in a given PR - at least unless there are related commits and each makes sense as a change on its own.
Aware about the PR to be merged using "create a merge commit" rather than "squash and merge" (or similar)
The upstream/missing label (or upstream/not-needed) has been set on the PR.

Summary

Test Methodology

src/tardev-snapshotter/src/snapshotter.rs

jiria · 2025-01-15T20:46:03Z

src/tardev-snapshotter/src/snapshotter.rs

                    }
+
+                    warn!("Retrying layer image download...");
+                    continue; // Retry fetching the layer image


Not sure, but would it make sense to sleep here for a bit? Presumably we will run against the deadline that containerd has, so cannot sleep for too long.

Yes, I agree. It would make sense to sleep a bit if it's truely a network issue. I think the specific timeout is managed by the client (k8s) instead of containerd itself. Given that I set sleep time to be 500ms for now.

src/tardev-snapshotter/src/snapshotter.rs

jiria · 2025-01-15T20:49:33Z

src/tardev-snapshotter/src/snapshotter.rs

-                file.rewind().context("failed to rewind the file handle")?;
-                tarindex::append_index(&mut file).context("failed to append tar index")?;
+                // Process the layer
+                let process_result = tokio::task::spawn_blocking({


Should the download itself be part of the spawn_blocking block?

Moved download itself to be part of the new function to be run inside the while loop.

jiria · 2025-01-15T20:51:49Z

src/tardev-snapshotter/src/snapshotter.rs

+                            if let Err(e) = std::io::copy(&mut gz_decoder, &mut file) {
+                                let copy_error = format!("failed to copy payload from gz decoder {:?}", e);
+                                error!("{}", copy_error);
+                                return Err(anyhow::anyhow!(copy_error));


This is the error we hit; should we trigger retry with a new download here as well? Or what are we doing to resolve it?

failed to extract image layer: failed to copy payload from gz decoder Error { kind: UnexpectedEof, message: \"failed to fill whole buffer\" }: unknown

Yes, failing here will trigger a new download.

Add retry logic to layer fetching and decompression

24ed368

miz060 requested review from a team as code owners January 14, 2025 00:26

jiria reviewed Jan 15, 2025

View reviewed changes

Address comment

fdfa224

miz060 merged commit 66d2248 into jiria/solar Jan 16, 2025
41 of 52 checks passed

miz060 deleted the mitchzhu/add_retry branch January 17, 2025 01:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add retry logic to image layer fetching and decompression #291

Add retry logic to image layer fetching and decompression #291

miz060 commented Jan 14, 2025

jiria Jan 15, 2025

miz060 Jan 16, 2025

jiria Jan 15, 2025

miz060 Jan 16, 2025

jiria Jan 15, 2025

miz060 Jan 16, 2025

Add retry logic to image layer fetching and decompression #291

Add retry logic to image layer fetching and decompression #291

Conversation

miz060 commented Jan 14, 2025

Merge Checklist

Summary

Test Methodology

jiria Jan 15, 2025

Choose a reason for hiding this comment

miz060 Jan 16, 2025

Choose a reason for hiding this comment

jiria Jan 15, 2025

Choose a reason for hiding this comment

miz060 Jan 16, 2025

Choose a reason for hiding this comment

jiria Jan 15, 2025

Choose a reason for hiding this comment

miz060 Jan 16, 2025

Choose a reason for hiding this comment