Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#2083] improvement: Quickly delete local or HDFS data at the shuffleId level. #2084

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

yl09099
Copy link
Contributor

@yl09099 yl09099 commented Aug 23, 2024

What changes were proposed in this pull request?

At the shuffleId level, data on the local or HDFS needs to be deleted synchronously. In some scenarios, the deletion time needs to be shortened. You can rename folders and delete them asynchronously.

Why are the changes needed?

Fix: #2083

Does this PR introduce any user-facing change?

No.

How was this patch tested?

UT.

@jerqi jerqi requested a review from zuston August 23, 2024 06:45
Copy link

github-actions bot commented Aug 23, 2024

Test Results

 2 966 files  ±0   2 966 suites  ±0   6h 27m 53s ⏱️ - 1m 6s
 1 096 tests ±0   1 094 ✅ ±0   2 💤 ±0  0 ❌ ±0 
13 735 runs  ±0  13 705 ✅ ±0  30 💤 ±0  0 ❌ ±0 

Results for commit 75aaa4f. ± Comparison against base commit bd7c2cc.

♻️ This comment has been updated with latest results.

@yl09099 yl09099 force-pushed the uniffle-2083 branch 2 times, most recently from ecf9e44 to 6052399 Compare August 23, 2024 08:00
@codecov-commenter
Copy link

codecov-commenter commented Aug 23, 2024

Codecov Report

Attention: Patch coverage is 11.57895% with 168 lines in your changes missing coverage. Please review.

Project coverage is 52.60%. Comparing base (34bf686) to head (9947af7).
Report is 11 commits behind head on master.

Files with missing lines Patch % Lines
.../handler/impl/HadoopShuffleAsyncDeleteHandler.java 0.00% 47 Missing ⚠️
...storage/handler/impl/AsynDeletionEventManager.java 0.00% 45 Missing ⚠️
...rage/handler/impl/LocalFileAsyncDeleteHandler.java 0.00% 30 Missing ⚠️
...che/uniffle/storage/handler/AsynDeletionEvent.java 0.00% 18 Missing ⚠️
...uniffle/storage/factory/ShuffleHandlerFactory.java 0.00% 8 Missing ⚠️
...age/request/CreateShuffleDeleteHandlerRequest.java 0.00% 7 Missing ⚠️
...pache/uniffle/server/ShuffleServerGrpcService.java 0.00% 3 Missing ⚠️
.../org/apache/uniffle/server/ShuffleTaskManager.java 60.00% 2 Missing ⚠️
...e/uniffle/server/storage/HadoopStorageManager.java 60.00% 1 Missing and 1 partial ⚠️
...he/uniffle/server/storage/LocalStorageManager.java 66.66% 1 Missing and 1 partial ⚠️
... and 2 more
Additional details and impacted files
@@             Coverage Diff              @@
##             master    #2084      +/-   ##
============================================
+ Coverage     51.84%   52.60%   +0.75%     
- Complexity     2864     3532     +668     
============================================
  Files           469      534      +65     
  Lines         23879    29317    +5438     
  Branches       1966     2731     +765     
============================================
+ Hits          12380    15421    +3041     
- Misses        10726    12905    +2179     
- Partials        773      991     +218     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@yl09099 yl09099 force-pushed the uniffle-2083 branch 5 times, most recently from 8a37d06 to b264a11 Compare August 25, 2024 15:13
Copy link
Member

@zuston zuston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you help share which case will need to shorten the deletion time.

@yl09099
Copy link
Contributor Author

yl09099 commented Aug 26, 2024

Could you help share which case will need to shorten the deletion time.

During the Stage retry, delete the shuffle data block from the disk or hdfs.

@yl09099 yl09099 force-pushed the uniffle-2083 branch 2 times, most recently from b7a438c to ccbe953 Compare August 27, 2024 01:58
@yl09099
Copy link
Contributor Author

yl09099 commented Aug 27, 2024

@zuston Help trigger the error module, I have no local error.

@yl09099 yl09099 force-pushed the uniffle-2083 branch 2 times, most recently from 96e5fbe to 8126263 Compare September 1, 2024 13:56
@yl09099 yl09099 requested a review from zuston September 1, 2024 14:24
Copy link
Member

@zuston zuston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your effort on this feature for the stage retry. One comment is how to enable this by config. I hope this feature could be scoped in the explicility config option and be disable by default.

}

@Override
public void removeResources(PurgeEvent event, boolean isQuick) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make the softDeletion/isQuick as the internal variable in the PurgeEvent?

deleteHandler.quickDelete(asynchronousDeleteEvent);
boolean isSucess = quickNeedDeletePaths.offer(asynchronousDeleteEvent);
if (!isSucess) {
LOG.warn(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is abnormal that will make the data leaked. For this case, the metrics should be added for better observability


HadoopStorageManager(ShuffleServerConf conf) {
super(conf);
hadoopConf = conf.getHadoopConf();
shuffleServerId = conf.getString(ShuffleServerConf.SHUFFLE_SERVER_ID, "shuffleServerId");
isStorageAuditLogEnabled = conf.getBoolean(ShuffleServerConf.SERVER_STORAGE_AUDIT_LOG_ENABLED);
Runnable clearNeedDeletePathTask =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not intergrating this part async deletion into the underlying class like SingleStorageManager for localfile and hadoop storage type to share

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All comments are complete.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Let me take a look again

@yl09099 yl09099 force-pushed the uniffle-2083 branch 10 times, most recently from 0af2052 to a6e8d64 Compare September 9, 2024 06:26
@@ -227,7 +227,7 @@ public void registerShuffle(
taskInfo.refreshLatestStageAttemptNumber(shuffleId, stageAttemptNumber);
try {
long start = System.currentTimeMillis();
shuffleServer.getShuffleTaskManager().removeShuffleDataSync(appId, shuffleId);
shuffleServer.getShuffleTaskManager().softRemoveShuffleDataSync(appId, shuffleId);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope this could be enabled by the extra config option

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for my previous comment, I think this deletion could be named as TwoPhaseDeletion, which will include 2 phases

  1. Soft deletion
  2. Hard deletion

And for the original deletion way is the hard deletion, we could extra the abstract class to have a good abstraction

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use another concept? Because hard deletion makes me think of rm --force. Maybe you can use rename directly.

@@ -157,6 +157,9 @@ public class ShuffleServerMetrics {
public static final String TOPN_OF_ON_HADOOP_DATA_SIZE_FOR_APP =
"topN_of_on_hadoop_data_size_for_app";

private static final String TOTAL_HADOOP_SOFT_DELETE_FAILED = "total_hadoop_soft_delete_failed";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

total_hadoop_two_phases_deletion_failed

@@ -157,6 +157,9 @@ public class ShuffleServerMetrics {
public static final String TOPN_OF_ON_HADOOP_DATA_SIZE_FOR_APP =
"topN_of_on_hadoop_data_size_for_app";

private static final String TOTAL_HADOOP_SOFT_DELETE_FAILED = "total_hadoop_soft_delete_failed";
private static final String TOTAL_LOCAL_SOFT_DELETE_FAILED = "total_local_soft_delete_failed";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

}
} else {
deleteHandler.delete(deletePaths.toArray(new String[deletePaths.size()]), appId, user);
}
removeAppStorageInfo(event);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this will effect the metrics analysis when using the 2 phase deletion?

@yl09099 yl09099 force-pushed the uniffle-2083 branch 2 times, most recently from bb11961 to 9e9721f Compare September 12, 2024 08:47
@yl09099 yl09099 closed this Oct 28, 2024
@yl09099
Copy link
Contributor Author

yl09099 commented Oct 28, 2024

Overwrite delete logic.

@yl09099 yl09099 reopened this Oct 28, 2024
@jerqi jerqi changed the title [Improvement] Quickly delete local or HDFS data at the shuffleId level. [#2083] improvement: Quickly delete local or HDFS data at the shuffleId level. Oct 28, 2024
@jerqi jerqi requested a review from zuston October 28, 2024 12:59
@yl09099 yl09099 force-pushed the uniffle-2083 branch 9 times, most recently from b8194df to 93be639 Compare November 4, 2024 05:11
@jerqi
Copy link
Contributor

jerqi commented Nov 21, 2024

@zuston Could you help me review this pull request?

@zuston
Copy link
Member

zuston commented Nov 25, 2024

@zuston Could you help me review this pull request?

Yes, I will review this in the later 3 days.

@@ -25,11 +25,19 @@ public abstract class PurgeEvent {
private String appId;
private String user;
private List<Integer> shuffleIds;
// Quick Delete or not.
private boolean isTwoPhasesDeletion;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TwoPhaseDeletion make me confusing. Could you give a better name? Could you add some comments about renaming and deleting the shuffle files?

import org.apache.uniffle.storage.request.CreateShuffleDeleteHandlerRequest;
import org.apache.uniffle.storage.util.StorageType;

public class AsynDeletionEventManager {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you give some comments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you give some comments?

Your above message has been modified.

import org.apache.uniffle.storage.handler.AsynDeletionEvent;
import org.apache.uniffle.storage.handler.api.ShuffleDeleteHandler;
import org.apache.uniffle.storage.util.StorageType;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HadoopFilesystem already uses renaming in my view. This seems unnecessary for HDFS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HadoopFilesystem already uses renaming in my view. This seems unnecessary for HDFS.

OK,Let me delete that.

Copy link
Contributor

@jerqi jerqi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Improvement] Quickly delete local or HDFS data at the shuffleId level.
4 participants