-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#2083] improvement: Quickly delete local or HDFS data at the shuffleId level. #2084
base: master
Are you sure you want to change the base?
Conversation
ecf9e44
to
6052399
Compare
8a37d06
to
b264a11
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you help share which case will need to shorten the deletion time.
During the Stage retry, delete the shuffle data block from the disk or hdfs. |
b7a438c
to
ccbe953
Compare
@zuston Help trigger the error module, I have no local error. |
96e5fbe
to
8126263
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your effort on this feature for the stage retry. One comment is how to enable this by config. I hope this feature could be scoped in the explicility config option and be disable by default.
} | ||
|
||
@Override | ||
public void removeResources(PurgeEvent event, boolean isQuick) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we make the softDeletion/isQuick
as the internal variable in the PurgeEvent?
deleteHandler.quickDelete(asynchronousDeleteEvent); | ||
boolean isSucess = quickNeedDeletePaths.offer(asynchronousDeleteEvent); | ||
if (!isSucess) { | ||
LOG.warn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is abnormal that will make the data leaked. For this case, the metrics should be added for better observability
|
||
HadoopStorageManager(ShuffleServerConf conf) { | ||
super(conf); | ||
hadoopConf = conf.getHadoopConf(); | ||
shuffleServerId = conf.getString(ShuffleServerConf.SHUFFLE_SERVER_ID, "shuffleServerId"); | ||
isStorageAuditLogEnabled = conf.getBoolean(ShuffleServerConf.SERVER_STORAGE_AUDIT_LOG_ENABLED); | ||
Runnable clearNeedDeletePathTask = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not intergrating this part async deletion into the underlying class like SingleStorageManager for localfile and hadoop storage type to share
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All comments are complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Let me take a look again
0af2052
to
a6e8d64
Compare
a6e8d64
to
9145b24
Compare
@@ -227,7 +227,7 @@ public void registerShuffle( | |||
taskInfo.refreshLatestStageAttemptNumber(shuffleId, stageAttemptNumber); | |||
try { | |||
long start = System.currentTimeMillis(); | |||
shuffleServer.getShuffleTaskManager().removeShuffleDataSync(appId, shuffleId); | |||
shuffleServer.getShuffleTaskManager().softRemoveShuffleDataSync(appId, shuffleId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope this could be enabled by the extra config option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for my previous comment, I think this deletion could be named as TwoPhaseDeletion, which will include 2 phases
- Soft deletion
- Hard deletion
And for the original deletion way is the hard deletion, we could extra the abstract class to have a good abstraction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use another concept? Because hard deletion makes me think of rm --force
. Maybe you can use rename
directly.
@@ -157,6 +157,9 @@ public class ShuffleServerMetrics { | |||
public static final String TOPN_OF_ON_HADOOP_DATA_SIZE_FOR_APP = | |||
"topN_of_on_hadoop_data_size_for_app"; | |||
|
|||
private static final String TOTAL_HADOOP_SOFT_DELETE_FAILED = "total_hadoop_soft_delete_failed"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
total_hadoop_two_phases_deletion_failed
@@ -157,6 +157,9 @@ public class ShuffleServerMetrics { | |||
public static final String TOPN_OF_ON_HADOOP_DATA_SIZE_FOR_APP = | |||
"topN_of_on_hadoop_data_size_for_app"; | |||
|
|||
private static final String TOTAL_HADOOP_SOFT_DELETE_FAILED = "total_hadoop_soft_delete_failed"; | |||
private static final String TOTAL_LOCAL_SOFT_DELETE_FAILED = "total_local_soft_delete_failed"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
} | ||
} else { | ||
deleteHandler.delete(deletePaths.toArray(new String[deletePaths.size()]), appId, user); | ||
} | ||
removeAppStorageInfo(event); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this will effect the metrics analysis when using the 2 phase deletion?
bb11961
to
9e9721f
Compare
Overwrite delete logic. |
b8194df
to
93be639
Compare
@zuston Could you help me review this pull request? |
Yes, I will review this in the later 3 days. |
@@ -25,11 +25,19 @@ public abstract class PurgeEvent { | |||
private String appId; | |||
private String user; | |||
private List<Integer> shuffleIds; | |||
// Quick Delete or not. | |||
private boolean isTwoPhasesDeletion; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TwoPhaseDeletion
make me confusing. Could you give a better name? Could you add some comments about renaming and deleting the shuffle files?
import org.apache.uniffle.storage.request.CreateShuffleDeleteHandlerRequest; | ||
import org.apache.uniffle.storage.util.StorageType; | ||
|
||
public class AsynDeletionEventManager { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you give some comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you give some comments?
Your above message has been modified.
import org.apache.uniffle.storage.handler.AsynDeletionEvent; | ||
import org.apache.uniffle.storage.handler.api.ShuffleDeleteHandler; | ||
import org.apache.uniffle.storage.util.StorageType; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HadoopFilesystem already uses renaming
in my view. This seems unnecessary for HDFS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HadoopFilesystem already uses
renaming
in my view. This seems unnecessary for HDFS.
OK,Let me delete that.
7f17f5e
to
a49aa84
Compare
a49aa84
to
75aaa4f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
What changes were proposed in this pull request?
At the shuffleId level, data on the local or HDFS needs to be deleted synchronously. In some scenarios, the deletion time needs to be shortened. You can rename folders and delete them asynchronously.
Why are the changes needed?
Fix: #2083
Does this PR introduce any user-facing change?
No.
How was this patch tested?
UT.