From 51a4041628b58724ec3b208da53d7509ef809941 Mon Sep 17 00:00:00 2001 From: Wei-Chiu Chuang Date: Fri, 13 Dec 2024 17:58:29 -0800 Subject: [PATCH 1/2] HDDS-11932. [Website v2] [Docs] [User Guide] DistCP integration --- .../03-integrations/08-distcp.md | 40 +++++++++++++++++++ 1 file changed, 40 insertions(+) create mode 100644 docs/04-user-guide/03-integrations/08-distcp.md diff --git a/docs/04-user-guide/03-integrations/08-distcp.md b/docs/04-user-guide/03-integrations/08-distcp.md new file mode 100644 index 000000000..4e93eaeff --- /dev/null +++ b/docs/04-user-guide/03-integrations/08-distcp.md @@ -0,0 +1,40 @@ +# Hadoop DistCp + +The Hadoop DistCP is a command line MapReduce-based tool for bulk data copying. + +The `hadoop distcp` command can be used to copy data to/from Ozone and any Hadoop compatible file systems. For example, HDFS or S3A. + +## Basic usage + +To copy files from a source Ozone cluster directory to a destination Ozone cluster directory: +```bash + hadoop distcp ofs://ozone1/vol1/bucket/dir1 ofs://ozone2/vol2/bucket2/dir2 +``` + +> You must have both ozone1 and ozone2 cluster service ID defined in ozone-site.xml configuration file. + +## Copy between Ozone and HDFS + +DistCp performs file checksum check to ensure file integrity. Because the default checksum type of HDFS (CRC32C) and Ozone (CRC32) are different, file checksum check will fail the distcp job. To prevent job failures, specify checksum options in the distcp command to force Ozone use the same checksum type as HDFS. For example: + +```bash +hadoop distcp \ + -Ddfs.checksum.combine.mode=COMPOSITE_CRC \ + -Dozone.client.checksum.type=CRC32C \ + hdfs://ns1/tmp ofs://ozone1/vol1/bucket1/dst +``` +The parameter `-Ddfs.checksum.combine.mode=COMPOSITE_CRC` is not required if the HDFS cluster is on Hadoop 3.1.1 or above. + +Alternatively, skip file checksum check: + +```bash +hadoop distcp \ + -skipcrccheck \ + hdfs://ns1/tmp ofs://ozone1/vol1/bucket1/dst +``` + +## Encrypted data + +When data is in HDFS encryption zone or Ozone encrypted buckets, the file checksum will not match since the underlying block data is different because a new EDEK will be used to encrypt at destination. In this case, specify the -skipcrccheck parameter to avoid job failures. + +For more information using Hadoop DistCP, consult [DistCp Guide](https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html). From ce924a0ac93ae852aacc616db919b3ecb62c4c84 Mon Sep 17 00:00:00 2001 From: Wei-Chiu Chuang Date: Fri, 13 Dec 2024 18:04:45 -0800 Subject: [PATCH 2/2] Spelling. --- cspell.yaml | 3 +++ docs/04-user-guide/03-integrations/08-distcp.md | 6 ++++-- 2 files changed, 7 insertions(+), 2 deletions(-) diff --git a/cspell.yaml b/cspell.yaml index 68515b9fa..01a2e799d 100644 --- a/cspell.yaml +++ b/cspell.yaml @@ -91,6 +91,7 @@ words: - mds - javadoc - JVM +- distcp # Misc words - acking - dashboarding @@ -99,3 +100,5 @@ words: - UX - devs - CLI +- EDEK +- skipcrccheck diff --git a/docs/04-user-guide/03-integrations/08-distcp.md b/docs/04-user-guide/03-integrations/08-distcp.md index 4e93eaeff..7c6fd70ba 100644 --- a/docs/04-user-guide/03-integrations/08-distcp.md +++ b/docs/04-user-guide/03-integrations/08-distcp.md @@ -7,11 +7,12 @@ The `hadoop distcp` command can be used to copy data to/from Ozone and any Hadoo ## Basic usage To copy files from a source Ozone cluster directory to a destination Ozone cluster directory: + ```bash hadoop distcp ofs://ozone1/vol1/bucket/dir1 ofs://ozone2/vol2/bucket2/dir2 ``` -> You must have both ozone1 and ozone2 cluster service ID defined in ozone-site.xml configuration file. +> You must have both `ozone1` and `ozone2` cluster service ID defined in `ozone-site.xml` configuration file. ## Copy between Ozone and HDFS @@ -23,7 +24,8 @@ hadoop distcp \ -Dozone.client.checksum.type=CRC32C \ hdfs://ns1/tmp ofs://ozone1/vol1/bucket1/dst ``` -The parameter `-Ddfs.checksum.combine.mode=COMPOSITE_CRC` is not required if the HDFS cluster is on Hadoop 3.1.1 or above. + +> The parameter `-Ddfs.checksum.combine.mode=COMPOSITE_CRC` is not required if the HDFS cluster is on Hadoop 3.1.1 or above. Alternatively, skip file checksum check: