HDF5: Empiric for Optimal Chunk Size (#916)

* HDF5: Empiric for Optimal Chunk Size This ports a prior empirical algorithm from libSplash to determine an optimal (large) chunk size for an HDF5 dataset based on its datatype and global extent. Original implementation by Felix Schmitt @f-schmitt (ZIH, TU Dresden) in [libSplash](https://github.com/ComputationalRadiationPhysics/libSplash). Original source: - https://github.com/ComputationalRadiationPhysics/libSplash/blob/v1.7.0/src/DCDataSet.cpp - https://github.com/ComputationalRadiationPhysics/libSplash/blob/v1.7.0/src/include/splash/core/DCHelper.hpp Co-authored-by: Felix Schmitt <[email protected]> * Add scaffolding for JSON options in HDF5 * HDF5: Finish Chunking JSON/Env control * HiPACE (legacy) pipeline: no chunking The parallel, independent I/O pattern here is corner-case for what HDF5 can support, due to non-collective declarations of data sets. Testing shows that it does not work with chunking. * CI: no HDF5 Chunking with Sanitizer Runs into timeout for unclear reasons with this patch: ``` 15/32 Test #15: MPI.8_benchmark_parallel ...............***Timeout 1500.17 sec ``` * Apply suggestions from code review Co-authored-by: Franz Pöschel <[email protected]> Co-authored-by: Felix Schmitt <[email protected]> Co-authored-by: Franz Pöschel <[email protected]>
openPMD · Jun 24, 2021 · 8c2d9ce · 8c2d9ce
1 parent baaf349
commit 8c2d9ce
Show file tree

Hide file tree

Showing 14 changed files with 268 additions and 36 deletions.
diff --git a/.github/workflows/unix.yml b/.github/workflows/unix.yml
@@ -43,16 +43,23 @@ jobs:
         python3 -m pip install -U numpy
         sudo .github/workflows/dependencies/install_spack
     - name: Build
-      env: {CC: mpicc, CXX: mpic++, OMPI_CC: clang-10, OMPI_CXX: clang++-10, CXXFLAGS: -Werror -Wno-deprecated-declarations}
+      env: {CC: mpicc, CXX: mpic++, OMPI_CC: clang-10, OMPI_CXX: clang++-10, CXXFLAGS: -Werror -Wno-deprecated-declarations, OPENPMD_HDF5_CHUNKS: none}
       run: |
         eval $(spack env activate --sh .github/ci/spack-envs/clangtidy_nopy_ompi_h5_ad1_ad2/)
         spack install
         SOURCEPATH="$(pwd)"
         mkdir build && cd build
         ../share/openPMD/download_samples.sh && chmod u-w samples/git-sample/*.h5
         export LDFLAGS="${LDFLAGS} -fsanitize=address,undefined -shared-libsan"
-        CXXFLAGS="${CXXFLAGS} -fsanitize=address,undefined -shared-libsan"
-        CXXFLAGS="${CXXFLAGS}" cmake -S .. -B . -DopenPMD_USE_MPI=ON -DopenPMD_USE_PYTHON=ON -DopenPMD_USE_HDF5=ON -DopenPMD_USE_ADIOS2=ON -DopenPMD_USE_ADIOS1=ON -DopenPMD_USE_INVASIVE_TESTS=ON -DCMAKE_VERBOSE_MAKEFILE=ON
+        export CXXFLAGS="${CXXFLAGS} -fsanitize=address,undefined -shared-libsan"
+        cmake -S .. -B .                  \
+          -DopenPMD_USE_MPI=ON            \
+          -DopenPMD_USE_PYTHON=ON         \
+          -DopenPMD_USE_HDF5=ON           \
+          -DopenPMD_USE_ADIOS2=ON         \
+          -DopenPMD_USE_ADIOS1=ON         \
+          -DopenPMD_USE_INVASIVE_TESTS=ON \
+          -DCMAKE_VERBOSE_MAKEFILE=ON
         cmake --build . --parallel 2
         export ASAN_OPTIONS=detect_stack_use_after_return=1:detect_leaks=1:check_initialization_order=true:strict_init_order=true:detect_stack_use_after_scope=1:fast_unwind_on_malloc=0
         export LSAN_OPTIONS=suppressions="$SOURCEPATH/.github/ci/sanitizer/clang/Leak.supp"

diff --git a/docs/source/backends/hdf5.rst b/docs/source/backends/hdf5.rst
@@ -26,6 +26,7 @@ environment variable                  default   description
 ===================================== ========= ====================================================================================
 ``OPENPMD_HDF5_INDEPENDENT``          ``ON``    Sets the MPI-parallel transfer mode to collective (``OFF``) or independent (``ON``).
 ``OPENPMD_HDF5_ALIGNMENT``            ``1``     Tuning parameter for parallel I/O, choose an alignment which is a multiple of the disk block size.
+``OPENPMD_HDF5_CHUNKS``               ``auto``  Defaults for ``H5Pset_chunk``: ``"auto"`` (heuristic) or ``"none"`` (no chunking).
 ``H5_COLL_API_SANITY_CHECK``          unset     Set to ``1`` to perform an ``MPI_Barrier`` inside each meta-data operation.
 ===================================== ========= ====================================================================================
 
@@ -40,6 +41,9 @@ According to the `HDF5 documentation <https://support.hdfgroup.org/HDF5/doc/RM/H
 *For MPI IO and other parallel systems, choose an alignment which is a multiple of the disk block size.*
 On Lustre filesystems, according to the `NERSC documentation <https://www.nersc.gov/users/training/online-tutorials/introduction-to-scientific-i-o/?start=5>`_, it is advised to set this to the Lustre stripe size. In addition, ORNL Summit GPFS users are recommended to set the alignment value to 16777216(16MB).
 
+``OPENPMD_HDF5_CHUNKS`` This sets defaults for data chunking via `H5Pset_chunk <https://support.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_chunk.htm>`__.
+Chunking generally improves performance and only needs to be disabled in corner-cases, e.g. when heavily relying on independent, parallel I/O that non-collectively declares data records.
+
 ``H5_COLL_API_SANITY_CHECK``: this is a HDF5 control option for debugging parallel I/O logic (API calls).
 Debugging a parallel program with that option enabled can help to spot bugs such as collective MPI-calls that are not called by all participating MPI ranks.
 Do not use in production, this will slow parallel I/O operations down.

diff --git a/docs/source/details/backendconfig.rst b/docs/source/details/backendconfig.rst
@@ -74,6 +74,23 @@ Explanation of the single keys:
 Any setting specified under ``adios2.dataset`` is applicable globally as well as on a per-dataset level.
 Any setting under ``adios2.engine`` is applicable globally only.
 
+HDF5
+^^^^
+
+A full configuration of the HDF5 backend:
+
+.. literalinclude:: hdf5.json
+   :language: json
+
+All keys found under ``hdf5.dataset`` are applicable globally (future: as well as per dataset).
+Explanation of the single keys:
+
+* ``adios2.dataset.chunks``: This key contains options for data chunking via `H5Pset_chunk <https://support.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_chunk.htm>`__.
+  The default is ``"auto"`` for a heuristic.
+  ``"none"`` can be used to disable chunking.
+  Chunking generally improves performance and only needs to be disabled in corner-cases, e.g. when heavily relying on independent, parallel I/O that non-collectively declares data records.
+
+
 Other backends
 ^^^^^^^^^^^^^^
 

diff --git a/docs/source/details/json.json b/docs/source/details/json.json
@@ -0,0 +1,7 @@
+{
+  "hdf5": {
+    "dataset": {
+      "chunks": "auto"
+    }
+  }
+}
diff --git a/include/openPMD/IO/HDF5/HDF5Auxiliary.hpp b/include/openPMD/IO/HDF5/HDF5Auxiliary.hpp
@@ -1,4 +1,4 @@
-/* Copyright 2017-2021 Fabian Koller
+/* Copyright 2017-2021 Fabian Koller, Felix Schmitt, Axel Huebl
  *
  * This file is part of openPMD-api.
  *
@@ -34,7 +34,6 @@
 
 namespace openPMD
 {
-#if openPMD_HAVE_HDF5
     struct GetH5DataType
     {
         std::unordered_map< std::string, hid_t > m_userTypes;
@@ -54,5 +53,20 @@ namespace openPMD
     std::string
     concrete_h5_file_position(Writable* w);
 
-#endif
+    /** Computes the chunk dimensions for a dataset.
+     *
+     * Chunk dimensions are selected to create chunks sizes between
+     * 64KByte and 4MB. Smaller chunk sizes are inefficient due to overhead,
+     * larger chunks do not map well to file system blocks and striding.
+     *
+     * Chunk dimensions are less or equal to dataset dimensions and do
+     * not need to be a factor of the respective dataset dimension.
+     *
+     * @param[in] dims dimensions of dataset to get chunk dims for
+     * @param[in] typeSize size of each element in bytes
+     * @return array for resulting chunk dimensions
+     */
+    std::vector< hsize_t >
+    getOptimalChunkDims( std::vector< hsize_t > const dims,
+                         size_t const typeSize );
 } // namespace openPMD
diff --git a/include/openPMD/IO/HDF5/HDF5IOHandler.hpp b/include/openPMD/IO/HDF5/HDF5IOHandler.hpp
@@ -22,6 +22,8 @@
 
 #include "openPMD/IO/AbstractIOHandler.hpp"
 
+#include <nlohmann/json.hpp>
+
 #include <future>
 #include <memory>
 #include <string>
@@ -34,7 +36,7 @@ class HDF5IOHandlerImpl;
 class HDF5IOHandler : public AbstractIOHandler
 {
 public:
-    HDF5IOHandler(std::string path, Access);
+    HDF5IOHandler(std::string path, Access, nlohmann::json config);
     ~HDF5IOHandler() override;
 
     std::string backendName() const override { return "HDF5"; }

diff --git a/include/openPMD/IO/HDF5/HDF5IOHandlerImpl.hpp b/include/openPMD/IO/HDF5/HDF5IOHandlerImpl.hpp
@@ -24,6 +24,7 @@
 #if openPMD_HAVE_HDF5
 #   include "openPMD/IO/AbstractIOHandlerImpl.hpp"
 
+#   include "openPMD/auxiliary/JSON.hpp"
 #   include "openPMD/auxiliary/Option.hpp"
 
 #   include <hdf5.h>
@@ -38,7 +39,7 @@ namespace openPMD
     class HDF5IOHandlerImpl : public AbstractIOHandlerImpl
     {
     public:
-        HDF5IOHandlerImpl(AbstractIOHandler*);
+        HDF5IOHandlerImpl(AbstractIOHandler*, nlohmann::json config);
         ~HDF5IOHandlerImpl() override;
 
         void createFile(Writable*, Parameter< Operation::CREATE_FILE > const&) override;
@@ -77,6 +78,8 @@ namespace openPMD
         hid_t m_H5T_CLONG_DOUBLE;
 
     private:
+        auxiliary::TracingJSON m_config;
+        std::string m_chunks = "auto";
         struct File
         {
             std::string name;

diff --git a/include/openPMD/IO/HDF5/ParallelHDF5IOHandler.hpp b/include/openPMD/IO/HDF5/ParallelHDF5IOHandler.hpp
@@ -23,6 +23,8 @@
 #include "openPMD/config.hpp"
 #include "openPMD/IO/AbstractIOHandler.hpp"
 
+#include <nlohmann/json.hpp>
+
 #include <future>
 #include <memory>
 #include <string>
@@ -36,9 +38,10 @@ namespace openPMD
     {
     public:
     #if openPMD_HAVE_MPI
-        ParallelHDF5IOHandler(std::string path, Access, MPI_Comm);
+        ParallelHDF5IOHandler(
+            std::string path, Access, MPI_Comm, nlohmann::json config);
     #else
-        ParallelHDF5IOHandler(std::string path, Access);
+        ParallelHDF5IOHandler(std::string path, Access, nlohmann::json config);
     #endif
         ~ParallelHDF5IOHandler() override;
 

diff --git a/include/openPMD/IO/HDF5/ParallelHDF5IOHandlerImpl.hpp b/include/openPMD/IO/HDF5/ParallelHDF5IOHandlerImpl.hpp
@@ -27,6 +27,7 @@
 #   include <mpi.h>
 #   if openPMD_HAVE_HDF5
 #       include "openPMD/IO/HDF5/HDF5IOHandlerImpl.hpp"
+#       include <nlohmann/json.hpp>
 #   endif
 #endif
 
@@ -37,7 +38,8 @@ namespace openPMD
     class ParallelHDF5IOHandlerImpl : public HDF5IOHandlerImpl
     {
     public:
-        ParallelHDF5IOHandlerImpl(AbstractIOHandler*, MPI_Comm);
+        ParallelHDF5IOHandlerImpl(
+            AbstractIOHandler*, MPI_Comm, nlohmann::json config);
         ~ParallelHDF5IOHandlerImpl() override;
 
         MPI_Comm m_mpiComm;

diff --git a/src/IO/AbstractIOHandlerHelper.cpp b/src/IO/AbstractIOHandlerHelper.cpp
@@ -45,7 +45,8 @@ namespace openPMD
         switch( format )
         {
             case Format::HDF5:
-                return std::make_shared< ParallelHDF5IOHandler >( path, access, comm );
+                return std::make_shared< ParallelHDF5IOHandler >(
+                    path, access, comm, std::move( options ) );
             case Format::ADIOS1:
 #   if openPMD_HAVE_ADIOS1
                 return std::make_shared< ParallelADIOS1IOHandler >( path, access, comm );
@@ -80,7 +81,8 @@ namespace openPMD
         switch( format )
         {
             case Format::HDF5:
-                return std::make_shared< HDF5IOHandler >( path, access );
+                return std::make_shared< HDF5IOHandler >(
+                    path, access, std::move( options ) );
             case Format::ADIOS1:
 #if openPMD_HAVE_ADIOS1
                 return std::make_shared< ADIOS1IOHandler >( path, access );

diff --git a/src/IO/HDF5/HDF5Auxiliary.cpp b/src/IO/HDF5/HDF5Auxiliary.cpp
@@ -1,4 +1,4 @@
-/* Copyright 2017-2021 Fabian Koller, Axel Huebl
+/* Copyright 2017-2021 Fabian Koller, Felix Schmitt, Axel Huebl
  *
  * This file is part of openPMD-api.
  *
@@ -30,10 +30,12 @@
 
 #   include <array>
 #   include <complex>
+#   include <map>
 #   include <stack>
 #   include <stdexcept>
 #   include <string>
 #   include <typeinfo>
+#   include <vector>
 
 #   if openPMD_USE_VERIFY
 #       define VERIFY(CONDITION, TEXT) { if(!(CONDITION)) throw std::runtime_error((TEXT)); }
@@ -306,4 +308,92 @@ openPMD::concrete_h5_file_position(Writable* w)
     return auxiliary::replace_all(pos, "//", "/");
 }
 
+
+std::vector< hsize_t >
+openPMD::getOptimalChunkDims( std::vector< hsize_t > const dims,
+                     size_t const typeSize )
+{
+    auto const ndims = dims.size();
+    std::vector< hsize_t > chunk_dims( dims.size() );
+
+    // chunk sizes in KiByte
+    constexpr std::array< size_t, 7u > CHUNK_SIZES_KiB
+        {{4096u, 2048u, 1024u, 512u, 256u, 128u, 64u}};
+
+    size_t total_data_size = typeSize;
+    size_t max_chunk_size = typeSize;
+    size_t target_chunk_size = 0u;
+
+    // compute the order of dimensions (descending)
+    // large dataset dimensions should have larger chunk sizes
+    std::multimap<hsize_t, uint32_t> dims_order;
+    for (uint32_t i = 0; i < ndims; ++i)
+        dims_order.insert(std::make_pair(dims[i], i));
+
+    for (uint32_t i = 0; i < ndims; ++i)
+    {
+        // initial number of chunks per dimension
+        chunk_dims[i] = 1;
+
+        // try to make at least two chunks for each dimension
+        size_t half_dim = dims[i] / 2;
+
+        // compute sizes
+        max_chunk_size *= (half_dim > 0) ? half_dim : 1;
+        total_data_size *= dims[i];
+    }
+
+    // compute the target chunk size
+    for( auto const & chunk_size : CHUNK_SIZES_KiB )
+    {
+        target_chunk_size = chunk_size * 1024;
+        if (target_chunk_size <= max_chunk_size)
+            break;
+    }
+
+    size_t current_chunk_size = typeSize;
+    size_t last_chunk_diff = target_chunk_size;
+    std::multimap<hsize_t, uint32_t>::const_iterator current_index =
+            dims_order.begin();
+
+    while (current_chunk_size < target_chunk_size)
+    {
+        // test if increasing chunk size optimizes towards target chunk size
+        size_t chunk_diff = target_chunk_size - (current_chunk_size * 2u);
+        if (chunk_diff >= last_chunk_diff)
+            break;
+
+        // find next dimension to increase chunk size for
+        int can_increase_dim = 0;
+        for (uint32_t d = 0; d < ndims; ++d)
+        {
+            int current_dim = current_index->second;
+
+            // increasing chunk size possible
+            if (chunk_dims[current_dim] * 2 <= dims[current_dim])
+            {
+                chunk_dims[current_dim] *= 2;
+                current_chunk_size *= 2;
+                can_increase_dim = 1;
+            }
+
+            current_index++;
+            if (current_index == dims_order.end())
+                current_index = dims_order.begin();
+
+            if (can_increase_dim)
+                break;
+        }
+
+        // can not increase chunk size in any dimension
+        // we must use the current chunk sizes
+        if (!can_increase_dim)
+            break;
+
+        last_chunk_diff = chunk_diff;
+    }
+
+    return chunk_dims;
+}
+
 #endif