Add pre-commit and corrections

EPCCed · May 2, 2023 · 37fda66 · 37fda66
1 parent 6d7ca21
commit 37fda66
Show file tree

Hide file tree

Showing 21 changed files with 161 additions and 67 deletions.
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -0,0 +1,18 @@
+# See https://pre-commit.com for more information
+# See https://pre-commit.com/hooks.html for more hooks
+exclude: '.svg$'
+repos:
+-   repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: v3.2.0
+    hooks:
+    -   id: trailing-whitespace
+    -   id: end-of-file-fixer
+    -   id: check-yaml
+    -   id: check-added-large-files
+# codespell: note that --ignore-words may be read as "allow words"
+-   repo: https://github.com/codespell-project/codespell
+    rev: v2.2.2
+    hooks:
+    -   id: codespell
+        args:
+        - --ignore-words=spelling.txt
diff --git a/README.md b/README.md
@@ -94,11 +94,12 @@ the advertised start and finish times, and the break times.
 | 14:00 | Some performance considerations          |                              |
 |       | Exercise on matrix operation             | [section-2.03](section-2.03) |
 | 15:00 | Break                                    |                              |
-| 15:20 | More on memory: managed memory           |                              |
+| 15:20 | Managed memory                           |                              |
 |       | Exercise on managed memory               | [section-2.04](section-2.04) |
-| 15:50 | More on memory: shared memory            |                              |
+| 15:50 | Shared memory                            |                              |
 | 16:10 | Exercise on vector product               | [section-2.05](section-2.05) |
-| 16:30 | All together: matrix-vector product      | [][]                         |
+| 16:30 | Constant memory                          |                              |
+| 16:40 | All together: matrix-vector product      | [section-2.06](section-2,06) |
 | 17:00 | Close                                    |                              |
 
 

diff --git a/section-1.01/README.md b/section-1.01/README.md
@@ -13,7 +13,7 @@ Correspondingly, there might be a number of factors which could be
 taken into consideration in a performance model:
 
 1. Clock speed: the rate of issue of instructions by the processor
-2. Memory latency: time taken to retreive a data item from memory
+2. Memory latency: time taken to retrieve a data item from memory
 3. Memory bandwidth: amount of data transferred in unit time
 4. Parallelism: can I replicate the basic unit above?
 
@@ -27,8 +27,8 @@ Historically, increases in CPU performance have been related to increases
 in clock speed. However, owing largely to power constraints, most modern
 processors have a clock speed of around 2-3 GHz.
 
-Absent some unforseen fundamental breakthrough, it is not expected that
-this fundamental speed will increase signficantly in the future.
+Absent some unforeseen fundamental breakthrough, it is not expected that
+this fundamental speed will increase significantly in the future.
 
 ### Memory latency
 
@@ -40,7 +40,7 @@ operated on in a register).
 
 CPUs mitigate this problem by having caches: memory that is
 "closer" to the processor, and so reduces the time for access.
-Many caches are heirarchical in nature: the nearer the processor
+Many caches are hierarchical in nature: the nearer the processor
 the smaller the cache size in bytes, but the faster the access.
 These are typically referred to  as Level 1, Level 2, Level 3,
 (L1, L2, L3) and so on
@@ -51,7 +51,7 @@ Try the command
 ```
 $ lscpu
 ```
-on Cirrus to see what the cache heirarchy looks like.
+on Cirrus to see what the cache hierarchy looks like.
 
 Other latency hiding measures exist, e.g., out-of-order execution
 where instructions are executed based on the availability of data,
@@ -71,7 +71,7 @@ Memory bandwidth can then be a key consideration.
 
 ### Parallelism
 
-While it is not possible to increase the clock speed of an indivdual
+While it is not possible to increase the clock speed of an individual
 processor, one can use add more processing units (for which we will
 read: "cores").
 
@@ -89,12 +89,12 @@ available on Cirrus.
 
 Driven by commercial interest (games), a many-core processor *par exellence*
 has been developed. These are graphics processors. Subject to the same
-considerations as thoese discussed above, the hardware design choices taken
-to resolve them have been specificially related to the parallel pixel
+considerations as those discussed above, the hardware design choices taken
+to resolve them have been specifically related to the parallel pixel
 rendering problem (a trivially parallel problem).
 
 Clocks speeds have, historically, have lagged behind CPUs, but are now
-broadly similar. However, increases in GPU performance are releted to
+broadly similar. However, increases in GPU performance are related to
 parallelism.
 
 Memory latency has not gone away, but the mechanism used to mitigate it

diff --git a/section-1.02/README.md b/section-1.02/README.md
@@ -5,7 +5,7 @@ The application programmer does not want to have to worry about
 the exact disposition of cores/SMs, or whatever, in the hardware.
 An abstraction is wanted.
 
-In CUDA and HIP, this abstraction is based on an heirarchical
+In CUDA and HIP, this abstraction is based on an hierarchical
 organisation of threads.
 
 
@@ -106,15 +106,15 @@ $ nvcc -arch=sm_70 code.cu
 ```
 will run  on Volta. Minor versions such as `sm_72` also exist.
 
-This should not be confued with the CUDA version. The SM is a hardware
+This should not be confused with the CUDA version. The SM is a hardware
 feature, which the CUDA version is a software issue.
 
 
 ## Portability: CUDA and HIP
 
 CUDA has been under development by NVIDIA since around 2005. AMD, rather
 later to the party, develops HIP, which shadows CUDA. For
-example, a C/C++ call to 
+example, a C/C++ call to
 ```
   cudaMalloc(...);
 ```

diff --git a/section-2.01/README.md b/section-2.01/README.md
@@ -1,6 +1,6 @@
 # CUDA Programming
 
-The first topic we must address is the existance of separate address
+The first topic we must address is the existence of separate address
 spaces for CPU and GPU memory, and moving data between them.
 
 ![Schematic of host/device memories](../images/ks-schematic-memory-transfer.svg)
@@ -178,7 +178,7 @@ Check the CUDA documentation to see what other information is available
 from the structure `cudaDeviceProp`. This will be in the section on
 device management in the CUDA runtime API reference.
 
-What other possiblities exist for `cudaMemcpyKind`?
+What other possibilities exist for `cudaMemcpyKind`?
 
 https://docs.nvidia.com/cuda/cuda-runtime-api/index.html
 

diff --git a/section-2.01/exercise_dscal.cu b/section-2.01/exercise_dscal.cu
@@ -86,15 +86,15 @@ int main(int argc, char *argv[]) {
 
   /* ... kernel will be here  ... */
 
-  /* copy the result array back to the host output arrray */
+  /* copy the result array back to the host output array */
 
 
   /* We can now check the results ... */
   printf("Results:\n");
   {
     int ncorrect = 0;
     for (int i = 0; i < ARRAY_LENGTH; i++) {
-      /* The print statement can be uncommented for debuging... */
+      /* The print statement can be uncommented for debugging... */
       /* printf("%9d %5.2f\n", i, h_out[i]); */
       if (fabs(h_out[i] - a*h_x[i]) < DBL_EPSILON) ncorrect += 1;
     }

diff --git a/section-2.02/README.md b/section-2.02/README.md
@@ -60,7 +60,7 @@ We have introduced the structure
     unsigned int z;
   } dim3;
 ```
-which may be intialised in C as above, or using C++ style
+which may be initialised in C as above, or using C++ style
 constructors.
 
 
@@ -99,7 +99,7 @@ thread's position in the abstract grid picture:
 ```
    dim3 gridDim;     /* The number of blocks */
    dim3 blockDim;    /* The number of threads per block */
-   
+
    /* Unique to each block: */
    dim3 blockIdx;    /* 0 <= blockIdx.x < gridDim.x  etc. for y,z */
 
@@ -128,7 +128,7 @@ completed, we need synchronisation.
 
 ### Error handling
 
-Errors occuring in the kernel execution are also asynchronous, which
+Errors occurring in the kernel execution are also asynchronous, which
 can cause some confusion. As a result, one will sometimes see this
 usage:
 ```
@@ -157,7 +157,7 @@ adjust the value of the constant `a` to be e.g., `a = 2.0`.
 There is also a new template with a canned solution to the previous
 part in this directory.
 
-### Sugggested procedure
+### Suggested procedure
 
 1. Write a kernel of the prototype
 ```
@@ -185,13 +185,13 @@ the correct behaviour. Check for larger multiples of
 
 ### Problem size not a whole number of blocks
 
-As we are effectively contrained in the choice of `THREADS_PER_BLOCK`,
+As we are effectively constrained in the choice of `THREADS_PER_BLOCK`,
 it is likely that the problem space is not an integral number of
 blocks for general problems. How can we deal with this situation?
 
 1. For the launch parameters, you will need to compute a number of blocks
 that is sufficient and necessary to cover the entire problem space. (There
-needs to be at least one block, but no more than necesary.)
+needs to be at least one block, but no more than necessary.)
 2. You will also need to make an adjustment in the kernel. To avoid what
 type of error?
 
@@ -228,4 +228,4 @@ kernel parameters. Hint: for our first kernel this final argument will be
    void *args[] = {&a, &d_x};
 ```
 As `cudaLaunchKernel()` is an API function returning an error, the return code can be
-inpsected with the macro to check for errors in the launch (instead of `cudaPeekAtLastError()`). 
+inpsected with the macro to check for errors in the launch (instead of `cudaPeekAtLastError()`).
diff --git a/section-2.02/exercise_dscal.cu b/section-2.02/exercise_dscal.cu
@@ -96,7 +96,7 @@ int main(int argc, char *argv[]) {
 
   /* ... kernel will be here  ... */
 
-  /* copy the result array back to the host output arrray */
+  /* copy the result array back to the host output array */
 
   CUDA_ASSERT( cudaMemcpy(h_out, d_x, sz, cudaMemcpyDeviceToHost) );
 
@@ -105,7 +105,7 @@ int main(int argc, char *argv[]) {
   {
     int ncorrect = 0;
     for (int i = 0; i < ARRAY_LENGTH; i++) {
-      /* The print statement can be uncommented for debuging... */
+      /* The print statement can be uncommented for debugging... */
       /* printf("%9d %5.2f\n", i, h_out[i]); */
       if (fabs(h_out[i] - a*h_x[i]) < DBL_EPSILON) ncorrect += 1;
     }

diff --git a/section-2.03/README.md b/section-2.03/README.md
@@ -24,7 +24,7 @@ or eliminate host operations in favour of device operations.
 
 Potentially, the GPU has a lot of SMs/cores that can be used. Having very
 many blocks of work available at an one time is said to favour
-high *occupancy*. 
+high *occupancy*.
 
 This may be thought of simply as having a very high degree of thread
 parallelism. However, the degree is much higher than would be expected
@@ -64,7 +64,7 @@ SMs.
 
 ## Memory usage
 
-### CPU: caching bahaviour
+### CPU: caching behaviour
 
 A given thread in a CPU code favours consecutive memory accesses.
 E.g., in C, recall that it is the right-most index that runs
@@ -83,7 +83,7 @@ contiguous memory accesses.
 ### GPU: coalescing behaviour
 
 For GPU global memory, the opposite is true. The hardware wants
-to have warps of consectutive threads load consectutive memory
+to have warps of consecutive threads load consecutive memory
 locations in a contiguous block.
 
 Consider a one-dimensional example:
@@ -107,7 +107,7 @@ Consider first:
   }
 ```
 Here, a given thread makes `NY` consecutive accesses to the arrays. This
-does not favour coalesed access.
+does not favour coalesced access.
 
 We want consecutive threads to have consecutive accesses, e.g.,
 ```
@@ -188,7 +188,7 @@ A suggested procedure is:
    Hint: keep the same total number of threads per block; but the block
    must become two-dimensional.
 
-5. Is your resultant code getting the coalescing right? Consectutive
+5. Is your resultant code getting the coalescing right? Consecutive
    threads, that is, threads with consecutive $x$-index, should
    access consecutive memory location.
 
@@ -206,5 +206,3 @@ kernel launch (`cudaLaunchKernel` in the profile) compared with the
 time taken for the kernel itself?
 
 What's the overhead for the host-device transfers?
-
-
diff --git a/section-2.03/exercise_dger.cu b/section-2.03/exercise_dger.cu
@@ -121,7 +121,7 @@ int main(int argc, char *argv[]) {
   CUDA_ASSERT( cudaPeekAtLastError() );
   CUDA_ASSERT( cudaDeviceSynchronize() );
 
-  /* Retreive the results to h_a and check the results */
+  /* Retrieve the results to h_a and check the results */
 
   kind = cudaMemcpyDeviceToHost;
   CUDA_ASSERT( cudaMemcpy(h_a, d_a, mrow*ncol*sizeof(double), kind) );

diff --git a/section-2.04/README.md b/section-2.04/README.md
@@ -137,9 +137,9 @@ The `cudaMemoryAdvise` value may include:
 
 1. `cudaMemAdviseSetReadMostly` indicates infrequent reads;
 2. `cudaMemAdviseSetPreferredLocation` sets the preferred location to
-   the specified device (`cudaCpuDeviceId` for the host); 
+   the specified device (`cudaCpuDeviceId` for the host);
 3. `cudaMemAdviseSetAccessedBy` suggests that the data will be accessed
-   by the specfied device.
+   by the specified device.
 
 Each option has a corresponding `Unset` value which can be used to
 nullify the effect of a preceding `Set` specification.
@@ -183,4 +183,3 @@ id is already present in the code as `deviceNum`.
 
 What happens if you should accidentally use `cudaMalloc()` where you intended
 to use `cudaMallocManaged()`?
-
diff --git a/section-2.04/exercise_dger.cu b/section-2.04/exercise_dger.cu
@@ -14,7 +14,7 @@
  *         managed memory.
  * Part 2. Add prefetch requests for x and y before the kernel,
  *         and the matrix a after the kernel.
- * 
+ *
  * Copyright EPCC, The University of Edinburgh, 2023
  */
 
@@ -126,7 +126,7 @@ int main(int argc, char *argv[]) {
   CUDA_ASSERT( cudaPeekAtLastError() );
   CUDA_ASSERT( cudaDeviceSynchronize() );
 
-  /* Retreive the results to h_a and check the results */
+  /* Retrieve the results to h_a and check the results */
 
   kind = cudaMemcpyDeviceToHost;
   CUDA_ASSERT( cudaMemcpy(h_a, d_a, mrow*ncol*sizeof(double), kind) );

diff --git a/section-2.05/README.md b/section-2.05/README.md
@@ -77,7 +77,7 @@ conditions.
 ### The solution
 
 In practice, potentially unsafe updates to any form of shared memory
-must be protected by appropriate synchronisation: guarentees that
+must be protected by appropriate synchronisation: guarantees that
 operations happen in the correct order.
 
 For global memory, we require a so-called *atomic* update. For our
@@ -94,6 +94,15 @@ So the atomic update is a single unified operation on a single thread:
 3. store the result back to the global memory location;
 4. release the lock on that location.
 
+### Note
+
+`atomicAdd()` is an overloaded device function:
+```
+__device__ int atomicAdd(int * address, int value);
+__device__ double atomicAdd(double * address, double value);
+```
+and so on. The old value of the target variable is returned.
+
 
 ## Shared memory in blocks
 
@@ -106,13 +115,13 @@ These values are shared only between threads in the same block.
 
 Potential uses:
 1. marshalling data within a block;
-2. temprary values (particularly if there is signficant reuse);
+2. temporary values (particularly if there is significant reuse);
 3. contributions to reduction operations.
 
 Note: in the above example we have fixed the size of the `tmp`
 object at compile time ("static" shared memory).
 
-### Synchonisation
+### Synchronisation
 
 There are quite a large number of synchronisation options for
 threads within a block in CUDA. The essential one is probably
@@ -130,6 +139,7 @@ Here is a (slightly contrived) example:
 /* Reverse elements so that the order 0,1,2,3,...
  * becomes ...,3,2,1,0
  * Assume we have one block. */
+
 __global__ void reverseElements(int * myArray) {
 
   __shared__ int tmp[THREADS_PER_BLOCK];
@@ -159,7 +169,7 @@ time, and so harm occupancy.
 ## Exercise (20 minutes)
 
 In the following exercise we we implement a vector scalar product
-in the style of the BLAS levle 1 routine `ddot()`.
+in the style of the BLAS level 1 routine `ddot()`.
 
 The template provided sets up two vectors `x` and `y` with some
 initial values. The exercise is to complete the `ddot()` kernel
@@ -186,6 +196,6 @@ answer by chance. Be sure to check with a larger problem size.
 ### Finished?
 
 It is possible to use solely `atomicAdd()` to form the result (and not
-do anything using `__shared__` within a block). Investigate the performance
+do anything using `__shared__` within a block)? Investigate the performance
 implications of this (particularly, if the problem size becomes larger).
 You will need two versions of the kernel.