Skip to content

Commit

Permalink
Add pre-commit and corrections
Browse files Browse the repository at this point in the history
  • Loading branch information
kevinstratford committed May 2, 2023
1 parent 6d7ca21 commit 37fda66
Show file tree
Hide file tree
Showing 21 changed files with 161 additions and 67 deletions.
18 changes: 18 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# See https://pre-commit.com for more information
# See https://pre-commit.com/hooks.html for more hooks
exclude: '.svg$'
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v3.2.0
hooks:
- id: trailing-whitespace
- id: end-of-file-fixer
- id: check-yaml
- id: check-added-large-files
# codespell: note that --ignore-words may be read as "allow words"
- repo: https://github.com/codespell-project/codespell
rev: v2.2.2
hooks:
- id: codespell
args:
- --ignore-words=spelling.txt
7 changes: 4 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,11 +94,12 @@ the advertised start and finish times, and the break times.
| 14:00 | Some performance considerations | |
| | Exercise on matrix operation | [section-2.03](section-2.03) |
| 15:00 | Break | |
| 15:20 | More on memory: managed memory | |
| 15:20 | Managed memory | |
| | Exercise on managed memory | [section-2.04](section-2.04) |
| 15:50 | More on memory: shared memory | |
| 15:50 | Shared memory | |
| 16:10 | Exercise on vector product | [section-2.05](section-2.05) |
| 16:30 | All together: matrix-vector product | [][] |
| 16:30 | Constant memory | |
| 16:40 | All together: matrix-vector product | [section-2.06](section-2,06) |
| 17:00 | Close | |


Expand Down
18 changes: 9 additions & 9 deletions section-1.01/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ Correspondingly, there might be a number of factors which could be
taken into consideration in a performance model:

1. Clock speed: the rate of issue of instructions by the processor
2. Memory latency: time taken to retreive a data item from memory
2. Memory latency: time taken to retrieve a data item from memory
3. Memory bandwidth: amount of data transferred in unit time
4. Parallelism: can I replicate the basic unit above?

Expand All @@ -27,8 +27,8 @@ Historically, increases in CPU performance have been related to increases
in clock speed. However, owing largely to power constraints, most modern
processors have a clock speed of around 2-3 GHz.

Absent some unforseen fundamental breakthrough, it is not expected that
this fundamental speed will increase signficantly in the future.
Absent some unforeseen fundamental breakthrough, it is not expected that
this fundamental speed will increase significantly in the future.

### Memory latency

Expand All @@ -40,7 +40,7 @@ operated on in a register).

CPUs mitigate this problem by having caches: memory that is
"closer" to the processor, and so reduces the time for access.
Many caches are heirarchical in nature: the nearer the processor
Many caches are hierarchical in nature: the nearer the processor
the smaller the cache size in bytes, but the faster the access.
These are typically referred to as Level 1, Level 2, Level 3,
(L1, L2, L3) and so on
Expand All @@ -51,7 +51,7 @@ Try the command
```
$ lscpu
```
on Cirrus to see what the cache heirarchy looks like.
on Cirrus to see what the cache hierarchy looks like.

Other latency hiding measures exist, e.g., out-of-order execution
where instructions are executed based on the availability of data,
Expand All @@ -71,7 +71,7 @@ Memory bandwidth can then be a key consideration.

### Parallelism

While it is not possible to increase the clock speed of an indivdual
While it is not possible to increase the clock speed of an individual
processor, one can use add more processing units (for which we will
read: "cores").

Expand All @@ -89,12 +89,12 @@ available on Cirrus.

Driven by commercial interest (games), a many-core processor *par exellence*
has been developed. These are graphics processors. Subject to the same
considerations as thoese discussed above, the hardware design choices taken
to resolve them have been specificially related to the parallel pixel
considerations as those discussed above, the hardware design choices taken
to resolve them have been specifically related to the parallel pixel
rendering problem (a trivially parallel problem).

Clocks speeds have, historically, have lagged behind CPUs, but are now
broadly similar. However, increases in GPU performance are releted to
broadly similar. However, increases in GPU performance are related to
parallelism.

Memory latency has not gone away, but the mechanism used to mitigate it
Expand Down
6 changes: 3 additions & 3 deletions section-1.02/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ The application programmer does not want to have to worry about
the exact disposition of cores/SMs, or whatever, in the hardware.
An abstraction is wanted.

In CUDA and HIP, this abstraction is based on an heirarchical
In CUDA and HIP, this abstraction is based on an hierarchical
organisation of threads.


Expand Down Expand Up @@ -106,15 +106,15 @@ $ nvcc -arch=sm_70 code.cu
```
will run on Volta. Minor versions such as `sm_72` also exist.

This should not be confued with the CUDA version. The SM is a hardware
This should not be confused with the CUDA version. The SM is a hardware
feature, which the CUDA version is a software issue.


## Portability: CUDA and HIP

CUDA has been under development by NVIDIA since around 2005. AMD, rather
later to the party, develops HIP, which shadows CUDA. For
example, a C/C++ call to
example, a C/C++ call to
```
cudaMalloc(...);
```
Expand Down
4 changes: 2 additions & 2 deletions section-2.01/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# CUDA Programming

The first topic we must address is the existance of separate address
The first topic we must address is the existence of separate address
spaces for CPU and GPU memory, and moving data between them.

![Schematic of host/device memories](../images/ks-schematic-memory-transfer.svg)
Expand Down Expand Up @@ -178,7 +178,7 @@ Check the CUDA documentation to see what other information is available
from the structure `cudaDeviceProp`. This will be in the section on
device management in the CUDA runtime API reference.

What other possiblities exist for `cudaMemcpyKind`?
What other possibilities exist for `cudaMemcpyKind`?

https://docs.nvidia.com/cuda/cuda-runtime-api/index.html

Expand Down
4 changes: 2 additions & 2 deletions section-2.01/exercise_dscal.cu
Original file line number Diff line number Diff line change
Expand Up @@ -86,15 +86,15 @@ int main(int argc, char *argv[]) {

/* ... kernel will be here ... */

/* copy the result array back to the host output arrray */
/* copy the result array back to the host output array */


/* We can now check the results ... */
printf("Results:\n");
{
int ncorrect = 0;
for (int i = 0; i < ARRAY_LENGTH; i++) {
/* The print statement can be uncommented for debuging... */
/* The print statement can be uncommented for debugging... */
/* printf("%9d %5.2f\n", i, h_out[i]); */
if (fabs(h_out[i] - a*h_x[i]) < DBL_EPSILON) ncorrect += 1;
}
Expand Down
14 changes: 7 additions & 7 deletions section-2.02/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ We have introduced the structure
unsigned int z;
} dim3;
```
which may be intialised in C as above, or using C++ style
which may be initialised in C as above, or using C++ style
constructors.


Expand Down Expand Up @@ -99,7 +99,7 @@ thread's position in the abstract grid picture:
```
dim3 gridDim; /* The number of blocks */
dim3 blockDim; /* The number of threads per block */
/* Unique to each block: */
dim3 blockIdx; /* 0 <= blockIdx.x < gridDim.x etc. for y,z */
Expand Down Expand Up @@ -128,7 +128,7 @@ completed, we need synchronisation.

### Error handling

Errors occuring in the kernel execution are also asynchronous, which
Errors occurring in the kernel execution are also asynchronous, which
can cause some confusion. As a result, one will sometimes see this
usage:
```
Expand Down Expand Up @@ -157,7 +157,7 @@ adjust the value of the constant `a` to be e.g., `a = 2.0`.
There is also a new template with a canned solution to the previous
part in this directory.

### Sugggested procedure
### Suggested procedure

1. Write a kernel of the prototype
```
Expand Down Expand Up @@ -185,13 +185,13 @@ the correct behaviour. Check for larger multiples of

### Problem size not a whole number of blocks

As we are effectively contrained in the choice of `THREADS_PER_BLOCK`,
As we are effectively constrained in the choice of `THREADS_PER_BLOCK`,
it is likely that the problem space is not an integral number of
blocks for general problems. How can we deal with this situation?

1. For the launch parameters, you will need to compute a number of blocks
that is sufficient and necessary to cover the entire problem space. (There
needs to be at least one block, but no more than necesary.)
needs to be at least one block, but no more than necessary.)
2. You will also need to make an adjustment in the kernel. To avoid what
type of error?

Expand Down Expand Up @@ -228,4 +228,4 @@ kernel parameters. Hint: for our first kernel this final argument will be
void *args[] = {&a, &d_x};
```
As `cudaLaunchKernel()` is an API function returning an error, the return code can be
inpsected with the macro to check for errors in the launch (instead of `cudaPeekAtLastError()`).
inpsected with the macro to check for errors in the launch (instead of `cudaPeekAtLastError()`).
4 changes: 2 additions & 2 deletions section-2.02/exercise_dscal.cu
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@ int main(int argc, char *argv[]) {

/* ... kernel will be here ... */

/* copy the result array back to the host output arrray */
/* copy the result array back to the host output array */

CUDA_ASSERT( cudaMemcpy(h_out, d_x, sz, cudaMemcpyDeviceToHost) );

Expand All @@ -105,7 +105,7 @@ int main(int argc, char *argv[]) {
{
int ncorrect = 0;
for (int i = 0; i < ARRAY_LENGTH; i++) {
/* The print statement can be uncommented for debuging... */
/* The print statement can be uncommented for debugging... */
/* printf("%9d %5.2f\n", i, h_out[i]); */
if (fabs(h_out[i] - a*h_x[i]) < DBL_EPSILON) ncorrect += 1;
}
Expand Down
12 changes: 5 additions & 7 deletions section-2.03/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ or eliminate host operations in favour of device operations.

Potentially, the GPU has a lot of SMs/cores that can be used. Having very
many blocks of work available at an one time is said to favour
high *occupancy*.
high *occupancy*.

This may be thought of simply as having a very high degree of thread
parallelism. However, the degree is much higher than would be expected
Expand Down Expand Up @@ -64,7 +64,7 @@ SMs.

## Memory usage

### CPU: caching bahaviour
### CPU: caching behaviour

A given thread in a CPU code favours consecutive memory accesses.
E.g., in C, recall that it is the right-most index that runs
Expand All @@ -83,7 +83,7 @@ contiguous memory accesses.
### GPU: coalescing behaviour

For GPU global memory, the opposite is true. The hardware wants
to have warps of consectutive threads load consectutive memory
to have warps of consecutive threads load consecutive memory
locations in a contiguous block.

Consider a one-dimensional example:
Expand All @@ -107,7 +107,7 @@ Consider first:
}
```
Here, a given thread makes `NY` consecutive accesses to the arrays. This
does not favour coalesed access.
does not favour coalesced access.

We want consecutive threads to have consecutive accesses, e.g.,
```
Expand Down Expand Up @@ -188,7 +188,7 @@ A suggested procedure is:
Hint: keep the same total number of threads per block; but the block
must become two-dimensional.

5. Is your resultant code getting the coalescing right? Consectutive
5. Is your resultant code getting the coalescing right? Consecutive
threads, that is, threads with consecutive $x$-index, should
access consecutive memory location.

Expand All @@ -206,5 +206,3 @@ kernel launch (`cudaLaunchKernel` in the profile) compared with the
time taken for the kernel itself?

What's the overhead for the host-device transfers?


2 changes: 1 addition & 1 deletion section-2.03/exercise_dger.cu
Original file line number Diff line number Diff line change
Expand Up @@ -121,7 +121,7 @@ int main(int argc, char *argv[]) {
CUDA_ASSERT( cudaPeekAtLastError() );
CUDA_ASSERT( cudaDeviceSynchronize() );

/* Retreive the results to h_a and check the results */
/* Retrieve the results to h_a and check the results */

kind = cudaMemcpyDeviceToHost;
CUDA_ASSERT( cudaMemcpy(h_a, d_a, mrow*ncol*sizeof(double), kind) );
Expand Down
5 changes: 2 additions & 3 deletions section-2.04/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -137,9 +137,9 @@ The `cudaMemoryAdvise` value may include:

1. `cudaMemAdviseSetReadMostly` indicates infrequent reads;
2. `cudaMemAdviseSetPreferredLocation` sets the preferred location to
the specified device (`cudaCpuDeviceId` for the host);
the specified device (`cudaCpuDeviceId` for the host);
3. `cudaMemAdviseSetAccessedBy` suggests that the data will be accessed
by the specfied device.
by the specified device.

Each option has a corresponding `Unset` value which can be used to
nullify the effect of a preceding `Set` specification.
Expand Down Expand Up @@ -183,4 +183,3 @@ id is already present in the code as `deviceNum`.

What happens if you should accidentally use `cudaMalloc()` where you intended
to use `cudaMallocManaged()`?

4 changes: 2 additions & 2 deletions section-2.04/exercise_dger.cu
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
* managed memory.
* Part 2. Add prefetch requests for x and y before the kernel,
* and the matrix a after the kernel.
*
*
* Copyright EPCC, The University of Edinburgh, 2023
*/

Expand Down Expand Up @@ -126,7 +126,7 @@ int main(int argc, char *argv[]) {
CUDA_ASSERT( cudaPeekAtLastError() );
CUDA_ASSERT( cudaDeviceSynchronize() );

/* Retreive the results to h_a and check the results */
/* Retrieve the results to h_a and check the results */

kind = cudaMemcpyDeviceToHost;
CUDA_ASSERT( cudaMemcpy(h_a, d_a, mrow*ncol*sizeof(double), kind) );
Expand Down
20 changes: 15 additions & 5 deletions section-2.05/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ conditions.
### The solution

In practice, potentially unsafe updates to any form of shared memory
must be protected by appropriate synchronisation: guarentees that
must be protected by appropriate synchronisation: guarantees that
operations happen in the correct order.

For global memory, we require a so-called *atomic* update. For our
Expand All @@ -94,6 +94,15 @@ So the atomic update is a single unified operation on a single thread:
3. store the result back to the global memory location;
4. release the lock on that location.

### Note

`atomicAdd()` is an overloaded device function:
```
__device__ int atomicAdd(int * address, int value);
__device__ double atomicAdd(double * address, double value);
```
and so on. The old value of the target variable is returned.


## Shared memory in blocks

Expand All @@ -106,13 +115,13 @@ These values are shared only between threads in the same block.

Potential uses:
1. marshalling data within a block;
2. temprary values (particularly if there is signficant reuse);
2. temporary values (particularly if there is significant reuse);
3. contributions to reduction operations.

Note: in the above example we have fixed the size of the `tmp`
object at compile time ("static" shared memory).

### Synchonisation
### Synchronisation

There are quite a large number of synchronisation options for
threads within a block in CUDA. The essential one is probably
Expand All @@ -130,6 +139,7 @@ Here is a (slightly contrived) example:
/* Reverse elements so that the order 0,1,2,3,...
* becomes ...,3,2,1,0
* Assume we have one block. */
__global__ void reverseElements(int * myArray) {
__shared__ int tmp[THREADS_PER_BLOCK];
Expand Down Expand Up @@ -159,7 +169,7 @@ time, and so harm occupancy.
## Exercise (20 minutes)

In the following exercise we we implement a vector scalar product
in the style of the BLAS levle 1 routine `ddot()`.
in the style of the BLAS level 1 routine `ddot()`.

The template provided sets up two vectors `x` and `y` with some
initial values. The exercise is to complete the `ddot()` kernel
Expand All @@ -186,6 +196,6 @@ answer by chance. Be sure to check with a larger problem size.
### Finished?

It is possible to use solely `atomicAdd()` to form the result (and not
do anything using `__shared__` within a block). Investigate the performance
do anything using `__shared__` within a block)? Investigate the performance
implications of this (particularly, if the problem size becomes larger).
You will need two versions of the kernel.
Loading

0 comments on commit 37fda66

Please sign in to comment.