From 086f56cc0131bc85781111c6b09218539d1852f1 Mon Sep 17 00:00:00 2001
From: youngdae <itsme2030@gmail.com>
Date: Wed, 29 May 2024 22:14:41 -0500
Subject: [PATCH] Adds description on how to generate PTX.

---
 README.md | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/README.md b/README.md
index 7d106d0..043b114 100644
--- a/README.md
+++ b/README.md
@@ -1,6 +1,6 @@
 # ExaTron.jl
 
-ExaTron.jl implements a trust-region Newton algorithm for bound constrained batch nonlinear 
+ExaTron.jl implements a trust-region Newton algorithm for bound constrained batch nonlinear
 programming on GPUs.
 Its algorithm is based on [Lin and More](https://epubs.siam.org/doi/10.1137/S1052623498345075)
 and [TRON](https://www.mcs.anl.gov/~more/tron).
@@ -92,9 +92,9 @@ Note that the following table shows correspondence between the casename and the
 ### Figure 10
 
 To reproduce Figure 5, submit a job with each case file and its parameter values.
-For each case with name `casename`, it will generate `output_gpu1_casename.txt`. 
+For each case with name `casename`, it will generate `output_gpu1_casename.txt`.
 Near the end of the file, you will see the timing results: `Branch/iter = %.2f (millisecs)` is the relevant result.
-For example, in order to obtain timing results for `case19402_goc`, we read the following line around the end of 
+For example, in order to obtain timing results for `case19402_goc`, we read the following line around the end of
 the file
 ```bash
 Branch/iter = 3.94 (millisecs)
@@ -105,7 +105,7 @@ Here `3.94` miiliseconds will be the input for the `34K` batch size in Figure 5.
 
 To reproduce Figure 6, submit a job with each case file, its parameter values, and different GPU number `N`.
 It will generate `output_gpu${N}_casename.txt` file for each `casename` where `N` represents the number of GPUs
-used. 
+used.
 Near the end of the file, you will see the timing results: `[0] (Br+MPI)/iter = %.2f (millisecs)` is the relevant result,
 where `[0]` represents the rank (the root in this case) of a process.
 For example, in order to obtain timing results for `case19402_goc` with 6 GPUs, we read the following line around the end of the file
@@ -150,7 +150,7 @@ It will generate `br_time_gpu6_case13659pegase.pdf`. The file should look simila
 
 ### Figure 13
 
-To reproduce Figure 8, we need to execute ExaTron with 40 CPU cores. 
+To reproduce Figure 8, we need to execute ExaTron with 40 CPU cores.
 For this, we replace the line starting with `jsrun` with the following:
 ```bash
 jsrun -n 1 -r 1 -a 40 -c 40 -g 0 -d packed julia --project ./src/launch_mpi.jl ./data/casename pq_val va_val iterlim false
@@ -181,6 +181,15 @@ If you want to run ExaTron on a non-cluster, copy `julia --project ...` part in
 For multiple GPUs, run with `mpirun -np N julia --project ..`
 Note that all of the MPI processes should be able to see the `N` number of GPUs. Otherwise, it will generate an error.
 
+### Generating PTX code for a kernel
+
+By running the following, you could generate PTX code for a kernel:
+```bash
+@device_code_ptx CUDA.@sync @cuda threads=32 blocks=10240 kernel_func(a,b)
+```
+where the numbers for `threads` and `blocks` and the arguments `a` and `b` depend on `kernel_func`.
+If needed, you may want to specify its shared memory size.
+
 ## Citing this package
 
 ```