report.Rmd

---
title: "Assignment 1"
author: "Niccolò Tosato"
output:
  pdf_document: default
  html_document:
    df_print: paged

---


```{r setup, include=FALSE}
library(ggplot2)
library(gtable)
library(gridExtra)
library("kableExtra")
knitr::opts_chunk$set(echo = TRUE)
```
**Repository** with code and scripts: https://github.com/NiccoloTosato/HPC_assignment1 

# Section 1: MPI programming

## Section 1.1: 1D ring

The ring has been implemented using non-blocking communication routines and 1D virtual topology with periodic boundaries. 
The non-blocking implementation leads to a linear growth of the execution time of the program, indeed the execution time of a single iteration can be modeled roughly as a *double PingPing*. 
The real performance will be of course worse than the ideal one, since the PingPing takes into account only 2 processes and does not consider crowded configuration with more than 2 processes. However, we have a lower bound ideal model to estimate the expected performance.

### Software stack and measure setup.

The performances of the program are measured over 10000 iterations, using **UCX** and **InfiniBand**, across *core*, *socket* and *node*. Across two nodes the *mlx5_0*  hardware interface is used. Times of theoretical models have been taken from Intel® MPI benchmark PingPing (Figure 1). Program can be compiled by Makefile with differents options. Times have been measured on rank zero process, it is expected to be slower than the other ranks due topology reasons. Using *ldd* on executable it is possible to report specific library used.

\scriptsize

```{bash, eval=FALSE}
[s271550@login ring]$ ldd ring.x
        linux-vdso.so.1 =>  (0x00007ffce79ef000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f9e9dc88000)
        libmpi.so.40 => /opt/area/shared/programs/x86_64/openmpi/4.0.3/gnu/9.3.0/lib/libmpi.so.40 (0x00007f9e9d964000)
```
\normalsize

![PingPing structure from Intel® MPI Benchmarks User Guide.](/home/nt/Scrivania/en/mpi/imb_user_guide/IMB_Users_Guide_files/image02.png){#id .class width=35% height=35%}

### Map by node model

The network model across two nodes may take into account the latency of the network (dominated by the switch) and the number of processes involved. Each iteration will be lower bounded by the network.
$$Time=N_{procs} \cdot \lambda_{network}$$

The estimated latency between two node using the Intel MPI benchmark is a bit lower than the one declared by switch constructor. This lower latency holds only when few communications are going on. With multiple processes involved, a more realistic declared latency of $1.35$ microseconds leads to a more accurate model. Thus the model will be experimentally outperformed when $N_{procs}=2$. When the number of processes grow,it can be seen a slight slowdown of the actual implementation. 

Across two nodes the latency can be estimated optimistically as $\lambda_{network}=1.01 \space \mu Sec$.
\scriptsize

```{bash eval=FALSE}
#---------------------------------------------------
# Benchmarking PingPing 
# #processes = 2 
#---------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000         1.00         0.00
            1         1000         1.00         1.00
            2         1000         1.01         1.98
            4         1000         1.01         3.94

```
\normalsize
We expect the time to be $2 \cdot \lambda_{network}$ for each iterations, but experimentally this model doesn't hold. The experimental evidence suggests an iteration time of $\lambda_{network}$. My guess is that 2 consecutive messages are merged into one unique request routed to the switch. This behaviour can be confirmed using Intel MPI Benchmark **Exchange** again, that implements 1D non periodic ring (Figure 2).

\scriptsize
```{bash, eval=FALSE}
#-----------------------------------------------------------------------------
# Benchmarking Exchange 
# #processes = 36 
#-----------------------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   Mbytes/sec
            0        10000         1.46         1.46         1.46         0.00
            1        10000         1.45         1.45         1.45         2.75
            2        10000         1.45         1.45         1.45         5.50
            4        10000         1.46         1.46         1.46        10.97

```
\normalsize

![Exchannge benchmark structure from Intel® MPI Benchmarks User Guide.](./img/image04.png){#id99 .class width=40% height=40%}

### Map by socket model

With the socket round robin binding selected, we can model the ring behaviour as following, with the usual Intel MPI benchmark estimation.
$$Time=N_{procs}\cdot \lambda_{socket}\cdot 2$$
Again when the sockets become crowded the real performance is worse than the expected model.

Across two sockets the estimated time is $t=0.49 \space \mu Sec$
\scriptsize
```{bash eval=FALSE}
#---------------------------------------------------
# Benchmarking PingPing 
# #processes = 2 
#---------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000         0.49         0.00
            1         1000         0.48         2.06
            2         1000         0.49         4.12
            4         1000         0.49         8.18
```
\normalsize

### Map by core model

When mapping by core the processes, two factors have to be taken into account: how many processes are spawned and where. When $N_{procs}$ is less or equal than the number of cores in a single socket, the expected execution time is bounded by the core communication. When the first socket is filled, the first process on the second socket will be the slowest among all processes. Indeed its neighbors will be placed in the other socket. One iteration is long as the slowest process communication time. Experimentally a spike when $N_{procs}=13$ is clearly visible. Counterintuitively, the performance increases when the number of processes is greater than $N_{procs}=13$ and other processes are spawned in the other socket. 

Indeed, the slowest process has one neighbor on the same socket and one on the other socket, thus the overall iteration time for the slowest process is $\lambda_{core}+\lambda_{socket}$. 


$$
  Time= \left\{
\begin{array}{ll}
      N_{procs}\cdot\lambda_{core}\cdot 2 & N_{procs}\leq N_{cpu\space core} \\
      N_{procs}\cdot(\lambda_{core}+\lambda_{socket}) & N_{procs} > N_{cpu\space core}+2 \\
            N_{procs}\cdot\lambda_{socket}\cdot 2 & N_{procs} = N_{cpu\space core}+1 \\
\end{array} 
\right.  
$$


Across two sockets the estimated time is $t=0.23 \space \mu Sec$
\scriptsize
```{bash eval=FALSE}
#---------------------------------------------------
# Benchmarking PingPing 
# #processes = 2 
#---------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000         0.23         0.00
            1         1000         0.23         4.27
            2         1000         0.24         8.44
            4         1000         0.23        17.18
```
\normalsize

### Experimental results compared with theoretical model

```{r ring, include=FALSE, echo=FALSE}

ring_core=read.csv("./ring/timing/core.csv", header = TRUE)
ring_socket=read.csv("./ring/timing/socket.csv", header = TRUE)
ring_node=read.csv("./ring/timing/node.csv", header = TRUE)

ring_core_big=read.csv("./ring/timing/all_nodes/core.txt", header = TRUE)
ring_socket_big=read.csv("./ring/timing/all_nodes/socket.txt", header = TRUE)
ring_node_big=read.csv("./ring/timing/all_nodes/node.txt", header = TRUE)

ring_model_core=ring_core
ring_model_socket=ring_socket
ring_model_node=ring_node

core_latency=0.23
socket_latency=0.48
node_latency=1.35

ring_model_core$MEAN[ring_model_core$X.SIZE<=12]=ring_model_core$X.SIZE[ring_model_core$X.SIZE<=12]*core_latency*2
ring_model_core$MEAN[ring_model_core$X.SIZE>12]=ring_model_core$X.SIZE[ring_model_core$X.SIZE>12]*socket_latency+ring_model_core$X.SIZE[ring_model_core$X.SIZE>12]*core_latency
ring_model_core$MEAN[ring_model_core$X.SIZE==13]=ring_model_core$X.SIZE[ring_model_core$X.SIZE==13]*socket_latency*2

ring_model_socket$MEAN=ring_model_socket$X.SIZE*socket_latency*2


ring_model_node$MEAN=ring_model_node$X.SIZE*node_latency

sp1 = ggplot(ring_core,aes(x=X.SIZE,y=MEAN,color="Map by core",linetype="Real time"))  +
  scale_y_continuous(name="Time uSec",breaks = seq(0,45,by=5))+
  scale_x_continuous(name="Processor number",breaks = ring_core$X.SIZE) +
  geom_point() + 
  geom_line()  + 
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=0.5))+
  geom_point(data=ring_model_core,aes(x=X.SIZE,y=MEAN,color="Map by core"))+
  geom_line(data=ring_model_core,aes(x=X.SIZE,y=MEAN,color="Map by core",linetype="Theoretical time")) +    
  geom_point(data=ring_socket,aes(x=X.SIZE,y=MEAN,color="Map by socket")) +
  geom_line(data=ring_socket,aes(x=X.SIZE,y=MEAN,color="Map by socket",linetype="Real time"))+
  geom_point(data=ring_model_socket,aes(x=X.SIZE,y=MEAN,color="Map by socket")) +
  geom_line(data=ring_model_socket,aes(x=X.SIZE,y=MEAN,color="Map by socket",linetype="Theoretical time"))+
  geom_point(data=ring_model_node,aes(x=X.SIZE,y=MEAN,color="Map by node")) +
  geom_line(data=ring_model_node,aes(x=X.SIZE,y=MEAN,color="Map by node",linetype="Theoretical time"))+
  geom_point(data=ring_node,aes(x=X.SIZE,y=MEAN,color="Map by node")) +
  geom_line(data=ring_node,aes(x=X.SIZE,y=MEAN,color="Map by node",linetype="Real time"))+
  scale_color_manual(name="Mapping", values = c(10,12,13))+
  scale_linetype_manual(name="",values=c(1,2))+
  theme(legend.position = c(0.2, 0.8))+ geom_vline(xintercept=12, 
                color = "red",linetype="dashed", size=0.5)

core_latency=0.23
socket_latency=0.48
node_latency=1.35

ring_model_core$MEAN_MESSAGE[ring_model_core$X.SIZE<=12]=core_latency*2
ring_model_core$MEAN_MESSAGE[ring_model_core$X.SIZE>12]=socket_latency+core_latency
ring_model_core$MEAN_MESSAGE[ring_model_core$X.SIZE==13]=socket_latency*2

ring_model_socket$MEAN_MESSAGE=socket_latency*2


ring_model_node$MEAN_MESSAGE=node_latency

sp = ggplot(ring_core,aes(x=X.SIZE,y=MEAN_MESSAGE,color="Map by core",linetype="Real time"))  +
  scale_y_continuous(name="Mean iteration time uSec",breaks = seq(0.4,1.6,by=0.1))+
  scale_x_continuous(name="Processor number",breaks = ring_core$X.SIZE) +
  geom_point() + 
  geom_line()  + 
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=0.5))+
  geom_point(data=ring_model_core,aes(x=X.SIZE,y=MEAN_MESSAGE,color="Map by core"))+
  geom_line(data=ring_model_core,aes(x=X.SIZE,y=MEAN_MESSAGE,color="Map by core",linetype="Theoretical time")) +    
  geom_point(data=ring_socket,aes(x=X.SIZE,y=MEAN/X.SIZE,color="Map by socket")) +
  geom_line(data=ring_socket,aes(x=X.SIZE,y=MEAN/X.SIZE,color="Map by socket",linetype="Real time"))+
  geom_point(data=ring_model_socket,aes(x=X.SIZE,y=MEAN_MESSAGE,color="Map by socket")) +
  geom_line(data=ring_model_socket,aes(x=X.SIZE,y=MEAN_MESSAGE,color="Map by socket",linetype="Theoretical time"))+
  geom_point(data=ring_model_node,aes(x=X.SIZE,y=MEAN_MESSAGE,color="Map by node")) +
  geom_line(data=ring_model_node,aes(x=X.SIZE,y=MEAN_MESSAGE,color="Map by node",linetype="Theoretical time"))+
  geom_point(data=ring_node,aes(x=X.SIZE,y=MEAN/X.SIZE,color="Map by node")) +
  geom_line(data=ring_node,aes(x=X.SIZE,y=MEAN/X.SIZE,color="Map by node",linetype="Real time"))+
  scale_color_manual(name="Mapping", values = c(10,12,13))+
  scale_linetype_manual(name="",values=c(1,2))+
  guides(color = FALSE, size = FALSE,linetype= FALSE)+
  geom_vline(xintercept=12, 
                color = "red",linetype="dashed", size=0.5)

```


```{r griglia, dpi=600, fig.width=13, fig.height=6,echo=FALSE,fig.cap="Ring on THIN node up to 24 cores"}

grid.arrange(sp1,sp,nrow=1)

```

```{r,include=FALSE, echo=FALSE}

library(ggplot2)
sp = ggplot(ring_core_big,aes(x=X.SIZE,y=MEAN,color="Map by core" ))  +
  scale_y_continuous(name="Time uSec",breaks = seq(0,180,by=10))+
  scale_x_continuous(name="Processor number",breaks = ring_core_big$X.SIZE) +
  geom_point() + 
  geom_line()  + 
  theme(axis.text.x = element_text(angle = 60, vjust = 0.5, hjust=0.5))+
  geom_point(data=ring_socket_big,aes(x=X.SIZE,y=MEAN,color="Map by socket")) +
  geom_line(data=ring_socket_big,aes(x=X.SIZE,y=MEAN,color="Map by socket" ))+
  geom_point(data=ring_node_big,aes(x=X.SIZE,y=MEAN,color="Map by node")) +
  geom_line(data=ring_node_big,aes(x=X.SIZE,y=MEAN,color="Map by node" ))+
  scale_color_manual(name="Mapping", values = c(10,12,13,14))+
  scale_linetype_manual(name="",values=c(1,2))+
  theme(legend.position = c(0.1, 0.6))+
  geom_vline(xintercept=12, 
                color = "red",linetype="dashed", size=0.5)+
  geom_vline(xintercept=24, 
                color = "red",linetype="dashed", size=0.5)+
  geom_vline(xintercept=36, 
                color = "red",linetype="dashed", size=0.5)+
  geom_vline(xintercept=48, 
                color = "red",linetype="dashed", size=0.5)+
   geom_vline(xintercept=60, 
                color = "red",linetype="dashed", size=0.5)+
  geom_vline(xintercept=72, 
                color = "red",linetype="dashed", size=0.5)+
  geom_vline(xintercept=84, 
                color = "red",linetype="dashed", size=0.5)


library(ggplot2)
sp2 = ggplot(ring_core_big,aes(x=X.SIZE,y=MEAN_MESSAGE,color="Map by core" ))  +
  scale_y_continuous(name="Mean iteration time uSec",breaks = seq(0.4,1.7,0.1))+
  scale_x_continuous(name="Processor number",breaks = ring_core_big$X.SIZE) +
  geom_point() + 
  geom_line()  + 
  theme(axis.text.x = element_text(angle = 60, vjust = 0.5, hjust=0.5))+
  geom_point(data=ring_socket_big,aes(x=X.SIZE,y=MEAN_MESSAGE,color="Map by socket")) +
  geom_line(data=ring_socket_big,aes(x=X.SIZE,y=MEAN_MESSAGE,color="Map by socket" ))+
  geom_point(data=ring_node_big,aes(x=X.SIZE,y=MEAN_MESSAGE,color="Map by node")) +
  geom_line(data=ring_node_big,aes(x=X.SIZE,y=MEAN_MESSAGE,color="Map by node" ))+
  scale_color_manual(name="Mapping", values = c(10,12,13,14))+
  scale_linetype_manual(name="",values=c(1,2))+
  guides(color = FALSE, size = FALSE,linetype= FALSE)+ geom_vline(xintercept=12, 
                color = "red",linetype="dashed", size=0.5)+
  geom_vline(xintercept=12, 
                color = "red",linetype="dashed", size=0.5)+
  geom_vline(xintercept=24, 
                color = "red",linetype="dashed", size=0.5)+
  geom_vline(xintercept=36, 
                color = "red",linetype="dashed", size=0.5)+
  geom_vline(xintercept=48, 
                color = "red",linetype="dashed", size=0.5)+
  geom_vline(xintercept=60, 
                color = "red",linetype="dashed", size=0.5)+
  geom_vline(xintercept=72, 
                color = "red",linetype="dashed", size=0.5)+
  geom_vline(xintercept=84, 
                color = "red",linetype="dashed", size=0.5)

```

\newpage

```{r , dpi=600, fig.width=10, fig.height=8,echo=FALSE,fig.cap="Ring on THIN node up to 96 cores and 4 nodes"}

grid.arrange(sp,sp2,nrow=2)

```

By running the ring across 4 nodes,the pattern above can been visualized. After 24 processes, when the socket or the node is saturated the changes on the average iteration time are smaller.This is due the fact that the communication time between nodes is the largest and it is more evident than the intra-socket or the intra-core ones.Moreover with more than 24 processes we have at least one intra-node communication going on, the step confirms that.

\newpage

## Section 1.2: Matrix sum

In memory, a 3D array can be represented as a unique linear array. The most effective way to sum it in parallel is using collective operations. 

The shape of the array doesn't make any difference on how the array is represented in memory. Also the virtual topology and its relative domain decomposition do not have any impact. Indeed, we can assume that the problem is not sensible to any topology, and thus there is no need of communication between neighbors. The most efficient way to communicate and sum between matrices is to use MPI_Scatterv and MPI_Gatherv collective routines. 

The expectations of this code are not so high in terms of scalability:the sum of two matrices containing $n$ elements, requires $2n$ memory read accesses, $n$ memory to write, a lot of communication and **only** $n$ floating point operations (in other words: like Level 1 BLAS operation we can expect matrix sum at least to be memory bounded). Parallel portion of code is smaller with respect to the serial one. Using Ahmdal's law,a prediction of speed up bound can be made.

By testing the parallel code with two $2400\times1000\times700$ matrices  of double values and by using only one processor, it can be seen that the parallel part takes only $2.63$ seconds of the overall runtime of $49$ seconds, representing about $5\%$ of execution time (Elapsed is measured using /usr/bin/time and detailed Scatter and Gather with MPI_Wtime() ). Total execution takes into account the matrix initialization and the error checking.The following graph represents the theoretical maximum speedup supported by Ahmdal's law assuming only $5\%$ roughly of parallel code, while communication and serial code in the remaining part of the program is fixed with respect to the number of processors. The theoretical maximum speedup modelled is a good approximation and it catches the trend between [1,24] processes.


```{r ,echo=FALSE,message=FALSE}
options(warn=-1)
core_matrix=read.csv("./matrix/matrix_no_topo/big/1_core.csv")
socket_matrix=read.csv("./matrix/matrix_no_topo/big/1_socket.csv")
core_matrix$X.np=core_matrix$X.np+1
socket_matrix$X.np=socket_matrix$X.np+1
core_matrix=core_matrix[c(1,4,8,12,16,20,24),]
socket_matrix=socket_matrix[c(1,4,8,12,16,20,24),]
rownames(core_matrix)=c()
rownames(socket_matrix)=c()
speed_teo=core_matrix
p=0.13
s=1-p
speed_teo$total=1/(s+p/speed_teo$X.np)

t1=kbl(core_matrix, booktabs = T, align = "c",escape=F, col.names = linebreak(c("N° procs", "Scatter\n(S)","Gather\n(S)","Parallel\n(S)","Total\n(S)"), align = "c")) %>%
column_spec(1, bold=T) %>% 
column_spec(4, color = "white",
background = spec_color(core_matrix$parallel, end = 0.5,alpha = 0.2,option = "plasma"),
popover = paste("am:",core_matrix$parallel))

t2=kbl(socket_matrix, booktabs = T, align = "c",escape=F, col.names = linebreak(c("N° procs", "Scatter\n(S)","Gather\n(S)","Parallel\n(S)","Total\n(S)"), align = "c")) %>%
column_spec(1, bold=T) %>% 
column_spec(4, color = "white",
background = spec_color(socket_matrix$parallel, end = 0.5,alpha = 0.2,option = "plasma"),
popover = paste("am:",socket_matrix$parallel))

knitr::kables(list(t1,t2),caption = "Matrix sum timings")

```

```{r,echo=FALSE, dpi=600, fig.width=10, fig.height=2}
sp = ggplot(socket_matrix,aes(x=X.np, y=socket_matrix$total[1]/socket_matrix$total, color="Socket")) +
  scale_x_continuous(name="N procs",breaks=socket_matrix$X.np)+
  scale_y_continuous(name="Speedup",breaks = seq(0.9,1.4,by=0.1),limit = c(0.9, 1.2))  + geom_point() + geom_line()  +
  theme(axis.text.x = element_text(angle = 0, vjust = 0.5, hjust=1)) + 
  geom_point(data=speed_teo,aes(x=X.np,y=total,color="Teo")) + geom_line(data=speed_teo,aes(x=X.np,y=total,color="Teo"))+ 
  geom_point(data=core_matrix,aes(x=X.np,y=core_matrix$total[1]/core_matrix$total,color="Core")) + geom_line(data=core_matrix,aes(x=X.np,y=core_matrix$total[1]/core_matrix$total,color="Core"))+ labs(title="Matrix sum speedup", subtitle="")  +
      theme(plot.title = element_text(size = 13, face = "bold",hjust = 0.5,margin = margin(t = 12)), plot.subtitle = element_text(size = 10, hjust = 0.5, margin = margin(t = 7, b = 10), face = "italic"), axis.title = element_text(face = "bold"), legend.text = element_text(margin = margin(t = 7, b = 7, r=12)), axis.title.x = element_text(margin = margin(t = 10, b=10)), axis.title.y = element_text(margin = margin(r = 10,l=10)), axis.text = element_text(color= "#2f3030", face="bold"))+
  scale_color_manual(name="Mapping", values = c(10,12,13))
```

```{r, dpi=600, fig.width=10, fig.height=4,echo=FALSE,message=FALSE}

sp

```

\newpage

Moreover I've implemented a 3D matrix sum program that includes a 1D/2D/3D domain decomposition that matches virtual topology and that uses collective operations, as initially requested. This program leads to a very high overhead, since a collective operation (in general communication routine) needs contiguous memory region to communicate each subdomain to its worker. To achieve this an unrolling is performed for each subdomain. Once the worker has received the subdomain, it elaborates the array and sends back the summed array to the master processor. The master processor needs to reordinate the subdomains and to assemble the final matrix. Like in the first program, this code show an even lower performance.In this latter case we have some changes in performances with respect to matrices shape and domain decomposition, but this is due to matrix preparation overhead (buffering of subdomains before scatter it) and it is not due communication routine.

```{r, dpi=600, fig.width=14, fig.height=20,echo=FALSE,message=FALSE}
core_matrix_topo_1=read.csv("./matrix/matrix_collective/big/1_core.csv")
socket_matrix_topo_1=read.csv("./matrix/matrix_collective/big/1_socket.csv")

core_matrix_topo_2=read.csv("./matrix/matrix_collective/big/2_core.csv")
socket_matrix_topo_2=read.csv("./matrix/matrix_collective/big/2_socket.csv")

sp = ggplot(core_matrix_topo_1,aes(x=X.np, y=core_matrix_topo_1$total[1]/core_matrix_topo_1$total, color="Core",linetype="2400 1000 700")) +
  scale_x_continuous(name="N procs",breaks=core_matrix_topo_1$X.np)+
  scale_y_continuous(name="Speedup",breaks = seq(0.9,1.4,by=0.1),limit = c(0.95, 1.1))  + geom_point() + geom_line()  +
  theme(axis.text.x = element_text(angle = 0, vjust = 0.5, hjust=1)) +
  geom_point(data=socket_matrix_topo_1,aes(x=X.np,y=socket_matrix_topo_1$total[1]/socket_matrix_topo_1$total,color="Socket")) +
  geom_line(data=socket_matrix_topo_1,aes(x=X.np,y=socket_matrix_topo_1$total[1]/socket_matrix_topo_1$total,color="Socket",linetype="2400 1000 700"))+ labs(title="Matrix sum speedup", subtitle="")   +
  geom_point(data=socket_matrix_topo_2,aes(x=X.np,y=socket_matrix_topo_2$total[1]/socket_matrix_topo_2$total,color="Socket")) + geom_line(data=socket_matrix_topo_2,aes(x=X.np,y=socket_matrix_topo_2$total[1]/socket_matrix_topo_2$total,color="Socket",linetype="1200 2000 700"))+
  
  geom_point(data=core_matrix_topo_2,aes(x=X.np,y=core_matrix_topo_2$total[1]/core_matrix_topo_2$total,color="Core")) + geom_line(data=core_matrix_topo_2,aes(x=X.np,y=core_matrix_topo_2$total[1]/core_matrix_topo_2$total,color="Core",linetype="1200 2000 700"))+
      theme(plot.title = element_text(size = 13, face = "bold",hjust = 0.5,margin = margin(t = 12)), plot.subtitle = element_text(size = 10, hjust = 0.5, margin = margin(t = 7, b = 10), face = "italic"), axis.title = element_text(face = "bold"), legend.text = element_text(margin = margin(t = 7, b = 7, r=12)), axis.title.x = element_text(margin = margin(t = 10, b=10)), axis.title.y = element_text(margin = margin(r = 10,l=10)), axis.text = element_text(color= "#2f3030", face="bold"))+
  scale_color_manual(name="Mapping and size", values = c(10,12,13,15))+
  scale_linetype_manual(name="",values=c(1,2))


```

```{r, dpi=600, fig.width=10, fig.height=4,echo=FALSE,message=FALSE}

sp

```

\newpage

# Section 2: MPI point to point performance

In order to measure MPI point to point performance, the Intel MPI benchmarks is used. By looking to Intel's documentation benchmark works as follow:

![PingPong structure from Intel® MPI Benchmarks User Guide.](./img/image01.png){#id2 .class width=50% height=35%}

The bandwidth and the latency estimates are computed across *core*, *socket* and different *nodes*, combined with different protocols and hardware devices.Each of the different setup of the program has been runned 10 times and **openMPI 4.0.3** has been used. One entire node was reserved in order to reduce possible sources of noise in the measurements.
The **pml** involved in the benchmarks are **OB1** and **UCX**. The **btl** used are **tcp** and **vader**. When the measurements across nodes were performed also different network with different protocols have been selected: $25$ Gbit Ethernet and $100$ Gbit InfiniBand throught Mellanox network switch. 

```{r , include=FALSE,echo=FALSE}
node_ib=read.csv("./pingpong/ompi/node_ib.out.csv", header = TRUE)
node_ob1_tcp=read.csv("./pingpong/ompi/node_ob1_selftcp.out.csv", header = TRUE)
node_ucx_br0=read.csv("./pingpong/ompi/node_ucx_br0.out.csv", header = TRUE)
node_ucx_ib0=read.csv("./pingpong/ompi/node_ucx_ib0.out.csv", header = TRUE)
node_ucx__mlx5=read.csv("./pingpong/ompi/node_ucx_mlx5.out.csv", header = TRUE)

socket_ib=read.csv("./pingpong/ompi/socket_ib.out.csv", header = TRUE)
socket_ob1_tcp=read.csv("./pingpong/ompi/socket_ob1_selftcp.out.csv", header = TRUE)
socket_ob1_vader=read.csv("./pingpong/ompi/socket_ob1_selfvader.out.csv", header = TRUE)

core_ib=read.csv("./pingpong/ompi/core_ib.out.csv", header = TRUE)
core_ob1_tcp=read.csv("./pingpong/ompi/core_ob1_selftcp.out.csv", header = TRUE)
core_ob1_vader=read.csv("./pingpong/ompi/core_ob1_selfvader.out.csv", header = TRUE)


node_ib_intel=read.csv("./pingpong/intel/node_ib_intel.out.csv", header = TRUE)
socket_ib_intel=read.csv("./pingpong/intel/socket_ib_intel.out.csv", header = TRUE)
core_ib_intel=read.csv("./pingpong/intel/core_ib_intel.out.csv", header = TRUE)
```

```{r , dpi=600, fig.width=10, fig.height=6,echo=FALSE,include=FALSE}
library(ggplot2)

sp_thin = ggplot(core_ib,aes(x=X.bytes,y=mbs,color="UCX IB",linetype="Core")) +
  scale_x_continuous(trans='log2',name="Message Size Bytes",breaks=core_ib$X.bytes) +
  geom_point() + geom_line()+scale_y_continuous(breaks = seq(0,25000,1000),name="MB/s") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  geom_point(data=core_ob1_tcp,aes(x=X.bytes,y=mbs,color="OB1 tcp")) +
  geom_line(data=core_ob1_tcp,aes(x=X.bytes,y=mbs,color="OB1 tcp",linetype="Core")) +
  geom_point(data=core_ob1_vader,aes(x=X.bytes,y=mbs,color="OB1 vader")) +
  geom_line(data=core_ob1_vader,aes(x=X.bytes,y=mbs,color="OB1 vader",linetype="Core")) +
  geom_point(data=socket_ib,aes(x=X.bytes,y=mbs,color="UCX IB")) +
  geom_line(data=socket_ib,aes(x=X.bytes,y=mbs,color="UCX IB",linetype="Socket")) +
  geom_point(data=socket_ob1_tcp,aes(x=X.bytes,y=mbs,color="OB1 tcp")) +
  geom_line(data=socket_ob1_tcp,aes(x=X.bytes,y=mbs,color="OB1 tcp",linetype="Socket")) +
  geom_point(data=socket_ob1_vader,aes(x=X.bytes,y=mbs,color="OB1 vader")) +
  geom_line(data=socket_ob1_vader,aes(x=X.bytes,y=mbs,color="OB1 vader",linetype="Socket")) +
  geom_point(data=socket_ib_intel,aes(x=X.bytes,y=mbs,color="Intel IB")) +
  geom_line(data=socket_ib_intel,aes(x=X.bytes,y=mbs,color="Intel IB",linetype="Socket")) +
  geom_point(data=core_ib_intel,aes(x=X.bytes,y=mbs,color="Intel IB")) +
  geom_line(data=core_ib_intel,aes(x=X.bytes,y=mbs,color="Intel IB",linetype="Core"))+
  scale_color_manual(name="Protocol", values = c("#F8766D", "#7CAE00", "#00BFC4" ,"#C77CFF"))+
  scale_linetype_manual(name="Mapping",values=c(1,2,3,4))+
  labs(title="PingPong bandwidth", subtitle="Thin node,core and socket mapping")  +
      theme(plot.title = element_text(size = 13, face = "bold",hjust = 0.5,margin = margin(t = 12)), plot.subtitle = element_text(size = 10, hjust = 0.5, margin = margin(t = 7, b = 10), face = "italic"), axis.title = element_text(face = "bold"), legend.text = element_text(margin = margin(t = 7, b = 7, r=12)), axis.title.x = element_text(margin = margin(t = 10, b=10)), axis.title.y = element_text(margin = margin(r = 10,l=10)), axis.text = element_text(color= "#2f3030", face="bold"))+
  geom_vline(xintercept=32768, 
                color = "black",linetype="dashed", size=0.35)+
  geom_vline(xintercept=1048576, 
                color = "black",linetype="dashed", size=0.35)+
  geom_vline(xintercept=19922944, 
                color = "black",linetype="dashed", size=0.35)+
  annotate("text", x=58000, y=24000, label= "L1",color="Black")+
  annotate("text", x=1800000, y=24000, label= "L2",color="Black")+
  annotate("text", x=29900000, y=24000, label= "L3",color="Black")
  

```


```{r aa, dpi=600, fig.width=10, fig.height=6,echo=FALSE}
options(warn=-1)
#sp

```

```{r,echo=FALSE}
socket_ib_gpu=read.csv("./pingpong/ompi_gpu/socket_ib.out.csv", header = TRUE)
socket_ob1_tcp_gpu=read.csv("./pingpong/ompi_gpu/socket_ob1_selftcp.out.csv", header = TRUE)
socket_ob1_vader_gpu=read.csv("./pingpong/ompi_gpu/socket_ob1_selfvader.out.csv", header = TRUE)
socket_ib_intel_gpu=read.csv("./pingpong/intel_gpu/socket_ib_intel.out.csv", header = TRUE)

core_ib_intel_gpu=read.csv("./pingpong/intel_gpu/core_ib_intel.out.csv", header = TRUE)
core_ib_gpu=read.csv("./pingpong/ompi_gpu/core_ib.out.csv", header = TRUE)
core_ob1_tcp_gpu=read.csv("./pingpong/ompi_gpu/core_ob1_selftcp.out.csv", header = TRUE)
core_ob1_vader_gpu=read.csv("./pingpong/ompi_gpu/core_ob1_selfvader.out.csv", header = TRUE)

sp_gpu = ggplot(core_ib_gpu,aes(x=X.bytes,y=mbs,color="UCX IB",linetype="Core")) +
  scale_x_continuous(trans='log2',name="Message Size Bytes",breaks=core_ib$X.bytes) +
  geom_point() + geom_line()+scale_y_continuous(breaks = seq(0,22000,1000),name="MB/s") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  geom_point(data=core_ob1_tcp_gpu,aes(x=X.bytes,y=mbs,color="OB1 tcp")) +
  geom_line(data=core_ob1_tcp_gpu,aes(x=X.bytes,y=mbs,color="OB1 tcp",linetype="Core")) +
  geom_point(data=core_ob1_vader_gpu,aes(x=X.bytes,y=mbs,color="OB1 vader")) +
  geom_line(data=core_ob1_vader_gpu,aes(x=X.bytes,y=mbs,color="OB1 vader",linetype="Core")) +
  geom_point(data=socket_ib_gpu,aes(x=X.bytes,y=mbs,color="UCX IB")) +
  geom_line(data=socket_ib_gpu,aes(x=X.bytes,y=mbs,color="UCX IB",linetype="Socket")) +
  geom_point(data=socket_ob1_tcp_gpu,aes(x=X.bytes,y=mbs,color="OB1 tcp")) +
  geom_line(data=socket_ob1_tcp_gpu,aes(x=X.bytes,y=mbs,color="OB1 tcp",linetype="Socket")) +
  geom_point(data=socket_ob1_vader_gpu,aes(x=X.bytes,y=mbs,color="OB1 vader")) +
  geom_line(data=socket_ob1_vader_gpu,aes(x=X.bytes,y=mbs,color="OB1 vader",linetype="Socket")) +
  geom_point(data=socket_ib_intel_gpu,aes(x=X.bytes,y=mbs,color="Intel IB")) +
  geom_line(data=socket_ib_intel_gpu,aes(x=X.bytes,y=mbs,color="Intel IB",linetype="Socket")) +
  geom_point(data=core_ib_intel_gpu,aes(x=X.bytes,y=mbs,color="Intel IB")) +
  geom_line(data=core_ib_intel_gpu,aes(x=X.bytes,y=mbs,color="Intel IB",linetype="Core"))+
  scale_color_manual(name="Protocol", values = c("#F8766D", "#7CAE00", "#00BFC4" ,"#C77CFF"))+
  scale_linetype_manual(name="Mapping",values=c(1,2,3,4))+
  labs(title="PingPong bandwidth", subtitle="Gpu node,core and socket mapping")  +
      theme(plot.title = element_text(size = 13, face = "bold",hjust = 0.5,margin = margin(t = 12)), plot.subtitle = element_text(size = 10, hjust = 0.5, margin = margin(t = 7, b = 10), face = "italic"), axis.title = element_text(face = "bold"), legend.text = element_text(margin = margin(t = 7, b = 7, r=12)), axis.title.x = element_text(margin = margin(t = 10, b=10)), axis.title.y = element_text(margin = margin(r = 10,l=10)), axis.text = element_text(color= "#2f3030", face="bold"))+
  geom_vline(xintercept=32768, 
                color = "black",linetype="dashed", size=0.35)+
  geom_vline(xintercept=1048576, 
                color = "black",linetype="dashed", size=0.35)+
  geom_vline(xintercept=19922944, 
                color = "black",linetype="dashed", size=0.35)+
  annotate("text", x=58000, y=20000, label= "L1",color="Black")+
  annotate("text", x=1800000, y=20000, label= "L2",color="Black")+
  annotate("text", x=29900000, y=20000, label= "L3",color="Black")
  
```

```{r , dpi=600, fig.width=10, fig.height=6,echo=FALSE}
options(warn=-1)
#sp

```
The following graphs show the behaviour inside the same node, i.e mapping the processes across two sockets or in the same socket. Mapping the processes in the same socket shows often a better performance,as expected. The behaviour before the asymptotic plateau is strange and shows very different performances among different implementations, the main cause of that is due to the cache. 
Before analyzing the performance we need to know more about the node topology, as explained in figure 6. 

![lstopo output on THIN node.](./img/lstopo.png){#idasd .class width=50% height=50% align=center}

Cache sizes are reported on the following graphs with vertical black dashed lines. 
As it can be noticed in the following two graphs, the bandwidth behaviour appears strange before $16$ MB included. After that point, the behaviour is stable beacause the message size is larger than all the caches. The larger cache is **L3** with $19$ MB.
The effect of **L2** becomes clear after $1$ MB, after it all implementations start loosing bandwidth.
The effect of **L1**  is not clearly visible from the bandwidth due to the latency. To infer with more accuracy about the cache effect we can perform some profiling test using hardware counters. I have modified Intel MPI Benchmark PingPong injecting some code in order to inspect L1 data cache misses and L2 cache misses through PAPI. I tracked all cache misses on MPI_Send and MPI_Recv routine of rank 0,after some cicles of warmup.
The code is available here https://github.com/NiccoloTosato/HPC_assignment1/tree/main/mpi-benchmarks_mod
The cache misses are reported on absolute value and normalized with respect to the  message size.
Before $128$ KB UCX shows poor performance with respect to other protocols,while after that size an huge speed up is present. This fact  will be explained later. Intel InfiniBand also shows poor performance with respect to UCX implementation, this behaviour is explained with cache too. 

```{r, dpi=600, fig.width=10, fig.height=6,echo=FALSE,include=FALSE}
ob1_core_tcp=read.csv("./mpi-benchmarks_mod/src_c/cache_misses/core_ob1_selftcp_cache_misses.out.csv", header = TRUE)
ob1_core_tcp$L1=ob1_core_tcp$L1/(ob1_core_tcp$ITERATION)
ob1_core_tcp$L2=ob1_core_tcp$L2/(ob1_core_tcp$ITERATION)

ob1_core_vader=read.csv("./mpi-benchmarks_mod/src_c/cache_misses/core_ob1_selfvader_cache_misses.out.csv", header = TRUE)
ob1_core_vader$L1=ob1_core_vader$L1/(ob1_core_vader$ITERATION)
ob1_core_vader$L2=ob1_core_vader$L2/(ob1_core_vader$ITERATION)

ucx_core=read.csv("./mpi-benchmarks_mod/src_c/cache_misses/core_ib_cache_misses.out.csv", header = TRUE)
ucx_core$L1=ucx_core$L1/(ucx_core$ITERATION)
ucx_core$L2=ucx_core$L2/(ucx_core$ITERATION)

ob1_socket_tcp=read.csv("./mpi-benchmarks_mod/src_c/cache_misses/socket_ob1_selftcp_cache_misses.out.csv", header = TRUE)
ob1_socket_tcp$L1=ob1_socket_tcp$L1/(ob1_socket_tcp$ITERATION)
ob1_socket_tcp$L2=ob1_socket_tcp$L2/(ob1_socket_tcp$ITERATION)

ob1_socket_vader=read.csv("./mpi-benchmarks_mod/src_c/cache_misses/socket_ob1_selfvader_cache_misses.out.csv", header = TRUE)
ob1_socket_vader$L1=ob1_socket_vader$L1/(ob1_socket_vader$ITERATION)
ob1_socket_vader$L2=ob1_socket_vader$L2/(ob1_socket_vader$ITERATION)

ucx_socket=read.csv("./mpi-benchmarks_mod/src_c/cache_misses/socket_ib_cache_misses.out.csv", header = TRUE)
ucx_socket$L1=ucx_socket$L1/(ucx_socket$ITERATION)
ucx_socket$L2=ucx_socket$L2/(ucx_socket$ITERATION)
 
intel_core=read.csv("./mpi-benchmarks_mod/src_c/cache_misses/intel_core_cache_misses.out.csv", header = TRUE)

intel_core$L1=intel_core$L1/intel_core$ITERATION
intel_core$L2=intel_core$L2/intel_core$ITERATION

intel_socket=read.csv("./mpi-benchmarks_mod/src_c/cache_misses/intel_socket_cache_misses.out.csv", header = TRUE)

intel_socket$L1=intel_socket$L1/intel_core$ITERATION
intel_socket$L2=intel_socket$L2/intel_core$ITERATION


intel_node=read.csv("./mpi-benchmarks_mod/src_c/cache_misses/intel_node_cache_misses.out.csv", header = TRUE)
intel_node$L1=intel_node$L1/intel_node$ITERATION

ucx_node_ib=read.csv("./mpi-benchmarks_mod/src_c/cache_misses/node_ib_cache_misses.out.csv", header = TRUE)
ucx_node_ib$L1=ucx_node_ib$L1/ucx_node_ib$ITERATION

ucx_node_br0=read.csv("./mpi-benchmarks_mod/src_c/cache_misses/node_ucx_br0_cache_misses.out.csv", header = TRUE)
ucx_node_br0$L1=ucx_node_br0$L1/ucx_node_br0$ITERATION

ucx_node_ib0=read.csv("./mpi-benchmarks_mod/src_c/cache_misses/node_ucx_ib0_cache_misses.out.csv", header = TRUE)
ucx_node_ib0$L1=ucx_node_ib0$L1/ucx_node_ib0$ITERATION

ob1_node_tcp=read.csv("./mpi-benchmarks_mod/src_c/cache_misses/node_ob1_selftcp_cache_misses.out.csv", header = TRUE)
ob1_node_tcp$L1=ob1_node_tcp$L1/ob1_node_tcp$ITERATION

```


```{r, dpi=600, fig.width=10, fig.height=6,echo=TRUE,include=FALSE}
gg_color_hue <- function(n) {
  hues = seq(15, 375, length = n + 1)
  hcl(h = hues, l = 65, c = 100)[1:n]
}


sp = ggplot(ob1_core_vader,aes(x=X.SIZE,y=L1/X.SIZE,color="L1 miss OB1",linetype="Core")) +
  scale_x_continuous(trans='log2',name="Message Size Bytes", breaks=ob1_core_vader$X.SIZE) +
  geom_point() +
  geom_line()  +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  scale_y_continuous(trans='log2',name="Log(Cache miss)") +
  geom_point(data=ob1_core_vader,aes(x=X.SIZE,y=L2/X.SIZE,color="L2 miss OB1")) +
  geom_line(data=ob1_core_vader,aes(x=X.SIZE,y=L2/X.SIZE,color="L2 miss OB1",linetype="Core"))  +
  geom_point(data=ob1_socket_vader,aes(x=X.SIZE,y=L1/X.SIZE,color="L1 miss OB1")) +
  geom_line(data=ob1_socket_vader,aes(x=X.SIZE,y=L1/X.SIZE,color="L1 miss OB1",linetype="Socket")) +  
  geom_point(data=ob1_socket_vader,aes(x=X.SIZE,y=L2/X.SIZE,color="L2 miss OB1")) +
  geom_line(data=ob1_socket_vader,aes(x=X.SIZE,y=L2/X.SIZE,color="L2 miss OB1",linetype="Socket")) +
  geom_point(data=ucx_socket,aes(x=X.SIZE,y=L1/X.SIZE,color="L1 miss UCX"))+
  geom_line(data=ucx_socket,aes(x=X.SIZE,y=L1/X.SIZE,color="L1 miss UCX",linetype="Socket")) +
  geom_point(data=ucx_socket,aes(x=X.SIZE,y=L2/X.SIZE,color="L2 miss UCX")) + 
  geom_line(data=ucx_socket,aes(x=X.SIZE,y=L2/X.SIZE,color="L2 miss UCX",linetype="Socket")) + 
  geom_point(data=ucx_core,aes(x=X.SIZE,y=L2/X.SIZE,color="L2 miss UCX")) + 
  geom_line(data=ucx_core,aes(x=X.SIZE,y=L2/X.SIZE,color="L2 miss UCX",linetype="Core"))+
  geom_point(data=ucx_core,aes(x=X.SIZE,y=L1/X.SIZE,color="L1 miss UCX")) + 
  geom_line(data=ucx_core,aes(x=X.SIZE,y=L1/X.SIZE,color="L1 miss UCX",linetype="Core"))  +
  geom_point(data=intel_core,aes(x=X.SIZE,y=L1/X.SIZE,color="L1 miss Intel")) + 
  geom_line(data=intel_core,aes(x=X.SIZE,y=L1/X.SIZE,color="L1 miss Intel",linetype="Core"))  +
  geom_point(data=intel_core,aes(x=X.SIZE,y=L2/X.SIZE,color="L2 miss Intel")) + 
  geom_line(data=intel_core,aes(x=X.SIZE,y=L2/X.SIZE,color="L2 miss Intel",linetype="Core"))  +
  geom_point(data=intel_socket,aes(x=X.SIZE,y=L1/X.SIZE,color="L1 miss Intel")) + 
  geom_line(data=intel_socket,aes(x=X.SIZE,y=L1/X.SIZE,color="L1 miss Intel",linetype="Socket"))  +
  geom_point(data=intel_socket,aes(x=X.SIZE,y=L2/X.SIZE,color="L2 miss Intel")) + 
  geom_line(data=intel_socket,aes(x=X.SIZE,y=L2/X.SIZE,color="L2 miss Intel",linetype="Socket"))  +       
  scale_color_manual(name="Protocol", values = c( "#F8766D", "#B79F00" ,"#00BA38", "#00BFC4", "#619CFF" ,"#F564E3"))+
  scale_linetype_manual(name="Mapping",values=c(1,2))+
  geom_vline(xintercept=32768, 
                color = "black",linetype="dashed", size=0.35)+
  geom_vline(xintercept=1048576, 
                color = "black",linetype="dashed", size=0.35)+
  geom_vline(xintercept=19922944, 
                color = "black",linetype="dashed", size=0.35)+
  theme(axis.text.y =element_blank())+
  annotate("text", x=58000, y=24000, label= "L1",color="Black")+
  annotate("text", x=1800000, y=24000, label= "L2",color="Black")+
  annotate("text", x=29900000, y=24000, label= "L3",color="Black")+
  labs(title="PingPong cache misses", subtitle="Thin node,relative cache misses")  +
      theme(plot.title = element_text(size = 13, face = "bold",hjust = 0.5,margin = margin(t = 12)), plot.subtitle = element_text(size = 10, hjust = 0.5, margin = margin(t = 7, b = 10), face = "italic"), axis.title = element_text(face = "bold"), legend.text = element_text(margin = margin(t = 7, b = 7, r=12)), axis.title.x = element_text(margin = margin(t = 10, b=10)), axis.title.y = element_text(margin = margin(r = 10,l=10)), axis.text = element_text(color= "#2f3030", face="bold"))

sp1 = ggplot(ob1_core_vader,aes(x=X.SIZE,y=L1,color="L1 miss OB1",linetype="Core")) +
  scale_x_continuous(trans='log2',name="Message Size Bytes",breaks=ob1_core_vader$X.SIZE) +
  geom_point() +
  geom_line()  +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  scale_y_continuous(trans='log2',name="Log(Cache misses)") +
  geom_point(data=ob1_core_vader,aes(x=X.SIZE,y=L2,color="L2 miss OB1")) +
  geom_line(data=ob1_core_vader,aes(x=X.SIZE,y=L2,color="L2 miss OB1",linetype="Core"))  +
  geom_point(data=ob1_socket_vader,aes(x=X.SIZE,y=L1,color="L1 miss OB1")) +
  geom_line(data=ob1_socket_vader,aes(x=X.SIZE,y=L1,color="L1 miss OB1",linetype="Socket")) +  
  geom_point(data=ob1_socket_vader,aes(x=X.SIZE,y=L2,color="L2 miss OB1")) +
  geom_line(data=ob1_socket_vader,aes(x=X.SIZE,y=L2,color="L2 miss OB1",linetype="Socket")) +
  geom_point(data=ucx_socket,aes(x=X.SIZE,y=L1,color="L1 miss UCX"))+
  geom_line(data=ucx_socket,aes(x=X.SIZE,y=L1,color="L1 miss UCX",linetype="Socket")) +
  geom_point(data=ucx_socket,aes(x=X.SIZE,y=L2,color="L2 miss UCX")) + 
  geom_line(data=ucx_socket,aes(x=X.SIZE,y=L2,color="L2 miss UCX",linetype="Socket")) + 
  geom_point(data=ucx_core,aes(x=X.SIZE,y=L2,color="L2 miss UCX")) + 
  geom_line(data=ucx_core,aes(x=X.SIZE,y=L2,color="L2 miss UCX",linetype="Core"))+
  geom_point(data=ucx_core,aes(x=X.SIZE,y=L1,color="L1 miss UCX")) + 
  geom_line(data=ucx_core,aes(x=X.SIZE,y=L1,color="L1 miss UCX",linetype="Core"))  +
  geom_point(data=intel_core,aes(x=X.SIZE,y=L1,color="L1 miss Intel")) + 
  geom_line(data=intel_core,aes(x=X.SIZE,y=L1,color="L1 miss Intel",linetype="Core"))  +
  geom_point(data=intel_core,aes(x=X.SIZE,y=L2,color="L2 miss Intel")) + 
  geom_line(data=intel_core,aes(x=X.SIZE,y=L2,color="L2 miss Intel",linetype="Core"))  +
  geom_point(data=intel_socket,aes(x=X.SIZE,y=L1,color="L1 miss Intel")) + 
  geom_line(data=intel_socket,aes(x=X.SIZE,y=L1,color="L1 miss Intel",linetype="Socket"))  +
  geom_point(data=intel_socket,aes(x=X.SIZE,y=L2,color="L2 miss Intel")) + 
  geom_line(data=intel_socket,aes(x=X.SIZE,y=L2,color="L2 miss Intel",linetype="Socket"))  +       
  scale_color_manual(name="Protocol", values = c( "#F8766D", "#B79F00" ,"#00BA38", "#00BFC4", "#619CFF" ,"#F564E3"))+
  scale_linetype_manual(name="Mapping",values=c(1,2)) +
  geom_vline(xintercept=32768, 
                color = "black",linetype="dashed", size=0.35)+
  geom_vline(xintercept=1048576, 
                color = "black",linetype="dashed", size=0.35)+
  geom_vline(xintercept=19922944, 
                color = "black",linetype="dashed", size=0.35)+
  theme(axis.text.y =element_blank())+
  annotate("text", x=58000, y=24000, label= "L1",color="Black")+
  annotate("text", x=1800000, y=24000, label= "L2",color="Black")+
  annotate("text", x=29900000, y=24000, label= "L3",color="Black")+
  labs(title="PingPong cache misses", subtitle="Thin node,absolute cache misses")  +
      theme(plot.title = element_text(size = 13, face = "bold",hjust = 0.5,margin = margin(t = 12)), plot.subtitle = element_text(size = 10, hjust = 0.5, margin = margin(t = 7, b = 10), face = "italic"), axis.title = element_text(face = "bold"), legend.text = element_text(margin = margin(t = 7, b = 7, r=12)), axis.title.x = element_text(margin = margin(t = 10, b=10)), axis.title.y = element_text(margin = margin(r = 10,l=10)), axis.text = element_text(color= "#2f3030", face="bold"))


sp3 = ggplot(intel_node,aes(x=X.SIZE,y=L1,color="L1 miss Intel",linetype="Node")) +
  scale_x_continuous(trans='log2',name="Message Size Bytes",breaks=ob1_core_vader$X.SIZE) +
  geom_point() +
  geom_line()  +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  scale_y_continuous(trans='log2',name="Log(Normalized Cache misses)") +
  geom_point(data=intel_node,aes(x=X.SIZE,y=L2,color="L2 miss Intel")) +
  geom_line(data=intel_node,aes(x=X.SIZE,y=L2,color="L2 miss Intel",linetype="Node"))  +
  geom_point(data=ucx_node_ib,aes(x=X.SIZE,y=L1,color="L1 miss UCX")) +
  geom_line(data=ucx_node_ib,aes(x=X.SIZE,y=L1,color="L1 miss UCX",linetype="Node")) +  
  geom_point(data=ucx_node_ib,aes(x=X.SIZE,y=L2,color="L2 miss UCX")) +
  geom_line(data=ucx_node_ib,aes(x=X.SIZE,y=L2,color="L2 miss UCX",linetype="Node")) +
  geom_point(data=ucx_node_br0,aes(x=X.SIZE,y=L1,color="L1 miss UCX br0"))+
  geom_line(data=ucx_node_br0,aes(x=X.SIZE,y=L1,color="L1 miss UCX br0",linetype="Node")) +
  geom_point(data=ucx_node_br0,aes(x=X.SIZE,y=L2,color="L2 miss UCX br0")) + 
  geom_line(data=ucx_node_br0,aes(x=X.SIZE,y=L2,color="L2 miss UCX br0",linetype="Node")) + 
  geom_point(data=ucx_node_ib0,aes(x=X.SIZE,y=L2,color="L2 miss UCX ib0")) + 
  geom_line(data=ucx_node_ib0,aes(x=X.SIZE,y=L2,color="L2 miss UCX ib0",linetype="Node"))+
  geom_point(data=ucx_node_ib0,aes(x=X.SIZE,y=L1,color="L1 miss UCX ib0")) + 
  geom_line(data=ucx_node_ib0,aes(x=X.SIZE,y=L1,color="L1 miss UCX ib0",linetype="Node"))  +
  geom_point(data=ob1_node_tcp,aes(x=X.SIZE,y=L1,color="L1 miss OB1")) + 
  geom_line(data=ob1_node_tcp,aes(x=X.SIZE,y=L1,color="L1 miss OB1",linetype="Node"))  +
  geom_point(data=ob1_node_tcp,aes(x=X.SIZE,y=L2,color="L2 miss OB1")) + 
  geom_line(data=ob1_node_tcp,aes(x=X.SIZE,y=L2,color="L2 miss OB1",linetype="Node"))    + 
  scale_color_manual(name="Protocol", values = c( "#F8766D", "#D89000", "#A3A500", "#39B600", "#00BF7D", "#00BFC4", "#00B0F6", "#9590FF", "#E76BF3", "#FF62BC"))+
  scale_linetype_manual(name="Mapping",values=c(1,2)) +
  geom_vline(xintercept=32768, 
                color = "black",linetype="dashed", size=0.35)+
  geom_vline(xintercept=1048576, 
                color = "black",linetype="dashed", size=0.35)+
  geom_vline(xintercept=19922944, 
                color = "black",linetype="dashed", size=0.35)+
  theme(axis.text.y =element_blank())+
  annotate("text", x=58000, y=24000, label= "L1",color="Black")+
  annotate("text", x=1800000, y=24000, label= "L2",color="Black")+
  annotate("text", x=29900000, y=24000, label= "L3",color="Black")+
  labs(title="PingPong cache misses", subtitle="Thin node,node mapping")  +
      theme(plot.title = element_text(size = 13, face = "bold",hjust = 0.5,margin = margin(t = 12)), plot.subtitle = element_text(size = 10, hjust = 0.5, margin = margin(t = 7, b = 10), face = "italic"), axis.title = element_text(face = "bold"), legend.text = element_text(margin = margin(t = 7, b = 7, r=12)), axis.title.x = element_text(margin = margin(t = 10, b=10)), axis.title.y = element_text(margin = margin(r = 10,l=10)), axis.text = element_text(color= "#2f3030", face="bold"))


```

```{r, dpi=600, fig.width=10, fig.height=5.7,echo=FALSE}
sp_thin
sp_gpu
```

```{r, dpi=600, fig.width=10, fig.height=6,echo=FALSE}
sp1
sp
```


The behaviour of  Intel implementation can be seen in a clearer way. It has the largest cache misses number among all other configurations. The bandwidth spike of **UCX** after $128$ KB, can be explained by **L1** and **L2** drop in terms of cache misses. **OB1** implementation seems to be the most cache friendly *PML*. Disclaimer: those graphs must be used in a qualitative way and not in a quantitative way. 


```{r,echo=FALSE,include=FALSE}
node_ib=read.csv("./pingpong/ompi/node_ib.out.csv", header = TRUE)
node_ob1_tcp=read.csv("./pingpong/ompi/node_ob1_selftcp.out.csv", header = TRUE)
node_ucx_br0=read.csv("./pingpong/ompi/node_ucx_br0.out.csv", header = TRUE)
node_ucx_ib0=read.csv("./pingpong/ompi/node_ucx_ib0.out.csv", header = TRUE)
node_ucx__mlx5=read.csv("./pingpong/ompi/node_ucx_mlx5.out.csv", header = TRUE)
node_ib_intel=read.csv("./pingpong/intel/node_ib_intel.out.csv", header = TRUE)

```


```{r, dpi=600, fig.width=9, fig.height=5,echo=TRUE,include=FALSE}
library(ggplot2)
sp = ggplot(node_ib,aes(x=X.bytes,y=mbs,color="UCX IB")) +
  scale_x_continuous(trans='log2',name="Message Size Bytes",breaks=core_ib$X.bytes)+
  scale_y_continuous(breaks = seq(0,13000,500),name="MB/s")  + geom_point() + geom_line()  +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + 
  geom_point(data=node_ob1_tcp,aes(x=X.bytes,y=mbs,color="OB1 tcp")) + geom_line(data=node_ob1_tcp,aes(x=X.bytes,y=mbs,color="OB1 tcp"))+ geom_point(data=node_ucx_br0,aes(x=X.bytes,y=mbs,color="UCX br0")) + geom_line(data=node_ucx_br0,aes(x=X.bytes,y=mbs,color="UCX br0")) + geom_point(data=node_ucx_ib0,aes(x=X.bytes,y=mbs,color="UCX ib0")) + geom_line(data=node_ucx_ib0,aes(x=X.bytes,y=mbs,color="UCX ib0")) + geom_point(data=node_ib_intel,aes(x=X.bytes,y=mbs,color="Intel IB")) + geom_line(data=node_ib_intel,aes(x=X.bytes,y=mbs,color="Intel IB")) +
  labs(title="PingPong bandwidth", subtitle="Thin node,node mapping")  +
      theme(plot.title = element_text(size = 13, face = "bold",hjust = 0.5,margin = margin(t = 12)), plot.subtitle = element_text(size = 10, hjust = 0.5, margin = margin(t = 7, b = 10), face = "italic"), axis.title = element_text(face = "bold"), legend.text = element_text(margin = margin(t = 7, b = 7, r=12)), axis.title.x = element_text(margin = margin(t = 10, b=10)), axis.title.y = element_text(margin = margin(r = 10,l=10)), axis.text = element_text(color= "#2f3030", face="bold"))

```


```{r , dpi=600, fig.width=10, fig.height=6,echo=FALSE}
sp 
```

Inter-node benchmark point out the network performance and topology. Two physical networks are available, $25$  Gbit Ethernet and $100$ Gbit InfiniBand. 
The PCI-E devices are visible from lstopo. We can physically see that em1, em2  are bonded togheter, creating the interface bond0. To infer this configurations we can use ifconfig and ip link command.

\scriptsize

```{bash, eval=FALSE}
[s271550@ct1pt-tnode007 etc]$ ifconfig
bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST>  mtu 1500
        ether 34:80:0d:4e:55:68  txqueuelen 1000  (Ethernet)

br0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        ether 34:80:0d:4e:55:68  txqueuelen 1000  (Ethernet)

em1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST>  mtu 1500
        ether 34:80:0d:4e:55:68  txqueuelen 1000  (Ethernet)

em2: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST>  mtu 1500
        ether 34:80:0d:4e:55:68  txqueuelen 1000  (Ethernet)

ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 2044
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
        infiniband 00:00:09:07:FE:80:00: ..... txqueuelen 256  (InfiniBand)

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
```


```{bash,eval=FALSE}
[s271550@ct1pt-tnode007 etc]$ ip link 
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000

2: em1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP mode DEFAULT group default qlen 1000

3: em2: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP mode DEFAULT group default qlen 1000

4: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP mode DEFAULT group default qlen 256
    link/infiniband ....
5: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UP mode DEFAULT group default qlen 1000

6: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
```
\normalsize
\newpage

With openMPI implementation and UCX, we can directly select the devices that lead to a specific protocol. The devices tested are ib0, br0 and mlx5_0:1. Ib0 is the IPoIB protocol, br0 leads a TCP communication and mlx5_0:1 is a pure native InfiniBand device.
The theoretical maximum performances are $12.5$ GB/s or $12800$ MB/s for InfiniBand and $3.125$ GB/s or $3200$ MB/s for the Ethernet network. 

The experimental asymptotic bandwidth measured are: 
```{r,include=FALSE,echo=FALSE}
calculate_fit=function(d){
 fit1=lm(data=d[1:15,],formula = t~X.bytes)
 fit2=lm(data=d[15:30,],formula = t~X.bytes)
 lambda=fit1$coefficients[1]
 b=(fit2$coefficients[2]^-1)
 d$t_est=lambda+d$X.bytes/b
 d$b_est=d$X.bytes/d$t_est
 return(d)
}

plot_fit=function( d,mapping,mapping_2 ){
  mapping_2=paste0(mapping_2, " \nlatency=" ,round(d$t_est[1],digits = 2)," uSec\n bandwidth=",round(d$b_est[30],digits = 2)," MB/s")
  sp = ggplot(d,aes(x=d$X.bytes,y=mbs,color="Real")) +
  scale_x_continuous(trans='log2',name="Message Size Bytes",breaks=d$X.bytes)  +
  geom_point() + geom_line()+scale_y_continuous(name="MB/s")   +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  geom_point(data=d,aes(x=d$X.bytes,y=b_est,color="Estimated")) +
  geom_line(data=d,aes(x=d$X.bytes,y=b_est,color="Estimated"))+
  labs(title=mapping, subtitle=mapping_2) +
      theme(plot.title = element_text(size = 13, face = "bold",hjust = 0.5,margin = margin(t = 12)), plot.subtitle = element_text(size = 10, hjust = 0.5, margin = margin(t = 7, b = 10), face = "italic"), axis.title = element_text(face = "bold"), legend.text = element_text(margin = margin(t = 7, b = 7, r=12)), axis.title.x = element_text(margin = margin(t = 10, b=10)), axis.title.y = element_text(margin = margin(r = 10,l=10)), axis.text = element_text(color= "#2f3030", face="bold")) +
  theme(
#  panel.background = element_rect(fill = "#ebeef0", colour = "#d5dce0",
#                               size = 2, linetype = "solid"),
  panel.grid.major = element_line(size = 0.6, linetype = 'solid',
                                colour = "white"), 
  panel.grid.minor = element_line(size = 0.35, linetype = 'solid',
                                colour = "white")
  )+scale_color_manual(name="",values=c("#F8766D","#00BFC4"))
  sp
}

get_plot_fit=function( d,mapping,mapping_2 ){
  mapping_2=paste0(mapping_2, " \nlatency=" ,round(d$t_est[1],digits = 2)," uSec\n bandwidth=",round(d$b_est[30],digits = 2)," MB/s")
  sp = ggplot(d,aes(x=d$X.bytes,y=mbs,color="Real")) +
  scale_x_continuous(trans='log2',name="Message Size Bytes",breaks=d$X.bytes)  +
  geom_point() + geom_line()+scale_y_continuous(name="MB/s")   +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  geom_point(data=d,aes(x=d$X.bytes,y=b_est,color="Estimated")) +
  geom_line(data=d,aes(x=d$X.bytes,y=b_est,color="Estimated"))+
labs(title=mapping, subtitle=mapping_2) +
    theme(plot.title = element_text(size = 10, face = "bold",hjust = 0,margin = margin(t = 5)), plot.subtitle = element_text(size = 10, hjust = 0.5, margin = margin(t = 4, b = 5), face = "italic"), axis.title = element_text(face = "bold"), legend.text = element_text(margin = margin(t = 4, b = 4, r=12)), axis.title.x = element_text(margin = margin(t = 5, b=5)), axis.title.y = element_text(margin = margin(r = 10,l=10)), axis.text = element_text(color= "#2f3030", face="bold")) +
  theme(
#  panel.background = element_rect(fill = "#ebeef0", colour = "#d5dce0",
#                               size = 2, linetype = "solid"),
  panel.grid.major = element_line(size = 0.6, linetype = 'solid',
                                colour = "white"), 
  panel.grid.minor = element_line(size = 0.35, linetype = 'solid',
                                colour = "white")
  )+scale_color_manual(name="",values=c("#F8766D","#00BFC4"))+
  guides(color = FALSE, size = FALSE,linetype= FALSE)
  return(sp)
}

get_plot_fit_legend=function( d,mapping,mapping_2 ){
    mapping_2=paste0(mapping_2, " \nlatency=" ,round(d$t_est[1],digits = 2)," uSec\n bandwidth=",round(d$b_est[30],digits = 2)," MB/s")
  sp = ggplot(d,aes(x=d$X.bytes,y=mbs,color="Real")) +
  scale_x_continuous(trans='log2',name="Message Size Bytes",breaks=d$X.bytes)  +
  geom_point() + geom_line()+scale_y_continuous(name="MB/s")   +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  geom_point(data=d,aes(x=d$X.bytes,y=b_est,color="Estimated")) +
  geom_line(data=d,aes(x=d$X.bytes,y=b_est,color="Estimated"))+
labs(title=mapping, subtitle=mapping_2) +
    theme(plot.title = element_text(size = 10, face = "bold",hjust = 0,margin = margin(t = 5)), plot.subtitle = element_text(size = 10, hjust = 0.5, margin = margin(t = 4, b = 5), face = "italic"), axis.title = element_text(face = "bold"), legend.text = element_text(margin = margin(t = 4, b = 4, r=12)), axis.title.x = element_text(margin = margin(t = 5, b=5)), axis.title.y = element_text(margin = margin(r = 10,l=10)), axis.text = element_text(color= "#2f3030", face="bold")) +
  theme(
#  panel.background = element_rect(fill = "#ebeef0", colour = "#d5dce0",
#                               size = 2, linetype = "solid"),
  panel.grid.major = element_line(size = 0.6, linetype = 'solid',
                                colour = "white"), 
  panel.grid.minor = element_line(size = 0.35, linetype = 'solid',
                                colour = "white")
  )+scale_color_manual(name="",values=c("#F8766D","#00BFC4"))+
  theme(legend.position = c(0.2, 0.8))
  return(sp)
}

```


```{r, dpi=600, fig.width=15, fig.height=12.5,echo=FALSE}

node_ib_fit=calculate_fit(node_ib)
sp1=get_plot_fit_legend(node_ib_fit,mapping="Map by node, THIN node",mapping_2="UCX ib")

node_ob1_tcp_fit=calculate_fit(node_ob1_tcp)
sp2=get_plot_fit(node_ob1_tcp_fit,mapping="Map by node, THIN node",mapping_2="OB1 tcp")

node_ucx_br0_fit=calculate_fit(node_ucx_br0)
sp3=get_plot_fit(node_ucx_br0_fit,mapping="Map by node, THIN node",mapping_2="UCX interface br0")

node_ucx_ib0_fit=calculate_fit(node_ucx_ib0)
sp4=get_plot_fit(node_ucx_ib0_fit,mapping="Map by node, THIN node",mapping_2="UCX interface ib0")

node_ib_intel_fit=calculate_fit(node_ib_intel)
sp5=get_plot_fit(node_ib_intel_fit,mapping="Map by node, THIN node",mapping_2="Intel ib")

grid.arrange(sp1,sp2,sp3,sp4,sp5,nrow=3)


df_node=data.frame(name=c("UCX IB","OB1 tcp","UCX br0","UCX ib0","Intel ib"),
                   latency=c(node_ib_fit$t_est[1],node_ob1_tcp_fit$t_est[1],node_ucx_br0_fit$t_est[1],node_ucx_ib0_fit$t_est[1],node_ib_intel_fit$t_est[1]),
                   bandwith=c(node_ib_fit$b_est[30],node_ob1_tcp_fit$b_est[30],node_ucx_br0_fit$b_est[30],node_ucx_ib0_fit$b_est[30],node_ib_intel_fit$b_est[30]))

```

UCX with InfiniBand and Intel shows the best performances among all the configurations both in terms of latency and bandwidth. UCX overperforms a bit Intel latency. No big cache effects are visible, thanks to  RDMA that exclude all caches, CPU and kernel action. This explain also the highest bandwidth with respect to core and socket mapping. Real InfiniBand performance is about $95\%$ of theoretical bandwidth,this is a very nice result assuming the encoding $64b/66b$.

UCX with br0 and OB1 with tcp show comparable latency, but OB1 gains a better bandwidth. The real maximum performance is about $85\%$ of theoretical bandwidth, this is a good result taking into account the heavier tcp protocol with respect to InfiniBand (ACK,handshake, encoding...). Moreover, transport overhead and some inefficency introduced by the cache and CPU (no RDMA available)  are present .

IPoIB has a good latency (no Ethernet switch involved) but the bandwidth is not so high as expected. 
Gpu nodes behave like thin nodes,and thus no difference can be seen, GPU node is only a bit slower than thin node. This can be due to different CPU frequency and node configuration. Cache sizes are the same of thin nodes, and this cache effects are similar. The summary of fitting results on THIN and GPU node is presented in the next page.


```{r, dpi=600, fig.width=14, fig.height=20,echo=FALSE}
 calculate_fit=function(d){
fit1=lm(data=d[1:15,],formula = t~X.bytes)
  fit2=lm(data=d[15:30,],formula = t~X.bytes)
  lambda=fit1$coefficients[1]
  b=(fit2$coefficients[2]^-1)
  d$t_est=lambda+d$X.bytes/b
  d$b_est=d$X.bytes/d$t_est
  return(d)
 }
# 
# plot_fit=function( d,mapping,mapping_2 ){
#   mapping_2=paste0(mapping_2, " \nlatency=" ,round(d$t_est[1],digits = 2)," uSec\n bandwidth=",round(d$b_est[30],digits = 2)," MB/s")
#   sp = ggplot(d,aes(x=d$X.bytes,y=mbs,color="Real")) +
#   scale_x_continuous(trans='log2',name="Message Size Bytes",breaks=d$X.bytes)  +
#   geom_point() + geom_line()+scale_y_continuous(name="MB/s")   +
#   theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
#   geom_point(data=d,aes(x=d$X.bytes,y=b_est,color="Estimated")) +
#   geom_line(data=d,aes(x=d$X.bytes,y=b_est,color="Estimated"))+
#   labs(title=mapping, subtitle=mapping_2) +
#       theme(plot.title = element_text(size = 13, face = "bold",hjust = 0.5,margin = margin(t = 12)), plot.subtitle = element_text(size = 10, hjust = 0.5, margin = margin(t = 7, b = 10), face = "italic"), axis.title = element_text(face = "bold"), legend.text = element_text(margin = margin(t = 7, b = 7, r=12)), axis.title.x = element_text(margin = margin(t = 10, b=10)), axis.title.y = element_text(margin = margin(r = 10,l=10)), axis.text = element_text(color= "#2f3030", face="bold")) +
#   theme(
# #  panel.background = element_rect(fill = "#ebeef0", colour = "#d5dce0",
# #                               size = 2, linetype = "solid"),
#   panel.grid.major = element_line(size = 0.6, linetype = 'solid',
#                                 colour = "white"), 
#   panel.grid.minor = element_line(size = 0.35, linetype = 'solid',
#                                 colour = "white")
#   )+scale_color_manual(name="",values=c("#F8766D","#00BFC4"))
#   sp
# }
# 
 core_ib_fit=calculate_fit(core_ib)
# sp1=get_plot_fit_legend(core_ib_fit,mapping="Map by core, THIN node",mapping_2="UCX ib")
# 
 core_ob1_tcp_fit=calculate_fit(core_ob1_tcp)
# sp2=get_plot_fit(core_ob1_tcp_fit,mapping="Map by core, THIN node",mapping_2="OB1 tcp")
# 
 core_ob1_vader_fit=calculate_fit(core_ob1_vader)
# sp3=get_plot_fit(core_ob1_vader_fit,mapping="Map by core, THIN node",mapping_2="OB1 vader")
# 
 core_ib_intel_fit=calculate_fit(core_ib_intel)
# sp4=get_plot_fit(core_ib_intel_fit,mapping="Map by core, THIN node",mapping_2="Intel IB")
# 
 core_ib_gpu_fit=calculate_fit(core_ib_gpu)
# sp5=get_plot_fit(core_ib_gpu_fit,mapping="Map by core, GPU node",mapping_2="UCX IB")
# 
 core_ib_intel_gpu_fit=calculate_fit(core_ib_intel_gpu)
# sp6=get_plot_fit(core_ib_intel_gpu_fit,mapping="Map by core, GPU node",mapping_2="Intel IB")
# 
 core_ob1_tcp_gpu_fit=calculate_fit(core_ob1_tcp_gpu)
# sp7=get_plot_fit(core_ob1_tcp_gpu_fit,mapping="Map by core, GPU node",mapping_2="OB1 tcp")
# 
 core_ob1_vader_gpu_fit=calculate_fit(core_ob1_vader_gpu)
# sp8=get_plot_fit(core_ob1_vader_gpu_fit,mapping="Map by core, GPU node",mapping_2="OB1 vader")
# 
# grid.arrange(sp1,sp2,sp3,sp4,sp5,sp6,sp7,sp8,nrow=4)


```

```{r, dpi=600, fig.width=14, fig.height=20,echo=FALSE}

 socket_ib_fit=calculate_fit(socket_ib)
# sp1=get_plot_fit_legend(socket_ib_fit,mapping="Map by socket, THIN node",mapping_2="UCX ib")
# 
 socket_ob1_tcp_fit=calculate_fit(socket_ob1_tcp)
# sp2=get_plot_fit(socket_ob1_tcp_fit,mapping="Map by socket, THIN node",mapping_2="OB1 tcp")
# 
 socket_ob1_vader_fit=calculate_fit(socket_ob1_vader)
# sp3=get_plot_fit(socket_ob1_vader_fit,mapping="Map by socket, THIN node",mapping_2="OB1 vader")
# 
 socket_ib_intel_fit=calculate_fit(socket_ib_intel)
# sp4=get_plot_fit(socket_ib_intel_fit,mapping="Map by socket, THIN node",mapping_2="Intel IB")
# 
 socket_ib_gpu_fit=calculate_fit(socket_ib_gpu)
# sp5=get_plot_fit(socket_ib_gpu_fit,mapping="Map by socket, GPU node",mapping_2="UCX IB")
# 
 socket_ib_intel_gpu_fit=calculate_fit(socket_ib_intel_gpu)
# sp6=get_plot_fit(socket_ib_intel_gpu_fit,mapping="Map by socket, GPU node",mapping_2="Intel IB")
# 
 socket_ob1_tcp_gpu_fit=calculate_fit(socket_ob1_tcp_gpu)
# sp7=get_plot_fit(socket_ob1_tcp_gpu_fit,mapping="Map by socket, GPU node",mapping_2="OB1 tcp")

socket_ob1_vader_gpu_fit=calculate_fit(socket_ob1_vader_gpu)
#sp8=get_plot_fit(socket_ob1_vader_gpu_fit,mapping="Map by socket, GPU node",mapping_2="OB1 vader")
#grid.arrange(sp1,sp2,sp3,sp4,sp5,sp6,sp7,sp8,nrow=4)

df_socket=data.frame(name=c("UCX IB","OB1 tcp","OB1 vader","Intel IB"),
                   latency=c(socket_ib_fit$t_est[1],socket_ob1_tcp_fit$t_est[1],socket_ob1_vader_fit$t_est[1],socket_ib_intel_fit$t_est[1]),
                   bandwith=c(socket_ib_fit$b_est[30],socket_ob1_tcp_fit$b_est[30],socket_ob1_vader_fit$b_est[30],socket_ib_intel_fit$b_est[30]) )

df_core=data.frame(name=c("UCX IB","OB1 tcp","OB1 vader","Intel IB"),
                   latency=c(core_ib_fit$t_est[1],core_ob1_tcp_fit$t_est[1],core_ob1_vader_fit$t_est[1],core_ib_intel_fit$t_est[1]),
                   bandwith=c(core_ib_fit$b_est[30],core_ob1_tcp_fit$b_est[30],core_ob1_vader_fit$b_est[30],core_ib_intel_fit$b_est[30]) )


df_core_gpu=data.frame(name=c("UCX IB","OB1 tcp","OB1 vader","Intel IB"),
                   latency=c(core_ib_gpu_fit$t_est[1],core_ob1_tcp_gpu_fit$t_est[1],core_ob1_vader_gpu_fit$t_est[1],core_ib_intel_gpu_fit$t_est[1]),
                   bandwith=c(core_ib_gpu_fit$b_est[30],core_ob1_tcp_gpu_fit$b_est[30],core_ob1_vader_gpu_fit$b_est[30],core_ib_intel_gpu_fit$b_est[30]) )


df_socket_gpu=data.frame(name=c("UCX IB","OB1 tcp","OB1 vader","Intel IB"),
                   latency=c(socket_ib_gpu_fit$t_est[1],socket_ob1_tcp_gpu_fit$t_est[1],socket_ob1_vader_gpu_fit$t_est[1],socket_ib_intel_gpu_fit$t_est[1]),
                   bandwith=c(socket_ib_gpu_fit$b_est[30],socket_ob1_tcp_gpu_fit$b_est[30],socket_ob1_vader_gpu_fit$b_est[30],socket_ib_intel_gpu_fit$b_est[30]) )

```

```{r ,echo=FALSE,fig.cap="Fitting results"}
#knitr::kable(df_core,caption = "PingPong core performance, THIN node ",digits = 3)
#knitr::kable(df_socket,caption = "PingPong socket performance, THIN node ",digits = 3)
#knitr::kable(df_node,caption = "PingPong node performance, THIN node ",digits = 3)
#knitr::kable(df_core_gpu,caption = "PingPong core performance, GPU node ",digits = 3)
#knitr::kable(df_socket_gpu,caption = "PingPong socket performance, GPU node ",digits = 3)
df_node=df_node[order(df_node$latency),]
df_socket=df_socket[order(df_socket$latency),]
df_core=df_core[order(df_core$latency),]
df_thin=rbind(df_core,df_socket,df_node)

df_socket_gpu=df_socket_gpu[order(df_socket_gpu$latency),]
df_core_gpu=df_core_gpu[order(df_core_gpu$latency),]
df_gpu=rbind(df_core_gpu,df_socket_gpu)


rownames(df_thin) <- c()
t1=kbl(df_thin, booktabs = T,escape=F, col.names = linebreak(c("THIN node", "Latency\n uSec","Bandwidth\n MB/s"), align = "c"))   %>% pack_rows("Core mapping", 1, 4) %>% pack_rows("Socket mapping", 5, 9) %>% pack_rows("Node mapping", 10, 13)  %>%
column_spec(2, color = "white",
background = spec_color(df_thin$latency, end = 0.5,alpha = 0.2,option = "plasma"),
popover = paste("am:",df_thin$latency))%>%
column_spec(3,
color =  spec_color(log(df_thin$bandwith), end = 0.6,alpha = 0.2,option = "magma",direction = 1),
popover = paste("am:",df_thin$bandwith))

rownames(df_gpu) <- c()
t2=kbl(df_gpu, booktabs = T,escape=F, col.names = linebreak(c("GPU node", "Latency\n uSec","Bandwidth\n MB/s"), align = "c"))   %>% pack_rows("Core mapping", 1, 4) %>% pack_rows("Socket mapping", 5, 8) %>% 
column_spec(2, color = "white",
background = spec_color(df_gpu$latency, end = 0.5,alpha = 0.2,option = "plasma"),
popover = paste("am:",df_gpu$latency))%>%
column_spec(3,
color =  spec_color(log(df_gpu$bandwith), end = 0.6,alpha = 0.2,option = "magma",direction = 1),
popover = paste("am:",df_gpu$bandwith))
knitr::kables(list(t1,t2),caption = "Fitting results")
```

\newpage

# Section 3 : Jacobi solver
## Performance model
In order to predict Jacobi model performance the following model has been used:
$$P(L,N)=\frac{L^3N}{T_s+T_c}\space [MLUP/s]$$
$L$ represent the subdomain size (assuming it has cube shape), in out case it is costant. $N$ represent the number of processes. The problem's size is $L^3N$. This amount of work increases linearly as a function of $N$, producing a weak scalability. The performance is evaluated using $MLUP/s$, mega lactice update per sec.

$T_s$ is the time swept, is constant and it can be estimated using the serial output. For Thin node $T_s=3.05$ and for Gpu node $T_s=4.3$.

$T_c$ represents the communication time in seconds. This quantity can be modeled by estimating the latency $\lambda$, bandwidth $B$ and the sizes of messages$C$.

$$T_c=\frac{C}{B}+4k\lambda\space[seconds]$$
Regarding $C$, we assume grid point as double:
$$C=L^2k\cdot2\cdot2\cdot8\space[byte]$$
$k$ represents the number of directions in which the halo exchange occurs, it must be multiplied by $2$ taking into account the bidirectional halo exchange and again it must be multiplied by two taking into account the positive and the negative directions.

## Domain decomposition

Several domain decompositions are possible,  some of them can be discarded a priori assuming poor performance due to buffering. For example between $(12,1,1)$ $(1,12,1)$ $(1,1,12)$, the first one is the best and more memory friendly decomposition. Moreover, it has the lowest buffering needs. The latter exapme can be proved experimentally: the first configuration behave better than the other in terms of MLUP/s.

## Experimentals results

```{r,echo=FALSE,include=FALSE,eval=TRUE}
jacobi_model=function(data,Ts,lambda,band,L){
  data_model=subset(data,select=c("N","Nx","Ny","Nz","k","C_MB","Tc","P_MLUPs","data_MLUPs","elapsed","ratioP"))
  data_model$k=rowSums(data_model[,2:4]>1)
  
  data_model$k=rowSums(data_model[,2:4]>1)
  data_model$C_MB=(L^2)*8*data_model$k*4/(1024^2)
  data_model$Tc=data_model$C_MB/band+4*(10^-6)*data_model$k*lambda
  data_model$P_MLUPs=(L^3)*data_model$N*(10^-6)/(data_model$Tc+Ts)
  data_model$ratioP=data_model$N*data_model$data_MLUPs[1]/data_model$data_MLUPs
  return(data_model)
}

jacobi_thin_core=read.csv("/home/nt/Scrivania/ass2021/jacobi/thin/thin_core_model.csv",header=TRUE)
jacobi_thin_core_model=jacobi_model(data=jacobi_thin_core,Ts=3.05,lambda=0.24,band=5987,L=700)

jacobi_thin_socket=read.csv("/home/nt/Scrivania/ass2021/jacobi/thin/thin_socket_model.csv",header=TRUE)
jacobi_thin_socket_model=jacobi_model(data=jacobi_thin_socket,Ts=3.05,lambda=0.6,band=5123,L=700)

jacobi_thin_node=read.csv("/home/nt/Scrivania/ass2021/jacobi/thin/thin_node_model.csv",header=TRUE)
jacobi_thin_node_model=jacobi_model(data=jacobi_thin_node,Ts=3.05,lambda=1.35,band=12182,L=700)

jacobi_gpu_core=read.csv("/home/nt/Scrivania/ass2021/jacobi/gpu/gpu_core_model.csv",header=TRUE)
jacobi_gpu_core_model=jacobi_model(data=jacobi_gpu_core,Ts=4.3,lambda=0.27,band=5652,L=700)
jacobi_gpu_core_model$data_MLUPs[1]=77.458
jacobi_gpu_core_model$elapsed[1]=76.52


jacobi_gpu_socket=read.csv("/home/nt/Scrivania/ass2021/jacobi/gpu/gpu_soc_model.csv",header=TRUE)
jacobi_gpu_socket_model=jacobi_model(data=jacobi_gpu_socket,Ts=4.3,lambda=0.65,band=4615,L=700)
jacobi_gpu_socket_model$data_MLUPs[1]=77.458
jacobi_gpu_socket_model$elapsed[1]=76.52

```

### Thin node, core mapping.
$\lambda=0.24\space[usec]$ and $B=5987[MB/s]$. 
```{r, dpi=600, fig.width=14, fig.height=20,echo=FALSE,eval=TRUE,message=FALSE}
t=kbl(jacobi_thin_core_model, booktabs = T, align = "c",escape=F, col.names = linebreak(c("N° procs", "Nx","Ny","Nz","k","C\nMB","Tc\ns","Theo perf\nMLUP/s","Real perf\nMLUP/s","Elapsed time\ns","NP(1)/P(N)"), align = "c")) %>%
column_spec(1, bold=T) %>%
collapse_rows(columns = 1:1, latex_hline = "major", valign = "middle") %>% 
column_spec(11, color = "white",
background = spec_color(jacobi_thin_core_model$ratioP, end = 0.5,alpha = 0.2,option = "plasma"),
popover = paste("am:",jacobi_thin_core_model$ratioP))
```
```{r, dpi=600, fig.width=14, fig.height=20,echo=FALSE,message=FALSE}
t
```


### Thin node, socket mapping.
$\lambda=0.6\space[usec]$ and $B=5123[MB/s]$.
```{r, dpi=600, fig.width=14, fig.height=20,echo=FALSE,eval=TRUE,message=FALSE}

t=kbl(jacobi_thin_socket_model, booktabs = T, align = "c",escape=F, col.names = linebreak(c("N° procs", "Nx","Ny","Nz","k","C\nMB","Tc\ns","Theo perf\nMLUP/s","Real perf\nMLUP/s","Elapsed time\ns","NP(1)/P(N)"), align = "c")) %>%
column_spec(1, bold=T) %>%
collapse_rows(columns = 1:1, latex_hline = "major", valign = "middle") %>% 
column_spec(11, color = "white",
background = spec_color(jacobi_thin_socket_model$ratioP, end = 0.5,alpha = 0.2,option = "plasma"),
popover = paste("am:",jacobi_thin_socket_model$ratioP))
```
```{r, dpi=600, fig.width=14, fig.height=20,echo=FALSE,message=FALSE}
t
```
\newpage

### Thin node, node mapping.
$\lambda=1.35\space[usec]$ and $B=12182[MB/s]$.
```{r, dpi=600, fig.width=14, fig.height=20,echo=FALSE,eval=TRUE,message=FALSE}

t=kbl(jacobi_thin_node_model, booktabs = T, align = "c",escape=F, col.names = linebreak(c("N° procs", "Nx","Ny","Nz","k","C\nMB","Tc\ns","Theo perf\nMLUP/s","Real perf\nMLUP/s","Elapsed time\ns","NP(1)/P(N)"), align = "c")) %>%
column_spec(1, bold=T) %>%
collapse_rows(columns = 1:1, latex_hline = "major", valign = "middle") %>% 
column_spec(11, color = "white",
background = spec_color(jacobi_thin_node_model$ratioP, end = 0.5,alpha = 0.2,option = "plasma"),
popover = paste("am:",jacobi_thin_node_model$ratioP))
```
```{r, dpi=600, fig.width=14, fig.height=20,echo=FALSE,message=FALSE}
t
```
Even if this model seems accurate, some comments are required:
mapping by socket leads to a slighty better performance with respect to core mapping and this is not what we expected considering the theoretical model.
If, on one hand socket map lead to poor communication performance, on other hand it exploits better the memory allocation. When one socket is filled with 12 processes the entire grid is stored on socket competence LRDIMM, but spreading processes across 2 sockets lead to a better exploit of bandwidth using different memory controller and then all memory channels available. At each iteration, this solver elaborates huge amount of memory: given $L=1200$ and $12$ workers at least 300 GB of ram are elaborated (half LRDIMMs are saturated, then all single socket channels are used).
Also with map-by node option we obtain better results when several processes are involved.
The following results report different domain decompositions and binding with $L=1200$, $N=12$ on thin nodes. Is reported the maximum value among all runs.

```{r, dpi=600, fig.width=14, fig.height=20,echo=FALSE,eval=TRUE}
last=read.csv("./jacobi/various.csv")
last=last[,c("mapping","nx","ny","nz","perf")]
kbl(last,booktabs=T,escape=F, col.names = linebreak(c("Mapping","Nx","Ny","Nx","MLUP/s")))%>%
column_spec(1, bold=T) %>%
collapse_rows(columns = 1:1, latex_hline = "major", valign = "middle")
```

Performance model doesn't take into account the buffering requirement, given a domain decomposition, needed to exchange the halo point. It is not possible to infer theoretically using same domain decomposition but on different direction. Experimentally differences on performance are negligible but still present.
\newpage

### Gpu node, core mapping.

```{r, dpi=600, fig.width=14, fig.height=20,echo=FALSE,eval=TRUE,message=FALSE}

t=kbl(jacobi_gpu_core_model, booktabs = T, align = "c",escape=F, col.names = linebreak(c("N° procs", "Nx","Ny","Nz","k","C\nMB","Tc\ns","Theo perf\nMLUP/s","Real perf\nMLUP/s","Elapsed time\ns","NP(1)/P(N)"), align = "c")) %>%
column_spec(1, bold=T) %>%
collapse_rows(columns = 1:1, latex_hline = "major", valign = "middle")%>% 
column_spec(11, color = "white",
background = spec_color(jacobi_gpu_core_model$ratioP, end = 0.5,alpha = 0.2,option = "plasma"),
popover = paste("am:",jacobi_gpu_core_model$ratioP))
```
```{r, dpi=600, fig.width=14, fig.height=20,echo=FALSE,message=FALSE}
t
```
### Gpu node, socket mapping.

```{r, dpi=600, fig.width=14, fig.height=20,echo=FALSE,message=FALSE}

kbl(jacobi_gpu_socket_model, booktabs = T, align = "c",escape=F, col.names = linebreak(c("N° procs", "Nx","Ny","Nz","k","C\nMB","Tc\ns","Theo perf\nMLUP/s","Real perf\nMLUP/s","Elapsed time\ns","NP(1)/P(N)"), align = "c")) %>%
column_spec(1, bold=T) %>%
collapse_rows(columns = 1:1, latex_hline = "major", valign = "middle")%>% 
column_spec(11, color = "white",
background = spec_color(jacobi_gpu_socket_model$ratioP, end = 0.5,alpha = 0.2,option = "plasma"),
popover = paste("am:",jacobi_gpu_socket_model$ratioP))

```
On gpu node differences on socket and core mapping are even bigger and enhanced, again we can guess the memory is the main cause. Even if has higher frequency LDIMMs modules, Gpu node setup doesn't exploit all possible memory channel. When worker number grow over physical core number, the real performance goes dramatically down respect the model due hyperthreading. Also the execution time show poor performance with 48 workers.
All materials, scripts, bash code, tools and machinery developed to achieve assignment goals are available on thi github repository https://github.com/NiccoloTosato/HPC_assignment1 
\newpage
```{r,echo=FALSE,eval=FALSE,include=FALSE}
write.table(jacobi_gpu_socket_model,file = "jacobi_gpu_socket_model.csv",row.names = FALSE)
write.table(jacobi_gpu_core_model,file = "jacobi_gpu_core_model.csv",row.names = FALSE)

write.table(jacobi_thin_core_model,file = "jacobi_thin_core_model.csv",row.names = FALSE)
write.table(jacobi_thin_socket_model,file = "jacobi_thin_socket_model.csv",row.names = FALSE)
write.table(jacobi_thin_node_model,file = "jacobi_thin_node_model.csv",row.names = FALSE)


```