-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathreport.Rmd
1248 lines (1001 loc) · 75.9 KB
/
report.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Assignment 1"
author: "Niccolò Tosato"
output:
pdf_document: default
html_document:
df_print: paged
---
```{r setup, include=FALSE}
library(ggplot2)
library(gtable)
library(gridExtra)
library("kableExtra")
knitr::opts_chunk$set(echo = TRUE)
```
**Repository** with code and scripts: https://github.com/NiccoloTosato/HPC_assignment1
# Section 1: MPI programming
## Section 1.1: 1D ring
The ring has been implemented using non-blocking communication routines and 1D virtual topology with periodic boundaries.
The non-blocking implementation leads to a linear growth of the execution time of the program, indeed the execution time of a single iteration can be modeled roughly as a *double PingPing*.
The real performance will be of course worse than the ideal one, since the PingPing takes into account only 2 processes and does not consider crowded configuration with more than 2 processes. However, we have a lower bound ideal model to estimate the expected performance.
### Software stack and measure setup.
The performances of the program are measured over 10000 iterations, using **UCX** and **InfiniBand**, across *core*, *socket* and *node*. Across two nodes the *mlx5_0* hardware interface is used. Times of theoretical models have been taken from Intel® MPI benchmark PingPing (Figure 1). Program can be compiled by Makefile with differents options. Times have been measured on rank zero process, it is expected to be slower than the other ranks due topology reasons. Using *ldd* on executable it is possible to report specific library used.
\scriptsize
```{bash, eval=FALSE}
[s271550@login ring]$ ldd ring.x
linux-vdso.so.1 => (0x00007ffce79ef000)
libm.so.6 => /lib64/libm.so.6 (0x00007f9e9dc88000)
libmpi.so.40 => /opt/area/shared/programs/x86_64/openmpi/4.0.3/gnu/9.3.0/lib/libmpi.so.40 (0x00007f9e9d964000)
```
\normalsize
![PingPing structure from Intel® MPI Benchmarks User Guide.](/home/nt/Scrivania/en/mpi/imb_user_guide/IMB_Users_Guide_files/image02.png){#id .class width=35% height=35%}
### Map by node model
The network model across two nodes may take into account the latency of the network (dominated by the switch) and the number of processes involved. Each iteration will be lower bounded by the network.
$$Time=N_{procs} \cdot \lambda_{network}$$
The estimated latency between two node using the Intel MPI benchmark is a bit lower than the one declared by switch constructor. This lower latency holds only when few communications are going on. With multiple processes involved, a more realistic declared latency of $1.35$ microseconds leads to a more accurate model. Thus the model will be experimentally outperformed when $N_{procs}=2$. When the number of processes grow,it can be seen a slight slowdown of the actual implementation.
Across two nodes the latency can be estimated optimistically as $\lambda_{network}=1.01 \space \mu Sec$.
\scriptsize
```{bash eval=FALSE}
#---------------------------------------------------
# Benchmarking PingPing
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 1.00 0.00
1 1000 1.00 1.00
2 1000 1.01 1.98
4 1000 1.01 3.94
```
\normalsize
We expect the time to be $2 \cdot \lambda_{network}$ for each iterations, but experimentally this model doesn't hold. The experimental evidence suggests an iteration time of $\lambda_{network}$. My guess is that 2 consecutive messages are merged into one unique request routed to the switch. This behaviour can be confirmed using Intel MPI Benchmark **Exchange** again, that implements 1D non periodic ring (Figure 2).
\scriptsize
```{bash, eval=FALSE}
#-----------------------------------------------------------------------------
# Benchmarking Exchange
# #processes = 36
#-----------------------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec
0 10000 1.46 1.46 1.46 0.00
1 10000 1.45 1.45 1.45 2.75
2 10000 1.45 1.45 1.45 5.50
4 10000 1.46 1.46 1.46 10.97
```
\normalsize
![Exchannge benchmark structure from Intel® MPI Benchmarks User Guide.](./img/image04.png){#id99 .class width=40% height=40%}
### Map by socket model
With the socket round robin binding selected, we can model the ring behaviour as following, with the usual Intel MPI benchmark estimation.
$$Time=N_{procs}\cdot \lambda_{socket}\cdot 2$$
Again when the sockets become crowded the real performance is worse than the expected model.
Across two sockets the estimated time is $t=0.49 \space \mu Sec$
\scriptsize
```{bash eval=FALSE}
#---------------------------------------------------
# Benchmarking PingPing
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 0.49 0.00
1 1000 0.48 2.06
2 1000 0.49 4.12
4 1000 0.49 8.18
```
\normalsize
### Map by core model
When mapping by core the processes, two factors have to be taken into account: how many processes are spawned and where. When $N_{procs}$ is less or equal than the number of cores in a single socket, the expected execution time is bounded by the core communication. When the first socket is filled, the first process on the second socket will be the slowest among all processes. Indeed its neighbors will be placed in the other socket. One iteration is long as the slowest process communication time. Experimentally a spike when $N_{procs}=13$ is clearly visible. Counterintuitively, the performance increases when the number of processes is greater than $N_{procs}=13$ and other processes are spawned in the other socket.
Indeed, the slowest process has one neighbor on the same socket and one on the other socket, thus the overall iteration time for the slowest process is $\lambda_{core}+\lambda_{socket}$.
$$
Time= \left\{
\begin{array}{ll}
N_{procs}\cdot\lambda_{core}\cdot 2 & N_{procs}\leq N_{cpu\space core} \\
N_{procs}\cdot(\lambda_{core}+\lambda_{socket}) & N_{procs} > N_{cpu\space core}+2 \\
N_{procs}\cdot\lambda_{socket}\cdot 2 & N_{procs} = N_{cpu\space core}+1 \\
\end{array}
\right.
$$
Across two sockets the estimated time is $t=0.23 \space \mu Sec$
\scriptsize
```{bash eval=FALSE}
#---------------------------------------------------
# Benchmarking PingPing
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 0.23 0.00
1 1000 0.23 4.27
2 1000 0.24 8.44
4 1000 0.23 17.18
```
\normalsize
### Experimental results compared with theoretical model
```{r ring, include=FALSE, echo=FALSE}
ring_core=read.csv("./ring/timing/core.csv", header = TRUE)
ring_socket=read.csv("./ring/timing/socket.csv", header = TRUE)
ring_node=read.csv("./ring/timing/node.csv", header = TRUE)
ring_core_big=read.csv("./ring/timing/all_nodes/core.txt", header = TRUE)
ring_socket_big=read.csv("./ring/timing/all_nodes/socket.txt", header = TRUE)
ring_node_big=read.csv("./ring/timing/all_nodes/node.txt", header = TRUE)
ring_model_core=ring_core
ring_model_socket=ring_socket
ring_model_node=ring_node
core_latency=0.23
socket_latency=0.48
node_latency=1.35
ring_model_core$MEAN[ring_model_core$X.SIZE<=12]=ring_model_core$X.SIZE[ring_model_core$X.SIZE<=12]*core_latency*2
ring_model_core$MEAN[ring_model_core$X.SIZE>12]=ring_model_core$X.SIZE[ring_model_core$X.SIZE>12]*socket_latency+ring_model_core$X.SIZE[ring_model_core$X.SIZE>12]*core_latency
ring_model_core$MEAN[ring_model_core$X.SIZE==13]=ring_model_core$X.SIZE[ring_model_core$X.SIZE==13]*socket_latency*2
ring_model_socket$MEAN=ring_model_socket$X.SIZE*socket_latency*2
ring_model_node$MEAN=ring_model_node$X.SIZE*node_latency
sp1 = ggplot(ring_core,aes(x=X.SIZE,y=MEAN,color="Map by core",linetype="Real time")) +
scale_y_continuous(name="Time uSec",breaks = seq(0,45,by=5))+
scale_x_continuous(name="Processor number",breaks = ring_core$X.SIZE) +
geom_point() +
geom_line() +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=0.5))+
geom_point(data=ring_model_core,aes(x=X.SIZE,y=MEAN,color="Map by core"))+
geom_line(data=ring_model_core,aes(x=X.SIZE,y=MEAN,color="Map by core",linetype="Theoretical time")) +
geom_point(data=ring_socket,aes(x=X.SIZE,y=MEAN,color="Map by socket")) +
geom_line(data=ring_socket,aes(x=X.SIZE,y=MEAN,color="Map by socket",linetype="Real time"))+
geom_point(data=ring_model_socket,aes(x=X.SIZE,y=MEAN,color="Map by socket")) +
geom_line(data=ring_model_socket,aes(x=X.SIZE,y=MEAN,color="Map by socket",linetype="Theoretical time"))+
geom_point(data=ring_model_node,aes(x=X.SIZE,y=MEAN,color="Map by node")) +
geom_line(data=ring_model_node,aes(x=X.SIZE,y=MEAN,color="Map by node",linetype="Theoretical time"))+
geom_point(data=ring_node,aes(x=X.SIZE,y=MEAN,color="Map by node")) +
geom_line(data=ring_node,aes(x=X.SIZE,y=MEAN,color="Map by node",linetype="Real time"))+
scale_color_manual(name="Mapping", values = c(10,12,13))+
scale_linetype_manual(name="",values=c(1,2))+
theme(legend.position = c(0.2, 0.8))+ geom_vline(xintercept=12,
color = "red",linetype="dashed", size=0.5)
core_latency=0.23
socket_latency=0.48
node_latency=1.35
ring_model_core$MEAN_MESSAGE[ring_model_core$X.SIZE<=12]=core_latency*2
ring_model_core$MEAN_MESSAGE[ring_model_core$X.SIZE>12]=socket_latency+core_latency
ring_model_core$MEAN_MESSAGE[ring_model_core$X.SIZE==13]=socket_latency*2
ring_model_socket$MEAN_MESSAGE=socket_latency*2
ring_model_node$MEAN_MESSAGE=node_latency
sp = ggplot(ring_core,aes(x=X.SIZE,y=MEAN_MESSAGE,color="Map by core",linetype="Real time")) +
scale_y_continuous(name="Mean iteration time uSec",breaks = seq(0.4,1.6,by=0.1))+
scale_x_continuous(name="Processor number",breaks = ring_core$X.SIZE) +
geom_point() +
geom_line() +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=0.5))+
geom_point(data=ring_model_core,aes(x=X.SIZE,y=MEAN_MESSAGE,color="Map by core"))+
geom_line(data=ring_model_core,aes(x=X.SIZE,y=MEAN_MESSAGE,color="Map by core",linetype="Theoretical time")) +
geom_point(data=ring_socket,aes(x=X.SIZE,y=MEAN/X.SIZE,color="Map by socket")) +
geom_line(data=ring_socket,aes(x=X.SIZE,y=MEAN/X.SIZE,color="Map by socket",linetype="Real time"))+
geom_point(data=ring_model_socket,aes(x=X.SIZE,y=MEAN_MESSAGE,color="Map by socket")) +
geom_line(data=ring_model_socket,aes(x=X.SIZE,y=MEAN_MESSAGE,color="Map by socket",linetype="Theoretical time"))+
geom_point(data=ring_model_node,aes(x=X.SIZE,y=MEAN_MESSAGE,color="Map by node")) +
geom_line(data=ring_model_node,aes(x=X.SIZE,y=MEAN_MESSAGE,color="Map by node",linetype="Theoretical time"))+
geom_point(data=ring_node,aes(x=X.SIZE,y=MEAN/X.SIZE,color="Map by node")) +
geom_line(data=ring_node,aes(x=X.SIZE,y=MEAN/X.SIZE,color="Map by node",linetype="Real time"))+
scale_color_manual(name="Mapping", values = c(10,12,13))+
scale_linetype_manual(name="",values=c(1,2))+
guides(color = FALSE, size = FALSE,linetype= FALSE)+
geom_vline(xintercept=12,
color = "red",linetype="dashed", size=0.5)
```
```{r griglia, dpi=600, fig.width=13, fig.height=6,echo=FALSE,fig.cap="Ring on THIN node up to 24 cores"}
grid.arrange(sp1,sp,nrow=1)
```
```{r,include=FALSE, echo=FALSE}
library(ggplot2)
sp = ggplot(ring_core_big,aes(x=X.SIZE,y=MEAN,color="Map by core" )) +
scale_y_continuous(name="Time uSec",breaks = seq(0,180,by=10))+
scale_x_continuous(name="Processor number",breaks = ring_core_big$X.SIZE) +
geom_point() +
geom_line() +
theme(axis.text.x = element_text(angle = 60, vjust = 0.5, hjust=0.5))+
geom_point(data=ring_socket_big,aes(x=X.SIZE,y=MEAN,color="Map by socket")) +
geom_line(data=ring_socket_big,aes(x=X.SIZE,y=MEAN,color="Map by socket" ))+
geom_point(data=ring_node_big,aes(x=X.SIZE,y=MEAN,color="Map by node")) +
geom_line(data=ring_node_big,aes(x=X.SIZE,y=MEAN,color="Map by node" ))+
scale_color_manual(name="Mapping", values = c(10,12,13,14))+
scale_linetype_manual(name="",values=c(1,2))+
theme(legend.position = c(0.1, 0.6))+
geom_vline(xintercept=12,
color = "red",linetype="dashed", size=0.5)+
geom_vline(xintercept=24,
color = "red",linetype="dashed", size=0.5)+
geom_vline(xintercept=36,
color = "red",linetype="dashed", size=0.5)+
geom_vline(xintercept=48,
color = "red",linetype="dashed", size=0.5)+
geom_vline(xintercept=60,
color = "red",linetype="dashed", size=0.5)+
geom_vline(xintercept=72,
color = "red",linetype="dashed", size=0.5)+
geom_vline(xintercept=84,
color = "red",linetype="dashed", size=0.5)
library(ggplot2)
sp2 = ggplot(ring_core_big,aes(x=X.SIZE,y=MEAN_MESSAGE,color="Map by core" )) +
scale_y_continuous(name="Mean iteration time uSec",breaks = seq(0.4,1.7,0.1))+
scale_x_continuous(name="Processor number",breaks = ring_core_big$X.SIZE) +
geom_point() +
geom_line() +
theme(axis.text.x = element_text(angle = 60, vjust = 0.5, hjust=0.5))+
geom_point(data=ring_socket_big,aes(x=X.SIZE,y=MEAN_MESSAGE,color="Map by socket")) +
geom_line(data=ring_socket_big,aes(x=X.SIZE,y=MEAN_MESSAGE,color="Map by socket" ))+
geom_point(data=ring_node_big,aes(x=X.SIZE,y=MEAN_MESSAGE,color="Map by node")) +
geom_line(data=ring_node_big,aes(x=X.SIZE,y=MEAN_MESSAGE,color="Map by node" ))+
scale_color_manual(name="Mapping", values = c(10,12,13,14))+
scale_linetype_manual(name="",values=c(1,2))+
guides(color = FALSE, size = FALSE,linetype= FALSE)+ geom_vline(xintercept=12,
color = "red",linetype="dashed", size=0.5)+
geom_vline(xintercept=12,
color = "red",linetype="dashed", size=0.5)+
geom_vline(xintercept=24,
color = "red",linetype="dashed", size=0.5)+
geom_vline(xintercept=36,
color = "red",linetype="dashed", size=0.5)+
geom_vline(xintercept=48,
color = "red",linetype="dashed", size=0.5)+
geom_vline(xintercept=60,
color = "red",linetype="dashed", size=0.5)+
geom_vline(xintercept=72,
color = "red",linetype="dashed", size=0.5)+
geom_vline(xintercept=84,
color = "red",linetype="dashed", size=0.5)
```
\newpage
```{r , dpi=600, fig.width=10, fig.height=8,echo=FALSE,fig.cap="Ring on THIN node up to 96 cores and 4 nodes"}
grid.arrange(sp,sp2,nrow=2)
```
By running the ring across 4 nodes,the pattern above can been visualized. After 24 processes, when the socket or the node is saturated the changes on the average iteration time are smaller.This is due the fact that the communication time between nodes is the largest and it is more evident than the intra-socket or the intra-core ones.Moreover with more than 24 processes we have at least one intra-node communication going on, the step confirms that.
\newpage
## Section 1.2: Matrix sum
In memory, a 3D array can be represented as a unique linear array. The most effective way to sum it in parallel is using collective operations.
The shape of the array doesn't make any difference on how the array is represented in memory. Also the virtual topology and its relative domain decomposition do not have any impact. Indeed, we can assume that the problem is not sensible to any topology, and thus there is no need of communication between neighbors. The most efficient way to communicate and sum between matrices is to use MPI_Scatterv and MPI_Gatherv collective routines.
The expectations of this code are not so high in terms of scalability:the sum of two matrices containing $n$ elements, requires $2n$ memory read accesses, $n$ memory to write, a lot of communication and **only** $n$ floating point operations (in other words: like Level 1 BLAS operation we can expect matrix sum at least to be memory bounded). Parallel portion of code is smaller with respect to the serial one. Using Ahmdal's law,a prediction of speed up bound can be made.
By testing the parallel code with two $2400\times1000\times700$ matrices of double values and by using only one processor, it can be seen that the parallel part takes only $2.63$ seconds of the overall runtime of $49$ seconds, representing about $5\%$ of execution time (Elapsed is measured using /usr/bin/time and detailed Scatter and Gather with MPI_Wtime() ). Total execution takes into account the matrix initialization and the error checking.The following graph represents the theoretical maximum speedup supported by Ahmdal's law assuming only $5\%$ roughly of parallel code, while communication and serial code in the remaining part of the program is fixed with respect to the number of processors. The theoretical maximum speedup modelled is a good approximation and it catches the trend between [1,24] processes.
```{r ,echo=FALSE,message=FALSE}
options(warn=-1)
core_matrix=read.csv("./matrix/matrix_no_topo/big/1_core.csv")
socket_matrix=read.csv("./matrix/matrix_no_topo/big/1_socket.csv")
core_matrix$X.np=core_matrix$X.np+1
socket_matrix$X.np=socket_matrix$X.np+1
core_matrix=core_matrix[c(1,4,8,12,16,20,24),]
socket_matrix=socket_matrix[c(1,4,8,12,16,20,24),]
rownames(core_matrix)=c()
rownames(socket_matrix)=c()
speed_teo=core_matrix
p=0.13
s=1-p
speed_teo$total=1/(s+p/speed_teo$X.np)
t1=kbl(core_matrix, booktabs = T, align = "c",escape=F, col.names = linebreak(c("N° procs", "Scatter\n(S)","Gather\n(S)","Parallel\n(S)","Total\n(S)"), align = "c")) %>%
column_spec(1, bold=T) %>%
column_spec(4, color = "white",
background = spec_color(core_matrix$parallel, end = 0.5,alpha = 0.2,option = "plasma"),
popover = paste("am:",core_matrix$parallel))
t2=kbl(socket_matrix, booktabs = T, align = "c",escape=F, col.names = linebreak(c("N° procs", "Scatter\n(S)","Gather\n(S)","Parallel\n(S)","Total\n(S)"), align = "c")) %>%
column_spec(1, bold=T) %>%
column_spec(4, color = "white",
background = spec_color(socket_matrix$parallel, end = 0.5,alpha = 0.2,option = "plasma"),
popover = paste("am:",socket_matrix$parallel))
knitr::kables(list(t1,t2),caption = "Matrix sum timings")
```
```{r,echo=FALSE, dpi=600, fig.width=10, fig.height=2}
sp = ggplot(socket_matrix,aes(x=X.np, y=socket_matrix$total[1]/socket_matrix$total, color="Socket")) +
scale_x_continuous(name="N procs",breaks=socket_matrix$X.np)+
scale_y_continuous(name="Speedup",breaks = seq(0.9,1.4,by=0.1),limit = c(0.9, 1.2)) + geom_point() + geom_line() +
theme(axis.text.x = element_text(angle = 0, vjust = 0.5, hjust=1)) +
geom_point(data=speed_teo,aes(x=X.np,y=total,color="Teo")) + geom_line(data=speed_teo,aes(x=X.np,y=total,color="Teo"))+
geom_point(data=core_matrix,aes(x=X.np,y=core_matrix$total[1]/core_matrix$total,color="Core")) + geom_line(data=core_matrix,aes(x=X.np,y=core_matrix$total[1]/core_matrix$total,color="Core"))+ labs(title="Matrix sum speedup", subtitle="") +
theme(plot.title = element_text(size = 13, face = "bold",hjust = 0.5,margin = margin(t = 12)), plot.subtitle = element_text(size = 10, hjust = 0.5, margin = margin(t = 7, b = 10), face = "italic"), axis.title = element_text(face = "bold"), legend.text = element_text(margin = margin(t = 7, b = 7, r=12)), axis.title.x = element_text(margin = margin(t = 10, b=10)), axis.title.y = element_text(margin = margin(r = 10,l=10)), axis.text = element_text(color= "#2f3030", face="bold"))+
scale_color_manual(name="Mapping", values = c(10,12,13))
```
```{r, dpi=600, fig.width=10, fig.height=4,echo=FALSE,message=FALSE}
sp
```
\newpage
Moreover I've implemented a 3D matrix sum program that includes a 1D/2D/3D domain decomposition that matches virtual topology and that uses collective operations, as initially requested. This program leads to a very high overhead, since a collective operation (in general communication routine) needs contiguous memory region to communicate each subdomain to its worker. To achieve this an unrolling is performed for each subdomain. Once the worker has received the subdomain, it elaborates the array and sends back the summed array to the master processor. The master processor needs to reordinate the subdomains and to assemble the final matrix. Like in the first program, this code show an even lower performance.In this latter case we have some changes in performances with respect to matrices shape and domain decomposition, but this is due to matrix preparation overhead (buffering of subdomains before scatter it) and it is not due communication routine.
```{r, dpi=600, fig.width=14, fig.height=20,echo=FALSE,message=FALSE}
core_matrix_topo_1=read.csv("./matrix/matrix_collective/big/1_core.csv")
socket_matrix_topo_1=read.csv("./matrix/matrix_collective/big/1_socket.csv")
core_matrix_topo_2=read.csv("./matrix/matrix_collective/big/2_core.csv")
socket_matrix_topo_2=read.csv("./matrix/matrix_collective/big/2_socket.csv")
sp = ggplot(core_matrix_topo_1,aes(x=X.np, y=core_matrix_topo_1$total[1]/core_matrix_topo_1$total, color="Core",linetype="2400 1000 700")) +
scale_x_continuous(name="N procs",breaks=core_matrix_topo_1$X.np)+
scale_y_continuous(name="Speedup",breaks = seq(0.9,1.4,by=0.1),limit = c(0.95, 1.1)) + geom_point() + geom_line() +
theme(axis.text.x = element_text(angle = 0, vjust = 0.5, hjust=1)) +
geom_point(data=socket_matrix_topo_1,aes(x=X.np,y=socket_matrix_topo_1$total[1]/socket_matrix_topo_1$total,color="Socket")) +
geom_line(data=socket_matrix_topo_1,aes(x=X.np,y=socket_matrix_topo_1$total[1]/socket_matrix_topo_1$total,color="Socket",linetype="2400 1000 700"))+ labs(title="Matrix sum speedup", subtitle="") +
geom_point(data=socket_matrix_topo_2,aes(x=X.np,y=socket_matrix_topo_2$total[1]/socket_matrix_topo_2$total,color="Socket")) + geom_line(data=socket_matrix_topo_2,aes(x=X.np,y=socket_matrix_topo_2$total[1]/socket_matrix_topo_2$total,color="Socket",linetype="1200 2000 700"))+
geom_point(data=core_matrix_topo_2,aes(x=X.np,y=core_matrix_topo_2$total[1]/core_matrix_topo_2$total,color="Core")) + geom_line(data=core_matrix_topo_2,aes(x=X.np,y=core_matrix_topo_2$total[1]/core_matrix_topo_2$total,color="Core",linetype="1200 2000 700"))+
theme(plot.title = element_text(size = 13, face = "bold",hjust = 0.5,margin = margin(t = 12)), plot.subtitle = element_text(size = 10, hjust = 0.5, margin = margin(t = 7, b = 10), face = "italic"), axis.title = element_text(face = "bold"), legend.text = element_text(margin = margin(t = 7, b = 7, r=12)), axis.title.x = element_text(margin = margin(t = 10, b=10)), axis.title.y = element_text(margin = margin(r = 10,l=10)), axis.text = element_text(color= "#2f3030", face="bold"))+
scale_color_manual(name="Mapping and size", values = c(10,12,13,15))+
scale_linetype_manual(name="",values=c(1,2))
```
```{r, dpi=600, fig.width=10, fig.height=4,echo=FALSE,message=FALSE}
sp
```
\newpage
# Section 2: MPI point to point performance
In order to measure MPI point to point performance, the Intel MPI benchmarks is used. By looking to Intel's documentation benchmark works as follow:
![PingPong structure from Intel® MPI Benchmarks User Guide.](./img/image01.png){#id2 .class width=50% height=35%}
The bandwidth and the latency estimates are computed across *core*, *socket* and different *nodes*, combined with different protocols and hardware devices.Each of the different setup of the program has been runned 10 times and **openMPI 4.0.3** has been used. One entire node was reserved in order to reduce possible sources of noise in the measurements.
The **pml** involved in the benchmarks are **OB1** and **UCX**. The **btl** used are **tcp** and **vader**. When the measurements across nodes were performed also different network with different protocols have been selected: $25$ Gbit Ethernet and $100$ Gbit InfiniBand throught Mellanox network switch.
```{r , include=FALSE,echo=FALSE}
node_ib=read.csv("./pingpong/ompi/node_ib.out.csv", header = TRUE)
node_ob1_tcp=read.csv("./pingpong/ompi/node_ob1_selftcp.out.csv", header = TRUE)
node_ucx_br0=read.csv("./pingpong/ompi/node_ucx_br0.out.csv", header = TRUE)
node_ucx_ib0=read.csv("./pingpong/ompi/node_ucx_ib0.out.csv", header = TRUE)
node_ucx__mlx5=read.csv("./pingpong/ompi/node_ucx_mlx5.out.csv", header = TRUE)
socket_ib=read.csv("./pingpong/ompi/socket_ib.out.csv", header = TRUE)
socket_ob1_tcp=read.csv("./pingpong/ompi/socket_ob1_selftcp.out.csv", header = TRUE)
socket_ob1_vader=read.csv("./pingpong/ompi/socket_ob1_selfvader.out.csv", header = TRUE)
core_ib=read.csv("./pingpong/ompi/core_ib.out.csv", header = TRUE)
core_ob1_tcp=read.csv("./pingpong/ompi/core_ob1_selftcp.out.csv", header = TRUE)
core_ob1_vader=read.csv("./pingpong/ompi/core_ob1_selfvader.out.csv", header = TRUE)
node_ib_intel=read.csv("./pingpong/intel/node_ib_intel.out.csv", header = TRUE)
socket_ib_intel=read.csv("./pingpong/intel/socket_ib_intel.out.csv", header = TRUE)
core_ib_intel=read.csv("./pingpong/intel/core_ib_intel.out.csv", header = TRUE)
```
```{r , dpi=600, fig.width=10, fig.height=6,echo=FALSE,include=FALSE}
library(ggplot2)
sp_thin = ggplot(core_ib,aes(x=X.bytes,y=mbs,color="UCX IB",linetype="Core")) +
scale_x_continuous(trans='log2',name="Message Size Bytes",breaks=core_ib$X.bytes) +
geom_point() + geom_line()+scale_y_continuous(breaks = seq(0,25000,1000),name="MB/s") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
geom_point(data=core_ob1_tcp,aes(x=X.bytes,y=mbs,color="OB1 tcp")) +
geom_line(data=core_ob1_tcp,aes(x=X.bytes,y=mbs,color="OB1 tcp",linetype="Core")) +
geom_point(data=core_ob1_vader,aes(x=X.bytes,y=mbs,color="OB1 vader")) +
geom_line(data=core_ob1_vader,aes(x=X.bytes,y=mbs,color="OB1 vader",linetype="Core")) +
geom_point(data=socket_ib,aes(x=X.bytes,y=mbs,color="UCX IB")) +
geom_line(data=socket_ib,aes(x=X.bytes,y=mbs,color="UCX IB",linetype="Socket")) +
geom_point(data=socket_ob1_tcp,aes(x=X.bytes,y=mbs,color="OB1 tcp")) +
geom_line(data=socket_ob1_tcp,aes(x=X.bytes,y=mbs,color="OB1 tcp",linetype="Socket")) +
geom_point(data=socket_ob1_vader,aes(x=X.bytes,y=mbs,color="OB1 vader")) +
geom_line(data=socket_ob1_vader,aes(x=X.bytes,y=mbs,color="OB1 vader",linetype="Socket")) +
geom_point(data=socket_ib_intel,aes(x=X.bytes,y=mbs,color="Intel IB")) +
geom_line(data=socket_ib_intel,aes(x=X.bytes,y=mbs,color="Intel IB",linetype="Socket")) +
geom_point(data=core_ib_intel,aes(x=X.bytes,y=mbs,color="Intel IB")) +
geom_line(data=core_ib_intel,aes(x=X.bytes,y=mbs,color="Intel IB",linetype="Core"))+
scale_color_manual(name="Protocol", values = c("#F8766D", "#7CAE00", "#00BFC4" ,"#C77CFF"))+
scale_linetype_manual(name="Mapping",values=c(1,2,3,4))+
labs(title="PingPong bandwidth", subtitle="Thin node,core and socket mapping") +
theme(plot.title = element_text(size = 13, face = "bold",hjust = 0.5,margin = margin(t = 12)), plot.subtitle = element_text(size = 10, hjust = 0.5, margin = margin(t = 7, b = 10), face = "italic"), axis.title = element_text(face = "bold"), legend.text = element_text(margin = margin(t = 7, b = 7, r=12)), axis.title.x = element_text(margin = margin(t = 10, b=10)), axis.title.y = element_text(margin = margin(r = 10,l=10)), axis.text = element_text(color= "#2f3030", face="bold"))+
geom_vline(xintercept=32768,
color = "black",linetype="dashed", size=0.35)+
geom_vline(xintercept=1048576,
color = "black",linetype="dashed", size=0.35)+
geom_vline(xintercept=19922944,
color = "black",linetype="dashed", size=0.35)+
annotate("text", x=58000, y=24000, label= "L1",color="Black")+
annotate("text", x=1800000, y=24000, label= "L2",color="Black")+
annotate("text", x=29900000, y=24000, label= "L3",color="Black")
```
```{r aa, dpi=600, fig.width=10, fig.height=6,echo=FALSE}
options(warn=-1)
#sp
```
```{r,echo=FALSE}
socket_ib_gpu=read.csv("./pingpong/ompi_gpu/socket_ib.out.csv", header = TRUE)
socket_ob1_tcp_gpu=read.csv("./pingpong/ompi_gpu/socket_ob1_selftcp.out.csv", header = TRUE)
socket_ob1_vader_gpu=read.csv("./pingpong/ompi_gpu/socket_ob1_selfvader.out.csv", header = TRUE)
socket_ib_intel_gpu=read.csv("./pingpong/intel_gpu/socket_ib_intel.out.csv", header = TRUE)
core_ib_intel_gpu=read.csv("./pingpong/intel_gpu/core_ib_intel.out.csv", header = TRUE)
core_ib_gpu=read.csv("./pingpong/ompi_gpu/core_ib.out.csv", header = TRUE)
core_ob1_tcp_gpu=read.csv("./pingpong/ompi_gpu/core_ob1_selftcp.out.csv", header = TRUE)
core_ob1_vader_gpu=read.csv("./pingpong/ompi_gpu/core_ob1_selfvader.out.csv", header = TRUE)
sp_gpu = ggplot(core_ib_gpu,aes(x=X.bytes,y=mbs,color="UCX IB",linetype="Core")) +
scale_x_continuous(trans='log2',name="Message Size Bytes",breaks=core_ib$X.bytes) +
geom_point() + geom_line()+scale_y_continuous(breaks = seq(0,22000,1000),name="MB/s") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
geom_point(data=core_ob1_tcp_gpu,aes(x=X.bytes,y=mbs,color="OB1 tcp")) +
geom_line(data=core_ob1_tcp_gpu,aes(x=X.bytes,y=mbs,color="OB1 tcp",linetype="Core")) +
geom_point(data=core_ob1_vader_gpu,aes(x=X.bytes,y=mbs,color="OB1 vader")) +
geom_line(data=core_ob1_vader_gpu,aes(x=X.bytes,y=mbs,color="OB1 vader",linetype="Core")) +
geom_point(data=socket_ib_gpu,aes(x=X.bytes,y=mbs,color="UCX IB")) +
geom_line(data=socket_ib_gpu,aes(x=X.bytes,y=mbs,color="UCX IB",linetype="Socket")) +
geom_point(data=socket_ob1_tcp_gpu,aes(x=X.bytes,y=mbs,color="OB1 tcp")) +
geom_line(data=socket_ob1_tcp_gpu,aes(x=X.bytes,y=mbs,color="OB1 tcp",linetype="Socket")) +
geom_point(data=socket_ob1_vader_gpu,aes(x=X.bytes,y=mbs,color="OB1 vader")) +
geom_line(data=socket_ob1_vader_gpu,aes(x=X.bytes,y=mbs,color="OB1 vader",linetype="Socket")) +
geom_point(data=socket_ib_intel_gpu,aes(x=X.bytes,y=mbs,color="Intel IB")) +
geom_line(data=socket_ib_intel_gpu,aes(x=X.bytes,y=mbs,color="Intel IB",linetype="Socket")) +
geom_point(data=core_ib_intel_gpu,aes(x=X.bytes,y=mbs,color="Intel IB")) +
geom_line(data=core_ib_intel_gpu,aes(x=X.bytes,y=mbs,color="Intel IB",linetype="Core"))+
scale_color_manual(name="Protocol", values = c("#F8766D", "#7CAE00", "#00BFC4" ,"#C77CFF"))+
scale_linetype_manual(name="Mapping",values=c(1,2,3,4))+
labs(title="PingPong bandwidth", subtitle="Gpu node,core and socket mapping") +
theme(plot.title = element_text(size = 13, face = "bold",hjust = 0.5,margin = margin(t = 12)), plot.subtitle = element_text(size = 10, hjust = 0.5, margin = margin(t = 7, b = 10), face = "italic"), axis.title = element_text(face = "bold"), legend.text = element_text(margin = margin(t = 7, b = 7, r=12)), axis.title.x = element_text(margin = margin(t = 10, b=10)), axis.title.y = element_text(margin = margin(r = 10,l=10)), axis.text = element_text(color= "#2f3030", face="bold"))+
geom_vline(xintercept=32768,
color = "black",linetype="dashed", size=0.35)+
geom_vline(xintercept=1048576,
color = "black",linetype="dashed", size=0.35)+
geom_vline(xintercept=19922944,
color = "black",linetype="dashed", size=0.35)+
annotate("text", x=58000, y=20000, label= "L1",color="Black")+
annotate("text", x=1800000, y=20000, label= "L2",color="Black")+
annotate("text", x=29900000, y=20000, label= "L3",color="Black")
```
```{r , dpi=600, fig.width=10, fig.height=6,echo=FALSE}
options(warn=-1)
#sp
```
The following graphs show the behaviour inside the same node, i.e mapping the processes across two sockets or in the same socket. Mapping the processes in the same socket shows often a better performance,as expected. The behaviour before the asymptotic plateau is strange and shows very different performances among different implementations, the main cause of that is due to the cache.
Before analyzing the performance we need to know more about the node topology, as explained in figure 6.
![lstopo output on THIN node.](./img/lstopo.png){#idasd .class width=50% height=50% align=center}
Cache sizes are reported on the following graphs with vertical black dashed lines.
As it can be noticed in the following two graphs, the bandwidth behaviour appears strange before $16$ MB included. After that point, the behaviour is stable beacause the message size is larger than all the caches. The larger cache is **L3** with $19$ MB.
The effect of **L2** becomes clear after $1$ MB, after it all implementations start loosing bandwidth.
The effect of **L1** is not clearly visible from the bandwidth due to the latency. To infer with more accuracy about the cache effect we can perform some profiling test using hardware counters. I have modified Intel MPI Benchmark PingPong injecting some code in order to inspect L1 data cache misses and L2 cache misses through PAPI. I tracked all cache misses on MPI_Send and MPI_Recv routine of rank 0,after some cicles of warmup.
The code is available here https://github.com/NiccoloTosato/HPC_assignment1/tree/main/mpi-benchmarks_mod
The cache misses are reported on absolute value and normalized with respect to the message size.
Before $128$ KB UCX shows poor performance with respect to other protocols,while after that size an huge speed up is present. This fact will be explained later. Intel InfiniBand also shows poor performance with respect to UCX implementation, this behaviour is explained with cache too.
```{r, dpi=600, fig.width=10, fig.height=6,echo=FALSE,include=FALSE}
ob1_core_tcp=read.csv("./mpi-benchmarks_mod/src_c/cache_misses/core_ob1_selftcp_cache_misses.out.csv", header = TRUE)
ob1_core_tcp$L1=ob1_core_tcp$L1/(ob1_core_tcp$ITERATION)
ob1_core_tcp$L2=ob1_core_tcp$L2/(ob1_core_tcp$ITERATION)
ob1_core_vader=read.csv("./mpi-benchmarks_mod/src_c/cache_misses/core_ob1_selfvader_cache_misses.out.csv", header = TRUE)
ob1_core_vader$L1=ob1_core_vader$L1/(ob1_core_vader$ITERATION)
ob1_core_vader$L2=ob1_core_vader$L2/(ob1_core_vader$ITERATION)
ucx_core=read.csv("./mpi-benchmarks_mod/src_c/cache_misses/core_ib_cache_misses.out.csv", header = TRUE)
ucx_core$L1=ucx_core$L1/(ucx_core$ITERATION)
ucx_core$L2=ucx_core$L2/(ucx_core$ITERATION)
ob1_socket_tcp=read.csv("./mpi-benchmarks_mod/src_c/cache_misses/socket_ob1_selftcp_cache_misses.out.csv", header = TRUE)
ob1_socket_tcp$L1=ob1_socket_tcp$L1/(ob1_socket_tcp$ITERATION)
ob1_socket_tcp$L2=ob1_socket_tcp$L2/(ob1_socket_tcp$ITERATION)
ob1_socket_vader=read.csv("./mpi-benchmarks_mod/src_c/cache_misses/socket_ob1_selfvader_cache_misses.out.csv", header = TRUE)
ob1_socket_vader$L1=ob1_socket_vader$L1/(ob1_socket_vader$ITERATION)
ob1_socket_vader$L2=ob1_socket_vader$L2/(ob1_socket_vader$ITERATION)
ucx_socket=read.csv("./mpi-benchmarks_mod/src_c/cache_misses/socket_ib_cache_misses.out.csv", header = TRUE)
ucx_socket$L1=ucx_socket$L1/(ucx_socket$ITERATION)
ucx_socket$L2=ucx_socket$L2/(ucx_socket$ITERATION)
intel_core=read.csv("./mpi-benchmarks_mod/src_c/cache_misses/intel_core_cache_misses.out.csv", header = TRUE)
intel_core$L1=intel_core$L1/intel_core$ITERATION
intel_core$L2=intel_core$L2/intel_core$ITERATION
intel_socket=read.csv("./mpi-benchmarks_mod/src_c/cache_misses/intel_socket_cache_misses.out.csv", header = TRUE)
intel_socket$L1=intel_socket$L1/intel_core$ITERATION
intel_socket$L2=intel_socket$L2/intel_core$ITERATION
intel_node=read.csv("./mpi-benchmarks_mod/src_c/cache_misses/intel_node_cache_misses.out.csv", header = TRUE)
intel_node$L1=intel_node$L1/intel_node$ITERATION
ucx_node_ib=read.csv("./mpi-benchmarks_mod/src_c/cache_misses/node_ib_cache_misses.out.csv", header = TRUE)
ucx_node_ib$L1=ucx_node_ib$L1/ucx_node_ib$ITERATION
ucx_node_br0=read.csv("./mpi-benchmarks_mod/src_c/cache_misses/node_ucx_br0_cache_misses.out.csv", header = TRUE)
ucx_node_br0$L1=ucx_node_br0$L1/ucx_node_br0$ITERATION
ucx_node_ib0=read.csv("./mpi-benchmarks_mod/src_c/cache_misses/node_ucx_ib0_cache_misses.out.csv", header = TRUE)
ucx_node_ib0$L1=ucx_node_ib0$L1/ucx_node_ib0$ITERATION
ob1_node_tcp=read.csv("./mpi-benchmarks_mod/src_c/cache_misses/node_ob1_selftcp_cache_misses.out.csv", header = TRUE)
ob1_node_tcp$L1=ob1_node_tcp$L1/ob1_node_tcp$ITERATION
```
```{r, dpi=600, fig.width=10, fig.height=6,echo=TRUE,include=FALSE}
gg_color_hue <- function(n) {
hues = seq(15, 375, length = n + 1)
hcl(h = hues, l = 65, c = 100)[1:n]
}
sp = ggplot(ob1_core_vader,aes(x=X.SIZE,y=L1/X.SIZE,color="L1 miss OB1",linetype="Core")) +
scale_x_continuous(trans='log2',name="Message Size Bytes", breaks=ob1_core_vader$X.SIZE) +
geom_point() +
geom_line() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
scale_y_continuous(trans='log2',name="Log(Cache miss)") +
geom_point(data=ob1_core_vader,aes(x=X.SIZE,y=L2/X.SIZE,color="L2 miss OB1")) +
geom_line(data=ob1_core_vader,aes(x=X.SIZE,y=L2/X.SIZE,color="L2 miss OB1",linetype="Core")) +
geom_point(data=ob1_socket_vader,aes(x=X.SIZE,y=L1/X.SIZE,color="L1 miss OB1")) +
geom_line(data=ob1_socket_vader,aes(x=X.SIZE,y=L1/X.SIZE,color="L1 miss OB1",linetype="Socket")) +
geom_point(data=ob1_socket_vader,aes(x=X.SIZE,y=L2/X.SIZE,color="L2 miss OB1")) +
geom_line(data=ob1_socket_vader,aes(x=X.SIZE,y=L2/X.SIZE,color="L2 miss OB1",linetype="Socket")) +
geom_point(data=ucx_socket,aes(x=X.SIZE,y=L1/X.SIZE,color="L1 miss UCX"))+
geom_line(data=ucx_socket,aes(x=X.SIZE,y=L1/X.SIZE,color="L1 miss UCX",linetype="Socket")) +
geom_point(data=ucx_socket,aes(x=X.SIZE,y=L2/X.SIZE,color="L2 miss UCX")) +
geom_line(data=ucx_socket,aes(x=X.SIZE,y=L2/X.SIZE,color="L2 miss UCX",linetype="Socket")) +
geom_point(data=ucx_core,aes(x=X.SIZE,y=L2/X.SIZE,color="L2 miss UCX")) +
geom_line(data=ucx_core,aes(x=X.SIZE,y=L2/X.SIZE,color="L2 miss UCX",linetype="Core"))+
geom_point(data=ucx_core,aes(x=X.SIZE,y=L1/X.SIZE,color="L1 miss UCX")) +
geom_line(data=ucx_core,aes(x=X.SIZE,y=L1/X.SIZE,color="L1 miss UCX",linetype="Core")) +
geom_point(data=intel_core,aes(x=X.SIZE,y=L1/X.SIZE,color="L1 miss Intel")) +
geom_line(data=intel_core,aes(x=X.SIZE,y=L1/X.SIZE,color="L1 miss Intel",linetype="Core")) +
geom_point(data=intel_core,aes(x=X.SIZE,y=L2/X.SIZE,color="L2 miss Intel")) +
geom_line(data=intel_core,aes(x=X.SIZE,y=L2/X.SIZE,color="L2 miss Intel",linetype="Core")) +
geom_point(data=intel_socket,aes(x=X.SIZE,y=L1/X.SIZE,color="L1 miss Intel")) +
geom_line(data=intel_socket,aes(x=X.SIZE,y=L1/X.SIZE,color="L1 miss Intel",linetype="Socket")) +
geom_point(data=intel_socket,aes(x=X.SIZE,y=L2/X.SIZE,color="L2 miss Intel")) +
geom_line(data=intel_socket,aes(x=X.SIZE,y=L2/X.SIZE,color="L2 miss Intel",linetype="Socket")) +
scale_color_manual(name="Protocol", values = c( "#F8766D", "#B79F00" ,"#00BA38", "#00BFC4", "#619CFF" ,"#F564E3"))+
scale_linetype_manual(name="Mapping",values=c(1,2))+
geom_vline(xintercept=32768,
color = "black",linetype="dashed", size=0.35)+
geom_vline(xintercept=1048576,
color = "black",linetype="dashed", size=0.35)+
geom_vline(xintercept=19922944,
color = "black",linetype="dashed", size=0.35)+
theme(axis.text.y =element_blank())+
annotate("text", x=58000, y=24000, label= "L1",color="Black")+
annotate("text", x=1800000, y=24000, label= "L2",color="Black")+
annotate("text", x=29900000, y=24000, label= "L3",color="Black")+
labs(title="PingPong cache misses", subtitle="Thin node,relative cache misses") +
theme(plot.title = element_text(size = 13, face = "bold",hjust = 0.5,margin = margin(t = 12)), plot.subtitle = element_text(size = 10, hjust = 0.5, margin = margin(t = 7, b = 10), face = "italic"), axis.title = element_text(face = "bold"), legend.text = element_text(margin = margin(t = 7, b = 7, r=12)), axis.title.x = element_text(margin = margin(t = 10, b=10)), axis.title.y = element_text(margin = margin(r = 10,l=10)), axis.text = element_text(color= "#2f3030", face="bold"))
sp1 = ggplot(ob1_core_vader,aes(x=X.SIZE,y=L1,color="L1 miss OB1",linetype="Core")) +
scale_x_continuous(trans='log2',name="Message Size Bytes",breaks=ob1_core_vader$X.SIZE) +
geom_point() +
geom_line() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
scale_y_continuous(trans='log2',name="Log(Cache misses)") +
geom_point(data=ob1_core_vader,aes(x=X.SIZE,y=L2,color="L2 miss OB1")) +
geom_line(data=ob1_core_vader,aes(x=X.SIZE,y=L2,color="L2 miss OB1",linetype="Core")) +
geom_point(data=ob1_socket_vader,aes(x=X.SIZE,y=L1,color="L1 miss OB1")) +
geom_line(data=ob1_socket_vader,aes(x=X.SIZE,y=L1,color="L1 miss OB1",linetype="Socket")) +
geom_point(data=ob1_socket_vader,aes(x=X.SIZE,y=L2,color="L2 miss OB1")) +
geom_line(data=ob1_socket_vader,aes(x=X.SIZE,y=L2,color="L2 miss OB1",linetype="Socket")) +
geom_point(data=ucx_socket,aes(x=X.SIZE,y=L1,color="L1 miss UCX"))+
geom_line(data=ucx_socket,aes(x=X.SIZE,y=L1,color="L1 miss UCX",linetype="Socket")) +
geom_point(data=ucx_socket,aes(x=X.SIZE,y=L2,color="L2 miss UCX")) +
geom_line(data=ucx_socket,aes(x=X.SIZE,y=L2,color="L2 miss UCX",linetype="Socket")) +
geom_point(data=ucx_core,aes(x=X.SIZE,y=L2,color="L2 miss UCX")) +
geom_line(data=ucx_core,aes(x=X.SIZE,y=L2,color="L2 miss UCX",linetype="Core"))+
geom_point(data=ucx_core,aes(x=X.SIZE,y=L1,color="L1 miss UCX")) +
geom_line(data=ucx_core,aes(x=X.SIZE,y=L1,color="L1 miss UCX",linetype="Core")) +
geom_point(data=intel_core,aes(x=X.SIZE,y=L1,color="L1 miss Intel")) +
geom_line(data=intel_core,aes(x=X.SIZE,y=L1,color="L1 miss Intel",linetype="Core")) +
geom_point(data=intel_core,aes(x=X.SIZE,y=L2,color="L2 miss Intel")) +
geom_line(data=intel_core,aes(x=X.SIZE,y=L2,color="L2 miss Intel",linetype="Core")) +
geom_point(data=intel_socket,aes(x=X.SIZE,y=L1,color="L1 miss Intel")) +
geom_line(data=intel_socket,aes(x=X.SIZE,y=L1,color="L1 miss Intel",linetype="Socket")) +
geom_point(data=intel_socket,aes(x=X.SIZE,y=L2,color="L2 miss Intel")) +
geom_line(data=intel_socket,aes(x=X.SIZE,y=L2,color="L2 miss Intel",linetype="Socket")) +
scale_color_manual(name="Protocol", values = c( "#F8766D", "#B79F00" ,"#00BA38", "#00BFC4", "#619CFF" ,"#F564E3"))+
scale_linetype_manual(name="Mapping",values=c(1,2)) +
geom_vline(xintercept=32768,
color = "black",linetype="dashed", size=0.35)+
geom_vline(xintercept=1048576,
color = "black",linetype="dashed", size=0.35)+
geom_vline(xintercept=19922944,
color = "black",linetype="dashed", size=0.35)+
theme(axis.text.y =element_blank())+
annotate("text", x=58000, y=24000, label= "L1",color="Black")+
annotate("text", x=1800000, y=24000, label= "L2",color="Black")+
annotate("text", x=29900000, y=24000, label= "L3",color="Black")+
labs(title="PingPong cache misses", subtitle="Thin node,absolute cache misses") +
theme(plot.title = element_text(size = 13, face = "bold",hjust = 0.5,margin = margin(t = 12)), plot.subtitle = element_text(size = 10, hjust = 0.5, margin = margin(t = 7, b = 10), face = "italic"), axis.title = element_text(face = "bold"), legend.text = element_text(margin = margin(t = 7, b = 7, r=12)), axis.title.x = element_text(margin = margin(t = 10, b=10)), axis.title.y = element_text(margin = margin(r = 10,l=10)), axis.text = element_text(color= "#2f3030", face="bold"))
sp3 = ggplot(intel_node,aes(x=X.SIZE,y=L1,color="L1 miss Intel",linetype="Node")) +
scale_x_continuous(trans='log2',name="Message Size Bytes",breaks=ob1_core_vader$X.SIZE) +
geom_point() +
geom_line() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
scale_y_continuous(trans='log2',name="Log(Normalized Cache misses)") +
geom_point(data=intel_node,aes(x=X.SIZE,y=L2,color="L2 miss Intel")) +
geom_line(data=intel_node,aes(x=X.SIZE,y=L2,color="L2 miss Intel",linetype="Node")) +
geom_point(data=ucx_node_ib,aes(x=X.SIZE,y=L1,color="L1 miss UCX")) +
geom_line(data=ucx_node_ib,aes(x=X.SIZE,y=L1,color="L1 miss UCX",linetype="Node")) +
geom_point(data=ucx_node_ib,aes(x=X.SIZE,y=L2,color="L2 miss UCX")) +
geom_line(data=ucx_node_ib,aes(x=X.SIZE,y=L2,color="L2 miss UCX",linetype="Node")) +
geom_point(data=ucx_node_br0,aes(x=X.SIZE,y=L1,color="L1 miss UCX br0"))+
geom_line(data=ucx_node_br0,aes(x=X.SIZE,y=L1,color="L1 miss UCX br0",linetype="Node")) +
geom_point(data=ucx_node_br0,aes(x=X.SIZE,y=L2,color="L2 miss UCX br0")) +
geom_line(data=ucx_node_br0,aes(x=X.SIZE,y=L2,color="L2 miss UCX br0",linetype="Node")) +
geom_point(data=ucx_node_ib0,aes(x=X.SIZE,y=L2,color="L2 miss UCX ib0")) +
geom_line(data=ucx_node_ib0,aes(x=X.SIZE,y=L2,color="L2 miss UCX ib0",linetype="Node"))+
geom_point(data=ucx_node_ib0,aes(x=X.SIZE,y=L1,color="L1 miss UCX ib0")) +
geom_line(data=ucx_node_ib0,aes(x=X.SIZE,y=L1,color="L1 miss UCX ib0",linetype="Node")) +
geom_point(data=ob1_node_tcp,aes(x=X.SIZE,y=L1,color="L1 miss OB1")) +
geom_line(data=ob1_node_tcp,aes(x=X.SIZE,y=L1,color="L1 miss OB1",linetype="Node")) +
geom_point(data=ob1_node_tcp,aes(x=X.SIZE,y=L2,color="L2 miss OB1")) +
geom_line(data=ob1_node_tcp,aes(x=X.SIZE,y=L2,color="L2 miss OB1",linetype="Node")) +
scale_color_manual(name="Protocol", values = c( "#F8766D", "#D89000", "#A3A500", "#39B600", "#00BF7D", "#00BFC4", "#00B0F6", "#9590FF", "#E76BF3", "#FF62BC"))+
scale_linetype_manual(name="Mapping",values=c(1,2)) +
geom_vline(xintercept=32768,
color = "black",linetype="dashed", size=0.35)+
geom_vline(xintercept=1048576,
color = "black",linetype="dashed", size=0.35)+
geom_vline(xintercept=19922944,
color = "black",linetype="dashed", size=0.35)+
theme(axis.text.y =element_blank())+
annotate("text", x=58000, y=24000, label= "L1",color="Black")+
annotate("text", x=1800000, y=24000, label= "L2",color="Black")+
annotate("text", x=29900000, y=24000, label= "L3",color="Black")+
labs(title="PingPong cache misses", subtitle="Thin node,node mapping") +
theme(plot.title = element_text(size = 13, face = "bold",hjust = 0.5,margin = margin(t = 12)), plot.subtitle = element_text(size = 10, hjust = 0.5, margin = margin(t = 7, b = 10), face = "italic"), axis.title = element_text(face = "bold"), legend.text = element_text(margin = margin(t = 7, b = 7, r=12)), axis.title.x = element_text(margin = margin(t = 10, b=10)), axis.title.y = element_text(margin = margin(r = 10,l=10)), axis.text = element_text(color= "#2f3030", face="bold"))
```
```{r, dpi=600, fig.width=10, fig.height=5.7,echo=FALSE}
sp_thin
sp_gpu
```
```{r, dpi=600, fig.width=10, fig.height=6,echo=FALSE}
sp1
sp
```
The behaviour of Intel implementation can be seen in a clearer way. It has the largest cache misses number among all other configurations. The bandwidth spike of **UCX** after $128$ KB, can be explained by **L1** and **L2** drop in terms of cache misses. **OB1** implementation seems to be the most cache friendly *PML*. Disclaimer: those graphs must be used in a qualitative way and not in a quantitative way.
```{r,echo=FALSE,include=FALSE}
node_ib=read.csv("./pingpong/ompi/node_ib.out.csv", header = TRUE)
node_ob1_tcp=read.csv("./pingpong/ompi/node_ob1_selftcp.out.csv", header = TRUE)
node_ucx_br0=read.csv("./pingpong/ompi/node_ucx_br0.out.csv", header = TRUE)
node_ucx_ib0=read.csv("./pingpong/ompi/node_ucx_ib0.out.csv", header = TRUE)
node_ucx__mlx5=read.csv("./pingpong/ompi/node_ucx_mlx5.out.csv", header = TRUE)
node_ib_intel=read.csv("./pingpong/intel/node_ib_intel.out.csv", header = TRUE)
```
```{r, dpi=600, fig.width=9, fig.height=5,echo=TRUE,include=FALSE}
library(ggplot2)
sp = ggplot(node_ib,aes(x=X.bytes,y=mbs,color="UCX IB")) +
scale_x_continuous(trans='log2',name="Message Size Bytes",breaks=core_ib$X.bytes)+
scale_y_continuous(breaks = seq(0,13000,500),name="MB/s") + geom_point() + geom_line() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
geom_point(data=node_ob1_tcp,aes(x=X.bytes,y=mbs,color="OB1 tcp")) + geom_line(data=node_ob1_tcp,aes(x=X.bytes,y=mbs,color="OB1 tcp"))+ geom_point(data=node_ucx_br0,aes(x=X.bytes,y=mbs,color="UCX br0")) + geom_line(data=node_ucx_br0,aes(x=X.bytes,y=mbs,color="UCX br0")) + geom_point(data=node_ucx_ib0,aes(x=X.bytes,y=mbs,color="UCX ib0")) + geom_line(data=node_ucx_ib0,aes(x=X.bytes,y=mbs,color="UCX ib0")) + geom_point(data=node_ib_intel,aes(x=X.bytes,y=mbs,color="Intel IB")) + geom_line(data=node_ib_intel,aes(x=X.bytes,y=mbs,color="Intel IB")) +
labs(title="PingPong bandwidth", subtitle="Thin node,node mapping") +
theme(plot.title = element_text(size = 13, face = "bold",hjust = 0.5,margin = margin(t = 12)), plot.subtitle = element_text(size = 10, hjust = 0.5, margin = margin(t = 7, b = 10), face = "italic"), axis.title = element_text(face = "bold"), legend.text = element_text(margin = margin(t = 7, b = 7, r=12)), axis.title.x = element_text(margin = margin(t = 10, b=10)), axis.title.y = element_text(margin = margin(r = 10,l=10)), axis.text = element_text(color= "#2f3030", face="bold"))
```
```{r , dpi=600, fig.width=10, fig.height=6,echo=FALSE}
sp
```
Inter-node benchmark point out the network performance and topology. Two physical networks are available, $25$ Gbit Ethernet and $100$ Gbit InfiniBand.
The PCI-E devices are visible from lstopo. We can physically see that em1, em2 are bonded togheter, creating the interface bond0. To infer this configurations we can use ifconfig and ip link command.
\scriptsize
```{bash, eval=FALSE}
[s271550@ct1pt-tnode007 etc]$ ifconfig
bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST> mtu 1500
ether 34:80:0d:4e:55:68 txqueuelen 1000 (Ethernet)
br0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
ether 34:80:0d:4e:55:68 txqueuelen 1000 (Ethernet)
em1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 1500
ether 34:80:0d:4e:55:68 txqueuelen 1000 (Ethernet)
em2: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 1500
ether 34:80:0d:4e:55:68 txqueuelen 1000 (Ethernet)
ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
infiniband 00:00:09:07:FE:80:00: ..... txqueuelen 256 (InfiniBand)
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
```
```{bash,eval=FALSE}
[s271550@ct1pt-tnode007 etc]$ ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
2: em1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP mode DEFAULT group default qlen 1000
3: em2: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP mode DEFAULT group default qlen 1000
4: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP mode DEFAULT group default qlen 256
link/infiniband ....
5: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UP mode DEFAULT group default qlen 1000
6: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
```
\normalsize
\newpage
With openMPI implementation and UCX, we can directly select the devices that lead to a specific protocol. The devices tested are ib0, br0 and mlx5_0:1. Ib0 is the IPoIB protocol, br0 leads a TCP communication and mlx5_0:1 is a pure native InfiniBand device.
The theoretical maximum performances are $12.5$ GB/s or $12800$ MB/s for InfiniBand and $3.125$ GB/s or $3200$ MB/s for the Ethernet network.
The experimental asymptotic bandwidth measured are:
```{r,include=FALSE,echo=FALSE}
calculate_fit=function(d){
fit1=lm(data=d[1:15,],formula = t~X.bytes)
fit2=lm(data=d[15:30,],formula = t~X.bytes)
lambda=fit1$coefficients[1]
b=(fit2$coefficients[2]^-1)
d$t_est=lambda+d$X.bytes/b
d$b_est=d$X.bytes/d$t_est
return(d)
}
plot_fit=function( d,mapping,mapping_2 ){
mapping_2=paste0(mapping_2, " \nlatency=" ,round(d$t_est[1],digits = 2)," uSec\n bandwidth=",round(d$b_est[30],digits = 2)," MB/s")
sp = ggplot(d,aes(x=d$X.bytes,y=mbs,color="Real")) +
scale_x_continuous(trans='log2',name="Message Size Bytes",breaks=d$X.bytes) +
geom_point() + geom_line()+scale_y_continuous(name="MB/s") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
geom_point(data=d,aes(x=d$X.bytes,y=b_est,color="Estimated")) +
geom_line(data=d,aes(x=d$X.bytes,y=b_est,color="Estimated"))+
labs(title=mapping, subtitle=mapping_2) +
theme(plot.title = element_text(size = 13, face = "bold",hjust = 0.5,margin = margin(t = 12)), plot.subtitle = element_text(size = 10, hjust = 0.5, margin = margin(t = 7, b = 10), face = "italic"), axis.title = element_text(face = "bold"), legend.text = element_text(margin = margin(t = 7, b = 7, r=12)), axis.title.x = element_text(margin = margin(t = 10, b=10)), axis.title.y = element_text(margin = margin(r = 10,l=10)), axis.text = element_text(color= "#2f3030", face="bold")) +
theme(
# panel.background = element_rect(fill = "#ebeef0", colour = "#d5dce0",
# size = 2, linetype = "solid"),
panel.grid.major = element_line(size = 0.6, linetype = 'solid',
colour = "white"),
panel.grid.minor = element_line(size = 0.35, linetype = 'solid',
colour = "white")
)+scale_color_manual(name="",values=c("#F8766D","#00BFC4"))
sp
}
get_plot_fit=function( d,mapping,mapping_2 ){
mapping_2=paste0(mapping_2, " \nlatency=" ,round(d$t_est[1],digits = 2)," uSec\n bandwidth=",round(d$b_est[30],digits = 2)," MB/s")
sp = ggplot(d,aes(x=d$X.bytes,y=mbs,color="Real")) +
scale_x_continuous(trans='log2',name="Message Size Bytes",breaks=d$X.bytes) +
geom_point() + geom_line()+scale_y_continuous(name="MB/s") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
geom_point(data=d,aes(x=d$X.bytes,y=b_est,color="Estimated")) +
geom_line(data=d,aes(x=d$X.bytes,y=b_est,color="Estimated"))+
labs(title=mapping, subtitle=mapping_2) +
theme(plot.title = element_text(size = 10, face = "bold",hjust = 0,margin = margin(t = 5)), plot.subtitle = element_text(size = 10, hjust = 0.5, margin = margin(t = 4, b = 5), face = "italic"), axis.title = element_text(face = "bold"), legend.text = element_text(margin = margin(t = 4, b = 4, r=12)), axis.title.x = element_text(margin = margin(t = 5, b=5)), axis.title.y = element_text(margin = margin(r = 10,l=10)), axis.text = element_text(color= "#2f3030", face="bold")) +
theme(
# panel.background = element_rect(fill = "#ebeef0", colour = "#d5dce0",
# size = 2, linetype = "solid"),
panel.grid.major = element_line(size = 0.6, linetype = 'solid',
colour = "white"),
panel.grid.minor = element_line(size = 0.35, linetype = 'solid',
colour = "white")
)+scale_color_manual(name="",values=c("#F8766D","#00BFC4"))+
guides(color = FALSE, size = FALSE,linetype= FALSE)
return(sp)
}
get_plot_fit_legend=function( d,mapping,mapping_2 ){
mapping_2=paste0(mapping_2, " \nlatency=" ,round(d$t_est[1],digits = 2)," uSec\n bandwidth=",round(d$b_est[30],digits = 2)," MB/s")
sp = ggplot(d,aes(x=d$X.bytes,y=mbs,color="Real")) +
scale_x_continuous(trans='log2',name="Message Size Bytes",breaks=d$X.bytes) +
geom_point() + geom_line()+scale_y_continuous(name="MB/s") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
geom_point(data=d,aes(x=d$X.bytes,y=b_est,color="Estimated")) +
geom_line(data=d,aes(x=d$X.bytes,y=b_est,color="Estimated"))+
labs(title=mapping, subtitle=mapping_2) +
theme(plot.title = element_text(size = 10, face = "bold",hjust = 0,margin = margin(t = 5)), plot.subtitle = element_text(size = 10, hjust = 0.5, margin = margin(t = 4, b = 5), face = "italic"), axis.title = element_text(face = "bold"), legend.text = element_text(margin = margin(t = 4, b = 4, r=12)), axis.title.x = element_text(margin = margin(t = 5, b=5)), axis.title.y = element_text(margin = margin(r = 10,l=10)), axis.text = element_text(color= "#2f3030", face="bold")) +
theme(
# panel.background = element_rect(fill = "#ebeef0", colour = "#d5dce0",
# size = 2, linetype = "solid"),
panel.grid.major = element_line(size = 0.6, linetype = 'solid',
colour = "white"),
panel.grid.minor = element_line(size = 0.35, linetype = 'solid',
colour = "white")
)+scale_color_manual(name="",values=c("#F8766D","#00BFC4"))+
theme(legend.position = c(0.2, 0.8))
return(sp)
}
```
```{r, dpi=600, fig.width=15, fig.height=12.5,echo=FALSE}
node_ib_fit=calculate_fit(node_ib)
sp1=get_plot_fit_legend(node_ib_fit,mapping="Map by node, THIN node",mapping_2="UCX ib")
node_ob1_tcp_fit=calculate_fit(node_ob1_tcp)
sp2=get_plot_fit(node_ob1_tcp_fit,mapping="Map by node, THIN node",mapping_2="OB1 tcp")
node_ucx_br0_fit=calculate_fit(node_ucx_br0)
sp3=get_plot_fit(node_ucx_br0_fit,mapping="Map by node, THIN node",mapping_2="UCX interface br0")
node_ucx_ib0_fit=calculate_fit(node_ucx_ib0)
sp4=get_plot_fit(node_ucx_ib0_fit,mapping="Map by node, THIN node",mapping_2="UCX interface ib0")
node_ib_intel_fit=calculate_fit(node_ib_intel)
sp5=get_plot_fit(node_ib_intel_fit,mapping="Map by node, THIN node",mapping_2="Intel ib")
grid.arrange(sp1,sp2,sp3,sp4,sp5,nrow=3)
df_node=data.frame(name=c("UCX IB","OB1 tcp","UCX br0","UCX ib0","Intel ib"),
latency=c(node_ib_fit$t_est[1],node_ob1_tcp_fit$t_est[1],node_ucx_br0_fit$t_est[1],node_ucx_ib0_fit$t_est[1],node_ib_intel_fit$t_est[1]),
bandwith=c(node_ib_fit$b_est[30],node_ob1_tcp_fit$b_est[30],node_ucx_br0_fit$b_est[30],node_ucx_ib0_fit$b_est[30],node_ib_intel_fit$b_est[30]))
```
UCX with InfiniBand and Intel shows the best performances among all the configurations both in terms of latency and bandwidth. UCX overperforms a bit Intel latency. No big cache effects are visible, thanks to RDMA that exclude all caches, CPU and kernel action. This explain also the highest bandwidth with respect to core and socket mapping. Real InfiniBand performance is about $95\%$ of theoretical bandwidth,this is a very nice result assuming the encoding $64b/66b$.
UCX with br0 and OB1 with tcp show comparable latency, but OB1 gains a better bandwidth. The real maximum performance is about $85\%$ of theoretical bandwidth, this is a good result taking into account the heavier tcp protocol with respect to InfiniBand (ACK,handshake, encoding...). Moreover, transport overhead and some inefficency introduced by the cache and CPU (no RDMA available) are present .
IPoIB has a good latency (no Ethernet switch involved) but the bandwidth is not so high as expected.
Gpu nodes behave like thin nodes,and thus no difference can be seen, GPU node is only a bit slower than thin node. This can be due to different CPU frequency and node configuration. Cache sizes are the same of thin nodes, and this cache effects are similar. The summary of fitting results on THIN and GPU node is presented in the next page.
```{r, dpi=600, fig.width=14, fig.height=20,echo=FALSE}
calculate_fit=function(d){
fit1=lm(data=d[1:15,],formula = t~X.bytes)
fit2=lm(data=d[15:30,],formula = t~X.bytes)
lambda=fit1$coefficients[1]
b=(fit2$coefficients[2]^-1)
d$t_est=lambda+d$X.bytes/b
d$b_est=d$X.bytes/d$t_est
return(d)
}
#
# plot_fit=function( d,mapping,mapping_2 ){
# mapping_2=paste0(mapping_2, " \nlatency=" ,round(d$t_est[1],digits = 2)," uSec\n bandwidth=",round(d$b_est[30],digits = 2)," MB/s")
# sp = ggplot(d,aes(x=d$X.bytes,y=mbs,color="Real")) +
# scale_x_continuous(trans='log2',name="Message Size Bytes",breaks=d$X.bytes) +
# geom_point() + geom_line()+scale_y_continuous(name="MB/s") +
# theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
# geom_point(data=d,aes(x=d$X.bytes,y=b_est,color="Estimated")) +
# geom_line(data=d,aes(x=d$X.bytes,y=b_est,color="Estimated"))+
# labs(title=mapping, subtitle=mapping_2) +
# theme(plot.title = element_text(size = 13, face = "bold",hjust = 0.5,margin = margin(t = 12)), plot.subtitle = element_text(size = 10, hjust = 0.5, margin = margin(t = 7, b = 10), face = "italic"), axis.title = element_text(face = "bold"), legend.text = element_text(margin = margin(t = 7, b = 7, r=12)), axis.title.x = element_text(margin = margin(t = 10, b=10)), axis.title.y = element_text(margin = margin(r = 10,l=10)), axis.text = element_text(color= "#2f3030", face="bold")) +
# theme(
# # panel.background = element_rect(fill = "#ebeef0", colour = "#d5dce0",
# # size = 2, linetype = "solid"),
# panel.grid.major = element_line(size = 0.6, linetype = 'solid',
# colour = "white"),
# panel.grid.minor = element_line(size = 0.35, linetype = 'solid',
# colour = "white")
# )+scale_color_manual(name="",values=c("#F8766D","#00BFC4"))
# sp
# }
#
core_ib_fit=calculate_fit(core_ib)
# sp1=get_plot_fit_legend(core_ib_fit,mapping="Map by core, THIN node",mapping_2="UCX ib")
#
core_ob1_tcp_fit=calculate_fit(core_ob1_tcp)
# sp2=get_plot_fit(core_ob1_tcp_fit,mapping="Map by core, THIN node",mapping_2="OB1 tcp")
#
core_ob1_vader_fit=calculate_fit(core_ob1_vader)
# sp3=get_plot_fit(core_ob1_vader_fit,mapping="Map by core, THIN node",mapping_2="OB1 vader")
#
core_ib_intel_fit=calculate_fit(core_ib_intel)
# sp4=get_plot_fit(core_ib_intel_fit,mapping="Map by core, THIN node",mapping_2="Intel IB")
#
core_ib_gpu_fit=calculate_fit(core_ib_gpu)
# sp5=get_plot_fit(core_ib_gpu_fit,mapping="Map by core, GPU node",mapping_2="UCX IB")
#
core_ib_intel_gpu_fit=calculate_fit(core_ib_intel_gpu)
# sp6=get_plot_fit(core_ib_intel_gpu_fit,mapping="Map by core, GPU node",mapping_2="Intel IB")
#
core_ob1_tcp_gpu_fit=calculate_fit(core_ob1_tcp_gpu)
# sp7=get_plot_fit(core_ob1_tcp_gpu_fit,mapping="Map by core, GPU node",mapping_2="OB1 tcp")
#
core_ob1_vader_gpu_fit=calculate_fit(core_ob1_vader_gpu)
# sp8=get_plot_fit(core_ob1_vader_gpu_fit,mapping="Map by core, GPU node",mapping_2="OB1 vader")
#
# grid.arrange(sp1,sp2,sp3,sp4,sp5,sp6,sp7,sp8,nrow=4)