Replies: 1 comment
-
Also remember that sxm versions are 700W and are SIGNIFICANTLY harder to cool. Some people showed that on extended training runs performance degrades over time because the rig thermal throttles. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Apparently, the H100 PCIe is significantly worse than the H100 SXM.
On the H100 SXM, we can easily achieve
620-730 TFLOPs
(BF16/FP16). In contrast, on the H100 PCIe, we can only reach350-450 TFLOPs
(BF16/FP16).The cheapest H100 SXM instance I found is on Lambda Cloud, where 8xH100 SXM GPUs cost
$27.92
per hour (before tax), roughly$3.49
per GPU hour. The H100 PCIe offers only56.4%-61.6%
of the H100 SXM's performance, so the fair charge for H100 PCIe should be around$1.96-$2.15
per GPU hour. However, the price of H100 PCIe on Lambda Cloud is$2.49
per GPU hour, which is15%-27%
more expensive than it should be.Therefore, my advice is if you have serious training workloads, you should use the H100 SXM instead of the H100 PCIe. The name H100 PCIe is misleading. It should be called H50 or H60, as its performance is only
56.4%-61.6%
of the H100 SXM : (Beta Was this translation helpful? Give feedback.
All reactions