Script to compute step duration #70

tengyifei · 2025-01-31T19:28:09Z

Precursor to solving #67.

Imports https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/utils/profile_convert.py and improves it.

Specifically, I noticed sometimes there is an empty gap between two step markers in the profile. So if we averaged event durations, that would overestimate the MFU. Instead, this now averages the delta between the starting time offset of neighboring events.

Now that we can print step time from the profile, I removed the step time from the training loop. That added a bunch of delays and is actually pretty inaccurate (1.7s vs 1.85s in local testing).

Tested:

XLA_IR_DEBUG=1 XLA_HLO_DEBUG=1 python3 torchprime/torch_xla_models/train.py mesh.fsdp=8 profile_step=4 model=llama-3-8b
XLA_IR_DEBUG=1 XLA_HLO_DEBUG=1 python3 torchprime/torch_xla_models/train.py mesh.fsdp=8 profile_step=4 model=mixtral-8x7b

Precursor to #67. Imports https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/utils/profile_convert.py and improves it. Specifically, I noticed sometimes there is an empty gap between two step markers in the profile. So if we averaged event durations, that would overestimate the MFU. Instead, this now averages the delta between the starting time offset of neighboring events. Now that we can print step time from the profile, I removed the step time from the training loop. That added a bunch of delays and is actually pretty inaccurate (1.7s vs 1.85s in local testing). Tested: XLA_IR_DEBUG=1 XLA_HLO_DEBUG=1 python3 torchprime/torch_xla_models/train.py mesh.fsdp=8 profile_step=4 model=llama-3-8b XLA_IR_DEBUG=1 XLA_HLO_DEBUG=1 python3 torchprime/torch_xla_models/train.py mesh.fsdp=8 profile_step=4 model=mixtral-8x7b

tengyifei marked this pull request as ready for review January 31, 2025 20:02

tengyifei requested a review from bhavya01 January 31, 2025 20:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Script to compute step duration #70

Script to compute step duration #70

tengyifei commented Jan 31, 2025

Script to compute step duration #70

Are you sure you want to change the base?

Script to compute step duration #70

Conversation

tengyifei commented Jan 31, 2025