Update megatron-lm to `core_r0.11.0` #392

ISEEKYAN · 2025-02-26T06:05:09Z

Support Megatron mcore 0.11

Description

This PR introduces official support for Megatron mcore 0.11 with the following updates:

Upgraded Megatron to version core_r0.11.0
Applied compatibility patch patches/mcore_r0.11.patch
Removed legacy version support for cleaner implementation

Special thanks to @chendong-1998 for:

Original Megatron upgrade from 0.4 to 0.6 (#93f6a7e)

Compatibility Notes

Current implementation requires careful handling due to dependency conflicts:

megatron-core==0.11.0 requires torch>=2.6
vllm==0.6.3 requires torch==2.4

Installation constraints:

Must use vllm's torch dependency (2.4) as baseline
Do NOT run pip install -e . in mcore directory (will upgrade torch to 2.6)
Apply compatibility patch manually after installation

Testing

test with `verl/examples/ppo_trainer/run_deepseek_megatron.sh`

Signed-off-by: chendong-1998 <[email protected]>

patch megatron-lm with `patches/mcore_r0.11.patch` can't run `pip install -e .` in megatron directory, because mcore0.11 is dependent on torch 2.6, but vLLM 0.6.3 requires torch 2.4

eric-haibin-lin · 2025-02-26T06:15:14Z

verl/models/llama/megatron/checkpoint_utils/llama_loader.py

@@ -12,6 +12,8 @@
 # See the License for the specific language governing permissions and


could u keep the v0.4 patch file for now in case others are want to run v0.4 for comparison. thanks!

we can remove the v0.4 patch after the next stable release of verl

OK，I will add that back

CLAassistant · 2025-02-26T06:22:53Z

All committers have signed the CLA.

PeterSH6

Brilliant work!

PeterSH6 · 2025-02-26T06:35:20Z

verl/utils/megatron_utils.py

+
+
+def get_model_config(model):
+    return get_attr_wrapped_model(model, 'megatron_config', allow_none=False)


I think get_model_config() is no longer necessary if we change the customized model class element.
Currently in ParallelLlamaForCausalLMRmPadPP

config -> Huggingface config

megatron_config -> megatron.core.ModelParallelConfig

We could rename megatron_config to config and rename config to hf_config

I agree the multiple config types in ParallelLlamaForCausalLMRmPadPP can be confusing. However, directly renaming megatron_config to config and config to hf_config might introduce breaking changes to existing code that references these properties.
I suggest merging the current PR first, then handling the renaming separately with proper migration planning

verl/utils/megatron_utils.py

PeterSH6 · 2025-02-26T06:45:01Z

verl/utils/megatron_utils.py

@@ -216,7 +225,7 @@ class FakeTimers:
    """Disable All Megatron Timing with FakeTimers"""

    def __init__(self):
-        from megatron.timers import DummyTimer
+        from megatron.core.timers import DummyTimer


Do we still need Timers in MCore v0.11?
This FakeTimer is mainly for optimizer.step() in MCore v0.4.
As there's no need to use timer in optimizer.step() in MCore v0.11, I suggest simply delete this class.
(Also delete its usage in L212)

I tried, it works without the timer, I will delete the whole timer

PeterSH6 · 2025-02-26T06:51:05Z

verl/workers/sharding_manager/megatron_vllm.py

-        self.memory_buffers[pp_rank] = build_memory_buffer(weight_buffer_meta)
+        if pp_rank == self._pp_rank:
+            from verl.utils.memory_buffer import MemoryBuffer
+            # The code here is very hard-coded, based on the following assumptions:


That's true. Just wondering if the MemoryBuffer will complicate the weight synchronization process when we enable EP?
If so, we can abandon the MemoryBuffer and change to per-parameter synchronization

I agree to abandon MemoryBuffer to change to per-parameter synchronization

Essentially, weight binding can be generalized to per parameter all-gather and redistribute

PeterSH6 · 2025-02-26T06:57:56Z

verl/utils/megatron_utils.py

@@ -139,7 +155,7 @@ def unwrap_model(model, module_instances=ALL_MODULE_WRAPPER_CLASSNAMES):

 def convert_config(hf_config: PretrainedConfig, megatron_config) -> TransformerConfig:
    print(f'megatron config {megatron_config}')
-    dt = PrecisionType.to_dtype(megatron_config['param_dtype'])
+    dt = torch.bfloat16


How to pass the parameter dtype of Megatron in the current implementation?

if modifying L22-L24 in verl/utils/torch_dtypes.py, like

HALF_LIST = [16, "16", "fp16", "float16", torch.float16] FLOAT_LIST = [32, "32", "fp32", "float32", torch.float32] BFLOAT_LIST = ["bf16", "bfloat16", torch.bfloat16]

then the args could be delivered by dt = PrecisionType.to_dtype(megatron_config.params_dtype)

PeterSH6 · 2025-02-26T07:01:15Z

verl/models/llama/megatron/checkpoint_utils/llama_saver.py

+from megatron.model import Float16Module
+from megatron.model import DistributedDataParallel as LocalDDP
+
+from megatron.training.utils import print_rank_0, unwrap_model


We copied several module and util functions from the Megatron-LM package into the megatron_utils.py.
It would be better if we can remove importing from outside megatron.core

sure, fixing this

small fixes

Chendong98 and others added 4 commits February 25, 2025 08:55

support megatron-0.6 in verl

afb9a79

Signed-off-by: chendong-1998 <[email protected]>

small fix

93f6a7e

Signed-off-by: chendong-1998 <[email protected]>

update megatron to version core_v0.6.0

65e7ace

update Megatron to version core_r0.11.0

0ef4e03

patch megatron-lm with `patches/mcore_r0.11.patch` can't run `pip install -e .` in megatron directory, because mcore0.11 is dependent on torch 2.6, but vLLM 0.6.3 requires torch 2.4

eric-haibin-lin reviewed Feb 26, 2025

View reviewed changes

PeterSH6 reviewed Feb 26, 2025

View reviewed changes

ISEEKYAN added 2 commits February 26, 2025 00:00

add megatron_v4.patch back

fcbf9b7

small fixes

remove the need for mcore patches

4f85cfe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update megatron-lm to `core_r0.11.0` #392

Update megatron-lm to `core_r0.11.0` #392

ISEEKYAN commented Feb 26, 2025

eric-haibin-lin Feb 26, 2025

eric-haibin-lin Feb 26, 2025

ISEEKYAN Feb 26, 2025

CLAassistant commented Feb 26, 2025 •

edited

Loading

PeterSH6 left a comment

PeterSH6 Feb 26, 2025

ISEEKYAN Feb 26, 2025

PeterSH6 Feb 26, 2025

ISEEKYAN Feb 26, 2025

PeterSH6 Feb 26, 2025

vermouth1992 Feb 26, 2025

vermouth1992 Feb 26, 2025

PeterSH6 Feb 26, 2025

ISEEKYAN Feb 26, 2025

PeterSH6 Feb 26, 2025

ISEEKYAN Feb 26, 2025

		@@ -12,6 +12,8 @@
		# See the License for the specific language governing permissions and



		def get_model_config(model):
		return get_attr_wrapped_model(model, 'megatron_config', allow_none=False)

Update megatron-lm to core_r0.11.0 #392

Are you sure you want to change the base?

Update megatron-lm to core_r0.11.0 #392

Conversation

ISEEKYAN commented Feb 26, 2025

Support Megatron mcore 0.11

Description

Compatibility Notes

Testing

test with verl/examples/ppo_trainer/run_deepseek_megatron.sh

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CLAassistant commented Feb 26, 2025 • edited Loading

PeterSH6 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Update megatron-lm to `core_r0.11.0` #392

Update megatron-lm to `core_r0.11.0` #392

test with `verl/examples/ppo_trainer/run_deepseek_megatron.sh`

CLAassistant commented Feb 26, 2025 •

edited

Loading