NVIDIA / Megatron-LM Public

Notifications You must be signed in to change notification settings
Fork 2.5k
Star 11k

Code
Issues 165
Pull requests 155
Discussions
Actions
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Discussions
Actions
Security
Insights

Issues: NVIDIA/Megatron-LM

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

165 Open 652 Closed

Author

Filter by author

Label

Filter by label

Use alt + click/return to exclude labels

or ⇧ + click/return for logical OR

Projects

Filter by project

Milestones

Filter by milestone

Assignee

Filter by who’s assigned

Assigned to nobody

Sort

Sort by

Newest Oldest Most commented Least commented Recently updated Least recently updated Best match

Most reactions

Issues list

[BUG] can't load saved fp8 checkpoint when resume training

#1350 opened Jan 8, 2025 by switiz

[BUG] Using fp16 uses more memory than using fp32

#1349 opened Jan 8, 2025 by eliird

[BUG] When trying to convert llama2-7b model from HF format to megatron format

#1348 opened Jan 6, 2025 by Sun2018421

[QUESTION] Typo in MoE README

#1346 opened Jan 4, 2025 by rgtjf

[QUESTION] Resume training about dataset

#1343 opened Jan 2, 2025 by JiwenJ

[QUESTION] Expert Parallelism with Non-Identical Experts

#1342 opened Jan 1, 2025 by kevin3567

[QUESTION]"a2a+p2p" for context parallel(cp)

#1341 opened Dec 27, 2024 by heavyrain-lzy

[QUESTION]How to convert the weight file format of the MAMBA model from pt to safetensors format?

#1339 opened Dec 26, 2024 by fxnie

[QUESTION] Why mixral use Llama2Tokenizer?

#1338 opened Dec 25, 2024 by DemingCheng

[QUESTION] Why BF16 has FP32 for param gradient at the first time unlike FP16 has FP16 for param gradient at the first time and switches them to FP32 when update param?

#1335 opened Dec 24, 2024 by renyinCheng001

[QUESTION]How can I load a checkpoint trained by Megatron-LM 0.5 into Megatron-LM 0.7 to resume pretraing?

#1333 opened Dec 22, 2024 by IgorZan

[BUG] MoE load balancing loss is accumulated twice when using activation checkpointing

#1330 opened Dec 20, 2024 by thuwzt

[BUG]megatron-lm，with torchompile，The provided qkv memory layout is not supported!

#1329 opened Dec 20, 2024 by qingshanxwx

[QUESTION] Why doesn't GPTDataset build a global shuffle index?

#1328 opened Dec 20, 2024 by dynamicheart

[BUG] Precision issue caused by different token dispatchers in MoE training

#1327 opened Dec 17, 2024 by qi7kuo

[QUESTION] About using StreamingLLM

#1326 opened Dec 17, 2024 by zhangyilalala

[BUG] Using different distributed strategies of Megatron-LM to train the llama3.1-8B model results in inconsistent training loss

#1324 opened Dec 16, 2024 by cailun01

[BUG] FSDP requires torch optimizer, not transformer_engine or apex

#1322 opened Dec 15, 2024 by prrathi

[QUESTION] I encountered the following issue when executing your command. What could be the cause? args.exit_on_missing_checkpoint is: True >> '--exit-on-missing-checkpoint' set ... exiting. <<

#1317 opened Dec 10, 2024 by Alinanini

[QUESTION]Does Megatron support tracing computation graphs with torch.fx?

#1315 opened Dec 7, 2024 by fy-j

[BUG] When using LLaVA with freeze-LM, training text only sample occurs error.

#1314 opened Dec 6, 2024 by liveseongho

[QUESTION] How to specify the implementation of Attention？

#1313 opened Dec 6, 2024 by renyinCheng001

[QUESTION] Gradient Propagation in backward pass

#1312 opened Dec 5, 2024 by arul-lm

[QUESTION]UnboundLocalError：local variable ‘output tensor’ referenced before assignmnet

#1311 opened Dec 5, 2024 by zmtttt

[ENHANCEMENT]When load_ckpt is called and the obtained iteration count equals args.train_iters, the train_step process will be directly skipped. If, at this point, the save_checkpoint function may encounter an error.

#1310 opened Dec 5, 2024 by bphwk

Previous 1 2 3 4 5 6 7 Next

Previous Next

ProTip! Mix and match filters to narrow down what you’re looking for.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly