We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I'm running validation where each trainer can have different data size, however, validation gets stuck.
In the example above, rank 5 ran out of data after batch 50, while other ranks still have data. But the program got stuck.
I'm using FSDP strategy to train an LLM. Not sure why validation batches are synchronized.
v2.3
def validation_step(self, input_dict: Dict, batch_idx: int) -> Dict[str, torch.Tensor]: pred_dict : Dict[str, torch.Tensor] = self.model(input_dict) loss_dict: Dict[str, torch.Tensor] = ( self.model.get_loss_dict(gt_dict=input_dict, pred_dict=pred_dict) ) self.log_dict( {f"val_loss/{key}": loss for key, loss in loss_dict.items()}, # In validation step, all logs are aggregated and logged at the epoch end. on_step=False, on_epoch=True, sync_dist=True, ) return pred_dict
No error message. NCCL timeout after 10 minutes.
No response
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Bug description
I'm running validation where each trainer can have different data size, however, validation gets stuck.
In the example above, rank 5 ran out of data after batch 50, while other ranks still have data. But the program got stuck.
I'm using FSDP strategy to train an LLM. Not sure why validation batches are synchronized.
What version are you seeing the problem on?
v2.3
How to reproduce the bug
Error messages and logs
No error message. NCCL timeout after 10 minutes.
Environment
Current environment
- GPU:
- NVIDIA L20
- available: True
- version: 12.4
- efficientnet-pytorch: 0.7.1
- lightning: 2.3.0
- lightning-thunder: 0.1.0
- lightning-utilities: 0.11.2
- pytorch-lightning: 2.5.0.post0
- pytorch-quantization: 2.1.2
- pytorch-triton: 3.0.0+a9bc1a364
- torch: 2.5.1+cu124
- torch-scatter: 2.1.2+pt25cu124
- torchcodec: 0.1.1+cu124
- torchdata: 0.7.1a0
- torchmetrics: 1.6.1
- torchvision: 0.20.1+cu124
- xpilot-lightning-mini: 2.3.0
- absl-py: 2.1.0
- accelerate: 1.2.1
- adjusttext: 1.1.1
- aiofiles: 22.1.0
- aiohttp: 3.9.3
- aiosignal: 1.3.1
- aiosqlite: 0.20.0
- albumentations: 1.1.0
- alembic: 1.14.0
- aliyun-python-sdk-core: 2.16.0
- aliyun-python-sdk-kms: 2.16.5
- annotated-types: 0.6.0
- anyio: 3.7.1
- argon2-cffi: 23.1.0
- argon2-cffi-bindings: 21.2.0
- arrow: 1.3.0
- astral: 3.2
- astroid: 2.15.8
- asttokens: 2.4.1
- astunparse: 1.6.3
- async-timeout: 4.0.3
- attrs: 23.2.0
- audioread: 3.0.1
- av: 12.3.0
- babel: 2.16.0
- bcrypt: 4.2.1
- beautifulsoup4: 4.12.3
- bleach: 6.1.0
- blessed: 1.20.0
- blinker: 1.9.0
- blis: 0.7.11
- blosc2: 2.7.1
- cachetools: 5.3.3
- casadi: 3.6.5
- catalogue: 2.0.10
- cattrs: 24.1.2
- ccimport: 0.4.4
- certifi: 2024.2.2
- cffi: 1.16.0
- charset-normalizer: 3.3.2
- click: 8.1.7
- cloudpathlib: 0.16.0
- cloudpickle: 2.2.1
- cmake: 3.29.0.1
- coloredlogs: 15.0.1
- colorlog: 6.9.0
- comm: 0.2.2
- confection: 0.1.4
- contextlib2: 21.6.0
- contourpy: 1.2.1
- crcmod: 1.7
- cryptography: 44.0.0
- cuda-python: 12.4.0rc7+3.ge75c8a9.dirty
- cudf: 24.2.0
- cudnn: 1.1.2
- cugraph: 24.2.0
- cugraph-dgl: 24.2.0
- cugraph-service-client: 24.2.0
- cugraph-service-server: 24.2.0
- cuml: 24.2.0
- cumm-cu120: 0.4.11
- cupy-cuda12x: 13.0.0
- cycler: 0.12.1
- cymem: 2.0.8
- cython: 3.0.10
- dask: 2024.1.1
- dask-cuda: 24.2.0
- dask-cudf: 24.2.0
- databricks-cli: 0.18.0
- dbus-python: 1.2.18
- debugpy: 1.8.1
- decorator: 5.1.1
- decord: 0.6.0
- deepspeed: 0.15.0
- defusedxml: 0.7.1
- deprecated: 1.2.15
- descartes: 1.1.0
- dill: 0.3.9
- distributed: 2024.1.1
- distro: 1.7.0
- dm-tree: 0.1.8
- dnspython: 2.7.0
- docker: 6.1.3
- docstring-parser: 0.16
- dotmap: 1.3.30
- duckdb: 0.8.1
- durationpy: 0.9
- easydict: 1.11
- editor: 1.6.6
- efficientnet-pytorch: 0.7.1
- einops: 0.7.0
- elasticsearch: 7.17.12
- elasticsearch-dsl: 7.4.1
- engineering-notation: 0.10.0
- entrypoints: 0.4
- exceptiongroup: 1.2.0
- execnet: 2.0.2
- executing: 2.0.1
- expecttest: 0.1.3
- fastjsonschema: 2.19.1
- fastrlock: 0.8.2
- ffmpeg-python: 0.2.0
- filelock: 3.13.3
- filterpy: 1.4.5
- fire: 0.7.0
- flash-attn: 2.7.2.post1
- flask: 2.3.3
- flatbuffers: 24.12.23
- fonttools: 4.51.0
- fqdn: 1.5.1
- frozenlist: 1.4.1
- fsspec: 2022.11.0
- ftfy: 6.2.3
- func-timeout: 4.3.5
- future: 1.0.0
- fuyao: 4.5.1.post2
- fuyao-all: 3.8.17
- fuyao-skinny: 0.1.10
- gast: 0.5.4
- gitdb: 4.0.12
- gitpython: 3.1.44
- google-auth: 2.29.0
- google-auth-oauthlib: 0.4.6
- graphsurgeon: 0.4.6
- greenlet: 3.1.1
- grpcio: 1.62.1
- gunicorn: 20.1.0
- guppy3: 3.1.3
- h5py: 3.12.1
- hjson: 3.1.0
- huggingface-hub: 0.27.1
- humanfriendly: 10.0
- hypothesis: 5.35.1
- idna: 3.6
- igraph: 0.11.4
- imageio: 2.36.1
- importlib-metadata: 4.13.0
- iniconfig: 2.0.0
- inquirer: 3.4.0
- intel-openmp: 2021.4.0
- ipcqueue: 0.9.7
- ipykernel: 6.29.4
- ipython: 8.21.0
- ipython-genutils: 0.2.0
- ipywidgets: 8.1.5
- isoduration: 20.11.0
- isort: 5.13.2
- itsdangerous: 2.2.0
- jedi: 0.19.1
- jinja2: 3.1.3
- jmespath: 0.10.0
- joblib: 1.3.2
- json5: 0.9.24
- jsonpointer: 3.0.0
- jsonschema: 4.21.1
- jsonschema-specifications: 2023.12.1
- jupyter: 1.1.1
- jupyter-client: 6.2.0
- jupyter-console: 6.6.3
- jupyter-core: 5.7.2
- jupyter-enterprise-gateway: 3.2.2
- jupyter-events: 0.11.0
- jupyter-server: 1.24.0
- jupyter-server-fileid: 0.9.3
- jupyter-server-ydoc: 0.8.0
- jupyter-tensorboard: 0.2.0
- jupyter-ydoc: 0.2.5
- jupyterlab: 3.6.6
- jupyterlab-pygments: 0.3.0
- jupyterlab-server: 2.27.3
- jupyterlab-widgets: 3.0.13
- jupytext: 1.16.1
- kafka-python: 2.0.2
- kiwisolver: 1.4.5
- kubernetes: 31.0.0
- langcodes: 3.3.0
- lark: 1.1.9
- lazy-loader: 0.4
- lazy-object-proxy: 1.10.0
- librosa: 0.10.1
- lightning: 2.3.0
- lightning-thunder: 0.1.0
- lightning-utilities: 0.11.2
- llvmlite: 0.42.0
- locket: 1.0.0
- looseversion: 1.3.0
- lz4: 4.3.2
- mako: 1.3.8
- markdown: 3.6
- markdown-it-py: 3.0.0
- markupsafe: 2.1.5
- masksdk: 1.2.1
- matplotlib: 3.8.4
- matplotlib-inline: 0.1.6
- mccabe: 0.7.0
- mdit-py-plugins: 0.4.0
- mdurl: 0.1.2
- mistune: 3.0.2
- mkl: 2021.1.1
- mkl-devel: 2021.1.1
- mkl-include: 2021.1.1
- mlflow: 1.29.0
- mock: 5.1.0
- mpmath: 1.3.0
- msgpack: 1.0.8
- multidict: 6.0.5
- multiscaledeformableattention: 1.0
- murmurhash: 1.0.10
- mypy-extensions: 1.0.0
- namedatomiclock: 1.1.3
- nbclassic: 1.1.0
- nbclient: 0.10.0
- nbconvert: 7.16.3
- nbformat: 5.10.4
- ndindex: 1.9.2
- nest-asyncio: 1.6.0
- networkx: 3.4.2
- ninja: 1.11.1.1
- notebook: 6.4.10
- notebook-shim: 0.2.4
- numba: 0.59.0+1.g20ae2b56c
- numexpr: 2.10.2
- numpy: 1.24.4
- nuscenes-devkit: 1.1.9
- nvfuser: 0.1.6a0+a684e2a
- nvidia-cublas-cu12: 12.4.5.8
- nvidia-cuda-cupti-cu12: 12.4.127
- nvidia-cuda-nvrtc-cu12: 12.4.127
- nvidia-cuda-runtime-cu12: 12.4.127
- nvidia-cudnn-cu12: 9.1.0.70
- nvidia-cufft-cu12: 11.2.1.3
- nvidia-curand-cu12: 10.3.5.147
- nvidia-cusolver-cu12: 11.6.1.9
- nvidia-cusparse-cu12: 12.3.1.170
- nvidia-dali-cuda120: 1.36.0
- nvidia-ml-py: 12.560.30
- nvidia-nccl-cu12: 2.21.5
- nvidia-nvimgcodec-cu12: 0.2.0.7
- nvidia-nvjitlink-cu12: 12.4.127
- nvidia-nvtx-cu12: 12.4.127
- nvidia-pyindex: 1.0.9
- nvsmi: 0.4.2
- nvtx: 0.2.5
- oauthlib: 3.2.2
- onnx: 1.16.0
- onnxruntime: 1.16.0
- opencv: 4.7.0
- opencv-python: 4.5.5.62
- opencv-python-headless: 4.5.5.62
- opt-einsum: 3.3.0
- optree: 0.11.0
- orjson: 3.8.7
- oss-middle-layer: 1.9.9
- oss2: 2.16.0
- ossfs: 2023.1.0
- packaging: 21.3
- pandas: 1.5.3
- pandocfilters: 1.5.1
- paramiko: 3.5.0
- parso: 0.8.4
- partd: 1.4.1
- pccm: 0.4.16
- peft: 0.13.2
- pexpect: 4.9.0
- pika: 1.3.2
- pillow: 10.3.0
- pip: 24.0
- pipdeptree: 2.13.0
- platformdirs: 4.2.0
- pluggy: 1.4.0
- ply: 3.11
- polars: 0.18.6
- polygraphy: 0.49.8
- pooch: 1.8.1
- portalocker: 3.1.1
- preshed: 3.0.9
- presto-python-client: 0.8.4
- prettytable: 3.10.0
- prometheus-client: 0.20.0
- prometheus-flask-exporter: 0.23.1
- prompt-toolkit: 3.0.43
- protobuf: 3.20.3
- psutil: 5.9.4
- psycopg2-binary: 2.9.10
- ptyprocess: 0.7.0
- pure-eval: 0.2.2
- py-cpuinfo: 9.0.0
- py-spy: 0.3.14
- pyarrow: 14.0.1
- pyasn1: 0.6.0
- pyasn1-modules: 0.4.0
- pybind11: 2.12.0
- pybind11-global: 2.12.0
- pycls: 0.1.1
- pycocotools: 2.0.8
- pycparser: 2.22
- pycryptodome: 3.21.0
- pycryptodomex: 3.21.0
- pydantic: 2.6.4
- pydantic-core: 2.16.3
- pygments: 2.17.2
- pygobject: 3.42.1
- pyjwt: 2.10.1
- pylibcugraph: 24.2.0
- pylibcugraphops: 24.2.0
- pylibraft: 24.2.0
- pylint: 2.17.7
- pylint-exit: 1.2.0
- pymongo: 4.10.1
- pymysql: 1.0.2
- pynacl: 1.5.0
- pynvjitlink: 0.1.13
- pynvml: 11.4.1
- pyparsing: 3.1.2
- pypcd: 0.1.1
- pyquaternion: 0.9.9
- pyrasite: 2.0.1
- pytest: 8.1.1
- pytest-flakefinder: 1.1.0
- pytest-rerunfailures: 14.0
- pytest-shard: 0.1.2
- pytest-xdist: 3.5.0
- python-dateutil: 2.9.0.post0
- python-hostlist: 1.23.0
- python-json-logger: 3.2.1
- python-lzf: 0.2.6
- python-rapidjson: 1.8
- pytorch-lightning: 2.5.0.post0
- pytorch-quantization: 2.1.2
- pytorch-triton: 3.0.0+a9bc1a364
- pytz: 2022.7.1
- pyyaml: 6.0.1
- pyzmq: 24.0.1
- qudida: 0.0.4
- querystring-parser: 1.2.4
- raft-dask: 24.2.0
- rapids-dask-dependency: 24.2.0a0
- ratelimiter: 1.2.0.post0
- readchar: 4.2.1
- redis: 5.2.1
- referencing: 0.34.0
- regex: 2023.12.25
- requests: 2.31.0
- requests-oauthlib: 2.0.0
- rfc3339-validator: 0.1.4
- rfc3986-validator: 0.1.1
- rich: 13.7.1
- rmm: 24.2.0
- rpds-py: 0.18.0
- rsa: 4.9
- ruamel.yaml: 0.18.10
- ruamel.yaml.clib: 0.2.12
- runs: 1.2.2
- safetensors: 0.5.2
- schema: 0.7.5
- scikit-image: 0.25.0
- scikit-learn: 1.2.0
- scipy: 1.12.0
- seaborn: 0.12.2
- send2trash: 1.8.2
- sentencepiece: 0.2.0
- setuptools: 68.2.2
- sh: 2.0.7
- shapely: 1.8.1
- simplejson: 3.19.3
- six: 1.16.0
- smart-open: 6.4.0
- smmap: 5.0.2
- sniffio: 1.3.1
- sortedcontainers: 2.4.0
- soundfile: 0.12.1
- soupsieve: 2.5
- soxr: 0.3.7
- spacy: 3.7.4
- spacy-legacy: 3.0.12
- spacy-loggers: 1.0.5
- spconv-cu120: 2.3.6
- sphinx-glpi-theme: 0.6
- sqlalchemy: 1.4.54
- sqlparse: 0.5.3
- srsly: 2.4.8
- ssh-import-id: 5.11
- stack-data: 0.6.3
- sympy: 1.13.1
- tables: 3.10.1
- tabulate: 0.9.0
- tbb: 2021.12.0
- tblib: 3.0.0
- tenacity: 9.0.0
- tensorboard: 2.17.0
- tensorboard-data-server: 0.7.2
- tensorboard-plugin-wit: 1.8.1
- tensorboardx: 2.6.2.2
- tensorrt: 8.6.3
- termcolor: 2.5.0
- terminado: 0.18.1
- texttable: 1.7.0
- thinc: 8.2.3
- threadpoolctl: 3.3.0
- thriftpy2: 0.4.17
- tifffile: 2024.12.12
- timm: 0.5.4
- tinycss2: 1.2.1
- tk-tools: 0.16.0
- tokenizers: 0.20.4
- toml: 0.10.2
- tomli: 2.0.1
- tomlkit: 0.13.2
- toolz: 0.12.1
- torch: 2.5.1+cu124
- torch-scatter: 2.1.2+pt25cu124
- torchcodec: 0.1.1+cu124
- torchdata: 0.7.1a0
- torchmetrics: 1.6.1
- torchvision: 0.20.1+cu124
- tornado: 6.4
- tqdm: 4.66.2
- traitlets: 5.9.0
- transformers: 4.46.2
- treelite: 4.0.0
- triton: 3.1.0
- typed-argument-parser: 1.10.1
- typer: 0.9.4
- types-dataclasses: 0.6.6
- types-python-dateutil: 2.9.0.20241206
- typing-extensions: 4.10.0
- typing-inspect: 0.9.0
- ucx-py: 0.36.0
- uff: 0.6.9
- uri-template: 1.3.0
- urllib3: 1.26.18
- wasabi: 1.1.2
- watchdog: 6.0.0
- wcwidth: 0.2.13
- weasel: 0.3.4
- webcolors: 24.11.1
- webencodings: 0.5.1
- websocket-client: 1.8.0
- werkzeug: 3.0.2
- wheel: 0.43.0
- widgetsnbextension: 4.0.13
- wrapt: 1.17.0
- xcompress: 5.0.1
- xdata-dataloader: 3.1.63
- xdoctest: 1.0.2
- xfoundation: 0.0.6.dev2
- xgboost: 2.0.3
- xmod: 1.8.1
- xpilot-lightning-mini: 2.3.0
- y-py: 0.6.2
- yacs: 0.1.8
- yarl: 1.9.4
- yarn-api-client: 1.0.3
- ypy-websocket: 0.8.4
- zict: 3.0.0
- zipp: 3.17.0
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.10.12
- release: 5.10.134-17.3.al8.x86_64
- version: Proposal for help #1 SMP Thu Oct 31 14:29:57 CST 2024
More info
No response
The text was updated successfully, but these errors were encountered: