NVIDIA Cosmos is a developer-first world foundation model platform designed to help Physical AI developers build their Physical AI systems better and faster. Cosmos contains
- pre-trained models, available via Hugging Face under the NVIDIA Open Model License that allows commercial use of the models for free
- training/fine-tuning scripts under the Apache 2 License, offered through NVIDIA Nemo Framework for training/fine-tuning the models for various downstream Physical AI applications
Details of the platform is described in the Cosmos paper. Preview access is avaiable at build.nvidia.com.
- Pre-trained Diffusion-based world foundation models for Text2World and Video2World generation where a user can generate visual simulation based on text prompts and video prompts.
- Pre-trained Autoregressive-based world foundation models for Video2World generation where a user can generate visual simulation based on video prompts and optional text prompts.
- Video tokenizers for tokenizing videos into continuous tokens (latent vectors) and discrete tokens (integers) efficiently and effectively.
- Post-training scripts to post-train the pre-trained world foundation models for various Physical AI setup.
- Video curation pipeline for building your own video dataset. [Coming soon]
- Training scripts for building your own world foundation model. [Diffusion] [Autoregressive].
Model name | Description | Try it out |
---|---|---|
Cosmos-1.0-Diffusion-7B-Text2World | Text to visual world generation | Inference |
Cosmos-1.0-Diffusion-14B-Text2World | Text to visual world generation | Inference |
Cosmos-1.0-Diffusion-7B-Video2World | Video + Text based future visual world generation | Inference |
Cosmos-1.0-Diffusion-14B-Video2World | Video + Text based future visual world generation | Inference |
Cosmos-1.0-Autoregressive-4B | Future visual world generation | Inference |
Cosmos-1.0-Autoregressive-12B | Future visual world generation | Inference |
Cosmos-1.0-Autoregressive-5B-Video2World | Video + Text based future visual world generation | Inference |
Cosmos-1.0-Autoregressive-13B-Video2World | Video + Text based future visual world generation | Inference |
Cosmos-1.0-Guardrail | Guardrail contains pre-Guard and post-Guard for safe use | Embedded in model inference scripts |
Follow the Cosmos Installation Guide to setup the docker. For inference with the pretrained models, please refer to Cosmos Diffusion Inference and Cosmos Autoregressive Inference.
The code snippet below provides a gist of the inference usage.
PROMPT="A sleek, humanoid robot stands in a vast warehouse filled with neatly stacked cardboard boxes on industrial shelves. \
The robot's metallic body gleams under the bright, even lighting, highlighting its futuristic design and intricate joints. \
A glowing blue light emanates from its chest, adding a touch of advanced technology. The background is dominated by rows of boxes, \
suggesting a highly organized storage system. The floor is lined with wooden pallets, enhancing the industrial setting. \
The camera remains static, capturing the robot's poised stance amidst the orderly environment, with a shallow depth of \
field that keeps the focus on the robot while subtly blurring the background for a cinematic effect."
# Example using 7B model
PYTHONPATH=$(pwd) python cosmos1/models/diffusion/inference/text2world.py \
--checkpoint_dir checkpoints \
--diffusion_transformer_dir Cosmos-1.0-Diffusion-7B-Text2World \
--prompt "$PROMPT" \
--offload_prompt_upsampler \
--video_save_name Cosmos-1.0-Diffusion-7B-Text2World
text2world_example.mp4
Check out Cosmos Post-training for more details.
This project will download and install additional third-party open source software projects. Review the license terms of these open source projects before use.
NVIDIA Cosmos source code is released under the Apache 2 License.
NVIDIA Cosmos models are released under the NVIDIA Open Model License. For a custom license, please contact [email protected].