Skip to content

Latest commit

 

History

History
408 lines (308 loc) · 20.9 KB

README.md

File metadata and controls

408 lines (308 loc) · 20.9 KB

Cosmos Diffusion-based World Foundation Models

Table of Contents

This page details the steps for using the Cosmos diffusion-based world foundation models.

Getting Started

Set Up Docker Environment

Follow our Installation Guide to set up the Docker environment. All commands on this page should be run inside Docker.

Download Checkpoints

  1. Generate a Hugging Face access token. Set the access token to 'Read' permission (default is 'Fine-grained').

  2. Log in to Hugging Face with the access token:

huggingface-cli login
  1. Request access to Mistral AI's Pixtral-12B model by clicking on Agree and access repository on Pixtral's Hugging Face model page. This step is required to use Pixtral 12B for the Video2World prompt upsampling task.

  2. Download the Cosmos model weights from Hugging Face:

PYTHONPATH=$(pwd) python cosmos1/scripts/download_diffusion.py --model_sizes 7B 14B --model_types Text2World Video2World
  1. The downloaded files should be in the following structure:
checkpoints/
├── Cosmos-1.0-Diffusion-7B-Text2World
│   ├── model.pt
│   └── config.json
├── Cosmos-1.0-Diffusion-14B-Text2World
│   ├── model.pt
│   └── config.json
├── Cosmos-1.0-Diffusion-7B-Video2World
│   ├── model.pt
│   └── config.json
├── Cosmos-1.0-Diffusion-14B-Video2World
│   ├── model.pt
│   └── config.json
├── Cosmos-1.0-Tokenizer-CV8x8x8
│   ├── decoder.jit
│   ├── encoder.jit
│   └── mean_std.pt
├── Cosmos-1.0-Prompt-Upsampler-12B-Text2World
│   ├── model.pt
│   └── config.json
├── Pixtral-12B
│   ├── model.pt
│   ├── config.json
└── Cosmos-1.0-Guardrail
    ├── aegis/
    ├── blocklist/
    ├── face_blur_filter/
    └── video_content_safety_filter/

Usage

Model Types

There are two model types available for diffusion world generation:

  1. Text2World: Supports world generation from text input
  • Models: Cosmos-1.0-Diffusion-7B-Text2World and Cosmos-1.0-Diffusion-14B-Text2World
  • Inference script: text2world.py
  1. Video2World: Supports world generation from text and image/video input
  • Models: Cosmos-1.0-Diffusion-7B-Video2World and Cosmos-1.0-Diffusion-14B-Video2World
  • Inference script: video2world.py

Single and Batch Generation

We support both single and batch video generation.

For generating a single video, Text2World mode requires the input argument --prompt (text input). Video2World mode requires --input_image_or_video_path (image/video input). Additionally for Video2World, if the prompt upsampler is disabled, a text prompt must also be provided using the --prompt argument.

For generating a batch of videos, both Text2World and Video2World require --batch_input_path (path to a JSONL file). For Text2World, the JSONL file should contain one prompt per line in the following format, where each line must contain a "prompt" field:

{"prompt": "prompt1"}
{"prompt": "prompt2"}

For Video2World, each line in the JSONL file must contain a "visual_input" field:

{"visual_input": "path/to/video1.mp4"}
{"visual_input": "path/to/video2.mp4"}

If you disable the prompt upsampler by setting the --disable_prompt_upsampler flag, each line in the JSONL file will need to include both "prompt" and "visual_input" fields.

{"prompt": "prompt1", "visual_input": "path/to/video1.mp4"}
{"prompt": "prompt2", "visual_input": "path/to/video2.mp4"}

Sample Commands

There are two main demo scripts for diffusion world generation: text2world.py and video2world.py. Below you will find sample commands for single and batch generation, as well as commands for running with low-memory GPUs using model offloading. We also provide a memory usage table comparing different offloading strategies to help with configuration.

Text2World (text2world.py): 7B and 14B

Generates world from text input.

Single Generation
PROMPT="A sleek, humanoid robot stands in a vast warehouse filled with neatly stacked cardboard boxes on industrial shelves. \
The robot's metallic body gleams under the bright, even lighting, highlighting its futuristic design and intricate joints. \
A glowing blue light emanates from its chest, adding a touch of advanced technology. The background is dominated by rows of boxes, \
suggesting a highly organized storage system. The floor is lined with wooden pallets, enhancing the industrial setting. \
The camera remains static, capturing the robot's poised stance amidst the orderly environment, with a shallow depth of \
field that keeps the focus on the robot while subtly blurring the background for a cinematic effect."

# Example using 7B model
PYTHONPATH=$(pwd) python cosmos1/models/diffusion/inference/text2world.py \
    --checkpoint_dir checkpoints \
    --diffusion_transformer_dir Cosmos-1.0-Diffusion-7B-Text2World \
    --prompt "$PROMPT" \
    --offload_prompt_upsampler \
    --video_save_name Cosmos-1.0-Diffusion-7B-Text2World

# Example using the 7B model on low-memory GPUs with model offloading. The speed is slower if using batch generation.
PYTHONPATH=$(pwd) python cosmos1/models/diffusion/inference/text2world.py \
    --checkpoint_dir checkpoints \
    --diffusion_transformer_dir Cosmos-1.0-Diffusion-7B-Text2World \
    --prompt "$PROMPT" \
    --video_save_name Cosmos-1.0-Diffusion-7B-Text2World_memory_efficient \
    --offload_tokenizer \
    --offload_diffusion_transformer \
    --offload_text_encoder_model \
    --offload_prompt_upsampler \
    --offload_guardrail_models

# Example using 14B model with prompt upsampler offloading (required on H100)
PYTHONPATH=$(pwd) python cosmos1/models/diffusion/inference/text2world.py \
    --checkpoint_dir checkpoints \
    --diffusion_transformer_dir Cosmos-1.0-Diffusion-14B-Text2World \
    --prompt "$PROMPT" \
    --video_save_name Cosmos-1.0-Diffusion-14B-Text2World \
    --offload_prompt_upsampler \
    --offload_guardrail_models
Batch Generation
# Example using 7B model
PYTHONPATH=$(pwd) python cosmos1/models/diffusion/inference/text2world.py \
    --checkpoint_dir checkpoints \
    --diffusion_transformer_dir Cosmos-1.0-Diffusion-7B-Text2World \
    --batch_input_path cosmos1/models/diffusion/assets/v1p0/batch_inputs/text2world.jsonl \
    --video_save_folder outputs/Cosmos-1.0-Diffusion-7B-Text2World \
    --offload_prompt_upsampler
Example Output

Here is an example output video generated using text2world.py:

text2world_example.mp4

The upsampled prompt used to generate the video is:

In a sprawling, meticulously organized warehouse, a sleek humanoid robot stands sentinel amidst towering shelves brimming with neatly stacked cardboard boxes. The robot's metallic body, adorned with intricate joints and a glowing blue chest light, radiates an aura of advanced technology, its design a harmonious blend of functionality and futuristic elegance. The camera captures this striking figure in a static, wide shot, emphasizing its poised stance against the backdrop of industrial wooden pallets. The lighting is bright and even, casting a warm glow that accentuates the robot's form, while the shallow depth of field subtly blurs the rows of boxes, creating a cinematic depth that draws the viewer into this high-tech realm. The absence of human presence amplifies the robot's solitary vigil, inviting contemplation of its purpose within this vast, organized expanse.

If you disable the prompt upsampler by using the --disable_prompt_upsampler flag, the output video will be generated using the original prompt:

text2world_example2.mp4

The original prompt is:

A sleek, humanoid robot stands in a vast warehouse filled with neatly stacked cardboard boxes on industrial shelves. The robot's metallic body gleams under the bright, even lighting, highlighting its futuristic design and intricate joints. A glowing blue light emanates from its chest, adding a touch of advanced technology. The background is dominated by rows of boxes, suggesting a highly organized storage system. The floor is lined with wooden pallets, enhancing the industrial setting. The camera remains static, capturing the robot's poised stance amidst the orderly environment, with a shallow depth of field that keeps the focus on the robot while subtly blurring the background for a cinematic effect.

Note that the robot face could be blurred sometimes by the guardrail in this example.

Inference Time and GPU Memory Usage

The numbers provided below may vary depending on system specs and are for reference only.

We report the maximum observed GPU memory usage during end-to-end inference. Additionally, we offer a series of model offloading strategies to help users manage GPU memory usage effectively.

For GPUs with limited memory (e.g., RTX 3090/4090 with 24 GB memory), we recommend fully offloading all models. For higher-end GPUs, users can select the most suitable offloading strategy considering the numbers provided below.

Offloading Strategy 7B Text2World 14B Text2World
Offload prompt upsampler 74.0 GB > 80.0 GB
Offload prompt upsampler & guardrails 57.1 GB 70.5 GB
Offload prompt upsampler & guardrails & T5 encoder 38.5 GB 51.9 GB
Offload prompt upsampler & guardrails & T5 encoder & tokenizer 38.3 GB 51.7 GB
Offload prompt upsampler & guardrails & T5 encoder & tokenizer & diffusion model 24.4 GB 39.0 GB

The table below presents the end-to-end inference runtime on a single H100 GPU, excluding model initialization time.

7B Text2World (offload prompt upsampler) 14B Text2World (offload prompt upsampler, guardrails)
~380 seconds ~590 seconds

Video2World (video2world.py): 7B and 14B

Generates world from text and image/video input.

Single Generation

Note that our prompt upsampler is enabled by default for Video2World, and it will generate the prompt from the input image/video. If the prompt upsampler is disabled, you can provide a prompt manually using the --prompt flag.

# Example using the 7B model
PYTHONPATH=$(pwd) python cosmos1/models/diffusion/inference/video2world.py \
    --checkpoint_dir checkpoints \
    --diffusion_transformer_dir Cosmos-1.0-Diffusion-7B-Video2World \
    --input_image_or_video_path cosmos1/models/diffusion/assets/v1p0/video2world_input0.jpg \
    --num_input_frames 1 \
    --video_save_name Cosmos-1.0-Diffusion-7B-Video2World \
    --offload_prompt_upsampler

# Example using the 7B model on low-memory GPUs with model offloading. The speed is slower if using batch generation.
PYTHONPATH=$(pwd) python cosmos1/models/diffusion/inference/video2world.py \
    --checkpoint_dir checkpoints \
    --diffusion_transformer_dir Cosmos-1.0-Diffusion-7B-Video2World \
    --input_image_or_video_path cosmos1/models/diffusion/assets/v1p0/video2world_input0.jpg \
    --num_input_frames 1 \
    --video_save_name Cosmos-1.0-Diffusion-7B-Video2World_memory_efficient \
    --offload_tokenizer \
    --offload_diffusion_transformer \
    --offload_text_encoder_model \
    --offload_prompt_upsampler \
    --offload_guardrail_models

# Example using 14B model with prompt upsampler offloading (required on H100)
PYTHONPATH=$(pwd) python cosmos1/models/diffusion/inference/video2world.py \
    --checkpoint_dir checkpoints \
    --diffusion_transformer_dir Cosmos-1.0-Diffusion-14B-Video2World \
    --input_image_or_video_path cosmos1/models/diffusion/assets/v1p0/video2world_input0.jpg \
    --num_input_frames 1 \
    --video_save_name Cosmos-1.0-Diffusion-14B-Video2World \
    --offload_prompt_upsampler \
    --offload_guardrail_models
Batch Generation
# Example using 7B model with 9 input frames
PYTHONPATH=$(pwd) python cosmos1/models/diffusion/inference/video2world.py \
    --checkpoint_dir checkpoints \
    --diffusion_transformer_dir Cosmos-1.0-Diffusion-7B-Video2World \
    --batch_input_path cosmos1/models/diffusion/assets/v1p0/batch_inputs/video2world_ps.jsonl \
    --video_save_folder outputs/Cosmos-1.0-Diffusion-7B-Video2World \
    --offload_prompt_upsampler \
    --num_input_frames 9

# Example using 7B model with 9 input frames without prompt upsampler, using 'prompt' field in the JSONL file
PYTHONPATH=$(pwd) python cosmos1/models/diffusion/inference/video2world.py \
    --checkpoint_dir checkpoints \
    --diffusion_transformer_dir Cosmos-1.0-Diffusion-7B-Video2World \
    --batch_input_path cosmos1/models/diffusion/assets/v1p0/batch_inputs/video2world_wo_ps.jsonl \
    --video_save_folder outputs/Cosmos-1.0-Diffusion-7B-Video2World_wo_ps \
    --disable_prompt_upsampler \
    --num_input_frames 9
Example Output

Here is an example output video generated using video2world.py, using Cosmos-1.0-Diffusion-14B-Video2World:

video2world_example.mp4

The upsampled prompt (generated by the prompt upsampler) used to generate the video is:

The video depicts a long, straight highway stretching into the distance, flanked by metal guardrails. The road is divided into multiple lanes, with a few vehicles visible in the far distance. The surrounding landscape features dry, grassy fields on one side and rolling hills on the other. The sky is mostly clear with a few scattered clouds, suggesting a bright, sunny day.
Inference Time and GPU Memory Usage

The numbers provided below may vary depending on system specs and are for reference only.

Offloading Strategy 7B Video2World 14B Video2World
Offload prompt upsampler 76.5 GB > 80.0 GB
Offload prompt upsampler & guardrails 59.9 GB 73.3 GB
Offload prompt upsampler & guardrails & T5 encoder 41.3 GB 54.8 GB
Offload prompt upsampler & guardrails & T5 encoder & tokenizer 41.1 GB 54.5 GB
Offload prompt upsampler & guardrails & T5 encoder & tokenizer & diffusion model 27.3 GB 39.0 GB

The following table shows the end-to-end inference runtime on a single H100 GPU, excluding model initialization time:

7B Video2World (offload prompt upsampler) 14B Video2World (offload prompt upsampler, guardrails)
~383 seconds ~593 seconds

Arguments

Common Parameters

Parameter Description Default
--checkpoint_dir Directory containing model weights "checkpoints"
--tokenizer_dir Directory containing tokenizer weights "Cosmos-1.0-Tokenizer-CV8x8x8"
--video_save_name Output video filename for single video generation "output"
--video_save_folder Output directory for batch video generation "outputs/"
--prompt Text prompt for single video generation. Required for single video generation. None
--batch_input_path Path to JSONL file for batch video generation. Required for batch video generation. None
--negative_prompt Negative prompt for improved quality "The video captures a series of frames showing ugly scenes..."
--num_steps Number of diffusion sampling steps 35
--guidance CFG guidance scale 7.0
--num_video_frames Number of frames to generate 121
--height Output video height 704
--width Output video width 1280
--fps Frames per second 24
--seed Random seed 1
--disable_prompt_upsampler Disable automatic prompt enhancement False
--offload_diffusion_transformer Offload DiT model after inference, used for low-memory GPUs False
--offload_tokenizer Offload VAE model after inference, used for low-memory GPUs False
--offload_text_encoder_model Offload text encoder after inference, used for low-memory GPUs False
--offload_prompt_upsampler Offload prompt upsampler after inference, used for low-memory GPUs False
--offload_guardrail_models Offload guardrail models after inference, used for low-memory GPUs False

Note: we support various aspect ratios, including 1:1 (960x960 for height and width), 4:3 (960x704), 3:4 (704x960), 16:9 (1280x704), and 9:16 (704x1280). The frame rate is also adjustable within a range of 12 to 40 fps.

Text2World Specific Parameters

Parameter Description Default
--diffusion_transformer_dir Directory containing DiT weights "Cosmos-1.0-Diffusion-7B-Text2World"
--prompt_upsampler_dir Directory containing prompt upsampler weights "Cosmos-1.0-Prompt-Upsampler-12B-Text2World"
--word_limit_to_skip_upsampler Skip prompt upsampler for better robustness if the number of words in the prompt is greater than this value 250

Video2World Specific Parameters

Parameter Description Default
--diffusion_transformer_dir Directory containing DiT weights "Cosmos-1.0-Diffusion-7B-Video2World"
--prompt_upsampler_dir Directory containing prompt upsampler weights "Pixtral-12B"
--input_image_or_video_path Input video/image path for single video generation. Required for single video generation. None
--num_input_frames Number of video frames (1 or 9) 1

Safety Features

The model uses a built-in safety guardrail system that cannot be disabled. Generating human faces is not allowed and will be blurred by the guardrail.

For more information, check out the Cosmos Guardrail Documentation.

Prompting Instructions

The input prompt is the most important parameter under the user's control when interacting with the model. Providing rich and descriptive prompts can positively impact the output quality of the model, whereas short and poorly detailed prompts can lead to subpar video generation. Here are some recommendations to keep in mind when crafting text prompts for the model:

  1. Describe a single, captivating scene: Focus on a single scene to prevent the model from generating videos with unnecessary shot changes.
  2. Limit camera control instructions: The model doesn't handle prompts involving camera control well, as this feature is still under development.
  3. Prompt upsampler limitations: The current version of the prompt upsampler may sometimes deviate from the original intent of your prompt, adding unwanted details. If this happens, you can disable the upsampler with the --disable_prompt_upsampler flag and edit your prompt manually. We recommend using prompts of around 120 words for optimal quality.

Cosmos-1.0-Prompt-Upsampler

The prompt upsampler automatically expands brief prompts into more detailed descriptions (Text2World) or generates detailed prompts based on input images (Video2World).

Text2World

When enabled (default), the upsampler will:

  1. Take your input prompt
  2. Process it through a finetuned Mistral model to generate a more detailed description
  3. Use the expanded description for video generation

This can help generate better quality videos by providing more detailed context to the video generation model. To disable this feature, use the --disable_prompt_upsampler flag.

Video2World

When enabled (default), the upsampler will:

  1. Take your input image or video
  2. Process it through a Pixtral model to generate a detailed description
  3. Use the generated description for video generation

Please note that the Video2World prompt upsampler does not consider any user-provided text prompt. To disable this feature, use the --disable_prompt_upsampler flag.