Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query/Issue with Custom YOLOv5 Model and ONNX Export #13473

Open
1 of 2 tasks
AbhirupSinha1811 opened this issue Dec 27, 2024 · 16 comments
Open
1 of 2 tasks

Query/Issue with Custom YOLOv5 Model and ONNX Export #13473

AbhirupSinha1811 opened this issue Dec 27, 2024 · 16 comments
Labels
bug Something isn't working detect Object Detection issues, PR's exports Model exports (ONNX, TensorRT, TFLite, etc.)

Comments

@AbhirupSinha1811
Copy link

Search before asking

  • I have searched the YOLOv5 issues and found no similar bug report.

YOLOv5 Component

Detection, Export

Bug

I am working with a custom-trained YOLOv5 model that was trained on a dataset with 4 classes. After exporting the model to ONNX format, I am facing discrepancies in the output tensor shape and class configurations, which are creating confusion and potential issues in downstream tasks. Below, I outline the details of my observations, potential root causes, and attempts to resolve the issue.

Environment

yolov5s.pt, ubuntu 22.04, in own system.

Minimal Reproducible Example

normal detection code from"https://github.com/arindal1/yolov5-onnx-object-recognition/blob/main/yolov5.py"

Additional

Observations:

Custom Model Details:

The .pt model was trained on a dataset with 4 classes (bird, drone, helicopter, jetplane).

When inspecting the .pt model, the number of classes is confirmed as 4 both in the names field and in the nc parameter from the data.yaml.

The .pt model performs as expected, detecting all 4 classes correctly during inference.

ONNX Export Details:

After exporting the model to ONNX, the output tensor shape is reported as [1, 8, 8400].

The 8 indicates the number of output channels in the detection head, which suggests it is configured for only 3 classes (5 + 3 = 8 instead of 5 + 4 = 9).

This is inconsistent with the .pt model, which was trained on 4 classes.

When checking the ONNX model metadata, the class names (bird, drone, helicopter, jetplane) are correctly stored, indicating 4 classes in the metadata.

Comparison with Default COCO Model:

For reference, the output tensor shape of a YOLOv5 model trained on the COCO dataset (80 classes) is [1, 25200, 85].

Here, 85 = 5 + 80 (5 for bounding box attributes + 80 for classes).

This format aligns with the expected configuration for YOLO models.

Key Issues:

Mismatch in Output Tensor Shape:

The ONNX model’s output tensor shape suggests it is configured for only 3 classes ([1, 8, 8400]), despite the .pt model being trained on 4 classes.

This raises concerns about whether the ONNX model will correctly detect all 4 classes.

Potential Causes of the Issue:

The detection head in the .pt model might have been misconfigured during training or export.

For 4 classes, the detection head’s out_channels should be 5 + 4 = 9, but it appears to be set to 8.

The ONNX export process might not be correctly handling the model’s class configuration.

Implications for Object Detection:

If the ONNX model is truly configured for only 3 classes, it may fail to detect one of the classes or produce incorrect predictions.

Steps Taken to Debug:

Inspected Detection Head of .pt Model:

Verified the out_channels of the detection head (last layer).

The .pt model’s detection head is confirmed to have out_channels = 8, indicating a configuration for 3 classes.

This discrepancy persists despite the model being trained on 4 classes.

Verified ONNX Model Metadata:

Extracted metadata from the ONNX model, which correctly lists 4 class names (bird, drone, helicopter, jetplane).

Tried Re-exporting the Model:

Re-exported the .pt model to ONNX using the official YOLOv5 export script.

The issue with the output tensor shape ([1, 8, 8400]) remains.

Request for Assistance:

Clarification on Detection Head Configuration:

Could this issue arise from a misconfiguration of the detection head during training? If so, how can I fix it without retraining the model?

Is there a way to manually adjust the detection head’s out_channels in the .pt model and re-export it to ONNX?

ONNX Export Process:

Are there known issues with the YOLOv5 ONNX export script that could cause this mismatch?

How can I ensure the ONNX model’s detection head is correctly configured for 4 classes?

General Guidance:

What steps can I take to verify that the ONNX model will correctly detect all 4 classes?

Are there tools or scripts you recommend for validating the ONNX model’s outputs?

Additional Context:

ultralytics - 2.4.1
PyTorch Version: 2.4.1

ONNX Runtime Version:1.16.3

Thank you for your assistance in resolving this issue!

Are you willing to submit a PR?

  • Yes I'd like to help by submitting a PR!
@AbhirupSinha1811 AbhirupSinha1811 added the bug Something isn't working label Dec 27, 2024
@UltralyticsAssistant UltralyticsAssistant added detect Object Detection issues, PR's exports Model exports (ONNX, TensorRT, TFLite, etc.) labels Dec 27, 2024
@UltralyticsAssistant
Copy link
Member

👋 Hello @AbhirupSinha1811, thank you for your detailed report and for using YOLOv5 🚀! Your observations and debugging steps are very thorough, which is highly appreciated.

If this is indeed a 🐛 Bug Report, we kindly request a minimum reproducible example (MRE) to better assist in debugging this issue. An MRE would ideally contain simplified, complete code snippets and/or instructions to reproduce the ONNX export and the tensor shape discrepancy.

From the context provided, here are a few steps you can double-check:

  1. Detection Head Configuration: Ensure the YOLOv5 detection head reflects the correct out_channels value (which should match 5 + number_of_classes) of the dataset both before and after training.
  2. ONNX Metadata: Validate that the ONNX model metadata and the number of classes defined match the expected configurations.
  3. Re-export Process: Try re-exporting the model using the official export script with verbose logging enabled to identify any discrepancies during the export process.

Requirements

Ensure you are using Python>=3.8 with all dependencies installed correctly. Install requirements using:

pip install -r requirements.txt

Verified Environments

The ONNX export process is generally supported on environments such as notebooks, cloud platforms, or Docker. Make sure your training and export environments meet the dependencies, including PyTorch, CUDA, and ONNX runtime versions.

Additionally, it's worth confirming if the issue persists when running the export script on different setups or versions.

This is an automated response, but don't worry! An Ultralytics engineer will review your issue promptly to provide further assistance. In the meantime, feel free to share any additional findings or code snippets that could help us debug further 🚀.

@pderrenger
Copy link
Member

@AbhirupSinha1811 thank you for providing a detailed explanation of the issue. Based on your observations, it seems the problem stems from a misconfigured detection head in the .pt model. Here are some points to address your concerns:

  1. Detection Head Configuration:

    • The mismatch in out_channels (8 instead of 9) indicates the model was trained with an incorrect detection head configuration for 4 classes. Unfortunately, this cannot be fixed without retraining the model, as the detection head's architecture is defined during training.
  2. ONNX Export Process:

    • The YOLOv5 export script correctly uses the configuration of the .pt model for ONNX conversion. Since the .pt model itself is misconfigured, the ONNX model inherits the same issue. There are no known bugs in the export script that would alter the class configuration during conversion.
  3. Manual Adjustment (Without Retraining):

    • While directly modifying the detection head's out_channels in the .pt model is theoretically possible, it is not recommended. Adjusting this manually would require significant changes to the model's architecture and weights, which is error-prone and may lead to unreliable results.
  4. Validation of ONNX Outputs:

    • To verify the ONNX model's behavior, you can test it using the detect.py script in ONNX mode:
      python detect.py --weights model.onnx --img-size 640 --dnn
    • If issues persist, visualizing the model using Netron can help confirm the final layer's configuration.

To resolve this issue definitively, it is recommended to retrain the model with the correct class configuration (4 classes). If you suspect a training script issue, ensure you are using the latest YOLOv5 version and verify the data.yaml and training parameters before starting.

Feel free to share further observations or questions. The YOLO community and Ultralytics team are here to help!

@AbhirupSinha1811
Copy link
Author

AbhirupSinha1811 commented Dec 29, 2024

Hello, after check the detection head of the yolo .pt model what I'm get is given below:-
"Dectection Head Output Channels": 68
"Number of Classes": 4
"class Name":[
'bird', 'drone', 'helicopter', 'jetplane']

  1. Detection Head and Output Channels

Why does the detection head of my custom YOLOv5s model have 68 output channels when it was trained on 4 classes? Shouldn’t it be 27 (3 × (5 + 4) for 4 classes and 3 anchors)?
Could this mismatch have happened during training? How can I check and fix it?

  1. Inference Behavior
    Then why the model detects all 4 classes correctly during inference with the .pt model??

How does it handle the extra channels?
Is this behavior consistent across all formats like ONNX or TensorRT?

  1. ONNX Export and Output Shape
    When exporting to ONNX, the output tensor shape is [1, 8, 8400] instead of [1, 27, grid_cells] for 4 classes. Why is this happening, and how can I fix it?
    Could the extra detection head channels (68) be causing this issue?

  2. Debugging and Fixing
    How can I verify the number of classes (nc) and output channels used during training?
    Is there a way to fix the detection head’s output channels post-training without retraining?

  3. Recommendations
    What’s the best way to ensure the detection head matches the number of classes during training and export?
    Are there tools or scripts to avoid issues like this during ONNX export?

Key Observations to Share
Detection Head Output Channels: 68.
Number of Classes (nc) in YAML: 4.
Class Names: ['bird', 'drone', 'helicopter', 'jetplane'].
ONNX Output Tensor Shape: [1, 8, 8400].

Behavior: Model detects all 4 classes correctly during inference with .pt but shows unexpected behavior during ONNX export.

@pderrenger
Copy link
Member

@AbhirupSinha1811 thank you for the detailed observations. Here's a concise response addressing your queries:

  1. Detection Head Output Channels (68 instead of 27):
    The detection head's out_channels is determined by the architecture during training. A value of 68 suggests the model may have been configured with additional outputs, such as extra layers or custom modifications. To verify this, inspect the model's architecture and training script for any changes to the detection head.

  2. Correct Inference with .pt:
    Despite the mismatch in out_channels, the .pt model likely filters outputs internally to match the 4 classes during inference. This behavior depends on how the post-processing step (e.g., NMS) is configured. It does not guarantee consistent behavior across formats like ONNX or TensorRT.

  3. ONNX Export Issue ([1, 8, 8400] output):
    The ONNX export inherits the detection head configuration from the .pt model. The discrepancy in output shape likely results from the detection head misconfiguration during training. The [1, 8, 8400] output suggests the model is treating it as 3 classes (5 + 3 = 8). Fixing this requires retraining with the correct configuration.

  4. Debugging and Fixing:

    • To verify the number of classes (nc) used during training, check the data.yaml file and the model.yaml or architecture definition.
    • Post-training fixes are not recommended as modifying detection head outputs requires retraining to ensure weight alignment.
  5. Recommendations:

    • Ensure the data.yaml and model.yaml files are correctly configured for the intended number of classes before training.
    • Use the latest YOLOv5 export script (export.py) to minimize export-related issues.
    • Debug exported models using tools like Netron to visualize outputs and metadata.

For further details on ONNX export, refer to the YOLOv5 Export Tutorial. Feel free to follow up with additional questions!

@AbhirupSinha1811
Copy link
Author

Hello, @pderrenger

I have reviewed the training script and data.yaml file thoroughly, and there have been no modifications. The script is standard and directly references data.yaml with nc=4 and class names: ["bird", "drone", "helicopter", "jetplane"]. No customizations or deviations have been made.

Training sample code:-
#================================================================
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

Training starts here

model = YOLO(data="data.yaml", epochs=100) # Initiates training with 100 epochs
model = YOLO('runs/detect/train/weights/last.pt').to(device) # Loads the last checkpoint
result = model.train(resume=True) # Resumes training
#=======================================================================

YOLOv5 Version:-
The YOLOv5 version used for training was downloaded from the official Ultralytics site and initiated on March 24, 2024.

Observed Issue:-
Despite adhering to these configurations, the detection head outputs (no=68) mismatch the expected configuration for nc=4. This has resulted in discrepancies during ONNX export ([1, 8, 8400] output) and inference."

Training Script Behavior:

The training script appears standard and passes data="data.yaml" with nc=4. Is there any additional step required to ensure that the detection head is correctly initialized with the number of classes (4) during training?

When resuming training with resume=True, does the detection head automatically align with the nc value in data.yaml, or could it retain the configuration from the checkpoint (last.pt)?

Detection Head Configuration:
What could cause the detection head to produce no=68 outputs when nc=4 is defined in data.yaml? Is this likely due to an issue during checkpoint initialization or training?

Does the model automatically reconfigure the detection head when nc changes, or does it require manual intervention (e.g., reinitializing layers)?

Data.yaml Verification:

The data.yaml file has nc=4 and lists four class. Are there any other factors (e.g., anchor settings or dataset labels) that could lead to a mismatch in detection head outputs?"

Does the order or format of the class names in data.yaml impact the detection head configuration during training?"
Impact of Resume Training:

When resuming training with last.pt, could the detection head's configuration (e.g., no and anchors) differ from the new dataset's nc? If so, what steps are needed to realign the detection head?

Model Export and Compatibility:
Could a mismatch between nc and no cause downstream issues, such as incorrect ONNX outputs or inference errors in

TensorRT
If yes, how can these issues be resolved during export or training?

What is the best way to inspect the detection head during training or inference to verify its nc and no configuration? Are there specific checkpoints or logging steps recommended to avoid such mismatches?

@pderrenger
Copy link
Member

Hello, @AbhirupSinha1811, and thank you for the detailed explanation and observations. Based on your description, here are some points to address your concerns:

  1. Detection Head Mismatch (no=68 with nc=4):
    The detection head's no (number of outputs per anchor) is determined during training based on the formula:
    no = (nc + 5) * number_of_anchors_per_layer. A value of no=68 implies some inconsistency in the configuration, possibly due to:

    • A mismatch in the data.yaml file or its interpretation during training.
    • A prior checkpoint (last.pt) being loaded with a different architecture or parameters. YOLOv5 does not automatically reinitialize the detection head when resuming training (resume=True); it retains the configuration from the checkpoint.
  2. Resume Training Behavior:
    Resuming training with resume=True will not realign the detection head to the data.yaml file’s nc value if the checkpoint was trained with a different configuration. To avoid this, ensure that the initial checkpoint (last.pt) matches the current dataset's nc and other parameters.

  3. ONNX Export Mismatch:
    The ONNX export inherits the trained model’s architecture. If the .pt checkpoint has incorrect no values, the ONNX export ([1, 8, 8400]) will also reflect this. This can cause downstream issues with inference in TensorRT or other formats.

  4. Steps to Address the Issue:

    • Verify Training Parameters: Ensure that the nc=4 in data.yaml aligns with the dataset and that no conflicting parameters are introduced.
    • Inspect the Checkpoint: Use the following code to inspect the detection head’s configuration in the .pt model before resuming training:
      model = torch.load('runs/detect/train/weights/last.pt')
      print(model['model'].names)  # Class names
      print(model['model'].yaml)  # Verify nc and other parameters
    • Reinitialize the Detection Head: If the no mismatch persists, reinitialize the model with the correct nc and retrain:
      model = YOLO(data='data.yaml', pretrained=False)  # Initialize with correct nc
      model.train(epochs=100)
    • Inspect ONNX Outputs: Use Netron to visualize the exported ONNX model and confirm its architecture.
  5. Key Consideration for Resume Training:
    If the checkpoint (last.pt) was trained on a different dataset or configuration, it will retain the previous nc and detection head configuration. Always verify that the checkpoint aligns with the current training setup before resuming.

Let us know if you need further clarification! For more export-related guidance, refer to the YOLOv5 Export Tutorial.

@AbhirupSinha1811
Copy link
Author

Hello , I've check the out custom .pt model into Netron and get this:-

Detection Head and Outputs

Why is the number of outputs (no) from the detection head 68, what are the factors are suppose to be responsible for this kind of value and if we do re-train what things keep in mind to before perform ?

The anchors tensor has a shape of float16[2,7497]. Is this correct for my custom-trained model, or does it indicate an issue?

How does the detection head configuration relate to the number of classes (nc=4)?

Anchors
The anchors format is different from the standard YOLOv5 anchors (float16[3,3,2]). Could this cause issues during inference?

How can I confirm if the anchors used during training were correct for my dataset?

Training and Configuration
Could the issues be caused by not explicitly using a model.yaml during training?
Does YOLOv5 automatically adjust anchors for custom datasets, and how can I check this?

@pderrenger
Copy link
Member

Hello, thank you for your observations. Here's a concise breakdown addressing your concerns:

  1. The no=68 from the detection head suggests a mismatch between the expected number of outputs and your dataset configuration (nc=4). This could result from loading a checkpoint (last.pt) trained on a different setup without reinitializing the model. Retraining with a properly configured model.yaml (matching nc=4) is necessary to resolve this.

  2. The anchors tensor (float16[2,7497]) is incorrect for YOLOv5, where anchors are typically shaped like float16[3,3,2]. This discrepancy could indicate issues in model initialization or training. Ensure that the correct model.yaml is used, and let YOLOv5 automatically calculate anchors during training.

  3. YOLOv5 adjusts anchors automatically for custom datasets unless explicitly overridden. To verify, inspect the anchors in your model.yaml or training logs (AutoAnchor should report if anchors are updated).

To avoid such issues, verify the model.yaml and data.yaml configurations before training and ensure logs report the expected setup. If needed, refer to the YOLOv5 architecture documentation for further details. Let us know if you need additional clarification!

@AbhirupSinha1811
Copy link
Author

Hello @pderrenger ,
could you please me out by explaining how could I find out or check the calculation takes place , due to which I'm getting the values 68 and in float16[2,7497] getting 7497 value , means any code way or through netron exact which layer before going to detection layer , where I can check what kind of values are passing to the output detection layer so we get this kind of wrong values..

@pderrenger
Copy link
Member

Hello @AbhirupSinha1811,

To trace and understand the calculations resulting in no=68 and float16[2,7497], you can inspect the layers preceding the detection head using the following approaches:

  1. Using Netron:
    Open the .pt model in Netron and navigate to the layers directly before the detection head. Look for discrepancies in the output tensor shapes or parameters that might propagate incorrect values.

  2. Using Code:
    Load the model and print the details of the layers before the detection head:

    import torch
    
    model = torch.load('runs/detect/train/weights/last.pt')['model']
    for i, layer in enumerate(model.model[-1].m):  # Iterate through detection layers
        print(f"Layer {i}: {layer}")

    You can also inspect the anchors and shapes:

    print(f"Anchors: {model.yaml['anchors']}")
    print(f"Detection head outputs: {model.model[-1].no}")

This will help identify where the configuration might deviate from expectations. Let me know if you need further clarification!

@AbhirupSinha1811
Copy link
Author

Hello @pderrenger ,
go it , when I have to do retrain the model , could you please guide what are the parameters I've to keep in mind so that again this kind of values of The anchors tensor (float16[2,7497]) don't come again which is incorrect for YOLOv5.

@pderrenger
Copy link
Member

Hello @AbhirupSinha1811, to avoid issues like incorrect anchor tensor shapes (float16[2,7497]) during retraining, ensure the following:

  1. Use the correct data.yaml file with nc matching the number of classes in your dataset.
  2. Allow YOLOv5 to calculate anchors automatically (--autoanchor=True), which adapts them to your custom dataset.
  3. Verify the model.yaml architecture matches your dataset needs, particularly the number of detection layers and anchors.
  4. Avoid resuming training (resume=True) if the initial checkpoint was trained on mismatched settings. Start fresh with pretrained=False or a compatible checkpoint.

For more on anchor generation, review the YOLOv5 Architecture Documentation. Let me know if further details are needed!

@AbhirupSinha1811
Copy link
Author

Hello,

I am currently working on retraining a YOLOv5 model using the last.pt checkpoint and I would like to continue training with additional epochs. I am considering using the --resume argument in the train.py script for this purpose.

Could you please confirm if using the --resume argument is the correct approach for continuing the training from the last.pt checkpoint with additional epochs

@pderrenger
Copy link
Member

Hello, yes, using the --resume argument is the correct approach to continue training from the last.pt checkpoint. This will load the weights, optimizer state, and training parameters, seamlessly continuing from where the previous training left off. Ensure your last.pt checkpoint aligns with your current dataset and configuration. For more details, refer to the Ultralytics YOLOv5 Training Documentation.

@AbhirupSinha1811
Copy link
Author

AbhirupSinha1811 commented Jan 21, 2025

Hello,
In my case the training was not interrupted, just want to do retrain the model by using the last.pt with additional epochs , so in that case does the --resume will be required ?

@pderrenger
Copy link
Member

Hello, the --resume argument is not required in this case. You can load the last.pt checkpoint and start a new training session with additional epochs by specifying the same data and epochs parameters. Using --resume is intended for continuing interrupted training runs. For your scenario, simply start training with the last.pt as your initial weights.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working detect Object Detection issues, PR's exports Model exports (ONNX, TensorRT, TFLite, etc.)
Projects
None yet
Development

No branches or pull requests

3 participants