Skip to content

sorny92/RGB-Depth-Prediction

Repository files navigation

RGB depth estimation

How to run

Train

Configure training.py as desired. You need to export the ENVVAR DATASET_PATH to point to DIODE train folder.

python -m training

Run in real time

python -m run_webcam
usage: Run with a webcam feed or a video [-h] saved_model_path input_video

There's a saved_model in the repository to point to.

If you have trained a model use:

python -m export_model
usage: Export model [-h] model_weights export_path

To export the checkpoint to saved model for better inference time.

Visualize dataset

Modify visualizer.py as needed to visualize the train or val datasets.

python -m data.visualizer

Datasets

For the task of depth estimation there's multiple sources of data.
As this case describes, a viable dataset is one with RGB images and depth maps as outputs. This type or datasets are not easy to produce in a reliable way as there's no high resolution depth sensor currently. Nowadays, this is done through several systems: through disparity maps in stereo images, with a lidar, or a pattern matching system. Each one of them has different advantages and disadvantages, which I'm not going to study here, but they should be taken into account.

For this task we can look for some datasets in one of the most popular resources pages:

  • Papers with code: Has multiple datasets which we can split in the next categories:

    • Stereo:
      This data could be useful for pretaining as a disparity map could be generated and pretrain the model trying to predict the disparity map. Then the model would be fine-tuned in depth data.

      • HRWSI: Can be downloaded
      • ETH3D: Can be downloaded
      • UASOL: Can be downloaded
      • Holopix50k: Collection in the wild and it can be downloaded.
      • WSVD: More than 500 videos from youtube recorded in stereoscopic. Same as the others
    • RGB-D:

      • Real data:
        • DIODE: Promising dataset as it contains data from indoor and outdoor. 8k images indoor, 17k images outdoor. It can also be downloaded.
        • NYUv2: 1.4k images dense, 400k sparse.
        • TUM RGB-D: Data extracted from a Microsoft Kinect. Can be downloaded
        • SUN3D: Data extracted from a Microsoft Kinect. Can be downloaded
        • ScanNet: It would be useful as it also has segmentation mask and bounding boxes. But it can't be downloaded directly without asking for permission.
      • Synthetic data:
        • Hypersim: Useful dataset for pretraining as it contains synthetic data. It can also be downloaded.
        • EDEN: Synthetic dataset but it's specific for gardens it might not be great as it could overfit. It's nice that it has segmentation mask too.
        • Virtual KITTI: Dataset with
        • SUNCG: Interesting dataset for pretraining as it contains synthetic data that is dense. Can't be downloaded due to a legal problem.
        • SuperCaustics: synthetic data from UE. It could be useful to generate data on the fly, but it doesn't have a dataset with depth.
    • Lidar/Pointcloud:

  • Huggingface: There's only one dataset available for this task in Huggingface which is nyu_depth_v2.

Models

This is a task which is considered as a image to image problem. The input is an image and the output is another image. For this problem the output is a depth map, so the values are continuous therefor it's a regression per pixel.

In the file model.py it can be seen the model used for this task. In this case a model based on DeepLabv3 as it's mostly used for image segmentation, the last layer can be changed to just one channel without activation, so it can solve regression problems.

Data preparation

On the file loader.py it can be seen the pipeline for data loading. A generator is created that will feed paths to images so tensorflow can automatically prefetch the images and process them to have them ready for the GPU when it needs them.

The scaling used is the one used for imagenet as the backbone of the model is pretrained in it.

Method

For this project I will use DIODE as it's much bigger than NYUv2 with more variety of scenes.

For the training it's used the whole train set split in 90% of the scenes for training and the other 10% for validation. A set of data augmentation operations are used to train that can be seen in the file data_augmentation.py.

The training pipeline can be found in the file training.py. For this task as the literature similar to this project does, Adam with default learning rate is used. The learning rate decay is also used during training to reduce when the learning achieves a plateau. The loss used will be detailed in the next point.

Loss function

The loss function used is the one used in here and it's implemented in training.py.

It consists in three parts:

  • L1 loss over the whole image
  • L1 loss over the edges of the image
  • structural similarity of the image. The loss is known in this task to help converge faster. MAE.png In the previous image you can see three training with the same data but with one component of the loss, with two and the three of them.

Metrics

Metrics used are the standard for the task:

  • Mean absolute error
  • Root mean squared error
  • And thresold

Results

Over validation set (including indoor and outdoor) of DIODE the model model-04-182311 achieves the next results:

  • mean_absolute_error: 0.0215
  • root_mean_squared_error: 0.0403
  • delta1: 0.2337
  • delta2: 0.4377
  • delta3: 0.5808

The current model achieves realtime processing speed at 512x512 in a RTX3060 without quantization or layer fusion. In a modern 12 threads system CPU can run at 2FPS.

Future work

  • The model need to be trained for longer as it never got to achieve a plateau.
  • The depth maps in the DIODE dataset have a lot of null values where there's no depth data, specially in the outdoor set. This needs to be addressed as the model has a bias to set everything mostly black and it's good enough in many cases.
  • In works as NYUv2 they apply image inpainting to fill the gaps in the depth map, but there's definitely more options as this
  • The model used is based on a backbone that is not the most efficient, Resnet50. A model based in something more modern, for example EfficientNet would definitely bring better results in less time.
  • Apply quantization to the model to decrease latency.
  • Better pretraining making use of the extensive access to synthetic data in this domain.

About

Depth estimation from RGBD data using Diode dataset

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published