Abstract: Object detection, a vital task in computer vision, identifies and locates multiple objects within an image. This tutorial explores object detection with deep neural networks (DNNs), focusing on Convolutional Neural Networks (CNNs) and two main approaches: two-stage and single-stage detectors. It also covers major deep learning frameworks like TensorFlow, PyTorch, and MMDetection, enabling practitioners to choose the right tools. Additionally, it serves as a practical guide for training object detection models using MMDetection within a Docker container, addressing model deployment. The guide caters to users with basic knowledge of deep learning, computer vision, and Docker, outlining two approaches: using a pre-built MMDetection Docker image for quick setup or building a custom Docker image for flexibility. By following this guide, users can effectively use MMDetection within Docker to train custom object detection models.

Keywords: Object detection, CNNs, Two-stage detectors, Single-stage detectors, Deep learning frameworks, MMDetection, Docker, Training, GPU acceleration, Real-world applications

1. Introduction

Object detection is a fundamental computer vision task that extends beyond simple image classification (see our previous article titled “Image Classification with Deep Neural Networks“). It involves identifying objects within an image and precisely localizing them along with their corresponding categories. This capability underpins a myriad of applications, from autonomous driving to facial recognition systems. Deep neural networks (DNNs), particularly Convolutional Neural Networks (CNNs), have significantly advanced object detection by offering unparalleled accuracy and efficiency. CNNs, a type of DNN tailored for image analysis, excel at feature extraction from images through convolutional layers, enabling them to discern intricate patterns crucial for object detection [1].

While image classification categorizes entire images into single classes (e.g., cat or dog), object detection identifies and locates multiple objects within an image, providing bounding boxes and class labels for each object.

Object detection encompasses two primary tasks:

  • Localization: Identifying the precise location of objects within an image, typically delineated by bounding boxes.
  • Classification: Assigning appropriate labels to the detected objects.

2. Different Architectures for Object Detection with CNNs

While CNNs are the workhorse for many computer vision tasks, there are different general architectures used for object detection with CNNs. Here’s a breakdown of two main approaches:

2.1. Object Detection CNN Architectures

2.1.1. Two-Stage Detectors

These detectors work in two steps:

  • Region Proposal: The first stage proposes candidate regions in the image where objects might be present. This is often done using a separate CNN that generates a large number of bounding boxes.
  • Classification and Refinement: The second stage takes these proposed regions and classifies them, identifying the specific object in the box (e.g., car, cat) and refines the bounding box for better accuracy.

Some popular two-stage detectors include:

  • R-CNN (Regions with CNN features): This is the foundational model for many later detectors. It’s slow for training but has good accuracy [2].
  • Fast R-CNN: Improves on R-CNN by sharing convolutional features across all regions, making it faster to train [2].
  • Faster R-CNN: Introduces a region proposal network (RPN) that integrates with the main network, further improving speed and accuracy [3].
2.1.2. Single-Stage Detectors:

These detectors perform both object classification and bounding box regression in a single step. They are generally faster than two-stage detectors but might be less accurate for complex tasks. Here’s a deeper dive into single-stage detectors:

  • YOLO (You Only Look Once): This is a popular single-stage detector that divides the image into a grid and predicts bounding boxes and class probabilities for each grid cell. It’s known for its speed while maintaining reasonable accuracy [4]. There are many variations of YOLO, with newer versions achieving higher accuracy while still prioritizing speed.
  • SSD (Single Shot MultiBox Detector): Another fast detector, SSD uses multiple convolutional layers at different scales of the image to predict bounding boxes and class probabilities. This allows for detection of objects at various sizes within the image [5].
  • EfficientDet: This family of detectors focuses on achieving high accuracy for object detection while still being efficient. They use a compound scaling method to improve performance without significantly increasing computational cost [6].

2.2. Choosing the Right Architecture

The best architecture for your task depends on your priorities. Here’s a general guideline:

  • Accuracy: If accuracy is paramount, a two-stage detector like Faster R-CNN might be a good choice.
  • Speed: If speed is a major concern, a single-stage detector like YOLO or SSD might be better suited. There’s a trade-off here, with newer versions of some single-stage detectors achieving higher accuracy but potentially sacrificing some speed compared to simpler models.

3. Main Deep Learning Frameworks for Training Object Detection Models

When it comes to training object detection models, several powerful deep learning frameworks are available. Here’s a breakdown of some of the most prominent options:

  • TensorFlow (by Google): A versatile framework offering a flexible computational graph for customization and experimentation. It provides pre-trained models and tools like TensorFlow Object Detection API (TFOD API) to streamline object detection development.
  • PyTorch (by Facebook AI Research): Known for its ease of use and dynamic computational graphs, PyTorch offers a Pythonic approach. It has libraries like Detectron2 specifically designed for object detection tasks.
  • Keras (on top of TensorFlow or Theano): While not a framework itself, Keras simplifies model building with a user-friendly interface and pre-built modules for object detection tasks.
  • MXNet (by Amazon Web Services): This scalable framework is known for its efficiency and speed across various hardware platforms. It provides tools for distributed training on multiple GPUs.
  • MMDetection (built on PyTorch): This modular open-source toolbox focuses on object detection. It offers a rich set of pre-trained models, detectors, and tools for diverse needs.
  • PaddlePaddle (by Baidu): An industry-scale deep learning platform with a focus on efficiency and ease of use. PaddleDetection, its end-to-end development toolkit, specifically caters to object detection. It provides pre-trained models and various model algorithms.

Choosing the right framework depends on your specific needs and preferences. Here’s a quick guideline:

  • Ease of Use: PyTorch or Keras might be easier for beginners.
  • Flexibility and Customization: TensorFlow offers more control for complex models.
  • Scalability and Speed: MXNet can be efficient for resource-intensive tasks.
  • Object Detection Focus: MMDetection and PaddleDetection provide comprehensive toolsets specifically for object detection.

Remember, these are just some of the leading frameworks. The best choice will depend on your project requirements, expertise, and desired balance between features, ease of use, and efficiency.

4. Object Detection Pipeline: From Data Preparation to Inference

4.1. Data Preparation

The effectiveness of any deep learning model relies heavily on the quality and quantity of available data. In the case of object detection, datasets need to consist of annotated images where each object of interest is delineated by a bounding box and associated class label. The major object detection dataset formats include:

  • Pascal VOC (Visual Object Classes): Pascal VOC is a widely used dataset format for object detection tasks. It consists of annotated images with bounding boxes around objects of interest, along with class labels. Pascal VOC datasets typically include a predefined set of object classes.

  • COCO (Common Objects in Context): COCO is another popular dataset format for object detection, segmentation, and captioning tasks. COCO datasets contain images with object annotations, segmentation masks, and captions. The annotations in COCO datasets are more detailed and extensive compared to Pascal VOC, covering a broader range of object categories and providing segmentation masks for each object instance.

  • YOLO (You Only Look Once): YOLO is a dataset format specifically designed for use with the YOLO object detection algorithm. YOLO datasets consist of text files containing bounding box coordinates and class labels for objects in each image.

  • KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute): KITTI is a dataset format commonly used for object detection and tracking tasks in the context of autonomous driving. KITTI datasets contain images captured from onboard cameras in vehicles, along with annotations for various objects such as cars, pedestrians, and cyclists.

4.2. Model Training

Upon selecting a dataset and determining the model architecture, the training process begins, which involves:

  • Data Preprocessing: Tasks such as resizing images, standardizing pixel values, and formatting annotations are carried out.
  • Model Initialization: The DNN is initialized either with pre-trained weights (utilizing transfer learning) or trained from scratch if no pre-trained weights are available.
  • Loss Function: A loss function is defined to penalize errors in both object localization and classification.
  • Optimization: An optimization algorithm (e.g., SGD, Adam) is employed to minimize the loss function and update the model weights.

4.3. Evaluation

Following training, evaluating the model’s performance is crucial. This involves utilizing metrics such as mean Average Precision (mAP), Intersection over Union (IoU), and accuracy.

4.4. Inference

After training and evaluation, the model is ready for inference on new, unseen images or videos. This process entails feeding input data through the network and analyzing the output to identify and localize objects within the scene.

5. A Practical Guide: Object Detection using MMDetection with Docker

In this part, we will delve into performing object detection using MMDetection, a popular open-source object detection toolbox built on top of PyTorch. We will demonstrate how to set up a development environment using Docker, configure a model, handle data loading, set up evaluation, and optimize the training process using MMDetection.

5.1. MMDetection

MMDetection stands as a cutting-edge open-source object detection toolbox, rooted in the PyTorch framework and nurtured within the OpenMMLab ecosystem. It offers a comprehensive suite of tools, meticulously crafted to tackle the complexities of object detection tasks with unparalleled efficiency and accuracy. Drawing from a rich pool of pre-trained models and leveraging state-of-the-art methodologies, MMDetection empowers researchers, developers, and practitioners to push the boundaries of object detection in diverse domains.

With its robust architecture and modular design, MMDetection provides a flexible and extensible platform for experimenting with various detection architectures, training strategies, and optimization techniques. From two-stage detectors to single-stage models, users have access to a plethora of cutting-edge algorithms, ensuring versatility in addressing a wide range of object detection challenges. Additionally, MMDetection’s intuitive interface and extensive documentation streamline the process of model deployment and experimentation, making it accessible to both beginners and experts alike.

The key features of MMDetection include:

  • Versatile Architecture: MMDetection offers a versatile architecture that supports a wide range of object detection algorithms, including two-stage detectors, single-stage detectors, and anchor-free detectors.

  • Pre-trained Models: MMDetection provides access to a variety of pre-trained models, allowing users to leverage state-of-the-art architectures for their specific tasks. These models are trained on large-scale datasets and serve as strong starting points for fine-tuning or transfer learning.

  • Extensive Dataset Support: MMDetection supports various standard object detection datasets, such as COCO, Pascal VOC, and Cityscapes, making it suitable for a wide range of applications and research scenarios.

  • Modular Design: MMDetection is built with a modular design, allowing users to easily customize and extend different components of the detection pipeline, such as backbones, necks, and heads.

  • Flexible Training Options: MMDetection provides flexible training options, including support for multi-GPU training, distributed training, and mixed precision training, enabling users to efficiently train large-scale models on diverse hardware setups.

  • Comprehensive Evaluation Tools: MMDetection offers comprehensive evaluation tools for assessing the performance of trained models on validation and test datasets. Users can easily evaluate metrics such as mAP and IoU to measure detection accuracy.

  • Active Community Support: MMDetection benefits from an active community of developers, researchers, and users who contribute to its ongoing development, provide support, and share insights and best practices. This vibrant community fosters collaboration and knowledge exchange within the object detection community.

To utilize MMDetection for object detection tasks, follow these steps:

  1. Set up an environment and install the necessary packages to ensure compatibility with MMDetection.

  2. Download the dataset containing the images for object detection. Ensure that the dataset is properly formatted and organized according to the requirements of MMDetection.

  3. Import the predefined configuration provided by MMDetection. These configurations serve as templates for setting up the object detection model architecture, training parameters, and other essential components.

  4. Customize the imported configuration to specify the classes of objects to detect and adjust other parameters as needed to tailor the model to your specific requirements.

  5. Configure the model architecture, data loaders, validation evaluator, and optimization wrapper according to your customized configuration. This step ensures that the model is properly configured for training and evaluation.

  6. Train or fine-tune the model on the dataset using the configured settings. During training, the model learns to detect objects of interest in the provided images.

  7. Evaluate the trained model’s performance on a separate validation set to assess its accuracy and generalization capabilities. This evaluation step helps ensure that the model provides reliable object detection results on unseen data.

5.2. Setting Up the Docker Environment:

Prerequisites:

  • Docker installed on your system (see here).
  • Basic understanding of deep learning, computer vision concepts, and Docker.
  • Familiarity with Python programming.

Source code: 

  • Original codes can be downloaded from GitHub.

Trained Model:

There are two main approaches to using MMDetection in Docker:

  • Option 1: Use a pre-built MMDetection Docker image.
  • Option 2: Build your own Docker image with MMDetection and your project code.
5.2.1. Option 1: Using a Pre-built Image
  • Pull the official MMDetection Docker image:
Command
$ docker pull homai/openmmlab:pytorch2.1.1-cuda12.1-cudnn8-mmdetection3.3.0
  • Change the image tag:
Command
$ docker image tag homai/openmmlab:pytorch2.1.1-cuda12.1-cudnn8-mmdetection3.3.0 mmdetection

This creates a container named mmdetection with a bash shell. You can now proceed with training your model inside the container.

5.2.2. Option 2: Building Your Own Image
  • Create a Dockerfile specifying the environment setup:
Dockerfile

ARG PYTORCH="2.1.1"
ARG CUDA="12.1"
ARG CUDNN="8"

FROM pytorch/pytorch:${PYTORCH}-cuda${CUDA}-cudnn${CUDNN}-devel

ARG DEBIAN_FRONTEND=noninteractive
ENV TORCH_CUDA_ARCH_LIST="6.0 6.1 7.0 7.5 8.0 8.6+PTX" \
    TORCH_NVCC_FLAGS="-Xfatbin -compress-all" \
    CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \
    FORCE_CUDA="1"

# Install the required packages
RUN apt-get update \
    && apt-get install -y ffmpeg libsm6 libxext6 git ninja-build libglib2.0-0 libsm6 libxrender-dev libxext6 curl \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

# Install MMEngine, MMCV and MMDetection
RUN pip install openmim && \
    mim install mmengine mmcv mmdet
RUN mim install mmcv mmdet 
# Install JupyterLab if you are interested to run experiments on the docker and seaborn to plot logs
RUN mim install jupyterlab ipykernel seaborn
RUN git clone https://github.com/open-mmlab/mmdetection.git

  • Build the Docker image:
Command
$ DOCKER_BUILDKIT=1 docker build -f docker/mmdetection.Dockerfile -t mmdetection .

5.3. Dataset Preparation:

5.3.1. Downloading Dataset

The Kvasir-Instrument dataset (size 170 MB) contains 590 endoscopic tool images and their ground truth bounding box and mask. It can be downloaded from simula. Download and extract them.

5.3.2. Convert to COCO format

As the dataset is not in proper format to digest by MMDetection, it must be reformatted to COCO format (see Data Preparation Notebook inside the project GitHub repo).

5.4. Configuring Model

MMDetection provides a range of pre-defined configurations for various object detection models. For this tutorial, we’ll use the RTMDet model. You can find its configuration file here. We’ll import it as our base configuration and make modifications to this configuration to suit our specific task and dataset.

Here are main parts of modification, but you can see the complete config file on the project GitHub repo here.

Import Base Configuration:

  • Import a base configuration file that serves as the foundation for your training setup. This base configuration typically includes default settings for model architecture, optimizer, and other parameters.

_base_ = [
    'mmdet::rtmdet/rtmdet_s_8xb32-300e_coco.py'
]

Modify Path to Dataset:

  • Adjust the paths within the configuration to point to your dataset directory. This ensures that the training pipeline accesses the correct data during the training process.

# dataset settings
dataset_type = 'CocoDataset'
data_root = "/data/" 
train_annot = "train_coco.json"
val_annot = "test_coco.json"
test_annot = "test_coco.json"
train_image_folder = "images/"
val_image_folder = "images/"
test_image_folder = "images/"

Define Training Hyperparameters:

  • Specify the hyperparameters for training, such as batch size, number of epochs, learning rate, weight decay, etc. These parameters significantly impact the training process and model performance.

# Training Parameter Settings
base_lr = 0.004
max_epochs = 100
warmup_iters = 200
check_point_interval = 10
val_interval =  1
stage2_num_epochs = 40

Update Model for Dataset Compatibility:

  • Modify the model configuration to accommodate the specific requirements of your dataset. This might involve adjusting the input/output dimensions, changing the number of output classes, or fine-tuning certain layers to better suit your data (here we only need to update the number of objects).

model = dict(
    bbox_head=dict(
        num_classes = num_classes
    )
)

Update Data Loaders:

  • Customize the data loaders to preprocess and load your dataset efficiently. This step involves setting up data augmentation techniques, data normalization, and any other preprocessing steps necessary for training.

train_dataloader = dict(
    batch_size=8,
    num_workers=3,
    dataset=dict(
        type=dataset_type,
        metainfo=metainfo,
        data_root=data_root,
        ann_file=train_annot,
        data_prefix=dict(img=train_image_folder)
        )
    )

Update Optimizer and Learning Rate Scheduler:

  • Configure the optimizer (e.g., SGD, Adam) and learning rate scheduler (e.g., step-based, cosine annealing) based on your training objectives and model architecture. Tuning these components can significantly impact convergence speed and final model performance.

# optimizer
optim_wrapper = dict(
    _delete_=True,
    type='OptimWrapper',
    optimizer=dict(type='AdamW', lr=base_lr, weight_decay=0.05),
    paramwise_cfg=dict(
        norm_decay_mult=0, bias_decay_mult=0, bypass_duplicate=True))

# learning rate
param_scheduler = [
    dict(
        type='LinearLR',
        start_factor=1.0e-5,
        by_epoch=False,
        begin=0,
        end=warmup_iters),
    dict(
        # use cosine lr from 150 to 300 epoch
        type='CosineAnnealingLR',
        eta_min=base_lr * 0.05,
        begin=max_epochs // 2,
        end=max_epochs,
        T_max=max_epochs // 2,
        by_epoch=True,
        convert_to_iter_based=True),
]

5.5. Configure Main Python Script

Adjust the main Python script that orchestrates the training process to incorporate the updated configurations. This script typically initializes the model and  before starting the training loop.


from mmengine.config import Config
from mmengine.runner import Runner
import argparse

def main(args):
    config = Config.fromfile(args.config_path)
    config.launcher = "pytorch"
    runner = Runner.from_cfg(config)
    runner.train()
    
if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='Get Config Path.')
    parser.add_argument('config_path', type=str, help='path to the config file')
    args = parser.parse_args()
    main(args)

5.6. Start Training on Docker from BASH Terminal

Execute the training process within a Docker container from the BASH terminal. This involves running the main training script with the updated configurations inside the Docker environment, ensuring reproducibility and isolation of the training process.

To train on Docker from the BASH terminal, it’s essential to define local paths to the configuration folder, data folder, and output folder. Assuming the BASH command is executed from the project’s root folder and the dataset resides at ‘/mnt/SSD2/kvasir-instrument/’ on the local PC, the following variables are defined within the BASH script to facilitate this setup.

Bash Script

#!/bin/bash

DATA_DIR="/mnt/SSD2/kvasir-instrument/"
OUT_DIR="$PWD/out"
CONFIG_DIR="$PWD/codes/"
  

As I’m about to initiate training inside Docker, leveraging GPU acceleration, and given that my PC is equipped with 3 GPUs, I’ll commence the training process with adding the following command to add to the bash script:

Docker Command

GPUS=3

docker run -it --rm \
    --gpus all \
    --mount type=bind,source=$CONFIG_DIR,target=/configs \
    --mount type=bind,source=$DATA_DIR,target=/data \
    --mount type=bind,source=$OUT_DIR,target=/out \
    --shm-size 8g \
    homai/openmmlab:pytorch2.1.1-cuda12.1-cudnn8-mmdetection3.3.0 \
    torchrun --nnodes 1 --nproc_per_node=$GPUS  /configs/main_train_mmengine.py /configs/rtmdet_s_8xb32-300e_coco.py
  
5.6.1. Additional Considerations:
  • Mounting Data Volumes: Use the –mount flag with docker run to mount your local dataset directory and any other necessary volumes into the container.
  • Mounting Config Volumes: Use the –mount flag with docker run to mount your local configuration directory and any other necessary volumes into the container.
  • Saving Training Outputs: Mount a volume to persist the trained model weights and logs outside the container for easy access.
  • GPU Support: If you have a GPU available, utilize the nvidia-docker runtime to enable GPU acceleration within the container. Refer to the official Nvidia Docker documentation for details.
5.6.2 Tips
  • –gpus all: allow the docker to utilize all GPUs installed on PC
  • –mount type=bind, source=FULL/PATH/TO/LOCAL/FOLDER =/MONTED/PATH/ON/CONTAINER: allow the container to access the folder and associated files (for example, “–mount type=bind, source=/mnt/SSD2/kvasir-instrument/ , target=/data”, mount the local path “/mnt/SSD2/kvasir-instrument/” to “/data” folder on the running container, and all content of this local folder is accessible on the container at “/data” folder.)

5.7. Log Analysis

You can analyze the training performance using the ‘analyze_logs.py’ script located in the ‘tools/analysis_tools’ directory. This script generates plots for loss, mAP, learning rate, etc., based on a provided training log file. Remember to specify the path to the training log file as an argument when using the script.

Ensure that the paths are accurate according to your container configuration and log folder path.

For example, if the log folder on your PC is within the ‘20240309_054830’ directory and you’re executing commands within the mmdetection container, you can utilize the following commands:

Commands
$ python /workspace/mmdetection/tools/analysis_tools/analyze_logs.py plot_curve /out/20240309_054830/vis_data/scalars.json --keys bbox_mAP bbox_mAP_50 bbox_mAP_75 --out ./bbox_mAP.jpg --legend mAP mAP_50 mAP_75
Commands
$ python /workspace/mmdetection/tools/analysis_tools/analyze_logs.py plot_curve /out/20240309_054830/vis_data/scalars.json --keys lr --out ./lr.jpg --legend lr

6. Model Deployment

Model deployment is a crucial phase where the trained model transitions from development to practical application. Our aim is to make the object detection model accessible and usable for developers and applications. To achieve this, we streamlined the deployment process by converting the PyTorch model to ONNX format, ensuring interoperability and compatibility. We further optimized the model for deployment by converting it to OpenVINO format, enabling efficient inference performance across diverse hardware platforms.

Next, we uploaded both the original PyTorch model and its OpenVINO version to the Hugging Face Model Hub. This step not only provides easy access to our models but also allows developers to test and fine-tune them on custom datasets. The Hugging Face Model Hub serves as a centralized repository for machine learning models, facilitating seamless integration into applications while supporting testing and fine-tuning processes.

Deploying the model on Hugging Face enables fine-tuning on custom datasets, enhancing its applicability to specific tasks and domains. Leveraging the Hugging Face ecosystem provides access to a vibrant community of developers and researchers, fostering collaboration and continuous improvement of the deployed models.

Considering factors such as scalability, latency, and resource utilization is essential when deploying the model in production environments. Employing optimization techniques like model quantization and pruning can enhance performance and resource efficiency, while monitoring and logging mechanisms ensure model reliability and performance tracking over time.

7. Conclusion

Object detection with deep neural networks represents a powerful field with diverse applications. This tutorial offers a foundational understanding to kickstart your exploration. Given the rapid advancements in this field, staying updated on the latest developments and leveraging the plethora of online resources is essential for continued growth.

In this tutorial, we’ve also covered the basics of performing object detection using MMDetection within a Docker environment. We started by setting up a Docker image with all the necessary dependencies, configuring a model, handling data loading, setting up evaluation, and optimizing the training process. By following these steps and adapting them to your specific task and dataset, you can leverage MMDetection to build powerful object detection systems within a Dockerized environment.

8. References

Disclaimer

This tutorial is intended for educational purposes only. It does not constitute professional advice, including medical or legal advice. Any application of the techniques discussed in real-world scenarios should be done cautiously and with consultation from relevant experts.

Scroll to Top