Abstract: Object detection is a crucial task in computer vision, allowing for the identification and localization of objects within images and videos. The YOLO (You Only Look Once) series has become a benchmark for combining speed and accuracy in this domain. YOLOv8, a cutting-edge object detection model, advances these capabilities further, making it highly suitable for real-time applications through improvements in model architecture and optimization techniques.
This tutorial introduces the YOLOv8 model using the MMYOLO framework, part of the OpenMMLab project. MMYOLO offers a flexible, modular, and user-friendly environment for training, fine-tuning, and deploying YOLO models. Additionally, the guide includes leveraging Docker to create a consistent, reproducible development environment, simplifying the setup and deployment process across different systems. By integrating YOLOv8 with MMYOLO and Docker, users can efficiently build and deploy state-of-the-art object detection systems with minimal setup complexity.
Keywords: YOLOv8, Object Detection, MMYOLO, Docker, Computer Vision, Deep Learning
1. Introduction
Object detection is a critical task in computer vision, enabling machines to identify and locate objects within images and videos (see our previous article titled “Object Detection with Deep Neural Networks“). Among the many models developed for this purpose, the YOLO (You Only Look Once) series has consistently stood out for its remarkable balance between speed and accuracy. YOLOv8, a cutting-edge object detection model, represents a significant advancement over its predecessors in several key areas that meets the demanding requirements of modern applications, from real-time surveillance to autonomous driving.
2. MMYOLO
The MMYOLO package is a specialized extension within the broader MMDetection ecosystem, crafted specifically to support the YOLO series of models [1]. While MMDetection serves as a versatile toolbox for a wide array of object detection algorithms [2], MMYOLO narrows its focus to YOLO models, offering a more streamlined and efficient environment for their development, training, and deployment. YOLO models are known for their speed and efficiency [3], making them popular in real-time object detection applications, and MMYOLO is designed to maximize these strengths.
The content of the MMYOLO package includes all the necessary tools and resources for working with YOLO models. This includes pre-configured model architectures, data processing pipelines, custom training scripts, and performance evaluation tools. Additionally, MMYOLO incorporates unique optimizations tailored to the YOLO architecture, such as specialized training routines and deployment strategies that enhance the speed and accuracy of the models. The package is regularly updated to include the latest developments in YOLO research, ensuring that users have access to state-of-the-art tools and techniques.
The MMYOLO package supports a range of YOLO models, including YOLOv3 [4], YOLOv4 [5], YOLOv5, YOLOv6 [6], YOLOv7 [7], and YOLOv8 [8]. Each of these models offers distinct advantages for various object detection tasks. YOLOv3 and YOLOv4 are well-regarded for their balance of speed and accuracy, making them suitable for real-time applications. YOLOv5 introduces improvements in performance and flexibility, including better support for diverse deployment scenarios. YOLOv6 and YOLOv7 further advance the state of the art with enhanced accuracy and efficiency, addressing modern object detection challenges. YOLOv8 continues to push the boundaries with even greater performance enhancements and expanded capabilities for complex tasks.
In addition to the YOLO family, the MMYOLO package also integrates support for other leading models such as PP-YOLOE [9], YOLOX [10], and RTMDet [11]. PP-YOLOE, developed by PaddlePaddle, emphasizes high efficiency and accuracy in industrial applications, offering an alternative to traditional YOLO models [9]. YOLOX introduces an anchor-free approach that simplifies the model while maintaining high detection accuracy, making it a strong contender in the object detection landscape. RTMDet, designed for real-time applications, optimizes speed and precision, making it particularly suitable for environments where performance is critical.
By supporting these various YOLO versions and additional models like PP-YOLOE, YOLOX, and RTMDet, MMYOLO provides users with a robust toolkit for leveraging the strengths of each model according to their specific needs.
The decision to separate MMYOLO from the main MMDetection framework was made to provide a more focused and optimized experience for users working with YOLO models. YOLO has distinct requirements and optimizations that differ from other object detection architectures [12], and by creating a standalone package, the developers have ensured that all tools, updates, and documentation are specifically aligned with YOLO’s needs. This separation simplifies the development process for those focused on YOLO, removing the need to navigate through the broader toolsets of MMDetection and allowing for a more concentrated effort in advancing YOLO technologies.
A notable difference between MMYOLO and MMDetection lies in their licensing. MMDetection is released under the Apache 2.0 license, a permissive open-source license that allows users considerable freedom in using, modifying, and distributing the software [13]. In contrast, MMYOLO is released under the GNU General Public License (GPL) v3.0. The GPL v3.0 is a copyleft license, meaning that any derivative works or redistributions of the software must also be licensed under the GPL [14]. This ensures that any modifications or improvements to MMYOLO remain open and accessible to the community, fostering a collaborative environment where innovations are shared and continuously built upon.
3. YOLOv8
YOLOv8 represents a significant leap forward in the YOLO series, offering several key improvements over its predecessors:
Enhanced Architecture
- New convolutional layers: YOLOv8 introduces innovative convolutional layers that contribute to improved accuracy and efficiency.
- Anchor-free detection: This approach simplifies the detection process and often leads to better performance.
Improved Performance
- Higher accuracy: YOLOv8 consistently achieves higher accuracy metrics compared to previous versions on various benchmarks.
- Faster inference: Despite the increased complexity, YOLOv8 maintains a good balance between accuracy and speed.
Broader Capabilities
- Versatility: YOLOv8 excels not only in object detection but also in image segmentation and keypoint detection tasks.
- Flexibility: The model can be adapted to different problem domains and datasets more easily.
Refined Training Process
- Optimized training: YOLOv8 benefits from refined training techniques and hyperparameters, leading to faster convergence and better results.
Streamlined User Experience
- User-friendly interface: The framework provides a more intuitive interface for training and deployment.
- Expanded functionalities: Additional features and tools enhance the overall user experience.
In essence, YOLOv8 builds upon the strengths of its predecessors while addressing their limitations, resulting in a more robust, versatile, and efficient object detection model.
4. Why Train YOLOv8 with MMYOLO on Docker?
Docker provides a robust and isolated environment for training machine learning models, offering several key advantages:
Reproducibility
- Consistent environment: Docker ensures that the training environment is identical across different machines and platforms, making it easier to reproduce results.
- Dependency management: Docker handles dependencies effectively, preventing conflicts and ensuring that the correct versions of libraries and frameworks are used.
Isolation
- Environment separation: Docker containers isolate the training environment from the host system, protecting it from potential conflicts and ensuring stability.
- Security: By running the training process in a container, you can mitigate security risks as it limits the potential attack surface.
Efficiency
- Resource utilization: Docker allows for efficient resource allocation, enabling you to optimize GPU usage and other hardware resources.
- Rapid deployment: Docker containers can be quickly deployed on different platforms, facilitating experimentation and scaling.
Collaboration
- Shared environments: Docker containers can be easily shared among team members, fostering collaboration and knowledge sharing.
- Version control: You can use Docker to version control your training environments, making it easier to track changes and revert to previous states.
Portability
- Cross-platform compatibility: Docker containers can run on various operating systems, making it easy to transfer your training environment to different platforms.
- Cloud integration: Docker containers integrate seamlessly with cloud platforms, enabling you to leverage cloud resources for training.
By leveraging Docker, you can streamline your YOLOv8 training process with MMYOLO, enhance reproducibility, improve efficiency, and facilitate collaboration.
5. MMYOLO and Ultralytics
MMYOLO and Ultralytics share several key similarities. Both are open-source frameworks primarily focused on YOLO-based object detection, utilizing PyTorch as their underlying deep learning library. They offer pre-trained models as starting points, support various YOLO architectures, and provide tools for training, evaluation, and inference. Additionally, both frameworks emphasize efficiency and accuracy in their implementations.
While Ultralytics offers a streamlined approach to object detection, MMYOLO provides several advantages for those seeking a more flexible and customizable solution:
Flexibility and Customization
- Modular design: MMYOLO’s modular architecture allows for easy customization of backbones, necks, and heads, enabling tailored solutions for specific problem domains.
- Extensive configuration options: MMYOLO provides a wide range of hyperparameters and configuration options, allowing for fine-tuning to optimize performance.
License Compatibility
- GPL-3.0 License: MMYOLO is released under the GPL-3.0 license (here), which provides more flexibility for commercial use compared to the AGPL-3.0 license (here) used by Ultralytics. While both licenses require sharing modifications, the GPL-3.0 license is generally considered less restrictive, particularly for software linking to the licensed code.
Research-Oriented Features
- Advanced techniques: MMYOLO incorporates state-of-the-art techniques and research findings, making it suitable for exploring new ideas and pushing the boundaries of object detection.
- Integration with other tools: As part of the OpenMMLab ecosystem, MMYOLO seamlessly integrates with other computer vision tools and frameworks, facilitating complex projects.
Community and Ecosystem
- Active development: MMYOLO benefits from the support of a large and active open-source community, ensuring continuous improvement and updates.
The choice between MMYOLO and Ultralytics often depends on the specific requirements of your project. While Ultralytics offers a user-friendly experience for rapid prototyping, MMYOLO excels in providing the flexibility, customization, research-oriented features, and a more permissive license required for many advanced object detection projects.
By carefully considering these factors, you can make an informed decision about which framework best suits your needs.
6. Hands-on Step-by-Step Guide to Train a YOLOv8 with MMYOLO using Docker
6.1. Before Start
Prerequisites:
- Docker installed on your system (see here).
- Basic understanding of deep learning, computer vision concepts, and Docker.
- Familiarity with Python programming.
Source code:
- Original codes can be downloaded from GitHub.
Trained Model:
- The trained model uploaded on Hugging Face. It is available to test and/or download (here).
6.2. Setting Up the Docker Environment and Building Your Own Image
- Create a Dockerfile specifying the environment setup:
ARG PYTORCH="2.0.1"
ARG CUDA="11.7"
ARG CUDNN="8"
FROM pytorch/pytorch:${PYTORCH}-cuda${CUDA}-cudnn${CUDNN}-devel
ARG DEBIAN_FRONTEND=noninteractive
ENV TORCH_CUDA_ARCH_LIST="6.0 6.1 7.0 7.5 8.0 8.6+PTX" \
TORCH_NVCC_FLAGS="-Xfatbin -compress-all" \
CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \
FORCE_CUDA="1"
# Install the required packages
RUN apt-get update \
&& apt-get install -y ffmpeg libsm6 libxext6 git ninja-build libglib2.0-0 libsm6 libxrender-dev libxext6 curl \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
# Install MMEngine, MMCV, MMDetection and MMYolo
RUN pip install openmim && \
mim install "mmengine>=0.6.0" "mmcv>=2.0.0rc4,<2.1.0" "mmdet>=3.0.0,<4.0.0"
# Install JupyterLab if you are interested to run experiments on the docker and seaborn to plot logs
RUN mim install jupyterlab ipykernel seaborn
RUN git clone https://github.com/open-mmlab/mmyolo.git &&\
cd mmyolo && pip install albumentations==1.3.1 &&\
mim install -v -e .
RUN pip install sahi
RUN git clone https://github.com/open-mmlab/mmdetection.git
- Build the Docker image:
$ DOCKER_BUILDKIT=1 docker build -f docker/mmyolo.Dockerfile -t mmyolo .
6.3. Dataset Preparation
- Download the “COCO-Stuff 10K dataset v1.1” dataset from cocostuff10k GitHub Repo.
- Unzip the dataset and place it in a suitable directory.
- Convert the dataset format to COCO format by following the steps described in
Data_Preparation.ipynbJupyter notebook.
6.4. Configuring the YOLOv8 Model
MMYOLO provides pre-trained YOLOv8 models. You can use these as a starting point or create your own configuration.
Choose a pre-trained model:
- MMYOLO offers various YOLOv8 models (e.g., YOLOv8s, YOLOv8m, YOLOv8l, YOLOv8x).
- Select the appropriate model based on your computational resources and desired accuracy.
# Import a base configuration file that serves as the foundation for your training setup.
# This base configuration typically includes default settings for model architecture, optimizer, and other parameters.
_base_ = [
'mmyolo::yolov8/yolov8_l_mask-refine_syncbn_fast_8xb16-500e_coco.py'
]
6.5. Customize the configuration file
- Modify the configuration file (
.pyfile) to match your dataset and training parameters. - Adjust hyperparameters such as learning rate, batch size, epochs, etc.
- Customize the data loading pipeline, model architecture, and loss function if needed.
- Modify the configuration file (
6.5.1. Dataset Settings
# Adjust the paths within the configuration to point to your dataset directory.
# This ensures that the training pipeline accesses the correct data during the training process.
data_root = "/data/"
train_annot = "train_coco.json"
val_annot = "test_coco.json"
test_annot = "test_coco.json"
train_image_folder = "images/"
val_image_folder = "images/"
test_image_folder = "images/"
6.5.2. Training Parameter Settings
# Specify the hyperparameters for training, such as batch size, number of epochs, learning rate, weight decay, etc.
# These parameters significantly impact the training process and model performance.
base_lr = 0.0001
lr_factor = 0.1
max_epochs = 100
warmup_epochs = 5
check_point_interval = 10
val_interval = 1
close_mosaic_epochs = 10
val_interval_stage2 = 1
batch_size = 4
work_dir = "/out"
6.5.3 Model Settings
# Modify the model configuration to accommodate the specific requirements of your dataset.
# This might involve adjusting the input/output dimensions, changing the number of output classes, or fine-tuning certain layers to better suit your data (here we only need to update the number of objects).
train_data_annot_path = data_root + train_annot
def get_object_classes(path_to_annotation):
import json
with open(path_to_annotation, "r") as f:
data = json.load(f)
cats = [cat['name'] for cat in data["categories"]]
return tuple(cats)
classes = get_object_classes(train_data_annot_path)
metainfo = {
'classes': classes
}
num_classes = len(classes)
model = dict(
bbox_head=dict(
head_module=dict(
num_classes = num_classes
)
),
train_cfg=dict(
assigner=dict(num_classes = num_classes)
)
)
6.5.4. Customize Data Loaders
# Customize the data loaders to preprocess and load your dataset efficiently.
# This step involves setting up data augmentation techniques, data normalization, and any other preprocessing steps necessary for training.
train_dataloader = dict(
batch_size=batch_size,
num_workers=3,
sampler=dict(_delete_=True, type='DefaultSampler', shuffle=True),
collate_fn=dict(_delete_=True, type='yolov5_collate'),
dataset=dict(
metainfo=metainfo,
data_root=data_root,
ann_file=train_annot,
data_prefix=dict(img=train_image_folder)
)
)
val_dataloader = dict(
batch_size=batch_size,
num_workers=3,
sampler=dict(_delete_=True, type='DefaultSampler', shuffle=True),
dataset=dict(
metainfo=metainfo,
data_root=data_root,
ann_file=val_annot,
data_prefix=dict(img=val_image_folder)
)
)
test_dataloader = dict(
batch_size=batch_size,
num_workers=3,
sampler=dict(_delete_=True, type='DefaultSampler', shuffle=True),
dataset=dict(
metainfo=metainfo,
data_root=data_root,
ann_file=test_annot,
data_prefix=dict(img=test_image_folder),
test_mode=True
)
)
train_cfg = dict(
val_interval=val_interval,
max_epochs=max_epochs,
dynamic_intervals=[((max_epochs - close_mosaic_epochs),
val_interval_stage2)]
)
val_evaluator = dict(
ann_file=data_root + val_annot,
)
test_evaluator = dict(
ann_file=data_root + test_annot,
)
6.5.5. Other Customization (optimizer, parameter scheduler, hooks etc.)
# optimizer
# Configure the optimizer (e.g., SGD, Adam) and learning rate scheduler (e.g., step-based, cosine annealing) based on your training objectives and model architecture.
# Tuning these components can significantly impact convergence speed and final model performance.
optim_wrapper = dict(
_delete_=True,
type='OptimWrapper',
optimizer=dict(type='AdamW', lr=base_lr, weight_decay=0.05),
paramwise_cfg=dict(
norm_decay_mult=0, bias_decay_mult=0, bypass_duplicate=True))
# learning rate
param_scheduler = [
dict(
type='LinearLR',
start_factor=lr_factor,
by_epoch=True,
begin=0,
end=warmup_epochs),
dict(
# use cosine lr from 150 to 300 epoch
type='CosineAnnealingLR',
eta_min=base_lr * 0.01,
begin=int(0.7*max_epochs),
end=max_epochs,
by_epoch=True,
convert_to_iter_based=True),
]
# hooks
default_hooks = dict(
param_scheduler=dict(
type='YOLOv5ParamSchedulerHook',
scheduler_type='linear',
lr_factor=lr_factor,
max_epochs=max_epochs),
checkpoint=dict(
_delete_ = True,
type='CheckpointHook',
interval=check_point_interval, # Save checkpoint on "interval" epochs
max_keep_ckpts=1 # only keep latest 1 checkpoints
))
custom_hooks = [
dict(
type='EMAHook',
ema_type='ExpMomentumEMA',
momentum=0.0001,
update_buffers=True,
strict_load=False,
priority=49),
dict(
type='mmdet.PipelineSwitchHook',
switch_epoch=max_epochs - close_mosaic_epochs,
switch_pipeline=_base_.train_pipeline_stage2
)
]
6.6. Configure Main Python Script
Adjust the main Python script that orchestrates the training process to incorporate the updated configurations. This script typically initializes the model and before starting the training loop.
from mmengine.config import Config
from mmengine.runner import Runner
import argparse
def main(args):
config = Config.fromfile(args.config_path)
config.launcher = "pytorch"
config.load_from = args.model_path
runner = Runner.from_cfg(config)
runner.train()
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Get Config Path.')
parser.add_argument('config_path', type=str, help='path to the config file')
parser.add_argument('model_path', type=str, help='path to the model file')
args = parser.parse_args()
main(args)
6.7. Start Training on Docker from BASH Terminal
Execute the training process within a Docker container from the BASH terminal. This involves running the main training script with the updated configurations inside the Docker environment, ensuring reproducibility and isolation of the training process.
To train on Docker from the BASH terminal, it’s essential to define local paths to the configuration folder, data folder, and output folder. Assuming the BASH command is executed from the project’s root folder and the dataset resides at ‘/mnt/SSD2/coco_stuff10k/’ on the local PC, the following variables are defined within the BASH script to facilitate this setup.
DATA_DIR="/mnt/SSD2/coco_stuff10k/"
OUT_DIR="$PWD/out"
CONFIG_DIR=$PWD/codes/
MODEL_PATH="https://download.openmmlab.com/mmyolo/v0/yolov8/yolov8_l_mask-refine_syncbn_fast_8xb16-500e_coco/yolov8_l_mask-refine_syncbn_fast_8xb16-500e_coco_20230217_120100-5881dec4.pth"
As I’m about to initiate training inside Docker, leveraging GPU acceleration, and given that my PC is equipped with 3 GPUs, I’ll commence the training process with adding the following command to add to the bash script:
GPUS=3
# You can use the variables DATA_DIR, OUT_DIR, CONFIG_DIR, GPUS in your script from this point onward.
docker run -it --rm \
--gpus all \
--mount type=bind,source=$CONFIG_DIR,target=/configs \
--mount type=bind,source=$DATA_DIR,target=/data \
--mount type=bind,source=$OUT_DIR,target=/out \
--shm-size 8g \
mmyolo:latest \
torchrun --nnodes 1 --nproc_per_node=$GPUS /configs/main_train_mmengine.py /configs/yolov8_l_mask-refine_syncbn_fast_8xb16-500e_coco.py $MODEL_PATH
6.8. Start training
$ bash train.sh
6.9. Log Analysis
You can assess the training performance by using the analyze_logs.py script from MMDetection, which is located in the tools/analysis_tools directory (inside MMDetection repo here). This script creates plots for metrics such as loss, mAP, learning rate, and more, based on the training log file you provide. Be sure to specify the correct path to the training log file when running the script.
Make sure the paths are correctly aligned with your container configuration and the location of your log files.
For instance, if your log files are stored in the 20240401_163114 directory on your PC and you’re running commands within the mmdetection container, you can use the following commands (first run the jupyter notebook inside the docker image, and then run the command):
$ bash jupyter.sh
$ python /workspace/mmdetection/tools/analysis_tools/analyze_logs.py plot_curve /out/20240401_163114/vis_data/scalars.json --keys bbox_mAP bbox_mAP_50 bbox_mAP_75 --out ./bbox_mAP.jpg --legend mAP mAP_50 mAP_75
$ python /workspace/mmdetection/tools/analysis_tools/analyze_logs.py plot_curve /out/20240401_163114/vis_data/scalars.json --keys lr --out ./lr.jpg --legend lr
6.10 Additional Tips
- Experiment with different hyperparameters: Fine-tune learning rate, batch size, optimizer, and data augmentation for optimal performance.
- Leverage data augmentation: Increase data diversity and improve model robustness.
- Use transfer learning: Start with a pre-trained model and fine-tune it on your dataset.
- Optimize model size and speed: Consider quantization or pruning techniques for deployment on resource-constrained devices.
7. Model Deployment
Deploying a model is a crucial step in transitioning from development to real-world application. Our objective is to make the object detection model easily accessible and usable for developers and applications. To streamline this process, we first converted the PyTorch model into ONNX format, ensuring broad compatibility and smooth integration into various systems.
To further optimize the model for deployment, we took two key steps: first, we converted the model into the OpenVINO format for efficient inference on Intel CPUs. Second, we utilized TensorRT to create a version optimized specifically for NVIDIA GPUs. This multi-faceted approach ensures that our model delivers high-performance inference across a wide range of hardware platforms, including both CPUs and GPUs.
To maximize accessibility, we uploaded three versions of the model—the original PyTorch model, the OpenVINO-optimized model, and the TensorRT-optimized model—to the Hugging Face Model Hub. This platform not only simplifies access to our models but also enables developers to test and fine-tune them on their own datasets. Deploying these models on Hugging Face also supports fine-tuning, allowing them to be adapted for specific tasks and domains. By leveraging the Hugging Face ecosystem, we benefit from a vibrant community of developers and researchers, fostering collaboration and the continuous improvement of our models.
In production environments, it’s essential to consider factors such as scalability, latency, and resource efficiency. Techniques like model quantization and pruning can significantly boost performance while reducing resource demands. Additionally, implementing robust monitoring and logging mechanisms is key to ensuring long-term reliability and tracking model performance over time.
8. Summary
In the realm of computer vision, object detection is a critical task that enables the identification and localization of objects within images and videos. YOLO (You Only Look Once) models have set a high standard for this task due to their impressive balance of speed and accuracy. YOLOv8, builds upon these strengths with enhanced model architecture and optimization, making it highly effective for real-time applications.
This tutorial focuses on utilizing YOLOv8 with the MMYOLO framework, a modular and user-friendly tool from the OpenMMLab project. MMYOLO simplifies the process of training, fine-tuning, and deploying YOLO models through its flexible design and comprehensive support for various datasets and configurations. Additionally, the tutorial incorporates Docker, which facilitates the creation of consistent and reproducible development environments. This integration ensures that users can seamlessly set up and manage their object detection workflows across different systems.
By following this guide, users will gain the skills to efficiently build, train, and deploy advanced object detection systems, leveraging the combined power of YOLOv8, MMYOLO, and Docker.
9. References
[1] OpenMMLab, “MMYOLO: A YOLO-Centric Extension of MMDetection,” GitHub Repository.
[4] J. Redmon and A. Farhadi, “YOLOv3: An Incremental Improvement,” arXiv:1804.02767, 2018.
[5] C.-Y. Wang et al., “YOLOv4: Optimal Speed and Accuracy of Object Detection,”
[10] Z. Ge et al., “YOLOX: Exceeding YOLO Series in 2021,” arXiv:2107.08430, 2021.
[12] OpenMMLab, “Why mmyolo? Focusing on YOLO’s Unique Needs,” mmyolo Documentation.
[13] Apache Software Foundation, “Apache License, Version 2.0,”
[14] Free Software Foundation, “GNU General Public License, Version 3,”
Object Detection With Deep Neural Networks
Source code: Original codes can be downloaded from GitHub.
Trained Model: The trained model uploaded on Hugging Face. It is available to test and/or download.
Disclaimer
This tutorial is intended for educational purposes only. It does not constitute professional advice, including medical or legal advice. Any application of the techniques discussed in real-world scenarios should be done cautiously and with consultation from relevant experts.
Share this:
- Click to share on X (Opens in new window) X
- Click to share on Facebook (Opens in new window) Facebook
- Click to share on LinkedIn (Opens in new window) LinkedIn
- Click to email a link to a friend (Opens in new window) Email
- Click to share on Reddit (Opens in new window) Reddit
- Click to print (Opens in new window) Print