Object Detection and YOLO v8 on Oracle Cloud

Published in

The Modern Scientist

8 min readJan 12, 2023

Introduction to the problem.

Even if from the title it may seem that I want to deal with a question of “pure technology”, the most correct approach is always to start from the real case of a customer, with his specific needs.

In my case, since the end of December I have been working, with a colleague, to define a solution that can help automate readings from water meters.

That is, to extract the reading of consumptions from an image of this type:

(In this specific case the reading is: 00293).

The need is a concrete need, linked to the fact that smart meters, able to provide directly a remote digital reading, are still not very widespread today, for measuring water consumption.

The operation must therefore be done by people. It is a task that requires time and a lot of attention and, let’s face it, boring and repetitive.

In short, one of the many necessary tasks, but for which human beings are poorly suited.

Obviously, in an era where every day we talk about the surprising results that innovation in the field of AI can produce, one might think that such a solution is entirely feasible and not even too complicated. But we risk to make the problem trivial.

As we will see, a possible approach is to treat the task as an “Object Detection” problem, one of the problems substantially solved in the “Computer Vision” field. The technology exists and is within the reach of (almost) everyone.

But, in the real world, “the devil is in the details”.

To give you an idea, I’ll try to list the difficulties encountered:

Images may be blurry
They can be turned upside down and, often, they are (there is the risk of reading the numbers backwards)
There are many types of water meters, therefore with digits appearing in many different ways
Finally, to train a model, it is not easy to have lots of annotated examples.

To sum up: we should be able to develop a solution that works with very different images, with a fairly high accuracy, but starting from relatively few examples for model training (in jargon, it is called “few shot learning”).

Finally, luckily water has lower costs than electricity and gas, therefore a solution, while being very useful for freeing people from boring and repetitive tasks, cannot have too high costs.

Object Detection and YOLO.

In the field of Computer Vision, the Object Detection task consists in identifying some objects within an image (e.g.: people, bicycles, cars, etc) and in identifying the portion of the image that contains each of these objects, delimiting it with a rectangle.

In my concrete case the objects are a bit “special”:

The digits that make up the reading (0,…9)
The entire reading (that is, the box containing the entire reading)

Now, if you have a set of images (a dataset), annotated, there are several technologies that can be used for Object Detection, all based on Convolutional Networks.

YOLO, an acronym that stands for “You Only Look Once” is a very fast and efficient algorithm, now available for some years.

An Open Source version in Python is available and it often used in custom projects. It has been developed by a company called Ultralytics.

The important news at the end of 2022 is that Ultralytics has announced and made available a new version, version 8, which promises to be more accurate, faster and easier to use.

Can YOLO help me?

In developing a PoC for my customer I had started, with good results, using the most commonly used older version (ver. 5).

At this point, I asked myself the following three questions:

Can version 8 be used on the Oracle Data Science Service, today?
How easy it is to migrate from vers. 5 to vers. 8?
In my specific case, is there really an improvement in accuracy (for the same dataset)?

OCI Data Science (in a few lines).

The OCI Data Science Service allows you to quickly activate computational environments (let’s think of them as VMs, to simplify) with adequate resources (CPU, GPU, memory), with flexibility (quickly switch, for example, from CPU to GPU and vice versa, or increase the number of GPUs) and within which I can use all the Python Open-Source libraries normally required in such projects.

YOLO v8 in OCI Data Science.

Let’s go to the point and try to answer the three questions.

To train an Object Detection model, I already had an annotated dataset with about 200 images, 20 of which used for validation.

Furthermore, the time available to me for all the tests was around 4 hours. Starting more or less only from the experience made with YOLO V5. Well, this was the most challenging part: 4 hours is not a long time and there can be always surprises, both positive and negative.

To be able to train a model of this type, in a reasonable time, it is essential to use GPU. One of the keys to the success of modern ML/AI is the use of these processors, with thousands of specialised cores.

Therefore, I decided to activate a Notebook Session of shape VMGPU3.2 for the tests. That is, an environment with 2 Nvidia V100 GPUs.

These are GPUs that today can be considered intermediate range, with costs of about 2.5 euros per GPU per hour.

For activation of the Notebook Session: let’s say 5 minutes.

For the software environment, more than fifty environments are available in OCI Data Science, with many pre-integrated and tested libraries. We call them “conda environments”, as they are based on the Anaconda package manager.

I chose a specialised environment for Computer Vision and GPU. The command to activate is:

odsc conda install -s computervision_p37_gpu_v1

Activation of the software environment: 5 minutes.

Only one command is required to install YOLO v8 in such an environment. For those who love details, here it is:

pip install ultralytics

Again: a few minutes.

The dataset? As I said, I already had an annotated dataset in a format suitable for YOLO v5.

I verified that the format is, as expected, compatible with YOLO v8. So I just had to bring directories and files into the Notebook Session.

Time? There are two hundred images + txt files with the relative annotations. I couldn’t take more than 10 minutes (zip & unzip, etc…).

At this point I just had to figure out how to launch model training.

The YOLO v8 documentation, although still being completed, is already quite clear and with examples and, therefore, in about 10 minutes, I defined the command line for the launch:

yolo task=detect mode=train model=yolov8x.pt data=/home/datascience/path_to_yaml/data.yaml device=\'0,1\' epochs=150

(device parameter is essential to use both GPUs).

In summary, in just under 1 hour I was ready for a first test, with just a few epochs. The test showed me that YOLO v8 was running smoothly.

So, largely positive answers for questions 1 and 2, namely:

YOLO v8 runs smoothly in OCI Data Science
It’s super easy to upgrade from v5 to v8.

At this point I just had to complete the training, with an adequate number of epochs, to obtain the metrics on the validation set and compare the performance obtained with those I had produced using YOLO v5.

With YOLO v5, to obtain a satisfactory result I had to use about 300 epochs, for a total time of just over 1h of training.

Carrying out a second test with the v8 and with about 50 epochs I immediately realized another interesting aspect: the model seemed to converge towards performances comparable to vers. 5 faster. With fewer epochs.

So, now it was lunchtime, I was ready for the most important run and I decided to launch it with 150 epoch.

Well, after returning from lunch (never wait in front of the screen, our time can always be used for more pleasant or useful things) I had a good surprise.

Here are the screenshots of the two executions to compare (v5 vs v8):

v5 Results (300 epochs):

v8 results (150 epochs):

Therefore, there is actually an improvement, of about 10%, with fewer epochs (see mAP50).

So, to answer question #3: slightly more accurate and faster (with less epoch).

In the end, I was satisfied for having used well the (less than) 4h dedicated to the task.

My conclusion.

I don’t want to go into too much details (and I’ll probably write another article). But surely I have to try to draw, concisely, some conclusions.

The suitable and usable technology, with costs of a few tens of euros, is available. YOLO v.8 can be used.

For the training part, it is very simple to adopt it. And it works fine.

The inference API (how to apply the trained model to new images) is in my opinion still partial and I had to write a few dozen lines of code for “cropping” the “Region of interest” (ROI), i.e. the rectangle that contains the entire reading.

YOLO v8 allows me to improve accuracy. But such a task, with relatively few images and the variety I mentioned, is still a challenging task.

From my latest tests (on other images, outside the training and validation datasets) I verified that the identification of the ROI is very accurate (let’s say accuracy > 0.95).

This is really useful, in my opinion, because it allows you to develop, with a few lines of code, a user interface that allows human review of automatic readings by displaying only the ROI and not the entire image.

On the other side, the accuracy for identifying the digits is not as high (ranging from 0.6 to 0.8).

The problem concerns in particular some digits for which the examples available are objectively few.

But the solution is simple: increase the size of the training set, with quality images, and repeat the training. Even doubling the training set, it would be about 1h of training.

What takes time, human time, is the selection of images and the annotation.

Finally, there is the issue of “upside-down images” to address. There I have a solution, still based on AI, which I mention to you: develop a binary classification model (0: not flipped, 1: flipped). But maybe I’ll tell you more about this and the technological solutions to adopt in another article.

Some small updates: well, I’m happy that some people have already given a close look at the article and sent me feedbacks. It means that in some way it is useful.

Obviously, I have not explained all the details. The main reason is that I wanted to keep it as short as possible. And for that reason, I plan to write a more technical, in depth article regarding the YOLO details and changes in vers. 8.

One small piece of information: one improvement is that I have been able to get some better performance with less training time. And I mentioned “epochs”. Well, for some people it is enough clear, but for those of you that normally don’t do NN training, epoch is a complete passage over the training dataset. During the training the dataset is scanned several times (for several epochs) to get the higher possible accuracy. Obviously, less epochs means less training time and… less costs.