Convolutional Neural Networks for Object Detection

by Ivan Ozhiganov

25 Feb 2016

These days there are so many photo and video surveillance systems in public areas that you would be hard pressed to find someone who hasn’t heard about them. “Big Brother” is watching us — on the roads, in airports, at train stations, when we shop in supermarkets or walk down in the underground. Remote monitoring technologies, photo, and video capture programs are widespread in everyday life but they are also intended for military and other purposes.

As a result, this requires the constant support and development of new methods for automatically processing the visual data captured. In particular, digital image processing often focusses on object detection in the picture, including localization, and recognition of a particular class of objects.

Azoft R&D team has extensive experience in dealing with similar challenges. Specifically, we have implemented a face recognition system and worked on a project for road sign recognition. Today we are going to share our experience on how to do object detection in images using convolutional neural networks. This type of neural networks has successfully proven itself in our past projects.

Project Overview

In the past, the Haar cascade classifier and the LBP-based classifier were the best tools for detecting objects in images. However with the introduction of convolutional neural networks and their proven successful application in computer vision, these cascade classifiers have become a second-best alternative.

We decided to test in practice the effectiveness of convolutional neural networks for object detection in images. As the object of our research, we chose license plate and road sign pictures.

The goal of our project was to develop a convolutional neural network model that allows recognition of objects in images with a higher quality and performance than cascade classifiers.
The task is broken into four stages:

  1. License plate keypoints detection using a convolutional neural network
  2. Road signs keypoints detection using a convolutional neural network
  3. Road signs detection using a fully convolutional neural network
  4. Comparing cascade classifiers and a convolutional neural network for the purpose of license plate detection

We implemented the first three stages simultaneously. Each of them was the responsibility of a single engineer. At the end, we compared the final model of the convolutional neural network of the first stage with cascade classifier methods according to specific parameters.

The performance of various recognition algorithms is particularly important for contemporary mobile devices. For this reason, we decided to test neural networks and cascade classifiers using smartphones.


Stage 1. License plate keypoints detection using a convolutional neural network

Over the years, we have used machine learning for several research projects, and for image recognition we’ve often used a dataset of license plate numbers as the learning base. Seeing as our new experiment required the detection of specific identical objects in images, our license plate database was perfectly suited to this task.

First, we decided to train the convolutional neural network to find keypoints of license plates. This was the purpose of the regression analysis. In our case, pixels of the image were independent input parameters while the coordinates of object keypoints were the dependent output parameters. The example of keypoints detection is available in the Image 1.

Detecting keypoints of license plates

Training a convolutional neural network to find keypoints requires a dataset with a large number of images of the needed object (no less than 1000). Coordinates of keypoints have to be designated and located in the same order.

Our dataset included several hundred images, however this wasn’t enough for training the network. Therefore, we decided to increase the dataset via augmentation of available data. Before starting augmentation we designated keypoints in the images and divided our dataset into training and control parts.

We applied the following transformations to the initial image during its augmentation:

  • Shifts
  • Resize
  • Rotations relative to the center
  • Mirror
  • Affine transformations (they allowed us to rotate and stretch a picture).

Besides this we changed all the images to 320*240 pixels. Take a look at an example of augmentation with transformations in Image 2.

Data augmentation

We chose the Caffe framework for the first stage because it is one of the most flexible and fastest frameworks for experiments with convolutional neural networks.

One way to solve a regression task in Caffe is using the special file format HDF5. After normalization of pixel values from 0 to 1 and coordinate values from -1 to 1, we packed the images and coordinates of keypoints in HDF5. You can find more details in our tutorial (see below).

At the beginning, we applied large network architectures (from 4 to 6 convolutional layers and a large amount of convolution kernels). The models with big architectures demonstrated good results but very low performance. For this reason, we decided to set up a simple neural network architecture to keep the quality on the same level.

The final architecture of the convolutional neural network for detecting keypoints of license plates was the following:

The architecture of convolutional neural network for detecting the keypoints of license plates

While training the neural network we used the optimization method called Adam. Compared to the Nesterov’s gradient descent, which demonstrated a high value of loss even after the thorough selection of momentum and learning rate parameters, Adam showed the best results. Using the Adam method, the convolutional neural network was trained with higher quality and speed.

We got a neural network that finds the key points of license plates quite well if the plates are not very close to the borders.

Detecting the key points of license plates with the obtained convolutional neural network model

For deeper understanding of the convolutional neural network principle, we studied the obtained convolution kernels and feature maps on different layers of the neural network. Considering the final model, convolution kernels demonstrated that the network learned to respond to the sudden swings of brightness, which appear in the borders and symbols of a license plate. Feature maps on the images below show the received model.

The trained kernels on the first and second layers are in pictures 5 and 6.

Obtained kernel of the first convolutional layer 7х7

Obtained kernel of the second convolutional layer 5х5

Regarding the feature maps in the final model, we took the car picture (Image 7) and transformed it into the picture with gray color gradation to look at the obtained maps after the first (Image 8) and the second (Image 9) convolutional layers.

The feature map after the first convolutional layer

The feature maps after the second convolutional layer

Finally, we designated the received coordinates of the key points and got the desired image (Image 10).

Designated picture after the CNN forward pass

The convolutional neural network was very effective in detecting the keypoints of license plates. In the majority of cases, the key points of license plate borders were recognized correctly. Therefore, we can highly praise the productivity of the convolutional neural network. The example of license plate detection using an iPhone is available in the video.

Stage 2. Road sign keypoint detection using a convolutional neural network

While training a convolutional neural network to find license plates we simultaneously worked on training a similar network to find road signs with speed limits. We also implemented experiments on the base of the Caffe framework. For training this model we used a dataset with nearly 1700 pictures of signs, which were augmented to 35000.

In doing this we successfully trained a neural network to find road signs on images with a size of 200х200 pixels.

When we tried to detect road signs with a different size on the image the problem appeared. For this reason, we implemented experiments based on simple conditions. A network had to find a white circle against a dark backdrop. We also kept the same image size of 200x200 pixels and made the fixed parameters for the circle size in variation up to 3 times. In other words, the minimum and maximum radius of circles in our dataset differed from each other by 3 times.

Finally, we achieved an appropriate outcome only when the radius of the circle changed no more than 10%. Experiments with gray circles (the spectral range of gray from 0.1 to 1) also demonstrated the same result. Thus, it is an open question as to how to implement object detection when the objects have a different size.

Examples of the last successfully tested model are shown on Image 11 as a group of pictures. As we can see in the pictures, the network learned to distinguish between similar signs. If the network pointed at the image center, then it didn’t find an object.

Regarding the detection of road signs, the convolutional neural network demonstrated good results. However, different sizes of objects became an unexpected obstacle. We plan to come back to the search of the solution to this problem in future projects, as it requires detailed research.

Stage 3. Road sign detection using a fully convolutional neural network

Another model that we decided to train to find road signs was a fully convolutional neural network without fully-connected layers. There is an image of a specific size at the input of the fully convolutional neural network, which transforms to a smaller size image at the output. In fact, the network is a non-linear filter with resizing. In other words, the neural network helps to increase the sharpness by removing noise from particular image areas without edge smearing but at the cost of reducing the size of the input image.

The brightness value of pixels is equal to 1 in the output image, where an object is. And any pixels outside of the image have a brightness value equal to 0. Therefore, the brightness value of pixels in the output image is the probability of that pixel belonging to the object.

It is important to consider that the output image is smaller than the input image. That’s why the object coordinates have to be scaled in accordance with the output image size.

We chose the following architecture to train the network: 4 convolutional layers, max-pooling (reducing size via choice of the biggest one) for the first layer. We trained the network using a dataset which is based on images of road signs with speed limits. When we finished training and applied the network to our dataset, we made binarization and found the connected components. Every component is a hypothesis about the sign location. Here are the results:

We divided images into 3 groups of 3 pictures (see Image 12). The left side of the image is the input. The right side of the image is the output. We labeled the central image, which shows the result we would like to get ideally.

We applied the binarization to the input image with a threshold of 0.5 and found the connected components. As a result, we found the desired rectangle that designates the location of a road sign. The rectangle is well seen in the left image and the coordinates are scaled back to the input image size.

Nevertheless, this method demonstrated both good results and false positives:

Overall, we evaluate this model positively. The fully convolutional neural network showed about 80% of correctly found signs from the independent testing sample. The special advantage of this model is that you can find the same two signs and label them with a rectangle. If there are no signs in the picture, the network won’t mark anything.

Stage 4. Comparing cascade classifiers and a convolutional neural network for the purpose of license plate detection

Through our earlier experiments we came to the conclusion that convolutional neural networks are fully comparable with cascade classifiers and even outperform them for some parameters.

To evaluate the quality and performance of different methods for detecting objects on images we use the following characteristics:

Characteristics of Methods for Object Detection

Level of precision and recall

Both the convolutional neural networks and Haar classifier demonstrate a high level of precision and recall for detecting objects in images. At the same time, the LBP classifier shows a high level of recall (finding the object quite regularly) but also has a high rate of false positives and low precision.

Scale invariance

Whereas the Haar cascade classifier and the LBP cascade classifier demonstrate strong invariance to changing the scale of the object in the images, the convolutional neural network cannot manage it in some cases and this shows the low scale invariance.

Number of attempts before getting a working model

Using cascade classifiers, you need a few attempts to get a working model for object detection. The convolutional neural networks do not give a result so quickly. To achieve the goal you need to perform dozens of experiments.

Time for processing

A convolutional neural network does not require much time for processing. And the LBP classifier also doesn’t need a lot of processing time. As for the Haar classifier, it takes a significantly longer time for processing.

The average time spent on processing one picture in seconds (not counting the time for capturing and displaying video):

Average Time Spent on Processing

Consistency with tilting objects

Another great advantage of the convolutional neural network is the consistency with tilting objects. Neither cascade classifiers are consistent with objects which are tilting on the image.

To summarize, we can say that convolutional neural networks are equal or even better than cascade classifiers for some parameters. However the conditions are that there will be a significant number of experiments required to teach the neural network and that the object scale can’t change a lot.


During the process of solving the problem of detecting specific objects in images, we applied two models based on convolutional neural networks. The first one is finding the object’s keypoints in images using a convolutional neural network. The second one is finding the objects in images via a fully convolutional neural network.

Each of the experiments were very labor-intensive and time-consuming. Once more we were convinced that the process of training convolutional neural networks is complicated and requires more investment to obtain reliable results of high quality and performance.

The comparative analysis of cascade classifier methods and the trained convolutional neural network model confirmed our main hypothesis. The convolutional neural network allows localizing objects faster and with higher quality than cascade classifiers if the object won’t change in scale very much. To solve the problem of the low scale invariance, we will try to increase the number of convolutional layers in future projects and use the most representative dataset.



Filter by