Breaking down Convolutional Neural Networks: Understanding the Magic behind Image Recognition (2024)

An in-depth look at the architecture and inner workings of CNNs for successful image classification and object recognition.

Breaking down Convolutional Neural Networks: Understanding the Magic behind Image Recognition (1)



Published in

Towards Data Science


9 min read


Dec 4, 2021


Breaking down Convolutional Neural Networks: Understanding the Magic behind Image Recognition (3)

The Convolutional Neural Network (CNN or ConvNet) is a subtype of Neural Networks that is mainly used for applications in image and speech recognition. Its built-in convolutional layer reduces the high dimensionality of images without losing its information. That is why CNNs are especially suited for this use case.

If we want to use a fully-connected neural network for image processing, we quickly discover that it does not scale very well.

For the computer, an image in RGB notation is the summary of three different matrices. For each pixel of the image, it describes what color that pixel displays. We do this by defining the red component in the first matrix, the green component in the second, and then the blue component in the last. So for an image with the size 3 on 3 pixels, we get three different 3x3 matrices.

Breaking down Convolutional Neural Networks: Understanding the Magic behind Image Recognition (4)

To process an image, we enter each pixel as input into the network. So for an image of size 200x200x3 (i.e. 200 pixels on 200 pixels with 3 color channels, e.g. red, green and blue) we have to provide 200 * 200 * 3= 120,000 input neurons. Then each matrix has a size of 200 by 200 pixels, so 200 * 200 entries in total. This matrix then finally exists three times, each for red, blue, and green. The problem then arises in the first hidden layer, because each of the neurons there would have 120,000 weights from the input layer. This means the number of parameters would increase very quickly as we increase the number of neurons in the Hidden Layer.

This challenge is exacerbated when we want to process larger images with more pixels and more color channels. Such a network with a huge number of parameters will most likely run into overfitting. This means that the model will give good predictions for the training set, but will not generalize well to new cases that it does not yet know. Additionally, due to a large number of parameters, the network would very likely stop attending to individual image details as they would be lost in sheer mass. However, if we want to classify an image, e.g. whether there is a dog in it or not, these details, such as the nose or the ears, can be the decisive factor for the correct result.

For these reasons, the Convolutional Neural Network takes a different approach, mimicking the way we perceive our environment with our eyes. When we see an image, we automatically divide it into many small sub-images and analyze them one by one. By assembling these sub-images, we process and interpret the image. How can this principle be implemented in a Convolutional Neural Network?

The work happens in the so-called convolution layer. To do this, we define a filter that determines how large the partial images we are looking at should be, and a step length that decides how many pixels we continue between calculations, i.e. how close the partial images are to each other. By taking this step, we have greatly reduced the dimensionality of the image.

The next step is the pooling layer. From a purely computational point of view, the same thing happens here as in the convolution layer, with the difference that we only take either the average or maximum value from the result, depending on the application. This preserves small features in a few pixels that are crucial for the task solution.

Finally, there is a fully-connected layer, as we already know it from regular neural networks. Now that we have greatly reduced the dimensions of the image, we can use the tightly meshed layers. Here, the individual sub-images are linked again in order to recognize the connections and carry out the classification.

Now that we have a basic understanding of what the individual layers roughly do, we can look in detail at how an image becomes a classification. For this purpose, we try to recognize from a 4x4x3 image whether there is a dog in it.

In the first step, we want to reduce the dimensions of the 4x4x3 image. For this purpose, we define a filter with the dimension 2x2 for each color. In addition, we want a step length of 1, i.e. after each calculation step, the filter should be moved forward by exactly one pixel. This will not reduce the dimension as much, but the details of the image will be preserved. If we migrate a 4x4 matrix with a 2x2 and advance one column or one row in each step, our Convolutional Layer will have a 3x3 matrix as output. The individual values of the matrix are calculated by taking the scalar product of the 2x2 matrices, as shown in the graphic.

Breaking down Convolutional Neural Networks: Understanding the Magic behind Image Recognition (5)

The (Max) Pooling Layer takes the 3x3 matrix of the convolution layer as input and tries to reduce the dimensionality further and additionally take the important features in the image. We want to generate a 2x2 matrix as the output of this layer, so we divide the input into all possible 2x2 partial matrices and search for the highest value in these fields. This will be the value in the field of the output matrix. If we were to use the average pooling layer instead of a max-pooling layer, we would calculate the average of the four fields instead.

Breaking down Convolutional Neural Networks: Understanding the Magic behind Image Recognition (6)

The pooling layer also filters out noise from the image, i.e. elements of the image that do not contribute to the classification. For example, whether the dog is standing in front of a house or in front of a forest is not important at first.

The fully-connected layer now does exactly what we intended to do with the whole image at the beginning. We create a neuron for each entry in the smaller 2x2 matrix and connect them to all neurons in the next layer. This gives us significantly fewer dimensions and requires fewer resources in training.

This layer then finally learns which parts of the image are needed to make the classification dog or non-dog. If we have images that are much larger than our 5x5x3 example, it is of course also possible to set the convolution layer and pooling layer several times in a row before going into the fully-connected layer. This way you can reduce the dimensionality far enough to reduce the training effort.

Tensorflow has a wide variety of datasets that we can download and use with just a few lines of code. This is especially helpful when you want to test new models and their implementation and therefore do not want to search for appropriate data for a long time. In addition, Google also offers a dataset search, with which one can find a suitable dataset within a few clicks.

An Introduction to TensorFlowGet to know the Machine Learning Framework, its Architecture and the Comparison to

For our exemplar Convolutional Neural Network, we use the CIFAR10 dataset, which is available through Tensorflow. The dataset contains a total of 60,000 images in color, divided into ten different image classes, e.g. horse, duck, or truck. We note that this is a perfect training dataset as each class contains exactly 6,000 images. In classification models, we must always make sure that every class is included in the dataset an equal number of times, if possible. For the test dataset, we take a total of 10,000 images and thus 50,000 images for the training dataset.

Each of these images is 32×32 pixels in size. The pixels in turn have a value between 0 and 255, where each number represents a color code. Therefore, we divide each pixel value by 255 so that we normalize the pixel values to the range between 0 and 1.

To check that all images are displayed correctly, we print the first ten images including the class they belong to. Since these are only 32×32 images, they are relatively blurry, but you can still tell which class they are part of.

In Tensorflow we can now build the Convolutional Neural Network by defining the sequence of each layer. Since we are dealing with relatively small images we will use the stack of Convolutional Layer and Max Pooling Layer twice. The images have, as we already know, 32 height dimensions, 32 width dimensions, and 3 color channels (red, green, blue).

The Convolutional Layer uses first 32 and then 64 filters with a 3×3 kernel as a filter and the Max Pooling Layer searches for the maximum value within a 2×2 matrix.

After these two stacks, we have already reduced the dimensions of the images significantly, to 6 height pixels, 6 width pixels, and a total of 64 filters. With a third and final convolutional layer, we reduce these dimensions further to 4x4x64. Before we now build a fully meshed network from this, we replace the 3×3 matrix per image, with a vector of 1024 elements (4*4*64), without losing any information.

Now we have sufficiently reduced the dimensions of the images and can add one more hidden layer with a total of 64 neurons before the model ends in the output layer with the ten neurons for the ten different classes.

The model with a total of 122,570 parameters is now ready to be built and trained.

Before we can start training the Convolutional Neural Network, we have to compile the model. In it we define which loss function the model should be trained according to, the optimizer, i.e. according to which algorithm the parameters change, and which metric we want to be shown in order to be able to monitor the training process.

After training the Convolutional Neural Network for a total of 10 epochs, we can look at the progression of the model’s accuracy to determine if we are satisfied with the training.

Our prediction of the image class is correct in about 80% of the cases. This is not a bad value, but not a particularly good one either. If we want to increase this even further, we could have the Convolutional Neural Network trained for more epochs or possibly configure the dense layers even differently.

  • Convolutional neural networks are used in image and speech processing and are based on the structure of the human visual cortex.
  • They consist of a convolution layer, a pooling layer, and a fully connected layer.
  • Convolutional neural networks divide the image into smaller areas in order to view them separately for the first time.
  • Convolutional neural networks can be programmed in just a few steps using Tensorflow.
  • It is important to adjust the arrangement of the convolutional and max-pooling layers to each different use case.

If you like my work, please subscribe here or check out my website Data Basecamp! Also, medium permits you to read 3 articles per month for free. If you wish to have unlimited access to my articles and thousands of great articles, don’t hesitate to get a membership for $5 per month by clicking my referral link:

Breaking down Convolutional Neural Networks: Understanding the Magic behind Image Recognition (2024)


What is Convolutional Neural Network for image recognition? ›

The Convolutional Neural Network (CNN or ConvNet) is a subtype of Neural Networks that is mainly used for applications in image and speech recognition. Its built-in convolutional layer reduces the high dimensionality of images without losing its information. That is why CNNs are especially suited for this use case.

Which neural network will be used to solve an image recognition problem? ›

The leading architecture used for image recognition and detection tasks is Convolutional Neural Networks (CNNs). Convolutional neural networks consist of several layers with small neuron collections, each of them perceiving small parts of an image.

What is Convolutional Neural Network fully explained? ›

A convolutional neural network (CNN or ConvNet) is a network architecture for deep learning that learns directly from data. CNNs are particularly useful for finding patterns in images to recognize objects, classes, and categories. They can also be quite effective for classifying audio, time-series, and signal data.

What are the stages of CNN? ›

A CNN typically has three layers: a convolutional layer, a pooling layer, and a fully connected layer.

What is convolution and why it is used in image recognition? ›

Convolution is a simple mathematical operation which is fundamental to many common image processing operators. Convolution provides a way of `multiplying together' two arrays of numbers, generally of different sizes, but of the same dimensionality, to produce a third array of numbers of the same dimensionality.

How convolutional neural network works in an image or video? ›

It works by placing a filter over an array of image pixels – this then creates what's called a convolved feature map. “It's a bit like looking at an image through a window which allows you to identify specific features you might not otherwise be able to see.

Which algorithm is best for image recognition? ›

The leading architecture used for image recognition and detection tasks is that of convolutional neural networks (CNNs). Convolutional neural networks consist of several layers, each of them perceiving small parts of an image.

Which algorithm is used for image recognition? ›

Some of the algorithms used in image recognition (Object Recognition, Face Recognition) are SIFT (Scale-invariant Feature Transform), SURF (Speeded Up Robust Features), PCA (Principal Component Analysis), and LDA (Linear Discriminant Analysis).

How does image recognition neural network work? ›

How does Image recognition work? Typically the task of image recognition involves the creation of a neural network that processes the individual pixels of an image. These networks are fed with as many pre-labelled images as we can, in order to “teach” them how to recognize similar images.

What is a simple example of Convolutional Neural Network? ›

A convolutional neural network is used to detect and classify objects in an image. Below is a neural network that identifies two types of flowers: Orchid and Rose. In CNN, every image is represented in the form of an array of pixel values. The convolution operation forms the basis of any convolutional neural network.

What is the main advantage of Convolutional Neural Network? ›

The main advantage of CNN compared to its predecessors is that it automatically detects the important features without any human supervision. For example, given many pictures of cats and dogs it learns distinctive features for each class by itself. CNN is also computationally efficient.

Why do we need Convolutional Neural Network? ›

Why do we need CNN over ANN? CNN is needed as it is an important and more accurate way for image classification problems. With Artificial Neural Networks, a 2D image would first be converted into a 1-dimensional vector before training the model.

What are the 4 different layers on CNN? ›

The different layers of a CNN. There are four types of layers for a convolutional neural network: the convolutional layer, the pooling layer, the ReLU correction layer and the fully-connected layer.

What are the 4 hidden layers of CNN? ›

Convolutional Neural Networks (CNN)

The hidden layers of a CNN typically consist of convolutional layers, pooling layers, fully connected layers, and normalization layers.

What is the basics of convolution? ›

What is a convolution? Convolution is the process of multiplying each pixel with the corresponding pixel value of the filter and then adding all of the products to get the result. These combinations of result give the output image representation.

How does image convolution work? ›

Convolution is the process of adding each element of the image to its local neighbors, weighted by the kernel. This is related to a form of mathematical convolution. The matrix operation being performed—convolution—is not traditional matrix multiplication, despite being similarly denoted by *.

How do convolutions improve image recognition? ›

The added computational load makes the network less accurate in this case. By killing a lot of these less significant connections, convolution solves this problem. In technical terms, convolutional neural networks make the image processing computationally manageable through filtering the connections by proximity.

What are the disadvantages of CNN? ›

A Convolutional neural network is significantly slower due to an operation such as maxpool. If the CNN has several layers then the training process takes a lot of time if the computer doesn't consist of a good GPU. A ConvNet requires a large Dataset to process and train the neural network.

What are the disadvantages of CNN algorithm? ›

Some of the disadvantages of CNNs: include the fact that a lot of training data is needed for the CNN to be effective and that they fail to encode the position and orientation of objects. They fail to encode the position and orientation of objects. They have a hard time classifying images with different positions.

How does CNN extract features from images? ›

CNN is a neural network that extracts input image features and another neural network classifies the image features. The input image is used by the. The extracted feature signals are utilized by the neural network for classification.

What is the most advanced image recognition? ›

Deep learning image recognition systems are now considered to be the most advanced and capable systems in terms of performance and flexibility. Recent breakthroughs in image recognition have been made possible by innovative combinations of deep learning and artificial intelligence (AI) hardware.

What are the steps of image recognition? ›

How image recognition works in four steps.
  • Step 1: Extraction of pixel features of an image.
  • Step 2: Preparation of labeled images to train the model.
  • Step 3: Training the model to recognize images.
  • Step 4: Recognition of new images.
Sep 21, 2022

What is an example of image recognition? ›

The most common example of image recognition can be seen in the facial recognition system of your mobile. Facial recognition in mobiles is not only used to identify your face for unlocking your device; today, it is also being used for marketing.

Which CNN algorithm is best for image classification? ›

Several datasets can be used to apply CNN effectively. The three most popular ones vital in image classification using CNN are MNIST, CIFAR-10, and ImageNet. Let's look at MNIST first.

Which algorithm is better than CNN? ›

R-FCN consists of shared, fully convolutional architectures as is the case of FCN that is known to yield a better result than the Faster R-CNN. In this algorithm, all learnable weight layers are convolutional and are designed to classify the ROIs into object categories and backgrounds.

What are the challenges of image recognition? ›

The main challenges in image classification are the large number of images, the high dimensionality of the data, and the lack of labeled data. Images can be very large, containing a large number of pixels. The data in each image may be high-dimensional, with many different features.

What is a real life example of convolution? ›

One of the real life applications of convolution is seismic signals for oil exploration. Here a perturbation is produced in the surface of the area to be analized. The signal travel underground producing reflexions at each layer. This reflexions are measured in the surface through a sensors network.

What is the most common convolutional neural network? ›

5 Most Well-Known CNN Architectures Visualized
  • Convolution Layer.
  • Pooling Layer.
  • Normalization Layer.
  • Fully Connected Layer.
  • Activation Function.
Aug 22, 2022

What are convolutional neural networks usually used with? ›

A Convolutional neural network (CNN) is a neural network that has one or more convolutional layers and are used mainly for image processing, classification, segmentation and also for other auto correlated data.

What is the most popular use of a convolutional net? ›

The major use of convolutional neural networks is image recognition and classification. It is also the only use case involving the most advanced frameworks (especially, in the case of medical imaging).

What are the disadvantages of CNN for image classification? ›

What are the advantages and disadvantages of Convolutional Neural Network (CNN)
Efficient image processingHigh computational requirements
High accuracy ratesDifficulty with small datasets
3 more rows

What are the key features of convolutional neural networks? ›

Key Points

Convolutional neural network is composed of multiple building blocks, such as convolution layers, pooling layers, and fully connected layers, and is designed to automatically and adaptively learn spatial hierarchies of features through a backpropagation algorithm.

What is the impact of convolutional neural networks? ›

Convolutional Neural Network has achieved significant results in pattern recognition, image analysis, and text classification. This study investigates the application of the CNN model on text classification problems by experimentation and analysis.

How is convolutional neural network different from regular? ›

The main difference between a CNN and an RNN is the ability to process temporal information — data that comes in sequences, such as a sentence. Recurrent neural networks are designed for this very purpose, while convolutional neural networks are incapable of effectively interpreting temporal information.

What is the mathematical formula for convolutional neural network? ›

It allows to determine the output size from a convolutional layer. In this case the output is of length 5 . In general the length of the output follows, Output size=nx+2P−nhS+1, Output size = n x + 2 P − n h S + 1 , where nx is the length of the input signal and nh is the length of the filter.

Is CNN an algorithm or architecture? ›

A convolutional neural network is made up of numerous layers, such as convolution layers, pooling layers, and fully connected layers, and it uses a backpropagation algorithm to learn spatial hierarchies of data automatically and adaptively.

What is the activation function in convolutional neural network? ›

The activation function is typically applied to the output of each neuron in the network. It takes in the weighted sum of the inputs and produces an output that is then passed on to the next layer. The most commonly used activation functions in CNNs are: Rectified linear unit (ReLU)

What do convolutional layers find first? ›

The first layer of a Convolutional Neural Network is always a Convolutional Layer. Convolutional layers apply a convolution operation to the input, passing the result to the next layer. A convolution converts all the pixels in its receptive field into a single value.

How many CNN algorithms are there? ›

The CNN architecture comprises three main layers: convolutional layers, pooling layers, and a fully connected (FC) layer. There can be multiple convolutional and pooling layers.

What is Softmax layer in CNN? ›

Softmax extends this idea into a multi-class world. That is, Softmax assigns decimal probabilities to each class in a multi-class problem. Those decimal probabilities must add up to 1.0. This additional constraint helps training converge more quickly than it otherwise would.

What are CNN bottleneck layers? ›

In a CNN (such as Google's Inception network), bottleneck layers are added to reduce the number of feature maps (aka channels) in the network, which, otherwise, tend to increase in each layer. This is achieved by using 1x1 convolutions with fewer output channels than input channels.

What is neural network in image recognition? ›

A neural network is a method in artificial intelligence that teaches computers to process data in a way that is inspired by the human brain. It is a type of machine learning process, called deep learning, that uses interconnected nodes or neurons in a layered structure that resembles the human brain.

What is convolutional neural network for facial recognition? ›

In deep learning, a convolutional neural network (CNN) is a special type of neural network that is designed to process data through multiple layers of arrays. A CNN is well-suited for applications like image recognition and is often used in face recognition software.

Which CNN model is best for image classification? ›

VGG16 is a pre-trained CNN model which is used for image classification. It is trained on a large and varied dataset and fine-tuned to fit image classification datasets with ease.

Is CNN or ANN better for image classification? ›

ANN is ideal for solving problems regarding data. Forward-facing algorithms can easily be used to process image data, text data, and tabular data. CNN requires many more data inputs to achieve its novel high accuracy rate.

What is the best algorithm for image recognition? ›

Rectified Linear Units (ReLu) are seen as the best fit for image recognition tasks. The matrix size is decreased to help the machine learning model better extract features by using pooling layers.

What are the methods of image recognition? ›

Types of image recognition: Image recognition systems can be trained in one of three ways — supervised learning, unsupervised learning or self-supervised learning. Usually, the labeling of the training data is the main distinction between the three training approaches.

Why is convolutional neural network important? ›

Why do we need CNN over ANN? CNN is needed as it is an important and more accurate way for image classification problems. With Artificial Neural Networks, a 2D image would first be converted into a 1-dimensional vector before training the model.

What are convolutional neural networks good for? ›

Sometimes called ConvNets or CNNs, convolutional neural networks are a class of deep neural networks used in deep learning and machine learning. Convolutional neural networks are usually used for visual imagery, helping the computer identify and learn from images.

What is the most popular convolutional neural network? ›

LeNet is the most popular CNN architecture it is also the first CNN model which came in the year 1998. LeNet was originally developed to categorise handwritten digits from 0–9 of the MNIST Dataset. It is made up of seven layers, each with its own set of trainable parameters.

How accurate is CNN for image classification? ›

Cactus image classification using convolutional neural network (CNN) that reaches over 98% accuracy.

What is the best neural network for image processing? ›

1. Convolutional Neural Networks (CNNs) CNN's, also known as ConvNets, consist of multiple layers and are mainly used for image processing and object detection.

What are the disadvantages of using CNN for image classification? ›

However, CNNs also have some drawbacks that limit their performance and applicability. One of the main disadvantages of CNNs is that they require a large amount of labeled data to train effectively, which can be costly and time-consuming to obtain and annotate.

Why CNN is better than fully connected for image classification? ›

With CNN the differences you can notice in summary are Output shape and number of parameters. As compared to the fully connected neural network model the total number of parameters is too less i.e. 0.1 million.


Top Articles
Latest Posts
Article information

Author: Msgr. Refugio Daniel

Last Updated:

Views: 5944

Rating: 4.3 / 5 (54 voted)

Reviews: 93% of readers found this page helpful

Author information

Name: Msgr. Refugio Daniel

Birthday: 1999-09-15

Address: 8416 Beatty Center, Derekfort, VA 72092-0500

Phone: +6838967160603

Job: Mining Executive

Hobby: Woodworking, Knitting, Fishing, Coffee roasting, Kayaking, Horseback riding, Kite flying

Introduction: My name is Msgr. Refugio Daniel, I am a fine, precious, encouraging, calm, glamorous, vivacious, friendly person who loves writing and wants to share my knowledge and understanding with you.