The VGGNets are a series of

convolutional neural network A convolutional neural network (CNN) is a type of feedforward neural network that learns features via filter (or kernel) optimization. This type of deep learning network has been applied to process and make predictions from many different ty ...

s (CNNs) developed by the Visual Geometry Group (VGG) at the

University of Oxford The University of Oxford is a collegiate university, collegiate research university in Oxford, England. There is evidence of teaching as early as 1096, making it the oldest university in the English-speaking world and the List of oldest un ...

. The VGG family includes various configurations with different depths, denoted by the letter "VGG" followed by the number of weight layers. The most common ones are VGG-16 (13 convolutional layers + 3 fully connected layers, 138M parameters) and VGG-19 (16 + 3, 144M parameters). The VGG family were widely applied in various computer vision areas. An ensemble model of VGGNets achieved state-of-the-art results in the

ImageNet Large Scale Visual Recognition Challenge The ImageNet project is a large visual database designed for use in Outline of object recognition, visual object recognition software research. More than 14 million images have been hand-annotated by the project to indicate what objects are pictur ...

(ILSVRC) in 2014. It was used as a baseline comparison in the ResNet paper for

image classification Computer vision tasks include methods for acquiring, processing, analyzing, and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information, e.g. in the form o ...

, as the network in the Fast Region-based CNN for

object detection Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class (such as humans, buildings, or cars) in digital images and videos. Well-researched ...

, and as a base network in neural style transfer. The series was historically important as an early influential model designed by composing generic modules, whereas

AlexNet AlexNet is a convolutional neural network architecture developed for image classification tasks, notably achieving prominence through its performance in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). It classifies images into 1, ...

(2012) was designed "from scratch". It was also instrumental in changing the standard convolutional kernels in CNN from large (up to 11-by-11 in AlexNet) to just 3-by-3, a decision that was only revised in ConvNext (2022). VGGNets were mostly obsoleted by

Inception ''Inception'' is a 2010 science fiction action heist film written and directed by Christopher Nolan, who also produced it with Emma Thomas, his wife. The film stars Leonardo DiCaprio as a professional thief who steals information by inf ...

, ResNet, and

DenseNet A residual neural network (also referred to as a residual network or ResNet) is a deep learning architecture in which the layers learn residual functions with reference to the layer inputs. It was developed in 2015 for image recognition, and won ...

. RepVGG (2021) is an updated version of the architecture.

Architecture

The key architectural principle of VGG models is the consistent use of small

3 \times 3

convolutional filters throughout the network. This contrasts with earlier CNN architectures that employed larger filters, such as

11 \times 11

in AlexNet. For example, two

3 \times 3

convolutions stacked together has the same receptive field pixels as a single

5 \times 5

convolution, but the latter uses

\left(25 \cdot c^2\right)

parameters, while the former uses

\left(18 \cdot c^2\right)

parameters (where

c

is the number of channels). The original publication showed that deep and narrow CNN significantly outperform their shallow and wide counterparts. The VGG series of models are deep neural networks composed of generic modules: # Convolutional modules:

3 \times 3

convolutional layers with stride 1, followed by ReLU activations. # Max-pooling layers: After some convolutional modules, max-pooling layers with a

2 \times 2

filter and a stride of 2 to downsample the feature maps. It halves both width and height, but keeps the number of channels. # Fully connected layers: Three fully connected layers at the end of the network, with sizes 4096-4096-1000. The last one has 1000 channels corresponding to the 1000 classes in ImageNet. # Softmax layer: A softmax layer outputs the probability distribution over the classes. The VGG family includes various configurations with different depths, denoted by the letter "VGG" followed by the number of weight layers. The most common ones are VGG-16 (13 convolutional layers + 3 fully connected layers) and VGG-19 (16 + 3), denoted as configurations ''D'' and ''E'' in the original paper. As an example, the 16 convolutional layers of VGG-19 are structured as follows:

\begin
&3 \to 64 \to 64 &\xrightarrow\\
&64 \to 128 \to 128 &\xrightarrow\\
&128 \to 256 \to 256 \to 256 \to 256 &\xrightarrow \\
&256 \to 512 \to 512 \to 512 \to 512 &\xrightarrow\\
&512 \to 512 \to 512 \to 512 \to 512 &\xrightarrow 
\end

where the arrow

c_1 \to c_2

means a 3x3 convolution with

c_1

input channels and

c_2

output channels and stride 1, followed by ReLU activation. The

\xrightarrow

means a down-sampling layer by 2x2 maxpooling with stride 2. {, class="wikitable" , +Table of VGG models !Name !Number of convolutional layers !Number of fully connected layers !Parameter count , - , VGG-16 , 13 , 3 , 138M , - , VGG-19 , 16 , 3 , 144M

Training

The original VGG models were implemented in a version of C++ Caffe, modified for multi-GPU training and evaluation with

data parallelism Data parallelism is parallelization across multiple processors in parallel computing environments. It focuses on distributing the data across different nodes, which operate on the data in parallel. It can be applied on regular data structures like ...

. On a system equipped with 4 NVIDIA Titan Black GPUs, training a single net took 2–3 weeks depending on the architecture.

References

Deep learning software Object recognition and categorization Neural network architectures