More

    Classification of Neural Network Hyperparameters

    By network structure, learning and optimization, and regularization effect

    Photo by John-Mark Smith on Unsplash

    A major challenge when working with DL algorithms is setting and controlling hyperparameter values. This is technically called hyperparameter tuning or hyperparameter optimization.

    Hyperparameters control many aspects of DL algorithms.

    • They can decide the time and computational cost of running the algorithm.
    • They can define the structure of the neural network model
    • They affect the model’s prediction accuracy and generalization capability.

    In other words, hyperparameters control the behavior and structure of the neural network models. So, it is really important to learn more about what each hyperparameter does in a neural network with a proper classification (see the chart).

    Before introducing and classifying neural network hyperparameters, I want to list down the following important facts about hyperparameters. To learn the differences between the parameters and hyperparameters in detail with examples, read my Parameters Vs Hyperparameters: What is the difference?article.

    • You should not confuse hyperparameters with parameters. Both are variables that exist in ML and DL algorithms. But, there are clear differences between them.
    • Parameters are variables of which values are learned from the data and updated during the training.
    • In neural networks, weights and biases are parameters. They are optimized (updated) during the backpropagation to minimize the cost function.
    • Once the optimal values for the parameters are found, we stop the training process.
    • Hyperparameters are variables of which values are set by the ML engineer or any other person before training the model. These values are not automatically learned from the data. So, we need to manually adjust them to build better models.
    • By changing the values of hyperparameters, we can build different types of models.
    • Finding the optimal values for hyperparameters is a challenging task in ML and DL.
    • The optimal values of hyperparameters also depend on the size and nature of the dataset and the problem we want to solve.

    Hyperparameters in a neural network can be classified by considering the following criteria.

    Based on the above criteria, neural network hyperparameters can be classified as follows.

    (Chart by author, made with draw.io)

    The hyperparameters classified under this criterion directly affect the structure of the neural network.

    Number of hidden layers

    This is also called the depth of the network. The term “deep” in deep learning refers to the number of hidden layers (depth) of a neural network.

    When designing a neural network such as MLP, CNN, AE, the number of hidden layers decides the learning capacity of the network. In order to learn all important non-linear patterns in the data, there should be a sufficient number of hidden layers in the neural network.

    When the size and complexity of the dataset increase, more learning capacity is needed for a neural network. Therefore, too many hidden layers are needed for large and complex datasets.

    A very small number of hidden layers generate a smaller network that may underfit the training data. That type of network does not learn the complex patterns in the training data and also does not perform well on unseen data when it comes to prediction.

    Too many hidden layers will generate a larger network that might overfit the training data. That type of network tries to memorize the training data instead of learning patterns in the data. So, that type of network does not generalize well on new unseen data.

    Overfitting is not as harmful as underfitting because overfitting can be reduced or eliminated with a proper regularization method.

    Number of nodes (neurons/units) in each layer

    This is also called the width of the network.

    The nodes in a hidden layer are often called hidden units.

    The number of hidden units is another factor that affects the learning capacity of the network.

    Too many hidden units create very large networks that may overfit the training data and a very small number of hidden units create smaller networks that may underfit the training data.

    The number of nodes in an MLP input layer depends on the dimensionality (the number of features) in the input data.

    The number of nodes in an MLP output layer depends on the type of problem that we want to solve.

    • Binary classification: One node in the output layer is used.
    • Multilabel classification: If there are n number of mutually inclusive classes, n number of nodes are used in the output layer.
    • Multiclass classification: If there are n number of mutually exclusive classes, n number of nodes are used in the output layer.
    • Regression: One node in the output layer is used.

    Beginners always ask how many hidden layers or how many nodes in a neural network layer should be included.

    To answer this question, you can use the above facts and also the following two important points.

    • When the number of hidden layers and hidden units increases, the network becomes very large and the number of parameters significantly increases. To train such large networks, a lot of computational resources are needed. So, large neural networks are expensive in terms of computational resources.
    • We can experiment with different network structures by adding or removing hidden layers and hidden units and then see the performance of the models by plotting the training error and test (validation) error against the number of epochs during the training.

    Type of activation function

    This is the last hyperparameter that defines the network structure.

    We use activation functions in the layers of neural networks. The input layer does not require any activation function. We must use an activation function in the hidden layers to introduce non-linearity to the network. The type of activation to be used in the output layer is decided by the type of problem that we want to solve.

    • Regression: Identity activation function with one node
    • Binary classification: Sigmoid activation function with one node
    • Multiclass classification: Softmax activation function with one node per class
    • Multilabel classification: Sigmoid activation function with one node per class

    To learn different types of activation functions in detail with graphical representations, read my How to Choose the Right Activation Function for Neural Networks article.

    To read the usage guidelines of activation functions, click here.

    To learn the benefits of activation functions, read my 3 Amazing Benefits of Activation Functions in Neural Networks article.

    To see what will happens if you do not use any activation function in a neural network’s hidden layer(s), read this article.

    The hyperparameters classified under this criterion directly control the training process of the network.

    Type of optimizer

    The optimizer is also called the optimization algorithm. The task of the optimizer is to minimize the loss function by updating the network parameters.

    Gradient descent is one of the most popular optimization algorithms. It has three variants.

    • Batch gradient descent
    • Stochastic gradient descent
    • Mini-batch gradient descent

    All these variants differ in the batch size (more about this shortly) that we use to compute the gradient of the loss function.

    Other types of optimizers that have been developed to deal with the shortcomings of the gradient descent algorithm are:

    • Gradient descent with momentum
    • Adam
    • Adagrad
    • Adadelta
    • Adamax
    • Nadam
    • Ftrl
    • RMSProp (Keras default)

    Learning rate-α

    This hyperparameter can be found in any optimization algorithm.

    During the optimization, the optimizer takes tiny steps to descend the error curve. The learning rate refers to the size of the step. It determines how fast or slow the optimizer descends the error curve. The direction of the step is determined by the gradient (derivative).

    This is one of the most important hyperparameters in neural network training.

    A larger value of learning rate can be used to train the network faster. A too large value will cause the loss function to oscillate around the minimum and never descend. In that case, the model will never be trained!

    A too small value of the learning rate will cause the model to train even for months. In that case, the convergence happens very slowly. The network will need many epochs (more about this shortly) to converge with a very small learning rate.

    We should avoid using too large and too small values for the learning rate. It is better to begin with a small learning rate such as 0.001 (the default value in most optimizers) and then systematically increase if the network takes too much time to converge.

    Type of loss function

    There should be a way to measure the performance of a neural network during the training. The loss function is used to compute the loss score (error) between the predicted values and ground truth (actual) values. Our goal is to minimize the loss function by using an optimizer. That’s what we achieve during the training.

    The type of loss function to be used during training depends on the type of problem that we have.

    • Mean Squared Error (MSE) — This is used to measure the performance of regression problems.
    • Mean Absolute Error (MAE) — This is used to measure the performance of regression problems.
    • Mean Absolute Percentage Error — This is used to measure the performance of regression problems.
    • Huber Loss — This is used to measure the performance of regression problems.
    • Binary Cross-entropy (Log Loss) — This is used to measure the performance of binary (two-class) classification problems.
    • Multi-class Cross-entropy/Categorical Cross-entropy — This is used to measure the performance of multi-class (more than two classes) classification problems.
    • Sparse Categorical Cross-entropy — Automatically convert scalar-value labels into a one-hot vector in multi-class classification problems. Learn more about this here.

    Type of model evaluation metric

    Like we use a loss function to measure the performance of a neural network during the training, we use an evaluation metric to measure the performance of the model during testing.

    For classification tasks, ‘accuracy’, ‘precision’, ‘recall’, ‘auc’ metrics will be used. For regression tasks, ‘mean squared error’, ‘mean absolute error’ will be used.

    Batch size

    The batch size is another important hyperparameter that is found in the model.fit()method.

    Batch size refers to the number of training instances in the batch — Source: All You Need to Know about Batch Size, Epochs and Training Steps in a Neural Network

    In other words, it is the number of instances used per gradient update (iteration).

    Typical values for batch size are 16, 32 (Keras default), 64, 128 and 256, 512 and 1024.

    A larger batch size typically requires a lot of computational resources per epoch but requires fewer epochs to converge.

    A smaller batch size does not require a lot of computational resources per epoch but requires many epochs to converge.

    Epochs

    The epochs is another important hyperparameter that is found in the model.fit()method.

    Epochs refer to the number of times the model sees the entire dataset — Source: All You Need to Know about Batch Size, Epochs and Training Steps in a Neural Network

    The number of epochs should be increased when,

    • The network is trained with a very small learning rate.
    • The batch size is too small.

    Sometimes, the network will tend to overfit the training data with a large number of epochs. That is, after converging, the validation error begins to increase at some point while the training error is further decreasing. When that happens, the model performs well on the training data and but poorly generalizes on new unseen data. At that point, we should stop the training process. This is called early stopping.

    Early stopping process (Image by author, made with matplotlib and draw.io)

    Training steps (iterations) per epoch

    A training step (iteration) is one gradient update — Source: All You Need to Know about Batch Size, Epochs and Training Steps in a Neural Network

    We do not need to set a value for this hyperparameter as the algorithm automatically calculates it as follows.

    Steps per epoch = (Size of the entire dataset / batch size) + 1

    We add 1 to compensate for any fractional part. For example, if we get 18.75, we ignore 0.75 and add 1 to 18. The total is 19.

    In Keras model.fit()method, this hyperparameter is specified by the steps_per_epoch argument. Its default is Nonewhich means the algorithm automatically uses the value calculated using the above equation.

    Anyway, if we specify a value for this argument, that will overwrite the default value.

    Note: To learn more about the connection between batch size, epochs and training steps with examples, read my All You Need to Know about Batch Size, Epochs and Training Steps in a Neural Network article.

    The hyperparameters classified under this criterion directly control the overfitting in neural networks.

    I’m not going to discuss each hyperparameter in detail here as I’ve previously done in my other articles. The links to previously published articles will be included.

    Lambda-λ in L1 and L2 regularization

    The λ is the regularization parameter (factor) that controls the level of L1 and L2 regularization. The special values of λ are:

    • lambda=0: No regularization is applied
    • lambda=1: Full regularization is applied
    • lambda=0.01: Keras default

    The λ can take any value between 0 and 1 (both inclusive).

    To learn this hyperparameter along with LI and L2 regularization techniques in detail, read my How to Apply L1 and L2 Regularization Techniques to Keras Models article.

    Dropout rate in dropout regularization

    This hyperparameter defines the dropout probability (the fraction of nodes to be removed from the network) in dropout regularization. Two special values are:

    • rate=0: No dropout regularization is applied
    • rate=1: Removes all nodes from the network (not practical)

    The dropout rate can take any value between 0 and 1.

    To learn more about this hyperparameter along with dropout regularization, read my How Dropout Regularization Mitigates Overfitting in Neural Networks article.

    The term hyperparameter is a very important concept in ML and DL. For a given task in DL, the type of neural network architecture is also a hyperparameter. For example, we can use an MLP or CNN architecture to classify the MNSIT handwritten digits. Here, choosing between MLP and CNN is a type of setting a hyperparameter!

    For a given neural network architecture, the above hyperparameters exist. Note that the regularization hyperparameters are optional.

    In Keras, some hyperparameters can be added as layers or string identifiers via a relevant argument within the function.

    For example, the ReLU activation function can be added to the layer in one of the following ways.

    # As a string identifier
    model.add(Dense(100, activation='relu'))

    Classification of Neural Network Hyperparameters Republished from Source https://towardsdatascience.com/classification-of-neural-network-hyperparameters-c7991b6937c3?source=rss----7f60cf5620c9---4 via https://towardsdatascience.com/feed

    Recent Articles

    spot_img

    Related Stories

    Stay on op - Ge the daily news in your inbox