Neural Network Learning
Neural network learning is a well defined process but involved a wide variety of hyper-parameters in order to correctly learn from big data. More general information about the architectures and types of an artificial neural network can be obtained from our article on a Neural Network. This article offers more insight into the learning process and the complexity of its hyper-parameters as well as some tricks used during learning.
For the important activation functions that represent nonlinearities the common functions are the sigmoid and the tanh. In general one can observe that nonlinearities that are symmetric around the origin are good. The reason is that those produce zero-mean inputs to the next neural network layer. It was also empirically shown that the tanh has better convergence properties. The weight initialization before starting the learning is important to consider as well. One rule of thumb is to have the weights small enough around the origin. This ensures that the activation function operates in its linear regime that in turn means where gradients are the largest.
Another important hyper-parameter in learning neural networks is the number of hidden units. Experience shows that this parameter is extremely dataset specific. In other words the more complicated the input distribution in the dataset, the more capacity the neural network will require to model it. This capacity in turn refers to the number of hidden units. The rule of thumb is the more complicated the dataset is the larger the number of hidden units in the neural network. But the number of hidden units also has a significant impact on the number of weights in a layer that in some literature is even more known as the direct measure of capacity in neural networks. Note that increasing the number of hidden nodes too much will lead to over-fitting of the dataset due to this overwhelming amount of weights. Such over-fitting harms the important generalization in machine learning concept.
Another key hyper-parameter is the learning rate whereby many recommend the simple solution with just a constant rate. One rule of thumb here is to tryout several log-spaced values like 10^-1, 10^-2, and so on. In the greater grid search during validation it is then important to narrow it to the region where the lowest validation error was observed. In some application domains it is a good approach to decrease the learning rate over time. With respect to L1 and L2 regularization parameter lambda often used values are 10^-2, 10^-3, and so on. More details about rules of thumbs related to hyper-parameter tuning can be found here.
Neural Network Learning Details
Please have a look on the following video: