ALGORITHM OF AMSGrad AND CHAOS OPTIMIZATION IN MULTILAYERED NEURON NETWORKS WITH STOCHASTIC GRADIENT DESCENT

Serhiy Sveleba, I. Katerynchuk, I. Kuno, O. Semotyuk, Ya. Shmygelsky, S. Velgosh, A. Kopach, V. Kuno

Abstract


In this paper, the AMSGrad stochastic optimization method was tested using the logistic function that describes the doubling process and the Fourier spectra of the error function.

The implementation of the gradient descent optimization algorithm, using AMSGrad, was done for a multilayer neural network with hidden layers. The program for recognizing printed digits was written using the Python software environment. The array of each digit consisted of a set of "0" and "1" of size 4x7. The sample of each digit contained a set of 5 possible distortions and a set of 3 arrays that did not correspond to any of the digits. The analysis of the influence of the value of hyperparameters beta1, beta2, and learning rate on the optimizing process for teaching a multilayer neural network, which contained 3 hidden layers of 28 neurons each, was carried out. We constructed branching diagrams based on these parameters. 

We found that the hyperparameter beta1, which describes the contribution of the linear gradient of the error function, is associated with a doubling of the number of local and global minima of the error function in the process of retraining the neural network. The hyperparameter beta2, which describes the error function gradient square contribution, is associated with the block structure formation that blocks the number of local minima doubling processes.

If the alpha learning rate is greater than the retraining rate, there is a transition to a chaotic state, which is accompanied by both multiple passage through the global minimum and, apparently, the appearance of local minima.

At such a speed of learning, the optimizer practically does not work, but in the presence of the hyperparameter beta1, i.e. a linear gradient, the general picture of the transition to chaos is described by the process of doubling the number of local minima.

The application of the AMSGrad stochastic optimization method for teaching a multilayer neural network is shown to lead to better learning, compared to a conventional multilayer neural network, even at the optimal learning rate (the learning rate when the number of existing local and global minima doubles).

Keywords: optimization methods, error function, AMSGrad, learning rate, branching diagrams.


References


  1. Sashank J. On the Convergence of Adam and Beyond /Sashank J. Reddi, Satyen Kale, Sanjiv Kumar. – Published as a conference paper at ICLR 2018. – 2019. – P. 1-23. https://doi.org/10.48550/arXiv.1904.09237
  2. Jason Brownlee. Optimization for Machine Learning. Finding Function Optima with Python. – The MIT Press. – 2021. – 403p.
  3. Diederik P. Adam: a method for stochastic optimization / Diederik P. Kingma, Jimmy Lei Ba – Published as a conference paper at ICLR 2015. – 2015. – P. 1-15. https://doi.org/10.48550/arXiv.1412.6980
  4. Yu. Taranenko Information entropy of chaos URL: https://habr.com/ru/post/447874/
  5. Свелеба С. Xаотичні стани багатошарової нейронної мережі / С. Свелеба, І. Катеринчук, І. Куньо, І. Карпа, О. Семотюк, Я. Шмигельський, Н. Свелеба, В. Куньо // Electronics and information technologies. – 2021. – Issue 13. – P. 96–107.
  6. Sveleba S. Specifics of the learning error dependence of multilayered neural networks from the activation function during the process of printed digits identification / S. Sveleba, I. Katerynchuk, I. Kuno, O. Semotiuk, Ya. Shmyhelskyy, N. Sveleba // Electronics and Information Technologies. – 2022. – Issue 17. – P. 36–53.




DOI: http://dx.doi.org/10.30970/eli.21.7

Refbacks

  • There are currently no refbacks.