Difference between revisions of "Keras: Introduction to the Adam Optimization Algorithm"

From OnnoWiki
Jump to navigation Jump to search
Line 6: Line 6:
 
Adam optimization algorithm adalah extension dari stochastic gradient descent yang baru-baru ini memperoleh adopsi yang lebih luas untuk aplikasi deep learning dalam computer vision dan natural language processing.
 
Adam optimization algorithm adalah extension dari stochastic gradient descent yang baru-baru ini memperoleh adopsi yang lebih luas untuk aplikasi deep learning dalam computer vision dan natural language processing.
  
In this post, you will get a gentle introduction to the Adam optimization algorithm for use in deep learning.
+
Dalam tulisan ini, anda akan mendapatkan pengantar tentang Adam optimization algorithm untuk digunakan dalam deep learning.
  
After reading this post, you will know:
+
Setelah membaca posting ini, anda akan tahu:
  
    What the Adam algorithm is and some benefits of using the method to optimize your models.
+
* Apa algoritma Adam dan beberapa manfaat menggunakan metode untuk mengoptimalkan model anda.
    How the Adam algorithm works and how it is different from the related methods of AdaGrad and RMSProp.
+
* Bagaimana algoritma Adam bekerja dan bagaimana perbedaannya dari metode terkait AdaGrad dan RMSProp.
    How the Adam algorithm can be configured and commonly used configuration parameters.
+
* Bagaimana algoritma Adam dapat dikonfigurasi dan parameter konfigurasi yang umum digunakan.
  
Discover how to train faster, reduce overfitting, and make better predictions with deep learning models in my new book, with 26 step-by-step tutorials and full source code.
+
==Apakah Adam optimization algorithm?==
  
Let’s get started.
+
Adam adalah algoritme pengoptimalan yang dapat digunakan sebagai ganti dari prosedur stochastic gradient descent klasik untuk memperbarui weight network secara iteratif berdasarkan data training.
What is the Adam optimization algorithm?
 
  
Adam is an optimization algorithm that can used instead of the classical stochastic gradient descent procedure to update network weights iterative based in training data.
+
Adam pertama kali di presentasikan oleh Diederik Kingma dari OpenAI dan Jimmy Ba dari University of Toronto dalam paper mereka di 2015 ICLR yang berjudul “Adam: A Method for Stochastic Optimization“.
  
Adam was presented by Diederik Kingma from OpenAI and Jimmy Ba from the University of Toronto in their 2015 ICLR paper (poster) titled “Adam: A Method for Stochastic Optimization“. I will quote liberally from their paper in this post, unless stated otherwise.
+
Algoritma ini di sebut Adam. Itu bukan singkatan dan tidak ditulis sebagai “ADAM”.
 
 
The algorithm is called Adam. It is not an acronym and is not written as “ADAM”.
 
  
 
     … the name Adam is derived from adaptive moment estimation.
 
     … the name Adam is derived from adaptive moment estimation.
Line 29: Line 26:
 
When introducing the algorithm, the authors list the attractive benefits of using Adam on non-convex optimization problems, as follows:
 
When introducing the algorithm, the authors list the attractive benefits of using Adam on non-convex optimization problems, as follows:
  
    Straightforward to implement.
+
* Straightforward to implement.
    Computationally efficient.
+
* Computationally efficient.
    Little memory requirements.
+
* Little memory requirements.
    Invariant to diagonal rescale of the gradients.
+
* Invariant to diagonal rescale of the gradients.
    Well suited for problems that are large in terms of data and/or parameters.
+
* Well suited for problems that are large in terms of data and/or parameters.
    Appropriate for non-stationary objectives.
+
* Appropriate for non-stationary objectives.
    Appropriate for problems with very noisy/or sparse gradients.
+
* Appropriate for problems with very noisy/or sparse gradients.
    Hyper-parameters have intuitive interpretation and typically require little tuning.
+
* Hyper-parameters have intuitive interpretation and typically require little tuning.
  
Want Better Results with Deep Learning?
+
==How Does Adam Work?==
 
 
Take my free 7-day email crash course now (with sample code).
 
 
 
Click to sign-up and also get a free PDF Ebook version of the course.
 
 
 
Download Your FREE Mini-Course
 
How Does Adam Work?
 
  
 
Adam is different to classical stochastic gradient descent.
 
Adam is different to classical stochastic gradient descent.
Line 69: Line 59:
  
 
The paper is quite readable and I would encourage you to read it if you are interested in the specific implementation details.
 
The paper is quite readable and I would encourage you to read it if you are interested in the specific implementation details.
Adam is Effective
+
 
 +
==Adam is Effective==
  
 
Adam is a popular algorithm in the field of deep learning because it achieves good results fast.
 
Adam is a popular algorithm in the field of deep learning because it achieves good results fast.
Line 103: Line 94:
  
 
Do you know of any other examples of Adam? Let me know in the comments.
 
Do you know of any other examples of Adam? Let me know in the comments.
Adam Configuration Parameters
 
  
    alpha. Also referred to as the learning rate or step size. The proportion that weights are updated (e.g. 0.001). Larger values (e.g. 0.3) results in faster initial learning before the rate is updated. Smaller values (e.g. 1.0E-5) slow learning right down during training
+
==Adam Configuration Parameters==
    beta1. The exponential decay rate for the first moment estimates (e.g. 0.9).
+
 
    beta2. The exponential decay rate for the second-moment estimates (e.g. 0.999). This value should be set close to 1.0 on problems with a sparse gradient (e.g. NLP and computer vision problems).
+
* alpha. Also referred to as the learning rate or step size. The proportion that weights are updated (e.g. 0.001). Larger values (e.g. 0.3) results in faster initial learning before the rate is updated. Smaller values (e.g. 1.0E-5) slow learning right down during training
    epsilon. Is a very small number to prevent any division by zero in the implementation (e.g. 10E-8).
+
* beta1. The exponential decay rate for the first moment estimates (e.g. 0.9).
 +
* beta2. The exponential decay rate for the second-moment estimates (e.g. 0.999). This value should be set close to 1.0 on problems with a sparse gradient (e.g. NLP and computer vision problems).
 +
* epsilon. Is a very small number to prevent any division by zero in the implementation (e.g. 10E-8).
  
 
Further, learning rate decay can also be used with Adam. The paper uses a decay rate alpha = alpha/sqrt(t) updted each epoch (t) for the logistic regression demonstration.
 
Further, learning rate decay can also be used with Adam. The paper uses a decay rate alpha = alpha/sqrt(t) updted each epoch (t) for the logistic regression demonstration.
Line 122: Line 114:
 
We can see that the popular deep learning libraries generally use the default parameters recommended by the paper.
 
We can see that the popular deep learning libraries generally use the default parameters recommended by the paper.
  
    TensorFlow: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08.
+
TensorFlow: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08.
    Keras: lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0.
+
Keras: lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0.
    Blocks: learning_rate=0.002, beta1=0.9, beta2=0.999, epsilon=1e-08, decay_factor=1.
+
Blocks: learning_rate=0.002, beta1=0.9, beta2=0.999, epsilon=1e-08, decay_factor=1.
    Lasagne: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08
+
Lasagne: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08
    Caffe: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08
+
Caffe: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08
    MxNet: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8
+
MxNet: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8
    Torch: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8
+
Torch: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8
  
 
Do you know of any other standard configurations for Adam? Let me know in the comments.
 
Do you know of any other standard configurations for Adam? Let me know in the comments.
Further Reading
+
 
 +
==Further Reading==
  
 
This section lists resources to learn more about the Adam optimization algorithm.
 
This section lists resources to learn more about the Adam optimization algorithm.
Line 143: Line 136:
  
 
Do you know of any other good resources on Adam? Let me know in the comments.
 
Do you know of any other good resources on Adam? Let me know in the comments.
Summary
+
 
 +
==Summary==
  
 
In this post, you discovered the Adam optimization algorithm for deep learning.
 
In this post, you discovered the Adam optimization algorithm for deep learning.
Line 149: Line 143:
 
Specifically, you learned:
 
Specifically, you learned:
  
    Adam is a replacement optimization algorithm for stochastic gradient descent for training deep learning models.
+
* Adam is a replacement optimization algorithm for stochastic gradient descent for training deep learning models.
    Adam combines the best properties of the AdaGrad and RMSProp algorithms to provide an optimization algorithm that can handle sparse gradients on noisy problems.
+
* Adam combines the best properties of the AdaGrad and RMSProp algorithms to provide an optimization algorithm that can handle sparse gradients on noisy problems.
    Adam is relatively easy to configure where the default configuration parameters do well on most problems.
+
* Adam is relatively easy to configure where the default configuration parameters do well on most problems.
 
 
Do you have any questions?
 
Ask your questions in the comments below and I will do my best to answer.
 
 
 
  
  

Revision as of 10:21, 19 August 2019

Sumber: https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/


Pilihan algoritma optimalisasi untuk model deep learning dapat berarti perbedaan antara hasil yang baik dalam hitungan menit, jam, dan hari.

Adam optimization algorithm adalah extension dari stochastic gradient descent yang baru-baru ini memperoleh adopsi yang lebih luas untuk aplikasi deep learning dalam computer vision dan natural language processing.

Dalam tulisan ini, anda akan mendapatkan pengantar tentang Adam optimization algorithm untuk digunakan dalam deep learning.

Setelah membaca posting ini, anda akan tahu:

  • Apa algoritma Adam dan beberapa manfaat menggunakan metode untuk mengoptimalkan model anda.
  • Bagaimana algoritma Adam bekerja dan bagaimana perbedaannya dari metode terkait AdaGrad dan RMSProp.
  • Bagaimana algoritma Adam dapat dikonfigurasi dan parameter konfigurasi yang umum digunakan.

Apakah Adam optimization algorithm?

Adam adalah algoritme pengoptimalan yang dapat digunakan sebagai ganti dari prosedur stochastic gradient descent klasik untuk memperbarui weight network secara iteratif berdasarkan data training.

Adam pertama kali di presentasikan oleh Diederik Kingma dari OpenAI dan Jimmy Ba dari University of Toronto dalam paper mereka di 2015 ICLR yang berjudul “Adam: A Method for Stochastic Optimization“.

Algoritma ini di sebut Adam. Itu bukan singkatan dan tidak ditulis sebagai “ADAM”.

   … the name Adam is derived from adaptive moment estimation.

When introducing the algorithm, the authors list the attractive benefits of using Adam on non-convex optimization problems, as follows:

  • Straightforward to implement.
  • Computationally efficient.
  • Little memory requirements.
  • Invariant to diagonal rescale of the gradients.
  • Well suited for problems that are large in terms of data and/or parameters.
  • Appropriate for non-stationary objectives.
  • Appropriate for problems with very noisy/or sparse gradients.
  • Hyper-parameters have intuitive interpretation and typically require little tuning.

How Does Adam Work?

Adam is different to classical stochastic gradient descent.

Stochastic gradient descent maintains a single learning rate (termed alpha) for all weight updates and the learning rate does not change during training.

A learning rate is maintained for each network weight (parameter) and separately adapted as learning unfolds.

   The method computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients.

The authors describe Adam as combining the advantages of two other extensions of stochastic gradient descent. Specifically:

   Adaptive Gradient Algorithm (AdaGrad) that maintains a per-parameter learning rate that improves performance on problems with sparse gradients (e.g. natural language and computer vision problems).
   Root Mean Square Propagation (RMSProp) that also maintains per-parameter learning rates that are adapted based on the average of recent magnitudes of the gradients for the weight (e.g. how quickly it is changing). This means the algorithm does well on online and non-stationary problems (e.g. noisy).

Adam realizes the benefits of both AdaGrad and RMSProp.

Instead of adapting the parameter learning rates based on the average first moment (the mean) as in RMSProp, Adam also makes use of the average of the second moments of the gradients (the uncentered variance).

Specifically, the algorithm calculates an exponential moving average of the gradient and the squared gradient, and the parameters beta1 and beta2 control the decay rates of these moving averages.

The initial value of the moving averages and beta1 and beta2 values close to 1.0 (recommended) result in a bias of moment estimates towards zero. This bias is overcome by first calculating the biased estimates before then calculating bias-corrected estimates.

The paper is quite readable and I would encourage you to read it if you are interested in the specific implementation details.

Adam is Effective

Adam is a popular algorithm in the field of deep learning because it achieves good results fast.

   Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods.

In the original paper, Adam was demonstrated empirically to show that convergence meets the expectations of the theoretical analysis. Adam was applied to the logistic regression algorithm on the MNIST digit recognition and IMDB sentiment analysis datasets, a Multilayer Perceptron algorithm on the MNIST dataset and Convolutional Neural Networks on the CIFAR-10 image recognition dataset. They conclude:

   Using large models and datasets, we demonstrate Adam can efficiently solve practical deep learning problems.

Comparison of Adam to Other Optimization Algorithms Training a Multilayer Perceptron

Comparison of Adam to Other Optimization Algorithms Training a Multilayer Perceptron Taken from Adam: A Method for Stochastic Optimization, 2015.

Sebastian Ruder developed a comprehensive review of modern gradient descent optimization algorithms titled “An overview of gradient descent optimization algorithms” published first as a blog post, then a technical report in 2016.

The paper is basically a tour of modern methods. In his section titled “Which optimizer to use?“, he recommends using Adam.

   Insofar, RMSprop, Adadelta, and Adam are very similar algorithms that do well in similar circumstances. […] its bias-correction helps Adam slightly outperform RMSprop towards the end of optimization as gradients become sparser. Insofar, Adam might be the best overall choice.

In the Stanford course on deep learning for computer vision titled “CS231n: Convolutional Neural Networks for Visual Recognition” developed by Andrej Karpathy, et al., the Adam algorithm is again suggested as the default optimization method for deep learning applications.

   In practice Adam is currently recommended as the default algorithm to use, and often works slightly better than RMSProp. However, it is often also worth trying SGD+Nesterov Momentum as an alternative.

And later stated more plainly:

   The two recommended updates to use are either SGD+Nesterov Momentum or Adam.

Adam is being adapted for benchmarks in deep learning papers.

For example, it was used in the paper “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention” on attention in image captioning and “DRAW: A Recurrent Neural Network For Image Generation” on image generation.

Do you know of any other examples of Adam? Let me know in the comments.

Adam Configuration Parameters

  • alpha. Also referred to as the learning rate or step size. The proportion that weights are updated (e.g. 0.001). Larger values (e.g. 0.3) results in faster initial learning before the rate is updated. Smaller values (e.g. 1.0E-5) slow learning right down during training
  • beta1. The exponential decay rate for the first moment estimates (e.g. 0.9).
  • beta2. The exponential decay rate for the second-moment estimates (e.g. 0.999). This value should be set close to 1.0 on problems with a sparse gradient (e.g. NLP and computer vision problems).
  • epsilon. Is a very small number to prevent any division by zero in the implementation (e.g. 10E-8).

Further, learning rate decay can also be used with Adam. The paper uses a decay rate alpha = alpha/sqrt(t) updted each epoch (t) for the logistic regression demonstration.

The Adam paper suggests:

   Good default settings for the tested machine learning problems are alpha=0.001, beta1=0.9, beta2=0.999 and epsilon=10−8

The TensorFlow documentation suggests some tuning of epsilon:

   The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1.

We can see that the popular deep learning libraries generally use the default parameters recommended by the paper.

TensorFlow: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08.
Keras: lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0.
Blocks: learning_rate=0.002, beta1=0.9, beta2=0.999, epsilon=1e-08, decay_factor=1.
Lasagne: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08
Caffe: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08
MxNet: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8
Torch: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8

Do you know of any other standard configurations for Adam? Let me know in the comments.

Further Reading

This section lists resources to learn more about the Adam optimization algorithm.

   Adam: A Method for Stochastic Optimization, 2015
   Stochastic gradient descent on Wikipedia
   An overview of gradient descent optimization algorithms, 2016
   ADAM: A Method for Stochastic Optimization (a review)
   Optimization for Deep Networks (slides)
   Adam: A Method for Stochastic Optimization (slides)

Do you know of any other good resources on Adam? Let me know in the comments.

Summary

In this post, you discovered the Adam optimization algorithm for deep learning.

Specifically, you learned:

  • Adam is a replacement optimization algorithm for stochastic gradient descent for training deep learning models.
  • Adam combines the best properties of the AdaGrad and RMSProp algorithms to provide an optimization algorithm that can handle sparse gradients on noisy problems.
  • Adam is relatively easy to configure where the default configuration parameters do well on most problems.






Referensi


Pranala Menarik