Difference between revisions of "Keras: Gradient Descent For Machine Learning"

From OnnoWiki
Jump to navigation Jump to search
Line 20: Line 20:
 
==Intuisi untuk Gradient Descent==
 
==Intuisi untuk Gradient Descent==
  
Think of a large bowl like what you would eat cereal out of or store fruit in. This bowl is a plot of the cost function (f).
+
Bayangkan sebuah mangkuk besar seperti saat anda makan mie. Mangkuk ini adalah sebidang fungsi cost (f).
  
A random position on the surface of the bowl is the cost of the current values of the coefficients (cost).
+
Posisi acak pada permukaan mangkuk adalah cost dari nilai saat ini dari koefisien (cost).
  
The bottom of the bowl is the cost of the best set of coefficients, the minimum of the function.
+
Bagian bawah mangkuk adalah cost untuk set koefisien terbaik, merupakan titik minimum dari fungsi.
  
The goal is to continue to try different values for the coefficients, evaluate their cost and select new coefficients that have a slightly better (lower) cost.
+
Tujuannya adalah untuk terus mencoba nilai koefisien yang berbeda, mengevaluasi cost dan memilih koefisien baru yang memiliki cost yang sedikit lebih baik (lebih rendah).
 +
 
 +
Mengulangi proses ini cukup waktu akan mengarah ke dasar mangkuk dan anda akan tahu nilai-nilai koefisien yang menghasilkan cost minimum.
  
Repeating this process enough times will lead to the bottom of the bowl and you will know the values of the coefficients that result in the minimum cost.
 
 
 
==Gradient Descent Procedure==
 
==Gradient Descent Procedure==
  

Revision as of 11:35, 7 September 2019

Sumber: https://machinelearningmastery.com/gradient-descent-for-machine-learning/


Optimalisasi adalah bagian utama dari machine learning. Hampir setiap algoritma pembelajaran mesin memiliki algoritma pengoptimalan pada intinya.

Dalam posting ini anda akan menemukan algoritma optimasi sederhana yang dapat anda gunakan dengan algoritma machine learning apa pun. Mudah dimengerti dan mudah diterapkan. Setelah membaca posting ini anda akan tahu:

  • Apakah gradient descent?
  • Bagaimana gradient descent digunakan sebagai algoritma seperti linear regression?
  • Bagaimana gradient descent digunakan untuk dataset yang sangat besar?
  • Apa tip untuk memaksimalkan gradient descent?

Gradient Descent

Gradient descent adalah algoritma optimasi yang digunakan untuk menemukan nilai-nilai parameter (koefisien) dari suatu fungsi (f) yang meminimalkan fungsi cost.

Gradient descent paling baik digunakan ketika parameter tidak dapat dihitung secara analitik (mis. Menggunakan aljabar linier) dan harus dicari dengan algoritma optimalisasi.

Intuisi untuk Gradient Descent

Bayangkan sebuah mangkuk besar seperti saat anda makan mie. Mangkuk ini adalah sebidang fungsi cost (f).

Posisi acak pada permukaan mangkuk adalah cost dari nilai saat ini dari koefisien (cost).

Bagian bawah mangkuk adalah cost untuk set koefisien terbaik, merupakan titik minimum dari fungsi.

Tujuannya adalah untuk terus mencoba nilai koefisien yang berbeda, mengevaluasi cost dan memilih koefisien baru yang memiliki cost yang sedikit lebih baik (lebih rendah).

Mengulangi proses ini cukup waktu akan mengarah ke dasar mangkuk dan anda akan tahu nilai-nilai koefisien yang menghasilkan cost minimum.

Gradient Descent Procedure

The procedure starts off with initial values for the coefficient or coefficients for the function. These could be 0.0 or a small random value.

coefficient = 0.0

The cost of the coefficients is evaluated by plugging them into the function and calculating the cost.

cost = f(coefficient)

or

cost = evaluate(f(coefficient))

The derivative of the cost is calculated. The derivative is a concept from calculus and refers to the slope of the function at a given point. We need to know the slope so that we know the direction (sign) to move the coefficient values in order to get a lower cost on the next iteration.

delta = derivative(cost)

Now that we know from the derivative which direction is downhill, we can now update the coefficient values. A learning rate parameter (alpha) must be specified that controls how much the coefficients can change on each update.

coefficient = coefficient – (alpha * delta)

This process is repeated until the cost of the coefficients (cost) is 0.0 or close enough to zero to be good enough.

You can see how simple gradient descent is. It does require you to know the gradient of your cost function or the function you are optimizing, but besides that, it’s very straightforward. Next we will see how we can use this in machine learning algorithms.

Batch Gradient Descent for Machine Learning

The goal of all supervised machine learning algorithms is to best estimate a target function (f) that maps input data (X) onto output variables (Y). This describes all classification and regression problems.

Some machine learning algorithms have coefficients that characterize the algorithms estimate for the target function (f). Different algorithms have different representations and different coefficients, but many of them require a process of optimization to find the set of coefficients that result in the best estimate of the target function.

Common examples of algorithms with coefficients that can be optimized using gradient descent are Linear Regression and Logistic Regression.

The evaluation of how close a fit a machine learning model estimates the target function can be calculated a number of different ways, often specific to the machine learning algorithm. The cost function involves evaluating the coefficients in the machine learning model by calculating a prediction for the model for each training instance in the dataset and comparing the predictions to the actual output values and calculating a sum or average error (such as the Sum of Squared Residuals or SSR in the case of linear regression).

From the cost function a derivative can be calculated for each coefficient so that it can be updated using exactly the update equation described above.

The cost is calculated for a machine learning algorithm over the entire training dataset for each iteration of the gradient descent algorithm. One iteration of the algorithm is called one batch and this form of gradient descent is referred to as batch gradient descent.

Batch gradient descent is the most common form of gradient descent described in machine learning.

Stochastic Gradient Descent for Machine Learning

Gradient descent can be slow to run on very large datasets.

Because one iteration of the gradient descent algorithm requires a prediction for each instance in the training dataset, it can take a long time when you have many millions of instances.

In situations when you have large amounts of data, you can use a variation of gradient descent called stochastic gradient descent.

In this variation, the gradient descent procedure described above is run but the update to the coefficients is performed for each training instance, rather than at the end of the batch of instances.

The first step of the procedure requires that the order of the training dataset is randomized. This is to mix up the order that updates are made to the coefficients. Because the coefficients are updated after every training instance, the updates will be noisy jumping all over the place, and so will the corresponding cost function. By mixing up the order for the updates to the coefficients, it harnesses this random walk and avoids it getting distracted or stuck.

The update procedure for the coefficients is the same as that above, except the cost is not summed over all training patterns, but instead calculated for one training pattern.

The learning can be much faster with stochastic gradient descent for very large training datasets and often you only need a small number of passes through the dataset to reach a good or good enough set of coefficients, e.g. 1-to-10 passes through the dataset.


Tips for Gradient Descent

This section lists some tips and tricks for getting the most out of the gradient descent algorithm for machine learning.

  • Plot Cost versus Time: Collect and plot the cost values calculated by the algorithm each iteration. The expectation for a well performing gradient descent run is a decrease in cost each iteration. If it does not decrease, try reducing your learning rate.
  • Learning Rate: The learning rate value is a small real value such as 0.1, 0.001 or 0.0001. Try different values for your problem and see which works best.
  • Rescale Inputs: The algorithm will reach the minimum cost faster if the shape of the cost function is not skewed and distorted. You can achieved this by rescaling all of the input variables (X) to the same range, such as [0, 1] or [-1, 1].
  • Few Passes: Stochastic gradient descent often does not need more than 1-to-10 passes through the training dataset to converge on good or good enough coefficients.
  • Plot Mean Cost: The updates for each training dataset instance can result in a noisy plot of cost over time when using stochastic gradient descent. Taking the average over 10, 100, or 1000 updates can give you a better idea of the learning trend for the algorithm.

Summary

In this post you discovered gradient descent for machine learning. You learned that:

  • Optimization is a big part of machine learning.
  • Gradient descent is a simple optimization procedure that you can use with many machine learning algorithms.
  • Batch gradient descent refers to calculating the derivative from all training data before calculating an update.
  • Stochastic gradient descent refers to calculating the derivative from each training data instance and calculating the update immediately.


Referensi

Pranala Menarik