Difference between revisions of "Keras: Loss and Loss Function"
| Onnowpurbo (talk | contribs) | Onnowpurbo (talk | contribs)  | ||
| Line 83: | Line 83: | ||
| − | + | Kita memiliki dataset training dengan satu atau lebih variabel input dan kita membutuhkan model untuk memperkirakan parameter weight model yang memetakan contoh-contoh input ke output atau variabel target. | |
| − | + | Diberikan input, model ini mencoba membuat prediksi yang sesuai dengan distribusi data dari variabel target. Di bawah maximum likelihood, fungsi loss memperkirakan seberapa dekat distribusi prediksi yang dibuat oleh suatu model cocok dengan distribusi variabel target dalam data training. | |
| ''One way to interpret maximum likelihood estimation is to view it as minimizing the dissimilarity between the empirical distribution […] defined by the training set and the model distribution, with the degree of dissimilarity between the two measured by the KL divergence. […] Minimizing this KL divergence corresponds exactly to minimizing the cross-entropy between the distributions.'' — Halaman 132, Deep Learning, 2016. | ''One way to interpret maximum likelihood estimation is to view it as minimizing the dissimilarity between the empirical distribution […] defined by the training set and the model distribution, with the degree of dissimilarity between the two measured by the KL divergence. […] Minimizing this KL divergence corresponds exactly to minimizing the cross-entropy between the distributions.'' — Halaman 132, Deep Learning, 2016. | ||
| − | + | Manfaat menggunakan maximum likelihood sebagai kerangka kerja untuk memperkirakan parameter model (weight) untuk neural network dan machine learning secara umum adalah bahwa ketika jumlah contoh dalam dataset training ditingkatkan, estimasi parameter model meningkat. Ini disebut properti "konsistensi." | |
| − | |||
| Line 97: | Line 96: | ||
| − | + | Sekarang kita sudah terbiasa dengan pendekatan umum maximum likelihood, kita dapat melihat fungsi kesalahan. | |
| ==Maximum Likelihood and Cross-Entropy== | ==Maximum Likelihood and Cross-Entropy== | ||
| − | + | Di bawah framework maximum likelihood, kesalahan antara dua distribusi probabilitas diukur menggunakan cross-entropy. | |
| When modeling a classification problem where we are interested in mapping input variables to a class label, we can model the problem as predicting the probability of an example belonging to each class. In a binary classification problem, there would be two classes, so we may predict the probability of the example belonging to the first class. In the case of multiple-class classification, we can predict a probability for the example belonging to each of the classes. | When modeling a classification problem where we are interested in mapping input variables to a class label, we can model the problem as predicting the probability of an example belonging to each class. In a binary classification problem, there would be two classes, so we may predict the probability of the example belonging to the first class. In the case of multiple-class classification, we can predict a probability for the example belonging to each of the classes. | ||
| Line 109: | Line 108: | ||
| Therefore, under maximum likelihood estimation, we would seek a set of model weights that minimize the difference between the model’s predicted probability distribution given the dataset and the distribution of probabilities in the training dataset. This is called the cross-entropy. | Therefore, under maximum likelihood estimation, we would seek a set of model weights that minimize the difference between the model’s predicted probability distribution given the dataset and the distribution of probabilities in the training dataset. This is called the cross-entropy. | ||
| − | + | ''In most cases, our parametric model defines a distribution […] and we simply use the principle of maximum likelihood. This means we use the cross-entropy between the training data and the model’s predictions as the cost function.'' — Halaman 178, Deep Learning, 2016. | |
| − | |||
| − | —  | ||
| Technically, cross-entropy comes from the field of information theory and has the unit of “bits.” It is used to estimate the difference between an estimated and predicted probability distributions. | Technically, cross-entropy comes from the field of information theory and has the unit of “bits.” It is used to estimate the difference between an estimated and predicted probability distributions. | ||
Revision as of 15:39, 30 August 2019
Neural Network yang di training menggunakan stochastic gradient descent dan mengharuskan anda memilih fungsi loss saat merancang dan mengkonfigurasi model anda.
Ada banyak fungsi loss yang dapat dipilih dan merupakan tantangan untuk mengetahui apa yang harus dipilih, atau bahkan apa fungsi loss dan peran yang dimainkannya saat melatih neural network.
Dalam posting ini, anda akan menemukan peran fungsi loss dan dalam deep learning neural network training dan bagaimana memilih fungsi loss yang tepat untuk masalah pemodelan prediktif anda.
Sesudah membaca tulisan ini, anda akan mengetahui:
- Neural Network dapat dilatih menggunakan proses optimasi yang membutuhkan fungsi loss untuk menghitung kesalahan model.
- Maximum Likelihood menyediakan kerangka kerja untuk memilih fungsi loss saat men-train neural network dan model machine learning secara umum.
- Cross-entropy dan mean squared error adalah dua jenis utama fungsi kerugian yang digunakan ketika melatih model neural network.
Overview
Tutorial ini di bagi dalam tujuh bagian, yaitu:
- Neural Network Learning sebagai Optimization
- Apakah Loss Function dan Loss?
- Maximum Likelihood
- Maximum Likelihood dan Cross-Entropy
- Loss Function yang dapat kita gunakan?
- Cara mengimplementasi Loss Function
- Loss Function dan Reported Model Performance
Disini kita akan fokus pada teory dibelakang loss function.
Neural Network Learning sebagai Optimization
Deep learning neural network belajar untuk memetakan serangkaian input ke serangkaian output dari data training.
Kami tidak dapat menghitung weight yang sempurna untuk neural network; ada terlalu banyak yang tidak diketahui. Sebaliknya, masalah pembelajaran dilemparkan sebagai masalah pencarian atau optimisasi dan algoritma digunakan untuk menavigasi ruang set bobot yang mungkin digunakan model untuk membuat prediksi yang baik atau cukup baik.
Biasanya, model neural network dilatih menggunakan stochastic gradient descent optimization algorithm dan weight diperbarui menggunakan backpropagation of error algorithm.
"Gradien" dalam gradient descent mengacu pada gradien kesalahan. Model dengan seperangkat weight tertentu digunakan untuk membuat prediksi dan kesalahan untuk prediksi tersebut dihitung.
Gradient descent algorithm berupaya mengubah weight sehingga evaluasi selanjutnya mengurangi kesalahan, artinya algoritma pengoptimalan mengarahkan gradien (atau kemiringan) kesalahan.
Sekarang kita tahu bahwa pelatihan neural network pada dasarnya menyelesaikan masalah optimisasi, kita dapat melihat bagaimana kesalahan dari set weight dihitung.
Apakah Loss Function dan Loss?
Dalam konteks algoritma pengoptimalan, fungsi yang digunakan untuk mengevaluasi solusi kandidat (yaitu serangkaian weight) disebut sebagai fungsi objektif.
Kami dapat berupaya untuk memaksimalkan atau meminimalkan fungsi tujuan, yang berarti bahwa kami sedang mencari solusi kandidat yang masing-masing memiliki skor tertinggi atau terendah.
Biasanya, dengan neural network, kita berupaya meminimalkan kesalahan. Dengan demikian, fungsi objektif sering disebut sebagai fungsi cost atau fungsi loss dan nilai yang dihitung oleh fungsi loss disebut sebagai “loss”.
The function we want to minimize or maximize is called the objective function or criterion. When we are minimizing it, we may also call it the cost function, loss function, or error function. — Halaman 82, Deep Learning, 2016.
Fungsi cost atau loss memiliki pekerjaan penting karena dia harus dengan menyaring semua aspek model menjadi satu angka sedemikian rupa sehingga peningkatan angka itu merupakan tanda-tanda dari model yang lebih baik.
The cost function reduces all the various good and bad aspects of a possibly complex system down to a single number, a scalar value, which allows candidate solutions to be ranked and compared. — Halaman 155, Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, 1999.
Dalam menghitung kesalahan model selama proses optimasi, sebuah fungsi loss harus dipilih.
Ini bisa menjadi masalah yang menantang karena fungsinya harus menangkap sifat-sifat masalah dan dimotivasi oleh concern yang penting bagi project dan stakeholders.
It is important, therefore, that the function faithfully represent our design goals. If we choose a poor error function and obtain unsatisfactory results, the fault is ours for badly specifying the goal of the search. — Page 155, Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, 1999.
Sekarang kita sudah akrab dengan fungsi loss dan loss itu sendiri, kita perlu tahu fungsi apa yang digunakan.
Maximum Likelihood
Ada banyak fungsi yang dapat digunakan untuk memperkirakan kesalahan dari set weight dalam neural network.
Kita lebih suka fungsi di mana ruang kandidat solusi memetakan ke bidang yang mulus (tetapi high-dimensional) yang dapat ditelusuri dengan baik oleh algoritma pengoptimalan melalui pembaruan iteratif dari weight model.
Estimasi kemungkinan maksimum, atau MLE, adalah kerangka kerja untuk inferensi untuk menemukan estimasi statistik terbaik dari parameter dari data pelatihan historis: persis apa yang kita coba lakukan dengan neural network.
Maximum likelihood seeks to find the optimum values for the parameters by maximizing a likelihood function derived from the training data. — Halaman 39, Neural Networks for Pattern Recognition, 1995.
Kita memiliki dataset training dengan satu atau lebih variabel input dan kita membutuhkan model untuk memperkirakan parameter weight model yang memetakan contoh-contoh input ke output atau variabel target.
Diberikan input, model ini mencoba membuat prediksi yang sesuai dengan distribusi data dari variabel target. Di bawah maximum likelihood, fungsi loss memperkirakan seberapa dekat distribusi prediksi yang dibuat oleh suatu model cocok dengan distribusi variabel target dalam data training.
One way to interpret maximum likelihood estimation is to view it as minimizing the dissimilarity between the empirical distribution […] defined by the training set and the model distribution, with the degree of dissimilarity between the two measured by the KL divergence. […] Minimizing this KL divergence corresponds exactly to minimizing the cross-entropy between the distributions. — Halaman 132, Deep Learning, 2016.
Manfaat menggunakan maximum likelihood sebagai kerangka kerja untuk memperkirakan parameter model (weight) untuk neural network dan machine learning secara umum adalah bahwa ketika jumlah contoh dalam dataset training ditingkatkan, estimasi parameter model meningkat. Ini disebut properti "konsistensi."
Under appropriate conditions, the maximum likelihood estimator has the property of consistency […], meaning that as the number of training examples approaches infinity, the maximum likelihood estimate of a parameter converges to the true value of the parameter. — Halaman 134, Deep Learning, 2016.
Sekarang kita sudah terbiasa dengan pendekatan umum maximum likelihood, kita dapat melihat fungsi kesalahan.
Maximum Likelihood and Cross-Entropy
Di bawah framework maximum likelihood, kesalahan antara dua distribusi probabilitas diukur menggunakan cross-entropy.
When modeling a classification problem where we are interested in mapping input variables to a class label, we can model the problem as predicting the probability of an example belonging to each class. In a binary classification problem, there would be two classes, so we may predict the probability of the example belonging to the first class. In the case of multiple-class classification, we can predict a probability for the example belonging to each of the classes.
In the training dataset, the probability of an example belonging to a given class would be 1 or 0, as each sample in the training dataset is a known example from the domain. We know the answer.
Therefore, under maximum likelihood estimation, we would seek a set of model weights that minimize the difference between the model’s predicted probability distribution given the dataset and the distribution of probabilities in the training dataset. This is called the cross-entropy.
In most cases, our parametric model defines a distribution […] and we simply use the principle of maximum likelihood. This means we use the cross-entropy between the training data and the model’s predictions as the cost function. — Halaman 178, Deep Learning, 2016.
Technically, cross-entropy comes from the field of information theory and has the unit of “bits.” It is used to estimate the difference between an estimated and predicted probability distributions.
In the case of regression problems where a quantity is predicted, it is common to use the mean squared error (MSE) loss function instead.
A few basic functions are very commonly used. The mean squared error is popular for function approximation (regression) problems […] The cross-entropy error function is often used for classification problems when outputs are interpreted as probabilities of membership in an indicated class.
— Page 155-156, Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, 1999.
Nevertheless, under the framework of maximum likelihood estimation and assuming a Gaussian distribution for the target variable, mean squared error can be considered the cross-entropy between the distribution of the model predictions and the distribution of the target variable.
Many authors use the term “cross-entropy” to identify specifically the negative log-likelihood of a Bernoulli or softmax distribution, but that is a misnomer. Any loss consisting of a negative log-likelihood is a cross-entropy between the empirical distribution defined by the training set and the probability distribution defined by model. For example, mean squared error is the cross-entropy between the empirical distribution and a Gaussian model.
— Page 132, Deep Learning, 2016.
Therefore, when using the framework of maximum likelihood estimation, we will implement a cross-entropy loss function, which often in practice means a cross-entropy loss function for classification problems and a mean squared error loss function for regression problems.
Almost universally, deep learning neural networks are trained under the framework of maximum likelihood using cross-entropy as the loss function.
Most modern neural networks are trained using maximum likelihood. This means that the cost function is […] described as the cross-entropy between the training data and the model distribution.
— Page 178-179, Deep Learning, 2016.
In fact, adopting this framework may be considered a milestone in deep learning, as before being fully formalized, it was sometimes common for neural networks for classification to use a mean squared error loss function.
One of these algorithmic changes was the replacement of mean squared error with the cross-entropy family of loss functions. Mean squared error was popular in the 1980s and 1990s, but was gradually replaced by cross-entropy losses and the principle of maximum likelihood as ideas spread between the statistics community and the machine learning community.
— Page 226, Deep Learning, 2016.
The maximum likelihood approach was adopted almost universally not just because of the theoretical framework, but primarily because of the results it produces. Specifically, neural networks for classification that use a sigmoid or softmax activation function in the output layer learn faster and more robustly using a cross-entropy loss function.
The use of cross-entropy losses greatly improved the performance of models with sigmoid and softmax outputs, which had previously suffered from saturation and slow learning when using the mean squared error loss.
— Page 226, Deep Learning, 2016.
What Loss Function to Use?
We can summarize the previous section and directly suggest the loss functions that you should use under a framework of maximum likelihood.
Importantly, the choice of loss function is directly related to the activation function used in the output layer of your neural network. These two design elements are connected.
Think of the configuration of the output layer as a choice about the framing of your prediction problem, and the choice of the loss function as the way to calculate the error for a given framing of your problem.
The choice of cost function is tightly coupled with the choice of output unit. Most of the time, we simply use the cross-entropy between the data distribution and the model distribution. The choice of how to represent the output then determines the form of the cross-entropy function.
— Page 181, Deep Learning, 2016.
We will review best practice or default values for each problem type with regard to the output layer and loss function. Regression Problem
A problem where you predict a real-value quantity.
Output Layer Configuration: One node with a linear activation unit. Loss Function: Mean Squared Error (MSE).
Binary Classification Problem
A problem where you classify an example as belonging to one of two classes.
The problem is framed as predicting the likelihood of an example belonging to class one, e.g. the class that you assign the integer value 1, whereas the other class is assigned the value 0.
Output Layer Configuration: One node with a sigmoid activation unit. Loss Function: Cross-Entropy, also referred to as Logarithmic loss.
Multi-Class Classification Problem
A problem where you classify an example as belonging to one of more than two classes.
The problem is framed as predicting the likelihood of an example belonging to each class.
Output Layer Configuration: One node for each class using the softmax activation function. Loss Function: Cross-Entropy, also referred to as Logarithmic loss.
How to Implement Loss Functions
In order to make the loss functions concrete, this section explains how each of the main types of loss function works and how to calculate the score in Python. Mean Squared Error Loss
Mean Squared Error loss, or MSE for short, is calculated as the average of the squared differences between the predicted and actual values.
The result is always positive regardless of the sign of the predicted and actual values and a perfect value is 0.0. The loss value is minimized, although it can be used in a maximization optimization process by making the score negative.
The Python function below provides a pseudocode-like working implementation of a function for calculating the mean squared error for a list of actual and a list of predicted real-valued quantities.
# calculate mean squared error def mean_squared_error(actual, predicted): sum_square_error = 0.0 for i in range(len(actual)): sum_square_error += (actual[i] - predicted[i])**2.0 mean_square_error = 1.0 / len(actual) * sum_square_error return mean_square_error
For an efficient implementation, I’d encourage you to use the scikit-learn mean_squared_error() function.
Cross-Entropy Loss (or Log Loss)
Cross-entropy loss is often simply referred to as “cross-entropy,” “logarithmic loss,” “logistic loss,” or “log loss” for short.
Each predicted probability is compared to the actual class output value (0 or 1) and a score is calculated that penalizes the probability based on the distance from the expected value. The penalty is logarithmic, offering a small score for small differences (0.1 or 0.2) and enormous score for a large difference (0.9 or 1.0).
Cross-entropy loss is minimized, where smaller values represent a better model than larger values. A model that predicts perfect probabilities has a cross entropy or log loss of 0.0.
Cross-entropy for a binary or two class prediction problem is actually calculated as the average cross entropy across all examples.
The Python function below provides a pseudocode-like working implementation of a function for calculating the cross-entropy for a list of actual 0 and 1 values compared to predicted probabilities for the class 1.
from math import log # calculate binary cross entropy def binary_cross_entropy(actual, predicted): sum_score = 0.0 for i in range(len(actual)): sum_score += actual[i] * log(1e-15 + predicted[i]) mean_sum_score = 1.0 / len(actual) * sum_score return -mean_sum_score
Note, we add a very small value (in this case 1E-15) to the predicted probabilities to avoid ever calculating the log of 0.0. This means that in practice, the best possible loss will be a value very close to zero, but not exactly zero.
Cross-entropy can be calculated for multiple-class classification. The classes have been one hot encoded, meaning that there is a binary feature for each class value and the predictions must have predicted probabilities for each of the classes. The cross-entropy is then summed across each binary feature and averaged across all examples in the dataset.
The Python function below provides a pseudocode-like working implementation of a function for calculating the cross-entropy for a list of actual one hot encoded values compared to predicted probabilities for each class.
from math import log # calculate categorical cross entropy def categorical_cross_entropy(actual, predicted): sum_score = 0.0 for i in range(len(actual)): for j in range(len(actual[i])): sum_score += actual[i][j] * log(1e-15 + predicted[i][j]) mean_sum_score = 1.0 / len(actual) * sum_score return -mean_sum_score
For an efficient implementation, I’d encourage you to use the scikit-learn log_loss() function.
Loss Functions and Reported Model Performance
Given a framework of maximum likelihood, we know that we want to use a cross-entropy or mean squared error loss function under stochastic gradient descent.
Nevertheless, we may or may not want to report the performance of the model using the loss function.
For example, logarithmic loss is challenging to interpret, especially for non-machine learning practitioner stakeholders. The same can be said for the mean squared error. Instead, it may be more important to report the accuracy and root mean squared error for models used for classification and regression respectively.
It may also be desirable to choose models based on these metrics instead of loss. This is an important consideration, as the model with the minimum loss may not be the model with best metric that is important to project stakeholders.
A good division to consider is to use the loss to evaluate and diagnose how well the model is learning. This includes all of the considerations of the optimization process, such as overfitting, underfitting, and convergence. An alternate metric can then be chosen that has meaning to the project stakeholders to both evaluate model performance and perform model selection.
Loss: Used to evaluate and diagnose model optimization only. Metric: Used to evaluate and choose models in the context of the project.
The same metric can be used for both concerns but it is more likely that the concerns of the optimization process will differ from the goals of the project and different scores will be required. Nevertheless, it is often the case that improving the loss improves or, at worst, has no effect on the metric of interest.
Further Reading
This section provides more resources on the topic if you are looking to go deeper. Books
Deep Learning, 2016. Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, 1999. Neural Networks for Pattern Recognition, 1995.
Articles
Maximum likelihood estimation, Wikipedia. Kullback–Leibler divergence, Wikipedia. Cross entropy, Wikipedia. Mean squared error, Wikipedia. Log Loss, FastAI Wiki.
Summary
In this post, you discovered the role of loss and loss functions in training deep learning neural networks and how to choose the right loss function for your predictive modeling problems.
Specifically, you learned:
- Neural networks are trained using an optimization process that requires a loss function to calculate the model error.
- Maximum Likelihood provides a framework for choosing a loss function when training neural networks and machine learning models in general.
- Cross-entropy and mean squared error are the two main types of loss functions to use when training neural network models.