Keras: Evaluate the Skill of Deep Learning Model

Sumber: https://machinelearningmastery.com/evaluate-skill-deep-learning-models/

Sering terjadi kebingungan saat ingin mengevaluasi sebuah model deep learning, pertanyaan yang sering muncul seperti:

Random seed seperti apa yang perlu saya gunakan?
Mengapa kita memerlukan random seed?
Kenapa saya tidak memperoleh hasil yang sama di run selanjutnya?

Dalam tulisan ini, kita akan menemukan prosedur yang dapat kita gunakan untuk mengevaluasi model deep learning dan alasan untuk menggunakannya.

Kita juga akan menemukan statistik terkait yang berguna yang dapat kita hitung untuk menyajikan kemampuan model kita, seperti standar deviasi, kesalahan standar, dan interval keyakinan.

Kesalahan Pemula

Kita mem-fit model dengan data training dan mengevaluasinya pada dataset testing, kemudian melaporkan kemampuannya.

Mungkin saja kita menggunakan k-fold cross validation untuk mengevaluasi model, kemudian melaporkan kemampuan model.

Ini adalah kesalahan yang banyak dilakukan oleh pemula.

Sepertinya kita melakukan hal yang benar, tetapi ada masalah utama yang belum kita perhitungkan adalah:

Model Deep Learning bersifat stokastik.
Artificial neural networks menggunakan keacakan saat melakukan fit pada dataset, seperti awal weight acak dan pengacakan data setiap epoch training saat stochastic gradient descent.
Ini berarti bahwa setiap kali model yang sama di fit dengan data yang sama, akan memberikan prediksi yang berbeda yang pada akhirnya akan memberikan kemampuan model yang berbeda.

Mengestimasi Kemampuan Model

(Mengontrol Model Variance)

Kita tidak memiliki semua data yang mungkin; jika kita mempunyai hal tersebut, kita tidak perlu membuat prediksi.

Kita memiliki sampel data yang terbatas, dan dari situ kita perlu menemukan model terbaik yang kita bisa.

Menggunakan Train-Test Split

Kami melakukannya dengan membagi data menjadi dua bagian, fit model atau konfigurasi model tertentu pada bagian pertama data dan menggunakan model fit untuk membuat prediksi pada bagian lainnya, kemudian mengevaluasi kemampuan prediksi tersebut. Ini disebut train-test split dan kita menggunakan kemampuan sebagai perkiraan untuk seberapa baik kami berpikir model akan melakukan dalam praktiknya ketika membuat prediksi pada data baru.

Sebagai contoh, berikut adalah pseudocode untuk mengevaluasi sebuah model menggunakan train-test split:

train, test = split(data)
model = fit(train.X, train.y)
predictions = model.predict(test.X)
skill = compare(test.y, predictions)

Train-test split adalah pendekatan yang baik untuk digunakan jika kita memiliki banyak data atau model yang sangat lambat untuk di train, tetapi skor kemampuan yang dihasilkan untuk model tersebut akan noisy karena keacakan data (varians dari model) .

Ini berarti bahwa model yang sama di fit dengan data yang berbeda akan memberikan skor kemampuan model yang berbeda.

Penggunaan k-Fold Cross Validation

Kita sering dapat memperketat ini dan mendapatkan perkiraan kemampuan model yang lebih akurat menggunakan teknik seperti k-fold cross validation. Ini adalah teknik yang secara sistematis membagi data yang tersedia menjadi k-folds, cocok dengan model pada k-1 lipatan, mengevaluasinya pada lipatan yang tertahan, dan mengulangi proses ini untuk setiap lipatan.

Ini menghasilkan k model yang berbeda yang memiliki k set prediksi yang berbeda, dan pada gilirannya, k skor kemampuan yang berbeda.

Misalnya, berikut adalah pseudocode untuk mengevaluasi suatu model menggunakan k-fold cross validation:

scores = list()
for i in k:
	train, test = split_old(data, i)
	model = fit(train.X, train.y)
	predictions = model.predict(test.X)
	skill = compare(test.y, predictions)
 	scores.append(skill)

Sebuah populasi skor keterampilan lebih berguna karena kita dapat mengambil rata-rata dan melaporkan rata-rata kemampuan yang diharapkan dari model, yang kemungkinan akan lebih dekat dengan kemampuan sebenarnya dari model dalam praktik. Sebagai contoh:

mean_skill = sum(scores) / count(scores)

Kami juga dapat menghitung deviasi standar menggunakan mean_skill untuk mendapatkan gambaran tentang penyebaran skor rata-rata di sekitar mean_skill:

standard_deviation = sqrt(1/count(scores) * sum( (score - mean_skill)^2 ))

Mengestimasi Kemampuan Sebuah Stochastic Model

(Mengontrol Model Stability)

Stochastic models, like deep neural networks, add an additional source of randomness.

This additional randomness gives the model more flexibility when learning, but can make the model less stable (e.g. different results when the same model is trained on the same data).

This is different from model variance that gives different results when the same model is trained on different data.

To get a robust estimate of the skill of a stochastic model, we must take this additional source of variance into account; we must control for it.

Fix the Random Seed

One way is to use the same randomness every time the model is fit. We can do that by fixing the random number seed used by the system and then evaluating or fitting the model. For example:

seed(1)
scores = list()
for i in k:
	train, test = split_old(data, i)
	model = fit(train.X, train.y)
	predictions = model.predict(test.X)
	skill = compare(test.y, predictions)
	scores.append(skill)

This is good for tutorials and demonstrations when the same result is needed every time your code is run.

This is fragile and not recommended for evaluating models.

See the post:

Embrace Randomness in Machine Learning
How to Get Reproducible Results with Keras

Repeat Evaluation Experiments

A more robust approach is to repeat the experiment of evaluating a non-stochastic model multiple times.

For example:

scores = list()
for i in repeats:
	run_scores = list()
	for j in k:
		train, test = split_old(data, j)
		model = fit(train.X, train.y)
		predictions = model.predict(test.X)
		skill = compare(test.y, predictions)
		run_scores.append(skill)
	scores.append(mean(run_scores))

Note, we calculate the mean of the estimated mean model skill, the so-called grand mean.

This is my recommended procedure for estimating the skill of a deep learning model.

Because repeats is often >=30, we can easily calculate the standard error of the mean model skill, which is how much the estimated mean of model skill score differs from the unknown actual mean model skill (e.g. how wrong mean_skill might be)

standard_error = standard_deviation / sqrt(count(scores))

Further, we can use the standard_error to calculate a confidence interval for mean_skill. This assumes that the distribution of the results is Gaussian, which you can check by looking at a Histogram, Q-Q plot, or using statistical tests on the collected scores.

For example, the interval of 95% is (1.96 * standard_error) around the mean skill.

interval = standard_error * 1.96
lower_interval = mean_skill - interval
upper_interval = mean_skill + interval

There are other perhaps more statistically robust methods for calculating confidence intervals than using the standard error of the grand mean, such as:

Calculating the Binomial proportion confidence interval.
Using the bootstrap to estimate an empirical confidence interval.

How Unstable Are Neural Networks?

It depends on your problem, on the network, and on its configuration.

I would recommend performing a sensitivity analysis to find out.

Evaluate the same model on the same data many times (30, 100, or thousands) and only vary the seed for the random number generator.

Then review the mean and standard deviation of the skill scores produced. The standard deviation (average distance of scores from the mean score) will give you an idea of just how unstable your model is. How Many Repeats?

I would recommend at least 30, perhaps 100, even thousands, limited only by your time and computer resources, and diminishing returns (e.g. standard error on the mean_skill).

More rigorously, I would recommend an experiment that looked at the impact on estimated model skill versus the number of repeats and the calculation of the standard error (how much the mean estimated performance differs from the true underlying population mean).

Summary

In this post, you discovered how to evaluate the skill of deep learning models.

Specifically, you learned:

The common mistake made by beginners when evaluating deep learning models.
The rationale for using repeated k-fold cross validation to evaluate deep learning models.
How to calculate related model skill statistics, such as standard deviation, standard error, and confidence intervals.

Referensi

https://machinelearningmastery.com/evaluate-skill-deep-learning-models/

Pranala Menarik

Keras
Python