Orange: Preprocess

From OnnoWiki
Revision as of 14:30, 18 April 2020 by Onnowpurbo (talk | contribs)
Jump to navigation Jump to search

Sumber: https://docs.biolab.si//3/visual-programming/widgets/data/preprocess.html

Widget Preprocess akan mem-preprocess data dengan metode yang dipilih.

Input

Data: input dataset

Output

Preprocessor: preprocessing method
Preprocessed Data: data preprocessed with selected methods

Preprocessing sangat penting untuk mencapai hasil analisis berkualitas lebih baik. Widget Preprocess menawarkan beberapa metode preprocessing yang dapat digabungkan dalam satu pipa preprocessing tunggal. Beberapa metode tersedia sebagai widget terpisah, yang menawarkan teknik canggih dan setup parameter yang lebih banyak.

Preprocess-stamped.png
  • List of preprocessors. Double click the preprocessors you wish to use and shuffle their order by dragging them up or down. You can also add preprocessors by dragging them from the left menu to the right.
  • Preprocessing pipeline.
  • When the box is ticked (Send Automatically), the widget will communicate changes automatically. Alternatively, click Send.

Preprocessor

Preprocess1-stamped.png

Discretization of continuous values:

  • Entropy-MDL discretization by Fayyad and Irani that uses expected information to determine bins.
  • Equal frequency discretization splits by frequency (same number of instances in each bin.
  • Equal width discretization creates bins of equal width (span of each bin is the same).
  • Remove numeric features altogether.

Continuization of discrete values:

  • Most frequent as base treats the most frequent discrete value as 0 and others as 1. The discrete attributes with more than 2 values, the most frequent will be considered as a base and contrasted with remaining values in corresponding columns.
  • One feature per value creates columns for each value, place 1 where an instance has that value and 0 where it doesn’t. Essentially One Hot Encoding.
  • Remove non-binary features retains only categorical features that have values of either 0 or 1 and transforms them into continuous.
  • Remove categorical features removes categorical features altogether.
  • Treat as ordinal takes discrete values and treats them as numbers. If discrete values are categories, each category will be assigned a number as they appear in the data.
  • Divide by number of values is similar to treat as ordinal, but the final values will be divided by the total number of values and hence the range of the new continuous variable will be [0, 1].

Impute missing values:

  • Average/Most frequent replaces missing values (NaN) with the average (for continuous) or most frequent (for discrete) value.
  • Replace with random value replaces missing values with random ones within the range of each variable.
  • Remove rows with missing values.

Select relevant features:

  • Similar to Rank, this preprocessor outputs only the most informative features. Score can be determined by information gain, gain ratio, gini index, ReliefF, fast correlation based filter, ANOVA, Chi2, RReliefF, and Univariate Linear Regression.
  • Strategy refers to how many variables should be on the output. Fixed returns a fixed number of top scored variables, while Percentile return the selected top percent of the features.
Preprocess2-stamped.png
  • Select random features outputs either a fixed number of features from the original data or a percentage. This is mainly used for advanced testing and educational purposes.
  • Normalize adjusts values to a common scale. Center values by mean or median or omit centering altogether. Similar for scaling, one can scale by SD (standard deviation), by span or not at all.
  • Randomize instances. Randomize classes shuffles class values and destroys connection between instances and class. Similarly, one can randomize features or meta data. If replicable shuffling is on, randomization results can be shared and repeated with a saved workflow. This is mainly used for advanced testing and educational purposes.
  • Remove sparse features retains features that have more than user-defined threshold percentage of non-zero values. The rest are discarded.
  • Principal component analysis outputs results of a PCA transformation. Similar to the PCA widget.
  • CUR matrix decomposition is a dimensionality reduction method, similar to SVD.

Contoh

Pada contoh berikut ini, kita telah menggunakan dataset heart_disease.tab yang tersedia di dropdown menu dari widget File. Kemudian, kita menggunakan widget Preprocess untuk memasukkan nilai yang hilang dan menormalkan feature. Kita bisa mengamati perubahan yang terjadi di widget Data Table dan membandingkannya dengan data yang tidak diproses.

Preprocess-Example1.png

Pada contoh berikut ini, kita akan menggunakan widget Preprocess untuk pemodelan prediktif.

Kali ini kita menggunakan dataset heart_disease.tab dari widget File. Kita dapat mengakses data melalui menu dropdown. Ini adalah dataset dengan 303 pasien yang datang ke dokter yang menderita nyeri dada. Setelah tes dilakukan, beberapa pasien ditemukan memiliki penyempitan diameter dan yang lainnya tidak (ini adalah variabel class).

Data penyakit jantung memiliki beberapa nilai yang hilang dan kita ingin menjelaskannya. Pertama, kita akan membagi dataset menjadi data train dan data test menggunakan widget Data Sampler.

Kemudian kita akan mengirimkan Data Sample (data train) ke dalam widget Preprocess. Kita akan menggunakan Impute Missing Values, tentunya kita dapat mencoba berbagai kombinasi preprosesor pada data tersebut. Kita akan mengirimkan data yang telah diproses ke widget Logistic Regression dan model yang dibangun ke widget Prediction.

Akhirnya, Prediksi juga membutuhkan data untuk diprediksi. Kita akan menggunakan output widget Data Sampler untuk prediksi, tetapi kali ini bukan Sample Data, tetapi Data Remaining (data test), ini adalah data yang tidak digunakan untuk men-train model.

Perhatikan bagaimana kita mengirim data remaining (data test) secara langsung ke widget Prediksi tanpa menerapkan preprocessing apa pun. Ini karena Orange menangani preprocessing pada data baru secara internal untuk mencegah kesalahan dalam konstruksi model. Preprosesor yang sama persis yang digunakan pada data train akan digunakan untuk prediksi. Proses yang sama berlaku untuk widget Test & Score.

Predictions-Example2.png

Referensi

Pranala Menarik