Orange: Data Sampler

From OnnoWiki
Revision as of 07:10, 27 January 2020 by Onnowpurbo (talk | contribs)
Jump to navigation Jump to search

Sumber: https://docs.biolab.si//3/visual-programming/widgets/data/datasampler.html

Memilih subset dari data instance dari input dataset.

Input

Data: input dataset

Output

Data Sample: sampled data instances
Remaining Data: out-of-sample data

Widget Data Sampler mengimplementasikan beberapa metode pengambilan sampel data. Outputnya sampel dan dataset pelengkap (dengan contoh dari set input yang tidak termasuk dalam dataset sampel). Output diproses setelah dataset input disediakan dan Sample Data di Tekan.

DataSampler-stamped.png
  • Informasi akan input dan output dataset.
  • Metoda sampling yang di inginkan:
    • Proporsi data yang tetap mengembalikan persentase yang dipilih dari seluruh data (mis. 70% dari semua data)
    • Ukuran sampel yang fix akan mengeluarkan jumlah instance data yang dipilih dengan kesempatan untuk mengatur Sampel dengan penggantian, yang selalu men-sampel dari seluruh dataset (tidak mengurangi instance yang sudah ada dalam subset). Dengan penggantian, Anda dapat menghasilkan lebih banyak instance daripada yang tersedia di dataset input.
    • Cross Validation memecah instance data ke dalam himpunan bagian pelengkap, di mana kita dapat memilih jumlah lipatan (subset) dan lipatan mana yang ingin kita gunakan sebagai sampel.
    • Bootstrap meng-inferensi sampel dari statistik populasi.
  • Pengambilan sampel yang dapat direplikasi mempertahankan pola pengambilan sampel yang dapat dilakukan lintas pengguna, sementara stratifikasi sampel meniru komposisi dataset input.
  • Press Sample Data untuk mengeluarkan data sample.

Jika semua instance data dipilih (dengan mengatur proporsi ke 100% atau mengatur ukuran sampel tetap ke seluruh ukuran data), instance output akan dikocok / shuffled.

Contoh

First, let’s see how the Data Sampler works. We will use the iris data from the File widget. We see there are 150 instances in the data. We sampled the data with the Data Sampler widget and we chose to go with a fixed sample size of 5 instances for simplicity. We can observe the sampled data in the Data Table widget (Data Table (in-sample)). The second Data Table (Data Table (out-of-sample)) shows the remaining 145 instances that weren’t in the sample. To output the out-of-sample data, double-click the connection between the widgets and rewire the output to Remaining Data –> Data.

DataSampler-Example1.png

Now, we will use the Data Sampler to split the data into training and testing part. We are using the iris data, which we loaded with the File widget. In Data Sampler, we split the data with Fixed proportion of data, keeping 70% of data instances in the sample.

Then we connected two outputs to the Test & Score widget, Data Sample –> Data and Remaining Data –> Test Data. Finally, we added Logistic Regression as the learner. This runs logistic regression on the Data input and evaluates the results on the Test Data.

DataSampler-Example2.png

Over/Undersampling

Data Sampler can also be used to oversample a minority class or undersample majority class in the data. Let us show an example for oversampling. First, separate the minority class using a Select Rows widget. We are using the iris data from the File widget. The data set has 150 data instances, 50 of each class. Let us oversample, say, iris-setosa.

In Select Rows, set the condition to iris is iris-setosa. This will output 50 instances of the iris-setosa class. Now, connect Matching Data into the Data Sampler, select Fixed sample size, set it to, say, 100 and select Sample with replacement. Upon pressing Sample Data, the widget will output 100 instances of iris-setosa class, some of which will be duplicated (because we used Sample with replacement).

Finally, use Concatenate to join the oversampled instances and the Unmatched Data output of the Select Rows widget. This outputs a data set with 200 instances. We can observe the final results in the Distributions.

DataSampler-Example-OverUnderSampling.png




Referensi

Pranala Menarik