Orange: Bag of Words

Sumber: https://orange3-text.readthedocs.io/en/latest/widgets/bagofwords-widget.html

Membuat sebuah bag of words dari input corpus.

Input

Corpus: A collection of documents.

Output

Corpus: Corpus with bag of words features appended.

Bag of Words model membuat sebuah corpus dengan word counts untuk setiap data instance (document). Count dapat absolute, binary (ada atau tidak ada) atau sublinear (logaritmik dari frekuensi istilah). Bag of words model dibutuhkan dalam kombinasi dengan widget Word Enrichment dan dapat digunakan untuk predictive modelling.

Parameters for bag of words model:
- Term Frequency:
  - Count: number of occurrences of a word in a document
  - Binary: word appears or does not appear in the document
  - Sublinear: logarithm of term frequency (count)
- Document Frequency:
  - (None)
  - IDF: inverse document frequency
  - Smooth IDF: adds one to document frequencies to prevent zero division.
- Regulariation:
  - (None)
  - L1 (Sum of elements): normalizes vector length to sum of elements
  - L2 (Euclidean): normalizes vector length to sum of squares
Produce a report.
If Commit Automatically is on, changes are communicated automatically. Alternatively press Commit.

Contoh

Dalam contoh ini kita hanya akan men-cek seperti apa sebuah bag of words model. Load book-excerpts.tab menggunakan widget Corpus dan sambungkan ke widget Bag of Words. Disini kita sengaja menggunakan parameter defaults - count paling sederhana adalah menghitung frekuensi istilah. Cek apa yang di keluarkan oleh widget Bag of Words menggunakan widget Data Table. Kolom terakhir merepresentasikan frekuensi istilah dari setiap dokumen.

In the second example we will try to predict document category. We are still using the book-excerpts.tab data set, which we sent through Preprocess Text with default parameters. Then we connected Preprocess Text to Bag of Words to obtain term frequencies by which we will compute the model.

Connect Bag of Words to Test & Score for predictive modelling. Connect SVM or any other classifier to Test & Score as well (both on the left side). Test & Score will now compute performance scores for each learner on the input. Here we got quite impressive results with SVM. Now we can check, where the model made a mistake.

Add Confusion Matrix to Test & Score. Confusion matrix displays correctly and incorrectly classified documents. Select Misclassified will output misclassified documents, which we can further inspect with Corpus Viewer.