Difference between revisions of "Orange: Topic Modelling"

Revision as of 18:33, 11 April 2020

Sumber: https://orange3-text.readthedocs.io/en/latest/widgets/topicmodelling-widget.html

Widget Topic Modelling melakukan Topic modelling menggunakan Latent Dirichlet Allocation (LDA), Latent Semantic Indexing (LSI) atau Hierarchical Dirichlet Process (HDP).

Input

Corpus: A collection of documents.

Output

Corpus: Corpus with topic weights appended.
Topics: Selected topics with word weights.
All Topics: Token weights per topic.

Widget Topic Modelling menemukan topik-topik abstrak dalam sebuah corpus berdasarkan kumpulan kata-kata yang ditemukan dalam setiap dokumen dan frekuensi masing-masing. Sebuah dokumen biasanya berisi banyak topik dalam proporsi berbeda, sehingga widget tersebut juga melaporkan bobot topik per dokumen.

Widget Topic Modelling membungkus gensim topic model (LSI, LDA, HDP).

Latent Semantic Indexing (LSI), dapat memberikan kata-kata positif dan negatif (kata-kata yang ada dalam topik dan yang tidak) dan bersamaan dengan bobot topik, yang bisa positif atau negatif. Seperti yang dinyatakan oleh pengembang gensim utama, Radim Řehůřek: “LSI topics are not supposed to make sense; since LSI allows negative numbers, it boils down to delicate cancellations between topics and there’s no straightforward way to interpret a topic.”

Latent Dirichlet Allocation (LDA) lebih mudah di interpretasikan, akan tetapi lebih lambat di bandingkan Latent Semantic Indexing (LSI). Hierarchical Dirichlet Process (HDP) mempunyai banyak parameter - parameter yang sesuai dengan jumlah topik adalah Top level truncation level (T). Jumlah topik terkecil yang dapat diambil seseorang adalah 10.

Topic modelling algorithm:
- Latent Semantic Indexing. Returns both negative and positive words and topic weights.
- Latent Dirichlet Allocation
- Hierarchical Dirichlet Process
Parameters for the algorithm. LSI and LDA accept only the number of topics modelled, with the default set to 10.

HDP, however, has more parameters. As this algorithm is computationally very demanding, we recommend you to try it on a subset or set all the required parameters in advance and only then run the algorithm (connect the input to the widget).

- First level concentration (γ): distribution at the first (corpus) level of Dirichlet Process
- Second level concentration (α): distribution at the second (document) level of Dirichlet Process
- The topic Dirichlet (α): concentration parameter used for the topic draws
- Top level truncation (Τ): corpus-level truncation (no of topics)
- Second level truncation (Κ): document-level truncation (no of topics)
- Learning rate (κ): step size
- Slow down parameter (τ)
Produce a report.
If Commit Automatically is on, changes are communicated automatically. Alternatively press Commit.

Contoh

Exploring Individual Topics

Dalam contoh ini, diperlihatkan penggunaan sederhana widget Topic Modelling. Pertama-tama kita load kumpulan data grimm-tales-selected.tab dan menggunakan widget Preprocess Text untuk tokenize hanya dengan kata-kata dan menghapus stopword. Kemudian kami menghubungkan widget Preprocess Text ke widget Topic Modelling, di mana kita menggunakan Latent Semantic Indexing (LSI) sederhana untuk menemukan 10 topik dalam text.

Latent Semantic Indexing (LSI) memberikan bobot positif dan negatif per topik. Bobot positif berarti kata yang digunakan sangat mewakili suatu topik, sedangkan bobot negatif berarti kata tersebut sangat tidak mewakili topik (semakin sedikit hal itu terjadi dalam text, semakin besar kemungkinan topik itu). Kata-kata positif berwarna hijau dan kata-kata negatif berwarna merah.

We then select the first topic and display the most frequent words in the topic in Word Cloud. We also connected Preprocess Text to Word Cloud in order to be able to output selected documents. Now we can select a specific word in the word cloud, say little. It will be colored red and also highlighted in the word list on the left.

Now we can observe all the documents containing the word little in Corpus Viewer.

Topic Visualization

In the second example, we will look at the correlation between topics and words/documents. We are still using the grimm-tales-selected.tab corpus. In Preprocess Text we are using the default preprocessing, with an additional filter by document frequency (0.1 - 0.9). In Topic Modelling we are using LDA model with 5 topics.

Connect Topic Modelling to MDS. Ensure the link is set to All Topics - Data. Topic Modelling will output a matrix of word weights by topic.

In MDS, the points are now topics. We have set the size of the points to Marginal topic probability, which is an additional columns of All Topics - it reports on the marginal probability of the topic in the corpus (how strongly represented is the topic in the corpus).

We can now explore which words are representative for the topic. Select, say, Topic 5 from the plot and connect MDS to Box Plot. Make sure the output is set to Data - Data (not Selected Data - Data).

In Box Plot, set the subgroup to Selected and check the Order by relevance to subgroups box. This option will sort the variables by how well they separate between the selected subgroup values. In our case, this means which words are the most representative for the topic we have selected in the plot (subgroup Yes means selected).

We can see that little, children and kings are the most representative words for Topic 5, with good separation between the word frequency for this topic and all the others. Select other topics in MDS and see how the Box Plot changes.

Referensi

https://orange3-text.readthedocs.io/en/latest/widgets/topicmodelling-widget.html

Pranala Menarik

Orange

@@ Line 45: / Line 45: @@
 ===Exploring Individual Topics===
-In the first example, we present a simple use of the Topic Modelling widget. First we load grimm-tales-selected.tab data set and use Preprocess Text to tokenize by words only and remove stopwords. Then we connect Preprocess Text to Topic Modelling, where we use a simple Latent Semantic Indexing to find 10 topics in the text.
+Dalam contoh ini, diperlihatkan penggunaan sederhana widget Topic Modelling. Pertama-tama kita load kumpulan data grimm-tales-selected.tab dan menggunakan widget Preprocess Text  untuk tokenize hanya dengan kata-kata dan menghapus stopword. Kemudian kami menghubungkan widget Preprocess Text ke widget Topic Modelling, di mana kita menggunakan Latent Semantic Indexing (LSI) sederhana untuk menemukan 10 topik dalam text.
-[[File:Topic-Modelling-Example1.png|center|200px|thumb]]
+[[File:Topic-Modelling-Example1.png|center|600px|thumb]]
-LSI provides both positive and negative weights per topic. A positive weight means the word is highly representative of a topic, while a negative weight means the word is highly unrepresentative of a topic (the less it occurs in a text, the more likely the topic). Positive words are colored green and negative words are colored red.
+Latent Semantic Indexing (LSI) memberikan bobot positif dan negatif per topik. Bobot positif berarti kata yang digunakan sangat mewakili suatu topik, sedangkan bobot negatif berarti kata tersebut sangat tidak mewakili topik (semakin sedikit hal itu terjadi dalam text, semakin besar kemungkinan topik itu). Kata-kata positif berwarna hijau dan kata-kata negatif berwarna merah.
 We then select the first topic and display the most frequent words in the topic in Word Cloud. We also connected Preprocess Text to Word Cloud in order to be able to output selected documents. Now we can select a specific word in the word cloud, say little. It will be colored red and also highlighted in the word list on the left.