Python: Gephi + MALLET + EMDA

From OnnoWiki
Jump to navigation Jump to search

s I prepare to attend Early Modern Digital Agendas next week, I’ve been exploring a few tools that have been on my to-try list for a while — things that have come up at many a DH-related event over the years. Gephi

I’m embarrassed to say that I haven’t really done any data visualization myself beyond the occasional wordle. Reading a handful of data viz blogs or Tufte, for example, has therefore been an act of imagination rather than practicality. But today I tinkered with Gephi to get a visual glimpse of where EMDA participants’ interests lay. Here’s one of the better visualizations I made (words are stemmed):

Gephi text viz

To do this, I dumped all of our application essays into one big .txt file, stripping out essay titles and name/page number headers. Then I processed the text using Python and NLTK to make a Gephi-friendly XML file, following the algorithm and file format as demonstrated as described in the article “Identifying the Pathways for Meaning Circulation using Text Network Analysis.” You can see my script at Github. (Don’t make too much fun of my novice code.)

This Python script spits out each stemmed non-stopword as a node, and counts word-pairs as edges. That is, an edge occurs whenever one word occurs within 4 words of another word. The edge weight increases with the frequency of the word-pair. So digit human is a strong word pair because we mention digital humanities quite a lot.

The data my script output gave me 2,900 nodes and 11,000 edges. I filtered out nodes with fewer than 17 degrees so we’d only be looking at the top 175 nodes. Then I used the modularity algorithm, which detects ‘communities’ (almost like topics?). With a modularity resolution of 2.0, I narrowed it down to 10 communities, which are indicated by color in the visualization above. They’re sort of clustered. I’m not really sure if this is a good visualization — it seems like it is, but I’m not experienced enough to critique knowledgeably.

And what does it look like if the visualization considers all 2900 nodes? Here’s one look:

2900 nodes in 39 communities
2900 nodes in 39 communities, not really clustered at all, no labels, data party!

Gephi text viz

Circle layout, ordered by community. Crisscrossing lines show relationships between word communities. MALLET

I also tried out topic modeling using MALLET on the same essay dump. Here’s a list of topics limited to 5:


Well, this was rather fun. And all this from a relatively small text. Methinks my MacBook Air would explode if I cranked whole corpora through these exercises.


Referensi