Difference between revisions of "Python: Gephi + MALLET + EMDA"

Latest revision as of 06:35, 30 January 2017

s I prepare to attend Early Modern Digital Agendas next week, I’ve been exploring a few tools that have been on my to-try list for a while — things that have come up at many a DH-related event over the years. Gephi

I’m embarrassed to say that I haven’t really done any data visualization myself beyond the occasional wordle. Reading a handful of data viz blogs or Tufte, for example, has therefore been an act of imagination rather than practicality. But today I tinkered with Gephi to get a visual glimpse of where EMDA participants’ interests lay. Here’s one of the better visualizations I made (words are stemmed):

Gephi text viz

To do this, I dumped all of our application essays into one big .txt file, stripping out essay titles and name/page number headers. Then I processed the text using Python and NLTK to make a Gephi-friendly XML file, following the algorithm and file format as demonstrated as described in the article “Identifying the Pathways for Meaning Circulation using Text Network Analysis.” You can see my script at Github. (Don’t make too much fun of my novice code.)

This Python script spits out each stemmed non-stopword as a node, and counts word-pairs as edges. That is, an edge occurs whenever one word occurs within 4 words of another word. The edge weight increases with the frequency of the word-pair. So digit human is a strong word pair because we mention digital humanities quite a lot.

The data my script output gave me 2,900 nodes and 11,000 edges. I filtered out nodes with fewer than 17 degrees so we’d only be looking at the top 175 nodes. Then I used the modularity algorithm, which detects ‘communities’ (almost like topics?). With a modularity resolution of 2.0, I narrowed it down to 10 communities, which are indicated by color in the visualization above. They’re sort of clustered. I’m not really sure if this is a good visualization — it seems like it is, but I’m not experienced enough to critique knowledgeably.

And what does it look like if the visualization considers all 2900 nodes? Here’s one look:

2900 nodes in 39 communities
2900 nodes in 39 communities, not really clustered at all, no labels, data party!

Gephi text viz

Circle layout, ordered by community. Crisscrossing lines show relationships between word communities. MALLET

I also tried out topic modeling using MALLET on the same essay dump. Here’s a list of topics limited to 5:

Well, this was rather fun. And all this from a relatively small text. Methinks my MacBook Air would explode if I cranked whole corpora through these exercises.

Referensi

https://www.robincamille.com/2013-07-03-gephi-emda/

Difference between revisions of "Python: Gephi + MALLET + EMDA"

Latest revision as of 06:35, 30 January 2017

Referensi

Navigation menu

Search

@@ Line 14: / Line 14: @@
 And what does it look like if the visualization considers all 2900 nodes? Here’s one look:
 nodes in 39 communities
 nodes in 39 communities, not really clustered at all, no labels, data party!
 Gephi text viz
 Circle layout, ordered by community. Crisscrossing lines show relationships between word communities.
 MALLET
 I also tried out topic modeling using MALLET on the same essay dump. Here’s a list of topics limited to 5:
-network
-seminar
-social
-historical
-milton
-reading
-scholarship
-field
-make
-approach
-sdfb
-form
-networks
-terms
-community
-long
-society
-fact
-benefit 	digital humanities
-work
-early
-projects
-institute
-research
-english
-teaching
-scholarly
-editions
-current
-library
-working
-students
-future
-experience
-part
-scholarship 	digital research
-shakespeare
-institute
-studies
-university
-tools
-methods
-dh
-hope
-language
-graduate
-develop
-study
-analysis
-based
-corpus
-large
-focus 	early modern
-project
-texts
-agendas
-scholars
-resources
-eebo
-questions
-folger
-ways
-period
-works
-books
-online
-tcp
-information
-bring
-existing 	data
-digital
-literary
-media
-history
-book
-time
-text
-database
-century
-archives
-order
-learn
-political
-share
-cultural
-narrative
-eager
-press
-And limited to 10:
-early modern
-scholars
-network
-social
-texts
-agendas
-words
-scholarship
-sdfb
-approach
-inquiry
-criticism
-mining
-world
-chapter
-ontologies
-actors
-persons 	texts
-ways
-eebo
-questions
-folger
-period
-tcp
-bring
-existing
-online
-understand
-books
-reading
-corpus
-neh
-work
-present
-develop
-understanding 	digital humanities
-modern
-university
-tools
-english
-teaching
-study
-part
-projects
-scholarship
-development
-practice
-future
-professional
-provide
-past
-technology
-developing 	data
-work
-digital
-opportunity
-seminar
-database
-archives
-discussions
-literary
-narrative
-text
-eager
-press
-conversations
-relationships
-relationship
-archive
-interface
-interested 	shakespeare
-research
-work
-dh
-hope
-language
-methods
-graduate
-analysis
-agendas
-based
-application
-plan
-training
-university
-literature
-approaches
-writing
-benefit
-media
-milton
-interest
-field
-historical
-society
-theory
-means
-performance
-larger
-arts
-prose
-write
-reflect
-professor
-team
-college
-readings
-basic 	projects
-resources
-library
-students
-experience
-faculty
-collections
-london
-make
-explore
-information
-place
-john
-practical
-curation
-center
-important
-end
-moeml 	history
-early
-book
-time
-project
-technologies
-build
-share
-paper
-agendas
-experiences
-scale
-poems
-writers
-space
-ocr
-thinking
-courses
-form 	early modern
-institute
-research
-project
-studies
-scholarly
-editions
-working
-current
-summer
-knowledge
-edition
-large
-collaborative
-textual
-electronic
-renaissance
-participation 	literary
-century
-order
-political
-text
-methods
-historical
-works
-topic
-natural
-learn
-great
-scientific
-computer
-public
-complex
-discuss
-eighteenth
-applying
 Well, this was rather fun. And all this from a relatively small text. Methinks my MacBook Air would explode if I cranked whole corpora through these exercises.