Difference between revisions of "Python: Gephi + MALLET + EMDA"

From OnnoWiki
Jump to navigation Jump to search
(Created page with "s I prepare to attend Early Modern Digital Agendas next week, I’ve been exploring a few tools that have been on my to-try list for a while — things that have come up at ma...")
 
 
Line 14: Line 14:
 
And what does it look like if the visualization considers all 2900 nodes? Here’s one look:
 
And what does it look like if the visualization considers all 2900 nodes? Here’s one look:
  
2900 nodes in 39 communities
+
2900 nodes in 39 communities
2900 nodes in 39 communities, not really clustered at all, no labels, data party!
+
2900 nodes in 39 communities, not really clustered at all, no labels, data party!
  
 
Gephi text viz
 
Gephi text viz
 +
 
Circle layout, ordered by community. Crisscrossing lines show relationships between word communities.
 
Circle layout, ordered by community. Crisscrossing lines show relationships between word communities.
 
MALLET
 
MALLET
  
 
I also tried out topic modeling using MALLET on the same essay dump. Here’s a list of topics limited to 5:
 
I also tried out topic modeling using MALLET on the same essay dump. Here’s a list of topics limited to 5:
network
 
seminar
 
social
 
historical
 
milton
 
reading
 
scholarship
 
field
 
make
 
approach
 
sdfb
 
form
 
networks
 
terms
 
community
 
long
 
society
 
fact
 
benefit digital humanities
 
work
 
early
 
projects
 
institute
 
research
 
english
 
teaching
 
scholarly
 
editions
 
current
 
library
 
working
 
students
 
future
 
experience
 
part
 
scholarship digital research
 
shakespeare
 
institute
 
studies
 
university
 
tools
 
methods
 
dh
 
hope
 
language
 
graduate
 
develop
 
study
 
analysis
 
based
 
corpus
 
large
 
focus early modern
 
project
 
texts
 
agendas
 
scholars
 
resources
 
eebo
 
questions
 
folger
 
ways
 
period
 
works
 
books
 
online
 
tcp
 
information
 
bring
 
existing data
 
digital
 
literary
 
media
 
history
 
book
 
time
 
text
 
database
 
century
 
archives
 
order
 
learn
 
political
 
share
 
cultural
 
narrative
 
eager
 
press
 
  
And limited to 10:
+
 
early modern
 
scholars
 
network
 
social
 
texts
 
agendas
 
words
 
scholarship
 
sdfb
 
approach
 
inquiry
 
criticism
 
mining
 
world
 
chapter
 
ontologies
 
actors
 
persons texts
 
ways
 
eebo
 
questions
 
folger
 
period
 
tcp
 
bring
 
existing
 
online
 
understand
 
books
 
reading
 
corpus
 
neh
 
work
 
present
 
develop
 
understanding digital humanities
 
modern
 
university
 
tools
 
english
 
teaching
 
study
 
part
 
projects
 
scholarship
 
development
 
practice
 
future
 
professional
 
provide
 
past
 
technology
 
developing data
 
work
 
digital
 
opportunity
 
seminar
 
database
 
archives
 
discussions
 
literary
 
narrative
 
text
 
eager
 
press
 
conversations
 
relationships
 
relationship
 
archive
 
interface
 
interested shakespeare
 
research
 
work
 
dh
 
hope
 
language
 
methods
 
graduate
 
analysis
 
agendas
 
based
 
application
 
plan
 
training
 
university
 
literature
 
approaches
 
writing
 
benefit
 
media
 
milton
 
interest
 
field
 
historical
 
society
 
theory
 
means
 
performance
 
larger
 
arts
 
prose
 
write
 
reflect
 
professor
 
team
 
college
 
readings
 
basic projects
 
resources
 
library
 
students
 
experience
 
faculty
 
collections
 
london
 
make
 
explore
 
information
 
place
 
john
 
practical
 
curation
 
center
 
important
 
end
 
moeml history
 
early
 
book
 
time
 
project
 
technologies
 
build
 
share
 
paper
 
agendas
 
experiences
 
scale
 
poems
 
writers
 
space
 
ocr
 
thinking
 
courses
 
form early modern
 
institute
 
research
 
project
 
studies
 
scholarly
 
editions
 
working
 
current
 
summer
 
knowledge
 
edition
 
large
 
collaborative
 
textual
 
electronic
 
renaissance
 
participation literary
 
century
 
order
 
political
 
text
 
methods
 
historical
 
works
 
topic
 
natural
 
learn
 
great
 
scientific
 
computer
 
public
 
complex
 
discuss
 
eighteenth
 
applying
 
  
 
Well, this was rather fun. And all this from a relatively small text. Methinks my MacBook Air would explode if I cranked whole corpora through these exercises.
 
Well, this was rather fun. And all this from a relatively small text. Methinks my MacBook Air would explode if I cranked whole corpora through these exercises.

Latest revision as of 06:35, 30 January 2017

s I prepare to attend Early Modern Digital Agendas next week, I’ve been exploring a few tools that have been on my to-try list for a while — things that have come up at many a DH-related event over the years. Gephi

I’m embarrassed to say that I haven’t really done any data visualization myself beyond the occasional wordle. Reading a handful of data viz blogs or Tufte, for example, has therefore been an act of imagination rather than practicality. But today I tinkered with Gephi to get a visual glimpse of where EMDA participants’ interests lay. Here’s one of the better visualizations I made (words are stemmed):

Gephi text viz

To do this, I dumped all of our application essays into one big .txt file, stripping out essay titles and name/page number headers. Then I processed the text using Python and NLTK to make a Gephi-friendly XML file, following the algorithm and file format as demonstrated as described in the article “Identifying the Pathways for Meaning Circulation using Text Network Analysis.” You can see my script at Github. (Don’t make too much fun of my novice code.)

This Python script spits out each stemmed non-stopword as a node, and counts word-pairs as edges. That is, an edge occurs whenever one word occurs within 4 words of another word. The edge weight increases with the frequency of the word-pair. So digit human is a strong word pair because we mention digital humanities quite a lot.

The data my script output gave me 2,900 nodes and 11,000 edges. I filtered out nodes with fewer than 17 degrees so we’d only be looking at the top 175 nodes. Then I used the modularity algorithm, which detects ‘communities’ (almost like topics?). With a modularity resolution of 2.0, I narrowed it down to 10 communities, which are indicated by color in the visualization above. They’re sort of clustered. I’m not really sure if this is a good visualization — it seems like it is, but I’m not experienced enough to critique knowledgeably.

And what does it look like if the visualization considers all 2900 nodes? Here’s one look:

2900 nodes in 39 communities
2900 nodes in 39 communities, not really clustered at all, no labels, data party!

Gephi text viz

Circle layout, ordered by community. Crisscrossing lines show relationships between word communities. MALLET

I also tried out topic modeling using MALLET on the same essay dump. Here’s a list of topics limited to 5:


Well, this was rather fun. And all this from a relatively small text. Methinks my MacBook Air would explode if I cranked whole corpora through these exercises.


Referensi