Text Mining in R and Python: 8 Tips To Get Started
You want to get started on text mining, but most of the tutorials you start, get pretty complex very quickly? Or you can’t find a proper data set to work on?
DataCamp’s latest post will walk you through 8 tips and tricks that will help you to start text mining and to stay hooked on it.
1. Get Curious About Text
The first step to almost anything in data science is to get curious. Text mining is no exception to that.
You should get curious about text like David Robinson, data scientist at StackOverflow, described in his blog a couple of weeks ago, “I saw a hypothesis […] that simply begged to be investigated with data”.
Or maybe, if you’re not really into verifying hypotheses, you should get curious about that cool word cloud you saw, realizing you want to reproduce it for yourself.
Do you still need to be convinced of how cool text mining can be?
Get inspired by one of the many text mining use cases that recently got a lot of attention in the media, like the text mining and analysis of South Park dialogue, film dialogue, …
2. Get The Skills and Knowledge You Need
When you have gotten curious, it’s time to step up your game and start developing your knowledge and skills about text mining. You can easily do this by completing some tutorials and courses.
What you should look out for in these courses is that they introduce you to at least some of the steps that you find in a data science workflow, such as data preparation or preprocessing, data exploration, data analysis, …
DataCamp offers some material for those who are looking to get started with text mining: recently, Ted Kwartler wrote a guest tutorial on mining data from Google Trends and Yahoo’s stock service. This easy-to-follow R tutorial lets you learn text mining by doing and is a great start for any text mining starters.
In addition, Ted Kwartler is also the instructor of DataCamp’s R course “Text Mining: Bag of Words”, which will introduce you to a variety of essential topics for analyzing and visualizing data and lets you practice your acquired text mining skills on a real-world case study.
On the other hand, you also have some other material out there that is not necessarily limited to R. For Python, you could check out these tutorials and/or courses: for an introduction to text analysis in Python, you can go to this tutorial. Or you can also go through this introductory Kaggle tutorial.
Are you, however, more interested in other resources? Go to DataCamp’s Learn Data Science – Resources for Python & R tutorial!
3. Words, Words, Words – Finding Your Data
Once you have gotten the hang out of the essential concepts and topics that you need to analyze and visualize your data, it is time to go and find the data!
And believe us when we tell you that there are a lot of ways to get your data. Besides the mention of Google Trends and Yahoo, you can also access data from:
- Twitter! Both R and Python offer packages or libraries that will allow you to connect to the Twitter API and retrieve tweets. You will learn more about this in the next section.
- The Internet Archive, a non-profit library of millions of free books, movies, software, music, websites, and more.
- Project Gutenberg offers over 55,000 free ebooks. Most of them are established literature and will thus be a good source if you want to do an analysis on the works of authors like Shakespeare, Jane Austen, Edgar Allan Poe.
- For an academic approach to text mining, you can use the contents of JSTOR’s data for research. It is a free, self-service tool that allows computer scientists, digital humanists, and other researchers to select and interact with content on JSTOR.
- If you’re looking to do text mining on series or movies, just like in the examples that were given above, you might want to consider downloading the subtitles. A simple Google search can definitely provide you what you need to form your own corpus to get started on text mining.
You can also get your data from corpora. Two of the well-known corpora are:
- Reuters Text Corpus. Some will argue that this is not the most diverse corpus to use, but it is excellent if you’re just starting to learn to do text mining.
- The Brown Corpus contains text from 500 sources, which are categorized by genre.
As you can see, the possibilities are endless. Everything that contains text can become the topic of your text mining case study.
4. Finding The Right Tools For The Job
Now that you have found the source of your data, you probably want to use the right tools to get them into your possession and to perform an analysis on it.
The tutorials and courses you will have followed will have given you some tools to get started.
But, depending on which courses or tutorials you have followed, you might miss a couple. To be complete, here’s a list of some of the packages that are used for text mining in R:
- One of the most used packages for text mining in R is, without a doubt, the tm package. This package is often used in addition to more specific packages, like for example the twitteR package, which you can use to extract tweets and followers from the Twitter website.
- To do web scraping with R, you should go for the rvest library. For a short tutorial on the use of rvest, go here.
For Python, you can rely on these libraries:
- The text mining 1.0 package contains a variety of useful functions for text mining in Python.
- The natural language toolkit, contained within the nltk package. This package can be extremely useful because you have easy access to over 50 corpora and lexical resources. You can see a list of these on this page.
- If you want to mine Twitter data, you have a lot of choices in packages. One of the packages that is used a lot is Tweepy package.
- For web scraping, the scrapy package will come in handy to extract the data you need from websites. Also consider using urllib2, a package for opening URLs. Sometimes, however, the requests package is more recommended and it might even be more convenient to use. Some say that it is ‘more human’ or declarative because some things, like setting the user-agent and requesting a page is only one line of code. You will also sometimes see that some people refer to the urllib package, but this seems not all too popular: most developers refer to only one or two functions that they find particularly useful and use.
5. Preparation Is Half The Battle – Preprocessing Your Data
It probably doesn’t come to you as a surprise when I tell you that data scientists spend 80% of their time cleaning their data.
Text mining is also no exception in this respect.
Textual data can be dirty, so you should make sure that you spend enough time to clean it.
If you’re unsure of what preprocessing your data means, some of the standard preprocessing steps include:
- Extracting text and structure so that you have the textual format you want to process,
- Removing stopwords such as “that” or “and”,
- Stemming, which you use to extract the root of words. This can be done with the help of a dictionary or with linguistic rules or algorithms such as Porter’s Algorithm.
These steps seem hard, but preprocessing your data doesn’t need to be like that.
For the most part, the libraries and packages that were mentioned in the previous section can already help you a lot. For example, the tm library in R allows you to do some preprocessing with its built-in functions: you can do stemming and remove stop words, eliminate white spaces and convert the words to lowercase. Similarly, the nltk package in Python allows you to do much of the preprocessing because of the built-in functions.
However, you can still go a step further and also do some preprocessing based on regular expressions to describe the character patterns which interest you. This way, you will also speed up the process of data cleaning a bit.
For Python, you can make use of the re library and for R, there are a bunch of functions that can help you out, such as grep(), grepl(), regexpr(), gregexpr(), sub(), gsub(), and strsplit().
If you want to know more about these functions and regular expressions in R, you can always check out this page.
6. Data Scientist’s Adventures in Wonderland – Exploring Your Data
By now, you will be excited to get started on your analysis. It is, however, always a good idea to get a look at your data before you start your analysis.
Some ideas to quickly get started on exploring your data with the help of the base packages or the libraries that have been mentioned above:
- Create a document term matrix: elements in this matrix represent the occurrence of a term (a word or an n-gram) in a document of the corpus.
- After you have made the document term matrix, you can use a histogram to visualize the frequency of the words in your corpus.
- You might also be interested in knowing the correlation between two or more terms in your corpus.
- To visualize your corpus, you can also make a word cloud. In R, you can make use of the wordcloud library. A Python package with the same name also exists if you want to do the same in Python.
The nice thing about exploring your data before diving into your analysis is that you already have an idea what you’ll be working with. If you see in the document term matrix or the histogram that you have a lot of sparse words, you can decide to remove them from your corpus.
7. Level Up Your Text Mining Skills
When you have preprocessed and have done a basic textual analysis of your data with the tools that have been mentioned in the previous step, you might also consider using your data set to broaden your text mining skills.
Because there is so much more.
You have only seen the tip of the iceberg when it comes to text mining.
Firstly, you should consider exploring the difference between text mining and Natural Language Processing (NLP). More NLP libraries in R can be found on this page. With NLP, you will discover Named Entity Recognition, POS tagging and parsers, sentiment analysis, … For Python, you can make use of the nltk package. You can find a a full tutorial on sentiment analysis with the nltk package here.
Besides these packages, you can check out more tools to get started on topics such as deep learning and statistical topic detection modeling (such as Latent Dirichlet Allocation or LDA), among the many others that exist. Some of the packages that you can use to approach these topics are listed below:
- Python packages: the Python packages gensim to implement word2vec, among others, and GloVe. Also, theano should probably also be on your list if you want to discover deep learning further. Lastly, use gensim if you want to implement LDA.
- R packages: for an approach on text mining with deep learning in R, use text2vec. If, however, you’re more interested in getting into sentiment analysis, the syuzhet library in combination with the tm library is probably the way to go. Finally, the topicmodels library for R is ideal for statistical topic detection modeling.
And these packages are not nearly all that exists.
Since text mining is a pretty hot topic, there has been a lot to discover these past years in terms of research and you can expect that it will keep on being important in the years to come with multimedia mining, multilingual text mining, …
8. More Than Words – Visualizing Your Results
Don’t forget to communicate the results of your analysis! This is probably one of the most wonderful things that you can do, since visual representations attract people.
Your visualizations are your story.
So don’t hold back to visualize the correlations or topics you have found in your analysis.
For both Python and R you have specific packages that will help you to do this. You should therefore complete your list of the packages that you will have to use with these specific data visualization libraries to present your results:
For Python, you might consider using the NetworkX package to visualize complex networks. However, the matplotlib package can also come in handy for other types of visualizations. Also the plotly package, which allows you to make interactive, publication-quality graphs online is one of go-to packages for presenting your results visually.
A tip for all of those who are huge fans of data visualization: try linking Python and D3, the JavaScript library for dynamic data manipulation and visualization that allows your audience to become active participants in data visualization process.
For R, besides the libraries that you will already know, such as ggplot2, which is always a good idea to use, you can also use the igraph library to analyze following or followed and retweeting relationships. Do you want even more? Consider checking out plotly and networkD3 to link R and JavaScript. Starting Your Text Mining Journey With DataCamp
Do you want to get started on text mining in R? Head over to the “Text Mining: Bag of Words” course and learn text mining by doing it! This course is the best resource to quickly get started on a Text Mining case study to get introduced to a variety of essential topics for analyzing and visualizing data and to practice your acquired text mining skills on a real-world case study. Don’t you have the time to get started on the course yet? Try out Ted Kwartler’s interactive R tutorial. It’s the best resource to get started on a text mining case study quickly.
In short, go ahead and get started on you text mining journey with DataCamp!