Python: Referensi untuk Social Scientist

From OnnoWiki
Jump to navigation Jump to search

Learning Python for Social Scientists

Neal Caren - University of North Carolina, Chapel Hill mail web twitter scholar

I’ve compiled a list of Python tutorials and annotated analyses. I've tried to list pages that are accessible to social scientists with little background in Python and/or machine learning.

If you are totally new to Python, I would recommend installing Continuum's Anacoda Python distribution. It works on Macs and Windows, makes using IPython notebooks trivial, and solves most of the problems associated with installing various packages.

If you know of anything I've left out or if links go dead, please let me know.

Walkthroughs

One of the great things about IPython notebooks is that they can easily blend text and code. This has led to a sharp increase in the number of data analysis projects where people carefully explain an entire research project, including data collection/importation, management and analysis. The code is right there, and you can usually run it and/or modify yourself. Looking at a few of these is an excellent introduction to what people are currently doing, even if you don't understand everything.

  • Diving into Open Data with IPython Notebook & Pandas by Julia Evans. An analysis of whether people bike when it rains using Pandas.
  • The Need for Openness in Data Journalism by Brian Keegan Reanalysis of a 538 posts on the Bechdel Test and films using Pandas and statsmodels. When I ran this one, I had an issue with the BeautifulSoup part.u
  • Predicting customer churn with scikit-learn by Eric Chiang. I don't care about customer churn, but it's a well-written walkthrough of machine learning classification.
  • 538 Model by Skipper Seabold. Recreation of the classic 538 prediction model using Pandas.
  • Heat and Violence in Chicago by Brian Keegan. Walkthrough of an impressive analysis of crime trends.
  • Powerpoetry Analysis by SumAll Foundation. Analysis of how individual poetry styles change over time using pandas.
  • World Cup Learning by Juan Pedro Fisanotti. Predict winners of World Cup soccer matches using the PyBrain library for machine learning. Data is also on Github.
  • Is Seattle Really Seeing an Uptick In Cycling? by Jake Vanderplas. Yes. An excellent time-series analysis.
  • Predicting the World Cup with the Google Cloud Platform by Jordan Tigani. Soccer predictions. 99% of this doesn't have anything to do with the Google Cloud Platform.
  • Finding the World's Economic Center of Gravity by Daniel Velkov. Using pandas to find and plot the coordinates of the GDP-weighted center of the world by year.
  • Pinterest Analysis about the NYFW Fall 2014 by Rosario Gomez. Sraping, cleaning and descriptives. Nicely demonstrates how to store data in a Mongodb database. There is also a fancier version of the analysis.

Overviews

Introductions to using Python for data analysis that make sense to social scientists.

Using APIs

When a service wants you to use their data, they often provide it through an API. There are often specific Python libraries for accessing popularing, complex and/or APIs requiring authentication. Otherwises, requests is quite useful.

Web Scraping

When they don't want to give you the data, you can sometimes grab it anyway by visiting one or more web pages and then extracting the parts you need. requests is a useful library for accessing web pages, and BeautifulSoup is a popular choice for pulling out the good stuff. If you don't know any HTML, regular expressions can sometimes work well too.

  • Intro to Beautiful Soup by Jeri Wieringa. Turning a table on a website into a CSV file. Part of the useful Programming Historian set of tutorials.
  • Scraping Craigslist for sold out concert tickets by Greg Reda. Using BeautifulSoup to parse HTML. AS a bonus, shows how to send yourself a text from Python, if you use Gmail & AT&T.
  • Web scraping in Python by me. Grabbing lacrosse scores and turning them into a CSV file.

Data Management

Going for raw data--numbers of words--to Xs that can be included in a regression equation is about 80% of the work. There's a lot of data management in the walkthroughs, but I've found a couple of others that show the process quite clearly. Pandas is popular and super useful, especially the data frames.

  • Intro to pandas data structures by Greg Reda. A three part tutorial that is a very accessible overview to working with data using Pandas.
  • Aggregating & plotting time series in python by yhatq. Basics of managing time series data in Pandas.
  • College Basketball three pointers by Justin Goodwin. Lightly commented guide to analyzing basektaball stats. I'm pretty sure you can download the data yourself using the code in this gist.
  • Data Wrangling with Python by Chris Fonnsbek. Also includes an overview of NumPy, which you can skim the first time through.
  • Pandas cookbook by Julia Evans. Lots of great examples for handling data in Pandas.

Text Management

Playing with words.

  • Text Analysis with Topic Models for the Humanities and Social Sciences by Allen Riddell. A series of tutorials on including preprocessing, computing text similarities, and finding distictive words.
  • A Smattering of NLP in Python by Charlie Greenbacker. The basics of using NTLK to analyze text documents.
  • Using Pandas to Curate Data from the New York Public Library's What's On the Menu? Project byTrevor Muñoz. People spell potatoes au gratin a lot of different ways.
  • Statistical Natural Language Processing in Python or How To Do Things With Words. And Counters. or Everything I Needed to Know About NLP I learned From Sesame Street. Except Kneser-Ney Smoothing. The Count Didn't Cover That. by Peter Norvig. This actually isn't that useful to social scientists, but his code is beautiful.
  • Yet Another Python Encoding Tutorial by Guillermo Moncecchi. Analyzing strings with é or “ in Python 2.7 can be a real pain.
  • NLTK Syntax Trees by Guillermo Moncecchi. Sentence parsing.
  • Fuzzy Matching with Yhat by yhat. Matching using messy text strings.
  • Introductory Natural Langauge Techniques using Python by Douglas Starnes. An introduction to NLTK.

Introduction to data analysis

Introductions and/or overviews of data analysis, usually using scikit-learn.

  • Basic principles of machine learning by Jake Vanderplas An introduction to scikit-learn, the most popular machine learning library in Python. It's really great.
  • Introduction to Scikit-Learn by Tim Hopper. This introduction assumes a modest level of familiarity with Python.
  • Using Python to see how the Times writes about men and women by me. Basics of using work counts.
  • Analyzing police daily activity logs by Dan Hill. Desciptive statistics.
  • An exploratory statistical analysis of the 2014 World Cup Final by Ricardo Tavares. Analysis of play-by-play data with some nifty visualizations.

Classification

When the outcome variable is categorical. Social scientists usually start and stop with variations on logistic regression. Turns out, there's a lot of other things out there.

  • Predicting NFL Field Goal Percentages by Justin Goodwin. Using Pandas and the scikit-learn Random Forests classifier.
  • Basic Random Forest Model by Trey Causey. Minimally commented but clear code for using Pandas and scikit-learn to analyze in-game NFL win probabilities.
  • Supervised Learning In-Depth: SVMs and Random Forests by Jake Vanderplas
  • Text Classification with Naïve Bayes by Guillermo Moncecchi. Python code from the second chapter of Learning scikit-learn: machine learning in Python.
  • Exercise to detect Algorithmically Generated Domain Names by Click Security. Text classification of a single string, so it uses lots of tricks besides word frequencies.
  • Are you talking fashion? Building a fashion classifier for Twitter data by Rosario Gomez. From collecting Twitter data with Tweepy to analyzing the text using scikit-learn.

Unsupervised Learning

When you don't have an outcome variable and/or want to combine your explanatory variables. Sociologists usually learn about factor analysis and then never use it. For text data, topic modeling is what all the cools are doing.

  • Unsupervised Learning In-depth: PCA and K-Means by Jake Vanderplas.
  • Topic modeling with MALLET by Allen Riddell. Using Python and mallet to generate topic models of text.
  • Topic modeling in Python by Allen Riddell. It isn't the standard LDA topic modeling algorithm, but it's close and it's in scikit-learn.
  • Topics extraction with Non-Negative Matrix Factorization by Oliver Grisel. More on NMF for topic modeling text.
  • Yet another very gentle Tutorial on Latent Dirichlet Allocation and Inference using Gibbs Sampling by Călin-Rareș Turliuc. Theorettical overview of LDA in an iPython notebook.

Regression

While continuous outcomes are common in the social sciences, machine learning folks rarely talk about them.

  • Multiple Regression using Statsmodels by by DataRobot. The stuff you already know how to do but this time in Python.
  • Gradient Boosted Regression Trees by DataRobot. Scikit-learn analysis of a continuous outcome measure.

Model/Feature Selection

Picking which model or variables to use often happens offstage in social science research. It doesn't have to be that way, though.

  • Machine Learning with Scikit-Learn: Validation and Model Selection by Jake Vanderplas. Evaluating and improving your scikit-learn models.
  • Testing and Validation in Scikit-Learn by Sarah Guido. Short and to the point.
  • Feature by Trey Causey. Minimally commented but clear code for using Pandas and scikit-learn to find most important features.

Networks

NetworkX and igraph are both fairly powerful tools for network analysis. I don't think you can use them for regression analysis, but you can use them to do things like compute centrality measures and make pretty pictures. You can also use Python to create/manipulate your network data for analysis/display elsewhere.

  • Six Degrees of Kevin Bacon by Brian Kent. A classic example of network analysis using NetworkX.
  • Building a Foursquare Location Graph by Benedikt Koehler. Using the FourSquare API to find related venues and then graph them as a network using NetworkX.
  • Analysis of Twitter stream data with the IPython Notebook by Brian Granger. Words as nodes in a network analysis of Twitter data.
  • Creating a route planner for road network by Cyrille Rossant from the Ipython Cookbook. Using NetworkX to find the shortest distant between two nodes. In this case, the network is roads and the nodes are latitude/longitude coordinates, so the result is a GPS-like route planner.

Plotting

matplotlib is the default plotting library for data scientists and plays well with pandas. seaborn makes it prettier. Other programs, like mpld3, Plotly, or bokeh are also worth trying out, especially for putting stuff together on the web.

  • Plotting and Visualization by Chris Fonnesbeck. Great overview of using matplotlib and pandas.
  • Financial market data manipulation and visualization with Python by Chris Degiere. Basics of plotting time series data in pandas.
  • Exploratory graphs by herrfz Basic Pandas plots.
  • Computational data visualization in Python by Olga Botvinnik. Using Seaborn to make pretty graphs from your numbers.
  • A Gallery of Statistical Graphs in Matplotlib by Chris Beaumont. Pretty and practical examples.
  • Demo of mpld3 by Jake Vanderplas. Interested in making interactive graphics? This is where I would start. Turns your matplotlib code into d3 figures.
  • Nine matplotlib figures made in Plotly. Plotly makes excellent interactive graphs which are hosted on their servers.
  • & D3 in Python by z-m-k. Fairly complicated worked examples of an alternate way of producing D3 graphs.
  • Hierarchical clustering by Olga Botvinnik. Pretty heatmaps with seaborn and pandas.

Images as Data

Social scientists don't really analyze images much, but that might be the next big thing.

  • Image Processing with scikit-image by Eric Chiang. Basics of using photos as data.
  • Building a fashion recommender by Rosario Gomez. Analyzing pictures of models to extract the colors of the clothes they are wearing. It's really impressive.

Reference