Monitoring Github projects using GHTorrent and R

GHTorrent

GHTorrent monitors the Github public event time line. For each event, it retrieves its contents and their dependencies, exhaustively. It then stores the raw JSON responses to a MongoDB database, while also extracting their structure in a MySQL database. The database currently stores 7.994.119 Github projects with information about commits, watchers, pull requests and issues. Because it contains so much information, running heavy queries can be rather slow. However, running queries for just one project is done pretty fast. The full schema of the MySQL database is shown below.

GHTorrent relational schema

 

The MySQL client can be found here

Screen Shot 2014-06-17 at 10.54.49

R

R is a free software programming language and software environment for statistical computing and graphics. The R language is perfect for doing statistics and data analysis. I prefer using RStudio as IDE, which provides a bit more functionality than the R IDE that comes with installing R. For example, it can show the data in a nice table. The stackoverflow community is extensive, so questions are easily answered by going to stackoverflow or googling.

Example

Screen Shot 2014-06-17 at 11.16.14

Example query to get all the projects from ItudeMobile, with the language

​SELECT p.id, p.name, p.language FROM projects p, users u WHERE u.id = p.owner_id AND u.login = "ItudeMobile";​

Example R code to put the languages in a graph.


require("ggplot2")
setwd("/Users/Cindy/Documents")

information <- read.csv(“IMLanguages.csv”)
colnames(information) <- c(“id”,”name”,”language”)

languageCount <- as.data.frame(table(factor(information$language)))

g <- ggplot(languageCount, aes(x=Var1, y = Freq))
g <- g + geom_bar(stat = “identity”) + xlab(“language”) + ylab(“# projects”)
ggsave(filename=”languagePlot.png”, g)

View of the table in RStudio:
Screen Shot 2014-06-17 at 11.28.00

The plot that comes out:
languagePlot

Sources
[1] Gousios, Georgios, and Diomidis Spinellis. “GHTorrent: Github’s data from a firehose.” Mining Software Repositories (MSR), 2012 9th IEEE Working Conference on. IEEE, 2012.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s