GHTorrent monitors the Github public event time line. For each event, it retrieves its contents and their dependencies, exhaustively. It then stores the raw JSON responses to a MongoDB database, while also extracting their structure in a MySQL database. The database currently stores 7.994.119 Github projects with information about commits, watchers, pull requests and issues. Because it contains so much information, running heavy queries can be rather slow. However, running queries for just one project is done pretty fast. The full schema of the MySQL database is shown below.
The MySQL client can be found here
R is a free software programming language and software environment for statistical computing and graphics. The R language is perfect for doing statistics and data analysis. I prefer using RStudio as IDE, which provides a bit more functionality than the R IDE that comes with installing R. For example, it can show the data in a nice table. The stackoverflow community is extensive, so questions are easily answered by going to stackoverflow or googling.
Example query to get all the projects from ItudeMobile, with the language
SELECT p.id, p.name, p.language FROM projects p, users u WHERE u.id = p.owner_id AND u.login = "ItudeMobile";
Example R code to put the languages in a graph.
information <- read.csv(“IMLanguages.csv”)
colnames(information) <- c(“id”,”name”,”language”)
languageCount <- as.data.frame(table(factor(information$language)))
g <- ggplot(languageCount, aes(x=Var1, y = Freq))
g <- g + geom_bar(stat = “identity”) + xlab(“language”) + ylab(“# projects”)
 Gousios, Georgios, and Diomidis Spinellis. “GHTorrent: Github’s data from a firehose.” Mining Software Repositories (MSR), 2012 9th IEEE Working Conference on. IEEE, 2012.