Data Analysis: Facebook Hacker Cup 2013

What programming language is most used?

To answer this question we use results from the qualification round of Facebook Hacker Cup 2013

If you only interested in statistics, skip!

0. Intro

As you may know, Facebook Hacker Cup is a programming contest. You have programming problem and you need to solve it, in most cases in efficient way. You get input file and you need to submit the output file in less than 6 minutes. Pretty simple. The interesting part you need to submit your source code, that can be reviewed by any of contestants after round.

These sources we use to get programming language statistics.

1. Data

Primary resource in statistics is data.

We have scoreboard page (You must be logged to FB to see results) and interested in all links named source. Open every link manually takes much time, so we will use automated approach and will write program. Yes, in clojure.

Dependencies

For facebook authentification we can use:

I didn't think too much about alternatives, because I wanted to try selenium in clojure. It's a time!

In few words, selenium provides capability to perform programmatically browser events. Often used in automation. To use selenium in clojure program just add [clj-webdriver "0.6.0-beta2"] to your project dependencies.

To simplify HTTP GET access we use library [clj-http "0.6.3"] and for additional help [org.clojure/clojure-contrib "1.2.0"].

In code we use following requires

(:require [clj-http.client :as http])
(:require [clj-webdriver.taxi :as web])
(:require [clj-webdriver.core :as c])
(:require [clojure.contrib.math :as math])

Automation

Before we will automate actions to gather sources we need to decide what these actions are?

Pretty clear, right? Let's code it sequentially.

Implementation

(web/set-driver! {:browser :firefox} "http://facebook.com")
(web/input-text "#email" username)
(web/input-text "#pass" password)
(web/submit "#pass")
(web/to "https://www.facebook.com/hackercup/scoreboard?round=185564241586420")
(doseq [page-num (range 1 (inc 114))]
  (process-page page-num))

Number of pages is hardcoded. It's ok. What is process-page method?

(defn process-page [n]
  (let [url (str "https://www.facebook.com/hackercup/scoreboard?round=185564241586420&page=" n)]
    (web/to url)
    (doseq [e (web/find-elements {:css "a"})
            :let [url (c/attribute e "href")]
            :when (and url (.startsWith url "https://fbcdn-dragon-a.akamaihd.net/"))]
      (process-url url))))

First of all, we concat scoreboard link with page number to get actual link to each page. Then we go to that page obtain all a elements, get their href values and filter to save only ones that contain source code.

(defn process-url [url]
  (let [source-code (:body (http/get url))
        file-name (generate-filename url)]
    (spit file-name source-code)))

In this part we obtain html source of url and get its :body tag. As all source urls contain just plain text, we don't need additional filtering. Just save it to file.

I don't want to name file as url. That's why for naming we use following function: timestamp + underscore + absolute value of url hashcode

(defn generate-filename [url]
  (str DIRECTORY (System/currentTimeMillis) "_" (math/abs (hash url)) ".txt"))

DIRECTORY is just def for folder where you want to place all sources.

Whole script source available here

Execution

To run that script you need to write in clojure REPL (run "username" "password"), with correct values for username and password, obviously.

It works some time. Some time equals to eight hours on my machine. Long enough. But it's ideal time for night crawling!

In the morning I had all work done.

ls -1 | wc -l && du

prints

20348
291120  .

Good sign. We have more than 20K of source codes with total size almost 300Mb.

2. Analysis

Data is good. But no one interested in raw data, so we need process it.

Basically, we need to detect programming language by source file. No extensions.

You can write your own classifier or use some existing tool.

I did few-minutes research on this topic and found linguist project. It is written in Ruby and used in Gist to detect snippet language. Exactly what we need!

Unfortunately, I do not know ruby. I even could not build and run linguist classifier to detect language in my files. Rvm, gems and modules driving me crazy. I surrendered.

Another solution to use javascript library Highlight.js. It is used in syntax coloring, but have automatic language detection. Again, javascript and reading files from filesystem... Don't tell me about Node.js

I decided to write my own "classifier". Honestly, it's just regexp matching mechanism on common language constructions: keywords, imports, most used functions, etc.

Iterative approach has been used.

Select some popular language construction say #include <iostream> and filter it as C. After filtering we detect some subset of C language, remove them from all files list and repeat again with another construction.

I don't know how C++ different from C, so I accept them both as interchangeable languages but call it C/C++. By the way, C - C++ = 0, so we assume they are equal.

I developed some number of patterns (they can be reviewed here) and processed all gathered source files.

Unfortunately, not all files were processed succesfully. I reviewed approximately 2000 files manually, few new languages were detected but big amount of them were the crap: input data, binary files, some text information.

I think we don't lose too much if we say only 99% of files were processed.

3. Statistics

If you don't like this visualization, you can create your own. Here is data:

C/C++          10524
Java            3117
Python          3102
C#              1233
PHP              821
Ruby             488
Perl             142
Pascal/Delphi    136
Javascript       109
Haskell           85
Scala             72
Clojure           29
Go                28
Visual Basic      19
F#                12
Scheme             8
OCaml              7
Common Lisp        6
Lua                5
Matlab             4
Cocoa              3
Groovy             2
Dart               2
awk                1
Powershell         1
bash               1
Kotlin             1
ActionScript       1
Dylan              1
--------------------
crap             192
not detected     196
--------------------
TOTAL          20348

4. Conclusion

Warning: Do not treat this statistics as real-world pattern. It is a programming competition with a lot of geeks, they can use all they want. Also do not blame their code for quality. It was created just for correctness and speed. Pay attention to code quality in production, but always remember:

Your code may be elegant, by mine fucking works.

-- Leon Fayer

P.S. First of all, it is not high-accurate statistics. As I am not programming guru, I don't know all possible languages' constructions so it is likely that one construction overlapped with another. In that case only first one will be checked. Ideally, would be good to have language detecting library in clojure for future analysis. Maybe, I will do it. Second of all, code presented here is not beautiful, not optimized, have a lot of hardcode, but it works. Just in the spirit of Facebook Hacker Cup.

mishadoff 01 February 2013
blog comments powered by Disqus