### Running Word Embedding on Piazza Posts

Word embedding (very bad explanation follows) is translating each word in a corpus into a D dimension vector. These vectors are supposed to preserve semantic properties and are commonly used in NLP. For example, in a large corpus of words (say the google news corpus), the vector point for man $v\_{king}$ should be near the vector point for $v\_{queen}$. Furthermore, meaning such as “king is to man as queen is to woman” is preserved via $v\_{king} - v\_{queen} \approx v\_{man} - v\_{queen}$. These embeddings are generated via neural networks and a common tool for doing this is Word2Vec produced by Mikolov et al. A fast implementation of Word2Vec is available in gensim.

Won’t it be fun to run this on Piazza posts? I’m a TA for data structures this semester (for Prof Blaer and Prof Bauer) and we have a Piazza containing 800 posts (ripe with answers, followups, etc). Speaking of that, we have an average response time of 5min. Find me another team of TAs that can beat this hurhhurh.

# Creating Corpus

Turns out Piazza does not have an official API. However, there is an unofficial API. https://github.com/hfaran/piazza-api That’s good enough for us.

Email: lq2137@columbia.edu
········

Now let’s grab all the posts as .jsons.

Downloading post 0
Downloading post 800

Then we parse the .jsons into a giant .txt by unscrambling the messy .json that the API provides. We also tokenize and perform some basic cleaning (mash everything to lower case). This is a very blunt tool, but should suffice for now.

# Training the Model

Now let’s train the model using gensim

Turns out gensim has a nice reader that iterates over a text file with one sentence a line. That’s exactly what we produced in the corpus section.

Now let’s train the model on this corpus. These are some hyperparameters that I found to be good. I shan’t explain them too much, but if you want a little more detail we are essentially using the Skip-Gram with Negative Sampling portion of Word2Vec over 100 iterations and a Skip-Gram window of 15. We also discard all words that occur few than 5 times.

This should take no more than a minute since the corpus is tiny (which theoretically should give us crappy results but let’s see.)

# Results

Now let’s look at some results.

The most_similar method takes in a word, converts it to its vector representation, and finds the other words that are closest to it by cosine distance. These similar words should have been used in a similar context with the original word. Let’s find some interesting results.

First, let’s do a sanity check using the word homework. Turns out homework is indeed associated with words we’d expect to be associated with homework: solutions, grade, email, and even latex. That’s good.

[(u'solutions', 0.3142701983451843),
(u'email', 0.3077431321144104),
(u'inserted', 0.3056461215019226),
(u'description', 0.2953464388847351),
(u'programming', 0.2912822961807251),
(u'latex', 0.2862958312034607),
(u'theorem', 0.28351300954818726),
(u'signature', 0.2821905314922333),
(u'posts', 0.28168296813964844)]

Now for something more advanced. Let’s try the word heap. Turns out we have words that are pretty related to the heap concept:

[(u'build', 0.30242919921875),
(u'percolatedown', 0.2909688353538513),
(u'one-by-one', 0.28867167234420776),
(u'deletion', 0.2867096960544586),
(u'k', 0.28530699014663696),
(u'discussed', 0.2748313546180725),
(u'linear', 0.26605361700057983),
(u'increasing', 0.2620879113674164),
(u'quicksort', 0.25420695543289185),
(u'instructions', 0.2464582324028015)]

Now for the most awesome result ever. What concepts are associated with good? Turns out I’m one of them.

[(u'great', 0.35192880034446716),
(u'very', 0.3159177303314209),
(u'almost', 0.30515187978744507),
(u'concepts', 0.3004434108734131),
(u'linan', 0.2995752990245819),
(u'too', 0.2964840829372406),
(u'basic', 0.2916448712348938),
(u'fit', 0.28502458333969116),
(u'useful', 0.2824944257736206),
(u'answered', 0.28173112869262695)]

Let’s see what’s similar to the profs. Turns out Prof Blaer’s love for ocaml is well noted.

[(u'professor', 0.7219115495681763),
(u'prof.', 0.6643839478492737),
(u'bauer', 0.4823228120803833),
(u'today', 0.44543683528900146),
(u'went', 0.4129117727279663),
(u'pm', 0.4097597897052765),
(u'session', 0.40625864267349243),
(u'mentioned', 0.4010222554206848),
(u'ocaml', 0.3884058892726898),
(u'tonight', 0.38226139545440674)]
[(u'professor', 0.6429691314697266),
(u'slides', 0.5277301669120789),
(u'prof.', 0.5086715221405029),
(u'lecture', 0.4928019642829895),
(u'blaer', 0.4823228120803833),
(u'sign', 0.4362599849700928),
(u'pm', 0.3718754053115845),
(u'went', 0.3549380302429199),
(u'tonight', 0.326946496963501),
(u'perform', 0.3159770965576172)]

Sasha loves grades, which is unsurprising.

[(u'email', 0.5593311786651611),
(u'ta', 0.4426960051059723),
(u'monday', 0.4238441586494446),
(u'hope', 0.3431258201599121)]
'graph'
'linan'