Running Word Embedding on Piazza Posts

Word embedding (very bad explanation follows) is translating each word in a corpus into a D dimension vector. These vectors are supposed to preserve semantic properties and are commonly used in NLP. For example, in a large corpus of words (say the google news corpus), the vector point for man \(v\_{king}\) should be near the vector point for \(v\_{queen}\). Furthermore, meaning such as “king is to man as queen is to woman” is preserved via \(v\_{king} - v\_{queen} \approx v\_{man} - v\_{queen}\). These embeddings are generated via neural networks and a common tool for doing this is Word2Vec produced by Mikolov et al. A fast implementation of Word2Vec is available in gensim.

For more reading, read the wiki or the paper itself. For interesting experiments, read these:

Won’t it be fun to run this on Piazza posts? I’m a TA for data structures this semester (for Prof Blaer and Prof Bauer) and we have a Piazza containing 800 posts (ripe with answers, followups, etc). Speaking of that, we have an average response time of 5min. Find me another team of TAs that can beat this hurhhurh.

Creating Corpus

Turns out Piazza does not have an official API. However, there is an unofficial API. https://github.com/hfaran/piazza-api That’s good enough for us.

1
2
3
from piazza_api import Piazza
piazza = Piazza()
piazza.user_login()
Email: lq2137@columbia.edu
········

Now let’s grab all the posts as .jsons.

1
import json
1
course = piazza.network('ijfyurrye2g1oc')
1
2
3
4
5
6
7
8
9
posts = course.iter_all_posts()

count = 0
for post in posts:
if count % 100 == 0:
print 'Downloading post %d' % count
with open('%d.json' % count, 'w') as csv_file:
csv_file.write(json.dumps(post))
count += 1
Downloading post 0
Downloading post 100
Downloading post 200
Downloading post 300
Downloading post 400
Downloading post 500
Downloading post 600
Downloading post 700
Downloading post 800

Then we parse the .jsons into a giant .txt by unscrambling the messy .json that the API provides. We also tokenize and perform some basic cleaning (mash everything to lower case). This is a very blunt tool, but should suffice for now.

1
2
3
import glob
from bs4 import BeautifulSoup
from nltk import word_tokenize
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def grab_children(post):
content_children = []
def traverse(children):
for child in children:
if 'history' in child:
history = child['history']
content_children.extend([history_item['content'] for history_item in history])
if 'children' in child:
traverse(child['children'])
traverse(post['children'])
return content_children

corpus = []

for glob_file in glob.glob('*.json'):
with open(glob_file, 'r') as json_file:
post = json.load(json_file)
content = [history['content'] for history in post['history']]
content.extend(grab_children(post))
content = [BeautifulSoup(text, 'html.parser').get_text() for text in content]
content = [word_tokenize(text.lower()) for text in content]
corpus.extend(content)

with open('corpus.txt', 'w') as corpus_file:
for line in corpus:
corpus_file.write('%s\n' % ' '.join(line).strip().encode('utf-8'))

Training the Model

Now let’s train the model using gensim

1
import gensim.models.word2vec as word2vec

Turns out gensim has a nice reader that iterates over a text file with one sentence a line. That’s exactly what we produced in the corpus section.

1
sentences = word2vec.LineSentence('corpus.txt')

Now let’s train the model on this corpus. These are some hyperparameters that I found to be good. I shan’t explain them too much, but if you want a little more detail we are essentially using the Skip-Gram with Negative Sampling portion of Word2Vec over 100 iterations and a Skip-Gram window of 15. We also discard all words that occur few than 5 times.

This should take no more than a minute since the corpus is tiny (which theoretically should give us crappy results but let’s see.)

1
model = word2vec.Word2Vec(sentences, min_count=5, workers=8, iter=100, window=15, size=300, negative=25)

Results

Now let’s look at some results.

The most_similar method takes in a word, converts it to its vector representation, and finds the other words that are closest to it by cosine distance. These similar words should have been used in a similar context with the original word. Let’s find some interesting results.

First, let’s do a sanity check using the word homework. Turns out homework is indeed associated with words we’d expect to be associated with homework: solutions, grade, email, and even latex. That’s good.

1
model.most_similar('homework')
[(u'solutions', 0.3142701983451843),
 (u'grade', 0.3114492893218994),
 (u'email', 0.3077431321144104),
 (u'inserted', 0.3056461215019226),
 (u'description', 0.2953464388847351),
 (u'programming', 0.2912822961807251),
 (u'latex', 0.2862958312034607),
 (u'theorem', 0.28351300954818726),
 (u'signature', 0.2821905314922333),
 (u'posts', 0.28168296813964844)]

Now for something more advanced. Let’s try the word heap. Turns out we have words that are pretty related to the heap concept:

1
model.most_similar('heap')
[(u'build', 0.30242919921875),
 (u'percolatedown', 0.2909688353538513),
 (u'one-by-one', 0.28867167234420776),
 (u'deletion', 0.2867096960544586),
 (u'k', 0.28530699014663696),
 (u'discussed', 0.2748313546180725),
 (u'linear', 0.26605361700057983),
 (u'increasing', 0.2620879113674164),
 (u'quicksort', 0.25420695543289185),
 (u'instructions', 0.2464582324028015)]

Now for the most awesome result ever. What concepts are associated with good? Turns out I’m one of them.

1
model.most_similar('good')
[(u'great', 0.35192880034446716),
 (u'very', 0.3159177303314209),
 (u'almost', 0.30515187978744507),
 (u'concepts', 0.3004434108734131),
 (u'linan', 0.2995752990245819),
 (u'too', 0.2964840829372406),
 (u'basic', 0.2916448712348938),
 (u'fit', 0.28502458333969116),
 (u'useful', 0.2824944257736206),
 (u'answered', 0.28173112869262695)]
doge

doge

Let’s see what’s similar to the profs. Turns out Prof Blaer’s love for ocaml is well noted.

1
model.most_similar('blaer')
[(u'professor', 0.7219115495681763),
 (u'prof.', 0.6643839478492737),
 (u'bauer', 0.4823228120803833),
 (u'today', 0.44543683528900146),
 (u'went', 0.4129117727279663),
 (u'pm', 0.4097597897052765),
 (u'session', 0.40625864267349243),
 (u'mentioned', 0.4010222554206848),
 (u'ocaml', 0.3884058892726898),
 (u'tonight', 0.38226139545440674)]
1
model.most_similar('bauer')
[(u'professor', 0.6429691314697266),
 (u'slides', 0.5277301669120789),
 (u'prof.', 0.5086715221405029),
 (u'lecture', 0.4928019642829895),
 (u'blaer', 0.4823228120803833),
 (u'sign', 0.4362599849700928),
 (u'pm', 0.3718754053115845),
 (u'went', 0.3549380302429199),
 (u'tonight', 0.326946496963501),
 (u'perform', 0.3159770965576172)]

Sasha loves grades, which is unsurprising.

1
model.most_similar('sasha')
[(u'email', 0.5593311786651611),
 (u'ta', 0.4426960051059723),
 (u'monday', 0.4238441586494446),
 (u'grade', 0.42380213737487793),
 (u'talk', 0.40258437395095825),
 (u'grading', 0.3625785708427429),
 (u'mentioned', 0.3540249466896057),
 (u'approaches', 0.3509276807308197),
 (u'cs', 0.3439938724040985),
 (u'hope', 0.3431258201599121)]
1
model.doesnt_match("bauer blaer graph".split())
'graph'

And of course it differentiates between a TA and a professor.

1
model.doesnt_match("bauer blaer linan".split())
'linan'

ISN’T THIS FUCKING AWESOME.