Class Score Analysis for CS 3134 Data Structures

After TA-ing COMS W3134 Data Structures for a semester, I was left with a sizable amount of data from the homeworks, exams, and piazza usage. So I decided to make some pretty graphs out of them. This stemmed from my frustration at previous classes – why only give students a mean and standard deviation if there are so many more dimensions to the data? And perhaps this may even inspire a student or two to like statistics (which unfortunately, given the way it is taught in Columbia, is hardly inspiring).

So here’s some data porn.

To make sense of it all, I’m going to invoke an imaginary student called George. George is that annoying student at the front of the class that questions the professor non-stop regarding the nitty gritty details of the course. Yes there’s always George.

Overall Distributions

*George: What’s the mean and standard deviation for each homework, the midterm, and the final? Are they normal? You know I have this hypothesis that homeworks are bimodal…“*

Now now George. Here’s all the info you need:

The mode of some homeworks tend to be around full marks, which is rather unsurprising given the nature of homeworks. Some fulfil all requirements, while others simply throw their hands in the air and well…not do them. This also reflects the content of the homeworks. Homework 3 and 4 tend to have many small (unrelated) parts hence it is entirely possible to score well in one and not others. The rest of them tend to have a larger single programming component.

Homework Covariance

George: Do students do consistenly well / bad in all the homeworks or do they tend to vary quite a bit?

Great question George! Let’s see. Here we are interested if homework grades, when taken as a whole, move together. That means if Student A scores well in Homework 1, 2, and 3, does he score well in Homework 4 and 5 as well? What about Student B? What about other homeworks?

Turns out the easiest way to answer this is: what is the correlation of the homework scores.

Turns out most of them are moderately correlated. We can quantify this further by using Principal Component Analysis (PCA). Intuitively, PCA is a method that distills a common “factor” for all the data. After running PCA on the data, we have the following results:

1
2
3
4
5
6
7
Loadings:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
data.hw1 -0.324 -0.103 0.139 0.927
data.hw2 -0.410 -0.433 0.686 0.317 -0.270
data.hw3 -0.612 -0.653 0.428
data.hw4 -0.510 -0.825 -0.240
data.hw5 -0.304 0.892 0.286 0.172

This means that the PCA reduced our data from 5 dimensions (since we had 5 datasets), the first of which is the homeworks moving in the same direction. But how much of the actual change is represented by that dimension? Turns out, quite a lot!

1
2
3
4
5
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
Standard deviation 40.9058984 25.1014708 23.9316752 20.2610806 19.4151851
Proportion of Variance 0.4567392 0.1719867 0.1563301 0.1120525 0.1028915
Cumulative Proportion 0.4567392 0.6287259 0.7850560 0.8971085 1.0000000

The first dimension when all 5 homeworks are moving in the same direction accounts for around 41% of the total amount of variation. That means a common “force” that moves all the homeworks in the same direction accounts for around half the changes in homeworks from student to student.

This means that students’ homeworks are covaried to a pretty large extent. So George, students who do well in some homeworks tend to do well in others as well.

Homework and Exams

George: What about the age old idiom that students who do well in homeworks tend to do better in exams? I was told by my elementary school teacher that…

Calm your tits George.

Let’s run a linear regression of homework against midterm and finals. Intuitively, this answers the question: if a question scores one point better on homeworks, how much better does he score on midterms / finals?

Let’s look at midterms first:

1
2
3
4
5
6
7
8
9
10
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.82253 4.52721 8.796 7.88e-16 ***
data$hw 0.07482 0.01105 6.768 1.52e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 13.79 on 193 degrees of freedom
Multiple R-squared: 0.1918, Adjusted R-squared: 0.1876
F-statistic: 45.81 on 1 and 193 DF, p-value: 1.523e-10

Turns out homeworks correlate pretty well with midterm. A 1 point increase in homework (remember homeworks are graded out of 500, since I excluded homework 6 which we are still grading at the time of writing) translates into a 0.075 point increase in midterms. This is a pretty significant effect, but the R-squared is rather low, meaning that this does not hold for all students very strictly.

Let’s look at finals now:

1
2
3
4
5
6
7
8
9
10
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.3966 4.7103 5.392 2.02e-07 ***
data$hw 0.1108 0.0115 9.637 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 14.35 on 193 degrees of freedom
Multiple R-squared: 0.3249, Adjusted R-squared: 0.3214
F-statistic: 92.87 on 1 and 193 DF, p-value: < 2.2e-16

Homeworks correlate a lot better with finals than midterms. Again, a 1 point increase in homework translates into a 0.11 point increase in finals, and we have a higher R-squared here.

These two correlations are visualized in the scatter plot above.

Piazza

George: So I’ve been contributing a lot on piazza. My grades are awesome. Does that relationship hold?

Finally, I always tell students that the more they use piazza, the better they are able to learn. Asking question gets you answers, which makes you better. Answering answers makes you even better, since you learn way more by teaching than just being spoon-fed.

Well, you can answer these for yourself here.