Mallory Analytics

Efficient Pairwise Distance for Binary Data

One of the bottlenecks of any KNN algorithm is calculating the distances between data points. These pairwise distance calculations run in O(n²) time because every combination of two points must be compared for a dataset with n rows. While this doesn’t present too much of an issue for smaller datasets, the runtime rapidly (exponentially you might say) increases as n gets larger.

For example, R’s dist() function takes approximately 3 seconds for a matrix of dimensions 5000×100, and 15 seconds for 10000×100. This function also performs very poorly as the number of columns increases – dist() takes 30 seconds when the matrix is 2000×1000. While there is no way around computing the distance between every pair of rows, there are faster implementations depending on your data types.

If you have purely binary values (0,1) in your data, then simple matrix multiplication can save you seconds or even minutes off the runtime of the distance calculations. As outlined here, the Euclidean distance between binary data points can be quickly calculated using a matrix cross product. So to calculate the Euclidean distance of binary data points in R, you can do:

D <- (!X) %*% t(X) return(D + t(D))

This works by taking the cross product of NOT X and X. The flipped version of X means that the cross product will count all occurrences where !X is set (1) and t(X) is 0. In case you don’t have matrix multiplication memorized, this graphic from Interactive Mathematics is particularly useful (remember that t(X) means that the columns in the second matrix would actually be the rows from X, so it’s still comparing row vs. row).

So how does this distance calculation compare against R’s dist() function? If you’re a fan of 3D plots, here is one that compares the runtime of dist() to matrix multiplication based on the number of rows and columns.

As we can see from the plot above, the runtime of R’s dist() function scales exponentially as the number of rows and columns increase. It should be noted that for data with fewer than 100 columns, dist() and the matrix cross product have effectively the same runtime. But if you have binary data, odds are you’ll also have a lot of columns because you’ve created a lot of dummy columns based off a categorical column. So for categorical columns that have more than 100 distinct values, a matrix cross product will be significantly faster than R’s built-in dist() function.

So there you have it. Matrix cross products can be much faster than existing functions when it comes to calculating pairwise distances. I imagine this is because matrix operations are extremely optimized and implemented at a low level, while functions such as dist() have to deal with R’s set of data types, storage structures, and overhead.

The disparity in runtimes as the number of columns increases may be due to the fact that R stores data frames and matrices as lists of columns, so accessing a row’s vector of values could require iterating over every column. Meanwhile, taking the cross product of !X and the transpose of X means that the rows of X become columns for t(X), which could further reduce runtime.

Stay tuned for another article where I show how this distance method as well as a host of other speedups, can be used to calculate KNN predictions in a fraction of the time that Caret’s KNN method takes.

Writing your first trading algorithm in R

If you’re interested in building your own trading algorithms, check out CloudQuant, which lets you build your own strategies and backtest them on millisecond trade data.

That’s all for this post. Remember that this is a very basic trading algorithm, and that even when using advanced trading algorithms to trade in the market, you still have risk.

Manipulating IPEDS data to work in Tableau

Through my Institutional Research internship at Denison University, I work a lot with IPEDS data. IPEDS, or the Integrated Postsecondary Education Data System, is a website that has facts such as retention rates, size, religious affiliation, and revenues, for thousands of public and private colleges. This data is useful and goes back decades, but it is stored in summary form; or wide format rather than long. Example:

Institution.Name	Alabama	Alaska	Arizona	Arkansas	California	Colorado
Abilene Christian University	0	2	5	0	17	16
Adrian College	0	0	0	0	3	0
Adventist University of Health Sciences	1	0	0	0	1	1
Agnes Scott College	4	0	0	2	11	0
Alaska Pacific University	0	10	0	0	2	1

The problem I run into when using this kind of data in Tableau is that I can’t “color” maps or area charts by the location variable because every state/location is stored in a separate column.

Tableau works most smoothly with data that is stored as individual observations (i.e. a row for every student represented in the table above, with columns for ‘StudentID’ ‘College’ and ‘Location’) because it can quickly aggregate the observations and create that summary data on its own. To transform the summary data shown above into something that tableau can deal with, we have to “melt” it.

This will not create individual observations for each student, but it will combine all the states into one column called “Location”, and all the corresponding values into a column called “Value”.

When loading this data frame into Tableau, “Location” becomes the variable and “Value” is your measure.

The only package you will need to install to transform the data is “reshape”.

require(reshape)

Creating the list

To begin, we need to read in the data tables. Data downloaded from IPEDS is stored as a ‘csv’, so that makes importing it easy, but if your data is stored as an Excel workbook, you can either convert it (save as “.csv”) or install the package XLconnect, which can import excel sheets into R.

This code will store each of your data frames as a separate element in a list called “Residences”, and then name each element with the year it reflects.

temp = list.files(pattern="*.csv")
Residences <- list()
for (k in 1:length(temp)){
 Residences[[k]] <- read.csv(temp[k])
}
names(Residences) <- c(1986,1988,seq.int(1992,2014,2))

“list.files” is a function that retrieves the names of any file type you specify. It defaults its search to your working directory, but you can edit that to any folder on your computer.

#temp = list.files(path = "User/You/Your_Path", pattern="*.any_file_type_you_want", full.names = TRUE)

If you do this, make sure to specify that “full.names” equals TRUE, or “read.csv” won’t work when reading the vector elements.

Melting the list

To melt this list, use “melt.list”. This will recursively melt each data frame in the list, and then combine them all into one long data frame.

The argument “id.vars” tells the function which columns you want to use as your identifying variables, or which ones you don’t want going in the “Location” column.

Specifying “level = ‘Year’” tells the function to name our new column “Year”.

RESI <- melt.list(Residences, id.vars = c("UnitID", "Institution.Name"), level = "Year")

UnitID	Institution.Name	Location	Value	Year
222178	Abilene Christian University	Alabama	1	1986
168528	Adrian College	Alabama	0	1986
133872	Adventist University of Health Sciences	Alabama	0	1986
138600	Agnes Scott College	Alabama	7	1986
102669	Alaska Pacific University	Alabama	0	1986
188526	Albany College of Pharmacy and Health Sciences	Alabama	0	1986
128498	Albertus Magnus College	Alabama	0	1986
168546	Albion College	Alabama	0	1986
210571	Albright College	Alabama	0	1986
237118	Alderson Broaddus University	Alabama	0	1986

Just as a personal preference, I like to order my tables alphabetically.

RESI <- RESI[order(RESI$Institution.Name),]

UnitID	Institution.Name	Location	Value	Year
222178	Abilene Christian University	Alabama	1	1986
222178	Abilene Christian University	Alaska	0	1986
222178	Abilene Christian University	Arizona	12	1986
222178	Abilene Christian University	Arkansas	1	1986
222178	Abilene Christian University	California	18	1986

As you can see, this new table stores the state names under a single column “Location”, but it still tells you how many students live in each state by connecting it with the column “Value. This column ‘Location’ will become the variable in Tableau that you can use to color Area Charts or Maps, while the ‘Value’ column will become the measure variable in Tableau. Tableau’s”Sum” function works correctly here because it groups its operation by school and by Location.

That’s all- if you have any comments or suggestions please let me know, as I am still learning R and Tableau

Political donations among Ohio liberal arts colleges

By Logan Mallory

Money makes the political world go ‘round. Candidates rely on donations to reach voters, get their messages out, and help campaign for other candidates who support the cause. Millions of individual donations go into the 3-4 billion dollar totals for modern campaigns that come from a relatively small slice of the population. How does Denison fit into this? And more broadly, how have Denison’s political donations looked over time and in comparison to other colleges in Ohio? Do faculty earn their reputation as left-leaning?

Employees of liberal arts colleges in Ohio donated a total of $883,284 to political groups and candidates from 1989 to 2016, in 3,421 donations from just 684 people[1]. Moreover, 3% of those donations account for about half of the total dollar amount — donations are highly skewed from a small number of bigger givers.

There are 6,472 instructors employed by liberal arts colleges in Ohio[2] but the total number of people employed by these schools is much much larger. Denison, for instance, employs 296 teaching staff, but over 500 maintenance, administrative, and miscellaneous workers. This means that the actual number of people eligible to list Denison as their employer is much larger than the available data suggests. At best, then, 10 percent of faculty are contributing money to candidates and other political groups. This figure is low — in 2016 data Dr. Djupe collected from a national sample, 19 percent of those with postgraduate degrees gave to presidential candidates and 28 percent gave to any candidate.

Political donations from faculty at Denison University totaled $31,372, from 63 different people. This may not seem like a whole lot of money in the grand scheme of political contributions, but it translates to over $1,300 donated to candidates every year. Compared to the 27 other liberal arts colleges in Ohio, this was the 5th highest total sum. Colleges donating more than Denison over the last 27 years include John Carroll University, Hiram College, Ohio Wesleyan University, and Oberlin College. As the chart below shows, 95.7% of Denison’s donations went to Democrats, as did 90.2% of the money. The top 5 contributors from Denison donated about 40% of Denison’s total $31,372, with the number one contributor donating $6,150.

The median individual donation amount across all 27 colleges was $197, but varied from college to college. Denison is slightly below that average, at a median donation amount of $134. If that small number disappoints you, don’t fret. Denison had the third highest number of donations at 234, after Oberlin (1268) and Ohio Wesleyan (338). Denison also has a higher donation sum than most other colleges.The median sum of donations across all colleges was $12,974, while Denison’s total donation set was $31,372.

When looking at total donation amounts, the top two colleges stick out. Ohio Wesleyan and Oberlin’s combined donations make up 62.7% of the $883,284 donated by liberal arts colleges in Ohio. Oberlin College, with its incredulous number of political donations, managed to donate every cent of its $316,794 to Democrats. Ohio Wesleyan University stood out in that it had consistently larger individual donations than any other college. The median donation amount for Ohio Wesleyan was $702, more than $500 above the median. Reflective of these generous donations was the total amount of money Ohio Wesleyan donated. In just 338 donations, the university’s contributions totaled a whopping $237,350, with one professor donating half of that. Denison, with 234 donations, only got to $31,372. And Oberlin College, despite having the highest total donation amount, did so over 1,268 donations and was only $79,444 ahead of Ohio Wesleyan.

2016 also stood out as a popular year to donate. Likely due to the polarity and media coverage of the contentious election, political donations last year skyrocketed from 317 in the last presidential election in 2012 to 1,799 in 2016. We will have to wait until 2020 though to find out if the 2016 donation frequencies have become the norm.

Overall, it seems that Denison’s political donations are not too different from other liberal arts colleges in Ohio. While the university is in the top 20% in terms of total donation amounts, it does not stand out like Oberlin or Ohio Wesleyan. The results of this analysis also strongly support the prevailing notion that college professors are unapologetically liberal. 93% of donations went to Democratic candidates and groups, as did 87% of the money. But again, only 10% of college instructors are included in this, so it is difficult to generalize these results to other faculty. While it is probably true, one could easily imagine Republicans’ reluctance to declare their allegiance in such a public way when surrounded by Democrats. Whether the other 90% of people are simply having their spouse do the donating or just want to stay out of politics, it is important to remember that donating to a cause you believe in is a very effective way to further it.

_____________________________________________

Notes

The contributions data are available from Open Secrets.org, a website by the Center for Responsive Politics that publishes information about the intersection between money and politics.

Faculty employment data was collected from CollegeFactual.com

It is unclear how many of these donations are from students. While most students list their occupation as “Student” when donating, and most all of Denison’s donations are from faculty, Oberlin may have encouraged its students to list “Oberlin” as their occupation when donating.