Efficient Pairwise Distance for Binary Data

One of the bottlenecks of any KNN algorithm is calculating the distances between data points. These pairwise distance calculations run in O(n2) time because every combination of two points must be compared for a dataset with n rows. While this doesn’t present too much of an issue for smaller datasets, the runtime rapidly (exponentially you might say) increases as n gets larger.

For example, R’s dist() function takes approximately 3 seconds for a matrix of dimensions 5000×100, and 15 seconds for 10000×100. This function also performs very poorly as the number of columns increases – dist() takes 30 seconds when the matrix is 2000×1000. While there is no way around computing the distance between every pair of rows, there are faster implementations depending on your data types.

If you have purely binary values (0,1) in your data, then simple matrix multiplication can save you seconds or even minutes off the runtime of the distance calculations. As outlined here, the Euclidean distance between binary data points can be quickly calculated using a matrix cross product. So to calculate the Euclidean distance of binary data points in R, you can do:

D <- (!X) %*% t(X)
return(D + t(D))

This works by taking the cross product of NOT X and X. The flipped version of X means that the cross product will count all occurrences where !X is set (1) and t(X) is 0. In case you don’t have matrix multiplication memorized, this graphic from Interactive Mathematics is particularly useful (remember that t(X) means that the columns in the second matrix would actually be the rows from X, so it’s still comparing row vs. row).

Screen Shot 2019-03-05 at 1.24.45 AM.png

So how does this distance calculation compare against R’s dist() function? If you’re a fan of 3D plots, here is one that compares the runtime of dist() to matrix multiplication based on the number of rows and columns.

As we can see from the plot above, the runtime of R’s dist() function scales exponentially as the number of rows and columns increase. It should be noted that for data with fewer than 100 columns, dist() and the matrix cross product have effectively the same runtime. But if you have binary data, odds are you’ll also have a lot of columns because you’ve created a lot of dummy columns based off a categorical column. So for categorical columns that have more than 100 distinct values, a matrix cross product will be significantly faster than R’s built-in dist() function.

So there you have it. Matrix cross products can be much faster than existing functions when it comes to calculating pairwise distances. I imagine this is because matrix operations are extremely optimized and implemented at a low level, while functions such as dist() have to deal with R’s set of data types, storage structures, and overhead.

The disparity in runtimes as the number of columns increases may be due to the fact that R stores data frames and matrices as lists of columns, so accessing a row’s vector of values could require iterating over every column. Meanwhile, taking the cross product of !X and the transpose of X means that the rows of X become columns for t(X), which could further reduce runtime.

Stay tuned for another article where I show how this distance method as well as a host of other speedups, can be used to calculate KNN predictions in a fraction of the time that Caret’s KNN method takes.

Manipulating IPEDS data to work in Tableau

Through my Institutional Research internship at Denison University, I work a lot with IPEDS data. IPEDS, or the Integrated Postsecondary Education Data System, is a website that has facts such as retention rates, size, religious affiliation, and revenues, for thousands of public and private colleges. This data is useful and goes back decades, but it is stored in summary form; or wide format rather than long. Example:

Institution.Name Alabama Alaska Arizona Arkansas California Colorado
Abilene Christian University 0 2 5 0 17 16
Adrian College 0 0 0 0 3 0
Adventist University of Health Sciences 1 0 0 0 1 1
Agnes Scott College 4 0 0 2 11 0
Alaska Pacific University 0 10 0 0 2 1

The problem I run into when using this kind of data in Tableau is that I can’t “color” maps or area charts by the location variable because every state/location is stored in a separate column.

Tableau works most smoothly with data that is stored as individual observations (i.e. a row for every student represented in the table above, with columns for ‘StudentID’ ‘College’ and ‘Location’) because it can quickly aggregate the observations and create that summary data on its own. To transform the summary data shown above into something that tableau can deal with, we have to “melt” it.

This will not create individual observations for each student, but it will combine all the states into one column called “Location”, and all the corresponding values into a column called “Value”.

When loading this data frame into Tableau, “Location” becomes the variable and “Value” is your measure.

The only package you will need to install to transform the data is “reshape”.

require(reshape)

Creating the list

To begin, we need to read in the data tables. Data downloaded from IPEDS is stored as a ‘csv’, so that makes importing it easy, but if your data is stored as an Excel workbook, you can either convert it (save as “.csv”) or install the package XLconnect, which can import excel sheets into R.

This code will store each of your data frames as a separate element in a list called “Residences”, and then name each element with the year it reflects.

temp = list.files(pattern="*.csv")
Residences <- list()
for (k in 1:length(temp)){
 Residences[[k]] <- read.csv(temp[k])
}
names(Residences) <- c(1986,1988,seq.int(1992,2014,2))

“list.files” is a function that retrieves the names of any file type you specify. It defaults its search to your working directory, but you can edit that to any folder on your computer.

#temp = list.files(path = "User/You/Your_Path", pattern="*.any_file_type_you_want", full.names = TRUE)

If you do this, make sure to specify that “full.names” equals TRUE, or “read.csv” won’t work when reading the vector elements.

Melting the list

To melt this list, use “melt.list”. This will recursively melt each data frame in the list, and then combine them all into one long data frame.

The argument “id.vars” tells the function which columns you want to use as your identifying variables, or which ones you don’t want going in the “Location” column.

Specifying “level = ‘Year’” tells the function to name our new column “Year”.

RESI <- melt.list(Residences, id.vars = c("UnitID", "Institution.Name"), level = "Year")
UnitID Institution.Name Location Value Year
222178 Abilene Christian University Alabama 1 1986
168528 Adrian College Alabama 0 1986
133872 Adventist University of Health Sciences Alabama 0 1986
138600 Agnes Scott College Alabama 7 1986
102669 Alaska Pacific University Alabama 0 1986
188526 Albany College of Pharmacy and Health Sciences Alabama 0 1986
128498 Albertus Magnus College Alabama 0 1986
168546 Albion College Alabama 0 1986
210571 Albright College Alabama 0 1986
237118 Alderson Broaddus University Alabama 0 1986

Just as a personal preference, I like to order my tables alphabetically.

RESI <- RESI[order(RESI$Institution.Name),]
UnitID Institution.Name Location Value Year
222178 Abilene Christian University Alabama 1 1986
222178 Abilene Christian University Alaska 0 1986
222178 Abilene Christian University Arizona 12 1986
222178 Abilene Christian University Arkansas 1 1986
222178 Abilene Christian University California 18 1986

As you can see, this new table stores the state names under a single column “Location”, but it still tells you how many students live in each state by connecting it with the column “Value. This column ‘Location’ will become the variable in Tableau that you can use to color Area Charts or Maps, while the ‘Value’ column will become the measure variable in Tableau. Tableau’s”Sum” function works correctly here because it groups its operation by school and by Location.

That’s all- if you have any comments or suggestions please let me know, as I am still learning R and Tableau