Research – Mallory Analytics

One of the bottlenecks of any KNN algorithm is calculating the distances between data points. These pairwise distance calculations run in O(n²) time because every combination of two points must be compared for a dataset with n rows. While this doesn’t present too much of an issue for smaller datasets, the runtime rapidly (exponentially you might say) increases as n gets larger.

For example, R’s dist() function takes approximately 3 seconds for a matrix of dimensions 5000×100, and 15 seconds for 10000×100. This function also performs very poorly as the number of columns increases – dist() takes 30 seconds when the matrix is 2000×1000. While there is no way around computing the distance between every pair of rows, there are faster implementations depending on your data types.

If you have purely binary values (0,1) in your data, then simple matrix multiplication can save you seconds or even minutes off the runtime of the distance calculations. As outlined here, the Euclidean distance between binary data points can be quickly calculated using a matrix cross product. So to calculate the Euclidean distance of binary data points in R, you can do:

D <- (!X) %*% t(X) return(D + t(D))

This works by taking the cross product of NOT X and X. The flipped version of X means that the cross product will count all occurrences where !X is set (1) and t(X) is 0. In case you don’t have matrix multiplication memorized, this graphic from Interactive Mathematics is particularly useful (remember that t(X) means that the columns in the second matrix would actually be the rows from X, so it’s still comparing row vs. row).

So how does this distance calculation compare against R’s dist() function? If you’re a fan of 3D plots, here is one that compares the runtime of dist() to matrix multiplication based on the number of rows and columns.

As we can see from the plot above, the runtime of R’s dist() function scales exponentially as the number of rows and columns increase. It should be noted that for data with fewer than 100 columns, dist() and the matrix cross product have effectively the same runtime. But if you have binary data, odds are you’ll also have a lot of columns because you’ve created a lot of dummy columns based off a categorical column. So for categorical columns that have more than 100 distinct values, a matrix cross product will be significantly faster than R’s built-in dist() function.

So there you have it. Matrix cross products can be much faster than existing functions when it comes to calculating pairwise distances. I imagine this is because matrix operations are extremely optimized and implemented at a low level, while functions such as dist() have to deal with R’s set of data types, storage structures, and overhead.

The disparity in runtimes as the number of columns increases may be due to the fact that R stores data frames and matrices as lists of columns, so accessing a row’s vector of values could require iterating over every column. Meanwhile, taking the cross product of !X and the transpose of X means that the rows of X become columns for t(X), which could further reduce runtime.

Stay tuned for another article where I show how this distance method as well as a host of other speedups, can be used to calculate KNN predictions in a fraction of the time that Caret’s KNN method takes.

Through my Institutional Research internship at Denison University, I work a lot with IPEDS data. IPEDS, or the Integrated Postsecondary Education Data System, is a website that has facts such as retention rates, size, religious affiliation, and revenues, for thousands of public and private colleges. This data is useful and goes back decades, but it is stored in summary form; or wide format rather than long. Example:

Institution.Name	Alabama	Alaska	Arizona	Arkansas	California	Colorado
Abilene Christian University	0	2	5	0	17	16
Adrian College	0	0	0	0	3	0
Adventist University of Health Sciences	1	0	0	0	1	1
Agnes Scott College	4	0	0	2	11	0
Alaska Pacific University	0	10	0	0	2	1

The problem I run into when using this kind of data in Tableau is that I can’t “color” maps or area charts by the location variable because every state/location is stored in a separate column.

Tableau works most smoothly with data that is stored as individual observations (i.e. a row for every student represented in the table above, with columns for ‘StudentID’ ‘College’ and ‘Location’) because it can quickly aggregate the observations and create that summary data on its own. To transform the summary data shown above into something that tableau can deal with, we have to “melt” it.

This will not create individual observations for each student, but it will combine all the states into one column called “Location”, and all the corresponding values into a column called “Value”.

When loading this data frame into Tableau, “Location” becomes the variable and “Value” is your measure.

The only package you will need to install to transform the data is “reshape”.

require(reshape)

Creating the list

To begin, we need to read in the data tables. Data downloaded from IPEDS is stored as a ‘csv’, so that makes importing it easy, but if your data is stored as an Excel workbook, you can either convert it (save as “.csv”) or install the package XLconnect, which can import excel sheets into R.

This code will store each of your data frames as a separate element in a list called “Residences”, and then name each element with the year it reflects.

temp = list.files(pattern="*.csv")
Residences <- list()
for (k in 1:length(temp)){
 Residences[[k]] <- read.csv(temp[k])
}
names(Residences) <- c(1986,1988,seq.int(1992,2014,2))

“list.files” is a function that retrieves the names of any file type you specify. It defaults its search to your working directory, but you can edit that to any folder on your computer.

#temp = list.files(path = "User/You/Your_Path", pattern="*.any_file_type_you_want", full.names = TRUE)

If you do this, make sure to specify that “full.names” equals TRUE, or “read.csv” won’t work when reading the vector elements.

Melting the list

To melt this list, use “melt.list”. This will recursively melt each data frame in the list, and then combine them all into one long data frame.

The argument “id.vars” tells the function which columns you want to use as your identifying variables, or which ones you don’t want going in the “Location” column.

Specifying “level = ‘Year’” tells the function to name our new column “Year”.

RESI <- melt.list(Residences, id.vars = c("UnitID", "Institution.Name"), level = "Year")

UnitID	Institution.Name	Location	Value	Year
222178	Abilene Christian University	Alabama	1	1986
168528	Adrian College	Alabama	0	1986
133872	Adventist University of Health Sciences	Alabama	0	1986
138600	Agnes Scott College	Alabama	7	1986
102669	Alaska Pacific University	Alabama	0	1986
188526	Albany College of Pharmacy and Health Sciences	Alabama	0	1986
128498	Albertus Magnus College	Alabama	0	1986
168546	Albion College	Alabama	0	1986
210571	Albright College	Alabama	0	1986
237118	Alderson Broaddus University	Alabama	0	1986

Just as a personal preference, I like to order my tables alphabetically.

RESI <- RESI[order(RESI$Institution.Name),]

UnitID	Institution.Name	Location	Value	Year
222178	Abilene Christian University	Alabama	1	1986
222178	Abilene Christian University	Alaska	0	1986
222178	Abilene Christian University	Arizona	12	1986
222178	Abilene Christian University	Arkansas	1	1986
222178	Abilene Christian University	California	18	1986

As you can see, this new table stores the state names under a single column “Location”, but it still tells you how many students live in each state by connecting it with the column “Value. This column ‘Location’ will become the variable in Tableau that you can use to color Area Charts or Maps, while the ‘Value’ column will become the measure variable in Tableau. Tableau’s”Sum” function works correctly here because it groups its operation by school and by Location.

That’s all- if you have any comments or suggestions please let me know, as I am still learning R and Tableau

Category: Research

Efficient Pairwise Distance for Binary Data

Manipulating IPEDS data to work in Tableau

Creating the list

Melting the list