Through my Institutional Research internship at Denison University, I work a lot with IPEDS data. IPEDS, or the Integrated Postsecondary Education Data System, is a website that has facts such as retention rates, size, religious affiliation, and revenues, for thousands of public and private colleges. This data is useful and goes back decades, but it is stored in summary form; or wide format rather than long. Example:
Institution.Name | Alabama | Alaska | Arizona | Arkansas | California | Colorado |
---|---|---|---|---|---|---|
Abilene Christian University | 0 | 2 | 5 | 0 | 17 | 16 |
Adrian College | 0 | 0 | 0 | 0 | 3 | 0 |
Adventist University of Health Sciences | 1 | 0 | 0 | 0 | 1 | 1 |
Agnes Scott College | 4 | 0 | 0 | 2 | 11 | 0 |
Alaska Pacific University | 0 | 10 | 0 | 0 | 2 | 1 |
The problem I run into when using this kind of data in Tableau is that I can’t “color” maps or area charts by the location variable because every state/location is stored in a separate column.
Tableau works most smoothly with data that is stored as individual observations (i.e. a row for every student represented in the table above, with columns for ‘StudentID’ ‘College’ and ‘Location’) because it can quickly aggregate the observations and create that summary data on its own. To transform the summary data shown above into something that tableau can deal with, we have to “melt” it.
This will not create individual observations for each student, but it will combine all the states into one column called “Location”, and all the corresponding values into a column called “Value”.
When loading this data frame into Tableau, “Location” becomes the variable and “Value” is your measure.
The only package you will need to install to transform the data is “reshape”.
require(reshape)
Creating the list
To begin, we need to read in the data tables. Data downloaded from IPEDS is stored as a ‘csv’, so that makes importing it easy, but if your data is stored as an Excel workbook, you can either convert it (save as “.csv”) or install the package XLconnect, which can import excel sheets into R.
This code will store each of your data frames as a separate element in a list called “Residences”, and then name each element with the year it reflects.
temp = list.files(pattern="*.csv")
Residences <- list()
for (k in 1:length(temp)){
Residences[[k]] <- read.csv(temp[k])
}
names(Residences) <- c(1986,1988,seq.int(1992,2014,2))
“list.files” is a function that retrieves the names of any file type you specify. It defaults its search to your working directory, but you can edit that to any folder on your computer.
#temp = list.files(path = "User/You/Your_Path", pattern="*.any_file_type_you_want", full.names = TRUE)
If you do this, make sure to specify that “full.names” equals TRUE, or “read.csv” won’t work when reading the vector elements.
Melting the list
To melt this list, use “melt.list”. This will recursively melt each data frame in the list, and then combine them all into one long data frame.
The argument “id.vars” tells the function which columns you want to use as your identifying variables, or which ones you don’t want going in the “Location” column.
Specifying “level = ‘Year’” tells the function to name our new column “Year”.
RESI <- melt.list(Residences, id.vars = c("UnitID", "Institution.Name"), level = "Year")
UnitID | Institution.Name | Location | Value | Year |
---|---|---|---|---|
222178 | Abilene Christian University | Alabama | 1 | 1986 |
168528 | Adrian College | Alabama | 0 | 1986 |
133872 | Adventist University of Health Sciences | Alabama | 0 | 1986 |
138600 | Agnes Scott College | Alabama | 7 | 1986 |
102669 | Alaska Pacific University | Alabama | 0 | 1986 |
188526 | Albany College of Pharmacy and Health Sciences | Alabama | 0 | 1986 |
128498 | Albertus Magnus College | Alabama | 0 | 1986 |
168546 | Albion College | Alabama | 0 | 1986 |
210571 | Albright College | Alabama | 0 | 1986 |
237118 | Alderson Broaddus University | Alabama | 0 | 1986 |
Just as a personal preference, I like to order my tables alphabetically.
RESI <- RESI[order(RESI$Institution.Name),]
UnitID | Institution.Name | Location | Value | Year |
---|---|---|---|---|
222178 | Abilene Christian University | Alabama | 1 | 1986 |
222178 | Abilene Christian University | Alaska | 0 | 1986 |
222178 | Abilene Christian University | Arizona | 12 | 1986 |
222178 | Abilene Christian University | Arkansas | 1 | 1986 |
222178 | Abilene Christian University | California | 18 | 1986 |
As you can see, this new table stores the state names under a single column “Location”, but it still tells you how many students live in each state by connecting it with the column “Value. This column ‘Location’ will become the variable in Tableau that you can use to color Area Charts or Maps, while the ‘Value’ column will become the measure variable in Tableau. Tableau’s”Sum” function works correctly here because it groups its operation by school and by Location.
That’s all- if you have any comments or suggestions please let me know, as I am still learning R and Tableau