2 Mapping and exploring real datasets: the case study of Obesity
Second class, read data, map and explore then
2.1 CSV files: import and use them
In a csv
file each cell is separated by a special character, which is usually a comma (Comma Separated Values), although other character can be used as well. In Europe, semi column (;
) is often used to avoid confusion as the comma is used for separate unit with decimal.
The usual way to import a csv
file called mydata.csv
in R is by using read.csv
function:
Feel free to use Excel or Libreoffice to format data. Usually, those program don’t use csv
format to save file you usually create with such software. To save a table you created using Excel or something similar, you may need to use the Export to csv
option.
2.2 The CDC data sets
2.2.1 Download the data
The website Center for Disease Control and Prevention offers various dataset that you can use and explore.
But before using those data set you will need to organize them in a format readable and usable by R
and ggplot
. For the purpose of this class we give you the dataset already formatted in the way we will need.
The file we need are:
* Obesity_2013.csv
* Obesity_2004.csv
* Income_2013.csv
* Obesity_data_2013_county.csv
You can download them by clicking to above’s links.
Then put all those csv
files in a R directory (advice: to better organize your the folder your are working in you can create another subdirectory data
and store all your csv
file within the directory)
2.2.2 Import the data in R
To import those files as data frames in R:
2.2.3 Visualize the data
A quick way to visualise those data is to use the choropleth
package
Install and use a mapping package
The function we will use to create county choropleth
maps is called
county_choropleth
Pass it as a data frame with one column named region and one column named value
2.3 Exploring Data
Warm-up: Show the structure, first 2 values etc. Create scatterplot: Obesity versus Income Fit trendiness through scatterplots Explore correlations between variables
Structure of the obesity data
2.4 First several rows of data frame
Last several rows of data frame
Last five rows
Plotting obesity data * Create a scatter plot in ggplot: Income vs obesity in USA * Check correlation between income and obesity
2.5 Plot income vs. obesity
Two underscores
Change default (y-)axis label:
2.6 Use log-transformed income
Sometime to be able to better grasp the nature of your data you need to transform them. For data that cross different scales we often use a logarithmic transformation:
ggplot(data = obesity_cov_2013) + geom_point(mapping = aes(x=Log_income, y= Obes__aa)) +
ylab("Obesity (%)") +
xlab("ln(income)")
2.6.1 Add linear trend line with 95% CI
2.6.2 Add loess trend line with 95% CI
2.6.3 Correlation income vs. obesity
cor.test(obesity_cov_2013$Income,obesity_cov_2013$Obes__aa)
##
## Pearson's product-moment correlation
##
## data: obesity_cov_2013$Income and obesity_cov_2013$Obes__aa
## t = -30.09, df = 3109, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5016747 -0.4472267
## sample estimates:
## cor
## -0.4749051
Using income
cor.test(obesity_cov_2013$Log_income,obesity_cov_2013$Obes__aa)
##
## Pearson's product-moment correlation
##
## data: obesity_cov_2013$Log_income and obesity_cov_2013$Obes__aa
## t = -30.998, df = 3109, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5122901 -0.4585839
## sample estimates:
## cor
## -0.4858955
Using log(income)
2.6.4 Try a few other correlations or plots
County_pop: population of county
County_houses: households in county
Urban_fract: % urban population in county
Diab__aa: Diabetes %
Leis__aa: “leisure” statistic
lalowihalfshare: food deserts
Try a facet plot…?
2.6.5 Facet plot
ggplot(data = obesity_cov_2013, aes(x= Log_income, y= Obes__aa)) + geom_point() +
geom_smooth(method=lm)+ facet_wrap(~ State.1, nrow=10)
bes__aa
How can we limit number of states?
Log_income
One way: plot only a range of rows
Facet plot
Two underscores
One way: Subset the data
Alabama <- obesity_cov_2013[ which(obesity_cov_2013$State.1=='AL'),]
first12 <- obesity_cov_2013[ which(obesity_cov_2013$State <=12),]
Correlations within each state (google it!)
You must translate this googled answer into your data
2.6.6 Correlations within each state
r <- by(obesity_cov_2013, obesity_cov_2013$State.1, FUN =
function(X) cor(X$Income, X$Obes__aa, method = "spearman"))
Now group by state and do the same thing, using obesity_cov_2013
require(plyr)
## Loading required package: plyr
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following object is masked from 'package:purrr':
##
## compact
func <- function(obesity_cov_2013) {return(data.frame(COR =
cor(obesity_cov_2013$Income, obesity_cov_2013$Obes__aa)))}
ddply(obesity_cov_2013, .(State), func)
## State COR
## 1 1 -0.636469528
## 2 4 -0.006202271
## 3 5 -0.426673104
## 4 6 -0.574784251
## 5 8 -0.523242778
## 6 9 -0.936956745
## 7 10 -0.827902383
## 8 11 NA
## 9 12 -0.671901965
## 10 13 -0.373281010
## 11 15 -0.075720507
## 12 16 -0.415400469
## 13 17 -0.104035947
## 14 18 -0.373967800
## 15 19 -0.208062329
## 16 20 -0.282640486
## 17 21 -0.529492188
## 18 22 -0.573971335
## 19 23 -0.724489922
## 20 24 -0.697263519
## 21 25 -0.197622037
## 22 26 -0.346072316
## 23 27 -0.376580497
## 24 28 -0.702186289
## 25 29 -0.267925517
## 26 30 -0.190482756
## 27 31 -0.021299234
## 28 32 -0.002202525
## 29 33 -0.205207686
## 30 34 -0.727272029
## 31 35 -0.192393859
## 32 36 -0.624782747
## 33 37 -0.506000104
## 34 38 -0.249070838
## 35 39 -0.434311148
## 36 40 -0.153572480
## 37 41 -0.235557929
## 38 42 -0.654855243
## 39 44 -0.740630563
## 40 45 -0.810487637
## 41 46 -0.609416899
## 42 47 -0.314878618
## 43 48 -0.211351236
## 44 49 -0.278214857
## 45 50 -0.668122950
## 46 51 -0.537430073
## 47 53 -0.361359718
## 48 54 -0.319925084
## 49 55 -0.424645040
## 50 56 0.014855257