2 Mapping and exploring real datasets: the case study of Obesity

Second class, read data, map and explore then

2.1 CSV files: import and use them

In a csv file each cell is separated by a special character, which is usually a comma (Comma Separated Values), although other character can be used as well. In Europe, semi column (;) is often used to avoid confusion as the comma is used for separate unit with decimal.

The usual way to import a csv file called mydata.csv in R is by using read.csv function:

Feel free to use Excel or Libreoffice to format data. Usually, those program don’t use csv format to save file you usually create with such software. To save a table you created using Excel or something similar, you may need to use the Export to csv option.

2.2 The CDC data sets

2.2.1 Download the data

The website Center for Disease Control and Prevention offers various dataset that you can use and explore.

But before using those data set you will need to organize them in a format readable and usable by R and ggplot. For the purpose of this class we give you the dataset already formatted in the way we will need. The file we need are: * Obesity_2013.csv * Obesity_2004.csv * Income_2013.csv * Obesity_data_2013_county.csv

You can download them by clicking to above’s links. Then put all those csv files in a R directory (advice: to better organize your the folder your are working in you can create another subdirectory data and store all your csv file within the directory)

2.2.3 Visualize the data

A quick way to visualise those data is to use the choropleth package Install and use a mapping package The function we will use to create county choropleth maps is called county_choropleth Pass it as a data frame with one column named region and one column named value

Other more complex package to handle map can be found: *

2.3 Exploring Data

Warm-up: Show the structure, first 2 values etc. Create scatterplot: Obesity versus Income Fit trendiness through scatterplots Explore correlations between variables

Structure of the obesity data

2.4 First several rows of data frame

Last several rows of data frame

Last five rows

Plotting obesity data * Create a scatter plot in ggplot: Income vs obesity in USA * Check correlation between income and obesity

2.6 Use log-transformed income

Sometime to be able to better grasp the nature of your data you need to transform them. For data that cross different scales we often use a logarithmic transformation:

2.6.4 Try a few other correlations or plots

County_pop: population of county
County_houses: households in county
Urban_fract: % urban population in county
Diab__aa: Diabetes %
Leis__aa: “leisure” statistic
lalowihalfshare: food deserts

Try a facet plot…?

2.6.6 Correlations within each state

Now group by state and do the same thing, using obesity_cov_2013

require(plyr)
## Loading required package: plyr
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following object is masked from 'package:purrr':
## 
##     compact
 func <- function(obesity_cov_2013) {return(data.frame(COR =
cor(obesity_cov_2013$Income, obesity_cov_2013$Obes__aa)))}
 ddply(obesity_cov_2013, .(State), func)
##    State          COR
## 1      1 -0.636469528
## 2      4 -0.006202271
## 3      5 -0.426673104
## 4      6 -0.574784251
## 5      8 -0.523242778
## 6      9 -0.936956745
## 7     10 -0.827902383
## 8     11           NA
## 9     12 -0.671901965
## 10    13 -0.373281010
## 11    15 -0.075720507
## 12    16 -0.415400469
## 13    17 -0.104035947
## 14    18 -0.373967800
## 15    19 -0.208062329
## 16    20 -0.282640486
## 17    21 -0.529492188
## 18    22 -0.573971335
## 19    23 -0.724489922
## 20    24 -0.697263519
## 21    25 -0.197622037
## 22    26 -0.346072316
## 23    27 -0.376580497
## 24    28 -0.702186289
## 25    29 -0.267925517
## 26    30 -0.190482756
## 27    31 -0.021299234
## 28    32 -0.002202525
## 29    33 -0.205207686
## 30    34 -0.727272029
## 31    35 -0.192393859
## 32    36 -0.624782747
## 33    37 -0.506000104
## 34    38 -0.249070838
## 35    39 -0.434311148
## 36    40 -0.153572480
## 37    41 -0.235557929
## 38    42 -0.654855243
## 39    44 -0.740630563
## 40    45 -0.810487637
## 41    46 -0.609416899
## 42    47 -0.314878618
## 43    48 -0.211351236
## 44    49 -0.278214857
## 45    50 -0.668122950
## 46    51 -0.537430073
## 47    53 -0.361359718
## 48    54 -0.319925084
## 49    55 -0.424645040
## 50    56  0.014855257