3 Big-Data Lab (January 30): More plotting practice and beginning with Twitter data
3.1 Objectives of this lab
This tutorial will cover the following:
- Practice more plots in ggplot;
- Practice ggplots using the public health dataframe (
obesity_cov_2013
) from previous lab. - Begin to use Twitter data – from an archived dataset
- If you want to access Twitter API yourself (not required), you’ll need your Twitter account (see below).
The R functions in this lab program should:
- Find the maximum number of pages to be queried
- Generate all the subpages that make up the reviews
- Scrape the information from each of them
- Combine the information into one comprehensive data frame
3.3 Packages for Twitter exercise
You will need to install some of them first
3.4 Exploring ggplot
3.5 MORE PRACTICE GGPLOT ON OBESITY AND COVARIATES
Here you will need to use data frame from obesity lab: obesity_cov_2013
(you may have to read it again. Feel free to include results on your assignment.
Replace x-axis label
Or just remove x-axis label
Too many states on the on x-axis? Try rotating the boxplot: with coord_flip()
Notice that the states are still addressed by xlab()
when using coord_flip()
Want average values for each state? Get mean value of columns 7 through 13 for each state with this command
States_2013
now has values averaged across counties (not population) for each state. Is it useful? Note that a boxplot of these state-averaged data shows much less!
3.6 BEGINNING WITH TWITTER DATASET
Once you are connected with your account (or one done for the occasion) you can go to twitter APIto request for a token to use the API:
Read data into dataframe:
You will need to install twitteR
package and load it:
install.packages("twitteR")
## Installing package into '/usr/local/lib/R/site-library'
## (as 'lib' is unspecified)
## also installing the dependency 'rjson'
library(twitteR)
##
## Attaching package: 'twitteR'
## The following object is masked from 'package:plyr':
##
## id
## The following objects are masked from 'package:dplyr':
##
## id, location
#to get your consumerKey and consumerSecret see the twitteR documentation for instructions
consumer_key <- 'your key'
consumer_secret <- 'your secret'
access_token <- 'your access token'
access_secret <- 'your access secret'
setup_twitter_oauth(consumer_key,
consumer_secret,
access_token,
access_secret)
We first look for all tweet with the hashtag #GoVols
Then we extract the date of the tweet store it in a vector:
As we formated the tweets in a way that only the days and the hours are used, we can count how many tweets we have for all those days and hours and plot it: