Brad Boehmke, Ph.D., is an Operations Research Analyst at Headquarters Air Force Materiel Command, Studies and Analyses Division. He is also Assistant Professor in the Operational Sciences Department at the Air Force Institute of Technology. Dr. Boehmke's research interests are in the areas of cost analysis, economic modeling, decision analysis, and developing applied modeling applications through the R statistical language.
1. Preface
2. Introduction
a. The Role of Data Wrangling
i. Introduction to R
1. Open Source
2. Flexibility
3. Community
ii. R Basics
1. Assignment & Evaluation
2. Vectorization
3. Getting help
4. Workspace
5. Working with packages
6. Style guide
3. Working with Different Types of Data in R
a. Dealing with Numbers
i. Integer vs. Double
ii. Generating sequence of non-random numbers
iii. Generating sequence of random numbers
iv. Setting the seed for reproducible random numbers
v. Comparing numeric values
vi. Rounding numbers
b. Dealing with Character Strings
i. Character string basics
ii. String manipulation with base R
iii. String manipulation with stringr
iv. Set operatons for character strings
c. Dealing with Regular Expressions
i. Regex Syntax
ii. Regex Functions
iii. Additional resources
d. Dealing with Factors
i. Creating, converting & inspecting factors
ii. Ordering levels
iii. Revalue levels
iv. Dropping levels
e. Dealing with Dates
i. Getting current date & time
ii. Converting strings to dates
iii. Extract & manipulate parts of dates
iv. Creating date sequences
v. Calculations with dates
vi. Dealing with time zones & daylight savings
vii. Additional resources
<4. Managing Data Structures in R
a. Data Structure Basics
i. Identifying the Structure
ii. Attributes
b. Managing Vectors
i. Creating
ii. Adding on to
iii. Adding attributes
iv. Subsetting
c. Managing Lists
i. Creating
iv. Subsetting
d. Managing Matrices
i. Creating
ii. Adding on to
iii. Adding attributes
iv. Subsetting
e. Managing Data Frames
i. Creating
ii. Adding on to
iii. Adding attributes
iv. Subsetting
f. Dealing with Missing Values
i. Testing for missing values
ii. Recoding missing values iii. Excluding missing values
5. Importing, Scraping, and Exporting Data with R
a. Importing Data
i. Reading data from text files
ii. Reading data from Excel files
iii. Load data from saved R object file
iv. Additional resources
b. Scraping Data
i. Importing tabular and Excel files stored online
ii. Scraping HTML text
iii. Scraping HTML table data
iv. Working with APIs
v. Additional Resources
c. Exporting Data
i. Writing data to text files
ii. Writing data to Excel files
iii. Saving data as an R object file
iv. Additional resources
6. Creating Efficient & Readable Code in R
a. Functions
i. Function Components
ii. Arguments
iii. Scoping Rules
iv. Lazy Evaluation
v. Returning Multiple Outputs from a Function
vi. Dealing with Invalid Parameters
vii. Saving and Sourcing Functions
viii. Additional Resources
b. Loop Control Statements
i. Basic control statements (i.e. if, for, while, etc.)
ii. Apply family
iii. Other useful "loop-like" functions
iv. Additional Resources
>%
>%) Operator
ii. Additional Functions
iii. Additional Pipe Operators
iv. Additional Resources
7. Shaping & Transforming Your Data with R
a. Reshaping Your Data with tidyr
i. Making wide data long
ii. Making long data wide iii. Splitting a single column into multiple columns
iv. Combining multiple columns into a single column
v. Additional tidyr functions
vi. Sequencing your tidyr operations
vii. Additional resources
b. Transforming Your Data with dplyr
i. Selecting variables of interest
ii. Filtering rows
iii. Grouping data by categorical variables
iv. Performing summary statistics on variables
v. Arranging variables by value
vi. Joining datasets
vii. Creating new variables
viii. Additional resources
This guide for practicing statisticians, data scientists, and R users and programmers will teach the essentials of preprocessing: data leveraging the R programming language to easily and quickly turn noisy data into usable pieces of information. Data wrangling, which is also commonly referred to as data munging, transformation, manipulation, janitor work, etc., can be a painstakingly laborious process. Roughly 80% of data analysis is spent on cleaning and preparing data; however, being a prerequisite to the rest of the data analysis workflow (visualization, analysis, reporting), it is essential that one become fluent and efficient in data wrangling techniques.
This book will guide the user through the data wrangling process via a step-by-step tutorial approach and provide a solid foundation for working with data in R. The author's goal is to teach the user how to easily wrangle data in order to spend more time on understanding the content of the data. By the end of the book, the user will have learned: