Skip to content

Class assignment for Coursera class "Getting and Cleaning Data"

Notifications You must be signed in to change notification settings

ekuns/getdata_assignment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

README for Coursera class "Getting and Cleaning Data" class assignment

Data Set Definition

All of the measurements in this data set are ultimately derived from two instruments within a Samsung Galaxy S II smartphone: an accelerometer and a gyroscope. Each instrument returns a three-dimensional vector as its measurement, that is, a measurement along the X, Y, and Z axes. A set of thirty test subjects wore the smartphone strapped to their waist. The accelerometer and gyroscope were read 50 times per second while the subjects performed one of six different activities, generating the raw data.

The original experimenters then did a significant amount of post-processing on the raw data, ultimately resulting in the data set that was used as the input for this analysis.

See the Code Book for the full details of the data set, where it came from, what variables it contains, and the processing that was done. The Code Book also documents what to find in the data set that is the final output of this analysis.

Assumptions of the processing script

The processing script run_analysis.R assumes that the data set has already been manually downloaded (from the URL listed below). It further assumes that the data set has been unzipped into the folder UCI HAR Dataset which R can find in its current working directory.

As mentioned in the Code Book, you can download a copy of the original data set from this URL:

https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip

NOTE: The processing script cleans its environment before it runs, to guarantee that there is no interference caused by previous contents of the global environment. It uses this command to do so:

rm(list = ls())

If you do not want the global environment to be emptied, you can comment out that line before running the script.

The processing script also requires that the library dplyr is installed locally. It will load it with the library function, so it does not have to loaded before the processing script is executed. It just has to be installed. If dplyr is not already installed, you can install it with this command:

install.packages("dplyr")

This script was developed and tested with R version 3.1.2 (2014-10-31), in a Windows 7 64-bit environment. Dplyr version 0.4.1 was used.

Using the processing script

The processing script run_analysis.R requires no arguments. Since it clears its environment before it runs, it does not depend on what was previously loaded in the environment. There are no additional or external R scripts. All of the processing code is in one script. Therefore, you can run the processing script with this command:

source("run_analysis.R")

When the script has completed, you will find a new file summarized_data.txt in the current working directory. The R environment will also contain these variables:

  • X -- The test and training data sets merged, but without the subject ID or activity ID. It does have an added row-ID column that is used in merging.
  • reduced_X -- the test and training sets merged, with only the desired subset of columns and with the subject ID and activity name columns added.
  • summarized -- the data set that is the output of the processing script, and which is also saved in the file summarized_data.txt.
  • sanityCheck -- the function used to double-check the analysis output.

All of the code is in a single script, for simplicity. The important sections of code are labeled with block comments to make it easy to identify what is being done at each step. Only one function is defined -- the function that verifies that the analysis was consistent and accurate. All other processing is done inline in the script.

Why the final data is tidy

For this exercise, either the "narrow" or "wide" form of tidy data would be valid. I opted to use the wide form of tidy data, as it is more convenient. Each row represents all computed averages for a single observation, that is, the summarized result of a single set of measurements for a single subject and activity. Each column is a single variable (representing the average of one of the variables on the input data set) for a single observation. There is only one kind of "observational unit" so there is no need for a second table (aka data.frame) or data file.

How to read the data file

You can load and view this data set into a data.frame in R using this code, assuming that the file summarized_data.txt is in the current directory:

summarized_data <- read.table("summarized_data.txt", header = TRUE) 
View(summarized_data)

About

Class assignment for Coursera class "Getting and Cleaning Data"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages