Skip to content
gmaubach edited this page May 11, 2017 · 51 revisions

2017-05-11: Project Organisation for Market Research and Data Mining

In my current projects I use a project structure as stated in https://github.com/gmaubach/R-Project-Utilities/blob/master/t_setup_project.R. This structured proved to be inefficient cause it does not separate files which will be reproduced regularly. Also files that need to be under version control are not separated from files not under version control. The structure was designed to suit the process within market research. The processes in data mining are however somewhat different.

Putting together all best practices I come up with a new project structure. This structure emiminates the above mentioned shortcommings and is much more complete. At first sight it might be overwhelming. But the structure shall be regarded as a template from which you can draw the parts and ideas suitable for your project.

The new structure looks like:

-- cut --

Principles

  • There are only two classes of files: data and programs.
  • Separate static files from files that change permanently.
  • Separate files under version control from those which are not under version control.
  • Design a project structure suitable for market research projects as well as data mining projects.

Files (in project folder)

  • CITATION
  • CONTRIBUTING
  • LICENSE
  • README.Rmd
  • requirements.txt
  • issues.txt
  • communication.txt
  • styleguide.txt
  • CHANGELOG (e.g. daily entries or reference to version control)
  • .gitignore (contains e.g. /bin, /data, /external, /temp)

Directories (and possible files in project folder)

  • fundamentals (all materials brought to start the project)
  • documentation (manuscripts, documentation of source code, electronic lab notebook)
  • permissions (credentials, passwords, private keys)
  • data (fix)
    • raw (never changed, for reference, including SHA hashes)
    • input (data prepared for import)
    • metadata
  • src (interpreted scripts)
    • configure (IDE setup, project config, load libraries)
    • declare (general definitions for other core sources)
    • core
      • collect (used to collect raw data)
      • import (imports input into analysis platform)
      • explore (exploratory data analysis)
      • clean
      • integrate (combine and merge data)
      • prepare (create new variables and cases)
      • analyse (create permanent results)
      • model
      • present
    • make (build scripts, controllers, make files)
    • sub (all what's not worth to be put in a library)
      • functions (single, atomic sub routines)
      • modules (complex, pooling sub routines)
    • lib (all used libraries, internal and external)
  • bin (binary programs)
  • external (binaries, libraries, packages, etc. from outside)
  • results (permanent results)
    • plots
    • papers
    • tables
    • slides
    • web
    • reports (e. g. according to CRISP-DM)
      • 001_Data_Mining_Report.Rmd
      • 100_Business_Understanding.Rmd
        • 110_Business_Objectives.Rmd
          • 111_Background.Rmd
          • 112_Business_Objectives.Rmd
          • 113_Business_Success_Criteria.Rmd
        • 120_Assess_Situation.Rmd
          • 121_Ressource_Inventory.Rmd
          • 122_Requirements_Assumptions_Constraints.Rmd
          • 123_Risks.Rmd
        • 130_Determine_Data_Mining_Goals.Rmd
          • 131_Data_Mining_Goals.Rmd
          • 132_Data_Mining_Success_Criteria.Rmd
        • 140_Project_Planning.Rmd
          • 141_Project_Plan.Rmd
          • 142_Tools_and_Techniques_Assessment.Rmd
      • 200_Data_Understanding.Rmd
        • 210_Initial_Data_Collection_Report.Rmd (uses /src/core/collect)
        • 220_Data_Description_Report.Rmd
        • 230_Data_Exploration_Report.Rmd
        • 240_Data_Quality_Report.Rmd
      • 300_Data_Preparation.Rmd (uses /scr/core/prepare)
        • 310_Select_Data.Rmd
        • 320_Data_Cleaning_Report.Rmd (uses /src/core/clean)
        • 330_Derived_Attributes.Rmd
        • 340_Generated_Records.Rmd
        • 350_Integrate.Rmd (uses /src/core/integrate)
        • 360_Reformat.Rmd
      • 400_Modeling.Rmd (uses /src/core/model)
        • 411_Modeling_Technique.Rmd
        • 412_Modeling_Assumptions.Rmd
        • 421_Test_Design.Rmd
        • 431_Parameter_Settings.Rmd
        • 432_Models.Rmd
        • 433_Model_Descriptions.Rmd
        • 434_Model_Assessment.Rmd
        • 435_Revised_Model_Parameter_Settings.Rmd
      • 500_Evaluation.Rmd
        • 511_Business_Assessment.Rmd
        • 512_Approved_Models.Rmd
        • 521_Process_Review_Report.Rmd
        • 531_Action_List.Rmd
        • 532_Decisions.Rmd
    • export (to other projects, interface to other programs, API's)
  • temp (temporary files)
    • src (automatically created scripts)
    • data
    • results
  • templates
  • tests
  • profiling
  • logs (program logs, sessionInfo output)

Version Control

All files and directories without

  • temp
  • bin (possibly)
  • external (possibly)
  • data (possibly)

Sources

-- cut --

This structure is far from being perfect. The integration of documents for the data mining part could be improved. I am eager to hear what you think and are open to comments and suggestions for enhancements.

That's all for today.

Regards

Georg

2016-07-04: Saving Graphics and Copying to Microsoft Excel

Today I talk about creating graphics in R and copy the result to Excel.

# Alternative 1: Using xlsx library

# Load libraries
library("dplyr")
library(ggplot2)
library("jpeg")
library("xlsx")

# Create table
freq_table <- mtcars %>% dplyr::group_by(gear) %>% dplyr::summarize(count = n())

# Generate and store graph
jpeg(file.path("C:", "Temp", "graphic.png"))  # open file for graph
ggplot(freq_table) +
geom_bar(aes(x = gear,
             y = count),
         stat = "identity",
         fill = "gray")
dev.off()  # write out graph to file

# Write it out to Excel
wb <- createWorkbook(type = "xlsx")
sheet1 <- createSheet(wb, sheetName = "graph")
my_file <- file.path("C:", "Temp", "graphic.png")
addPicture(file = my_file, sheet = sheet1, scale = 2, startRow = 10, startColumn = 2)
saveWorkbook(wb, file.path("C:", "Temp", "Analysis.xls"))

# Credits:
# (1) http://researchsupport.unt.edu/class/Jon/Benchmarks/ExportExcel_L_JDS_Sep2013.pdf
# (2) Zumel, Nina: Pracical Data Science with R, Shelter Island: Manning, 2014, p. 61

You can also write your graph directly to Excel using excel.link library.

# Alternative 2: Using excel.link

library("dplyr")
library(ggplot2)
library("excel.link")

xl.workbook.add()
ggplot(freq_table) +
  geom_bar(aes(x = gear,
               y = count),
           stat = "identity",
           fill = "gray")
xl[a1] = current.graphics()
xl.workbook.save(filename = file.path("C:", "Temp", "Analysis.xls"))

# Credits:
# (1) http://www.inside-r.org/packages/cran/excel.link/docs/current.graphics
# (2) Zumel, Nina: Pracical Data Science with R, Shelter Island: Manning, 2014, p. 61

The Quality of the graphic in Excel is better if you use excel.link.

2016-06-30: Unloading a package

Yesterday I experimented with xlsx during enhancing a script using openxlsx. Thus some functions from openxlsx were masked by functions from xlsx when loading xlsx after openxlsx. When running my script again, I ran into errors because R was using the functions from xlsx instead of openxlsx. I did not want to restart R. A possible solution is:

detach("package.xlsx", unload = TRUE)

Source: http://stackoverflow.com/questions/6979917/how-to-unload-a-package-without-restarting-r

2016-06-30: Configuration of RTools

I copied my RTools not to the default directory. So the Makeconf files for MingW have to be configured:

In

C:\R-Project\R-3.3.0\etc\x64\Makeconf

you need to configure

BINPREF ?= C:/R-Project/Rtools/mingw_64/bin/  
COMPILED_BY = g++  

and in

C:\R-Project\R-3.3.0\etc\i386\Makeconf

you need to configure

BINPREF ?= C:/R-Project/Rtools/mingw_32/bin/  
COMPILED_BY = g++  

Source: https://www.mail-archive.com/[email protected]/msg236630.html
Source: http://dirk.eddelbuettel.com/code/rcpp/Rcpp-FAQ.pdf

2016-06-29: Subsetting and Type Conversion

R often converts objects back and force. Here is an example:

class(iris[ , 1])
#> [1] "numeric"
class(iris[ , 1, drop = FALSE])
#> [1] "data.frame"
class(as_data_frame(iris)[ , 1])
#> [1] "tbl_df"     "tbl"        "data.frame"

Newer packages like tibble operate in a more concise way. Please see source below for details.

Source: Grolemund / Wickham: R for Data Science (http://r4ds.had.co.nz/introduction-2.html)

2016-06-22: New R Help Package SOS

Hi folks,

today I did a search on R books and came accross the package "sos". sos implements the ??? operator to be used to search the R documentation for a given string.

Example:

???ls

If you do so it responds with a web site containing a rated list of links to packages containing the search string: ls in this case.

If you are looking for something in R the sos package is your friend.

Source: http://www.burns-stat.com/r-navigation-tools/

2016-06-20: Package "rPython"

rPython is an Interface to Python data and code. You can import and export object from and to Python as well as execute Python code. As powerful data mining and machine learning libraries are available in Python, like the famous scikit-learn library, R builds a comprehensive workbench for machine learning. Putting this together with Weka and RWeka you will be able to execute most Projects in data mining, machine learning and predictive analytics.

2016-06-16: Package "RWeka"

Weka is the Waikato Environment for Knowledge Analysis and represents a workbench for manipulating data, data mining and machine learning containing tools for data pre-processing, classification, regression, clustering, association rules, and visualization.
The RWeka package is an Interface to the functions in Weka. Thus machine learning algorithms can be used right away from the R workbench. R and Weka are a good combination for doing data mining.

You can find more Information at

https://cran.r-project.org/web/packages/rio/rio.pdf

2016-06-15: Package "rio"

I found the library "rio" suitable for easy import and export of data to and from R. It is a swiss army knife for all import and export jobs. Usage is easy:

-- cut --
install.packages("rio"); library("rio"); library("datasets"); rio::export(x= cars, file = "C:/temp/cars_export_test.arff, Format = "arff");
-- cut --

You can find more Information at

https://cran.r-project.org/web/packages/rio/rio.pdf

Entries of R_Cheat_Sheet.xlsx

2016-06-15: Numerical Error

"Even simple floating point operations have numerical erros.
Methods to deal with that:
(1) Use all.qual() instead of ""=="".
(2) If you have computed integers which should be logically integers use round() to make they are.
(3) Use print(numerical expression, digits = 16) to see what R knows about the number."
Source: Burns: R Inferno, Chapter 1

2016-06-13: Using ifelse on data with NA

If the data contains NA, ifelse does deliver correct results if the ifelse expression compares to NA first using is.na() and has a nested ifelse() expression in the yes or no branches of the ifelse() expression.
Example:
ds_example <- ifelse(is.na(ds_example["variable1"] == TRUE, # ifelse condition ifelse(ds_example["variable2"] > 0, 1, 0), # TRUE branch: another ifelse condition FALSE) # FALSE branch: FALSE value (or any other value)
Source: Own Research

2016-06-09: Printing information on the consule during script execution

There are at least 3 alternatives to print text information during script execution on the console:

  1. print(): Much overhead, but the simplest possibilty. Automatically adds a new line to the given string. Text color is "black" by default.
  2. cat(): Simple text output without overhead. Does not add a new line to the text. New line has to be added is needed with cat(paste0("Text", "/n"). Text colur is "black" by default.
  3. message(): Simple text output with new line added. Text color is "red" by default.
    Source: http://adv-r.had.co.nz/Environments.html

2016-06-09: Meaning of mtime, ctime, atime

"mtime = time the content of a file was last altered, e. g. file was edited. Analogy: date on a letter. Under user control, e. g. user can alter that time.
ctime = time the file information was last altered, e. g. mode has changed. Analogy: postmark on an envelope. Under system control, e. g. is only altered by the operating system.
atime = time a file was last read"
Source: http://www.unix.com/tips-and-tutorials/20526-mtime-ctime-atime.html

2016-06-09: Getting file Information

"To access the file information you can use:
file.info()
It returns size, isdir, mode, mtime, ctime, atime, exe."
Source: Run "?file.info()" from R command line without quotes.

2016-06-09 Finding values of one vector that are NOT IN another vector

"To find the cartesian product of a vector you can use:
(1) !(dataset_2$variable %in% dataset_1$variable)
(2) setdiff(dataset_2$variable, dataset_1$variable)
R always checks if variable in dataset_1 contains elements of variable in dataset_2 and takes the opposite. This is, R checks which elements of variable in dataset_2 are NOT IN variable in dataset_1.
Alternatively you can create a new operator:
""%!in%"" <- function(x,y)!('%in%'(x,y))
You can name the operater as you like, e. g. "%not_in%", "%w/o%", etc."
Source: http://stackoverflow.com/questions/5812478/how-i-can-select-rows-from-a-dataframe-that-do-not-match

2016-06-08: Handling of missing values in filter variables as logical vectors

If cases shall be selected using a logical vector it might happen that the variable used to create the logical filter vector contains missing values (NA). As a consequence the logical filter vector also contains also missing values (NA). This results in unexpected and uninteded results, e. g. empty variables. To solve this issue you can convert the logical vector with which() into a numerical vector and hence get rid of the missing values in your filter variable.
Source: Wollschlaeger: Grundlagen Datenanalyse mit R, 3. erw. u. überarb. Auflage, Berlin: Springer, 2014, S. 96

2016-05-02: Transformations with factors

The "|" operator is not valid for factor variables and results to NA values.
Source: http://r.789695.n4.nabble.com/Interdependencies-of-variable-types-logical-expressions-and-NA-td4720183.html

2016-04-28: Missing Values in logical Expressions

"Missing values (= NA) are handled in logical expressions as follows:

""&"":
T & T = T
F & F = F
T & NA = NA (you cannot decide hence NA)
F & NA = F (you can decide that regardless of NA the result must be F)

""|""
T | T = T
F | F = F
T | NA = T (you can decide that regardless value in NA the result must be T)
F | NA = NA (you cannot decide hence NA)"
Source: https://stat.ethz.ch/pipermail/r-help/2016-April/438214.html

2016-03-31: Finding strings which are empty with Regular Expressions

Info "^$"
Source: Datacamp.com: Cleaning data