Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pre-winter 2025 Cleanup #13

Open
7 of 45 tasks
nick-ulle opened this issue Dec 16, 2024 · 4 comments
Open
7 of 45 tasks

Pre-winter 2025 Cleanup #13

nick-ulle opened this issue Dec 16, 2024 · 4 comments
Assignees

Comments

@nick-ulle
Copy link
Contributor

nick-ulle commented Dec 16, 2024

Update the reader to be more beginner-friendly and follow our guidelines in the DataLab handbook.

Content:

  • Chapter 1 "Command Line"
    • Update discussion of Vim (it confuses some aspects of how Vim works).
    • There is no discussion of file systems and paths in this chapter, but there probably should be.
  • Chapter 2 "Version Control"
    • The entire chapter is in need of a rewrite: some content is outdated, some Git commands presented are not relevant to beginners, and there are numerous typos (I fixed many of the typos in my edits).
    • The chapter treats local- vs. server-based VCS and centralized vs. distributed VCS like these are two different axes along which VCS exist, but in reality the former is a pattern of use rather than a property of the VCS. As far as I know, all centralized VCS require a server-based approach, and all distributed VCS allow both.
    • Computer memory is 1-dimensional, not 2-dimensional as presented in the reader.
    • The Unicode explanation makes Unicode seem more mysterious than it really is: computers speak numbers and Unicode is just a standard for numbering characters.
  • Chapter 3 "Introduction to R"
    • Why do all of the code cells in Section 3.2 "Mathematical Operations" have the results hidden?
    • @elisehellwig notes that the types section should probably come earlier.
  • Chapter 4 "Core Programming Concepts"
    • The entire chapter is in need of a rewrite: there are several issues (listed below).
    • This chapter lacks motivation. I think we could do better by teaching control flow structures at points where they naturally become necessary to solve or simplify a data science problem. Functions provide a way to organize code (the "paragraphs" of programming) and lead to if-statements (e.g., "how do I make my function handle this special case?"). Loops arise when we need to process multiple files or items.
    • This chapter is not well-integrated with the surrounding chapters. It assumes some concepts that are not taught until later, introduces some concepts that were already introduced in earlier chapters is if they are new, and introduces some concepts that are also introduced in later chapters.
    • Why do we teach switch? The problem it solves can be solved with other control flow structures, and an exhaustive list of control flow structures seems overwhelming rather than beginner-friendly.
    • The image in this chapter shows code, and should probably just be text rather than an image.
  • Chapter 5 "Files, Packages, and Data"
    • This chapter could use some editing to smooth out the flow and clarify some of the examples.
    • Many of the code cells are set to not be evaluated for some reason (?)
    • Is this really the right place to introduce du?
  • Chapter 6 "Data Structures"
    • This should probably come before or be integrated with Chapter 5.
    • The section on tabular data could probably be better integrated with the similar section in Chapter 7.
  • Chapter 7 "Data Forensics"
    • Lately we keep referring to this as "the statistics chapter," so maybe we should adjust the title?
    • Apply functions, covered here, are also briefly covered in Chapter 4. This chapter should build on what Chapter 4 already covered rather than assuming no prior knowledge. We no longer cover apply functions in Chapter 4.
    • Tabular data sets are discussed independently in Chapters 3, 5, 6, and 7 (and possibly others). We should clean this up.
  • Chapter 8 "Data Visualization"
    • Missing a chapter abstract/introduction
    • The source listed on the ggplot2 Napoleon's march image is not a book. Is it the ggplot2 package itself, and what is the license on the image? (Posit tends to use CC no derivatives licenses).
    • This chapter is much longer than the others---perhaps it should be two chapters?
    • Consider replacing the wine data with the dogs data introduced in earlier chapters, for continuity and because it makes nicer plots. The current narrative also implies students have seen the wine data before, but that's not true.
  • Chapter 9 "Getting Data from the Web"
    • The distinction between POST and GET is unclear in the current text.
    • The section about query strings seems out of place: it would be more naturally motivated in the hands-on web APIs section.
    • We should probably update all of the references to Twitter.
  • Chapter 10 "Strings and Regular Expressions"
    • The Unicode section is very similar to a section we have in Chapter 1.
    • The Tidyverse section could probably be placed earlier, in Chapter 8 when ggplot2 is introduced.
  • Chapter 11 "Optical Character Recognition"
    • The last example seems to be broken (no output) when the reader is rendered, but works correctly in a regular R session.
  • Chapter 12 "Statistics"
    • This chapter is in need of a rewrite: it assumes a lot of prior statistical knowledge and is not really at the level of our learners.
    • Missing abstract and learning goals.

This list will grow (and I'll try to address some of these) as I spot problems while migrating the reader to Quarto (#12).

@nick-ulle nick-ulle self-assigned this Dec 16, 2024
@nick-ulle
Copy link
Contributor Author

nick-ulle commented Dec 21, 2024

In the interest of time, I'm only including detailed feedback for the core content (chapters 1-10).

@elisehellwig
Copy link
Contributor

I think it might make sense to explicitly cover read.csv in the introduction to R (Basically what I did this year) since it seems like we want to have people read in data as part of the homework for that lecture. I know we are going to cover files, paths etc later, but since they have already seen paths with the command line, it doesn't seem like too heavy a lift. It is also the primary way people actually create data.frames.

@PLNReynolds also mentioned that it would make more sense to do at least some of the the reproducibility talk later on, once students have felt the struggle of running code a bit more.

@nick-ulle
Copy link
Contributor Author

@elisehellwig I agree, I've always felt the sequence should be closer to R Basics.

I think we added in the bit about reproducibility with the intention to "motivate everything, " but it seems premature to me.

@nick-ulle
Copy link
Contributor Author

nick-ulle commented Jan 23, 2025

I've replaced most of Chapter 4 with some new examples and some parts of R Basics and Intermediate R. It should now be more on-point (e.g., not teaching infrequently-used things like repeat and switch), but we should definitely do another pass later on to make sure the material is all accessible to our audience and fits in the sequence of topics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants