Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestions for Ch5 - Classification 1 #41

Closed
21 of 23 tasks
joelostblom opened this issue Sep 27, 2022 · 4 comments · Fixed by #78
Closed
21 of 23 tasks

Suggestions for Ch5 - Classification 1 #41

joelostblom opened this issue Sep 27, 2022 · 4 comments · Fixed by #78

Comments

@joelostblom
Copy link
Contributor

joelostblom commented Sep 27, 2022

These are all suggestions, some might seem more harshly written, but that is just to be succinct and write quickly.

  • Should we cover pandas categoricals? Not necessary for classification but it is the equivalent to factors in R
  • Instead of groupby.count with lots of intermediate objects, we should use .value_counts with and without normalize=True.
  • Altair's default color scheme is colorblind friendly so no need to change here as we have to when using ggplot.
  • Students have not learned how to use apply + lambda... the easiest here would be to map a dictionary to the series, but I don't think we teach that either. We could change the labels as in R, but see the point below
  • R&Py I think it is fragile to overwrite the data values in the legend and would not recommend anyone to do this. What if the order of the data changes, either because new data is added or we do some filtering? Then we might be mistakenly labeling B as Malign and there would be no easy way to find out. I think it is much better to change the values of the individual data points using dictionary like assignment that is robust to data reordering and therefore more reproducible.
  • R&Py "Based on our visualization, it seems like the prediction of an unobserved label might be possible." weird sentence
  • Fig 62-71 missing (and many more, need to go through the entire page and add figs)
  • Use method chaining to comp distance. Proper indentation. We haven't taught assign with lambda df so need to be first. np not defined. This is both for 2 and more predictor variables code chunks. HTML also not defined. Also use nlargest instead of sorting first.
  • R&Py 5.5.3 I don't think it is correct to describe this as 4 steps. Step 2 and 3 should be combined into "Find the K closest points", since we actually don't care about sorting the distances that are really far away.
  • Unintuitive usage of loc with : to select columns for cancer_train
  • We don't need to mention weight = 'uniform' since that is the default? And not get into other types of weights here either
  • Say something more about what happens after fitting so that students realize that knn_spec changes
  • unscaled_cancer can be assigned in one step.
  • make_column_transformer not defined (imports seem to generally not be working as expected here)
  • "Can be combined into fit_transform requires more explanation (as does transform itself)
  • Explain why we are calling pd.DataFrame (some things in this cell seems unnecessary)
  • TODO Compare with how I introduced these concepts in BAIT 509.
  • R bind_rows never explained, setting labels manually again
  • pd.concat never explained, neither is query nor lambda in apply.
  • value_counts instead of groupby.count (in text and code)
  • Incorrect usage of loc[:, ...] again in make_pipline
  • TODO go through prediction plot code in details (if we just include the screenshots from R for the other figures, do we really need the code for how to create one of these charts in this figure?)
  • Split multiple arguments over multiple rows so that they are easier to read (this goes for all chapters), e.g.
    knn_fit = make_pipeline(preprocessor, knn_spec).fit(
        X=unscaled_cancer.loc[:, ["Area", "Smoothness"]], y=unscaled_cancer["Class"]
    )
    
    should be:
    knn_fit = make_pipeline(preprocessor, knn_spec).fit(
        X=unscaled_cancer.loc[:, ["Area", "Smoothness"]],
        y=unscaled_cancer["Class"]
    )
    
    and
    pd.DataFrame({"Area": [500, 1500], "Smoothness": [0.075, 0.1]})
    
    should be:
    pd.DataFrame({
        "Area": [500, 1500],
        "Smoothness": [0.075, 0.1]
    })
    
@trevorcampbell trevorcampbell changed the title Suggestions for Ch5 - Clustering 1 Suggestions for Ch5 - Classification 1 Dec 15, 2022
@trevorcampbell
Copy link
Contributor

trevorcampbell commented Dec 31, 2022

Instead of groupby.count with lots of intermediate objects, we should use .value_counts with and without normalize=True.

I'll just show both. Using grouby and count is a good repetition of general purpose stuff that will help students reinforce earlier concepts.

@trevorcampbell
Copy link
Contributor

Students have not learned how to use apply + lambda... the easiest here would be to map a dictionary to the series, but I don't think we teach that either. We could change the labels as in R, but see the point below

I just use replace, which is easier (and I think we should cover that in wrangling if we don't already)

@trevorcampbell
Copy link
Contributor

I think it is fragile to overwrite the data values in the legend and would not recommend anyone to do this

Agree. I just changed the values in the df. Will open an issue in the R book.

@trevorcampbell
Copy link
Contributor

We don't need to mention weight = 'uniform' since that is the default? And not get into other types of weights here either

We added this into the book because students ask this question essentially every semester. But it does need to be fixed a bit for the Python version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants