Suggestions for Ch5 - Classification 1 #41

joelostblom · 2022-09-27T19:52:32Z

These are all suggestions, some might seem more harshly written, but that is just to be succinct and write quickly.

Should we cover pandas categoricals? Not necessary for classification but it is the equivalent to factors in R
Instead of groupby.count with lots of intermediate objects, we should use .value_counts with and without normalize=True.
Altair's default color scheme is colorblind friendly so no need to change here as we have to when using ggplot.
Students have not learned how to use apply + lambda... the easiest here would be to map a dictionary to the series, but I don't think we teach that either. We could change the labels as in R, but see the point below
R&Py I think it is fragile to overwrite the data values in the legend and would not recommend anyone to do this. What if the order of the data changes, either because new data is added or we do some filtering? Then we might be mistakenly labeling B as Malign and there would be no easy way to find out. I think it is much better to change the values of the individual data points using dictionary like assignment that is robust to data reordering and therefore more reproducible.
R&Py "Based on our visualization, it seems like the prediction of an unobserved label might be possible." weird sentence
Fig 62-71 missing (and many more, need to go through the entire page and add figs)
Use method chaining to comp distance. Proper indentation. We haven't taught assign with lambda df so need to be first. np not defined. This is both for 2 and more predictor variables code chunks. HTML also not defined. Also use nlargest instead of sorting first.
R&Py 5.5.3 I don't think it is correct to describe this as 4 steps. Step 2 and 3 should be combined into "Find the K closest points", since we actually don't care about sorting the distances that are really far away.
Unintuitive usage of loc with : to select columns for cancer_train
We don't need to mention weight = 'uniform' since that is the default? And not get into other types of weights here either
Say something more about what happens after fitting so that students realize that knn_spec changes
unscaled_cancer can be assigned in one step.
make_column_transformer not defined (imports seem to generally not be working as expected here)
"Can be combined into fit_transform requires more explanation (as does transform itself)
Explain why we are calling pd.DataFrame (some things in this cell seems unnecessary)
TODO Compare with how I introduced these concepts in BAIT 509.
R bind_rows never explained, setting labels manually again
pd.concat never explained, neither is query nor lambda in apply.
value_counts instead of groupby.count (in text and code)
Incorrect usage of loc[:, ...] again in make_pipline
TODO go through prediction plot code in details (if we just include the screenshots from R for the other figures, do we really need the code for how to create one of these charts in this figure?)

Split multiple arguments over multiple rows so that they are easier to read (this goes for all chapters), e.g.

knn_fit = make_pipeline(preprocessor, knn_spec).fit(
    X=unscaled_cancer.loc[:, ["Area", "Smoothness"]], y=unscaled_cancer["Class"]
)

should be:

knn_fit = make_pipeline(preprocessor, knn_spec).fit(
    X=unscaled_cancer.loc[:, ["Area", "Smoothness"]],
    y=unscaled_cancer["Class"]
)

and

pd.DataFrame({"Area": [500, 1500], "Smoothness": [0.075, 0.1]})

should be:

pd.DataFrame({
    "Area": [500, 1500],
    "Smoothness": [0.075, 0.1]
})

The text was updated successfully, but these errors were encountered:

trevorcampbell · 2022-12-31T22:19:12Z

Instead of groupby.count with lots of intermediate objects, we should use .value_counts with and without normalize=True.

I'll just show both. Using grouby and count is a good repetition of general purpose stuff that will help students reinforce earlier concepts.

trevorcampbell · 2022-12-31T23:12:17Z

Students have not learned how to use apply + lambda... the easiest here would be to map a dictionary to the series, but I don't think we teach that either. We could change the labels as in R, but see the point below

I just use replace, which is easier (and I think we should cover that in wrangling if we don't already)

trevorcampbell · 2022-12-31T23:14:06Z

I think it is fragile to overwrite the data values in the legend and would not recommend anyone to do this

Agree. I just changed the values in the df. Will open an issue in the R book.

trevorcampbell · 2023-01-01T07:27:11Z

We don't need to mention weight = 'uniform' since that is the default? And not get into other types of weights here either

We added this into the book because students ask this question essentially every semester. But it does need to be fixed a bit for the Python version.

trevorcampbell changed the title ~~Suggestions for Ch5 - Clustering 1~~ Suggestions for Ch5 - Classification 1 Dec 15, 2022

trevorcampbell mentioned this issue Dec 31, 2022

Chapter 5 production polish #78

Merged

trevorcampbell closed this as completed in #78 Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestions for Ch5 - Classification 1 #41

Suggestions for Ch5 - Classification 1 #41

joelostblom commented Sep 27, 2022 •

edited by trevorcampbell

Loading

trevorcampbell commented Dec 31, 2022 •

edited

Loading

trevorcampbell commented Dec 31, 2022

trevorcampbell commented Dec 31, 2022

trevorcampbell commented Jan 1, 2023

Suggestions for Ch5 - Classification 1 #41

Suggestions for Ch5 - Classification 1 #41

Comments

joelostblom commented Sep 27, 2022 • edited by trevorcampbell Loading

trevorcampbell commented Dec 31, 2022 • edited Loading

trevorcampbell commented Dec 31, 2022

trevorcampbell commented Dec 31, 2022

trevorcampbell commented Jan 1, 2023

joelostblom commented Sep 27, 2022 •

edited by trevorcampbell

Loading

trevorcampbell commented Dec 31, 2022 •

edited

Loading