Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vizualisation #25

Closed
nd265 opened this issue Jul 11, 2022 · 5 comments · Fixed by #77
Closed

vizualisation #25

nd265 opened this issue Jul 11, 2022 · 5 comments · Fixed by #77

Comments

@nd265
Copy link
Contributor

nd265 commented Jul 11, 2022

  1. may consult the Dates and Times chapter of R for Data Science [@wickham2016r];
    What to use as a python substitute?
  2. @joelostblom While creating the plot for filtered dates, I am filtering the dates using .loc[] operator and then plotting using the following code:
co2_df = pd.read_csv("mauna_loa_data.csv", parse_dates=['date_measured'])
co2_dates = co2_df.loc[(co2_df.date_measured >= '1990-01-01') &  (co2_df.date_measured < '1993-01-01')]
co2_dates
co2_line = alt.Chart(co2_dates).mark_line(color='black').encode(
    x = alt.X("date_measured:T",title = "Year"),
    y = alt.Y("ppm", scale=alt.Scale(zero=False), title = "Atmospheric CO2 (ppm)")).configure_axis(
    titleFontSize=12)

The output of this plot is as below with X-axis labels not coming year wise, but a mixture of year and month

image

Desired output:
image

@joelostblom
Copy link
Contributor

  1. I think this is a good reference and we mentioned using this book elsewhere too. https://wesmckinney.com/book/time-series.html The pandas docs are also pretty good tbh.

  2. You could reduce the number of ticks via alt.Axis(tickCount=4. However, I would personally leave the default instead and have the axis title as "Measurement Date" rather than "Year" since the data points are actually at finer granularity than years so that title could be misinterpreted. I would also do a few things slightly different syntax-wise than your example and end up with something like this:

    co2_df = pd.read_csv("~/Downloads/mauna_loa_data.csv", parse_dates=['date_measured'])
    
    # These two are equivalent, use the one that fits best with the context (ie do we teach query or not?)
    co2_dates = co2_df[co2_df['date_measured'].between('1990-01-01', '1993-01-01')]
    co2_dates = co2_df.query("'1990-01-01' <= date_measured <= '1993-01-01'")
    
    co2_line = alt.Chart(co2_dates).mark_line(color='black').encode(
        x=alt.X("date_measured", title="Measurement Date"),
        # x=alt.X("date_measured", title="Year", axis=alt.Axis(format='%Y-%b')), # Shows month and year, but a bit busy
        y=alt.Y("ppm", scale=alt.Scale(zero=False), title="Atmospheric CO2 (ppm)")
    ).configure_axis(
        titleFontSize=12
    )
    co2_line

    image

    Also note that in the R example, they set the xlim rather than filter the data which is why your y-axis is different. If you want to set the x-lim instead you could do the following, but I think it is in most cases more appropriate to have the y-lim rescaled to the data shown in the chart (either manually or by filtering the data as above) instead of having it scaled by data not shown in the plot (unless a direct comparison with the same scale is required, but then there is less effect gained from zooming in).

    co2_df = pd.read_csv("~/Downloads/mauna_loa_data.csv", parse_dates=['date_measured'])
    
    co2_line = alt.Chart(co2_df).mark_line(color='black', clip=True).encode(
        x=alt.X("date_measured", title="Measurement Date", axis=alt.Axis(tickCount=4), scale=alt.Scale(domain=['1990', '1993'])),
        # x=alt.X("date_measured", title="Year", axis=alt.Axis(format='%Y-%b')), # Shows month and year, but a bit busy
        y=alt.Y("ppm", scale=alt.Scale(zero=False), title="Atmospheric CO2 (ppm)")
    ).configure_axis(
        titleFontSize=12
    )
    co2_line

    image

@nd265
Copy link
Contributor Author

nd265 commented Jul 12, 2022

Thanks Joel, this helped

@nd265
Copy link
Contributor Author

nd265 commented Jul 12, 2022

  1. @joelostblom For geom_point() being used in ggplot, should I use mark_point(filled=True) or mark_circle()? In the tutorial, mark_point() is being used. Or should I just write some explanation that we can use mark_circle()as well?

@joelostblom
Copy link
Contributor

I would use mark_point without filled=True as the default replacement for geom_point. I actually think the default with hollow circle is slightly more useful in many cases, and either library would feel a bit clumpsy if you needed to tweak every plot to look like the default aesthetics in the other library.

Somewhere we should explicitly teach how to make filled points, and then I would first teach it with mark_point(filled=True) for continuity for what we have done so far and then mention that there is a shortcut in mark_circle and use that from now on wherever we want filled points. The only difference in behavior is that the shape is fixed as circular for mark_circle so we can use the shape encoding to change shape per group as with mark_point(filled=True).

Also the comments here are just my thoughts, so make sure to bring it up in the meeting also just so that everyone is onboard.

@nd265
Copy link
Contributor Author

nd265 commented Jul 12, 2022

I would use mark_point without filled=True as the default replacement for geom_point. I actually think the default with hollow circle is slightly more useful in many cases, and either library would feel a bit clumpsy if you needed to tweak every plot to look like the default aesthetics in the other library.

Somewhere we should explicitly teach how to make filled points, and then I would first teach it with mark_point(filled=True) for continuity for what we have done so far and then mention that there is a shortcut in mark_circle and use that from now on wherever we want filled points. The only difference in behavior is that the shape is fixed as circular for mark_circle so we can use the shape encoding to change shape per group as with mark_point(filled=True).

Also the comments here are just my thoughts, so make sure to bring it up in the meeting also just so that everyone is onboard.

Yes, in one of the plots, shape is being used, so there mark_point only can be used

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants