Skip to content

Commit

Permalink
NEWS-only: community consultation link added to news item
Browse files Browse the repository at this point in the history
  • Loading branch information
mattdowle committed Feb 6, 2021
1 parent 6924483 commit 97c96b2
Showing 1 changed file with 6 additions and 4 deletions.
10 changes: 6 additions & 4 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,15 @@

## POTENTIALLY BREAKING CHANGES

1. In v1.13.0 (July 2020) native parsing of datetime was added to `fread` by Michael Chirico which dramatically improved reading datetime. Before then datetime was read as character by default which was slow. Since v1.13.0, UTC-marked datetime (e.g. 2020-07-24T10:11:12.134Z where the final `Z` is present) has been read automatically as POSIXct and quickly. We provided the migration option `datatable.old.fread.datetime.character` to revert to the previous slow character behavior. We also added the `tz=` argument to control unmarked datetime; i.e. where the `Z` (or equivalent UTC postfix) is missing in the data. The default `tz=""` reads unmarked datetime as character as before, slowly. We gave you the ability to set `tz='UTC'` to turn on the new behavior and read unmarked datetime as UTC, quickly. R sessions that are running in UTC by setting the TZ environment variable, as is good practice and common in production, have also been reading unmarked datetime as UTC since v1.13.0, much faster. Note 1 of v1.13.0 (below in this file) ended "In addition to convenience, `fread` is now significantly faster in the presence of dates, UTC-marked datetimes, and unmarked datetime when tz="UTC" is provided.".
1. In v1.13.0 (July 2020) native parsing of datetime was added to `fread` by Michael Chirico which dramatically improved performance. Before then datetime was read as character by default which was slow. Since v1.13.0, UTC-marked datetime (e.g. 2020-07-24T10:11:12.134Z where the final `Z` is present) has been read automatically as POSIXct and quickly. We provided the migration option `datatable.old.fread.datetime.character` to revert to the previous slow character behavior. We also added the `tz=` argument to control unmarked datetime; i.e. where the `Z` (or equivalent UTC postfix) is missing in the data. The default `tz=""` reads unmarked datetime as character as before, slowly. We gave you the ability to set `tz='UTC'` to turn on the new behavior and read unmarked datetime as UTC, quickly. R sessions that are running in UTC by setting the TZ environment variable, as is good practice and common in production, have also been reading unmarked datetime as UTC since v1.13.0, much faster. Note 1 of v1.13.0 (below in this file) ended "In addition to convenience, `fread` is now significantly faster in the presence of dates, UTC-marked datetimes, and unmarked datetime when tz="UTC" is provided.".

At the `rstudio::global(2021)` conference, Neal Richardson, Director of Engineering at Ursa Labs, compared Arrow csv performance to data.table csv performance, [Bigger Data With Ease Using Apache Arrow](https://twitter.com/enpiar/status/1357729619420475392). He opened by comparing to data.table as his main point. Arrow was presented as 3 times faster than data.table. He talked at length about this result. This result is now being quoted in the community. However, no reproducible code was provided and we were not contacted in advance of the high profile talk in case we had any comments. Neal briefly mentioned New York Taxi data. That is a dataset known to us as containing unmarked datetime. We don't know if he set `tz='UTC'` or not. We could have suggested that if he had asked. We do know that setting `tz='UTC'` does speed up reading the New York Taxi dataset significantly. We don't know if the datetimes in the New York Taxi dataset really are in UTC, or local time, but we know it is common practice to read them as if they are UTC regardless.

We are open source developers just trying to do our best.

As an angry reaction to Neal's presentation, the default change from `tz=""` to `tz=UTC` is accelerated. If you have been using `tz=` explicitly then there should be no change. The change to read UTC-marked datetime as POSIXct rather than character already happened in v1.13.0. The change now is that unmarked datetimes are now read as UTC too by default without needing to set `tz="UTC"`. None of the 1,004 CRAN packages directly using data.table are affected. As before, the migration option `datatable.old.fread.datetime.character` can still be set to TRUE to revert to the old character behaviour. This migration option is temporary and will be removed in the near future.

As an angry reaction to Neal's presentation, the default change from `tz=""` to `tz=UTC` is accelerated. If you have been using `tz=` explicitly then there should be no change. The change to read UTC-marked datetime as POSIXct rather than character already happened in v1.13.0. The change now is that unmarked datetimes are now read as UTC too by default without needing to set `tz="UTC"`. None of the 1,004 CRAN packages directly using data.table are affected. As before, the migration option `datatable.old.fread.datetime.character` can still be set to TRUE to revert to the old character behavior. This migration option is temporary and will be removed in the near future.

The community was consulted in [this tweet](https://twitter.com/MattDowle/status/1358011599336931328) before release.

## BUG FIXES

Expand Down

0 comments on commit 97c96b2

Please sign in to comment.