Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add minimum_interval setting to auto_date_histogram aggregation #41757

Closed
simianhacker opened this issue May 2, 2019 · 10 comments · Fixed by #42814
Closed

Add minimum_interval setting to auto_date_histogram aggregation #41757

simianhacker opened this issue May 2, 2019 · 10 comments · Fixed by #42814
Assignees
Labels

Comments

@simianhacker
Copy link
Member

simianhacker commented May 2, 2019

In TSVB, I introduced the concept of a "minimum auto interval" because metrics usually have a minimum resolution that works. For example, if a user installs Metricbeat and modifies collection interval to be 1 minute, any time they request data that returns buckets smaller than 1m they will have gaps in their data. Once that happens most pipeline aggregations stop working as well, especially the derivative pipeline aggregation which is very common with counters like network traffic. This scenario is what usually drives users to start requesting interpolation between the points.

The concept of >=1m interval data math was born in TSVB to address the issue above. Adding support for this was trivial because you just calculate the interval and if it's less the the minimum, the minimum is returned instead.

I would rather use auto_date_histogram for most of the metrics UI's I develop but without support for a minimum interval, it's a non-starter.

@dnhatn dnhatn added the :Analytics/Aggregations Aggregations label May 2, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo

@pcsanwald pcsanwald self-assigned this May 28, 2019
@pcsanwald
Copy link
Contributor

Thanks for opening this, I'll work on it this week. It's definitely something that would be beneficial for a lot of use cases.

@pcsanwald
Copy link
Contributor

@polyfractal @colings86 my assumption here is we want to support same class of intervals as Date Histogram Aggregation, fixed intervals as well as calendar. do you have a different opinion?

@polyfractal
Copy link
Contributor

Hmm that's an interesting question. What does the auto date histo do right now for intervals since the user only specifies number of buckets? Calendar?

I think at a minimum the min_interval setting should match whatever the auto date histo is doing internally right now. Supporting both would probably depend on if there is a way for the user to specify which type of bucket is used.

@pcsanwald
Copy link
Contributor

pcsanwald commented May 29, 2019

believe we use a fixed calculation for roundings.

@polyfractal
Copy link
Contributor

Ah, interesting. It probably makes sense to start with fixed intervals then, and save adding calendar support for a different enhancement.

It does make the API a bit tricky. Do we say min_fixed_interval? Or maybe interval_mode: fixed and then it's implied that buckets and min_interval are in that mode?

Another interesting bit is that the fixed time parsing used by date_histo and others only parses up to days in fixed time, whereas auto_date_histo can go up to months/years. The path of least resistance there is to probably add months/years to the fixed time parsing (using the same definitions that auto date histo uses)

@pcsanwald
Copy link
Contributor

hmmm. it feels a bit weird to have interval_mode: where fixed is the only option ("you can have any color you want, as long as you want black"), but, if we do add a different mode, then having the parameter named min_fixed_interval is bad because we'll want to rename it.

So my vote is that we make it clear in the docs that min_interval is calculated using fixed (which is how the agg works today anyways), and we can add a mode if we add calendar intervals.

WDYT?

@colings86
Copy link
Contributor

The auto date histogram builder uses only calendar intervals (see https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/search/aggregations/bucket/histogram/AutoDateHistogramAggregationBuilder.java#L72 where we create the roundings).

We can only use calendar intervals because merging lower roundings into higher ones would not work well with fixed intervals. This is because the length of a day and moreover the length of a month cannot be assumed regular so if we tried to merge hours into days or days into months we would end up with accumulated error which would make the error in the final result too high.

I agree that an interval_mode option doesn't make much sense while we only support one option so I think we should document that the intervals are calendar based only.

@colings86
Copy link
Contributor

I also think we should make the min_interval option only take the unit and not a multiple of the unit (so you could specfy min_interval: hours as opposed to min_interval: 5h. This is for two reasons:

  • It save a lot of complexity since we only need to care about min_interval in the aggregator and not in the internalAutoDateHistogram
  • It saves the user from having to know the specific multiples of the unit we support and therefore cannot pick a min_interval we don't support

@pcsanwald
Copy link
Contributor

makes sense @colings86, thanks for clarifying. I was wrong about the fixed calcs, it makes sense now, given it takes a timezone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants