-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
data bias issues due to deferring reporting of all nav timing metrics to the load event #50
Comments
I'd be curious to know how existing integrations deal with this today, because even with NT1 ( which updates the same entry as the timestamps become available) you'd have the same issue:
@bluesmoon @nicjansma @bmaurer @philipwalton how do you handle this today? My guess is it's the latter: wait until |
boomerang waits for |
At facebook we use a "capped average" for our metrics. This means that we take: AVG(time > 60 seconds ? 60 seconds : time) the idea here is that we penalize ourselves for extremely slow requests but we don't allow them to skew the average too much since these requests are usually broken in ways that are difficult to fix. Because of this, we report metrics back to our servers at the 60 second mark even if the load event hasn't occurred. Thus, it is important for us to have access to this timing information before the load event has occurred. We'll probably end up using the NT1 API until this can be addressed. I can also attest that from our experience we've found that situations like the one @bryanmcquade mentions have serious impact on metrics reporting. Anything that makes it hard for slow requests to report back (eg large reporting payloads, data not being available until onload) skew metrics quite a bit. |
Ben, thanks for the feedback and context.
How do you handle the abort case today? Based on above, if the page did not hit onload or 60s timeout, it would be omitted in your telemetry? Or, is there some other code path for the early abort case? |
We started using sendBeacon very early in the page to say "this trace is alive". So if we switched to NT2 we'd start seeing an increase in traces that are lost. |
As in, you periodically sendBeacon the NT1 object as the page is loading and then reconcile that on your backend? ... Not clear on how sendBeacon + early ping solve this today. 😕 |
We discussed this on last week's call and @toddreifsteck pointed out that Edge runs the "add the entry to performance entry buffer" as soon as it's created and updates relevant attributes on the same record as they become available. This behavior is counter to what the current processing section specifies (add the entry once all attributes are finalized), but also provides a reasonable solution to the issue we're discussing here... Adding the entry as soon as it is created provides similar semantics to NT1, which makes the upgrade path very simple. Applications can query the timeline to retrieve the navigation record and log it / beacon it at any point in the pageload cycle. If the navigation is aborted, for whatever reason, analytics is still able to collect nav timing data. Also, a few related thoughts and considerations: (1) Should we apply similar logic to ResourceTiming, and run the "add step" as soon as the fetch is initiated, instead of waiting until its done? The upside is consistency between NT and RT; this would address a related request for having a signal that a resource fetch has been initiated. (2) How does all of the above affect PerformanceObserver? Should we similarly move the "queue step" to right after the record is created?
I think my proposal here would be:
WDYT? |
+1 |
sakamoto / irori@ has raised the following concerns on the implementation review:
Proposal: can we use "-1" (or some pre-determined non-conflicting value) to indicate "not yet available" (as this is the new case we are adding with this change)?
For discussion details see: https://codereview.chromium.org/2647643004/ |
This is not a 'new' problem. NT1 consumers already have to deal with this, and AFAIK this has not been a problem in practice. For redirectStart in particular, note that we do set it to
That would make it incompatible with definition of
We should provide same behavior as we did with NT1. |
Okay, sounds like it's not going to be a problem in practice. Thanks! |
In Nav Timing Level 2, the processing model defers reporting of any nav timing data until the load event has fired.
This means that pages that reach earlier events, such as dom content loaded, but that are aborted before the load event fires, will fail to report the earlier events even though those events were reached during the page load. This leads to bias in aggregate data. Consider a few examples:
Example 1:
Consider 2 pages:
page 1: typically reaches domcontentloaded quickly, but onload very late
page 2: typically reaches domcontentloaded at the same time as page 1, and onload immediately after DCL fires
Because we only get to see data if the page reaches onload, page 1 is more likely to lose DCL samples for slower page loads that get aborted in the period between DCL and onload firing. This means
(a) we'll receive fewer DCL stats than actually happened - bias
(b) the lost data will be biased towards slower page loads, which means the aggregate DCL measured for this page will be artificially lower than the true DCL
The end result would be that even if the 2 pages have the same DCL, the aggregated stats will likely suggest that page 2 is slower due to it losing fewer samples than page 1, and thus it being more likely to report DCL samples for page loads that are slow.
Example 2:
Is there any opportunity to fix this issue and allow for reporting metrics as they happen, rather than once when the onload event is fired?
The text was updated successfully, but these errors were encountered: