Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data bias issues due to deferring reporting of all nav timing metrics to the load event #50

Closed
bryanmcquade opened this issue Nov 7, 2016 · 11 comments
Milestone

Comments

@bryanmcquade
Copy link

bryanmcquade commented Nov 7, 2016

In Nav Timing Level 2, the processing model defers reporting of any nav timing data until the load event has fired.

This means that pages that reach earlier events, such as dom content loaded, but that are aborted before the load event fires, will fail to report the earlier events even though those events were reached during the page load. This leads to bias in aggregate data. Consider a few examples:

Example 1:
Consider 2 pages:
page 1: typically reaches domcontentloaded quickly, but onload very late
page 2: typically reaches domcontentloaded at the same time as page 1, and onload immediately after DCL fires

Because we only get to see data if the page reaches onload, page 1 is more likely to lose DCL samples for slower page loads that get aborted in the period between DCL and onload firing. This means
(a) we'll receive fewer DCL stats than actually happened - bias
(b) the lost data will be biased towards slower page loads, which means the aggregate DCL measured for this page will be artificially lower than the true DCL

The end result would be that even if the 2 pages have the same DCL, the aggregated stats will likely suggest that page 2 is slower due to it losing fewer samples than page 1, and thus it being more likely to report DCL samples for page loads that are slow.

Example 2:

  • I have a page with a fast DCL and a slow time to onload.
  • I make a change to my code to improve my time to onload
  • As a result, fewer of my page loads are aborted between DCL and onload. In particular, users on slower connections are less likely to abort.
  • These users on slower connections will report an increased number of onload metrics as well as DCL metrics after my change, as they are less likely to abort.
  • Thus I'll see my aggregate onload metrics change, but I'll also likely see my aggregate DCL metrics get worse because fewer of them are lost due to aborts before onload, for users on the slowest connections, even though my change didn't affect DCL at all. This is counterintuitive, subtle, and undesirable.

Is there any opportunity to fix this issue and allow for reporting metrics as they happen, rather than once when the onload event is fired?

@igrigorik
Copy link
Member

I'd be curious to know how existing integrations deal with this today, because even with NT1 ( which updates the same entry as the timestamps become available) you'd have the same issue:

  • Either you continue polling and sending partial records
  • Or, you wait until some abort signal and harvest all/as much data as you can

@bluesmoon @nicjansma @bmaurer @philipwalton how do you handle this today? My guess is it's the latter: wait until onunload, harvest the data, and beacon it then? Also, do you have a sense, from your telemetry, how common this case is?

@bluesmoon
Copy link

boomerang waits for onload before beaconing, but we also beacon if beforeunload fires before onload, and in all cases, we also collect the time when the DOMContentLoaded event fires. We have certainly seen cases where DOMContentLoaded fires but onload doesn't, but I can't say that it is very common.

@bmaurer
Copy link

bmaurer commented Nov 29, 2016

At facebook we use a "capped average" for our metrics. This means that we take:

AVG(time > 60 seconds ? 60 seconds : time)

the idea here is that we penalize ourselves for extremely slow requests but we don't allow them to skew the average too much since these requests are usually broken in ways that are difficult to fix.

Because of this, we report metrics back to our servers at the 60 second mark even if the load event hasn't occurred. Thus, it is important for us to have access to this timing information before the load event has occurred. We'll probably end up using the NT1 API until this can be addressed.

I can also attest that from our experience we've found that situations like the one @bryanmcquade mentions have serious impact on metrics reporting. Anything that makes it hard for slow requests to report back (eg large reporting payloads, data not being available until onload) skew metrics quite a bit.

@igrigorik
Copy link
Member

Ben, thanks for the feedback and context.

Because of this, we report metrics back to our servers at the 60 second mark even if the load event hasn't occurred. Thus, it is important for us to have access to this timing information before the load event has occurred. We'll probably end up using the NT1 API until this can be addressed.

How do you handle the abort case today? Based on above, if the page did not hit onload or 60s timeout, it would be omitted in your telemetry? Or, is there some other code path for the early abort case?

@bmaurer
Copy link

bmaurer commented Dec 2, 2016

We started using sendBeacon very early in the page to say "this trace is alive". So if we switched to NT2 we'd start seeing an increase in traces that are lost.

@igrigorik
Copy link
Member

As in, you periodically sendBeacon the NT1 object as the page is loading and then reconcile that on your backend? ... Not clear on how sendBeacon + early ping solve this today. 😕

@igrigorik igrigorik added this to the Level 2 milestone Dec 7, 2016
@igrigorik
Copy link
Member

We discussed this on last week's call and @toddreifsteck pointed out that Edge runs the "add the entry to performance entry buffer" as soon as it's created and updates relevant attributes on the same record as they become available. This behavior is counter to what the current processing section specifies (add the entry once all attributes are finalized), but also provides a reasonable solution to the issue we're discussing here...

Adding the entry as soon as it is created provides similar semantics to NT1, which makes the upgrade path very simple. Applications can query the timeline to retrieve the navigation record and log it / beacon it at any point in the pageload cycle. If the navigation is aborted, for whatever reason, analytics is still able to collect nav timing data.


Also, a few related thoughts and considerations:

(1) Should we apply similar logic to ResourceTiming, and run the "add step" as soon as the fetch is initiated, instead of waiting until its done? The upside is consistency between NT and RT; this would address a related request for having a signal that a resource fetch has been initiated.

(2) How does all of the above affect PerformanceObserver? Should we similarly move the "queue step" to right after the record is created?

  • For Nav Timing this doesn't work that well, since the record would be emitted early in the page load and most analytics scripts probably won't yet be registered. I think this suggests that we should hold the record until all the attributes are finalized.
  • For Resource Timing, similar constraints apply for any/all fetches initiated as part of the initial page load.. You'd have to register very early to get those notifications. Further, if you get the early record, now you have a handle to an incomplete entry which you need... poll?

I think my proposal here would be:

  • Move the "add the X entry to performance buffer" step, for both Nav and Resource Timing, to immediately after the entry is created.
  • Keep the "queue the X entry to observer" step, for both Nav and Resource Timing at the end; observers will only see these records once the page load / resource load has finished.

WDYT?

@spanicker
Copy link

+1
This seems like a reasonable way to overcome this limitation.

@spanicker
Copy link

sakamoto / irori@ has raised the following concerns on the implementation review:

  1. the values of some fields in NT will now change depend on when they are queried, and there is no clear way to indicate that the value is "not available yet" vs. "not applicable" (e.g. redirectStart when there's no redirect) vs. zero

Proposal: can we use "-1" (or some pre-determined non-conflicting value) to indicate "not yet available" (as this is the new case we are adding with this change)?

  1. Some values aren't populated in a timely fashion due to different reasons (than the one we've introduced with this change) e.g. domainLookupStart will not be populated until final response is available, although domain lookup must be finished long before that.
    Is that okay?

For discussion details see: https://codereview.chromium.org/2647643004/

@irori @igrigorik @bryanmcquade

@igrigorik
Copy link
Member

the values of some fields in NT will now change depend on when they are queried, and there is no clear way to indicate that the value is "not available yet" vs. "not applicable" (e.g. redirectStart when there's no redirect) vs. zero

This is not a 'new' problem. NT1 consumers already have to deal with this, and AFAIK this has not been a problem in practice. For redirectStart in particular, note that we do set it to fetchStart in our processing algorithm (i.e., even if there is no redirect the value is non zero); we do reset it to 0 across redirects that don't pass TAO-algorithm, but we also increment redirectCount in the process, so once again, you can distinguish between "not available yet" vs "not applicable". Similar logic applies to other fields.

Proposal: can we use "-1" (or some pre-determined non-conflicting value) to indicate "not yet available" (as this is the new case we are adding with this change)?

That would make it incompatible with definition of DOMHighResTimestamp.

Some values aren't populated in a timely fashion due to different reasons (than the one we've introduced with this change) e.g. domainLookupStart will not be populated until final response is available, although domain lookup must be finished long before that.
Is that okay?

We should provide same behavior as we did with NT1.

@irori
Copy link

irori commented Mar 10, 2017

Okay, sounds like it's not going to be a problem in practice. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants