Extra Timestamps for encoded RTC media frames

Authors:

Guido Urdaneta (Google)

Participate

https://github.com/w3c/webrtc-encoded-transform

Introduction

The WebRTC Encoded Transform API allows applications to access encoded media flowing through a WebRTC RTCPeerConnection. Video data is exposed as RTCEncodedVideoFrames and audio data is exposed as RTCEncodedAudioFrames. Both types of frames have a getMetadata() method that returns a number of metadata fields containing more information about the frames.

This feature consists in adding a number of additional metadata fields containing timestamps, in line with recent additions to VideoFrameMetadata in WebCodecs and requestVideoFrameCallback.

For the purposes of this feature, we use the following definitions:

The capturer system is a system that originally captures a media frame, typically from a local camera, microphone or screen-share session. This frame can be relayed through multiple systems before it reaches its final destination.
The receiver system is the final destination of the captured frames. It receives the data via an [RTCPeerConnection] and it uses the WebRTC Encoded Transform API with the changes proposed by this feature.
The sender system is the system that communicates directly with the receiver system. It may be the same as the capturer system, but not necessarily. It is the last hop before the captured frames reach the receiver system.

The proposed new metadata fields are:

receiveTime: The time when the frame was received from the sender system.
captureTime: The time when the frame was captured by the capturer system. This timestamp is set by the capturer system.
senderCaptureTimeOffset: An estimate of the offset between the capturer system clock system and the sender system clock. The receiver system can compute the clock offset between the receiver system and the sender system and these two offset can be used to adjust the captureTime to the receiver system clock.

captureTime and senderCaptureTimeOffset are provided in WebRTC by the Absolute Capture Time" header extension.

Note that the RTCRtpContributingSource interface also exposes these timestamps (see also [extensions[(https://w3c.github.io/webrtc-extensions/#rtcrtpcontributingsource-extensions)]), but in a way that is not suitable for applications using the WebRTC Encoded Transform API. The reason is that encoded transforms operate per frame, while the values in RTCRtpContributingSource are the most recent seen by the UA, which make it impossible to know if the values provided by RTCRtpContributingSource actually correspond to the frames being processed by the application.

User-Facing Problem

This API supports applications where measuring the delay between the reception of a media frame and its original capture is useful.

Some examples use cases are:

Audio/video synchronization measurements
Performance measurements
Delay measurements

In all of these cases, the application can log the measurements for offline analysis or A/B testing, but also adjust application parameters in real time.

Goals

Provide Web applications using WebRTC Encoded Transform access to receive and capture timestamps in addition to existing metadata already provided.
Align encoded frame metadata with metadata provided for raw frames.

Non-goals

Provide mechanisms to improve WebRTC communication mechanisms based on the information provided by these new metadata fields.

Example

This shows an example of an application that:

Computes the delay between audio and video
Computes the processing and logs and/or updates remote parameters based on the delay.

// code in a DedicatedWorker
let lastVideoCaptureTime;
let lastAudioCaptureTime;
let lastVideoSenderCaptureTimeOffset;
let lastVideoProcessingTime;
let senderReceiverClockOffset = null;

function updateAVSync() {
  const avSyncDifference = lastVideoCaptureTime - lastAudioCaptureTime;
  doSomethingWithAVSync(avSyncDifference);
}

// Measures delay from original capture until reception by this system.
// Other forms of delay are also possible.
function updateEndToEndVideoDelay() {
  if (senderReceiverClockOffset == null) {
    return;
  }

  const adjustedCaptureTime =
      senderReceiverClockOffset + lastVideoSenderCaptureTimeOffset + lastVideoCaptureTime;
  const endToEndDelay = lastVideoReceiveTime - adjustedCaptureTime;
  doSomethingWithEndToEndDelay(endToEndDelay);
}

function updateVideoProcessingTime() {
  const processingTime = lastVideoProcessingTime - lastVideoReceiveTime;
  doSomethingWithProcessingTime();
}

function createReceiverAudioTransform() {
  return new TransformStream({
    start() {},
    flush() {},
    async transform(encodedFrame, controller) {
      let metadata = encodedFrame.getMetadata();
      lastAudioCaptureTime = metadata.captureTime;
      updateAVSync();
      controller.enqueue(encodedFrame);
    }
  });
}

function createReceiverVideoTransform() {
  return new TransformStream({
    start() {},
    flush() {},
    async transform(encodedFrame, controller) {
      let metadata = encodedFrame.getMetadata();
      lastVideoCaptureTime = metadata.captureTime;
      updateAVSync();
      lastVideoReceiveTime = metadata.receiveTime;
      lastVideoSenderCaptureTimeOffset = metadata.senderCaptureTimeOffset;
      updateEndToEndDelay();
      doSomeEncodedVideoProcessing(encodedFrame.data);
      lastVideoProcessingTime = performance.now();
      updateProcessing();
      controller.enqueue(encodedFrame);
    }
  });
}

// Code to instantiate transforms and attach them to sender/receiver pipelines.
onrtctransform = (event) => {
  let transform;
  if (event.transformer.options.name == "receiverAudioTransform")
    transform = createReceiverAudioTransform();
  else if (event.transformer.options.name == "receiverVideoTransform")
    transform = createReceiverVideoTransform();
  else
    return;
  event.transformer.readable
      .pipeThrough(transform)
      .pipeTo(event.transformer.writable);
};

onmessage = (event) => {
  senderReceiverClockOffset = event.data;
}


// Code running on Window
const worker = new Worker('worker.js');
const pc = new RTCPeerConnection();

// Do ICE and offer/answer exchange. Removed from this example for clarity.

// Configure transforms in the worker
pc.ontrack = e => {
  if (e.track.kind == "video")
    e.receiver.transform = new RTCRtpScriptTransform(worker, { name: "receiverVideoTransform" });
  else  // audio
    e.receiver.transform = new RTCRtpScriptTransform(worker, { name: "receiverAudioTransform" });
}

// Compute the clock offset between the sender and this system.
const stats = pc.getStats();
const remoteOutboundRtpStats = getRequiredStats(stats, "remote-outbound-rtp");
const remoteInboundRtpStats = getRequiredStats(stats, "remote-inbound-rtp")
const senderReceiverTimeOffset = 
    remoteOutboundRtpStats.timestamp - 
        (remoteOutboundRtpStats.remoteTimestamp + 
         remoteInboundRtpStats.roundTripTime / 2);

worker.postMessage(senderReceiverTimeOffset);

Alternatives considered

[Alternative 1]

Use the values already exposed in RTCRtpContributingSource.

RTCRtpContibutingSource already exposes the same timestamps as this feature. The problem with using those timestamps is that it is impossible to reliably associate them to a specific encoded frame exposed by the WebRTC Encoded Transform API.

This makes any of the computations in this feature unreliable.

[Alternative 2]

Expose only captureTime and receiveTime.

senderCaptureTimeOffset is a value that is provided by the Absolute Capture Timestamp https://webrtc.googlesource.com/src/+/refs/heads/main/docs/native-code/rtp-hdrext/abs-capture-time#absolute-capture-time WebRTC header extension, but that extension updates the value only periodically since there is little value in computing the estimatefor every packet, so it is strictly speaking not a per-frame value. Arguably, an application could use the senderCaptureTimeOffset already exposed in RTCRtpContributingSource.

However, given that this value is coupled with captureTime in the header extension, it looks appropriate and more ergonomic to expose the pair in the frame as well. While clock offsets do not usually change significantly in a very short time, there is some extra accuracy in having the estimated offset between the capturer system and the sender for that particular frame. This could be more visible, for example, if the set of relays that frames go through from the capturer system to the sender system changes.

Exposing senderCaptureTimeOffset also makes it clearer that the captureTime comes from the original capturer system, so it needs to be adjusted using the corresponding clock offset.

[Alternative 3]

Expose a captureTime already adjusted to the receiver system's clock.

The problem with this option is that clock offsets are estimates. Using estimates makes computing A/V Sync more difficult and less accurate.

For example, if the UA uses the a single estimate during the whole session, the A/V sync computation will be accurate, but the capture times themselves will be inaccurate as the clock offset estimate is never updated. Any other computation made with the captureTime and other local timestamps will be inaccurate.

[Alternative 4]

Expose a localClockOffset instead of a senderClockOffset.

This would certainly support the use cases presented here, but it would have the following downsides:

It would introduce an inconsistency with the values exposed in RTCRtpContibutingSource. This can lead to confusion, as the senderClockOffset is always paired together with the captureTime in the header extension and developers expect this association.
Applications can compute their own estimate of the offset between sender and receiver using WebRTC Stats and can control how often to update it.
Some applications might be interested in computing delays using the sender as reference.

In short, while this would be useful, the additional value is limited compared with the clarity, consistency and extra possibilities offered by exposing the senderClockOffset.

Accessibility, Privacy, and Security Considerations

These timestamps are already available in a form less suitable for applications using WebRTC Encoded Transform as part of the RTCRtpContributingSource API.

*The captureTime field is available via the RTCRtpContributingSource.captureTimestamp field.

*The senderCaptureTimeOffset field is available via the RTCRtpContributingSource.senderCaptureTimeOffset field.

*The receiveTime field is available via the RTCRtpContributingSource.timestamp field.

While these fields are not 100% equivalent to the fields in this feature, they have the same privacy characteristics. Therefore, we consider that the privacy delta of this feature is zero.

References & acknowledgements

Many thanks for valuable feedback and advice from:

Florent Castelli
Harald Avelstrand
Henrik Boström

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

timestamps.md

timestamps.md

Extra Timestamps for encoded RTC media frames

Authors:

Participate

Introduction

User-Facing Problem

Goals

Non-goals

Example

Alternatives considered

[Alternative 1]

[Alternative 2]

[Alternative 3]

[Alternative 4]

Accessibility, Privacy, and Security Considerations

References & acknowledgements

Files

timestamps.md

Latest commit

History

timestamps.md

File metadata and controls

Extra Timestamps for encoded RTC media frames

Authors:

Participate

Introduction

User-Facing Problem

Goals

Non-goals

Example

Alternatives considered

[Alternative 1]

[Alternative 2]

[Alternative 3]

[Alternative 4]

Accessibility, Privacy, and Security Considerations

References & acknowledgements