WebRTC 101: How Real-time Video Communication is Facilitated

Giorgi Beruashvili — Fri, 27 Jan 2023 09:51:38 +0000

WebRTC 101: How Real-time Video Communication is Facilitated Giorgi Beruashvili Fri, 01/27/2023 - 13:51

Real-time video communication has become an essential part of our daily lives, whether it's for work or socializing. The ability to have face-to-face conversations with anyone, anywhere, has been made possible by the development of WebRTC, a technology that allows for real-time video and audio communication in the browser.

The process of connecting two users starts with locating each other using their public IP addresses, but this can be challenging due to the limited number of public IP addresses available. Private IP addresses were introduced to solve this problem, and NAT and STUN/TURN servers are used to assign public IP addresses and establish direct connections. Additionally, ICE and SDP protocols are used to standardize the exchange of information about media streams, inputs, and outputs.

Let's start with the basics. Consider the scenario in which person A wants to call person B. The first step is to locate each other. If we imagine an empty Cartesian coordinate system, we need to locate two dots (A and B) in order to connect them with a line. In the internet world, their locations are their public IP addresses, which determine their public presence.

When public IP addresses were first introduced, they were formatted using a 32-bit system, which mathematically means that there could only be 2 to the power of 32 instances of public IP addresses, approximately 4.2 billion possible public IP addresses. However, nowadays there are more internet users and connected devices than the number of possible public IP addresses. As a result, not every device can have its own public IP address and therefore cannot have a public presence, meaning no dots on the Cartesian system.

How do we solve this issue?

To solve this problem, private IP addresses were introduced. Generally, only routers have public IP addresses, and all devices connected to this particular router have their own private IP address. These private IP addresses cannot communicate with the external world; they cannot send a request to the internet or receive a response.

When they send a request to the router, they are given a public presence (public IP address) through a process called NAT (Network Address Translation). This request is then sent to the destination and a response is returned, which is translated back to the private IP address that sent the request.

There are four types of NAT:

Full-cone NAT
IP address restricted NAT
Port restricted NAT
Symmetrical NAT.

Most routers operate with the full-cone NAT system, which allows responses from all IP addresses and ports to come through. Other types, however, check the trustworthiness of the response-issuer's IP address, port, or both. This will be important later on in the process, but for now we know that we send a request from a private IP address which is wrapped by a public IP address using NAT. The request receives a response from the destination and either passes through (full-cone) or is checked for trustworthiness (e.g. has the destination been visited before? If so, it is considered trustworthy) before being allowed to pass or be blocked.

Now we know that even though each device does not have its own public IP address, we can still communicate with the external world through our routers using NAT. However, how do we determine what our public IP address is?

We can use STUN (Session Traversal Utilities for NAT) to help with this.

STUN servers are cheap and easy to maintain, and they are mostly public servers that respond to us with our public IP address, port, and connectivity status. We will package this data and send it to our Peer as an ICE candidate, who will do the same (Interactive Connectivity Establishment is a protocol used to transfer data provided by the STUN server in a standardized way). This process allows us to locate each other and connect directly.

However, STUN servers can sometimes fail. This is because some NAT methods, such as symmetrical NAT, block direct connections after exchanging information about each other.

For example:

A has a full-cone NAT while B has a symmetrical NAT. They exchange their information and attempt to connect directly browser-to-browser, as this is what video calls are about - direct connection without delays. A's router, which uses full-cone NAT, allows the signal from B to come through, but B's router encounters A's signal for the first time and does not trust it, blocking it from passing through. This is where TURN (Traversal Using Relays around NAT) servers come into play.

They provide both peers with information about their IP addresses, ports, and other details, and then act as a mediator between them, allowing the signal to pass through them and eliminating the need for trust between the caller's and receiver's endpoints. TURN servers are more expensive and difficult to maintain, and are often privately owned. They are used as a relay if STUN servers fail.

We have reached the point where A has determined their public IP address using STUN or TURN and created an object containing this information as an ICE candidate.

As mentioned above, ICE is a protocol that is used to standardize the construction of this information. On the other end of the line, B also did the same.

What should we do next? Do we need to gather any more information, and if so, how should we send it? Do we need more information in general?

Yes, we need to gather information about our media streams, inputs, and outputs, among other things. We collect this information through WebRTC and create an offer containing it, also called an SDP (Session Description Protocol), which is essentially a string of information.

How do we exchange this information with each other? That part is left to your imagination. You can use paper, pigeons, or servers. Mostly, servers are chosen. We implement a web socket that fires when Peer A sends his SDP and ICE candidate to the server. This allows Peer B to react to the event and use this information, while also sending his own SDP. The same process happens with Peer A, as the web socket informs him about Peer B firing an event and he can respond to it.

The following code examples represent some of the concepts explained in the article. For example, getting a local video stream and rendering it in an HTML element.

In the following code snippet, we write a function called requestMediaDevices, which accesses the navigator object (The Navigator interface represents the state and identity of the user agent). Then, in the mediaDevices object, which contains information about user media devices, we call a method. This method prompts the user for permission to use a media input, which produces a MediaStream with tracks containing the requested types of media.

private async requestMediaDevices(): Promise {
  try {
    this.localMediaStream = await navigator.mediaDevices.getUserMedia();
  } catch (error: any) {
    alert(`An error occured while getting user media ${error.name}`);
  }
}

In order to attach a video stream to an HTML element, you need to map through the tracks of the localMediaStream and enable them. These tracks are video and audio. The localVideoStream is just a reference to the HTML element and you bind this flowing stream of media to this element's srcObject.

this.localMediaStream.getTracks().forEach(track => {
	  track.enabled = true;
});

this.localVideoStream.nativeElement.srcObject = this.localMediaStream;

Now we can skim through the call process:

async startCall(): Promise {
  this.peerConnection = new RTCPeerConnection(RTCConfiguration);
  // in RTC configuration we have information about websocket that we are using, stun servers
  // Here you should write handlers for onicecandidate (retirieve ice candidates and send them to websocket)
  // Also handlers for oniceconnectionstatechange, onsignalingstatechange and ontrack - to properly handle call failures, retries and etc.   
 
  //adding local video stream into the mediaStream of the offer
  this.localMediaStream.getTracks().forEach(
    track => this.peerConnection.addTrack(track, this.localVideoStream)
  );
  
  try {
    const offer: RTCSessionDescriptionInit = await this.peerConnection.createOffer();
    //setting local sdp offer into the localdescription of peerconnection
    await this.peerConnection.setLocalDescription(offer);
    this.RTCconnectionWebsocket.sendMessage({type: 'offer', data: offer});
  } catch (err: any) {
    //handle error here
  }
}

Now some moments from handling Peer A’s offer from Peer B’s side:

private handlePeerOffer(offer: RTCSessionDescriptionInit): void { 
 // An sdp from A, which is an offer is a remote description for B and visce versa
 this.peerConnection.setRemoteDescription(new RTCSessionDescription(offer))
    .then(() => {
      // add media stream to local video html element reference srcObject just like in Peer A
      this.localVideoStream.nativeElement.srcObject = this.localMediaStream;
      // add media tracks to remote connection just like in Peer A's case
      this.localMediaStream.getTracks().forEach(
        track => this.peerConnection.addTrack(track, this.localVideoStream)
      );
    }).then(() => {
    // Build SDP for Peer B which in peerConnection would be an answer to the offer
    return this.peerConnection.createAnswer();
  }).then((answer) => {
    // Created answer would be B's SDP which should be set as his local description
    return this.peerConnection.setLocalDescription(answer);
  }).then(() => {
    this.RTCconnectionWebsocket.sendMessage({type: 'peerAnswer', data: this.peerConnection.localDescription});
  }).catch(//catch error here);
}

As the code snippet shows, B sets A's offer as a remote description while creating its own answer and setting it as a local description of his own. Then, it alerts the web socket that the answer was set so that the further event handlers which depend on the web socket getting this type of message can act upon it.

Hopefully, this roundup can shed some light on the process of real-time video communication and contribute to enhancing the low-level knowledge of building such applications.

Language English