WebRTC Detail Description with Types
Peer-to-Peer
WebRTC was designed to send media directly between clients
via their browsers, also known as peer-to-peer (P2P). In the peer-to-peer
architecture, communication between the clients is preceded by first
establishing a signaling connection to the application server (sometimes
referred to as a signaling server). The signaling method or protocol is not
specified within the WebRTC specifications, thus allowing the adoption of an
existing method (SIP, WebSockets, XMPP, etc.) or implementing a proprietary
signaling process. The application server holds the business logic and acts as
the intermediary for the Session Description Protocol (SDP) exchange. Once the
SDP exchange completes, direct media communication between the two clients can
begin.
Peer-to-Server
While WebRTC is designed to be primarily browser-to-browser,
there are a growing number of use cases that benefit significantly when media
is anchored in the network with a server to act as a media peer, also known as
peer-to-server (P2S). Similar to peer-to-peer, in the peer-to-server
architecture, the clients again establish a signaling connection to the
application server. In this architecture, the application server continues to
manage the business logic but also utilizes a media control connection to the
server for the SDP exchange between the client and the media server. Once the
SDP exchange completes, media communication between the client and server can
begin.
Utilizing server-side processing can introduce advanced functionalities
such as centralized recording for compliance purposes, audio/video playback,
media analysis for speech-to-text detection, transcoding for connecting
disparate networks, and media mixing for multiparty conferencing. Depending on
the architecture, server-side processing can optimize bandwidth and minimize
client compute, benefiting mobile clients by increasing battery life and
providing a flexible user interface to clients.
Peer-to-Peer Mess
The peer-to-peer mesh topology operates without a centralized
media server requiring each client to simultaneously send its encoded media to
each participant client in the conference. Synchronously, the client must also
receive and decode each participant’s media stream. Peer-to-peer mesh is often
times referred to as “peer-to-peer mess” given the amount of streams required.
For instance, in a four-person videoconference, each client browser is encoding
and transmitting three media streams while also receiving and decoding three
additional media streams.
The WebRTC client has full control over the video layout.
Media latency is typically not an issue when using peer-to-peer mesh since
media will most often be direct to the receiving client. The peer-to-peer mesh
appeals to front-end developers because of the low cost to implement. However,
these topology architectures are extremely limited in functionality and scale.
For instance, without a centralized media server, the client is left to perform
advanced features such as recording. Putting this type of functionality at the
WebRTC client level not only causes added processing by the client but can also
risk compliance requirements.
Regarding scalability, the process of encoding/decoding
media streams is a compute intensive process - notably encoding being a ~4x
more intensive process than decoding. Assuming each client in the peer-to-peer
mesh is using the same codec, frame-rate and resolution, the transmitting
client could leverage only encoding once. However, clients within the
conference often do not share the same codec, frame-rate, and resolution.
Therefore, additional encoding is required. Furthermore, bandwidth can be a
preventive factor for scale with most notably the potential limited uplink
bandwidth from most devices. For example, the bitrate for a single 720p
resolution video stream starts at ~1.5 Mbps. In a standard four client
conference, the uplink bandwidth required will be ~4.5 Mbps. It is for these
reasons, WebRTC clients especially (but not limited to) mobile and tablet
devices where compute and bandwidth are limited, will see a reduced scale.
In summary, while peer-to-peer mesh topologies produce a
flexible user interface and low latency media connections, the disadvantages of
scalability, transcoding, and lack of advanced functionally are a significant
deterrent to launching many applications.
Multipoint Control Unit
(MCU)
The use of Multiple Control Unit (MCU) topologies have been popular for real-time communications well before the inception of WebRTC. MCU topologies operate with a centralized media server to which WebRTC clients send their encoded media stream. From there, the MCU media server receives the video stream, decodes the stream, tiles the decoded frames with the streams from other participants, and then encodes the tiled video to send it back to the participant. Using the MCU topology architecture simplifies the stream the WebRTC client will need to send and receive and reduces the number of streams to just one.
Reducing the required encode/decode to one will therefore
decrease both the client compute and bandwidth consumption, thus benefiting
mobile type devices. Furthermore, since each stream is decoded and transcoded
at the MCU media server there is no concern for each WebRTC client to share the
same codec, frame-rate, and resolution profile. The incoming media streams from
the various clients can now be transcoded to another codec, trans-sized to a
different resolution, and trans-rated to a different frame rate thereby
allowing each client to be optimized to their preferred profile. For instance,
a mobile client preferring to utilize H.264 with VGA resolution and 15
frames-per-second can be connected to a laptop WebRTC client using the VP8
codec with 720p resolution and 30 frames-per-second. Furthermore, since the MCU
media server is acting as a peer, quality of service (QoS) can be applied on a
per stream basis thus not allowing one poor connection to dictate the quality
of all users of a multi-party application.
A key benefit of the MCU topology is that it shifts the
processing of the encode/decode from the client into the server, often as part
of a cloud compute service, where processing resources are less expensive. The
performance issue is becoming more important as newer video coding systems are
very computationally intensive and conference users expect a high quality, high
resolution video such as 720P and 1080P at 30 frames per second or higher. As
with peer-to-peer topology, sharing encoders across multiple client streams can
address the performance drawback and result in greater server scalability. And
lastly, the MCU will utilize the same tiled video composition for each
connecting client. By doing so, the video layout in a multiparty conference
will be dedicated by the server-side for all client participants.
In summary, multipoint control unit topology results in a
significantly lower compute and bandwidth for the clients and can interconnect
disparate networks through transcoding/trans-rating/trans-sizing with the
disadvantages of around server performance and an inflexible client user
interface. However, the centralized computational resource of an MCU can become
a limiting factor in cost-sensitive, large-scale, one-to-many applications.
Selective Forwarding
Unit (SFU)
Selective Forwarding Unit (SFU), also known as video
routing, is a topology allowing for WebRTC clients to send their encoded video
stream to the centralized media server where it is then forwarded/routed to the
other WebRTC clients. The SFU topology is an attractive approach to addressing
the server performance issue, as it doesn’t involve the compute expense of
video decoding and encoding. Additionally, without encoding/decoding, the
latency of the added SFU media server is minimal. Lastly, the clients with full
correspondence with the SFU media server have complete control over the streams
it receives, and because the client is receiving the streams it wants, it can
have full control over the user interface flexibility.
While the SFU topology has become a popular choice among
WebRTC communities, perhaps the most common overlooked shortcoming of SFU
topology is the default to using the ‘least common codec’. This means every
participant in the conference need to use the same codec. For those multiparty
conferences where all participants are using PC/laptops, the issue is
negligible but introducing a mobile device that is hardware optimized for H.264
acceleration would be better suited using a different codec. The inability to
transcode video streams can limit the type of clients that can be connected together.
Furthermore, the multiple streams being routed/forwarded by
the SFU to the WebRTC client can cause increase in downlink bandwidth, thus
causing increased decode processing. The issue can be mitigated by limiting the
number of streams being forwarded to the client to either the active talker, a
subset of the streams, or a combination of both. Additionally, using a method
called simulcast allows multiple streams to be encoded by the WebRTC client.
Typically, two streams with the first being encoded with high resolution and
second encoded with lower resolution. This way, the SFU can forward/route the
high definition stream of the active talker while still sending the lower
definition streams of the listeners.
Lastly, traditional SIP based platforms cannot handle the
multiple streams produced by the SFU. For this reason, without a gateway
function in the middle, the SFU topologies are restricted to WebRTC only.
In summary, selective forwarding unit topology produces a
flexible client user interface and improved server-side performance; the
disadvantages of requiring a least common codec and client compute/bandwidth
are concerns to be considered.
Converged Architectures
- Applications For the Real World
As discussed, no topology is perfect, and each come with
distinct advantages as well as disadvantages. Multipoint control unit
architectures are ideal for when compute and bandwidth are limited and there is
a need for interoperability with disparate networks but come at a cost of high
server load and limiting video layout. On the other hand, selective forwarding
unit topologies are ideal for high server performance and maximum flexibility
for the client UI but come at a cost of requiring all connecting clients to
share the same codec, frame-rate, and resolution profile. The tough decision is
which to use for your application. Launching an application into real world
scenarios requires on-demand access to both the capabilities of a MCU and the
capabilities of a SFU. This complementing feature set has paved the way for a
new, next generation hybrid-SFU/MCU architecture.
Next Generation
Hybrid-SFU/MCU
The hybrid-SFU/MCU topology allows for the media stream to
be delivered based on the preference optimized for the individual client. For
example, in cases where the client is a mobile or SIP device the media server
can deliver a single MCU-type mixed stream. For WebRTC clients capable of
handling multiple streams and no restrictions on bandwidth or compute, then the
media server can deliver forwarded/routed-type streams. Additionally, having
the ability to transcode individual streams while leaving all others to be
forwarded/routed eliminates the least common codec issue of SFU. Allowing these
capabilities to be utilized from within the same media server can give the
application server’s business logic the most flexibility in delivering a
scalable, reliable, feature-rich service.