Kawa 川 - Livestream | Sanstzu's Blog

← Go back

Kawa 川 - Livestream

MON FEB 26 2024

Kawa 川: Live-streaming Backend Service

A simple personal project (?) Repository can be found in GitHub.

Background

I was scrolling around YouTube and found a video about live streaming system that seems to be very intricate - listening to several jargons that are unheard of. But instead, I was perplexed by it and decided that it will be my next project.

Design Architecture

The first stage of creating such projects, for me, is to lay out plans that include a bird's-eye view of the architecture. There will be two main components: RTMP Ingest and Transcoder. The stream manager is also essential for storing stream keys, but it is not as significant as the other two.

test.drawio

This streaming service is designed to work with Open Broadcasting Software (OBS), which is the most commonly used live streaming software. This software can send video and audio data to a remote server using the Real-time Messaging Protocol (RTMP).

RTMP is still the most widely used protocol for live streaming between the streamer and the server because it has significantly lower latency compared to other protocols, such as the HTTP Live Streaming (HLS) protocol, although it is slightly worse than the newer WebRTC protocol. The RTMP packets will be extracted (as will be explained further in the section below) and transcoded into HLS format using the popular open-source software FFmpeg, which we will then upload to the cloud provider that serves as a Content Delivery Network (CDN).

HLS is chosen for the last-mile delivery between the CDN and the clients due to its wide support in many browsers and players, such as Apple's WebKit, which has native support, and also VLC Media Player.

Part 1: Stream Manager

The stream manager is a simple server that performs Create, Read, Update, and Delete (CRUD) operations on a Redis instance. These CRUD operations involve stream keys and their corresponding publishing URLs. It can be accessed through the usual REST API or gRPC for internal services (such as RTMP ingest).

Part 2: RTMP Ingest

Verifying Stream Key

Now, the fun part begins. An RTMP server communicates with its clients through TCP sockets, which involves a handshaking process at the application layer. After this process, the client can start sending RTMP messages to the server. Given the complexity, I decided to use a library that handles the raw bytes and returns the abstracted packets (rml-rtmp).

The client will first send a publish request message to the server with a specific stream key and can start streaming once the server accepts the request. At this stage, the RTMP Ingest will send a request to the Stream Manager to verify whether the stream key is valid. If it is not valid, the server will not accept the request and will instead close the connection.

Once accepted, the client can now send video and audio messages to the RTMP ingest.

Extracting Video

For both video and audio RTMP packet, it is transported using FLV-like container format in the chunk data. W There are two types of video information, headers and the video frames itself. e only need to focus on the AVCDecoderConfigurationRecord and the Network Abstraction Layer (NAL) units.

Our goal is to extract the useful bytes known as NAL units from the RTMP packets and merge it into a raw H.264 data.

The header contains the Picture Parameter Set (PPS) and Sequence Parameter Set (SPS) that contains essential information regarding the resolution and frame rate of the video. It is also a part of the necessary NAL units for the raw H.264 data.

aligned(8) class AVCDecoderConfigurationRecord {
  unsigned int(8) configurationVersion = 1;
  unsigned int(8) AVCProfileIndication;
  unsigned int(8) profile_compatibility;
  unsigned int(8) AVCLevelIndication;
  bit(6) reserved = ‘111111’b;
  unsigned int(2) lengthSizeMinusOne;
  bit(3) reserved = ‘111’b;
  unsigned int(5) numOfSequenceParameterSets;// SPS
  for (i=0; i< numOfSequenceParameterSets; i++) {
    unsigned int(16) sequenceParameterSetLength ;// SPS
    bit(8*sequenceParameterSetLength) sequenceParameterSetNALUnit;
  }
  unsigned int(8) numOfPictureParameterSets;
  for (i=0; i< numOfPictureParameterSets; i++) {
    unsigned int(16) pictureParameterSetLength;// PPS
    bit(8*pictureParameterSetLength) pictureParameterSetNALUnit; // PPS bit
  }
  if( profile_idc == 100 || profile_idc == 110 ||
  profile_idc == 122 || profile_idc == 144 )
  {
    bit(6) reserved = ‘111111’b;
    unsigned int(2) chroma_format;
    bit(5) reserved = ‘11111’b;
    unsigned int(3) bit_depth_luma_minus8;
    bit(5) reserved = ‘11111’b;
    unsigned int(3) bit_depth_chroma_minus8;
    unsigned int(8) numOfSequenceParameterSetExt;
    for (i=0; i< numOfSequenceParameterSetExt; i++) {
      unsigned int(16) sequenceParameterSetExtLength;
      bit(8*sequenceParameterSetExtLength) sequenceParameterSetExtNALUnit;
    }
  }
}

(extracted from 知乎)

Other than the header, the other type of video message contains the actual video frames encoded in H.264. Inside the RTMP ChunkData, it contains a series of NAL units. It can be extracted by iterating through the NAL units (read the first 4 bytes that indicates the size, and read the next n bytes where n is the size).

Once the necessary information are extracted from the RTMP packets, the NAL units will be appended together. But there is a seperator between each NAL units: a 0x00_00_00_01 or 0x00_00_00_00_01 prefix is added to each unit.

Extracting Audio

Similar to the video, there are two types of audio messages: header and data.

The header contains the necessary information such as sampling frequency, etc.

For the other type (data), it is much more complicated. Instead of containing the AudioSpecificConfig, it contains the raw bytes of the audio encoded in Advanced Audio Coding (AAC), so we can just skip the first two bytes of RTMP ChunkData.

The complicated part is encapsulating the bytes in Audio Data Transport Stream (ADTS) frames which also contains the header information, then the raw audio bytes can be appended at the end. The details of the ADTS frame can be found here.

Transporting Audio and Video Bytes

Now that the video and audio is in raw H.264 and AAC format respectively, the transcoding process can be done. In this system, the ingest server and transcoder is seperated and thus a way is needed to transport these bytes.

One option is to use gRPC, a type of Remote Procedure Call (RPC) made by Google, that supports streaming data. In this case, we will stream the raw bytes from the ingest to the transcoder service.

Part 3: Transcoding

Now that the video and audio are in raw H.264 and AAC formats respectively, the transcoding process can begin. In this system, the ingest server and transcoder are separated, and thus, a method is needed to transport these bytes.

One option is to use gRPC, a type of Remote Procedure Call (RPC) developed by Google, that supports streaming data. In this case, we will stream the raw bytes from the ingest server to the transcoder service.

HTTP Live stream

HLS operates by generating a manifest file (.m3u8), which contains links to several segment files (.ts) that hold the video details. In this scenario, FFmpeg will continuously update the .m3u8 file while generating new .ts files and deleting the older ones.

One advantage of using the HLS format is the capability for adaptive bitrate streaming, allowing the client to dynamically adjust the video quality on demand based on the connection speed. The manifest will include multiple links, each pointing to a different manifest file for various resolutions.

#EXTM3U
#EXT-X-VERSION:3
#EXT-X-TARGETDURATION:5
#EXT-X-MEDIA-SEQUENCE:8
#EXTINF:5.000000,
file00008.ts
#EXTINF:5.000000,
file00009.ts
#EXTINF:5.000000,
file00010.ts

Named Pipe

Now that we have a gRPC server and FFmpeg for transcoding, how do we connect them together?

This was one of the issues I encountered, and I found the solution upon learning about named pipes. A named pipe acts as an inter-process communication (IPC) method between the gRPC server and the FFmpeg instance that is spawned upon receiving a request.

It has the same functionality as a file in which the gRPC server writes into the pipe, while FFmpeg reads from it. Unlike a traditional file, it uses memory instead of being written to the disk, thus requiring both the producing and consuming process to access the pipe simultaneously.

A named pipe can be created using the mkfifo command.

Demo

You can refer to the GitHub's repository for the guide on setting up the project (repository can be found here).

Once it is run, you can send a POST request to the stream manager (default is port 8888) and obtain a stream key.

Then, you can use OBS to stream to the RTMP server (default is port 1935) using the stream key obtained from the stream manager.

It will be uploaded to the AWS S3 bucket and the manifest file will be in [AWS S3 Bucket URI]/[publish_path]/index.m3u8.

It can be played natively on WebKit-based browsers, such as Safari, or using VLC Media Player.

(26/2/24) A simple full-stack demo might be made in the future, but I am too busy with school :)

References

It is my first time using Rust and creating a project in this scale, so please expect that some things might not be very optimal :D. And ChatGPT is used to fix grammatical errors.