This is an old revision of the document!

Getting started with our code

This page is currently being written (a bit every day, actively) so new developers that want to join us don't have to learn the basics from scratch.

We often get questions about how to get started with our code. The most important thing would be: Don't try to read and understand every file, because it's pointless and there's no need. While it's definitely not the linux kernel, CCExtractor's code is not trivial, and it's been written by a number of people during a long time. Often, that people was learning as they went, and it shows in parts of the code.

However, it's important to have a general idea of how things are organized so you know where to look for things and how to add new features.

This page tries to explain the most important concepts and introduces the important files in the core CCExtractor tool. Note that we have additional tools such as our regression test platform, or the real time subtitle database. Those will be explained in their own pages.

CCExtractor is written in C. If you are a C++ developer that will have pretty much zero impact in your ability to contribute, because the really important differences are abstracted in functions anyway. Sure we don't have classes and our I/O is different, but that's really not a big deal here - you will need to understand file formats anyway, or how to read specification documents. None of that depends on the language of choice.

CCExtractor reads binary streams (a stream may be a file, but it can also be data coming from network - so don't assume) and writes subtitle files.

Container formats

The usual audio/video streams come in a number of variants. You know how in files you have .avi, .mkv, .mp4, .mpeg and so on? Those are container formats, because they “contain” the parts of the media: Video, audio and subtitles. Each of those have some limitations, but in general, the contain format doesn't specify how each part of the media is encoded. You have can a .mkv (Matroska) that contains the video encoded as MPEG-2, or H264, etc, then the audio as MP3, or AAC and so on.

In TV broadcast, the typical container is the Transport Stream (.ts). a Transport Stream can carry more than one TV program (for example, BBC One, BBC Two and BBC News), each of them with its own video, audio, and subtitles (and for each, maybe more than one language).

Streaming services such as iTunes uses .mp4.

The parts of CCExtractor that handle the containers are called demuxers. A demuxer is capable of reading a specific container and return parts of it.


Our input streams are files that contain subtitles. These subtitles can encoded in a different ways depending on the country they come from or the tecnology used to make the recording. Focusing on recordings made from a TV broadcast, we have:

CEA-608, which is the “old” format used in North America. It comes from the analog days of NTSC, but the while the transmission was analog, in the end you had 2 bytes (that's digital) of subtitles in each frame, and that's the one thing that is important to keep in mind. You don't need to bother understanding the analog part of the transmission, because what we process is just those two bytes.

By the way, in North America those subtitles that you can turn on and off are called closed captions.

CEA-708, is the “new” format used in North America. It's all digital, and because when it was designed the TVs were a lot better, they had much more bandwidth for subtitles, they have lots more capabilities.

Teletext, is the old format in Europe. It's still around, but it's quickly being replaced with DVB.

DVB is the current format in Europe. It's a bitmap based format, which means that instead of characters being transmitted it's images (for example, for “CCExtractor” you would have the representation of the letters in graphics format, not one byte for each letter as you could expect). This makes DVB more capable, but also a lot harder to transcript to text, since a OCR is required.

ISDB is the format used in Brazil.

In CCExtractor, the parts of code responsible for handling the different subtitle formats are called decoders.

Combination of containers and subtitle formats

As explained, subtitles come in a number of encodings, and they can be carried in different containers. So you can have subtitles encoded in CEA-608 inside a .ts or a .mp4. And you can also have a .ts file or a .mp4 that contains subtitles in CEA-608 and DVB.

Once you have the subtitle data it doesn't matter where it came from (what the container type is). Similarly, when processing a container, it doesn't matter what type of subtitles are there.

Reading the containers

The first thing that we do is identify (unless the user specified it manually) the type of container we're going to process. This is done by reading the first bytes and figuring it out for ourselves.

This happens in the function

void detect_stream_type (struct ccx_demuxer *ctx)

which is in the file streams_functions.c

That function (please check the code) sets the type format (best guess; identifying without fault is a lot harder than you'd think, but that's not important for an introduction) for the context (more on contexts later).

Once we know what type of stream we're processing we know which demuxer to use to read it.

We have demuxers for Transport Streams (in ts_functions.c), mp4 (in mp4.c) and more. The block that, after knowing the type of container, decides what to do, is in the main file,,

        /* -----------------------------------------------------------------
        MAIN LOOP
        ----------------------------------------------------------------- */
        switch (stream_mode)
  • public/general/gettingstartedwithourcode.1518479292.txt.gz
  • Last modified: 2018/02/12 23:48
  • by cfsmp3