CC3 Closed Captions Solved - I Feel Dumb 4
Sometimes you have the solution and don’t realize it. I’ve been using ccextractor for years – at least 4, probably longer, to pull Closed Captions from recorded TV and convert them into SRT files before including them in MKV containers. For years, I knew how to get CC1 and CC2 out – the ccextractor help was clear on that. However, there was no mention of CC3 or CC4 – which is where the English captions are placed by a popular Spanish language TV network.
I’ve been working a project with a friend – he shall remain Nameless at his request. Anyway, we were playing with some changes to VLC software related to subtitles and I mentioned the issue getting CC3 for a telenovele that I’ve been watching. We were in the VLC code where subtitles are handled, so I had hope for my CC3 dreams. I could add a few printf() calls and have the CC3 data dumped from VLC. Of course, VLC handles all sorts of subtitles and closed captioning standards pretty well, but using it is not always an option, since it doesn’t work on all platforms. Hacking VLC for this purpose is a major hack and would certainly be rejected by the VLC team. I know that I’d reject it.
A few hours later the same evening, I get an email from Nameless with a English SRT attached, but no explanation. He likes to do that. It is after my bedtime, so I figure the explanation can wait until later. I’ve been trying to solve this for awhile.
Explanation
The folks that do Closed Captions are TV/video people. They are do not seem to be computer people. Someone who bridges that gap is pretty rare – lucky for me, Nameless is just that sort of person. He recognized the CEA-608 NTSC names and the use of Field 1 and 2 inside some code. He became interested and did some quick googling and found a program that handled subtitles and captions. That program had an option to dump field 2 text in line 21 streams. He used it with a few options and ended up with CC3 dumped into an SRT file. Perfect.
The program? ccextractor – yes, the program that I’ve been using for years. I’ve pulled the code and searched it for CC3/cc3 and never found anything. The support website makes reference to CC3 patches that were lost. Those two datum were enough to convince me that it wouldn’t have the desired capabilities. Obviously, I was wrong.
Hopefully, this outline will help explain the CEA-608 organization:
CEA-608 NTSC
Line 21
Field 1
CC1
CC2
T1
T2
Field 2
CC3
CC4
T3
T4
XDS
I Feel Stupid
Nameless didn’t have the same preconceived ideas. He simple pulled the ccextractor code down, compiled it, and read through the help. He was able to recognize that an available option already existed in the code to pull data from both Field 1 or Field 2, separately or together. CC1, CC2 are field 1 and CC3, CC4 are field 2. A quick test and he knew it worked. All that remained was sharing his success with me.
I had pulled the latest source for ccextractor myself and compiled it using the unusual build script. I had already decided to help the project out by creating a makefile, cleaning up the multitude of compiler warnings and building libraries for the parts where that would make sense. The code appears to be a hodge-podge of C and C++ written by many different people over the years. It works and the feature set is pretty awesome, but the code is not what I’d call pretty. I still need to contact the core dev team to see if they are interested. If the changes will be rejected, this isn’t useful. I’ve helped a few companies migrate their source code to more maintainable setups.
ccextractor is cross-platform, so any changes risk breaking the code on MS-Windows, OSX, Linux, BSD, Solaris, and even ARM devices.
The CC3 Dump SRT Command
To extract CC3 from an MPEG2 recording, use
$ ccextractor -2 -cc1 tv_show_file.mpg
This will result in a file named tv_show_file.srt.
To extract CC1 and CC3 from an MPEG2 recording in a single command, use
$ ccextractor -12 -cc1 tv_show_file.mpg
This will result in files named tv_show_file_1.srt and tv_show_file_2.srt.
To get CC4 data, I think this command should work
$ ccextractor -2 -cc2 tv_show_file.mpg
however, I never tested this.
Other options that I use, are;
—nofontcolor -utf8
to remove colors from the SRT files and to force the UTF8 character set, good for non-English languages.
Adding SRT to MKV Files
Getting the SRT files out isn’t my end goal. I love MKV video containers as you can see. Putting multiple captions, subtitles, and other files into the MKV container with the audio and video streams is my end goal.
I use mkvmerge to do this in a script. The basic format is:
$ mkvmerge -o output.mkv input.h264 english.srt
That’s fine, but we can do better.
Add multiple SRT files to a single file.
$ mkvmerge -o output.mkv input.h264 spanish.srt english.srt
Better, but the player will have them as unknown languages. We can get the language label with this command
$ mkvmerge -o output.mkv input.h264 —language 0:esp spanish.srt —language 0:eng english.srt
Perfect. I’ve used descriptive file names, not the actual filenames output from ccextractor. Of course, ccextractor does let us specify the output SRT filenames, if you want to control them.
So, I’ve finally gotten CC1 and CC3 out and will get them added to MKV files with the correct language labels.
I still need to suggest documentation changes to the ccextractor team. Hopefully, they will expand the explanation into the README docs and the help screen for the program.
So I’ve been adding SRT files to lots of videos here and for the most part, it has been pretty great.
Except that Android 3.x tablets don’t support MKV file playback in hardware. Let me explain. I encode MPEG2 TV recordings to h.264/MP4 files. Then I take those files and put them into MKV containers – the video/audio are copied into the MKV container – ok, not really copied, but there isn’t any re-transcoding performed. It is just a different container for the same video/audio. Yet Android doesn’t support this in hardware decoding.
Software-base video decoders simply can’t keep up when there is only an ARM CPU. Hardware-based decoders are required.
Ok, this is a nice to have capability. I’d like to play any video encode on a tablet and on a XBMC netbook. For 720p resolutions, even in hardware decoding stresses the Android device and the dual-Atom CPUs can’t keep up either, so I’ve dropped back to 600p for all recordings. This resolution works just fine on the netbook, but the ARM tablet can’t handle it with MKV containers.
So all this lead me to research MP4 container support for SRT subtitles. Turns out you can add multiple SRT files to an MP4 container using HandbrakeCLI, so I experimented with CC1 (Spanish) and CC3 (English). I was able to get both added to the MP4 container. Initially there was a language display problem for the Spanish due to character set issue. After forcing Handbrake to use UTF8, both VLC and XBMC display the correct characters during playback. That is a win, but the on-screen display for the Spanish is off. It is hard to explain, but the MKV subtitles (from exactly the same SRT output) have line breaks and spacing as you expect. The MP4 SRT/subtitles just look odd. The spacing and line breaks are off is the only way I can explain it.
creates file_1.srt and file_2.srt.
Then
HandBrakeCLI $HB_OPTS —srt-codeset UTF8 —srt-file “$SRTOUT1,$SRTOUT2” -i “$IN” -o “$OUT”
forces the codeset to be UTF8 for both subtitles. This works, but the resulting Spanish SRT displays just a little off on complicated sections. The English SRT display is just fine. It seems that the linefeeds are being lost within a single stanza during display with mp4 srt embedded files. That could be a bug in VLC. Display for the exact same SRT file using either an external source with the MP4 during playback OR using an internal source for MKV playback work as expected.
Android doesn’t seem to support subtitles in any of the players that I’ve seen, so this isn’t all that critical, but I’d rather have just 1 file type around for all playback desires.
Perhaps all this won’t be an issue with ICS on Android? Seems that native support for MKV containers is built into ICS. Now if I could only get ICS for my tablet – but that is a different problem. ICS has been released, but my tablet is rooted which means no updates. I’m not using any of the root features, besides backups, so an update isn’t that big of a deal to me.
pls i don’t know how to work with comands, i need to extract english subtitle from cc3
@donatus: In the article, how to extract subtitles from CC3 is explained. Please re-read the section The CC3 Dump SRT Command. However, subtitles (image-based) are technically different from Closed Captions which are text based.
a) if you can’t work with “comands”, then I’m afraid I can’t help. The ccextractor tool is a command line tool. There is no GUI version that I’m aware. I prefer the CLI version since it allows automation much easier.
b) subtitles from TV recordings aren’t usually labeled with the language, so you have to dump them all, CC1, CC2, CC3, CC4 and manually read the SRT files to determine if it is English or Spanish or whatever.
c) Subtitles or Closed Captions for other media like DVD and Bluray discs are stored differently. This article is about TV recordings in the USA that use CC1, CC2, CC3 and CC4. Other countries probably have different standards to embed closed captions into TV signals. I have seen a few DVDs with English CCs in addition to the normal glyph-based image subtitles.
d) Your TV tuner card must actually record the closed captions with the MPEG2 file. Not all TV tuners do. If it doesn’t, then you are without hope. The best tool that I’ve found to see if the MPEG2 recordings even contain the CC data is VLC. It has a GUI and seems to recognize subtitles and/or closed captions in most video media containers.
I have an idea on the GUI problem :) as well as others. Did you ever contact that core dev team?