Published on ONJava.com (http://www.onjava.com/)
 See this if you're having trouble printing code examples

Making Media from Scratch, Part 1

by Chris Adamson

QuickTime is often described as a "media creation" API, and that means a lot more than just the ability to edit your audio and video and export it to an arbitrary format. This month I'd like to take the term very literally and show you how to create your movies in Java, one frame at a time, without depending on a pre-existing movie.

To do that, we need to take another look at the format of a QuickTime movie. In "Parsing and Writing the QuickTime File Format," we saw how structures called "atoms" represented this format. For today, let's strip away those details and look at the big picture:

  1. A movie contains metadata (creation and modification time, current selection, preferred volume and rate, etc.) and zero or more tracks.

  2. A track contains metadata (creation and modification time, playback quality), exactly one media object, and an edit list describing which parts of the media are to be used.

  3. A media object contains a data reference that indicates where the audio, video, or other data actually is (in the movie file, in another file, on the network, etc.); information about which QuickTime "media handler" can load, save, and play the data; and a structure called a "sample table" to represent where the sample for a given time can be found in the data.

Graphically, this can be seen as a movie where the references are all to external sources (files, URLs, and other movies), as shown in Figure 1, or a "flattened" one, in which the data is all contained within the same .mov file as the movie's structure, as shown in Figure 2. Either way, the movie is the structure that represents where the samples are, how they're arranged, and what to do with them.

a movie with external sources
Figure 1. A movie with external sources

a movie with internal sources
Figure 2. A movie with internal sources

Related Reading

Mac OS X for Java Geeks
By Will Iverson

Sampling Samples

By "samples," we mean what is to be seen or heard at some instant of time, in the smallest amount of time relevant to that kind of media. For example, imagine a format where we have totally uncompressed video (equivalent to, say, North American television) and uncompressed CD-quality audio. The video, by our definition, is 30 frames per second, so there are 30 video samples in one second. CD-quality audio is 44.1 KHz, meaning there are 44,100 samples in a second.

QuickTime, interestingly, realizes that a player would generally like its data to be organized with regards to time. For example, you don't want to have a file with all of the video data first and then all the audio data, since playing back would require jumping back and forth between the two, and the read/write head on your hard drive would scream in agony. It's easier to mix them, so that the video data for a certain time and the audio data for that time are in the same place. In QuickTime's worldview, this is a process of "chunking" — the media data combines video, audio, and any other data into one stream (a long run of bytes), with "chunks" of audio, video, and other samples grouped by time. It's up to the media object to manage several tables, like a time-to-sample table and a sample-to-chunk table, to allow it to find the samples at playback time.

Fortunately, you as a developer aren't responsible for all of that bookkeeping, but it's good to understand how it works.

Getting back to the point, to make a movie from scratch, we need to do the following:

  1. Lay down samples.
  2. Add these to a media object.
  3. Add the media to an appropriate track.
  4. Add the track to a movie.

You may have noticed in the diagrams above that our hypothetical movie contains not just an audio and a video track, but also a "text track." This is exactly what it sounds like: a time-based collection of text, commonly used for providing captions to QuickTime movies. More technically, it is a track where the media samples are ordinary text strings. This is a good place to start with creating our own media, since it doesn't require knowing anything about images or sounds.

Source Code

Download the source code for the MakeTextTrack sample application.

An Example: MakeTextTrack

The MakeTextTrack sample application creates a movie with a single text track. It starts by creating an empty movie file to write to:

            StdQTConstants.createMovieFileDeleteCurFile |

Next, it creates an empty text track and a text media object, which it will eventually insert into the track:

Track textTrack = movie.addTrack (TEXT_TRACK_WIDTH,
TextMedia textMedia = new TextMedia(textTrack,

The last argument is a time scale for the media. Movies, tracks, and media all have their own time scale, which is the number of time units that pass in one second. For a movie, this value defaults to 600, which has the advantage of being an even multiple of many common frame-rates: 30 (NTSC video), 25 (PAL and SECAM video), and 24 (film). Dean Perry of Abstract Plane also reminds me it's an even multiple of the 60 "ticks" per second that older Macintoshes used for timekeeping. However, you're free to use and abuse the time scales as you see fit. I arbitrarily chose a value of 100 for my media, so my sample durations are measured in hundredths of a second.

Next, we tell the new Media object that we intend to do some edits:


We then get the media handler object, required in this case because it has a method for creating new text samples:

TextMediaHandler handler = textMedia.getTextHandler();

and we create a rectangle that will be used in every sample to describe the shape that the text is to be rendered into when played back:

QDRect textBox = new QDRect(0, 0,

We're finally ready to start adding samples. The sample application uses a static array of Strings, getting a QuickTime-compatible QTPointer to each one and passing that as the first argument to the TextMediaHandler.addTextSample() method. Here's how that call looks:

handler.addTextSample (msgPoint,

Obviously, this method has a lot of parameters. In order, they are:

  1. QTPointerRef text: a pointer to the string.

  2. int fontNumber: an integer to indicate font. 0 can always be used as a generic default, or use the QDFont class to get the ID for a font name

  3. int fontSize: the font size, in points.

  4. int textFace: a style, such as bold, italics, etc., as defined by constants in QDConstants.

  5. QDColor textColor: the foreground color, expressed as a QDColor (not a java.awt.Color).

  6. QDColor backColor: the background color.

  7. int textJustification: the right/left/center justification. Possible values are in QDConstants: teJustLeft, teFlushDefault, etc.

  8. QDRect textBox: a QDRect rectangle describing the box in which the text is to be displayed.

  9. int displayFlags: zero or many behavior flags, logically OR'd together, describing behavior such as clipping or scaling the text when displayed over other video, etc. These flags are in StdQTConstants and a list of supported flags is documented for the native TextMediaAddTextSample function.

  10. int scrollDelay: a time to delay between scrolls if the dfScrollIn and/or dfScrollOut flags are set. Not useful in this app, with its short samples, but potentially useful for other purposes.

  11. int hiliteStart: the index of first character of text to highlight (select), if any.

  12. int hiliteEnd: the index of the last character of text to highlight.

  13. QDColor rgbHiliteColor: the color of the highlight, if used.

  14. int duration: the duration of this sample, expressed in the media's time scale.

The duration is interesting for a couple of reasons. First, it's expressed in terms of the media's time scale. In our case, the time scale is 100 and the duration is 100, so the sample is exactly one second long. Of course, we could have half-second samples by using a duration of 50, or any sample length that can be expressed as a fraction of duration over time scale. Moreover, despite the commonness of fixed frame rates in audio and video (30 fps video, 44.1 KHz sound, etc.), QuickTime requires no such thing -- each sample can be of an arbitrary duration, different from the sample before or after it.

Wrapping up the application, once the loop is done adding samples, we inform the Media that we're done editing:


and insert this media into the text track:

textTrack.insertMedia (0, // trackStart
             0, // mediaTime
             textMedia.getDuration(), // mediaDuration
             1); // mediaRate

after which we save the file to disk as texttrack.mov, in the current directory.

To compile and run the sample code, make sure you've worked through any versioning or classpath issues as covered in our re-introduction to QTJ a few months back. When you're done, the result will look something like this (assuming you have the QT plug-in):

One of the nice things to notice is that we picked up word-wrap automatically, without hand-coding line-breaks.

Another Example: AddTimeCodeTrack


I'd like to point out that this tape has not
been tampered with or edited in any way.  It even has a timecode
on it, and those are very hard to fake.


For the benefit of the court, would you please 
explain "timecode"?


Just because I don't know what it is ...
doesn't mean I'm lying.
from the movie Strange Brew

Source Code

Download the source code for the AddTimeCodeTrack sample application.

Actually, Claude, you are lying, and timecodes -- which are just a system for encoding the current time in a movie -- are very easy to fake. In fact, the next example will add a timecode to any QuickTime movie.

To do this, we'll add a text track with timecode-like Strings to the existing tracks in a movie:

  1. Open a movie. Note that this has to be a real QuickTime movie, not just some other format that QuickTime can open, such as AVI or MPEG-4.

  2. Add a text track for the timecode text, set to use the bottom center of the movie's display.

  3. Write text samples every 1/30th of a second, in a typical timecode format (hh:mm:ss:ff).

  4. Flatten the movie out to a new disk file.

We start off by getting a movie from a file rather than creating a new one. The movie already has some sizing information from its existing video track, which will help us later.

This time, I've used a time scale of 30 for the text media, which will correspond with the idea of having timestamps every 1/30th of a second. That means every sample will have a duration of 1. Of course, we could have accomplished the same thing by using a time scale of 60 and samples of duration 2, or a time scale of 600 and 20-unit samples, and so on.

What's interesting is what we don't have to do, namely care what the time scale or the frame rate of our video and audio is. Just as you can have audio and video at more or less arbitrary frame rates, freely changeable independent of one another, we can have 30 text samples a second regardless of the video's frame rate. Granted, this can't be truly accurate if the video's frame rate isn't 30 fps or something reasonably divisible, but that's not the point. The key is that for any given time in the movie, there is one appropriate sample in each track that QuickTime will retrieve for us, whether that's one of thousands of audio samples that fly by every second, one frame of video, or one of our text samples.

On the other hand, we do have to worry about where our text will be placed over the video. When the (0,0)-based coordinates of the text frames are mapped into the movie's display space, we get a timecode at (0,0), which is not what we want, as shown in Figure 3.

the added timecode
Figure 3. The added timecode

Enter the Matrix

To place the caption box in a specific place relative to the other tracks in the movie, we can use a transformation matrix. In QuickTime, this is a 3x3 mathematical construct that maps points from one space into another. In our case, we need to map from a rectangle whose upper left corner is at (0,0) to a rectangle that is centered along the bottom of the movie's space. We do this by calling setMatrix() on our text track, with a Matrix object that describes the spatial transformation we want QuickTime to perform.

The formula for matrix transformations is shown in Figure 4. Don't run away. It's not that scary, at least not in practice.

the formula for matrix transformations
Figure 4. The formula for matrix transformations

The formula means that, given a point (x,y), we get the new coordinates (x',y') by applying matrix multiplication. The transformation can be expressed more simply as a pair of formulas:

x' = ax + cy + tx
y' = bx + dy + ty

This buys us the ability to specify operations that move, rotate, and scale your source, all with one object. A full discussion of the possibilities is available on Apple's developer site.

For our purposes, we only need to specify a move to a pair of coordinates we calculate as boxLeft and boxTop, which are then used to create a QDRect object called toBox. We can then create a Matrix that represents the moving of pixels from the original textBox, with an upper left corner of (0,0), to toBox, with upper left corner of (boxLeft, boxTop). Setting this as the text track's matrix causes QuickTime to use the matrix when drawing the text frames at playback time:

Matrix transformMatrix = new Matrix();
QDRect toBox = new QDRect (boxLeft, boxTop, 
transformMatrix.map (textBox, toBox);
textTrack.setMatrix (transformMatrix);

If you read the docs, you'll notice that the tx and ty values are the only ones used for moving pixels; i.e., for translating between coordinate spaces. So we could replace the map() call with:

transformMatrix.setTx (boxLeft);
transformMatrix.setTy (boxTop);

Either way, this puts the text box in its proper location relative to the rest of the movie, as seen in Figure 5.

a better timestamp location
Figure 5. A better timestamp location

The QTJ Matrix class provides a several methods that allow you to define matrices that can perform scaling and rotation operations, all without you having to do your own trigonometry. For example, adding this rather silly call rotates our timecode counter-clockwise by 45 degrees, centered on the top left corner:

transformMatrix.rotate (315, boxLeft, boxTop);

The result looks amusing as a screenshot, but is more impressive (or just plain goofy) when played as an accurate, running timecode for the movie, as shown in Figure 6.

a rotated timestamp
Figure 6. A rotated timestamp

One More Neat Trick

Overall, the code for this example is fairly similar to that of the first one. Again we create a text Track and accompanying TextMedia, which we populate with samples. The addTextSample has a few differences to superimpose the text onto the video:

This use of dfKeyedText produces a chromakey effect, replacing the background color (QDColor.black, in our case) with the pixels from the video underneath. So the black box surrounding the text becomes invisible, and we just see the text on top of the video.

As before, the resulting movie is flatten()ed out to a file, this time called timecoded.mov, which you can open in QuickTime Player.

Onwards ...

Having done this simple little timecode with a text track, it should be noted that QuickTime offers a real "timecode track" as one of the many track types it supports. It is much more involved than is necessary for this tutorial, but if you have professional needs, check out the TimeCoder and TimeCodeMedia classes in QTJ.

Now that we've done some simple text tracks, the next step is to get into the good stuff: writing out video tracks from scratch. In our next article, we'll do just that, borrowing an image-to-movie effect from our favorite Civil War documentarian.

Chris Adamson is an author, editor, and developer specializing in iPhone and Mac.

Return to ONJava.com.

Copyright © 2009 O'Reilly Media, Inc.