Palagpat Coding: Why data transformations are hard (part 1)

Per my plan for January, last Friday was the day I'd planned to have several EFR segmentation files ready for use later on in testing the software I'm going to build for my Linguistics MA capstone project.

My task for that week was to take several XML movie transcripts (in segmented, time-aligned XCES format) and spit out text files in a rather simple format I'll call EFR-segmentation (or segfile for short), which can be parsed by other pre-existing tools in the EFR toolchain. Here's a sample of the input and target formats, so we can clearly define our start and end points:

XCES format (the source):

  
    
    So
    ,
    you
    do
    like
    it
    ,
    don'
    t
    you
    ?

EFR-segfile format (the target):

2460-2520 (00:01:20:20 - 00:01:22:11)
Scene 20: T14S
//T: So, you do like it, don't you?

Since the starting point was an XML format, my first thought was to use XSLT transformations to accomplish the reformatting. I created a stylesheet, segfile.xsl, to transform the XCES data files into something closely resembling the EFR-segfile format... but it was lacking in a couple of minor-but-not-insignificant ways. Observe:

Almost-but-not-quite-segfile format:

00:01:20,664
Scene 20: T14S
//T: So , you do like it , don' t you ?

The major problem with this output is the timestamp, which in XCES is formatted like hh:mm:ss,ms. EFR-segfile's timestamp format, on the other hand, is start_frame-end_frame (hh:mm:ss:frame - hh:mm:ss:frame). The milliseconds-to-frames part is easy:

var old_format = "00:02:36,532";
var parts = old_format.split(',');
// X ms * (1 sec/1000 ms) * (30 frames/1 sec) = Y frames
var new_format = parts[0] + ':' + Math.round(parseInt(parts[1]) * 0.03);
// new_format == "00:02:36:16"

... but the rest of that format change is not so simple. We're lucky in the above segment, because there are two timestamps: one at the beginning and one at the end. This seems to be the exception, however, and not the rule, so we'll still need to keep some kind of "current timecode" variable on hand while our parser works its way through the file. Once this is addressed, a second task is to convert the segment start and end times from the hh:mm:ss:ff format into raw frame counts, since EFR segfiles require both in order to have a valid timestamp line. This is a simple enough calculation, but again, doing it in XSLT is like trying to cut a 2x4 with a hammer, when we have other tools in our toolbox.

Besides the timestamp-formatting issues, there was another glitch: the punctuation characters in the transcript line spit out by the XSLT are padded by unnecessary whitespace. For sentence terminators and splicers (periods, commas, etc), this doesn't present a huge problem. But inter-word apostrophes, indicating contractions, split up a word in such a way that any downstream term lookup routine will fail to match it properly. This kind of minor text cleanup would be a snap with my Bat'leth tool; it's designed with text parsing in mind, and has a lot of flexibility. But unfortunately, the timestamp conversion is more complex than Bat'leth's search-and-replace regular expressions can handle, so even if I used it for transcript cleanup, I'd still need another tool to do the timestamps.

Given these shortcomings, I ruled out both XSLT and Bat'Leth as monolithic solutions, although each could contribute a step to the overall parsing workflow. I really wanted to have a single piece of software that could do both parts, however, so I decided to try doing it in Javascript. Then, I fell down the hole that is cross-browser XML — which is a whole different blog post I'll have to write soon. (executive summary in one word: complicated). Given the difficulty of that approach and the pressing deadline, I ended up writing something quick and dirty in Visual Basic 6. Not my first choice, but it got the job done.

The story doesn't end there, though. One of the movies I wanted to treat was 1985's The Goonies, something of a cultural touchstone for my generation. I found an XCES alignment file for the movie in the OPUS OpenSubtitles corpus that I'm using, fed it through my parser, and got this EFR-segfile. Looked pretty good, I thought... so I fired up my EFR authoring software to verify the transcript alignment. And.... it didn't match up! Tune in later this week for part 2 of this post, when we try to figure out why.

Labels: EFR, JavaScript, school, VB, XML

Palagpat Coding

Monday, January 25, 2010

Why data transformations are hard (part 1)

0 Comments:

Post a Comment

Looking for me?

Previous Posts

Tag Cloud

Around the Web