Why data transformations are hard (part 1)
Per my plan for January, last Friday was the day I'd planned to have several EFR segmentation files ready for use later on in testing the software I'm going to build for my Linguistics MA capstone project.
My task for that week was to take several XML movie transcripts (in segmented, time-aligned XCES format) and spit out text files in a rather simple format I'll call EFR-segmentation (or segfile for short), which can be parsed by other pre-existing tools in the EFR toolchain. Here's a sample of the input and target formats, so we can clearly define our start and end points:XCES format (the source):
EFR-segfile format (the target):
So , you do like it , don' t you ?
2460-2520 (00:01:20:20 - 00:01:22:11) Scene 20: T14S //T: So, you do like it, don't you?
Since the starting point was an XML format, my first thought was to use XSLT transformations to accomplish the reformatting. I created a stylesheet, segfile.xsl, to transform the XCES data files into something closely resembling the EFR-segfile format... but it was lacking in a couple of minor-but-not-insignificant ways. Observe:Almost-but-not-quite-segfile format:
00:01:20,664 Scene 20: T14S //T: So , you do like it , don' t you ?
The major problem with this output is the timestamp, which in XCES is formatted like
hh:mm:ss,ms. EFR-segfile's timestamp format, on the other hand, is
start_frame-end_frame (hh:mm:ss:frame - hh:mm:ss:frame). The milliseconds-to-frames part is easy:
var old_format = "00:02:36,532"; var parts = old_format.split(','); // X ms * (1 sec/1000 ms) * (30 frames/1 sec) = Y frames var new_format = parts + ':' + Math.round(parseInt(parts) * 0.03); // new_format == "00:02:36:16"
... but the rest of that format change is not so simple. We're lucky in the above segment, because there are two timestamps: one at the beginning and one at the end. This seems to be the exception, however, and not the rule, so we'll still need to keep some kind of "current timecode" variable on hand while our parser works its way through the file. Once this is addressed, a second task is to convert the segment start and end times from the
hh:mm:ss:ff format into raw frame counts, since EFR segfiles require both in order to have a valid timestamp line. This is a simple enough calculation, but again, doing it in XSLT is like trying to cut a 2x4 with a hammer, when we have other tools in our toolbox.
Besides the timestamp-formatting issues, there was another glitch: the punctuation characters in the transcript line spit out by the XSLT are padded by unnecessary whitespace. For sentence terminators and splicers (periods, commas, etc), this doesn't present a huge problem. But inter-word apostrophes, indicating contractions, split up a word in such a way that any downstream term lookup routine will fail to match it properly. This kind of minor text cleanup would be a snap with my Bat'leth tool; it's designed with text parsing in mind, and has a lot of flexibility. But unfortunately, the timestamp conversion is more complex than Bat'leth's search-and-replace regular expressions can handle, so even if I used it for transcript cleanup, I'd still need another tool to do the timestamps.
The story doesn't end there, though. One of the movies I wanted to treat was 1985's The Goonies, something of a cultural touchstone for my generation. I found an XCES alignment file for the movie in the OPUS OpenSubtitles corpus that I'm using, fed it through my parser, and got this EFR-segfile. Looked pretty good, I thought... so I fired up my EFR authoring software to verify the transcript alignment. And.... it didn't match up! Tune in later this week for part 2 of this post, when we try to figure out why.