Why data transformations are hard (part 2)
Previously, on Palagpat Coding...
In Part 1 of this post, I talked about how I was going to attack my task of converting an XML-based format for file transcription (XCES) into the flat-but-structured EFR-segfile format. At the end of the day, I'd gone with a quick-and-dirty solution in Visual Basic 6, as both XSLT and my Bat'leth tool seemed to fall short. Here's a sample of the input and output of that effort:
// EFR-segfile output: 0-1164 (00:00:00:00 - 00:00:38:24) Scene 1: T1S //T: Lunchtime!
As I mentioned last time, I took this EFR-segfile output and loaded it up in my EFR authoring software to test the segment boundaries, and found that it was off... not hugely off at the outset, just a few seconds... but things very quickly went off the rails. For example, by the end of Chapter 2 (just 13 minutes into the film), my segfile segments were already off by over 7 minutes!
37020-37231 (00:20:34:00 - 00:20:41:01) Scene 417: T303S //T: Come on, before you catch a real cold. //N: actual film time: 00:13:27-00:13:30
Clearly, this won't do.
Looking things over, I realized a couple of mistakes right away: First, I had assumed the XCES files covered the full film, so I started at frame 0. However, the first dialog in the film doesn't occur until about 40 seconds in, so my output started off wrong by that much. Secondly, I started each subsequent clip from the endpoint of the previous one, thus ignoring any segments of the movie that didn't involve spoken dialog, so as the film progressed, the timestamps got more and more wrong.
At this point I took another look at my original XCES file, and it hit me that there are actually two, overlapping data structures at play:
Get down , guys . Get down .
<s> tags, which I thought represented segments, instead appear to represent individual sentences. (with the
<w> tags representing individual words... seems rather obvious in retrospect). When I looked at the format last time, I didn't give any thought to why the
<time> elements sometimes came at the beginning of a sentence and sometimes at the end, but in my review I noticed a consistent trend: the instances that occur as the first child of an
<s> element have an id ending in S, and those that are the last child of an
<s> element have an id ending in E... S for "Start", and E for "End." So the XML structure of
<w>ords has a sort of a superstructure that is made up of T(ranscript?) blocks of one or more sentences each. I'm guessing XCES doesn't support this kind of 3-tier structure, hence the ad-hoc markup, but as a long-time user of XML, I really would have preferred something like this:
Get down , guys . Get down .
But, beggars can't be choosers, and anyway, my parser isn't a pure-XSLT solution anyway, so in the long run it doesn't matter all that much, even if it does make the data purist in me cringe a little.
As the title of this post asserts, it's been my experience that data transformations are only rarely as straightforward as they look. Several years ago, I worked for a B2B e-commerce company that handled a lot of EDI data, both reading it and writing it. In theory, this should be a standard format: if you know the specification, you know what the data will look like. In practice, however, we found that every new company we brought on the system had their own "shadow" data structures hidden in the standard fields, just like the invisible
<t>ranscript blocks in this XCES data.
The takeaway is this: when you're writing a parser for a new-to-you data format, no matter how "standard" it professes to be, never take your data's structure at face value. There may be hidden complexities, just waiting to be tripped over.