Palagpat Coding

Wednesday, August 04, 2010

Slaying the Beast

So tell me if this sounds familiar:

goal: build up my online portfolio.
how I'll do it: maintain a regular blogging schedule (this, too, I'll write more about soon)

Yeah, not so much. Instead, the past 6 months of my life have been swallowed up by a ravenous beast known as "graduate school," that has grown more and more aggressive in the past 3 or 4 weeks. I dealt it a mortal blow when I passed my defense on Monday the 19th, but it limped along for a while, whispering to me of Target Audiences, Proper Whitespacing, and Nested Bookmarks. Today, I think it finally gasped its last breath, but I have to wait until tomorrow for the official word from the ~~graduate reviewer~~coroner.

So, this is me saying "I'm back." I have a half-dozen half-finished posts in draft, way more free time than I'm used to, and a lot to catch up on.

See you soon.

Labels: _meta_, school

Tuesday, May 25, 2010

Tidbits

As of this morning, JSConf 2010 videos are starting to make their way online. I'm looking forward to virtually experiencing the Piratey goodness for myself.

Oh, and I'm starting to fix/update the Canvassa mapping tool, still needs some optimizations and bugfixes, but at least it's not completely broken anymore... yay?

Finally, I've recently found myself on an insanely-accelerated timetable to finish my Computational Linguistics MA thesis in the next 3 weeks or so, so you can probably expect my next post or two to reflect that preoccupation. Provided I find anything in the course of writing it that deserves the (slightly) larger platform.

Labels: Conferences, school, Zelda

Saturday, May 22, 2010

Solved: Cannot find -lgcc_s on Cygwin 4.3.4

For the past week or so, I've been trying to build the Giza++ language modeling tools on my Windows machines, and have been having quite a bit of trouble. Permissions errors on Windows 7, weird "missing separator" error messages that were completely unhelpful (it means your text editor has replaced all tabs with spaces), and so on. Compile and build errors that have kept me up late into the night on several occasions lately, but which I was ultimately able to untangle with a little applied Google-Fu.

Labels: C++, cygwin, school

Monday, February 01, 2010

Why data transformations are hard (part 2)

Previously, on Palagpat Coding...

In Part 1 of this post, I talked about how I was going to attack my task of converting an XML-based format for file transcription (XCES) into the flat-but-structured EFR-segfile format. At the end of the day, I'd gone with a quick-and-dirty solution in Visual Basic 6, as both XSLT and my Bat'leth tool seemed to fall short. Here's a sample of the input and output of that effort:


  
    
    Lunchtime
    !

// EFR-segfile output:
0-1164 (00:00:00:00 - 00:00:38:24)
Scene 1: T1S
//T: Lunchtime!

As I mentioned last time, I took this EFR-segfile output and loaded it up in my EFR authoring software to test the segment boundaries, and found that it was off... not hugely off at the outset, just a few seconds... but things very quickly went off the rails. For example, by the end of Chapter 2 (just 13 minutes into the film), my segfile segments were already off by over 7 minutes!

37020-37231 (00:20:34:00 - 00:20:41:01)
Scene 417: T303S
//T: Come on, before you catch a real cold.
//N: actual film time: 00:13:27-00:13:30

Clearly, this won't do.

What happened?

Looking things over, I realized a couple of mistakes right away: First, I had assumed the XCES files covered the full film, so I started at frame 0. However, the first dialog in the film doesn't occur until about 40 seconds in, so my output started off wrong by that much. Secondly, I started each subsequent clip from the endpoint of the previous one, thus ignoring any segments of the movie that didn't involve spoken dialog, so as the film progressed, the timestamps got more and more wrong.

At this point I took another look at my original XCES file, and it hit me that there are actually two, overlapping data structures at play:

  
     
    Get
    down
    ,
    guys
    .
  
  
    Get
    down
    .

The <s> tags, which I thought represented segments, instead appear to represent individual sentences. (with the <w> tags representing individual words... seems rather obvious in retrospect). When I looked at the format last time, I didn't give any thought to why the <time> elements sometimes came at the beginning of a sentence and sometimes at the end, but in my review I noticed a consistent trend: the instances that occur as the first child of an <s> element have an id ending in S, and those that are the last child of an <s> element have an id ending in E... S for "Start", and E for "End." So the XML structure of <s>entences and <w>ords has a sort of a superstructure that is made up of T(ranscript?) blocks of one or more sentences each. I'm guessing XCES doesn't support this kind of 3-tier structure, hence the ad-hoc markup, but as a long-time user of XML, I really would have preferred something like this:


  
    Get
    down
    ,
    guys
    .
  
  
    Get
    down
    .

But, beggars can't be choosers, and anyway, my parser isn't a pure-XSLT solution anyway, so in the long run it doesn't matter all that much, even if it does make the data purist in me cringe a little.

Shadow data

As the title of this post asserts, it's been my experience that data transformations are only rarely as straightforward as they look. Several years ago, I worked for a B2B e-commerce company that handled a lot of EDI data, both reading it and writing it. In theory, this should be a standard format: if you know the specification, you know what the data will look like. In practice, however, we found that every new company we brought on the system had their own "shadow" data structures hidden in the standard fields, just like the invisible <t>ranscript blocks in this XCES data.

The takeaway is this: when you're writing a parser for a new-to-you data format, no matter how "standard" it professes to be, never take your data's structure at face value. There may be hidden complexities, just waiting to be tripped over.

Labels: EFR, methodology, school, XML

Monday, January 25, 2010

Why data transformations are hard (part 1)

Per my plan for January, last Friday was the day I'd planned to have several EFR segmentation files ready for use later on in testing the software I'm going to build for my Linguistics MA capstone project.

My task for that week was to take several XML movie transcripts (in segmented, time-aligned XCES format) and spit out text files in a rather simple format I'll call EFR-segmentation (or segfile for short), which can be parsed by other pre-existing tools in the EFR toolchain. Here's a sample of the input and target formats, so we can clearly define our start and end points:

XCES format (the source):

  
    
    So
    ,
    you
    do
    like
    it
    ,
    don'
    t
    you
    ?

EFR-segfile format (the target):

2460-2520 (00:01:20:20 - 00:01:22:11)
Scene 20: T14S
//T: So, you do like it, don't you?

Since the starting point was an XML format, my first thought was to use XSLT transformations to accomplish the reformatting. I created a stylesheet, segfile.xsl, to transform the XCES data files into something closely resembling the EFR-segfile format... but it was lacking in a couple of minor-but-not-insignificant ways. Observe:

Almost-but-not-quite-segfile format:

00:01:20,664
Scene 20: T14S
//T: So , you do like it , don' t you ?

The major problem with this output is the timestamp, which in XCES is formatted like hh:mm:ss,ms. EFR-segfile's timestamp format, on the other hand, is start_frame-end_frame (hh:mm:ss:frame - hh:mm:ss:frame). The milliseconds-to-frames part is easy:

var old_format = "00:02:36,532";
var parts = old_format.split(',');
// X ms * (1 sec/1000 ms) * (30 frames/1 sec) = Y frames
var new_format = parts[0] + ':' + Math.round(parseInt(parts[1]) * 0.03);
// new_format == "00:02:36:16"

... but the rest of that format change is not so simple. We're lucky in the above segment, because there are two timestamps: one at the beginning and one at the end. This seems to be the exception, however, and not the rule, so we'll still need to keep some kind of "current timecode" variable on hand while our parser works its way through the file. Once this is addressed, a second task is to convert the segment start and end times from the hh:mm:ss:ff format into raw frame counts, since EFR segfiles require both in order to have a valid timestamp line. This is a simple enough calculation, but again, doing it in XSLT is like trying to cut a 2x4 with a hammer, when we have other tools in our toolbox.

Besides the timestamp-formatting issues, there was another glitch: the punctuation characters in the transcript line spit out by the XSLT are padded by unnecessary whitespace. For sentence terminators and splicers (periods, commas, etc), this doesn't present a huge problem. But inter-word apostrophes, indicating contractions, split up a word in such a way that any downstream term lookup routine will fail to match it properly. This kind of minor text cleanup would be a snap with my Bat'leth tool; it's designed with text parsing in mind, and has a lot of flexibility. But unfortunately, the timestamp conversion is more complex than Bat'leth's search-and-replace regular expressions can handle, so even if I used it for transcript cleanup, I'd still need another tool to do the timestamps.

Given these shortcomings, I ruled out both XSLT and Bat'Leth as monolithic solutions, although each could contribute a step to the overall parsing workflow. I really wanted to have a single piece of software that could do both parts, however, so I decided to try doing it in Javascript. Then, I fell down the hole that is cross-browser XML — which is a whole different blog post I'll have to write soon. (executive summary in one word: complicated). Given the difficulty of that approach and the pressing deadline, I ended up writing something quick and dirty in Visual Basic 6. Not my first choice, but it got the job done.

The story doesn't end there, though. One of the movies I wanted to treat was 1985's The Goonies, something of a cultural touchstone for my generation. I found an XCES alignment file for the movie in the OPUS OpenSubtitles corpus that I'm using, fed it through my parser, and got this EFR-segfile. Looked pretty good, I thought... so I fired up my EFR authoring software to verify the transcript alignment. And.... it didn't match up! Tune in later this week for part 2 of this post, when we try to figure out why.

Labels: EFR, JavaScript, school, VB, XML

Older Posts Home