work notes

the saga continues...

Although these notes are a bit cryptic about what I'm actually doing, they are enough for a friend at work to read and give me feedback about — which turns out to be a pretty good asynchronous communication mechanism — and he gave me some pointers. For one, it turns out I should be limiting my search for duplicates to duplicates by the same author. I reran my process and now get 14694 duplicates, which accounts for about a third of all duplicates.

Now I need to rebuild my features vector template with room for a "duplicate" feature. The feature-vector-template also has counts for various features, most of which are words, so that I can use them in some tfidf capacity if desired — though I think I'll begin with just binary features and see where I'm at, now that the training dataset is significantly larger.

Ok, I would have 288,893 features, but because I eliminate features that don't occur more than 100 times, I am left with 9032 features.

At this point, I have a feature vector template, which is just a list of features. Now I need to go back through all the data items and build a feature vector for each one. I think I need this two-pass process because I don't know until after the first pass is done which features I'm going to eliminate.

Hm.. for some reason I have usually preferred global functions for things like map and filter, but as I use it more, I'm less convinced, because I often get things like this:

_.map(_.sortBy(_.filter(_.map(_.pairs(v), function (e) {
    return [featureToIndex[e[0]], e[1]]
}), function (e) {
    return e[0]
}), function (e) {
    return e[0]
}), function (e) {
    return e[0] + ":" + ((e[1] % 1 == 0) ? e[1] : e[1].toFixed(6))
}).join(' ')

which requires too much inside-out reading for me. I forget why I wanted the global functions. It had to do with some danger in touching JavaScript's prototype functions... anyway, I think I'll try it and re-encounter whatever problem I faced before.

But back to features... it's processing. About half-way there. Ok, done. 260mb. Not bad.

Ok, let's try running this thing: train -v 5 feature_vectors.txt... oh, forgot to install liblinear on the ec2 machine..

trying again.. ./liblinear/train -v 5 feature_vectors.txt... hm... "Wrong input format at line 1". Ah. I forgot to join with newlines. Now, when I examined the partial output to make sure everything was working before letting it sit for 5 minutes building the large vector, I only looked at a single vector, so I didn't see the lacking newline. It impresses me sometimes how I trip over everything that can be tripped over, over and over... I've been doing this for years. Years. And still, I'm liable to make every mistake possible :)

Great, ok, it's running! Now the moment of truth.. will liblinear be able to process a 260mb training file on a medium ec2 machine... there are dots appearing... I think that means it's thinking... Hm... I have no idea how long it will take, and I may go to bed before it finishes. I think maybe I'll cancel it, run it in "screen", so I can log out if I need to and let it run... Ok, there it goes...

No comments:

Post a Comment