work notes

still on this data processing / machine learning job...

I'm classifying a bunch of documents. I had been marking things as yes or no, but we actually have more information than just yes, like why. I thought maybe the classifier should try to say why as well. I was showing data about the counts of different whys to my friend at work, and he mentioned that the real purpose of the classifier is to identify nos that don't need to be processed further. Part of my brain was thinking that all along, yet somehow it didn't occur to me that the why is irrelevant for this purpose. So, good.

I need to identify duplicates. Or rather, I need to make sure the classifier doesn't mark duplicates as nos. This will probably need to be another feature, if not a first-pass-algorithm. Let's see how many duplicates there are. Hm.. 37,040 are exact duplicates, which is about 6%. What if we convert everything to lower-case, and remove punctuation.. 43,678, which is about 18% more. Not too many more, but worth doing I suppose.

Programming note: the data is in about 15 json files. Each is an array. I could load all these arrays into memory and concatenate them, but that seems to take time, perhaps because it is shuffling around 1.7gb of memory. I think it is faster to load each file, process all its items, and then move to the next one. (ahh, the trains are back.. sigh.. the heater is also on, but the trains are louder.)

So what is a good programming abstraction for when I want to feed a bunch of items into a thing one at a time, and then get a final assessment of all the items back.. here we go:

var countDups = Fiber(function (d) {
var descs = {}
var ouch = 0
while (d) {
var hash = _.md5(d.description.toLowerCase().replace(/\W+/g, ''))
if (!_.setAdd(descs, hash)) {
d = Fiber.yield()
console.log("ouchies: " + ouch)

...now I just call countDups.run(d) for each data item, and countDups.run(null) when I'm done. Node fibers are so awesome.

Ok, now I need to convert some scattered test code for extracting features and such into these well-contained utility functions... (hm.. I had typed _.each(text.match(/[a-zA-Z]+/g), ...) and worried that this might encounter the "length" issue in _.each if text happened to contain the word "length", but it doesn't, because match itself is returning an array... but the fact that I keep needing to think about this issue is annoying. So annoying that I may switch away from underscore :( ... ugg... and there it is: _.each(getWordBag(text), ...). A word bag is a hashmap mapping words to counts, and it might contain the word "length". Do I really want to move away from underscore for this? Can I convince the underscore person that this is an issue? Ok, I just submitted an issue. We'll see what comes of it. For now, I'll work around this, I guess... Hm... I guess underscore says "no".

No comments:

Post a Comment