real gl: data processing notes

notes..

I've been gathering data from a database using node-postgres. One thing that sucks about javascript is multiline strings. I have a large SQL query that would be much more readable and manageable over multiple lines. Python and Perl have good support for this. I think javascript would benefit. As a side note, I always wonder why regular double-quoted strings can't go over multiple lines.

The resulting data is about 1gb of json. I've been doing preliminary processing on my own machine, but now I'm trying to put everything on an ec2 machine.

Usually I feel fine doing development in Dropbox, but when dealing with large datasets, I feel a bit worried. I'm not sure why. I have 100gb of space. It shouldn't be a problem.

Macfusion has been great — it makes the ec2 machine look like a harddrive on my mac.

The json data is 814mb on disk, but expands to about 1.76gb in memory. Not sure how much of that is ungarbage collected stuff, since I'm doing some array concatenation to build the final data array from multiple files. It also takes 4.5 minutes to load. I feel like that should be reducible.

This ec2 machine won't need to do much most of the time. I could use a small. But every day, I want to load and process this dataset, which will require at least a medium. I wonder how easy people have made it to create a large ec2 machine, send a bunch of data there, process it, and send the results back.

I discovered another field I needed, so I had to re-download everything and re-copy it. The ec2 machine itself doesn't have access to the database. I'm not sure what the right way to fix that is. Giving the ec2 machine database access is not trivial, and may involve some bureaucracy.

I need a word stemmer in node. I type "node stemmer" into google. I get "porter-stemmer", a stemmer package for node. Excellent. Yes, I know this happens with many languages, and is not new — there evolves an ecosystem of packages for anything you want to do — but it is still cool.

What is the longest word? I want to remove stuff people type like "asdflisjflsekfjlaskdfjilewghlegjlsijislehg" that isn't a word, and isn't likely to be typed by other people. Wikipedia says Floccinaucinihilipilification is the "Longest unchallenged nontechnical word". Great. Anything over "Floccinaucinihilipilification".length is rejected. Or perhaps I can just include everything at first, and remove things that don't appear often enough.

I was about to worry about the issue with underscore's _.each treating hashmaps with a "length" as arrays, since I'm going to be throwing random words as keys into hashmaps, and one of those words might be "length", but then I realize I'm adding a prefix to all my word-keys, so I guess that's not an issue. For now.

Hm.. I just made my first pull request. I didn't even need to clone the project. I just saw an error in the readme file, I opened the readme file online, edited the file, and "proposed the change". Go github. And of course I actually modified it wrong, so I closed that pull request and submitted another one.

I'm building a large feature vector. I count how many examples use each feature, and filter away features that are possessed by fewer than 10 items. That reduces the features from 288891 to 30753.

I feel like it might be worth plotting a little graph to show many many features I would get for different cutoff choices, and choosing something at the "elbow" of the curve. Hm. How much effort would that take?

Oops. I was looking at how many times different words were used, and it's good I did, because I had meant to be counting how many documents include each word, but instead I'm counting how many times the word appears. I noticed this because the word "a" appears over a million times, but I don't have over a million documents.

Hm.. I usually don't have test cases. I find them cumbersome. But I've been using test cases for myutil2.js, in a file called myutil2_tests.html, and it is pretty convenient. In this case, I feel like it isn't actually more work, since I need to test the functions I add anyway, so I just put those tests into that file.

Another reason that the test cases work out is because I'm having everything point to a single utility library. If I was copying the utility library around, then I feel like it would be easier to just test stuff in the code that was trying to use it, as I usually do. But in this case, to do that, I would need to make a change, check it in, test it where I'm using it, and then check in fixes.. meaning that the checked in code would be broken sometimes, which seems dangerous.

Incidentally, I use myutil2.js from node projects as well — like the data processing I've been writing — and I load it dynamically like this: eval(_.wget('https://raw.github.com/dglittle/myutil/master/myutil2.js')). Of course, this relies on _.wget, which is not part of underscore, but is rather something I add to it in a nodeutil.js, which is not in a centralized place on github. I'm not sure what a good way to do that would be.. perhaps I could have most of the nodeutil.js stuff on github, except for the wget function itself, but much of what nodeutil.js does is the wget function.

Ok.. I plotted the curve showing how many features we would keep with each threshold value. It is a very very sharp elbow. So sharp that it's not worth plotting the curve because it just looks like a line going down, and then a line going to the side. After some poking around with it — zooming in — I'm going to go with 100 as the cutoff, which will leave about 7000 features.

1/28/13

data processing notes