Q: I'll be classifying documents. How do I turn the words into "features"?
A: The most common approach seems to be having a separate feature for each word. For example, there would be a feature for the word "machine" and a different feature for the word "learning".
Q: This blog post has the word "machine", so what number would I put for the "machine" feature?
A: You could put "7", because "machine" occurs 7 times. Or you could put "1", because "machine" occurs at least once. You could also use tf-idf.
Q: What do you recommend?
A: Well, at the end of this paper is a section called: A Practical Quide to LIBLINEAR. After reading it, I'm leaning toward binary features, where "1" means the word is there and "0" means the word is not there, and then normalizing these values, so if a document has 10 words, then we'd put "1/sqrt(10)" for each word instead of "1".
Q: What is the format for LIBLINEAR training data?
A: Each line represents a document. The first number is the class of the document, e.g., "0" or "1". Then a space. Then a list of feature-value-pairs separated by spaces. Each feature-value-pair looks like this "32:3.141" where "32" is the index of the feature, and "3.141" is the value.
Q: It is giving me an error for this input: "0 0:1 5:1 3:1"
A: The features are 1-indexed, so you can't have a feature "0".
Q: It is still giving me an error: "0 1:1 5:1 3:1"
A: The features must be in order, e.g., "0 1:1 3:1 5:1".
./train -v 5 training_data.txt
yielding "Cross Validation Accuracy = 92.97%", which sounds great.
Now if I follow the advice from "A Practical Quide to LIBLINEAR" and normalize features for each document — that is, if we think of each document as a vector in a high-dimensional space, we want to make the vector have length one — and re-run:
./train -v 5 training_data.txt
Ok, now I think that it is using an SVM. But a friend, and co-worker, ran a Logistic Regression on this data and got really good accuracy, and I'd rather use that since I think I can extract from the Logistic Regression a meaningful coefficient for each feature that will tell me how "useful" that feature is. The Logistic Regression is accessed with the "-s 6" option:
./train -s 6 -v 5 training_data.txt
Now to see which features are the most useful. (Man, I have a raspberry seed stuck in my tooth. That always happens. Stupid raspberries. Yet I still eat them. Hm..)
Ok, so there are about 40,000 features, and only about 250 have no-zero coefficients. That's nice. Should make it efficient to classify stuff.
I would say which features are the most useful, but I'm not sure I should put that online. I'll not for now.
My friend plotted an ROC curve, and I wanted to understand that, so I went off to understand ROC curves...
Hm.. I notice that ROC curves do not include "precision". But I can see why that might be good. It could be that our classifier is identifying 100% of the problem items, while also falsely accusing the same number of other items. Hence, it would have 100% recall, but only 50% precision, since only half of the items it says are bad are actually bad.
Now 50% looks like low precision, but if there are 10,000 items, and only 100 of them are bad, and the system identifies 200 items as potentially bad, where all 100 bad items are in that set.. that's actually doing something pretty good, and the metric on the ROC curve shows that (the FPR would be 100/9900 = 0.01, where a low number is desirable for FPR).