Jay Taylor's notes

back to listing index

Bag of Tricks for Efficient Text Classification | Hacker News

[web search]
Original source (news.ycombinator.com)
Tags: nlp text-classification natural-language-processing news.ycombinator.com
Clipped on: 2016-07-22

Image (Asset 1/2) alt= Hacker News new | threads | comments | show | ask | jobs | submit jaytaylor (1969) | logout
Bag of Tricks for Efficient Text Classification (arxiv.org)
169 points by T-A 13 days ago | flag | hide | past | web | 15 comments | favorite

Image (Asset 2/2) alt=

For anyone who is interested in efficiently classifying text, I can't recommend Vowpal Wabbit[1] (VW) enough. It's blazingly fast and has been used in both production and research. It also has a billion options out of the box for various different set-ups.

Other researchers have noted[2] that with a set of command line flags that vw is almost the same as the system described in the paper, specifically, "vw --ngrams 2 --log_multi [K] --nn 10".

Behind the speed of both methods is use of ngrams^, the feature hashing trick (think Bloom filter except for features) that has been the basis of VW since it began, hierarchical softmax (think finding an item in O(log n) using a balanced binary tree instead of an O(n) array traversal) and using a shallow instead of deep model.

I am still interested in the more detailed insights the team from Facebook AI Research may provide but the initial paper is a little light and they're still in the process of releasing the source code.

^ Illustrating ngrams: "the cat sat on the mat" => "the cat", "cat sat", "sat on", "on the", "the mat" - you lose complex positional and ordering information but for many text classification tasks that's fine.

[1]: https://github.com/JohnLangford/vowpal_wabbit/wiki

[2]: https://twitter.com/haldaume3/status/751208719145328640

Note that the fact that this can be easily accomplished in VW doesn't really take away from the message the authors are trying to make; namely, simple models done carefully are nearly as (or more) effective in these sorts of problems as fancy deep models, but much cheaper to train and test.

I can attest to that: used even simpler algorithm with bi-gram hasing to generate user profile for http://news-AI.com presonalized news service.

Surprisingly, it produced satisfactory results with much smaller CPU requirements.

This is implied through research on reductions in machine learning.

That simple models can solve complex tasks.

For example, you can do multiclass classification, cost sensitive (importance weighted) multiclass and binary, quantile regression, structured prediction (as is done with HMMs, CRFs, MEMMs, structured SVMs etc.), just using a binary classifier.

So, if your implementation of that binary classifier is efficient and performant, you'll be (given that your reduction is consistent) efficient and performant on any of the above tasks.

What the authors of the paper above did, is that they rediscovered some old tricks, removed the theory of reductions, and that's that - without referencing vowpal wabbit that does way more useful tricks. I'm not sure why, because VW team consistently references Leon Bottou (out of all others) that is member of FAIR, and has been using implementation tricks for decades.

Their log(k) implementation is probably less performant than the one-against-some consistent reduction in VW due to the latter having better theoretical bounds on performance.

Thanks for the detailed comment! It's interesting that simple and classical techniques are so competitive for text, but not for images. What do you think is different about text that makes simple methods so effective, or equivalently what is different about images that yields such massive improvements from using convolutional deep neural nets?

Imagine looking at an image as a "bag of pixels": shuffle them all up, and look at the resulting image. What do you see? Nothing useful, right?

Now look at a bag-of-words view of a movie review:

    set(['and', 'predecessor,', 'immersive;', 'script,', 'is', 'an', 'engaging', 'as', 'home', 'still', 'its', 'film', 'identity.', 'puts', 'dazzling', 'issues', 'visually', 'colorful', 'While', 'not', 'spin', 'on', 'of', 'while', 'the', 'predictable,'])
It's not certain, but there's definitely a lot more information there.

The predict loop of a linear model works like this (written with sparse vectors, implemented as dictionaries):

    def predict(classes, weights, features):
        scores = {clas: 0 for clas in classes}
        for feature in features:
            for clas, weight in weights[feature].items():
                scores[clas] += weight
        return max(scores, key=lambda clas: scores[clas])
This function is the same for Naive Bayes, Maximum Entropy, linear-kernel SVM, Averaged Perceptron...etc.

All you get to do is attach a weight to each feature, which you'll sum. You can make the features larger or smaller slices of the input, but that's all the structure you get.

Note that linear models have massive disadvantages for speech recognition, too. Linear models don't work very well if you have an analog signal.

I think bag of pixels would be more analogous to a bag of characters. A bag of words is more like a bag of SIFT features.

Sure but n-gram feature extraction is what, five lines of code? It's a trivial transform compared to SIFT.

If you don't do SIFT manually prior to classification then your NN has to evolve something "similar" in order to work. Which is why it needs to be deep.

"This movie make all other movies look awful - do not fail to see, missing it would be a crime"

Is this a negative, or a positive review?

awful -1 fail -1 miss[ing] -1

Seems pretty negative to me...

You can build a fairly accurate nudity filter by detecting % skin-tone pixels, but that extra mile to distinguish bikini/beach pics from nudes is the real crux.

That's a really good question and why the initial promise of deep learning was best shown in vision. It is only recently that deep learning in text has given enough performance gains to justify the (large) computational overhead.

For many languages, and many tasks performed on these languages, features have already been discretely provided. If you take English, split on the space character, and then throw those into a standard classifier (Naive Bayes, SVM, logistic regression), you're likely to get a reasonable result. If you add n-grams, so a little bit of collocation information, you tend to do even better. For most tasks, you may even make the simplifying assumption that features don't even interact at all. Even naive use of these methods will get you 90% of the way in many NLP tasks with very little computational overhead (as is evidence in this article).

For other languages, generating those features is a more complicated matter. Chinese is written without spaces between words for example. Extracting such discrete features from there requires an extra step in the pipeline. The way in which you split the characters can also depend on context, requiring an interaction between these components. This interaction is very iffy when it's in a pipeline and you don't have gradients flowing end to end, as is the case with most deep learning systems.

Images are far more like the latter case. There are few discrete features that can be easily extracted from images. Images are also scarily high dimensional - the number of pixels, the number of colours, the interaction between pixels, etc. Using human generated methods of feature extraction on this is fraught with a lot of complicated hand tuning and a lot of misses.

tldr; Text, by virtue of being human generated, is far more structured and far more amenable to easily extracting discrete representative features. For images, we just never really found a highly effective way of producing discrete representative features by hand. Deep learning, by virtue of providing the feature extraction in an automated manner directed by the loss on the task, has helped solve a major issue that plagued computer vision.

It isn't that different than some tricks in image recognition. Look at basic training schemes for digit recognition. A feature you can start with is basically how many pixels are used.

The basic idea is how can you start reducing the pattern space. So, for images, don't consider all colors and only consider brightness.

I don't understand why they're doing this on the Yelp/IMDB datasets. Here's the paper they're citing that is doing the same thing with gated neural nets: http://www.emnlp2015.org/proceedings/EMNLP/pdf/EMNLP167.pdf

However that paper also has the polar score which is a 2-class classifier - 1 star + 2 star vs 3 star + 4 star.

Basically, the rating scores are clearly ordinal, but they're completely disregarding that information as if a 4 is a completely different star review than a 5 in a yelp restaurant review or a 9/10 is different from 8/10 for IMDB review.

More useful estimate of success would be how close they got to real rating and possibly penalize disproportionately when a model gets the rating completely wrong. Something like mean squared error of the final score vs predicted score.

I understand that these are just used to benchmark the algorithms against each other, but why not use something more relevant like a topic classification? That is a real n-classes problem.

General disregard for domain-specific information is very prevalent in machine learning papers and it makes real world applications difficult because real world evaluation metrics are different so classifiers that are marked as inferior in such evaluation might actually be better.

In both of their evaluations, why not try

    F(y) = exp(yj) / sum(exp(yi) for i in range(1, n+1))
    p1 = F(y1)
    pj = F(yj) - F(yj-1), for j ≥ 2
for softmax? In other words, subtract cum prob of previous category.

Or use kappa function?

At least the naive bag of words comparison classifier should have used ordinal logistic regression instead of n-class logistic classifier.

How a sequence of words is feeded to non-recurrent network?

They represent the sequence as a bag of n-grams, and feed that into the classifier, rather than feeding the sequence directly. The paper basically combines variants on a few old techniques (although a few of the variants are significant and recent), but the interesting result is that they show that put together in the right way and tweaked a little, they're competitive in accuracy with state-of-the-art deep neural network models, at least on some problems, while being much faster to train. Section 2 of the paper, although pretty brief, is where this info is.

Specifically the bag of n-grams can be viewed as a very sparse vector with non-zero entries corresponding to the n-grams in the bag. As a result, n-grams not seen during training need to be ignored.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact