News Article Categorization (Subject / Entity Analysis via NLP?); Preferably in Node.js

Objective: a node.js function that can be passed a news article (title, text, tags, etc.) and will return a category for that article ("Technology", "Fashion", "Food", etc.)

I'm not picky about exactly what categories are returned, as long as the list of possible results is finite and reasonable (10-50).

There are Web APIs that do this (eg, alchemy), but I'd prefer not to incur the extra cost (both in terms of external HTTP requests and also $$) if possible.

I've had a look at the node module "natural". I'm a bit new to NLP, but it seems like maybe I could achieve this by training a BayesClassifier on a reasonable word list. Does this seem like a good/logical approach? Can you think of anything better?

node.js
nlp

I don't know if you are still looking for an answer, but let me put my two cents for anyone who happens to come back to this question.

Having worked in NLP i would suggest you look into the following approach to solve the problem. Don't look for a single package solution. There are great packages out there, no doubt for lots of things. But when it comes to active research areas like NLP, ML and optimization, the tools tend to be atleast 3 or 4 iterations behind whats there is academia.

Coming to the core problem. What you want to achieve is text classification. The simplest way to achieve this would be an SVM multiclass classifier. Simplest yes, but also with very very (see the double stress) reasonable classification accuracy, runtime performance and ease of use.

The thing which you would need to work on would be the feature set used to represent your news article/text/tag. You could use a bag of words model. add named entities as additional features. You can use article location/time as features. (though for a simple category classification this might not give you much improvement). The bottom line is. SVM works great. they have multiple implementations. and during runtime you don't really need much ML machinery. Feature engineering on the other hand is very task specific. But given some basic set of features and a good labelled data you can train a very decent classifier.

here are some resources for you. http://svmlight.joachims.org/ SVM multiclass is what you would be interested in.

And here is a tutorial by SVM zen himself! http://www.cs.cornell.edu/People/tj/publications/joachims_98a.pdf

I don't know about the stability of this but from the code its a binary classifier SVM. which means if you have a known set of tags of size N you want to classify the text into, you will have to train N binary SVM classifiers. One each for the N category tags.

Hope this helps.