The fundamental concepts of NLP differ from those of Machine Learning or Software Engineering in general. I will start with the most low-level things (which doesn’t mean “simple” though) and then I’ll try to show you how do they build up a production model.
This is a core tool for every NLP framework. Many ML techniques whether they aim for text classification or regression, use n-grams and features, produced by them. Before you start extracting features, you need to get the words.
POS-tagger and lemmatizer
This is the next thing you will need, although, maybe, not directly. Words can take many forms and the connections between them (as you will see below) depend on their POS. Lemmatizers are involved most often when something like TDM is needed, because they naturally reduce the dimensionality and lead to a greater overall robustness.
Which stands for Named Entity Recognizers. They rely on extracted parts-of-speech and basic grammars, encoded in frameworks. There is a separated part of NLP, called information retrieval, where people do really cool things like an automated generation of reports based on several messages about the topic. NER is certainly the biggest part of it. If you want to understand it deeply, you can read about Context-Free Grammars.
Is this review good or bad? Did the critic like the movie? Put these 1 000 000 reviews in this machine and it will be able to tell. There are several ways to perform sentiment analysis, some people even use deep learning (word2vec). It starts with feature extraction, usually, computes TDM from 2-3-grams, which contain sentiment-related words from dictionaries (semi- and supervised models) or builds the dictionaries based on the word distribution itself (un- and semisupervised models). Then the TDM is used as a feature matrix, which is fed to the neural net or SVM or whatever the end-point algorithms happens to be.
I will address a fairly known task which is called text regression. You have a text and a number associated with it. The problem is that the text itself is not a numerical dataset, so you can’t use it directly. One of the simplest ways to do that would be an algorithm that you can implement immediately after you read this answer. It doesn’t utilize the whole power of NLP, but provides a good introduction.
Cast the test to the lowercase, remove punctuation, numbers etc.
Compute TF-IDF scores (check out the article on Wikipedia) for each word and put them in the table so that the columns represent words and the rows represent documents.
Eliminate the words with an excessive amount of zeroes. What should be considered excessive is entirely up to you. I won’t tell you, try this out.
Hint: look at the distribution of “popularity” among words.
Fit a model and validate it.
To the void
There are endless ways to make your application more powerful. Every tool I described in the first part of the answer can provide you with hundreds of potential features. Add a column with a sentiment rating. Extract all entities from the corpus and use them as features when you compute TDM. Clusterize the documents using TD-IDF representations. Reduce the words using their POS – say, lets keep only nouns, verbs and adjectives. What will happen then?
I hope, this will give some perspective on how the NLP can be learned for practical purposes. As for academical ones, you could read some papers from ACL, for example.
Association for Computational Linguistics
word2vec – Tool for computing continuous distributed representations of words. – Google Project HostingSentiment analysis
The Stanford NLP (Natural Language Processing) Group
Contributed by Roman Trusov to quora