By on May 17, 2013

Lemma what? A guide to Text Processing and Machine Learning API terms

Machine Learning clusters

After we posted the list of NLP, Sentiment Analysis, and Machine Learning APIs a while ago, we noticed that some API descriptions require a little bit of digging into, to fully appreciate what these APIs can do.  Here’s an example:

Text analysis API including wordnet synsets,relation extraction,named entity recognition and classification,lemmatization,part of speech tagging,tokenization, and semantic role labeling. 

If you’re not familiar with these words, you could totally miss the features that this API is capable of.

To help with that, we have listed below an explanation to some of these words in the NLP/Machine Learning context; as well as APIs (represented as numbered links) whose descriptions mention these terms.  Hopefully a basic understanding of these terms would help you appreciate what these APIs are capable of.

Stemming and Lemmatization (1, 2, 3)

Stemming is the process of removing and replacing word suffixes to arrive at a common root for of the word.  Lemmas differ from stems in that a lemma is a canonical form of the word, while a stem may not be a real word. (Reference)

For example, from “produced”, the lemma is “produce”, but the stem is “produc-“.  This is because there are words such as production. (Reference)

Text Analytics (1, 2, 3, 4)

Describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources for business intelligence, exploratory data analysis, research, or investigation. (Reference)

Rough equivalent to the term “text data mining”, it usually involves the process of structuring input text (usually parsing), deriving patterns within the structured data, and finally evaluation and interpretation of the output.  Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e. learning relations between named entities).  (Reference)

Tokenization (1)

Is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens.  The list of tokens becomes input for further processing such as parsing or text mining. (Reference)

Named-entity recognition/extraction (1, 2, 3, 4, 5, 6, 7, 8)

Seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. (Reference)


Jim bought 300 shares of Acme Corp. in 2006.

Producing an annotated block of text, such as this one:


Sentiment Analysis (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)

Aims to determine the attitude of a speaker or writer with respect to some topic or the overall contextual polarity of a document.  A basic task in sentiment analysis is classifying the polarity of a given text at the document, sentence, or feature/aspect level – whether the expressed opinion in a document, a sentence, or an entity is positive, negative, or neutral. (Reference)

Summarization (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

Process of reducing a text document with a computer program in order to create a summary that retains most important points of the original document. (Reference)

Chunking (1, 2, 3)

Also called “light parsing”, is an analysis of a sentence which identifies the constituents (noun, verbs), but does not specify their internal structure, nor their role in the main sentence. It is similar to the concept of lexical analysis for computer languages. (Reference)

(Word-sense) Disambiguation (1, 2, 3, 4)

The process of identifying which sense of a word (meaning) is used in a sentence, when the word has multiple meanings.

Consider the word “bass” which could either mean “a type of fish” or “tones of low frequency”, depending on the sentence it was used on. (Reference)

Part-of-speech tagging (1, 2, 3, 4)

(POS) Also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition, as well as its context – i.e. relationship with adjacent and related words in a phrase, sentence, or paragraph.  (Reference)

To check out the difference between chunking and POS, check here.

Semantic role labeling (1)

Also sometimes called shallow semantic parsing, is a task in natural language processing consisting of the detection of the semantic arguments associated with the predicate or verb of a sentence and their classification into their specific roles.  For example, given a sentence like “Mary sold the book to John”, the task would be to recognize the verb “to sell” as representing the predicate, “Mary” as representing the seller (agent), “the book” as representing the good (theme), and “John as representing the recipient.  (Reference)

Collaborative filtering

In the newer, narrower sense, collaborative filtering is a method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating).  The underlying assumption of the collaborative filtering approach is that if person A has the same opinion as a person B on an issue, A is more likely to have B’s opinion on a different issue x than to have the opinion on x of a person chosen randomly.  For example, a collaborative filtering recommendation system for television tastes could make predictions about which television show a user should like given a partial list of that user’s tastes (likes or dislikes).  Note that these predictions are specific to the user, but use information gleaned from many users.  This differs from the simpler approach of giving an average score for each item of interest, for example based on its number of votes.

Cluster analysis or clustering

Is the task of grouping a set of objects in such a way that objects in the same group (called cluster) are more similar (in some sense or another) to each other than those in other groups (clusters).  It is a main task of exploratory data mining, and a common technique for statistical data analysis used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics.

Popular notions of clusters include groups with small distances among the cluster members, dense areas of the data space, intervals, or particular statistical distributions. (Reference)  Also see “Classification” below.

Classification (1, 2, 3, 4, 5, 6, 7, 8)

Is the problem of identifying to which of a set of categories a new observation belons, on the basis of a training set of data containing observations who category membership is unknown.   Classification is considered an instance of supervised learning i.e. learning where a training set of correctly-identified observations is available.  The corresponding unsupervised procedure is known as clustering (cluster analysis), and involves grouping data into categories based on some measure of inherent similarity. (Reference)

Supervised versus Unsupervised Learning

Machine learning algorithms are described as either ‘supervised’ or ‘unsupervised’.  The distinction is drawn from how the learner classifies data.  In supervised algorithms, the classes are predetermined.  These classes can be conceived of as a finite set, previously arrived at by a human.  In practice, a certain segment of data will be labelled with these classifications.  The machine learner’s task is to search for patterns and construct mathematical models.  These models are then evaluated on the basis of their predictive capability in relation to measures of variance itself.  Decision tree induction and naive Bayes are some examples of supervised learning techniques.

Unsupervised learners are not provided with classifications.  Unsupervised algorithms seek out similarity between pieces of data in order to determine whether they can be characterized as forming a group.  These groups are termed clusters, and there are a whole family of clustering machine learnign techniques. (Reference)

Recommender system (1, 2)

A subclass of information filtering system that seek to predict the ‘rating’ or ‘preference’ that user would give to an item (such as music, books, or movies) or social element (people or groups) they had not yet considered, using a model built from the characteristics of an item (content-based approaches) or the user’s social environment (collaborative filtering approaches. (Reference)  

Recommender systems have become extremely common in recent years. A few examples of such systems:

  • When viewing a product on, the store will recommend additional items based on a matrix of what other shoppers bought along with the currently selected item.[3]
  • Pandora Radio takes an initial input of a song or musician and plays music with similar characteristics (based on a series of keywords attributed to the inputted artist or piece of music). The stations created by Pandora can be refined through user feedback (emphasizing or deemphasizing certain characteristics).
  • Netflix offers predictions of movies that a user might like to watch based on the user’s previous ratings and watching habits (as compared to the behavior of other users), also taking into account the characteristics (such as the genre) of the film.

Neural Networks

A computing system made up of a number of simple, highly interconnected processing elements, which process information by their dynamic state response to external inputs. (Reference)

The power of neural networks is that it can make reasonable guesses about results for queries it has never seen before, based on similarity to other queries.  For example, search engines can use neural network algorithms to provide best guess answers to queries that have not been typed in before.