Similar to commonly used supervised learning techniques, structured prediction models are typically trained by means of observed data in which the predicted value is compared to the ground truth, and this is used to adjust the model parameters. Due to the complexity of the model and the interrelations of predicted variables, the processes of model training and inference are often computationally infeasible, so approximate inference and learning methods are used.
Applications
An example application is the problem of translating a natural language sentence into a syntactic representation such as a parse tree. This can be seen as a structured prediction problem[2] in which the structured output domain is the set of all possible parse trees. Structured prediction is used in a wide variety of domains including bioinformatics, natural language processing (NLP), speech recognition, and computer vision.
Example: sequence tagging
Sequence tagging is a class of problems prevalent in NLP in which input data are often sequential, for instance sentences of text. The sequence tagging problem appears in several guises, such as part-of-speech tagging (POS tagging) and named entity recognition. In POS tagging, for example, each word in a sequence must be 'tagged' with a class label representing the type of word:
The main challenge of this problem is to resolve ambiguity: in the above example, the words "sentence" and "tagged" in English can also be verbs.
While this problem can be solved by simply performing classification of individual tokens, this approach does not take into account the empirical fact that tags do not occur independently; instead, each tag displays a strong conditional dependence on the tag of the previous word. This fact can be exploited in a sequence model such as a hidden Markov model or conditional random field[2] that predicts the entire tag sequence for a sentence (rather than just individual tags) via the Viterbi algorithm.
One of the easiest ways to understand algorithms for general structured prediction is the structured perceptron by Collins.[3] This algorithm combines the perceptron algorithm for learning linear classifiers with an inference algorithm (classically the Viterbi algorithm when used on sequence data) and can be described abstractly as follows:
First, define a function that maps a training sample and a candidate prediction to a vector of length ( and may have any structure; is problem-dependent, but must be fixed for each model). Let be a function that generates candidate predictions.
Then:
Let be a weight vector of length
For a predetermined number of iterations:
For each sample in the training set with true output :
Make a prediction :
Update (from towards ): , where is the learning rate.
In practice, finding the argmax over is done using an algorithm such as Viterbi or a max-sum, rather than an exhaustive search through an exponentially large set of candidates.