HOME

TheInfoList



OR:

Paraphrase or paraphrasing in
computational linguistics Computational linguistics is an Interdisciplinarity, interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, comput ...
is the
natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
task of detecting and generating
paraphrase A paraphrase () is a restatement of the meaning of a text or passage using other words. The term itself is derived via Latin ', . The act of paraphrasing is also called ''paraphrasis''. History Although paraphrases likely abounded in oral tra ...
s. Applications of paraphrasing are varied including information retrieval,
question answering Question answering (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP), which is concerned with building systems that automatically answer questions posed by humans in a natural l ...
, text summarization, and
plagiarism detection Plagiarism detection or content similarity detection is the process of locating instances of plagiarism or copyright infringement within a work or document. The widespread use of computers and the advent of the Internet have made it easier to pla ...
. Paraphrasing is also useful in the
evaluation of machine translation Various methods for the evaluation for machine translation have been employed. This article focuses on the evaluation of the output of machine translation, rather than on performance or usability evaluation. Round-trip translation A typical way ...
, as well as
semantic parsing Semantic parsing is the task of converting a natural language utterance to a logical form: a machine-understandable representation of its meaning. Semantic parsing can thus be understood as extracting the precise meaning of an utterance. Application ...
and
generation A generation refers to all of the people born and living at about the same time, regarded collectively. It can also be described as, "the average period, generally considered to be about 20–⁠30 years, during which children are born and gr ...
of new samples to expand existing
corpora Corpus is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of linguistics Music * ...
.


Paraphrase generation


Multiple sequence alignment

Barzilay and Lee proposed a method to generate paraphrases through the usage of monolingual
parallel corpora A parallel text is a text placed alongside its translation or translations. Parallel text alignment is the identification of the corresponding sentences in both halves of the parallel text. The Loeb Classical Library and the Clay Sanskrit Libr ...
, namely news articles covering the same event on the same day. Training consists of using multi-sequence alignment to generate sentence-level paraphrases from an unannotated corpus. This is done by * finding recurring patterns in each individual corpus, i.e. " (injured/wounded) people, seriously" where are variables * finding pairings between such patterns the represent paraphrases, i.e. " (injured/wounded) people, seriously" and " were (wounded/hurt) by , among them were in serious condition" This is achieved by first clustering similar sentences together using
n-gram In the fields of computational linguistics and probability, an ''n''-gram (sometimes also called Q-gram) is a contiguous sequence of ''n'' items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or b ...
overlap. Recurring patterns are found within clusters by using multi-sequence alignment. Then the position of argument words is determined by finding areas of high variability within each cluster, aka between words shared by more than 50% of a cluster's sentences. Pairings between patterns are then found by comparing similar variable words between different corpora. Finally, new paraphrases can be generated by choosing a matching cluster for a source sentence, then substituting the source sentence's argument into any number of patterns in the cluster.


Phrase-based Machine Translation

Paraphrase can also be generated through the use of phrase-based translation as proposed by Bannard and Callison-Burch. The chief concept consists of aligning phrases in a
pivot language A pivot language, sometimes also called a bridge language, is an artificial or natural language used as an intermediary language for translation between many different languages – to translate between any pair of languages A and B, one translate ...
to produce potential paraphrases in the original language. For example, the phrase "under control" in an English sentence is aligned with the phrase "unter kontrolle" in its German counterpart. The phrase "unter kontrolle" is then found in another German sentence with the aligned English phrase being "in check," a paraphrase of "under control." The probability distribution can be modeled as \Pr(e_2 , e_1), the probability phrase e_2 is a paraphrase of e_1, which is equivalent to \Pr(e_2, f) \Pr(f, e_1) summed over all f, a potential phrase translation in the pivot language. Additionally, the sentence e_1 is added as a prior to add context to the paraphrase. Thus the optimal paraphrase, \hat can be modeled as: : \hat = \text \max_ \Pr(e_2 , e_1, S) = \text \max_ \sum_f \Pr(e_2 , f, S) \Pr(f , e_1, S) \Pr(e_2, f) and \Pr(f, e_1) can be approximated by simply taking their frequencies. Adding S as a prior is modeled by calculating the probability of forming the S when e_1 is substituted with


Long short-term memory

There has been success in using
long short-term memory Long short-term memory (LSTM) is an artificial neural network used in the fields of artificial intelligence and deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. Such a recurrent neural network (RNN) ca ...
(LSTM) models to generate paraphrases. In short, the model consists of an encoder and decoder component, both implemented using variations of a stacked residual LSTM. First, the encoding LSTM takes a
one-hot In digital circuits and machine learning, a one-hot is a group of bits among which the legal combinations of values are only those with a single high (1) bit and all the others low (0). A similar implementation in which all bits are '1' except ...
encoding of all the words in a sentence as input and produces a final hidden vector, which can represent the input sentence. The decoding LSTM takes the hidden vector as input and generates a new sentence, terminating in an end-of-sentence token. The encoder and decoder are trained to take a phrase and reproduce the one-hot distribution of a corresponding paraphrase by minimizing
perplexity In information theory, perplexity is a measurement of how well a probability distribution or probability model predicts a sample. It may be used to compare probability models. A low perplexity indicates the probability distribution is good at ...
using simple
stochastic gradient descent Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It can be regarded as a stochastic approximation of ...
. New paraphrases are generated by inputting a new phrase to the encoder and passing the output to the decoder.


Transformers

With the introduction of Transformer models, paraphrase generation approaches improved their ability to generate text by scaling
neural network A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up of biological ...
parameters and heavily parallelizing training through feed-forward layers. These models are so fluent in generating text that human experts cannot identify if an example was human-authored or machine-generated. Transformer-based paraphrase generation relies on autoencoding,
autoregressive In statistics, econometrics and signal processing, an autoregressive (AR) model is a representation of a type of random process; as such, it is used to describe certain time-varying processes in nature, economics, etc. The autoregressive model spe ...
, or sequence-to-sequence methods. Autoencoder models predict word replacement candidates with a one-hot distribution over the vocabulary, while autoregressive and seq2seq models generate new text based on the source predicting one word at a time. More advanced efforts also exist to make paraphrasing controllable according to predefined quality dimensions, such as semantic preservation or lexical diversity. Many Transformer-based paraphrase generation methods rely on unsupervised learning to leverage large amounts of training data and scale their methods.


Paraphrase recognition


Recursive Autoencoders

Paraphrase recognition has been attempted by Socher et al through the use of recursive
autoencoder An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data (unsupervised learning). The encoding is validated and refined by attempting to regenerate the input from the encoding. The autoencoder lear ...
s. The main concept is to produce a vector representation of a sentence and its components by recursively using an autoencoder. The vector representations of paraphrases should have similar vector representations; they are processed, then fed as input into a
neural network A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up of biological ...
for classification. Given a sentence W with m words, the autoencoder is designed to take 2 n-dimensional
word embedding In natural language processing (NLP), word embedding is a term used for the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the v ...
s as input and produce an n-dimensional vector as output. The same autoencoder is applied to every pair of words in S to produce \lfloor m/2 \rfloor vectors. The autoencoder is then applied recursively with the new vectors as inputs until a single vector is produced. Given an odd number of inputs, the first vector is forwarded as-is to the next level of recursion. The autoencoder is trained to reproduce every vector in the full recursion tree, including the initial word embeddings. Given two sentences W_1 and W_2 of length 4 and 3 respectively, the autoencoders would produce 7 and 5 vector representations including the initial word embeddings. The
euclidean distance In mathematics, the Euclidean distance between two points in Euclidean space is the length of a line segment between the two points. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, therefor ...
is then taken between every combination of vectors in W_1 and W_2 to produce a similarity matrix S \in \mathbb^. S is then subject to a dynamic min- pooling layer to produce a fixed size n_p \times n_p matrix. Since S are not uniform in size among all potential sentences, S is split into n_p roughly even sections. The output is then normalized to have mean 0 and standard deviation 1 and is fed into a fully connected layer with a softmax output. The dynamic pooling to softmax model is trained using pairs of known paraphrases.


Skip-thought vectors

Skip-thought vectors are an attempt to create a vector representation of the semantic meaning of a sentence, similarly to the skip gram model. Skip-thought vectors are produced through the use of a skip-thought model which consists of three key components, an encoder and two decoders. Given a corpus of documents, the skip-thought model is trained to take a sentence as input and encode it into a skip-thought vector. The skip-thought vector is used as input for both decoders; one attempts to reproduce the previous sentence and the other the following sentence in its entirety. The encoder and decoder can be implemented through the use of a
recursive neural network A recursive neural network is a kind of deep neural network created by applying the same set of weights recursively over a structured input, to produce a structured prediction over variable-size input structures, or a scalar prediction on it, by t ...
(RNN) or an
LSTM Long short-term memory (LSTM) is an artificial neural network used in the fields of artificial intelligence and deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. Such a recurrent neural network (RNN) c ...
. Since paraphrases carry the same semantic meaning between one another, they should have similar skip-thought vectors. Thus a simple
logistic regression In statistics, the logistic model (or logit model) is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear function (calculus), linear combination of one or more independent var ...
can be trained to good performance with the absolute difference and component-wise product of two skip-thought vectors as input.


Transformers

Similar to how Transformer models influenced paraphrase generation, their application in identifying paraphrases showed great success. Models such as BERT can be adapted with a
binary classification Binary classification is the task of classifying the elements of a set into two groups (each called ''class'') on the basis of a classification rule. Typical binary classification problems include: * Medical testing to determine if a patient has c ...
layer and trained end-to-end on identification tasks. Transformers achieve strong results when transferring between domains and paraphrasing techniques compared to more traditional machine learning methods such as
logistic regression In statistics, the logistic model (or logit model) is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear function (calculus), linear combination of one or more independent var ...
. Other successful methods based on the Transformer architecture include using
adversarial learning Adversarial machine learning is the study of the attacks on machine learning algorithms, and of the defenses against such attacks. A recent survey exposes the fact that practitioners report a dire need for better protecting machine learning syste ...
and
meta-learning Meta-learning is a branch of metacognition concerned with learning about one's own learning and learning processes. The term comes from the meta prefix's modern meaning of an abstract recursion, or "X about X", similar to its use in metaknowled ...
.


Evaluation

Multiple methods can be used to evaluate paraphrases. Since paraphrase recognition can be posed as a classification problem, most standard evaluations metrics such as
accuracy Accuracy and precision are two measures of ''observational error''. ''Accuracy'' is how close a given set of measurements (observations or readings) are to their ''true value'', while ''precision'' is how close the measurements are to each other ...
,
f1 score In statistical analysis of binary classification, the F-score or F-measure is a measure of a test's accuracy. It is calculated from the precision and recall of the test, where the precision is the number of true positive results divided by the nu ...
, or an
ROC curve A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The method was originally developed for operators of m ...
do relatively well. However, there is difficulty calculating f1-scores due to trouble producing a complete list of paraphrases for a given phrase and the fact that good paraphrases are dependent upon context. A metric designed to counter these problems is ParaMetric. ParaMetric aims to calculate the precision and recall of an automatic paraphrase system by comparing the automatic alignment of paraphrases to a manual alignment of similar phrases. Since ParaMetric is simply rating the quality of phrase alignment, it can be used to rate paraphrase generation systems, assuming it uses phrase alignment as part of its generation process. A notable drawback to ParaMetric is the large and exhaustive set of manual alignments that must be initially created before a rating can be produced. The evaluation of paraphrase generation has similar difficulties as the evaluation of
machine translation Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates t ...
. The quality of a paraphrase depends on its context, whether it is being used as a summary, and how it is generated, among other factors. Additionally, a good paraphrase usually is lexically dissimilar from its source phrase. The simplest method used to evaluate paraphrase generation would be through the use of human judges. Unfortunately, evaluation through human judges tends to be time-consuming. Automated approaches to evaluation prove to be challenging as it is essentially a problem as difficult as paraphrase recognition. While originally used to evaluate machine translations, bilingual evaluation understudy (
BLEU Bleu or BLEU may refer to: * the French word for blue * '' Three Colors: Blue'', a 1993 movie * BLEU (Bilingual Evaluation Understudy), a machine translation evaluation metric * Belgium–Luxembourg Economic Union * Blue cheese, a type of cheese ...
) has been used successfully to evaluate paraphrase generation models as well. However, paraphrases often have several lexically different but equally valid solutions, hurting BLEU and other similar evaluation metrics. Metrics specifically designed to evaluate paraphrase generation include paraphrase in n-gram change (PINC) and paraphrase evaluation metric (PEM) along with the aforementioned ParaMetric. PINC is designed to be used with BLEU and help cover its inadequacies. Since BLEU has difficulty measuring lexical dissimilarity, PINC is a measurement of the lack of n-gram overlap between a source sentence and a candidate paraphrase. It is essentially the Jaccard distance between the sentence, excluding n-grams that appear in the source sentence to maintain some semantic equivalence. PEM, on the other hand, attempts to evaluate the "adequacy, fluency, and lexical dissimilarity" of paraphrases by returning a single value heuristic calculated using
N-gram In the fields of computational linguistics and probability, an ''n''-gram (sometimes also called Q-gram) is a contiguous sequence of ''n'' items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or b ...
s overlap in a pivot language. However, a large drawback to PEM is that it must be trained using large, in-domain parallel corpora and human judges. It is equivalent to training a paraphrase recognition to evaluate a paraphrase generation system. The Quora Question Pairs Dataset, which contains hundreds of thousands of duplicate questions, has become a common dataset for the evaluation of paraphrase detectors. The best performing models for paraphrase detection for the last three years have all used the Transformer architecture and all have relied on large amounts of pre-training with more general data before fine-tuning with the question pairs.


See also

* Round-trip translation *
Text simplification Text simplification is an operation used in natural language processing to change, enhance, classify, or otherwise process an existing body of human-readable text so its grammar and structure is greatly simplified while the underlying meaning and ...
*
Text normalization Text normalization is the process of transforming text into a single canonical form that it might not have had before. Normalizing text before storing or processing it allows for separation of concerns, since input is guaranteed to be consisten ...


References

{{Reflist, 30em


External links


Microsoft Research Paraphrase Corpus
- a dataset consisting of 5800 pairs of sentences extracted from news articles annotated to note whether a pair captures semantic equivalence
Paraphrase Database (PPDB)
- A searchable database containing millions of paraphrases in 16 different languages Computational linguistics Machine learning