Plagiarism detection or content similarity detection is the process of locating instances of

plagiarism Plagiarism is the fraudulent representation of another person's language, thoughts, ideas, or expressions as one's own original work.From the 1995 '' Random House Compact Unabridged Dictionary'': use or close imitation of the language and though ...

copyright infringement Copyright infringement (at times referred to as piracy) is the use of works protected by copyright without permission for a usage where such permission is required, thereby infringing certain exclusive rights granted to the copyright holder, ...

within a work or document. The widespread use of computers and the advent of the Internet have made it easier to plagiarize the work of others.Bretag, T., & Mahmud, S. (2009). A model for determining student plagiarism: Electronic detection and academic judgement. ''Journal of University Teaching & Learning Practice, 6''(1). Retrieved from http://ro.uow.edu.au/jutlp/vol6/iss1/6 Detection of plagiarism can be undertaken in a variety of ways. Human detection is the most traditional form of identifying plagiarism from written work. This can be a lengthy and time-consuming task for the reader and can also result in inconsistencies in how plagiarism is identified within an organization. Text-matching software (TMS), which is also referred to as "plagiarism detection software" or "anti-plagiarism" software, has become widely available, in the form of both commercially available products as well as open-source software. TMS does not actually detect plagiarism per se, but instead finds specific passages of text in one document that match text in another document.

Software-assisted plagiarism detection

Computer-assisted plagiarism detection (CaPD) is an Information retrieval (IR) task supported by specialized IR systems, which is referred to as a plagiarism detection system (PDS) or document similarity detection system. A 2019 systematic literature review presents an overview of state-of-the-art plagiarism detection methods.

In text documents

Systems for text similarity detection implement one of two generic detection approaches, one being external, the other being intrinsic. External detection systems compare a suspicious document with a reference collection, which is a set of documents assumed to be genuine. Based on a chosen document model and predefined similarity criteria, the detection task is to retrieve all documents that contain text that is similar to a degree above a chosen threshold to text in the suspicious document. Intrinsic PDSes solely analyze the text to be evaluated without performing comparisons to external documents. This approach aims to recognize changes in the unique writing style of an author as an indicator for potential plagiarism. PDSes are not capable of reliably identifying plagiarism without human judgment. Similarities and writing style features are computed with the help of predefined document models and might represent false positives.

Effectiveness of those tools in higher education settings

A study was conducted to test the effectiveness of similarity detection software in a higher education setting. One part of the study assigned one group of students to write a paper. These students were first educated about plagiarism and informed that their work was to be run through a content similarity detection system. A second group of students was assigned to write a paper without any information about plagiarism. The researchers expected to find lower rates in group one but found roughly the same rates of plagiarism in both groups.

Approaches

The figure below represents a classification of all detection approaches currently in use for computer-assisted content similarity detection. The approaches are characterized by the type of similarity assessment they undertake: global or local. Global similarity assessment approaches use the characteristics taken from larger parts of the text or the document as a whole to compute similarity, while local methods only examine pre-selected text segments as input. PDS Classification

=Fingerprinting

= Fingerprinting is currently the most widely applied approach to content similarity detection. This method forms representative digests of documents by selecting a set of multiple substrings ( n-grams) from them. The sets represent the

fingerprint A fingerprint is an impression left by the friction ridges of a human finger. The recovery of partial fingerprints from a crime scene is an important method of forensic science. Moisture and grease on a finger result in fingerprints on surfa ...

s and their elements are called minutiae. A suspicious document is checked for plagiarism by computing its fingerprint and querying minutiae with a precomputed index of fingerprints for all documents of a reference collection. Minutiae matching with those of other documents indicate shared text segments and suggest potential plagiarism if they exceed a chosen similarity threshold. Computational resources and time are limiting factors to fingerprinting, which is why this method typically only compares a subset of minutiae to speed up the computation and allow for checks in very large collection, such as the Internet.

=String matching

= String matching is a prevalent approach used in computer science. When applied to the problem of plagiarism detection, documents are compared for verbatim text overlaps. Numerous methods have been proposed to tackle this task, of which some have been adapted to external plagiarism detection. Checking a suspicious document in this setting requires the computation and storage of efficiently comparable representations for all documents in the reference collection to compare them pairwise. Generally, suffix document models, such as

suffix tree In computer science, a suffix tree (also called PAT tree or, in an earlier form, position tree) is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow parti ...

s or suffix vectors, have been used for this task. Nonetheless, substring matching remains computationally expensive, which makes it a non-viable solution for checking large collections of documents.

=Bag of words

= Bag of words analysis represents the adoption of vector space retrieval, a traditional IR concept, to the domain of content similarity detection. Documents are represented as one or multiple vectors, e.g. for different document parts, which are used for pair wise similarity computations. Similarity computation may then rely on the traditional cosine similarity measure, or on more sophisticated similarity measures.

=Citation analysis

= Citation-based plagiarism detection (CbPD) relies on

citation analysis Citation analysis is the examination of the frequency, patterns, and graphs of citations in documents. It uses the directed graph of citations — links from one document to another document — to reveal properties of the documents. A ty ...

, and is the only approach to plagiarism detection that does not rely on the textual similarity. CbPD examines the citation and reference information in texts to identify similar

pattern A pattern is a regularity in the world, in human-made design, or in abstract ideas. As such, the elements of a pattern repeat in a predictable manner. A geometric pattern is a kind of pattern formed of geometric shapes and typically repeated li ...

s in the citation sequences. As such, this approach is suitable for scientific texts, or other academic documents that contain citations. Citation analysis to detect plagiarism is a relatively young concept. It has not been adopted by commercial software, but a first prototype of a citation-based plagiarism detection system exists. Similar order and proximity of citations in the examined documents are the main criteria used to compute citation pattern similarities. Citation patterns represent subsequences non-exclusively containing citations shared by the documents compared. Factors, including the absolute number or relative fraction of shared citations in the pattern, as well as the probability that citations co-occur in a document are also considered to quantify the patterns’ degree of similarity.

=Stylometry

Stylometry Stylometry is the application of the study of linguistic style, usually to written language. It has also been applied successfully to music and to fine-art paintings as well. Argamon, Shlomo, Kevin Burns, and Shlomo Dubnov, eds. The structure of ...

subsumes statistical methods for quantifying an author’s unique writing style and is mainly used for authorship attribution or intrinsic plagiarism detection. Detecting plagiarism by authorship attribution requires checking whether the writing style of the suspicious document, which is written supposedly by a certain author, matches with that of a corpus of documents written by the same author. Intrinsic plagiarism detection, on the other hand, uncover plagiarism based on internal evidences in the suspicious document without comparing it with other documents. This is performed by constructing and comparing stylometric models for different text segments of the suspicious document, and passages that are stylistically different from others are marked as potentially plagiarized/infringed. Although they are simple to extract, character

n-grams In the fields of computational linguistics and probability, an ''n''-gram (sometimes also called Q-gram) is a contiguous sequence of ''n'' items from a given Sample (statistics), sample of text or speech. The items can be phonemes, syllables ...

are proven to be among the best stylometric features for intrinsic plagiarism detection.

=Neural networks

= More recent approaches to assess content similarity using

neural networks A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up of biological ...

have achieved significantly greater accuracy, but come at great computational cost. Traditional neural network approaches embed both pieces of content into semantic vector embeddings to calculate their similarity, which is often their cosine similarity. More advanced methods perform end-to-end prediction of similarity or classifications using the

Transformer A transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple circuits. A varying current in any coil of the transformer produces a varying magnetic flux in the transformer' ...

architecture. Particularly paraphrase detection benefits from highly parameterized pre-trained models.

Performance

Comparative evaluations of content similarity detection systems indicate that their performance depends on the type of plagiarism present (see figure). Except for citation pattern analysis, all detection approaches rely on textual similarity. It is therefore symptomatic that detection accuracy decreases the more plagiarism cases are obfuscated. PD Methods Detection Performance

Literal copies, aka copy and paste (c&p) plagiarism or blatant copyright infringement, or modestly disguised plagiarism cases can be detected with high accuracy by current external PDS if the source is accessible to the software. Especially substring matching procedures achieve a good performance for c&p plagiarism, since they commonly use lossless document models, such as

s. The performance of systems using fingerprinting or bag of words analysis in detecting copies depends on the information loss incurred by the document model used. By applying flexible chunking and selection strategies, they are better capable of detecting moderate forms of disguised plagiarism when compared to substring matching procedures. Intrinsic plagiarism detection using

stylometry Stylometry is the application of the study of linguistic style, usually to written language. It has also been applied successfully to music and to fine-art paintings as well. Argamon, Shlomo, Kevin Burns, and Shlomo Dubnov, eds. The structure of ...

can overcome the boundaries of textual similarity to some extent by comparing linguistic similarity. Given that the stylistic differences between plagiarized and original segments are significant and can be identified reliably, stylometry can help in identifying disguised and

paraphrased A paraphrase () is a restatement of the meaning of a text or passage using other words. The term itself is derived via Latin ', . The act of paraphrasing is also called ''paraphrasis''. History Although paraphrases likely abounded in oral tra ...

plagiarism. Stylometric comparisons are likely to fail in cases where segments are strongly paraphrased to the point where they more closely resemble the personal writing style of the plagiarist or if a text was compiled by multiple authors. The results of the International Competitions on Plagiarism Detection held in 2009, 2010 and 2011, as well as experiments performed by Stein, indicate that stylometric analysis seems to work reliably only for document lengths of several thousand or tens of thousands of words, which limits the applicability of the method to CaPD settings. An increasing amount of research is performed on methods and systems capable of detecting translated plagiarism. Currently, cross-language plagiarism detection (CLPD) is not viewed as a mature technology and respective systems have not been able to achieve satisfying detection results in practice. Citation-based plagiarism detection using citation pattern analysis is capable of identifying stronger paraphrases and translations with higher success rates when compared to other detection approaches, because it is independent of textual characteristics. However, since citation-pattern analysis depends on the availability of sufficient citation information, it is limited to academic texts. It remains inferior to text-based approaches in detecting shorter plagiarized passages, which are typical for cases of copy-and-paste or shake-and-paste plagiarism; the latter refers to mixing slightly altered fragments from different sources.

Software

The design of content similarity detection software for use with text documents is characterized by a number of factors: Most large-scale plagiarism detection systems use large, internal databases (in addition to other resources) that grow with each additional document submitted for analysis. However, this feature is considered by some as a violation of student copyright.

In source code

Plagiarism in computer source code is also frequent, and requires different tools than those used for text comparisons in document. Significant research has been dedicated to academic source-code plagiarism. A distinctive aspect of source-code plagiarism is that there are no

essay mill An essay mill (also term paper mill) is a business that allows customers to commission an original piece of writing on a particular topic so that they may commit academic fraud. Customers provide the company with specific information about the e ...

s, such as can be found in traditional plagiarism. Since most programming assignments expect students to write programs with very specific requirements, it is very difficult to find existing programs that already meet them. Since integrating external code is often harder than writing it from scratch, most plagiarizing students choose to do so from their peers. According to Roy and Cordy,Roy, Chanchal Kumar;Cordy, James R. (26 September 2007
"A Survey on Software Clone Detection Research"
School of Computing, Queen's University, Canada. source-code similarity detection algorithms can be classified as based on either * Strings – look for exact textual matches of segments, for instance five-word runs. Fast, but can be confused by renaming identifiers. * Tokens – as with strings, but using a

lexer In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of ''lexical tokens'' (strings with an assigned and thus identified m ...

to convert the program into

token Token may refer to: Arts, entertainment, and media * Token, a game piece or counter, used in some games * The Tokens, a vocal music group * Tolkien Black, a recurring character on the animated television series ''South Park,'' formerly known as ...

s first. This discards whitespace, comments, and identifier names, making the system more robust to simple text replacements. Most academic plagiarism detection systems work at this level, using different algorithms to measure the similarity between token sequences. * Parse Trees – build and compare parse trees. This allows higher-level similarities to be detected. For instance, tree comparison can normalize conditional statements, and detect equivalent constructs as similar to each other. * Program Dependency Graphs (PDGs) – a PDG captures the actual flow of control in a program, and allows much higher-level equivalences to be located, at a greater expense in complexity and calculation time. * Metrics – metrics capture 'scores' of code segments according to certain criteria; for instance, "the number of loops and conditionals", or "the number of different variables used". Metrics are simple to calculate and can be compared quickly, but can also lead to false positives: two fragments with the same scores on a set of metrics may do entirely different things. * Hybrid approaches – for instance, parse trees +

s can combine the detection capability of parse trees with the speed afforded by suffix trees, a type of string-matching data structure. The previous classification was developed for

code refactoring In computer programming and software design, code refactoring is the process of restructuring existing computer code—changing the '' factoring''—without changing its external behavior. Refactoring is intended to improve the design, structu ...

, and not for academic plagiarism detection (an important goal of refactoring is to avoid

duplicate code In computer programming, duplicate code is a sequence of source code that occurs more than once, either within a program or across different programs owned or maintained by the same entity. Duplicate code is generally considered undesirable for a nu ...

, referred to as code clones in the literature). The above approaches are effective against different levels of similarity; low-level similarity refers to identical text, while high-level similarity can be due to similar specifications. In an academic setting, when all students are expected to code to the same specifications, functionally equivalent code (with high-level similarity) is entirely expected, and only low-level similarity is considered as proof of cheating.

Complications with the use of text-matching software for plagiarism detection

Various complications have been documented with the use of text-matching software when used for plagiarism detection. One of the more prevalent concerns documented centers on the issue of intellectual property rights. The basic argument is that materials must be added to a database in order for the TMS to effectively determine a match, but adding users' materials to such a database may infringe on their intellectual property rights. The issue has been raised in a number of court cases. An additional complication with the use of TMS is that the software finds only precise matches to other text. It does not pick up poorly paraphrased work, for example, or the practice of plagiarizing by use of sufficient word substitutions to elude detection software, which is known as rogeting.

References

Literature

* Carroll, J. (2002). A'' handbook for deterring plagiarism in higher education''. Oxford: The Oxford Centre for Staff and Learning Development, Oxford Brookes University. (96 p.), * Zeidman, B. (2011). ''The Software IP Detective’s Handbook''. Prentice Hall. (480 p.), {{ISBN, 0137035330 Educational assessment and evaluation Plagiarism