The two figures above show HyPlag’s main analysis views – the Results Overview (Fig. 1) and the Detailed Comparison View (Fig. 2). The Results Overview enables users to quickly browse all identified similarities and check which parts of the input document are affected. The left part of the screen shows the full text of the input document (see (1) in the left Figure). The right part shows a list of result summaries (2) for all documents, for which similarities to the input document have been identified. Each result summary includes one or more match views (3). Each match view has two panels and represents the similarities that an analysis method identified, e.g., matching citations or similar formulae. The left panel (4a) represents the input document and the right panel (4b) the comparison document. Matching features appear in the match views connected by lines. For the example, the match views in the left Figure show the similarity of text (left), citations (middle) and mathematical content (right) in a retracted article by and two papers by other authors.
The Detailed Comparison View (Fig. 2) allows users to inspect identified similarities in detail. The screen displays the full text of the input document (8) and a selected comparison document (9) side-by-side. Between the full texts, a match view (10) similar to the match views in the Results Overview highlights all matching features in both documents. However, in this view, each feature match (11a,b) is assigned a separate color. Clicking on any highlight in the full text panels or the central match view aligns the respective feature matches. Since the central match view represents the entire document, the current view port, i.e., the segment of text visible in the adjacent full text panel and the position of the text segment in the document, is indicated using a darker shade.
For details on HyPlag’s visualizations or system architecture, see .
HyPlag includes the following analysis methods for the different categories of non-textual content:
HyPlag employs four citation-based analysis methods, which our prior research proved effective for discovering concealed forms of academic plagiarism (see our project page on Citation-based Plagiarism Detection or [4-10] for details). The code for the citation-based analysis is available as a separate GitHub repository.
- Bibliographic Coupling (BC), quantifies the absolute number or fraction of shared references while ignoring the number, position, and order of citations in the text.
- Longest Common Citation Sequence (LCCS) is the maximum number of citations that match in both documents in the same order, but not necessarily in a contiguous block. We showed that LCCS achieves good results for retrieving longer passages of reused text, in which the sequence of ideas remained unchanged.
- Greedy Citation Tiling (GCT) identifies all individually longest matching substrings of citations in two documents (‘citation tiles’), i.e., all blocks of consecutive shared citations in identical order. Longer citation tiles are a strong indicator for high semantic similarity of text passages, even if the order of the passages was changed.
- Citation Chunking (CC) is a class of heuristic measures to find variably-sized patterns of matching citations, in which the count and order of matching citations can differ.
Currently, HyPlag includes four analysis methods to identify potentially suspicious image similarity (see  for details). The code for the image-based analysis is available as a separate GitHub repository.
- Perceptual hashing (pHash) is a well-established, fast, and reliable method to find highly similar images.
- Trigram text matching for the text that has been extracted from images using OCR.
- Positional text matching improves the similarity analysis for OCR text of figures that includes significant recognition errors. The approach only considers text matches for computing the similarity of two images if the matching text occurs in broadly similar regions in both images.
- Ratio hashing identifies highly similar bar charts by comparing the relative heights of the bars sorted in decreasing order and calculating the sum of the differences of the bar heights.
To create mathematics-based semantic fingerprints of documents, HyPlag uses three similarity measures that analyze mathematical identifiers. We showed that identifiers are most effective for this purpose in a previous study (see our project page on Math-based PD or  for details)
- Frequency histograms of mathematical identifiers (Histo) quantifies the similarity of two documents by analyzing the union of the identifiers in both documents. HyPlag considers the relative difference in the number of occurrences of individual identifiers. The Histo measure quantifies the global overlap of mathematical identifiers in the analyzed documents. The number of shared identifiers is normalized by the sum of identifiers in both documents. Thus, achieving high scores requires documents that contain a comparable number of identifiers. Typically, this requirement is only met if the two documents are of similar length.
- Longest Common Subsequence of Identifiers (LCIS) is the maximum number of identifiers that match in both documents in the same order, but not necessarily in a contiguous block. HyPlag considers the number of identifiers in the query document that are part of the longest common identifier sequence. Like Histo, the LCIS measure quantifies the global similarity of documents, but considers the order while Histo is order-agnostic.
- Greedy Identifier Tiles (GIT) are the set of all individually longest blocks of shared identifiers in identical order that cannot be extended to the left or right without encountering a non-matching identifier. The GIT score quantifies the number of identifiers in the query document that are part of identifier tiles with a minimum length of five.
To find similar text, HyPlag relies on established text retrieval methods.
- Text fingerprinting performs text chunking using word 3-grams and probabilistically selects a subset of chunks for computing a digital signature of the input text. The mean probability for chunk retention is 1 16. We realized this approach by adapting the Sherlock tool.
- Encoplot, developed by Grozea et al. , is an efficient character 16-gram comparison that achieves a time-complexity of O(n) by ignoring repeated matches.
- Boyer-Moore string matching to identify all strings (including repetitions) with 12 or more identical words.