Code duplications
Source code similarity measurement is a fundamental activity for solving many code-related tasks in software engineering, including clone and reuse identification, plagiarism detection, malware and vulnerability analysis, and code recommendation. Duplicated source code blocks can harm maintainability of software systems.
Duplicated code blocks lead to higher defect rates and more complex resolutions. 17% of cloned code contains bugs (Wagner, et. al.) and 18.42% of cloned code containing bugs propagates those bugs into other copies. (Mondal, et. al.).
Code clones decrease the stability of programs. On the other hand, several studies have claimed that refactoring duplicate codes are not always desirable. For instance, removing code clones increases dependencies between different components, which may be tightly coupled due to referring to one piece of code instead of separated clones. Hence, many researchers state that clone codes should be at least identified even if no action is performed. Unfortunately, detecting code clone instances is not straightforward. However, clone types definition and taxonomies sometimes are not clear, and the border between code clone and code similarity is not defined well in the research.
Duplication is one of the six major anti-patterns STUPID (Singleton, Tight-coupling ,Untestable, Premature-optimization, Indescriptive-naming, Duplication).
When the same code is repeated several times, it comes with two human costs:
As long as repetitions are synchronized, developers must repeat their editions on all occurrences. For N lines repeated D times, the overwork is around Nx(D-1) lines of code. This is a form of Tight-coupling design: more code than ideally intended are linked by changes and create inertia.
Once repetitions are no more synchronized, developers must also take care of the local variations in the code. Each reading brings the temptation to remove the variations -which can be present for a good reason-.
Assuming duplicated blocks of code are supposed to do the same thing, any refactoring, even simple, must be duplicated too – which is unrewarding grunt work, and puts pressure on the developer to find every place in which to perform the refactoring. Source code similarity is a concept used to measure the similarity degree between a piece of code snippets with respect to the text, syntax, and semantics. It is important to differentiate between the task of measuring code similarity and the task of identifying similar or clone codes. Indeed, the former is a general concept, and the latter can be categorized as one of the applications of code similarity measurement.
Various reasons lead to cloned codes in software systems during the life-cycle:
Development strategy. Specific development approaches such as copy-paste programming to reuse different software components create clones. For instance, generating code with a tool or merging two similar systems' source code often leads to clones.
Maintainability advantages. Developers often make changes that increase code similarity during the maintenance phase.
Programming Language limitation and Vendor lock-in. The lack of built-in capability and abstraction mechanism for some widely used programming structures pushes developers to repeat specific pieces of code. For instance, the lack of inheritance and templates in the C programming language may lead to repeated same blocks of codes with minor changes. Such repeating structures may create potentially frequent clones.
Accidental similarity. Code similarities may be created inadvertently. For example, protocols for interacting with APIs and libraries typically require a series of function calls or sequences of commands that lead to code similarity. Moreover, two or more independent developers may use the same logic to solve a particular problem.
The widely-accepted definitions of clone types according to the existing research are summarized as follows:
• Type I. The code snippets are entirely the same except for changes that may exist in the white spaces and comments.
• Type II. The structure of the code snippets is the same. However, the identifiers' names, types, white spaces, and comments differ.
• Type III. In this type, in addition to changes in identifiers, variable names, data types, and comments, some parts of the code can be deleted or updated, or some new parts can be added.
• Type IV. Two pieces of code have different texts but the same functionality
Automated tools like Static Reviewer can help to detect all clone types and more, with its Fine Duplication Detection.
The following is a panoramic view of existing Similirarity Algorithms:
Static Reviewer use an Hybrid approach between Data-driven (using OSS codebase version history) and Algorithmic (Token, Graph and Metric based), without AI dependencies. It is Language Indipendent and is able to detect:
Duplicated Files
Classes
Duplicated Methods/Functions
Duplicated Code Blocks
Duplicated code snippets (minimum 10 lines)
OSS Plagiarism
Duplication Anti-Patterns, like Too-DRY, STUPID, Duplicate Code, Copy-Paste Catastrophe
Static Reviewer gets its own very fast group of algorithms, some of in parallel, some in serial. This a short list of algorithms used on application folders comparison:
Instruction Sequence Matching. This algorithm compares the first instruction of every source line in the pair of files extracted by the folders involved in the comparison, ignoring blank lines and comment lines, and auto-generated, ignored and excluded files. It finds the longest common sequence within both files. Further, yields a correlation score representing the number of source lines in the longest matching instruction sequence in the two files. It is used as first filter for valid files that will submitted to the other algorithms.
String Matching. It compares each line of comments and each string from both files, handling translated comments and strings too. Sequences of whitespace are converted to single spaces so that the syntax structure of the line is preserved. The entire comment or string is language-gnostic compared, regardless of whether there are keywords in the comment or string. This algorithm yields a correlation score representing the number of matching comments and strings in the two files. It represent the second valid file filtering level.
Identifier Matching. For each file pair, the algorithm counts the number of matching identifiers that are not programming language keywords, not by their name but by type, length or other attributes depending on the programming language. In order to determine whether an identifier is a keyword, comparison is done with a whitelist. Only non-keywords are compared in order to find matching translated function names, variable names, and other identifiers. This algorithm also examines each non-keyword identifier in the source code of one file and finds all identifiers that match a sequence within one or more non-keyword identifiers in the other file. For example, if identifier xyz is found in one file and the identifiers axyz, xyz1111111, and xyzxxxyz are found in another file, then there is a partial match for the identifier xyz. Partial matching finds common symbols, function names, variable names, and other identifiers where someone has attempted to disguise the identifiers in a fairly basic way. It also yields a correlation score representing the number of matching and partially matching identifiers in the two files. It is the third valid file filtering level.
Statement matching. This algorithm yields a correlation score representing the number of matching instruction lines into the files. It is executed in parallel on multiple file pairs. It exclude non-statements lines (like comments, blank lines, pre-processor or compiler directives, etc.) and it has its own way on applying Simhash to code snippets. Even the symbols, functions, identifiers and variables are different, and there are some noisy lines in the middle, Static Reviewer can still recognize duplicated code.
Syntax Tree Similarity. It finds duplicated code by comparing Abstract Syntax Trees and Dynamic Syntax Trees from parsers. It can find clone sets with multiple instances (we've some applications with hundreds of clones of a single bit of code), and it can find clones across many source files simultaneously. The algorithm defines the ratio of identical elements (technically, AST and DST nodes) across a clone set divided by the total number of elements across that clone set. This ratio is a value between 0 and 1. It is compared against a threshold and finally the correlation score is calculated; we've found that 95% is surprisingly robust as threshold in terms of the quality of reported clones.
Binary Code Similarity. It uses a method that involves comparing two or more binary code segments to identify their similarities and differences. Binary Code Similarity Detection (BCSD) algorithm compare the semantic differences between two binary programs, rather than merely relying on byte-level comparisons. BCSD attempts to understand the meaning and behavior of programs, which helps in identifying code reuse, plagiarism, software piracy, malware variants, or even assessing the impact of software patches and updates, thereby playing a role in fields such as software development, reverse engineering, and security analysis. In binary code similarity analysis, the only input source may be an executable binary file or program, which significantly limits the amount of information available for analysis. The binary code generation process encompasses multiple stages, and the specific steps and the potential impacts on generating a binary file are important for reverse-engineering that file, like: Source Code original file name without path (can be derived from extracted strings), the Compiler (every compiler has its own signature over the binaries), Optimization level, Target Platform Specification (by detection of binary format such as PE -on Microsoft Windows-, ELF -on Linux and most other versions of Unix-, Mach-O -on macOS and iOS-, Linking Dependencies (through Software Composition Analysis) and Decompiling.
Repository-Level matching. It uses the Commit APIs offered by all of the major git providers (GitHub, GitLab, BitBucket, Azure Devops, ABAPGit) to incrementally piece together the meaningful changes that occur within a repository over time. In order to detect clone blocks without having access to the full repo source code, The algorithm generates a one-way hash value to represent each changed line with anonymized identifiers. When a line is changed (in a non-ignored file), it gathers the 5 lines most proximal to the change, transform each line into its hashed representation, and search for any other files that contain this same Content Signature.
Once is located an instance of a separate code block that matches the Content Signature, it first gathers all locations that match the 5-line “content signature” then search outward from each Content Signature to assess the full extent of the overlap between the two code locations. Since 2015, it collected over 2 billions lines of code from repos to enhance its experience.
Different groups of algorithms are used for OSS plagiarism and Duplication Anti-Patterns recognition.
Finally, a single correlation score is given for the similarity of the file pairs. If a file pair has a higher score, it implies that these files are more similar or may be plagiarized from each other or from a common third file. There are many kinds of plagiarism and many ways of fooling plagiarism detection programs. For this reason, Static Reviewer produces a list of file pairs ordered by their total match scores, reducing the effort needed by experts by allowing them to narrow his focus from hundreds of thousands of lines of code in hundreds of files to dozens of lines of code in dozens of files. The file pairs' list is used for Duplicated Classes, Methods/Functions, Code Blocks and Code Snippets discovery.
It suggests the appropriate Refactoring Patterns too.
See: Duplicate Code Refactoring Pattern and the Template Method.
Duplication detection benchmarking
Datasets play an essential role in building and validating code similarity measurement tools. The investigation of primary studies demonstrates that there are 5 datasets still active concerning source code similarity and clone detection research:
Static Reviewer has be trained using those datasets, with an actual coverage over 85%. We will do our best to become better.