Dot Plot

Dot Plot

C#

New versions will be in C# because I want to learn that language and a real application is the best way to learn.

Parsing

The original tokeniser splits on whitespace; this means that strings such as:

  using System.Windows.Forms;

appear as two tokens instead of the expected four (or five if you count the semicolon, seven if you count the to dots). This can obscure some similarities but is so much faster and so much simpler to implement than a general parsing algorithm that it is probably the best compromise. The real problem comes in analysing files created in general purpose text editors. Such files do not always include spaces between tokens such as operators. For instance the MS VB IDE editor adds spaces around infix operators, suppresses them before opening brackets, adds them before end of line comments and so on; this has the effect of ensuring that operators are seen as tokens, however, such spacing is not required by the compiler.

For instance the following line is perfectly legal VB:

  A=B+C
but if you enter it using the VB IDE it will become
  A = B + C    

The first will be seen as one token and the second as five tokens. This is a serious limitation if we think in terms of detecting plagiarism because then we must assume that the programmer of at least one of the files is hostile to the process, on the other hand in professional contexts we are generally looking for sloppy programming that involves verbatim copying without any intent to conceal, only an intent to get the job done as quickly as possible. This means that copied source will tend to have the same spacing as the original even if the programmer has altered it a little to fit its new home.

The problems associated with the original tokenisation method areeasily avoided with the .NET framework regular expression class because it has a Split method that allows a regular expression to be used to identify the delimiters.

Implementation Notes

New Features

Speed is better after the change to the way the tokens were stored. Now it is time to add more features. The first question is which ones. Possible features include:

Improvements to Algorithms

Dotplot in and for VB

Dotplots are two dimensional charts. To produce such a chart two text files are tokenised. The values along the axes are the tokens; imagine the whole text of the file written along the axis.

For example:

         The quick brown fox

The       X            
dirty          
brown                X
dog

Dots are placed at each intersection point where the tokens are the same. The same technique can look for repeated sequences within a file.

Example with case ignored:

         The quick brown fox and the dirty brown dog
The       X                       X                
quick          X                                 
brown                X                       X           
fox                       X                         
and                           X                         
the       X                       X               
dirty                                  X         
brown                X                       X
dog                                               X

Notice that the diagonal line is completely filled and that there are diagonals in the upper right and lower left quadrants.

The purpose of such a plot is to make detection of duplicated code easier. Many academic citations of this method, other than those in bilogical sciences, tend to couple it with plagiarism detection. While it can of course be used to detect stolen code in software development course and in detecting stolen passages from novels and scientific reports my feeling is that the best use for the technique is to aid in software maintenance by making it easy to detect code that has simply been copied and edited in order to get a program up and running quickly. There is nothing wrong with cut and paste development so long as the developer stays awake and refactors the code when the duplication reaches a certain level (one copy is alright, two should make you make a mental note, three means it's time to refactor)

Quick and Dirty Implementation

Dotplot versions

CurrentVersion
The development version. This file briefly describes changes to whatever I am working on at the moment.
DotPlotVersionThree
Token streaming from file list. Multiple files.
DotPlotVersionTwo
Improved data structures that speed up plotting by not requiring string comparisons.
DotPlotVersionOne
Simple version that plots the correspondence between two files and has a slow plotter.

Links to dotplot resources on the web

Google search for dotplot
finds altogether too many hits. Mostly from the molecular biology field, interesting but only occasionally directly relevant.
Google search for: dotplot source code
Much better.
Jonathan Helfman and Ken Church
This has to be the best, most compact and most easily digested introduction to the concepts of dotplots.
Visual Detection of Duplicated Code
Rieger and Ducasse at Berne.
Identifying Duplication in Program Traces
Interesting suggestion for using dotplots to examine not the static structure but the behaviour of a program.
Mi-Nyeong Hwang, Jeong-Hyeon Choi, Do-Hoon Lee, and Hwan-Gue Cho
An overview paper in M$ Word format (yuck) giving a little more background and some cross references to biological uses for dot plots.
Genetics Computer Group
The manual for a real commercial dot plotting programming used for DNA analysis.
Peter J. Balafas
A dissertation on plagiarism detection from Loughborough University. I've only scanned it briefly, largely because I am allergic to the whole idea of automated plagiarism detection. Just in case you care you can find my reasoning on the IrrelevantAsides page
Paul Clough - Dotplot: Visually exploring patterns in text
Another presentation of the Meter project that is the subject of Balafas' paper, this time from Sheffield University

Links to related pages

Real artificial life: Where we may be
Very interesting paper on software as a living system. Makes passing reference to dotplots in Software genetics.

<>