New versions will be in C# because I want to learn that language and a
real application is the best way to learn.
Parsing
The original tokeniser splits on whitespace; this means that strings
such as:
using System.Windows.Forms;
appear as two tokens instead of the expected four (or five if you
count the semicolon, seven if you count the to dots). This can
obscure some similarities but is so much faster and so much simpler to
implement than a general parsing algorithm that it is probably the
best compromise. The real problem comes in analysing files created in
general purpose text editors. Such files do not always include spaces
between tokens such as operators. For instance the MS VB IDE editor
adds spaces around infix operators, suppresses them before opening
brackets, adds them before end of line comments and so on; this has
the effect of ensuring that operators are seen as tokens, however,
such spacing is not required by the compiler.
For instance the following line is perfectly legal VB:
A=B+C
but if you enter it using the VB IDE it will become
A = B + C
The first will be seen as one token and the second as five tokens.
This is a serious limitation if we think in terms of detecting
plagiarism because then we must assume that the programmer of at least
one of the files is hostile to the process, on the other hand in
professional contexts we are generally looking for sloppy programming
that involves verbatim copying without any intent to conceal, only an
intent to get the job done as quickly as possible. This means that
copied source will tend to have the same spacing as the original even
if the programmer has altered it a little to fit its new home.
The problems associated with the original tokenisation method
areeasily avoided with the .NET framework regular expression class
because it has a Split method that allows a regular expression to be
used to identify the delimiters.
Implementation Notes
Make the xGetToken function responsible for the maintenance of the
current character position variable
Connect token axis values directly to token positions in the source
documents.
Drop the current file strings and read the file directly into the
text boxes. This should work properly now that I am using C#
because the regular expression Split method returns the delimiters
as well as the tokens.
New Features
Speed is better after the change to the way the tokens were stored.
Now it is time to add more features. The first question is which
ones. Possible features include:
Allow user to specify that files in sub-directories should be
processed.
Allow user to specify a regular expression to match filenames.
Split on non-word characters using a regular expression.
Provide flags to ignore delimiters or treat them as tokens
Allow user to supply a regular expression that identifies delimiters
or one that identifies tokens.
Supply sample regular expressions intended for specific languages
When clicking in the plot to select text in the files make sure that
the selection in the text boxes covers all of the tokens that are
represented by the selected pixel. For large texts one screen pixel
represents many tokens, perhaps even hundreds of lines.
All files in a directory,
Selected files
Files in a VBP
Files mentioned in another file such as a make file, SharpDevelop
combine, Visual Studio solution, etc.
zoom
synchronise selection in text box with selection in picture box so
that selecting in text moves a cross hair in the plot.
use cursor keys in plot
draw boundary lines between files on the plot
save plots as jpg, png etc.
save parsed files and plots so that a session can be restored
without reparsing the files or replotting. The reason for this is
that large files or sets of files will be very slow.
Improvements to Algorithms
Instead of parsing line by line just Split the whole file on spaces.
This should be faster but how do we connect a token number to the
position in the file? One way is to accumulate the offset by adding
the token lengths together, must take care of the delimiting
spaces. Then we must use the token offset instead of the line
offset because we will not know where the line boundaries are.
Read the whole file and split on end of line characters.
Dotplot in and for VB
Dotplots are two dimensional charts. To produce such a chart two text
files are tokenised. The values along the axes are the tokens;
imagine the whole text of the file written along the axis.
For example:
The quick brown fox
The X
dirty
brown X
dog
Dots are placed at each intersection point where the tokens are the
same. The same technique can look for repeated sequences within a
file.
Example with case ignored:
The quick brown fox and the dirty brown dog
The X X
quick X
brown X X
fox X
and X
the X X
dirty X
brown X X
dog X
Notice that the diagonal line is completely filled and that there are
diagonals in the upper right and lower left quadrants.
The purpose of such a plot is to make detection of duplicated code
easier. Many academic citations of this method, other than those in
bilogical sciences, tend to couple it with plagiarism detection.
While it can of course be used to detect stolen code in software development
course and in detecting stolen passages from novels and scientific
reports my feeling is that the best use for the technique is to aid in
software maintenance by making it easy to detect code that has simply
been copied and edited in order to get a program up and running
quickly. There is nothing wrong with cut and paste development so
long as the developer stays awake and refactors the code when the
duplication reaches a certain level (one copy is alright, two should
make you make a mental note, three means it's time to refactor)
Quick and Dirty Implementation
Create a simple VB project.
Read two files
Convert all to lower case
Tokenise each line with the Split function using space as the delimiter.
Discard empty tokens
Scale a picture box so that the width and height correspond to the
number of tokens in each file.
Iterate over the tokens and place a dot at each point in the
picture box that corresponds to identical tokens.
A dissertation on plagiarism detection from
Loughborough University. I've only scanned it briefly, largely
because I am allergic to the whole idea of automated plagiarism
detection. Just in case you care you can find my reasoning on the
IrrelevantAsides page