Checking for duplicate code

figaro · June 1, 2012, 6:39pm

I have a short line of code that checks very rudimentary for duplicate code:

sort myfile.cpp | uniq -c | grep -v "^.*1 " | grep -v "}"

It sorts the file, counts occurrences of each line, removes single occurrences and removes the ubiquitous closing brace. The language is C++, but is easily extensible to other programming languages.

I would like to make this a bit more advanced. A few examples:

1- Allow for spaces, so that the following lines of output are considered identical:

   2     for (i = 0; i < N; i++) {
   2        for (i = 0; i < N; i++) {

2- Allow for spaces within the code, so that the following lines of output are considered identical:

   2     for (i = 0; i < N; i++) {
   2     for ( i = 0; i < N; i++ ) {

If there are easy ways to fix this, I like to hear from you.

I am deliberately not excluding lines of comment, such as those containing "/" or "/" or "//", as this would reduce the case to tell developers to document their code better.

Any other one-liner ideas to check for duplicate code are also welcome.

jim_mcnamara · June 2, 2012, 4:07pm

What is your idea of duplicate code? I'm sure you can not script out duplicate code and still keep a functioning, logically ordered execution path by doing that.

figaro · June 3, 2012, 3:15am

I want to be able to spot code that is a candidate for refactoring. There is no intention to script out lines of code.

bakunin · June 3, 2012, 5:12am

I'd start with removing blanks before doing anything else. In C/C++ blanks can only serve two functions: to make code easier to read (indentation) or in output (like "printf( " \n");"). Replace in the following "<spc>" and "<tab>" with literal space/tab characters.

sed 's/[<spc><tab>]*//g'

This removes any space or tab character from the source, including indentation.

An idea you might want to follow is to concatenate lines which do not end in a closing brace or semicolon. Consider the following two lines:

a=b+c;

a =
b + c;

They are equal to the compiler, but your procedure would count them as different.

You can do this concatenation with a regexp, but it involves a little hold space / pattern space gymnastics:

sed -n 's/[<spc><tab>]*//g
     $ { x
         G
         s/\n//g
         p
         q
       }
     /[;{}]$/ {
            x
            G
            s/\n//g
            p
            s/.*//
            x
            d
          }
     /[;{}]$/! {
            H
            d
           }' /path/to/input

What it does (i suggest you get a sed-reference if you don't feel familiar with this): at first, all the spaces/tabs are deleted in the first line. Then there are 3 types of lines to handle:

The last line is covered first in the paragraph "$ {..". The content of the hold space is exchanged with the pattern space, then the content of the hold space (the former pattern space content) is copied to the end of the pattern space - we concatenate the line with the former read lines. Next, all the line feeds are deleted ( s/\n//g ) and the line is printed out, then we quit.

The next type of lines are the ones ending either with a ";" with a "{" or "}". (Braces end expressions too). We do practically the same as with the last line, but after printing the line to output we clear the pattern space and hold space to "flush the buffers". Otherwise portions of the text would be duplicated.

The last type of lines are the one which don't end on braces or semi-colons. We append their content to the hold space, delete the pattern space and start over with the next line.

So, in principle, we are collecting text in the hold space and flush that out on specific occasions (whenever we feel a "program line" is completely read).

I hope this helps.

bakunin