Looking to find files that are similar.

jeffs42885 · October 1, 2012, 4:33pm

Hello all,

I have a server that is running AIX, running a tool that converts various printstreams (AFP/Metadata) to PDF. This is done using a rexx script and an off the shelf utility.

Each report (there's around 125) uses a certain script file, it's basically a text file.

I am trying to find out if the scripts are similar, some of them might even be the same, and I'm wondering if anyone has any advice on the best way to do this.

Here is a sample of what it would be like:

1 or multiple script files (.txt) might be in:

/app/transformation/project1/

Other files exist...etc..

/app/transformation/project2/

Does anyone have feedback on what the best way to compare?

rdrtx1 · October 1, 2012, 4:36pm

diff -lrs /app/transformation/project1/ /app/transformation/project2/ | more

jeffs42885 · October 1, 2012, 4:45pm

Thank you for the prompt response. This method looks like it will work, but I am wondering if there is any way to compare more than 2 files.

jim_mcnamara · October 1, 2012, 6:23pm

define compare. How similar are we talking here? Exactly the same means they are similar? If so use checksums.

for file in /app/transformation/project1/*.pdf /app/transformation/project2/*.pdf
do
  cksum $file
done  | sort -n   > my pdfs.txt

matching files will all have the same checksum. AIX cksum example output:

3995432187       1390    file.pdf

where 3995432187 is the checksum, 1390 is the file size in bytes, file1.pdf is the filename. This is why sorting by checksum finds multiple duplicates.

the_gripmaster · October 1, 2012, 6:34pm

jim mcnamara:

define compare. How similar are we talking here? Exactly the same means they are similar? If so use checksums.
for file in /app/transformation/project1/*.pdf /app/transformation/project2/*.pdf
do
  cksum $file
done  | sort -n   > my pdfs.txt
matching files will all have the same checksum. AIX cksum example output:
3995432187       1390    file.pdf
where 3995432187 is the checksum, 1390 is the file size in bytes, file1.pdf is the filename. This is why sorting by checksum finds multiple duplicates.

Just wanted to point out a gotcha here: even a single extra blank line or a space will produce dissimilar cksums. Otherwise this is okay to find similar files.

jim_mcnamara · October 1, 2012, 8:23pm

identical=exact which should mean the checksums match. Similarity is a really difficult problem - google for Levenshtein distance or Wagner-Fischer algorithm.