Two folder comparison

Hi,

I have few files in a directory. I have the same set of files in another directory. I need to remove the lines starting with the word 'HDR' and 'FTR' if present from all the files in both the directories. Then i need to sort the contents from all the files in both the directories and then compare the two directories.If they are not same i should report the that the two files are different. If any file is additional in either of the directories, Then i should report that the file is present in one directory and not in the other. The columns in the files are either tab separated or | (pipe) separated.

One thing is that the file size may be huge. It might be in hundrerds of MB's.

Can anyone help me in givind a script to do this job?

This would be a lot simpler if you could make a temporary copy of both directories. Is this feasible, or are they too large?

mkdir /tmp/dir1 /tmp/dir2
for file in dir1/* dir2/*; do
  egrep -v '^(HDR|FTR)' "$file" | sort >/tmp/"$file"
done
diff -rad /tmp/dir1 /tmp/dir2 | egrep '^(diff|Only in )'

... assuming dir1 and dir2 are the original directories you want to compare. If they are in different places in the file system, perhaps it would be easiest to create symlinks for the duration of this script. And of course, once you are done, you can remove the copies in /tmp/dir1 and /tmp/dir2.

The output from diff is not particularly intuitive but it contains the information you request; you can post-process it with sed, or simply load it in your editor and massage it into something your manager can understand.

I'm not sure I captured the sorting requirement correctly. This creates sorted copies of each file, after removing the HDR and FTR lines; if you want all the content in a single file, of course, that can be done, too (but running diff on a single huge file is going to be painful).

diff is really overkill for this problem, but it produces exactly the information you wanted. It might be easier on the hardware to run cmp and then separately check for files which exist in one directory but not in the other.

Hi era,

I am getting this message and the temp directories are not getting created and the files are not compared.
bash: /usr/local/bin/..: is a directory

The code i used is

mkdir /gdw/shared/qa2/dev/scripts_regr_test/design3/extract_comparison/tmp/dir1 /gdw/shared/qa2/dev/scripts_regr_test/design3/extract_comparison/tmp/dir2
for file in /gdw/shared/qa2/dev/scripts_regr_test/design3/extract_comparison/testing/* /gdw/shared/qa2/dev/scripts_regr_test/design3/extract_comparison/testing1/*; do
egrep -v '^(HDR|FTR)' "$file" | sort >/tmp/"$file"
done
diff -rad /gdw/shared/qa2/dev/scripts_regr_test/design3/extract_comparison/tmp/dir1 /gdw/shared/qa2/dev/scripts_regr_test/design3/extract_comparison/tmp/dir2 | egrep '^(diff|Only in )'

Why is this happening?

The variable "file" contains the whole full path, and if the same path doesn't exist under /tmp it will complain about the redirection. That's what I was alluding to with the suggestion to create symbolic links to the directories.

ln -s /gdw/shared/qa2/dev/scripts_regr_test/design3/extract_comparison/testing dir1
ln -s /gdw/shared/qa2/dev/scripts_regr_test/design3/extract_comparison/testing1 dir2
mkdir /tmp/tmp1 /tmp/dir2
for file in dir1/* dir2/* ...

Hi era,

I am new to unix. i am getting this error. When i run the script.

++ set -x
++ ln -s /gdw/shared/qa2/dev/scripts_regr_test/design3/extract_comparison/testing dir1
++ ln -s /gdw/shared/qa2/dev/scripts_regr_test/design3/extract_comparison/testing1 dir2
++ mkdir /tmp/dir1 /tmp/dir2
mkdir: Failed to make directory "/tmp/dir2"; File exists
++ egrep -v '^(HDR|FTR)' dir1/file1.txt
++ sort
++ egrep -v '^(HDR|FTR)' dir1/file2.txt
++ sort
++ egrep -v '^(HDR|FTR)' dir1/file3.txt
++ sort
++ egrep -v '^(HDR|FTR)' dir1/testing
++ sort
sort: missing NEWLINE added at end of input file STDIN
++ egrep -v '^(HDR|FTR)' dir2/file1.txt
++ sort
++ egrep -v '^(HDR|FTR)' dir2/file2.txt
++ sort
++ egrep -v '^(HDR|FTR)' dir2/file3.txt
++ sort
++ egrep -v '^(HDR|FTR)' dir2/testing1
++ sort
sort: missing NEWLINE added at end of input file STDIN
++ diff -rad /tmp/dir1 /tmp/dir2
++ egrep '^(diff|Only in )'
diff: illegal option -- a
usage: diff [-bitw] [-c | -e | -f | -h | -n] file1 file2
diff [-bitw] [-C number] file1 file2
diff [-bitw] [-D string] file1 file2
diff [-bitw] [-c | -e | -f | -h | -n] [-l] [-r] [-s] [-S name] directory1 directory2

The directories i need to compare are testing and testing1 which are in the path /gdw/shared/qa2/dev/scripts_regr_test/design3/extract_comparison/

The code i used is
ln -s /gdw/shared/qa2/dev/scripts_regr_test/design3/extract_comparison/testing dir1
ln -s /gdw/shared/qa2/dev/scripts_regr_test/design3/extract_comparison/testing1 dir2
mkdir /tmp/dir1 /tmp/dir2
for file in dir1/* dir2/*; do
egrep -v '^(HDR|FTR)' "$file" | sort >/tmp/"$file"
done
diff -rad /tmp/dir1 /tmp/dir2 | egrep '^(diff|Only in )'

But its still failing. Can you give me the exact code?

You have a typo in the mkdir, you have "mkdir /tmp/dir11" with one 1 too many.

(Looking back, I had the wrong directory names there too, tmp1 and tmp2 instead of dir1 or dir2.)

Of course, once the symlinks and directories are in place, the code to create them won't need to be run again.

If you have the following:

  • symlink dir1 pointing to ...path to/testing
  • symlink dir2 pointing to ...path to/testing1
  • directory /tmp/dir1 exists
  • directory /tmp/dir2 exists

... then you should be ready to go with the for loop.

Looks like you will also need to drop the -a and -d options to diff, so just diff -r

(For the record, my diff has an -a option which says to always treat all files as text, and -d basically means try a little harder algorithm in order to keep the diffs small. I guess you can cope without either of those.)

If testing and testing1 are subdirectories among the files, perhaps you want to skip those from the loop.

for file in dir1/* dir2/*; do
  test -d "$file" && continue  # skip if it's a directory
  egrep -v '^(HDR|FTR)' "$file" | sort >/tmp/"$file"
done
diff -r /tmp/dir1 /tmp/dir2 | egrep '^(diff|Only in )'

Testing and testing1 are the directories containing the files which have to be compared.

This script is going to be automated and will be a generic one and will be run by using a java code.so i cant create symbolic links each and every time i need to compare two directories. can the code be modified to suit this purpose?

Oh, and if your diff is that different (sic) from mine, perhaps you want to check that its output is as expected before you plunge ahead and run it on all your files.

Try something quick in the /tmp directory:

vnix$ cd /tmp
vnix$ mkdir foo bar
vnix$ echo one >foo/one
vnix$ echo one >bar/one
vnix$ echo two >foo/too
vnix$ echo three >foo/three
vnix$ echo four >bar/three
vnix$ diff -r foo bar
diff -r foo/three bar/three
1c1
< three
---
> four
Only in foo: too
vnix$ 

The boldfaced lines are the ones grep will look for. If your output looks different, you need to adapt.

If the directories you want to compare will always exist side by side in the directory tree, then yes, it's trivial. If not, some additional gyrations are required, but it's not extremely complicated either.

#!/bin/sh

case $# in 2);; *) echo "syntax: $0 dir1 dir2" >&2; exit 2;; esac

dir1="$1"
dir2="$2"

base1=`basename "$dir1"`
base2=`basename "$dir2"`

# note: predictable temp names are a security problem!
# FIXME: use something fancier here
# Also, won't work if dir1 or dir2 is directly in /tmp ...
mkdir /tmp/$base1 /tmp/$base2

# clean out temporary directories if interrupted
trap 'rm -rf /tmp/$base1 /tmp/$base2' 0
trap 'exit 127' 1 2 3 5 15

while read d t; do
  for f in "$d"/*; do
    test -d "$f" || continue
    egrep -v '^(HDR|FTR)' "$f" | sort >"$t"/${f#"$d"}
  done
done <<HERE
$dir1 $base1
$dir2 $base2
HERE

diff -r /tmp/$base1 /tmp/$base2 | egrep '^(Only in |diff )'

As you can see, the additional motions are rather unsightly but not altogether very complex.

Are you getting paid to solve this problem?

Hi era,

Ya. I am paid for this. I tried to create two variables containing the path containg the files. For this variable i am creating a symbolic link.

but the thing is I am getting stuck here in creating the temporary folder.

a=/gdw/shared/qa2/dev/scripts_regr_test/design3/extract_comparison/testing/
b=/gdw/shared/qa2/dev/scripts_regr_test/design3/extract_comparison/testing1/
ln -s $a dir1
ln -s $b dir2
c=/gdw/shared/qa2/dev/scripts_regr_test/design3/extract_comparison/
mkdir $c/tmp/dir1 $c/tmp/dir2
for file in dir1/* dir2/*; do
test -d "$file" && continue # skip if it's a directory
egrep -v '^(HDR|FTR)' "$file" | sort >/tmp/"$file"
done
diff -r $c/tmp/dir1 $c/tmp/dir2 | egrep '^(diff|Only in )'

Can anything be done on this code itself?

You need the $c before /tmp after the sort also.