File Comparison

I have to compare two text files, very few of the lines in these files will have some difference in some column.
The files size is in GB.
Sample lines are as below:
11111122222222333333aaaaaaaaaabbbbbbbbbccccccccdddddd
11111122222222333333aaaaaaaaaabbbbbbbbbccccccccddeddd

So assuming these two lines are from file1 and file2 respectively, I should get the second file line in a new output file which is the difference file.

What I would like to do is read line1 from file1 and loop through all the lines in file2 and stop when a match is found, else print it that line to output file. And repeat the same steps for all the lines from file1.

Appreciate any help in this regard.

What do you mean by "stop when a match is found", then you read more from file1....
Do you want the line number? Stop usually means to exit the read loop.

Yes, I want to exit the read loop when a match is found, I do not want to check any more for that line.
No I do not need the line number.

If I understand what you're trying to do correctly, here's a quick bash script.

#!/bin/bash

compareFile = "/path/to/file/to/compare.txt"
outputFile = "/path/to/outputFile.txt"

for filename in /some/dir/of/text/files/*.txt; do 
        
        numlines=`cat $filename | wc -l`
                
        for i in `seq 1 $numlines`; do 
                current=`cat $filename | head -$i | tail -1` 
 
                grep -q "${current}" ${compareFile} 
 
                if [ $? != 0 ]; then
                         #doesn't exist, append to $outputFile
                        echo "${filename}:${current}" >> ${outputFile} 
                fi
        done 
done

Hi, Thank you for the quick solution and looks pretty much what I want.
But I am unable to run this script, I use ksh.
One of the errors is "seq: command not found"

which seq (usually resides in /usr/bin/)

It's an individual executable command; should be part of the coreutils package if you're using linux.

if it exists on your system, modify the script
seq="/path/to/seq"

then modify the for statement to use the variable: for i in `${seq}...

As mentioned by OP, the files are in GB. I think there will be some performance lag. just a guess.
Also seq is not a standard command in some *nix OS. Therefore if you want to use loops that loop over a counter, a while loop can be used instead. eg while [ $num -le $numlines ]

Do we have to loop?

$ cat f1
11111122222222333333aaaaaaaaaabbbbbbbbbccccccccdddddd
91111122222222333333aaaaaaaaaabbbbbbbbbccccccccdddddd
81111122222222333333aaaaaaaaaabbbbbbbbbccccccccdddddd
$
$ cat f2
11111122222222333333aaaaaaaaaabbbbbbbbbccccccccddeddd
11111122222222333333aaaaaaaaaabbbbbbbbbccccccccdddddd
11111122222222333333aaaaaaaaaabbbbbbbbbccccccccddeddd
91111122222222333333aaaaaaaaaabbbbbbbbbccccccccdddddd
$
$ diff f1 f2 |grep "<" |cut -d"<" -f2 |cut -c2-
81111122222222333333aaaaaaaaaabbbbbbbbbccccccccdddddd

HTH

I'd probably use diff too...

If the lines in the files are similar to the lines you put in your first post, meaning there are no spaces on the lines, you could:

#!/bin/sh
for k in `cat file1`
do
  grep -m 1 $k file2 > /dev/null
  if [ $? -eq 1 ]; then echo $k; fi 
done

the -m1 will cause grep to exit after the first match is found. If no match is found grep will exit with status 1, you can use that to determine if the line exists in file 2 or not. Keep in mind, that "for k in `cat`" stuff will break if you have spaces in the lines in the file.

grep -v -f file1 file2

This will give you all the lines in file2 which are not in file1

This is excellant stuff, this is exactly what I wanted. Thank you so much.

Guess I got excited too early.
It worked fine for small files, but when I tried on large files (200+MB), it ran for 3 hours and still was running and I had to kill.
Appreciate for any other alternate tips considering the size of the files also.

Which one were you running?

grep -v -f file1 file2

I am trying stateful method, but I am not getting any output.
I made your code as a script file and executed it where the files reside, do not see anything,it comes back without any output or error. I am trying on small files to verify.

Hi.

By chance I am working with a text file of this size ( 1 GB ). It contains just over 1 GB, and has 15 M (15,000,000) lines. The real time to count the lines with wc is 15-20 seconds ( AMD-64/3000, SATA disk).

If this is correct, and you have 2 such files, then I think any method that reads a line from file1 and uses it with a program to look through file 2 at each step will not end quickly, because there will be 15 M loads of that program involved, not to mention actually reading the file. For example, doing a grep reading /dev/null for 15,000 times takes about 10 seconds (10.2 actually) real time. For 1,000 times that, I'd be looking at 2.75 hours just to load grep from the disk and read an immediate EOF. A grep of a non-existent string takes about 18 seconds for a single search.

I suggest that the files be sorted and diff be run once on the two files (post #8, rikxik). That will be 2 passes across each file, a decrease of close to 100% from 15M passes over 1 file.

If my facts are wrong, then tell me where I missed something of importance or made a mistake. Otherwise, perhaps we should take a step back and you tell us what the higher purpose of the problem is -- what problem you are really trying to solve -- perhaps we can suggest some other approach ... cheers, drl

Hi drl - I was wondering whether there is any reason/performance gain (for diff) if we sort the files? Is it essential/necessary? Just thinking aloud.

Hi, rikxik.

I was thinking that the diff window to look for sequences would not be so large. However, if the files were very similar, then the sort could perhaps be skipped -- I hope for the best, but expect the worst :slight_smile:

It would be interesting to try it both ways, of course ... cheers, drl

I did sort both the files and then tried diff as well as grep -v -f file1 file2, same problem.
It is running for too long.

Hi.

Perhaps I had more luck -- I didn't have to wait so long for a definitive answer. On 2 different machines, I had 2 large, similar, but different files of size about 1 GB. One machine had 2.5 GB memory, the other 1 GB. When I used diff, I got the message:

diff: memory exhausted
 Exit status: 2

So I sorted the files and ran:

comm -3 file1 file2

On one machine the elapsed time for comm was 3 minutes (2.8 GHz Xeon, RHEL 4), and on the other, 2.5 minutes (AMD-64, 3000+, Debian sarge).

You may need to glance at man comm to see what it is doing -- it does require sorted input files, and then presents unique entries in both files.

Best wishes ... cheers, drl