I have to compare two text files, very few of the lines in these files will have some difference in some column.
The files size is in GB.
Sample lines are as below:
11111122222222333333aaaaaaaaaabbbbbbbbbccccccccdddddd
11111122222222333333aaaaaaaaaabbbbbbbbbccccccccddeddd
So assuming these two lines are from file1 and file2 respectively, I should get the second file line in a new output file which is the difference file.
What I would like to do is read line1 from file1 and loop through all the lines in file2 and stop when a match is found, else print it that line to output file. And repeat the same steps for all the lines from file1.
What do you mean by "stop when a match is found", then you read more from file1....
Do you want the line number? Stop usually means to exit the read loop.
If I understand what you're trying to do correctly, here's a quick bash script.
#!/bin/bash
compareFile = "/path/to/file/to/compare.txt"
outputFile = "/path/to/outputFile.txt"
for filename in /some/dir/of/text/files/*.txt; do
numlines=`cat $filename | wc -l`
for i in `seq 1 $numlines`; do
current=`cat $filename | head -$i | tail -1`
grep -q "${current}" ${compareFile}
if [ $? != 0 ]; then
#doesn't exist, append to $outputFile
echo "${filename}:${current}" >> ${outputFile}
fi
done
done
Hi, Thank you for the quick solution and looks pretty much what I want.
But I am unable to run this script, I use ksh.
One of the errors is "seq: command not found"
As mentioned by OP, the files are in GB. I think there will be some performance lag. just a guess.
Also seq is not a standard command in some *nix OS. Therefore if you want to use loops that loop over a counter, a while loop can be used instead. eg while [ $num -le $numlines ]
If the lines in the files are similar to the lines you put in your first post, meaning there are no spaces on the lines, you could:
#!/bin/sh
for k in `cat file1`
do
grep -m 1 $k file2 > /dev/null
if [ $? -eq 1 ]; then echo $k; fi
done
the -m1 will cause grep to exit after the first match is found. If no match is found grep will exit with status 1, you can use that to determine if the line exists in file 2 or not. Keep in mind, that "for k in `cat`" stuff will break if you have spaces in the lines in the file.
Guess I got excited too early.
It worked fine for small files, but when I tried on large files (200+MB), it ran for 3 hours and still was running and I had to kill.
Appreciate for any other alternate tips considering the size of the files also.
I am trying stateful method, but I am not getting any output.
I made your code as a script file and executed it where the files reside, do not see anything,it comes back without any output or error. I am trying on small files to verify.
By chance I am working with a text file of this size ( 1 GB ). It contains just over 1 GB, and has 15 M (15,000,000) lines. The real time to count the lines with wc is 15-20 seconds ( AMD-64/3000, SATA disk).
If this is correct, and you have 2 such files, then I think any method that reads a line from file1 and uses it with a program to look through file 2 at each step will not end quickly, because there will be 15 M loads of that program involved, not to mention actually reading the file. For example, doing a grep reading /dev/null for 15,000 times takes about 10 seconds (10.2 actually) real time. For 1,000 times that, I'd be looking at 2.75 hours just to load grep from the disk and read an immediate EOF. A grep of a non-existent string takes about 18 seconds for a single search.
I suggest that the files be sorted and diff be run once on the two files (post #8, rikxik). That will be 2 passes across each file, a decrease of close to 100% from 15M passes over 1 file.
If my facts are wrong, then tell me where I missed something of importance or made a mistake. Otherwise, perhaps we should take a step back and you tell us what the higher purpose of the problem is -- what problem you are really trying to solve -- perhaps we can suggest some other approach ... cheers, drl
I was thinking that the diff window to look for sequences would not be so large. However, if the files were very similar, then the sort could perhaps be skipped -- I hope for the best, but expect the worst
It would be interesting to try it both ways, of course ... cheers, drl
Perhaps I had more luck -- I didn't have to wait so long for a definitive answer. On 2 different machines, I had 2 large, similar, but different files of size about 1 GB. One machine had 2.5 GB memory, the other 1 GB. When I used diff, I got the message:
diff: memory exhausted
Exit status: 2
So I sorted the files and ran:
comm -3 file1 file2
On one machine the elapsed time for comm was 3 minutes (2.8 GHz Xeon, RHEL 4), and on the other, 2.5 minutes (AMD-64, 3000+, Debian sarge).
You may need to glance at man comm to see what it is doing -- it does require sorted input files, and then presents unique entries in both files.