Compare multiple columns between 2 files

stevesmith · September 18, 2006, 10:05am

hello

I need to compare 2 text files. File 1 has 2 columns and file 2 has 1 to many.
Sample:

File 1:

111 555
222 666
333 777
444 755

File 2:

000 110 113 114
844 111 555 999
202 777 865 098 023
222 313 499
065 655 333 011 890 777
433

Results should be from file1:
222 666
444 755

I need to find all the lines from file 1 that the data in column 1 and 2 do not also exist in a row in anywhere in file 2. The files are a few thousand lines each.

I've looked at awk and nawk but to be honest I'm not sure where to begin with this.

Thanks very much

jim_mcnamara · September 18, 2006, 10:33am

try using associative arrays -

#!/bin/ksh
# this looks at all the columns in searchfile 
# prints a line from outputfile when no column in 
#    the outputfile line matches any column in 
#     searchfile
awk '
	FILENAME=="searchfile" {
	 for(i=1;i<=NF;i++ ) {Keys[$i]++}
	}
	FILENAME=="outputfile" {
		found=0
		for(i=1;i<=NF;i++) { if(Keys[$i]>0) {found=1; break;}}
		if (found == 0) {
			print $0
		}
	}
' searchfile outputfile

Doc_RuNNeR · September 18, 2006, 12:14pm

In the aux file you have the solution

#!/bin/bash

echo > aux
for i in `cat file1`
do
  bool=1
  for j in `cat file2`
  do
    if [ `echo $i` = `echo $j` ]
      then
        bool=0
    fi
  done
  if [ $bool -eq 1 ]
    then
      echo $i >> aux
  fi
done

Ask if you have questions, see you

stevesmith · September 18, 2006, 12:25pm

Awesome.

A couple of little changes I needed to make for it to work exactly as I needed:

awk '
FILENAME=="searchfile" {
for(i=1;i<=NF;i++ ) {Keys[$i]++}
}
FILENAME=="outputfile" {
found=0
for(i=1;i<=NF;i++) { if(Keys[$i]>0) {found++; }}
if (found != 2 ) {
print $0
}
}
' searchfile outputfile

Many thanks for your help on this!!!

dsravan · September 18, 2006, 1:18pm

can you guys tell me what's the search file and output file here. My guess is that output file is empty before the process and the search file is file2. Am i wrong in my assumption? Let me know.

stevesmith · September 18, 2006, 1:52pm

The search file is the example "file 2" from my original post, the output file would be "file 1." File 1 has 2 columns to find in any row of file 2. The output of the script goes to standard out and is all rows from "file 1" whose columns do not have matching values in either column of of any row of file 2. Take a good look at the example in my original post to understand what the script does.
Cheers

stevesmith · September 20, 2006, 9:22am

Greetings folks

I've found a problem with the code, it does everything described in the original thread except it doesn't do it row by row.
For example: if i have the following files

File1
111 222
333 444
555 666

File2
000 999 211 333
111 020 222
444 990 433

The output should be:
333 444
555 666

Actual result is:
555 666

Because it finds the columns 333 and 444 from file1 in 2 different rows of file2. I need the rows from file1 to be compared to match 2 columns in of any 1 row in file2. I've been playing with this for 2 days and am once again stumped.

Thanks in advance!

#!/bin/ksh
# this looks at all the columns in hcs.txt
# prints a line from wmh.txt when no column in
# the wmh.txt line matches any column in
# hcs.txt
awk '
FILENAME=="file2" {
for(i=1;i<=NF;i++ ) {Keys[$i]++}
}
FILENAME=="file1" {
found=0
for(i=1;i<=NF;i++)
{ if(Keys[$i]>0) {found++; }}
if (found != 2 ) {
print $0; found = 0; break;
}
}
' file2 file1

anbu23 · September 20, 2006, 10:45am

try this

while read a b
do
        grep -q ".*$a.*$b.*" file2 
        if [ $? -ne 0 ]
        then
                echo $a $b
        fi
done < file1

stevesmith · September 20, 2006, 10:56am

Thanks, this also looks promising but I'm getting a syntax error for the grep and without it the output is all the lines from file1 and file2. whereas I jsut want the lines from file1.

grep: illegal option -- q
Usage: grep -hblcnsviw pattern file . . .

Cheers

anbu23 · September 20, 2006, 10:59am

instead of this

grep -q ".*$a.*$b.*" file2

use this

grep ".*$a.*$b.*" f2 >/dev/null

stevesmith · September 20, 2006, 11:24am

Close, very close..I made one change to allow for random order of a and b and random number of characters in each column. The only thing I can't figure out is why when it finds this it doesn't print it, everything else seems to work great:

file1
8 1

file2
8
808 322 011

while read a b
do
grep " .*$a.* " 2 |grep " .*$b.*" 2 >/dev/null
if [ $? -ne 0 ]
then
echo $a $b
fi
done < 1

stevesmith · September 20, 2006, 11:41am

Got it!

This works.

grep -w $a 2 |grep -w $b 2 >/dev/null

Thanks very much for your help, I appreciate it!

anbu23 · September 20, 2006, 11:50am

try this

while read a b
do
grep -w $a f2 |grep -w $b  >/dev/null
if [ $? -ne 0 ]
then
echo $a $b
fi
done < f1

stevesmith · September 20, 2006, 11:53am

Ug. Same problem with the array and the grep. If I have the following:

File1:
222 666

File2:
222 111 333
756 666 000

The entry for 222 666 should be printed because it's values are on 2 separate rows of file 2.

Thanks

anbu23 · September 20, 2006, 11:58am

did you try the above code?
it will work.
Are you trying to do that using array and grep?
then show us your code

stevesmith · September 20, 2006, 12:04pm

My bad,

My problem was with this line which had an extra filename in the piped grep:
grep -w $a file2 |grep -w $b file2 >/dev/null

Should be like you said which works perfectly.

grep -w $a file2 |grep -w $b >/dev/null

My file 1 is about 1000 lines and my file 2 is over 1 million. It was taking about 4 hours with the array so I'm betting this will also be much faster as well.

Thanks a heck of a lot!