I need to find all the lines from file 1 that the data in column 1 and 2 do not also exist in a row in anywhere in file 2. The files are a few thousand lines each.
I've looked at awk and nawk but to be honest I'm not sure where to begin with this.
#!/bin/ksh
# this looks at all the columns in searchfile
# prints a line from outputfile when no column in
# the outputfile line matches any column in
# searchfile
awk '
FILENAME=="searchfile" {
for(i=1;i<=NF;i++ ) {Keys[$i]++}
}
FILENAME=="outputfile" {
found=0
for(i=1;i<=NF;i++) { if(Keys[$i]>0) {found=1; break;}}
if (found == 0) {
print $0
}
}
' searchfile outputfile
#!/bin/bash
echo > aux
for i in `cat file1`
do
bool=1
for j in `cat file2`
do
if [ `echo $i` = `echo $j` ]
then
bool=0
fi
done
if [ $bool -eq 1 ]
then
echo $i >> aux
fi
done
can you guys tell me what's the search file and output file here. My guess is that output file is empty before the process and the search file is file2. Am i wrong in my assumption? Let me know.
The search file is the example "file 2" from my original post, the output file would be "file 1." File 1 has 2 columns to find in any row of file 2. The output of the script goes to standard out and is all rows from "file 1" whose columns do not have matching values in either column of of any row of file 2. Take a good look at the example in my original post to understand what the script does.
Cheers
I've found a problem with the code, it does everything described in the original thread except it doesn't do it row by row.
For example: if i have the following files
File1
111 222
333 444
555 666
File2
000 999 211 333
111 020 222
444 990 433
The output should be:
333 444
555 666
Actual result is:
555 666
Because it finds the columns 333 and 444 from file1 in 2 different rows of file2. I need the rows from file1 to be compared to match 2 columns in of any 1 row in file2. I've been playing with this for 2 days and am once again stumped.
Thanks in advance!
#!/bin/ksh
# this looks at all the columns in hcs.txt
# prints a line from wmh.txt when no column in
# the wmh.txt line matches any column in
# hcs.txt
awk '
FILENAME=="file2" {
for(i=1;i<=NF;i++ ) {Keys[$i]++}
}
FILENAME=="file1" {
found=0
for(i=1;i<=NF;i++)
{ if(Keys[$i]>0) {found++; }}
if (found != 2 ) {
print $0; found = 0; break;
}
}
' file2 file1
Thanks, this also looks promising but I'm getting a syntax error for the grep and without it the output is all the lines from file1 and file2. whereas I jsut want the lines from file1.
Close, very close..I made one change to allow for random order of a and b and random number of characters in each column. The only thing I can't figure out is why when it finds this it doesn't print it, everything else seems to work great:
file1
8 1
file2
8
808 322 011
while read a b
do
grep " .*$a.* " 2 |grep " .*$b.*" 2 >/dev/null
if [ $? -ne 0 ]
then
echo $a $b
fi
done < 1
My problem was with this line which had an extra filename in the piped grep:
grep -w $a file2 |grep -w $b file2 >/dev/null
Should be like you said which works perfectly.
grep -w $a file2 |grep -w $b >/dev/null
My file 1 is about 1000 lines and my file 2 is over 1 million. It was taking about 4 hours with the array so I'm betting this will also be much faster as well.