Match strings in two files and compare columns of both

GarciasMuffin · August 26, 2009, 6:42am

Good Morning,

I was wondering if anybody could tell me how to achieve the following, preferably with a little commenting for understanding.

I have 2 files, each with multiple rows with multiple columns.

I need to find each row where the value in column 1 of file 1 matches column 1 in file 2. I then need to compare the 4th column from the rows and if the value in file 2 is greater than in file 1 then output to a third file the matching column 1 value, the 2nd column value,the 4th column value in file 1 and file 2 and the difference between them.

An example should make the above clearer.

File 1

123 john blah 2
456 tony blah 7
789 michelle blah 9
111 james blah 3

File 2

135 gary blah 6
456 tony blah 13
789 michelle blah 4
111 james blah 19

So the below should be output to file 3:

456 tony 7 13 6
111 james 3 19 16

Many thanks in advance for any answers

malcomex999 · August 26, 2009, 7:06am

Try...

 
awk 'NR==FNR{arr[$1]=$2","$4}NR!=FNR{for(i in arr){split(arr,ss,",");
if(i==$1 && ss[2]<$4) print i,ss[1],ss[2],$4,$4-ss[2]}}' file1 file2

GarciasMuffin · August 26, 2009, 7:24am

Worked 100% perfectly.

You're an absolute star mate

Not sure how the bits award thing works, but did it (hopefully correctly).

sagar_evc · August 26, 2009, 7:49am

awk 'NR==FNR{arr[$1]=$2","$4}NR!=FNR{for(i in arr){split(arr,ss,",");
if(i==$1 && ss[2]<$4) print i,ss[1],ss[2],$4,$4-ss[2]}}' file1 file2

I was trying to understand how this command works...

I get this error when I try to run this command. Please help debug this..

awk: arr is not an array
record number 1

malcomex999 · August 26, 2009, 7:58am

it is not giving me that kind of error when i run that code, may be you better try with nawk or gawk inplace of awk.

sagar_evc · August 26, 2009, 8:01am

Wow! nawk works for me.

jackygrahamez · February 7, 2010, 5:36pm

This example works, I have a little more complicated comparison opperation. I have an index of IDs. These IDs are part of several substrings of a file. I want to see which IDs are in the file and which ones are not. Here are a few lines.

23873_ChemDiv_000A-0001
AVtclcactv01291016372D 0 0.00000 0.00000 1

31 34 0 0 0 0 0 0 0 0999 V2000

The ID is the first 5 characters of this string 23873_ChemDiv_000A-0001

malcomex999 · February 8, 2010, 1:11am

can you give more data from your input files and show the desired output too?

jackygrahamez · February 8, 2010, 10:24am

Its so long. I put a couple records in a pastebin file. Each record represents a chemical structure
pastebin - collaborative debugging tool

The index would look like this

23874
12345
23876
23456

We want to to print the index values that are present in the first file

23874
23876

And the values that are in the index but not the index file
12345
23456

basso · February 8, 2010, 4:25pm

I'm working on a similar problem. Perhaps it's okay to piggyback on the thread?

I've two files:

First file:
data1
data2
...
string1:string2:string3

and the second:

string2:string3:n1:n2

I'd like each input line from file1 to appear in the output, and if string2:string3 appears in file2, append :n1:n2 to the matching
line in file1.

(My awk is really rusty...)

Thanks for any pointers you can provide!

malcomex999 · February 9, 2010, 4:54am

Assuming the serial numbers are not in the file as it is in the pastebin...
This prints the index values that are present in the file...

awk 'NR==FNR{if($0=="^[0-9]*_ChemDiv_*");
split($0,arr,"_");
a[arr[1]]=arr[1];next}
$0 in a{print $0}
' file index

This prints the index values that are not in the file...

awk 'NR==FNR{if($0=="^[0-9]*_ChemDiv_*");
split($0,arr,"_");
a[arr[1]]=arr[1];next}
$0 in a{next}1
' file index