Looping over a file to count common fields from another file

fadista · June 28, 2012, 4:29am

Hi,

I would like to know how can I get the number of rows in file1 that:

the 1st and 2nd field should be the same (text)
the 3rd field should be less or equal (numeric)
when comparing to file2.

So for each row of file1, I would like to have the number of rows in file2 that follow the above 2 rules. This is just a small example. In reality, my files have millions of rows and more columns (fields, tab separated)

Example of file1

A AB 3.7
B AB 2.5

Example of file2

A AB 3.5
A AB 3.7

Desired output file

A AB 3.7 2
B AB 2.5 0

Many thanks!

pravin27 · June 28, 2012, 4:36am

Could this help you ?

awk 'NR==FNR{a[$1$2]++;next} {if (a[$1$2]){print $0,a[$1$2]}else {print $0,"0"}}' File2 File1

Scrutinizer · June 28, 2012, 5:03am

It would we preferable to use some form of field separation to prevent unexpected mixing of the the two index fields.

awk -F '\t' '
  NR==FNR{
    A[$1 OFS $2]=$3
    C[$1 OFS $2]=0
    next
  } 
  $3<=A[$1 OFS $2]{
    C[$1 OFS $2]++
  } 
  END{
    for(i in A)print i,A,C
  }
' file1 file2

elixir_sinari · June 28, 2012, 5:05am

awk 'BEGIN{SUBSEP=FS} FNR==NR{a[$1,$2]=$3;next} {
for(i in a){
if($1SUBSEP$2==i) {
if($3<=a) {
b++;break}}}
} END{for(i in a) printf("%s %s %d\n",i,a,b)}' file1 file2

krsnadasa · June 28, 2012, 7:44am

Hi Scrut,

Can you please explain that in for loop what is the value of variable i?

As there are two array:

A[A AB]=3.7 etc
C[A AB]=0
Now here index is field 1 and 2.
 
When printing in for loop how system decide the value of var i whether it should be variable or string [A AB].

Please explain?

Scrutinizer · June 28, 2012, 8:13am

Hi, the content of the i variable in the loop is determined by for(i in A) , which enumerates the indexes... I am not sure if I understood your question correctly...