comparing two fields from two different files in AWK

amarn · October 8, 2011, 5:58pm

Hi,

I have two files formatted as following:

File 1: (user_num_ID , realID) (the NR here is 41671)

1  cust_034_60
2  cust_80_91
3  cust_406_4

..
..

File 2: (realID , clusterNumber) (total NR here is 1000)

cust_034_60  2
cust_406_4   3
..
..

and i want to compare these two files based on File 1 $2 field and File 2 $1 field and get a resulting file like

File 3: (user_num_ID, realID, clusterNumber)

1 cust_034_60 2
3 cust_406_4 3
..
..
..

just to summarize:
Create a new file3
Find where file2 ($1) is equal to file1($2)
and print to file3 file1($1) and file2($1,$2) (as a single record)

since im new in awk, i've tried several suggestions from previous threads but it seems that im doing something wrong...

the latest one I've tried:

awk > file3 'NR==FNR{ _[$2]=$1 next}{print $0, _[$1,$2] }' file1 file2

surely im doing something wrong

Thanks in advance

radoulov · October 9, 2011, 5:09am

awk > file3 'NR == FNR {
  f2[$1] = $2; next 
  }
$2 in f2 {
  print $0, f2[$2]
  }' file2 file1

amarn · October 9, 2011, 10:33am

Hi radoulov,

Thank you for your quick response. I've tried the script you suggested and it seems that i get an empty file3...is there an issue on whether i'm using awk under cygwin? I'm sure that i have types exactly what you suggested.

Thanks again

ps: i've tried using grep -f by just comparing the real_custID of file2 with file1 and it worked fine..but i'm curious on how this can be done using awk.

radoulov · October 9, 2011, 10:43am

No, this is not a Cygwin issue.

Consider the following (rigth now, I'm on Cygwin too):

This is the content of the two input files: file1 and file2:

% head file[12]
==> file1 <==
1  cust_034_60
2  cust_80_91
3  cust_406_4

==> file2 <==
cust_034_60  2
cust_406_4   3

This is what I get when I run the awk command:

% awk 'NR == FNR {
  f2[$1] = $2; next
  }
$2 in f2 {
  print $0, f2[$2]
  }' file2 file1
1  cust_034_60 2
3  cust_406_4 3

To debug further, try dumping the content of the array f2 and the content of file1:

% awk 'NR == FNR {
  f2[$1] = $2; next
  }
FNR == 1 {
  for (F in f2)
    printf "|%s| --> |%s|\n", f2[F], F
  }
{
  printf "|%s|\n", $2
  }' file2 file1
|2| --> |cust_034_60|
|3| --> |cust_406_4|
|cust_034_60|
|cust_80_91|
|cust_406_4|

Your output should be different than mine.

amarn · October 10, 2011, 8:07am

It worked perfectly well now:)

Thank you again radoulov, all the best!

Cybex2011 · December 7, 2011, 2:24pm

Hi, all:

I roughly understand that this code first reads in a smaller file(file_2) and stores every line into an array f2.

Next, when it reads in the other bigger one, file_1, line by line, it checks each line's 2nd field against the array f2. If there is a match, it prints out the current line in process and an element from the array f2.

awk > file3 'NR == FNR {
  f2[$1] = $2; next 
  }
$2 in f2 {
  print $0, f2[$2]
  }' file_2 file_1

#==> file_2 <==
#cust_034_60 2
#cust_406_4 3
#
#==> file_1 <==
#1 cust_034_60
#2 cust_80_91
#3 cust_406_4
#
#==> OUTPUT:
#1 cust_034_60 2
#3 cust_406_4 3

But I have some difficulty to figure out:

how file_2's lines being stored into f2?

Use the 1st line of file_2 as an example, "fcust_034_60 2"
==> f2[$1]=$2

It looks to me, the code only put the 2nd field, which is the number 2, into the array's position $1, not the entire current line goes in there.

how f2's element being retrieved by the code
==> f2[$2]?

If it had just put a field in the 1st step,
how could the code use "fcust_034_60" now read in from file_1 to search against the array f2?

To me, "fcust_034_60" is even not in f2.
And in the previous step, seems to me, the code "f2[$1]=$2", just put $2 from the line in the $1 position of f2. How come when retrieved, it is from the position $2 of f2?

I am new here and awk is new to me too.
The book I have at hand is talking about using awk to process one single file only.

Using awk to process 2 or more files at a time seems will take me some get used to it.

Thanks to the other person here on the other page clarified the NR/FNR issue very well,
Now I have a vague impression about how awk uses NR/FNR to handle 2 files's comparison at a time.

Hopefully someone here will give me some hints to my questions above too. Thank you in advance.

Regards,

tukuyomi · December 7, 2011, 3:21pm

In this example: f2['cust_034_60']=2
f2 is the array, and yes, 'cust_034_60' is an array index
(not sure where you got fcust_034_60 however...)

See my previous statement

radoulov · December 7, 2011, 3:42pm

Hi Cybex2011,
welcome to unix.com!

We associate the value of the first field (this is the key, the pseudo-index) with the value of the second one
(the value, the array element).
This is actually the entire line (because in this case file_2 contains only two fields).
Note that what you see in the output is the content of the entire line from file_1
and only the second column from file_2:

==> OUTPUT:
1 cust_034_60       2
|__ file_1___|  |_file_2_|
3 cust_406_4        3
|__ file_1___|  |_file_2_|

awk's arrays are associative: consider them key-value pairs.
"Array position $1" actually means that the value of the second field $2
is associated with the key - the value of the first field:

cust_034_60       2
|___key___|   |_value_|

Check this page for more detailed info.

The value is retrieved by comparing the value of the second field from file_1
with the keys present in the associative array f2
(in this case: cust_034_60 and cust_406_4).

Hope this helps.

Cybex2011 · December 8, 2011, 2:25am

tukuyomi & radoulov,
Many thanks to you both.
///////

f2[$1]=$2 and f2[$2] bogged me down for a whole afternoon.
I was confused by them. Thanks to your help, it is clear to me now.

In this block

$2 in f2 {
  print $0, f2[$2]
  }'

I know now $2 refers to the 2nd column of file_1,
and file_2 at this stage is already a thing of the past.

One line back, in this statement

f2[$1]=$2;

$1 refers to the 1st column of file_2.

These 2 colums have some elements in common.
In that case, the related lines will be printed out.

I also revised radoulov's original code as follows.
I feel it's more readable for a prior c/c++ person like me.
It seems I don't need 'next' in here any more.
The output is the same.
Correct me please if there is any error in my revision. Thank you both again.

awk  '{if(NR==FNR){#now read file_2 into f2
       f2[$1]=$2} 
      else #now read in file_1, because NR!=FNR
      {
         if($2 in f2){print $0, f2[$2]}}
      }' file_2 file_1

radoulov · December 8, 2011, 4:13am

Hi Cybex2011,
your revision/code rewrite is correct.
In my version the next statement is needed because in the second block
I don't check the NR == FNR condition:

NR == FNR {
  ... some code ...
  next                    # jump directly to the next input line
                          # so the actions in the following blocks
                          #  don't execute when NR == FNR
  }
... here we're sure NR != FNR ...

Cybex2011 · December 9, 2011, 2:01pm

Hi, radoulov:
Using your code as a foundation, making some remodeling, I got my solutions to solve my own problem.

I intended to interleave and merge my own 2 data files like below.
I realized they are not in array, so I cannot take any advantage of the pseudo index feature of awk.
But I took full advantage of its NR/FNR feature to build a traditional array and get my work done.

My code works for me pretty well except there is tiny imperfection at the very beginning of the output. I can manually fix it in a snap. Its not for commercial use, just for my own hobby, so it's good enough for me for now.

I need to thank you again. Without your support, I couldn't have finished what I thought of so quick. I wish you have a happy holiday.
regards,

== file_E

1
00:00:01,400 --> 00:00:10,300
In the air war of Vietnam, one day stands out among all the rest.
2
00:00:10,300 --> 00:00:21,666
May 10, 1972, the full fury of American air power is unleashed on North Vietnam.
3
00:00:22,600 --> 00:00:30,433
More Vietnamese MiGs are shot down on this day than on any other day of the war.

== file_C

1
00:00:01,400 --> 00:00:10,300

2
00:00:10,300 --> 00:00:21,666
1972510
3
00:00:22,600 --> 00:00:30,433

== awk code

awk > "SRT" '{if(NR==FNR && length($0)> 0)
{#now read file_E into f2
f2[FNR]=$0} 
 
else if(length($0)> 0)
{#now read in file_C, because NR!=FNR
{if(FNR==1){end=NR-1}}
{for (i=1; i <= end; i++)
if( $0==f2){print $0; next}
}
# {printf "|%d| --> |%s|\n", FNR, f2[FNR]}
printf "%s\n",f2[FNR];
printf "%s\n\n",$0
}
}' file_E file_C #

== output:

1
00:00:01,400 --> 00:00:10,300
In the air war of Vietnam, one day stands out among all the rest.

2
00:00:10,300 --> 00:00:21,666
May 10, 1972, the full fury of American air power is unleashed on North Vietnam.
1972510
3
00:00:22,600 --> 00:00:30,433
More Vietnamese MiGs are shot down on this day than on any other day of the war.

p.s.
The file is the subtitles of a History Channel program Dogfights. The 2nd data file is in Chinese. Chinese speaking people are not necessary from China. In Asia, there are 3 countries use Chinese as their official language.

radoulov · December 9, 2011, 2:24pm

Glad you enjoy awk!
If the input format is constant,
this should work too:

awk 'FNR == 1 {
  sub(/^\xef\xbb\xbf/, x)
  }
/-->/ { 
  idx = $0 
  $0 in d && $0 = $0 RS d[$0]
  }
  NR == FNR {
    !/^[0-9]+$/ && d[idx] = $0
    next
    }
  1' file_E file_C

Happy Holidays!