Read a file and search a value in another file create third file using AWK

King_Kalyan · June 17, 2009, 4:50pm

Hi,

I have two files with the format shown below. I need to read first field(value before comma) from file 1 and search for a record in file 2 that has the same value in the field "KEY=" and write the complete record of file 2 with corresponding field 2 of the first file in to result file.

File 1:

000000000160191837,00140000637006925269
000000000160191837,00140000637006925270
000000000160191838,00140000637006925271
000000000160191840,00140000637006925272

File 2:

Result File:

<DATA1><#KEY=000000000160191837><DATA2><RESULT>00140000637006925269
<DATA3><#KEY=000000000160191837><DATA4><RESULT>00140000637006925270
<DATA5><#KEY=000000000160191838><DATA6><RESULT>00140000637006925271
<DATA6><#KEY=000000000160191840><DATA8><RESULT>00140000637006925272

I wrote awk command for it but my code doesn't take care of duplicate records. please look at first two records in File 1 in the above example, field 1 is same but field 2 is different. In the same way I will have two exact same entries (same KEY value) in File 2 and I need to assign different values.

My code:

awk '{ 
  if (FNR==NR) {
    FS=","  
    sample_array[$1]=$2; 
    next 
   }
  FS="KEY=" 
  x=index($2,">")
  sample_num=substr($2,1,x-1);
  if (sample_num in sample_array)
      print $0 "<RESULT>" Sample_array[Sample_num] 
    
 } ' file1 file2 > result_file

Thanks in advance!

vgersh99 · June 17, 2009, 5:00pm

nawk -f king.awk FS=, file1 FS='(#KEY=|>)' file2

king.awk:

FNR==NR{f1[$1];next}
$3 in f1 {out[$3]=($3 in out)?$0:out[$3] $0}
END {
  for (i in out)
    print out
}

King_Kalyan · June 17, 2009, 5:19pm

Thanks for quick reponse!

The code is kind of suppressing duplicates and it's not giving corresponding field 2 of file 1 in the output. I need all records in the output with different field 2 values for duplicates as I shown in the example.

I'm just asking, does it require muti-dimensional array to store different values for duplicates. Not sure as I'm not good at using multi-dimensional arrays.

vgersh99 · June 17, 2009, 5:45pm

sorry - misread the data samples.

Assuming there're equal number of same 'keys' in file1 and file2.

king.awk:

FNR==NR{f1[$1]=($1 in f1)? f1[$1] SUBSEP $2 : $2;next}
$3 in f1 {
   n=split(f1[$3], a, SUBSEP)
   delete f1[$3]
   printf("%s<RESULT>%s\n", $0, a[1])
   for(i=2;i<=n;i++)
    f1[$3]=(i==2)?a:f1[$3] SUBSEP a
}

King_Kalyan · June 17, 2009, 6:17pm

Perfect!! Thanks a lot!!!
It works great!! I never thought it in that angle.

I added one more part, please check and let me know if I did it right.
If there is no match for a value in file 2 then I need to take first 11 digits from any value and append zeros to it and output the record.

It was working fine before but now it's not working not sure where I went wrong.

Addition:

FNR==NR{f1[$1]=($1 in f1)? f1[$1] SUBSEP $2 : $2;next}
$3 in f1 {
n=split(f1[$3], a, SUBSEP)
delete f1[$3]
printf("%s<RESULT>%s\n", $0, a[1])
for(i=2;i<=n;i++)
f1[$3]=(i==2)?a:f1[$3] SUBSEP a [i]; next}
[i]for ( temp in f1) {
tmp_value=substr(f1[temp],1,11)
print $0 "<RESULT>" tmp_value "000000000"
}

vgersh99 · June 17, 2009, 8:27pm

FNR==NR{f1[$1]=($1 in f1)? f1[$1] SUBSEP $2 : $2;next}
$3 in f1 {
   n=split(f1[$3], a, SUBSEP)
   delete f1[$3]
   printf("%s<RESULT>%s\n", $0, a[1])
   for(i=2;i<=n;i++)
      f1[$3]=(i==2)?a:f1[$3] SUBSEP a
   next
}
{
   for( i in f1) {
      print $0 "<RESULT>" substr(f1(i), 1, 11) "000000000"
      break
  }
}

King_Kalyan · June 17, 2009, 10:44pm

Thanks!! You are the best!!
BTW Thanks for calling the awk code as king.awk

This is not giving the desired results if the missing record is last one in the file 2. I figured it out, as we are deleting the array element everytime and when we reach last record we would have deleted all array elements and so it's not printing the last record.

I changed the code a liitle bit and it's working fine now.

FNR==NR{f1[$1]=($1 in f1)? f1[$1] SUBSEP $2 : $2; default_num=$2;next}
$3 in f1 {
n=split(f1[$3], a, SUBSEP)
delete f1[$3]
printf("%s<RESULT>%s\n", $0, a[1])
for(i=2;i<=n;i++)
f1[$3]=(i==2)?a[i]:f1[$3] SUBSEP a
[i]next
}
{
print $0 "<RESULT>" substr(default_num, 1, 11) "000000000"
}

This is my first post to this forum and I'm really astonished with the quality/quick response.

summer_cherry · June 18, 2009, 1:12am

hope below perl script can help you some.

while(<DATA>){
	my @tmp=split(",",$_);
	push @{$tmp[0]}, $tmp[1];
}
open $fh,"<", "a.txt";
while(<$fh>){
	chomp;
	if(/KEY=([0-9]+)/){
		my $tmp=shift @{$1};
		print $_,"<RESULT>",$tmp;
	}
}
__DATA__
000000000160191837,00140000637006925269
000000000160191837,00140000637006925270
000000000160191838,00140000637006925271
000000000160191840,00140000637006925272

vgersh99 · June 18, 2009, 11:33am

or better yet - to take care of the mismatching NUMBER of keys in either one of the files:

FNR==NR{f1[$1]=($1 in f1)? f1[$1] SUBSEP $2 : $2;default_num=$2;next}
$3 in f1 {
   n=split(f1[$3], a, SUBSEP)
   printf("%s<RESULT>%s\n", $0, a[1])
   if (n==1) next;
   delete f1[$3]
   for(i=2;i<=n;i++)
      f1[$3]=(i==2)?a:f1[$3] SUBSEP a
   next
}
{
  print $0 "<RESULT>" substr(default_num, 1, 11) "000000000"
}
}

King_Kalyan · June 18, 2009, 5:00pm

Thanks Cherry! I do not know perl so this is kind of out of scope for me but I heard perl is very fast. I have to learn that in future.

The below awk code runs fine for small number of records but now I'm running it on 200K records and it's taking lot of time, it's been 25 minutes and it wrote just 50 records in to the output file. Not sure how much more time it will take to complete the process.

Is there any thing wrong with the code that is making it to run long time?
Generally awk is very fast, right?

Actually this code is already availble in C++ and I'm trying to re-write in awk because of performance issues as awk is faster.

awk -f king.awk FS=, file1 FS='#KEY=' file2

 
FNR==NR{f1[$1]=($1 in f1)? f1[$1] SUBSEP $2 : $2; default_num=$2;next}
{x=index($2,">");
key=substr($2,1,x-1);
}
key in f1 {
n=split(f1[key], a, SUBSEP)
delete f1[key]
printf("%s<RESULT>%s\n", $0, a[1])
for(i=2;i<=n;i++)
f1[key]=(i==2)?a:f1[key] SUBSEP a
next
}
{
print $0 "<RESULT>" substr(default_num, 1, 11) "000000000"
}

---------- Post updated at 05:00 PM ---------- Previous update was at 11:37 AM ----------

My bad.. I used wrong data that has only 7 unique records and rest all of it is duplicate which will not happen in real world. So, I'm good.

For all 200K unique records and 1 duplicate for each record, it ran in ~3 mins.

Thanks for all your support!

vgersh99 · June 18, 2009, 5:14pm

I'm not sure why you changed the invocation:

nawk -f king.awk FS=, file1 FS='(#KEY=|>)' file2

TO

awk -f king.awk FS=, file1 FS='#KEY=' file2

and do the 'index/substr' for each record/line in file2. It's definitely adding time to the execution.
Also if you take my last version - it should be a little faster as I don't rebuild the array if I just have 1 entry in it (probably the majority of your records in file2).
You could probably think of a different implementation that doesn't require rebuilding the array all together. This is left as an exercise for the OP

King_Kalyan · June 19, 2009, 12:05am

Thanks for giving me a new direction!

I changed the invocation because field <#KEY> can be anywhere in the file it's not always at second position. Sorry I did not mention that in my first post.

Yes I saw your last code and forgot to include that in my code, now I included it (skipping rebuilding array if there is only one entry).

After you told that, I thought of different implementation and here it is.. this is more faster...

 
FNR==NR{f1[$1]=($1 in f1)? f1[$1] SUBSEP $2 : $2; default_num=$2;next}
{x=index($2,">");
key=substr($2,1,x-1);
}
key in f1 {
n=split(f1[key], a, SUBSEP)
printf("%s<RESULT>%s\n", $0, a[1])
if (n==1) {next}
y=index(f1[key],SUBSEP);
f1[key]=substr(f1[key],y+1)
next
}
{
print $0 "<RESULT>" substr(default_num, 1, 11) "000000000"
}