Can awk do lookups to other files and process results

joeyg · October 23, 2008, 2:56pm

I know that 'brute-force' scripting could accomplish this with lots of cat/echo/cut/grep and more. But, because my real file has 800k records, and the matching files have 10-20k records, this is not time-possible or efficient.
I have input file:

> cat file_in
1234567890123456789012345678901234567890
Joe   123456  30 Main St    1234    F   
Jim   101362  1492 Hugh     0101    P   
Kerry 040419  6091 Lost St  0101    F   
Linda 123456  50 High Way   1235        
Matt  242424  48 Speedway Dr4343    F   
Kerrin180118  99 Skaters Way2012    P  *

(you can ignore the first line - just a help since a fixed record file)
(tail +2 file_in skips over this line during testing)

Begin by only reviewing records where position 40 is blank = still need to process.

Want to see those records that cannot be processed because (a) the data in columns 7-12 does not exist in the following file:

> cat file_cd1
040419
101362
180118
242424
789012
967539
988012

I know Joe does not match, so ideally I would like to put a "1" in position 39 telling me I failed the first test.

A second test (b) is to only process records that are "abc" based on lookup of columns 29-32 into the following file:

> cat file_cd2
0101 abc
1234 abc 
1235 ghi
2012 ghi
4343 ghi
9012 abc

Linda & Matt should then have a "2" put in position 39.

So, my start would be

awk 'substr($0,40,1)==" " {print}' file_in >file_out

which would create an output file, but only records I want to even consider that are not yet marked as processed. So, yes I intend to start with 6 records and make a file of 5 records. I now need to add those two codes at position 39 when appropriate.

radoulov · October 23, 2008, 4:15pm

Edit: actually you should jump to the second example. The first assumes that posision 39 is always empty.

The code below sets 1 for Linda, because she's not present in the example file_cd1:

awk 'NR == FNR { cd1[$1]; next }
f { cd2[$1] = $2; next }
!f && / $/ { 
  if (!(substr($0, 7, 6) in cd1)) sub(/. $/, "1 ") 
  if ((substr($0, 29, 4) in cd2) && cd2[substr($0, 29, 4)] != "abc") 
    sub(/  $/, "2 ") 
  }1' file_cd1 f=1 file_cd2 f=0 file_in

An example:

% awk 'NR == FNR { cd1[$1]; next }
f { cd2[$1] = $2; next }
!f && / $/ {
  if (!(substr($0, 7, 6) in cd1)) sub(/. $/, "1 ")
  if ((substr($0, 29, 4) in cd2) && cd2[substr($0, 29, 4)] != "abc")
    sub(/  $/, "2 ")
  }1' file_cd1 f=1 file_cd2 f=0 file_in
1234567890123456789012345678901234567890
Joe   123456  30 Main St    1234    F 1 
Jim   101362  1492 Hugh     0101    P   
Kerry 040419  6091 Lost St  0101    F   
Linda 123456  50 High Way   1235      1 
Matt  242424  48 Speedway Dr4343    F 2 
Kerrin180118  99 Skaters Way2012    P  *

If you want the second test to have precedence:

awk 'NR == FNR { cd1[$1]; next }
f { cd2[$1] = $2; next }
!f && / $/ { 
  if (!(substr($0, 7, 6) in cd1)) sub(/. $/, "1 ") 
  if ((substr($0, 29, 4) in cd2) && cd2[substr($0, 29, 4)] != "abc") 
    sub(/. $/, "2 ") 
  }1' file_cd1 f=1 file_cd2 f=0 file_in

For example:

% awk 'NR == FNR { cd1[$1]; next }
quote> f { cd2[$1] = $2; next }
quote> !f && / $/ { 
quote>   if (!(substr($0, 7, 6) in cd1)) sub(/. $/, "1 ") 
quote>   if ((substr($0, 29, 4) in cd2) && cd2[substr($0, 29, 4)] != "abc") 
quote>     sub(/. $/, "2 ") 
quote>   }1' file_cd1 f=1 file_cd2 f=0 file_in
1234567890123456789012345678901234567890
Joe   123456  30 Main St    1234    F 1 
Jim   101362  1492 Hugh     0101    P   
Kerry 040419  6091 Lost St  0101    F   
Linda 123456  50 High Way   1235      2 
Matt  242424  48 Speedway Dr4343    F 2 
Kerrin180118  99 Skaters Way2012    P  *

jim_mcnamara · October 23, 2008, 4:17pm

Okay. Use associative arrays. This gives you three files one.txt with a "1" two.txt three.txt which are intermediate and then bad.txt which is still just blank in col 39 & 40.

awk ' FILENAME=="file_cd1" { cd1[$0]=$0}
      FILENAME=="file_cd2" { cd2[$1]=$2}
      FILENAME=="inputfile" {
         if(FNR > 1 && substr($0,40,1)==" ")
         {
         	if ( substr($0,7,6) in cd_1)
         	{
         	    $0=substr($0,1,38) "1 "
         	    print $0 > "one.txt"
         	    continue
         	}
         	else
         	{
         		if( cd2[substr($0, 29, 4)]!="abc")
         		  { $0=substr($0,1,38) "2 "
         		     print $0 > "two.txt"
         		     continue 
         		  }        		  
         	}
             print $0 > "bad.txt"; continue
         }
         print $0 > "three.txt"
      
      } '  file_cd1 file_cd2 inputfile

radoulov · October 23, 2008, 4:31pm

After re-reading your post and Jim's comments I'm not sure if you prefer to generate multiple files (good - bad records) or an output like the one I posted.

joeyg · October 23, 2008, 4:37pm

I would prefer all data - good and bad records - stored to one file.
While reading through my 'sed & awk' book, the idea of arrays did jump out to me. I am going to have to sit and read through the examples to understand how they work.