Match and Grep the nearest value in last field

jiam912 · June 22, 2015, 4:06pm

Gents

I have this input file
file1 (uniq records)

54503207851 170211240
54503207911 170210837
54503208111 170215105
54503208112 170215210
54655210011 170223140
54655210091 170223738
54655210172 170224355
54655210251 170224741
54655210331 170225039
54655210411 170225505
54655210492 170230224
54655210571 170230632
54655210651 170231004
54655210731 170231601

file2 (duplicate records)
54503207851  1     1         11       2600      54503207851 170211240
54503207911  1     1          7       2600      54503207911 170210837
54503208111  1     7    1301482       2600      54503208111 170215105
54503208112  2     1          8       2600      54503208112 170215210
54655210011  1     1          1       2597      54655210011 170223140
54655210091  1     1          4       2597      54655210091 170223738
54655210172  2     1          2       2597      54655210172 170224355
54655210251  1     1          2       2597      54655210251 170224748 
54655210251  1     1          2       2597      54655210251 170224742
54655210331  1     1          1       2597      54655210331 170225040
54655210411  1     1          1       2597      54655210411 170225505
54655210492  2     1         46       2597      54655210492 170230024 
54655210492  2     1          1       2597      54655210492 170230225
54655210571  1     1          3       2597      54655210571 170230632
54655210651  1     1          1       2597      54655210651 170231004
54655210731  1     1          2       2597      54655210731 170231601

desired output

54503207851 170211240 1     1         11       2600      54503207851 170211240
54503207911 170210837 1     1          7       2600      54503207911 170210837
54503208111 170215105 1     7    1301482       2600      54503208111 170215105
54503208112 170215210 2     1          8       2600      54503208112 170215210
54655210011 170223140 1     1          1       2597      54655210011 170223140
54655210091 170223738 1     1          4       2597      54655210091 170223738
54655210172 170224355 2     1          2       2597      54655210172 170224355
54655210251 170224741 1     1          2       2597      54655210251 170224742
54655210331 170225039 1     1          1       2597      54655210331 170225040
54655210411 170225505 1     1          1       2597      54655210411 170225505
54655210492 170230224 2     1          1       2597      54655210492 170230225
54655210571 170230632 1     1          3       2597      54655210571 170230632
54655210651 170231004 1     1          1       2597      54655210651 170231004
54655210731 170231601 1     1          2       2597      54655210731 170231601

i am using the following code to get the output

awk '{x=substr($0,1,22)} FNR==NR{a[substr($0,1,11)]=$0;next} {print x, a[substr($0,1,11)]?  OFS a[substr($0,1,11)] : OFS "9999" }' file2 file1

The point is to match the column 1 in both files but to grep from file2 the nearest value comparing columns 2 in file1 and 7 in file2..

With my code I got any of them,, the purpose is to get the close value matching column 1 and comparing last column.

Thanks for your help

RudiC · June 22, 2015, 4:41pm

Nearest to what? Sorry to be that thick.

jiam912 · June 22, 2015, 4:47pm

Hi RudiC
I say close value to column 2 in file1 comparing with value column 7 in file2 the last 6 dig correspond to time, Then as the duplicate value in column 1 file2 has 2 different time i would like to Keep the time which is more close to time in file 1.

Hope u can help me.

Chubler_XL · June 22, 2015, 5:39pm

You could try this (assumption is that file2 is sorted, as in your demo data):

awk '
FNR==NR{a[$1]=$2;next}
key && key != $1 { best=key=x; print ln }
$1 in a {
  if(a[$1]<$7) diff=$7-a[$1]
     else diff=a[$1]-$7
  if(!diff) {
     print
     key=best=x
     delete a[$1]
  } else if(!best || diff<best) {
     key=$1
     best=diff
     ln=$0
  }
}
END { if(key) print ln } ' file1 file2

jiam912 · June 23, 2015, 9:28am

Hi Xl
Thanks for your answer I will check it and I let you know

---------- Post updated at 08:28 AM ---------- Previous update was at 03:04 AM ----------

Hi XL

it works perfect, thanks a lot

but there is something that is not correct, per example if in the file2 there is 2 duplicate values and one of them have the same time of file1, so the script should keep the time like file1..

other thing please, how to ajust the code if we have also duplicate values in file1 column 1..

54503207851 170211240
54503207911 170210837
54503208111 170215105
54503208112 170215210
54655210011 170223140
54655210091 170223738
54655210172 170224355
54655210251 170224741
54655210251 170224741
54655210331 170225039
54655210411 170225505
54655210492 170230224
54655210492 170230224
54655210571 170230632
54655210651 170231004
54655210731 170231601

,

Then the time from file2 should adapt to the 2 duplicate values in file1.

Regards...

RudiC · June 23, 2015, 10:44am

Your desired output from post#1 is matched neither by your own code snippet's result nor by Chubler_XL's. To introduce file1's second column as column 2 in the output, try

awk '
FNR==NR         {a[$1]=$2
                 next
                }
key != $1 &&
key             {best=key=x
                 print ln
                }

$1 in a         {T=$1
                 if ((diff=$7-a[T]) < 0) diff=-diff
                 sub ($1, $1 OFS a[$1])

                 if (!diff)     {print
                                 key=best=x
#                                       delete a[$1]
                                }
                 else if(!best || diff<best)    {key=$1
                                                 best=diff
                                                 ln=$0
                                                }
                }
END             {if (key) print ln }
' file1 file2
54503207851 170211240  1     1         11       2600      54503207851 170211240
54503207911 170210837  1     1          7       2600      54503207911 170210837
54503208111 170215105  1     7    1301482       2600      54503208111 170215105
54503208112 170215210  2     1          8       2600      54503208112 170215210
54655210011 170223140  1     1          1       2597      54655210011 170223140
54655210091 170223738  1     1          4       2597      54655210091 170223738
54655210172 170224355  2     1          2       2597      54655210172 170224355
54655210251 170224741  1     1          2       2597      54655210251 170224742
54655210331 170225039  1     1          1       2597      54655210331 170225040

(Chubler_XL's code adapted)

To your last question: If in file1 there's exact duplicates, nothing will happen except that a[$1] will be overwritten with the identical value. If the values are not identical, the last one will apply.

jiam912 · June 23, 2015, 10:56am

Dear Rudi..

Thanks a lot

Just i was going to say something else

I get something like this:
file1

54655210492 170230224

file2

54655210492  2     1          1       2597 54655210492170230225  170230225
54655210492  2     1         46       2597 54655210492170230024  170230024

output

54655210492  2     1          1       2597 54655210492170230225  170230225

output desired

54655210492  2     1         46       2597 54655210492170230024  170230024

I will try your script, and check if give me the same result..

Please try to help me

---------- Post updated at 09:56 AM ---------- Previous update was at 09:47 AM ----------

Dear RudiC,

Just I try your code and it gives the same output.

Please try to help me to solve it..

As I say previously it needs to grep the rows where the time is exactly or the more closets.

RudiC · June 23, 2015, 12:04pm

From my point of view, 170230224 is closer to 170230225 than to 170230024. So - where's the problem?

jiam912 · June 23, 2015, 2:08pm

Dear RudiC
Sorry it is my mistake.
I am getting crazy
Works Fine

jiam912 · June 24, 2015, 3:58pm

RudiC or XL

Can you please modify your script to be able to use it with csh instead of bash...
I want to merge it with other script I have in csh.

Many thanks

RudiC · June 24, 2015, 4:09pm

Neither script has anything to do with bash nor csh . Both are pure awk .

jiam912 · June 24, 2015, 4:15pm

Dear RudiC,

When i run it using
#!/bin/bash

it works fine..

but when i put it inside of the csh script
give a error

Unmatched '

Why this error.

RudiC · June 24, 2015, 4:17pm

Because, I'd guess, a ' is missing. But, I can't help with csh .

Chubler_XL · June 24, 2015, 4:27pm

Probably be better of avoiding having the awk program on the command line try
putting the program in a file like match_val.awk
and change your command line to awk -f match_val.awk file1 file2

match_val.awk should contain the code you have between ' and ' eg:

FNR==NR         {a[$1]=$2
                 next
                }
...
END { if(key) print ln }