match range of different numbers by AWK

repinementer · July 21, 2009, 5:35am

if the column1 and 2 in both files has same key (for example "a" and "a1") compare each first key value(a1 of a) of input2 (for example 1-4 or 65-69 not 70-100 or 44-40 etc) with all the values in input1.
if the range of first key value in input2 is outof range in input1 values named it as out of range1 or vice versa.
some of the key values in input2 are in descending order and some are in ascending order. based on these order we have to name them accordingly as I shown below.

If it seems complicated and time consuming please give me any basic idea or approach to compare ranges of 2 files

Help would be appreciated

input1

a	a1	5-10
		30-40
		45-60
		80-90
		100-120

input2

a	a1	1-4
a	a1	4-1
a	a1	120-140
a	a1	140-120
a	a1	65-69
		70-100
a	a1	70-65
		44-40
a	a1	30-33
		37-57
		63-83
a	a1	85-81
		30-25
b	b1	100-200
c	c2	1-200
d	d3	2-333

output

a	a1	1-4		outofrange1
a	a1	4-1		outofrange2
a	a1	120-140     outofrange3
a	a1	140-12      outofrange4
a	a1	65-69	        inrange1
		70-100
a	a1	70-65  	inrange2
		44-40	
a	a1	30-33 	inrange3
		37-57
		63-83 
a	a1	85-81 	inrange4
		30-25

---------- Post updated at 01:34 AM ---------- Previous update was at 12:21 AM ----------[COLOR="\#738fbf"]

---------- Post updated at 01:35 AM ---------- Previous update was at 01:34 AM ----------

If it seems complicated and time consuming please give me any basic idea or approach to compare ranges of 2 files

radoulov · July 22, 2009, 5:12am

Could you please elaborate further?

stateperl · July 22, 2009, 8:45pm

There are 2 inputfiles as I mentioned, Input1 and Input2.
Input1 has 3 columns. 1st one has keyvalues and 2nd ones has sub key values and 3rd one has numerical values (ranges like from 5-10, 30-40)

input1

a	a1	5-10
		30-40
		45-60
		80-90
		100-120
x       a2    10-20
                50-60

Input2 has also 3 columns exactly like input1.1st one with key and 2nd one with subkey and 3rd one with various ranges of numbers like 1-4,1-4,120-140,140-120

input2

a	a1	1-4
a	a1	4-1
a	a1	120-140
a	a1	140-120
a	a1	65-69
		70-100
a	a1	70-65
		44-40
a	a1	6-7
		37-57
		63-83
a	a1	7-8

Now I need to name the input2 value ranges according the ranges given in input 1 like in the following output

output

a	a1	1-4		outofrange1
a	a1	4-1		outofrange2
a	a1	120-140     outofrange3
a	a1	140-12      outofrange4
a	a1	65-69	        inrange1
		70-100
a	a1	70-65  	inrange2
		44-40	
a	a1	6-7            inrange3
a	a1	7-8            inrange4

As you can see 1-4 (the values from 1 to 4 in input2 are absent in input1) giving the name as outofrange1.

2nd one is little bit trciky if the value range is from high to low like 4-1 (the values from 4 to 1 in input2 are absent in input1). though it looks same as 1st case it has high value to low value range like 4 to 1. and given name as outofrange2

outofrange3 is same as outofrange1
outofrange4 is same as outofrange2

5th one, 65-69, inrange1 (the values from 65 to 69 in input2 are present in between the ranges in input1 but not with in the exact ranges giving the name as inrange1.

45-60 65-69 80-90

inrange2.[/b] is the trickiest version of inrange1.[/b](I mentioned before in ouofrange2)

7th one , 6-7, inrange3 (the values 6 to 7 in input 2 are exactly present in between the values of input 1), inrange3.

5 6-7 10

inrange4.[/b] is the trickiest version of inrange2.[/b]

And most importantly we are comparing the first range values of all the keys in input2 with all the key values in input1 like a-a1-65-69 with all the values in input2 a -a1- 5-10, 30-40, 45-60,80-90, 100-120

Hope it elucidate every thing clearly:)

rakeshawasthi · July 23, 2009, 2:02am

With unclear requirement and mistakes in the expected out it was difficult to write a program. There was no program was also provided so i had to write from scratch.

Try:

re_arrange_file ()
{
infile=$1
out_file=$infile"x"
>$out_file
while read line
do
        set $line
        if [ $# -eq 3 ]; then
                key=$1
                subkey=$2
                min_range=$(echo $3 | cut -d"-" -f1)
                max_range=$(echo $3 | cut -d"-" -f2)
        else
                min_range=$(echo $1 | cut -d"-" -f1)
                max_range=$(echo $1 | cut -d"-" -f2)
        fi
        if [[ $min_range -gt $max_range ]]; then
           (( min_range = $max_range + $min_range))
                (( max_range = $min_range - $max_range))
                (( min_range = $min_range - $max_range))
        fi

        echo $key $subkey $min_range $max_range >> $out_file
done < $infile
}

re_arrange_file input1
re_arrange_file input2
>out_file
in_range_count=0
out_range_count=0
file2_lin_no=0
while read line
do
   set $line
   key=$1
   subkey=$2
   min_range=$3
   max_range=$4

        found=0
        ((file2_lin_no = $file2_lin_no + 1))
        file2_lin=`head -$file2_lin_no input2 | tail -1`
        cat input1x | grep "$key" | grep "$subkey" > tmp
        while read _key _subkey _min _max
        do
                if [[ ${_min} -le $min_range && ${_max} -ge $max_range ]]; then
                        ((in_range_count = $in_range_count + 1))
                        echo $file2_lin "inrange"$in_range_count
                        found=1
                        break
                fi
        done < tmp
        if [[ $found -eq 0 ]]; then
                ((out_range_count = $out_range_count + 1))
                echo $file2_lin outofrange"$out_range_count"
        fi
done < input2x

---------- Post updated at 11:32 AM ---------- Previous update was at 11:30 AM ----------

Output:

a a1 1-4 outofrange1
a a1 4-1 outofrange2
a a1 120-140 outofrange3
a a1 140-120 outofrange4
a a1 65-69 outofrange5
70-100 outofrange6
a a1 70-65 outofrange7
44-40 outofrange8
a a1 6-7 inrange1
37-57 outofrange9
63-83 outofrange10
a a1 7-8 inrange2

This is not exactly what is provided in question. But I think in question it is wrong, as many lines do not have either inrange or outofrange itself.
a a1 65-69 should be outofrange but was provided otherwise.

repinementer · July 23, 2009, 2:18am

hey my apologies for inconvenience.
Thank you very much for the script and time you have spent on this problem

I assume still your code is missing the following

Need to take first range (others are not needed)

a a1 65-69 outofrange5 (need to compare)
70-100 outofrange6 (no need to compare)
a a1 70-65 outofrange7 (need to compare)
44-40 outofrange8 (no need to compare)

65-69 is out of range but they are in between the ranges i have given in input1

This is really really important

Need to consider the ranges between the ranges even though they are not exactly match. Especially in case of 65-69. This range is present in between 45-60 and 80-90
45-60 65-69 80-90

Same thing follows to 70-65 case

radoulov · July 23, 2009, 4:09am

I really don't understand ...
Could you post a bigger samples from the input files and the expected output? Are the in/outofrange n always progressing or they are specific to the combination?
You could start with something like this (use gawk, nawk or /usr/xpg4/bin/awk on Solaris):

awk 'NR == FNR {
  NF != 1 && k = $1
  in1[k] = in1[k] ? in1[k] FS $NF : $NF
  next
  }
$1 in in1 {
  n = split(in1[$1], t, "-")
  min = t[1]; max = t[n]; split($NF, tt, "-")
  tt[1] > tt[2] ? k1 = 2 && k2 = 1 : k1 = 1 && k2 = 2 
  range = tt[k1] >= min && tt[k2] <= max ? "inrange" : "outofrange"
  $0 = $0 "\t\t" range (++r[range]) 
    }1' input*

This is what I get:

zsh-4.3.10[t]% head -20 in*
==> input1 <==
a       a1      5-10
                30-40
                45-60
                80-90
                100-120
x       a2    10-20
                50-60

==> input2 <==
a       a1      1-4
a       a1      4-1
a       a1      120-140
a       a1      140-120
a       a1      65-69
                70-100
a       a1      70-65
                44-40
a       a1      6-7
                37-57
                63-83
a       a1      7-8

zsh-4.3.10[t]% awk 'NR == FNR {
  NF != 1 && k = $1
  in1[k] = in1[k] ? in1[k] FS $NF : $NF
  next
  }
$1 in in1 {
  n = split(in1[$1], t, "-")
  min = t[1]; max = t[n]; split($NF, tt, "-")
  tt[1] > tt[2] ? k1 = 2 && k2 = 1 : k1 = 1 && k2 = 2
  range = tt[k1] >= min && tt[k2] <= max ? "inrange" : "outofrange"
  $0 = $0 "\t\t" range (++r[range])
    }1' input*
a       a1      1-4             outofrange1
a       a1      4-1             outofrange2
a       a1      120-140         outofrange3
a       a1      140-120         outofrange4
a       a1      65-69           inrange1
                70-100
a       a1      70-65           inrange2
                44-40
a       a1      6-7             inrange3
                37-57
                63-83
a       a1      7-8             inrange4

repinementer · July 23, 2009, 4:29am

That is the exact output I'm looking for but the names are specific not progressive

except this error every thing seems to be right.

I will post sample input files asap . Thanx alot for advice and script

repinementer · July 23, 2009, 6:42am

I've attched the excel with input and desired output and logic behind that.
Hope this time its clear:b:

radoulov · July 23, 2009, 7:24am

OK,
I think it's clear now. The format of your input files is like the one you posted in the first post or it's like the one in the xls file?

repinementer · July 23, 2009, 8:21am

The format in XLS file is the correct one.

Is it possible to up grade the script based on XLSfile values and format.

Thank you for co-operation and patience in understanding my question
and get backing to me

Really appreciated

radoulov · July 23, 2009, 9:42am

You swapped the input filenames in your xls, so this time the order is input2 input1. Try this code:

awk 'BEGIN {
  def["ascoutlower"]    = "ARANGE"   
  def["ascoutupper"]    = "BRANGE"
  def["descoutlower"]   = "CRANGE"
  def["descoutupper"]   = "DRANGE"
  def["ascinnotexact"]  = "ERANGE"
  def["descinnotexact"] = "FRANGE"
  def["ascinexact"]     = "GRANGE"
  def["descinexact"]    = "HRANGE"
  }
NR == FNR && NF {
  NF > 2 && k = $1
  in2[k] = in2[k] ? in2[k] RS $1 FS $2 : $2 FS $3
  next
  }
$1 in in2 {
  n = split(in2[$1], tmp, RS) 
  split(tmp[1], Tmp); min = Tmp[1]
  m = split(tmp[n], Tmp); max = Tmp[m]
  # asc - desc
  Def = $2 > $3 ? "desc" : "asc"
  # inrange - outofrange
  if (Def == "asc")
    Def = Def ($2 >= min && $3 <= max ? "in" : "out") 
  else
    Def = Def ($3 >= min && $2 <= max ? "in" : "out")
  # lower - upper
  if ((Def ~ /ascout/ ? $3 : $2) <= min) {
    Def = Def "lower"
    print $0 "\t\t" def[Def]
    next
    }
  if ((Def ~ /ascout/ ? $3 : $2) >= max) {
    Def = Def "upper"
    print $0 "\t\t" def[Def]
    next
    }    
  # exact - not exact
  for (i=1; i<=n; i++) {
    split(tmp, range)
    if (Def ~ /asc/) { k1 = $2; k2 = $3 }      
    else { k1 = $3; k2 = $2 }
    if (k1 >= range[1] && k2 <= range[2]) {
      Def = Def "exact"
      print $0 "\t\t" def[Def]
      next
      }
    }
      Def = Def "notexact"
    print $0 "\t\t" def[Def]
    next    
}1' input2 input1

repinementer · July 23, 2009, 10:10am

Amazing:eek:
Really I didn't believe u just did with awk. Thanx alot:)
could you please suggest me any thing. I'm really interested in learning AWK:D
OUTPUT I gOT like this

c1	1	4	1	4	+	1	3	0		ARANGE
c1	120	140	120	140	+	1	20	0		BRANGE
c1	4	1	4	1	-	1	3	0		CRANGE
c1	140	120	140	120	-	1	20	0		DRANGE
c1	65	69	65	100	+	2	4,30	0,5		ERANGE
	70	100						
c1	71	65	71	40	-	2	4,10	0,21		FRANGE
	44	40						
c1	30	33	33	83	+	3	3.20,20	0,7,30GRANGE
	37	57						
	63	83						
c1	85	81	85	25	-	2	5,4	0,56		HRANGE
	30	25						
c2	4	1	4	1	-	1	3	0
c2	140	120	140	120	-	1	20	0
c3	65	69	65	100	+	2	4,30	0,5
	70	100						
c2	71	65	71	40	-	2	4,10	0,21
	44	40						
c9	140	120	140	120	-	1	20	0

Is it possible it modify like this

c1	1	4	1	4	+	1	3	0		ARANGE
c1	120	140	120	140	+	1	20	0		BRANGE
c1	4	1	4	1	-	1	3	0		CRANGE
c1	140	120	140	120	-	1	20	0		DRANGE
c1	65	69	65	100	+	2	4,30	0,5		ERANGE					
c1	71	65	71	40	-	2	4,10	0,21		FRANGE						
c1	30	33	33	83	+	3	3.20,20 0,7,30    GRANGE						
c1	85	81	85	25	-	2	5,4	0,56		HRANGE					
c2	4	1	4	1	-	1	3	0              UNKNOWN
c2	140	120	140	120	-	1	20	0              UNKNOWN
c3	65	69	65	100	+	2	4,30	0,5	        UNKNOWN				
c2	71	65	71	40	-	2	4,10	0,21	        UNKNOWN					
c9	140	120	140	120	-	1	20	0              UNKNOWN

radoulov · July 23, 2009, 10:35am

Sure:

awk 'BEGIN {
  def["ascoutlower"]    = "ARANGE"   
  def["ascoutupper"]    = "BRANGE"
  def["descoutlower"]   = "CRANGE"
  def["descoutupper"]   = "DRANGE"
  def["ascinnotexact"]  = "ERANGE"
  def["descinnotexact"] = "FRANGE"
  def["ascinexact"]     = "GRANGE"
  def["descinexact"]    = "HRANGE"
  }
NR == FNR && NF {
  NF > 2 && k = $1
  in2[k] = in2[k] ? in2[k] RS $1 FS $2 : $2 FS $3
  next
  }
$1 in in2 {
  n = split(in2[$1], tmp, RS) 
  split(tmp[1], Tmp); min = Tmp[1]
  m = split(tmp[n], Tmp); max = Tmp[m]
  # asc - desc
  Def = $2 > $3 ? "desc" : "asc"
  # inrange - outofrange
  if (Def == "asc")
    Def = Def ($2 >= min && $3 <= max ? "in" : "out") 
  else
    Def = Def ($3 >= min && $2 <= max ? "in" : "out")
  # lower - upper
  if ((Def ~ /ascout/ ? $3 : $2) <= min) {
    Def = Def "lower"
    print $0 "\t\t" def[Def]
    next
    }
  if ((Def ~ /ascout/ ? $3 : $2) >= max) {
    Def = Def "upper"
    print $0 "\t\t" def[Def]
    next
    }    
  # exact - not exact
  for (i=1; i<=n; i++) {
    split(tmp, range)
    if (Def ~ /asc/) { k1 = $2; k2 = $3 }      
    else { k1 = $3; k2 = $2 }
    if (k1 >= range[1] && k2 <= range[2]) {
      Def = Def "exact"
      print $0 "\t\t" def[Def]
      next
      }
    }
      Def = Def "notexact"
    print $0 "\t\t" def[Def]
    next    
}
!/^[ \t]/ { print $0 "\t\tUNKNOWN" }' input2 input1

Gawk: Effective AWK Programming by Arnold Robbins is free.

repinementer · July 23, 2009, 10:39am

Note:Only consider if you have time. Thanq u

Actually before posting this question I made a small script that do the following job that made the input files for you script from the main inputfiles. (posted few 2 examples down)
I have 3 scripts to do this job. If you have time could you incorporate anything in your script that do this type of job.

If you want I ill post you the scripts I have used (not so professional but do the job)

INPUT

c1	5	120	+	5,10,5,10,20	0,25,40,75,95
c1	25	85	-	2	5,4	0,56

OUPUT (THE ONES YOU ARE USED AS INPUT2 AND INPUT2)

c1	5	10	5	120	+	5,10,5,10,20	0,25,40,75,95
	30	40					
	45	60					
	80	90					
	100	120	
c1	85	81	85	25	-	2	5,4	0,56
	30	25

As you can see I'm doing 2 tasks to get the new columns in output i.e, 2nd and 3rd columns.
1task. if the 6th column of inputs contains " + "then 1st value of 7th column adds with 1st value of 8th column and produce new value in second column of output and then 1st value of 7th column (same one) adds with 2nd value of 8th column and produce new value in third column of output and so on. It looks like this
LOGIC

5,10,5,10,20	0,25,40,75,95
5+0=5  5+5=10        
5+25=30 30+10=40 and so on

2task.if the 6th column of inputs contains "-" (Minus symbol) then 1st value of 7th column adds with 1st value of 8th column and produce new value in second column of output and then 1st value of 7th column (same one) adds with 2nd value of 8th column and produce new value in third column of output and so on.Finally every thing will be placed in reverse order. It looks like this

25	85	-	2	5,4	0,56
25+0=25  25+5=30  
25+56=81 81+4=85  

and they reversed and ascended from high to low because of -
	81	85	
	30	25

---------- Post updated at 06:39 AM ---------- Previous update was at 06:38 AM ----------

I will read the and practice the book you mentioned So helpful.

radoulov · July 24, 2009, 4:37am

c1    5    120    +    5,10,5,10,20    0,25,40,75,95

c1    5    10    5    120    +    5,10,5,10,20    0,25,40,75,95
    30    40                    
    45    60                    
    80    90

The third line should be:

45    50

and not

45    60

Or I'm missing something?

As far as the second task is concerned, you said:

25    85    -    2    5,4    0,56
25+0=25  25+5=30  
25+56=81 81+4=85

The first value of the 7th column (actually it's the fifth column in your original input) is 2, not 25? 25 is the second column in the original input ...

Could you clarify

repinementer · July 24, 2009, 4:56am

First of all I have to say sorry for errors in posting. Please excuse me for this time. You are right about the errors.

I'm including new inputs

INPUT1

c1	5	120	+	5,10,5,10,20	0,25,40,75,95
c1	5	120	-	5,10,5,10,20	0,25,40,75,95

OUTPUT

c1	5	10	5	120	+	5,10,5,10,20	0,25,40,75,95
	30	40					
	45	60					
	80	90					
	100	120	

c1	120	100	5	120	-	5,10,5,10,20	0,25,40,75,95
	90	80					
	45	60					
	40	30				
	10	5

All the original inputfiles are exactly looks like input1.
The Amazing Awk script that u have developed is for OUTPUT.

My request is to modify the script that suitable to OUTPUT.

radoulov · July 24, 2009, 5:13am

I understand how you calculate the lower bound of the range.
I don't understand how you calculate the upper bound.
Could you try to elaborate further?

first record

c1    5    120    +    5,10,5,10,20    0,25,40,75,95

relevant columns

+    5,10,5,10,20    0,25,40,75,95

So, for the lower bound - the new second column - we have:

5 + 0 = 5
5 + 25 = 30
5 + 40 = 45
and so on...

Could you explain how the new third column (the upper bound) should be calculated? How you get the following numbers:

repinementer · July 24, 2009, 5:37am

Well the low bound values adds with 5,10,15,10,20 and produce upperbound

+    5,10,15,10,20    0,25,40,75,95

5 + 0 = 5
5 + 25 = 30
5 + 40 = 45
and so on...

5 + 5 = 10
30 +10 = 40
45 + 15 = 60
80 + 10 = 90
100 + 20 = 120

radoulov · July 24, 2009, 5:41am

OK,
so, as I already mentioned, your example output was wrong.

Your original input file contains:

5,10,5,10,20

and not:

5,10,15,10,20

repinementer · July 24, 2009, 6:07am

YAAAA. Sorry DA
Thst why I mentioned in Bold

---------- Post updated at 02:07 AM ---------- Previous update was at 02:05 AM ----------

The book you have suggested is really Awesome.
Thanx