match range of different numbers by AWK

radoulov · July 24, 2009, 6:18am

OK,
given the following input:

zsh-4.3.10[t]% cat infile
c1      5       120     +       5,10,15,10,20   0,25,40,75,95
c1      5       120     -       5,10,15,10,20   0,25,40,75,95

This code:

awk '{
  fifth = split($5, _fifth, ","); sixth = split($6, _sixth, ",")
  counter = 0; key = $1; flag = $4; sub(/[^ \t*]*/, "")
  dummy = sprintf("%*s", length(key),x)
  for (i=1; i<=sixth; i++) {
    second_third = _fifth[1] + _sixth FS _fifth + _fifth[1] + _sixth
    third_second = _fifth + _fifth[1] + _sixth FS _fifth[1] + _sixth 
    if (flag == "+") 
      rec = rec ? rec RS dummy OFS second_third : key OFS second_third OFS $0
    else  
      rec_rev = rec_rev ? \
        (++counter == sixth - 1 ? key OFS third_second OFS $0 : dummy OFS third_second ) RS rec_rev : \
        dummy OFS third_second
    }
  print (flag == "+" ? rec : rec_rev)    
 }' OFS='\t' ORS='\n\n' infile

... produces the following output:

zsh-4.3.10[t]% nawk '{   
  fifth = split($5, _fifth, ","); sixth = split($6, _sixth, ",")
  counter = 0; key = $1; flag = $4; sub(/[^ \t*]*/, "")
  dummy = sprintf("%*s", length(key),x)
  for (i=1; i<=sixth; i++) {
    second_third = _fifth[1] + _sixth FS _fifth + _fifth[1] + _sixth
third_second = _fifth + _fifth[1] + _sixth FS _fifth[1] + _sixth
if (flag == "+")
  rec = rec ? rec RS dummy OFS second_third : key OFS second_third OFS $0
else
  rec_rev = rec_rev ? \
    (++counter == sixth - 1 ? key OFS third_second OFS $0 : dummy OFS third_second ) RS rec_rev : \
dummy OFS third_second
    }
  print (flag == "+" ? rec : rec_rev)
 }' OFS='\t' ORS='\n\n' infile
c1      5 10            5       120     +       5,10,15,10,20   0,25,40,75,95
        30 40
        45 60
        80 90
        100 120

c1      120 100         5       120     -       5,10,15,10,20   0,25,40,75,95
        90 80
        60 45
        40 30
        10 5

Merging the two scripts is left as an exercise

repinementer · July 24, 2009, 7:15am

Thank you very much for your great scripts:)

I did the exercise. The book u referred is really really good. Now on my own created couple of super scripts but compare to yours nothing.

But I will try to learn GAWK book asap.

I think I have consumed your valuble time alot by posting impromptu queries:mad:. My apologies for that.

I have a problem regarding my LOGIC file that I posted earlier. Could you please take a look at new LOGICfile.

I have done some additions. Its almost same as the previous one but with new types of ranges. (Illustrated in XLS file attachment)

The mistake I did was only taking the first range. I need to consider all the ranges.

I think it's asking a lot. .

But really help would be appreciated:b:

repinementer · July 27, 2009, 4:56am

I have a made a small correction in the second script you made (sec = $2) as per my need. Every thing working fine but it printing 2 times if the flag is "-" and working fine with "+"

Could you help me to find the error tht printing twice the result

Thanx

  awk '{
  sec = $2; fifth = split($5, _fifth, ","); sixth = split($6, _sixth, ",")
  counter = 0; key = $1; flag = $4; sub(/[^ \t*]*/, "")
  dummy = sprintf("%*s", length(key),x)
  for (i=1; i<=sixth; i++) {
    second_third = sec + _sixth FS _fifth + sec + _sixth
    third_second = _fifth + sec + _sixth FS sec + _sixth 
    if (flag == "+") 
      rec = rec ? rec RS dummy OFS second_third : key OFS second_third OFS $0
    else  
      rec_rev = rec_rev ? \
        (++counter == sixth - 1 ? key OFS third_second OFS $0 : dummy OFS third_second ) RS rec_rev : \
        dummy OFS third_second
    }
  print (flag == "+" ? rec : rec_rev)    
 }' OFS='\t' ORS='\n\n' r1.txt

input

chr1    61223    1201233    -    5,10,15,10,20    0,25,40,75,95
chr1    61223    1201233    +   5,10,15,10,20    0,25,40,75,95

ouput

chr1    61338 61318        61223    1201233    -    5,10,15,10,20    0,25,40,75,95
        61308 61298
        61278 61263
        61258 61248
        61228 61223

chr1    61338 61318        61223    1201233    -    5,10,15,10,20    0,25,40,75,95
        61308 61298
        61278 61263
        61258 61248
        61228 61223
chr1    61223 61228        61223    1201233    +    5,10,15,10,20    0,25,40,75,95
        61248 61258
        61263 61278
        61298 61308
        61318 61338

radoulov · July 27, 2009, 5:13am

Most likely you have blank lines in your input file. Use this code:

awk 'NF {
  sec = $2; fifth = split($5, _fifth, ","); sixth = split($6, _sixth, ",")
  counter = 0; key = $1; flag = $4; sub(/[^ \t*]*/, "")
  dummy = sprintf("%*s", length(key),x)
  for (i=1; i<=sixth; i++) {
    second_third = sec + _sixth FS _fifth + sec + _sixth
    third_second = _fifth + sec + _sixth FS sec + _sixth 
    if (flag == "+") 
      rec = rec ? rec RS dummy OFS second_third : key OFS second_third OFS $0
    else  
      rec_rev = rec_rev ? \
        (++counter == sixth - 1 ? key OFS third_second OFS $0 : dummy OFS third_second ) RS rec_rev : \
        dummy OFS third_second
    }
  print (flag == "+" ? rec : rec_rev)    
 }' OFS='\t' ORS='\n\n' r1.txt

repinementer · July 27, 2009, 6:13am

Thanx Rado
Working fine now

---------- Post updated at 02:13 AM ---------- Previous update was at 01:26 AM ----------

Is it possible to modify the script for the problem I have posted earlier (copy of Logic file, 2nd one)

Please if your busy leave it to me. I will try my best

radoulov · July 27, 2009, 7:30am

Try to solve the problem yourself. You should verify your requirements: in your last xls file you indicate different definitions for the same ranges ...

repinementer · July 27, 2009, 11:11pm

Yes I did. But I did it on purpose. But you are right I should give the same names to same type of ranges. Anyways.I 'm reading arrays in awk now. The book is really helping me alot.

Could you please explain the code you have written before for the Logic. So that I could understand and re write and modify it according to my usage.

repinementer · July 30, 2009, 12:15am

I have gone through your code.
I have understood how you differentiated and assigned "desc", "asc", "in", "out", "lower", "upper", "exact", "inexact" of input 1 first range value and how you set maximum and minimum values for input2.
As I already told you I'm trying to modify it based on 2 editions.
Edition1. ) Instead of comparing input 1 first range value I would like to compare input 1 all range values based on their specific keys.

I assume you assigned the first range value of input1 as k1 and so on
If I want to assign all the ranges Do I have to remove the k1 and k2

# exact - not exact
  for (i=1; i<=n; i++) {
    split(tmp, range)
    if (Def ~ /asc/) { k1 = $2; k2 = $3 }      
    else { k1 = $3; k2 = $2 }
    if (k1 >= range[1] && k2 <= range[2]) {
      Def = Def "exact"
      print $0 "\t\t" def[Def]
      next
      }
    }
      Def = Def "notexact"
    print $0 "\t\t" def[Def]
    next

Edition2. ) If any range value of input 1 of their specific key overlaps with any range value of input 2 it comes under G and H RANGES (Highlited in green boxes) based on the ""lower"" or """upper""

No need any change

need change

Note : I have renamed the ranges in the new file last copy.xls

Could you please help me with this

radoulov · July 30, 2009, 9:41am

The script I posted compares the ranges in the file input2 with all the ranges
for the relative key in file input1 (not only the first one).

Could you try to elaborate further on the difference between the E/F and G/H ranges?

repinementer · July 30, 2009, 10:48am

Actually what I'm trying to modify is

####################################################
Anyways coming to the point

I mean the ranges in bold letters (We only compared the 1st range i.e, 32-36). But I would like to compare all the ranges of specific key like32-36,50-55 and 86-88

If c1 ranges(all) like 32-36, 50-55 and 86-88 are out of the ranges (all) of input2 like 5-10 then we will give range name (In this case Brange)

Like the above 2ndone has 2-3,4-55 and 86-88. In this case 2-3 range is out of range it comes under ARANGE according the previous script. But now we have to consider all the ranges 2-3,4-55 and 86-88. If any one is overlapping with the ranges it comes under G or H range according the lower or upper values. In this case 4-55 overlapping with 5-10 and it comes under GRANGE
Example:

Input 1

C1 	32 	36

	50 	55

	86 	88

C1	2       3

	4   55

	86	88

Input 2

c1 	5 	10

output

C1 	32 	36     BRANGE

	50 	55 	

	86 	88 	

C1 	2 	3 	GRANGE

	4      55 	

	86 	88

In the previous post of excel file I have mentioned different types of varieties overlapping with values. Simply I placed all overlapping values
under either GRANGE (If they are lower) or HRANGE (If they are upper).

The main thing modified is comparing not only the first range of a key value but also all the ranges of key value and basing on the all the ranges we decide which range they belong

I assume you will understand now.
In case you are not I will explain more elaborately each range individually. Please let me know. I 'll get back to you asap.
Please don't hesitate to criticize me or advice me.

You helped me alot by suggesting AWK book. I'm practicing it daily.I'm really amazed by its built in variables like NR, FNR, OFS,FS. Awk is really awesome.

Thanx rado

repinementer · July 30, 2009, 11:57am

Hey rado I'm including a simple figure and simpler values that elaborates my modifications.
Hope this helps. And don't forget the values I used in this figure are different from the XLS file.

radoulov · July 30, 2009, 3:53pm

OK,
now I understand the definition of G/H ranges. I need more info (or a sample) on how the actual output should look like.
Do you want to combine both results (the second one - ranges from input1 with their definitions - appended
after the first one - ranges from input2 with definitions, for example)?

repinementer · July 30, 2009, 11:46pm

I have uploaded an excel file that has 3 worksheets including input1, input2 and desired output.

Based on the input2 value signs (positive or negative) ranges of output are given.
If its positive sign it is exactly as I mentioned in the previous picture.
If its negative sign there will be little change.

For understanding this I'll include another picture

Thanx

repinementer · July 31, 2009, 12:07am

The figure that explains comparison of ranges of specific keys based on positive and negative signs

repinementer · August 1, 2009, 12:48pm

I hope you understood what I previously posted.
But post if you don't understand any:)

Thanx

repinementer · August 3, 2009, 4:19am

COuld you please explain this one

  NR == FNR && NF {
  NF > 2 && k = $1
  in2[k] = in2[k] ? in2[k] RS $1 FS $2 : $2 FS $3
  next
  }

radoulov · August 3, 2009, 4:52am

I was just looking at the last xls ...
Now, first of all, which ranges from input1 you want to compare with the ranges in input2 (column 1 or column 2)?
Isn't the range separator constant or variable (comma, dot or what)?

---------- Post updated at 10:52 AM ---------- Previous update was at 10:29 AM ----------

NR == FNR

While reading the first non-empty input file ...

&& NF

and the current record contains at least one field ...

NF > 2 && k = $1

if the number of fields is greater than 2 set the variable k to the value of the first field.

in2[k] = in2[k] ? in2[k] RS $1 FS $2 : $2 FS $3

Build an (associative) array in2 keyed by the current value of k, the values are all first and second fields matching the same key
(simulating a more complex data structure, using the current RS as a element separator).

next

You need the next statement here, because you don't want these actions to be performed on the next input file(s).

repinementer · August 3, 2009, 5:05am

Now, first of all, which ranges from input1 you want to compare with the ranges in input2 (column 1 or column 2)?

It has to be column2 or 3 because column 1 has keys like X1....X2...

Now, first of all, which ranges from input1 you want to compare with the ranges in input2 (column 2or column 3)?

Column2
Actually there has to be tab instead of comma in column2. By mistake I forgot to convert comma to tabs
If we do that it column 2 will become column2 and 3
Same thing with input2 commas has to be convert to tabs

Now coming to the main point I would like to compare col2 and col3 of input1 to col2 and col3 of input2

Input1 Has to be like this

X1	1	2	1	4	+	1,2	0,1
	2	4					
X1	120	130	120	140	+	10,4	0,16
	136	140					
X1	4	3	1	4	-	1,2	0,1
	2	1					
X1	140	130	120	140	-	10,4	0,16
	125	120					
X1	15	20	15	98	+	5,10,4	0,35,79
	50	60					
	94	98					
X1	98	96	15	98	-	5,10,4	0,35,79
	75	45					
	25	15					
X1	1	2	1	36	+	1,2,3	0,1,32
	2	4					
	33	36					
X1	88	84	84	140	-	4,10,4	0,36,52
	130	120					
	140	136					
X1	15	20	15	110	+	5,10,5	0,35,90
	50	60					
	105	110					
X1	98	94	50	98	-	10,4,4	0,34,44
	88	84					
	60	50					
X2	15	20	15	98	+	5,10,4	0,35,79
	50	60					
	94	98					
X2	98	96	15	98	-	5,10,4	0,35,79
	75	45					
	25	15					
X3	1	2	1	4	+	1,2	0,1
	2	4					
X3	120	130	120	140	+	10,4	0,16
	136	140					
X3	4	3	1	4	-	1,2	0,1
	2	1					
X3	140	130	120	140	-	10,4	0,16
	125	120					
X3	15	20	15	98	+	5,10,4	0,35,79
	50	60					
	94	98					
X3	98	96	15	98	-	5,10,4	0,35,79
	75	45					
	25	15					
X3	1	2	1	36	+	1,2,3	0,1,32
	2	4					
	33	36					
X3	88	84	84	140	-	4,10,4	0,36,52
	130	120					
	140	136

Input2 has to be like this

X1	5	10	5	118	+	5,10,10,18	0,25,75,95
	30	40					
	80	90					
	100	118					
X2	10	20	10	100	+	10,20,20	0,30,70
	40	60					
	80	100					
X3	118	100	5	118	-	5,10,10,18	0,25,75,95
	90	80					
	40	30					
	10	5					
X4	5	10	5	118	+	5,10,10,18	0,25,75,95
	30	40					
	80	90					
	100	118

I just converted commas into tabs. Every thins as same as the last XLS file

---------- Post updated at 01:05 AM ---------- Previous update was at 12:59 AM ----------

If you want to use these numbers copy and paste these in notepad. Dont paste in excel file.

Thanx

radoulov · August 3, 2009, 5:20am

I don't understand the range definitions in your example output in the last xls file ...
You defined the X1 -> 1,2 as GRANGE, why? Why not ARANGE?
X1 -> 15,20 is defined once as ERANGE, once as GRANGE, why?

---------- Post updated at 11:20 AM ---------- Previous update was at 11:14 AM ----------

I believe you should really try to do it yourself. I just don't have time to analyze these continuously changing specifics ...
Try to write the code yourself.
Feel free to ask when you have a problem with a specific micro-task, but you will learn only if you practice.

repinementer · August 4, 2009, 5:24am

Because in ARANGE 1,2 and 2,4 of X1are out of area when they compared to the ranges
5-10,30-40,80-90 and 100-118.

 .....5-10,30-40,80-90 and 100-118    ARANGE
1-2,2-4

In second case GRANGE, 1-2, 2-4 and 33-36 of X1are not out of area when they compared to the ranges because of 33-36

..5-10,30-40,80-90 and 100-118    GRANGE
1-2,2-4.33-36

It means we have to consider every range . If any one of them is with in the range like 33-36 in 30-40 we have to give the name GRANGE or HRANGE based on upper or lower.

---------- Post updated at 01:30 AM ---------- Previous update was at 01:24 AM ----------

I agree with you rado.
I should do that. I will start with micro task as you suggested.
Just reply me if you find free time otherwise just ignore it.
I'm very happy that you helped me alot so far even though I didn't finish the script.

---------- Post updated 08-04-09 at 01:24 AM ---------- Previous update was 08-03-09 at 01:30 AM ----------

awk 'NF {
  sec = $2; fifth = split($5, _fifth, ","); sixth = split($6, _sixth, ",")
  counter = 0; key = $1; flag = $4; sub(/[^ \t*]*/, "")
  dummy = sprintf("%*s", length(key),x)
  for (i=1; i<=sixth; i++) {
    second_third = sec + _sixth FS _fifth + sec + _sixth
    third_second = _fifth + sec + _sixth FS sec + _sixth 
    if (flag == "+") 
      rec = rec ? rec RS dummy OFS second_third : key OFS second_third OFS $0
    else  
      rec_rev = rec_rev ? \
        (++counter == sixth - 1 ? key OFS third_second OFS $0 : dummy OFS third_second ) RS rec_rev : \
        dummy OFS third_second
    }
  print (flag == "+" ? rec : rec_rev)    
 }' OFS='\t' ORS='\n\n' r1.txt

input

X1    100    200    +    10,20,30,30    10,20,30,40

X2    100    200    +    10,20,30,30    10,20,30,40

output

X1    110 120        100    200    +    10,20,30,30    10,20,30,40
      120 140
      130 160
      140 170

X1    110 120        100    200    +    10,20,30,30    10,20,30,40
      120 140
      130 160
      140 170
      110 120
      120 140
      130 160
      140 170

CORRECT oUTPUT NEEDED IS

X1    110 120        100    200    +    10,20,30,30    10,20,30,40
      120 140
      130 160
      140 170
X2   110 120        100    200    +    10,20,30,30    10,20,30,40
      120 140
      130 160
      140 170

Why I'm getting extra values. COuld you please explain. X2 results are not coming in output.