Awk: greater than sign is working upside down

beca123456 · April 23, 2018, 9:23am

Hi,

I noticed a weird behaviour with awk.

input:

A|B|1-100|blabla_35_40_blabla;blabla_53_60_blabla;blabla_90_110_blabla

Objective:
For each string separated by ';' in $4, if the first and second numbers are included in the interval in $3, then print "TRUE". Otherwise print "FALSE".

In order to get this output:

A|B|1-100|blabla_35_40_blabla|TRUE
A|B|1-100|blabla_53_60_blabla|TRUE
A|B|1-100|blabla_90_110_blabla|FALSE

My code:

awk '
BEGIN{FS=OFS="|"}
{
    START=FINISH=$3
    gsub(/-.+$/,"",START)                         # isolate the first number in the interval in $3
    gsub(/^.+-/,"",FINISH)                       # isolate the second number in the interval in $3

    a=split($4,b,";")
    for(i=1; i<=a; i++){
        beg=gensub(/(^[^_]+_)([0-9]+)(_.+$)/,"\\2","g",b)                      # isolate first number in $4
        end=gensub(/(^[^_]+_[0-9]+_)([0-9]+)(_.+$)/,"\\2","g",b)          # isolate second number in $4

        if(beg > START && end < FINISH){
            print $1 FS $2 FS $3 FS b FS "TRUE"
        }
        else{
            print $1 FS $2 FS $3 FS b FS "FALSE"
        }
    }
}' input

But I get:

A|B|1-100|blabla_35_40_blabla|FALSE
A|B|1-100|blabla_53_60_blabla|FALSE
A|B|1-100|blabla_90_110_blabla|FALSE

---------- Post updated at 08:23 AM ---------- Previous update was at 07:57 AM ----------

It actually works when I use arrays instead of 'gsub /gensub'. So I assume awk treats the number as numbers with arrays and as text with gensub maybe

rdrtx1 · April 23, 2018, 10:20am

awk '
BEGIN{FS=OFS="|"}
{
    split($3,interval, "[-]")

    a=split($4,string, ";")

    for(i=1; i<=a; i++){
       b=split(string, numbers, "_")
       print $1, $2, $3, string, (numbers[2] >= interval[1] && numbers[3] <= interval[2]) ? "TRUE" : "FALSE";
    }
}' input

Scrutinizer · April 23, 2018, 11:40am

@OP, try:

if(beg+0 > START+0 && end+0 < FINISH+0){

RudiC · April 23, 2018, 3:33pm

rdrtx1's proposal works well on the sample data line posted, but it assumes that the second number is always greater than the first, so that if N2 is less than the high boundary, so is N1, and if N1 is greater than the low boundary, so is N2, and thus omits two boundary tests.
If that assumption is not valid, and nothing has been said it is, then use

awk '
BEGIN{FS=OFS="|"}
{
    split($3,interval, "[-]")

    a=split($4,string, ";")

    for(i=1; i<=a; i++){
       b=split(string, numbers, "_")
       print $1, $2, $3, string, (numbers[2] >= interval[1] && numbers[2] <= interval[2] && numbers[3] >= interval[1] && numbers[3] <= interval[2]) ? "TRUE" : "FALSE";
    }
}' file