Loop and array problem

hubleo · March 24, 2013, 5:57pm

Hi, I have the following problem that is beyond what I can currently do with bash scripting.

In file 1, I have ~ 2500000 values. Note this file is not sorted.

3 19 LABEL_A
3 37 LABEL_B
2 12 LABEL_C
1 15 LABEL_D

I have a list of values in "file 2" ~ 25000 unique lines:
Note - LABEL_7 AND LABEL_8 overlap slightly in their column 2 and 3 values

1 11 20 LABEL_1
1 18 30 LABEL_2
1 31 40 LABEL_3
2 11 20 LABEL_4
2 21 30 LABEL_5
2 31 40 LABEL_6
3 11 20 LABEL_7
3 15 30 LABEL_8
3 31 40 LABEL_9
4 11 20 LABEL_10

ETC

To run through what I would like to do, as an example:

LABEL_A (FILE 1) has a 3 in column 1, and a value of 19 in column 2.
I want to compare this to every line in FILE 2.

So, if there is a 3 in column 1 of FILE2, and 19 is between the values of columns 2 and 3 of FILE2, see what label this corresponds to in FILE2.

In this example, 19 is between the values in column 2 and 3 (FILE2) for LABEL_7 and LABEL_8.

Desired output: (Note the value of 2 in column 4 below means there are 2 labels that contain the value 19).

LABEL_A LABEL_7 LABEL_8 2

Full output:

LABEL_A LABEL_7 LABEL_8 2
LABEL_B LABEL_9 1
LABEL_C LABEL_4 1
LABEL_D LABEL 1 1

I think the code for this will involve while loops and arrays, but I have no idea where to start. Any bash solutions would be great (as this is what I am currently learning), but any assistance at all would be very much appreciated.

Don_Cragun · March 24, 2013, 7:34pm

You explicitly state that file 1 is not sorted, but you don't say whether or not file 2 is sorted. Is file 2 sorted by increasing numeric values in fields 1, 2, and 3 as shown in your example, or is file 2 also unsorted?

What is the range of the values in fields 1, 2, and 3?

Are all of the values in fields 1, 2, and 3 integers or are floating point values also included?

hubleo · March 24, 2013, 7:51pm

Hi Don,

Sorry if I wasn't clear.

Field (column 1) goes up to 24 in both File 1 and File 2.
File 1 is unsorted.

File 2 is sorted, from 1-24 in column 1. It is then sorted by lowest number number in column 2 (real values are from 1-about 30,000,000).
So this means that if column 1 = 1, values could be from

1 1 100
1 400 2050
1 9000 19200

or

2 1234 9999
2 25000 10000
2 14000 192000

There is no fixed number by which column 2 increases by, nor is there a fixed number by which column 3 increases by. The lowest possible number in column 2, and highest possible number in column 3 changes depending on whether there is 1,2,3,4 etc in column 1.

The number in all columns are fixed integers.

I hope that makes things clearer...Thanks!

Don_Cragun · March 24, 2013, 8:27pm

hubleo:

Hi Don,

Sorry if I wasn't clear.

Field (column 1) goes up to 24 in both File 1 and File 2.
File 1 is unsorted.

File 2 is sorted, from 1-24 in column 1. It is then sorted by lowest number number in column 2 (real values are from 1-about 30,000,000).
So this means that if column 1 = 1, values could be from
1 1 100
1 400 2050
1 9000 19200
or
2 1234 9999
2 25000 10000
2 14000 192000
There is no fixed number by which column 2 increases by, nor is there a fixed number by which column 3 increases by. The lowest possible number in column 2, and highest possible number in column 3 changes depending on whether there is 1,2,3,4 etc in column 1.

The number in all columns are fixed integers.

I hope that makes things clearer...Thanks!

I understand that file 1 is not sorted and that doesn't matter to me. Lines from file 1 can be read and processed without having to store file 1 values.

File 2 values, however, need to be stored so they can be examined 2.5 million times (once for each line in file 1), so I want to minimize the effort needed to determine if a line in file 1 matches values saved from file 2.

You said above that file 2 is sorted by numeric value on fields 1 and 2 with field 1 being the primary sort key. But, the value marked in red in your example above shows that file 2 is not sorted as you described. It is also strange that that entry has the high end of the range with a value lower than the low end of the range. Did you perhaps intend for the second set of values to be:

2 1234 9999
2 2500 10000
2 14000 192000

instead of:

2 1234 9999
2 25000 10000
2 14000 192000

or do we have to reverse the order of values in fields 2 and 3 if field 2's value is greater than field 3's value?

hubleo · March 24, 2013, 8:35pm

Sorry that was a typo from me

2 1234 9999
2 2500 10000
2 14000 192000

Every value in column 2 is lower than column 3. Many thanks

Don_Cragun · March 24, 2013, 10:19pm

I don't have a set of files of the size you want to use, but the following works for the samples you provided:

awk '
FNR == NR {
        h[$1] = NR      # set "h"igh end of input lines for this label
        m[NR] = $2      # save "m"inimum range value for this line
        M[NR] = $3      # save "M"aximum range value for this line
        l[NR] = $4      # save "l"abel from this line
        next
}
{       e = h[$1]       # set high line number for field 1 value on this line
        o = $NF         # set initial output line to label on this line
        oc = 0          # set number of matched lines
        # Loop through the ranges associated with field 1 from this input line
        for(i = h[$1 - 1] + 1; i <= e; i++) {
                if($2 < m) continue  # range is too low; keep looking
                if(m > $2) break     # range is too high; we are done
                if($2 <= M) {        # we have a matching range
                        o = o " " l  # add corresponding label to output line
                        oc++            # increment match count
                }
        }
        print o, oc     # print the matched labels and the match count
}' "file 2" "file 1"

As always, if you are using a Solaris/SunOS system, use /usr/xpg4/bin/awk or nawk instead of awk .
With your sample input, this script produces:

LABEL_A LABEL_7 LABEL_8 2
LABEL_B LABEL_9 1
LABEL_C LABEL_4 1
LABEL_D LABEL_1 1

which matches the Full output you said you wanted except your output showed a space instead of the underline marked in red in the last output line.

hubleo · March 25, 2013, 7:22am

Don that works perfectly. Does exactly what I wanted. Also, thanks for so clearly annotating your lines, I can tell exactly what you've done and that's really helpful for a relative newbie like me. Thanks so much