Ranking data points from multiple files

ncwxpanther · June 17, 2016, 1:48pm

I need to rank a large number of data points that exist in multiple files. My data points (Column 3) are based on unique values in columns 1 and 2. I need to rank the values that are in File 1, Column 3.

For instance:

Input File 1

AAA BBB 10
CCC DDD 16
EEE FFF 20

Input File 2

AAA BBB 11
CCC DDD 14
EEE FFF 19

Input File 3

AAA BBB 12
CCC DDD 15
EEE FFF 19

Expected output

AAA BBB 1
CCC DDD 3
EEE FFF 3

Any ideas how to best perform this analysis? I am familiar with awk and bash.

vgersh99 · June 17, 2016, 2:49pm

what's the math to calculate the rank (given the samples)?

ncwxpanther · June 17, 2016, 2:53pm

I was using this script for ranking 1 column of a single file - In this case i was ranking column 4.

sort -k4 -n  file | awk '$4 != prev { rank = NR }{ print $1, $4, rank ; prev = $4 }'

But this new script needs to rank multiple files based on the same data points that are in columns 1 and 2.

Corona688 · June 17, 2016, 3:32pm

You've told us how you don't want them ranked.

What we need is how you actually do want them ranked.

ncwxpanther · June 17, 2016, 3:51pm

I would think that the awk variable NR would be able to count the number of records and create the rank based on that.

There are 2 stumping blocks for me:
1) I dont know how to make File 1 be the data value that it is ranking.
2) Im not sure how to best search for each data point (columns 1 and 2) in all the files. What is the most efficient way in doing this?

Corona688 · June 17, 2016, 4:20pm

NR is the line number, effectively. AAA BBB should have a rank of 1, CCC DDD should have a rank of 2, etc.

If that's not what you do want, you'll have to actually tell us what you want instead of showing something which gives you things you don't want.

ncwxpanther · June 20, 2016, 9:02am

I have the following script that will search for a data point and return the sorted values.

How would I:

1) Create a variable for each of the data points that are in File 1
2) Create an additional column with the ranking of the values

for year in "${test}"/{1900..2016}05.pnt 
do
cat "$year"
done |
grep "AAA BBB" $year | sort -k4 -n

 
 AAA BBB    24.48
 AAA BBB    24.70
 AAA BBB    24.95
 AAA BBB    25.10
 AAA BBB    25.28
 AAA BBB    25.36
 AAA BBB    25.57

Corona688 · June 20, 2016, 4:43pm

1) Depends what "create a variable" means.
2) Depends what "ranking" means.

How does your ranking work?

Why do the values "AAA BBB 10", "AAA BBB 11", and "AAA BBB 12" have a rank of 1?

Why do the values "CCC DDD 16", "CCC DDD 14", and "CCC DDD 15" have a rank of 3?

Why do the values "EEE FFF 20", "EEE FFF 19", and "EEE FFF 19" have a rank of 3?

Either your output is unrelated to your input, or there is a deeper pattern you must explain before we can do anything with it.

Please show relevant input, and related output, and explain how one becomes the other.

ncwxpanther · June 21, 2016, 9:05am

There is an input file with the following format

Main Input File

 24.5625  -81.8125    27.47
 24.5625  -81.7708    27.16
 24.5625  -81.7292    27.49

I need to rank the third column of that input file when compared to 100 other files with the same format. The first 2 columns need to be used as identifiers or data points. So when you search through the other 100 files, you only compare these unique data points. Each value with be ranked based on the original input file.

One of the 100 other files to compare to:

 24.5625  -81.8125    27.37
 24.5625  -81.7708    27.26
 24.5625  -81.7292    27.29

One of the 100 other files to compare to:

 24.5625  -81.8125    27.57
 24.5625  -81.7708    27.66
 24.5625  -81.7292    27.19

Expected Output:

 24.5625  -81.8125    27.47  2  
 24.5625  -81.7708    27.46  1
 24.5625  -81.7292    27.49  3

Corona688 · June 21, 2016, 11:18am

Ranked how?

You have 27.47, 27.37, and 27.57 as input. How does the output value of 27.47 get chosen? Is it an average, or some other method?

And where does the two come from?

ncwxpanther · June 21, 2016, 11:38am

27.47 comes from the original input file which is named YYYYMM.pnt. I was hoping that can be chosen from a variable.

set month = `date +"%m"`
set year = `date +"%Y"`

$year$month.pnt

All the other files to compare to are the same month in previous years.

201505.pnt
201405.pnt
201305.pnt
etc

The two is the rank of the data value in the original file when compared to the other files.

Corona688 · June 21, 2016, 12:04pm

Compared how? Why is it two, and not three, zero, or nine? Why is the result 27.47 and not 27.37 or 27.57? What criteria makes this decision?

Corona688 · June 21, 2016, 12:11pm

Why isn't 24.5625 -81.7708 27.16 chosen from the original input file?

ncwxpanther · June 21, 2016, 12:13pm

The ranking is compared based on the 3rd column.

Its two because 27.47 is the second highest value when compared to 27.37 and 27.57.

27.47 comes from the original input file. The original input file needs to be the one that is ranked, thus its values (27.47) are preserved.

Corona688 · June 21, 2016, 12:15pm

Thank you! That finally explains it.

I'm guessing the 27.46 is a typo then?

ncwxpanther · June 21, 2016, 12:16pm

Sorry, that output should be 24.5625 -81.7708 27.16 1
not
24.5625 -81.7708 27.46 1

Corona688 · June 21, 2016, 12:57pm

You can sort everything by value and just let awk decide which belongs to which:

REF="test/190005.pnt"

sort -k3 -n test/{1900..2016}05.pnt |
        awk '
        # Read the values you want to rank from the first file
        # This trick works by NR and FNR only being the same while reading
        # the first file, not the second.
        NR==FNR {  A[$1,$2]=$3 ; next }
        # Read everything sorted by value, counting order per prefix
        # as it goes.  When a matching value is found, print its order.
        ++C[$1,$2] && A[$1,$2] == $3 { print $1, $2, C[$1,$2] }
' $REF -

ncwxpanther · June 22, 2016, 8:55am

Thanks Corona. The scripts works as expected.

However there is an added intricacy that I need help with.

For ties - I need the script to rank that tied value the same.

For instance, If there is a rank of 1, 2, 3 but 4, 5, and 6 are tied. The ranks would be 1, 2, 3, 4, 4, 4, 7,....

To add to the complexity, I need to rank at both tails of the range. That way if the tie is on the upper half of the ranks, it would assume the higher value.

For instance, if there is a rank of 100, 99, 98, but 97 and 96 are tied. The ranks would be 100, 99, 98, 97, 97, 95,.....

In order to do this the total number of ranks would need to be divided by 2 to find the midpoint. Any ties less than the midpoint would assume the lower rank. While any ties more than the midpoint would assume the higher rank.

The total number of ranks could be found in the number of files in the sort statement - In this case it would be

({1900..2016} + 1) / 2 = 59

Is this something that can be done in awk?

Corona688 · June 22, 2016, 11:28am

So you want it to trawl through all the ties and pick the "tie" which has the most repeated values in common?

ncwxpanther · June 22, 2016, 12:11pm

The output that I currently have is great, I just need to add a layer of complexity to it. In that I need all the ties to be ranked as a tie.

This snipit does just that

awk '$3 != prev { rank = NR }{ print $0, "   "rank; prev = $3 }'

However, what I also need the script to do is to rank both ends of the values.

For instance

VALUE  |    RANK
100           1
200           2
200           2
300           4
400           5
500           6
600           8
600           8
700           9
800          10

Notice how the ties in the top half (100-400) default to the lower rank, while the ties in the bottom half (500-800) default to the higher rank.