Finding the right file with multiple sort criteria

LMHmedchem · August 4, 2014, 1:34pm

Hello,

I have files in a directory with names like,

./f0/84.40_E1200_85.39_E1300_f0_r00_1300-ON-0.25_S7A_v4_47.19.1.out.txt
./f0/84.40_E1200_85.83_E1200_f0_r00_1200-ON-0.25_S7A_v4_47.19.1.out.txt
./f0/84.60_E1100_86.45_E1100_f0_r00_1100-ON-0.25_S7A_v4_47.19.1.out.txt
./f0/85.20_E1000_87.26_E1000_f0_r00_1000-ON-0.25_S7A_v4_47.19.1.out.txt
./f0/86.42_E900_88.14_E900_f0_r00_900-ON-0.25_S7A_v4_47.19.1.out.txt
./f0/88.88_E800_90.07_E800_f0_r00_800-ON-0.25_S7A_v4_47.19.1.out.txt

I need to find the smallest value for the first '_' delimited field (84.40 in this case). Where there are more than one file with this value, like above, I need the one with the smallest value of the int in 1300-ON-0.25, 1200-ON-0.25, etc. For this example, I would want to find the file and assign that name to a bash variable.

84.40_E1200_85.83_E1200_f0_r00_1200-ON-0.25_S7A_v4_47.19.1.out.txt

There will never be more than 2 files with the same value for the field I am looking at.

Another caveat is that there are times where I would be looking at the float in the third '_' delimited field instead of the first. If $STOP_ON==T, I would sort on the third number and if $STOP_ON=V I would sort on the first.

I guess what I would do here is something like,

if [ "$STOP_ON" == "T" ]; then
   # this removes the path from the front of the filename, sorts real in position 3
   FILES=$(ls  $CURRENT_DIR'/'*'out.txt' | \
           awk 'BEGIN {FS="/"} {print $6}' | \
           sort -t_ -k 3 -n | \
           head -n 2)
fi
if [ "$STOP_ON" == "V" ]; then
   # this removes the path from the front of the filename, sorts real in position 1
   FILES=$(ls  $CURRENT_DIR'/'*'out.txt' | \
           awk 'BEGIN {FS="/"} {print $6}' | \
           sort -t_ -k 1 -n | \
           head -n 2)
fi

This would strip off the path, sort on the float that I need and then pick off the top two. I could then process the list to find the one with the lowest value for the second sorting criteria.

This seems rather involved and there is another problem in that there may not be two files with the same value for the first sort filed as there are in this example. That means I would first have to check the results of the above code for that as well.

It seems as if there should be a better way to do this. Please let me know if I have botched my explanation and want me to try again.

LMHmedchem

clx · August 4, 2014, 2:21pm

If your field's (the ones which you want to sort on) positions and width are fixed,

sort -t "_" -nk1.6,1.7 -nk1.9,1.10 -nk7.1,7.4 file

Even if it doesn't, I think you can manipulate the above command to match your criteria.

LMHmedchem · August 4, 2014, 3:25pm

The position and width for the floats are fixed, but the second sorting criteria will not have a fixed width.

One thing I don't get at the moment is that my script seems to think that there is only one element in FILES.

If I print,

echo ${#FILES[@]}

I get 1.

If I print,

echo ${FILES[0]}

I get,

84.40_E1200_85.39_E1300_f0_r00_1300-ON-0.25_S7A_v4_47.19.1.out.txt 84.40_E1200_85.83_E1200_f0_r00_1200-ON-0.25_S7A_v4_47.19.1.out.txt

I thought that when you did something like FILES=$(ls ...) you would end up with an array if there was more than one file returned by ls . For some reason, it is treating the list of files as a single string in one element. I tried adding {OFS=" "} to the awk call, but that doesn't do anything.

For now, I can split up the single string in ${FILES[0]} on space, but it seems like I shouldn't have to do that.

LMHmedchem

RudiC · August 4, 2014, 4:34pm

Assigning an array would (in bash) require to embrace the term with ( ... ). See the differences:

FILES=($(ls))
echo ${#FILES[@]}
33
echo ${!FILES[@]}
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

as opposed to

FILES=$(ls)
echo ${#FILES[@]}
1
echo ${!FILES[@]}
0

LMHmedchem · August 7, 2014, 12:25am

I think I have this sorted out, thanks again for the assistance.

LMHmedchem

RudiC · August 7, 2014, 5:05am

To make your sorting on field 1 or 3 a bit easier, you might want to consider

[ "$STOP_ON" == "V" ]; IX=$((3-$?*2)); sort -nt_ -k$IX,$IX file