A simpler way to do this (save a list of files based on part of their name)

LMHmedchem · August 1, 2013, 1:00pm

Hello,

I have a script that checks every file with a specific extension in a specific directory. The file names contain some numerical output and I am recording the file names with the best n outcomes.

The script finds all files in the directory with the extension .out.txt and uses awk to parse the filename on underscore. In this case, I am reading the first field and looking for the smallest three values across the set of files. In other cases, I may be reading the third field. I understand that in this simple case, all I would have to do is take the first three files, but there will be other cases where that would not work.

This is the script at this point and there is sample input in the attached zip. The input file names look like,

48.93_E3200_55.94_E1900_34_ri_OA_f0_S1A_v17_52.26.1_4_ON_0.25lr.out.txt
49.15_E2700_51.98_E1200_32_ri_OA_f0_S1A_v17_52.26.1_4_ON_0.25lr.out.txt
49.16_E1600_52.54_E1600_44_ri_OA_f0_S1A_v17_52.26.1_4_ON_0.25lr.out.txt
50.36_E3400_55.09_E3000_35_ri_OA_f0_S1A_v17_52.26.1_4_ON_0.25lr.out.txt
50.62_E1700_51.92_E300_8_ri_OA_f0_S1A_v17_52.26.1_4_ON_0.25lr.out.txt

#!/bin/bash
# loop through all files and save the top 3 filenames

   # initalize
   FILENAME=""
   CURRENT_MAE_VALUE=0
   # these are initalized to an arbitrarily large value
   EV_MAE_0=1000.0
   EV_MAE_1=1000.0
   EV_MAE_2=1000.0

   EV_FILES=(NULL0 NULL1 NULL2)

   # set fold value
   FOLD=f0

   # get directory list
   FILES='./'$FOLD'/'*'out.txt'

   for INFILE in $FILES
   do

   #  remove directory from path
      FILENAME=`echo $INFILE | awk 'BEGIN {FS="/"} {print $3}'`
   #  find ev mae value
      CURRENT_MAE_VALUE=`echo $FILENAME | awk 'BEGIN {FS="_"} {print $1}'`

   # save the names of the top 3 EV files and EV values
      if (( $(bc <<< "$CURRENT_MAE_VALUE < $EV_MAE_0") == 1 ))
      then
         #bump down current list items
         EV_FILES[2]=${EV_FILES[1]}; EV_MAE_2=$EV_MAE_1
         EV_FILES[1]=${EV_FILES[0]}; EV_MAE_1=$EV_MAE_0
         EV_FILES[0]=$FILENAME
         # assign EV_MAE_VALUE to top value
         EV_MAE_0=$CURRENT_MAE_VALUE

      elif (( $(bc <<< "$CURRENT_MAE_VALUE < $EV_MAE_1") == 1 ))
      then
         #bump down current list items
         EV_FILES[2]=${EV_FILES[1]}; EV_MAE_2=$EV_MAE_1
         EV_FILES[1]=$FILENAME
         # assign EV_MAE_VALUE to second value
         EV_MAE_1=$CURRENT_MAE_VALUE

      elif (( $(bc <<< "$CURRENT_MAE_VALUE < $EV_MAE_2") == 1 ))
      then
         #bump down current list items
         EV_FILES[2]=$FILENAME
         # assign EV_MAE_VALUE to third value
         EV_MAE_2=$CURRENT_MAE_VALUE

      fi

   done

# print results
   echo "1st EV file"
   echo ${EV_FILES[0]}
   echo "EV MAE 0"  $EV_MAE_0
   echo""
   echo "2nd EV file"
   echo ${EV_FILES[1]}
   echo "EV MAE 1"  $EV_MAE_1
   echo""
   echo "3rd EV file"
   echo ${EV_FILES[2]}
   echo "EV MAE 2"  $EV_MAE_2
   echo""

My main question is about how to keep a running record of the file names of the best three values as I loop through the file names. This script does it by brute force and works alright, but I may need to save the top 20 or 50, and I don't look forward to coding that up with the method I used above.

Any suggestions?

LMHmedchem

blackrageous · August 1, 2013, 1:50pm

Seems like a egrep would work where the output of your grep would include the filename and the particular field you wanted if the value you're interested in is actually in the file. Then you would sort by numeric value on that particular field, than use head or tail depending upon your sort and boom...done. I am not clear on if you're using the filenames to extract the values yet, but in any case it will be similar, I will look at your data and script and an example shortly. Someone will probably post a solution if I don't in a short time.

---------- Post updated at 12:50 PM ---------- Previous update was at 12:40 PM ----------
Based on filename approach...
Something like this

ls *.out.txt | sort -k1,1 -t\_ -n -r | tail -3

LMHmedchem · August 1, 2013, 3:39pm

If I was doing this in cpp, I would definitely use some kind of sort, but I'm not at all familiar with how to do this in a shell. The key value is in the file, but not somewhere where it can be easily found (not in the same place in every file). I have already processed these files and added the value I am interested in to the file name so it will be easier to access. It's easy enough to grab the value out of the filename, but I don't know if that's compatible with your solution.

LMHmedchem

Don_Cragun · August 1, 2013, 3:43pm

Based on the data in your zip file and your current bash script, here is another bash script that seems to do what you want, but instead of hard coding the directory, field number, and number of files to be listed, it takes them as parameters:

#!/bin/bash
IAm=${0##*/}
Usage="Usage: $IAm directory field_number count"
if [ $# -ne 3 ] || ! cd "$1" > /dev/null || [ "$2" != "${2%*[^0-9]*}" ] ||
        [ "$3" != "${3%*[^0-9]*}" ]
then    echo "$Usage"
        exit 1
fi
ls *.out.txt | sort -t_ -k$2,$2n | awk -F_ -v f=$2 -v c=$3 '
NR > c {exit}
{       if(NR == 1) s = "st"
        else if(NR == 2) s = "nd"
        else if(NR == 3) s = "rd"
        else s = "th"
        printf("%d%s EV file\n%s\nEV MAE %d %s\n\n", NR, s, $0, NR - 1, $f)
}'

This script was tested using both bash and ksh, but should work with any POSIX conforming shell.

If you save this in a file named test2_copy.sh , make it executable with:

chmod +x test2_copy.sh

and execute it with:

./test2_copy.sh f0 1 3

you get the same output as you get if you run ./test_copy.sh :

1st EV file
48.93_E3200_55.94_E1900_34_ri_OA_f0_S1A_v17_52.26.1_4_ON_0.25lr.out.txt
EV MAE 0 48.93

2nd EV file
49.15_E2700_51.98_E1200_32_ri_OA_f0_S1A_v17_52.26.1_4_ON_0.25lr.out.txt
EV MAE 1 49.15

3rd EV file
49.16_E1600_52.54_E1600_44_ri_OA_f0_S1A_v17_52.26.1_4_ON_0.25lr.out.txt
EV MAE 2 49.16

but you can also run it with:

./test2_copy.sh f0 3 5

to produce:

1st EV file
50.62_E1700_51.92_E300_8_ri_OA_f0_S1A_v17_52.26.1_4_ON_0.25lr.out.txt
EV MAE 0 51.92

2nd EV file
49.15_E2700_51.98_E1200_32_ri_OA_f0_S1A_v17_52.26.1_4_ON_0.25lr.out.txt
EV MAE 1 51.98

3rd EV file
49.16_E1600_52.54_E1600_44_ri_OA_f0_S1A_v17_52.26.1_4_ON_0.25lr.out.txt
EV MAE 2 52.54

4th EV file
50.36_E3400_55.09_E3000_35_ri_OA_f0_S1A_v17_52.26.1_4_ON_0.25lr.out.txt
EV MAE 3 55.09

5th EV file
48.93_E3200_55.94_E1900_34_ri_OA_f0_S1A_v17_52.26.1_4_ON_0.25lr.out.txt
EV MAE 4 55.94

which gives you data sorted on the 3rd underscore delimited field and limited to the 1st 5 matching files. The color was added only to highlight the sort field; the actual output will not have red text.So, you could sort on the 5th field with:

./test2_copy.sh f0 5 5

to get:

1st EV file
50.62_E1700_51.92_E300_8_ri_OA_f0_S1A_v17_52.26.1_4_ON_0.25lr.out.txt
EV MAE 0 8

2nd EV file
49.15_E2700_51.98_E1200_32_ri_OA_f0_S1A_v17_52.26.1_4_ON_0.25lr.out.txt
EV MAE 1 32

3rd EV file
48.93_E3200_55.94_E1900_34_ri_OA_f0_S1A_v17_52.26.1_4_ON_0.25lr.out.txt
EV MAE 2 34

4th EV file
50.36_E3400_55.09_E3000_35_ri_OA_f0_S1A_v17_52.26.1_4_ON_0.25lr.out.txt
EV MAE 3 35

5th EV file
49.16_E1600_52.54_E1600_44_ri_OA_f0_S1A_v17_52.26.1_4_ON_0.25lr.out.txt
EV MAE 4 44

Note, however, that it is doing a numeric sort, so the results are unspecified if you select a field that isn't entirely a number.

LMHmedchem · August 1, 2013, 3:49pm

Thanks, I will go over this and see if I can get it working. At the end of the day, I will be doing a cp of each file in the list to another directory. One of the problems I have is that I will probably want the top 20 out of 50 or so (not the top 3), so you can see why my method wasn't going to be practical.

It's not entirely clear to me what arguments 2 and 3 are. I argument 3 the number of files being processed and argument 2 the field being sorted on?

LMHmedchem

Don_Cragun · August 1, 2013, 6:34pm

I'm sorry for not explaining it better. I thought the usage message comment was sufficient documentation along with the examples I gave. The arguments are:

A pathname of the directory containing the files to be processed.
The field to be used as your sort key.
The maximum number of files you want to list.

Your file names are of the form:

field1_field2_field3_field4_field5_field6_field7_field8_field9_field10_field11_field12_field13_field14

where field14 always ends with the string .out.txt . I showed you examples using fields 1, 3, and 5 as the sort key since they were the only numeric fields in the names of the files you used in your example that had values that were not a constant. The 12th field was numeric but all filenames had 4 in field12 so sorting on it didn't seem useful.

The count (3rd operand) in my examples was 3 and 5 since you used 3 in your example and you only had 5 files in your example. You can put any number you want there to specify the number of files you want listed. It is happy with 1; it is happy with 32000. Pick the number you want.

LMHmedchem · August 1, 2013, 7:24pm

After spending some more time looking through this, you did explain it quite well. I just didn't read it as well as you explained it.

I am in the process of trying to copy the files that are found by this to a different location and not having much success. Probably the best solution would be to dump the sorted list into a bash array. Then I can do all the rest I need to do.

This is my attempt to do this (I didn't include that parsing and exception code here but will post the entire working script, once it is...)

eval array=( $(df -h | ls *.out.txt | sort -t_ -k$2,$2n | awk -F_ -v f=$2 -v c=$3 'NR > c {exit} {printf("%s", $0)}') )

This is mildly successful in that is does capture the file names in an array, but all of them are in the first array element. I suppose I could parse array[0] on out.txt, or something kludgey like that, but I am guessing there is a better way.

I know there are some ways to copy in awk, and also with system, but I need to extract some additional information from the filename to locate an additional file, and the only way I know how to do that is in bash.

LMHmedchem

Don_Cragun · August 1, 2013, 7:52pm

If you just drop the eval you'll probably get the array you want, but creating an array sounds like an unnecessary complication for what you seem to want to do. If you give us a concrete example showing us exactly what you want to do, we can probably show you an easier way to do it just using a while read loop or a straight command substitution in a cp command line.

As a general rule, ALWAYS determine what you want to do 1st and then figure out how to do it. If you start with the assumption that you need to use an array before you decide what you want to do, you'll frequently miss simpler or more efficient ways to get things done.

LMHmedchem · August 1, 2013, 9:17pm

This is the full script I have that does what I want.

#!/bin/bash


# loop through all files and copy the top 5 EV and CV to continue w/ random weight files

FOLDS=(f0 f1 f2 f3 f4 f5 f6 f7 f8 f9)

for FOLD in "${FOLDS[@]}"
do

   # get directory list
   FILES='./'$FOLD'/'*'out.txt'

   # reinitalize
   FILENAME=""
   EV_MAE_VALUE=0
   CV_MAE_VALUE=0

   EV_MAE_0=1000.0
   EV_MAE_1=1000.0
   EV_MAE_2=1000.0
   EV_MAE_3=1000.0
   EV_MAE_4=1000.0

   EV_FILES=(NULL0 NULL1 NULL2 NULL3 NULL4)

   CV_MAE_0=1000.0
   CV_MAE_1=1000.0
   CV_MAE_2=1000.0
   CV_MAE_3=1000.0
   CV_MAE_4=1000.0

   CV_FILES=(NULL0 NULL1 NULL2 NULL3 NULL4)

   for INFILE in $FILES
   do

   #  remove directory from path
      FILENAME=`echo $INFILE | awk 'BEGIN {FS="/"} {print $3}'`
   #  find ev mae value
      EV_MAE_VALUE=`echo $FILENAME | awk 'BEGIN {FS="_"} {print $1}'`
   #  find ev mae value
      CV_MAE_VALUE=`echo $FILENAME | awk 'BEGIN {FS="_"} {print $3}'`

   # save the names of the top 5 EV files
      if (( $(bc <<< "$EV_MAE_VALUE < $EV_MAE_0") == 1 ))
      then
         #bump down current list items
         EV_FILES[4]=${EV_FILES[3]}; EV_MAE_4=$EV_MAE_3
         EV_FILES[3]=${EV_FILES[2]}; EV_MAE_3=$EV_MAE_2
         EV_FILES[2]=${EV_FILES[1]}; EV_MAE_2=$EV_MAE_1
         EV_FILES[1]=${EV_FILES[0]}; EV_MAE_1=$EV_MAE_0
         EV_FILES[0]=$FILENAME
         # assign EV_MAE_VALUE to top value
         EV_MAE_0=$EV_MAE_VALUE

      elif (( $(bc <<< "$EV_MAE_VALUE < $EV_MAE_1") == 1 ))
      then
         EV_FILES[4]=${EV_FILES[3]}; EV_MAE_4=$EV_MAE_3
         EV_FILES[3]=${EV_FILES[2]}; EV_MAE_3=$EV_MAE_2
         EV_FILES[2]=${EV_FILES[1]}; EV_MAE_2=$EV_MAE_1
         EV_FILES[1]=$FILENAME
         EV_MAE_1=$EV_MAE_VALUE

      elif (( $(bc <<< "$EV_MAE_VALUE < $EV_MAE_2") == 1 ))
      then
         EV_FILES[4]=${EV_FILES[3]}; EV_MAE_4=$EV_MAE_3
         EV_FILES[3]=${EV_FILES[2]}; EV_MAE_3=$EV_MAE_2
         EV_FILES[2]=$FILENAME
         EV_MAE_2=$EV_MAE_VALUE

      elif (( $(bc <<< "$EV_MAE_VALUE < $EV_MAE_3") == 1 ))
      then
         EV_FILES[4]=${EV_FILES[3]}; EV_MAE_4=$EV_MAE_3
         EV_FILES[3]=$FILENAME
         EV_MAE_3=$EV_MAE_VALUE

      elif (( $(bc <<< "$EV_MAE_VALUE < $EV_MAE_4") == 1 ))
      then
         EV_FILES[4]=$FILENAME
         EV_MAE_4=$EV_MAE_VALUE

      fi

   # save the names of the top 5 CV files
      if (( $(bc <<< "$CV_MAE_VALUE < $CV_MAE_0") == 1 ))
      then
         #bump down current list items
         CV_FILES[4]=${CV_FILES[3]}; CV_MAE_4=$CV_MAE_3
         CV_FILES[3]=${CV_FILES[2]}; CV_MAE_3=$CV_MAE_2
         CV_FILES[2]=${CV_FILES[1]}; CV_MAE_2=$CV_MAE_1
         CV_FILES[1]=${CV_FILES[0]}; CV_MAE_1=$CV_MAE_0
         CV_FILES[0]=$FILENAME
         # assign EV_MAE_VALUE to top value
         CV_MAE_0=$CV_MAE_VALUE

      elif (( $(bc <<< "$CV_MAE_VALUE < $CV_MAE_1") == 1 ))
      then
         CV_FILES[4]=${CV_FILES[3]}; CV_MAE_4=$CV_MAE_3
         CV_FILES[3]=${CV_FILES[2]}; CV_MAE_3=$CV_MAE_2
         CV_FILES[2]=${CV_FILES[1]}; CV_MAE_2=$CV_MAE_1
         CV_FILES[1]=$FILENAME
         CV_MAE_1=$CV_MAE_VALUE

      elif (( $(bc <<< "$CV_MAE_VALUE < $CV_MAE_2") == 1 ))
      then
         CV_FILES[4]=${CV_FILES[3]}; CV_MAE_4=$CV_MAE_3
         CV_FILES[3]=${CV_FILES[2]}; CV_MAE_3=$CV_MAE_2
         CV_FILES[2]=$FILENAME
         CV_MAE_2=$CV_MAE_VALUE

      elif (( $(bc <<< "$CV_MAE_VALUE < $CV_MAE_3") == 1 ))
      then
         CV_FILES[4]=${CV_FILES[3]}; CV_MAE_4=$CV_MAE_3
         CV_FILES[3]=$FILENAME
         CV_MAE_3=$CV_MAE_VALUE

      elif (( $(bc <<< "$CV_MAE_VALUE < $CV_MAE_4") == 1 ))
      then
         CV_FILES[4]=$FILENAME
         CV_MAE_4=$CV_MAE_VALUE

      fi

   done

   # copy list of filenames and corresponding ini weight sets to continue
   RAND_SET=""
   for I in "${EV_FILES[@]}"
   do
      # copy file to continue
      cp -p './'$FOLD'/'$I './'$FOLD'/'$FOLD'_continue/EV/'$I
      #  find random ini set number
      RAND_SET=`echo $I} | awk 'BEGIN {FS="_"} {print $5}'`
      # copy random ini weight file to continue
      cp -p './rnd_ini/'$FOLD'/ri_'$RAND_SET'_'*'.wts'  './'$FOLD'/'$FOLD'_continue/EV/'
   done

   for I in "${CV_FILES[@]}"
   do
      # copy file to continue
      cp -p './'$FOLD'/'$I './'$FOLD'/'$FOLD'_continue/CV/'$I
      #  find random ini set number
      RAND_SET=`echo $I} | awk 'BEGIN {FS="_"} {print $5}'`
      # copy random ini weight file to continue
      cp -p './rnd_ini/'$FOLD'/ri_'$RAND_SET'_'*'.wts'  './'$FOLD'/'$FOLD'_continue/CV/'
   done

   #move fold output files to stats folder
   mv './'$FOLD'/'*'out.txt' './'$FOLD'/'$FOLD'_stats/'

done

I thought it was overly long to post since my question was about the first part. This does the job that the first one I posted did, except that it does it twice. It loops through a set of sub folders f0-f9, finds the files with the top 5 EV MAE values (5 smallest field 1) and copies those files and a corresponding set of files to f*/f*_continue/EV/. Then it does the same thing for the top CV MAE values (5 smallest field 3) and copies to f*/f*_continue/CV/.

I have attached a new test dir with the script and supporting files. I have edited the script so that it is only working with f0, f1, f2 to help simplify things. The script will find the top EV MAE values by reading the first field, and then find the .wts file that goes with it. Both will be coppied to the continue folder.

For example, the top EV MAE file for f0 is,
53.96_E3000_50.19_E2200_35_ri_OA_f0_S1C_v17_52.26.1_4_ON_0.25lr.out.txt
so this will be copied to ./f0/f0_continue/EV/

The .wts file associated with this is 35 (field 5), so the script will also copy,
./rnd_ini/f0/ri_35*.wts
to ./f0/f0_continue/EV/

The top CV MAE file for f0 is,
54.90_E3000_48.65_E4300_23_ri_OA_f0_S1C_v17_52.26.1_4_ON_0.25lr.out.txt
so this will be copied to ./f0/f0_continue/CV/

The .wts file associated with this is 23 (field 5), so the script will also copy,
./rnd_ini/f0/ri_23*.wts
to ./f0/f0_continue/CV/

After a f* directory is processed, the script moves all the .out.txt files to ./f*/f*_stats/.

Since the version you posted accepts arguments, there would be no need to do both CV and EV in the same run. The script could be called twice with the proper arguments for each. It is also very nice that your method allows you to pick any number of files to collect. It will probably end up being about 20, but I'm not sure yet.

Thanks for all the help.

LMHmedchem

LMHmedchem · August 5, 2013, 7:49pm

This is why I often don't post what I want to do in its entirety. It seems that most of the time when I post a long script, no one seems to want to wade into it (which is quite understandable). I guess I still need to work on making posts that are long enough to convey what I am asking and get a workable solution, but short enough that they will actually be read.

I modified the code that you posted and have something that gives me what I need. This is the modified code,

#!/bin/bash

# argument $1 is the field to sort on based on file names as below

# 53.96_E3000_50.19_E2200_35_ri_OA_f0_S1C_v17_52.26.1_4_ON_0.25lr.out.txt
#     1     2     3     4  5

# argument $2 is the file count, meaning the number of files to find and copy
# argument $3 is the set type EV/CV

# for the top 10 EV outcomes call as ./01_copy_top_outcomes.sh 1 10 EV
# for the top 10 CV outcomes call as ./01_copy_top_outcomes.sh 3 10 CV

USAGE="./script_name  sort_field   file_count   set_type"

# field to sort on
KEY_FIELD=$1
# number of files to find
FILE_COUNT=$2
# processing set type EV/CV
SET_TYPE=$3

# make sure there are 3 arguments and the first two are numberss
if [ $# -ne 3 ] || [ "$1" != "${2%*[^0-9]*}" ] || [ "$2" != "${3%*[^0-9]*}" ]
then    echo "$USAGE"
        exit 1
fi

# loop on all folds
FOLDS=(f0 f1 f2 f3 f4 f5 f6 f7 f8 f9)

for FOLD in ${FOLDS[@]}
do

#  check if the directory exists, this should never throw.
   if [[ ! -d "$FOLD" ]]
   then  echo 'directory' $FOLD'/ does not exist, exit script'
         exit 1
   fi

   # change directory to current fold
   cd $FOLD
   echo "processing" $FOLD

   #re-initalize
   OUTPUT=""
   FILE_TEMP=""
   FILE_NAME=""
   RAND_SET=""

   # sort the list of filenames and output the top number "n" as specified in argument $3
   FILE_LIST=( $(df -h | ls *.out.txt | sort -t_ -k$KEY_FIELD,$KEY_FIELD'n' | awk -F_ -v f=$KEY_FIELD -v c=$FILE_COUNT 'NR > c {exit} {printf("%s", $0)}') )

   # loop up to file count to parse output and copy files that were found by sort
   for (( LOOP_CT=1; LOOP_CT<=$FILE_COUNT; LOOP_CT++ ))
   do

      # parse output string on .out.txt to locate individual files
      FILE_TEMP=`echo $FILE_LIST | awk -v N=$LOOP_CT 'BEGIN {FS=".out.txt"} {print $N}'`
      # restore file extension
      FILE_NAME=$FILE_TEMP'.out.txt'

      echo $FILE_NAME

      # copy file and corresponding ini weight set to continue
      # copy file to continue
      cp -p './'$FILE_NAME './'$FOLD'_continue/'$SET_TYPE'/'$FILE_NAME

      #  find random ini set number
      RAND_SET=`echo $FILE_NAME | awk 'BEGIN {FS="_"} {print $5}'`
      # copy random ini weight file to continue
      cp -p '../rnd_ini/'$FOLD'/ri_'$RAND_SET'_'*'.wts'  './'$FOLD'_continue/'$SET_TYPE'/'

   done

   # return to start directory
   cd ../

done

The entire list of files that is found ends up in the variable FILE_LIST, so that gets parsed into individual file names. Those files are copied to the proper location and an associated file is also located an copied. This loops through all sub folders f0-f9, so that is no longer an argument.

This seems reasonable and works, but I don't know awk well enough to see if there are any hidden problems. There is probably an easier way to copy the files I need, but I don't know how to copy in awk, so I needed to get the file names in bash variables that I know how to manipulate to some extent.

Do you see anything dreadfully wrong here? This does give me the ablity to specify the sort field and the number of files I want, which is a big improvement over what I first posted.

LMHmedchem

Don_Cragun · August 5, 2013, 11:03pm

The goal of The UNIX and Linux Forums is to help you learn how to do "stuff" on your own; not to write programs for you. I gave you a sample script to get you started, and from your message #9 in this thread it sounded like you were well on your way to getting a working solution. (And posting a 384Kb zipped archive that expands to over 1Mb without a clear indication of the desired outcome of processing it takes more space and time that most volunteers are willing to donate.)

From what you have shown here in message #10, you are learning quickly. I will make a few more comments that may help you speed this up a little bit: First, in the pipeline:

df -h | ls *.out.txt | sort -t_ -k$KEY_FIELD,$KEY_FIELD'n' | awk -F_ -v f=$KEY_FIELD -v c=$FILE_COUNT 'NR > c {exit} {printf("%s", $0)}'

what would happen if you remove the code shown in red? The ls utility doesn't read from standard input, so it seems that the df command in this pipeline should make no difference in the output of this pipeline. (It will just make the pipeline run slower.)

Second you seem to go to a lot of effort to store the output of this pipeline in an array and then spend a lot of time trying to extract individual file names from the array. It looks like the array will only have one element because the printf in your awk command doesn't put a space between the names of the files it prints. If you would change the printf statement from:

printf("%s", $0)

to:

printf(" %s", $0)

you could reference filenames in the array more simply by using ${FILELIST[0]} through ${FILELIST[$((FILE_COUNT-1))]} .

But, why have an array at all. Why not just process the files one at a time as they come out of awk? As an example, what would happen if you replaced:

   # sort the list of filenames and output the top number "n" as specified in argument $3
   FILE_LIST=( $(df -h | ls *.out.txt | sort -t_ -k$KEY_FIELD,$KEY_FIELD'n' | awk -F_ -v f=$KEY_FIELD -v c=$FILE_COUNT 'NR > c {exit} {printf("%s", $0)}') )

   # loop up to file count to parse output and copy files that were found by sort
   for (( LOOP_CT=1; LOOP_CT<=$FILE_COUNT; LOOP_CT++ ))
   do

      # parse output string on .out.txt to locate individual files
      FILE_TEMP=`echo $FILE_LIST | awk -v N=$LOOP_CT 'BEGIN {FS=".out.txt"} {print $N}'`
      # restore file extension
      FILE_NAME=$FILE_TEMP'.out.txt'

      echo $FILE_NAME

      # copy file and corresponding ini weight set to continue
      # copy file to continue
      cp -p './'$FILE_NAME './'$FOLD'_continue/'$SET_TYPE'/'$FILE_NAME

      #  find random ini set number
      RAND_SET=`echo $FILE_NAME | awk 'BEGIN {FS="_"} {print $5}'`
      # copy random ini weight file to continue
      cp -p '../rnd_ini/'$FOLD'/ri_'$RAND_SET'_'*'.wts'  './'$FOLD'_continue/'$SET_TYPE'/'

   done

with the much simpler:

      ls *.out.txt | sort -t_ -k$KEY_FIELD,${KEY_FIELD}n |
      awk -F_ -v c="$FILE_COUNT" '
        NR > c {exit}
        {print $0, $5}' |
      while read FILE_NAME RAND_SET
      do
        # copy files that were found by sort
        echo "file_name: $FILE_NAME rand_set: $RAND_SET"

        # copy file and corresponding ini weight set to continue
        # copy file to continue
        cp -p './'$FILE_NAME './'$FOLD'_continue/'$SET_TYPE'/'$FILE_NAME

        # copy random ini weight file to continue
        cp -p '../rnd_ini/'$FOLD'/ri_'$RAND_SET'_'*'.wts'  './'$FOLD'_continue/'$SET_TYPE'/'
      done

Note that there is no array here, there is only one invocation of awk (instead of n+1 invocations to process n files), and RAND_SET is pulled from the file name at the file name at a time when we already have the fields in the file name split out (so we only have to split the name once). You can also get rid of some unneeded temporary variables since OUTPUT was not (and still is not) referenced after being set, and FILE_TEMP is no longer used.

LMHmedchem · August 11, 2013, 2:24pm

Sorry for the delay, it has been an unexpectedly busy end of the week.

I completely understand and agree with this. I always try to start with a post that contains at least some kind of a working script. This is to do as much as I can on my own and let the other users here know that I am working to solve the problem, not expecting others to do it for me. I also think that compared to a text explanation, programming code is easier to read in terms of understand what a person is after. I read long prose explanations of code algorithms when I am having trouble falling asleep at night. I still haven't consistently found the sweet spot when it comes to exactly how much to post. It appears that my first attempt was to short to explain all I was trying to do, and my second was way too long to bother wading into.

It turns out that I didn't end up using an array. Everything that came out of the code you posted ended up in a single long string variable ($FILE_LIST). I just looped on the number of files I was expecting to find and parsed the long string to pull out the file name for each iteration of the loop.

The simple explanation for this is that I don't know awk very well at all. I can more or less use it to parse things, but only in the simplest implementations. I did spend some time trying to take the output from awk and work with it, but I think I was using a redirect instead of a pipe.

So as I read this now, ls is passing all .out.txt to sort, sort is sorting on the key field and passes the sorted list to awk. Awk processes each item in the list and outputs the specified fields. Then it looks like the output of awk is passed to read, which dumps the output into the vars $FILE_NAME and $RAND_SET. Once you are there, the rest is straightforward. I am not familiar with read, so that is something new to me. I would not have known how to get awk to output two variables and get them into something that I could use with cp. I am also not quite clear about how awk knows when it has read in enough lines to get to $FILE_COUNT. Is "NR" an implicit running counter of some kind so that when NR > $FILE_COUNT awk quits (I see that you have passed $FILE_COUNT to awk as c)?

From time to time I think I am getting better at this, then I try to do something new and find out how much I still don't know. I do really appreciate that help that is available here.

LMHmedchem

Don_Cragun · August 11, 2013, 5:43pm

When you assign a value to a shell variable using the syntax:

var=( list_of_values )

and your shell is a recent bash or ksh , you are defining var to be an array. So, the way you initialized FILE_LIST, it was an array containing only one element.

lmhmedchem:

Originally Posted by Don Cragun
But, why have an array at all. Why not just process the files one at a time as they come out of awk?

The simple explanation for this is that I don't know awk very well at all. I can more or less use it to parse things, but only in the simplest implementations. I did spend some time trying to take the output from awk and work with it, but I think I was using a redirect instead of a pipe.

So as I read this now, ls is passing all .out.txt to sort, sort is sorting on the key field and passes the sorted list to awk. Awk processes each item in the list and outputs the specified fields. Then it looks like the output of awk is passed to read, which dumps the output into the vars $FILE_NAME and $RAND_SET. Once you are there, the rest is straightforward. I am not familiar with read, so that is something new to me. I would not have known how to get awk to output two variables and get them into something that I could use with cp. I am also not quite clear about how awk knows when it has read in enough lines to get to $FILE_COUNT. Is "NR" an implicit running counter of some kind so that when NR > $FILE_COUNT awk quits (I see that you have passed $FILE_COUNT to awk as c)?

Yes, you correctly interpreted what ls , sort , awk , and read are doing.

I suggest that you look at the read(1) man page. The read utility built into your shell will probably have additional options, but the POSIX description in the link above is all you need to understand what is going on in the simple script suggestions I provided. You might also want to read the awk(1) man page; the awk command:

      awk -F_ -v c="$FILE_COUNT" '
        NR > c {exit}
        {print $0, $5}'

The -F_ sets the input field separator to the underscore character, -v c="$FILE_COUNT" sets the awk variable c to the expansion of the shell variable FILE_COUNT, NR > c {exit} exits awk if the current number of input records read from all input files is greater than the awk variable c, and {print $0, $5} prints the entire current input record followed by a space followed by the 5th field from the current input record followed by a newline character. And then the:

while read FILE_NAME RAND_SET
do
 ... ... ...
done

does read one line in a loop until the end-of-file is detected on the input pipe and sets the shell variables FILE_NAME and RAND_SET to the two values written by awk . And, as you said, the loop processes each line of output from awk to move the appropriate files into their desired places.

That's what we're here for. Don't be afraid to experiment. If you have a loop like this and want to see what it will do without actually copying files, put an echo in front of the cp to have the script show you what it will do when you remove the echo s.

Get used to using:

set -xv
    code to trace
set +xv

to surround segments of shell code that you don't understand so you can see what commands are being called and what operands are being passed to them.