There are two parts to this. In the first part I need to read a list of files from a directory and split it into 4 arrays. I have done that with the following code,
# collect list of file names
STATS_INPUT_FILENAMES=($(ls './'$SET'/'$FOLD'/'*'in.txt'))
# get number of files
NUM_INPUT_FILES=${#STATS_INPUT_FILENAMES[@]}
# get size of each subset
PROC_SIZE=$((NUM_INPUT_FILES / 4))
# create array start and stop positions
let "START2 = $PROC_SIZE+1"; let "STOP2 = $START2+$PROC_SIZE";
let "START3 = $STOP2+1"; let "STOP3 = $START3+$PROC_SIZE";
let "START4 = $STOP3+1"
# create 4 arrays, each wiht 25% of filenames
NUM_FILE_LISTS='4'
FILE_LIST_0=("${STATS_INPUT_FILENAMES[@]:0:$PROC_SIZE}")
FILE_LIST_1=("${STATS_INPUT_FILENAMES[@]:$START2:$STOP2}")
FILE_LIST_2=("${STATS_INPUT_FILENAMES[@]:$START3:$STOP3}")
FILE_LIST_3=("${STATS_INPUT_FILENAMES[@]:$START4:$NUM_INPUT_FILES}")
This is not very elegant, but I think it has the list split up
Next, I need to pass each of the 4 lists to a bash function, but I can't seem to find a reasonable syntax for doing that. Suggestions would be appreciated.
It may have the list split up, but the 4 lists are no where close to containing the same number of elements. The construct grabbing a subset of the array elements in ksh93 and recent versions of bash is not:
${arrary[@]:start_index:end_index}
it is:
${arrary[@]:start_index:number_of_elements}
If we had a list of 10 files named 1 through 10 the four lists created by your code would be:
2:1 2
5:4 5 6 7 8
4:7 8 9 10
1:10
where the number before the colon is the number of files in the list and the numbers after the colon are the files in that list. (Note that files 7 , 8 , and 10 are in two lists and file 3 isn't in any list, and the number of files in the lists are 2, 5, 4, and 1.)
To get more even lists (and get each file in exactly one of your four lists), you could try something more like:
# collect list of file names
STATS_INPUT_FILENAMES=($(ls './'$SET'/'$FOLD'/'*'in.txt'))
STATS_INPUT_FILENAMES=(1 2 3 4 5 6 7 8 9 10) # For testing only.
# get number of files
NUM_INPUT_FILES=${#STATS_INPUT_FILENAMES[@]}
# create 4 arrays, each with ~25% of filenames
NUM_FILE_LISTS='4'
# get size of each subset
BASE_LIST_SIZE=$(((NUM_INPUT_FILES) / NUM_FILE_LISTS))
LEFTOVER=$((NUM_INPUT_FILES % NUM_FILE_LISTS))
LIST_SIZE0=$((BASE_LIST_SIZE + (LEFTOVER > 0)))
LIST_SIZE1=$((BASE_LIST_SIZE + (LEFTOVER > 1)))
LIST_SIZE2=$((BASE_LIST_SIZE + (LEFTOVER > 2)))
FILE_LIST_0=("${STATS_INPUT_FILENAMES[@]:0:$LIST_SIZE0}")
FILE_LIST_1=("${STATS_INPUT_FILENAMES[@]:$LIST_SIZE0:$LIST_SIZE1}")
FILE_LIST_2=("${STATS_INPUT_FILENAMES[@]:$((LIST_SIZE0 + LIST_SIZE1)):$LIST_SIZE2}")
FILE_LIST_3=("${STATS_INPUT_FILENAMES[@]:$((LIST_SIZE0 + LIST_SIZE1 + LIST_SIZE2))}")
echo ${#FILE_LIST_0[@]}:${FILE_LIST_0[@]}
echo ${#FILE_LIST_1[@]}:${FILE_LIST_1[@]}
echo ${#FILE_LIST_2[@]}:${FILE_LIST_2[@]}
echo ${#FILE_LIST_3[@]}:${FILE_LIST_3[@]}
Which with the same list of 10 files produces the output:
3:1 2 3
3:4 5 6
2:7 8
2:9 10
Passing arrays to a function is tricky. The easier approach is to pass any fixed arguments as the 1st arguments to your functions and pass the filenames as a variable argument list with "${FILE_LIST_x[@]}" .
I think that I am going to avoid passing the array for now and see how it goes. I can pass LIST_SIZE0 and LIST_SIZE* and let the function create each sub list. This will mean repeating STATS_INPUT_FILENAMES=($(ls './'$SET'/'$FOLD'/'*'in.txt')) for each function call, but I will put up with that for now.
I guess I misunderstood the syntax for grabbing part of an array. The most important issue here is making sure that each file is on exactly one list. The second priority is making the lists as even as possible.
This is what I have set up instead of passing the array.
calling code
# the number of availalbe cores
if [ "$CORES" == "quad" ]; then
# create 4 arrays, each with ~25% of filenames
NUM_FILE_LISTS='4'
PROCESSED='0'
# get size of each subset
BASE_LIST_SIZE=$(((NUM_INPUT_FILES) / NUM_FILE_LISTS))
LEFTOVER=$((NUM_INPUT_FILES % NUM_FILE_LISTS))
# set up start elements and number of elements for all lists
# list 0
START_ELEMENT_0='0'
NUMBER_OF_ELEMENTS_0=$((BASE_LIST_SIZE + (LEFTOVER > 0)))
# keep track of number of files processed
let "PROCESSED=$PROCESSED+$NUMBER_OF_ELEMENTS_0"
# list 1
START_ELEMENT_1=$PROCESSED
#let "START_ELEMENT_1=$START_ELEMENT_0+$NUMBER_OF_ELEMENTS_0"
NUMBER_OF_ELEMENTS_1=$((BASE_LIST_SIZE + (LEFTOVER > 1)))
let "PROCESSED=$PROCESSED+$NUMBER_OF_ELEMENTS_1"
# list 2
START_ELEMENT_2=$PROCESSED
NUMBER_OF_ELEMENTS_2=$((BASE_LIST_SIZE + (LEFTOVER > 2)))
# keep track of number of files processed
let "PROCESSED=$PROCESSED+$NUMBER_OF_ELEMENTS_2"
# list 3
START_ELEMENT_3=$PROCESSED
# assign the rest to this list
let "NUMBER_OF_ELEMENTS_3=$NUM_INPUT_FILES-$PROCESSED"
# keep track of number of files processed
let "PROCESSED=$PROCESSED+$NUMBER_OF_ELEMENTS_3"
# call functions to process stats
run_stats_program $SET $FOLD $START_ELEMENT_0 $NUMBER_OF_ELEMENTS_0 &
# to prevent terminal overrun
sleep 2
run_stats_program $SET $FOLD $START_ELEMENT_1 $NUMBER_OF_ELEMENTS_1 &
sleep 2
run_stats_program $SET $FOLD $START_ELEMENT_2 $NUMBER_OF_ELEMENTS_2 &
sleep 2
run_stats_program $SET $FOLD $START_ELEMENT_3 $NUMBER_OF_ELEMENTS_3 &
sleep 2
# wait untill subshells have returned
wait
fi
called function
function run_stats_program {
# function args
SET_F=$1
FOLD_F=$2
START_ELEMENT_F=$3
NUMBER_OF_ELEMENTS_F=$4
# get list of stats input files in fold directory
STATS_INPUT_FILENAMES_F=($(ls './'$SET_F'/'$FOLD_F'/'*'in.txt'))
# create file list as subest of STATS_INPUT_FILENAMES_F
FILE_LIST=("${STATS_INPUT_FILENAMES_F[@]:$START_ELEMENT_F:$NUMBER_OF_ELEMENTS_F}")
for INPUT_FILE in "${FILE_LIST[@]}"
do
echo $INPUT_FILE
done
}
All this does at this point is print the filenames. In the end, this will process the 4 file lists in 4 subshells. Processing involved calling a c++ widget to process each file. This setup allows 4 instances of the c++ app to run simultaneously and use availalble CPU resources. There will be a similar code block for hex core.
I get this this is written in long form at the moment. It would be nice for the code to be a bit more compact and elegant, but I don't see a clear way to put the function calls in a loop or something like that.
The code I posted above is working with one caviat. In the function code,
function run_stats_program {
# function args
SET_F=$1
FOLD_F=$2
START_ELEMENT_F=$3
NUMBER_OF_ELEMENTS_F=$4
# get list of stats input files in fold directory
STATS_INPUT_FILENAMES_F=($(ls './'$SET_F'/'$FOLD_F'/'*'in.txt'))
# create file list as subest of
FILE_LIST=("${STATS_INPUT_FILENAMES_F[@]:$START_ELEMENT_F:$NUMBER_OF_ELEMENTS_F}")
# get reference file name
REFERENCE_FILE_F=$(ls './'$SET_F'/'$FOLD'/00_'$FOLD'_reference_'*'.txt')
for INPUT_FILE in "${FILE_LIST[@]}"
do
# print current input file
echo $INPUT_FILE
#process stats input file
'./'$STATS_APP -r $REFERENCE_FILE_F -i $INPUT_FILE -l $BATCH_STOP_SUBSETS -s $BATCH_STOP_STATS -p $OA_PRINT_PRECISION -f $INPUT_FORMAT
#delete stats input file
rm -f $INPUT_FILE
done
}
my preference is to remove the files as they are processed as the code in red indicates. This cannot be done as currently implemented because STATS_INPUT_FILENAMES_F is generated in the function and if files are deleted, the size of the resulting array changes between function calls. This blows up the array ranges that the function is trying to select.
If I want to delete files as processed, it would appear that I would need to create the sub-lists outside of the function and then pass in the arrays. That puts me back to passing in the arrays as arguments or waiting on deletion.
As long as the file list array is not defined as a local variable in the parent shell, the subshells running the function don't need to redefine the array; it will be inherited.
Perhaps the following script with a revised version of your function and a new function that takes one operand specifying the number of invocations of your function to run concurrently, splits the list of files into subsets, invokes your function, and waits for all invocations to complete will provide a useful example:
#!/bin/bash
# Define functions...
function run_stats_program {
# function args
SET_F=$1
FOLD_F=$2
START_ELEMENT_F=$3
NUMBER_OF_ELEMENTS_F=$4
echo 'function run_stats_program called with args: ' "$@"
# get reference file name
#REFERENCE_FILE_F=$(ls './'$SET_F'/'$FOLD_F'/00_'$FOLD_F'_reference_'*'.txt')
REFERENCE_FILE_F=Reference
for INPUT_FILE in "${STATS_INPUT_FILENAMES[@]:START_ELEMENT_F:NUMBER_OF_ELEMENTS_F}"
do
# print current input file
echo $INPUT_FILE
#process stats input file
echo './'$STATS_APP -r $REFERENCE_FILE_F -i $INPUT_FILE -l $BATCH_STOP_SUBSETS -s $BATCH_STOP_STATS -p $OA_PRINT_PRECISION -f $INPUT_FORMAT
#delete stats input file
echo rm -f $INPUT_FILE
sleep 1
done
}
function split_and_run {
NGROUPS="$1"
# get number of files
NUM_INPUT_FILES=${#STATS_INPUT_FILENAMES[@]}
# Calculate number of files to be sent to each invocation of
# run_stats_program..
BASE_LIST_SIZE=$((NUM_INPUT_FILES / NGROUPS))
LEFTOVER=$((NUM_INPUT_FILES % NGROUPS))
SPLIT_START=0
# Run NGROUPS copies of run_state_program asynchronously...
for ((n = 1; n <= NGROUPS; n++)) {
GROUP_SIZE=$((BASE_LIST_SIZE + (LEFTOVER >= n)))
run_stats_program "$SET" "$FOLD" $SPLIT_START $GROUP_SIZE&
sleep 2
SPLIT_START=$((SPLIT_START + GROUP_SIZE))
}
# Wait for run_state_program invocations to finish...
wait
}
# Initialize variables:
BATCH_STOP_STATS='batch_stop_stats_value'
BATCH_STOP_SUBSETS='batch_stop_subsets_value'
FOLD='fold_value'
INPUT_FORMAT='input_format_value'
OA_PRINT_PRECISION='oa_print_precision_value'
SET='set_value'
STATS_APP='stats_app_value'
# Collect list of file names
#STATS_INPUT_FILENAMES=($(ls './'$SET'/'$FOLD'/'*'in.txt'))
STATS_INPUT_FILENAMES=(a b c d e f g h i j k l m n o p q r s t u v w x y z)
# Test run for dual processor system...
split_and_run 2
echo '*** 1st set done ***'
sleep 5
# Test run for quad processor system...
split_and_run 4
echo '*** 2nd set done ***'
sleep 5
# Test run for dual quad processor system...
split_and_run 8
echo '*** 3rd set done ***'
Note that I changed a couple of references to $FOLD in your function to instead refer to $FOLD_F . It isn't obvious to me whether $FOLD and $SET will be the same in all of your function calls or not. If they will be the same, you can probably drop the 1st two operands to your function and just inherit the values of $FOLD and $SET from the invoking shell. Similarly, if the reference file is the same in all invocations of your function, you can set it once in the invoking shell instead of duplicating that processing in each function invocation.
Note that if you might run this with fewer files than the number of concurrent invocations, you'll probably want to change: