Split list of files into an array and pass to function

LMHmedchem · January 13, 2015, 2:26am

There are two parts to this. In the first part I need to read a list of files from a directory and split it into 4 arrays. I have done that with the following code,

# collect list of file names
STATS_INPUT_FILENAMES=($(ls  './'$SET'/'$FOLD'/'*'in.txt'))
# get number of files
NUM_INPUT_FILES=${#STATS_INPUT_FILENAMES[@]}


# get size of each subset
PROC_SIZE=$((NUM_INPUT_FILES / 4))
# create array start and stop positions
let "START2 = $PROC_SIZE+1"; let "STOP2 = $START2+$PROC_SIZE"; 
let "START3 = $STOP2+1"; let "STOP3 = $START3+$PROC_SIZE";
let "START4 = $STOP3+1"

# create 4 arrays, each wiht 25% of filenames
NUM_FILE_LISTS='4'
FILE_LIST_0=("${STATS_INPUT_FILENAMES[@]:0:$PROC_SIZE}")
FILE_LIST_1=("${STATS_INPUT_FILENAMES[@]:$START2:$STOP2}")
FILE_LIST_2=("${STATS_INPUT_FILENAMES[@]:$START3:$STOP3}")
FILE_LIST_3=("${STATS_INPUT_FILENAMES[@]:$START4:$NUM_INPUT_FILES}")

This is not very elegant, but I think it has the list split up

Next, I need to pass each of the 4 lists to a bash function, but I can't seem to find a reasonable syntax for doing that. Suggestions would be appreciated.

LMHmedchem

Don_Cragun · January 13, 2015, 5:08am

It may have the list split up, but the 4 lists are no where close to containing the same number of elements. The construct grabbing a subset of the array elements in ksh93 and recent versions of bash is not:

${arrary[@]:start_index:end_index}

it is:

${arrary[@]:start_index:number_of_elements}

If we had a list of 10 files named 1 through 10 the four lists created by your code would be:

2:1 2
5:4 5 6 7 8
4:7 8 9 10
1:10

where the number before the colon is the number of files in the list and the numbers after the colon are the files in that list. (Note that files 7 , 8 , and 10 are in two lists and file 3 isn't in any list, and the number of files in the lists are 2, 5, 4, and 1.)

To get more even lists (and get each file in exactly one of your four lists), you could try something more like:

# collect list of file names
STATS_INPUT_FILENAMES=($(ls  './'$SET'/'$FOLD'/'*'in.txt'))
STATS_INPUT_FILENAMES=(1 2 3 4 5 6 7 8 9 10) # For testing only.

# get number of files
NUM_INPUT_FILES=${#STATS_INPUT_FILENAMES[@]}

# create 4 arrays, each with ~25% of filenames
NUM_FILE_LISTS='4'
# get size of each subset
BASE_LIST_SIZE=$(((NUM_INPUT_FILES) / NUM_FILE_LISTS))
LEFTOVER=$((NUM_INPUT_FILES % NUM_FILE_LISTS))
LIST_SIZE0=$((BASE_LIST_SIZE + (LEFTOVER > 0)))
LIST_SIZE1=$((BASE_LIST_SIZE + (LEFTOVER > 1)))
LIST_SIZE2=$((BASE_LIST_SIZE + (LEFTOVER > 2)))

FILE_LIST_0=("${STATS_INPUT_FILENAMES[@]:0:$LIST_SIZE0}")
FILE_LIST_1=("${STATS_INPUT_FILENAMES[@]:$LIST_SIZE0:$LIST_SIZE1}")
FILE_LIST_2=("${STATS_INPUT_FILENAMES[@]:$((LIST_SIZE0 + LIST_SIZE1)):$LIST_SIZE2}")
FILE_LIST_3=("${STATS_INPUT_FILENAMES[@]:$((LIST_SIZE0 + LIST_SIZE1 + LIST_SIZE2))}")

echo ${#FILE_LIST_0[@]}:${FILE_LIST_0[@]}
echo ${#FILE_LIST_1[@]}:${FILE_LIST_1[@]}
echo ${#FILE_LIST_2[@]}:${FILE_LIST_2[@]}
echo ${#FILE_LIST_3[@]}:${FILE_LIST_3[@]}

Which with the same list of 10 files produces the output:

Passing arrays to a function is tricky. The easier approach is to pass any fixed arguments as the 1st arguments to your functions and pass the filenames as a variable argument list with "${FILE_LIST_x[@]}" .

Hope this helps...

pravin27 · January 13, 2015, 5:22am

Could this help you ?

#!/bin/sh

print_output () {
  myArray=$1
  eval echo \${$myArray[*]}
}

cd /path/to/yourdir
ls  | paste  - - - - | while read line
do
    eval FileArray=("${line}")
    print_output FileArray
done

LMHmedchem · January 13, 2015, 1:10pm

I think that I am going to avoid passing the array for now and see how it goes. I can pass LIST_SIZE0 and LIST_SIZE* and let the function create each sub list. This will mean repeating STATS_INPUT_FILENAMES=($(ls './'$SET'/'$FOLD'/'*'in.txt')) for each function call, but I will put up with that for now.

I guess I misunderstood the syntax for grabbing part of an array. The most important issue here is making sure that each file is on exactly one list. The second priority is making the lists as even as possible.

LMHmedchem

RudiC · January 13, 2015, 1:50pm

I can see two ways to pass an array to a function, at least for my bash 4.3.30:

pass the element count and then the elements scr1.sh A B ${#LIST[@]} ${LIST[@] C D } , run a for loop to assign to a local array
pass the array like

scr1.sh A B "${LIST
[*]}" C D

; define local array like ARR=($3)

LMHmedchem · January 13, 2015, 3:18pm

This is what I have set up instead of passing the array.

calling code

# the number of availalbe cores
if [ "$CORES" == "quad" ]; then

   # create 4 arrays, each with ~25% of filenames
   NUM_FILE_LISTS='4'
   PROCESSED='0'

   # get size of each subset
    BASE_LIST_SIZE=$(((NUM_INPUT_FILES) / NUM_FILE_LISTS))
   LEFTOVER=$((NUM_INPUT_FILES % NUM_FILE_LISTS))

   # set up start elements and number of elements for all lists
   # list 0
   START_ELEMENT_0='0'
   NUMBER_OF_ELEMENTS_0=$((BASE_LIST_SIZE + (LEFTOVER > 0)))
   # keep track of number of files processed
   let "PROCESSED=$PROCESSED+$NUMBER_OF_ELEMENTS_0"
      
   # list 1   
   START_ELEMENT_1=$PROCESSED
   #let "START_ELEMENT_1=$START_ELEMENT_0+$NUMBER_OF_ELEMENTS_0"
   NUMBER_OF_ELEMENTS_1=$((BASE_LIST_SIZE + (LEFTOVER > 1)))
   let "PROCESSED=$PROCESSED+$NUMBER_OF_ELEMENTS_1"
 
   # list 2  
   START_ELEMENT_2=$PROCESSED
   NUMBER_OF_ELEMENTS_2=$((BASE_LIST_SIZE + (LEFTOVER > 2)))
   # keep track of number of files processed
   let "PROCESSED=$PROCESSED+$NUMBER_OF_ELEMENTS_2"
 
   # list 3  
   START_ELEMENT_3=$PROCESSED
   # assign the rest to this list
   let "NUMBER_OF_ELEMENTS_3=$NUM_INPUT_FILES-$PROCESSED"
   # keep track of number of files processed
   let "PROCESSED=$PROCESSED+$NUMBER_OF_ELEMENTS_3"

      # call functions to process stats
      run_stats_program  $SET  $FOLD  $START_ELEMENT_0  $NUMBER_OF_ELEMENTS_0 &
      # to prevent terminal overrun
      sleep 2
      run_stats_program  $SET  $FOLD  $START_ELEMENT_1  $NUMBER_OF_ELEMENTS_1 &
      sleep 2
      run_stats_program  $SET  $FOLD  $START_ELEMENT_2  $NUMBER_OF_ELEMENTS_2 &
      sleep 2
      run_stats_program  $SET  $FOLD  $START_ELEMENT_3  $NUMBER_OF_ELEMENTS_3 &
      sleep 2
      # wait untill subshells have returned
      wait

 fi

called function

function run_stats_program {

   # function args
   SET_F=$1
   FOLD_F=$2
   START_ELEMENT_F=$3
   NUMBER_OF_ELEMENTS_F=$4
   
   # get list of stats input files in fold directory
   STATS_INPUT_FILENAMES_F=($(ls  './'$SET_F'/'$FOLD_F'/'*'in.txt'))
 
   # create file list as subest of STATS_INPUT_FILENAMES_F
   FILE_LIST=("${STATS_INPUT_FILENAMES_F[@]:$START_ELEMENT_F:$NUMBER_OF_ELEMENTS_F}")

   for INPUT_FILE in "${FILE_LIST[@]}"
   do
      echo $INPUT_FILE
   done
}

All this does at this point is print the filenames. In the end, this will process the 4 file lists in 4 subshells. Processing involved calling a c++ widget to process each file. This setup allows 4 instances of the c++ app to run simultaneously and use availalble CPU resources. There will be a similar code block for hex core.

I get this this is written in long form at the moment. It would be nice for the code to be a bit more compact and elegant, but I don't see a clear way to put the function calls in a loop or something like that.

LMHmedchem

sea · January 13, 2015, 3:55pm

Ok, not the nicest, but it seems to work:

#!/bin/bash
ARRAY_ORGINAL=("${@}")
declare -a ARRAY1 ARRAY2 ARRAY3 ARRAY4
TOTAL=${#ARRAY_ORGINAL[@]}
MAX=$((  $TOTAL / 4 ))

count=0
ARRAY1=( ${ARRAY_ORGINAL[@]:$count:$MAX} )
n=0
while [[ $n -le $MAX ]]
do	#set -x
	unset ARRAY_ORGINAL[$n]
	n=$(($n+1))
done

count=$(($count+$MAX))
ARRAY2=( ${ARRAY_ORGINAL[@]:$count:$MAX} )
n=0
while [[ $n -le $MAX ]]
do	unset ARRAY_ORGINAL[$n]
	n=$(($n+1))
done

count=$(($count+$MAX))
ARRAY3=( ${ARRAY_ORGINAL[@]:$count:$MAX} )
n=0
while [[ $n -le $MAX ]]
do	unset ARRAY_ORGINAL[$n]
	n=$(($n+1))
done

count=$(($count+$MAX))
ARRAY4=( ${ARRAY_ORGINAL[@]:$count:$MAX} )
n=0
while [[ $n -le $MAX ]]
do	unset ARRAY_ORGINAL[$n]
	n=$(($n+1))
done

echo "1 : ${ARRAY1[@]}"
echo "2 : ${ARRAY2[@]}"
echo "3 : ${ARRAY3[@]}"
echo "4 : ${ARRAY4[@]}"

sh test.sh  a b c d e f g h i j k l
1 : a b c
2 : e f g
3 : g h i
4 : j k l

Left overs (as in, provided argument list is not 'equaly' dividable by 4) are not handled here.

LMHmedchem · January 13, 2015, 6:57pm

The code I posted above is working with one caviat. In the function code,

function run_stats_program {

   # function args
   SET_F=$1
   FOLD_F=$2
   START_ELEMENT_F=$3
   NUMBER_OF_ELEMENTS_F=$4
   
   # get list of stats input files in fold directory
   STATS_INPUT_FILENAMES_F=($(ls  './'$SET_F'/'$FOLD_F'/'*'in.txt'))

   # create file list as subest of 
   FILE_LIST=("${STATS_INPUT_FILENAMES_F[@]:$START_ELEMENT_F:$NUMBER_OF_ELEMENTS_F}")

   # get reference file name
   REFERENCE_FILE_F=$(ls './'$SET_F'/'$FOLD'/00_'$FOLD'_reference_'*'.txt')

   for INPUT_FILE in "${FILE_LIST[@]}"
   do
      # print current input file
      echo $INPUT_FILE
      #process stats input file
      './'$STATS_APP -r $REFERENCE_FILE_F -i $INPUT_FILE -l $BATCH_STOP_SUBSETS -s $BATCH_STOP_STATS -p $OA_PRINT_PRECISION -f $INPUT_FORMAT
      #delete stats input file
      rm -f $INPUT_FILE     
   done
}

my preference is to remove the files as they are processed as the code in red indicates. This cannot be done as currently implemented because STATS_INPUT_FILENAMES_F is generated in the function and if files are deleted, the size of the resulting array changes between function calls. This blows up the array ranges that the function is trying to select.

If I want to delete files as processed, it would appear that I would need to create the sub-lists outside of the function and then pass in the arrays. That puts me back to passing in the arrays as arguments or waiting on deletion.

Thoughts?

LMHmedchem

Don_Cragun · January 14, 2015, 1:53am

As long as the file list array is not defined as a local variable in the parent shell, the subshells running the function don't need to redefine the array; it will be inherited.

Perhaps the following script with a revised version of your function and a new function that takes one operand specifying the number of invocations of your function to run concurrently, splits the list of files into subsets, invokes your function, and waits for all invocations to complete will provide a useful example:

#!/bin/bash
# Define functions...
function run_stats_program {

   # function args
   SET_F=$1
   FOLD_F=$2
   START_ELEMENT_F=$3
   NUMBER_OF_ELEMENTS_F=$4
   echo 'function run_stats_program called with args: ' "$@"
   
   # get reference file name
   #REFERENCE_FILE_F=$(ls './'$SET_F'/'$FOLD_F'/00_'$FOLD_F'_reference_'*'.txt')
   REFERENCE_FILE_F=Reference

   for INPUT_FILE in "${STATS_INPUT_FILENAMES[@]:START_ELEMENT_F:NUMBER_OF_ELEMENTS_F}"
   do
      # print current input file
      echo $INPUT_FILE
      #process stats input file
      echo './'$STATS_APP -r $REFERENCE_FILE_F -i $INPUT_FILE -l $BATCH_STOP_SUBSETS -s $BATCH_STOP_STATS -p $OA_PRINT_PRECISION -f $INPUT_FORMAT
      #delete stats input file
      echo rm -f $INPUT_FILE     
      sleep 1
   done
}

function split_and_run {
	NGROUPS="$1"

	# get number of files
	NUM_INPUT_FILES=${#STATS_INPUT_FILENAMES[@]}

	# Calculate number of files to be sent to each invocation of
	# run_stats_program..
	BASE_LIST_SIZE=$((NUM_INPUT_FILES / NGROUPS))
	LEFTOVER=$((NUM_INPUT_FILES % NGROUPS))
	SPLIT_START=0

	# Run NGROUPS copies of run_state_program asynchronously...
	for ((n = 1; n <= NGROUPS; n++)) {
		GROUP_SIZE=$((BASE_LIST_SIZE + (LEFTOVER >= n)))
		run_stats_program "$SET" "$FOLD" $SPLIT_START $GROUP_SIZE&
		sleep 2
		SPLIT_START=$((SPLIT_START + GROUP_SIZE))
	}
	# Wait for run_state_program invocations to finish...
	wait
}

# Initialize variables:
BATCH_STOP_STATS='batch_stop_stats_value'
BATCH_STOP_SUBSETS='batch_stop_subsets_value'
FOLD='fold_value'
INPUT_FORMAT='input_format_value'
OA_PRINT_PRECISION='oa_print_precision_value'
SET='set_value'
STATS_APP='stats_app_value'

# Collect list of file names
#STATS_INPUT_FILENAMES=($(ls  './'$SET'/'$FOLD'/'*'in.txt'))
STATS_INPUT_FILENAMES=(a b c d e f g h i j k l m n o p q r s t u v w x y z)

# Test run for dual processor system...
split_and_run 2
echo '*** 1st set done ***'
sleep 5
# Test run for quad processor system...
split_and_run 4
echo '*** 2nd set done ***'
sleep 5
# Test run for dual quad processor system...
split_and_run 8
echo '*** 3rd set done ***'

Note that I changed a couple of references to $FOLD in your function to instead refer to $FOLD_F . It isn't obvious to me whether $FOLD and $SET will be the same in all of your function calls or not. If they will be the same, you can probably drop the 1st two operands to your function and just inherit the values of $FOLD and $SET from the invoking shell. Similarly, if the reference file is the same in all invocations of your function, you can set it once in the invoking shell instead of duplicating that processing in each function invocation.

Note that if you might run this with fewer files than the number of concurrent invocations, you'll probably want to change:

		run_stats_program "$SET" "$FOLD" $SPLIT_START $GROUP_SIZE&
		sleep 2

to something more like:

		if [ $GROUP_SIZE -gt 0 ]
		then	run_stats_program "$SET" "$FOLD" $SPLIT_START $GROUP_SIZE&
			sleep 2
		fi

If this sample code looks like it is doing what you want, remove (or replace) the code in red to use your real data.