Very Challenging :Copy files in Multiple Threads

Hello all,

I asked this in the basic Unix forum got no answer since one week.

So I believe this is an advanced level question hence posting it here.

Any suggestions welcome.

I have a directory of files of varying sizes.

I want to copy all these files in n number of threads to another directory such that each

copy set is more or less the same size.

Example :

Say /mydirA

It has around say 23 files of various sizes.

Number of copy threads say = 3

total directory size of /mydirA = 25 GB

So each thread should copy files whose sizes add up to almost 25/3 ~ 8 GB

So need to gather files based on the size for each thread such that they add upto 8GB

Thread 1 --> 8GB ..could be 11 files which add up to 8 gb

Thread 2 -->8 Gb ... couldbe 5 files which add up to 8 gb

Thread 3 ---> 9GB ...could be 8 files which add up to 8 or 9 gb

Want roughly equal copy set threads. It is also possible that even though I select 3 threads of equal size because of lack of number of files not all 3 threads could satisy the 8gb copy set size. So atleast try to fulfill the copy set thread size as far as possible.

All files need to go from /mydirA to /mydirB in N threads bases on the size of each thread as
(Total size of directory)/N which could have different number of files in each thread based on size to add up to the individual copy thread size

In short: good luck solving an NP-complete problem.

Besides that, for any large copy operation the bottleneck will be the IO subsystem (disk, network, ...) rather than any CPU.

Hi Samoo,

I have come up with a script for your requirement. When tested with sample files with same size it worked fine in deviding all files into 3 sets of equal size.

logic used :

Till $totsize variable is less the reference (1/3 of total size) we are appening each file name to a string variable.
once it exceeds, we are appending that file name also to the string variable and exiting.
This ensures the total size of the all the files in the string (filelist1=$filelist1:$i) slightly great than reference size.

Once we get the thread1 files,

we are excluding these files from sizefiles list (i.e total files list in current directory) and proceeding with the remaining files with the same logic to get the second thread files....

Finally remaining will come under thread3 files.

I am not sure how this script will work in real time senerio (i.e files of different sizes). How ever this may give you some idea how to proceed further.

Current directory files

$ls -l
-rw-r--r--   1 userid   staff           166 Mar 01 04:21 f1
-rw-r--r--   1 userid   staff           166 Mar 01 04:21 f2
-rw-r--r--   1 userid   staff           166 Mar 01 04:21 f3
-rw-r--r--   1 userid   staff           166 Mar 01 04:21 f4
-rw-r--r--   1 userid   staff           166 Mar 01 04:21 f5
-rw-r--r--   1 userid   staff           166 Mar 01 04:22 f6
-rw-r--r--   1 userid   staff           166 Mar 01 04:22 f7
-rw-r--r--   1 userid   staff           166 Mar 01 04:22 f8
-rw-r--r--   1 userid   staff           166 Mar 01 04:22 f9
-rwxr-xr-x   1 userid   staff          2514 Mar 01 05:21 seperate.sh
Script seperate.sh

# Deleting previously created files by this script is any.......
rm -r dir1 dir2 dir3 list* size* thread* >/dev/null 2>/dev/null
# Preparing a file containing all the file names and their corrosponding sizes in the current directory
ls -l |grep -v "dr--*"|grep -v "total"|grep -v $0|awk '{print $9" :" $5}' >sizelist
 

                        ##################### PART1 #########################
# Calculating the total files size in current directory and taking 1/3 of it as reference for getting files for thread 1,2,3
totsize=0
for i in `cat sizelist|awk -F: '{print $2}'`
do
((totsize=$totsize+$i))
done
((refsize=$totsize/3))
 

                        ##################### PAR2 #########################
# Preparing a list of thread1 files
filelist1=" "
thread1size=0
for i in `cat sizelist|awk -F: '{print $1}'`
do
filesize=`cat sizelist|grep "$i"|awk -F: '{print $2}'`
((thread1size=$thread1size+$filesize))
if [ $thread1size -lt $refsize ]
then
filelist1=$filelist1:$i
else
filelist1=$filelist1:$i
break
fi
done
echo $filelist1 |tr -s : " " >thread1
 
                       ##################### PART3 ##########################
# Preparing a file containing  list of filenames excluding the thread1 files for getting thread2 files
cat sizelist >list2
for i in `cat thread1`
do
cat list2|grep -v $i >list2
done
# Preparing a list of thread2 files
filelist2=" "
threadr2size=0
for i in `cat list2|awk -F: '{print $1}'`
do
filesize=`cat list2|grep "$i"|awk -F: '{print $2}'`
((thread2size=$thread2size+$filesize))
if [ $thread2size -lt $refsize ]
then
filelist2=$filelist2:$i
else
filelist2=$filelist2:$i
break
fi
done
echo $filelist2 |tr -s : " " >thread2
 
                       ##################### PART4 #############################
# Preparing list of remaining files for thread3
#echo $thread2
cat list2 >thread3
for i in `cat thread2`
do
cat thread3|grep -v $i|awk -F: '{print $1}'>thread3
done
 
                       ##################### PART5 #############################
#creating three directories and coping thread1 and thread2 and thread3 files to them
mkdir dir1 dir2 dir3
for i in `cat thread1`
do
cp $i dir1/
done
for i in `cat thread2`
do
cp $i dir2/
done
for i in `cat thread3`
do
cp $i dir3/
done
echo "Total size of all files in the current directory is $totsize"
echo "The reference size is 1/3 is $refsize"
echo "Thread one files are: `cat thread1`"
echo "Thread two files are: `cat thread2`"
echo "Thread three files are: `cat thread3`"

Output :

$ seperate.sh

Total size of all files in the current directory is 1494
The reference size is 1/3 is 498
Thread one files are:  f1 f2 f3
Thread two files are:  f4 f5 f6
Thread three files are: f7
f8
f9