Split a folder with huge number of files in n folders

We have a folder XYZ with large number of files (>350,000). how can i split the folder and create say 10 of them XYZ1 to XYZ10 with 35,000 files each. (doesnt matter which files go where).

Try something like:

cd XYZ || { echo "directory does not exist" > &2; exit 1 ;}
n=0
for i in *
do
  if [ $((n+=1)) -gt 10 ]; then
    n=1
  fi
  todir=../XYZ$n
  [ -d "$todir" ] || mkdir "$todir" 
  mv "$i" "$todir" 
done

Try it out on a test directory first and/or prepend echo to the mv command to see if it works as desired...

1 Like

Thanks. I have something similar. But I was wondering if there is a way to do this without the explicit loop and moving one file at a time. This is too slow and we have to perform this operation on multiple folders of the given size.

What OS are you using?

Do any of the 350,000 files you're moving have names that contain any space, tab, or newline characters?

Aren't 350 000 files too many arguments for for i in * ?
Safer and faster is

find . -type f |
while read i
1 Like

Hi Don,

I am on Ubuntu Server 12.04 LTS. The files do not contain any special characters except _

Actually the format is something like

<xyz>_n1_n2_n3.pbx

where <xyz> is a prefix containing only alphabets [a-z A-Z]
n1 is a number less than 10
n2 is a number less than 1000
n3 is a number less than 10000
pbx is the extension

---------- Post updated at 04:08 PM ---------- Previous update was at 04:06 PM ----------

Thanks.. Although I did not get an error for too many files with for I in * , find seems faster and as you mentioned safer if the number of files increase.

In theory, no. The for i in * is all processed inside the shell without having to create an argument list like it would for ls * (which would almost certainly fail hitting ARG_MAX limits).

Another possibility would be something like set -- * which would then allow you to use $# to split groups of files instead of going round-robin. Invoking mv for each file to be moved is going to be much slower than invoking mv to move groups of files.

If filenames are of the form <xyz>_n1_n2_n3.pbx would it make more sense to create a directory for each n2 value rather than ten directories with an arbitrary spread of files into those new directories. If part of the filename can be used to determine which directory to search, it should make it much faster to find needed files after the files have been rearranged. If this is of interest to the submitter, we could help come up with a way to help do this.

For search/listing efficiency you should try and avoid directories with more that 32000 files and this is why I guess you are wanting to move these files to sub folders. However 10 sub-directories will still leave you above this limit - why not go with 100 folders rather than 10 - That way your down to 3500 files per folder and you have plenty of room to move.

As Don Cragun mentioned using the last (or first) two digits of n2 could offer a nice logical split, as long as the distribution is fairly even. Also you will be able to easily determine which directory any new files belong in.

I was actually suggesting 1000 directories, where the name of the directory is the value of n2 extracted from the file's name.

If characteristics of filenames are used, then in addition to the format there would still need to be a reasonable understanding of the distribution of filenames along the filename-parts that are chosen as bins, otherwise some of them may still end up being too full.

----
Since you are using Ubuntu an entirely different alternative might be to leave the files as-is and use locate and updatedb , but of course that would not be adequate for files younger than the last update..

-----

As Don mentioned: Safer? No. There is no limitation like ARG_MAX, since there are no external programs that arguments are being passed to. In theory find-and-pipe is slightly less safe, since it will not work for file names with newlines in them, but this is mostly theory since in practice I for one have never encountered files like that, other than the ones I had created myself for testing purposes...

Another way to speed up could be to do the moves in the background:

cd XYZ || { echo "directory does not exist" >&2; exit 1 ;}
n=0
for i in *
do
  if [ $((n+=1)) -gt 20 ]; then
    n=1
    wait
  fi
  todir=../XYZ$n
  [ -d "$todir" ] || mkdir "$todir" 
  mv "$i" "$todir" &
done
wait

--
Probably the best way though would be to build randomly or serially selected lists of file names and feed them to mv operations to specific directories, while observing ARG_MAX

If: there is a relatively even distribution of files for the different values of n2 in filenames of the form <alpha>*_n1_n2_n3.pbx (where <alpha>* is a string of one or more alphabetic characters, n1 is a single decimal digit, n2 is one to three decimal digits, and n4 is one to four decimal digits), and either:

  1. n2 contains leading zeroes and you have a 1993 or later version of ksh, or
  2. n2 does not contain leading zeroes and you have a 1993 or later version of ksh or a version of bash that expands ${!arr[@]} to a list of the subscripts used in the array arr[] ,

then the following might do what was requested more efficiently:

#!/bin/ksh
IAm=${0##*/}
ec=0		# Final exit code.
mvc=100		# Maximum # of files to move in one invocation of mv.  (Adjust
		# to fit your envinronment based on the actual length of your
		# filenames, the value of ARG_MAX on your system, and the amount
		# of data being passed through environment variables when you
		# invoke the mv utility.)
typeset -A d	# Use string values (not numeric values) as subscripts.  Note:
		# This only works with ksh93.  This avoids having a string like
		# 010 treated as an octal value and being converted to decimal 8.
		#
		# For filenames of the form: <alpha>*_<digit>_n2_n3.pbx
		# where n2 is 3 decimal digits (with leading zero fill) or 1 to
		# 3 decimal digits with no leading zeros, and n3 is 1 to 4
		# decimal digits.
		#
		# If n2 contains leading 0 fill, this typeset is required.  If
		# there are no leading 0s in n2, this typeset can be left out
		# and this script will work with both bash and 1993 or later
		# versinos of ksh.
cd src
for i in *_*_*_*.pbx
do	# Extract 3rd component of filename:
	n2=${i%_*}	# Remove _*.pbx from end of filename.
	n2=${n2##*_}	# Remove *_*_ from start of filename.
	d[$n2]=		# Add extracted value to list of directories to crete.
done
if [ ${!d[@]} == "*" ]
then	printf "%s: No files matching *_*_*_*.pbx found in %s\n" "$IAm" "$PWD" >&2
	exit 1
fi
for i in ${!d[@]}
do	printf "Processing files to go to directory: %s\n" $i
	# Create the directory if it doesn't already exist.
	[ ! -d ../$i ] && mkdir ../$i
	# Initialize number of files found for this directory and list of naems.
	n=0
	p=
	for j in *_*_${i}_*.pbx
	do	p="$p $j"
		if [ $((++n)) -ge $mvc ]
		then	if mv $p ../$i 
			then	printf "moved %d files to ../%s\n" $n $i
			n=0
			p=
			else	# mv already printed a diagnostic, note error
				ec=1
			fi
		fi
	done
	# If we have files that weren't already moved in the loop, move them now
	if [ $n -gt 0 ] && mv $p ../$i
	then	printf "moved %d files to ../%s\n" $n $i
	else	# mv already printed a diagnostic, note error
		ec=1
	fi
done
exit $ec

which groups files to be moved by destination directory and moves up to a hundred files (although you can easily choose a larger or smaller number) with each invocation of mv .

1 Like

There is a tool called 'fpart' which can be used for the first step in this (taking a list of files and splitting them into X groups). Pushing that file into xargs and mv shouldn't be too difficult. I'm about to do this myself, so I will post the script when I have it. EDIT: There is a section on "migrating data" in the README. This might be enough.