How to delete a huge number of files at a time

lisp21 · December 29, 2010, 5:56am

I met a problem on HPUX with 64G RAM and 20 CPU.

There are 5 million files with file name from file0000001.dat to file9999999.dat, in the same directory, and with some other files with random names.

I was trying to remove all the files from file0000001.dat to file9999999.dat at the same time.

If use 'rm file???????.dat', trying to remove them at once, then there was an error output.

If instead use 'rm file1??????.dat' , trying to remove 10% of them at a time, then the shell did not response in several hours and had to kill the process.

Does any expert of file system can help? Or is it possible to do that by any chance?

Thanks a lot!

methyl · December 29, 2010, 6:37am

What version of HP-UX?
What type of filesystem on what physical disc arrangement?
Is the filesystem mirrored?
Is NFS or anything slow involved?
What is the approximate total size of the files to be deleted?
What is the size of the directory file and how many inodes?

ls -lad /directory_name
df -i /directory_name

What was the "error output" mentioned above?

Are there any subdirectories under the directory containing these files?
i.e. Does this "find" command find all the files without unwanted hits and without finding files we don't want and without pointless searching?

find /directory_tree -type f -name file\?\?\?\?\?\?\?\.dat -print

sandholm · December 29, 2010, 7:35am

This might help:

I suspect you're running into a shell limitation.

Perderabo · December 29, 2010, 6:48pm

Let it run. Deleting that many files will take a very long time. It probably did delete some of them. Do "ls file1??????.dat | wc -l" to count how many are left.

rdcwayx · December 29, 2010, 7:25pm

give you some outputs, when deleting, so you can monitor the progress.

ls file???????.dat |while read file
do
  echo "Deleting file $file"
  rm $file
done

Perderabo · December 29, 2010, 10:53pm

That is adding half a million fork() calls to a procedure that is already painfully long. I must advise against that. Once an hour or so he can count the remaining files using the command I gave. (Put the delete in the baxkground or use a second window.)

Corona688 · December 30, 2010, 12:21am

That'd probably die with 'too many arguments', too, just like the OP did. The answer to 'too many arguments' is not to cram the too many arguments into ls instead, the answer is to not use that many arguments because you cannot cram an unlimited number in there. On some OSes the limit is surprisingly small.

Scrutinizer · December 30, 2010, 2:48am

xargs could be used to cut the workload in pieces, but the trouble with (non-GNU) xargs is that it does not work for file names with spaces in the name, since the input delimiter cannot be specified.
But how about something like this

ls | grep 'file.......\.dat' |
( IFS="
" 
  set --
  i=0
  while read file
  do
   set -- $@ $file
   if [ $((i+=1)) -ge 32 ]; then
      rm $@
      i=0
      set --
    fi
  done
  rm $@
)

The number 32 could be changed to 16 or some other number..

This should probably work too:

( IFS="
" 
  set --
  i=0
  for f in file???????.dat
  do
   set -- $@ "$f"
   if [ $((i+=1)) -ge 32 ]; then
      ls $@
      i=0
      set --
    fi
  done
  ls $@
)

methyl · December 30, 2010, 5:49am

The above code is not suitable for 5 million files for three reasons:
1) The expanded "ls" command will still be too long.
2) The "ls" program sorts the filenames to alphabetical order - a massive overhead in this circumstance.
3) It does not correctly deal with filenames containing space characters.

Any chance of an answer to my earlier questions?

m.d.ludwig · December 30, 2010, 9:30am

@sandholm is quite correct:

xargs is your friend.
And so is grep.

If you want to delete the files and monitor how many are left:

cd directory
ls -f1 | grep file.......\.dat$ | xargs -P 4 rm -f &
while sleep 10; do
ls -f1 | grep -c file.......\.dat
done

With a large number of files, the -f option keeps man ls (linux) from sorting -- in what order the files are deleted, or counted, should not matter.

This is a quick-n-dirty script, ymmv.

---------- Post updated at 09:30 AM ---------- Previous update was at 09:26 AM ----------

If the directory only contains these files, it would be easier to:

rm -rf directory
mkdir directory