Fast processing(mv command) of 1 million+ files using find, mv and xargs

agentgrecko · April 3, 2013, 2:15am

Hi, I'd like to ask if anybody can help improve my code to move 1 million+ files from a directory to another:

find /source/dir -name file* -type f | xargs -I '{}' mv {} /destination/dir

I learned this line of code from this forum as well and it works fine. However, file movement is kinda slow; about 1-2 files per second. At this rate, it may take days to move the files. I have not much background yet about xargs, so I was wondering if there could be a faster way to accomplish this process.

Here are some more details:
-OS is HP-UX.
-The files in /source/dir are continually being added.
-Size per file is around 300-1000kb.
-Filename pattern includes YYYYMMDD date (might prove useful for batch processing).
-The mv* command is already encountering "arg list too long," hence the use of find/xargs.
-/source/dir has no sub directories.
-After moving the files, I would later divide/mv then into different dirs corresponding to their YYYYMMDD date.

Hope the above info helps. Any advise would be greatly appreciated as well.

Thank you.

RudiC · April 3, 2013, 2:35am

Were /source/dir and /destination/dir on the same file system, mv would just rename the files and not move a single byte.
Hadn't you said new files drop in constantly, I'd have tried to rename the directory itself (same file system!).
The way it is, I'm afraid I'm out of ideas.

agentgrecko · April 3, 2013, 2:52am

Hi RudiC, yup, same file system. It would have been easier indeed to just rename the dir.

My main goal is just to redistribute the files to different dirs according to their YYYYMMDD. The mv to /destination/dir is just an extra step since we have more space there.

hanson44 · April 3, 2013, 4:29am

Are you sure it's one to two files per second?

$ ls | wc
   1993    1993   34863

$ time find . -name "*.*" -type f | xargs -I '{}' mv {} ../xxx
real    0m2.846s
user    0m0.668s
sys     0m2.104s

$ cd ../xxx
$ ls | wc
   1993    1993   34863

Seems like about 700 files per second. And this is running on kind of a dog of a linux computer, nothing special. Unless your find command is taking days, maybe your operations are going faster than you think.

At 500 files per second, you could mv a million files in 2000 seconds, about 30 minutes.

agentgrecko · April 3, 2013, 4:50am

hanson44:

Are you sure it's one to two files per second?
$ ls | wc
   1993    1993   34863
$ time find . -name "*.*" -type f | xargs -I '{}' mv {} ../xxx
real    0m2.846s
user    0m0.668s
sys     0m2.104s
$ cd ../xxx
$ ls | wc
   1993    1993   34863
Seems like about 700 files per second. And this is running on kind of a dog of a linux computer, nothing special. Unless your find command is taking days, maybe your operations are going faster than you think.

At 500 files per second, you could mv a million files in 2000 seconds, about 30 minutes.

H hanson44, yup, around 2 files per sec.
I have a counter on /destination/dir that executes ls | wc -l every 2 sec just so I could check the progress.
I'm thinking that since /source/dir already contains 1.2 million files (and still receiving more from an auto-dump script), it contributes to the slow processing.

hanson44 · April 3, 2013, 5:11am

Yes, perhaps the find command is the bottleneck. Maybe it has a hard time "dealing with" so many files.

What happens if you run the find command for a minute (forget about the mv part for the time being), saving the output from find to a file, and see how many lines accumulate in the file?

If there are perhaps 120 lines after a minute (two per second), then find is the bottleneck. If there are tens of thousands of lines, then it's still a mystery.

Is there any chance you could just use 'ls' instead of find?

Corona688 · April 3, 2013, 12:14pm

ls is guaranteed to perform badly here, because it must read the entire directory list and sort their names before it can print. It might bog for minutes or hours until it shows anything.

find doesn't have problems "dealing with" large numbers of files. In a sense find's job is rather simple -- opendir(), readdir(), print if match, loop until done. If it's struggling, that means it either has too much work to do -- finding 300 'good' files out of 1.2 million files you don't care about means scanning through all 1.2 million -- or the filesystem itself is responding slowly.

Small numbers of folders crammed full of millions of files generally perform rather badly, especially when already busy. The filesystem itself, rather than find, may be suffering here.