Increase the performance of find command.

mohtashims · December 7, 2019, 6:52am

I'm trying to exclude 'BACKUP', 'STORE', 'LOGGER' folders while searching for all files under a directory "/tmp/moht"

Once a file is found I wish to display the filename, the size of the file & the cksum value.

Below is the command, I'm using:

/opt/freeware/bin/find /tmp/moht -type d -name 'BACKUP' -prune -o -type d -name 'STORE' -prune -o -type d -name 'LOGGER' -prune -o -type f -exec cksum {} \;

Output:

  701567198 47034 /tmp/moht/UPLOAD_DATA_OLD/WINTER/CORE14_46000.txt
  1165791713 39019 /tmp/moht/UPLOAD_DATA_OLD/CORE14_530000.txt
  3448997243 35258 /tmp/moht/UPLOAD_DATA_OLD/CORE14_487300.txt
  .......
  .......
+ 4294967295 0 /tmp/moht/UPLOAD_DATA_OLD/TEST/CORE14_613500.txt
  2875732103 46516 /tmp/moht/NEW/CORE14_753200.txt
  1525766291 46064 /tmp/moht/UPLOAD_DATA_OLD/CORE14_849300.txt
  2315828286 46532 /tmp/moht/UPLOAD_DATA_OLD/CORE14_902400.txt

Although the performce i.e time taken by the above command is reasonable; I wish to understand if there is any scope of performce improvement.

One thing I guess may help somewhat is:

cd /tmp/moht; /opt/freeware/bin/find . -type d -name "BACKUP" -prune -o -type d -name "STORE" -prune -o -type f -exec cksum {} \;

I'm on AiX 6.1

Suggestions / recommendations are appreciated.

RudiC · December 7, 2019, 8:57am

Compare performance of

cksum /tmp/moht/* | grep -v "BACKUP\|STORE\|LOGGER"

mohtashims · December 7, 2019, 9:22am

But you have not considered the file size. Can you please include that in your answer?

Also note that the files should be searched recursively under subfolders.

jim_mcnamara · December 7, 2019, 9:28am

This standard library call: nftw (or ftw)
IBM Knowledge Center

supports the find command traversing directory file trees - i.e., searching and locating files.

Assuming you want to keep the command you already have (and I am not sue that Rudi's suggested test is valid because of file and directory caching ):

A limiting factor is known to be the number of sub-directories in the file tree, and possibly the number of available open file descriptors - a per process limit.
If you can parallelize your code using several processes it may improve performance. I'm not sure this will help much because it depends on the number of sub-directories being large to gain any benefit. The developers who write system code try to maximize throughput.

What I'm saying is: performance enhancement work is subjective and often a misplaced resource and a waste of programmer time.
Suppose your command runs in one minute in production. Then you work hard and get it down to 35 seconds. The user perception of "slow" will still be there, so you have to get it down to maybe 6 seconds to make users happy and see it as "faster". In this case getting an order of magnitude improvement may not be possible.

And in this case you would have to do something about directory caching messing up testing because (you check this yourself) once you open a directory the system caches it for speedier access. Use the time command and rerun the command to see what I mean:

time [my long command goes here]
#write down the result
time [my long command goes here]
# write down the result and compare the two resulting times

MadeInGermany · December 7, 2019, 9:29am

The shorter pathnames is a small improvement only when post processing the output.
Then, you can bundle the names (shortens the command, not so much the run time).
But a + instead of the \; will have an impact. Then find runs cksum with many collected arguments - fewer runs are needed.

cd /tmp/moht && find . -type d \( -name 'BACKUP' -o -name 'STORE' -o -name 'LOGGER' \) -prune -o -type f -exec cksum {} +

Further, compare the speeds of the /usr/bin/find and the freeware find.

RudiC · December 7, 2019, 12:00pm

File sizes are included in my cksum . For climbing down the dir tree, try

cksum * */* */*/* |& grep -v "BACKUP\|STORE\|LOGGER\|cksum"
268795035 355 file1
113460914 19 file2

drl · December 7, 2019, 2:09pm

Hi.

Indeed. The first question one needs to answer is Does it have to be faster? Otherwise you are spending time that probably could be better spent elsewhere.

That being said, I have been [trying to] learn rustc , and have compiled a few codes that are very fast. One is fd . You can see benchmarks comparing it to standard find at GitHub - sharkdp/fd: A simple, fast and user-friendly alternative to 'find'

Depending on choices fd is faster by a factor of 5 up to 9, or even faster if one ignores hidden directories.

However, it would require you to either download a compiled code, or download the Rust system and compile fd yourself. I don't see a version for AIX, so this is academic.

I suppose if enough folks asked for Rust to be ported to platforms like Solaris, AIX, etc., it might happen. It might be worth a try if one really, really wanted that extra bit of speed.

I'll take the speed if it's easy to do and I really need it, but otherwise I have other stuff to do.

Best wishes ... cheers, drl