Recursive directory search using ls instead of find

newreverie · July 6, 2011, 12:05pm

I was working on a shell script and found that the find command took too long, especially when I had to execute it multiple times. After some thought and research I came up with two functions.
fileScan()
filescan will cd into a directory and perform any operations you would like from within it.
directoryScan()
directoryScan will recursively cd into all directories benieth an initial provided root directory. once in a new directory, the directory is sent to fileScan so that other functions can be executed.

I found that this is blazing fast compared to find especially when searching large directory trees or if having to run more than one find in a script or chron.

enjoy the code

#!/bin/bash
# Directory Scanner using recursive ls instead of find
# Do not make any of the local variables into globals
# folder, numdirectories, and x should not be used outside fileScan() and directoryScan()
# directoryScan() will cd into all directories below the "root" directory sent to it
# fileScan() will perform operations on any directory sent to it
fileScan()
{
local folder=$1
cd $folder
if [ $folder = $PWD ]
then
#you are now inside of a directory.  Do any operations you need to do with files that may exist in this directory
fi
}
directoryScan()
{
local folder=$1
cd $folder
if [ $folder = $PWD ]
then
local numdirectories=$(ls -lS | egrep '^d' | wc -l)
fileScan $folder
local x=1
while [ $x -le $numdirectories ]
do
subdirectory=$(ls -lS | egrep '^d' | sed "s/[ \t][ \t]*/ /g" | cut -d" " -f9 | head -n $x | tail -n 1)
subdirectory="${folder}/${subdirectory}"
directoryScan $subdirectory
x=$(($x + 1))
cd $folder
done
fi
}
# sample call to directoryScan()
# directoryScan $rootdirectory
# sample call to fileScan()
# fileScan $scandirectory

alister · July 6, 2011, 1:37pm

Hi, newreverie:

Welcome to the forum.

I'd be interested in seeing to what that shell script is comparitively blazingly fast. I'm inclined to believe that your find solution was suboptimal if that shell script, executing those pipelines for each visited directory, is faster.

If you are not familiar with AWK, you might enjoy the challenge of learning enough of it to simplify the egrep|sed|cut|head|tail pipeline to one concise AWK invocation.

Performance and efficiency aside, there are some potentially serious issues with that code. One that stands out: if a directory is deleted between the time $numdirectories is calculated and the subsequent while loop concludes, entire subtrees of the hierarchy will be visited more than once (a result of the input to head being shorter than expected). Depending on what's being done with each of the files, this could be deal breaker.

Again, welcome to the forum and thanks for the contribution.

Regards,
Alister

ctsgnb · July 6, 2011, 2:02pm

I am also quite sceptical.

By the way, ls perform an ascii sorting by default if you are in a directory with several tousands of files, this sorting operation can be consumming and may slow down the processing.

To avoid it you can use the -f option to get the inode in the order they will comme from the directory structure, this will avoid useless sorting, especially when you pipe your ls output in a wc -l

A lot of the performance problem are because of weak algorithm logic or approach, going through a redesigning step logic can then speed up processing.

I am curious to see the code of the initial "poor performance" script that was using the find command.

Sharing your code is still a nice intention.

Here are some example of performance problem because of wrong logic or bad use of find command :

newreverie · July 7, 2011, 9:53am

find may be faster if I sent the results into an array or text file and then looped through those results for my program.

My issue with the find command had more to do with the time it took to run to completion. Given the large directory structure and the variety and type of files i needed to search for, the find command took several minutes or more to run to completion.

The particular shell i was writing has a UI, and so the user is forced to wait several minutes or more between executing any search and the ability to work with the results of that search. This was decided to be unacceptable and so a method was needed to execute searches closer to real time and allow the user to interact with files as they are found.

find could still be used the fileScan() function with the prune option to search only within the current directory. But I left the options open in that function to suite your purposes.

So perhaps I overstated the net speed of the functions in relation to find. find may work faster overal, but if a user is faced with waiting for a find command to run to completion vs the abiltiy to interact with the results of a search in near real time, i believe this is a better method.

As for the comment about directory deletion while this script is running, I can see the pitfalls, but it can also be avoided by making subdirectories into a local array and storing the results of an ls there without using the head and tail method. attempts to cd into the non existent directory would be handled in the if [ $folder = $PWD ] logic.

bearvarine · July 8, 2011, 10:34am

One of the phenomena I have noticed over my years of being involved in Unix/Linux is that people tend to over-use shell scripting.

The problem with shell scripting is, simply, performance. Its one thing to accomplish small to medium tasks with a shell script. But once you begin doing serious processing work, involving tight loops of text processing, you will very quickly run into trouble. The reason is because most things done in a shell script are done by small programs - cut, sed, awk, head, tail, etc. When you combine dozens of these in a loop that will be running heavily - the computer has to launch THOUSANDS of tiny programs to accomplish the overall task.

I have seen large powerful Unix systems brought to their knees by simple DB loader scripts done in ksh for this very reason.

The solution is to use a more appropriate software tool to solve the problem. If you really want to do it in shell scripting style, why not try it in perl or python? These programs will allow you to write a single program to accomplish the task, no spawning of child programs required. This will vastly speed up the program. I know this to be a fact, because I've had a simple perl file and text search program in my toolbox since 1994.

Corona688 · July 8, 2011, 10:56am

One of the phenomena I have noticed over my years of being involved with UNIX/Linux is that people tend to blame poor shell scripts on the language.

The program above isn't slow because it's shell. It's slow because of things like these:

ls -lS | egrep '^d' | sed "s/[ \t][ \t]*/ /g" | cut -d" " -f9 | head -n $x | tail -n 1

Six programs and five pipes, to do something you could've done in two or less! How find could be slower I can't imagine -- perhaps he didn't realize find is recursive?

If you'd been programming shell for years, you ought to know:

1) awk can make a decent a replacement for all the tools you listed above -- in combinations, even -- being capable of quite complex programs in its own right. Putting it in the same class as head, tail, etc is a bit of a misnomer and jamming it in the middle of a long pipe chain is generally misuse: awk can often replace the entire chain, sometimes the entire script.

2) It's often not necessary to run thousands of tiny processes to accomplish single tasks when people have chosen to do so. Efficient use of piping or external programs are powerful features, but too often they're abused, causing terrible performance.

Funny thing -- I've done that with Perl. I've also done it in assembly language. It's possible to write terrible code in any language.

because those don't resemble shell languages? Someone who writes a shell script precisely the same way they'd write a perl or python one isn't utilizing the shell's important features.

Did you know many modern shells have regular expressions, can do substrings and simple text replacement, can pipe text between entire code blocks, can read line by line or token by token and split lines on tokens, can open/close/seek in files, etc, etc, etc -- all as shell builtin features?

All too often, people don't, and use thousands of tiny external programs instead.

The trick is to do large amounts of work with each process you make, never use them for anything trivial.

bearvarine · July 8, 2011, 11:24am

@Corona688: You make very good points here, and I believe you are essentially affirming my main point - don't put lots of little programs together in tight loops and expect good performance from a shell script.

Honestly though, -- and I know this is just my personal opinion -- I think awk is a scourge upon our land. Like kudzu, it should be ripped out where ever it is found and replaced with something less inscrutable. I don't think there is any reason in 2011 to continue using an such an ancient, arcane, difficult to debug tool like awk when there are so many better choices available. - I'm Just Sayin'...

Corona688 · July 8, 2011, 12:17pm

I thought so for years, then spared the time to learn it. Now I can't do without it. It's not inscrutiable, just not the same old procedural language you've relearned umpteen times.

People who live in perl houses shouldn't complain about arcane and difficult and hard to debug. I learned awk in two weeks.

Peasant · July 8, 2011, 11:58pm

awk is the first real language i'm learning.
So far, awk has proven to be invaluable for unix administration.
Not only printing and formatting, but also better understanding of other languages as well (C and Java).

It is hard (since i have no programming experience other then shell & sql ), but i believe mastering it will enable me to grow

Other then that, those small unix utilities we love are extremely optimized since they were written in the days machines had couple of KB of memory and slow CPU.
In todays multicore / GB enviroments those tools tend to do the job without any actual overhead on the machine.

Just my two cents.
Regards
Peasant.