Counting files in a given directory

verdepollo · November 15, 2010, 9:57am

Hi all,

Need some help counting files...

I'm trying to count the number of files in a given directory (and subdirectories) which reportedly contains "thousands" of files.

I'm using this:

ls -R | wc -l

However it's been an hour and looks like it's still running; there is no output at all.

Is there a faster/better way to count the exact number of files or get an approximate value?

Thanks.

Franklin52 · November 15, 2010, 10:06am

Try:

find * -type f |wc -l

verdepollo · November 15, 2010, 1:41pm

Ok, I cancelled the "ls" and ran a "find"... now it's been almost four hours and the command is still running.

Guess I'm out of luck.

sumitpandya · November 15, 2010, 11:33pm

I ran find command on my entire root (/) which finished in less then 2 minutes. In my case having more then 600K files

[root@aaa-build ~]# date
Tue Nov 16 10:01:22 IST 2010
[root@aaa-build ~]# time find / -type f | wc -l
662226

real    1m23.387s
user    0m1.149s
sys     0m6.536s

I'd adise you to run "find ${LOCATION} -type f | wc -l"

verdepollo · November 16, 2010, 9:33am

Well, I tried with 'find' yesterday and left it running for about 7 hours and it never finished.

Apparently it's not just "thousands" of files, but several millions.

I'm still trying to figure out how to count those files. It would be enough for me if I could find out an approximate value, and not the exact number.

Any suggestions? Thanks.

---------- Post updated at 10:33 AM ---------- Previous update was at 10:12 AM ----------

Now I'm wondering, would the used inodes in the filesystem provide an approximate value of the total number of files? This is the output of df:

[root@atlas ~]# df -i
Filesystem            		Inodes   IUsed   IFree		IUse% 	Mounted on
/dev/mapper/volAvg-A1lv		15466496 4455023 11011473	30%	/export

Corona688 · November 16, 2010, 11:30am

Anything that counts files will have to do so the same way: checking directory entries. So there's no special "faster ls". (If there was, why wouldn't we use it for everything?)

If you can compile on this machine, this program can provide a running total, updated once a second:

#include <stdio.h>
#include <time.h>

int lines=0;

int main(void)
{
	time_t timer=time(NULL);
	char buf[16384];
	while(fgets(buf, 16384, stdin) != NULL)
	{
		lines++;

		if((time(NULL)-timer) > 1)
		{
			fprintf(stderr, "\r%d", lines);
			timer=time(NULL);
		}
	}

	printf("%d\n", lines);

	return(0);
}

It only reads from stdin.

---------- Post updated at 10:30 AM ---------- Previous update was at 10:26 AM ----------

Very approximate since it includes directory entries as well, but since counting 4 million files is going to be hard, it might have to do.

verdepollo · November 16, 2010, 1:17pm

I compiled your code; somehow the counter is increasing really slowly (~200 files/min).

I think I can assume there are around ~4 million files. Hopefully this approximation will work fine for my process.

Thanks all for your help.

achenle · November 16, 2010, 1:41pm

This might be faster:

#include <stdio.h>
#include <ftw.h>

static long file_count = 0L;

int ftw_callback( const char *path, const struct stat *sb, int flag )
{
    if ( FTW_F == flag )
    {
        file_count++;
        /* print every 1000 files */
        if ( 0 == ( file_count % 1000 ) )
        {
            fprintf( stderr, "%ld\n", file_count );
        }
    }
    return( 0 );
}

int main( int argc, char **argv )
{
    int ii;
    for ( ii = 1; ii < argc; ii++ )
    {
        ftw( argv[ ii ], ftw_callback, 256 );
    }
    fprintf( stderr, "Final count: %ld\n", file_count );
    return( 0 );
}

That won't need to have another process reading the inode data and feeding it via a pipe. Compile it with the -m64 flag if you're on a 64-bit platform and you won't even need to worry if you have over 2 billion files....

ctsgnb · November 16, 2010, 1:47pm

If it takes so long, maybe the machine is suffering from another performance problem/bottleneck somewhere else ...

Corona688 · November 16, 2010, 1:54pm

At 200 lines per minute I doubt my application's the bottleneck. On my system it was able to process tens of thousands of lines per second... I wonder just how badly this disk's fragmented.

jim_mcnamara · November 16, 2010, 1:58pm

Really large directories (lots and lots of entries) are very slow to read. Use the filesystem stats:

/* ffcnt.c fast file count 
 print the number of files in a filesystem
 usage: ffcnt [pathname]
*/
#include <sys/types.h>   
#include <sys/statvfs.h>
#include <stdlib.h>
#include <stdio.h>


int main(int argc, char **argv)
{
    struct statvfs st;
    struct statvfs *stp=&st;
    unsigned int cnt=0;
    
    if(argc!=2)
       fprintf(stderr, "usage:  ffcnt [pathname]\n");    
    else
    {
         if(statvfs(argv[1], stp)==-1)
         {
            perror("Cannot read filesystem data");
            exit(1);
         }
         cnt=stp->f_files - stp->f_ffree;
         printf("total files used %u, free files %u\n",
              cnt, stp->f_ffree);
     }    
    return 0;     
}

Corona688 · November 16, 2010, 2:03pm

And I learn a new system call, thanks. That's still inodes, not files, but close enough.

verdepollo · November 16, 2010, 5:21pm

Thanks again for your help, here are the results:

Jim's code "instantly" throws up the same values as df -i ; After all I think I can use the inode count as an acceptable approximation.

achenle's code runs a little faster; I tried it first with a small directory like /opt with no issues:

[root@atlas ~]# time ./achenle_counter /opt
1000
2000
3000
4000
5000
6000
Final count: 6534
real    0m4.503s
user    0m0.010s
sys     0m0.161s

Now when I try to count the real directory it becomes slow again, I'm not sure why. E.g.:

[root@atlas ~]# time ./achenle_counter /export/archives/2010/storage
1000
2000
3000
4000
5000
 Ctrl^C
real    2m52.076s
user    0m0.138s
sys     0m8.671s

I also found that the directory in question, besides having millions of files has also millions of directories (although I only want to count files). Could this be causing a slow counting?

jim_mcnamara · November 16, 2010, 8:15pm

Yes. find uses ftw() or nftw(), it opens every directory and stats every entry. When directories are large this takes a long time. One directory with 1M entries can take literally minutes to process.

verdepollo · November 17, 2010, 10:26am

Got it, I have gone through ftw() documentation from IEEE Std 1003.1 to get a grasp of the concept.

Even though I was unable to count the exact number of files it has helped me out understanding the issue. Thanks.

sumitpandya · November 18, 2010, 1:52am

Can moderator take this thread to GNU ls/find and ask to provide option for counter. We all know that every admin uses counting of files into daily operation and into almost every script. It would be good value addition into those command lines.

Corona688 · November 18, 2010, 11:38am

Why a mod? The mods here aren't on the GNU committee AFAIK. In other words your suggestion has as much clout as theirs.

The best way to get what you want is to make your own modifications and submit a patch for them. You're far more likely to get what you want when you do the work.

methyl · November 25, 2010, 9:42am

Is this filesystem physically attached to the computer on which you are running the "find" ? If it is actually attached to another computer I'd run the "find" there.

Normally "find" is much faster than "ls" because "find" does not sort the output.

ctsgnb · March 19, 2011, 7:15pm

Also the simple ls command defaultly sort the ouput by name even when skipping aliases \ls

To avoid this you can use a special option -f (may depends on your plateform) this will force ls to display all found entries WITHOUT ordering them ... which can be ... MUCH FASTER in some case.

That is why when used with wc -l , and especially for direcory containing numerous entries, the ls command should be used with such "nosort" option

---------- Post updated at 12:12 AM ---------- Previous update was at 12:08 AM ----------

try this : ls -fR /users/home 2>/dev/null | wc -l

---------- Post updated at 12:12 AM ---------- Previous update was at 12:12 AM ----------

... or on any other PATH containing a bunch of entries

---------- Post updated at 12:15 AM ---------- Previous update was at 12:12 AM ----------

look how fast it can be : more than 23000 entries in less than a sec !

[ctsgnb@shell ~/sand]$ date && ls -fR /users/home 2>/dev/null | wc -l && date
Sat Mar 19 17:14:16 MDT 2011
   23932
Sat Mar 19 17:14:16 MDT 2011
[ctsgnb@shell ~/sand]$