Counting total files with different file types in each folder

Trying to count total files with different file types with thousands of files in each folder.
Since some files do not have extensions I have to use below criteria.

Count Total Files starting with --> "^ERROR"
Count Total Files starting with --> "^Runtime"
Count Everything else or files without any extension

sample input files in each sub-folder.

RuntimeProperties_296090758.xls
RuntimeProperties_296409844.xls
ERROR_261218287_296336046_20161213_101129
261218194_296090758_20161212_120448
RuntimeProperties_296413261.xls
ERROR_261218194_296090758_20161212_120448
261218287_296409844_20161213_120039
261218287_296336046_20161213_101129
ERROR_261218287_296409844_20161213_120040

Since I have to count this in a 12TB root folder with 6800 sub folders with thousands of files in each, this should not get into buffer overflow or out of memory or too many files situations. It should be faster.

I think either perl or awk can do this implicitly with the help of xargs!,, but not entirely sure how..

# I wish something like this can print counts for each sub-folder. 
for each targetDIR in $(6800 folders); do
     find $targetDIR -type f  | xargs -i awk -v file="{}"  -v td="$targetDIR" \
            'file ~ "./^ERROR" {CNT_ERROR += 1}; \
             file ~ "./^Runtime" {CNT_Runtime += 1}; \
             file !~ "./^ERROR|^./Runtime" {CNT_Others += 1}; \
             END {print td "," CNT_ERROR ","  CNT_Runtime "," CNT_Others}'
done

Then I can get overall counts myself.

Try (not thoroughly tested):

find . -type f |
awk -F\/ '
        {PTH = $0 
         sub (/\/[^/]*$/, _, PTH) 
         IX = $NF~/^ERROR/?"ERROR":$NF~/^Runtime/?"RUNTM":"OTHER" 
         CNT[IX OFS PTH]++
        } 
END     {for (c in CNT) print CNT[c], c
        }
' OFS="\t"

You may want to test this on a subset of your directory tree. For that amount of dirs / files it may take a while...

1 Like

I ran it on a small sample data in one folder,, its working great.. Not sure why its getting "." at the end!?

299     ERROR   .
299     RUNTM   .
299     OTHER   .

That's the path - obviously you're running it just in the cwd.