Count the number of subset of files in a directory

piynik · January 18, 2013, 5:48am

hi I am trying to write a script to count the number of files, with slightly different subset name, in a directory

for example, in directory /data, there are a subset of files that are name as follow

 
/data/data_1_(1to however many).txt
/data/data_2_(1 to however many).txt
/data/data_3_(1to however many).txt
etc etc etc

now I want to write a script to count the number of files with each subset name (1 to 3 or however many there are)

Thanks for your help

Scrutinizer · January 18, 2013, 6:06am

Something like this?

cd /data
ls | awk -F_ '{A[$1 FS $2]++} END{for(i in A) print i, A}'

piynik · January 18, 2013, 6:18am

would you minding explaining the script so I could adapt it accordingly and test?

Scrutinizer · January 18, 2013, 11:16am

Sure:

awk -F_ '                     # set field separator to underscore
  {
    A[$1 FS $2]++             # count the number of times $1 FS $2 occurs (field 1 and field2 separated by an underscore (for example "data_1")
  } 
END{
  for(i in A) print i, A   # At the end print the results
}
'

piynik · January 20, 2013, 12:42pm

Thanks for your reply. What is

A[$1 FS $2]++

?

is that an array?

Scrutinizer · January 20, 2013, 1:28pm

That is an element in the associative array A with index $1 FS $2 (FS="_" , so this means $1_$2 ) that gets incremented by 1 ( ++ )

piynik · January 28, 2013, 6:35am

The code has worked impressively, many thanks for that.

I want to write a line to determine that, if the number of files with any prefix is more than 5, then print out the prefix names in one line (separated by a single space), such as

data_1 data_2

I wrote this line

ls | awk -F_ '{A[$1 FS $2]++} END {for (j in A) {if (A[j] > 5) {printf j, " "}}}'

However the output from this line is

data_1data_2

It doesn't seem to recognise the single space I asked for between the prefix. Do you know what I may have done wrong?

Scrutinizer · January 28, 2013, 6:57am

Try something, like this:

awk -F_ '{A[$1 FS $2]++} END {for (j in A) if (A[j] > 5) printf "%s ",j; print ""}'

The first argument to printf is a format field.

piynik · January 28, 2013, 9:32am

That worked, thanks so much

But while waiting for your reply I also found that if I remove the "," in the printf argument in my original code, so that

ls | awk -F_ '{A[$1 FS $2]++} END {for (j in A) if (A[j] > 5) printf j " "}'

It worked, which is against I have read in the syntaxing of the awk/printf code. Don't know, something to do with the shell (zsh) I am using or other reason I don't understand.

Scrutinizer · January 28, 2013, 9:35am

Yes, in that cat j an " " are concatenated, so printf then uses the resulting string as a single argument. However, I would not recommend using printf with data in the format field.

--
Could you please use code tags BTW

piynik · January 28, 2013, 10:23am

I can't seem to get the printf to work probably. For example

ls | awk -F_ '{A[$1 FS $2]++} END {for (j in A) print j, A[j]}'

would output

data_1 200
data_2 34

while the equivalent command with printf

ls | awk -F_ '{A[$1 FS $2]++} END {for (j in A) printf j, A[j]}'

would only output

data_1data_2

while ignoring the A[j]

The reason why I want to do this is because I want to line up the output nicer, as at the moment for my test directory I am getting (while using the \t key)

loooooooooger_prefix1     200
shorter_prefix2         34

but I want to get

loooooooooger_prefix1     200
shorter_prefix2                  34

---------- Post updated at 10:23 AM ---------- Previous update was at 10:21 AM ----------

sorry my message format wasnt displayed probably

but I want to output the prefix and the number of files with each prefix aligned in a column. Using the \t doesn't work too well if the prefix has different length.

Scrutinizer · January 28, 2013, 10:31am

Like this?

awk -F_ '{A[$1 FS $2]++} END {for (j in A) if (A[j] > 5) printf "%-30s%5d\n",j, A[j]}'

please use code tags, pyinik..

piynik · January 28, 2013, 10:47am

Thanks it worked.

Apology for the code tag issue.