awk to count occurrence of strings and loop for multiple columns

iling14 · September 26, 2014, 10:22am

Hi all,

If i would like to process a file input as below:

col1 col2 col3 ...col100
1 A C E A ...
3 D E G A
5 T T A A 
6 D C A G

how can i perform a for loop to count the occurences of letters in each column? (just like uniq -c ) in every column.

on top of that, i would also like to print out col1 as concatenate form. the expected output would be:

col2 count col1 col2 count col1 col3 count col1 ...col100 count col1
A 1 1 C 2 1,6
D 2 3,6 E 1 3
T 1 5 T 1 5

Appreciate if you could help me resolve this problem...urgent!!

Thanks

Don_Cragun · September 26, 2014, 4:30pm

What OS are you using? (Show output from uname -a .)

What is the value of LINE_MAX on your system? (Show output from getconf LINE_MAX .)

The obvious way to do things like this is with awk , but the standards only define the behavior of awk on text files. On many systems, a text file cannot have any lines longer than 2048 bytes and if you have 1100 fields in your input files, you have crossed that threshold.

How many lines are in your input file? What is the total size of your input file?

The header for your desired output file:

col2 count col1 col2 count col1 col3 count col1 ...col100 count col1

shows the 1st three output fields being duplicated but if I understand what you're trying to do, shouldn't the header be?:

col2 count col1 col3 count col1 ...col100 count col1

iling14 · September 27, 2014, 2:31am

Hi Don,

Thanks for your response. I'm currently using redhat. The line max is about 100k lines, while the number of fields are 100.

Yup, my desired output shall be what you have corrected. I manage to do the count for column by column, but wonder how could i perform a for loop one shot for 100 columns.:rolleyes:

don cragun:

What OS are you using? (Show output from uname -a .)

What is the value of LINE_MAX on your system? (Show output from getconf LINE_MAX .)

The obvious way to do things like this is with awk , but the standards only define the behavior of awk on text files. On many systems, a text file cannot have any lines longer than 2048 bytes and if you have 1100 fields in your input files, you have crossed that threshold.

How many lines are in your input file? What is the total size of your input file?

The header for your desired output file:
col2 count col1 col2 count col1 col3 count col1 ...col100 count col1
shows the 1st three output fields being duplicated but if I understand what you're trying to do, shouldn't the header be?:
col2 count col1 col3 count col1 ...col100 count col1

Don_Cragun · September 27, 2014, 4:20am

You didn't answer most of my questions. But, now that I know that you're using a Linux system, I know that your version of awk will handle pretty much unlimited line lengths.

I'm still trying to get a feel for what the input and output data is going to look like. I'm assuming that the 100k number you gave is the number of lines in your input file (not the number of bytes in the longest line in your input file). Assuming that the numbers in the 1st column of your input are unique and that there are 6 distinct values in the other columns (A, G, C, T, D, and E) that means that we are converting 100k input rows with a maximum line length of about 210 bytes each into six output data lines (plus one header line) with a maximum line length approaching (3 * 99 spaces between fields +1 * 99 single letters for the col2 through col100 data + 99 * (6 digits + 1 comma) * 100k col1 values + 99 * 5 digits for the count values) 69.3 million bytes and an average line length approaching 11.5 million bytes.

Am I in the right ballpark here, or are my assumptions off? If my assumptions are off, where am I guessing wrong? Are there values other than A, G, C, T, D, and E that will appear in col2 through col100 in the input file?

Once you have created this file, do you have something that is going to be able to use this data?

iling14 · September 27, 2014, 10:24am

Hi Don,

I presume that in linux OS, awk has not limitations of the file size? And i've no clue about bytes calculations

Your assumptions to my problems are pretty accurate except that the values in each columns are not fixed, in other words, it might have other letters or combinations of 2 letters.

I would need to use this kind of program in my other analysis works. (i'm working on quantitative genetics, dealing with DNA data).

Scrutinizer · September 27, 2014, 3:57pm

Not sure if this is entirely what you are looking for, but try:

awk '
  FNR==1{
    next
  }
  { 
    for(i=2; i<=NF; i++) {
      if(NR==FNR){
        A[$i,i]=(($i,i) in A?A[$i,i]",":x) $1
        C[$i,i]++
      } 
      else {
        $i=$i FS C[$i,i] FS A[$i,i]
      }
    }
  }
  NR>FNR
' file file

Note: The input file is specified twice

Output:

1 A 1 1 C 2 1,6 E 1 1 A 3 1,3,5
3 D 2 3,6 E 1 3 G 1 3 A 3 1,3,5
5 T 1 5 T 1 5 A 2 5,6 A 3 1,3,5
6 D 2 3,6 C 2 1,6 A 2 5,6 G 1 6

iling14 · September 28, 2014, 10:49am

scrutinizer:

Not sure if this is entirely what you are looking for, but try:

awk '
  FNR==1{
   next
  }
  { 
   for(i=2; i<=NF; i++) {
   if(NR==FNR){
   A[$i,i]=(($i,i) in A?A[$i,i]",":x) $1
   C[$i,i]++
   } 
   else {
   $i=$i FS C[$i,i] FS A[$i,i]
   }
   }
  }
  NR>FNR
' file file

Note: The input file is specified twice

Output:

1 A 1 1 C 2 1,6 E 1 1 A 3 1,3,5
3 D 2 3,6 E 1 3 G 1 3 A 3 1,3,5
5 T 1 5 T 1 5 A 2 5,6 A 3 1,3,5
6 D 2 3,6 C 2 1,6 A 2 5,6 G 1 6

Thanks a lot scutinizer!

The program works fine, but i jus realize tht the number will appear many times because the output will have same NR as input. (not like uniq -c) anymore.

Btw, we dont have to specify print using this program?

Scrutinizer · September 28, 2014, 2:25pm

Do you mean something more like this:

awk '
  FNR==1{
    n=NF;
    next
  }
  { 
    for(i=2; i<=NF; i++) {
      O[$i]
      A[$i,i]=(($i,i) in A?A[$i,i]",":x) $1
      C[$i,i]++
    } 
  }
  END {
    for(i in O) { 
      $0=x
      for(j=2; j<=n; j++) {
          $(j-1)=i FS C[i,j]+0 FS ((i,j) in A?A[i,j]:"-")
      }
      print
    }
  }
' file

Output:

A 1 1 A 0 - A 2 5,6 A 3 1,3,5
C 0 - C 2 1,6 C 0 - C 0 -
D 2 3,6 D 0 - D 0 - D 0 -
E 0 - E 1 3 E 1 1 E 0 -
G 0 - G 0 - G 1 3 G 1 6
T 1 5 T 1 5 T 0 - T 0 -

Don_Cragun · September 28, 2014, 5:42pm

If I correctly understand what you're trying to do, maybe the folllowing is closer to what you want:

awk '
NR == 1 {
        # We have the 1st line in the input file...  Print output header...
        for(i = 2; i <= NF; i++)
                printf("%s count %s%s", $i, $1, i == NF ? "\n" : " ")
        # Save # of fields for use in END clause.
        nf = NF
        next
}
{       # For each remaining line in the input file...  Accumulate data...
        d[NR, 1] = $1
        for(i = 2; i <= NF; i++) {
                d[NR, i] = $i
                if(!((i, $i) in vc)) {
                        # Add new value for this field...
                        v[i, ++nv] = $i
                        # If the # of values in this row is greater than the #
                        # of output rows to be produced, increment the # of
                        # output rows.
                        if(nv > orows)
                                orows++
                }
                # Increment count of occurences of this value in this field.
                vc[i, $i]++
        }
}
END {   # Process accumulated data and print results...
        # For each output row to be printed...
        for(row = 1; row <= orows; row++) {
                # For each set of 3 columns to be printed for this row...
                for(field = 2; field <= nf; field++) {
                        if(row > nv[field]) {
                                # No data for this field for this row.
                                # Print dashes...
                                printf("- - -%s", field == nf ? "\n" : " ")
                                # Skip to next field.
                                continue
                        }
                        # Print field value and count columns for this set.
                        printf("%s %d ", v[field, row],
                                vc[field, v[field, row]])
                        # Print Column 1 values list for this set.
                        oc = 0
                        for(line = 2; oc < vc[field, v[field, row]]; line++) {
                                # If the data in this input line or this field
                                # is the string we are looking for, print the
                                # field 1 value for this line...
                                if(d[line, field] == v[field, row])
                                        printf("%s%s", d[line, 1], 
                                                ++oc < vc[field, v[field, row]]\
                                                ? "," : \
                                                field == nf ? "\n" : " ")
                        }
                }
        }
}' file

With the following in file :

col1 col2 col3 col4 col5 col6 col7
1 A C E A XX AA
3 D E G A XX AG
5 T T A A XX AC
6 D C A G XX AT
100 E C A A XX CA
5002 E G G G XX CC
99999 A C E A AB CG

it produces the output:

col2 count col1 col3 count col1 col4 count col1 col5 count col1 col6 count col1 col7 count col1
A 2 1,99999 C 4 1,6,100,99999 E 2 1,99999 A 5 1,3,5,100,99999 XX 6 1,3,5,6,100,5002 AA 1 1
D 2 3,6 E 1 3 G 2 3,5002 G 2 6,5002 AB 1 99999 AG 1 3
T 1 5 T 1 5 A 3 5,6,100 - - - - - - AC 1 5
E 2 100,5002 G 1 5002 - - - - - - - - - AT 1 6
- - - - - - - - - - - - - - - CA 1 100
- - - - - - - - - - - - - - - CC 1 5002
- - - - - - - - - - - - - - - CG 1 99999

Is something like this what you want?