Finding most common substrings

verse123 · February 22, 2014, 8:24pm

Hello, I would like to know what is the three most abundant substrings of length 6 from col2. The file is quite large and looks like this

col1        col2
EN03    typehellobyedogcatcatdog
EN09    typehellobyebyebyebye
EN08    dogcatcatdogbyebyebyebye
EN09    catcattypehellobyebyebyebye
EN10    typehellobyedogcatcatdogbyebyebyebye
EN10    typehellobyedogcatcatdogdogbyebye

the output should be something like:

byebye
catdog
typehe

Don_Cragun · February 22, 2014, 8:51pm

What have you tried?

If you don't have working code, can you describe the algorithm that should be used to get the results you want?

verse123 · February 22, 2014, 11:00pm

I don't have working code for this. To make the example simpler, the algorithm should look for any pattern of length 6 of letters A C E G in any order and find the most abundant pattern across the entire column.

Don_Cragun · February 23, 2014, 12:05am

With the simple awk program:

#!/bin/ksh
awk '
NR > 1 {for(i = length($2) - 5; i >= 1; i--)
                c[substr($2, i, 6)]++
}
END {   for(i in c)
                print c, i | "sort -rn"
}' file

and with file containing the data you provided in message #1, the output produced is:

13 byebye
8 yebyeb
8 ebyeby
5 ypehel
5 typehe
5 pehell
5 llobye
5 hellob
5 elloby
5 ehello
5 catcat
4 tcatdo
4 ogcatc
4 gcatca
4 dogcat
4 catdog
4 atcatd
3 yedogc
3 ogbyeb
3 obyedo
3 lobyed
3 gbyeby
3 edogca
3 dogbye
3 byedog
2 tdogby
2 obyeby
2 lobyeb
2 atdogb
1 ttypeh
1 tdogdo
1 tcatty
1 ogdogb
1 gdogby
1 dogdog
1 cattyp
1 attype
1 atdogd
1 atcatt

so "byebye" is indeed the most common six character substring in the 2nd column of your input. But "catdog" and "typehe" don't even come close. To just print the most commonly occurring substring or (if more than one substring appears the same number of times as the first most common substring) substrings, try:

#!/bin/ksh
awk '
NR > 1 {for(i = length($2) - 5; i >= 1; i--)
                c[substr($2, i, 6)]++
}
END {   for(i in c)
                print c, i | "sort -rn"
}' file | awk '
NR == 1 {c = $1}
$1 == c {print $2}'

which just prints:

byebye

If you want to try this on a Solaris/SunOS system, change awk from the default /usr/bin/awk to /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk .

I use the Korn shell, but any shell that supports basic Bourne shell or POSIX standard shell syntax can be used instead.

verse123 · February 23, 2014, 12:17am

thanks, don. This is really helpful. Could you explain your awk code.

Don_Cragun · February 23, 2014, 12:43am

Does this help?

#!/bin/ksh
awk '
NR > 1 {# For all lines except line 1 (which contains headers; not data), loop
        # through the 2nd field and increment the number of times the six
        # character substring starting at offset i in tn the 2nd field appears:
        for(i = length($2) - 5; i >= 1; i--)
                c[substr($2, i, 6)]++
}
END {   # After all input lines have been processed, print and sort (using a
        # descreasing numeric sort) the number of times that substring occurred
        # and the substring.
        for(i in c)
                print c, i | "sort -rn"
}' file |       # The data to be processed comes from a file named file.
# The sorted output is then piped tnto another awk script to just print the
# most commonly appearing substrings.
awk '
# Save the count from the 1st input line (the most frequent substring).
NR == 1 {c = $1}
# Print the substring from every input line that has the same count as the
# most commonly appearing substring.
$1 == c {print $2}'

verse123 · February 24, 2014, 5:30pm

thanks, don. I have 2 last questions.

Some helpful individuals from the forums helped with this code below but it seems to be generating the same output for any string pattern I give it when I run it with a large file (200GB). However, when I run it with a small test file, that output is different for each string. Why do you think this could be happening?

Also, am I recording the length appropriately for all of the column 2 lines that have the same ID from column 1? Here's the code:

JOIN="joined.join"
cod1="ypehe"
cod2="edog"

awk '
        NR==1{
                print "ID", "cod_freq", "ID_freq", "length"
                next
             }
      FNR==NR{
                A[$1]++
                B[$1]+=gsub(/'${cod1}' || '${cod2}'/,x, $2)
                C[$1]+=length($2)
                next
             } 
    ($1 in A){ 
                print $1,B[$1],A[$1],C[$1]
                delete A[$1]
                delete B[$1]
                delete C[$1]
             }

    ' OFS='\t' ${JOIN}.out{,} > ${JOIN}.freq

Don_Cragun · February 25, 2014, 11:09am

verse123:

thanks, don. I have 2 last questions.

Some helpful individuals from the forums helped with this code below but it seems to be generating the same output for any string pattern I give it when I run it with a large file (200GB). However, when I run it with a small test file, that output is different for each string. Why do you think this could be happening?

Also, am I recording the length appropriately for all of the column 2 lines that have the same ID from column 1? Here's the code:
JOIN="joined.join"
cod1="ypehe"
cod2="edog"

awk '
   NR==1{
   print "ID", "cod_freq", "ID_freq", "length"
   next
   }
   FNR==NR{
   A[$1]++
   B[$1]+=gsub(/'${cod1}' || '${cod2}'/,x, $2)
   C[$1]+=length($2)
   next
   } 
   ($1 in A){ 
   print $1,B[$1],A[$1],C[$1]
   delete A[$1]
   delete B[$1]
   delete C[$1]
   }

   ' OFS='\t' ${JOIN}.out{,} > ${JOIN}.freq

I don't understand how this code applies to this thread???

verse123 · February 27, 2014, 9:02pm

Hi Don, I noticed that when I change the length of the substring combinations to look for above 10, I get smaller than 10 character patterns in the output. How can I prevent awk from doing this?

Don_Cragun · February 27, 2014, 11:15pm

You said you wanted strings containing 6 characters, and the code I gave you processes substrings containing 6 characters. Please show us how you modified the code to handle strings of the length you want now.

In your sample input, the strings to be searched always contained more than 6 characters. Do you ever have an input field that is shorter then the substring length you're looking for? The code I gave you didn't check for this possibility.