Replicate merging and frequency calculation

Hello, I have a 2 column file with an ID column and a column with some string.

ID     String
EN03 typehellobyedogcatcatdog
EN09 typehellobye
EN08 dogcatcatdog
EN09 catcattypehello
EN10 typehellobyedogcatcatdog
EN10 typehellobyedogcatcatdogdog

I would like to count the amount of times an ID repeats in the "ID" column and record it in a new column. Similarly, I would like to count the amount of times the substring "dog" appears for the lines with the same "ID" and record it in a new column. The output should look like this:

ID      dog_frequency   ID_frequency
EN03  2                       1
EN09  0                       2
EN08  2                       1
EN10  5                       2

any ideas?

$ cat file
ID String
EN03 typehellobyedogcatcatdog
EN09 typehellobye
EN08 dogcatcatdog
EN09 catcattypehello
EN10 typehellobyedogcatcatdog
EN10 typehellobyedogcatcatdogdog 
awk '
        NR==1{
		print "ID", "dog_frequency", "ID_frequency"
		next
	     }
      FNR==NR{
		A[$1]++
		B[$1]+=gsub(/dog/,x,$2)
		next
	     } 
    ($1 in A){ 
		print $1,B[$1],A[$1]
		delete A[$1]
		delete B[$1]
	     }
    ' OFS='\t' file{,} 

Resulting

ID	dog_frequency	ID_frequency
EN03	2	1
EN09	0	2
EN08	2	1
EN10	5	2
1 Like

Try:

awk 'NR>1{D[$1]+=gsub(/dog/,x,$2); F[$1]++} END{for (i in D) print i,D,F}' file
1 Like

Thanks. Any way to find say the top 5 most abundant substrings from column 2 that are of length 6 without knowing them beforehand?

---------- Post updated at 06:20 PM ---------- Previous update was at 04:40 PM ----------

also, any idea how to add together the length of each of the strings for each ID and print them as well? For example the output should look like this:

ID      dog_frequency   ID_frequency string_length
EN03  2                       1                24
EN09  0                       2                27
EN08  2                       1                12
EN10  5                       2                51

Okay it looks like you continued your discussion in new thread with different title

Finding most common substrings | Unix Linux Forums | Shell Programming and Scripting

sorry about that! I wasn't aware they were the same forum.

Scrutinizer, I am trying to incorporate your code from #3 in a while loop but I need awk to use the substrings from file1 to search for a match in file2 and to count it based on the info from file2. However, awk is just counting the frequency of the ID and not counting the frequency of the substring that exists in file2. How can I get this to work? Here is the code:

while read A B
do
         mawk 'NR>1{D[$1]+=gsub(/${A}/,x,$2); F[$1]++}; END{for (i in D) print i,D,F}' file2 > $A.out
done; < file1

Try:

while read A B
do
         mawk -v a="$A" 'NR>1{D[$1]+=gsub(a,x,$2); F[$1]++}; END{for (i in D) print i,D,F}' file2 > $A.out
done < file1
1 Like

that doesn't work either. I get 0's for the string frequency column. It does, however, count the occurrence of the ID.

Oops, did not have a chance to test it. Made a correction to my post.

1 Like