Replicate merging and frequency calculation

verse123 · February 22, 2014, 2:25pm

Hello, I have a 2 column file with an ID column and a column with some string.

ID     String
EN03 typehellobyedogcatcatdog
EN09 typehellobye
EN08 dogcatcatdog
EN09 catcattypehello
EN10 typehellobyedogcatcatdog
EN10 typehellobyedogcatcatdogdog

I would like to count the amount of times an ID repeats in the "ID" column and record it in a new column. Similarly, I would like to count the amount of times the substring "dog" appears for the lines with the same "ID" and record it in a new column. The output should look like this:

ID      dog_frequency   ID_frequency
EN03  2                       1
EN09  0                       2
EN08  2                       1
EN10  5                       2

any ideas?

Akshay_Hegde · February 22, 2014, 2:39pm

$ cat file
ID String
EN03 typehellobyedogcatcatdog
EN09 typehellobye
EN08 dogcatcatdog
EN09 catcattypehello
EN10 typehellobyedogcatcatdog
EN10 typehellobyedogcatcatdogdog

awk '
        NR==1{
		print "ID", "dog_frequency", "ID_frequency"
		next
	     }
      FNR==NR{
		A[$1]++
		B[$1]+=gsub(/dog/,x,$2)
		next
	     } 
    ($1 in A){ 
		print $1,B[$1],A[$1]
		delete A[$1]
		delete B[$1]
	     }
    ' OFS='\t' file{,}

Resulting

ID	dog_frequency	ID_frequency
EN03	2	1
EN09	0	2
EN08	2	1
EN10	5	2

Scrutinizer · February 22, 2014, 2:42pm

Try:

awk 'NR>1{D[$1]+=gsub(/dog/,x,$2); F[$1]++} END{for (i in D) print i,D,F}' file

verse123 · February 22, 2014, 5:20pm

Thanks. Any way to find say the top 5 most abundant substrings from column 2 that are of length 6 without knowing them beforehand?

---------- Post updated at 06:20 PM ---------- Previous update was at 04:40 PM ----------

also, any idea how to add together the length of each of the strings for each ID and print them as well? For example the output should look like this:

ID      dog_frequency   ID_frequency string_length
EN03  2                       1                24
EN09  0                       2                27
EN08  2                       1                12
EN10  5                       2                51

Akshay_Hegde · February 23, 2014, 1:37am

Okay it looks like you continued your discussion in new thread with different title

Finding most common substrings | Unix Linux Forums | Shell Programming and Scripting

verse123 · February 24, 2014, 5:13pm

sorry about that! I wasn't aware they were the same forum.

verse123 · March 4, 2014, 3:18am

Scrutinizer, I am trying to incorporate your code from #3 in a while loop but I need awk to use the substrings from file1 to search for a match in file2 and to count it based on the info from file2. However, awk is just counting the frequency of the ID and not counting the frequency of the substring that exists in file2. How can I get this to work? Here is the code:

while read A B
do
         mawk 'NR>1{D[$1]+=gsub(/${A}/,x,$2); F[$1]++}; END{for (i in D) print i,D,F}' file2 > $A.out
done; < file1

Scrutinizer · March 4, 2014, 3:21am

Try:

while read A B
do
         mawk -v a="$A" 'NR>1{D[$1]+=gsub(a,x,$2); F[$1]++}; END{for (i in D) print i,D,F}' file2 > $A.out
done < file1

verse123 · March 4, 2014, 3:42am

that doesn't work either. I get 0's for the string frequency column. It does, however, count the occurrence of the ID.

Scrutinizer · March 4, 2014, 4:03am

Oops, did not have a chance to test it. Made a correction to my post.