Number of matches and matched pattern(s) in awk

beca123456 · December 26, 2015, 6:11pm

input:

!@#$%2QW5QWERTAB$%^&*

The string above is not separated (or FS="").
For clarity sake one could re-write the string by including a "|" as FS as follow:

!|@|#|$|%|2QW|5QWERT|A|B|$|%|^|&|*

Here, I am only interested in patterns (their numbers are variable between records) containing capital letters, i.e.:
2QW
5QWERT
A
B

Note that patterns with more than one capital letter is preceeded by a digit which equals the length of the pattern.

The output I am trying to obtain is:

String   # of A   #of B   # of longer patterns
!@#$%2QW5QWERTABCD$%^&*   1   1   1:QW; 1:QWERT

What I tried so far:

awk 'BEGIN{OFS="   "; print "String   # of A   #of B   # of longer patterns"}
{
   string=$0
   
   # number of 'A'
   num_A=gsub(/A/,"A",string)
   
   #number of 'B'
   num_B=gsub(/B/,"B",string)
   
   # extract long pattern #1
   num_pattern_1==0   
   match(string, /[0-9]+/)
   length_pattern_1=substr(string, RSTART,RLENGTH)
   pattern_1=substr(string, RSTART+1, length_pattern_1)

   # extract long_pattern #2 (stuck here)
   Is there a way to skip the first digit match?  
   If I use 'split(string, b, "[0-9]+")' I could use a for loop through the different indexes of array b, but I will lose the pattern length.

   # count the number of same pattern
   Since I cannot use 'split' I don't see how I could iterate the count through the different motifs 

   print string, num_A, num_B, num_pattern_1":"pattern_1";"num_pattern_2":"pattern_2"; "num_pattern_X":"pattern-X
}'

durden_tyler · December 26, 2015, 7:36pm

$ 
$ cat -n f36.awk
     1	BEGIN {ind=0; str=""}
     2	{
     3	     n = split($0, a, "");
     4	     i = 1;
     5	     while (i <= n) {
     6	         if (a >= 1 && a <= 9) {
     7	             str = a;
     8	             for (j=i+1; j<=i+a; j++) {
     9	                 str = str""a[j];
    10	             }
    11	             ind++;
    12	             pattern[ind] = str;
    13	             str = "";
    14	             i = i + a + 1;
    15	         } else if (a >= "A" && a <= "Z") {
    16	             ind++;
    17	             pattern[ind] = a;
    18	             i++;
    19	         } else {
    20	             i++;
    21	         }
    22	     }
    23	}
    24	END {
    25	    for (k=1; k<=ind; k++) { printf("pattern[%d] = [%s]\n", k, pattern[k]) }
    26	}
    27	
$ 
$ echo "\!@#$%2QW5QWERTAB$%^&*" | awk -f f36.awk
pattern[1] = [2QW]
pattern[2] = [5QWERT]
pattern[3] = [A]
pattern[4] = 
$ 
$ 
$ echo "\!@#$%2QW7QWERTABXY3PQR$%^Z&*LMn#O" | awk -f f36.awk
pattern[1] = [2QW]
pattern[2] = [7QWERTAB]
pattern[3] = [X]
pattern[4] = [Y]
pattern[5] = [3PQR]
pattern[6] = [Z]
pattern[7] = [L]
pattern[8] = [M]
pattern[9] = [O]
$ 
$

Don_Cragun · December 26, 2015, 9:39pm

Are the numbers specifying the length of a pattern limited to a single digit?

Does the string 12XYCCCCCCCCCCD contain 2 patterns: 12XYCCCCCCCCCC and D (each occurring once)? Or does it contain the pattern 2XY occurring once, the pattern C occurring 10 times, and the pattern D occurring once?
What happens if a digit is followed by fewer uppercase letters than are specified by that digit?
In the string 3D@C , is there one pattern ( 3D@C ) or two patterns ( D and C )?
Your code explicitly counts occurrences of A and B separately from counting patterns that might contain them. Is that what you want?
Should the string 4AABB just report one occurrence of the pattern 4AABB ? Or, should it report the pattern 4AABB occurring once, two occurrences of the pattern A , and two occurrences of the pattern B ?
Is the digit 1 special?
Should the string 1XX be treated as one occurrence of the pattern 1X and one occurrence of the pattern X ? Or, should it be treated as two occurrences of the pattern X ?

durden_tyler · December 26, 2015, 9:55pm

Addendum to Don's question # 2:

What happens if a digit is followed by a mix of uppercase and lowercase letters? Eg. in the string "3AbCD", is the pattern "3ACD" or something else?

beca123456 · December 26, 2015, 10:13pm

Sorry for the lack of clarity.

No. The number specifying the length of the pattern is always 2.
In the example Don Cragun mentioned ('12XYCCCCCCCCCCD'), there are 2 patterns: '12XYCCCCCCCCCC' and 'D'.
'12' should be consider as the figure '12' and not as single digits '1' then '2'.

It cannot happen. A number is always followed by uppercase or lowercase letters only (no symbols, or other characters than uppercase or lowercase letters). The length of the pattern formed by these letters is always than the number that precedes them.

Correct. '4AABB' is one pattern only, 'AABB', as defined by the number '4'.

There is never the figure '1' in the string. The only figures present in the string are always 2 (e.g. 2, 34, 2000...).

If '3AbCD' occurs, then we have 2 patterns: 'AbC' and 'D'.
A number X is always followed by letters that forms a X-long pattern. The case doesn't matter as soon as the characters are letters.
If a letter occurs directly after the X-long motif, it is considered a pattern itself.

example 1:

!@#$%4AvDf2QWER

There are 4 patterns ('AvDf', 'QW', 'E' and 'R')

example 2:

3BHuI4RtYU2vGP

There are 5 patterns ('BHu', 'I', RtYU', 'vG' and 'P')

example 3:

$%$6ABcdEf)-2yg*%/LK@~~()

There are 4 patterns ('ABcdEf', 'yg', 'L' and 'K')

durden_tyler · December 26, 2015, 10:42pm

$ 
$ cat -n f36_v1.awk
     1	BEGIN {ind=0; str=""}
     2	{
     3	     n = split($0, a, "");
     4	     i = 1;
     5	     while (i <= n) {
     6	         if (a >= 1 && a <= 9) {
     7	             while (1) {
     8	                 str = str""a;
     9	                 if (a[i+1] < 0 || a[i+1] > 9) { break; }
    10	                 else { i++; }
    11	             }
    12	             len = str;
    13	             str = "";
    14	             for (j=i+1; j<=i+len; j++) {
    15	                 str = str""a[j];
    16	             }
    17	             ind++;
    18	             pattern[ind] = str;
    19	             str = "";
    20	             i += len + 1;
    21	         } else if ((a >= "a" && a <= "z") || (a >= "A" && a <= "Z")) {
    22	             ind++;
    23	             pattern[ind] = a;
    24	             i++;
    25	         } else {
    26	             i++;
    27	         }
    28	     }
    29	}
    30	END {
    31	    for (k=1; k<=ind; k++) { printf("pattern[%d] = [%s]\n", k, pattern[k]) }
    32	}
    33	
    34	
$ 
$ echo "\!@#$%2QW5QWERTAB$%^&*" | awk -f f36_v1.awk
pattern[1] = [QW]
pattern[2] = [QWERT]
pattern[3] = [A]
pattern[4] = 
$ 
$ echo "\!@#$%2QW7QWERTABXY3PQR$%^Z&*LMn#O" | awk -f f36_v1.awk
pattern[1] = [QW]
pattern[2] = [QWERTAB]
pattern[3] = [X]
pattern[4] = [Y]
pattern[5] = [PQR]
pattern[6] = [Z]
pattern[7] = [L]
pattern[8] = [M]
pattern[9] = [n]
pattern[10] = [O]
$ 
$ echo "\$%\$6ABcdEf)-2yg*%/LK@~~()" | awk -f f36_v1.awk
pattern[1] = [ABcdEf]
pattern[2] = [yg]
pattern[3] = [L]
pattern[4] = [K]
$ 
$ echo "3BHuI4RtYU2vGP" | awk -f f36_v1.awk
pattern[1] = [BHu]
pattern[2] = 
pattern[3] = [RtYU]
pattern[4] = [vG]
pattern[5] = [P]
$ 
$ echo "\!@#$%4AvDf2QWER" | awk -f f36_v1.awk
pattern[1] = [AvDf]
pattern[2] = [QW]
pattern[3] = [E]
pattern[4] = [R]
$ 
$

beca123456 · December 26, 2015, 11:38pm

Thanks durden_tyler, it helps a lot!

Now I have to work on the format of the output as mentioned in my original post, i.e. counting the number of occurrence of each pattern as follow (multiple-letter pattern in same field separated by "; " and single-letter patterns in one field;

example:

!@#$%3ABC$%DE$%4Fghi3ABC^&*D$%^&

We have:

pattern[1] = [ABC]
pattern[2] = [D]
pattern[3] = [E]
pattern[4] = [Fghi]
pattern[5] = [ABC]
pattern[6] = [D]

What I am trying to get is:

!@#$%3ABC$%DE$%4Fghi3ABC^&*D$%^&|2:D|1:E|2:ABC; 1:Fghi

The order of the multiple-letters pattern within the field doesn't matter.
The order of the single-letter patterns doesn't matter too.
But it would be useful to have the single-letter pattern before the multiple-letter pattern like above.

For clarity I used "|" as FS, but I could change it as " " like in my original post.

Aia · December 27, 2015, 1:23am

cat beca123456.input
!@#$%2QW5QWERTAB$%^&*
!@#$%4AvDf2QWER
3BHuI4RtYU2vGP
$%$6ABcdEf)-2yg*%/LK@~~()

awk '
{
    for(i=1;i<=length($0);i++){
        ch = substr($0, i, 1)
        if(ch ~ /[0-9]/){
            pat = substr($0, i+1, ch)
            multi[pat]++;
            i += ch
        }
        else if(ch ~ /[a-zA-Z]/){
            single[ch]++
        }
        
    }
}

{
    printf "%s", $0
    for (s in single){
       printf "|%d:%s", single, s 
       delete single
    }

    for (s in multi){
      m == "" ? m="|"multi":"s : m=m"; "multi":"s
      delete multi
    }
    print m
    m = ""

}' beca123456.input

!@#$%2QW5QWERTAB$%^&*|1:A|1:B|1:QWERT; 1:QW
!@#$%4AvDf2QWER|1:R|1:E|1:AvDf; 1:QW
3BHuI4RtYU2vGP|1:P|1:I|1:RtYU; 1:vG; 1:BHu
$%$6ABcdEf)-2yg*%/LK@~~()|1:K|1:L|1:yg; 1:ABcdEf

I am confused about your affair with the "|", at this point I do not know if you want it or not in the actual output. However, this match your example:

Don_Cragun · December 27, 2015, 1:31am

Here is an alternative approach that will work with any standards-conforming version of awk . (Note that the standards say the behavior is unspecified if FS (or the ERE used in split() ) is an empty string.

awk '
BEGIN {	printf("String   #_of_occurrences_of__pattern:pattern...\n")
}
{	printf("%s", left = $0)
	while(match(left, /[[:alnum:]]+/)) {
		# Throw away leading non-digit, non-alpha characters.
		if(RSTART > 1)
			left = substr(left, RSTART)
		if((num = left + 0) > 0) {
			# We have a string starting with a leading digit string.
			p = substr(left, len = length(num) + 1, num)
			left = substr(left, len + num)
			if(p in mcnt) {
				# We have seen this pattern before.
				mcnt[p]++
			} else {# We have not seen this pattern before.
				mcnt[mplist[++nmp] = p] = 1
			}
		} else {
			# We have a single alphabetic character string.
			p = substr(left, 1, 1)
			left = substr(left, 2)
			if(p in scnt) {
				# We have seen this pattern before.
				scnt[p]++
			} else {# We have not seen this pattern before.
				scnt[splist[++nsp] = p] = 1
			}
		}
	}
	# Print the results for this input line.
	# Print single character patterns.
	for(i = 1; i <= nsp; i++) {
		printf("   %d:%s", scnt[splist], splist)
		delete scnt[splist]
		delete splist
	}
	# Print multiple character patterns.
	for(i = 1; i <= nmp; i++) {
		printf("   %d:%s", mcnt[mplist], mplist)
		delete mcnt[mplist]
		delete mplist
	}
	print ""
	nmp = nsp = 0
}' file

If file contains:

!@#$%2QW5QWERTAB$%^&*
!|@|#|$|%|2QW|5QWERT|A|B|$|%|^|&|*
!@#$%2QW5QWERTAB$%^&*2QW5QWERTABAB
12ABCDEFGHIJKLMNABC
12ABCDEFGHIJKLMNABC#12ABCDEFGHIJKLMNDEF
~!@#$%^&*()_+
Aa@52ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz#aA
AAAAAAAAAAAAAAAAAAAAAAAaaaaaabbbbbbAAAAAAAAAAAAAAAAAA
!@#$%3ABC$%DE$%4Fghi3ABC^&*D$%^&

it produces the output:

String   #_of_occurrences_of__pattern:pattern...
!@#$%2QW5QWERTAB$%^&*   1:A   1:B   1:QW   1:QWERT
!|@|#|$|%|2QW|5QWERT|A|B|$|%|^|&|*   1:A   1:B   1:QW   1:QWERT
!@#$%2QW5QWERTAB$%^&*2QW5QWERTABAB   3:A   3:B   2:QW   2:QWERT
12ABCDEFGHIJKLMNABC   1:M   1:N   1:A   1:B   1:C   1:ABCDEFGHIJKL
12ABCDEFGHIJKLMNABC#12ABCDEFGHIJKLMNDEF   2:M   2:N   1:A   1:B   1:C   1:D   1:E   1:F   2:ABCDEFGHIJKL
~!@#$%^&*()_+
Aa@52ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz#aA   2:A   2:a   1:ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
AAAAAAAAAAAAAAAAAAAAAAAaaaaaabbbbbbAAAAAAAAAAAAAAAAAA   41:A   6:a   6:b
!@#$%3ABC$%DE$%4Fghi3ABC^&*D$%^&   2:D   1:E   2:ABC   1:Fghi

If someone wants to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk . (I don't think nawk knows how to handle the character class expression [[:alnum:]] .)

PS: Note that this script uses a consistent field separator of three <space> characters instead of a mixture of pipe symbols and semicolons.

Scrutinizer · December 27, 2015, 2:00am

Hi Don, it works fine with /usr/xpg4/bin/awk on Solaris 10. Just tested it. Indeed nawk cannot handle POSIX character classes..

--
EDIT: misread Don's post as a request instead of a suggestion..

RudiC · December 27, 2015, 7:50am

Different approach:

awk '
        {printf "%s", $0
         gsub (/[^A-Za-z0-9]/, "")
         n = split ($0, DIG, "[A-Za-z]*")
         m = split ($0, CHR, "[0-9]*")
         S = CHR[1]
         B = 1 + !(DIG[1])
         for (i=B; i<n; i++)    {IX = i - B + 2
                                 TMP = substr (CHR[IX], 1, DIG)
                                 PAT[TMP]++
                                 sub ("^" TMP, _, CHR[IX])
                                 S = S CHR[IX]
                                }
         for (i=split (S, T, ""); i>0; i--) SGL[T]++
         for (p in PAT) printf "\t%d:%s", PAT[p], p
         for (s in SGL) printf "\t%d:%s", SGL, s
         printf "\n"
         S = ""
         delete PAT
         delete SGL
        }
' file

beca123456 · December 27, 2015, 3:19pm

Thanks Don Cragun for the script and the explanations ! It works perfectly.

Thanks RudiC ! However, could you tell me what the following line means. I have never sen this expression before:

B = 1 + !(DIG[1])

Does it mean 'B = index of CHR starting after same DIG index'?

Thanks Aia ! The code is easier to understand but it only works for number with single digit.
example:

!@#$%12QWtttttttttt5QWERTAB$%^&*

returns:

!@#$%12QWtttttttttt5QWERTAB$%^&*   1:A   1:B   10:t   1:Q   1:W   1:2; 1:QWERT

instead of:

!@#$%12QWtttttttttt5QWERTAB$%^&*   1:A   1:B   1:QWtttttttttt   1:QWERT

---------- Post updated at 03:19 PM ---------- Previous update was at 02:37 PM ----------

Aia · December 27, 2015, 4:06pm

beca123456:

Thanks Aia ! The code is easier to understand but it only works for number with single digit.
example:
!@#$%12QWtttttttttt5QWERTAB$%^&*
returns:
!@#$%12QWtttttttttt5QWERTAB$%^&*   1:A   1:B   10:t   1:Q   1:W   1:2; 1:QWERT
instead of:
!@#$%12QWtttttttttt5QWERTAB$%^&*   1:A   1:B   1:QWtttttttttt   1:QWERT

Sorry, I missed that in your request.

awk '
{
    for(i=1;i<=length($0);i++){
        ch = substr($0, i, 1)
        if(ch ~ /[0-9]/){
            d = ""
            while(ch ~ /[0-9]/){
               d = d ch
               ch = substr($0, ++i, 1)
            }
            pat = substr($0, i, d)
            multi[pat]++
            i += (d-1)
       }
        else if(ch ~ /[a-zA-Z]/){
            single[ch]++
        }
    }
}
{
    printf "%s", $0
    for (s in single){
       printf " %d:%s", single, s
       delete single
    }

    for (s in multi){
      m == "" ? m=" "multi":"s : m=m"; "multi":"s
      delete multi
    }
    print m
    m = ""
}' beca123456.file

beca123456 · December 27, 2015, 4:19pm

Thanks Aia. The new version works great !

Thanks Don Cragun and RudiC as well!

RudiC · December 28, 2015, 2:12am

beca123456:

.
.
.
However, could you tell me what the following line means. I have never sen this expression before:
B = 1 + !(DIG[1])
Does it mean 'B = index of CHR starting after same DIG index'?
.
.
.

!(DIG[1]) is a boolean expression that assumes the values 0 or 1 depending on that array element having a value or not. Actually above it is a - sometimes dicussed - shortcut for if (DIG[1] == 0) then B = 2 else B = 1

beca123456 · December 28, 2015, 1:32pm

Alright. I understand. Thanks RudiC !

durden_tyler · December 28, 2015, 3:17pm

$ 
$ cat -n f36
     1	!@#$%2QW5QWERTAB$%^&*
     2	!@#$%2QW7QWERTABXY3PQR$%^Z&*LMn#O
     3	$%$6ABcdEf)-2yg*%/LK@~~()
     4	3BHuI4RtYU2vGP
     5	!@#$%4AvDf2QWER
     6	##AAAABBBCCD##2RTC##=3XYZ##?4WWWW##3PQaR##=3XYZ#
     7	##AAAA#BBBvCCD##
     8	##AA#Bv#CC#4RSTUV#
     9	#?2AA#3XYZvN#4PQrsN#3XYZ=2wq#
    10	!^#=3AAAB$$?2CCR^&?2DD*=4EEEEY()
$ 
$ cat -n f36_v3.awk
     1	function fetch_num (s) {
     2	    # This function returns the number at the start of a string.
     3	    # If "2ABCD" is passed, then 2 is returned.
     4	    # If "23XYZW" is passed, then 23 is returned.
     5	    l = length(s);
     6	    num = "";
     7	    i = 0;
     8	    while (++i <= l && substr(s,i,1) ~ /[2-9]/) {
     9	        num = num""substr(s,i,1);
    10	    }
    11	    return num;
    12	}
    13	function join (a, kvsep, arrsep) {
    14	    # This function joins all elements of an associative array with arrsep.
    15	    # Each key/value pair of the array is joined with kvsep.
    16	    # a = associative array
    17	    # kvsep = separator between key/value pairs
    18	    # arrsep = separator between array elements
    19	    iter = 1;
    20	    result = "";
    21	    for (i in a) {
    22	       if (iter == 1) { result = a kvsep i; }
    23	       else { result = result arrsep a kvsep i; }
    24	       iter++;
    25	    }
    26	    return result;
    27	}
    28	{   # There are 4 associative arrays: s, m, q, e
    29	    # s => to store number of occurrences of single character patterns
    30	    # m => to store number of occurrences of multi-character patterns
    31	    # q => to store number of occurrences of patterns that follow "?" character
    32	    # e => to store number of occurrences of patterns that follow "=" character
    33	    str = $0;
    34	    ind = 1;
    35	    len = length(str);
    36	    printf("Input  : %s\n", str);
    37	    while (ind <= len) {
    38	        ch = substr(str, ind, 1);
    39	        if (ch == "?" && substr(str,ind+1,1) ~ /[2-9]/) { # Pattern following "?"
    40	            n = fetch_num(substr(str, ind+1));
    41	            q[substr(str, ind+2, n)]++;
    42	            ind += n + 2;
    43	        } else if (ch == "=" && substr(str,ind+1,1) ~ /[2-9]/) { # Pattern following "="
    44	            n = fetch_num(substr(str, ind+1));
    45	            e[substr(str, ind+2, n)]++;
    46	            ind += n + 2;
    47	        } else if (ch ~ /[2-9]/) { # Multi-character pattern
    48	            n = fetch_num(substr(str, ind));
    49	            m[substr(str, ind+1, n)]++;
    50	            ind += n + 1;
    51	        } else if (ch ~ /[A-Za-z]/) { # Single-character pattern
    52	            s[ch]++;
    53	            ind++;
    54	        } else {
    55	            ind++;
    56	        }
    57	    }
    58	    if (join(s,";"," ") != "") { s_str = join(s,";"," "); }
    59	    if (join(m,";","/") != "") { m_str = join(m,";","/"); }
    60	    if (join(q,";","|") != "") { q_str = join(q,";","|"); }
    61	    if (join(e,";","|") != "") { e_str = join(e,";","|"); }
    62	    printf("Output : %s => %s %s %s %s\n", str, s_str, m_str, q_str, e_str);
    63	    printf("\n");
    64	    # Flush all arrays and start over again
    65	    split("",s);
    66	    split("",m);
    67	    split("",q);
    68	    split("",e);
    69	}
    70	
$ 
$ awk -f f36_v3.awk f36
Input  : !@#$%2QW5QWERTAB$%^&*
Output : !@#$%2QW5QWERTAB$%^&* => 1;A 1;B 1;QWERT/1;QW  

Input  : !@#$%2QW7QWERTABXY3PQR$%^Z&*LMn#O
Output : !@#$%2QW7QWERTABXY3PQR$%^Z&*LMn#O => 1;O 1;n 1;X 1;L 1;Y 1;M 1;Z 1;QWERTAB/1;PQR/1;QW  

Input  : $%$6ABcdEf)-2yg*%/LK@~~()
Output : $%$6ABcdEf)-2yg*%/LK@~~() => 1;K 1;L 1;yg/1;ABcdEf  

Input  : 3BHuI4RtYU2vGP
Output : 3BHuI4RtYU2vGP => 1;P 1;I 1;RtYU/1;vG/1;BHu  

Input  : !@#$%4AvDf2QWER
Output : !@#$%4AvDf2QWER => 1;R 1;E 1;AvDf/1;QW  

Input  : ##AAAABBBCCD##2RTC##=3XYZ##?4WWWW##3PQaR##=3XYZ#
Output : ##AAAABBBCCD##2RTC##=3XYZ##?4WWWW##3PQaR##=3XYZ# => 4;A 3;B 3;C 1;D 1;R 1;PQa/1;RT 1;WWWW 2;XYZ

Input  : ##AAAA#BBBvCCD##
Output : ##AAAA#BBBvCCD## => 4;A 1;v 3;B 2;C 1;D 1;PQa/1;RT 1;WWWW 2;XYZ

Input  : ##AA#Bv#CC#4RSTUV#
Output : ##AA#Bv#CC#4RSTUV# => 2;A 1;v 1;B 2;C 1;V 1;RSTU 1;WWWW 2;XYZ

Input  : #?2AA#3XYZvN#4PQrsN#3XYZ=2wq#
Output : #?2AA#3XYZvN#4PQrsN#3XYZ=2wq# => 2;N 1;v 1;PQrs/2;XYZ 1;AA 1;wq

Input  : !^#=3AAAB$$?2CCR^&?2DD*=4EEEEY()
Output : !^#=3AAAB$$?2CCR^&?2DD*=4EEEEY() => 1;B 1;R 1;Y 1;PQrs/2;XYZ 1;CC|1;DD 1;AAA|1;EEEE

$ 
$