Count occurences of the word without it repeating

Hi, I would like to count the number of ALA occurences without having them to be repeated. In the script I have written now it has 40 repetitions of ALA but it has to be 8. ALA is chosen as one of the 20 values it can have when the script asks for the input of AAA, which for this example is chosen to be ALA.

The script I have:

#!/bin/bash
read -p "amino acid: " AAA
if [[ "ALA ARG ASN ASP CYS GLN GLY GLU HIS ILE \
	   LEU LYS MET PHE PRO SER THR TRP TYR VAL" =~ $AAA ]]
then 
	for i in HS_data_*.txt; 
		do
			cat $i | grep -o -i $AAA | wc -l | awk '{print $1}'
#			awk -F"[ ]" -v SRCH="$AAA" '$0 ~ SRCH && !OCC[$6]++ {CNT++ } END {print CNT+0}' $i
        done
else
	exit 1
fi

The input of one of HS_data_*.txt file is this:

ATOM   2351  N   ALA B  10      13.856  10.830 -20.161  1.00 27.93           N  
ATOM   2352  CA  ALA B  10      13.893  11.449 -18.853  1.00 27.45           C  
ATOM   2353  C   ALA B  10      13.899  10.389 -17.757  1.00 29.99           C  
ATOM   2354  O   ALA B  10      14.653  10.538 -16.788  1.00 30.44           O  
ATOM   2355  CB  ALA B  10      12.686  12.323 -18.679  1.00 26.90           C  
ATOM   2423  N   ALA B  26      11.645  18.555   7.864  1.00 32.06           N  
ATOM   2424  CA  ALA B  26      11.938  19.955   7.579  1.00 35.40           C  
ATOM   2425  C   ALA B  26      13.080  20.496   8.431  1.00 37.27           C  
ATOM   2426  O   ALA B  26      13.742  21.478   8.087  1.00 39.36           O  
ATOM   2427  CB  ALA B  26      10.716  20.815   7.844  1.00 34.56           C  
ATOM   2643  N   ALA B  56       5.654  16.636 -19.419  1.00 27.14           N  
ATOM   2644  CA  ALA B  56       4.306  16.969 -19.795  1.00 27.77           C  
ATOM   2645  C   ALA B  56       4.139  18.435 -20.144  1.00 29.41           C  
ATOM   2646  O   ALA B  56       3.619  18.808 -21.204  1.00 30.63           O  
ATOM   2647  CB  ALA B  56       3.373  16.628 -18.664  1.00 28.99           C  
ATOM   2887  N   ALA B  88      -3.023   7.753 -19.907  1.00 20.84           N  
ATOM   2888  CA  ALA B  88      -3.018   7.206 -18.575  1.00 17.38           C  
ATOM   2889  C   ALA B  88      -1.627   6.647 -18.364  1.00 18.59           C  
ATOM   2890  O   ALA B  88      -1.086   5.920 -19.197  1.00 14.88           O  
ATOM   2891  CB  ALA B  88      -4.015   6.090 -18.472  1.00 18.60           C  
ATOM   3187  N   ALA B 130      -4.398   5.962 -24.620  1.00 22.40           N  
ATOM   3188  CA  ALA B 130      -3.225   5.141 -24.341  1.00 20.70           C  
ATOM   3189  C   ALA B 130      -3.170   4.921 -22.854  1.00 19.83           C  
ATOM   3190  O   ALA B 130      -3.725   5.716 -22.066  1.00 17.31           O  
ATOM   3191  CB  ALA B 130      -1.913   5.797 -24.700  1.00 22.82           C  
ATOM   3516  N   ALA B 177       0.656  -7.277 -20.930  1.00 19.87           N  
ATOM   3517  CA  ALA B 177      -0.367  -8.059 -20.250  1.00 19.38           C  
ATOM   3518  C   ALA B 177      -0.263  -9.541 -20.590  1.00 20.35           C  
ATOM   3519  O   ALA B 177       0.029  -9.962 -21.720  1.00 19.92           O  
ATOM   3520  CB  ALA B 177      -1.747  -7.592 -20.659  1.00 15.99           C  
ATOM   3541  N   ALA B 181      -4.381 -14.273 -14.076  1.00 16.90           N  
ATOM   3542  CA  ALA B 181      -4.649 -13.158 -13.194  1.00 16.14           C  
ATOM   3543  C   ALA B 181      -3.446 -12.893 -12.306  1.00 18.15           C  
ATOM   3544  O   ALA B 181      -2.692 -13.819 -12.014  1.00 20.60           O  
ATOM   3545  CB  ALA B 181      -5.817 -13.463 -12.335  1.00 15.23           C  
ATOM   3626  N   ALA B 194       8.308 -12.434 -17.665  1.00 29.11           N  
ATOM   3627  CA  ALA B 194       9.387 -12.364 -18.631  1.00 28.89           C  
ATOM   3628  C   ALA B 194      10.604 -11.653 -18.089  1.00 31.02           C  
ATOM   3629  O   ALA B 194      10.592 -11.177 -16.949  1.00 31.88           O  
ATOM   3630  CB  ALA B 194       8.920 -11.616 -19.844  1.00 25.66           C  

As you can see from the input ALA is repeated 40 times but 5 times each, so a total of 8 times. The 4th column gives the ALA value, while 6th column shows how many times the same ALA is repeated. For example ALA at 10 (6th column) is repeated 5 times, ALA at 26 is repeated 5 times, ALA at 56 is also repeated 5 times, etc.

The output has to count ALA 8 times instead of 40 which is the current case with my script (bold: cat $i | grep -o -i $AAA | wc -l | awk '{print $1}' ).

Also I was trying to figure out how to count ALA 8 times using strictly the # awk -F"[ ]" -v SRCH="$AAA" '$0 ~ SRCH && !OCC[$6]++ {CNT++ } END {print CNT+0}' $i command (commented), however I am struggling to get the correct awk command.

Thus, I would like to ask of few questions:
1) How could I make the bolded command count ALA 8 times instead of 40?
2) How could I make strictly the awk command (commented) count ALA also 8 times instead of 5 as it does now which does not make sense as there are much more ALA words?

Hi,

The awk statement when you just leave out -F"[ ]" .

awk -v SRCH="$AAA" '$0 ~ SRCH && !OCC[$6]++ {CNT++ } END {print CNT+0}' "$i" 

renders 8

I would suggest slightly modifying it to make it more exact:

awk -v AMINO="$AAA" '$4 == AMINO && !OCC[$6]++ {CNT++ } END {print CNT+0}' "$i"

Mind you, with the approach of the loop, you are counting per file. So perhaps you would like the filename too:

So

for i in HS_data_*.txt; 
do
  awk -v AMINO="$AAA" '$4 == AMINO && !OCC[$6]++ {CNT++ } END {print FILENAME ": ",CNT+0}' "$i"
done

Or if you want the total of all the HS_data files in the directory, try:

awk -v AMINO="$AAA" '$4 == AMINO && !OCC[$6]++ {CNT++ } END {print CNT+0}' HS_data_*.txt

or if there are too many files and you get line length errors, try:

for i in HS_data_*.txt; 
do
  cat "$i"
done | 
awk -v AMINO="$AAA" '$4 == AMINO && !OCC[$6]++ {CNT++ } END {print CNT+0}' 

Regards,

S.

3 Likes

Thank you very much Scrutinizer for a lengthy response!!! You're wonderfully generous :slight_smile:

Everything in your response gives 8 or any other expected value depending on what you wrote except for the first awk code. When I write awk -F"[ ]" -v SRCH="$AAA" '$0 ~ SRCH && !OCC[$6]++ {CNT++ } END {print CNT+0}' "$i" it gives 5 which I have no idea how it came up with. Any ideas why that might be the case?

2 Likes

As Scrutinizer posted, modifying the field separator is responsible, as it changes field numbering. $6 now assumes the values "C", "N", "O", "CA", "CB", whose count is 5.

Understood, but when I added field seperator and tried $6 as well as other values till $12 no one gave me 8. Where might be the problem?

That's because of the different lengths of $6 (1 or 2) resulting in a different FS count after it, so sometimes "ALA" shows up in $8, sometimes in $9, and the fields to follow as well.

1 Like

RudiC, thanks for confirming my thought that different numerical values have to do with uneven spacings. Thanks!

Note: If you do not specify a field separator ( FS ) in awk, it uses the default of a single space (" "), which has a special meaning:

The Open Group Base Specifications Issue 7, 2018 edition