Hi, I would like to count the number of ALA occurences without having them to be repeated. In the script I have written now it has 40 repetitions of ALA but it has to be 8. ALA is chosen as one of the 20 values it can have when the script asks for the input of AAA, which for this example is chosen to be ALA.
The script I have:
#!/bin/bash
read -p "amino acid: " AAA
if [[ "ALA ARG ASN ASP CYS GLN GLY GLU HIS ILE \
LEU LYS MET PHE PRO SER THR TRP TYR VAL" =~ $AAA ]]
then
for i in HS_data_*.txt;
do
cat $i | grep -o -i $AAA | wc -l | awk '{print $1}'
# awk -F"[ ]" -v SRCH="$AAA" '$0 ~ SRCH && !OCC[$6]++ {CNT++ } END {print CNT+0}' $i
done
else
exit 1
fi
The input of one of HS_data_*.txt file is this:
ATOM 2351 N ALA B 10 13.856 10.830 -20.161 1.00 27.93 N
ATOM 2352 CA ALA B 10 13.893 11.449 -18.853 1.00 27.45 C
ATOM 2353 C ALA B 10 13.899 10.389 -17.757 1.00 29.99 C
ATOM 2354 O ALA B 10 14.653 10.538 -16.788 1.00 30.44 O
ATOM 2355 CB ALA B 10 12.686 12.323 -18.679 1.00 26.90 C
ATOM 2423 N ALA B 26 11.645 18.555 7.864 1.00 32.06 N
ATOM 2424 CA ALA B 26 11.938 19.955 7.579 1.00 35.40 C
ATOM 2425 C ALA B 26 13.080 20.496 8.431 1.00 37.27 C
ATOM 2426 O ALA B 26 13.742 21.478 8.087 1.00 39.36 O
ATOM 2427 CB ALA B 26 10.716 20.815 7.844 1.00 34.56 C
ATOM 2643 N ALA B 56 5.654 16.636 -19.419 1.00 27.14 N
ATOM 2644 CA ALA B 56 4.306 16.969 -19.795 1.00 27.77 C
ATOM 2645 C ALA B 56 4.139 18.435 -20.144 1.00 29.41 C
ATOM 2646 O ALA B 56 3.619 18.808 -21.204 1.00 30.63 O
ATOM 2647 CB ALA B 56 3.373 16.628 -18.664 1.00 28.99 C
ATOM 2887 N ALA B 88 -3.023 7.753 -19.907 1.00 20.84 N
ATOM 2888 CA ALA B 88 -3.018 7.206 -18.575 1.00 17.38 C
ATOM 2889 C ALA B 88 -1.627 6.647 -18.364 1.00 18.59 C
ATOM 2890 O ALA B 88 -1.086 5.920 -19.197 1.00 14.88 O
ATOM 2891 CB ALA B 88 -4.015 6.090 -18.472 1.00 18.60 C
ATOM 3187 N ALA B 130 -4.398 5.962 -24.620 1.00 22.40 N
ATOM 3188 CA ALA B 130 -3.225 5.141 -24.341 1.00 20.70 C
ATOM 3189 C ALA B 130 -3.170 4.921 -22.854 1.00 19.83 C
ATOM 3190 O ALA B 130 -3.725 5.716 -22.066 1.00 17.31 O
ATOM 3191 CB ALA B 130 -1.913 5.797 -24.700 1.00 22.82 C
ATOM 3516 N ALA B 177 0.656 -7.277 -20.930 1.00 19.87 N
ATOM 3517 CA ALA B 177 -0.367 -8.059 -20.250 1.00 19.38 C
ATOM 3518 C ALA B 177 -0.263 -9.541 -20.590 1.00 20.35 C
ATOM 3519 O ALA B 177 0.029 -9.962 -21.720 1.00 19.92 O
ATOM 3520 CB ALA B 177 -1.747 -7.592 -20.659 1.00 15.99 C
ATOM 3541 N ALA B 181 -4.381 -14.273 -14.076 1.00 16.90 N
ATOM 3542 CA ALA B 181 -4.649 -13.158 -13.194 1.00 16.14 C
ATOM 3543 C ALA B 181 -3.446 -12.893 -12.306 1.00 18.15 C
ATOM 3544 O ALA B 181 -2.692 -13.819 -12.014 1.00 20.60 O
ATOM 3545 CB ALA B 181 -5.817 -13.463 -12.335 1.00 15.23 C
ATOM 3626 N ALA B 194 8.308 -12.434 -17.665 1.00 29.11 N
ATOM 3627 CA ALA B 194 9.387 -12.364 -18.631 1.00 28.89 C
ATOM 3628 C ALA B 194 10.604 -11.653 -18.089 1.00 31.02 C
ATOM 3629 O ALA B 194 10.592 -11.177 -16.949 1.00 31.88 O
ATOM 3630 CB ALA B 194 8.920 -11.616 -19.844 1.00 25.66 C
As you can see from the input ALA is repeated 40 times but 5 times each, so a total of 8 times. The 4th column gives the ALA value, while 6th column shows how many times the same ALA is repeated. For example ALA at 10 (6th column) is repeated 5 times, ALA at 26 is repeated 5 times, ALA at 56 is also repeated 5 times, etc.
The output has to count ALA 8 times instead of 40 which is the current case with my script (bold: cat $i | grep -o -i $AAA | wc -l | awk '{print $1}'
).
Also I was trying to figure out how to count ALA 8 times using strictly the # awk -F"[ ]" -v SRCH="$AAA" '$0 ~ SRCH && !OCC[$6]++ {CNT++ } END {print CNT+0}' $i
command (commented), however I am struggling to get the correct awk command.
Thus, I would like to ask of few questions:
1) How could I make the bolded command count ALA 8 times instead of 40?
2) How could I make strictly the awk command (commented) count ALA also 8 times instead of 5 as it does now which does not make sense as there are much more ALA words?