How to extract specific data and count number containing sets from a file?

Hello everybody!

I am quit new here and hope you can help me.

Using an awk script I am trying to extract data from several files. The structure of the input files is as follows:

TimeStep parameter1 parameter2 parameter3 parameter4

e.g.

1 X Y Z L
1 D H Z I
1 H Y E W
2 D H G F
2 R T U V
3
.
.
.

I would like to count the number entries from time step 1, 2, 3... containing certain parameters, respectively.

I already wrote a script that is extracting entries with specific parameters. However I am still struggling with getting with the information how many entries of time step 1, 2 and 3 contain e.g. parameter Z in $4 (2 for time step 1 and 0 for all other in the example above).

I Would prefer to do everything within a single script awk because I got a lot of data and later only want to change the parameter selection.

I already tried to do it within for and while loops but it did not work as I wanted it to... well I am just starting with awk:)

Thanks for your help guys!

Hi

Not sure whether I understood you correctly.

# awk '{if ($x==y)a[$1]++;else a[$1]+=0;}END{for (i in a)print i,a}' x=4 y=Z file
1 2
2 0
#

where x represents the column number in which you want to search, Y represents the parameter you want to search.

Guru.

Thanks a lot Guru it almost doing what I wanted :slight_smile:

I use the following script to calculate the number of entries in each $1= 1, 2, 3... consistent with the defined values for parameter A, B, C and D.

BEGIN     {     
    r=5; #parameterA
    x=9; #parameterB
    solv2="TFE"; #parameterC
    solv1="TIP3"; #parameterD
               }


/CA 1/ #pattern for row to earch in. in order to skip header of the file

    {if ( ($9*1 < r) && (( $7 ~ solv1 )||($7 ~ solv2)) )    a[$1]++;else a[$1]+=0;} #$9*1 to avoid wrong counting that occured sometimes



END{for (i in a)print i,a}

the output is something like this

.
.
.
 90 CA 1 67 18 5744 TFE O1 8.17278
 90 CA 1 67 19 6988 TFE O1 8.51086
 90 CA 1 67 20 7806 TIP3 OH2 4.75067
 90 CA 1 67 21 10479 TIP3 OH2 4.67777
 90 CA 1 67 22 10845 TIP3 OH2 7.16528
 90 CA 1 67 23 11554 TIP3 OH2 4.19535
10 7
11 6
12 7
13 12
14 6
15 6
NAME 0
16 8
30 4

.
.
.

So it is messing up the order and including parts of the header in the output and printing the whole inputfile at the beginning... I am confused:D

The input file looks like this

 NAME DWT26R1_CA1_PEP1.DAT
 FRAMES[PS] 5000
 SKIPPED 500
 STEP 50
 PROCESSED 90
 1 CA 1 98 1 2643 TFE F21 9.5831
 1 CA 1 98 2 2654 TFE O1 6.25134
 1 CA 1 98 3 2681 TFE O1 5.01697
 1 CA 1 98 4 2751 TFE O1 6.45506
 1 CA 1 98 15 5702 TFE O1 9.63541
 1 CA 1 98 16 6096 TFE O1 4.69877
 1 CA 1 98 17 6337 TFE O1 6.64662
 1 CA 1 98 18 8167 TIP3 OH2 5.73264 
 2 CA 1 103 18 6096 TFE O1 6.27655
 2 CA 1 103 19 6337 TFE O1 8.68132
 2 CA 1 103 20 8167 TIP3 OH2 3.85201
 2 CA 1 103 21 8178 TIP3 OH2 7.49269
 2 CA 1 103 22 8481 TIP3 OH2 6.79798
 2 CA 1 103 23 8591 TIP3 OH2 3.98057
 2 CA 1 103 24 9917 TIP3 OH2 5.53047
.
.
.

Cheers,
Daniel

---------- Post updated at 06:18 AM ---------- Previous update was at 02:41 AM ----------

I managed to get the output in the correct order by changing the lor loop:

END {for ( i=1; i<100; i++) print i,a}

However I have still have a question

I have several input files representing successive data sets. However the time step ($1) starts for each file with 1. I need to continue increasing that value instead of starting at 1 again with reading from a new file.

cheers,
daniel