Counting a consecutive number in column 2

Ryan_Kim · February 6, 2014, 1:09pm

Hi,
I have a input file which contains following data

I would like to count a consecutive number in column 2 which are grouped by column 1 as well as obtain the maximum number of consecutive no. in each group.

So, results I want to get would be

As for maximum thing,

0 4
1 4
2 3

Thanks in advance

th0gz19 · February 6, 2014, 1:19pm

are you sure this isn't a homework?

Ryan_Kim · February 6, 2014, 1:29pm

That is a part of result which represents the stacking interaction in DNA. Column 1 represents time, Column 2 is residue in DNA participating stacking interaction. I just want to know the how many stacks are formed in a consecutive manner.
Since raw data is too complex to post here, I just omitted raw data and gave the example in order to explain more efficiently.

Scrutinizer · February 6, 2014, 2:10pm

Try something like:

awk '$1>p || $2!=q+1{if(NR>1)print p,c; c=0} {p=$1; q=$2; c++} END{print p,c}' file

Ryan_Kim · February 6, 2014, 3:16pm

Thanks. It really works.

Also, I would appreciate it a lot if you reply me how to obtain maximum value in the group.

The result I would like to get is following

0 4
1 4
2 3

Scrutinizer · February 6, 2014, 3:34pm

Try something like:

awk '$1>p || $2!=q+1{if(c>m)m=c; c=0} $1>p{if(NR>1)print p,m; m=0} {p=$1; q=$2; c++} END{if(c>m)m=c; print p,m}' file

Ryan_Kim · February 6, 2014, 3:40pm

Sincerely appreciate it !

Scrutinizer · February 6, 2014, 3:44pm

You're welcome. I added a modification in the END section. This all only works if column 1 is grouped..

Ryan_Kim · February 6, 2014, 4:12pm

Sorry for bothering you. I would like to know how to get average of the number grouped by column 1.

If I have following data,

The result I would like obtain would be

0 3
1 2

Thanks in advance

Scrutinizer · February 6, 2014, 4:35pm

Try:

awk '$1>p {if(NR>1)print p, t/n; t=n=0} {p=$1; n++; t+=$2} END{print p,t/n}'  file

--edit--
I added if(NR>1) in all suggestions

mbp · April 8, 2014, 12:55am

Hi,
I am working with very similar data and Scrutinizer's answers have been very helpful. However, I was wondering how one would alter the output a bit. For example, if I had Ryan Kim's data in his post 'One more question':

What if I wanted to print out all of the actual lines that correspond to a series of lines with at least n consecutive values in column 2? For example, if I had n=4 (or 5) and the above data, I would want to extract and print the following lines of data:

If I had n=3, I would extract all of the lines from the original dataset.

I altered Scrutinizer's awk solution slightly to allow filtering the series based on the number of lines with consecutive values in column two:

awk '$1>p || $2!=q+1{if(NR>1)print p,c; c=0} {p=$1; q=$2; c++} END{if (c>4) {print p,c}} file

But, I can't figure out how to print the actual series of lines with the consecutive values in them. Any possible advice/explanations would be greatly appreciated!

Thank you very much in advance for any help.

Scrutinizer · April 8, 2014, 3:25am

Something like this?

awk '
  $1>p || $2!=q+1 {
    if(NR>1 && c>=n) print s
    c=0
    s=x
    p=$1
  }
  {
    s=s ORS $0
    q=$2
    c++
  }
  END{
    if(c>=n)print s
  }
' n=3 file

mbp · April 8, 2014, 11:11pm

Hi Scrutinizer,
thanks very much, that works perfectly!