awk print values between consecutive lines

alex2005 · December 5, 2013, 1:29pm

I have a file in below format:

file01.txt

TERM
TERM
TERM
ABC     12315   68.53   12042013   165144
ABC     12315   62.12   12042013   165145
ABC     12315  122.36   12052013   165146
ABC     12315  582.18   12052013   165147
ABC     12316    2.36   12052013   165141
ABC     12316   68.53   12042013   165142
ABC     12316   62.12   12042013   165143
ABC     12316  122.36   12052013   165144
ABC     12316  122.36   12052013   165145

my desired output will be:

ABC     12315   68.53   12042013   165144
ABC     12315  582.18   12052013   165147
ABC     12316    2.36   12052013   165141
ABC     12316  122.36   12052013   165145

In this file all the values are sorted by column 2 and 5.

I've tried the following command:

awk '/^ABC/ {if (lastval != $5-1 ) { print line;print $0}  lastval = $5; line = $0 }' file01.txt

which adds an extra line at the beginning and skips the last row as well:

                                                        
ABC     12315   68.53   12042013   165144
ABC     12315  582.18   12052013   165147
ABC     12316    2.36   12052013   165141

Seeking for your assistance regarding on how to modify the one-liner in order to:

print the last row
add a count for each pair of values and add one new line between pairs :

ABC     12315   68.53   12042013   165144
ABC     12315  582.18   12052013   165147  4

ABC     12316    2.36   12052013   165141
ABC     12316  122.36   12052013   165145  5

if a value is missing between first and last value do not split e.g.:

ABC     12316   62.12   12042013   165143

is missing from the file01.txt

The final output should be:

ABC     12315   68.53   12042013   165144
ABC     12315  582.18   12052013   165147  4

ABC     12316    2.36   12052013   165141
ABC     12316  122.36   12052013   165145  4

Thank you in advance for your help

Corona688 · December 5, 2013, 1:34pm

So you want the first and last line of each group (as determined by $2) plus a count of how many lines there were in the group?

It will be difficult to make this a "one-liner" as printing the count requires it to read ahead, to know when the "group" ends.

alex2005 · December 5, 2013, 1:36pm

Yes, that's correct.

It doesn't have to be a one-liner. I use the one-liner only on my trials.

Corona688 · December 5, 2013, 1:41pm

$ cat grp2.awk
!/^ABC/ { next }

!($2 in A)      {       if(LAST) print LAST,A[LID] ; print      }
                {       A[$2]++; LAST=$0; LID=$2                }
END             {       if(LAST) print LAST, A[LID]             }

$ awk -f grp2.awk data

ABC     12315   68.53   12042013   165144
ABC     12315  582.18   12052013   165147 4
ABC     12316    2.36   12052013   165141
ABC     12316  122.36   12052013   165145 5

$

Your input data includes five lines for 12316, not four.

alex2005 · December 5, 2013, 1:49pm

Thank you, worked very well.

The output with 4 lines was when

ABC     12316   62.12   12042013   165143

was missing. I tested your script and works well even if a value is missing from group.

Akshay_Hegde · December 5, 2013, 1:56pm

alex2005:

Thank you, worked very well.

The output with 4 lines was when
ABC     12316   62.12   12042013   165143
was missing. I tested your script and works well even if a value is missing from group.

How ?

I am also getting result like corona, with assumption file is sorted

$ awk '!/^ABC/{next}p!=$5-1{printf last ? last FS x[l]++ RS $0 RS : $0 RS}{p=$5;last=$0;l=$2;x[$2]++}END{print last FS x[l]++}' file
ABC     12315   68.53   12042013   165144
ABC     12315  582.18   12052013   165147 4
ABC     12316    2.36   12052013   165141
ABC     12316  122.36   12052013   165145 5

alex2005 · December 5, 2013, 2:43pm

Hi,
Thank you for your reply.
I wanted to be able to use the script even if the values in column $5 are not consecutive,

For example row "

ABC     12316   62.12   12042013   165143

" is missing

The

file01.txt

would become:

TERM
TERM
TERM
ABC     12315   68.53   12042013   165144
ABC     12315   62.12   12042013   165145
ABC     12315  122.36   12052013   165146
ABC     12315  582.18   12052013   165147
ABC     12316    2.36   12052013   165141
ABC     12316   68.53   12042013   165142
ABC     12316  122.36   12052013   165144
ABC     12316  122.36   12052013   165145

Here is the result of your one-liner:

awk '!/^ABC/{next}p!=$5-1{printf last ? last FS x[l]++ RS $0 RS : $0 RS}{p=$5;last=$0;l=$2;x[$2]++}END{print last FS x[l]++}' file02.txt

ABC     12315   68.53   12042013   165144
ABC     12315  582.18   12052013   165147 4
ABC     12316    2.36   12052013   165141
ABC     12316   62.12   12042013   165143 3
ABC     12316  122.36   12052013   165145
ABC     12316  122.36   12052013   165145 5

My desired output would be:

ABC     12315   68.53   12042013   165144
ABC     12315  582.18   12052013   165147  4
ABC     12316    2.36   12052013   165141
ABC     12316  122.36   12052013   165145  4

Sorry if I couldn't describe more accurate from the first trial.

Best Regards

Akshay_Hegde · December 5, 2013, 3:20pm

In your first post you were checking $5 this made confusion, anyways this will work and corona's solution also, it checks $2

$ cat file
TERM
TERM
TERM
ABC     12315   68.53   12042013   165144
ABC     12315   62.12   12042013   165145
ABC     12315  122.36   12052013   165146
ABC     12315  582.18   12052013   165147
ABC     12316    2.36   12052013   165141
ABC     12316   68.53   12042013   165142
ABC     12316  122.36   12052013   165144
ABC     12316  122.36   12052013   165145

$ awk '!/^ABC/{next}p!=$2{print l ? l FS x[p] RS $0 : $0}{p=$2;l=$0;x[$2]++}END{print l FS x[p] }' file
ABC     12315   68.53   12042013   165144
ABC     12315  582.18   12052013   165147 4
ABC     12316    2.36   12052013   165141
ABC     12316  122.36   12052013   165145 4

alex2005 · December 5, 2013, 3:31pm

Thank you so much for your time.