awk- comparing fields from the same column, finding discontinuities.

acsg · April 14, 2011, 2:52am

Hello,

I have a file with two fields. The first field repeats itself for quite a while but the second field changes. What I want to do is to go through the first column until its value changes (and while it doesn't, verify that the second field is in a sequence from 0-15).

Example input:

160 13
160 14
160 15
160 0
160 1
160 4 <-- **
160 2 <-- **
409 2
409 3
409 5 <-- **
....

For the output I would like to have a report like:

Channel 160: 2 discontinuities
Channel 409: 1 discontinuity

I only have a quite tangled pseudo-code so far since I don't know how to refer to "the previous field in the same column":

{channel[$1];

   while ($1=i){

      if($2 < $(previous field, same column) && ($2==0 && $(prev.field,same column) !=15 && $(prevfield,same column) != 0) || $2 > $(prev.field, same column)+1)
discont++;
   }
}

Thanks!

mirni · April 14, 2011, 3:34am

How about this...

awk  '
   ch!=$1{ch=$1;seq=$2}  #initialize with new channel
   ch==$1{   #channel same as stored
             if((seq++%16)!=$2) #increment and cycle the counter; compare
                 cntr[$1]++
   }END{
     for(i in cntr) {
       print "Channel " i " has " cntr " discontinuities" 
   }
 }' input

Note that the output is gonna be in random order.
To sort them by channel pipe this awk code to sort

awk '{...}' input | sort -n -k2

To sort by number of discontinuities, sort by fourth field:

| sort -n -k4

_UVI · April 14, 2011, 3:44am

another way

awk 'BEGIN{ 
  while (getline  > 0){
    if ( NR > 1 ){
      C1 = $1
      C2 = $2

      if ( P1 == C1 ){
        if ( (P2+1)%16 != C2)
          data_array[$1]++
      }
    }
    P1 = $1
    P2 = $2
  }
}

END{
  for (var in data_array)
    if ( data_array[var] > 1)
      print "Channel " var ": " data_array[var] " discontinuities"
    else
      print "Channel " var ": " data_array[var] " discontinuity"
}'

acsg · April 14, 2011, 4:54am

mirni:

How about this...
awk  '{
   ch!=$1{ch=$1;seq=$2}  #initialize with new channel
   ch==$1{   #channel same as stored
   if((seq++%16)!=$2) #increment and cycle the counter; compare
   cntr[$1]++
}END{
   for(i in cntr) {
   print "Channel " i " has " cntr " discontinuities" 
   }
 }' input 
Note that the output is gonna be in random order.
To sort them by channel pipe this awk code to sort
awk '{...}' input | sort -n -k2
To sort by number of discontinuities, sort by fourth field:
| sort -n -k4

It doesn't seem to work, it doesn't print anything... are the first couple of instructions supposed to be wrapped in a BEGIN statement?

---------- Post updated at 11:42 AM ---------- Previous update was at 11:34 AM ----------

|uvi|:

another way

awk 'BEGIN{  
  while (getline  > 0){
   if ( NR > 1 ){
   C1 = $1
   C2 = $2

   if ( P1 == C1 ){
   if ( (P2+1)%16 != C2)
   data_array[$1]++
   }
   }
   P1 = $1
   P2 = $2
  }
}

END{ 
  for (var in data_array)
   if ( data_array[var] > 1)
   print "Channel " var ": " data_array[var] " discontinuities"
   else
   print "Channel " var ": " data_array[var] " discontinuity"
}'

Thank you!!! It works .....but there's a tiny problem. I wanted to specify that two consecutive fields having the same value shouldn't be seen as a discontinuity.

So for example

160 13
160 14
160 15
160 0
160 1
160 1
160 1
160 4 <-- **
160 2 <-- **
409 2
409 3
409 5 <-- **

Lines 5, 6, 7 shouldn't be seen as a discontinuity since I have a lot of those in the input file

---------- Post updated at 11:54 AM ---------- Previous update was at 11:42 AM ----------

|uvi|:

 
  while (getline  > 0){
   if ( NR > 1 ){
   C1 = $1
   C2 = $2

   if ( P1 == C1 ){
   if ( (P2+1)%16 != C2)
   data_array[$1]++
   }
   }
   P1 = $1
   P2 = $2
  }
}

END{ 
  for (var in data_array)
   if ( data_array[var] > 1)
   print "Channel " var ": " data_array[var] " discontinuities"
   else
   print "Channel " var ": " data_array[var] " discontinuity"
}'

Actually, I think it's only counting the number of times a channel is present in the first field... because I have another script that does that and returns the stats, and they're both giving the same results now...

_UVI · April 14, 2011, 5:20am

cat input.txt | awk 'BEGIN{ 
  while (getline  > 0){
    if ( NR > 1 ){
      C1 = $1 
      C2 = $2

      if ( P1 == C1 && P2 != C2){
        if ( (P2+1)%16 != C2 )
          data_array[$1]++
      }
    }
    P1 = $1
    P2 = $2
  }
}

END{
  for (var in data_array)
    if ( data_array[var] > 1)
      print "Channel " var " : "  data_array[var] " discontinuities"
    else
      print "Channel " var " : "  data_array[var] " discontinuity"
}'

now should be works

---------- Post updated at 04:20 AM ---------- Previous update was at 03:58 AM ----------

using this the program prints also channel with 0 discontinuities

cat input.txt | awk 'BEGIN{ 
  while (getline  > 0 && NF > 0){
    data_array[$1]+=0
    if ( NR > 1 ){
      C1 = $1 
      C2 = $2

      if ( P1 == C1 && P2 != C2){
        if ( (P2+1)%16 != C2 )
          data_array[$1]++
      }
    }
    P1 = $1
    P2 = $2
  }
}

END{
  for (var in data_array)
    if ( data_array[var] = 1)
      print "Channel " var " : "  data_array[var] " discontinuity"
    else
      print "Channel " var " : "  data_array[var] " discontinuities"
}'

acsg · April 14, 2011, 5:31am

|uvi|:

cat input.txt | awk 'BEGIN{ 
  while (getline  > 0){
   if ( NR > 1 ){
   C1 = $1 
   C2 = $2

   if ( P1 == C1 && P2 != C2){
   if ( (P2+1)%16 != C2 )
   data_array[$1]++
   }
   }
   P1 = $1
   P2 = $2
  }
}

END{
  for (var in data_array)
   if ( data_array[var] > 1)
   print "Channel " var " : "  data_array[var] " discontinuities"
   else
   print "Channel " var " : "  data_array[var] " discontinuity"
}'

now should be works

---------- Post updated at 04:20 AM ---------- Previous update was at 03:58 AM ----------

using this the program prints also channel with 0 discontinuities

cat input.txt | awk 'BEGIN{

 
  while (getline  > 0 && NF > 0){
   data_array[$1]+=0
   if ( NR > 1 ){
   C1 = $1 
   C2 = $2

   if ( P1 == C1 && P2 != C2){
   if ( (P2+1)%16 != C2 )
   data_array[$1]++
   }
   }
   P1 = $1
   P2 = $2
  }
}

END{ 
  for (var in data_array)
   if ( data_array[var] = 1)
   print "Channel " var " : "  data_array[var] " discontinuity"
   else
   print "Channel " var " : "  data_array[var] " discontinuities"
}'

Thank you so much!! You were extremely helpful.

mirni · April 14, 2011, 1:16pm

Sorry, I had an extra brace there at the beginning... I fixed the original reply.

acsg · April 15, 2011, 2:40am

mirni:

How about this...
awk  '
   ch!=$1{ch=$1;seq=$2}  #initialize with new channel
   ch==$1{   #channel same as stored
   if((seq++%16)!=$2) #increment and cycle the counter; compare
   cntr[$1]++
   }END{
   for(i in cntr) {
   print "Channel " i " has " cntr " discontinuities" 
   }
 }' input 
Note that the output is gonna be in random order.
To sort them by channel pipe this awk code to sort
awk '{...}' input | sort -n -k2
To sort by number of discontinuities, sort by fourth field:
| sort -n -k4

I tried the new code with this input:

160 1
160 2
160 3
160 4
160 6 <-- **
160 7
160 8
160 9
160 10
160 10
160 11
160 12
160 13
160 14
160 15
160 0
160 15 <-- **
160 0
162 1
162 2
162 4 <-- **
162 6 <-- **
162 7
162 8

and I got this output:

Channel 160 has 13 discontinuities
Channel 162 has 4 discontinuities

Normally there should be 2 discontinuities in channel 160 and 2 in channel 162. Is there an if statement missing? where we check if the channel is the same as stored?

I re-checked the other code (the one provided by UVI ) again (with this smaller input file) and it doesn't work properly, so I still have the same problem

mirni · April 15, 2011, 5:06am

<--- That was missing. Test this out:

awk  '
   ch!=$1{ch=$1;seq=$2}  #initialize with new channel
   ch==$1{   #channel same as stored
             if(seq==$2) next;  #if same, skip to next line
             else if((++seq%16)!=$2) { #increment and cycle the counter; compare
                 cntr[$1]++
                 seq=$2     #reset seq
                 #print "Disc. " $0   #debug; uncomment to check what was grabbed
             }
   }END{
     for(i in cntr) {
       print "Channel " i " has " cntr " discontinuities" 
   }
 }' input

acsg · April 19, 2011, 5:49am

mirni:

awk  '
   ch!=$1{ch=$1;seq=$2}  #initialize with new channel
   ch==$1{   #channel same as stored
   if(seq==$2) next;  #if same, skip to next line
   else if((++seq%16)!=$2) { #increment and cycle the counter; compare
   cntr[$1]++
   seq=$2     #reset seq
   #print "Disc. " $0   #debug; uncomment to check what was grabbed
   }
   }END{
   for(i in cntr) {
   print "Channel " i " has " cntr " discontinuities" 
   }
 }' input

Hi mirni,

Thanks for the reply. It works well overall but the problem is that it seems to detect every second repeated number as a discontinuity as well, so say that this is the input (if the sequence was from 0-3 instead of 0-15):

400 0
400 1 xx
400 1 xx
400 2
400 3
400 0 //
400 0 //
400 1
400 2
400 3 xx
400 3 xx
400 0
400 1 //
400 1 //

The places marked with // are viewed as a discontinuity. I think there's a problem with the storage of "seq" right after detecting a repeated number, but i haven't been able to fixt it.

I've attached the input file I'm using to test it, and the result of the discontinuities is:

Channel: 400
Discontinuities: 2

Discontinuities

Line number: 25
400 6

Line number: 52
400 15

Thanks a lot for your time and help.

_UVI · April 19, 2011, 6:09am

cat input.txt | awk 'BEGIN{ 
  while (getline  > 0 && NF > 0){
      data_array[$1]+=0
      if ( NR > 1 ){
        C1 = $1 
        C2 = $2

        if ( P1 == C1 && P2 != C2){
            if ( (P2+1)%16 != C2 ){
              data_array[$1]++
            }
        }
      }
      P1 = $1
      P2 = $2
  }
}

END{
  for (var in data_array)
    if ( data_array[var] == 1)
      print "Channel " var " : "  data_array[var] " discontinuity"
    else
      print "Channel " var " : "  data_array[var] " discontinuities"
}'

there was an error on assignement!
Now works correctly

mirni · April 19, 2011, 3:25pm

I don't see what is it doing wrong. Data:

$ cat d2
400 0
400 1 xx
400 1 xxxx
400 2
400 3
400 0 //
400 0 ////
400 1
400 2
400 3 xx
400 3 xxxx
400 0
400 1 //
400 1 ////

Script:

$ cat test.sh
#!/bin/sh

awk  '
   ch!=$1{ch=$1;seq=$2}  #initialize with new channel
   ch==$1{   #channel same as stored
             if(seq==$2) next;  #if same, skip to next line
             else if((++seq%16)!=$2) { #increment and cycle the counter; compare
                 cntr[$1]++
                 seq=$2     #reset seq
                 print "Disc. " $0 " seq: " seq   #debug; uncomment to check what was grabbed
             }
   }END{
     for(i in cntr) {
       print "Channel " i " has " cntr " discontinuities" 
   }
 }' < $1

Run:

$ ./test.sh d2 
Disc. 400 0 // seq: 0
Disc. 400 0 seq: 0
Channel 400 has 2 discontinuities

Isn't that the desired output?

summer_cherry · April 20, 2011, 11:02pm

my $pre;
while(<DATA>){
	my @tmp = split;
	push @{$hash{$tmp[0]}}, $tmp[1];
}
foreach my $key(keys %hash){
	my $cnt;
	my @arr = @{$hash{$key}};
	for(my $i=1;$i<=$#arr;$i++){
		#$cnt ++ if $arr[$i-1]<$arr[$i] && $arr[$i]>($arr[$i+1]||-10000);
		$cnt++ if not (($arr[$i-1]==$arr[$i]-1) || $arr[$i]==$arr[$i+1]-1);
	}
	print $key," has ", $cnt, " incontinuity\n" if $cnt;
}
__DATA__
160 13
160 14
160 15
160 0
160 1
160 4
160 2
409 2
409 3
409 5

acsg · April 21, 2011, 2:00am

It's working now, thank you so much for your help and time mirni, |UVI| and summer_cherry :o