Awk regular expression - I need exactly 1 occurrence of it

ioannisp · January 6, 2009, 8:17am

Hi all,

I am processing a file with awk that looks like this:

"
0.0021 etc
0.0123 etc
0.1234 etc
...
0.5324 etc
0.5434 etc
0.6543 etc
...
1.0344 etc
1.1344 etc
...
1.5345 etc
1.5632 etc
"
I need to print out only the lines that have '0' or '5' after the comma, plus I need only one occurrence of each pattern match. So, based on the structure above, I need this:

"
0.0021 etc
0.5324 etc
1.0344 etc
1.5345 etc
"

I managed to keep only the lines that have '0' or '5' after the comma with this regular expression (somewhere inside an "if" clause of an awk script):

$1~/0\.0/||$1~/0\.5/||$1~/1\.0/||$1~/1\.5/

so i get this:

"
0.0021 etc
0.0123 etc
0.5324 etc
0.5434 etc
1.0344 etc
1.5345 etc
1.5632 etc
"

What I can't accomplish, is to keep only the fisrt occurrence of each ".0" or ".5" pattern.

I found in the gnu manual that to keep exaclty n occurrences of the expression r in awk, you should add {n} right after (r{n}). So, I tried it with {1} after the expression and it didn't compile. I also tried it with backshlashes and put it before or somewhere inside the expression, but no luck.

Do you have any idea what's wrong?

Thanx in advance.

rubin · January 6, 2009, 7:05pm

awk '$1~/[0-9]\.(0|5)/ && !a[substr($1,1,3)]++' file

reborg · January 6, 2009, 7:15pm

There should not be any need for the ++ or the substr

awk '/^[0-9]+\.(0|5)/ && ! a[$0]'

frostmourn · January 6, 2009, 9:46pm

awk '/^.\.[05]/{if(substr($1,1,3)==a){a=substr($1,1,3);next}print;a=substr($1,1,3)}'

rujuta_rahalkar · January 7, 2009, 2:45am

Can you please tell me what does this part of the code exactly does --
&& !a[substr($1,1,3)]++'

ioannisp · January 7, 2009, 6:12am

Thanx everybody for replying.

The problem was not solved:(

It seems that the second part of the expression

$1~/[0-9]\.(0|5)/ && ...2nd part...

is always true if the first part is true. Whether the expression is "!a[substr($1,1,3)]++" or "!a[$0]".

The solution frostmourn suggested unfortunately didn't compile and I am afraid I don't understand the structure that well in order to make it compile.

Just for the record, the actuall file looks like this (I posted a simplified view before, thought it didn't matter):

"
+ 0.1 0 1 tcp 40 ------- 1 0.0 6.0 0 0

0.1 0 1 tcp 40 ------- 1 0.0 6.0 0 0
+ 0.1 7 2 tcp 40 ------- 2 7.0 6.1 0 1
0.1 7 2 tcp 40 ------- 2 7.0 6.1 0 1
+ 0.1 8 3 tcp 40 ------- 3 8.0 6.2 0 2
0.1 8 3 tcp 40 ------- 3 8.0 6.2 0 2
...
...
12.999072 2 7 ack 40 ------- 2 6.1 7.0 59 1228
r 13.002496 2 7 ack 40 ------- 2 6.1 7.0 59 1227
r 13.015712 3 2 ack 40 ------- 2 6.1 7.0 59 1229
+ 13.015712 2 7 ack 40 ------- 2 6.1 7.0 59 1229
13.015712 2 7 ack 40 ------- 2 6.1 7.0 59 1229
r 13.019136 2 7 ack 40 ------- 2 6.1 7.0 59 1228
r 13.035776 2 7 ack 40 ------- 2 6.1 7.0 59 1229
"

and the '$1' in my previous example is the '$2' in the actual problem. So this is the code I use based on your suggestions (I also need $1 to be "r", $3 to be 1 and $5 to be "tcp" but it doesn't change anything):

if($1=="r" && ($2~/\.(0|5)/ && !a[substr($2,1,3)]++) && $3==1 && $5=="tcp")
{
...
}

this didn't work either:
if($1=="r" && ($2~/\.(0|5)/ && !a[$2]) && $3==1 && $5=="tcp")
{
...
}

still doing something wrong?

//The code I used before and returned all the occurences and not just the first one was:
if($1=="r" && ($2~ /\.0/ ||$2~ /\.5/) && $3==2 && $5=="tcp")
{
...
}

Thanx in advance.

@rujuta_rahalkar: substr(a,b,c) returns a substring of the string a, that begins at place b (starting from 1) and extends to c places. The effects of the negation and the increment are not clear to me either.

reborg · January 10, 2009, 7:54pm

Please post a sample of your desired output it is a little unclear what you are trying to achieve.

ioannisp · January 11, 2009, 10:22am

Here is a sample of the desired output

"
0.020064
0.522624
1.00656
8.058944
"

based on this input:

"

0.020064 0 1 tcp 40 ------- 1 0.0 6.0 0 0
r 0.020064 1 2 tcp 40 ------- 1 0.0 6.0 0 0
r 0.020067 1 2 tcp 40 ------- 1 0.0 6.0 0 0
...
+ 0.522624 5 6 tcp 40 ------- 2 7.0 6.1 0 1
r 0.522624 1 5 tcp 40 ------- 2 6.1 7.0 0 12
r 0.522625 1 5 tcp 40 ------- 2 6.1 7.0 0 12
...
+ 0.998912 5 6 tcp 1040 ------- 3 8.0 6.2 1 21
r 1.00656 1 2 tcp 40 ------- 4 6.3 9.0 1 26
r 1.00657 1 9 tcp 40 ------- 4 6.3 9.0 1 26
...
r 8.058944 1 5 tcp 1040 ------- 2 7.0 6.1 52 883
r 8.058944 5 6 tcp 1040 ------- 2 7.0 6.1 52 883
r 8.062336 1 6 tcp 1040 ------- 5 10.0 6.4 200 837
"

This is actually the output of an ns-2 simulation, the $2 represents the time. So what I basically want is only the time values that have the form *.0* or *.5* and they have $3==1. Moreover, if there are more than one values with this quality, say 8.058944 and then 8.063296, I only want the first 8.0* value, in this case 8.058944. And this goes also for all the other values (0.0*, 0.5*, 1.0*, 1.5*, ..., 8.0*, 8.5*)

Thanks

Franklin52 · January 11, 2009, 3:03pm

If I don't misunderstand the question:

awk '{f=substr($2,1,3)} f!=s && $3==1{s=f;print $2}' file

Regards

ioannisp · January 11, 2009, 7:37pm

it worked, thank you very much Franklin52 and everybody who answered

quirkasaurus · January 30, 2009, 1:49pm

Ah reread everything... Saw franklin's solution.
I thought that the 2nd column wasn't necessarily sorted....
so i came up with:

nawk '{
  if ( $2 ~ /\.[05]/ && $3 == 1 ){
    idx = sprintf( "%0.01f", $2 );
    if ( ! hash[idx] ){
      hash[idx] = $2;
      }
    }
  }
  END{
  for ( idx in hash ){
    print hash[idx];
    }
  }'

The previous solution wouldn't handle 13.0 and 13.5 correctly.... btw....

ioannisp · January 31, 2009, 3:44am

thank you, gonna revise the issue,though my problem was mostly solved with the previous suggestion.

regards