How can I match lines with just one occurance of a string in awk?

jonathanm · October 24, 2008, 9:46am

Hi,

I'm trying to match records using awk which contain only one occurance of my string, I know how to match one or more (+) but matching only one is eluding me without developing some convoluted bit of code. I was hoping there would be some simple pattern matching thing similar to '+' but which means 'one and only one occurance of'.

My matching code looks like this:

$10 !~ /&| and | AND | And |\// && $11 !~ /FLAT|Flat|Apartment|APARTMENT/ && $10 ~ /MR|MISS|MRS|MS|Mr|Miss|Mrs|Ms/ {

But some records have in their name field multiple names, such as

and I want to not match those records.

Any help with this would be grand!

The only alternative I can think of is some convoluted counting loop which goes through the name split as an array to count if any of the Mr, Mrs, MR, MRS, etc occur more than once, which sounds quite long-winded and unnecessary.

drl · October 25, 2008, 1:09pm

Hi.

I find that such things are relatively straight-forward in perl because of the power of regular expression infrastructure. I don't know if awk has this feature as visibly as does perl, but here is a shell script that drives a small perl script:

#!/bin/bash -

# @(#) s1       Demonstrate perl.

echo
echo "(Versions displayed with local utility \"version\")"
version >/dev/null 2>&1 && version "=o" $(_eat $0 $1) perl
set -o nounset
echo

FILE=${1-data1}

echo " Data file $FILE:"
cat $FILE

echo
echo " perl script file:"
cat p1

echo
echo " Results:"
./p1 $FILE

exit 0

Producing:

% ./s1

(Versions displayed with local utility "version")
Linux 2.6.11-x1
GNU bash 2.05b.0
perl 5.8.4

 Data file data1:
Mr Magoo
Mr Magoo mr magoo
Mr Magoo Mr Smith Miss Demeanor
Mr Smith Miss Demeanor
Miss Demeanor Miss Taken
Miss Taken

 perl script file:
#!/usr/bin/perl

# @(#) p1       Demonstrate skipping of line with repeated matches.

use warnings;
use strict;

my($debug);
$debug = 0;
$debug = 1;
my($t1);

my($lines) = 0;

# Make entire line lower case to simply matches. Use captured
# string to omit lines with contain more than one match.

while ( <> ) {
chomp;
        print " Working on |$_|\n";
        $lines++;
        $t1 = lc $_;
        next if $t1 =~ /(mr|miss).*\1/;
    print "$_\n";;
}

print STDERR " ( Lines read: $lines )\n";

exit(0);

 Results:
 Working on |Mr Magoo|
Mr Magoo
 Working on |Mr Magoo mr magoo|
 Working on |Mr Magoo Mr Smith Miss Demeanor|
 Working on |Mr Smith Miss Demeanor|
Mr Smith Miss Demeanor
 Working on |Miss Demeanor Miss Taken|
 Working on |Miss Taken|
Miss Taken
 ( Lines read: 6 )

Best wishes ... cheers, drl

otheus · October 25, 2008, 4:10pm

I prefer perl too, in cases like this, but this is easily solvable in awk. Basically, you want to match X but not X.*X.

$10 !~ /&| and | AND | And |\// && $11 !~ /FLAT|Flat|Apartment|APARTMENT/ && $10 ~ /MR|MISS|MRS|MS|Mr|Miss|Mrs|Ms/ && $10 !~ /(MR|MISS|MRS|MS|Mr|Miss|Mrs|Ms).*(MR|MISS|MRS|MS|Mr|Miss|Mrs|Ms)/ {

And yes, it's a bit ugly, but awk isn't always very pretty.

radoulov · October 25, 2008, 4:40pm

With GNU AWK:

$ cat file
Mr Magoo
Mr Magoo mr magoo
Mr Magoo Mr Smith Miss Demeanor
Mr Smith Miss Demeanor
Miss Demeanor Miss Taken
Miss Taken
$ awk -F'm(r|iss)' 'NF==2' IGNORECASE=9 file
Mr Magoo
Miss Taken

Or another version of drl's solution:

perl -nle'!/(m(r|iss)).*\2/i&&print' file

Some versions of sed:

sed -nr '/(m(r|iss)).*\2/I!p' file

... I can't manage to make it work with grep.

danmero · October 25, 2008, 5:15pm

radoulov:

With GNU AWK:

$ cat file
Mr Magoo
Mr Magoo mr magoo
Mr Magoo Mr Smith Miss Demeanor
Mr Smith Miss Demeanor
Miss Demeanor Miss Taken
Miss Taken
$ awk -F'm(r|iss)' 'NF==2' IGNORECASE=9 file
Mr Magoo
Miss Taken

..

.ops , what is the logic here?

# cat file
Mr Magoo
Mr Magoo mr magoo
Mr Magoo Mr Smith Miss Demeanor
Mr Smith Miss Demeanor
Miss Demeanor Miss Taken
Miss Taken
# awk 'NF==2' file
Mr Magoo
Miss Taken
# awk -F'm(r|iss)' 'NF==2' IGNORECASE=9 file
Mr Magoo mr magoo

radoulov · October 25, 2008, 5:20pm

I said GNU AWK.

$ cat file
Mr Magoo A
Mr Magoo mr magoo
Mr Magoo Mr Smith Miss Demeanor
Mr Smith Miss Demeanor
Miss Demeanor Miss Taken
Miss Taken B
$ awk -F'm(r|iss)' 'NF==2' IGNORECASE=9 file
Mr Magoo A
Miss Taken B
$ nawk -F'm(r|iss)' 'NF==2' IGNORECASE=9 file
Mr Magoo mr magoo

Just for completeness:

$ awk --version|head -1                      
GNU Awk 3.1.6
$ strings =nawk|grep -Fm1 version
version 20070501

The problem with your second example is the case sensitive search (IGNORECASE is GNU specific):

$ print 'mr
mr mr
miss
miss miss'|nawk -F'm(r|iss)' 'NF==2{print NR,$0}' 
1 mr
3 miss

You may try to make it case insensitive using more verbose code

drl · October 25, 2008, 6:19pm

Hi.

If grep is compiled with perl regular expressions, one can get farther. I had 2 versions where it was not compiled in. Here's a sample:

#!/bin/bash -

# @(#) s1       Demonstrate perl regular expressions in grep.

echo
echo "(Versions displayed with local utility \"version\")"
version >/dev/null 2>&1 && version "=o" $(_eat $0 $1) grep
set -o nounset
echo

FILE=${1-data1}

echo " Data file $FILE:"
cat $FILE

echo
echo " Results:"
grep -v -i --perl-regexp '(mr).*\1' $FILE

exit 0

Producing (on openSUSE 11.0 (i586)):

$ ./s2

(Versions displayed with local utility "version")
Linux 2.6.25.16-0.1-pae
GNU bash 3.2.39
GNU grep 2.5.2

 Data file data1:
Mr Magoo
Mr Magoo mr magoo
Mr Magoo Mr Smith Miss Demeanor
Mr Smith Miss Demeanor
Miss Demeanor Miss Taken
Miss Taken

 Results:
Mr Magoo
Mr Smith Miss Demeanor
Miss Demeanor Miss Taken
Miss Taken

cheers, drl

radoulov · October 26, 2008, 5:45am

Great point drl,
thank you!

It was not obvious to me that this option was needed:

$ cat file
Mr Magoo
Mr Magoo mr magoo
Mr Magoo Mr Smith Miss Demeanor
Mr Smith Miss Demeanor
Miss Demeanor Miss Taken
Miss Taken

$ grep -viP '(m|(r|iss)).*\2' file
Mr Magoo
Miss Taken

Just an addition (I don't know how I missed that yesterday),
it seems it works with ERE's too:

$ grep -Evi '(m|(r|iss)).*\2' file
Mr Magoo
Miss Taken

$ egrep -vi '(m|(r|iss)).*\2' file
Mr Magoo
Miss Taken

drl · October 26, 2008, 8:12am

Hi.

Re-reading the man page for GNU grep, it looks like backreferences do not require the -P option. However, version 2.5.1 fails with egrep and a backreference, so perhaps that's an error that was fixed in later versions.

I think you all understood the problem better than I did. I assumed that matches with the same title, e.g. "Mr" should be omitted, and that "Mr" and "Miss" on the same line, for example, would be allowed. This is, for me at least, another good lesson on writing and reading requirements, along with sufficient examples.

So far, I like your solution:

awk -F'm(r|iss)' 'NF==2' IGNORECASE=9 file

the best. It's concise and makes good use of awk features. I often forget that awk allows regular expressions as field separators ... cheers, drl

radoulov · October 26, 2008, 8:32am

I believe you've got it right and I have not ...
I think that this should give the correct result:

$ grep -viP '(mr|miss).*\1' file
Mr Magoo
Mr Smith Miss Demeanor
Miss Taken

And I still don't understand why the command below returns a different result:

$ grep -viE '(mr|miss).*\1' file
Mr Magoo
Mr Magoo mr magoo
Mr Magoo Mr Smith Miss Demeanor
Mr Smith Miss Demeanor
Miss Demeanor Miss Taken
Miss Taken

In Perl it should be:

!/(mr|miss).*\1/i

In AWK:

awk '{ _ = $0; $0 = tolower($0)
  if (gsub(/mr/,"") < 2 && 2 > gsub(/miss/,""))
     print _ }
	' file