I'm trying to match records using awk which contain only one occurance of my string, I know how to match one or more (+) but matching only one is eluding me without developing some convoluted bit of code. I was hoping there would be some simple pattern matching thing similar to '+' but which means 'one and only one occurance of'.
My matching code looks like this:
$10 !~ /&| and | AND | And |\// && $11 !~ /FLAT|Flat|Apartment|APARTMENT/ && $10 ~ /MR|MISS|MRS|MS|Mr|Miss|Mrs|Ms/ {
But some records have in their name field multiple names, such as
and I want to not match those records.
Any help with this would be grand!
The only alternative I can think of is some convoluted counting loop which goes through the name split as an array to count if any of the Mr, Mrs, MR, MRS, etc occur more than once, which sounds quite long-winded and unnecessary.
I find that such things are relatively straight-forward in perl because of the power of regular expression infrastructure. I don't know if awk has this feature as visibly as does perl, but here is a shell script that drives a small perl script:
#!/bin/bash -
# @(#) s1 Demonstrate perl.
echo
echo "(Versions displayed with local utility \"version\")"
version >/dev/null 2>&1 && version "=o" $(_eat $0 $1) perl
set -o nounset
echo
FILE=${1-data1}
echo " Data file $FILE:"
cat $FILE
echo
echo " perl script file:"
cat p1
echo
echo " Results:"
./p1 $FILE
exit 0
Producing:
% ./s1
(Versions displayed with local utility "version")
Linux 2.6.11-x1
GNU bash 2.05b.0
perl 5.8.4
Data file data1:
Mr Magoo
Mr Magoo mr magoo
Mr Magoo Mr Smith Miss Demeanor
Mr Smith Miss Demeanor
Miss Demeanor Miss Taken
Miss Taken
perl script file:
#!/usr/bin/perl
# @(#) p1 Demonstrate skipping of line with repeated matches.
use warnings;
use strict;
my($debug);
$debug = 0;
$debug = 1;
my($t1);
my($lines) = 0;
# Make entire line lower case to simply matches. Use captured
# string to omit lines with contain more than one match.
while ( <> ) {
chomp;
print " Working on |$_|\n";
$lines++;
$t1 = lc $_;
next if $t1 =~ /(mr|miss).*\1/;
print "$_\n";;
}
print STDERR " ( Lines read: $lines )\n";
exit(0);
Results:
Working on |Mr Magoo|
Mr Magoo
Working on |Mr Magoo mr magoo|
Working on |Mr Magoo Mr Smith Miss Demeanor|
Working on |Mr Smith Miss Demeanor|
Mr Smith Miss Demeanor
Working on |Miss Demeanor Miss Taken|
Working on |Miss Taken|
Miss Taken
( Lines read: 6 )
$ cat file
Mr Magoo
Mr Magoo mr magoo
Mr Magoo Mr Smith Miss Demeanor
Mr Smith Miss Demeanor
Miss Demeanor Miss Taken
Miss Taken
$ awk -F'm(r|iss)' 'NF==2' IGNORECASE=9 file
Mr Magoo
Miss Taken
# cat file
Mr Magoo
Mr Magoo mr magoo
Mr Magoo Mr Smith Miss Demeanor
Mr Smith Miss Demeanor
Miss Demeanor Miss Taken
Miss Taken
# awk 'NF==2' file
Mr Magoo
Miss Taken
# awk -F'm(r|iss)' 'NF==2' IGNORECASE=9 file
Mr Magoo mr magoo
$ cat file
Mr Magoo A
Mr Magoo mr magoo
Mr Magoo Mr Smith Miss Demeanor
Mr Smith Miss Demeanor
Miss Demeanor Miss Taken
Miss Taken B
$ awk -F'm(r|iss)' 'NF==2' IGNORECASE=9 file
Mr Magoo A
Miss Taken B
$ nawk -F'm(r|iss)' 'NF==2' IGNORECASE=9 file
Mr Magoo mr magoo
Just for completeness:
$ awk --version|head -1
GNU Awk 3.1.6
$ strings =nawk|grep -Fm1 version
version 20070501
The problem with your second example is the case sensitive search (IGNORECASE is GNU specific):
$ print 'mr
mr mr
miss
miss miss'|nawk -F'm(r|iss)' 'NF==2{print NR,$0}'
1 mr
3 miss
You may try to make it case insensitive using more verbose code
If grep is compiled with perl regular expressions, one can get farther. I had 2 versions where it was not compiled in. Here's a sample:
#!/bin/bash -
# @(#) s1 Demonstrate perl regular expressions in grep.
echo
echo "(Versions displayed with local utility \"version\")"
version >/dev/null 2>&1 && version "=o" $(_eat $0 $1) grep
set -o nounset
echo
FILE=${1-data1}
echo " Data file $FILE:"
cat $FILE
echo
echo " Results:"
grep -v -i --perl-regexp '(mr).*\1' $FILE
exit 0
Producing (on openSUSE 11.0 (i586)):
$ ./s2
(Versions displayed with local utility "version")
Linux 2.6.25.16-0.1-pae
GNU bash 3.2.39
GNU grep 2.5.2
Data file data1:
Mr Magoo
Mr Magoo mr magoo
Mr Magoo Mr Smith Miss Demeanor
Mr Smith Miss Demeanor
Miss Demeanor Miss Taken
Miss Taken
Results:
Mr Magoo
Mr Smith Miss Demeanor
Miss Demeanor Miss Taken
Miss Taken
It was not obvious to me that this option was needed:
$ cat file
Mr Magoo
Mr Magoo mr magoo
Mr Magoo Mr Smith Miss Demeanor
Mr Smith Miss Demeanor
Miss Demeanor Miss Taken
Miss Taken
$ grep -viP '(m|(r|iss)).*\2' file
Mr Magoo
Miss Taken
Just an addition (I don't know how I missed that yesterday),
it seems it works with ERE's too:
$ grep -Evi '(m|(r|iss)).*\2' file
Mr Magoo
Miss Taken
$ egrep -vi '(m|(r|iss)).*\2' file
Mr Magoo
Miss Taken
Re-reading the man page for GNU grep, it looks like backreferences do not require the -P option. However, version 2.5.1 fails with egrep and a backreference, so perhaps that's an error that was fixed in later versions.
I think you all understood the problem better than I did. I assumed that matches with the same title, e.g. "Mr" should be omitted, and that "Mr" and "Miss" on the same line, for example, would be allowed. This is, for me at least, another good lesson on writing and reading requirements, along with sufficient examples.
So far, I like your solution:
awk -F'm(r|iss)' 'NF==2' IGNORECASE=9 file
the best. It's concise and makes good use of awk features. I often forget that awk allows regular expressions as field separators ... cheers, drl