Script to find duplicate pattern in a file irrespective of case

johnjs · October 12, 2012, 1:55pm

We have a configuration file in Unix. In that we have entries like below. if it ends with ":", then it is the end of record. We need to find our if there is any duplicate entries like ABCD irrespective of the case.

ABCD:\
  :conn.retry.stwait=00.00.30:\
  :sess.pnode.max=255:\
  :sess.snode.max=255:\
  :sess.default=1:\
  :comm.info=abcd.nam.nsroot.net;1364:\
  :pacing.send.count=0:

DGPickett · October 12, 2012, 2:37pm

Many tools have case insensitivity and forward reference in their regex. Regular expressions have many flavors and extensions. For instance, this sed finds such:

sed -in '
  :loop
  $d
  /\\$/{
    N
    b loop
    }
  /\(:[a-z][^:]*:\).*\1/p
 ' your_in_file

Narrative flow: sed runs in case insensitive mode and with no automatic output, I create a branch target so I can loop, if I hit EOF while collecting a \ continued line, I bail out with a delete, if there is a \ at the end of the line, read another line into the buffer and recheck eof and \ at end of line, grab \(\) each colon+letter+not-colon-however-many-times+colon and see if it occurs later \1, and if any such, print. The letter keeps me from grabbing inter-field areas like colon+\+end-of-line+spaces+colon ':\\\n *:' as a field.

Don_Cragun · October 12, 2012, 5:32pm

This problem seems easier to me in awk than in sed:

awk -F" *:" '$1=="" {next}
{       if(list[toupper($1)]++)
                printf("%s on line %d has been seen %d times\n",
                        $1, NR, list[toupper($1)])
}' in

In case it isn't obvious what is going on here. This makes the assumption that any line starting with zero or more spaces followed by a colon is a continuation line, and any other line is the 1st line in a configuration record. It converts the 1st field to uppercase and counts how many have been seen with the name in the first field. If more than one has been seen; it reports the name, input line number, and the number of time it has been seen each time it finds a duplicate entry.

If your configuration file has comments on lines starting with a particular string, this script can easily be modified to skip them.

DGPickett · October 15, 2012, 4:51pm

Well, a more structured file would help pick the right tool. While awk has more built in orientation toward delimited fields and can work spanning lines with alternate separators, sed is fast and simple, with skills that are very easy to reuse on a wide variety of problems and that work interactively in vi at the :, and here, the separtors here are negative: not-escaped line feeds.

johnjs · October 16, 2012, 12:07pm

Thanks Don. it works. I would like to ignore spaces or # as the start of a configuration. It should take as the start of the record, only if the record starts with a character/number and line ends with :\

ABCD:\

---------- Post updated at 12:07 PM ---------- Previous update was at 12:03 PM ----------

Thanks.. i tried the sed command.. but i am getting the below error.

sed: illegal option -- i

DGPickett · October 16, 2012, 4:39pm

My bad, -i is in-place edit, also very handy. For case insensitive regex, you need the I modifier: sed, a stream editor

 /regexp/I
 \%regexp%I 
                        The I modifier to regular-expression matching is a GNU extension
                         that causes the regexp to be matched in a case-insensitive manner.
 
 
$ sed -n '
  :loop
  $d
  /\\$/{
    N
    b loop
    }
  /\(:[a-z][^:]*:\).*\1/Ip
 ' your_in_file

johnjs · October 17, 2012, 11:25am

Thanks.. still no luck. getting below error.

sed:   /\(:[a-z][^:]*:\).*\1/Ip is not a recognized function.

DGPickett · October 17, 2012, 11:47am

Is it GNU sed? I tried /.../Ip on cygwin and it worked fine.

Don_Cragun · October 17, 2012, 10:32pm

johnjs:

Thanks Don. it works. I would like to ignore spaces or # as the start of a configuration. It should take as the start of the record, only if the record starts with a character/number and line ends with :\
ABCD:\

I'm sorry it has taken me so long to get back to you. This should do what you want and be portable to almost any version of awk (on Solaris, however, use nawk or /usr/xpg4/bin/awk rather than just awk):

awk -F":" '
$0 ~ /^[ #]/ {next}
$1 ~ /^[[:alnum:]]*$/ && $NF ~ /^\\$/ {
        if(list[toupper($1)]++)
                printf("%s on line %d has been seen %d times\n",
                        $1, NR, list[toupper($1)])
}' in

DGPickett · October 18, 2012, 4:49pm

Maybe too old a GNU sed ? From "sed --version" I get 4.2.1; what do you get?