GSUB/Regex Help

I am trying to write my gsub regex to replace a bunch of special characters with spaces, so i can split it to an array and look at each word independently.

However, my regex skills are slightly lacking and I appear to be missing a quote or something here.

I am trying to replace the following characters \/;:'"`()| with a space.

gsub ( "[\\'|\\"|\\:|\\;|\\`|\\(|\\)|\\\|\\/|\\|]"," ",$0 )

When i run this piece of the code i get the following error.

line 240: syntax error at line 289: `)' unexpected

What OS are you on? shell?

Can you please try this :

awk '{gsub(/[\|\(\)\\\/\;\:\47\`\"]*/," ",$0);}1' input_file

Caution: This might be a very naive one since am a beginner here:o
47 stands for '

[quote]
..its a octal value

That almost works. You don't need (or want) the asterisk in a call to gsub(). It matches every string of zero or more matches, which in this case effectively adds a space before any of the characters in the input that aren't in the list. You have several backslash characters escaping characters that aren't special inside a bracket expression, but they shouldn't hurt anything. Also, if the last arg to gsub() is left off, it uses $0 as a default.

The following line works:

awk '{gsub(/[|()\\\/;:\47`"]/," ");}1' input_file

Note, however, that this solution won't work on a system with EBCDIC as the codeset for the C Locale. (I think IBM still supports systems like this.) On a system using EBCDIC, you'd need to use \175 instead of \47. If you want to put this in an awk program file (where the script won't have quote processing performed by the shell before awk sees it, the following line should work:

script:

 {gsub(/[|()\\\/;:'`"]/," ");print}
awk -f script input_file

will work without codeset dependencies.

2 Likes

Superb Don ! your explanation will be fruitful to many a here like me trying to learn the best in awk/any programming language..thanks a million:)

Don

Thanks for the tips....I have it working on my HP boxes and Redhat boxes now.
I am running this as part of a KSH script so i still need the quote processing.

Here is the current code (i added a few more symbols).

gsub ( "['\)''\('=;:/'\'''\`''\\''\"''\|''\.''\$''\-''\@''\%']"," ",$0 )

Im not sure why but for whatever reason Solaris does not like this. I keep getting the following error.

Just to be sure I understand what you're saying, you have a ksh shell script that at some point contains something like:

awk 'first line of awk program
second line of awk program
third line of awk program
fourth line of awk program
fifth line of awk program
sixth line of awk program
seventh line of awk program
pattern {gsub ( "['\)''\('=;:/'\'''\`''\\''\"''\|''\.''\$''\-''\@''\%']"," ",$0 )}
possibly more lines of awk program'
possibly followed by more lines in your ksh shell script

and you have chosen to use the call to gsub() shown above rather than the suggestion I made in an earlier post:

gsub(/[|()\\\/;:\47`"]/," ")

because you now also want to change the characters <period>, <dollar-sign>, <hyphen>, <at-sign>, and <percent-sign> to a <space> in addition to the characters you were changing before. Is that correct?

Note that having $-@ in a bracket expression in the 1st argument to gsub after quote removal is a range expression matching <dollar-sign>, <at-sign> and everything that comes between them in your current locale definition. In the POSIX locale, that should match the following characters

$%&'()*+,-./0123456789:;<=>@

not just the $ , - , and @ .

I put together an input file to use to test various calls to gsub:

in.gsubspecial:

backslash[\] slash[/] semi[;] colon[:]
single-quote['] double-quote["] back-quote[`]
open-paren[(] close-paren[)] open-brace[{] close-brace[}]
dollar-sign[$] at-sign[@] percent[%] hyphen[-] 
digits[0123456789]
range-expression[$%&'()*+,-./0123456789:;<=>@]

and used the following commands in a shell script to test out three sample
gsub() calls I produced and the gsub call you have above.

awk ' { #print input line
        print
        #make copies
        x=$0
        y=$0
        z=$0
        #Previously suggested gsub (with original set of characters to change
        #This version used \47 to represent the single-quote
        gsub(/[|()\\\/;:\47`"$@%-]/," ");print $0,"my original gsub"
        #The following versions use '\'' to get out of the quoted string
        #  containing the program, insert an escaped quote, and get back into
        #  the quoted string containing the rest of the program (which gets rid
        #  of the codeset dependency).
        #Prevous gsub with added character using a range expression
        #Note that - is at the end of the bracket expression
        gsub("[|()\\\\\/;:'\''`\"$@%-]"," ", x);print x,"expanded gsub, no range exp"
        #Prevous gsub with added character using a range expression
        #Note that - is in between $ and @
        gsub("[|()\\\\\/;:'\''`\"$-@%]"," ", y);print y,"Expanded gsub w/range"

        gsub("['\)''\('=;:/'\'''\`''\\''\"''\|''\.''\$''\-''\@''\%']"," ",z);print z,"gsub from nitrobass24"
}' in.gsubspecial

When I run this script, I get the following output:

backslash[\] slash[/] semi[;] colon[:]
backslash[ ] slash[ ] semi[ ] colon[ ] my original gsub
backslash[ ] slash[ ] semi[ ] colon[ ] expanded gsub, no range exp
backslash[ ] slash[ ] semi[ ] colon[ ] Expanded gsub w/range
backslash[\] slash[ ] semi[ ] colon[ ] gsub from nitrobass24
single-quote['] double-quote["] back-quote[`]
single quote[ ] double quote[ ] back quote[ ] my original gsub
single quote[ ] double quote[ ] back quote[ ] expanded gsub, no range exp
single quote[ ] double quote[ ] back quote[ ] Expanded gsub w/range
single quote[ ] double quote[ ] back quote[ ] gsub from nitrobass24
open-paren[(] close-paren[)] open-brace[{] close-brace[}]
open paren[ ] close paren[ ] open brace[{] close brace[}] my original gsub
open paren[ ] close paren[ ] open brace[{] close brace[}] expanded gsub, no range exp
open paren[ ] close paren[ ] open brace[{] close brace[}] Expanded gsub w/range
open paren[ ] close paren[ ] open brace[{] close brace[}] gsub from nitrobass24
dollar-sign[$] at-sign[@] percent[%] hyphen[-]
dollar sign[ ] at sign[ ] percent[ ] hyphen[ ] my original gsub
dollar sign[ ] at sign[ ] percent[ ] hyphen[ ] expanded gsub, no range exp
dollar sign[ ] at sign[ ] percent[ ] hyphen[ ] Expanded gsub w/range
dollar sign[ ] at sign[ ] percent[ ] hyphen[ ] gsub from nitrobass24
digits[0123456789]
digits[0123456789] my original gsub
digits[0123456789] expanded gsub, no range exp
digits[          ] Expanded gsub w/range
digits[          ] gsub from nitrobass24
range-expression[$%&'()*+,-./0123456789:;<=>@]
range expression[  &   *+, . 0123456789  <=> ] my original gsub
range expression[  &   *+, . 0123456789  <=> ] expanded gsub, no range exp
range expression[                            ] Expanded gsub w/range
range expression[                            ] gsub from nitrobass24

I don't know why your gsub wouldn't work with /usr/xpg4/bin/awk on Solaris 10. (Is there any chance that you're using a Locale with a non-standard setting for the LC_COLLATE category? Are you sure that you are using exactly the same script on Solaris 10 that you're using on the other systems? Having another single-quote anywhere in your awk script [even in a comment] could greatly change the behavior.) I do see that your gsub() call fails to change a backslash character into a space. If you intended to use the $-@ as a range expression, we can get rid of several character in the matching list expression that are not only listed individually, but are also included in the range expression (including the single-quote).

Hopefully, this will give you something you can adapt to something you can use.