awk search pattern with special characters passed from CL

cue · February 22, 2010, 11:02am

I'm very new to awk and sed and I've been struggling with this for a while.

I'm trying to search a file for a string with special characters and this string is a command line argument to a simple script.

./myscript "searchpattern" file

#!/bin/sh

awk "/$1/" $2 > dupelistfilter.txt
sed "/$1/d" $2 >> deletelisttest.txt

since the search pattern is a path with / and whitespace what's the best way to deal with special characters in the search pattern?
I have to do this in the original bourne shell which has no printf %q.

nested quotes throws an awk error at me even if I were to use eval to avoid the $1.

Any advice appreciated.

anbu23 · February 22, 2010, 11:28am

$ x="/tmp/tmp"
$ echo "/tmp/tmp121/" | awk ' $0 ~ "'"$x"'" '
/tmp/tmp121/

alister · February 22, 2010, 12:20pm

Hey, cue:

I would recommend against your awk/sed approach and I also recommend not using anbu23's code (nothing personal, anbu23 ;)). All of the proposed solutions are vulnerable to the presence of special characters (whether they be special to sed and awk regular expressions [cue's examples] or to awk strings [as in anbu23's case]).

You've already seen the problem with your attempts. Even if you escaped the forward slashes in the variable's value (or used a different delimiter, s#regexp#replacement#flag, which sed allows), you may still encounter problems if there is a "." or a "*" or any other metacharacter.

anbu23's would fail and throw syntax errors if a there's a double quote, will match erroneously if backslash sequences are present, etc, due to conflicts with AWK's string parsing.

Example:

$ x='/tmp/tmp"'
$ echo '/tmp/tmp"121/' | awk ' $0 ~ "'"$x"'" '
awk: non-terminated string  tmp/tmp ... at source line 1
 context is
         >>>  <<<
awk: giving up
 source line number 2

In my opinion, the best (most futureproof) approach is to use something not vulnerable to any magical characters. I suggest:

awk -v x="$1" 'index($0,x)' "$2" > dupelistfilter.txt

If you want to negate the logic of the match:

awk -v x="$1" '!index($0,x)' "$2" >> deletelisttest.txt

Regards,
Alister

anbu23 · February 22, 2010, 12:58pm

If the input has quotes the

echo '/tmp/tmp"121/' | awk -v x='/tmp/tmp"' ' $0 ~ x '

alister · February 22, 2010, 1:18pm

In that case, the value of x inside AWK is vulnerable to regular expression metacharacters. Say, for example, that you wanted to match a pathname that had a dot. The dot would not be treated literally, but would be a wildcard matching any character. In the following example, it yields a false positive.

$ echo '/tmp/tmpnext' | awk -v x='/tmp/tmp.ext' ' $0 ~ x '
/tmp/tmpnext

There's simply no way around it. Unless you are absolutely certain that there will be no metacharacters involved, you cannot pass a value through SED or AWK's regular expression parsers (or AWK's string parser) without passing that value through some sort of sanitizing step to properly escape those special characters (which would be something of a nightmare if it had to be made safe to pass through AWK's string parsing before arriving at the regular expressioin parsing stage).

Alister

---------- Post updated at 01:18 PM ---------- Previous update was at 01:15 PM ----------

cue:

Now that i think about it, by far the simplest solution to this is fgrep. I became fixated on AWK and sed since they were listed in the original post. Unless I missed something, the following should work just fine and is not susceptible to metacharacter interference.

fgrep "$1" "$2" > dupelistfilter.txt
fgrep -v "$1" "$2" >> deletelisttest.txt

cue · February 22, 2010, 1:36pm

thanks anbu23 and alister for the suggestions. seems there is always one character which can cause a problem. As long as that specific metacharacter isn't permitted in a filename it's fine for my application of the script, even if it isn't fully sanitized. Thanks again.

edit: nevermind, you're right, should have just used grep/fgrep.

alister · February 22, 2010, 1:50pm

It checks to see if the string you're searching for (stored in the variable x) is present in the current line (stored in $0). If so, index() returns a non-zero value which in AWK is equivalent to a boolean true value. If the string is not found, index() returns zero. If true, it prints out that line (the default action which is implied is "{print $0}".

My post with the awk solutions included two commands; the second negates the return value with a "!", so that it excludes lines that match (what you are doing with sed's d command).

All that said, you're probably best off using the fgrep commands at the end of my previous post.

Cheers,
Alister