Regex within IF statement in awk

Hello to all,

I have:

X="string 1-"
Y="-string 2"
Z="string 1-20-string 2"

In the position of the number 20 could be different numbers, but I'm interest only when the number is 15, 20,45 or 70.

I want to include an IF within an awk code with a regex in the following way.

if(Z!~/X(15|20|45|70)Y/)

but it seems is not working. What is wrong? is possible?

The following way works:

if(Z!=X15Y && Z!=X20Y && Z!=X45Y && Z!=X70Y)

But I want to reduce the code and with a Regex would be much better.

Thanks in advance.

This is an example of one way:

echo 'string 1-20-string 2' | awk -F, '{ if ($1 ~ /-(15|20|45|70)-/) print }'

You can use an expression in place of the regular expression literal.

Z !~ X "(15|20|45|70)" Y

Be aware that the contents of any string literals used as part of a regular expression must traverse two parsers, first the string parser, then the regular expression parser. This is significant if you need to use escape sequences. To learn more, see the gawk manual: computed regular expressions (dynamic/computed regular expressions are not gawk specific, this was just a convenient link near the top of a web search).

Regards,
Alister

The pattern within /PAT/ cannot include variables. The spacebar fixed pattern is one solution. A more general solution, exactly the way you want it, is as follows:

PAT=X "(15|20|45|70)" Y; if (Z !~ PAT)

Hello,

Thank spacebar/alister/hanson for the answers.

I've tried but only the option to include the complete string literally inside the
"//" worked.

I mean, the only way that worked for me is (as spacebar solution):

if (Z !~ /string 1-(15|20|45|70)-string 2/)

I'm not sure why the other 2 solutions doesn't work. I'm using Cygwin.

The issue is that I was trying to replace "string 1-" and "-string 2" with variables because they are very long strings in the real code.

Thanks for the help.

I'm not sure why it is not working either. It "ought" to work. You can probably see the alister and hanson44 solutions are the same, the only difference being whether to assign the pattern to a separate variable.

To figure out why it is not working, post what you are doing, how you are trying to implement the solution, and what the output is. It will help if you can post more than a fragment, post something that produces output, something we can replicate. And post what version of awk is being used in your cygwin environment, which might make a difference. awk --version will probably print the version. If not, figure out what the awk version is however you can.

Hello hanson44,

The awk version is:

$ awk --version
GNU Awk 4.0.1
Copyright (C) 1989, 1991-2012 Free Software Foundation.

The sample input file is:

5|35|998367383|5|3||,7|44|783738002|3|55|JK|,97|16|333712894315|87|50|2|,8,3,32,0,1,0,1,7,8,9,2,2,3
5|35|998367383|5|3||,7|44|783738002|3|55|JK|,97|16|333712894334|87|50|2|,8,3,32,0,1,0,1,7,8,9,2,2,3
5|35|998367383|5|3||,7|44|783738002|3|55|JK|,97|16|333712894320|87|50|2|,8,3,32,0,1,0,1,7,8,9,2,2,3
5|35|998367383|5|3||,7|44|783738002|3|55|JK|,97|16|333712894302|87|50|2|,8,3,32,0,1,0,1,7,8,9,2,2,3
5|35|998367383|5|3||,7|44|783738002|3|55|JK|,97|16|333712894391|87|50|2|,8,3,32,0,1,0,1,7,8,9,2,2,3
5|35|998367383|5|3||,7|44|783738002|3|55|JK|,97|16|333712894345|87|50|2|,8,3,32,0,1,0,1,7,8,9,2,2,3
5|35|998367383|5|3||,7|44|783738002|3|55|JK|,97|16|333712894320|87|50|2|,8,3,32,0,1,0,1,7,8,9,2,2,3
5|35|998367383|5|3||,7|44|783738002|3|55|JK|,97|16|333712894345|87|50|2|,8,3,32,0,1,0,1,7,8,9,2,2,3
5|35|998367383|5|3||,7|44|783738002|3|55|JK|,97|16|333712894370|87|50|2|,8,3,32,0,1,0,1,7,8,9,2,2,3
5|35|998367383|5|3||,7|44|783738002|3|55|JK|,97|16|333712894315|87|50|2|,8,3,32,0,1,0,1,7,8,9,2,2,3

The output file should be (only lines that are different to the Regex inside the bars "//"):

5|35|998367383|5|3||,7|44|783738002|3|55|JK|,97|16|333712894334|87|50|2|,8,3,32,0,1,0,1,7,8,9,2,2,3
5|35|998367383|5|3||,7|44|783738002|3|55|JK|,97|16|333712894302|87|50|2|,8,3,32,0,1,0,1,7,8,9,2,2,3
5|35|998367383|5|3||,7|44|783738002|3|55|JK|,97|16|333712894391|87|50|2|,8,3,32,0,1,0,1,7,8,9,2,2,3

The code that works for me is:

awk '{if($0!~/5\|35\|998367383\|5\|3\|\|,7\|44\|783738002\|3\|55\|JK\|,97\|16\|3337128943(15|20|45|70)\|87\|50\|2\|,8,3,32,0,1,0,1,7,8,9,2,2,3/)
print}' input.txt

But if I try the following 2 options, the output fails, because they print all lines and not only the 3 lines showed in the desired output:
#1) This option doesn't give desired output:

awk '
BEGIN {
X="5\|35\|998367383\|5\|3\|\|,7\|44\|783738002\|3\|55\|JK\|,97\|16\|3337128943"
Y="\|87\|50\|2\|,8,3,32,0,1,0,1,7,8,9,2,2,3"
Z=X "(15|20|45|70)" Y
}
{if($0 !~ Z); print}' input.txt

#2) This option doesn't give desired output:

awk '
BEGIN {
X="5\|35\|998367383\|5\|3\|\|,7\|44\|783738002\|3\|55\|JK\|,97\|16\|3337128943"
Y="\|87\|50\|2\|,8,3,32,0,1,0,1,7,8,9,2,2,3"
Z1=X"15"Y
Z2=X"20"Y
Z3=X"45"Y
Z4=X"70"Y
}
{if($0!~Z1 && $0!~Z2 && $0!~Z3 && $0!~Z4); print}' input.txt

Thanks in advance for the help so far.

PD: I know that is not needed to use $0, because If I use FS="|", the option would be $15 !~/3337128943(15|20|45|70)/ . But this is only an example, because
in my real code each line that here in the sample input is "$0", in my real code are variables containing those strings. But for the purpose of my question the
behaviour is the same, only works when I put the string literally inside the bars "//".

Regards

The reason why those fail is explained in my post and in the link that I included.

Regards,
Alister

Hello alister,

Thanks for anwer and link shared. I see about the escape sequences in that link for a regexp constant.

I haven't put any escape to the "|" because awk understand it as literal "|", but I don't know why continues falin the code below.

awk '
BEGIN {
X="5|35|998367383|5|3||,7|44|783738002|3|55|JK|,97|16|3337128943"
Y="|87|50|2|,8,3,32,0,1,0,1,7,8,9,2,2,3"
Z=X "(15|20|45|70)" Y
}
$0 !~ Z {print}' input.txt

Is not possible because the string contains "|" or what is wrong?

Thanks in advance.

Regards

That's for sending the exact input file and script you are using. That really helped.

I took another look, tried several things, learned it's a sticky wicket, found something that seems to work well. :slight_smile:

------------------------------

When I added a diagnostic statement print "Z=" Z at the end of the BEGIN segment, it printed a message that shows awk disregards the single back-slash: :mad:

awk: cmd. line:3: warning: escape sequence `\|' treated as plain `|'
Z=5|35|998367383|5|3||,7|44|783738002|3|55|JK|,97|16|3337128943(15|20|45|70)|87|50|2|,8,3,32,0,1,0,1,7,8,9,2,2,3

------------------------------

I tried adding another \ character, in line with what awk expects, to make sure the | is escaped.

awk '
BEGIN {
X="5\\|35\\|998367383\\|5\\|3\\|\\|,7\\|44\\|783738002\\|3\\|55\\|JK\\|,97\\|16\\|3337128943"
Y="\\|87\\|50\\|2\\|,8,3,32,0,1,0,1,7,8,9,2,2,3"
Z=X "(15|20|45|70)" Y
print "Z=" Z
}
{if($0 !~ Z); print}' input.txt

That got rid of the warning message, and produced the expected Z string, but unfortunately did not seem to help (I thought it would work): :mad:

$ ./test.sh
Z=5\|35\|998367383\|5\|3\|\|,7\|44\|783738002\|3\|55\|JK\|,97\|16\|3337128943(15|20|45|70)\|87\|50\|2\|,8,3,32,0,1,0,1,7,8,9,2,2,3
5|35|998367383|5|3||,7|44|783738002|3|55|JK|,97|16|333712894315|87|50|2|,8,3,32,0,1,0,1,7,8,9,2,2,3
5|35|998367383|5|3||,7|44|783738002|3|55|JK|,97|16|333712894334|87|50|2|,8,3,32,0,1,0,1,7,8,9,2,2,3
5|35|998367383|5|3||,7|44|783738002|3|55|JK|,97|16|333712894320|87|50|2|,8,3,32,0,1,0,1,7,8,9,2,2,3
5|35|998367383|5|3||,7|44|783738002|3|55|JK|,97|16|333712894302|87|50|2|,8,3,32,0,1,0,1,7,8,9,2,2,3
5|35|998367383|5|3||,7|44|783738002|3|55|JK|,97|16|333712894391|87|50|2|,8,3,32,0,1,0,1,7,8,9,2,2,3
5|35|998367383|5|3||,7|44|783738002|3|55|JK|,97|16|333712894345|87|50|2|,8,3,32,0,1,0,1,7,8,9,2,2,3
5|35|998367383|5|3||,7|44|783738002|3|55|JK|,97|16|333712894320|87|50|2|,8,3,32,0,1,0,1,7,8,9,2,2,3
5|35|998367383|5|3||,7|44|783738002|3|55|JK|,97|16|333712894345|87|50|2|,8,3,32,0,1,0,1,7,8,9,2,2,3
5|35|998367383|5|3||,7|44|783738002|3|55|JK|,97|16|333712894370|87|50|2|,8,3,32,0,1,0,1,7,8,9,2,2,3
5|35|998367383|5|3||,7|44|783738002|3|55|JK|,97|16|333712894315|87|50|2|,8,3,32,0,1,0,1,7,8,9,2,2,3

------------------

I tried adding further backslash characters, it would not work. Maybe there is some way to add a series of preceding backslash characters and make it work, but at this point the large number of confusing backslash characters is a turn-off anyway.

The business of escaping escape characters and dealing with the two passes that awk makes to process the string expression made me re-strategize a way to avoid the byzantine complications from two parser traverses. Following is a solution that uses the simple way using // but creates it on the fly from within a shell script:

$ cat test.sh
B='5\|35\|998367383\|5\|3\|\|,7\|44\|783738002\|3\|55\|JK\|,97\|16\|3337128943'
M='(15|20|45|70)'
E='\|87\|50\|2\|,8,3,32,0,1,0,1,7,8,9,2,2,3'
Z="$B$M$E"
echo "\$0 !~ /$Z/ {print}" > script.awk
awk -f script.awk input.txt

It works correctly: :slight_smile:

$ ./test.sh
5|35|998367383|5|3||,7|44|783738002|3|55|JK|,97|16|333712894334|87|50|2|,8,3,32,0,1,0,1,7,8,9,2,2,3
5|35|998367383|5|3||,7|44|783738002|3|55|JK|,97|16|333712894302|87|50|2|,8,3,32,0,1,0,1,7,8,9,2,2,3
5|35|998367383|5|3||,7|44|783738002|3|55|JK|,97|16|333712894391|87|50|2|,8,3,32,0,1,0,1,7,8,9,2,2,3

The dynamically generated script.awk file, which you can examine in case needed to troubleshoot:

$ cat script.awk
$0 !~ /5\|35\|998367383\|5\|3\|\|,7\|44\|783738002\|3\|55\|JK\|,97\|16\|3337128943(15|20|45|70)\|87\|50\|2\|,8,3,32,0,1,0,1,7,8,9,2,2,3/ {print}

You can pass in the B (begin) and E (end) strings as shell script arguments, so this seems adaptable to changing them as needed. Hope this works!

1 Like

"\|" is an undefined sequence. It looks like an escape sequence, but there is no such defined escape sequence in string literals. Some awk string parsers (as the quoted error message makes clear) will discard the backslash and keep the pipe symbol, which is not special in a string. Other awks will keep both characters. However, an implementation that aborts with a compilation error (to my knowledge, none do) is not violating any standard.

The same is true when any character that is not part of a defined escape sequence (\\n, t, \\, etc) follows a backslash.

This is also an issue with sed escape sequence handling. If someone actually wrote an implementation that strictly refused undefined escape sequences, most non-trivial scripts posted in these forums would fail (which one could argue would be preferable to unknowingly harboring unreliable behavior).

The problem has nothing to do with the string literals and that their value is later processed by the regular expression parser. The problem is a misplaced semicolon forming an empty if-statement.

You can move or drop the semicolon, or just use a bare pattern. With your string literals as I quoted them, the following will work fine.

BEGIN { X=...; Y=...; Z=X...Y }
$0 !~ Z

Regards,
Alister

2 Likes

That is correct. Thanks. As I said before, I did expect adding the extra \ would work, was surprised it did not. I just copied the posted script verbatim, did not see the extra semicolon. Here is the corrected script, now you have two ways that work:

$ cat test.sh
awk '
BEGIN {
X="5\\|35\\|998367383\\|5\\|3\\|\\|,7\\|44\\|783738002\\|3\\|55\\|JK\\|,97\\|16\\|3337128943"
Y="\\|87\\|50\\|2\\|,8,3,32,0,1,0,1,7,8,9,2,2,3"
Z=X "(15|20|45|70)" Y
}
{if($0 !~ Z) print}' input.txt
$ ./test.sh
5|35|998367383|5|3||,7|44|783738002|3|55|JK|,97|16|333712894334|87|50|2|,8,3,32,0,1,0,1,7,8,9,2,2,3
5|35|998367383|5|3||,7|44|783738002|3|55|JK|,97|16|333712894302|87|50|2|,8,3,32,0,1,0,1,7,8,9,2,2,3
5|35|998367383|5|3||,7|44|783738002|3|55|JK|,97|16|333712894391|87|50|2|,8,3,32,0,1,0,1,7,8,9,2,2,3
1 Like

Hello hanson and alister,

Many thanks for your time to try to help. And thanks for your explanations, I have more clear some things.

hanson,

I was tried too, use double escaping but I had the semicolon in the same place either :p.

Alister,

May you explain me please, why that misplaces semicolon generates an empty if-statement.

When I need to put semicolon (when is mandatory and when is not)?

In this case, the semicolon was actually ruining the output.

Thanks for all the help.