Grep with regex containing one string but not the other

stresing · April 8, 2015, 1:59pm

Hi to you all,

I'm just struggling with a regex problem and I'm pretty sure that I'm missing sth obvious...

I need a regex to feed my grep in order to find lines that contain one string but not the other.

Here's the data example:

2015-04-08 19:04:55,926|xxxxxxxxxx|          |foobar|          |INFO |REQUEST|UserID1:42 | yeah
2015-04-08 19:04:56,157|yyyyyyyyyy|          |foobuzz|          |INFO |REQUEST|UserID1:23 | ohnooo

I want all lines with "UserID1" not containing "foobuzz". I tried this one:

LC_ALL=C grep -P '(?!foobuzz).*UserID1.*' example.txt

But it gives me both lines. (I have the "LC_ALL=C" workaround from a RedHat article, otherwise I get a core dump for free. )

What's wrong with me, uhm, with the regex?

And yes, it should be grep because the new regex will be part of an existing regex file which is used by a script in order to grep some gigabytes of data multiple times a day. I don't know if it would be faster with awk or sth like this. Any advice is appreciated!

I will get back to you tomorrow - it's late, I'm the last one in the office, sun is gone...

Best regards

Stephan

Additional informations:

LSB Version:    :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch
Distributor ID: RedHatEnterpriseServer
Description:    Red Hat Enterprise Linux Server release 6.6 (Santiago)
Release:        6.6
Codename:       Santiago

GNU grep 2.6.3

vgersh99 · April 8, 2015, 2:50pm

grep UserID1 example.txt | grep -v foobuzz
OR
awk '/UserID1/ && !/foobuzz/' example.txt
OR
awk '!/.*foobuzz.*UserID1/' example.txt
OR
and the list goes on and on....

sea · April 8, 2015, 6:53pm

With this 2 line example data, it would even work with:

$ grep -v foobuzz sample.txt 
2015-04-08 19:04:55,926|xxxxxxxxxx|          |foobar|          |INFO |REQUEST|UserID1:42 | yeah

Not sure why this doesnt work though...

grep -e UserID1 -ve buzz sample.txt

Corona688 · April 8, 2015, 6:58pm

You can't feed two different regular expressions into one grep. It doesn't do any sort of conditional logic, it just matches lines.

awk is a programming language, and can:

awk '/User1/ && !/buzz/'

stresing · April 9, 2015, 4:57am

To be more precise:

I have a shell script which calls a grep command (and does some more postprocessing).
This grep uses a file that holds a bunch of regexes and one of the regexes should be able to do a negative lookbehind (if that's the correct term for the regex logic that I want to achieve). So a piped "grep -v" doesn't help me in this case.

And wouldn't it slow down the processing if I would refactor it to awk (which handles the file with the regexes)?

disedorgue · April 9, 2015, 12:07pm

Hi, with grep:

$ cat gr.txt
2015-04-08 19:04:55,926|xxxxxxxxxx|          |foobar|          |INFO |REQUEST|UserID1:42 | yeah
2015-04-08 19:04:56,157|yyyyyyyyyy|          |foobuzz|          |INFO |REQUEST|UserID1:23 | ohnooo
2015-04-08 19:04:56,157|yyyyyyyyyy|          |foobuz|          |INFO |REQUEST|UserID1:23 | ohnooo
$ grep  -P '\|(?!foobuzz)[^|]*\|[^|]*\|[^|]*\|[^|]*\|UserID1' gr.txt
2015-04-08 19:04:55,926|xxxxxxxxxx|          |foobar|          |INFO |REQUEST|UserID1:42 | yeah
2015-04-08 19:04:56,157|yyyyyyyyyy|          |foobuz|          |INFO |REQUEST|UserID1:23 | ohnooo
$

Regards.

stresing · April 9, 2015, 12:14pm

Yeah, that's what I wanted to see!

Could you please explain why it didn't work out with a simple "." instead of the more exact "[^|]*\|[^|]*\|[^|]*\|[^|]\|"? I'm not that familiar with the perl'ish version of grep.

Thanks!

Best regards

Stephan

disedorgue:

Hi, with grep:

$ cat gr.txt
2015-04-08 19:04:55,926|xxxxxxxxxx|          |foobar|          |INFO |REQUEST|UserID1:42 | yeah
2015-04-08 19:04:56,157|yyyyyyyyyy|          |foobuzz|          |INFO |REQUEST|UserID1:23 | ohnooo
2015-04-08 19:04:56,157|yyyyyyyyyy|          |foobuz|          |INFO |REQUEST|UserID1:23 | ohnooo
$ grep  -P '\|(?!foobuzz)[^|]*\|[^|]*\|[^|]*\|[^|]*\|UserID1' gr.txt
2015-04-08 19:04:55,926|xxxxxxxxxx|          |foobar|          |INFO |REQUEST|UserID1:42 | yeah
2015-04-08 19:04:56,157|yyyyyyyyyy|          |foobuz|          |INFO |REQUEST|UserID1:23 | ohnooo
$

Regards.

Corona688 · April 9, 2015, 12:21pm

Greedy matching is doing you in. All "(?!foobuzz)..." needs to find is one spot before user1 which doesn't start in 'f' and its condition is satisfied. The .* after it swallows everything. So it scans the first character, goes 'yay its not foobuzz', searches for the username and accepts.

So, you need to force the regex to check for (?!foobuzz) in a relevant spot. Right after a | character perhaps. And you have to make it check all of them so it won't cheat by skipping ahead.

You can match entire fields with [^|]*\| , and force them to not start with foobuzz via (?!foobuzz)[^|]*\| , match several in a row by wrapping it in ()* and force it to start at the beginning of the line with ^ .

So ^((?!foobuzz)[^|]*\|)*UserID1 will only accept a line containing zero or more | fields, none of which begin with foobuzz, after which it must immediately find 'userid1'.

$ printf "2015-04-08 19:04:56,157|yyyyyyyyyy|          |foobuzz|          |INFO |REQUEST|UserID1:23 | ohnooo\n" |
        grep -P '^((?!foobuzz)[^|]*\|)*UserID1'

$ printf "2015-04-08 19:04:56,157|yyyyyyyyyy|          |frobuzz|          |INFO |REQUEST|UserID1:23 | ohnooo\n" | grep -P '^((?!foobuzz)[^|]*\|)*UserID1'
2015-04-08 19:04:56,157|yyyyyyyyyy|          |frobuzz|          |INFO |REQUEST|UserID1:23 | ohnooo

I don't think rewriting it in awk or sed would necessarily make it slower, but if you have a big pile of regexes already, could be a painful amount of work.

stresing · April 10, 2015, 10:56am

Thank you for your explanation! Now I've got it!

Yeah, you could bet on it...

---------- Post updated 04-10-15 at 04:56 PM ---------- Previous update was 04-09-15 at 07:23 PM ----------

D'oh, I cheered to early. The one-liner did the job, but now grep surprised me by saying "grep: the -P option only supports a single pattern". Ok, I will find a different solution by means of postprocessing.

Corona688 · April 10, 2015, 11:14am

Well, if you're rewriting from scratch anyway, you could have a file full of awk expressions, like:

# Do NOT print line if this regex matches
/regex1/ { next }

# print line if it contains this literal string.  Faster than regex
index($0, "literalstring") { print ; next }

# print line if one regex matches and another does not
/regex1/ && !/regex2/ { print ; next }

# print line if any of these regexes match
/regex1/ || /regex2/ || /regex3/ || /regex4/ { print ; next }

etc..

You can use that file like

awk -f expressions.awk

with a filename optionally specified afterwards.

There is also a variant of awk that's further optimized for speed, mawk.

drl · April 10, 2015, 1:15pm

Hi.

There is a code, peg , that may do what you desire: multiple patterns of PERLRE expressions. Using the pattern from disedorgue and an augmented data file, a demo is:

#!/usr/bin/env bash

# @(#) s2	Demonstrate complex matching expressions: peg
# peg (various versions):
# http://www.cpan.org/authors/id/A/AD/ADAVIES/

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C peg

FILE=${1-data2}

pl " Input data file $FILE:"
cat $FILE

pl " Results:"
peg -e garble -e '\|(?!foobuzz)[^|]*\|[^|]*\|[^|]*\|[^|]*\|UserID1' $FILE

exit 0

producing:

$ ./s2

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian 5.0.8 (lenny, workstation) 
bash GNU bash 3.2.39
peg (local) 3.10

-----
 Input data file data2:
2015-04-08 19:04:55,926|xxxxxxxxxx|          |foobar|          |INFO |REQUEST|UserID1:42 | yeah
2015-04-08 19:04:56,157|yyyyyyyyyy|          |foobuzz|          |INFO |REQUEST|UserID1:23 | ohnooo
2015-04-08 19:04:56,157|yyyyyyyyyy|          |foobuz|          |INFO |REQUEST|UserID1:23 | ohnooo
garble

-----
 Results:
2015-04-08 19:04:55,926|xxxxxxxxxx|          |foobar|          |INFO |REQUEST|UserID1:42 | yeah
2015-04-08 19:04:56,157|yyyyyyyyyy|          |foobuz|          |INFO |REQUEST|UserID1:23 | ohnooo
garble

Here is what -e is defined as:

       -e overloaded
           -e PERLEXPR
               Specify a PERLEXPR to match.

               If used more than once, then it is equivalent to using -o.  For
               example, "peg -e foo -e bar baz", "peg -o foo bar -- baz", and
               "peg "/foo/ or /bar/" baz" are all equivalent.

Best wishes ... cheers, drl

disedorgue · April 10, 2015, 2:10pm

maybe a way with grep with option "-v", example:
my regex file:

$ cat reg.gr 
!UserID1
foobuzz

my file to parse:

$ cat gr.txt 
2015-04-08 19:04:55,926|xxxxxxxxxx|          |foobar|          |INFO |REQUEST|UserID1:42 | yeah
2015-04-08 19:04:56,157|yyyyyyyyyy|          |foobuzz|          |INFO |REQUEST|UserID1:23 | ohnooo
2015-04-08 19:04:56,157|yyyyyyyyyy|          |foobuz|          |INFO |REQUEST|UserID1:23 | ohnooo

The command grep:

$ grep -vf reg.gr gr.txt 
2015-04-08 19:04:55,926|xxxxxxxxxx|          |foobar|          |INFO |REQUEST|UserID1:42 | yeah
2015-04-08 19:04:56,157|yyyyyyyyyy|          |foobuz|          |INFO |REQUEST|UserID1:23 | ohnooo

$

Edit: No, it's wrong... not work
Regards.