Possible to grep string based on surrounding strings?

jl487 · May 15, 2012, 10:41am

I was wondering if it was possible to grep a pattern based on the surround text. For example, if i have an input file like this:

titleA
titleB
titlex
titleC
titleD
titlex
titleE

And I want to grep "title" and save the results only if it is not followed with a "titlex". My output would look like this:

titleA
titleC
titleE

Is this possible? I feel like awk should come into play, but I'm not great at awk at the moment.

47shailesh · May 15, 2012, 10:53am

your expected output will have following two also rt?

titleB
titleD

try this

grep -v titlex infile

jl487 · May 15, 2012, 10:59am

I don't want to have titleB and titleD in my output. With "grep -v" I would get titleA, titleB, titleC, titleD, and titleE.

I found the following, which would produced an output of just titleB and titleD, I just need to make it do the opposite!

awk '/titlex/ { print prv_line; next } { prv_line = $0 }' input.txt

Corona688 · May 15, 2012, 11:01am

grep is not a programming language, it can't understand 'if x then do y'. It can't even remember lines for later. awk is a programming language, though, and can do both.

Perhaps something like

$ cat regafter.awk

# Recall N lines ago up to 9 lines
function last(N)
{
        if(N>L) return("");
        return(LINE[(L-N)%10]);
}

{ LINE[(++L)%10]=$0 } # Remember line for later

# If this line and the last line don't match titlex, print last line.
(last(1) ~ /title[^xX]/) && /title[^xX]/        { print last(1) }
# Do the same test for the last line by itself.
END {   if(last(0) ~ /title[^xX]/) print last(0); }

$ awk -f regafter.awk data

titleA
titleC
titleE

$

If awk doesn't work, try nawk or gawk.

alister · May 15, 2012, 11:38am

NEVERMIND. THIS DOES NOT WORK CORRECTLY when titlex is the first line in the file or when there are consecutive instances of titlex. I leave it here only for your amusment.

Corona's awk solution is more efficient, since it only reads the data once, but here's a simple ed solution.

ed -s data <<EOED
g/titlex/-,.d
g/title/
Q
EOED

Or, if you prefer a less readable oneliner

printf %s\\n g/titlex/-,.d g/title/ Q | ed -s data

Regards,
Alister

jl487 · May 15, 2012, 11:39am

corona688:

grep is not a programming language, it can't understand 'if x then do y'. It can't even remember lines for later. awk is a programming language, though, and can do both.

Perhaps something like
$ cat regafter.awk

# Recall N lines ago up to 9 lines
function last(N)
{
   if(N>L) return("");
   return(LINE[(L-N)%10]);
}

{ LINE[(++L)%10]=$0 } # Remember line for later

# If this line and the last line don't match titlex, print last line.
(last(1) ~ /title[^xX]/) && /title[^xX]/        { print last(1) }
# Do the same test for the last line by itself.
END {   if(last(0) ~ /title[^xX]/) print last(0); }

$ awk -f regafter.awk data

titleA
titleC
titleE

$
If awk doesn't work, try nawk or gawk.

Thanks corona. If I want to substitute "titlex" with another string, how can I do that? In the code, I replaced "title[^xX]" with another string to test, but it outputs the new searched string.

Corona688 · May 15, 2012, 11:41am

The regex I put in accepts titleG but not titleX or titlex. If you give it a regex that doesn't reject titleX, it of course won't reject titleX...

Please show what you did.

Better yet, show the actual input you have and actual output you want, since your sample data doesn't seem to be it.

jl487 · May 15, 2012, 11:46am

I created a new input, to test the code:

titleA
titleB
TEST
titleC
TEST

and used the following:

# Recall N lines ago up to 9 lines
function last(N)
{
        if(N>L) return("");
        return(LINE[(L-N)%10]);
}

{ LINE[(++L)%10]=$0 } # Remember line for later

# If this line and the last line don't match titlex, print last line.
(last(1) ~ /TEST/) && /TEST/        { print last(1) }

# Do the same test for the last line by itself.
END {   if(last(0) ~ /TEST/) print last(0); }

and the output I get is the string "TEST"

Corona688 · May 15, 2012, 11:51am

That string accepts TEST, it doesn't reject it. It even rejects titleA. It won't print things it doesn't accept. Given your input data, that's what I thought you wanted.

Rewriting.

jl487 · May 15, 2012, 11:55am

ok, I think i'm starting to understand it. I just assumed the code could be universal where I could simply swap the strings.

Corona688 · May 15, 2012, 11:57am

This should easily let you input an exact string to take as the string to reject.

$ cat regafter2.awk

# Recall N lines ago up to 9 lines
function last(N)
{
        if(N>L) return("");
        return(LINE[(L-N)%10]);
}

{ LINE[(++L)%10]=$0 } # Remember line for later

# If this line and the last line don't match titlex, print last line.
(last(1) != REJECT) && $0 != REJECT { print last(1) }
# Do the same test for the last line by itself.
END {   if(last(0) != REJECT) print last(0); }

$ awk -v REJECT="TEST" -f regafter2.awk data2

titleA

$

jl487 · May 16, 2012, 9:46am

i understand that the script works for "REJECT" but how can this be modified to accept wildcards. For example, if input is:

titleA
titleB
TEST FILLER
titleC
TEST

and REJECT=TEST, I would expect the output to be just titleA. However, the current code outputs titleA, titleB, and TEST FILLER.

drl · May 16, 2012, 12:27pm

Hi.

I had trouble all the way through this thread understanding the requirements. I'll assume that followed with means immediately followed by. Here is a solution for the last sample:

#!/usr/bin/env bash

# @(#) s2	Demonstrate search for pattern-accept followed by pattern-reject.
# See http://sourceforge.net/projects/cgrep/

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C cgrep

FILE=${1-data3}
ACCEPT="title"
REJECT="TEST"

pl " Input data file $FILE:"
cat $FILE

pl " Results, $ACCEPT to $REJECT:"
cgrep -a "^.*$ACCEPT.*\n.*$REJECT" $FILE
pl " Results, invert $ACCEPT to $REJECT:"
cgrep -v -a "^.*$ACCEPT.*\n.*$REJECT" $FILE

exit 0

producing:

% ./s2

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0.8 (lenny) 
bash GNU bash 3.2.39
cgrep ATT cgrep 8.15

-----
 Input data file data3:
titleA
titleB
TEST FILLER
titleC
TEST

-----
 Results, title to TEST:
titleB
TEST FILLER
titleC
TEST

-----
 Results, invert title to TEST:
titleA

The cgrep utility is non-standard, but very useful. See the URL for the source for anyone to compile and use.

Best wishes ... cheers, drl

( Edit 1: replaced wrong version of script. )
( Edit 2: correct minor typo )

Corona688 · May 16, 2012, 12:46pm

I am flailing around trying to solve for you a problem, which you keep changing. If you would lay down exactly what you need to do plainly the first time, I could solve it once.

$ cat regafter3.awk
# Recall N lines ago up to 9 lines
function last(N)
{
        if(N>L) return("");
        return(LINE[(L-N)%10]);
}

{ LINE[(++L)%10]=$0 } # Remember line for later

# If this line and the last line don't match titlex, print last line.
(last(1) !~ REJECT) && $0 !~ REJECT { print last(1) }
# Do the same test for the last line by itself.
END {   if(last(0) !~ REJECT) print last(0); }

$ awk -v REJECT="TEST.*" -f regafter3.awk data3

titleA

$

shamrock · May 16, 2012, 2:07pm

jl487:

i understand that the script works for "REJECT" but how can this be modified to accept wildcards. For example, if input is:
titleA
titleB
TEST FILLER
titleC
TEST
and REJECT=TEST, I would expect the output to be just titleA. However, the current code outputs titleA, titleB, and TEST FILLER.

Just create a wrapper for the awk script posted by Corona688...this way you can tweak the REJECT value based on a shell parameter.