Capturing multi-line records containing known value?

cs03dmj · March 3, 2010, 10:45am

Some records in a file look like this, with any number of lines between start and end flags:

/Start
Some stuff
Banana 1
Some more stuff
End/

/Start
Some stuff
End/

/Start
Some stuff
Some more stuff
Banana 2
End/

...how would I process this file to find records containing the keyword "Banana", i.e. to output:

/Start
Some stuff
Banana 1
Some more stuff
End/

/Start
Some stuff
Some more stuff
Banana 2
End/

I can do it using two awk scripts, but am sure it is doable with one! TIA...

soleil4716 · March 3, 2010, 11:54am

In case you want to do it using a script:


#!/usr/bin/ksh

v_banana="false"
v_output=""

while read LINE
do
    if [ "$LINE" = "/Start" ]
    then
        v_output="${LINE}\n"

    elif [ "$LINE" = "End/" ]
    then
        v_output="${v_output}${LINE}\n"

        if [ "$v_banana" = "true" ]
        then
            echo "$v_output"
        fi
        v_output=""
        v_banana="false"
    else
        v_output="${v_output}${LINE}\n"

        if [[ $LINE = *Banana* ]]
        then
            v_banana="true"
        fi
    fi
done < infile

Franklin52 · March 3, 2010, 1:42pm

Another one:

awk -v var="Banana" '
/Start/ { s="" }
$0 ~ var { print s; f=1 }
{ s=s?s "\n" $0:$0 }
f { print }
/End/ { f=0 }
' file

cs03dmj · March 4, 2010, 6:47am

Thanks to you both, but Franklin52 - that is exactly what I was after, save for needing to use nawk instead of awk (presumably due to the size of my input file). Would you be so kind as to walk me through the script and explain how some of the less obvious lines work - specifically:

{ s=s?s "\n" $0:$0 }

Franklin52 · March 4, 2010, 11:54am

cs03dmj:

Thanks to you both, but Franklin52 - that is exactly what I was after, save for needing to use nawk instead of awk (presumably due to the size of my input file). Would you be so kind as to walk me through the script and explain how some of the less obvious lines work - specifically:
{ s=s?s "\n" $0:$0 }

This is called a conditional expression and is similar to:

if(s) {
  s=s "\n" $0
}
else {
  s=$0
}

Have a read of this:

Conditional Exp - The GNU Awk User's Guide

alister · March 4, 2010, 2:59pm

Hello, all:

It's not as flexible as Franklin52's parameterized AWK script, but here's a solution for the sed freaks amongst us:

sed -n '/^\/Start/,/^End\//{H;/^End\//{x;/Banana/p;};/^\/Start/h;}'

And another AWK option:

awk -v key=Banana '/^\/Start/,/^End\//{s=s $0 RS} /^End\//{printf "%s",index(s,key)?s:""; s=""}'

Cheers,
Alister

cs03dmj · March 17, 2010, 7:28am

Thanks to you all for this; I have subsequently learnt a lot and expanded my final script (should it help anyone in future), which allows multiple searches to be passed from the command line and is based upon records separated by one or more empty lines.

vgersh99 · March 17, 2010, 7:58am

if on Solaris:

nawk -v RS='' '/Banana/ {print $0 ORS}' myFile

cs03dmj · March 17, 2010, 10:39am

Thanks for the inspiration, vgersh99! Revised code:

# FindR - Find(e)r - Find( )R(ecords)
# Finds certain records within decoded files given one or more search patterns...
# v3.0 - 20100317 - cs03dmj

# Usage Information:
usage() {
        echo "Usage: $0 decoded_filname search_pattern(s)"
        echo && echo "e.g. to find records containing a value:"
        echo "  $0 myfile monkey"
        echo && echo "e.g. to find records containing several values:"
        echo "  $0 myfile monkey banana"
        echo && echo "e.g. to find records containing specific data, use one or more regular expressions in quotes:"
        echo "  $0 mydecodedfile \"m.*y\" \"b.*a\""
        echo && echo "N.B. If output is redirected to a file, user messages will not be included..." && echo
        # Exit with an errored return code:
        exit 1
}

# If the number of parameters ($#) is less than two (i.e. filename and one search pattern) then display usage information:
[ $# -lt 2 ] && usage

# If the first paramater ($1) is not a valid file, report the problem then display usage information:
[ ! -f $1 ] && echo "File $1 does not exist." && usage

# Use nawk as awk can't handle large files:
nawk '
BEGIN {
        # Records are split by one or more blank lines, so set the Record Separator (RS) appropriately:
        RS = ""
        # ARGV is the built-in array of passed parameters.
        # ARGC is the number of elements in array ARGV
        # ARGV[0] = nawk
        # ARGV[1] = decoded_filename
        # Parse remaining parameters (ARGV[2 ...]) into "searches" array:
        for (i = 2; i < ARGC; i++) {
                searches = ARGV
                # Delete the parameter so is not treated as another input file:
                delete ARGV
        }
        # Create a variable to count the total number of matches (if any):
        totalmatches = 0
}

# For every record:
{
        # For every search pattern in "searches" array:
        for (search in searches) {
                # If the record does not contain a matching search, skip to the next record:
                if ( $0 !~ searches ) { next }
        }
        # If we have made it this far, all searches have been matched, so print the record followed by a newline:
        print $0 "\n"
        # Increment the total matches variable:
        totalmatches++
}

# At the end of the file:
END {
        # Notify the user (via stderr) how many matches have been found:
        print "...found " totalmatches " matching record(s) from " NR " total record(s)." > "/dev/stderr/"
}
' "$@" # <-- Here's where we pass all the parameters to nawk for processing, using $@ in quotes to ensure that whitespaces are kept and wildcards are not expanded by the shell.