Bash script to extract paragraph with globs in it

Hi,

Its been a long time since I have used Bash to write a script so am really struggling here. Need the gurus to help me out.

uname -a
Linux lxserv01 2.6.18-417.el5

i have a text file with blocks of code written in a similar manner

******* BEGIN MESSAGE *******

       Station / User:  129   800013   Batch Processing
 SDate / Time / PDate:  26.02.2017 17:07:05   26.02.2017
       Current System:  XXXXXX Production System       
   Institution Number:  00000043
Application / Version:  abc-inw   30.66.36   Release A (OMNI)
        Function Name:  FindOriginalPresentment

Warning !

Original presentment Not Found !

Institution No (Original Tran): [00000043]
Charge back slip: [70527509216]
Acquirer Reference: [85470355344549150697093]
Presentment Slip: [N/A]
Transaction Class: [002 - Clearing transactions]
Transaction Category: [001 - Presentments]
File Institution No: [00000043]
File No: [00041926]

******* END MESSAGE *******

******* BEGIN MESSAGE *******

       Station / User:  129   800013   Batch Processing
 SDate / Time / PDate:  26.02.2017 17:06:59   26.02.2017
       Current System:  XXXXXX Production System       
   Institution Number:  00000043
Application / Version:  abc-inw   30.66.36   Release A (OMNI)

Information message !

Exception Processing - Sundry Types!

Date: [20170226]
Time: [17:06:59]

003','040

******* END MESSAGE *******

******* BEGIN MESSAGE *******

       Station / User:  129   800013   Batch Processing
 SDate / Time / PDate:  26.02.2017 17:07:05   26.02.2017
       Current System:  XXXXXX Production System       
   Institution Number:  00000043
Application / Version:  abc-inw   30.66.36   Release A (OMNI)
        Function Name:  FindOriginalPresentment

Warning !

Original presentment Not Found !

Institution No (Original Tran): [00000043]
Charge back slip: [70527509216]
Acquirer Reference: [85470355344549150697093]
Presentment Slip: [N/A]
Transaction Class: [002 - Clearing transactions]
Transaction Category: [001 - Presentments]
File Institution No: [00000043]
File No: [00041926]

******* END MESSAGE *******

******* BEGIN MESSAGE *******

       Station / User:  129   800013   Batch Processing
 SDate / Time / PDate:  26.02.2017 17:06:59   26.02.2017
       Current System:  XXXXXX Production System        
   Institution Number:  00000043
Application / Version:  abc-inw   30.66.36   Release A (OMNI)

Information message !

Exception Processing - Sundry Types!

Date: [20170226]
Time: [17:06:59]

003','040

******* END MESSAGE *******

Each 'BEGIN MESSAGE' and the subsequent 'END MESSAGE' is a block. Once in this block, if there is a pattern/text 'Original presentment Not Found !', the script should spit out the entire BEGIN and END block. I started with a simple command to search for BEGIN and END blocks but the bash script is giving me errors on finding the GLOB in the BEGIN/END pattern of a block.

Help please, I am lost here.

Thanks a lot.

Please post your attempt so people in here can analyse it and possibly propose corrections and / or enhancements.

EDIT: If you don't insist on a bash solution, try

awk '
$0 ~ ST                         {TMP = GLPR = ""}
$0 ~ GL                         {GLPR = 1}
$0 ~ ST, $0 ~ EN                {TMP = TMP $0 ORS}
($0 ~ EN) && GLPR               {print TMP}
' ST="BEGIN MESSAGE" EN="END MESSAGE" GL="Original presentment Not Found !"  file
1 Like

Hi,

I think I have a script that will do the trick for you:

#!/bin/bash
  
input=example.txt
tmp=`/bin/tempfile`

while read -r line
do
        case "$line" in
                "******* BEGIN MESSAGE *******")
                        echo "$line" > "$tmp"
                        ;;
                "******* END MESSAGE *******")
                        echo "$line" >> "$tmp"

                        if /bin/grep ^Original\ presentment\ Not\ Found\ \!$ "$tmp" >/dev/null 2>/dev/null
                        then
                                /bin/cat "$tmp"
                                echo
                        fi
                        ;;
                *)
                        echo "$line" >> "$tmp"
                        ;;
        esac

done < $input

Here is the output from a test run. In this case, example.txt was populated with the exact example text you provided in your post.

$ ./script.sh 
******* BEGIN MESSAGE *******

Station / User:  129   800013   Batch Processing
SDate / Time / PDate:  26.02.2017 17:07:05   26.02.2017
Current System:  XXXXXX Production System
Institution Number:  00000043
Application / Version:  abc-inw   30.66.36   Release A (OMNI)
Function Name:  FindOriginalPresentment

Warning !

Original presentment Not Found !

Institution No (Original Tran): [00000043]
Charge back slip: [70527509216]
Acquirer Reference: [85470355344549150697093]
Presentment Slip: [N/A]
Transaction Class: [002 - Clearing transactions]
Transaction Category: [001 - Presentments]
File Institution No: [00000043]
File No: [00041926]

******* END MESSAGE *******

******* BEGIN MESSAGE *******

Station / User:  129   800013   Batch Processing
SDate / Time / PDate:  26.02.2017 17:07:05   26.02.2017
Current System:  XXXXXX Production System
Institution Number:  00000043
Application / Version:  abc-inw   30.66.36   Release A (OMNI)
Function Name:  FindOriginalPresentment

Warning !

Original presentment Not Found !

Institution No (Original Tran): [00000043]
Charge back slip: [70527509216]
Acquirer Reference: [85470355344549150697093]
Presentment Slip: [N/A]
Transaction Class: [002 - Clearing transactions]
Transaction Category: [001 - Presentments]
File Institution No: [00000043]
File No: [00041926]

******* END MESSAGE *******

$ 

This seems to print only two blocks, and they both contain the search string. The blocks that do not contain it are not written to standard output, which if I understand what you've written correctly is exactly what you're after.

Hope this helps.

@RudiC, your code runs perfect, but why I am insisting in bash script is because I have worked on it for some time so know the basics of it and later on I can automate this stuff using a script, with others also using it. My code is just a simple print statement, so its not even worth reading it out. Thats the reason I never posted it at the start.

@drysdalk, your script is giving me an error -> no such file or directory. I had created the example.txt beforehand. Just to let you know, I have rights only under my directory so have altered the script code accordingly. Other than that, you have entered the BEGIN and END using a fixed length string, which will work fine in this case but what should I do if the BEGIN and END strings are shorter or longer in length. That was the reason I was looking into using regex.

#!/bin/bash

input=/home/dsiddiqui/basic_bash/example.txt
tmp=`/home/dsiddiqui/basic_bash/tempfile`

while read -r line
do
        case "$line" in
                "******* BEGIN MESSAGE *******")
                        echo "$line" > "$tmp"
                        ;;
                "******* END MESSAGE *******")
                        echo "$line" >> "$tmp"

                        if /bin/grep ^Original\ presentment\ Not\ Found\ \!$ "$tmp" >/dev/null 2>/dev/null
                        then
                                /bin/cat "$tmp"
                                echo
                        fi
                        ;;
                *)
                        echo "$line" >> "$tmp"
                        ;;
        esac

done < $input

Hi,

I suspect the error is coming from this change you made:

tmp=`/home/dsiddiqui/basic_bash/tempfile`

Keep this line the way it was originally:

tmp=`/bin/tempfile`

The quotes I'm using here are backticks, and have the effect of running an external command, namely /bin/tempfile . This line does not simply set the filename directly by assigning text into a variable.

The purpose of the tempfile program is to generate a temporary file under /tmp that is guaranteed not to have existed already, so you don't have to worry about clobbering someone else's output. When run, it creates the file and returns the filename as output, so this basically results in the variable 'tmp' being set to your newly-created temporary file.

So if you try again with the original line and let us know how it goes, we can take things from there.

As for the variable-length output: I would have expected the beginning and end lines of every block would be identical, as a clear way of demarcating one data section from another ? If not, then if you could give us some idea of how the block markers are expected to vary I'll see what I can suggest.

Hi drysdalk,

Getting below error after putting in the original line

[dsiddiqui@lxserv01 scripts]$ ./para.sh
./para.sh: line 4: /bin/tempfile: No such file or directory
./para.sh: line 10: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 13: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 10: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 13: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 10: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 13: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 10: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 22: : No such file or directory
./para.sh: line 13: : No such file or directory
[dsiddiqui@lxserv01 scripts]$

your script

[dsiddiqui@lxserv01 scripts]$ more para.sh
#!/bin/bash

input=/home/dsiddiqui/basic_bash/example.txt
tmp=`/bin/tempfile`

while read -r line
do
        case "$line" in
                "******* BEGIN MESSAGE *******")
                        echo "$line" > "$tmp"
                        ;;
                "******* END MESSAGE *******")
                        echo "$line" >> "$tmp"

                        if /bin/grep ^Original\ presentment\ Not\ Found\ \!$ "$tmp" >/dev/null 2>/dev/null
                        then
                                /bin/cat "$tmp"
                                echo
                        fi
                        ;;
                *)
                        echo "$line" >> "$tmp"
                        ;;
        esac

done < $input
[dsiddiqui@lxserv01 scripts]$

to your question reg block markers: I did find out, they will remain fixed, but what I was thinking was say -> 'instead of 7 stars at the start of the line, it could be 8, or for that matter, any character any number of times, the only thing consistent would be 'BEGIN MESSAGE' and 'END MESSAGE'. We can also do a check for: if the next line to BEGIN MESSAGE is END MESSAGE -> then it is a block.

Hope I am making sense

Thanks a lot

Hi,

OK, thanks for the detail. This would seem to imply that your Linux system doesn't have /bin/tempfile installed on it. It may be worth checking to see if it's in /usr/bin/tempfile instead, but failing that change the line to something like this:

tmp=/tmp/script.tmp

or any other filename that you are sure it is safe for you to use.

An alternative version of the script that purely checks for the presence of "BEGIN MESSAGE" or "END MESSAGE" anywhere on a line follows.

Note that this isn't strictly speaking 100% safe or reliable, since if for any reason any other line happened to contain either of these strings as part of its text this would trigger the checks in question, and cause that particular block (at a minimum) to get mangled.

It's always a good idea to absolutely strictly define strings like this if you can, since they are fundamental to how parsing a file can safely be done. But if the number of asterisks is variable then you may have to go with the less-strict version below.

#!/bin/bash

input=example.txt
tmp=`/bin/tempfile`

while read -r line
do
        if echo "$line" | /bin/grep "BEGIN MESSAGE" >/dev/null 2>/dev/null
        then
                echo "$line" > "$tmp"
        elif echo "$line" | /bin/grep "END MESSAGE" >/dev/null 2>/dev/null
        then
                echo "$line" >> "$tmp"

                if /bin/grep ^Original\ presentment\ Not\ Found\ \!$ "$tmp" >/dev/null 2>/dev/null
                then
                        /bin/cat "$tmp"
                        echo
                fi
        else
                echo "$line" >> "$tmp"
        fi
done < "$input"

---------- Post updated at 02:30 PM ---------- Previous update was at 02:20 PM ----------

Hi,

One last version, this time avoiding the use of grep (which may make things run a little faster if you have a great deal of data to get through):

#!/bin/bash

input=example.txt
tmp=`/bin/tempfile`

while read -r line
do
        case "$line" in
                *BEGIN\ MESSAGE*)
                        echo "$line" > "$tmp"
                        ;;
                *END\ MESSAGE*)
                        echo "$line" >> "$tmp"

                        if /bin/grep ^Original\ presentment\ Not\ Found\ \!$ "$tmp" >/dev/null 2>/dev/null
                        then
                                /bin/cat "$tmp"
                                echo
                        fi
                        ;;
                *)
                        echo "$line" >> "$tmp"
                        ;;
        esac
done < "$input"

Again, the same caveats apply as outlined previously: if a line for any reason contains "BEGIN MESSAGE" or "END MESSAGE" as part of its own text and isn't itself a block marker, this would cause problems. So this is only safe if you're 100% sure that will never happen in your input.

Hi drysdalk,

I modified your script and ran it again. I have a doubt that it is to do something with my permissions or selinux or something. I created the temp file manually under /tmp but when trying to create it through the script, it gave me an error. As a result, I removed all the substitutions and put in the file names directly, and this time they worked. But this is really stupid and confusing if I cant do that.

#!/bin/bash

input=/home/dsiddiqui/basic_bash/scripts/example.txt
#tmp=`/tmp/script.tmp`

temp=/home/dsiddiqui/basic_bash/scripts/tempfile
while read -r line
do
        case "$line" in
                "******* BEGIN MESSAGE *******")
                        echo "$line" > /home/dsiddiqui/basic_bash/scripts/tempfile
                        ;;
                "******* END MESSAGE *******")
                        echo "$line" >> /home/dsiddiqui/basic_bash/scripts/tempfile

                        if /bin/grep ^Original\ presentment\ Not\ Found\ \!$ /home/dsiddiqui/basic_bash/scripts/tempfile >/dev/null 2>/dev/null
                        then
                                /bin/cat /home/dsiddiqui/basic_bash/scripts/tempfile
                                echo
                        fi
                        ;;
                *)
                        echo "$line" >> /home/dsiddiqui/basic_bash/scripts/tempfile
                        ;;
        esac

done < $input

Regarding your suggestions, one thing for sure is BEGIN MESSAGE and the END MESSAGE will always occur as a block. No other lines will have the same words again as these are log messages for an application and it is not programmed like that.

Hi,

OK, great - glad to know you've got a working script now.

On the topic of your own script and the variables not working, I think it's because you've not actually defined the variable you're using.

You seem to have commented out tmp and defined temp instead, but unless you also changed every occurrence of $tmp in the script to $temp (or even better just don't bother defining temp at all and modify the tmp definition instead), then that wouldn't actually work, since you'd be trying to use an un-defined variable.

---------- Post updated at 02:49 PM ---------- Previous update was at 02:45 PM ----------

Also just noticed: you wouldn't want to write:

tmp=`/tmp/script.tmp`

That's using backticks, so again would try to run an external binary or script called /tmp/script.tmp which almost certainly does not exist.

Instead, you don't want backticks here, and just a simple:

tmp=/tmp/script.tmp

will suffice.

1 Like

Hi drysdalk,

This is so stupid of me. I have totally lost it, it seems :slight_smile: . Made the changes that you suggested in your earlier post, and everything is working fine.

input=/home/dsiddiqui/basic_bash/scripts/example.txt
tmp=/tmp/tmpfile.scpt

while read -r line
do
        case "$line" in
                "******* BEGIN MESSAGE *******")
                        echo "$line" > "$tmp"
                        ;;
                "******* END MESSAGE *******")
                        echo "$line" >> "$tmp"

                        if /bin/grep ^Original\ presentment\ Not\ Found\ \!$ "$tmp" >/dev/null 2>/dev/null
                        then
                                /bin/cat "$tmp"
                                echo
                        fi
                        ;;
                *)
                        echo "$line" >> "$tmp"
                        ;;
        esac

done < $input

Thanks a lot for your patience.

Hi.

And yet grep is still there ... cheers, drl

Ah, yes. I was meaning without the use of if...grep evaluation rather than case (which for some reason I didn't think to keep with the first time I re-wrote it). Still, one grep is better than three :slight_smile:

Hi.

For comparison, here is a single command of the grep family which extracts bounded blocks containing the required string. It does not use temporary files. Ignoring the scaffolding and supporting code, this is the single command that obtains the results:

cgrep -D -w 'BEGIN MESSAGE' +w 'END MESSAGE' 'Original presentment Not Found' $FILE

Here is the demonstation script and the resulting solution:

#!/usr/bin/env bash

# @(#) s1       Demonstrate bounded block extraction, cgrep.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
em() { pe "$*" >&2 ; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C specimen cgrep dixf

FILE=${1-data1}

pl " Input data file $FILE:"
specimen $FILE

pl " Results:"
cgrep -D -w 'BEGIN MESSAGE' +w 'END MESSAGE' 'Original presentment Not Found' $FILE

pl " Details for utility cgrep:"
dixf cgrep

exit 0

which produces:

$ ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.16.0-4-amd64, x86_64
Distribution        : Debian 8.7 (jessie) 
bash GNU bash 4.3.30
specimen (local) 1.17
cgrep ATT cgrep 8.15
dixf (local) 1.42

-----
 Input data file data1:
Edges: 5:0:5 of 85 lines in file "data1"
******* BEGIN MESSAGE *******

       Station / User:  129   800013   Batch Processing
 SDate / Time / PDate:  26.02.2017 17:07:05   26.02.2017
       Current System:  XXXXXX Production System       
   ---
Time: [17:06:59]

003','040

******* END MESSAGE *******

-----
 Results:
******* BEGIN MESSAGE *******

       Station / User:  129   800013   Batch Processing
 SDate / Time / PDate:  26.02.2017 17:07:05   26.02.2017
       Current System:  XXXXXX Production System       
   Institution Number:  00000043
Application / Version:  abc-inw   30.66.36   Release A (OMNI)
        Function Name:  FindOriginalPresentment

Warning !

Original presentment Not Found !

Institution No (Original Tran): [00000043]
Charge back slip: [70527509216]
Acquirer Reference: [85470355344549150697093]
Presentment Slip: [N/A]
Transaction Class: [002 - Clearing transactions]
Transaction Category: [001 - Presentments]
File Institution No: [00000043]
File No: [00041926]

******* END MESSAGE *******
******* BEGIN MESSAGE *******

       Station / User:  129   800013   Batch Processing
 SDate / Time / PDate:  26.02.2017 17:07:05   26.02.2017
       Current System:  XXXXXX Production System       
   Institution Number:  00000043
Application / Version:  abc-inw   30.66.36   Release A (OMNI)
        Function Name:  FindOriginalPresentment

Warning !

Original presentment Not Found !

Institution No (Original Tran): [00000043]
Charge back slip: [70527509216]
Acquirer Reference: [85470355344549150697093]
Presentment Slip: [N/A]
Transaction Class: [002 - Clearing transactions]
Transaction Category: [001 - Presentments]
File Institution No: [00000043]
File No: [00041926]

******* END MESSAGE *******

-----
 Details for utility cgrep:
cgrep   shows context of matching patterns found in files (man)
Path    : ~/executable/cgrep
Version : 8.15
Type    : ELF 64-bit LSB executable, x86-64, version 1 (SYSV ...)
Home    : http://sourceforge.net/projects/cgrep/

I have installed cgrep on numerous systems, and, while a c compiler is needed, the compilation is a single step.

I have also benchmarked cgrep and it is as fast (in broad terms) as the fastest grep instances available.

Of course, if you do not wish to obtain and compile the code, or you do not do this kind of task frequently, then you are better off using the other suggestions.

Best wishes ... cheers, drl