Extract paragraphs and count them

dsid · March 13, 2017, 8:21am

Hi,

I have a text with a number of paragraphs in them. My problem is I need to locate certain errors/warning and extract/count them. Problem is I do not know how many paras are there with that particular type of error/warning. I had thought that somehow if I could count the number of paras/blocks in the complete text file-> then extract all blocks/paras with a particular type of warning/error so that it would lessen those lines/blocks/paras from the original text file, it would ultimately give me a count of 0 in the original text file, but whatever I have found in google related to perl or bash, it is just confusing me since I know little of actual scripting.

I am on
Linux 2.6.18-417.el5 #1 SMP Sat Nov 19 14:54:59 EST 2016 x86_64 x86_64 x86_64 GNU/Linux

I have also included the complete text file for the gurus

Thanks a lot again.

drysdalk · March 13, 2017, 8:36am

Hi,

If these errors or warnings can only ever occur once per each section of your input file, then all you'd need to do is search for all instances of those errors or warnings and count how many you've found. That would then tell you how many sections contained these errors. If they can occur multiple times per section of course that would complicate things.

If you can provide information on what these error/warning messages in your input are expected to look like, and if they will only ever appear once per paragraph, that would be a good way forward for starters. As of just now you haven't actually said exactly what it is in the file that constitutes the warning/error that you're interested in.

dsid · March 13, 2017, 10:32am

Hi @drysdalk,

Sorry about that. totally missed to write about the specific errors/warnings.

Each block starts with a BEGIN MESSAGE and an END MESSAGE.
Each block would have just one error or warning
Each error/warning message is preceded by a Warning(space)! or Error!

Eg, 'Original Transaction Not Found !' -> This particular error message would occur only once in a block.

The challenge I am facing is how do I find out what that error/warning message text is.

drysdalk · March 13, 2017, 10:42am

Hi,

OK, thanks. Are you just wanting to count up how many errors/warnings there are for informational purposes, or do you need to find all blocks that contain these and print them out in their entirety ?

dsid · March 13, 2017, 10:52am

Hi,

To the point -> I am actually looking for the entire block which gives me that error/warning.

Here is what people in my office normally do using Windows -> Open up notepad++, search for a warning/error text -> on the first occurrence that is found, we cut that block out and move it to a new file; thereby reducing the overall count in the original file. Every occurrence of that error/warning in the original file, we cut that block out, and paste it under the new file was opened. This way each new file contains only those errors/warnings. This gives us the count of the error messages as well as a sorted output since each new file only contains those particular errors/warnings.

And then when that is done, we do other manual thing of finding the (tab)Institution Number: and the Acquirer Reference: but that is a totally different requirement.

So you what I meant. It's just a long and frustrating way of finding out information which can be repetitive and mistake ridden as well.

Sorry to be throwing all of the information at once. But I am just tired of this manual way. And thanks for all your help

drysdalk · March 13, 2017, 11:04am

Hi,

This is an adaptation of the script I provided to your question from the other week, which I think will do what you need.

#!/bin/bash

input=EXTRN071_copy.txt
tmp=/tmp/script.tmp

while read -r line
do
        case "$line" in
                *BEGIN\ MESSAGE*)
                        unset print
                        echo "$line" > "$tmp"
                        ;;
                *END\ MESSAGE*)
                        echo "$line" >> "$tmp"

                        if [ "$print" == "1" ]
                        then
                                /bin/cat "$tmp"
                                echo
                        fi
                        ;;
                Warning*|Error*)
                        print=1
                        echo "$line" >> "$tmp"
                        ;;
                *)
                        echo "$line" >> "$tmp"
                        ;;
        esac
done < "$input"

So the basic idea is:

Read in the file, and consider it one line at a time

If the line contains "BEGIN MESSAGE" clear the 'print' variable, and over-write the temp file with the current line

If the line contains "END MESSAGE" add the line to the temp file, and if the 'print' variable is set, print out the whole temp file to stdout

If the line starts with "Error" or Warning" set the variable 'print' to 1, and add the line to the temp file

If the line starts with anything else, add the line to the temp file

Hope this does the trick. If not, let me now and I'll have another crack at it.

EDIT: If you need to preserve the spaces, tabs and other formatting at the start of the lines before the text begins, add a line like this at the top of the script after the shebang line:

IFS='' (that's two single-quotes, and not a double-quote)

dsid · March 13, 2017, 11:31am

Thanks. You're a genius. I remember the other script you gave me and I did try to fiddle around with it, but always messed up trying to find error and warning messages.

The output does print out all the errors and warning. Now if I wanna find out the particular type of warning/errors and their related blocks, do I need to copy the output and move it to a file and then do the related search? Is there a way to sort a complete block, like i would sort lines and uniq and then count it by a wc -l. Im sorry if I am asking a lot, but since you give me outputs in a jiffy, i thought I'll take the risk

drysdalk · March 13, 2017, 11:38am

Hi,

Sure, no problem. One last question then. Would output like this:

Station / User...
SDate / Time / PDate...
Institution Number...

Warning|Error

<message>

be what you're after ? Or is there a particular kind of summary you'd like as output ?

dsid · March 13, 2017, 11:46am

The original output is fine. i just want it sorted so that I know I am not doing unnecessary scrolling of the window bar in a notepad++ trying to find if a similar warning/error message occurs again, if you know what I mean.

I have a particular type of output that I am looking for but I'll try to do some research on my own and try to script it. At this moment I just wanna see how you write your logic so that I can learn from it. I don't wanna bug you with stupid questions again n again. Once I do come up with some output, Ill try to post the script and maybe if you have time, do try to comment on it.

drysdalk · March 13, 2017, 11:54am

Hi,

Sorting the output would be a bit trickier than you might imagine, since while on the face of it that would be easy to do, you'd end up with a mixture of lines all run together with no way to tie them back to the block they were sorted from, if you see what I mean. But parsing the blocks to print out some kind of neatly summarised information on a single line is possible, if there are certain key parts that you'd want included in the summary.

If you think you've got enough to go on for now that's great, but if you would like anything further then if you can provide the details we can take things from there.

vgersh99 · March 13, 2017, 12:04pm

what kind of sorting do you have in mind? What's the sorting criteria?
I foresee building up an awk hash index by "criteria" with actual block as hash-ed value and sorting the hash once built...
Just my $.02

dsid · March 13, 2017, 12:25pm

Hi drysdalk,

Yes I do understand. sorting would kind of get tricky. But a summarized information would also do the trick I guess. I did try to find a pattern in the 'Original presentment Not Found !' block and the only things of use were the 'Institution number'; 6th line from the 'BEGIN MESSAGE' and 'Acquirer Reference:'; 16th line from 'BEGIN MESSAGE'. The other warnings/errors don't have this particular information so I need to further analyze the logs to find a common pattern.

For the moment printing out the 'Institution number' and 'Acquirer Reference:' would kind of do the trick at least for the 'Original presentment Not Found !' block

Thanks for your help again.

---------- Post updated at 04:25 PM ---------- Previous update was at 04:06 PM ----------

i was looking for some sort of block sorting, for eg , a single block constitutes a
BEGIN MESSAGE and an END MESSAGE. The attachment at the start of the forum has the blocks. In this block would be an error/warning message. Based on that error/warning message, if my blocks are sorted, it would a bit easier for me to figure out how many of those error/warning message blocks are there in the original file

Hope my words made some sense. Let me know if its not clear

drysdalk · March 13, 2017, 12:38pm

Hi,

This solution is a bit less efficient since it now relies on external binaries rather than shell built-ins, but for every block that has a Warning or Error, this will print out the Institution ID and the text of the error or warning.

#!/bin/bash

IFS=''

input=EXTRN071_copy.txt
tmp=/tmp/script.tmp

echo institution,errormessage
while read -r line
do
        case "$line" in
                *BEGIN\ MESSAGE*)
                        unset print
                        echo "$line" > "$tmp"
                        ;;
                *END\ MESSAGE*)
                        echo "$line" >> "$tmp"

                        if [ "$print" == "1" ]
                        then
                                institution=`/usr/bin/awk '$0 ~ /   Institution/ {sub(/\r$/,""); print $NF}' "$tmp"`
                                errormessage=`/bin/grep -E -A2 "^Warning|^Error" "$tmp" | /usr/bin/tail -1`
                                echo $institution,$errormessage
                        fi
                        ;;
                Warning*|Error*)
                        print=1
                        echo "$line" >> "$tmp"
                        ;;
                *)
                        echo "$line" >> "$tmp"
                        ;;
        esac
done < "$input"

Sample output:

$ ./script.sh 
institution,errormessage
00000029,Original presentment Not Found !
00000029,Non-financial original Slip Not Found !
00000029,Processing Failed For Transaction!
00000046,Transaction type of chargeback is not the same as that of original presentment.
00000046,Transaction type of chargeback is not the same as that of original presentment.
00000041,Original presentment Not Found !
00000041,Non-financial original Slip Not Found !
00000041,Processing Failed For Transaction!
00000041,Original presentment Not Found !
00000041,Non-financial original Slip Not Found !
00000041,Processing Failed For Transaction!
00000050,Original presentment Not Found !
00000050,Non-financial original Slip Not Found !
00000050,Processing Failed For Transaction!
00000050,Original presentment Not Found !
00000050,Non-financial original Slip Not Found !
00000050,Processing Failed For Transaction!
00000007,Original Transaction Not Found !
00000007,Processing Failed For Transaction!
00000007,No transactions processed!
00000007,PROCESSING ERROR! - check log for error messages.
$

Hope this helps in the meantime.

EDIT: If you want the output sorted, change the last line to:

done < "$input" | /usr/bin/sort

dsid · March 13, 2017, 12:53pm

@drysdalk, @vgersh99 I did try to google and found a simple perl script and did some replacements with my own text

#!/bin/perl -w
$/ = '******* BEGIN MESSAGE *******';
$pattern = 'Original presentment Not Found !';
while ( <> )
{
    chomp;
    /$pattern/ or next;
    print $/;
    print $_;
}

but then again it just gave me those blocks. drysdalk's script already does that. For all different type of errors/warning messages, I would need to enter the pattern manually by first searching for it from the original file and then replacing it in the above script.

What if I did not need to enter the pattern manually, and the output the script would automatically give me the output of these patterns?

Actually would it not be possible to say first search for all similar patterns and when the next line is not similar it moves on. This way there is no requirement for a sort explicitly?

Please let me know if I am not clear

---------- Post updated at 04:43 PM ---------- Previous update was at 04:42 PM ----------

drysdalk:

Hi,

This solution is a bit less efficient since it now relies on external binaries rather than shell built-ins, but for every block that has a Warning or Error, this will print out the Institution ID and the text of the error or warning.

#!/bin/bash

IFS=''

input=EXTRN071_copy.txt
tmp=/tmp/script.tmp

echo institution,errormessage
while read -r line
do
   case "$line" in
   *BEGIN\ MESSAGE*)
   unset print
   echo "$line" > "$tmp"
   ;;
   *END\ MESSAGE*)
   echo "$line" >> "$tmp"

   if [ "$print" == "1" ]
   then
   institution=`/usr/bin/awk '$0 ~ /   Institution/ {sub(/\r$/,""); print $NF}' "$tmp"`
   errormessage=`/bin/grep -E -A2 "^Warning|^Error" "$tmp" | /usr/bin/tail -1`
   echo $institution,$errormessage
   fi
   ;;
   Warning*|Error*)
   print=1
   echo "$line" >> "$tmp"
   ;;
   *)
   echo "$line" >> "$tmp"
   ;;
   esac
done < "$input"

Sample output:

$ ./script.sh 
institution,errormessage
00000029,Original presentment Not Found !
00000029,Non-financial original Slip Not Found !
00000029,Processing Failed For Transaction!
00000046,Transaction type of chargeback is not the same as that of original presentment.
00000046,Transaction type of chargeback is not the same as that of original presentment.
00000041,Original presentment Not Found !
00000041,Non-financial original Slip Not Found !
00000041,Processing Failed For Transaction!
00000041,Original presentment Not Found !
00000041,Non-financial original Slip Not Found !
00000041,Processing Failed For Transaction!
00000050,Original presentment Not Found !
00000050,Non-financial original Slip Not Found !
00000050,Processing Failed For Transaction!
00000050,Original presentment Not Found !
00000050,Non-financial original Slip Not Found !
00000050,Processing Failed For Transaction!
00000007,Original Transaction Not Found !
00000007,Processing Failed For Transaction!
00000007,No transactions processed!
00000007,PROCESSING ERROR! - check log for error messages.
$

Hope this helps in the meantime.

EDIT: If you want the output sorted, change the last line to:

done < "$input" | /usr/bin/sort

@drysdalk Let me try to understand your script and I'll get back to you.

Thanks a lot again

drysdalk · March 13, 2017, 12:54pm

Hi,

Basically, print is a variable that we clear at the start of every block. We then set the variable if and only if we encounter a block that we want to print (that is, a block which contains an Error or Warning line). When we get to the end of the current block, we check to see if the print variable is set. If it is, we then proceed with printing out what we need. If it isn't, then we know we don't need to print anything from this block, as it contains no errors or warnings. So we then move on, and at the next start of a block we unset the variable, and so on.

dsid · March 13, 2017, 1:01pm

drysdalk:

Hi,

This solution is a bit less efficient since it now relies on external binaries rather than shell built-ins, but for every block that has a Warning or Error, this will print out the Institution ID and the text of the error or warning.

#!/bin/bash

IFS=''

input=EXTRN071_copy.txt
tmp=/tmp/script.tmp

echo institution,errormessage
while read -r line
do
   case "$line" in
   *BEGIN\ MESSAGE*)
   unset print
   echo "$line" > "$tmp"
   ;;
   *END\ MESSAGE*)
   echo "$line" >> "$tmp"

   if [ "$print" == "1" ]
   then
   institution=`/usr/bin/awk '$0 ~ /   Institution/ {sub(/\r$/,""); print $NF}' "$tmp"`
   errormessage=`/bin/grep -E -A2 "^Warning|^Error" "$tmp" | /usr/bin/tail -1`
   echo $institution,$errormessage
   fi
   ;;
   Warning*|Error*)
   print=1
   echo "$line" >> "$tmp"
   ;;
   *)
   echo "$line" >> "$tmp"
   ;;
   esac
done < "$input"

Sample output:

$ ./script.sh 
institution,errormessage
00000029,Original presentment Not Found !
00000029,Non-financial original Slip Not Found !
00000029,Processing Failed For Transaction!
00000046,Transaction type of chargeback is not the same as that of original presentment.
00000046,Transaction type of chargeback is not the same as that of original presentment.
00000041,Original presentment Not Found !
00000041,Non-financial original Slip Not Found !
00000041,Processing Failed For Transaction!
00000041,Original presentment Not Found !
00000041,Non-financial original Slip Not Found !
00000041,Processing Failed For Transaction!
00000050,Original presentment Not Found !
00000050,Non-financial original Slip Not Found !
00000050,Processing Failed For Transaction!
00000050,Original presentment Not Found !
00000050,Non-financial original Slip Not Found !
00000050,Processing Failed For Transaction!
00000007,Original Transaction Not Found !
00000007,Processing Failed For Transaction!
00000007,No transactions processed!
00000007,PROCESSING ERROR! - check log for error messages.
$

Hope this helps in the meantime.

EDIT: If you want the output sorted, change the last line to:

done < "$input" | /usr/bin/sort

Your script is amazing. Thanks for that. Some what gives me the same output I was looking for. Thanks a lot again.

However, I wanna know why are you using the print keyword and then unsetting it and then again setting it?

drysdalk · March 13, 2017, 1:02pm

Hi,

I could have called it anything, yes. I just happened to call it print . The fact it's a variable name, and always has a $ symbol before it to make it clear to the shell it's a variable name, means this does not interfere with anything else that may or may not exist as a built-in, or elsewhere.

dsid · March 13, 2017, 1:21pm

but what was the advantage of using print? If I understood correctly, we could have used any vaiable name for that matter and using print is putting some sort of a check?

And I guess I am wrong in mentioning print as a keyword in bash

---------- Post updated at 05:10 PM ---------- Previous update was at 05:02 PM ----------

@drysdalk I got what you are doing with the grep and tail command, but could you please explain to me what you are doing with the awk command and the characters you are using in that statement

---------- Post updated at 05:21 PM ---------- Previous update was at 05:10 PM ----------

@drysdalk I got what you are doing with the grep and tail command, but could you please explain to me what you are doing with the awk command and the characters you are using in that statement

drysdalk · March 13, 2017, 3:14pm

Hi,

Sure, no problem. There are a few different things to this /usr/bin/awk '$0 ~ / Institution/ {sub(/\r$/,""); print $NF}' "$tmp" line, so we'll take them in turn.

$0 ~ / Institution/
This is pattern-matching. What we're saying here is that we only want to consider the current input (represented by $0) further if it contains (the meaning of ~ in this context) the exact string " Institution" (that's the word 'Institution' with three spaces in front of it). If that pattern-matching check passes, we move on to the next bit of the line.

sub(/\r$/,"");
Now this was something I didn't actually expect to have to do, and it kind of caught me out. As it turns out, the example file you've provided has Windows-style end-of-lines, rather than UNIX-style. This was catching me out when trying to print the Institution ID numbers, since being the last field on the line, they also included the Windows-style end-of-line characters, and it messed with the output.

So what this awk substitution command is doing is looking for lines that end with a carriage return character, and replacing them with nothing, so we only have the line feed character to mark the end of a line. This makes the end of line "normal", from the perspective of a UNIX-style system.

Now that the line has been sanitised and stripped of all characters we don't need and would interfere with our later output (after already being sure we've found a line with the exact string we're looking for), we move on to the last bit of the awk line.

print $NF
This is the easiest one of the bunch, and prints the last field on the line (which in our case, is the Institution Number).

So the full explanation of this awk line in English would be:

Look for lines that contain the exact string " Institution" ...

and then strip them of Windows-style line-ends, leaving UNIX-style line ends...

and finally print out the last field of the remaining line.

Hope this helps.

dsid · March 14, 2017, 5:36am

Yes. I almost forgot tell you that the log file has carriage return characters at the end. What I was normally doing at my end was dos2unix and then move on from there.

One last question when it comes to this thread: Say I want to input a different pattern every time I run the script. (You had provided me with the script for searching for a pattern in the other thread). I have no problem when entering a string with spaces in between, storing it in a variable and then grepping that variable from the log file. However, the problem comes when there is an exclamation at the end of the string; which is normally the case in the log file if you remember. The grepping does not work when I include the exclamation in the stored variable. How do I overcome that problem?

Thanks a lot for all your help