Find keywords in multiple log files

I have several problems with my program: I hope you can help me.

1) the If else statement isn't working . The IF Else syntax is:
If MEMSIZE OR sasfoundation (SASEXE) OR Real Time(second) >1.0 and Filename, output column name and value to csv or else nothing

Example progflag,cvs:

Memsize                 Second                 SASEXE                                   filename
    400                       4.0                         SASFoundaion                   file11.log.20120314

2) I am not getting any data in the csv file

3) The email syntax isn't working. I am not receiving the cvs file attachment via email

My program read in multiple files with .log, extension. For example file12.log.20120314. The program search for 3 selected items in each log files.

Item 1# : Memsize . Memsize statement stores numeric values. For example memsize=400. the program output the column name (memsize) and its value and the filename to a csv file

example - progflag.csv:

memsize              filename
 400                       file12.log.20120314

Item 2# : Real Time; row value. For example, the row value for Real Time is 4.0. Real Time : 4.0.
In my program Real Time is named Second. For example, SECOND stores 4.0. IF SECOND > 1.0 then output the column name
and its value to a cvs file

example - progflag.csv:

Second                 filename
 4.0                      file11.log.20120314

If Real Time  row value is less than 1.0 then output no data to the cvs file.

Example  Real Time: 0.2         0.2 is less than 1.0 

item3#: if the program find the directory path /SASFoundation (SASEXE) then output the directory path to a cvs file

Example progflag.cvs

Second        SASEXE                                 filename
 4.0               SASFoundaion                   file11.log.20120314Here is the code:
cd /tmp/*.log.*
awk -F '[=:;.]' '
  function pr() {if(NR>1) printf "%s\t%s\t%s\t%s\n", K[1],K[2],K[3],K[0]}
  BEGIN {
      printf "MEMSIZE\tSECOND\tSASEXE\tFilename\n"
      for(i=split("memsize ,Real Time ,SASFoundation",A,",");i;i--) L[A]=i
  }
  FNR==1 {
      pr()
       K[0]=FILENAME
      K[1]=K[2]=K[3]=x
  }
  $1 in L {v=$2;gsub("^[/ ]*","",v);gsub(/ *$/,"",v);K[L[$1]]=v}
  END{pr()}
if MEMSIZE OR SECOND >1.0 OR  SASEXE AND Filename then
' *.log.* > progflag.csv

[ -s progflag.csv ] && mailx -s "subject text -a "Programs flagged" receiver@domain.com < progflag.csv
ELSE ''

So what are the symptoms of your problem?

Are you getting syntax errors for an incomplete awk program?

Are you getting failures from cd for trying to change directory to a list of four regular files (instead of to one directory)?

Are you always getting mail because progflag.csv is never empty since you always print a header line into that file even if no data follows the header?

1 Like

I'm giving up.

On top of what Don Cragun said,

  • the attached files' names don't match the ones mentioned in the text
  • the attached files are no *nix text files as they are lacking the trailing <new line> char
  • the attached files' structure (case, spaces around "=", maybe more) doesn't match the one mentioned in the text or inferred from the code sample.

Why don't you take a step back, rephrase the specification and explain the logics needed using input sample data and showing how they should show up in the output?

answer your question :

Are you getting syntax errors for an incomplete awk program?
I am get syntax errors for an incomplete awk program

I am getting an error in the If else statement

Are you always getting mail because progflag.csv is never empty since you always print a header line into that file even if no data follows the header?

I am not receiving email

Maybe you should consider the questions I asked and the comments RudiC made as suggestions for things to change in your code to make it work correctly and avoid the problems you are having. If you try fixing those problems in your code and are still having problems, come back to us and:

  1. Tell us what operating system you're using.
  2. Tell us what shell you're using.
  3. Clearly describe the format of the input and output files you are processing? (Are they UNIX format text files? If not, why not and what format are they?)
  4. Show us sample input files (in CODE tags).
  5. Show us sample output files showing the exact output you are trying to produce from your sample input files (in CODE tags).
  6. A CLEAR specification of what you are trying to do (using filenames in your specification that match the sample input and output files specified above.)
  7. Show us your updated code (in CODE tags).
  8. Show us all of the diagnostic messages that are being produced from your code (in CODE tags). (And, don't tell us that mailx is failing when your script died long before it got to mailx .)
1 Like

Answer to your questions:
Tell us what operating system you're using? AiX

Tell us what shell you're using? bash shell

The format of text files is based on generated sas programs that produce sas log files. The sas programmers sometimes add the following parameters to their code Memsize, and a directory path /sas/sasfoundation. A programmer doesn't always add Memsize or a directory path /sas/sasfoundation in his code.

Therefore the output in his log file will not have Memsize or a directory path
/sas/sasfoundation.

In all the log files, there is an assignment variable named Real Time with a numeric value. Real Time value is normally low. The value range between 0.0 - 0.9. Real Time value is high if the value is 1.0

I have several problems with the program: I hope you can help me.

1) the If else statement is throwing an error message. syntax error can't

    read
    {if ($1)|| ($2>1.0) || ($ 3) && ( $0)) printf $1 "\t" $2 "\t"" $3"\t" $0"\t";   
    elseif($2 < 1.0  else print ''}'
    ' *.log > progflag.csv.txt

What the below syntax is saying :

  1. if ($1)
    Memorize = ; , there a numeric value after the = in the log files, then output the value to progflag.csv.txt'
  2. or if ($2>1.0)
    Second which is alias for Real Time : , there is a numeric value after the : in the log files, greater than 1.0 then output the value to progflag.csv.txt'
  3. or if ($ 3)
    sasfoundation is the value that is stored in the alias sasexe.
    if sasfoundation exist in the logfile then output value to progflag.csv.txt ,
  4. and ($0)
    filename. Each log file has a title. if if ($1)|| ($2>1.0) || ($ 3) && ( $0))
    then output the each log file record with the filename to progflag.csv.txt
  5. elseif($2 < 1.0 else print ''}
    This means if $2 is less than 1.0 then no value is outputted to the column
    named Second in the progflag.csv.txt

For example in filew.log if the following items don't exist: Memsize, SASFoundation and also if Real Time row value is less than 1.0 then no data is outputted to progflag.cvs.txt

The below show sample of the exact output I'm trying to produce from the sample input files to progflag.cvs.txt . filew.log.txt data isn't in the progflag.cvs.txt because it doesn't have the following criteria Memsize, SASFoundation and Real Time value greater than 1.0 :

Memsize        Second        SASEXE                  filename.txt
    200                            SASFoundation           file1x.log.txt
    100                            SASFoundation           file2x.log.txt
    400           5.1                                            filez.log.txt 

2) I am not getting any data in progflag.cvs.txt even though Memsize, and SASFoundation are in some of the log files that the program reads in

3) I am not receiving the progflag.cvs attachment via email

4) I added *.log | awk because I want the program to read in log files with the .log extension only. There are other files in the directory that have different extensions.

'*.log | awk -F '[=:;.]' '
  function pr() {if(NR>1) printf "%s\t%s\t%s\t%s\n", K[1],K[2],K[3],K[0]

I am getting the following error

*.log |awk -F [=:: not found.
   .] not found. syntax error at line 3: '(' not expected

what The program is doing is the followings:

searches for 3 selected items in each log files:

  1. Memsize= ; 'a numeric value is after the ='
  2. sasfoundation - the path in a directory,
  3. Real time : 'a numeric value is after the :'
#!/bin/bash
cd /tmp/logs

'*.log | awk -F '[=:;.]' '
  function pr() {if(NR>1) printf "%s\t%s\t%s\t%s\n", K[1],K[2],K[3],K[0]}
  BEGIN {
      printf "MEMSIZE\tSECOND\tSASEXE\tFilename\n"
      for(i=split("memsize ,Real Time ,SASFoundation",A,",");i;i--) L[A]=i
  }
  FNR==1 {
      pr()
       K[0]=FILENAME
      K[1]=K[2]=K[3]=x
  }
  $1 in L {v=$2;gsub("^[/ ]*","",v);gsub(/ *$/,"",v);K[L[$1]]=v}
  END{pr(
{if ($1) || ($2>1.0 ) || ( $ 3 ) &&  ($0)) printf $1 "\t" $2 "\t" $3"\t" $0"\t; elseif($2 < 1.0 else print ''}'
' *.log > progflag.csv

[ -s progflag.csv ] && mailx -s "subject text -a "Programs flagged" receiver@domain.com < progflag.csv

We are lost here.
You have uploaded four sample files: file1.log.02896.txt , file2.log.02897.txt , filew.log.02820.txt , and filez.log.02899.txt
None of these files are referenced in any of your posts in this thread.

You have referenced files with names (or names that match pathname matching patterns) *.log , progflag.cvs , progflag.cvs.txt , and several others; but you have not shown us samples of the contents of any of these files.

You have shown us some code and you have sort of said what some of that code is trying to do, but the syntax is so different from the syntax expected by awk and bash and your explanations are not in complete sentences, so I am unable to figure out the format of your input files and I am unable to figure out the logic you are trying to use to produce the output you want.

The OS is AIX 7.1.

My program searches for certain keywords and its values from multiple text files and output the information to a text file and sends an email attachment. One of the Keyword is named real time . if real time row value in the text files is greater than 5:00:00 than output the column name and its value and the text filename that stores the information to progflag.txt.

Another keyword that is included in the search is an assignment operator named Memsize and its value. Memsize and its value and the text filename that stores the information are outputted to progflag.txt.

The last keyword that is included in the search is a directory name SASFoundation. SASfoundation and the text filename that stores the information are outputted to progflag.txt.

My problem is in progflag.txt, I am getting the headers with no column values. Below is the output when I run the code:

MEMSIZE SECOND   SASEXE   FILENAME

Here is what the output results need to show in progflag.txt

MEMSIZE   SECOND     SASEXE                     Filename
200                                                        SASFoundation_MEMSIZE.txt
400       06:00:00         SASFoundation        GT_5hr.txt

In the below example, there should be only 2 filenames in the progflag.txt and not three. For example, no_SASFoundation_no_MEMSIZE.txt doesn't meet the criteria so there shouldn't be any data for this file in progflag.txt.

Here is my code:

#!/bin/bash


cd /log/tmp/*.txt | awk -F '[=:]' '
  function pr() {printf FORMAT, K[1],K[2],K[3],K[0]}
  BEGIN {FORMAT="%s\t%s\t%16s\t%s\n"
      printf FORMAT, "MEMSIZE","SECOND","SASEXE","Filename\n"
        for(i=split("/Memsize/ $2, ,/Real Time/ $2 ,/SASFoundation/ $3",A,",");i;i--) L[A]=i
      FORMAT="%s\t%.1f\t%16s\t%s\n"
  }
  FNR==1 {
      if(K[1] || K[2]>'5:00:00' || K[3]) pr()
       K[0]=FILENAME
      K[1]=K[2]=K[3]=x
  }
  $1 in L {v=$2;gsub("^[/ ]*","",v);gsub(/ *$/,"",v);K[L[$1]]=v}
  END{if(K[1] || K[2]>'5:00:00' || K[3]) pr()}' *.txt > progflag.txt

[ -s progflag.txt ] && mailx -s "subject text" -a  progflag.txt receiver@domain.com < "Code Need to be Evaluated"

There seem to be multiple issues with this code:

  • there is semicolon missing between the first two gsubs, is that a typo?
  • also the third gsub seems to have a spurious v=$2 in it, and if you leave that out it becomes identical to the the first gsub, so the third serves no purpose
  • There is a single quote missing at the end of the awk statements
  • In the field separators specification -F '[= '':;.]' the two quotes in the middle serve no purpose. Also also it seems ill adapted to splitting fields of the input file. With the given input $1 will only ever contain "MEMSIZE" and so that is the only time that the $1 in L condition is true, but then $2 is empty, but since the label in array L is "MEMORIZE " with a trailing space, even that will not match.
  • K[2]>'5:00:00' contains single quotes instead of double quotes, so this evaluates to K[2]>5:00:00 , which is a syntax error

Thanks, Scrutinzer

Is there any way the gsub can be fixed that it will output the correct values?

I'm going to ignore most of your sample shell script for the moment because it doesn't seem to match any of your stated requirements. But, it is the only thing we have where you state what the explicit key words are that you are looking for in your text file. The key words your script defines are the literal strings: /Memsize/ $2 , a literal single space character, /Real Time/ $2 , and /SASFoundation/ $3 . Except for the second keyword in this (the single <space> character), I have not been able to find any of these key words in any of your sample files.

Searching through your sample input files for the data shown in your desired output above, I can find a line that would be matched by the ERE *real time * on a line that does NOT also contain the string seconds . Note that regular expressions and filename pattern matches are case-sensitive on UNIX and UNIX-like systems. Real Time and real time are NOT the same! Note that printing the value 6:00:00 from the input line:

      real time     6:00:00

(which does not contain the word seconds like other "real time" values:

      real time         0.06 seconds
      real time     3.01  seconds
      real time     0.3  seconds
      real time     3.0   seconds

under the heading SECONDS ) is highly counterintuitive, and will NOT be displayed as you have requested using the printf format string %.1f . (Using that format with the input 6:00:00 would produce the output 6.0 .) The string 6:00:00 seems to be hours, minutes, and second; not just seconds. And the test you're using to determine if a line should be printed is a string comparison; not a numeric comparison. With your test, a value of 51:00 (less than 1 hour) would compare greater than 5:00:00 and a value of 10:00:01 (more than 10 hours) would compare less than 5:00:00 . Please provide a much clearer description of which lines containing real time should be reported and explain what should happen if more than one of those lines in a single input file are selected. (Your code would only the report the last selected line, if your code actually selected any lines matching this pattern. Is that what you want?)

The ERE MEMSIZE *= * seems to match the lines you are trying to grab from your input files:

MEMSIZE = 200;
MEMSIZE= 400;

The only line in any of your input files containing the string SASFoundation is:

z=/SAS/SAS94/SASFoundation/9.4;

which seems to have the key word z which is not mentioned anywhere in your description. Why is the value to be placed in your output under the heading SAXEXE file just the 3rd of the three or four directories named in the z key word's value?

The final field in your output is described in your explanation above as "the text filename that stores the information", and the MEMSIZE = 200; data in your output file comes from a file named SASFoundation_MEMSIZE.txt . But, the data for the last line of your sample output file comes from a file named more_than_5_hr.txt not from the file listed in your sample output: GT_5hr.txt .

My program run without error. The problem I am having.

The program isn't outputting field values with the column headers to file.txt.

Each of the column headers in file.txt has no data.

MEMSIZE  SECOND SASFoundation  Filename

The output results in file.txt should show:

MEMSIZE   SECOND      SASFoundation            Filename
200                                                             LT_5h_MEMSIZE.txt
400          06:00:00       SASFoundation            GT_5hr.txt

I realized the problem is gsub. I don't know enough about gsub to fix this
issue.

$1 in L{v=$2;gsub("^[/]*","",v)gsub(/*$/,"",v);gsub(v=$2"^[/]*","",v);K[L[$1]]=v}

The first gsub stored the field value for MEMSIZE and second gsub
stored the field value for real time and the last gsub stored the field
value for SASFoundation. The field values for headers are outputted to file.txt


#!/bin/bash

cd /tmp/log/*.log
awk -F '[= '':;.]' '
function pr() {if(NR>1) printf "%s\t%s\t%s\t%s\n", K[1],K[2],K[3],K[0]}
BEGIN {
printf "MEMSIZE\tSECOND\tSASFoundation\tFilename\n"
for(i=split("MEMSIZE ,real time ,SASFoundation",A,",");i;i--) L[A]=i
}
FNR==1 {
pr()
K[0]=FILENAME
K[1]=K[2]=K[3]=x
}
$1 in L {v=$2;gsub("^[/ ]*","",v)gsub(/ *$/,"",v);gsub(v=$2"^[/ ]*","",v);K[L[$1]]=v}
 END{if(K[1] || K[2]>'5:00:00' || K[3]) pr()} *.txt > file.txt
[ -s file.txt ] && mailx -s "subject text" -a  file.txt receiver@domain.com < "Code Need to be Evaluated"

You say "My program run without error.", but with the 3rd line of your script being:

cd /tmp/log/*.log

I find that very hard to believe. This line will succeed if and only if there is exactly one file matching the pattern /tmp/log/*.log and that matching file is of type directory. Otherwise, that command will produce a diagnostic message. Since you are processing .txt files in that directory, please show us the exact output you get (in CODE tags) from the command:

ls -l /tmp/log/*.log/*.txt

You said: "I realized the problem is gsub. I don't know enough about gsub to fix this issue." I am not sure how your realized that gsub() is your problem (and it may be part of your problem), but you have a problem before you ever get to gsub() . With the field separators specified to be each occurrence of an equal sign, a space, a double-quote, a colon, a semicolon, or a period character in your input line and the strings that you are looking for in field 1 being MEMSIZE (which contains a trailing space character), real time (which contains an embedded and a trailing space character), and SASFoundation (which does not appear at the start of any line in any of your sample input files); there would seem to be zero chance that the condition $1 in L is ever going to be true for any of your sample input files. Therefore, none of your gsub() function calls will ever be executed in your script.

Instead of asking us to debug your gsub() function calls, please give us a CLEAR description in English of the logic used to determine:

  1. What keyword is being processed to get the value SASFoundation from the input line z=/SAS/SAS94/SASFoundation/9.4; . If you are processing the z keyword, why isn't the value for that keyword /SAS/SAS94/SASFoundation/9.4 ?
  2. Which lines containing real time need to be processed, what are the possible formats of the times specified on those lines, and how is that data supposed to be displayed in your output file?
  3. Since the last field in your output file is not always the name of the input file in which the rest of the data on that line was found, how is the data in that field determined?
  4. Why do your input and output files have incomplete last lines (with no line terminator) and why do they have DOS line separators? Why aren't your input and output files text files if they are named with the file extension .txt ?

I'm disappointed that you have chosen not to answer any of my questions (which would have helped give you code that might work for you), but maybe this will give you something you can adapt to your needs. It makes some wild assumptions based on sample input files you have provided in this thread, sample output files you have provided in this thread, sample code segments you have provided in this thread, statements you have made in this thread, and me reading a lot in between the lines:

  1. The input files you want to process are in the directory /tmp/log .
  2. The output file you want to produce should be placed in the directory /tmp/log .
  3. The name of the output file you want to produce is either file.txt or progflag.txt . (The following script uses the name progflag.txt .)
  4. You do not want to process your output file as an input file. (The following script ignores both file.txt and progflag.txt as input files.)
  5. All files in the directory /tmp/log whose names end with the string .txt (other than the two mentioned possible output files) are to be processed as input files.
  6. Your input files might or might not have DOS (CR-LF) line terminators instead of UNIX (LF) line terminators. If CR-LF line terminators are present, the CR should be removed before further processing an input line.
  7. Your input files might not have a line terminator on the last line. If an input file does not have a line terminator on the last line, a UNIX line terminator should be added.
  8. Your output file should be a properly formatted text file with UNIX line terminators.
  9. If an input file contains the string /SASFoundation/ , an output line should be created in your output file with the string SASFoundation as the 3rd field in that line.
  10. If an input file contains a line matching the ERE ^MEMSIZE *= *[^;]*;{0,1} , an output line should be created in your output file with the string matched by the [^;]* portion of that ERE as the 1st field in that line.
  11. If an input line contains three words and the 1st word is real , and 2nd word is time , and the 3rd word matches the ERE [0-9]+:[0-9]{2}:[0-9]{2} (where the leading digit(s) represent hours, the middle digits represent minutes, and the last digits represent seconds) and the elapsed time represented by the 3rd word is greater than 5 hours; an output line should be created in your output file with the 3rd word (with a leading zero prepended if there is only one leading digit in that word) as the 2nd field in that line.
  12. If more than one line matching any one of the above three criteria would cause an output line to be created, the last line encountered in an input file meeting that criteria is the one used to determine what appears in the output line.
  13. If more than one of the criteria is found in a single input file, only one line of output should be produced for that input file and the 4th field in that output line should be the name of the input file from which that data was extracted.
#!/bin/bash
cd /tmp/log
for f in *.txt
do	# Skip output files
	[ "$f" = "file.txt" ] && continue
	[ "$f" = "progflag.txt" ] && continue

	# Add a header line for each remaining file to be processed, copy the
	# file to awk's standard input, and add a line terminator to the end of
	# each input file...
	printf '***File=%s\n' "$f"	# Header
	cat "$f"			# File contents
	echo				# Terminate last incomplete line
done | awk '
BEGIN {	FMT[0] = "%-9s%08s  %-15s%s\n"	# SECOND field format for HH:MM:SS
	FMT[1] = "%-9s%-10s%-15s%s\n"	# SECOND field format for other values
}
# Function to print data from data for one input file (including output file
# header before the first output produced).
function pr() {
	if(ms || rt || se) {
		# If we have not printed a header yet...
		if(!header) {
			# print a header.
			header = 1
			printf(FMT[1], "MEMSIZE", "SECOND", "SASEXE",
			    "Filename")
		}
		# Print data gathered from this input file...
		printf(FMT[length(rt) == 0], ms, rt, se, fn)
		ms = rt = se = ""
	}
}
{	# Convert DOS line terminators to UNIX line termiantors.
	sub(/\r$/, "")
}
/^\*\*\*File=/ {
	# File header found for a new input file...
	# Print data from previous file.
	pr()

	# Grab filename from this line.
	fn = substr($0, 9)
#	printf("fn=\"%s\" extracted from \"%s\"\n", fn, $0)
	next
}
/^MEMSIZE *=/ {
	# Grab MEMSIZE field data.
	split($0, fields, / *= *|;/)
	ms = fields[2]
#	printf("ms=\"%s\" extracted from \"%s\"\n", ms, $0)
	next
}
/\/SASFoundation\// {
	# If any line contains the literal string "/SASFoundation/", set se to
	# "SASFoundation".
	se = "SASFoundation"
#	printf("se=\"%s\" extracted from \"%s\"\n", se, $0)
	next
}
$1 == "real" && $2 == "time" && NF == 3 && split($3, fields, /:/) == 3 {
	# We have found a "real time" line with 3 fields and the 3rd field is of
	# the form hours:minutes:seconds.  Set rt to $3 if hours > 5 OR
	# (hours == 5 AND (minutes > 0 || seconds > 0)).
	if(fields[1] + 0 > 5 ||
		(fields[1] == 5 && (fields[2] != "00" || fields[3] != "00")))
		rt = $3
#	printf("rt\"%s\" extracted from \"%s\"\n", rt, $0)
	next
}
END {	# Print results from last input file.
	pr()
}' > progflag.txt

# Send mail if output was produced.
[ -s progflag.txt ] && echo "Code Need to be Evaluated" |
    mailx -s "subject text" -a  progflag.txt receiver@domain.com

This script was written using a Korn shell and tested with a Korn shell and with bash . It should work with any POSIX-conforming shell. If you want to try this on a Solaris/SunOS system, change awk in this script to /usr/xg4/bin/awk or nawk . If the file you uploaded as sample data for this thread are located in the directory /tmp/log this script creates a file named progflag.txt containing:

MEMSIZE  SECOND    SASEXE         Filename
400      06:00:00  SASFoundation  GT_5hr.txt
200                               SASFoundation_MEMSIZE.txt
400      06:00:00  SASFoundation  more_than_5_hr.txt

Of course, the script won't work if receiver@domain.com is not a valid e-mail address nor if your systems version of mailx does not include a -a file option to include file as an attachment to your mail message. (The POSIX standards do not include a mailx -a file option.)