Using awk to Parse File

Hi all, I have a file that contains a good hundred of these job definitions below:

Job Name                                                         Last Start           Last End             ST Run     Pri/Xit
________________________________________________________________ ____________________ ____________________ __ _______ ___
B9043CC_APP_DMLD_025_FR_xpabbdu1_D                               03/12/2014 18:21:32  03/12/2014 18:22:07  SU 49744331/3

  Status/[Event]  Time                 Ntry ES  ProcessTime           Machine
  --------------  --------------------- --  --  --------------------- ----------------------------------------
  [FORCE_STARTJOB]  03/12/2014 17:30:52    0  PD  03/12/2014 17:30:53
    < >
  STARTING        03/12/2014 17:30:53    1  PD  03/12/2014 17:30:53   machine.enviorment.net
  RUNNING         03/12/2014 17:31:06    1  PD  03/12/2014 17:31:07   machine.enviorment.net
  SUCCESS         03/12/2014 17:31:46    1  PD  03/12/2014 17:31:47
  [FORCE_STARTJOB]  03/12/2014 18:16:06    0  PD  03/12/2014 18:16:07
    < >
  STARTING        03/12/2014 18:16:07    2  PD  03/12/2014 18:16:07   machine.enviorment.net
  RUNNING         03/12/2014 18:16:19    2  PD  03/12/2014 18:16:20   machine.enviorment.net
  FAILURE         03/12/2014 18:17:02    2  PD  03/12/2014 18:17:03
  [*** ALARM ***]
    JOBFAILURE    03/12/2014 18:17:03    2  PD  03/12/2014 18:17:04
  [FORCE_STARTJOB]  03/12/2014 18:21:18    0  PD  03/12/2014 18:21:19
    < >
  STARTING        03/12/2014 18:21:19    3  PD  03/12/2014 18:21:19   machine.enviorment.net
  RUNNING         03/12/2014 18:21:32    3  PD  03/12/2014 18:21:32   machine.enviorment.net
  SUCCESS         03/12/2014 18:22:07    3  PD  03/12/2014 18:22:08

The actual start/end times & actaul start/end dates are coming from the "Process time" column.I only want the data above and don't want any of the text including the "----" to be anywhere in the file I output it to. As mentioned above, I have a few hundred of these definitions in a single file.
This was something I was originally doing in python and am now going to try to do it using awk.

I know to read in the file it would be:

awk /dir/filepath/input.txt

And it output the file, I need:

System Number  Job Name                           Target Machiene    Status     Actual Start Date     Actual Start Time      Actual End Date    Actual End Time
9043           B9043CC_APP_DMLD_025_FR_xpabbdu1_D machine.enviorment.net    SUCCESS       03/12/2014               17:30:53            03/12/2014         17:31:47
9043           B9043CC_APP_DMLD_025_FR_xpabbdu1_D machine.enviorment.net    FAILURE       03/12/2014               18:16:07            03/12/2014         18:17:03
9043           B9043CC_APP_DMLD_025_FR_xpabbdu1_D machine.enviorment.net    SUCCESS       03/12/2014               18:21:19            03/12/2014         18:22:08
> /dir/filepath/output.txt

However, I'm looking for help with regards to the parsing aspect.

And where are we going to find the system number?

Under the "Job Name Column"

Job Name                                                         Last Start           Last End             ST Run     Pri/Xit
________________________________________________________________ ____________________ ____________________ __ _______ ___
B9043CC_APP_DMLD_025_FR_xpabbdu1_D                               03/12/2014 18:21:32  03/12/2014 18:22:07  SU 49744331/3

Can you tell us a bit more about the input file format:
I understand this is an extract, does it correspond to a specific job log, or we might find the same job later? etc...
Anything to clear how the parsing will work:
e.g.
Will we have to look 3rd line after we find "^Job Name " to find the string containing the System Number ( will always be the case...)?

Sure!

Here is the general logic:

Read six lines (header)
Get system number and batch name

Until end of file:
    Read five lines
    Get machine name, status, start and end dates and times
    If status is FAILURE
        Read two lines (clear error message)

No, duplicate job names will be present, however jobs will contain the same system numbers.

Also, since some jobs may have have ran on that specific day so there will be no data in them. In this case the fields in the output file would just be empty or null.

I.E

Job Name                     Last Start           Last End             ST Run     Pri/tit
____________________________ ____________________ ____________________ __ _______ ___
B9043CC_uwsprem_l_thd013sv_D 08/04/2010 22:03:55  03/05/2012 07:51:33  OI 22333537/0    

  Status/[Event]  Time                 Ntry ES  ProcessTime           Machine
  --------------  --------------------- --  --  --------------------- -------

B9043CC_uwsprem_l_thd024sv_D 03/06/2012 22:00:34  03/06/2012 22:00:42  OI 22333536/1    

  Status/[Event]  Time                 Ntry ES  ProcessTime           Machine
  --------------  --------------------- --  --  --------------------- -------

B9043BC_bond_ba_mf_loss_thd013sv_D                               03/06/2012 08:54:11  03/06/2012 11:44:06  OI 22303721/1    

  Status/[Event]  Time                 Ntry ES  ProcessTime           Machine
  --------------  --------------------- --  --  --------------------- ----------------------------------------
  [STARTJOB]      03/19/2014 17:45:00    0  PD  03/19/2014 17:45:00   
    <Event was Scheduled based on Job Definition.>

 B9043CC_bcmsloss_l_thd013sv_D                                   03/21/2014 08:46:48  03/21/2014 10:38:31  SU 22303721/110    

   Status/[Event]  Time                 Ntry ES  ProcessTime           Machine
   --------------  --------------------- --  --  --------------------- ----------------------------------------
   SUCCESS         03/19/2014 14:04:49   108  PD  03/19/2014 14:04:49   
   [FORCE_STARTJOB]  03/20/2014 13:39:15    0  PD  03/20/2014 13:39:15   
     < >
   STARTING        03/20/2014 13:39:15   109  PD  03/20/2014 13:39:16   machine.enviorment.net
   RUNNING         03/20/2014 13:39:17   109  PD  03/20/2014 13:39:17   machine.enviorment.net
   SUCCESS         03/20/2014 14:24:56   109  PD  03/20/2014 14:24:56   
   [FORCE_STARTJOB]  03/21/2014 08:46:47    0  PD  03/21/2014 08:46:47   
     < >
   STARTING        03/21/2014 08:46:47   110  PD  03/21/2014 08:46:48   tmachine.enviorment.net
   RUNNING         03/21/2014 08:46:48   110  PD  03/21/2014 08:46:49   machine.enviorment.net
   SUCCESS         03/21/2014 10:38:31   110  PD  03/21/2014 10:38:31   

... based off your original data ...

gawk '
	BEGIN {
		printf("%-14s %-65s %-41s %-8s %-18s %-18s %-16s %-16s\n",
		"System Number","Job Name","Target Machine","Status","Actual Start Date",
		"Actual Start Time","Actual End Date","Actual End Time") }
	$0~/^[A-Z]/ {
		match($1,/([0-9]+)/,s); s[2]=$1 }
	$1=="STARTING" {
		if(!s[3]) s[3]=$8; s[5]=$2; s[6]=$3 }
	$1~/^(SUCCESS|FAILURE)$/ {
		printf("%-14s %-65s %-41s %-8s %-18s %-18s %-16s %-16s\n",
		s[1], s[2], s[3], $1, s[5], s[6], $2, $3) }
' your_file

Thanks! I ust ran the script against the data below. I have multiple of these jobs in one file, so every job has a different job name which I want to grab, even if the job did not run. It is my fault for not mentioning this in the original post. I just ran the script against the data below and it only pulling the first job name it sees for each entry, am I am trying to modify that.

B3709BC_GCFCT_MONTHLY_tpabbtu1_D                                 03/12/2014 09:13:23  03/13/2014 00:43:10  FA 54759595/1 1  

  Status/[Event]  Time                 Ntry ES  ProcessTime           Machine
  --------------  --------------------- --  --  --------------------- ----------------------------------------
  RUNNING         03/12/2014 09:13:23    1  PD  03/12/2014 09:13:24    
  FAILURE         03/13/2014 00:43:10    1  PD  03/13/2014 00:43:11   
  [STARTJOB]      03/26/2014 18:45:00    0  UP                        
    <Event was Scheduled based on Job Definition.>

 B3709CC_GCFCT_MONTHLY_VALIDATION_tpabbtu1_D                     03/12/2014 10:59:52  03/12/2014 11:01:11  SU 54759595/1    

   Status/[Event]  Time                 Ntry ES  ProcessTime           Machine
   --------------  --------------------- --  --  --------------------- ----------------------------------------
   [FORCE_STARTJOB]  03/12/2014 10:59:46    0  PD  03/12/2014 10:59:46   
     < >
   STARTING        03/12/2014 10:59:46    1  PD  03/12/2014 10:59:46   machine.enviorment.net
   RUNNING         03/12/2014 10:59:52    1  PD  03/12/2014 10:59:52    machine.enviorment.net
   SUCCESS         03/12/2014 11:01:11    1  PD  03/12/2014 11:01:11   

 B3709CC_GCFCT_Monthly_LKUP_Creation_tpabbtu1_D                  03/12/2014 10:24:43  03/12/2014 10:27:57  SU 54759595/1    

   Status/[Event]  Time                 Ntry ES  ProcessTime           Machine
   --------------  --------------------- --  --  --------------------- ----------------------------------------
   [FORCE_STARTJOB]  03/12/2014 10:24:37    0  PD  03/12/2014 10:24:37   
     < >
   STARTING        03/12/2014 10:24:37    1  PD  03/12/2014 10:24:38  machine.enviorment.net
   RUNNING         03/12/2014 10:24:43    1  PD  03/12/2014 10:24:44   machine.enviorment.net
   SUCCESS         03/12/2014 10:27:57    1  PD  03/12/2014 10:27:58   

 B3709CC_GCFCT_IP_Target_Load_tpabbtu1_D                         04/11/2013 15:42:10  04/11/2013 15:45:31  IN 39115173/0    

   Status/[Event]  Time                 Ntry ES  ProcessTime           Machine
   --------------  --------------------- --  --  --------------------- ----------------------------------------

 B3709CC_GCFCT_ERROR_PROCESSING_tpabbtu1_D                       04/11/2013 15:45:41  04/11/2013 16:45:42  IN 39115173/0    

   Status/[Event]  Time                 Ntry ES  ProcessTime           Machine
   --------------  --------------------- --  --  --------------------- ----------------------------------------

output:

System Number  Job Name                                                          Target Machine                            Status   Actual Start Date  Actual Start Time  Actual End Date  Actual End Time 
3709           B3709BC_GCFCT_MONTHLY_tpabbtu1_D                                                                            FAILURE                                        03/13/2014       00:43:10        
3709           B3709BC_GCFCT_MONTHLY_tpabbtu1_D                                  machine.enviorment.net     SUCCESS  03/12/2014         10:59:46           03/12/2014       11:01:11        
3709           B3709BC_GCFCT_MONTHLY_tpabbtu1_D                                  machine.enviorment.net     SUCCESS  03/12/2014         10:24:37           03/12/2014       10:27:57   

Targetd output:

System Number  Job Name                                                          Target Machine                            Status   Actual Start Date  Actual Start Time  Actual End Date  Actual End Time 
3709           B3709BC_GCFCT_MONTHLY_tpabbtu1_D                                                                            FAILURE                                        03/13/2014       00:43:10        
3709           B3709CC_GCFCT_MONTHLY_VALIDATION_tpabbtu1_D                     machine.enviorment.net     SUCCESS  03/12/2014         10:59:46           03/12/2014       11:01:11        
3709           B3709CC_GCFCT_Monthly_LKUP_Creation_tpabbtu1_D                  machine.enviorment.net     SUCCESS  03/12/2014         10:24:37           03/12/2014       10:27:57        
3709           B3709CC_GCFCT_IP_Target_Load_tpabbtu1_
3709           B3709CC_GCFCT_ERROR_PROCESSING_tpabbtu1_D

Thank you thus far, jethrow!

Answered my question, just and a

 ?

:slight_smile: . And since every job starts with a B, I edited the string matching case.

gawk '
	BEGIN {
		printf("%s,%s,%s,%s,%s,%s,%s,%s\n",
		"System Number","Job Name","Target Machine","Status","Actual Start Date",
		"Actual Start Time","Actual End Date","Actual End Time") }
	/^ ?/ {
		match($1,/([0-9]+)/,s); s[2]=$1 }
	$1=="STARTING" {
		if(!s[3]) s[3]=$8; s[5]=$2; s[6]=$3 }
	$1~/^(SUCCESS|FAILURE)$/ {
		printf("%s,%s,%s,%s,%s,%s,%s,%s\n",
		s[1], s[2], s[3], $1, s[5], s[6], $2, $3) }
' your_file