AWK pattern matching, first and last

smb_uk · December 21, 2007, 2:17pm

In a nutshell, I need to work out how to return the last matching pattern from an awk //,// search. I can bring back the first, but am unsure how to obtain the last, and a simple tail won't work as the match could be over multiple lines.

Secondly I would like some way of pattern matching, a pattern matched sequence. The program should work by returning the first section matching /*H,H## from a file, and then performing multiple search's for other tags within (#P,P# #C,C#). I did think of returning the /*H pattern into a variable, and searching that, but that converts it to a string without line breaks (I think), and I want to maintain the format of the input.

The code below works in part (albeit not as described above and the last incorrectly handled #M, M#), but there must be a much more efficient way to do it, as it's searching the full file every time.
Any general advice on optimising the code would also be useful, as I appreciate there's a lot of piping going on.

The current code, input and ouput follow:

#!/bin/ksh 

for file in `find . -type f -name "*.code"` 
do 
  echo ">>>>>>> PROGRAM: " $file 
  echo ================================================================================ 
  awk '/#P/ && ++m==1,/P#/ {print $0}' $file | sed 's/^#[A-Z#] //g' | sed 's/ [A-Z#]#$//g' 
  awk '/#A/ && ++m==1,/A#/ {print $0}' $file | sed 's/^#[A-Z#] //g' | sed 's/ [A-Z#]#$//g' 
  awk '/#C/ && ++m==1,/C#/ {print $0}' $file | sed 's/^#[A-Z#] //g' | sed 's/ [A-Z#]#$//g' 
  awk '/#I/ && ++m==1,/I#/ {print $0}' $file | sed 's/^#[A-Z#] //g' | sed 's/ [A-Z#]#$//g' 
  echo -------------------------------------------------------------------------------- 
  awk '/#D/ && ++m==1,/D#/ {print $0}' $file | sed 's/^#[A-Z#] //g' | sed 's/ [A-Z#]#$//g' 
  awk '/#M/,/M#/ {print $0}' $file | tail -1 | sed 's/^#[A-Z#] //g' | sed 's/ [A-Z#]#$//g' 
  echo 
done

/*H#############################################################################
##                                                                            ## 
#P Purpose : Creates generation datasets from the original source file        P# 
##                                                                            ## 
#A Author  : D Turpin                                                         A# 
#C Date    : 1st November 2007                                                C# 
##                                                                            ## 
#I Inputs  : data.table1                                                      ## 
##           data.table2                                                      I# 
#O Outputs : data.table3                                                      O# 
##                                                                            ## 
################################################################################ 
## Change History                                                             ## 
#D Who When       Why                                                Version  D# 
##                                                                            ## 
#M DT  01.11.2007 Initial Development                                1.00     M# 
#M DT  15.11.2007 Modified to include new reqs                       1.01     ## 
##                and other things                                            M# 
##                                                                            ## 
#############################################################################H##
*...+....1....+....2....+....3....+....4....+....5....+....6....+....7....+...*/

>>>>>>> PROGRAM:  ./header.code 
================================================================================ 
Purpose : Creates generation datasets from the original source file 
Author  : D Turpin 
Date    : 1st November 2007 
Inputs  : data.table1 
          data.table2 
-------------------------------------------------------------------------------- 
Who When       Why                                                Version 
               and other things
--------------------------------------------------------------------------------

shamrock · December 21, 2007, 5:06pm

#!/bin/ksh 

for file in `find . -type f -name "*.code"` 
do
   echo ">>>>>>> PROGRAM: " $file 
   echo ================================================================================
   awk '/^#P/ || /P#/ {gsub("#P|P#|##",""); print}
        /^#A/ || /A#/ {gsub("#A|A#|##",""); print}
        /^#C/ || /C#/ {gsub("#C|C#|##",""); print}
        /^#I/ || /I#/ {gsub("#I|I#|##",""); print}
        /^#D/ || /D#/ {gsub("#D|D#|##",""); print}
        /^#M/ || /M#/ {gsub("#M|M#|##",""); print}' $file
done

smb_uk · December 22, 2007, 7:11am

That's much cleaner, and I assume quicker as it's doing a single scan. I can't test this at the moment, but I assume this is addressing the single pass, and the clean, and not the returning of either first or last?
The main problem remaining, is the ability to return the last #M,M# (or nth if possible for future reference), as I want to show just the last modification to a file.

Furthermore, is there any way to contain the searching to within the top section of the file, between the /*H and H## ? as if there are many large files to document, it would make it much quicker.

Thanks for the help so far.

Franklin52 · December 22, 2007, 8:22am

This should give the desired output without trailing spaces:

#!/bin/ksh

for file in `find . -type f -name "*.code"` 
do
  echo ">>>> Program :" $file
  echo "========================================================================="
  sed -n '/\#P /,/H..$/p' $file |
  sed 's/.*###.*/-------------------------------------------------------------------------/
  s/^#. //g
  s/.#$//g
  s/[ \t]*$//g
  /^$/d'
done

Regards

smb_uk · December 22, 2007, 9:16am

This solution loses the ability to handle the tags in a custom manner, it simply displays what's already there in a different format, although maybe I should have pointed that requirement out earlier.

Also, I still have the issue of not being able to return only the last #M,M# tag.

In summary, what I want to do is get the /*H, H## section, and individually retrieve tags within this block, choosing all, the first or the last (or nth/-nth if possible) of the tag (e.g. only the modification on the 15th).

fpmurphy · December 22, 2007, 11:16am

Why not simply use CVS or some other version control software?

ghostdog74 · December 22, 2007, 12:25pm

awk '
/H##$/{exit}
/#M/,/M#$/ { 
  # assuming Who, When, Why structure 
  # and 3rd field is the "When" column
  if ( $3 ~ /[1-3][1-9]\.[0-1][1-9]\.20[0-1][1-9]/) { 
       lastmod = $3      
  }  
}
/#[PACIDM#]/{ 
    gsub(/#[PACIDM#]|[PACIDM#]#/,"")  
}
/^###*|*\.\.|\/*H|[oO]utputs/{next}
1
END {
  print "Last modified: " lastmod
}
' *code

output:

# ./test.sh

 Purpose : Creates generation datasets from the original source file

 Author  : D Turpin
 Date    : 1st November 2007

 Inputs  : data.table1
           data.table2


 Who When       Why                                                Version

 DT  01.11.2007 Initial Development                                1.00
 DT  15.11.2007 Modified to include new reqs                       1.01
                and other things

Last modified: 15.11.2007

shamrock · December 22, 2007, 2:03pm

So out of all the M#,M# tags present you want to show just the last one as it is the last modification made to the file.

#!/bin/ksh

for file in `find . -type f -name "*.code"`
do
   echo ">>>>>>> PROGRAM: " $file
   echo ================================================================================
   awk '/^#P/ || /P#/ { gsub("#P|P#|##",""); print }
        /^#A/ || /A#/ { gsub("#A|A#|##",""); print }
        /^#C/ || /C#/ { gsub("#C|C#|##",""); print }
        /^#I/ || /I#/ { gsub("#I|I#|##",""); print }
        /^#D/ || /D#/ { gsub("#D|D#|##",""); print }
        /^#M/ || /M#/ { gsub("#M|M#|##",""); mtag[M] = $0 }
        END { print mtag[M] }' $file
done

smb_uk · December 23, 2007, 4:02am

I'm not going to get a chance to play with this until Thursday now, but it looks as though I've all the elements I need now from the many replies, so thanks all for the input.

smb_uk · December 27, 2007, 2:33pm

Just thought I'd let you know what I've ended up with. It's ugly as, but does what I want, but I reckon there are probably many ways to make it more elegant, as it's really only a few lines of code repeated. This way I capture all the individual elements, and can therefore present flexibly.

The problem with all the other suggestions was that they didn't capture the last modification tag #M, M# where is was multi-line, but the examples put me on the right track

Suggestions always welcome.

#!/bin/ksh 

for file in `find . -type f -name "*.code"` 
do 

  awk ' 
    /H##$/{exit} 

    /^#M/ { # set to zero on each new tag to only retain the last 
      mctr=0 
    } 

    /^#P/,/P#$/{ 
      gsub(/^#[PACIDMO#] | *[PACIDMO#]#$/,"") 
      ptag[++pctr]=$0 
    } 
    /^#A/,/A#$/{ 
      gsub(/^#[PACIDMO#] | *[PACIDMO#]#$/,"") 
      atag[++actr]=$0 
    } 
    /^#C/,/C#$/{ 
      gsub(/^#[PACIDMO#] | *[PACIDMO#]#$/,"") 
      ctag[++cctr]=$0 
    } 
    /^#I/,/I#$/{ 
      gsub(/^#[PACIDMO#] | *[PACIDMO#]#$/,"") 
      itag[++ictr]=$0 
    } 
    /^#D/,/D#$/{ 
      gsub(/^#[PACIDMO#] | *[PACIDMO#]#$/,"") 
      dtag[++dctr]=$0 
    } 
    /^#M/,/M#$/{ 
      gsub(/^#[PACIDMO#] | *[PACIDMO#]#$/,"") 
      mtag[++mctr]=$0 
    } 
    /^#O/,/O#$/{ 
      gsub(/^#[PACIDMO#] | *[PACIDMO#]#$/,"") 
      otag[++octr]=$0 
    } 

    0

    END { 
      print ">>>>>>>>> PROGRAM : " FILENAME 
      print "================================================================================" 
      for (i=1; i<=pctr; i++) 
        print ptag 
      for (i=1; i<=actr; i++) 
        print atag 
      for (i=1; i<=cctr; i++) 
        print ctag 
      for (i=1; i<=ictr; i++) 
        print itag 
      for (i=1; i<=octr; i++) 
        print otag 
      print "--------------------------------------------------------------------------------" 
      for (i=1; i<=dctr; i++) 
        print dtag 
      for (i=1; i<=mctr; i++) 
        print mtag 
    } 
  ' $file 

  print 

done

shamrock · December 27, 2007, 8:03pm

The latest suggestion is shorter and contains the search to within the top section of the file.

#!/bin/ksh

for file in `find . -type f -name "file"`
do

  BLOCK=$(sed -n '/\/*H/,/H##/p' $file)

  echo "$BLOCK" | awk '(/^#P/||/P#/) || (/^#A/||/A#/) || (/^#C/||/C#/) || (/^#I/||/I#/) || (/^#O/||/O#/) {
                           gsub(/^#[PACIO#] | *[PACIO#]# $/,"")
                           tag[++ctr]=$0
                       }
                       (/^#D/||/D#/) {
                           gsub(/^#[D#] | *[D#]# $/,"")
                           dtag[++dctr]=$0
                       }
                       /^#M/||/M#/ {
                           gsub(/^#[M#] | *[M#]# $/,"")
                           mtag[m] = $0
                       } END {
                           print ">>>>>>>>> PROGRAM : " FILENAME
                           print "================================================================================"
                           for (i=1; i<=ctr; i++)
                             print tag
                           print "--------------------------------------------------------------------------------"
                           for (i=1; i<=octr; i++)
                             print otag
                           print mtag[m]
                       }'

done