A Stanza File Parser in Pure ksh

bakunin · January 3, 2019, 7:03pm

As it was ultimately Don Craguns idea that saved the whole project i can as well give something back to the community. This is my stanza file parser, which was written only using ksh without any external programs.

The stanza structure
There is some inconsistency as to what exactly is meant by "stanza". I use it in the sense IBM uses it, i.e. in the file /etc/filesystems . It codes tabular data in the following form:

tablename1:
     column-name1=value1
     column-name2=value2
     column-name3=value3
     .....

tablename2:
     column-name1=value1
     column-name2=value2
     column-name3=value3
     .....

String values can be quoted to enclose white space like this:

tablename:
     string="value"
     .....

and i added the capability for inline- as well as line-comments which work the same way as in the shell: everything after an (unquoted and unescaped) comment sign ("#") up to the end of line is a comment. Notice that escaping needs to be done by double backslashes:

tablename:
     item=StringWith\\#EscapedChar
     .....

The Parser
The parser is done in a shell function which is callable. The function gets the following parameters:

1) the input file to parse

2) the name of a handler function. This function is called with every successfully parsed stanza and is passed three parameters: the table name and the name of two arrays, one contains the column names and one the respective values. Both arrays are 1-based.

3) An (optional) flag (0 or 1) to switch on or off "double-checking". Consider the following stanza where "item2" ismentioned two times:

table:
     item1=value1
     item2=value2
     item2=value3

This flag determines wether such a condition is considered to be wrong (an error is raised and the stanza is rejected) or not.

4) The name of an array which contains table/columnname combinations along with "data types", separated by colons (":"). If such an array is passed the mentioned table/columnn-combinations are checked for the value being of the correct data type. Notice that i use "data type" rather losely here. Right now allowed types are:

"file" - name of an existing file
"int" - an integer
"num" any numeric (float, only decimal notation)
"char" - any string

I plan to add more "data types" (i.e. a valid file name of a non-existing file) in a later revision, if you have ideas: just tell me. Here is a sample array definition and the subsequent call to the parser function:

typeset aTypes[1]="table1:field1:int"
typeset    aTypes[2]="table1:field2:num"
typeset    aTypes[3]="table1:field3:file"
typeset    aTypes[4]="table1:field4:char"
typeset -i lDblCheck=0

f_ProcessStanza /input/file pHandler $lDblCheck "aTypes"

The Handler Function
The function name passed to the parser must point to a valid function in your script. The function named there will be called with every successfully parsed stanza entry. Here is a template for how the function could look like:

#------------------------------------------ function pHandler
pHandler()
{
typeset -i iCnt=0
typeset    chTable=$1                      # records name
#typeset    aField                         # given by REFERENCE and
#typeset    aValue                         # holding field names/values


(( iCnt = 1 ))                             # extract record
while [ $iCnt -le $(eval print \${#$2[*]}) ] ; do
    typeset aField[$iCnt]=$(eval print \${$2[$iCnt]})
    typeset aValue[$iCnt]=$(eval print \${$3[$iCnt]})
    (( iCnt += 1 ))
done

case $chTable in
          .
          .
          .
          .   # process the different table types here
          .

     *)
          f_die 1 "error: unknown stanza entry $chTable"
          ;;

esac

return 0
}
#------------------------------------------ function pMain

Compatibility Issues

I have tested the parser only with ksh93 but it should work with ksh88 too as i see no reason why it shouldn't. Test it thoroughly before depending on it, though. For the whole to work in bash , though, some work is perhaps needed. There are some function used in there which come from my library. The names should make it pretty obvious what their purpose is. If there is interest in this i can provide the rest of my library.

The Code

# ------------------------------------------------------------------------------
# f_ProcessStanza                           parsing and processing stanza files
# ------------------------------------------------------------------------------
# Author.....: Wolf Machowitsch
# last update: 2018 12 10    by: Wolf Machowitsch
# ------------------------------------------------------------------------------
# Revision Log:
# - 0.99   1999 03 16   Original Creation
#                       Revision description
#
# - 1.00   1999 07 26   Production Release
#                       elaborated on the documentation, straightened out
#                       some minor glitches
#
# - 1.01   2000 01 01   Minor Revision
#                       straightened out faulty metacharacter expansion
#                       when content of $chLine resembles "*" and clarified
#                       the documentation
#
# - 2.00   2017 06 12   Major Update
#                       - new way of parsing implemented
#                       - makeover of old code
#                       - changed double-checking to be optional
#
# - 2.01   2018 12 10   type-checking
#                       - typechecking implemented, types are
#                         char  (practically everything)
#                         int   (integer)
#                         num   (numeric in any kind)
#                         file  (existing file)
#                         There should be: non-existing file.
#
# ------------------------------------------------------------------------------
# Usage:
#     f_ProcessStanza char File, char *Fnc, [ bool checkdbl, \
#                     [ char *Type[] ] ]
#
#     Reads and parses the stanza file given in $1 and calls a function
#     given in $2 to process the read records.
#
#     Example:
#              f_ProcessStanza /tmp/MyFile MyFunc 0 aFields aTypes
#
#              This causes the file '/tmp/MyFile' to be parsed. After
#              each record the function MyFunc() is called to process
#              the record read.
#
#      Additional checks are being performed after each line and after
#      each stanza is parsed and before the processing function is called.
#      Stanza entries failing these checks lead to an error message.
#
#     Double-checking
#     ===============
#     The third parameter (double-checking, allowed values are 0=off, 1=on)
#     checks for double entries within stanzas. When this check is off,
#     entries like:
#
#              ....
#              address:
#                   name="Smith"
#                   age=25
#                   income=1000.15
#                   income=2000.10
#
#      are allowed while enabling double-checking will lead to an error for
#      the fourth line. The line will be dropped, parsing will continue,
#      however.
#
#      Type-checking
#      =============
#      The types of the lines are typechecked using the array passed in the
#      fourth parameter. It has the structure of "tablename:fieldname:type"
#      and has to have one entry for every field of each record which is to
#      be checked. Field/table combinations are simply not checked at all.
#
#      Example:
#              ...
#              aTypes[1]="address:name:char"
#              aTypes[2]="address:age:int"
#              aTypes[3]="address:income:num"
#              aTypes[4]="address:notes:file"
#
#              f_ProcessStanza /tmp/MyFile MyFunc aFields aTypes
#              ...
#
#      In the above example only records with a table designator of 'address'
#      are checked. In every table-record the four fields, "name", "age",
#      "income" and "notes" are being typechecked:
#
#              "name"      has to have character content,
#              "age"       has to have integer content,
#              "income"    has to have number (float) content and
#              "notes"     has to contain a path name.
#
#      Example: the record
#
#              address:
#                   name="Smith"
#                   age=25
#                   income=1000.15
#                   notes=/path/to/notes.file
#
#      will be passed, whereas this:
#
#              address:
#                   name="Smith"
#                   age=25.1
#                   income=1000.15
#                   notes=/path/to/notes.file
#
#              will not because the content of "age" is allowed to be
#              of type integer only.
#
# Prerequisites:
# -   This function requires ksh93 to be used.
#
# -   The FPATH environment variable must be set to use this function.
#
# -   functional dependencies: f_CheckInteger(), f_CheckNumeric()
#
# ------------------------------------------------------------------------------
# Documentation:
#     Reads and parses the stanza file given in $1 and calls a function
#     given in $2 to process the read records.
#
#     Optionally a value of 0 or 1 can be passed as third parameter which
#     causes duplicate items in a single record to be dropped.
#
#     If an optional fourth parameter is supplied to f_ProcessStanza() the
#     types of fields read are being typechecked. The format of the fields
#     array entries is 'tablename:fieldname:type' where 'tablename' is the
#     name of the record designator and 'fieldname' is the name of the
#     field allowed in this type of record. Valid types are:
#
#              char: any alphanumerical string
#              int:  integer
#              num:  numerical (rational numbers)
#              file: a valid filename of an existing file (only
#                    'test -f' is used, the handler function will
#                    have to verify the accessability conditions
#                    itself)
#
#     The stanza file can contain comments (like in ksh-scripts introduced
#     by '#', which is masking everything until EOL), a literal '#' has to
#     be escaped by a preceding double backslash ('\\#'). Inside quoted
#     strings comment signs are ignored and used literally too. Trailing
#     and leading blanks are stripped but inside double-quoted strings they
#     are preserved.
#
#     The function referenced by the function pointer given in $2 has to
#     accept three parameters. In $1 the record designator (=table name) is
#     passed. In $2 the name of an array of field names is passed. In $3
#     the name of an array with field values is passed. An example for a
#     processing function could look like:
#
#     #------------------------------------------ function pMain
#     {
#     typeset -i iCnt=0
#     typeset    chTable=$1                      # records name
#     #typeset    aField                         # given by REFERENCE and
#     #typeset    aValue                         # holding field names/values
#
#
#     (( iCnt = 1 ))                             # extract record
#     while [ $iCnt -le $(eval print \${#$2[*]}) ] ; do
#         typeset aField[$iCnt]=$(eval print \${$2[$iCnt]})
#         typeset aValue[$iCnt]=$(eval print \${$3[$iCnt]})
#         (( iCnt += 1 ))
#     done
#
#     case $chTable in
#          .
#          .
#          .
#          .   # process the different table types here
#          .
#
#          *)
#               f_die 1 "error: unknown stanza entry $chTable"
#               ;;
#
#     esac
#
#     return 0
#     }
#     #------------------------------------------ function pMain
#
#     The parsing loop
#     ================
#     Notice that the strange formulation of the line parsing loop is
#     necessary because a naive
#
#            while [ -n "$chLine" ] ; do
#                 chChar="${chLine%${chLine#?}}"
#                 chLine="${chLine#?}"
#                 ....
#            done
#     would break with input containing escape chars. See the discussion
#     at https://www.unix.com/shell-programming-and-scripting/\
#               273239-possible-ksh93-bug-expanding-variables.html
#
#
#     Parameters: char file  char *function  [ char *array  char *array ]
#     Returns   : 0 on success, >0 otherwise:
#                 1: stanza file not available (not readable/non-existent)
#                 2: format error: no table designator for record
#                 3: format error: misformatted stanza field line
#                 4: field error: unknown field encountered
#                 5: type error: wrong data type for this field
#                 6: other (internal) error: (undetermined, mostly params)
#                 All of these errors will cause the process process to
#                 terminate.
# ------------------------------------------------------------------------------
# known bugs:
#
#     none
# ------------------------------------------------------------------------------
# .....................(C) 2017 Wolf Machowitsch ...............................
# ------------------------------------------------------------------------------

f_ProcessStanza ()
{

$chFullDebug

                                                 # internal variables
typeset -i iRetVal=0                             # return value
typeset -x fInFile="$1"                          # name of stanza file
typeset -x pProcessRecord="$2"                   # name of processing function
typeset -x chLine=""                             # buffer for a input file line
typeset -x achTables[1]=""                       # table with table names
typeset -x achFields[1]=""                       # table with field names
typeset -x achTypes[1]=""                        # table with field types
typeset -i lTypCheck=0                           # flag: field type checking
                                                 #    0=disabled
typeset -i lDblCheck=0                           # flag: check for double items
                                                 #    0=disabled (default)
typeset -i lDoubleCheckError=0                   # flag: double item encountered
                                                 #    0=unraised
typeset -i iLinCnt=1                             # line counter f. input file
typeset -i iCnt=0                                # general counter
typeset -x chTmp=""                              # general string
typeset    chMaskString='????????????????????'   # mask for string manipulation

                                                 # variables/flags for parsing
typeset -x chItem=""                             # fields name
typeset -x chValue=""                            # value of field read
typeset -x chTable=""                            # tables name of actual rec.
typeset -i lInQuotes=0                           # flag for being inside quotes
typeset -i lEsc=0                                # flag: this char is escaped
typeset -i lBeforeText=1                         # flag: not text on line so far
typeset -i lInComment=0                          # flag: inside comment
typeset -i lPastEqual=0                          # before/after an "=" sign
typeset    chResult=""                           # parsed part of the line
typeset    chBlankBuffer=""                      # blanks in parsed text go
                                                 # there first to eventually get
                                                 # trimmed
typeset -i iMaxTypeArr=$(eval print \${#$4[@]})  # number of elements in type
                                                 # array for type checking

                                                 # these are unset/set for
                                                 # every record
# achRecField[*]                                 # array w. field names
# achRecType[*]                                  # array w. data types
# achRecValue[*]                                 # array w. field data

                                                 # initialization phase
if [ -n "$3" ] ; then                            # double-item checking on/off
     if   [ "$3" == "1" ] ; then
          (( lDblCheck = 1 ))
     elif [ "$3" == "0" ] ; then
          :
     else
          f_CmdWarning "ignoring argument ${3}, only specify 0 or 1"
     fi
fi

if [ -n "$4" ] ; then                            # type checking array passed
     lTypCheck=1                                 # enable type checking
     (( iCnt = 1 ))
     while (( iCnt <= iMaxTypeArr )) ; do
          (eval "print - \${$4[$iCnt]}") |\
          IFS=':' read achTables[$iCnt] achFields[$iCnt] achTypes[$iCnt]
          case "${achTypes[$iCnt]}" in
               char|int|num|file)
                    ;;

               *)
                    f_CmdError "unknown type to check for: ${achTypes[$iCnt]}"
                    return 1
                    ;;

          esac
          (( iCnt += 1 ))
     done
fi

if [ ! -r $fInFile ] ; then                      # file available ?
     iRetVal=1
     return $iRetVal
fi

(( iLinCnt = 1 ))                                # init line counter
while read -r chLine ; do                        # main parsing loop
     chItem=""
     chValue=""

     lInQuotes=0
     lEsc=0
     lBeforeText=1
     lInComment=0
     lPastEqual=0
     chResult=""
     chBlankBuffer=""
                                                 # double mask if too short
     while [ ${#chLine} -gt ${#chMaskString} ] ; do
          chMaskString="${chMaskString}${chMaskString}"
     done

     # print - "\n-- Begin Line $iLinCnt \"$chLine\""

                                                 # main parsing loop
     while [ -n "$chLine" ] ; do
                                                 # this is to avoid problems
                                                 # with escaped strings, see
                                                 # documentation above
          chChar="${chLine%$(printf '%*.*s' $(( ${#chLine} - 1 )) \
                                            $(( ${#chLine} - 1 )) \
                                            "$chMaskString")}"
          chLine="${chLine#?}"

          # print - "   char: \"$chChar\""
          # print - "   Line: \"$chLine\""

          if (( lInComment )) ; then
               chChar=""
          fi
          case "$chChar" in
               "")
                    ;;

                " "|"   ")
                    if ! (( lBeforeText )) ; then # trimming blanks
                         chBlankBuffer="${chBlankBuffer}${chChar}"
                    fi
                    chChar=""
                    lEsc=0
                    ;;

               \\)
                    chChar=""
                    lEsc=1
                    ;;

               \#)
                    if (( lEsc + lInQuotes )) ; then
                         lEsc=0
                    else
                         chChar=""
                         lInComment=1
                    fi
                    ;;

               \")
                    if (( lInQuotes )) ; then
                         lInQuotes=0
                         chResult="${chResult}${chBlankBuffer}"
                         if (( lPastEqual )) ; then
                              chValue="${chValue}${chBlankBuffer}"
                         else
                              chItem="${chItem}${chBlankBuffer}"
                         fi
                    else
                         if [ -n "$chBlankBuffer" ] ; then
                              chResult="${chResult} "
                              if (( lPastEqual )) ; then
                                   chValue="${chValue} "
                              else
                                   chItem="${chItem} "
                              fi
                         fi
                         lInQuotes=1
                    fi
                    chChar=""
                    chBlankBuffer=""
                    lBeforeText=0
                    ;;

               =)
                    if (( lEsc + lInQuotes )) ; then
                         lEsc=0
                    else
                         chChar=""
                         chBlankBuffer=""
                         lBeforeText=1
                         lPastEqual=1
                    fi                    ;;

               *)
                    lEsc=0
                    lBeforeText=0
                    if (( lInQuotes )) ; then
                         chResult="${chResult}${chBlankBuffer}"
                         if (( lPastEqual )) ; then
                              chValue="${chValue}${chBlankBuffer}"
                         else
                              chItem="${chItem}${chBlankBuffer}"
                         fi
                    else
                         if [ -n "$chBlankBuffer" ] ; then
                              chResult="${chResult} "
                              if (( lPastEqual )) ; then
                                   chValue="${chValue} "
                              else
                                   chItem="${chItem}$ "
                              fi
                         fi
                    fi
                    chBlankBuffer=""
                    ;;

          esac
          chResult="${chResult}${chChar}"
          if (( lPastEqual )) ; then
               chValue="${chValue}${chChar}"
          else
               chItem="${chItem}${chChar}"
          fi

     done

     case "$chResult" in
          "")                                    # empty line, ignore
               ;;

          *:)                                    # new record starts
               if (( ${#achRecField[@]} > 0 )) ; then
                    $pProcessRecord "$chTable" achRecField achRecValue
                    (( iCnt = ${#achRecField[@]} ))
                    while (( iCnt >= 0 )) ; do
                         unset achRecField[$iCnt]
                         unset achRecValue[$iCnt]
                         (( iCnt -= 1 ))
                    done
               fi
               chTable="${chResult%:}"
               ;;

          *)                                     # item line
               if [ -n "$chValue" -a -n "$chItem" ] ; then
                    (( lCheckDoubleError = 0 ))  # check for double items
                                                 # lines in already read array
                    if (( lDblCheck )) ; then
                         (( iCnt = 1 ))
                         while [ \( $iCnt -le ${#achRecField[@]} \) -a \
                                 \( lCheckDoubleError -eq 0 \) \
                               ] ; do
                              if [ "$chItem" == "${achRecField[$iCnt]}" ] ; then
                                   f_CmdWarning "Line ${iLinCnt}: double item $chItem"
                                   (( lCheckDoubleError = 1 ))
                              else
                                   (( iCnt += 1 ))
                              fi
                         done
                    fi
                    if ! (( lCheckDoubleError )) ; then
                         typeset achRecField[$((${#achRecField[@]}+1))]="$chItem"
                         typeset achRecValue[$((${#achRecValue[@]}+1))]="$chValue"
                    fi
               else
                    f_CmdError "Line ${iLinCnt} badly formed: \"$chResult\""
               fi
               if (( lTypCheck )) ; then         # type checking enabled
                    (( iCnt = 1 ))
                    while (( iCnt <= ${#achTables[@]} )) ; do
                         if [ "${achTables[$iCnt]}" = "${chTable}"  -a \
                              "${achFields[$iCnt]}" = "$chItem" ] ; then
                              case "${achTypes[$iCnt]}" in
                                   char)
                                        ;;

                                   int)
                                        if ! f_CheckInteger "$chValue" ; then
                                             f_CmdError "bad type \"int\" in line $iLinCnt"
                                             return 4
                                        fi
                                        ;;

                                   num)
                                        if ! f_CheckNumeric "$chValue" ; then
                                             f_CmdError "type error \"num\" in line $iLinCnt"
                                             return 4
                                        fi
                                        ;;

                                   file)
                                        if ! [ -f "$chValue" ] ; then
                                             f_CmdError "type error \"file\" in line $iLinCnt"
                                             return 4
                                        fi
                                        ;;

                              esac
                         fi
                         (( iCnt += 1 ))
                    done
               fi
               ;;

     esac

     (( iLinCnt += 1 ))
done < "$fInFile"

$pProcessRecord "$chTable" achRecField achRecValue

return $iRetVal
}

# --- EOF f_ProcessStanza

Here is a sample script that shows how to use the function. The handler function only prints the name of the table and the items and their values. It should give an idea about how to use the whole construct, though:

#! /bin/ksh

pTest ()
{
typeset chTbl="$1"
# typeset aFld[]=
# typeset aVal[]=
typeset -i iCnt=1
typeset -i iMax=$( eval print \${#$2[@]} )

print - "--- inside pTest()"
print - "--- $chTbl"
while (( iCnt <= iMax )) ; do
     print - "    " $(eval print \${$2[$iCnt]} = \${$3[$iCnt]})
     (( iCnt += 1 ))
done

return 0
}

# main()
. ~/lib/f_env    # this brings in the library and sets the environment

typeset achTypeCheck[1]="value1:field4:char"
typeset achTypeCheck[2]="value1:field5:num"
typeset achTypeCheck[3]="value1:field6:file"

f_ProcessStanza ./input.stanza pTest 0 achTypeCheck
print - "function returned: $?"

exit 0

Here is a sample input file that shows what is possible:

# test stanza file

value1:
     field4="bla foo # bla"   # comment
     field5=25   # comment
     field6=/home/bakunin/.profile   # this file exists
     # field6=/does/not/exist          # this doesn't. Uncomment to get a parsing error raised

# comments can be placed here
value1:   # or here
     # even here
     field1 = bla foo \\# bla

     # empty lines do not hurt either
     field2        =          "bla foo" # whitespace before/after the assignment operator are ignored, also unquoted whitespace
field3 = "bla foo" # i usually indent properly for better readability but the parser does not rely on that

I did my best to test the code above. Still, test for yourself before you use it in a critical setting. If you find bugs: PLEASE TELL ME! I'll be glad to weed them out.

bakunin