Shell script to read multiple options from file, line by line

haggismn · June 9, 2012, 3:45pm

Hi all
I have spent half a day trying to create a shell script which reads a configuration file on a line by line basis.

The idea of the file is that each will contain server information, such as IP address and various port numbers. The line could also be blank (The file is user created). Here is an example

$ cat /tmp/servers
172.18.1.1

172.18.50.1 tcp SSH SSL=8443

The script should ignore any empty lines, obviously. For the line with server 172.18.1.1, default settings should be used as nothing else is specified (Default is UDP mode port 500). For the line 172.18.50.1, the specified settings are that tcp mode is to be used, and SSH and SSL on port 8443 options are also to be used.

This is what I have created, as an example. The final product will obviously do much more, it is the reading in of the lines and variables that I need help with.

#!/bin/sh
cat /tmp/servers | while read SRV ; do
IP=$(grep -o '^[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}' $SRV)
if [ ! -z $IP ] ; then
if grep -i "tcp" $SRV ; then TCP=1 ; fi
if grep -i "udp" $SRV ; then UDP=1 ; fi
if ! grep -i "tcp" $SRV ; then UDP=1 ; fi
SSL=$(grep -i -o 'ssl=[0-9]\{1,5\}' $SRV | cut -d = -f2) ; if [ -z $SSL] ; then if grep -i "ssl" $SRV ; then SSL=443 ; fi ; fi
SSH=$(grep -i -o 'ssh=[0-9]\{1,5\}' $SRV | cut -d = -f2) ; if [ -z $SSH] ; then if grep -i "ssh" $SRV ; then SSH=22 ; fi ; fi
HTTP=$(grep -i -o 'http=[0-9]\{1,5\}' $SRV | cut -d = -f2) ; if [ -z $HTTP] ; then if grep -i "http" $SRV ; then HTTP=80 ; fi ; fi 
echo "$IP $SSL $SSH $HTTP" 
fi ; done

The idea here is that each line will be read in. grep will attempt to find a valid IP address in that line, and if it does ( if [ ! -z $IP ] ) then the line will be checked for all possible options. Otherwise if there is no valid IP address on the line it will go to the next line.

I know that I am going wrong somewhere. I cannot use the $SRV string to get any information, although this pasted script simply gives no response. I also believe that using

if [ ! -z $SRV ]

will not work with the servers file where the line contains spaces. Is this true?

Can anyone advise on the commands I should be using to read the lines and gain these variables. Could I use a for loop instead?

Thanks for any help

PS also, is the usage of grep ok? I had used awk before but this seems to work ok and be shorter. Is there any advantage of one over the other? Thanks

bakunin · June 9, 2012, 9:45pm

Ok, let us start at the end, because this is easiest:

Actually neither awk nor grep is recommended. If you call an external program (regardless which one) from the shell you start a fork()-ed process. To create such a process costs an awful lot of time. You might want to read this thread where i learned the same lesson the very hard way.

To come back to your main question: what you need is a "parser". If you really want to indulge in the theory and practice of (recursively) parsing arbitrary languages you might want to read the classic "Dragon Book" ("Principles of Compiler Design"; Aho, Sethi, Ullman). It is a phantastic book about an intriguing field but for your rather limited purposes a very simple approach will suffice.

Lets start with a few thoughts about your input file:

We want empty space, like leading blanks, trailing blanks, empty lines, etc., not to influence the outcome, because this would make for a really awkward handling of files which users should prepare. A single mis-paced space char, which would be invisible, would prevent correct parsing, so we don't want that.

Well, the best way to prevent space from having any meaning is to remove it prior to even looking at the file. Question: do we have to handle quoted strings? If not, we can even throw all successive whitespace between words out and replace them with a single space. So let us start with a sketch of a script. We use sed for this, because it is only called ONCE for the whole input (replace <spc> and <tab> with literal space/tab chars):

sed 's/^[<spc><tab>]*//
     s/[<spc><tab>]*$//
     /^$/d
     s/[<spc><tab>][<spc><tab>]*/<spc>/g' $chFile |\
while read chLine ; do
     print - "$chLine"
done

The "print" line will later be removed, it is just there to let us see what we do. On to the next part:

We want to have comments, because it is easier for people to be able to comment directly in place what they do instead of having to use separate documents. As configuration files get longer some comments might be practical. We could implement multi-line comments like in C, but this would be overkill, so we settle for the same comments as in shell scripts: everything following a "#" is considered a comment. Now, it might be that "#" is part of a word and we do not want to remove half the word because it could be a comment, therefore we consider "#" to be introducing a comment only if it is either on the start of a line or preceeded by a space.

Let us change the sed-statement accordingly, to remove everything we don't need our parser to see:

sed 's/^[<spc><tab>]*//
     s/[<spc><tab>]*$//
     s/[<spc><tab>][<spc><tab>]*/<spc>/g
     s/^#.*$//
     s/<spc>#.*$//
     /^$/d' $chFile |\
while read chLine ; do
     print - "$chLine"
done

Now that we have taken care of the preliminaries we have to start on the real work: what should our config file look like? Which format do we want be able to recognize?

We start first with identifying necessary and optional values: the IP-address is obviously mandatory. The port-list is optional and we define a default for that. Then the mode: is it optional and we create a default? Is it mandatory? Is there any other information which should/could come up on a line? You want to consider this first and prepare a list like the following:

Item    Format                          mandatory/optional
----------------------------------------------------------
IP-Adr  fixed format                    mandatory
Ports   list delimited by comma         optional
Mode    "tcp"|"udp" (?)                 optional(?)

Be very thorough with this list, you will see why.

There are three basic layouts for your config file: stanza, delimited file and what i call option-file. The easiest to parse is the option-file, which contains only declarations of the form "identifier=value". For instance, it could look like:

# sample configuration file
machine=1.2.3.4     # this is our system
ports=5,6,7,8         # this is a list of ports
mode=tcp
option=some-value  # some other option

The problem (or advantage?) with this is that it only can contain a single system. You could put all these config-files to a directory and cycle through these. For some problems this is a good choice, you decide if this is good in your case. A parser would look like this (i have left out consistency checks to make it easier to follow, we will fill these in later):

ls /path/to/conf/files | while read chFile ; do

     chIP=""                     # no default for this
     chPort=500                  # default for port, overwritten if we read it
     chMode="tcp"                # default for mode, overwritten if we read it

     sed 's/^[<spc><tab>]*//
          s/[<spc><tab>]*$//
          s/[<spc><tab>][<spc><tab>]*/<spc>/g
          s/^#.*$//
          s/<spc>#.*$//
          /^$/d' $chFile |\
     while read chLine ; do
          chField="${chLine%%=*}"       # split into option name and value
          chValue="${chLine##*=}"
          case $chField in
               "IP")
                    # perform consistency check here
                    chIP="$chValue"
                    ;;

               "ports")
                    # perform consistency check here
                    chPort="$chValue"
                    ;;

               "mode")
                    # perform consistency check here
                    chMode="$chValue"
                    ;;

               # extend for other options by adding more branches here

               *)
                    # last, the catch-all for unknown options
                    print -u2 "Unknown option: $chField in file $chFile, Line\n "$chLine"
                    ;;

          esac
     done

     # here we have read a whole file and could process the system:
     if [ "$chIP" = "" ] ; then
          print -u2 "Error: no IP specified in $chFile"
     elif [ "$chOtherMandatoryOption" = "" ] ; then
          print -u2 "Error: no <OtherOption> specified in $chFile"
     elif [ <some other KO-criteria for processing the system> ] ; then
          print -u2 "Error: cannot process �chFile because of ..."
     else
          # all checks OK and we finally get to work
          <process system here>
     fi
done

It is a good idea btw., to put the processing of the system to a separate function and call that instead of doing all the work in one single program. It makes the code better readable and easier to maintain.

The next possibility would be the delimited file. It is a table with a certain delimiter character as field separator. Spreadsheet programs use this format for data interchange frequently ("comma-separated file"). It will allow for all the configuration data in a single file, but optional values will have to be left explicitly empty. In the option file you could simply leave out an optional value for which a default exists, not so here. Furthermore you have to decide on a delimiter char which cannot be used in text, unless we want to further complicate matters by introducing escaping:

# sample configuraton via a delimited file
# we will use ":" as a delimiter here and the three fields from above
#IP:port1[,port2,..,portN]:[mode]
1.2.3.4:5,6,7,8:tcp    # first system
2.3.4.5::udp           # second system, ports left blank
3.4.5.6::              # third system, all optional fields left blank
...

This file type is relatively easy to parse, we chop off from the start of the line to the next delimiter until we reach the end. Because we have a fixed succession of fields we do not need field names like in the first type, but this also makes it easier for people to make errors by exchanging field values, if the fields get more. This is what a parser could look like:

sed 's/^[<spc><tab>]*//
     s/[<spc><tab>]*$//
     s/[<spc><tab>][<spc><tab>]*/<spc>/g
     s/^#.*$//
     s/<spc>#.*$//
     /^$/d' $chFile |\
while read chLine ; do
     chIP=""                     # no default for this
     chPort=500                  # default for port, overwritten if we read it
     chMode="tcp"                # default for mode, overwritten if we read it
     chTmpPort=""                # we need one of these for every optional
     chTmpMode=""

                                        # chop off the IP and trim the remainder
     chIP="${chLine%%:*}"               # we use the ":" from the sample file
     # chIP="${chLine%%<delimiter-char>*}"    # the general form
     chLine="${chLine#*:}"
     # chLine="${chLine#*<delimiter-char>}"   # the general form

     # perform IP consistency checks here

                                        # same for ports, an optional parameter
     chTmpPort="${chLine%%:*}"
     chLine="${chLine#*:}"
     if [ "$chTmpPort" != "" ] ; then     
          # perform port consistency check here
          if [ <everything checked out OK> ] ; then
               chPort="$chTmpPort"
          else
               print -u2 "Error: ports $chPorts for IP $chIP is not possible."
          fi
     fi
                                        # same again for Mode
     chTmpMode="${chLine%%:*}"
     chLine="${chLine#*:}"
     if [ "$chTmpMode" != "" ] ; then     
          # perform mode consistency check here
          if [ <everything checked out OK> ] ; then
               chMode="$chTmpMode"
          else
               print -u2 "Error: mode $chMode for IP $chIP is not possible."
          fi
     fi

     # here we have read a whole line and could process the system:
     <process system here>

done

The last possibility is the stanza file format. It allows for easy handling of default options because fields can simply be left out. It is also possible to have multiple entries in a single file (which - see above - might be a good or bad thing, depending on your environment).

The stanza file looks like this:

# general stanza file format
identifier:
     field1=value
     field2=value
     field3=value
     ....

identifier:
     field1=value
     field2=value
     ...

...

In your case it could look like this:

# sample stanza file format

1.2.3.4:
     # some comment about this machine
     ports=5,6,7,8     # an inline comment
     mode=tcp

2.3.4.5:
     mode=udp         # ports left to default

4.5.6.7:                  # everything left to default
     # mode=??        # commented-out line

Unfortunately this is the most complicated to parse of the three formats, but it is definitely the most flexible. Let's get to it:

We start with an identifier (in our case the IP address) and read and store one line after the other until we encounter another identifier (or the end of the input file). This tells us we have read the whole record and we process it before we start over to read. We will - for the purpose of the example - suppose that "mode" is mandatory to show how mandatory fields are handled.

                            # reinit these for every new record 
chIP=""                     # no default for IP
chPort=500                  # default for port, will be overwritten if we read it
chMode=""                   # no default for mode
lProcessRecord=0         # 0=do not process record, 1=process it

sed 's/^[<spc><tab>]*//
     s/[<spc><tab>]*$//
     s/[<spc><tab>][<spc><tab>]*/<spc>/g
     s/^#.*$//
     s/<spc>#.*$//
     /^$/d' $chFile |\
while read chLine ; do
     case $chLine in
          *:)                           # identifier, process last record, start new one
                if [ "$chMode" = "" ] ; then     # check if all mandatory options were read
                     lProcessrecord=0
                     print -u2 "Error: skipping record, mode=-directive missing."
                fi
                if (( lProcessRecord )) ; then
                     <process record>
                fi
                chIP="${chLine%:}"      # reinit data structure
                chPort=500
                chMode=""
                lProcessRecord=1

                <perform consistency checks for IP>
                if [ NOT everything is checked OK> ] ; then
                     lProcessRecord=0
                     print -u2 "Error: IP $chIP is malformed, skipping record."
                fi
                ;;

          port=*)                       # ports line, collect and proceed
                <perform consistency checks for ports>
                if [ <everything is OK> ] ; then
                     chPort="${chLine#*=}"
                else
                     # notice we do not clear the process flag, just proceed with defaults 
                     print -u2 "Error: IP $chIP has wrong ports directive, using defaults."
                fi
                ;;

          mode=*)                       # mode line, collect and proceed
                <perform consistency checks for mode>
                if [ <everything is OK> ] ; then
                     chMode="${chLine#*=}"
                else
                     print -u2 "Error: IP $chIP has wrong Mode directive, skipping record."
                     lProcessRecord=0 
                fi
                ;;

          ?*=*)                         # general for of option line
                chFieldname="${chLine%=*}"
                chValue="${chLine#*=}"
                if [ <checks> = FAILED ] ; then
                     lProcessRecord=0          # prohibit processing of record
                fi
                ;;

          *)                           # catch-all, misformed lines
                print -u2 "Error: cannot decipher in stanza ${chIP}, line:\n${chLine}"
                ;;
     esac
done
if (( lProcessRecord )) ; then         # process last record read
     <process record>
fi

OK, as you see there is a lot of pseudo-code in there, which you have to fill with your checks. This post is getting very long so i would like to discuss this in a separate post. Please give me some kind of feedback first, it is quite some work to write this and i wouldn't want to do this unwanted.

Some last suggestions:

1) You should decide what to do with doubled directives, which could occur in the option-file and the stanza-file. For instance:

1.2.3.4:
     ports=5,6,7
     ports=8,9,10
     mode=tcp

You could: let the last one take precedence; warn the user and skip the record for ambiguousity; add all the options up to one, so that the example would be equivalent to "ports=5,6,7,8,9,10".

2) you might want to allow for spaces between the equal signs and the field names/values:

1.2.3.4:
     ports = 5,6,7
     mode = tcp

To achieve this it is only necessary to put the following directive into the sed-statement (which throws these out so that the provided code would go unchanged):

s/[<spc><tab>]*=[<spc><tab>]*/=/

3) A similar device could be employed in the delimited file, where blanks surrounding delimiter chars could be thrown out previous to parsing:

s/[<spc><tab>]*<delimiter>[<spc><tab>]*/<delimiter>/g

I hope this helps.

bakunin