Ok, let us start at the end, because this is easiest:
Actually neither awk nor grep is recommended. If you call an external program (regardless which one) from the shell you start a fork()-ed process. To create such a process costs an awful lot of time. You might want to read this thread where i learned the same lesson the very hard way.
To come back to your main question: what you need is a "parser". If you really want to indulge in the theory and practice of (recursively) parsing arbitrary languages you might want to read the classic "Dragon Book" ("Principles of Compiler Design"; Aho, Sethi, Ullman). It is a phantastic book about an intriguing field but for your rather limited purposes a very simple approach will suffice.
Lets start with a few thoughts about your input file:
- We want empty space, like leading blanks, trailing blanks, empty lines, etc., not to influence the outcome, because this would make for a really awkward handling of files which users should prepare. A single mis-paced space char, which would be invisible, would prevent correct parsing, so we don't want that.
Well, the best way to prevent space from having any meaning is to remove it prior to even looking at the file. Question: do we have to handle quoted strings? If not, we can even throw all successive whitespace between words out and replace them with a single space. So let us start with a sketch of a script. We use sed for this, because it is only called ONCE for the whole input (replace <spc> and <tab> with literal space/tab chars):
sed 's/^[<spc><tab>]*//
s/[<spc><tab>]*$//
/^$/d
s/[<spc><tab>][<spc><tab>]*/<spc>/g' $chFile |\
while read chLine ; do
print - "$chLine"
done
The "print" line will later be removed, it is just there to let us see what we do. On to the next part:
- We want to have comments, because it is easier for people to be able to comment directly in place what they do instead of having to use separate documents. As configuration files get longer some comments might be practical. We could implement multi-line comments like in C, but this would be overkill, so we settle for the same comments as in shell scripts: everything following a "#" is considered a comment. Now, it might be that "#" is part of a word and we do not want to remove half the word because it could be a comment, therefore we consider "#" to be introducing a comment only if it is either on the start of a line or preceeded by a space.
Let us change the sed-statement accordingly, to remove everything we don't need our parser to see:
sed 's/^[<spc><tab>]*//
s/[<spc><tab>]*$//
s/[<spc><tab>][<spc><tab>]*/<spc>/g
s/^#.*$//
s/<spc>#.*$//
/^$/d' $chFile |\
while read chLine ; do
print - "$chLine"
done
- Now that we have taken care of the preliminaries we have to start on the real work: what should our config file look like? Which format do we want be able to recognize?
We start first with identifying necessary and optional values: the IP-address is obviously mandatory. The port-list is optional and we define a default for that. Then the mode: is it optional and we create a default? Is it mandatory? Is there any other information which should/could come up on a line? You want to consider this first and prepare a list like the following:
Item Format mandatory/optional
----------------------------------------------------------
IP-Adr fixed format mandatory
Ports list delimited by comma optional
Mode "tcp"|"udp" (?) optional(?)
Be very thorough with this list, you will see why.
There are three basic layouts for your config file: stanza, delimited file and what i call option-file. The easiest to parse is the option-file, which contains only declarations of the form "identifier=value". For instance, it could look like:
# sample configuration file
machine=1.2.3.4 # this is our system
ports=5,6,7,8 # this is a list of ports
mode=tcp
option=some-value # some other option
The problem (or advantage?) with this is that it only can contain a single system. You could put all these config-files to a directory and cycle through these. For some problems this is a good choice, you decide if this is good in your case. A parser would look like this (i have left out consistency checks to make it easier to follow, we will fill these in later):
ls /path/to/conf/files | while read chFile ; do
chIP="" # no default for this
chPort=500 # default for port, overwritten if we read it
chMode="tcp" # default for mode, overwritten if we read it
sed 's/^[<spc><tab>]*//
s/[<spc><tab>]*$//
s/[<spc><tab>][<spc><tab>]*/<spc>/g
s/^#.*$//
s/<spc>#.*$//
/^$/d' $chFile |\
while read chLine ; do
chField="${chLine%%=*}" # split into option name and value
chValue="${chLine##*=}"
case $chField in
"IP")
# perform consistency check here
chIP="$chValue"
;;
"ports")
# perform consistency check here
chPort="$chValue"
;;
"mode")
# perform consistency check here
chMode="$chValue"
;;
# extend for other options by adding more branches here
*)
# last, the catch-all for unknown options
print -u2 "Unknown option: $chField in file $chFile, Line\n "$chLine"
;;
esac
done
# here we have read a whole file and could process the system:
if [ "$chIP" = "" ] ; then
print -u2 "Error: no IP specified in $chFile"
elif [ "$chOtherMandatoryOption" = "" ] ; then
print -u2 "Error: no <OtherOption> specified in $chFile"
elif [ <some other KO-criteria for processing the system> ] ; then
print -u2 "Error: cannot process �chFile because of ..."
else
# all checks OK and we finally get to work
<process system here>
fi
done
It is a good idea btw., to put the processing of the system to a separate function and call that instead of doing all the work in one single program. It makes the code better readable and easier to maintain.
The next possibility would be the delimited file. It is a table with a certain delimiter character as field separator. Spreadsheet programs use this format for data interchange frequently ("comma-separated file"). It will allow for all the configuration data in a single file, but optional values will have to be left explicitly empty. In the option file you could simply leave out an optional value for which a default exists, not so here. Furthermore you have to decide on a delimiter char which cannot be used in text, unless we want to further complicate matters by introducing escaping:
# sample configuraton via a delimited file
# we will use ":" as a delimiter here and the three fields from above
#IP:port1[,port2,..,portN]:[mode]
1.2.3.4:5,6,7,8:tcp # first system
2.3.4.5::udp # second system, ports left blank
3.4.5.6:: # third system, all optional fields left blank
...
This file type is relatively easy to parse, we chop off from the start of the line to the next delimiter until we reach the end. Because we have a fixed succession of fields we do not need field names like in the first type, but this also makes it easier for people to make errors by exchanging field values, if the fields get more. This is what a parser could look like:
sed 's/^[<spc><tab>]*//
s/[<spc><tab>]*$//
s/[<spc><tab>][<spc><tab>]*/<spc>/g
s/^#.*$//
s/<spc>#.*$//
/^$/d' $chFile |\
while read chLine ; do
chIP="" # no default for this
chPort=500 # default for port, overwritten if we read it
chMode="tcp" # default for mode, overwritten if we read it
chTmpPort="" # we need one of these for every optional
chTmpMode=""
# chop off the IP and trim the remainder
chIP="${chLine%%:*}" # we use the ":" from the sample file
# chIP="${chLine%%<delimiter-char>*}" # the general form
chLine="${chLine#*:}"
# chLine="${chLine#*<delimiter-char>}" # the general form
# perform IP consistency checks here
# same for ports, an optional parameter
chTmpPort="${chLine%%:*}"
chLine="${chLine#*:}"
if [ "$chTmpPort" != "" ] ; then
# perform port consistency check here
if [ <everything checked out OK> ] ; then
chPort="$chTmpPort"
else
print -u2 "Error: ports $chPorts for IP $chIP is not possible."
fi
fi
# same again for Mode
chTmpMode="${chLine%%:*}"
chLine="${chLine#*:}"
if [ "$chTmpMode" != "" ] ; then
# perform mode consistency check here
if [ <everything checked out OK> ] ; then
chMode="$chTmpMode"
else
print -u2 "Error: mode $chMode for IP $chIP is not possible."
fi
fi
# here we have read a whole line and could process the system:
<process system here>
done
The last possibility is the stanza file format. It allows for easy handling of default options because fields can simply be left out. It is also possible to have multiple entries in a single file (which - see above - might be a good or bad thing, depending on your environment).
The stanza file looks like this:
# general stanza file format
identifier:
field1=value
field2=value
field3=value
....
identifier:
field1=value
field2=value
...
...
In your case it could look like this:
# sample stanza file format
1.2.3.4:
# some comment about this machine
ports=5,6,7,8 # an inline comment
mode=tcp
2.3.4.5:
mode=udp # ports left to default
4.5.6.7: # everything left to default
# mode=?? # commented-out line
Unfortunately this is the most complicated to parse of the three formats, but it is definitely the most flexible. Let's get to it:
We start with an identifier (in our case the IP address) and read and store one line after the other until we encounter another identifier (or the end of the input file). This tells us we have read the whole record and we process it before we start over to read. We will - for the purpose of the example - suppose that "mode" is mandatory to show how mandatory fields are handled.
# reinit these for every new record
chIP="" # no default for IP
chPort=500 # default for port, will be overwritten if we read it
chMode="" # no default for mode
lProcessRecord=0 # 0=do not process record, 1=process it
sed 's/^[<spc><tab>]*//
s/[<spc><tab>]*$//
s/[<spc><tab>][<spc><tab>]*/<spc>/g
s/^#.*$//
s/<spc>#.*$//
/^$/d' $chFile |\
while read chLine ; do
case $chLine in
*:) # identifier, process last record, start new one
if [ "$chMode" = "" ] ; then # check if all mandatory options were read
lProcessrecord=0
print -u2 "Error: skipping record, mode=-directive missing."
fi
if (( lProcessRecord )) ; then
<process record>
fi
chIP="${chLine%:}" # reinit data structure
chPort=500
chMode=""
lProcessRecord=1
<perform consistency checks for IP>
if [ NOT everything is checked OK> ] ; then
lProcessRecord=0
print -u2 "Error: IP $chIP is malformed, skipping record."
fi
;;
port=*) # ports line, collect and proceed
<perform consistency checks for ports>
if [ <everything is OK> ] ; then
chPort="${chLine#*=}"
else
# notice we do not clear the process flag, just proceed with defaults
print -u2 "Error: IP $chIP has wrong ports directive, using defaults."
fi
;;
mode=*) # mode line, collect and proceed
<perform consistency checks for mode>
if [ <everything is OK> ] ; then
chMode="${chLine#*=}"
else
print -u2 "Error: IP $chIP has wrong Mode directive, skipping record."
lProcessRecord=0
fi
;;
?*=*) # general for of option line
chFieldname="${chLine%=*}"
chValue="${chLine#*=}"
if [ <checks> = FAILED ] ; then
lProcessRecord=0 # prohibit processing of record
fi
;;
*) # catch-all, misformed lines
print -u2 "Error: cannot decipher in stanza ${chIP}, line:\n${chLine}"
;;
esac
done
if (( lProcessRecord )) ; then # process last record read
<process record>
fi
OK, as you see there is a lot of pseudo-code in there, which you have to fill with your checks. This post is getting very long so i would like to discuss this in a separate post. Please give me some kind of feedback first, it is quite some work to write this and i wouldn't want to do this unwanted.
Some last suggestions:
1) You should decide what to do with doubled directives, which could occur in the option-file and the stanza-file. For instance:
1.2.3.4:
ports=5,6,7
ports=8,9,10
mode=tcp
You could: let the last one take precedence; warn the user and skip the record for ambiguousity; add all the options up to one, so that the example would be equivalent to "ports=5,6,7,8,9,10".
2) you might want to allow for spaces between the equal signs and the field names/values:
1.2.3.4:
ports = 5,6,7
mode = tcp
To achieve this it is only necessary to put the following directive into the sed-statement (which throws these out so that the provided code would go unchanged):
s/[<spc><tab>]*=[<spc><tab>]*/=/
3) A similar device could be employed in the delimited file, where blanks surrounding delimiter chars could be thrown out previous to parsing:
s/[<spc><tab>]*<delimiter>[<spc><tab>]*/<delimiter>/g
I hope this helps.
bakunin