Processing data that contains space and quote delimiters

I need to write a Bash script to process a data file that is in this format:

1 A B C D E
2 F G "H H" I J

As you can see, the data is delimited by a space, but there are also some fields that contain spaces and are surrounded by double-quotes. An example of that is "H H".

I wrote this test script to display the 4th parameter:

#!/bin/bash
while read line
do
        echo "line=$line"
        param4=$(echo $line | cut -d" " -f4)
        echo "param4=$param4"
done

Here's what it displays:

line=1 A B C D E
param4=C
line=2 F G "H H" I J
param4="H

For the second line of data, I wanted the fourth parameter to be "H H" (without the quotes) instead of one double quote and one H. It is using the other H and the trailing double quote as parameter 5. That is not what I wanted.

How can I process this data?

  • Are the quotes always enclosing the same values?
  • Could there be more than two quotes in one line?
  • Do you want all parameters tested or only some?

No.

Yes.

All of them.

Sample data might look like this:

A B C D E F
"A A" B C D E F
A "B B" C D E F
A B "Hi there" D E F
A B C "Lots of words" E F
A B C D "E E E E E E E E" F
A B C D E "F F"

Thanks for your help.

First convert your data into a comma delimited file (csv):

$ cat sample.dat
A B C D E F
"A A" B C D E F
A "B B" C D E F
A B "Hi there" D E F
A B C "Lots of words" E F
A B C D "E E E E E E E E" F
A B C D E "F F"
$ perl -ne 'my @x=split(/ /); for (0..$#x) {
     if ($x[$_] !~ /"/ && $a != 1) {print $x[$_] . ","; }
  elsif ($x[$_] =~ /^"/) { $x[$_] =~ s/"//g; print $x[$_] . " "; $a=1; }
  elsif ($x[$_] !~ /"$/ && $a == 1) { print $x[$_] . " "; }
   else { $x[$_] =~ s/"//g; print $x[$_] . ","; $a=0 } } ' sample.dat > sample.tmp
$ 
$ sed 's/^,//' sample.tmp | sed '$d' > sample.csv
$ rm sample.tmp
$
$ cat sample.csv
A,B,C,D,E,F
A A,B,C,D,E,F
A,B B,C,D,E,F
A,B,Hi there,D,E,F
A,B,C,Lots of words,E,F
A,B,C,D,E E E E E E E E,F
A,B,C,D,E,F F
$ 

Now you could use

param4=$(echo $line | awk -F, '{print $4}')

to assign the value of the fourth column to the param4 variable.

while read line
do
  eval set -- "$line"
  echo "$4"
done < infile

or:

print4() 
{ 
  echo "$4"
}

while read line
do
  eval print4 "$line"
done < infile

or (ksh93/bash):

while read line
do
  eval A=($line)
  echo ${A[3]}
done < infile

output:

C
H H
2 Likes

Thank you! This code is short and elegant and works nicely:

while read line
do
  eval set -- "$line"
  echo "$4"
done < infile

How does it work? Are you setting a variable named -- or something?

Hi RickS,

set without options assigns its arguments to the variables $1, $2, etc.. For instance:

# set a b 
# echo $1
a
# echo $2
b

-- signifies the end of options. Anything that comes after this will not be interpreted as an option to the command "set". This is used to ensure that if a variable has a value the starts with a "-" sign it does not unintentionally set an option.

The eval command plays a crucial role here. It first expandes "$line" so that the command with e.g. the second line of the input file reads:

set -- 2 F G "H H" I J

So then $1 becomes 2, $2 becomes F, $3 G and $4 is set to "H H" etc..

1 Like

Thank you so much for taking the time to help me with that. I could not have figured that out on my own, and I spent quite a long time trying. I very much appreciate you sharing your gifts.