How to extract fields from a CSV i.e comma separated where some of the fields having comma as value?

J.Jena · February 19, 2019, 6:03am

can anyone help me!!!! How to I parse the CSV file
file name : abc.csv (csv file) The above file containing data like

abv,sfs,,hju,',',jkk wff,fst,,rgr,',',rgr ere,edf,erg,',',rgr,rgr

I have a requirement like i have to extract different field and assign them into different variables.
My Code:

cat $file | awk 'NR!=1' | while read -r line   do      a=`echo "$line" | awk -F',' '{print $1}'`      b=`echo "$line" | awk -F',' '{print $2}'`      c=`echo "$line" | awk -F',' '{print $3}'`      d=`echo "$line" | awk -F',' '{print $4}'`      e=`echo "$line" | awk -F',' '{print $5}'`      f=`echo "$line" | awk -F',' '{print $6}'`      echo "$e"      echo "$f" done

output:(it gave the output as single quote)
' ' ' ' ' ' Required Output Should be like:(my 5th field having value "comma" )
, jkk , rgr , rgr rgr

RudiC · February 19, 2019, 8:38am

Welcome to the forum.

You're problem is quite ubiquitous and has been solved / commented on in these forums more than several times. Try the search function, or one of the links given at the lower left of this page under "More UNIX and Linux Forum Topics You Might Find Helpful".

The code with which you addressed your problem is not the most efficient one. Did you consider just read ing the variables? Like

{ read dummy
  while IFS=, read a b c d e f dummy
     do   echo $e
          echo $f

     .
     .
     .
     done
 } < $file

bakunin · February 19, 2019, 9:25am

j.jena:

can anyone help me!!!! How to I parse the CSV file
file name : abc.csv (csv file) The above file containing data like
abv,sfs,,hju,',',jkk wff,fst,,rgr,',',rgr ere,edf,erg,',',rgr,rgr
I have a requirement like i have to extract different field and assign them into different variables.
My Code:
cat $file | awk 'NR!=1' | while read -r line   do      a=`echo "$line" | awk -F',' '{print $1}'`      b=`echo "$line" | awk -F',' '{print $2}'`      c=`echo "$line" | awk -F',' '{print $3}'`      d=`echo "$line" | awk -F',' '{print $4}'`      e=`echo "$line" | awk -F',' '{print $5}'`      f=`echo "$line" | awk -F',' '{print $6}'`      echo "$e"      echo "$f" done
output: (it gave the output as single quote)
' ' ' ' ' ' Required Output Should be like: (my 5th field having value "comma" )
, jkk , rgr , rgr rgr

First: even if your script line is working, it will be slow as a slug. The reason is that you call external (that is: external to the shell) programs (here: awk) for every line over and over again. Calling a program costs a considerable amount of time. You will not notice that if you call it once but if your file has, say, 1000 lines and each lines has 10 fields, you will call awk 10.000 times. That takes a considerable amount of time.

What you want to do is called "parsing" and if you want to assign the parsed values to (shell) variables in the end you can as well do the parsing itself in the shell too. Let us start with something simple: a loop that reads a string one character at a time. I suppose your input file does not contain anything fancy (like escaped characters, unprintable sequences, etc.) but only printable characters. In this case it can be done in shell alone quite simply:

#! /bin/ksh

chLine=""
chChar=""

while read chLine ; do                           # go through the whole file
     echo "Whole line: $chLine"                  # test output 
     while [ -n "$chLine" ] ; do                 # go through the line
          chChar="${chLine%${chLine#?}}"         # chop off the first character
          chLine="${chLine#?}"                   # and remove that from the line
          echo "$chChar"                         # test output
     done
done

Now let this run against any (short!) file with a few lines and see how it works. It reads one line at a time, prints it (this is just to control its workings) completely, then starts to remove one character at a time until the end of the line. The removed character is also printed to show the mechanism.

Now we need to determine if we are inside a quotation or outside. This can be done simply by introducing a logical flag which we switch every time we find a quote. We start with the flag in "off", switch it to "on" when we find a quote, switch it to "off" again when we find another quote, etc.. Here we go:

#! /bin/ksh

chLine=""
chChar=""
lInsideQuote=0

while read chLine ; do                           
     echo "Whole line: $chLine"                  
     lInsideQuote=0                              # switch off for new line
     while [ -n "$chLine" ] ; do                 
          chChar="${chLine%${chLine#?}}"          
          chLine="${chLine#?}"                    
          case "$chChar" in
               \')                               # if we found a single quote
                    if (( lInsideQuote )) ; then # flip the status of the flag
                         lInsideQuote=0
                    else
                         lInsideQuote=1
                    fi
                    ;;

               *)                                # ignore all other characters
                    ;;

          esac
                    
          if (( lInsideQuote )) ; then           # test output shows quote-status
               echo "$chChar"   (inside quote)"
          else
               echo "$chChar"   (outside quote)"
          fi
     done
done

Now we are prepared to deal with the fields themselves: we read a character a time and add that to a growing string which represents our next field. When we encounter a comma we react in one of two ways: when the quotation-flag is switched off, we found a legal separator and we output the field and start over with the next one. If the quotation-field is switched on we just the comma as a normal character to the fields value and carry on.

Here it is:

#! /bin/ksh

chLine=""
chChar=""
chField=""
lInsideQuote=0

while read chLine ; do
     echo "Whole line: $chLine"
     lInsideQuote=0
     chField=""                            # reset field buffer for next line
     while [ -n "$chLine" ] ; do
          chChar="${chLine%${chLine#?}}"
          chLine="${chLine#?}"
          case "$chChar" in
               "'")
                    if (( lInsideQuote )) ; then
                         lInsideQuote=0
                    else
                         lInsideQuote=1
                    fi
                    ;;

               ",")                                # if we found a comma
                    if (( lInsideQuote )) ; then   # inside a quoted string
                         chField="${chField},"     # we just add the comma
                    else
                                                   # outside a quoted string
                         echo "Field: $chField"  # field is finished, display it
                         chField=""                # empty buffer for next field
                    fi
                    ;;

               *)
                    chField="${chField}${chChar}"  # all other chars are just added
                    ;;

          esac
     done
     echo "last Field: $chField"  # display the last field in the line
done

Try this with a small sample file and see if it works for you. Try to understand its inner workings.

You can use this as a basic structure and refine it to your needs. Where now the echo -statements just display the field values you can insert code to actually deal with the values instead. Things i have left out for clarity are: there are two ways to quote, single and double quotes. How should the script deal with that, i.e.:

"double ' quoted", 'single " quoted'

My script only takes single-quotes into account but if you allow single- and double-quotes it gets a bit more complicated when one quote is inside a string quoted with the other type. Usually the rule is that quote characters lose their special meanings inside a quoted string. The example above would be two quoted strings, one with a single quote character in it, one with a double-quote character in it.

Also i left out some context checking: suppose this input line:

'field1','field2

The last opened quotation is not closed again. Should this lead to an error? My script ignores it simply, but you might want to raise some error because of this. How you react to such a line is your decision and you need to decide based on your purpose.

Another thing is if you have escaped quote characters like this:

"a double quote character \" inside a double-quoted string"

It is possible to parse these (hint: introduce an "escape-flag", which you set whenever you encounter a "\") but again it makes the parser more complicated.

I hope this helps.

bakunin

rdrtx1 · February 19, 2019, 10:03am

A start:

awk '
FNR > 1 {
   while (match($0, "\x027[^\x027]*\x027")) {
      quote_field=substr($0, RSTART, RLENGTH);
      quote_field_id="_:" c++ ":_"
      quote_fields
[quote_field_id]
=quote_field;
      sub(quote_field, quote_field_id);
   }
   comma_fields_count=split($0, comma_fields, ",");
   for (j=1; j<=comma_fields_count; j++) print (quote_fields[comma_fields[j]] ? quote_fields[comma_fields[j]] : comma_fields[j]);
}
' infile

RudiC · February 20, 2019, 4:42am

In a recent shell, e.g. bash , all can be done with a "here string" and "parameter expansion: Pattern substitution", like

$ T=$'\001'                                                                        # set a token improbable to appear in file
$ while read LINE
    do    IFS=, read a b c d e f REST <<< ${LINE//\',\'/$T}                        # read from $LINE but with ',' replaced by token
          echo ${a//$T/,}:${b//$T/,}:${c//$T/,}:${d//$T/,}:${e//$T/,}:${f//$T/,}   # print variables with token replaced by comma
    done < file
abv:sfs::hju:,:jkk wff