How to split a field into two fields?

vbrown · February 20, 2008, 10:07am

Hi,

I have a comma delimited text file where character fields (as opposed to numeric and date fields) are always enclosed with double quotes. Records are separated by the newline character. In a shell script I would like to split a particular field into two separate fields (enclosed with double quotes). The field I would like to split always begins with <description> and ends with </description> and is always the 5th field in a record.

e.g. I would like to convert this:

18,"A",2008-02-11,"Y","<description> some long text </description>","N",1

to this:

18,"A",2008-02-11,"Y","<description> some lo","ng text </description>","N",1

I'm not bothered where in the field the split occurs - somewhere in the middle is optimal.

Really grateful for any help on this one.

Thanks, Vicky

jim_mcnamara · February 20, 2008, 12:11pm

try awk:

awk -F, '{
       for(i=1; i<NF; i++) {                                                
           if($i ~ /<description>/) { 
                   half=length($i)/2; 
                   printf("%s\",\"%s,", substr($i,1,half),
                                        substr($i,half+1))
           }
           else {
                   printf("%s,", $i)
           }
       }
       print $NF }' oldfile > newfile

bobbygsk · February 20, 2008, 12:23pm

While I was trying to solve the issue it seems Jim has already solved it.
Even I'm new to scripting and I have tried to solve it.
Hope it helps

But my script gives extra spaces at the delimiters.

vbrown · February 21, 2008, 5:15am

Thanks a lot for your help.

I should have said in my initial post that there may be text in between the double quotes which themselves are in double quotes and may contain commas,

e.g. 18,"<description><job_title value="some text, more text" /></description>",2008-02-19,"N"

I think this makes it a lot more complicated?

I'm also having to use nawk (I'm on Solaris) as each record is likely to be more than 3000 characters (max for awk), but I think the syntax is the same/similar to awk.

Any ideas?

Thanks again
Vicky

vbrown · February 21, 2008, 5:50am

Follow up:

Think I have managed to sort out what I want with only a minor modification to user "jim mcnamara" solution:

I used nawk and put </description> as the record delimiter instead of ,

Solution:
nawk -F "</description>" '{
for(i=1; i<NF; i++) {
if($i ~ /<description>/) {
half=length($i)/2;
printf("%s\",\"%s", substr($i,1,half), substr($i,half+1))
}
else {
printf("%s,", $i)
}
}
print $NF }' oldfile > newfile
fi

Thanks so much for your help.

Vicky