sven44
April 23, 2012, 11:13am
1
Hi,
I'm trying to parse a text file which uses commas as field separators. Fields are double quoted, and may themselves contain commas, like this:
"1","John Smith","London","123"
"2","Mary Robertson","Horsham, Sussex","456"
This causes problems for the following command
cut -d"," -f4
For the first line, it correctly extracts the fourth field ["123"], but for the second, it is fooled by the comma in Mary's adress, and thinks the fourth field is [, Sussex"].
Happy to use awk or any other standard utility available. Please help!
thanks
The problem is your field separator is not comma if it's in the field.
At least not a proper one.
One way to get around is to use sed to replace "," with "|" and make pipe your separator (if it's not in the fields of course)
One way would be :
sed 's#","#"|"#g' input | cut -d "|" -f 4
Hope that helps
Regards
Peasant.
1 Like
Using awk you can set the separator to include the quotes:
$ echo '"2","Mary Robertson","Horsham, Sussex","456"' | awk -F'","' '{print $3}'
Horsham, Sussex
You'd have to strip the leading/trailing quote tho when accessing the first/last field, respectively.
$ echo '"2","Mary Robertson","Horsham, Sussex","456"' | awk -F'","' '{print $1}'
"2
$ echo '"2","Mary Robertson","Horsham, Sussex","456"' | awk -F'","' '{gsub(/(^"|"$)/,"");print $1}'
2
$ echo '"2","Mary Robertson","Horsham, Sussex","456"' | awk -F'","' '{gsub(/(^"|"$)/,"");print $4}'
456
1 Like
With Perl:
perl -MText::ParseWords -nle'
print +(parse_line(",",0, $_))[3];
' infile
GNU awk 4+:
awk '{ print $4 }' FPAT='([^,]+)|("[^"]+")' infile
1 Like
awk -F\" '{print $8}' infile
Or in this case:
awk -F\" '{print $(NF-1)}' infile
including double quotes:
awk -F, '{print $NF}' infile
1 Like
birei
April 23, 2012, 12:03pm
6
Hi sven44,
One way with perl:
$ cat infile
"1","John Smith","London","123"
"2","Mary Robertson","Horsham, Sussex","456"
$ perl -MText::CSV_XS -e 'while ( $colref = Text::CSV_XS->new->getline( ARGV ) ) { printf qq[%s\n], $colref->[-1] }' infile
123
456
1 Like
joeyg
April 23, 2012, 12:47pm
7
$ awk -F'"' '{print $8}' sample4.txt
123
456
$ awk -F'"' '{print $(NF-1)}' sample4.txt
123
456
1 Like
sven44
April 23, 2012, 1:01pm
8
Many thanks for helpful replies!
The most robust solution (in that it doesn't make assumptions about whether fields are quoted or not, etc.) seems to be radoulov's - except that my version of awk doesn't seem to support this 'FPAT' variable!
For anyone stumbling across this thread, I also found the following fiendish solution which is similarly robust. Just working out what this means really made my head hurt, but it works!
sed 's/,\("[^"]*"\)*/\n\1/g' infile
Many thanks