Replace double quotes with a single quote within a double quoted string

pchang · May 2, 2014, 4:23pm

Hi Froum.

I have tried in vain to find a solution for this problem - I'm trying to replace any double quotes within a quoted string with a single quote, leaving everything else as is.

I have the following data:

Before:

0000001111,"IBD","601725","6017257000681563","0430","163458","002820","002820000000","E0107815","1801 3E AVENUE         VAL-D"OR
 QCCA","0200","","","WD","CH","","4000320275","","124","124",,60.00,60.00,60.00,0.00,0.45,60.45,0.037500,"APP","00","EXC","5"

After:

0000001111,"IBD","601725","6017257000681563","0430","163458","002820","002820000000","E0107815","1801 3E AVENUE         VAL-D'OR
 QCCA","0200","","","WD","CH","","4000320275","","124","124",,60.00,60.00,60.00,0.00,0.45,60.45,0.037500,"APP","00","EXC","5"

In this example, VAL-D"OR is replaced as VAL-D'OR.

Thanks.

rdrtx1 · May 2, 2014, 4:29pm

try:

sed 's/VAL-D"OR/VAL-D\x27OR/' in

pchang · May 2, 2014, 4:47pm

thanks for your reply.

Sorry that was just an example.

I need to replace all double quotes occurences within a quote string with a single quote.

Text can be anything.

wisecracker · May 2, 2014, 4:53pm

Longhand on OSX 10.7.5, default bash terminal.

#!/bin/bash
# quote.sh
echo '0000001111,"IBD","601725","6017257000681563","0430","163458","002820","002820000000","E0107815","1801 3E AVENUE         VAL-D"OR
 QCCA","0200","","","WD","CH","","4000320275","","124","124",,60.00,60.00,60.00,0.00,0.45,60.45,0.037500,"APP","00","EXC","5"' > /tmp/text
text=$(cat < /tmp/text)
echo "$text"
text=${text/D\"O/D\'O}
echo "$text" > /tmp/txt
cat < /tmp/txt
exit 0

Results:-

Last login: Fri May  2 21:36:37 on ttys000
AMIGA:barrywalker~> ./quote.sh
0000001111,"IBD","601725","6017257000681563","0430","163458","002820","002820000000","E0107815","1801 3E AVENUE         VAL-D"OR
 QCCA","0200","","","WD","CH","","4000320275","","124","124",,60.00,60.00,60.00,0.00,0.45,60.45,0.037500,"APP","00","EXC","5"
0000001111,"IBD","601725","6017257000681563","0430","163458","002820","002820000000","E0107815","1801 3E AVENUE         VAL-D'OR
 QCCA","0200","","","WD","CH","","4000320275","","124","124",,60.00,60.00,60.00,0.00,0.45,60.45,0.037500,"APP","00","EXC","5"
AMIGA:barrywalker~> _

bakunin · May 2, 2014, 4:56pm

Could you please explain what a "quote[d] string" as per your definition is?

Suppose a "quoted string" in the usual sense:

quoted string begins here ->"quoted string"<-and ends here

Now suppose there would be a double quote within these two:

quoted string begins here ->"quoted"string"<-and ends here

The quoted string would now be "quoted" and not "quoted["]string", no? If not, please explain how a program is supposed to find out.

Fundamental rule of parsing: its easy once you can define what you really want. ;-))

I hope this helps.

bakunin

wisecracker · May 2, 2014, 5:01pm

OK, so in this "example" is the field constant?
If not give us something fixed to work on...

pchang · May 2, 2014, 5:05pm

I hope I can be clear on my explanation.

We are basically receiving a csv file from the vendor and any field they deemed as text, they will enclose with a double quote.

The problem arises when they also have/use a double quote as part of the data.

In my example, it's an address field of a customer and the name of the street happens to have a quote in it and vendor has used double quotes for the data (Ex: VAL D"OR instead of VAL D'OR). If vendor used a single quote instead, everything would have been okay.

This is causing us issue when cleaning up the file using sed because there's an unmatched quotes in the file. That's why I'm looking to replace any double quotes that is part of the data with a single one.

Hope this answers your questions.

Thanks.

wisecracker · May 2, 2014, 5:15pm

OK, we have established that these _odd_ quotation marks can be anywhere in any field inside the file.

1) Will there be more than three quotation marks in any one field?
2) Will there be ANY possibility of any other SINGLE quotes in any particular field?
3) Will every field be COMMA separated?

RudiC · May 2, 2014, 5:19pm

This is based on the assumption that the field separator is a comma and it is immediatly (no spaces etc.) placed beside the "quoting" double quotes, so a double quote without a comma should be replaced:

awk 'match($0,/[^,]\"[^,]/) {$0=substr($0,1,RSTART) "\047" substr($0,RSTART+2)}1' file
0000001111,"IBD","601725","6017257000681563","0430","163458","002820","002820000000","E0107815","1801 3E AVENUE         VAL-D'OR
 QCCA","0200","","","WD","CH","","4000320275","","124","124",,60.00,60.00,60.00,0.00,0.45,60.45,0.037500,"APP","00","EXC","5"

If there's more double quotes to be replaced, you need to repeat the command on that line.

bakunin · May 2, 2014, 5:25pm

OK. The following may not be a perfect solution (coming from a less-than-rock-solid definition), but check how far you get with it.

Let us say that quotes you want to preserve are the ones immediately preceeding or following commas (which seem to be the field separators here). In addition there is a single double-quote at the beginning of the line and one at the end of the line. All the other double quotes should become single quotes.

This would work for your example, but there are cases conceivable where this ruleset could be tricked. This is why i suggest you doubly check if it works on your data or if we need to make the ruleset more solid.

Solution: first, all the sequences of "," are replaced by a placeholder (i use "@@", change it to something else if this is used in your data). Also the double-quotes at BOL and EOL are replaced. Then i change the remaining double-quotes to single-quotes and finally transfer the placeholders back.

This sounds complicated, but it makes the regexps necessary a lot easier to handle (and to understand).

sed "s/^\"/@@/;s/\"$/@@/;s/\",\"/@@/g;s/\",/@@@/g;s/,\"/@@@@/g
     s/\"/\'/g
     s/^@@/\"/;s/@@$/\"/;s/@@@@/,\"/g;s/@@@/\",/g;s/@@/\",\"/g" /path/to/input

I hope this helps.

bakunin

wisecracker · May 2, 2014, 7:00pm

Longhand using __builtins__, OSX 10.7.5, default bash terminal...

This assumes that the VERY first field is not in inverted commas...

#!/bin/bash
# quote.sh
ifs_str="$IFS"
IFS=","
echo '0000001111,"IBD","601725","6017257000681563","0430","163458","002820","002820000000","E0107815","1801 3E AVENUE         VAL-D"OR
 QCCA","0200","","","WD","CH","","4000320275","","124","124",,60.00,60.00,60.00,0.00,0.45,60.45,0.037500,"APP","00","EXC","5"
0000001111,"IBD","601725","6017257000681563","0430","163458","002820","002820000000","E0107815","1801 3E AVENUE         VAL-D"OR
 QCCA","0200","","","WD","CH","","4000320275","","124","124",,60.00,60.00,60.00,0.00,0.45,60.45,0.037500,"APP","00","EXC","5"
0000001111,"IBD","601725","6017257000681563","0430","163458","002820","002820000000","E0107815","1801 3E AVENUE         VAL-D"OR
 QCCA","0200","","","WD","CH","","4000320275","","124","124",,60.00,60.00,60.00,0.00,0.45,60.45,0.037500,"APP","00","EXC","5"' > /tmp/text
text=$(cat < /tmp/text)
echo "$text"
quote_array=($text)
field=1
txt="${quote_array[0]}"
while [ $field -lt ${#quote_array[@]} ]
do
	string="${quote_array[$field]}"
	if [ "${string:0:1}" == '"' ] && [ ${#string} -ge 4 ]
	then
		string="${string:1:$((${#string}-2))}"
		string="${string/\"'/'}"
		txt=$txt,\"$string\"
	else
		txt=$txt,$string
	fi
	field=$((field+1))
done
echo "$txt" > /tmp/txt
echo ""
cat < /tmp/txt
IFS="$ifs_str"
exit 0

Results:-

Last login: Fri May  2 23:22:20 on ttys000
AMIGA:barrywalker~> ./quote1.sh
0000001111,"IBD","601725","6017257000681563","0430","163458","002820","002820000000","E0107815","1801 3E AVENUE         VAL-D"OR
 QCCA","0200","","","WD","CH","","4000320275","","124","124",,60.00,60.00,60.00,0.00,0.45,60.45,0.037500,"APP","00","EXC","5"
0000001111,"IBD","601725","6017257000681563","0430","163458","002820","002820000000","E0107815","1801 3E AVENUE         VAL-D"OR
 QCCA","0200","","","WD","CH","","4000320275","","124","124",,60.00,60.00,60.00,0.00,0.45,60.45,0.037500,"APP","00","EXC","5"
0000001111,"IBD","601725","6017257000681563","0430","163458","002820","002820000000","E0107815","1801 3E AVENUE         VAL-D"OR
 QCCA","0200","","","WD","CH","","4000320275","","124","124",,60.00,60.00,60.00,0.00,0.45,60.45,0.037500,"APP","00","EXC","5"

0000001111,"IBD","601725","6017257000681563","0430","163458","002820","002820000000","E0107815","1801 3E AVENUE         VAL-D'OR
 QCCA","0200","","","WD","CH","","4000320275","","124","124",,60.00,60.00,60.00,0.00,0.45,60.45,0.037500,"APP","00","EXC","5'
000000111","IBD","601725","6017257000681563","0430","163458","002820","002820000000","E0107815","1801 3E AVENUE         VAL-D'OR
 QCCA","0200","","","WD","CH","","4000320275","","124","124",,60.00,60.00,60.00,0.00,0.45,60.45,0.037500,"APP","00","EXC","5'
000000111","IBD","601725","6017257000681563","0430","163458","002820","002820000000","E0107815","1801 3E AVENUE         VAL-D'OR
 QCCA","0200","","","WD","CH","","4000320275","","124","124",,60.00,60.00,60.00,0.00,0.45,60.45,0.037500,"APP","00","EXC","5"
AMIGA:barrywalker~> _

EDIT:
IGNORE THIS; I noticed a bug and will cure tomorrow...

HTH

Don_Cragun · May 2, 2014, 8:20pm

Do you always have two input lines per record?

Are there always 32 (comma separated) fields per record?

If neither of the above are true, how are we supposed to know whether a double-quote at the end of a line is the terminator for the last quoted field on a line (that should be left as is) or an embedded double-quote in the middle of a quoted field (that should be converted to a single-quote)?

drl · May 4, 2014, 1:20pm

Hi.

Like Don Cragun, I noticed that the (too short) sample had 2 lines. I assume they should be a single line, and so I joined them into a file data1.

My approach is to use a code that understands CSV files. Here it is, and following this listing are demonstrations on a short sample, and the supplied sample:

#!/usr/bin/env bash

# @(#) s1	Demonstrate CSV parsing, replacing, combining with perl module.
# See: perldoc Text/CSV

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C

FILE=${1-data1}

pl " Input data file $FILE:"
cat $FILE

pl " Results:"
perl -MText::CSV -lne '
BEGIN {
	$csv = Text::CSV->new(
	{ allow_loose_quotes => 1,
	always_quote => 1,
	escape_char        => "\\" });
	$sq = chr(39);
}
chomp;
# Parse, search/replace, combine, write.
if ($csv->parse($_)) {
	@cols = $csv->fields();
	for ($i=0;$i<$#cols;$i++) {
	  $cols[$i] =~ s/["]/$sq/g;
	}
	$csv->combine (@cols);
	print $csv->string();
} else {
	print " Error = ", Text::CSV->error_diag(), "\n";
	die " Parse error at line $.\n";
}
' $FILE

exit 0

producing on a short, readable sample:

$ ./s1 data2

Environment: LC_ALL = , LANG = en_US.UTF-8
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian 5.0.8 (lenny, workstation) 
bash GNU bash 3.2.39

-----
 Input data file data2:
"ABCD2","EFGH2","XXXX","1"
"ABCD2",EFGH2,"XX"XX",2

-----
 Results:
"ABCD2","EFGH2","XXXX","1"
"ABCD2","EFGH2","XX'XX","2"

and on the OP sample:

$ ./s1 data1

Environment: LC_ALL = , LANG = en_US.UTF-8
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian 5.0.8 (lenny, workstation) 
bash GNU bash 3.2.39

-----
 Input data file data1:
0000001111,"IBD","601725","6017257000681563","0430","163458","002820","002820000000","E0107815","1801 3E AVENUE         VAL-D"OR QCCA","0200","","","WD","CH","","4000320275","","124","124",,60.00,60.00,60.00,0.00,0.45,60.45,0.037500,"APP","00","EXC","5"

-----
 Results:
"0000001111","IBD","601725","6017257000681563","0430","163458","002820","002820000000","E0107815","1801 3E AVENUE         VAL-D'OR QCCA","0200","","","WD","CH","","4000320275","","124","124","","60.00","60.00","60.00","0.00","0.45","60.45","0.037500","APP","00","EXC","5"

Best wishes ... cheers, drl

wisecracker · May 4, 2014, 5:12pm

CygWin bash terminal, under Windows Vista...
(It might need a dos2unix conversion first.)
Longhand using __builtins__...

#!/bin/bash
# quote.sh
> /tmp/text
> /tmp/txt
ifs_str="$IFS"
IFS=","
echo '0000001111,"IBD","601725","6017257000681563","0430","163458","002820","002820000000","E0107815","1801 3E AVENUE         VAL-D"OR
 QCCA","0200","","","WD","CH","","4000320275","","124","124",,60.00,60.00,60.00,0.00,0.45,60.45,0.037500,"APP","00","EXC","5"
0000001111,"IBD","601725","6017257000681563","0430","163458","002820","002820000000","E0107815","1801 3E AVENUE         VAL-D"OR
 QCCA","0200","","","WD","CH","","4000320275","","124","124",,60.00,60.00,60.00,0.00,0.45,60.45,0.037500,"APP","00","EXC","5"
0000001111,"IBD","601725","6017257000681563","0430","163458","002820","002820000000","E0107815","1801 3E AVENUE         VAL-D"OR
 QCCA","0200","","","WD","CH","","4000320275","","124","124",,60.00,60.00,60.00,0.00,0.45,60.45,0.037500,"APP","00","EXC","5"' > /tmp/text
text=$(cat < /tmp/text)
echo "$text"
quote_array=($text)
n=0
m=0
field=1
string=""
newstring=""
txt="${quote_array[0]}"
while [ $field -lt ${#quote_array[@]} ]
do
	string="${quote_array[$field]}"
	newstring=""
	length=${#string}
	n=0
	m=0
	while [ $n -lt $length ]
	do
		if [ "${string:$n:1}" == '"' ]
		then
			m=$((m+1))
		fi
		n=$((n+1))
	done
	if [ $m -ge 3  ]
	then
		newstring=${string:1:$((${#string}-2))}
		newstring=${newstring/\"/\'}
		string=\"$newstring\"
	fi
	txt=$txt,$string
	field=$((field+1))
done
echo "$txt" > /tmp/txt
echo ""
cat < /tmp/txt
IFS="$ifs_str"
exit 0

Results using 3 copies of the original...

AMIGA:~> cd /tmp
AMIGA:/tmp> dos2unix quote.sh
dos2unix: converting file quote.sh to Unix format ...
AMIGA:/tmp> ./quote.sh
0000001111,"IBD","601725","6017257000681563","0430","163458","002820","002820000000","E0107815","1801 3E AVENUE         VAL-D"OR
 QCCA","0200","","","WD","CH","","4000320275","","124","124",,60.00,60.00,60.00,0.00,0.45,60.45,0.037500,"APP","00","EXC","5"
0000001111,"IBD","601725","6017257000681563","0430","163458","002820","002820000000","E0107815","1801 3E AVENUE         VAL-D"OR
 QCCA","0200","","","WD","CH","","4000320275","","124","124",,60.00,60.00,60.00,0.00,0.45,60.45,0.037500,"APP","00","EXC","5"
0000001111,"IBD","601725","6017257000681563","0430","163458","002820","002820000000","E0107815","1801 3E AVENUE         VAL-D"OR
 QCCA","0200","","","WD","CH","","4000320275","","124","124",,60.00,60.00,60.00,0.00,0.45,60.45,0.037500,"APP","00","EXC","5"

0000001111,"IBD","601725","6017257000681563","0430","163458","002820","002820000000","E0107815","1801 3E AVENUE         VAL-D'OR
 QCCA","0200","","","WD","CH","","4000320275","","124","124",,60.00,60.00,60.00,0.00,0.45,60.45,0.037500,"APP","00","EXC","5"
0000001111,"IBD","601725","6017257000681563","0430","163458","002820","002820000000","E0107815","1801 3E AVENUE         VAL-D'OR
 QCCA","0200","","","WD","CH","","4000320275","","124","124",,60.00,60.00,60.00,0.00,0.45,60.45,0.037500,"APP","00","EXC","5"
0000001111,"IBD","601725","6017257000681563","0430","163458","002820","002820000000","E0107815","1801 3E AVENUE         VAL-D'OR
 QCCA","0200","","","WD","CH","","4000320275","","124","124",,60.00,60.00,60.00,0.00,0.45,60.45,0.037500,"APP","00","EXC","5"
AMIGA:/tmp> _

EDIT:
5th May 2014, 15:25, UK time.
Now tested on OSX 10.7.5, default bash terminal and PCLinuxOS 2009, default bash terminal.
Also tested with extra newlines and quotes in random places.
Also with the line newstring=${newstring/\"/\'} replaced with newstring=${newstring//\"/\'} for multiple
instances of " but not shown here...

Aia · May 5, 2014, 12:57am

Wonder if this would perform the wanted task

perl -wple 's/(\w)"(\w)/$1\x27$2/g' wrong_quoted.file > right_quoted.file

SriniShoo · May 5, 2014, 2:06am

awk -v q="'" 'BEGIN{FS = OFS = ","}
  {for(i = 1; i <= NF; i++)
    {n = length($i);
    if(n > 1 && $i ~ /^\".*\"$/)
      T = substr($i, 2, n-2)
    gsub(/\"/, q, T)
    if(length(T) + 2 == n)
      $i = ("\"" T "\"")}}1' file

Scrutinizer · May 5, 2014, 2:49am

Hopefully, comma's are not allowed within double quotes ...

awk -F, '{for(i=1; i<=NF; i++) {s=$i; if(gsub(/^"|"$/,x,s)==2 && gsub(d,q,s)) $i=d s d}}1' OFS=, q=\' d=\" file

--edit--
OK, just noticed the approach is similar to SriniShoo's ...

pchang · May 5, 2014, 9:42am

My requirement is simple - I have an csv input file with some string fields enclosed with double quotes.

If any double quotes (data) appear in the string field, I want to replace with a single quote.

Can the script be generic enough so that it will solve any number of text strings in a record?

Thanks.

bakunin · May 5, 2014, 9:51am

As we have established in the first posts in this thread the task looks simple, but isn't. Take your pick among most of the answers here, try them on sample data and report back in which way they fail (if they fail at all) or that they are doing what you want, respectively. Probably the biggest difference will be the runtime: the fewer programs are involved and the more builtin shell functions are used the faster it will generally execute.

As far as i could see at a passing glance none of the suggeted solutions is limited in this regard. Probably execution time will differ but the difference will probably not be noticeable until you process thousands of lines. So just try it out, stop the time and find out yourself.

bakunin

Subbeh · May 5, 2014, 10:40am

I don't understand the commotion, a simple sed would suffice, no?

 
$ sed 's/\([^,]\)\"\([^,]\)/\1\x27\2/g' file
0000001111,"IBD","601725","6017257000681563","0430","163458","002820","002820000000","E0107815","1801 3E AVENUE         VAL-D'OR
 QCCA","0200","","","WD","CH","","4000320275","","124","124",,60.00,60.00,60.00,0.00,0.45,60.45,0.037500,"APP","00","EXC","5"