Replace double quotes with a single quote within a double quoted string

Thank you Subdeh - I decided to try your code (since it looks the most simplest) but I didn't get the expected output. Instead of a single quote, I got x27. Please see sample data below:

Input data:

0000001111,"IBD","601725","6017257000681563","0430","163458","002820","002820000000","E0107815","1801 3E AVENUE         VAL-D"OR     QCCA","0200","","","WD","CH","","4000320275","","124","124",,60.00,60.00,60.00,0.00,0.45,60.45,0.037500,"APP","00","EXC","5"

Output data:

0000001111,"IBD","601725","6017257000681563","0430","163458","002820","002820000000","E0107815","1801 3E AVENUE         VAL-Dx27OR     QCCA","0200","","","WD","CH","","4000320275","","124","124",,60.00,60.00,60.00,0.00,0.45,60.45,0.037500,"APP","00","EXC","5"

Try this instead:

sed "s/\([^,]\)\"\([^,]\)/\1'\2/g" file

WOW!!! - thank you.

Your code works great!!! The only thing is that it's only substituting the first occurence of a double quote, not all occurences within the string. See bold/underline data below. Could you enhance your code to replace all occurences of double quotes within the text string?

Input Data:

0000005335,"IBD","601725","6017257002503849","0430","153854","007907","0079070
00000","E0107725","2995 BL.DAGENAIS "H"   LAVAL        QCCA","0200","","","WD","
CH","","4001857090","","124","124",,60.00,60.00,60.00,0.00,0.45,60.45,0.037500,"
APP","00","EXC","5"

Output Data:

0000005335,"IBD","601725","6017257002503849","0430","153854","007907","0079070
00000","E0107725","2995 BL.DAGENAIS 'H"   LAVAL        QCCA","0200","","","WD","
CH","","4001857090","","124","124",,60.00,60.00,60.00,0.00,0.45,60.45,0.037500,"
APP","00","EXC","5"

Hi.

In the case "H" , the H has already participated in a regular expression match (the trailing [^,] , and (AFAIK) the sed RE engine does not allow overlapping REs ... cheers, drl

This could be done by using look-behind and look-ahead regexp.
A python example:

>>> import re
>>> text = '''0000005335,"IBD","601725","6017257002503849","0430","153854","007907","007907000000","E0107725","2995 BL.DAGENAIS "H"   LAVAL        QCCA","0200","","","WD","CH","","4001857090","","124","124",,60.00,60.00,60.00,0.00,0.45,60.45,0.037500,"APP","00","EXC","5"'''
>>> 
>>> print re.sub(r'(?<![,\A])"(?!(?:(,|\Z)))',"'",text)
0000005335,"IBD","601725","6017257002503849","0430","153854","007907","007907000000","E0107725","2995 BL.DAGENAIS 'H'   LAVAL        QCCA","0200","","","WD","CH","","4001857090","","124","124",,60.00,60.00,60.00,0.00,0.45,60.45,0.037500,"APP","00","EXC","5"

Hi klashxx...

Just a couple of points here.
1) It looks as though you are using Python 2.x, (2.7.x), so your print statement would not
work on 3.x.x. Parentheses are needed to make 'print' a function 'print(your stuff here)'
2) Can you guarantee that all UNIX, Linux, CygWin etc, systems have a python version?

EDIT:-
To the OP is the part above the ^ character an error?

Last login: Mon May  5 19:09:15 on ttys000
AMIGA:barrywalker~> python
Python 2.7.1 (r271:86832, Aug  5 2011, 03:30:24) 
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> text='''1234,"","qwerty",," abcd efg "h" bcd"efg
...  dfg",","","5"'''
>>> print text
1234,"","qwerty",," abcd efg "h" bcd"efg
 dfg",","","5"
>>> 
>>> 
>>> print re.sub(r'(?<![,\A])"(?!(?:(,|\Z)))',"'",text)
1234,"","qwerty",," abcd efg 'h' bcd'efg
 dfg",","","5"
>>> # ^ is this an error?
... # Should it be a "'"?
... 
>>> exit()
AMIGA:barrywalker~> _

Hi pchang,
As has been stated many times, your simple request is ambiguous. With your stated requirements and the 4 line input sample:

0000005335,"IBD","601725","6017257002503849","0430","153854","007907","0079070
00000","E0107725","2995 BL.DAGENAIS "H"   LAVAL        QCCA","0200","","","WD","
CH","","4001857090","","124","124",,60.00,60.00,60.00,0.00,0.45,60.45,0.037500,"
APP","00","EXC","5"

there are at least 2**46 (i.e., 2 raised to the 46th power) different answers that meet your stated criteria. For example, one possible result that meets all of your stated requirements is:

0000005335,"IBD','601725','6017257002503849','0430','153854','007907','0079070
00000','E0107725','2995 BL.DAGENAIS "H"   LAVAL        QCCA','0200','','','WD','
CH','','4001857090','','124','124',,60.00,60.00,60.00,0.00,0.45,60.45,0.037500,'
APP','00','EXC','5"

You have to tell us what constitutes a quoted string. In most CSV format files (using comma as the field separator) with quoted strings, comma can appear as a regular character in a quoted string, newline can appear as a regular character in a quoted string, in fact anything except an unescaped quoting character and an unescaped escape character can appear in a quoted string. But you don't have an escape character and you have unescaped quoting characters in your quoted string. So we need unambiguous rules that specify which double-quote characters start a quoted field and which double-quote characters end a quoted field.

For example, if the following rules correctly state your requirements, I can give you an awk script that will do what you want:

  1. The input file is a text file. (By definition this means there are no null bytes in the file, there are no lines longer than LINE_MAX bytes, and (unless the file is an empty file) the last character in the file is a newline character.)
  2. The start of a double-quoted field occurs when the first character of a field is a double-quote character. (This character can be referred to as an opening double-quote.)
  3. The end of a double-quoted field is delimited by a double-quote character that is not an opening double-quote and that is immediately followed by a comma or a newline character. (This character can be referred to as a closing double-quote.)
  4. Fields shall be separated by a comma that is not in a double-quoted field.
  5. Any double-quote character in a double-quoted field other than the opening double-quote and the closing double-quote shall be converted to a single-quote character.
  6. Double-quote characters that are not an opening double-quote, not a closing double-quote, and not in a double-quoted field shall not changed.
  7. A record shall be terminated by a newline character that is not in a double-quoted field.

Do these rules accurately describe your input file format?

If they do, I'll clean up my awk script and post it.

If they don't, give us your set of UNAMBIGUOUS rules and maybe we'll be able to help you.

3 Likes

This should fix it:

sed -e :a -e "s/\([^,]\)\"\([^,]\)/\1'\2/g;ta" file
1 Like

Hi subbeh...
Again without knowing anything about the OPs data then there could be something like .......,",....... from an accidental typo input as it WILL still create an odd number of inverted commas, (double quotes)...
None of us have accounted for it in any code written so far...

Last login: Mon May  5 21:47:34 on ttys000
AMIGA:barrywalker~> echo '1234,"","qwerty",," abcd efg "h" bcd"efg
>   dfg",","","5"' > /tmp/text
AMIGA:barrywalker~> sed -e :a -e "s/\([^,]\)\"\([^,]\)/\1'\2/g;ta" < /tmp/text
1234,"","qwerty",," abcd efg 'h' bcd'efg
  dfg",","","5"
AMIGA:barrywalker~> #  dfg",","","5"
AMIGA:barrywalker~> # Error ^ here?
AMIGA:barrywalker~> # We have no idea from the OP if this will ever occur.
AMIGA:barrywalker~> _

Hi wisecracker, I think I understand the OPs request quite clearly the way he described it. And although you make a valid point, I consider it as a data entry issue which I think is out of scope here.

1 Like

Nice solution Subbeh

I think this an awk equivalent:

awk '{
   for(i=2;length(v=substr($0,i++,2));)
      if (v==",\"") i++
      else if(v ~ "\"[^,]") $0=substr($0,1,i-2) "\x27" substr($0,i)
}1' file 

---------- Post updated at 08:29 AM ---------- Previous update was at 08:17 AM ----------

Or using GNU awk:

awk '{while((t=gensub(/([^,])"([^,])/, "\\1\x27\\2",$0))!=$0)$0=t}1' file
1 Like

Hi @wisecracker, i just wanted tho show a practical usage of look-ahead/behid assertion, for me the important thing here is not the print statement and of course i can't guarantee python in all *nix boxes.

Also i didn't take care of just single double quotes fields.

Cheers.

1 Like

Sample awk translator

cat translator.awk
 
BEGIN { FS="," }
{
  for (i=1; i<=NF; i++) {
    printf "%s", translateEmbeddedQuote($i)
    if ( i< NF ) printf ","
    else printf "\n"
  }
}
function translateEmbeddedQuote(s, m,ts) {
  m=match(s,/^".*"$/)
  if (m) ts=substr(s, 2, length(s)-2 )
  else ts=s
  if (length(s)>1) gsub("\"","'",ts)
  if (m) ts="\"" ts "\""
  return ts
}

Input data:

cat valdor.txt
0000001111,"IBD","601725","6017257000681563","0430","163458","002820","002820000000","E0107815","1801 3E AVENUE         VAL-D"OR QCCA","0200","","","WD","CH","","4000320275","","124","124",,60.00,60.00,60.00,0.00,0.45,60.45,0.037500,"APP","00","EXC","5"

[/CODE]

Run the test:

awk -f translator.awk valdor.txt
0000001111,"IBD","601725","6017257000681563","0430","163458","002820","002820000000","E0107815","1801 3E AVENUE         VAL-D'OR QCCA","0200","","","WD","CH","","4000320275","","124","124",,60.00,60.00,60.00,0.00,0.45,60.45,0.037500,"APP","00","EXC","5"