I have a pipe delimited (|) data in a file and all the fields are enclosed with " ". If "
is present in the data, then I have to replace with \"
.
Example:
Input: "abc"|"test " user""|"A"B" user"
Output: "abc"|"test \" user\""|"A\"B\" user"
I tried with below command, but it is not giving me the expected result.
sed 's/\([^|"]\)\"\([^"|]\)/\1\\"\2/g;'
@suneelkumar.mekala
welcome to the community.
Get used to wrapping your code/data samples with markdown code tags.
I did it for you for now, but please do so going forward.
Otherwise it's hard to see see what you're after - particularly in THIS case.
2 Likes
@suneelkumar.mekala
the embedded double-quotes should be perfectly normal for the CSV (in your case PSV - PipeSeparatedValues) files.
But if you insist, here's one way with gawk
:
gawk -F'|' -f sunee.awk OFS='|'
where sunee.awk
is:
function deq(str, a) {
if(!match(str, /^(\s*)\x22(.*)\x22(\s*)$/, a))
return str
else {
gsub(/\x22/, "\\\\&", a[2])
return a[1] "\x22" a[2] "\x22" a[3]
}
}
{
for(i=1; i<=NF;i++)
$i=deq($i)
print
}
yielding:
$ echo '"abc"|"test " user""|"A"B" user"'| gawk -F'|' -f sunee.awk OFS='|'
"abc"|"test \" user\""|"A\"B\" user"
and
echo '"abc"| "test " user""|"A"B" user" '| gawk -F'|' -f sunee.awk OFS='|'
"abc"| "test \" user\""|"A\"B\" user"
You didn't mention your OS => the above is Linux/gawk specific, but can be modified to be awk-version-agnostic.
Both neighbors of a matching "
must be anything but a |
sed -E 's/([^|])"([^|])/\1\\"\2/g'
But this does not see/match a near second "
because a /g
iteration continues after the previous coverage.
Perl knows a not covering lookbehind/lookahead, so lets switch to Perl!
perl -lpe 's/(?<=[^|])"(?=[^|])/\\"/g'
BTW RFC-4180 wants to double the embedded quotes:
perl -lpe 's/(?<=[^|])"(?=[^|])/""/g'
Thank you.
It is working except for one scenario where it does not giving the required output.
Script used:
perl -lpe 's/\\/\\\\/g;s/(?<=[^|])"(?=[^|])/\\"/g'
Input data:
""27/2/24(defg)
"123456
TESTING BLACK""|"FALSE"
Current Output:
"\"27/2/24(defg)
"123456
TESTING BLACK"\"|"FALSE"
Expected Output:
"\"27/2/24(defg)
"\123456
TESTING BLACK"\"|"FALSE"
The simple substitution runs within a line, and without further flag variables it treats a "
at the beginning of the line as a new record.
The following (executable!) sed script might better suit your needs:
#!/usr/bin/sed -f
/^"/{
:L1
/"$/b endrec
$b endrec
N
b L1
}
:endrec
s/"/\\"/g
s/^\\"/"/
s/\\"$/"/
s/\\"|\\"/"|"/g
It gathers a record in the input buffer; a "
at the end of the line indicates the end of the record.
Missing a lookbehind/lookahead, the substitution is done globally first, then undone at the field boundaries.