Removing " from a text using awk

Jotne · October 24, 2013, 7:26am

I was testing some and from this string try to remove all " , but not \"

cat file
The quick brown fox "jumps", over  the 'lazy \"dog\"'

result requested: The quick brown fox jumps, over the 'lazy \"dog\"'

I have seen a working solution for sed , but I like awk

This code seem to work, but for some reason it does remove a blank space after the fox , why?

awk '{gsub(/[^\\]\"/,x)}1' file
The quick brown foxjump, over  the 'lazy \"dog\"'

As far as I understand this means not \ and " , so why is the space gone?

EDIT:
I found why
not \ can is any characters but not \ , so space is gone and the s in jumps
Any Idea on how to fix this in awk ?
gensub should work, but not very portable.

---------- Post updated at 13:26 ---------- Previous update was at 12:57 ----------

I found two working version.
Fist is not perfect, but is portable.
Second is less portable.

awk '{gsub(/\\\"/,"_#_");gsub(/\"/,x);gsub(/_#_/,"\\\"")}1'

awk '{print gensub(/([^\\])\"/, "\\1", "g")}'

pamu · October 24, 2013, 7:36am

Try one more

$awk -F '\\\\"' '{for(i=1;i<=NF;i++){gsub("\"","",$i)}}1' OFS='\\"' file

The quick brown fox jumps, over  the 'lazy \"dog\"'

Akshay_Hegde · October 24, 2013, 7:45am

Would this help you

$ awk '{gsub(/[^\\A-Za-z]\"/," ");gsub(/"\,/,",")}1' file

Resulting

The quick brown fox jumps, over  the 'lazy \"dog\"'

your s is missing because gsub(/[^\\]\"/," ") its removing space + string after the space it was suppose to be
gsub(/[^\\A-Za-z]\"/," ")

For example

$ echo "test \"demo\" " | awk '{gsub(/[^\\]\"/," ");}1'
test dem  # o is missing here

$ echo "test \"demo\" " | awk '{gsub(/[^\\A-Za-z]\"/," ");}1'
test demo"

Jotne · October 24, 2013, 7:57am

@Akshay Hegde
Your solution fail with this line:

The quick brown fox "jumps", over  the 'lazy \"dog\"' here are some "more data"

it gives

The quick brown fox jumps, over  the 'lazy \"dog\"' here are some more data"

The blue " at final is not removed.

Akshay_Hegde · October 24, 2013, 8:04am

jotne:

@Akshay Hegde
Your solution fail with this line:
The quick brown fox "jumps", over  the 'lazy \"dog\"' here are some "more data"
it gives
The quick brown fox jumps, over  the 'lazy \"dog\"' here are some  more data"
The blue " infront to more is lost.

@pamu
Your fail with both the last " giving
The quick brown fox jumps, over  the 'lazy \"dog\"' here are some  more data
Edit: and the my solution fails with this too

Thanks Jotne

Try... for given input this will work

$ awk '{gsub(/[^\\A-Za-z]\"|"$/," ");gsub(/"\, /,",")}1' file
The quick brown fox jumps,over  the 'lazy \"dog\"' here are some more data

OR

$ awk '{for(i=1;i<=NF;i++)if($i~/^\"|"$/){gsub("\"","",$i);printf $i FS}else{printf $i FS}printf RS }' file
The quick brown fox jumps, over the 'lazy \"dog\"' here are some more data

Jotne · October 24, 2013, 8:31am

I messed some up, sorry
pamus solution is ok, and mine to.
And Akshay Hegdes now works fine.

disedorgue · October 24, 2013, 8:50am

Hi,
If you use the commutator & :

awk '{gsub(/[^\\]"/,"&\"");gsub(/""/,"")}1' file

Regards.

Jotne · October 24, 2013, 9:04am

Smart idea
You just add some extra to the stand alone " , then remove it.

Just a small variation (saves two characters):

awk '{gsub(/[^\\]"/,"&_");gsub(/"_/,x)}1' file

RavinderSingh13 · October 24, 2013, 10:50am

Hello Akshay,

Could you please explain both the codes a bit please.

Thanks,
R. Singh

Scrutinizer · October 24, 2013, 11:43am

Both suggestions will run into trouble if the replacement string is already present on the line, or if a double quote is in the first position..

--
Try:

awk '{gsub(/\\"/,RS); gsub(/"/,x); gsub(RS,"\\\"")}1' file

With RS="\n" as the record separator, we can be sure that it will never appear on the line, so that is a suitable intermediate character.

disedorgue · October 24, 2013, 12:28pm

We can always create a string that not present, but therefore a fix for case of double quote in the first position:

awk '{gsub(/^"|[^\\]"/,"&ASTRINGTHATNOTEXIST");gsub(/"ASTRINGTHATNOTEXIST/,"")}1' file

Regards.

Scrutinizer · October 24, 2013, 12:48pm

disedorgue:

We can always create a string that not present, but therefore a fix for case of double quote in the first position:
awk '{gsub(/^"|[^\\]"/,"&ASTRINGTHATNOTEXIST");gsub(/"ASTRINGTHATNOTEXIST/,"")}1' file
Regards.

That will still pose a problem if there are two consecutive double quotes ( "" ) in the input file

shamrock · October 24, 2013, 2:00pm

Try this awk one liner...

awk '{FS="\"";for (i=1;i<=NF;i++) printf("%s%s",$i,($i ~ "[\]$" ? FS : (i < NF ? "" :"\n")))}' file

alister · October 24, 2013, 2:45pm

Golfing with ORS:

awk '{ORS=(/\\$/ ? RS : x)}1' RS=\"

That should always work, except in the unlikely case that ....

Can you figure it out?

Regards,
Alister

wisecracker · October 24, 2013, 2:48pm

Sorry Jotne...

As I know little awk and its derivatives I decided to use shell builtins and /bin commands
only to see how easy it is...

#!/bin/sh
# Using shell builtins and /bin ONLY...
# Generate the string.
echo 'The quick brown fox ''"jumps"'', over the lazy \"dog\".\c' > /tmp/text
# Load the file into a string variable.
text=`cat < /tmp/text`
# Show it...
echo "$text"
newtext=""
decimal=0
subscript=0
length=$[ ${#text} - 1 ]
while [ $subscript -le $length ]
do
	decimal=`printf "%d" \'${text:$subscript:1}`
	if [ $decimal -eq 34 ]
	then
		subscript=$[ $subscript + 1 ]
	fi
	if [ "${text:$subscript:2}" == '\"' ]
	then
		newtext=$newtext'\"'
		subscript=$[ $subscript + 2 ]
	else
		newtext=$newtext${text:$subscript:1}
		subscript=$[ $subscript + 1 ]
	fi
done
# Print the final string... 
echo "$newtext"

Results on OSX 10.7.5 using /bin/sh...

Last login: Thu Oct 24 17:44:02 on ttys000
AMIGA:barrywalker~> ./quotes.sh
The quick brown fox "jumps", over the lazy \"dog\".
The quick brown fox jumps, over the lazy \"dog\".
AMIGA:barrywalker~>

I liked this challenge but only wish I knew more about awk...

Scrutinizer · October 24, 2013, 3:00pm

@alister, very nice! The only caveat would be if there are character sequences without double quotes that are too long the maximum record length gets exceeded....

@wisecracker. that is OT considering that this thread is specifically about awk .

disedorgue · October 25, 2013, 5:41am

the revised script below seems to work fine (I not found example that not work):

awk '{gsub(/^"|[^\\]"/,"&\\");gsub(/^"|"\\"?/,"")}1' file

Advantage: no need to search a string that not exist...

Problem for me (at this moment): I don't know how to explain why it work fine. :o

Regards.

Scrutinizer · October 25, 2013, 11:25am

disedorgue:

the revised script below seems to work fine (I not found example that not work):
awk '{gsub(/^"|[^\\]"/,"&\\");gsub(/^"|"\\"?/,"")}1' file
Advantage: no need to search a string that not exist...

Problem for me (at this moment): I don't know how to explain why it work fine. :o

Regards.

This does however pose a problem with \"\

disedorgue · October 25, 2013, 2:09pm

Thanks to have found this problem with \"\ and I don't think that a solution exist in this way without use a string that does not exist in the file.

Regards.

alister · October 25, 2013, 2:55pm

True, but that can be an issue regardless of record delimiter. I hope that implementations with a hardcoded limit are smart enough to fail loudly (scream and stop) in such cases.

The caveat I had in mind was the case of input data ending with a backslash (which would produce a spurious trailing double quote). This would not be a valid text file, but "text" that does not end with a newline isn't unheard of.

Regards,
Alister