Replace double quotes inside the string data for all the columns

SSrini · May 15, 2018, 10:33pm

Hi,

I have input data is below format and n of column in the multiple flat files. the string data has any double quotes(") values replaced to double double quotes for all the columns{"").

Also, my input flat file each column string data has carriage of new line too. this new lines also should replace double double quotes if data has presence single double quotes.

Please advise me how to handle this using awk or sed comand.

Input :
emp|^^|ename|^^|sal|^^|refname
"PM"|^^|"R"A"M"|^^|100|^^|"X""YZ"
"RM"|^^|"PR
RED"
ORANG"
SRI"|^^|500|^^|"ABCD"
"AM"|^^|""MAR""|^^|200|^^|"ABC""

Output:
emp|^^|ename|^^|sal|^^|refname
"PM"|^^|"R""A""M"|^^|100|^^|"X"""YZ"
"RM"|^^|"PR
RED""
ORANG""
SRI"|^^|500|^^|"ABCD"
"AM"|^^|"""MAR"""|^^|200|^^|"ABC"""

Thanks,

Regard,
Srini

jim_mcnamara · May 15, 2018, 11:54pm

What code have you tried? Please let us know what OS and shell you are using - it helps us to give you the best possible answer.

SSrini · May 16, 2018, 12:00am

Hi jim mcnamara,

Thanks for the reply,
we are using redhat 7 version and tried using sed command.
The flatfile data have carriage of new line for some of the columns, in that scenarios sed replacement is not working.

Sample Code :

### change Double Quote value (") change to double double quote (""). 
sed  -i 's/"/""/g;s/\\b^""""|^|\b\/""""/"""/g; s/\\b|^|""""\b\/""""/"""/g' $Inpufile
#### Add envlose character double quote value for each column
sed  -i 's/|^^|""/|^^|"/g; s/""|^^|/"|^^|/g' $Inpufile
##remove " value in begin and end of each row###########
sed -i  's/^""/"/;s/""$/"/' $Inpufile

please advise easiest way to replace double quote values inside the string for each column.

Thanks,

rovf · May 16, 2018, 1:17am

In the example you gave, the input line has 10 double quotes, while the output line has 12 double quotes. If every double quote is supposed to be doubled, there should be 20 double quotes in the output.

RudiC · May 16, 2018, 3:51am

I have to second rovf in that neither your description nor the samples are consistent. Even with the assumption that double quotes WITHIN the fields should be doubled, excluding the ones enclosing the field, I can't find a logical explanation why "X""YZ" should become "X"""YZ" .

RudiC · May 16, 2018, 3:54am

How about

awk -F\| '
        {while (NF < 7) {getline X
                         $0 = $0 "\001" X
                        }
         for (i=1; i<=NF; i++)  {gsub (/"/, "&&", $i)
                                 sub (/^"/, "", $i)
                                 sub (/"$/, "", $i)
                                }
         gsub (/\001/, "\n")
        }
1
' OFS=\| file
emp|^^|ename|^^|sal|^^|refname
"PM"|^^|"R""A""M"|^^|100|^^|"X""""YZ"
"RM"|^^|"PR
RED""
ORANG""
SRI"|^^|500|^^|"ABCD"
"AM"|^^|"""MAR"""|^^|200|^^|"ABC"""

SSrini · May 16, 2018, 7:54am

Hi Rudic,

Thanks a lot for your email.
I have tested with sample data, it's working as expected. I will test multiple flat files and let you know in case of any other issues.

Could you please explain me how above awk command replacing double double quotes where ever the double quotes inside the string value for each columns.

Thanks again.

Regards,
Srini

RudiC · May 16, 2018, 8:23am

It's following the same logic as does your sed approach but takes advantage of awk having a "finer grained" access to parts of the line, i.e. fields. Double any double quote in each field with gsub (/"/, "&&", $i) , and then remove the first and the last in the field.

SSrini · May 17, 2018, 3:54am

Hi Rudic,

Thanks for the email.
Could you please explain what is the reason NF<7(number of field) conditions? due to this,some of the flat files are not working i guess.

awk  -F\| '
        {while (NF < 7) {getline X
                         $0 = $0 "\001" X
                        }
         for (i=1; i<=NF; i++)  {gsub (/"/, "&&", $i)
                                 sub (/^"/, "", $i)
                                 sub (/"$/, "", $i)
                                }
         gsub (/\001/, "\n")
        }
1
' OFS=\| test file

Thanks,

Regards,
Srinivasan S

Thanks,

Srini

RudiC · May 17, 2018, 5:05am

Then, the sample given was not representative, I guess.

The <new line> char in some fields starts a new line in *nix text files, and awk sees exactly those. So the code repeatedly appends lines until all fields are present. If the fields count is not constant across lines or even files, things become way more complicated. You need to define a criterion to tell a complete line from a broken one.

wisecracker · May 17, 2018, 10:18am

H srini...

emp|^^|ename|^^|sal|^^|refname
"PM"|^^|"R""A""M"|^^|100|^^|"X"""YZ"
"RM"|^^|"PR
RED""
ORANG""
SRI"|^^|500|^^|"ABCD"
"AM"|^^|"""MAR"""|^^|200|^^|"ABC"""

Just an observation here, on line 2 there are 13 double quotes. Should there be 14 or should there be 12 as an odd number seems just exactly that - very ODD.

SSrini · May 18, 2018, 3:56am

Hi Rudic and wisecracker,

Thanks for the email.
We have requested the source team for changing the delimited format from |^^| to |~~|.

After changing this. the NF command count is working as expected.
somehow the carrot symbol delimiter number of field count is not working.

dlmt="[|]~~[|]"
vcol_cnt=`head -1 $Inpufile|awk -F $dlmt '{print NF}'`
echo "Number of Columns:$vcol_cnt"
 

awk  -v colcount=$vcol_cnt -F $dlmt  '
        {while (NF < colcount) {getline X
                         $0 = $0 "\001" X
                        }
         for (i=1; i<=NF; i++)  {gsub (/"/, "&&", $i)
                                 sub (/^"/, "", $i)
                                 sub (/"$/, "", $i)
                                }
         gsub (/\001/, "\n")
        }
1
' OFS=\, $Inpufile

Thanks,
Srini

RudiC · May 18, 2018, 4:04am

I don't think the delimiter influences the logics. And, you seem to be sure that the first line definitely determines the field count. If you can rely on that - why not do all of it in one awk script?

SSrini · May 18, 2018, 6:13am

Hi Rudic,

Thanks for the email.
Could you please possible to provide the sed command for the above scenario instead of awk command? the awk command is working fine but am seeing the performance wise slow.

Thanks,

Regard,
Srinivasan S

RudiC · May 19, 2018, 1:37pm

I'd be surprised if the sed version gave a noticeable performance improvement over the awk solution, but if it did, I'd ask you to report back some numbers. Try

sed 's/"/""/g; s/"\?|"\?/|/g; /|/ s/^"\|"$//g' file