How to ignore Pipe in Pipe delimited file?

rohit_shinez · August 8, 2013, 1:50pm

Hi guys,

I need to know how i can ignore Pipe '|' if Pipe is coming as a column in Pipe delimited file

for eg:


file 1:
xx|yy|"xyz|zzz"|zzz|12...
using below awk command

awk 'BEGIN {FS=OFS="|" } print $3

i would get xyz

But i want as :

xyz|zzz to consider as whole column representing as 3rd coulmn in that file

Jotne · August 8, 2013, 2:51pm

You can do this

awk -F'"' '{print $2}' file
xyz|zzz

rohit_shinez · August 8, 2013, 3:01pm

Hi ths would give me searching "" and printing but i dont want to avoid Pipe delimited as well my file is totally having columns with pipe dilimted so i want to cosider if "xyz|zzz" occurs it should take as single col

xx|yy|"xyz|zzz"|12 ...

$1 - xx
$2 - yy
$3 - xyz|zzz
$4 - 12

Yoda · August 8, 2013, 3:42pm

You could try something like:

sed 's#\("[^|]*\)\([|]\)\([^|]*"\)#\1,\3#g' file | awk -F\| '{gsub(/"/,x);sub(",","|",$3);print $3}'

The sed is substituting the pipe to comma so that awk can recognize it as one field.

Later awk substitute the comma back to pipe.

Jotne · August 8, 2013, 3:48pm

Found this ugly code using google so can not help you with it, only that its replace the | , to make it work
Some guru may be able to simplify it

sed -e 's/|/,/g' file | awk '{
$0=$0",";
while($0) {
  match($0,/[^,]*,| *"[^"]*" *,/);
  sf=f=substr($0,RSTART,RLENGTH);
  gsub(/^ *"?|"? *,$/,"",f);
  print "Field " ++c " is " f;
  sub(sf,"");
}}'
Field 1 is xx
Field 2 is yy
Field 3 is xyz,zzz
Field 4 is zzz
Field 5 is 12.

PS you may find some here: Splitting By Content (The GNU Awk User’s Guide)

RudiC · August 8, 2013, 4:31pm

Maybe you can adapt this post to fit your needs? It's for commas in comma delimited fields, but it should do the task after some fiddling...

rohit_shinez · August 9, 2013, 2:26am

Thanks Yoda i will try to apply this but wat i needed is
i am having a file with | seperated in which i need to search char in 3rd column and replace with null. i need to replace only the coulmn where character occurs in 3rd field
for eg:

Code:
file1.txt

xx|yy|xx|12

output file:

xx|yy||12

the above one i achieved with this below code using http://www.unix.com/shell-programming-scripting/232871-find-char-field-replace-null.html#post302841467

awk 'BEGIN {FS=OFS="|" } $3 ~ /[[:alnum:]]/ { $3="" }1' file

but wat i faced is if there is any column having pipe that should consider as single column

xx|yy|"xyz|xx"|AAA|12...

not i should achieve my requirement like this

xx|yy|"xyz|xx"||12

now AAA should replace with null considering as AAA as 4th column if use

awk 'BEGIN {FS=OFS="|" } $4 ~ /[[:alnum:]]/ { $4="" }1' file

Jotne · August 9, 2013, 2:43am

Please do use code tags

I think the only good solution for you is to change the pipe | inside double quotes "" to something else, like yoda does here:
sed 's#$"[^|]*$$[|]$$[^|]*"$#\1,\3#g'

Like this

echo 'xx|yy|"xyz|xx"|AAA|12...'  |sed 's#\("[^|]*\)\([|]\)\([^|]*"\)#\1,\3#g' | awk 'BEGIN {FS=OFS="|" } $4 ~ /[[:alnum:]]/ { $4="" } {sub(",","|",$3)}1'
xx|yy|"xyz|xx"||12...

alister · August 9, 2013, 2:46pm

yoda:

You could try something like:
sed 's#$"[^|]*$$[|]$$[^|]*"$#\1,\3#g' file | awk -F\| '{gsub(/"/,x);sub(",","|",$3);print $3}'
The sed is substituting the pipe to comma so that awk can recognize it as one field.

The idea is good: transform the data, before it's seen by the main script, then undo the transform after the main script has done its job.

The implementation, however, is not. Modifying field contents should be avoided. It is much less intrusive to eliminate collisions between delimiter-pipes and quoted-pipes by replacing the delimiter-pipes. Why does this matter? Because delimiters are not visible from AWK script logic. Field contents, however, are visible and used in unpredictable ways.

Instead of changing xx|yy|"xyz|xx"|AAA|12... into xx|yy|"xyz,xx"|AAA|12... , it should become xx,yy,"xyz|xx",AAA,12... .

One alternative:

awk '{for (i=1; i<=NF; i+=2) gsub(/\|/, d, $i); print}' FS=\" OFS=\" d=,

Regards,
Alister

rohit_shinez · August 10, 2013, 2:37pm

its taking lot of time when using below code since sed is being used , my file contains 27k records with 200 columns pipe separated

echo 'xx|yy|"xyz|xx"|AAA|12...'  |sed 's#\("[^|]*\)\([|]\)\([^|]*"\)#\1,\3#g' | awk 'BEGIN {FS=OFS="|" } $4 ~ /[[:alnum:]]/ { $4="" } {sub(",","|",$3)}1'xx|yy|"xyz|xx"||12...

Jotne · August 11, 2013, 3:24am

You can get hold of GNU Awk 4.0+ and then use FPAT to solve this.
Splitting By Content - The GNU Awk User's Guide

rohit_shinez · August 11, 2013, 4:10am

Hi I tried using fpat but couldn't able to frame as per by desired requirement can u help me

Jotne · August 12, 2013, 5:15am

This may work

echo 'xx|yy|"xyz|zzz"|zzz|12' | awk '{c=0
$0=$0"|"
while($0) {
  match($0,/ *"[^"]*" *\||[^\|]*\|/)
  f=substr($0,RSTART,RLENGTH)
  gsub(/^ *"?|"? *\|$/,"",f)
  print "Field " ++c " is " f
  $0=substr($0,RLENGTH+1)
}}'
Field 1 is xx
Field 2 is yy
Field 3 is xyz|zzz
Field 4 is zzz
Field 5 is 12

RudiC · August 12, 2013, 10:48am

Try this one:

awk     '               {for (i=2; i<=NF; i+=2) {                       # every second field is one inside double quotes
                                 gsub (/\|/, "\001", $i)                # replace every delimiter in quoted field by something
                                }
                         FS="|"                                         # set desired field separator
                         $0=$0                                          # rebuild fields
                         for (i=1; i<=NF; i++) {                        # in every new field
                                 gsub ("\001", "|", $i)                 # replace something back to commas
                                }
                        }
                        {for (i=1; i<=NF; i++) print $i}                # print new $i fields
        ' FS="\"" OFS="\"" file
xx
yy
"xyz|zzz"
zzz
12