Awk: split column if special characters

genome · January 27, 2018, 5:14pm

Hi,

I've data like these:

Gene1,Gene2 snp1
Gene3 snp2
Gene4 snp3

I'd like to split line if comma and then print remaining information for the respective gene.

My code:


awk '{       
if($1 ~ /,/){
n = split($0, t, ",")
for (i = 0; ++i <= n;) {
print t,$2
}

}
else{
print $0
}
}' smalldata.txt

It gives me output:

Gene1 snp1
Gene2 snp1 snp1
Gene3 snp2
Gene4 snp3

I want an output like:

Gene1 snp1
Gene2 snp1
Gene3 snp2
Gene4 snp3

Line can have multiple commas.

Linux platform: 4.9.0-4-amd64 #1 SMP Debian 4.9.51-1 (2017-09-28) x86_64 GNU/Linux

Yoda · January 27, 2018, 5:25pm

Try:-

awk '
        {
                if ( $0 ~ /,/ )
                {
                        n = split ( $1, T, "," )
                        for ( i = 1; i <= n; i++ )
                                print T, $NF
                }
                else
                        print $1, $NF
        }
' file

rdrtx1 · January 27, 2018, 8:00pm

awk ' { for (i=1; i<=NF-1; i++) print $i, $NF } ' FS="[ ,]" file

Aia · January 27, 2018, 10:31pm

In case you would not mind to use Perl.

perl -pale 's/,/ $F[1]\n/' genome.file

Output:

Gene1 snp1
Gene2 snp1
Gene3 snp2
Gene4 snp3

rdrtx1 · January 28, 2018, 12:18am

s/,/ $F[1]\n/g for multiple comas.

MadeInGermany · January 28, 2018, 2:56am

Your original code needs to split on $1 (first field) not $O.

...
n = split($1, t, ",")
...

RudiC · January 28, 2018, 4:33am

And, you don't need the test for existence of commas in $1 ; split will yield one single element in the absence of separators.

genome · February 3, 2018, 9:00am

Oh, thanks.