Parse and reformat

cmccabe · October 25, 2014, 11:44am

Trying to parse column C ($3) of the attached file (104 rows). The data is in the below format all in a string. Each string would be a separate row with the data in column A ($1) and column B ($2) being the header. All the data is in seperate columns as well. Thank you :).

 ACTA 59 A_16_P32713632=chr10:90695750-90695810, A_16_P32713635=chr10:90696573-90696633, A_16_P32713680=chr10:90697419-90697479

 ADAMTS10 7 A_16_P41135847=chr19:8647429-8647489, A_16_P03421012=chr19:8659282-8659342

Desired Output:

 
ACTA2	59
A_16_P32713632     chr10     90695750     90695810
A_16_P32713635     chr10     90696573     90696633
A_16_P32713680     chr10     90697419     90697479

 
ADAMTS10	7
A_16_P41135847     chr19     8647429     8647489
A_16_P03421012     chr19     8659282     8659342,

Akshay_Hegde · October 25, 2014, 12:05pm

awk '{print $1,$2;for(i=3;i<=NF;i++){ gsub(/[=:-]/,OFS,$i); sub(/,/,"",$i); print $i }}' OFS='\t' file

RudiC · October 25, 2014, 12:07pm

Like this:

awk     '       {print $1, $2
                 for (i=3; i<=NF; i++)
                        {n=split ($i, T, "[=:-,]")
                         print T[1],T[2],T[3],T[4]
                        }
                }
        ' OFS="\t" file
ACTA    59
A_16_P32713632    chr10    90695750    90695810
A_16_P32713635    chr10    90696573    90696633
A_16_P32713680    chr10    90697419    90697479
ADAMTS10    7
A_16_P41135847    chr19    8647429    8647489
A_16_P03421012    chr19    8659282    8659342

cmccabe · October 25, 2014, 12:11pm

Thank you :)., works perfect!

Akshay_Hegde · October 25, 2014, 12:13pm

Or something like this

awk '{match($0,/^[^ ]* [^ ]* /);s=substr($0,RLENGTH+1); gsub(/[=:-]/,OFS,s); gsub(/, /,RS,s); $0 = $1 OFS $2 RS s}1' file

cmccabe · October 25, 2014, 12:33pm

Can the code be modified to output this:

same data just no header.

Desired Output:

A_16_P32713632     chr10     90695750     90695810
A_16_P32713635     chr10     90696573     90696633
A_16_P32713680     chr10     90697419     90697479

A_16_P41135847     chr19     8647429     8647489
A_16_P03421012     chr19     8659282     8659342

Thank you :).

Akshay_Hegde · October 25, 2014, 12:37pm

remove print $1,$2;

awk '{for(i=3;i<=NF;i++){ gsub(/[=:-]/,OFS,$i); sub(/,/,"",$i); print $i }}' OFS='\t' file

cmccabe · October 25, 2014, 1:00pm

How is the order of the output determined in the script?

Original dat;

 ACTA 59 A_16_P32713632=chr10:90695750-90695810, A_16_P32713635=chr10:90696573-90696633, A_16_P32713680=chr10:90697419-90697479

For example, if instead of:

A_16_P32713632     chr10     90695750     90695810
A_16_P32713635     chr10     90696573     90696633
A_16_P32713680     chr10     90697419     90697479

a different out is needed:

chr10     90695750     90695810     A_16_P32713632
chr10     90696573     90696633     A_16_P32713635
chr10     90697419     90697479     A_16_P32713680

Same data just different order.

Thank you :).

Akshay_Hegde · October 25, 2014, 1:12pm

cmccabe:

How is the order of the output determined in the script?

Original dat;

 ACTA 59 A_16_P32713632=chr10:90695750-90695810, A_16_P32713635=chr10:90696573-90696633, A_16_P32713680=chr10:90697419-90697479

For example, if instead of:

A_16_P32713632     chr10     90695750     90695810
A_16_P32713635     chr10     90696573     90696633
A_16_P32713680     chr10     90697419     90697479

a different out is needed:

chr10     90695750     90695810     A_16_P32713632
chr10     90696573     90696633     A_16_P32713635
chr10     90697419     90697479     A_16_P32713680

Same data just different order.

Thank you :).

Read RudiC's answer and change array index Parse and reformat Post: 302922478

print T[1],T[2],T[3],T[4] to print T[2],T[3],T[4],T[1]

cmccabe · October 25, 2014, 1:47pm

 awk     '               {for (i=3; i<=NF; i++)
                        {n=split ($i, T, "[=:-,]")
                         print T[4],T[1],T[2],T[3]
                        }
                }
        ' OFS="\t" header_sort.txt > sort.txt

like this

 Sort.txt (no headers)
chr10     90695750     90695810     A_16_P32713632
chr10     90696573     90696633     A_16_P32713635

Thanks :).