Clustering data by matching columns

geetu · March 20, 2012, 2:39pm

I am stuck with by DNA clustering analysis. I thought this forum will be a great help with data manipulations. Please help me.

I have a table with 91 columns. First I want to trim the table to only having rows where the column values are single characters which are A,T,G,C or 0. So any row having column values such as AA,AAG, AATG , Y, K etc has to be filtered out.
I figured out the regular expression will be something like [0ATGC]

Next I want to compare all the columns pairwise and group the columns which have the exact same values.The intermediate table output is not required.

Example input for 6 columns

Col1 Col2 Col3 Col4 Col5 Col6
A G G G T A
A Y R R TT A
A G T T T A
A G G T 0 A
A 0 R T TT AGGGGTT

Trimmed table (no output reqd)

 Col1 Col2 Col3 Col4 Col5 Col6
A G G G T A
A G T T T A
A G G T 0 A

Clustering output (desired output)

Col1,Col6 
Col2
Col3
Col4
Col5

bartus11 · March 20, 2012, 3:11pm

Try:

awk 'NR>1{for (i=1;i<=NF;i++) if (length($i)>1) next}1' file | awk 'NR==1{for (i=1;i<=NF;i++) n=$i}
NR>1{for (i=1;i<=NF;i++) a=a""$i}
END{for (i=1;i<=NF;i++) b[a]=b[a]","n;
for (i in b) {sub("^,","",b);print b}}'

geetu · March 20, 2012, 3:16pm

Hi Bartus11
Can you also help me filter the single character values other than A,T,G,C and 0? Single characters like Y, M , R etc are present in my input which are not allowed.
Many Thanks

bartus11 · March 20, 2012, 3:21pm

awk 'NR>1{for (i=1;i<=NF;i++) if (length($i)>1||$i!~"A|G|T|C|0") next}1' file | awk 'NR==1{for (i=1;i<=NF;i++) n=$i}
NR>1{for (i=1;i<=NF;i++) a=a""$i}
END{for (i=1;i<=NF;i++) b[a]=b[a]","n;
for (i in b) {sub("^,","",b);print b}}'

geetu · March 20, 2012, 3:36pm

Works like a charm ! Thank you bartus11