I am stuck with by DNA clustering analysis. I thought this forum will be a great help with data manipulations. Please help me.
I have a table with 91 columns. First I want to trim the table to only having rows where the column values are single characters which are A,T,G,C or 0. So any row having column values such as AA,AAG, AATG , Y, K etc has to be filtered out.
I figured out the regular expression will be something like [0ATGC]
Next I want to compare all the columns pairwise and group the columns which have the exact same values.The intermediate table output is not required.
Example input for 6 columns
Col1 Col2 Col3 Col4 Col5 Col6
A G G G T A
A Y R R TT A
A G T T T A
A G G T 0 A
A 0 R T TT AGGGGTT
Trimmed table (no output reqd)
Col1 Col2 Col3 Col4 Col5 Col6
A G G G T A
A G T T T A
A G G T 0 A
Clustering output (desired output)
Col1,Col6
Col2
Col3
Col4
Col5