Help with remove the column that appear twice

perl_beginner · September 5, 2013, 12:12am

Input file 1

                    S1                            S2          S3
comp95_c1    1.00      comp95_c1     1.00       3.00
comp4_c0      6.00      comp4_c0       6.00      6.00
comp3_c0      0.00      comp3_c0       0.00      4.00
comp15_c1    3.00      comp15_c1      3.00      3.00
comp28_c0    33.00    comp28_c0      33.00     2.00
comp23_c0    4.00      comp23_c0      4.00       3.00

Desired output file 1

                    S1        S2          S3
comp95_c1    1.00      1.00       3.00
comp4_c0      6.00      6.00      6.00
comp3_c0      0.00      0.00      4.00
comp15_c1    3.00      3.00      3.00
comp28_c0    33.00    33.00     2.00
comp23_c0    4.00      4.00       3.00

Input file 2

                       S1             S2                             S3
comp5_c1         1.00           1.00       comp5_c1      3.00
comp40_c0       6.00            6.00      comp40_c0     6.00
comp31_c0       0.00            0.00      comp31_c0     4.00
comp51_c1       3.00            3.00      comp51_c1     3.00
comp82_c0       33.00          33.00     comp82_c0     2.00
comp3_c0        4.00            4.00       comp3_c0      3.00

Desired output file 2

                       S1             S2         S3
comp5_c1         1.00           1.00      3.00
comp40_c0       6.00            6.00      6.00
comp31_c0       0.00            0.00      4.00
comp51_c1       3.00            3.00      3.00
comp82_c0       33.00          33.00     2.00
comp3_c0        4.00            4.00       3.00

I hope can remove the column (compXXX) that appear twice.
All the files are tab delimited.

Thanks for any advice.

pamu · September 5, 2013, 12:23am

Try
Assuming you want to compare with column 1 only.

awk '{S=$1;for(i=2;i<=NF;i++){if($i != $1){S=S OFS $i}}print S;}' OFS="\t" file

perl_beginner · September 5, 2013, 12:29am

Hi pamu,

I did try your awk command for Input file 1.
It return the following result:

S1                S3
comp95_c1    1.00      1.00       3.00
comp4_c0      6.00      6.00      6.00
comp3_c0      0.00      0.00      4.00
comp15_c1    3.00      3.00      3.00
comp28_c0    33.00    33.00     2.00
comp23_c0    4.00      4.00       3.00

It seems like slightly different with desired output.
The line above "compXXXX" is a "\t" delimited and the content below "S1", "S2", "S3" are number etc.

Sorry for troubling you again.

pamu · September 5, 2013, 1:21am

Is this what you want..?

awk '{T=NR==1?"\t":"";S=T $1;for(i=2;i<=NF;i++){if($i != $1){S=S OFS $i}}print S;}' OFS="\t" file

        S1      S2      S3
comp95_c1       1.00    1.00    3.00
comp4_c0        6.00    6.00    6.00
comp3_c0        0.00    0.00    4.00
comp15_c1       3.00    3.00    3.00
comp28_c0       33.00   33.00   2.00
comp23_c0       4.00    4.00    3.00

perl_beginner · September 5, 2013, 2:10am

Hi pamu,

It is almost there
But I just curious if my S1, S2, S3 is becomes like S1, S1, S3
Is it possible that you make it still print out the following result

                      S1      S1      S3
comp95_c1       1.00    1.00    3.00
comp4_c0        6.00    6.00    6.00
comp3_c0        0.00    0.00    4.00
comp15_c1       3.00    3.00    3.00
comp28_c0       33.00   33.00   2.00
comp23_c0       4.00    4.00    3.00

Sorry again.
I just notice some case work fine but some case won't work perfect if the S1,S2,S3 is becomes like S1,S1,S3

pamu · September 5, 2013, 2:20am

What abt this..?

 awk 'NR==1{$1=OFS OFS $1}1 NR>1{S=$1;for(i=2;i<=NF;i++){if($i != $1){S=S OFS $i}}print S;}' OFS="\t" file

                S1      S1      S3
comp95_c1       1.00    1.00    3.00
comp4_c0        6.00    6.00    6.00
comp3_c0        0.00    0.00    4.00
comp15_c1       3.00    3.00    3.00
comp28_c0       33.00   33.00   2.00
comp23_c0       4.00    4.00    3.00

perl_beginner · September 5, 2013, 2:35am

Hi pamu,

When I try to issue the following command:

awk 'NR==1{$1=OFS OFS $1}1 NR>1{S=$1;for(i=2;i<=NF;i++){if($i != $1){S=S OFS $i}}print S;}' OFS="\t" file > file.out

awk -F"\t" '{print $1"\t"}' file.out

comp95_c1       
comp4_c0        
comp3_c0        
comp15_c1       
comp28_c0       
comp23_c0       

awk -F"\t" '{print $2"\t"}' file.out

1.00   
6.00    
0.00   
3.00    
33.00   
4.00    

awk -F"\t" '{print $3"\t"}' file.out
S1
1.00   
6.00    
0.00   
3.00    
33.00   
4.00    

awk -F"\t" '{print $4"\t"}' file.out
S1
3.00
6.00
4.00
3.00
2.00
3.00

awk -F"\t" '{print $5"\t"}' file.out
S3

I will expect the following result:

awk -F"\t" '{print $1"\t"}' file.out

comp95_c1       
comp4_c0        
comp3_c0        
comp15_c1       
comp28_c0       
comp23_c0       

awk -F"\t" '{print $2"\t"}' file.out
S1
1.00   
6.00    
0.00   
3.00    
33.00   
4.00    

awk -F"\t" '{print $3"\t"}' file.out
S1
1.00   
6.00    
0.00   
3.00    
33.00   
4.00    

awk -F"\t" '{print $4"\t"}' file.out
S3
3.00
6.00
4.00
3.00
2.00
3.00

awk -F"\t" '{print $5"\t"}' file.out

Thanks for your advice regarding the arrangement of "S1, S1, S3" and their corresponding record for further analysis.

pamu · September 5, 2013, 2:41am

Then remove one OFS from awk code

awk 'NR==1{$1=OFS $1}1 NR>1{S=$1;for(i=2;i<=NF;i++){if($i != $1){S=S OFS $i}}print S;}' OFS="\t" file > file.out

perl_beginner · September 5, 2013, 2:47am

Perfect, pamu.
Really thanks and appreciate your talent.
Thumb up