[Solved] Sorting a column based on another column

Homa · December 14, 2012, 10:51am

hello,

I have a file as follows:

F0100010 A C     F0100040 A G    BTA-28763-no-rs     77.2692
F0100020 A G      F0100030 A T    BTA-29334-no-rs     11.4989
F0100030 A T      F0100020 A G    BTA-29515-no-rs     127.006
F0100040 A G      F0100010 A C    BTA-29644-no-rs     7.29827
F0100050 A T      F0100050 A T    BTA-32647-no-rs     70.9005

I want to sort the fourth column based on the first column to get the same order.

Thank you in advance for any help.

jim_mcnamara · December 14, 2012, 10:53am

Please show us you expected output, based on the sample above.

Corona688 · December 14, 2012, 10:53am

What is the difference between sorting on the fourth column, and sorting 'based on' the fourth column? Do you want to sort on both columns, but group on the fourth?

In short -- what output would you expect for this input?

Homa · December 14, 2012, 11:00am

output:

F0100010 A C      F0100010 A C    BTA-29644-no-rs     7.29827  
F0100020 A G      F0100020 A G    BTA-29515-no-rs     127.006
F0100030 A T      F0100030 A T    BTA-29334-no-rs     11.4989
F0100040 A G      F0100040 A G    BTA-28763-no-rs     77.2692

This is what I want to have.

The difference is that if I only sort the fourth column, it will sort based on the numbers but I want to keep the same order as it exists in the first column. I can not sort the first column too because it is crucial to keep the order.

Corona688 · December 14, 2012, 11:24am

Oh, I see. You don't want it sorted. You want columns 4 through n of all rows moved such that column 1 lines up with column 4.

Working on it.

Corona688 · December 14, 2012, 11:28am

$ awk 'NR==FNR { ARR[$4]=$0 ; next }; $1 in ARR { A=$1 ; B=$2; C=$3; D=$4 ; $0=ARR[$1]; $1=A; $2=B; $3=C } 1' datafile datafile

F0100010 A C F0100010 A C BTA-29644-no-rs 7.29827
F0100020 A G F0100020 A G BTA-29515-no-rs 127.006
F0100030 A T F0100030 A T BTA-29334-no-rs 11.4989
F0100040 A G F0100040 A G BTA-28763-no-rs 77.2692
F0100050 A T F0100050 A T BTA-32647-no-rs 70.9005

$

Yes, I put the input file into it twice, not a typo. The first time reads all lines into memory and indexes on the fourth column. The second time, it prints out lines, recalling and recombining lines.

Homa · December 14, 2012, 11:29am

Great, thanks a lot!

Corona688 · December 14, 2012, 11:39am

A moment, that doesn't look quite right.

[edit] I was restoring four columns when I only needed three. Remove the $4=D from the code and it works.

Homa · December 14, 2012, 11:54am

oh yes you are right, thank you very much!

MadeInGermany · December 14, 2012, 12:07pm

In general, for an efficient merge operation, you need to have two files.
If you have one file, the shell can open it twice with another file descriptor.

#!/bin/sh
sort -k4,4 infile |
(
# in this sub shell, direct the stdin to &3
exec 3<&0
# now the while loop reads from another stdin
while read f1 f2 f3 junk
do
 read j1 j2 j3 k4 k5 k6 rest <&3
 printf "%s %s %s  %s %s %s  %s\n" "$f1" "$f2" "$f3" "$k4" "$k5" "$k6" "$rest"
done < infile
)