F0100010 A C F0100040 A G BTA-28763-no-rs 77.2692
F0100020 A G F0100030 A T BTA-29334-no-rs 11.4989
F0100030 A T F0100020 A G BTA-29515-no-rs 127.006
F0100040 A G F0100010 A C BTA-29644-no-rs 7.29827
F0100050 A T F0100050 A T BTA-32647-no-rs 70.9005
I want to sort the fourth column based on the first column to get the same order.
What is the difference between sorting on the fourth column, and sorting 'based on' the fourth column? Do you want to sort on both columns, but group on the fourth?
In short -- what output would you expect for this input?
F0100010 A C F0100010 A C BTA-29644-no-rs 7.29827
F0100020 A G F0100020 A G BTA-29515-no-rs 127.006
F0100030 A T F0100030 A T BTA-29334-no-rs 11.4989
F0100040 A G F0100040 A G BTA-28763-no-rs 77.2692
This is what I want to have.
The difference is that if I only sort the fourth column, it will sort based on the numbers but I want to keep the same order as it exists in the first column. I can not sort the first column too because it is crucial to keep the order.
$ awk 'NR==FNR { ARR[$4]=$0 ; next }; $1 in ARR { A=$1 ; B=$2; C=$3; D=$4 ; $0=ARR[$1]; $1=A; $2=B; $3=C } 1' datafile datafile
F0100010 A C F0100010 A C BTA-29644-no-rs 7.29827
F0100020 A G F0100020 A G BTA-29515-no-rs 127.006
F0100030 A T F0100030 A T BTA-29334-no-rs 11.4989
F0100040 A G F0100040 A G BTA-28763-no-rs 77.2692
F0100050 A T F0100050 A T BTA-32647-no-rs 70.9005
$
Yes, I put the input file into it twice, not a typo. The first time reads all lines into memory and indexes on the fourth column. The second time, it prints out lines, recalling and recombining lines.
In general, for an efficient merge operation, you need to have two files.
If you have one file, the shell can open it twice with another file descriptor.
#!/bin/sh
sort -k4,4 infile |
(
# in this sub shell, direct the stdin to &3
exec 3<&0
# now the while loop reads from another stdin
while read f1 f2 f3 junk
do
read j1 j2 j3 k4 k5 k6 rest <&3
printf "%s %s %s %s %s %s %s\n" "$f1" "$f2" "$f3" "$k4" "$k5" "$k6" "$rest"
done < infile
)