Sort and extract based on two files

phil_heath · March 9, 2011, 3:30pm

Hi,

I am having trouble sorting one file based on another file. I tried the grep -f function and failed. Basically what I have is two files that look like this:

File 1 (the list)

gh
aba
for
hmm

File 2 ( the file that needs to be sorted)

aba  2  4  6  7
for   2  4  7  4
hmm  1  2  7  4
gh  2  5  7  9

So file 1 is a list that has names in a particular order and I want to sort file 2 according to that order while also extracting the other columns.

So the end output would look like this.

Final file

gh  2  5  7  9
aba  2  4  6  7
for   2  4  7  4
hmm  1  2  7  4

Thanks

Phil

---------- Post updated at 03:30 PM ---------- Previous update was at 03:29 PM ----------

the file is tab separated.

jim_mcnamara · March 9, 2011, 3:46pm

awk 'FILENAME=="file2"  {arr[$1]=$0}
       FILENAME=="file1"  {print arr[$1]} '  file2 file1

There has to be a one to one correspondance between file1 and file2 - ie., if file1 is missing one of the keys that is in file2, that line will not print at all.

drl · March 9, 2011, 5:32pm

Hi.

Here is a script that uses a non-standard sort utility that admits alternate collating sequences, msort:

#!/usr/bin/env bash

# @(#) s1	Demonstrate alternate collating sequence.
# msort-home http://freshmeat.net/projects/msort

# Section 1, setup, pre-solution.
# Infrastructure details, environment, commands for forum posts. 
# Uncomment export command to test script as external user.
# export PATH="/usr/local/bin:/usr/bin:/bin"
set +o nounset
pe() { for i;do printf "%s" "$i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
C=$HOME/bin/context && [ -f $C ] && . $C specimen msort
set -o nounset
pe

FILE=${1-data1}
shift
CS=${1-data2}

# Section 2, display input file and collating sequence file.
# Display sample of data file, with head & tail as a last resort.
pe " || start [ first:middle:last ]"
specimen $FILE $CS \
|| { pe "(head/tail)"; head -n 5 $FILE; pe " ||"; tail -n 5 $FILE; }
pe " || end"

# Section 3, solution.
pl " Results:"
msort -q -n 1,1 -u n -l -c lexicographic -s $CS -1 $FILE

exit 0

producing:

% ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0.7 (lenny) 
GNU bash 3.2.39
specimen (local) 1.17
msort - ( /usr/bin/msort Apr 24 2008 )

 || start [ first:middle:last ]
Whole: 5:0:5 of 4 lines in file "data1"
aba  2  4  6  7
for   2  4  7  4
hmm  1  2  7  4
gh  2  5  7  9

Whole: 5:0:5 of 4 lines in file "data2"
gh
aba
for
hmmm
 || end

-----
 Results:
gh  2  5  7
aba  2  4  6  7
for   2  4  7
hmm  1  2  7  4

If you are using Debian GNU/Linux, msort is in the repository for lenny and squeeze, but not in wheezy yet. The freshmeat site has links to a number of packages for other OSs.

Good luck ... cheers, drl

Chubler_XL · March 9, 2011, 8:35pm

jim mcnamara:

awk 'FILENAME=="file2"  {arr[$1]=$0}
   FILENAME=="file1"  {print arr[$1]} '  file2 file1
There has to be a one to one correspondance between file1 and file2 - ie., if file1 is missing one of the keys that is in file2, that line will not print at all.

And if file1 contains a key that is missing from file2 it will print a blank line (this can be addressed with a slight change in the awk script).