sort on second column only based on first column

malcomex999 · March 15, 2010, 5:52am

I have an input file like this...

AAAlkalines     Energizer
AAAlkalines     Energizer
AAAlkalines     Energizer
AAAlkalines     Sunlight
AAAlkalines     Sunlight
AAAlkalines     Sunlight
AAAlkalines     Energizer
AAAlkalines     Energizer
AAAlkalines     Energizer
AAASalines      Energizer
AAASalines      Energizer
AAASalines      Sunlight
Batteries       Energizer
Batteries       Sunlight
Batteries       Energizer
RechargableAAA  Energizer
RechargableAAA  Energizer
RechargableAAA  Duracell
RechargableAAA  Duracell
EmergencyLight  AlFaris
EmergencyLight  AlFaris
EmergencyLight  Geepas
EmergencyLight  Geepas

I want to get the output as below while the first column is intact(without moving the order) but sorting on the second column only according to the names in the first column...

AAAlkalines     Energizer
AAAlkalines     Energizer
AAAlkalines     Energizer
AAAlkalines     Energizer
AAAlkalines     Energizer
AAAlkalines     Energizer
AAAlkalines     Sunlight
AAAlkalines     Sunlight
AAAlkalines     Sunlight
AAASalines      Energizer
AAASalines      Energizer
AAASalines      Sunlight
Batteries       Energizer
Batteries       Energizer
Batteries       Sunlight
RechargableAAA  Duracell
RechargableAAA  Duracell
RechargableAAA  Energizer
RechargableAAA  Energizer
EmergencyLight  AlFaris
EmergencyLight  AlFaris
EmergencyLight  Geepas
EmergencyLight  Geepas

clx · March 15, 2010, 7:40am

if the sequence of the col1 values are not important,

for col1 in $(awk '{print $1}' file | sort -u); do grep "^$col1" file | sort -k2; done

o/p:

AAASalines      Energizer
AAASalines      Energizer
AAASalines      Sunlight
AAAlkalines     Energizer
AAAlkalines     Energizer
AAAlkalines     Energizer
AAAlkalines     Energizer
AAAlkalines     Energizer
AAAlkalines     Energizer
AAAlkalines     Sunlight
AAAlkalines     Sunlight
AAAlkalines     Sunlight
Batteries       Energizer
Batteries       Energizer
Batteries       Sunlight
EmergencyLight  AlFaris
EmergencyLight  AlFaris
EmergencyLight  Geepas
EmergencyLight  Geepas
RechargableAAA  Duracell
RechargableAAA  Duracell
RechargableAAA  Energizer
RechargableAAA  Energizer

malcomex999 · March 15, 2010, 7:44am

Actually, the sequence of the first column is very important to me.

radoulov · March 15, 2010, 7:47am

perl -lane'
  $_{$F[0]}++ or $c++;
  push @x, [$c, @F];
  END {
    print $_->[1], "\t", $_->[2] for 
      sort {
        $a->[0] <=> $b->[0] || 
        $a->[2] cmp $b->[2]
        } @x
    }' infile

malcomex999 · March 15, 2010, 7:55am

Well, i don't have perl on the system i am working right now but i will give it a try later on another system.

Thanks,

radoulov · March 15, 2010, 7:59am

Ok

awk '$1 != p { c++ }
{ print c, $0; p = $1 }
' p=1 infile |
  sort -k1.1n -k3.3 |
    cut -d\  -f2-

malcomex999 · March 15, 2010, 8:14am

Yes, that's it. It is working perfect.

Thanks,

drl · March 15, 2010, 10:46am

Hi.

I like the concise awk code of radoulov.

I sometimes prefer to think in terms of large tasks before I code up a solution in awk, perl, c, etc (if needed for performance). For example, this problem could be considered as one of an alternate collating sequence: that of the first field. It could also be considered as a grouping problem.

Because the input is already in groups of a specific, desired order, the grouping view lets me think what I need to do to each group. Namely I need to sort by the second field. I cannot normally do that to a part of a file. However, if I could identify each section, then I'd be a step in the right direction.

There are no specific commands to do that, but you can find some interesting codes on the net that will. Here's how this can be done using some of these codes. The names of the codes should be suggestive of what they do:

#!/usr/bin/env bash

# @(#) s2	Demonstrate group sort, missing textutils.

# Infrastructure details, environment, commands for forum posts. 
set +o nounset
LC_ALL=C ; LANG=C ; export LC_ALL LANG
echo ; echo "Environment: LC_ALL = $LC_ALL, LANG = $LANG"
echo "(Versions displayed with local utility \"version\")"
c=$( ps | grep $$ | awk '{print $NF}' )
version >/dev/null 2>&1 && s=$(_eat $0 $1) || s=""
[ "$c" = "$s" ] && p="$s" || p="$c"
version >/dev/null 2>&1 && version "=o" $p blockwise sort
set -o nounset
echo

FILE=${1-data1}

# If "specimen" does not exist, replace with "cat".
specimen $FILE

echo " Preliminary conditions:"
t1=$( diff $FILE expected-output.txt | wc -l )
echo " About $t1 lines differ."

echo
echo " Results:"
split_at_colchange 1 $FILE |
tee t1 |
blockwise "sort -k2,2" |
tee t2 |
remove_blank_lines > tf

if ! cmp expected-output.txt tf
then
  sdiff -w78 expected-output.txt tf
else 
  echo
  echo " Pass - generated output and expected-output.txt are identical."
  echo
  specimen tf
fi

exit 0

producing:

% ./s2

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0 
GNU bash 3.2.39
blockwise - ( ~/bin/blockwise Sep 29 12:53 )
sort (GNU coreutils) 6.10

Edges: 10 of 23 lines in data1
     1	AAAlkalines	Energizer
     2	AAAlkalines	Energizer
     3	AAAlkalines	Energizer
     4	AAAlkalines	Sunlight
     5	AAAlkalines	Sunlight
   ...
    19	RechargableAAA	Duracell
    20	EmergencyLight	AlFaris
    21	EmergencyLight	AlFaris
    22	EmergencyLight	Geepas
    23	EmergencyLight	Geepas

 Preliminary conditions:
 About 18 lines differ.

 Results:

 Pass - generated output and expected-output.txt are identical.

Edges: 10 of 23 lines in tf
     1	AAAlkalines	Energizer
     2	AAAlkalines	Energizer
     3	AAAlkalines	Energizer
     4	AAAlkalines	Energizer
     5	AAAlkalines	Energizer
   ...
    19	RechargableAAA	Energizer
    20	EmergencyLight	AlFaris
    21	EmergencyLight	AlFaris
    22	EmergencyLight	Geepas
    23	EmergencyLight	Geepas

I canonicalized the data and reference output file so that the separators were TABs.

The steps are:

Separate the blocks
For each block, sort on the second field,
Remove the separator between blocks.

The temporary files from the tee commands can be examined to see the intermediate-step results.

The collection of perl codes can be found at The Missing Textutils

Best wishes ... cheers, drl