Sorting file content by file extensions

ajaypatil_am · May 11, 2012, 12:48am

Hi Experts,

I have one .txt file which has filenames with various extensions e.g. .gz,.dat,.CTL,.xml. I want to sort all the filenames as per their extensions and would like to delete all the file names with .xml extension.

Please help.
PS : I am using Sun OS Generic_122300-60.

Thanks,
Ajay

otheus · May 11, 2012, 1:19am

To do the last step, you need only:

grep -v '\.xml$'

as for sorting, you can try:

sort -t. -k 2,2 -k 1,1

but that will not sort correctly if you have files with two periods.

drl · May 11, 2012, 2:38pm

Hi.

Here are two other methods. The first uses msort, which allows fields to be specified from the right-hand side. The other uses a quickly written perl code, which reverses the characters on each line:

#!/usr/bin/env bash

# @(#) s1	Demonstrate collect by extension, msort, perl, sort
# msort home: http://freshmeat.net/projects/msort

pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
re() { perl -wn -e 'print scalar reverse;' $1; pe ; }
C=$HOME/bin/context && [ -f $C ] && $C msort perl sort

FILE=${1-data1}
pl " Input file $FILE:"
head $FILE

pl " Results with msort:"
msort -q --line -d"." --position=-1,-1 --position=1,1 $FILE

pl " Results with (perl) reverse, sort, reverse:"
re $FILE |
sort -t"." |
re |
tee f1

exit 0

producing:

% ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0.8 (lenny) 
bash GNU bash 3.2.39
msort 8.44
perl 5.10.0
sort (GNU coreutils) 6.10

-----
 Input file data1:
a.xml
a.doc
a.txt
a.jpg
b.xml
b.doc
b.txt
b.jpg
c.xml
c.doc

-----
 Results with msort:
a.doc
b.doc
c.doc
a.jpg
b.jpg
c.jpg
a.txt
b.txt
c.txt
a.xml
b.xml
c.xml

-----
 Results with (perl) reverse, sort, reverse:


a.doc
b.doc
c.doc
a.jpg
b.jpg
c.jpg
a.xml
b.xml
c.xml
a.txt
b.txt
c.txt

The perl function is not completely satisfactory, but perhaps someone will stop by with a suggestion to omit the extra newlines.

I haven't tried to install msort on Solaris, but there is a link on MSORT for it.

Best wishes ... cheers, drl

otheus · May 12, 2012, 12:24am

I thought of two better ways to do this.

First, you can use the traditional sort, and this will work fine for 99% of the cases:

ls -1 | sort -t. -k 3,3 -k 2,2 -k 1,1

You tell sort to order by the 3rd extension, then the 2nd, then the 1st... and sort ignores non-existent fields. The only problem with this sort method is that you get this kind of weird ordering:

bar
foo
bar.zip
foo.zip
foo.bar.jpg
foo.bar.zip
bar.foo.zip

That is, fields with more than 1 extension have higher sorting precedence than fields with two. So the two zip files seem out-of-place.

So for the best ordering -- the one most likely to be expected, you make the last extension "special" by inserting a special character before the final period. Then sort, then remove the special character. You can use path-separators because those are never part of the filename.

ls -1 | sed 's/\(\.[^.]*\)$/\/\1/' | sort -t/ -k 2,2 -k 1,1  |  sed 's/\/\([^/]*\)$/\1/'

I'll admit: That's ugly for the command line. It could be a bit nicer if you don't need to worry about full path-names in your list.

Postscript: On Linux, you can find the "rev" command with the util-linux suite. It prints out each line in the file in reverse, so you can use drl's technique in that environment:

ls -1 | rev | sort | rev

mirni · May 12, 2012, 2:45am

Not sure what ls there is on Solaris, but GNU ls has an -X option that does exactly that -- sorts by extension.

Edit: it doesn't support -X. Never mind. How about this:
Prepend with the extension, sort on it, and then take it out (DSU = decorate-sort-undecorate):

ls | awk -F. '{print $NF,$0}' | sort -k1  | cut -d" " -f2-

Scrutinizer · May 12, 2012, 3:33am

Both rev commands also reverse the suffixes, and so they do not get sorted in alphabetical order.

A different strategy would be to prepend with suffix and a dot or just a space and a dot if there is no suffix and remove them after the sort. The sort would still need to use -t.

Another option might be just to list the suffixes, if they are not too many:

ls -l *.gz *.dat *.ctl *.xml

drl · May 12, 2012, 6:51am

Hi.

Good selection of techniques and problem solving.

So far I like those 2 solutions the best for utilizing standard tools, at least on the data files posted so far in this thread. The msort solution is a single-command (but non-standard) solution: the ability to specify fields from the right-hand-side is invaluable in this situation.

That is true, however, my impression was that the OP desired grouping rather than strict sorting. In which case the revs work except in the cases where there are no extensions. In those situations, the no-suffixed files are not in a group by themselves. The possibility of more than dot does complicate the issue, and I'm glad that it was raised.

After some thought, a better re function for my script is:

re() { perl -wn -e 'chomp;print scalar reverse,"\n";' $1 ; }

Best wishes ... cheers, drl

Scrutinizer · May 12, 2012, 7:09am

Further to this any sort must have the suffix as the primary sort key and - like drl suggests - have provisions for files without extensions or these will be all over the place. Something like this, perhaps:

ls | awk -F. 'NF==1{NF++}{print $NF,$0}' OFS=. | sort -t. -k1,1 | cut -d. -f2-

otheus · May 12, 2012, 2:35pm

Did the OP die of info overload ?

drl · May 12, 2012, 5:48pm

Hi.

For the previous solution, we'd need to think of a way to remove added trailing dots, but preserve any that might be on the originals:

#!/usr/bin/env bash

# @(#) user3	Demonstrate collect by extension.

pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
re() { perl -wn -e 'chomp;print scalar reverse,"\n";' $1 ; }
C=$HOME/bin/context && [ -f $C ] && $C msort perl sort

FILE=${1-data1}
pl " Input file $FILE:"
head $FILE

pl " Results for awk / sort / cut:"
awk -F. 'NF==1{NF++}{print $NF,$0}' OFS=. $FILE | sort -t. -k1,1 | cut -d. -f2-

exit 0

producing:

% ./user3 data3

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0.8 (lenny) 
bash GNU bash 3.2.39
msort 8.44
perl 5.10.0
sort (GNU coreutils) 6.10

-----
 Input file data3:
bar
foo
bar.zip
foo.zip
foo.bar.jpg
foo.bar.zip
bar.foo.zip
baz.

-----
 Results for awk / sort / cut:
bar.
baz.
foo.
foo.bar.jpg
bar.foo.zip
bar.zip
foo.bar.zip
foo.zip

cheers, drl