Reverse sort on delimited chunks within a file

gimley · September 17, 2012, 9:32pm

Hello,
I have a large file in which data of names is sorted according to their homographs. The database has the following structure:Each set of homographs with their corresponding equivalents in Devanagari is separated out from the next set by a hard return. An example will make this clear:

The revsort routines I have in Gawk/Perl sort in reverse order but by doing so, do not respect the structure of the file which gets jumbled up.
I have tried to write a sort in which each set is sorted in reverse order separately, maintaining the integrity of the data structure, but am quite frustrated with the results since I know the logic but just cannot handle the bit of delimiting sets and then sorting in reverse within each set.
As an example of the desired output the first two sets would look something like this: (manually sorted and correctly I hope)

Many thanks in advance for help. I work under windows so an awk or perl script would be of great use.

Chubler_XL · September 17, 2012, 10:08pm

Try this gawk solution:

gawk 'BEGIN { PROCINFO["sorted_in"] = "@ind_str_desc" }
  { delete L
    printf "%s\n",$1
    for(i=2;i<=NF;i++)
       L[gensub(/ /,"","g",$i)]=$i
    for(l in L) printf "%s\n", L[l]
    printf "\n" }' FS='\n' RS='' infile

gimley · September 18, 2012, 12:19am

Many thanks. Am not in at present. But will definitely get back to you with feedback. Your solutions always work.
Thanks once again

---------- Post updated at 10:59 PM ---------- Previous update was at 09:55 PM ----------

Hello,
Sorry to hassle you but I am getting a consistent error on line7 of the code. I tried to correct it in all possible manners but I still get a consistent error.
Could you please help. Am reproducing the awk message below:

gawk: sortonsets.gk:7:     printf "\n" }' FS='\n' RS='
gawk: sortonsets.gk:7:                  ^ Invalid char ''' in expression

Many thanx

---------- Post updated at 11:19 PM ---------- Previous update was at 10:59 PM ----------

Sorry for the goof-up. Guess I was too tired. Here's the working code I put in comments to get clarity.

BEGIN { PROCINFO["sorted_in"] = "@ind_str_desc" }
  { delete L
    printf "%s\n",$1
    for(i=2;i<=NF;i++)
       L[gensub(/ /,"","g",$i)]=$i
    for(l in L) printf "%s\n", L[l]
    printf "\n" }
    # change the record separator from newline to nothing	
	RS=""
# change the field separator from whitespace to newline
	FS="\n"

Chubler_XL · September 18, 2012, 11:50am

NP, probably better to put RS and FS assignments in the BEGIN block.

BEGIN {
    # change the record separator from newline to empty line
    RS="";

    # change the field separator from whitespace to newline
    FS="\n"; 

    # change array sort order from unsorted to by index descending
    PROCINFO["sorted_in"] = "@ind_str_desc";
}

gimley · September 18, 2012, 12:29pm

Many thanks. Will try it out and see the output.
For the nonce and out of curiosity, I did want to put the FS and RS at the top but you had placed them at the end: does it make any difference to execution.
Many thanks once more for taking pains to make this useful suggestion.

Chubler_XL · September 18, 2012, 6:39pm

Yes it does make a difference. Originally I had the assignments on the command line (outside of the single quotes) and these assignments are done before the BEGIN block.

Assignments in the program outside of the BEGIN block will be done when each line is read in. This bad as the first line will be parsed with the default RS and FS before this happens.

I don't tend to change RS in the middle of the code so I'm not sure what will happen to $0 after this is done, but I assume that nothing will happen with the current line as it's already been read from the file at this stage.

gimley · September 19, 2012, 12:37pm

Sorry my broadband has been playing truant. I changed the position as you suggested and the sort has become more accurate. Many thanks for educating me on the reason why these should be placed first.
I just used to put these two in the beginning without knowing that their placement makes a difference. I learned awk by trial and error and I remember reading somewhere that the placement of AWK commands does not matter. Now I know it does and the value of such placements.
Many thanks once again

---------- Post updated 09-19-12 at 11:37 AM ---------- Previous update was 09-18-12 at 11:10 PM ----------

Hello,
I modified the script as per your suggestion and it worked just fine when I ran the script on a small sample, however when the sample size increased the sort seems to go wrong.
Instead of sorting look-alikes starting in reverse order (i.e. last to first), the script sometimes does a random sort. In other cases it seems to work just fine.
I am attaching a larger sample for testing. Here is the output of the file after applying the awk script. As can be seen all similar words in Hindi are not clustered together but are pretty well scattered.

The expected output should have been with all the look-alikes sorted from last to first clustered together.

             I have gone through the script pretty carefully: line by line and tested each condition laid down and am a bit perplexed as to why the data is behaving in this fashion. There are no trailing spaces and the data is absolutely "clean".

Please help. Sorting by hand is a time-consuming and also error-prone process.
Many thanks in advance.

Corona688 · September 19, 2012, 1:22pm

Not the most efficient, but seems to work:

$ awk -v RS="" -v ORS="\n" -v OFS="\n" -F"\n" '{ print "\n"$1 ; sub(/^[^\n]*\n/, ""); print | "sort" ; close("sort") }' data

#awsekar
! aaosekar=.....
! aausekar=.....
! aousekar=.....
! auosekar=.....
! ausekar=......
! aushekar=.....
! avasekar=......
! avsekar=......
! awasekar=......
! awsekar=......

#ayaaj
! aayaj=....
! aayaz=....
! aiyaz=.....
! ayaaj=
! ayaaz=.......
! ayaj=.....
! ayaja=.....
! ayaz=.......
! ayaza=.....
! ayyaj=.......
! ayyaz=......

#ayeza
! aaeesa=....
! aaeesha=....
! aaesha=.....
! aaisa=....
! aaisha=....
! aayasa=....
! aayasha=....
! aayeesha=.....
! aayesha=....
! aayeshaa=....
! aayisha=.....
! aaysa=....
! aaysha=....
! aeesa=....
! aeesha=....
! aesha=.....
! aeysha=....
! aiesha=.....
! aisa=....
! aisha=....
! aiyasha=....
! aiyesha=....
! aiysha=....
! ayaesha=.....
! ayasa=....
! ayasha=....
! ayeesha=.....
! ayesha=....
! ayeshah=.....
! ayeshaha=....
! ayeza=....
! ayisha=.....
! aysa=....
! aysha=....

Make it | "sort -r" to reverse the order..

Chubler_XL · September 19, 2012, 4:59pm

PROCINFO[] is only supported in GNU awk; awk --version should report something like "GNU Awk 4.0.1".

Otherwise Corona688's solution is quite acceptable for small to mid-sized data files.

Corona688 · September 19, 2012, 5:53pm

Here's a version which uses one sort. Should be much more efficient than my first version.

$ awk -v RS="" -v ORS="\n" -v OFS="\n" -F"\n" '{
        $1=sprintf("%08d\t%s", ++Z, $1);
        Z++; for(N=2; N<=NF; N++)
        $N=sprintf("%08d\t%s", Z, $N) } 1' data |
        sort | sed 's/^[^\t]*\t//;s/#/\n#/'


#awsekar
! aaosekar=.....
! aausekar=.....
! aousekar=.....
! auosekar=.....
! ausekar=......
! aushekar=.....
! avasekar=......
! avsekar=......
! awasekar=......
! awsekar=......

#ayaaj
! aayaj=....
! aayaz=....
! aiyaz=.....
! ayaaj=
! ayaaz=.......
! ayaj=.....
! ayaja=.....
! ayaz=.......
! ayaza=.....
! ayyaj=.......
! ayyaz=......

#ayeza
! aaeesa=....
! aaeesha=....
! aaesha=.....
! aaisa=....
! aaisha=....
! aayasa=....
! aayasha=....
! aayeesha=.....
! aayesha=....
! aayeshaa=....
! aayisha=.....
! aaysa=....
! aaysha=....
! aeesa=....
! aeesha=....
! aesha=.....
! aeysha=....
! aiesha=.....
! aisa=....
! aisha=....
! aiyasha=....
! aiyesha=....
! aiysha=....
! ayaesha=.....
! ayasa=....
! ayasha=....
! ayeesha=.....
! ayesha=....
! ayeshah=.....
! ayeshaha=....
! ayeza=....
! ayisha=.....
! aysa=....
! aysha=....

$

It works by prepending the same number to each line in a group of names, forcing sort to group them. Once they're sorted, sed strips the numbers back off and puts the newlines back in.

gimley · September 19, 2012, 10:13pm

Many thanks.
The first solution handles middle sized files. As soon as a large file is given it seems to scatter data
I can't use the second solution, Since I am working under windows, the sed command doesn't work for me.
Many thanks all the same for attacking the problem.

Chubler_XL · September 19, 2012, 10:59pm

Gimley,

If you can get gawk 4.0.1 for windows, I just downloaded it and tested it here and the sort works fine (3.1.6 dosn't sort).

Corona688 · September 21, 2012, 11:20am

Windows is known to react badly to rapidly launching thousands of small processes -- some sort of table fills up faster than it can empty and processes start failing to launch -- so yeah, I can see my first version not working properly in Windows. In UNIX, it would work, but at suboptimal speed.

Here's the single-sort version rewritten without sed.

awk -v RS="" -v ORS="\n" -v OFS="\n" -F"\n" '{
        $1=sprintf("%08d\t%s", ++Z, $1);
        Z++; for(N=2; N<=NF; N++)
        $N=sprintf("%08d\t%s", Z, $N) } 1' data |
        sort | awk '{ sub(/^[^!#]*/, ""); sub(/^#/, "\n#"); } 1'


#awsekar
! aaosekar=.....
! aausekar=.....
! aousekar=.....
! auosekar=.....
! ausekar=......
! aushekar=.....
! avasekar=......
! avsekar=......
! awasekar=......
! awsekar=......

#ayaaj
! aayaj=....
! aayaz=....
! aiyaz=.....
! ayaaj=
! ayaaz=.......
! ayaj=.....
! ayaja=.....
! ayaz=.......
! ayaza=.....
! ayyaj=.......
! ayyaz=......

#ayeza
! aaeesa=....
! aaeesha=....
! aaesha=.....
! aaisa=....
! aaisha=....
! aayasa=....
! aayasha=....
! aayeesha=.....
! aayesha=....
! aayeshaa=....
! aayisha=.....
! aaysa=....
! aaysha=....
! aeesa=....
! aeesha=....
! aesha=.....
! aeysha=....
! aiesha=.....
! aisa=....
! aisha=....
! aiyasha=....
! aiyesha=....
! aiysha=....
! ayaesha=.....
! ayasa=....
! ayasha=....
! ayeesha=.....
! ayesha=....
! ayeshah=.....
! ayeshaha=....
! ayeza=....
! ayisha=.....
! aysa=....
! aysha=....

$

You could also use Busybox for Windows to fill out your missing system utilities. It's a single executable which bundles all of these:

[, [[, ar, ash, awk, base64, basename, bash, bbconfig, bunzip2, bzcat,
bzip2, cal, cat, catv, cksum, cmp, comm, cp, cpio, cut, date, dc, dd,
diff, dirname, dos2unix, echo, ed, egrep, env, expand, expr, false,
fgrep, find, fold, getopt, grep, gunzip, gzip, hd, head, hexdump, kill,
killall, length, ls, lzcat, lzma, lzop, lzopcat, md5sum, mkdir, mv, od,
pgrep, pidof, printenv, printf, ps, pwd, rm, rmdir, rpm2cpio, sed, seq,
sh, sha1sum, sha256sum, sha512sum, sleep, sort, split, strings, sum,
tac, tail, tar, tee, test, touch, tr, true, uncompress, unexpand, uniq,
unix2dos, unlzma, unlzop, unxz, unzip, usleep, uudecode, uuencode, vi,
wc, wget, which, whoami, xargs, xz, xzcat, yes, zcat

Particularly useful is the built-in shell, since it can run pipe chains properly, and will know how to use all the above commands without prepending 'busybox.exe ' to them.

Its built-in BASH is not actually BASH shell, just the same Bourne you get with sh, but a quite reasonably functional shell nonetheless.