Sorting a html file with an external sort order

I am working on a web-concordance of Old Avestan and my concordance has produced a HTML file [attached in zipped format]
The sort deployed by the HTML file is not something which we normally use. I have tried my best to force a sort within the concordance itself, but the sort order does not work.
I am giving below the sort order in UTF-8 format:

a,,�,,,,b,,c,d,,e,,,,f,g,,,h,i,,j,k,l,m,m,n,,,,,,o,,p,r,s,�,�,,t,t,,u,,v,x,x,x,y,,z,�

Is there a Perl script which could do the trick. The data is part of an open-source project on Old Avestan and will be put up for use by all scholars working in the field.
Many thanks in advance for your help

Hi.

I have often used msort , found in many repositories. I don't know if it would be useful for your problem, but it has a number of features beyond GNU/*nix sort : MSORT

Best wishes ... cheers, drl

1 Like

Many thanks. I tried Msort but the problem is that it is a HTML file and the sort does not work out accurately.
I hope someone has an answer to the problem

Hi,

a ready-to-use solution for your problem is probably not existing, since it is an individual html-file, you want to have sorted.

Writing a custom sort function is not that complex. To say it short,
you have to write a little function, which returns -1,0 or 1 if of to
given values $a and $b the value of $a is less, equal or greater than
$b. Within that function you may have a hash which is storing the sort weight for each character. Something like this:

sub old_avesian {

   %chars = (
    
         "a" => 1,
         "" => 2,
         "�" => 3,
         "" => 4,
         "" => 5,
         "" => 6,
         ...
  )
  return $chars{"$a"} cmp $chars{"$b"};
}

@sorted_list = sort old_avesian @wordlist;

As additional difficulty here, you have to handle multibyte characters of unicode. Perl should have all the necessary tools integrated to do this. But this is beyond my experience.

My Perl-Skills are rarely used. Be sure to test my code before using.

1 Like

I will definitely try it out and see the result and get back to you.
Many thanks for your kind help

Btw: I tried this myself at the past weekend, and ran into the UTF-8 multibyte problem.

Hi gimley,

a good point start reading is for sure this one:

perlunicode - perldoc.perl.org

Here is my try with plain old 8-bit characters. You need only to add the unicode thingy :slight_smile:

#!/usr/bin/env perl

sub char_sort {
        my %chars = (
                "a" => 1,
                "b" => 2,
                "c" => 3,
                "d" => 4,
                "e" => 5,
                "f" => 6,
                "g" => 7,
                "h" => 8,
                "i" => 9,
                "j" => 10,
                "k" => 11,
                "l" => 12,
                "m" => 13,
                "n" => 14,
                "o" => 15,
                "p" => 16,
                "q" => 17,
                "r" => 18,
                "s" => 19,
                "t" => 20,
                "u" => 21,
                "v" => 22,
                "w" => 23,
                "x" => 24,
                "y" => 25,
                "z" => 26);

        # perl sets $a and $b for the values to compare. 
        # This function itself uses itself and calls with two parameters
        # select which type of call and wich arguments to use
        $word_a = (length($_[0])!=0)?$_[0]:$a;
        $word_b = (length($_[1])!=0)?$_[1]:$b;

        # Get the first chars, which we need to compare
        $a1=substr($word_a,0,1);
        $b1=substr($word_b,0,1);

        # print("A1=$a1 B1=$b1 A=$word_a B=$word_b\n");

        # if both args are empty return with equality(0)
        return 0 if(length($word_a)==0 and length($word_b)==0);

        # if current char is equal, call this function with the substrings beginning at the second char
        return char_sort(substr($word_a,1),substr($word_b,1)) if (($chars{$a1} <=> $chars{$b1})==0);

        # if current char is different, we're finished now
        return $chars{$a1} <=> $chars{$b1};
}

@list = ("my","favorite","animal","book","for","advanced","biologists");
@sorted = sort char_sort @list;

print("\n");
print("*** Unsorted ***\n");
foreach(@list) {
        print;
        print("\n");
 }
print("\n");

print("*** Sorted ***\n");
foreach(@sorted) {
        print;
        print("\n");
 }
print("\n");