Re: [Corpora-List] sorting OHG (non-ASCII) in PERL

From: Jan Strunk (strunk@linguistics.ruhr-uni-bochum.de)
Date: Tue Feb 04 2003 - 17:09:41 MET

  • Next message: Neal Audenaert: "[Corpora-List] Full Text MP3 Searching"

    sorting OHG (non-ASCII) in PERLHi,

    if you want it quick and dirty, you can define your own sorting routine for the
    perl sort function.
    I wrote an example. You could use the subs "mysort" and "initialize" as such in a
    Perl program provided you use the two global variables @order and %sorthash.
    @order should contain the exaxt ordering of letters (including capitalized and non-capitalized letter).
    %sorthash will be needed by the two subs.
    Then you need two call initialize(); first before doing any sorting.
    When you want to sort, you have to use "sort mysort @list".
    It is very important for the correct sorting that every character that ever occurs in anything
    you want to sort is included in the ordering, i.e. the list @order.

    As I am not a real perl hacker, myself, it may well be that there is some more
    efficient way or maybe there is even a bug in programm, but it seemed to work.

    Best,

    Jan Strunk
    strunk@linguistics.ruhr-uni-bochum.de

    An example is the following code:

    my @order=("a", "A", "â", "Â", "b", "e", "ê", "z"); # Has to contain a list of all ordered characters
                                                                                 
    my %sorthash; # For quicker sorting the sub initiliaze() puts the list @order into a hash.

    my @strings=("a", "â", "e", "âbe", "êz", "abe", "êza"); # Things you want to sort.

    initialize(); # Puts the ordering into a hash of the format ("a" => 1, "A" => 2, "â" => 3, "Â" => 4, ...)

    my $string;
    foreach $string (sort mysort @strings) { # Normal way of sorting in perl, but sort now calls "mysort" for getting the right ordering
        print $string."\n";
    }

    sub mysort { # Compares two elements x and y
        my $word1=$a;
        my $word2=$b;

        return 0 if ($word1 eq $word2);

        my @word1=split("", $word1);
        my @word2=split("", $word2);

        while ((@word1 > 0) and (@word2 > 0)) {
     my $char1=shift @word1;
     my $char2=shift @word2;

     my $compare=($sorthash{$char1}<=>$sorthash{$char2});

     return $compare if ($compare != 0);
        }

        if (@word1) {
     return 1;
        } else {
     return -1;
        }
    }

    sub initialize {
        my $i=1;
        my $entry;
        foreach $entry (@order) {
     $sorthash{$entry}=$i;
     $i++;
        }
    }

      ----- Original Message -----
      From: Henning Reetz
      To: corpora@hd.uib.no
      Sent: Tuesday, February 04, 2003 3:56 PM
      Subject: [Corpora-List] sorting OHG (non-ASCII) in PERL

      Hi,

      stupid question but perhaps the freaks can help me:

      we're building a database of Old High German words. Obviously, there are some characters that are not in ASCII (diacritics like stress marks ' and carots ^) and chars that do not follow the 'normal' sorting order (like 'uu' for 'w'). One possibility would be to recode these chars (e.g. get rid off the diacritics for sorting and put them back on in the output), but is there a more elegant and general way (e.g. in case one would like to have a long 'e' after the short 'e' etc.) so that one could use it for other scripts as well (UTF puts chars in an order that does not necessarily reflect the 'intuitiv' sequence in a language). - Is there a modul to tell PERL which sorting sequence one would like to use or do I have to program it myself?

      Thanx for any hints.

      Henning Reetz



    This archive was generated by hypermail 2b29 : Tue Feb 04 2003 - 17:07:08 MET