Re: [Corpora-List] sorting OHG (non-ASCII) in PERL

From: Thomas Schmidt (thomas.schmidt@uni-hamburg.de)
Date: Tue Feb 04 2003 - 16:31:26 MET

  • Next message: Alex Murzaku: "RE: [Corpora-List] sorting OHG (non-ASCII) in PERL"

    Dear Henning,

    I don't think there is an easy solution to this. If you say that you use
    diacritics, would that be "ordinary" characters followed by a combining
    diacritical mark (i.e. TWO chars) or would that be the fixed combinations of
    some characters and some diacritics (i.e. ONE char, e.g. 'e' with grave
    accent) that are in Latin-Extended etc.? If the latter, you may be lucky and
    find a locale that has the right sorting order for you - you could then tell
    PERL to use that locale. If the former, you'd probably have to write your
    own piece of code. Maybe these links will help you (they did help me with a
    similar problem):

    http://rf.net/~james/perli18n.html
    http://www.sysarch.com/perl/sort_paper.html

    Kind regards,

            Thomas

    ---------------------------------------
    Thomas Schmidt
    SFB 538 'Mehrsprachigkeit' Teilprojekt Z
    Tel: ++ 49 (040) 42838-6425
    Fax: ++ 49 (040) 42838-6116
    http://www.rrz.uni-hamburg.de/exmaralda
    http://www.rrz.uni-hamburg.de/SFB538/
    ---------------------------------------

    -----Ursprungliche Nachricht-----
    Von: owner-corpora@lists.uib.no [mailto:owner-corpora@lists.uib.no]Im
    Auftrag von Henning Reetz
    Gesendet: Dienstag, 4. Februar 2003 15:57
    An: corpora@hd.uib.no
    Betreff: [Corpora-List] sorting OHG (non-ASCII) in PERL

    Hi,

    stupid question but perhaps the freaks can help me:

    we're building a database of Old High German words. Obviously, there are
    some characters that are not in ASCII (diacritics like stress marks ' and
    carots ^) and chars that do not follow the 'normal' sorting order (like 'uu'
    for 'w'). One possibility would be to recode these chars (e.g. get rid off
    the diacritics for sorting and put them back on in the output), but is there
    a more elegant and general way (e.g. in case one would like to have a long
    'e' after the short 'e' etc.) so that one could use it for other scripts as
    well (UTF puts chars in an order that does not necessarily reflect the
    'intuitiv' sequence in a language). - Is there a modul to tell PERL which
    sorting sequence one would like to use or do I have to program it myself?

    Thanx for any hints.

    Henning Reetz



    This archive was generated by hypermail 2b29 : Tue Feb 04 2003 - 16:28:40 MET