Re: Corpora: a program needed

From: Alexander Clark (asc@aclark.demon.co.uk)
Date: Thu May 30 2002 - 07:27:48 MET DST

  • Next message: sebhoff@es.unizh.ch: "Re: Corpora: a program needed"

    Something like this?

    The tokenisation is obviously very poor. But if you run a tokenisation
    tool to put it in one word per line format, it would work correctly.

    #!/usr/bin/perl -w

    $numberTypes = 0;
    %dict;
    #$/ = " ";
    while ($line = <>)
    {
         @words = split(' ',$line);
         foreach $word (@words){
            if (!exists($dict{$word})){
                $dict{$word} = $numberTypes++;
            }
            print("$numberTypes\n");
         }
    }

    Sampo Nevalainen wrote:

    > Dear corporal mates,
    >
    > I am in an acute need for a simple program (dos, Windows, Unix) that
    > would provide me with cumulative numbers of different words (types) as
    > it skims through a text word by word. In other words, the program should
    > print out a number for each word but increase the number only when a new
    > type is encountered. The output would be something like that:
    > 1
    > 2
    > 3
    > 4
    > 4
    > 5
    > 6
    > 6
    > 6
    > ...
    > Probably I could write this kind of program myself, but I do not have
    > time or ardour to reinvent the wheel. Maybe a simple Perl script would
    > do the trick? Thank you in advance for your support.
    >
    > yours,
    > sampo
    >
    >
    > ( : ============================================= : )
    >
    > Sampo Nevalainen, M.A.
    > Researcher
    > University of Joensuu
    > Savonlinna School of Translation Studies
    > P.O.Box 48
    > FIN-57101 Savonlinna
    > FINLAND
    >
    > tel +358-15-511 70 (operator)
    > +358-15-511 7704
    > fax +358-15-515 096
    > email samponev@cc.joensuu.fi
    > http://www.joensuu.fi/slnkvl/
    >
    >
    >

    -- 
    Alexander Clark
    asc@aclark.demon.co.uk
    http://www.issco.unige.ch/staff/clark/index.html
    ISSCO/ETI, University of Geneva
    



    This archive was generated by hypermail 2b29 : Thu May 30 2002 - 09:34:18 MET DST