Re: Corpora: a program needed

From: sebhoff@es.unizh.ch
Date: Thu May 30 2002 - 09:52:11 MET DST

  • Next message: sebhoff@es.unizh.ch: "Re: Corpora: a program needed"

    <pre>
    Dear corporal mates,

    I am in an acute need for a simple program (dos, Windows, Unix) that would
    provide me with cumulative numbers of different words (types) as it skims
    through a text word by word. In other words, the program should print out a
    number for each word but increase the number only when a new type is
    encountered. The output would be something like that:
    1
    2
    3
    4
    4
    5
    6
    6
    6
    ...
    Probably I could write this kind of program myself, but I do not have time
    or ardour to reinvent the wheel. Maybe a simple Perl script would do the
    trick? Thank you in advance for your support.

    yours,
    sampo

    </pre>

    How about this:

    ---------------
    #!usr/bin/perl

    $countDifferent=0;

    open (IN, "</path/to/file") || die "can't open the file!";

    while (<IN>) {
            $line= $_;
            @words = split(/\s/, $line);
            foreach $word (@words) {
                    
                    if (!$words{$word}) {
                            $countDifferent++;
                            $words{$word} = 1;
                    }
                    
                    print "$countDifferent\n";
                    
            }
    }
    close (IN);

    exit(0);
    ---------------

    It's primitive - but does what you want.
    It assumes that you are interested in orthographic words and that there is
    always one whitespace between words.

    Best,
    Sebastian

    -- 
    

    Sebastian Hoffmann Englisches Seminar der Univ. Zürich Plattenstrasse 47 CH-8032 Zürich Tel: +41-1-634 3551 Fax: +41-1-634 4908



    This archive was generated by hypermail 2b29 : Thu May 30 2002 - 09:52:12 MET DST