Re: Corpora: a program needed - a kinda summary

From: Sampo Nevalainen (
Date: Fri May 31 2002 - 09:08:53 MET DST

  • Next message: Stefan Grondelaers: "Corpora: Sociolexicology abstracts reminder"


    I was asked for a summary of the responses I got for my request for a
    simple program that would calculate the cumulative numbers of types in a
    text files. So here it comes (although you’ll see that I was not given the
    gift of summarising things!).

    At first, a couple of links kindly supported by Paul Clough:
    Dan Melamed has a number of Perl scripts which are very useful for
    linguistic tasks:
    Another good source of Perl modules is CPAN:

    And now to the solutions I got. Not surprisingly, all the scripts were
    written in Perl, and this summary shows pretty well the abilities of this
    language as we proceed from a dozen of lines to a single command line… I
    have edited the mails a little, but the scripts, of course, are intact. I
    personally do not know Perl very well (I have some programming experience
    in Basic, Turbo Pascal and C++), and I have not tested all of the following
    scripts, so I WILL NOT be responsible of any nasty things they may do on
    your puter... for example, format your hard disk ;-)

    Sebastian Hoffmann:

    open (IN, "</path/to/file") || die "can't open the file!";
    while (<IN>) {
             $line= $_;
             @words = split(/\s/, $line);
             foreach $word (@words) {

                     if (!$words{$word}) {
                             $words{$word} = 1;
             print "$countDifferent\n";
    close (IN);
    The script "assumes that you are interested in orthographic words and that
    there is always one whitespace between words". As a response to Sebastian
    Hoffmann, Klas Prytz suggests that couldn't it “be a good idea to 'chomp'
    the lines before splitting them so that not words at the end of lines are
    counted as separate words just because they have a end of line character at
    the end?” Sebastian encounters a couple of other problems with the script:
    - It doesn't distinguish between lower and upper case (which could easily
    be remedied by adding "$line=lc($line);")
    - What happens to punctuation? If you add "$line=~s/[,.;:-!?]//g;" this
    would be taken care of - but no difference is being made between sentence
    boundaries and abbreviations.

    Alexander Clark has another approach to the problem:
    The tokenisation is obviously very poor. But if you run a tokenisation tool
    to put it in one word per line format, it would work correctly.
    #!/usr/bin/perl -w

    $numberTypes = 0;
    #$/ = " ";
    while ($line = <>)
              @words = split(' ',$line);
              foreach $word (@words){
                     if (!exists($dict{$word})){
                             $dict{$word} = $numberTypes++;

    And a pretty similar solution from Kaarel Kaljurand:
    this is a perl program, which expects its input from STDIN, and expects
    that each token (word) is on a separate line. each type is stored in a hash
    (%wordlist) therefore you might run out of memory when the inputfile is
    really huge.
    #!/bin/perl -w

    use strict;
    my %wordlist = ();
    my $i = 0;
    while(<>) {
              if(!defined($wordlist{$_})) {
                      $wordlist{$_} = 1;
             print "$i\n";

    Dave Graff also points out the problem of tokenization:
    The harder part of the problem is tokenization -- deciding what patterns
    constitute actual "types" (excluding all sorts of punctuation, normalizing
    case, deciding whether to treat hyphen-connected forms as if they were
    "space separated" or "not-space-separated", etc).

    Assume you have a suitable tokenizer for your data that simply puts out one
    word per line:
    tokenize data.file | \
    perl -pe 's/(\S+)/if(exists($t{$1})){ $t{$1} } else { $t{$1}=++$tc }/ge'

    Or more briefly, again, granting that the data is already tokenized to one
    word token per line:
    cat | \
    perl -pe 's/(\S+)/exists($t{$1}) ? $t{$1}:($t{$1}=++$tc)/e'

    As for tokenization, a separate perl command line could do that:
    cat data.file | \
    perl -ne '@t=split /[_\d\W]+/;print join($/,map{lc}@t,"")'

    Substitute this bit for the "tokenize data.file" above, and you have your
    program -- if this is the correct method of tokenization for your data.
    (The output will include some blank lines, which you can ignore.) To handle
    a full ISO accented character set in the tokenizer command, change this
    "/[_\d\W]/" to this: "/[^a-z\xa1-\xff]/i" for the split pattern.

    And finally, Daniel Walker gives another elegant one-line solution for the
    problem (I am impressed!):
    Actually, I believe the numbers are supposed to be incremented when a new
    type is encountered and otherwise stay the same: the numbers change less
    frequently towards the end of the file, and the last one printed is the
    number of different types. So, an even terser one-liner (got to love perl)...

    $ cat file | perl -pe 's/.+/$t{$_}?$i:($t{$_}=++$i)/e'

    Hopefully I did not miss anything. Thank you all again for your response!


    ( : ============================ : )

    Sampo Nevalainen, FM
    Joensuun yliopisto
    Kansainvälisen viestinnän laitos
    PL 48
    57101 Savonlinna

    puh +358-15-511 70 (keskus)
             +358-15-511 7704
    fax +358-15-515 096

    This archive was generated by hypermail 2b29 : Fri May 31 2002 - 09:18:04 MET DST