Re: Corpora: a program needed - a kinda summary

From: Sampo Nevalainen (samponev@cc.joensuu.fi)
Date: Fri May 31 2002 - 09:08:53 MET DST

Next message: Stefan Grondelaers: "Corpora: Sociolexicology abstracts reminder"

Previous message: Scott Sadowsky: "Corpora: Windows binary of Transcriber 1.4.4"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi,

I was asked for a summary of the responses I got for my request for a
simple program that would calculate the cumulative numbers of types in a
text files. So here it comes (although you’ll see that I was not given the
gift of summarising things!).

At first, a couple of links kindly supported by Paul Clough:
Dan Melamed has a number of Perl scripts which are very useful for
linguistic tasks:
http://www.cs.nyu.edu/~melamed/software.html.
Another good source of Perl modules is CPAN:
http://www.cpan.org/

And now to the solutions I got. Not surprisingly, all the scripts were
written in Perl, and this summary shows pretty well the abilities of this
language as we proceed from a dozen of lines to a single command line… I
have edited the mails a little, but the scripts, of course, are intact. I
personally do not know Perl very well (I have some programming experience
in Basic, Turbo Pascal and C++), and I have not tested all of the following
scripts, so I WILL NOT be responsible of any nasty things they may do on
your puter... for example, format your hard disk ;-)

Sebastian Hoffmann:
--------------------------------------------------------------------------------------------------
#!usr/bin/perl

$countDifferent=0;
open (IN, "</path/to/file") || die "can't open the file!";
while (<IN>) {
         $line= $_;
         @words = split(/\s/, $line);
         foreach $word (@words) {

                 if (!$words{$word}) {
                         $countDifferent++;
                         $words{$word} = 1;
                 }
         print "$countDifferent\n";
         }
}
close (IN);
exit(0);
---------------
The script "assumes that you are interested in orthographic words and that
there is always one whitespace between words". As a response to Sebastian
Hoffmann, Klas Prytz suggests that couldn't it “be a good idea to 'chomp'
the lines before splitting them so that not words at the end of lines are
counted as separate words just because they have a end of line character at
the end?” Sebastian encounters a couple of other problems with the script:
- It doesn't distinguish between lower and upper case (which could easily
be remedied by adding "$line=lc($line);")
- What happens to punctuation? If you add "$line=~s/[,.;:-!?]//g;" this
would be taken care of - but no difference is being made between sentence
boundaries and abbreviations.

Alexander Clark has another approach to the problem:
--------------------------------------------------------------------------------
The tokenisation is obviously very poor. But if you run a tokenisation tool
to put it in one word per line format, it would work correctly.
----------
#!/usr/bin/perl -w

$numberTypes = 0;
%dict;
#$/ = " ";
while ($line = <>)
{
          @words = split(' ',$line);
          foreach $word (@words){
                 if (!exists($dict{$word})){
                         $dict{$word} = $numberTypes++;
                 }
                 print("$numberTypes\n");
         }
}
----------

And a pretty similar solution from Kaarel Kaljurand:
--------------------------------------------------------------------------------
this is a perl program, which expects its input from STDIN, and expects
that each token (word) is on a separate line. each type is stored in a hash
(%wordlist) therefore you might run out of memory when the inputfile is
really huge.
--cut--
#!/bin/perl -w

use strict;
my %wordlist = ();
my $i = 0;
while(<>) {
          if(!defined($wordlist{$_})) {
                  $i++;
                  $wordlist{$_} = 1;
         }
         print "$i\n";
}
--cut--

Dave Graff also points out the problem of tokenization:
------------------------------------------------------------------------------------------------------------------------
The harder part of the problem is tokenization -- deciding what patterns
constitute actual "types" (excluding all sorts of punctuation, normalizing
case, deciding whether to treat hyphen-connected forms as if they were
"space separated" or "not-space-separated", etc).

Assume you have a suitable tokenizer for your data that simply puts out one
word per line:
tokenize data.file | \
perl -pe 's/(\S+)/if(exists($t{$1})){ $t{$1} } else { $t{$1}=++$tc }/ge'

Or more briefly, again, granting that the data is already tokenized to one
word token per line:
cat token.stream | \
perl -pe 's/(\S+)/exists($t{$1}) ? $t{$1}:($t{$1}=++$tc)/e'

As for tokenization, a separate perl command line could do that:
cat data.file | \
perl -ne '@t=split /[_\d\W]+/;print join($/,map{lc}@t,"")'

Substitute this bit for the "tokenize data.file" above, and you have your
program -- if this is the correct method of tokenization for your data.
(The output will include some blank lines, which you can ignore.) To handle
a full ISO accented character set in the tokenizer command, change this
"/[_\d\W]/" to this: "/[^a-z\xa1-\xff]/i" for the split pattern.

And finally, Daniel Walker gives another elegant one-line solution for the
problem (I am impressed!):
-----------------------------------------------------------------------------------------------------------
Actually, I believe the numbers are supposed to be incremented when a new
type is encountered and otherwise stay the same: the numbers change less
frequently towards the end of the file, and the last one printed is the
number of different types. So, an even terser one-liner (got to love perl)...

$ cat file | perl -pe 's/.+/$t{$_}?$i:($t{$_}=++$i)/e'

Hopefully I did not miss anything. Thank you all again for your response!

sampo

( : ============================ : )

Sampo Nevalainen, FM
suunnittelija
Joensuun yliopisto
Kansainvälisen viestinnän laitos
PL 48
57101 Savonlinna

puh +358-15-511 70 (keskus)
+358-15-511 7704
fax +358-15-515 096
email samponev@cc.joensuu.fi
http://www.joensuu.fi/slnkvl/

Next message: Stefan Grondelaers: "Corpora: Sociolexicology abstracts reminder"
Previous message: Scott Sadowsky: "Corpora: Windows binary of Transcriber 1.4.4"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Fri May 31 2002 - 09:18:04 MET DST