Re: Corpora: a program needed

From: sebhoff@es.unizh.ch
Date: Thu May 30 2002 - 10:55:05 MET DST

  • Next message: Sampo Nevalainen: "Re: Corpora: a program needed"

    Hi,

    Maybe I am missing something important here but couldn't be a good idea
    to 'chomp' the lines before splitting them so that not words at the end
    of lines are counted as separate words just because they have a end of
    line character at the end?

    Regards

    Klas Prytz
    Institutionen för lingvistik
    Uppsala universitet

    Yes - indeed. I had forgotten about that.

    There are further problems with the script:

    - It doesn't distinguish between lower and upper case.
    This could easily be remedied by adding "$line=lc($line);"

    - What happens to punctuation? Usually, there is no space between the actual
    word and punctuation markers, so in the sentence "Something is missing.", there
    would be a new type "missing." which isn't the same as "missing" in the middle
    of a sentence...
    If you add "$line=~s/[,.;:-!?]//g;" this would be taken care of - but no
    difference is being made between sentence boundaries and abbreviations.

    I'm sure someone will point out a few other problems... ;-)
    Best,
    Sebastian

    -- 
    

    Sebastian Hoffmann Englisches Seminar der Univ. Zürich Plattenstrasse 47 CH-8032 Zürich Tel: +41-1-634 3551 Fax: +41-1-634 4908



    This archive was generated by hypermail 2b29 : Thu May 30 2002 - 10:55:08 MET DST