Re: Corpora: a program needed

From: sebhoff@es.unizh.ch
Date: Thu May 30 2002 - 10:55:05 MET DST

Next message: Sampo Nevalainen: "Re: Corpora: a program needed"

Previous message: sebhoff@es.unizh.ch: "Re: Corpora: a program needed"
Maybe in reply to: Sampo Nevalainen: "Corpora: a program needed"
Next in thread: Sampo Nevalainen: "Re: Corpora: a program needed"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi,

Maybe I am missing something important here but couldn't be a good idea
to 'chomp' the lines before splitting them so that not words at the end
of lines are counted as separate words just because they have a end of
line character at the end?

Regards

Klas Prytz
Institutionen för lingvistik
Uppsala universitet

Yes - indeed. I had forgotten about that.

There are further problems with the script:

- It doesn't distinguish between lower and upper case.
This could easily be remedied by adding "$line=lc($line);"

- What happens to punctuation? Usually, there is no space between the actual
word and punctuation markers, so in the sentence "Something is missing.", there
would be a new type "missing." which isn't the same as "missing" in the middle
of a sentence...
If you add "$line=~s/[,.;:-!?]//g;" this would be taken care of - but no
difference is being made between sentence boundaries and abbreviations.

I'm sure someone will point out a few other problems... ;-)
Best,
Sebastian

Sebastian Hoffmann Englisches Seminar der Univ. Zürich Plattenstrasse 47 CH-8032 Zürich Tel: +41-1-634 3551 Fax: +41-1-634 4908

Next message: Sampo Nevalainen: "Re: Corpora: a program needed"
Previous message: sebhoff@es.unizh.ch: "Re: Corpora: a program needed"
Maybe in reply to: Sampo Nevalainen: "Corpora: a program needed"
Next in thread: Sampo Nevalainen: "Re: Corpora: a program needed"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Thu May 30 2002 - 10:55:08 MET DST