Re: Corpora: kwic concordances with Perl

Doug Cooper (doug@th.net)
Fri, 08 Oct 1999 15:06:50 +0700

At 22:15 7/10/99 +0200, Noord G.J.M. van wrote:
>no, this is not a good idea for large files (like corpora). You
>have the full file in memory; you don't want that.

Oh, if it's spectacularly big you can just use some embedded
non-text-item separator tag (eg </END-OF-BOOK>) to reduce size:

#undef $/; #get rid of this, and instead ...
$/ = 'some separator tag'; #reset the record separator
#$data = <F>; #get rid of this, and instead ...
while ($data = <F>) { #snarf in a record, then
while ... { #as before
} #done with this record
}

>At 09:28 8/10/99 +0200, <jochen.leidner@sap.com> wrote:
>There's a techno-sociological problem: the true hacker usually has
>the programming skills to do this, but considers text processing boring
>and easy; there is usually also no awareness of internationalization issues..

Au contraire, mon frere! I might as easily say that the linguist expects
the programmer not only to understand his problems, but to do his
work for him as well ;-).

But before this goes further, I would be tempted to ask both the
original poster, Christer Geisler, and Jochen Leidner just how big
their respective datasets are. I'm also curious as to whether anybody
might hazard a guess as to the pace of increase in typical 'big corpus'
size is, if there is such an animal.

Yours in avoiding creeping featurism,
Doug

__________________________________________________
1425 VP Tower, 21/45 Soi Chawakun
Rangnam Road, Rajthevi, Bangkok, 10400
doug@th.net (662) 246-8946 fax (662) 246-8789

Southeast Asian Software Research Center, Bangkok
http://seasrc.th.net --> SEASRC Web site
http://seasrc.th.net/sealang --> SEALANG Web site