Re: Corpora: kwic concordances with Perl...or rather use a database?

Max Schulze (bschulze@scansoft.com)
Tue, 12 Oct 1999 12:20:50 -0400

I wouldn't go with an RDBMS to store text corpora at all, but rather use a
specialized text retrieval system, such as Verity Topic. Most of these
systems are usually accompanied with an API which is usually in C. With the
help of Perl's XS interface you can even continue to program in Perl.

Max

---
Bruno Maximilian Schulze
Software Architect
ScanSoft, Inc.
9 Centennial Drive
Peabody, MA 01960, USA
email:   bschulze@scansoft.com
phone: +1 (978) 977-2131
fax:      +1 (978) 977-2425

----- Original Message ----- From: Leidner, Jochen <jochen.leidner@sap.com> To: 'Noord G.J.M. van' <vannoord@let.rug.nl>; Doug Cooper <doug@th.net> Cc: <corpora@hd.uib.no>; Christer Geisler <christer.geisler@engelska.uu.se> Sent: Friday, October 08, 1999 3:28 AM Subject: RE: Corpora: kwic concordances with Perl...or rather use a database?

> Hi all, > > I suggest you use the Perl 'tie' command to keep > any hashes in a Berkeley DB database [1] and use a > sentence- or wordwise processing strategy. This is > easy, as the access to your tied hashes will be > be completely transparent, i.e. you don't have to > use special function calls once the link between > a database file and a hash reference is established. > > On second thought, for large scale corpus > processing we should start thinking about employing > special purpose- or standard RDMBS-based systems > (e.g. using [2]) for keeping the corpus texts. > UNIX text tools are powerful and easy, but cannot > cope with SGML tags easily that may contain important > metadata or linguistic tags. Well yes, you can ignore > them, but you can't query them. What we need is > more research in special purpose (persistent as well > as transient) storage of textual structures. > > Does anybody use RDMBS for corpus storage? I'm only > aware of one forthcoming work at U Erlangen and > the efforts of Gerry Knowles and co-workers at U Lancaster > on the spoken side. I'd be interested to hear of > others on this list. > > There's a techno-sociological problem: the true > hacker usually has the programming skills to do this, > but considers text processing boring and easy; there > is usually also no awareness of internationalization > issues (Unicode storage, ...). The linguist, on the > other hand, would like to get the corpus processing > done without having to worry and concentrate on the > phenomena instead... > > Regards, > Jochen > > [1] <http://www.sleepycat.com/> > [2] <http://mysql.org/> > > > -----Original Message----- > > From: Noord G.J.M. van [mailto:vannoord@let.rug.nl] > > Sent: Thursday, October 07, 1999 10:16 PM > > To: Doug Cooper > > Cc: corpora@hd.uib.no; Christer Geisler > > Subject: Re: Corpora: kwic concordances with Perl > > > > > > Doug Cooper writes: > > > At 14:20 7/10/99 +0000, you wrote: > > > >a) will not detect multiple occurrencences on a line, > > > >b) nor find complex patterns across several line > > > >Can someone suggest other ways of writing simple kwic > > programs in Perl? > > > > > > Try this: > > > > > > #!/usr/bin/perl > > > ($fileName, $string, $width) = @ARGV; # eg: kwic data > > "find me" 10 > > > open (F, "$fileName") || die "Could not open file > > $fileName. Bailing"; > > > undef $/; # eliminate the input record delimiter > > > $data = <F>; # snarf in the entire file > > > $string =~ s/ /\\s/g; # Let spaces match across _and > > print_ newlines > > > #$data =~ s/\n/ /g; # Uncomment this to match/print > > newlines as spaces > > > while ($data =~ /(.{0,$width}$string.{0,$width})/g ) { #$1 > > holds the match > > > print "$1\n"; # print the string with 0..width > > characters on either side > > > } #all done > > > > > > > no, this is not a good idea for large files (like corpora). You > > have the full file in memory; you don't want that. > > > > Gj > > -- > Jochen Leidner, M.A. > <mailto:jochen.leidner@sap.com> > Developer > <http://www.sap.com/> > Knowledge Warehouse -- All views expressed are my > own. > SAP AG, Walldorf, Germany. phone +49 (6227) 7-63773 fax +49 6227 > 7-73773 > > >