Max
--- Bruno Maximilian Schulze Software Architect ScanSoft, Inc. 9 Centennial Drive Peabody, MA 01960, USA email: bschulze@scansoft.com phone: +1 (978) 977-2131 fax: +1 (978) 977-2425
----- Original Message ----- From: Leidner, Jochen <jochen.leidner@sap.com> To: 'Noord G.J.M. van' <vannoord@let.rug.nl>; Doug Cooper <doug@th.net> Cc: <corpora@hd.uib.no>; Christer Geisler <christer.geisler@engelska.uu.se> Sent: Friday, October 08, 1999 3:28 AM Subject: RE: Corpora: kwic concordances with Perl...or rather use a database?
> Hi all, > > I suggest you use the Perl 'tie' command to keep > any hashes in a Berkeley DB database [1] and use a > sentence- or wordwise processing strategy. This is > easy, as the access to your tied hashes will be > be completely transparent, i.e. you don't have to > use special function calls once the link between > a database file and a hash reference is established. > > On second thought, for large scale corpus > processing we should start thinking about employing > special purpose- or standard RDMBS-based systems > (e.g. using [2]) for keeping the corpus texts. > UNIX text tools are powerful and easy, but cannot > cope with SGML tags easily that may contain important > metadata or linguistic tags. Well yes, you can ignore > them, but you can't query them. What we need is > more research in special purpose (persistent as well > as transient) storage of textual structures. > > Does anybody use RDMBS for corpus storage? I'm only > aware of one forthcoming work at U Erlangen and > the efforts of Gerry Knowles and co-workers at U Lancaster > on the spoken side. I'd be interested to hear of > others on this list. > > There's a techno-sociological problem: the true > hacker usually has the programming skills to do this, > but considers text processing boring and easy; there > is usually also no awareness of internationalization > issues (Unicode storage, ...). The linguist, on the > other hand, would like to get the corpus processing > done without having to worry and concentrate on the > phenomena instead... > > Regards, > Jochen > > [1] <http://www.sleepycat.com/> > [2] <http://mysql.org/> > > > -----Original Message----- > > From: Noord G.J.M. van [mailto:vannoord@let.rug.nl] > > Sent: Thursday, October 07, 1999 10:16 PM > > To: Doug Cooper > > Cc: corpora@hd.uib.no; Christer Geisler > > Subject: Re: Corpora: kwic concordances with Perl > > > > > > Doug Cooper writes: > > > At 14:20 7/10/99 +0000, you wrote: > > > >a) will not detect multiple occurrencences on a line, > > > >b) nor find complex patterns across several line > > > >Can someone suggest other ways of writing simple kwic > > programs in Perl? > > > > > > Try this: > > > > > > #!/usr/bin/perl > > > ($fileName, $string, $width) = @ARGV; # eg: kwic data > > "find me" 10 > > > open (F, "$fileName") || die "Could not open file > > $fileName. Bailing"; > > > undef $/; # eliminate the input record delimiter > > > $data = <F>; # snarf in the entire file > > > $string =~ s/ /\\s/g; # Let spaces match across _and > > print_ newlines > > > #$data =~ s/\n/ /g; # Uncomment this to match/print > > newlines as spaces > > > while ($data =~ /(.{0,$width}$string.{0,$width})/g ) { #$1 > > holds the match > > > print "$1\n"; # print the string with 0..width > > characters on either side > > > } #all done > > > > > > > no, this is not a good idea for large files (like corpora). You > > have the full file in memory; you don't want that. > > > > Gj > > -- > Jochen Leidner, M.A. > <mailto:jochen.leidner@sap.com> > Developer > <http://www.sap.com/> > Knowledge Warehouse -- All views expressed are my > own. > SAP AG, Walldorf, Germany. phone +49 (6227) 7-63773 fax +49 6227 > 7-73773 > > >