Re: Corpora: kwic concordances with Perl

Noord G.J.M. van (vannoord@let.rug.nl)
Thu, 7 Oct 1999 22:15:56 +0200 (METDST)

Doug Cooper writes:
> At 14:20 7/10/99 +0000, you wrote:
> >a) will not detect multiple occurrencences on a line,
> >b) nor find complex patterns across several line
> >Can someone suggest other ways of writing simple kwic programs in Perl?
>
> Try this:
>
> #!/usr/bin/perl
> ($fileName, $string, $width) = @ARGV; # eg: kwic data "find me" 10
> open (F, "$fileName") || die "Could not open file $fileName. Bailing";
> undef $/; # eliminate the input record delimiter
> $data = <F>; # snarf in the entire file
> $string =~ s/ /\\s/g; # Let spaces match across _and print_ newlines
> #$data =~ s/\n/ /g; # Uncomment this to match/print newlines as spaces
> while ($data =~ /(.{0,$width}$string.{0,$width})/g ) { #$1 holds the match
> print "$1\n"; # print the string with 0..width characters on either side
> } #all done
>

no, this is not a good idea for large files (like corpora). You
have the full file in memory; you don't want that.

Gj