Re: Corpora: kwic concordances with Perl

Fredrik Olsson (fredriko@sics.se)
Fri, 8 Oct 1999 10:23:17 +0200 (MET DST)

Hi all!

On Fri, 8 Oct 1999, Doug Cooper wrote:
<snip,snip>
> while ($data =~ /(.{0,$width}$string.{0,$width})/g ) { #$1 holds the match

Occurences of $string within $width range to the right will not be
captured by this expression. You'll probably need something like this:

(.{0,$CONTEXT}$PATTERN)(?=(.{0,$CONTEXT}))

i.e., a zero-width positive lookahead assertion to pick up matching where
$string left off instead of $width characters to the right of $string.
Here's a script that implements this approach.

#!/usr/bin/perl -w

use strict;
use vars qw($PATTERN $WIDTH $CONTEXT $FILE);

die ("\n\t$0: <file> <string> <outputwidth>\n") unless scalar @ARGV > 2;
$FILE = shift @ARGV;
$PATTERN = quotemeta(shift @ARGV);
$WIDTH = shift @ARGV;
warn ("$0: ignoring @ARGV\n") if @ARGV;

open(FH, $FILE) or die "** $0 couldn't open $FILE:$!\n";
$CONTEXT = int(($WIDTH - length($PATTERN))/2);
$/ = '';

my(@tmp, $build, $prev);

while(<FH>) {
$_ =~ s/\n//g;
push @tmp, ($_ =~ /(.{0,$CONTEXT}$PATTERN)(?=(.{0,$CONTEXT}))/gio);

foreach (@tmp) {
if($build) {
($prev = " " x ($CONTEXT - length($prev) + 3) . $prev)
unless length($prev) == $CONTEXT;
print $prev . $_, "\n";
$build = 0;
} else {
$prev = $_;
$build++;
}
}
@tmp = ();

}
close(FH);

exit;

Regards,

Fredrik Olsson
_________________________________________________________________________
Fredrik Olsson SICS, the Information and Language Engineering Group

SICS http: www.sics.se/~fredriko fax: +46 (0)8 751 72 30
Box 1263 smtp: fredriko@sics.se phone: +46 (0)8 633 15 32
SE-164 29 Kista
SWEDEN