[Corpora-List] Summary - grep or perl concordancer?

From: Tony Berber Sardinha (tony4@uol.com.br)
Date: Mon Aug 12 2002 - 16:32:53 MET DST

  • Next message: Klas Prutz: "Re: [Corpora-List] English POS tagged copus"

    Dear list members

    Thanks to everyone who responded to my query about perl /grep scripts for
    generating concordances:

    J.Joao Dias de Almeida
    Dave Graff
    Martin Kay
    Petra Maier
    Ronald P. Reck
    Serge Sharoff
    Danko Sipka
    Jérome Vachey
    Daniel Walker
    Pete Whitelock

    Daniel Walker reminded me that a similar question had already been asked on this
    list: http://www.hit.uib.no/corpora/1999-3/0375.html
    I apologize; I was unaware of that - a search in Google did not bring up any
    pointers to this previous discussion.

    Petra Maier and J.Joao Dias de Almeida suggested the Unix ptx command,
    available in Linux and Windows Cygwin as well, which generates indexed
    (concordances). (The cygwin version - for Windows - did not seem to work,
    though.)

    The other pointers and scripts appear below, separated by '=/=/=/=/=/'

    Thanks again.

    Tony.

    =/=/=/=/=/

    Danko Sipka

    Take a look at:
    http://main.amu.edu.pl/~sipkadan/lingo.htm
    You can pick up the script from that page and see how it works.

    =/=/=/=/=/

    Ronald P. Reck
    rreck@iama.rrecktek.com

    Hi maybe this will help

    see #4
    http://iama.rrecktek.com/text/

    http://iama.rrecktek.com/cgi-bin/cvsweb.cgi/rr-text/wordparse?rev=1.2&content-ty
    pe=text/x-cvsweb-markup

    =/=/=/=/=/

    Serge Sharoff

    there is a set of examples for using Perl in Do-it-yourself
    corpus studies, including KWIC concordances at:
    http://www.geocities.com/SoHo/Square/3472/program.html#scripts

    =/=/=/=/=/

    Dave Graff

    You should check out this web site:

    www.perlmonks.org

    In fact, if you or any of your students/colleagues use perl a lot, you
    should join "the monastery" (but you don't have to).

    I tried putting "key word in context" into the "search" box at the top
    of the perlmonks home page, and found a number of hits that provide code
    to solve the problem. (It turns out that someone had posed the KWIC
    task as a "golfing" challenge, where perl hackers try to come in with
    the "lowest score", based on the number of bytes of perl code needed to
    solve the problem.)

    =/=/=/=/=/

    Pete Whitelock \ Principal Research Scientist
     E-mail: pete@sharp.co.uk \ Sharp Laboratories of Europe Ltd

    Here's one that generates KWIC output for any pattern (Perl regular
    expression, inc plain text)

    #!/usr/local/bin/perl

    # Author: Pete Whitelock
    # Start Date: 15.5.94

    # simple KWIC
    $usage = q!

    Usage: kwic [-ikb] [-c int|-l int -r int] pattern (filename)

    prints pattern in $opt_c characters of context on either side
    or L,R characters of context on left or right respectively (defaults to 50)

    -i means case-insensitive
    -k means print a tab-separated initial key in output as well
    -b means print &nbsp instead of spaces for html to browser

    set -c to 0 and pipe to sus (sort|uniq -c|sort -nr) to count instances of string
    in file

    !;

    use Getopt::Std;
    getopts('hikbc:l:r:');

    die "$usage" if $opt_h;

    $printing_key = $opt_k;
    $case_insensitive = $opt_i;
    $html = $opt_b;
    $left_context = $opt_l || $opt_c || 50;
    $right_context = $opt_r || $opt_c || 50;

    $pattern = shift(@ARGV);

    if ($pattern eq '') {print "$usage\n"; exit;}

    # protect pattern
    $pattern =~ s#/#\\/#g;

    if (defined $ARGV[0]) {
      open(INPUT,$ARGV[0]) || die "Couldn't open file $ARGV[0]\n";
    }
    else {open (INPUT,"-");}

    while(<>) {
      if ($printing_key) {
        ($key) = /^([^\t]*\t)/;
        $key =~ s#//P/Corpora and Linguistic Tools/BNCTagged/./../##;
        s/^[^\t]*\t//;
      }
      else {
        $key = '';
      }
      $key =~ s/\t/ /g;
      $key_length = length($key);

      # updated this 9.12.99 to display tabs correctly
      s/\t/ /g;
      $left = '';
      $match = '';
      if ($case_insensitive) {
        while (/($pattern)/ig) {
          &do;
        }
      }
      else {
        while (/($pattern)/g) {
          &do
        }
      }
    }

    sub do {

        ($left,$match,$right) = ($`,$&,$');
        $left =~ /.{0,$left_context}$/;
        $tleft = $&;
        $pad = ' ' x ($left_context - length($tleft));
        $right =~ /^.{0,$right_context}/;

        printf "%-20s", $key if $key;
        if ($html) {
          $pad =~ s/ /&nbsp;/g;
        }
        print $pad,$tleft,$match,$&,"\n";
      }

      exit;

    =/=/=/=/=/
    Martin Kay

    You might try something like this. No guarantees given or implied!

    --------------------------------------
    #!/usr/bin/perl
    $help_text=<<EOF;
    ------------------------------------------------------------------------
    kwic [switches] pattern files

    Search for a pattern given by a Perl regular expression in a file and
    show the matches in key-word-in-context (KWIC) format. Use a pattern
    like "/s/S+/s" to get all words. The program makes sure that there is
    a space before the first word on each line to make this easy.

    switches -l: length of left context.
             -r: length of right context.
             -i: length of id at the beginning of each line. None if the
                 switch is omitted.
             -s: sort by key and right context. No sort if switch is
                 omitted.

    The value that goes with a switch follows it with no intervening space.

    ------------------------------------------------------------------------

    EOF

    $leftlen=30;
    $rightlen=30;
    $id_width=0;
    $output=STDOUT;
    while($_=@ARGV[0], /^-/)
      { shift;
        /^-l(.*)/ && ($leftlen=$1);
        /^-r(.*)/ && ($rightlen=$1);
        /^-i(.*)/ && ($id_width=$1);
        /^-s/ && ($do_sort=1)
      }

    if(@ARGV>=1)
      { $pattern=shift @ARGV;
      if($do_sort)
        { $key_loc=$id_width+$leftlen;
          $output = open(OUT, "| sort -t '' +0.$key_loc");
          $output = OUT
        }

      while(<>)
        { chomp;
          s/\s+ / /g; # Remove multiple white space
            if($id_width)
      { $id = substr($_, 0, $id_width);
                $_ = " " . substr($_, $id_width)
      }
            $left="";
            while(/$pattern/g)
              { $left .= $`;
                $right=$&.$';
                if(($len = length($left))<$leftlen)
                  { $left=" " x ($leftlen-$len).$left; # Pad on left
                  }
                elsif($len>$leftlen) # or truncate left
                  { $left=substr($left, -$leftlen);
                  }
                if(($len=length($right)) > $rightlen) # Truncate right
                  { $right=substr($right, 0, $rightlen);
        };
                print $output "$id" if($id_width);
                print $output "$left$right\n";
                $left .= $&;
              }
          };
        if($do_sort)
          { close($output)
          }
      }
    else
      { print "Usage:\n$help_text";
      }

    =/=/=/=/=/

    jérome vachey
    jvachey@free.fr

    the joined perl script is a customized version of the program
    provided by g.j.m. van noord on thu, 7 oct 1999 to corpora-list
    in answer to a similar question.

    the original script looked for one special word given as an argument.
    this version produces a concordance for all words of the text.

    to sort the words in the output apply a sort command with tab delimiter:

      perl couique2.pl mytext.txt | sort -t' ' +1

    ------

    #!/home/bin/perl -w
    # ----------------------------------------------------------
    # concordance (kwic) en perl trouvée sur internet
    # et légèrement adaptée. voir les commentaires originaux.
    # jevy, 2000-12-15
    # ----------------------------------------------------------

    # ----------------------------------------------------------
    # jevy, 2000-12-18
    # - on traite tous les mots au lieu de demander une entrée
    # - tabulation devant le mot pour faciliter tris et recherches
    # ----------------------------------------------------------
    # couique2.pl [-f W -l W -n W -r W -s expr] [files]
    # geeft _per _zin_ die met Word matcht_ de linker- en rechtercontext of Word
    # - iedere match per zin gerapporteerd
    # - alleen context binnen dezelfde zin
    # - breedte van context wordt gegeven door $opt_l en $opt_r
    # - geeft ook bestandnaam en regelnummer (breedte met $opt_n $opt_f)
    # - $opt_s bepaalt hoe einde van de zin gedefinieerd is.
    # ----------------------------------------------------------
    # Corpora May 1999 to Jun 1999: Corpora: kwic concordances with Perl
    # Corpora: kwic concordances with Perl
    # Noord G.J.M. van (vannoord@let.rug.nl)
    # Thu, 7 Oct 1999 17:04:47 +0200 (METDST)
    #
    # Christer Geisler writes:
    # > The Perl script below (adapted from Dan Malamed's 2kwic.pl) will produce
    # > kwic concordances on a match, but
    # > a) will not detect multiple occurrencences on a line,
    # > b) nor find complex patterns across several lines.
    # >
    # > Can someone suggest other ways of writing simple kwic programs in Perl?
    # > Should I split into an array, use Perl's format, etc?
    #
    # both a) and b) are treated by the script below. No warrenties!
    # Some comments are in Dutch (which is useful for some..).
    # It obtains b) by treating paragraphs at the time. It also does
    # sentence splitting which might not be what you want (exercise left
    # to the reader).
    # ----------------------------------------------------------

    use strict;
    use vars qw($opt_f $opt_h $opt_l $opt_n $opt_r $opt_s);
    use Getopt::Std;

    # assign command line options:
    getopts('f:hl:n:r:s:');

    # assign default values to options
    $opt_f = defined($opt_f) ? $opt_f : 0;
    $opt_l = defined($opt_l) ? $opt_l : 30;
    $opt_n = defined($opt_n) ? $opt_n : 0;
    $opt_r = defined($opt_r) ? $opt_r : 30;
    $opt_s ||= '[\.\?\!][\'\"]?\s';

    # there must be at least one option remaining: the file

    if ((@ARGV < 1 ) or (defined($opt_h))) { die
    "
    usage: $0 [-f w -l w -n w -r w -s expr] [files]
        -f argument determines width of file name (0 for full file name),
           default: $opt_f
           nb. file name is printed only if there is more than one input file.
        -h displays this help message
        -l argument determines width of left context,
           default: $opt_l
        -n argument determines width of line number field,
           default: $opt_n
        -r argument determines width of right context,
           default: $opt_r
        -s argument is a perl regular expression for end of sentence.
           default: $opt_s

    ";
               }

    my $report_file_name=1;
    $report_file_name=0 if @ARGV < 2;

    # any remaining arguments are file names. If more than one file name,
    # we report file name for each match.

    $/=""; # reads a paragraph at a time. This gives unexpected results on
            # dos files (more like slurp them...

    while(<>) {
      close ARGV if eof; # for $. (current record nr of input)
      foreach $_ (split $opt_s) {
        tr/\n\t\r / /s; # removes ^M, ^J, ^I
    # while (/$word/gio) { # report each match
          while(/[^\s\(\["]+/gio) { # tous les mots
          if ($report_file_name) {
     printf("%*s ",$opt_f,
            length($ARGV)>$opt_f ? substr($ARGV,-$opt_f) : $ARGV);
          }
          printf("%*s %*s\t%s %-*s\n",
          $opt_n,$.,
          $opt_l,$opt_l ? (length($`)>$opt_l ? substr($`,-$opt_l):$`): "",
          $&,
          $opt_r,substr($',0,$opt_r));
        }
      }
    }

    # ----------------------------------------------------------

    =/=/=/=/=/

    Dr Daniel Robertson
    Centre for English Language Teaching
    University of Stirling

    I have a perl script which I call "kwic" which does the job. It's
    very slow, though, and I find it best to pre-process the corpus with a
    grep type utility which I call "cgrep" (written by Richard Caley,
    HCRC, University of Edinburgh). I also use a script called "kwicsort"
    which sorts the output of "kwic" by left or right context. The coding
    is not very elegant and I'm sure the scripts could be made more
    efficient and elegant by a perl guru but they work for me. I append
    copies of both.

    cgrep
    =======================================================================
    #!/usr/bin/perl
    # usage: cgrep [-lines] pattern [files]

    $usage_msg = "Usage: cgrep [-lines] pattern [files]\n";
    $context = 1;

    if ($ARGV[0] eq "-h") {
        die $usage_msg;
    };

    # A switch to set the number of lines of context before and after

    if ($ARGV[0] =~ /^-(\d+)$/) {
        $context = $1;
        shift;
    }

    # Get the pattern and protect the delimiter.

    $pat = shift;
    $pat =~ s#/#\\/#g;

    # First line of input will be middle of array.
    # In the eval below, it will be $ary[$context].

    $_ = <>;
    push(@ary,$_);

    # Add blank lines before, more input after first line.

    for (1 .. $context) {
        unshift(@ary,'');
        $_ = <>;
        push(@ary,$_) if $_;
    }

    # Now use @ary as a silo, shifting and pushing.

    eval <<LOOP_END;
        while (\$ary[$context]) {
    if (\$ary[$context] =~ /$pat/) {
        print "------\n" if \$seq++;
        print \@ary,"\n";
    }
    \$_ = <> if \$_;
    shift(\@ary);
    push(\@ary,\$_);
        }
    LOOP_END
    ==================================================================

    kwic
    ==================================================================
    #!/usr/bin/perl
    # usage: kwic pattern infiles
    # advisable to preprocess database with cgrep

    $screen = 78; # default
    $delimiter = "\n\n"; # default

    $usage_msg = "Usage: kwic [-options] pattern file[s]
         pattern is any string (not a regular expression)
         options:
           -f print file name
           -l print line number
           -h help
           -lob input files are in LOB format
           -Cn print with effective width of context n columns
           -Dchar delimiter is char
           -Ffile patterns in file\n";

    if ($ARGV[0] =~ /\-h/) { die $usage_msg };

    # process command line switches
    while ($ARGV[0] =~ /^-/) {
        $_ = shift;
        if (/\-f/) {
    $printfilename = 1;
        } elsif (/\-lob/) {
    $lob = 1;
        } elsif (/\-l/) {
    $printlinenumber = 1;
        } elsif (/\-C(\d+)/) {
    $screen = $1; # width of context
        } elsif (/\-D(\S+)/) {
    $delimiter = $1; # pattern space delimiter
        } elsif (/\-F(\w+)/) {
    $patterns_in_file;
    $pattern_file_name = $1;
        } else {
    die "Unrecognized switch\n";
        };
    };

    if ($patterns_in_file) {
        open(PATTERNS, $pattern_file) || die "Can't find pattern file\n";
        while(<PATTERNS>) {
    chop;
    push(@patterns,$_);
        };
    };

    $pattern = shift(@ARGV);
    @P = split(//,$pattern); # array of all characters in pattern

    $startcol = int($screen/2) - int(length($pattern)/2) ;
                              # position of first letter of pattern on screen

    $* = 1; # enable multi-line patterns
    $/ = $delimiter; # enable paragraph mode
    # $/ = "------"; # delimiter introduced by cgrep

    if (!-e $ARGV[0]) { # if no input file specified on command line
        die "No input file\n";
    };

    while (@ARGV) { # with remaining argument line files
        $file = shift @ARGV; # take next command line argument
        open(IN, $file);
        while (<IN>) {
    if ($lob) {
        s/\^//g; # remove hats
                s/\|//g; # remove pipe symbols
                s/\\0//g;
                s/\*0//g;
                s/\*"/"/g;
                s/\*\*"/"/g;
                s/\*<\*\d/</g;
                s/\*>/>/g;
                s/\*//g;
                if (/^(\w\d\d)\s+(\d+)/) {
    $lob_id = $1; $lob_line = $2;
        };
                s/^(\w\d\d)\s+(\d+) (.*)$/$3/g;
    };
    s/\n/ /g; # replace new line with space
    s/\t/ /g; # replace tab with space
            s/------//g; # remove cgrep delimiter
    tr/\ //s; # replace multiple spaces with space
    @PAR1 = split(//); # character array of current par
            @PAR2 = @PAR1;
            $i = 0; $p = 0; $hit = ""; # initialize counters
            while (@PAR2) { # search string non-empty
                $ch = shift @PAR2; # take next character in para
                $p++; # augment current para position
                if ($ch ne $P[$i]) { # no match
                    $i = 0; # re-initialize pattern counter
                    $hit = ""; # re-initialize hit
                } elsif ($ch eq $P[$i]) { # if para pos matches pattern pos
                    $hit .= $ch; # add to hit
                    $i++; # increment pattern pos
                    if ($hit eq $pattern) {
        if ($printfilename) {
                            $file =~ s/^(.*)\.txt$/\1/;
            print $file, ":";
        };
                        if ($printlinenumber) {
            #printf("%3d:", ($lob_line + 2));
    printf("%3d:", $.);
        };
                $bpos = $p - $i;
                if ($bpos <= $startcol) {
                    $shr = $startcol - $bpos; # shift right
                            for ($x=1; $x<=$shr; $x++) { # print leading blanks
                                printf("%s"," ");
                            };
                            $l = $screen - $shr; # length of remainder
                            for ($x=0; $x<=$l; $x++) { # print para from beg
                                printf("%s",$PAR1[$x]); # to fill up line
                            };
                            printf("\n");
                        } elsif ($bpos > $startcol) {
                            $start = $bpos - $startcol;
                            for ($x=$start; $x<=($start+$screen); $x++) {
                                printf("%s",$PAR1[$x]);
                            };
                            printf("\n");
                        };
                    };
                };
            };
        };
        close(IN);
    };
    ============================================================================

    kwicsort
    ============================================================================
    #!/usr/bin/perl
    # usage: kwicsort -[lcr] pattern kwic.file

    $usage_msg = "Usage: kwicsort -[lcr] pattern kwic.file
        pattern is any perl regular expression
        options:
           -l : sort by left context
           -c : sort by pattern
           -r : sort by right context\n";

    # process options
    if ($ARGV[0] =~ /\-h/) {
        die $usage_msg;
    } elsif ($ARGV[0] =~ /\-l/) {
        $lsort = 1;
    } elsif ($ARGV[0] =~ /\-r/) {
        $rsort = 1;
    } elsif ($ARGV[0] =~ /\-c/) {
        $csort = 1;
    } else { die $usage_msg };

    $pattern = $ARGV[1];
    $l = length($pattern);

    open(IN, $ARGV[2]) || open(IN, '-') || die "Can't find input file\n";
    while (<IN>) {
        chop; # remove new line
        $is_kwic = 0;
        $midpoint = 36; # half length of line
        $line{$.} = $_; # preserve input
        $dollarstring = '$' x $l;
      # check that pattern is centred in line
        do {
            m/$pattern/g;
    if (pos $_ >= $midpoint) {
        $is_kwic = 1; # this instance of kwic is centred
        s/($pattern)/~\1~/g; # insert tilde round kwic
    } else { # replace temporarily with $'s
                s/$pattern/$dollarstring/;
    };
        } until $is_kwic;

        s/$dollarstring/$pattern/g; # restore kwic
        tr/A-Z/a-z/; # canonicalize to lower-case
        ($left, $target, $right) = split(/~/); # split line on tildes
        if ($lsort) { # reverse order of words in left context
            $key = join("", reverse split(/(\S+)/, $left));
        } elsif ($rsort) {
            $key = $right;
        } elsif ($csort) {
    $key = $target . $right ;
        };
        $sortkey = join('@', $key, $.);
        push(@KEYS, $sortkey);
    };
    close(IN);
    ================================================================================
    @SORTED = sort @KEYS;

    foreach $sortkey (@SORTED) {
        ($key, $l) = split(/@/, $sortkey);
        printf("%s\n", $line{$l});
    };

    =/=/=/=/=/

    cheers
    tony.
    -------------------------------------
    Dr Tony Berber Sardinha
    LAEL, PUC/SP
    (Catholic University of Sao Paulo, Brazil)
    tony4@uol.com.br
    http://lael.pucsp.br/~tony
    [New website]



    This archive was generated by hypermail 2b29 : Mon Aug 12 2002 - 16:44:47 MET DST