Re: [Corpora-List] grep or perl concordancer?

From: Martin Kay (kay@csli.Stanford.EDU)
Date: Fri Jul 26 2002 - 00:50:26 MET DST

  • Next message: Ruvan Weerasinghe: "Re: [Corpora-List] pronunciation"

    On Thu, 25 Jul 2002, Tony Berber Sardinha wrote:

    > Dear colleagues
    >
    > I wonder if anyone has a bit of perl or unix (grep, etc) script that can
    > generate KWIC concordances from plain text? I found some awk script for this in
    > Ken Church's Unix for Poets.
    >
    > thanks very much in advance
    >
    > cheers
    > tony.
    > -------------------------------------
    > Dr Tony Berber Sardinha
    > LAEL, PUC/SP
    > (Catholic University of Sao Paulo, Brazil)
    > tony4@uol.com.br
    > http://lael.pucsp.br/~tony
    > [New website]
    >

    You might try something like this. No guarantees given or implied!

    --Martin Kay

    --------------------------------------
    #!/usr/bin/perl
    $help_text=<<EOF;
    ------------------------------------------------------------------------
    kwic [switches] pattern files

    Search for a pattern given by a Perl regular expression in a file and
    show the matches in key-word-in-context (KWIC) format. Use a pattern
    like "/s/S+/s" to get all words. The program makes sure that there is
    a space before the first word on each line to make this easy.

    switches -l: length of left context.
             -r: length of right context.
             -i: length of id at the beginning of each line. None if the
                 switch is omitted.
             -s: sort by key and right context. No sort if switch is
                 omitted.

    The value that goes with a switch follows it with no intervening space.

    ------------------------------------------------------------------------

    EOF

    $leftlen=30;
    $rightlen=30;
    $id_width=0;
    $output=STDOUT;
    while($_=@ARGV[0], /^-/)
      { shift;
        /^-l(.*)/ && ($leftlen=$1);
        /^-r(.*)/ && ($rightlen=$1);
        /^-i(.*)/ && ($id_width=$1);
        /^-s/ && ($do_sort=1)
      }

    if(@ARGV>=1)
      { $pattern=shift @ARGV;
      if($do_sort)
        { $key_loc=$id_width+$leftlen;
          $output = open(OUT, "| sort -t '' +0.$key_loc");
          $output = OUT
        }

      while(<>)
        { chomp;
          s/\s+ / /g; # Remove multiple white space
            if($id_width)
              { $id = substr($_, 0, $id_width);
                $_ = " " . substr($_, $id_width)
              }
            $left="";
            while(/$pattern/g)
              { $left .= $`;
                $right=$&.$';
                if(($len = length($left))<$leftlen)
                  { $left=" " x ($leftlen-$len).$left; # Pad on left
                  }
                elsif($len>$leftlen) # or truncate left
                  { $left=substr($left, -$leftlen);
                  }
                if(($len=length($right)) > $rightlen) # Truncate right
                  { $right=substr($right, 0, $rightlen);
                };
                print $output "$id" if($id_width);
                print $output "$left$right\n";
                $left .= $&;
              }
          };
        if($do_sort)
          { close($output)
          }
      }
    else
      { print "Usage:\n$help_text";
      }



    This archive was generated by hypermail 2b29 : Sat Jul 27 2002 - 13:36:24 MET DST