Corpora: Tgrep2

From: Douglas Rohde (dr+@cs.cmu.edu)
Date: Wed May 23 2001 - 18:20:38 MET DST

  • Next message: ramesh@clg.bham.ac.uk: "Corpora: Conversion of PDF files"

    The readers of this list may be interested in a new tool, tgrep2, that I
    have developed for searching parsed corpora such as those included in
    the Penn Treebank.

    As the name might suggest, tgrep2 is based on tgrep and is largely
    backward compatible. However, tgrep2 adds a number of new features,
    including the following major enhancements:

     * Rather than simply having a set of required relationships and a set
    of
       prohibited relationships, nodes can have full boolean expressions of
       relationships to other nodes.
     * Nodes can be given unique labels and may then be referred to by those
       labels in the pattern specification or in selecting trees for
    printing.
     * Patterns are no longer restricted to simple tree architectures. The
    use
       of node labels and segmented patterns allows links in a pattern to
    form
       back-edges as well, permitting cycles of links.
     * Customizable output formats allow a variety of information to be
       reported in a flexible manner.
     * Multiple search patterns may be specified and one can retrieve the
       first subtree matching any pattern, the first subtree matching each
       pattern, or all subtrees matching all patterns.
     * Subtrees can be reported using a code rather than by printing the
    whole
       structure. The trees themselves can later be retrieved using the
    codes.
     * A variety of new links have been added and the immediately-precedes
       link now has a more conventional meaning.
     * Tgrep2 corpus files are substantially smaller than tgrep corpora.

    More information and the tgrep2 software can be found at the following
    site:

    http://www.cs.cmu.edu/~dr/Tgrep2/

    Doug Rohde
    Carnegie Mellon University



    This archive was generated by hypermail 2b29 : Wed May 23 2001 - 18:16:51 MET DST