[Corpora-List] New Ngram package in Perl

From: Vlado Keselj (vlado@cs.dal.ca)
Date: Fri Jun 06 2003 - 22:17:30 MET DST

  • Next message: Antonio Branco: "[Corpora-List] Cfp: Workshop on Tagging and Shallow Processing of Portuguese (TASHA'2003)"

    Text::Ngrams - a new Perl package for n-gram analysis, is made
    available at the site:

      http://www.cs.dal.ca/~vlado/srcperl/Ngrams

    and it will be soon be indexed by CPAN (www.cpan.org).

    It is a small and flexible piece of code that comes with a script
    ngrams.pl for direct processing of files.

    I am aware that this is `yet another' n-gram package, but it is novel in
    some ways. References to other packages are included.

    The man pages for the script and the module are included below.

    Vlado

    -------

    SYNOPIS
             ngram [--version] [--help] [--n=3] [--type=character] [--orderbyfrequency] [input files]

    DESCRIPTION
           This script produces n-grams tables of the input files to
           the standard ouput.

           Options: =over 4 =item --version

           Prints version.

           --help Prints help.

           --n=NUMBER
                  N-gram size, produces 3-grams by default.

           --type=character|byte|word
                  Type of n-grams produces. See Text::Ngrams module.

           --orderbyfrequency
                  By default, the n-grams are ordered lexicographi­
                  cally. If this option is specified, then they are
                  ordered by frequency in descending order.

    PREREQUISITES
           Text::Ngrams, Getopt::Long

    SCRIPT CATEGORIES
           Text::Statistics

    SEE ALSO
           Text::Ngrams module.

    COPYRIGHT
           Copyright 2003 Vlado Keselj http://www.cs.dal.ca/~vlado

           This module is provided "as is" without expressed or
           implied warranty. This is free software; you can redis­
           tribute it and/or modify it under the same terms as Perl
           itself.

           The latest version can be found at
           http://www.cs.dal.ca/~vlado/srcperl/.

    ------------------------------------------------------------------------

    NAME
           Text::Ngrams - Flexible Ngram analysis (for characters,
           words, and more)

    SYNOPSIS
           For default character n-gram analysis of string:

             use Text::Ngrams;
             my $ng3 = Text::Ngrams->new;
             ng3->process_text('abcdefg1235678hijklmnop');
             print ng3->to_string;

           One can also feed tokens manually:

             use Text::Ngrams;
             my $ng3 = Text::Ngrams->new;
             $ng3->feed_tokens('a');
             $ng3->feed_tokens('b');
             $ng3->feed_tokens('c');
             $ng3->feed_tokens('d');
             $ng3->feed_tokens('e');
             $ng3->feed_tokens('f');
             $ng3->feed_tokens('g');
             $ng3->feed_tokens('h');

           We can choose n-grams of various sizes, e.g.:

             my $ng = Text::Ngrams->new( windowsize => 6 );

           or different types of n-grams, e.g.:

             my $ng = Text::Ngrams->new( type => byte );
             my $ng = Text::Ngrams->new( type => word );

    DESCRIPTION
           This module implement text n-gram analysis, supporting
           several types of analysis, including character and word n-
           grams.

           The module Text::Ngrams is very flexible. For example, it
           allows a user to manually feed a sequence of any tokens.
           It handles several types of tokens (character, word), and
           also allows a lot of flexibility in automatic recognition
           and feed of tokens and the way they are combined in an n-
           gram. It counts all n-gram frequencies up to the maximal
           specified length. The output format is meant to be pretty
           much human-readable, while also loadable by the module.

           The module can be used from the command line through the
           script the ngrams.pl manpage provided with the package.

    OUTPUT FORMAT
           The output looks like this:

             BEGIN OUTPUT BY Text::Ngrams version 0.01

             1-GRAMS (total count: 8)
             ------------------------
             a 1
             b 1
             c 1
             d 1
             e 1
             f 1
             g 1
             h 1

             2-GRAMS (total count: 7)
             ------------------------
             ab 1
             bc 1
             cd 1
             de 1
             ef 1
             fg 1
             gh 1

             3-GRAMS (total count: 6)
             ------------------------
             abc 1
             bcd 1
             cde 1
             def 1
             efg 1
             fgh 1

             END OUTPUT BY Text::Ngrams

           N-grams are encoded using encode_S
           (www.cs.dal.ca/~vlado/srcperl/snip/encode_S), so that they
           can always be recognized as \S+. For example, for word n-
           grams, space is replaced by underscore (_):

             BEGIN OUTPUT BY Text::Ngrams version 0.01

             1-GRAMS (total count: 8)
             ------------------------
             The 1
             brown 3
             fox 3
             quick 1

             2-GRAMS (total count: 7)
             ------------------------
             The_brown 1
             brown_fox 2
             brown_quick 1
             fox_brown 2
             quick_fox 1

             END OUTPUT BY Text::Ngrams

           Or, in case of byte type of processing:

             BEGIN OUTPUT BY Text::Ngrams version 0.01

             1-GRAMS (total count: 55)
             -------------------------
             \t 3
             \n 3
             _ 12
             , 2
             . 3
             T 1
             b 3
             c 1
             ... etc

             2-GRAMS (total count: 54)
             -------------------------
             \t_ 1
             \tT 1
             \tb 1
             \n\t 2
             __ 5
             _. 1
             _b 2
             _f 3
             _q 1
             ,\n 2
             .\n 1
             .. 2
             Th 1
             br 3
             ck 1
             e_ 1
             ... etc

             END OUTPUT BY Text::Ngrams

    METHODS

           new ( windowsize => POS_INTEGER, type => charac­
           ter|byte|word )

             my $ng = Text::Ngrams->new;
             my $ng = Text::Ngrams->new( windowsize=>10 );
             my $ng = Text::Ngrams->new( type=>'word' );
             and similar.

           Creates a new "Text::Ngrams" object and returns it.
           Parameters:

           windowsize
               n-gram size (i.e., `n' itself). Default is 3 if not
               given. It is stored in $object->{windowsize}.

           type
               Specifies a predefined type of n-grams:

               character (default)
                   Default character n-grams: Read letters, sequences
                   of all other characters are replaced by a space,
                   letters are turned uppercase.

               byte
                   Raw character n-grams: Don't ignore any bytes and
                   don't pre-process them.

               word
                   Default word n-grams: One token is a word consist­
                   ing of letters, digits and decimal digit are
                   replaced by <NUMBER>, and everything else is
                   ignored. A space is inserted when n-grams are
                   formed.

               One can also modify type, creating its own type, by
               fine-tuning several parameters (they can be unde­
               fined):

               $o->{tokenseparator} - string used to be inserted
               between tokens in n-gram (for characters it is empty,
               and for words it is a space).

               $o->{skiprex} - regular expression for ignoring stuff
               between tokens.

               $o->{tokenrex} - regular expression for recognizing a
               token.

               $o->{processtoken} - routine for token preprocessing.
               Token is given and returned in $_.

           feed_tokens ( list of tokens )

           This function manually supplies tokens.

           process_text ( list of strings )

           Process text, i.e., break each string into tokens and feed
           them.

           process_files ( file_names or file_handle_references)

           Process files, similarly to text. The files are processed
           line by line, so there should not be any multi-line
           tokens.

           to_string ( orderby => frequency )

           Produce string representation of the n-gram tables. If
           parameter 'orderyby=>frequency' is specified, each table
           is ordered by decreasing frequency.

    HISTORY AND RELATED WORK
           This code originated in my "monkeys and rhinos" project in
           2000, and is related to authorship attribution project.
           Some of the similar projects are (URLs can be found at my
           site):

           Ngram Statistics Package in Perl, by T. Pedersen at al.
               This is a package that includes a script for word n-
               grams.

           Text::Ngram Perl Package by Simon Cozens
               This is a similar package for character n-grams. As
               an XS-implementation it is supposed to be very effi­
               cient.

           Perl script ngram.pl by Jarkko Hietaniemi
               This is a script for analyzing character n-grams.

           Waterloo Statistical N-Gram Language Modeling Toolkit, in
               C++ by Fuchun Peng
               A n-gram language modeling package written in C++.

    BUGS AND LIMITATIONS
           If a user customizes a type, it is possible that a result­
           ing n-gram will be ambiguous. In this way, to different
           n-grams may be counted as one. With predefined types of
           n-grams, this should not happen.

           For example, if a user chooses that a token can contain a
           space, and uses space as an n-gram separator, then a tri­
           gram like this "x x x x" is ambiguous.

    AUTHOR
           Copyright 2003 Vlado Keselj www.cs.dal.ca/~vlado

           This module is provided "as is" without expressed or
           implied warranty. This is free software; you can redis­
           tribute it and/or modify it under the same terms as Perl
           itself.

           The latest version can be found at
           http://www.cs.dal.ca/~vlado/srcperl/.

    SEE ALSO
           Ngram Statistics Package in Perl, by T. Pedersen at al.,
           Waterloo Statistical N-Gram Language Modeling Toolkit in
           C++ by Fuchun Peng, Perl script ngram.pl by Jarkko
           Hietaniemi, Simon Cozen's Text::Ngram module in CPAN.

           The links should be available at
           http://www.cs.dal.ca/~vlado/nlp.



    This archive was generated by hypermail 2b29 : Fri Jun 06 2003 - 22:22:29 MET DST