[Corpora-List] New Ngram package in Perl

From: Vlado Keselj (vlado@cs.dal.ca)
Date: Fri Jun 06 2003 - 22:17:30 MET DST

Next message: Antonio Branco: "[Corpora-List] Cfp: Workshop on Tagging and Shallow Processing of Portuguese (TASHA'2003)"

Previous message: Chris Brew: "Re: [Corpora-List] XML annotation guidelines"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Text::Ngrams - a new Perl package for n-gram analysis, is made
available at the site:

http://www.cs.dal.ca/~vlado/srcperl/Ngrams

and it will be soon be indexed by CPAN (www.cpan.org).

It is a small and flexible piece of code that comes with a script
ngrams.pl for direct processing of files.

I am aware that this is `yet another' n-gram package, but it is novel in
some ways. References to other packages are included.

The man pages for the script and the module are included below.

Vlado

-------

SYNOPIS
ngram [--version] [--help] [--n=3] [--type=character] [--orderbyfrequency] [input files]

DESCRIPTION
This script produces n-grams tables of the input files to
the standard ouput.

Options: =over 4 =item --version

Prints version.

--help Prints help.

--n=NUMBER
N-gram size, produces 3-grams by default.

--type=character|byte|word
Type of n-grams produces. See Text::Ngrams module.

       --orderbyfrequency
              By default, the n-grams are ordered lexicographi
              cally. If this option is specified, then they are
              ordered by frequency in descending order.

PREREQUISITES
Text::Ngrams, Getopt::Long

SCRIPT CATEGORIES
Text::Statistics

SEE ALSO
Text::Ngrams module.

       This module is provided "as is" without expressed or
       implied warranty. This is free software; you can redis
       tribute it and/or modify it under the same terms as Perl
       itself.

The latest version can be found at
http://www.cs.dal.ca/~vlado/srcperl/.

------------------------------------------------------------------------

NAME
Text::Ngrams - Flexible Ngram analysis (for characters,
words, and more)

SYNOPSIS
For default character n-gram analysis of string:

         use Text::Ngrams;
         my $ng3 = Text::Ngrams->new;
         ng3->process_text('abcdefg1235678hijklmnop');
         print ng3->to_string;

One can also feed tokens manually:

         use Text::Ngrams;
         my $ng3 = Text::Ngrams->new;
         $ng3->feed_tokens('a');
         $ng3->feed_tokens('b');
         $ng3->feed_tokens('c');
         $ng3->feed_tokens('d');
         $ng3->feed_tokens('e');
         $ng3->feed_tokens('f');
         $ng3->feed_tokens('g');
         $ng3->feed_tokens('h');

We can choose n-grams of various sizes, e.g.:

my $ng = Text::Ngrams->new( windowsize => 6 );

or different types of n-grams, e.g.:

my $ng = Text::Ngrams->new( type => byte );
my $ng = Text::Ngrams->new( type => word );

DESCRIPTION
       This module implement text n-gram analysis, supporting
       several types of analysis, including character and word n-
       grams.

       The module Text::Ngrams is very flexible. For example, it
       allows a user to manually feed a sequence of any tokens.
       It handles several types of tokens (character, word), and
       also allows a lot of flexibility in automatic recognition
       and feed of tokens and the way they are combined in an n-
       gram. It counts all n-gram frequencies up to the maximal
       specified length. The output format is meant to be pretty
       much human-readable, while also loadable by the module.

The module can be used from the command line through the
script the ngrams.pl manpage provided with the package.

OUTPUT FORMAT
The output looks like this:

BEGIN OUTPUT BY Text::Ngrams version 0.01

         1-GRAMS (total count: 8)
         ------------------------
         a 1
         b 1
         c 1
         d 1
         e 1
         f 1
         g 1
         h 1

         2-GRAMS (total count: 7)
         ------------------------
         ab 1
         bc 1
         cd 1
         de 1
         ef 1
         fg 1
         gh 1

         3-GRAMS (total count: 6)
         ------------------------
         abc 1
         bcd 1
         cde 1
         def 1
         efg 1
         fgh 1

END OUTPUT BY Text::Ngrams

       N-grams are encoded using encode_S
       (www.cs.dal.ca/~vlado/srcperl/snip/encode_S), so that they
       can always be recognized as \S+. For example, for word n-
       grams, space is replaced by underscore (_):

BEGIN OUTPUT BY Text::Ngrams version 0.01

         1-GRAMS (total count: 8)
         ------------------------
         The 1
         brown 3
         fox 3
         quick 1

         2-GRAMS (total count: 7)
         ------------------------
         The_brown 1
         brown_fox 2
         brown_quick 1
         fox_brown 2
         quick_fox 1

END OUTPUT BY Text::Ngrams

Or, in case of byte type of processing:

BEGIN OUTPUT BY Text::Ngrams version 0.01

         1-GRAMS (total count: 55)
         -------------------------
         \t 3
         \n 3
         _ 12
         , 2
         . 3
         T 1
         b 3
         c 1
         ... etc

         2-GRAMS (total count: 54)
         -------------------------
         \t_ 1
         \tT 1
         \tb 1
         \n\t 2
         __ 5
         _. 1
         _b 2
         _f 3
         _q 1
         ,\n 2
         .\n 1
         .. 2
         Th 1
         br 3
         ck 1
         e_ 1
         ... etc

END OUTPUT BY Text::Ngrams

METHODS

new ( windowsize => POS_INTEGER, type => charac
ter|byte|word )

         my $ng = Text::Ngrams->new;
         my $ng = Text::Ngrams->new( windowsize=>10 );
         my $ng = Text::Ngrams->new( type=>'word' );
         and similar.

Creates a new "Text::Ngrams" object and returns it.
Parameters:

       windowsize
           n-gram size (i.e., `n' itself). Default is 3 if not
           given. It is stored in $object->{windowsize}.

type
Specifies a predefined type of n-grams:

           character (default)
               Default character n-grams: Read letters, sequences
               of all other characters are replaced by a space,
               letters are turned uppercase.

           byte
               Raw character n-grams: Don't ignore any bytes and
               don't pre-process them.

           word
               Default word n-grams: One token is a word consist
               ing of letters, digits and decimal digit are
               replaced by <NUMBER>, and everything else is
               ignored. A space is inserted when n-grams are
               formed.

           One can also modify type, creating its own type, by
           fine-tuning several parameters (they can be unde
           fined):

           $o->{tokenseparator} - string used to be inserted
           between tokens in n-gram (for characters it is empty,
           and for words it is a space).

$o->{skiprex} - regular expression for ignoring stuff
between tokens.

$o->{tokenrex} - regular expression for recognizing a
token.

$o->{processtoken} - routine for token preprocessing.
Token is given and returned in $_.

feed_tokens ( list of tokens )

This function manually supplies tokens.

process_text ( list of strings )

Process text, i.e., break each string into tokens and feed
them.

process_files ( file_names or file_handle_references)

       Process files, similarly to text. The files are processed
       line by line, so there should not be any multi-line
       tokens.

to_string ( orderby => frequency )

       Produce string representation of the n-gram tables. If
       parameter 'orderyby=>frequency' is specified, each table
       is ordered by decreasing frequency.

HISTORY AND RELATED WORK
       This code originated in my "monkeys and rhinos" project in
       2000, and is related to authorship attribution project.
       Some of the similar projects are (URLs can be found at my
       site):

       Ngram Statistics Package in Perl, by T. Pedersen at al.
           This is a package that includes a script for word n-
           grams.

       Text::Ngram Perl Package by Simon Cozens
           This is a similar package for character n-grams. As
           an XS-implementation it is supposed to be very effi
           cient.

Perl script ngram.pl by Jarkko Hietaniemi
This is a script for analyzing character n-grams.

       Waterloo Statistical N-Gram Language Modeling Toolkit, in
           C++ by Fuchun Peng
           A n-gram language modeling package written in C++.

BUGS AND LIMITATIONS
       If a user customizes a type, it is possible that a result
       ing n-gram will be ambiguous. In this way, to different
       n-grams may be counted as one. With predefined types of
       n-grams, this should not happen.

       For example, if a user chooses that a token can contain a
       space, and uses space as an n-gram separator, then a tri
       gram like this "x x x x" is ambiguous.

       This module is provided "as is" without expressed or
       implied warranty. This is free software; you can redis
       tribute it and/or modify it under the same terms as Perl
       itself.

The latest version can be found at
http://www.cs.dal.ca/~vlado/srcperl/.

SEE ALSO
       Ngram Statistics Package in Perl, by T. Pedersen at al.,
       Waterloo Statistical N-Gram Language Modeling Toolkit in
       C++ by Fuchun Peng, Perl script ngram.pl by Jarkko
       Hietaniemi, Simon Cozen's Text::Ngram module in CPAN.

The links should be available at
http://www.cs.dal.ca/~vlado/nlp.

Next message: Antonio Branco: "[Corpora-List] Cfp: Workshop on Tagging and Shallow Processing of Portuguese (TASHA'2003)"
Previous message: Chris Brew: "Re: [Corpora-List] XML annotation guidelines"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Fri Jun 06 2003 - 22:22:29 MET DST