[Corpora-List] POS tagging via relational databases (follow-up)

From: Mark Davies (Mark_Davies@byu.edu)
Date: Thu Sep 25 2003 - 13:30:43 MET DST

  • Next message: Susan Hockey: "[Corpora-List] Reminder: ISKO call for papers"

    Thanks to those who responded to my earlier message -- most via private
    communication. Here's a bit of an update.

    I ran some sample updates on a 1 million word extract of the BNC that I
    have in database format in SQL Server. First I "corrupted" the POS tags
    for 20,000-30,000 rows by running a query that would, for example,
    change "AJ0" to "AJ0-xxx" after a row with "AT0" where the second row
    started with "s-" or "t-". Then I'd run the "correction" update that
    would set "AJ0-xxx" to "AJ0" after "AT0" (modeling the resolution of
    ambiguity). I ran about twenty such "correction" UPDATE queries in
    sequence and noted the total elapsed time.

    Each update of 20,000-30,000 rows takes about .4 seconds, meaning that
    you could run about thirty of them in 10-12 seconds. This is after the
    initial UPDATE of the POS column for all rows in the database from the
    lexicon -- which takes about 8-10 seconds. Also, any updates on rows
    with specific lexical items (even relatively high frequency items) is
    essentially instantaneous.

    Anyway, all of this suggests that it would take about 20 seconds to tag
    a 1,000,000 word corpus with about thirty rewrite rules, and perhaps 30
    seconds for sixty or so rewrite rules. At this rate, one could tag the
    entire 100,000,000 word BNC in less than half an hour. This seems
    fairly acceptable to me, although some have suggested that this is still
    rather slow, as far as state of the art taggers.

    Mark Davies

    P.S. One or two others questioned the complexity of the SQL
    rewrite/UPDATE rules, but these can be easily derived via simple scripts
    from more standard rules, such as [NN2-VVZ > NN2 / ATO __]. Also, any
    type of ordering problems could -- it seems -- be accounted for as
    easily with SQL as with the rewrite rules in the Brill tagger.

    =================================================
    Mark Davies
    Assoc. Prof., Linguistics
    Brigham Young University
    (phone) 801-422-9168 / (fax) 801-422-0906
    http://davies-linguistics.byu.edu

    ** Corpus design and use // Web-database scripting **
    ** Historical linguistics // Functional-typological grammar **
    ** Spanish and Portuguese historical and dialectal syntax **
    =================================================



    This archive was generated by hypermail 2b29 : Thu Sep 25 2003 - 13:29:44 MET DST