Re: [Corpora-List] Tag-set conversion

From: Detmar Meurers (dm@ling.ohio-state.edu)
Date: Fri Jan 31 2003 - 06:51:01 MET

  • Next message: Mirjam Sepesy Maucec: "[Corpora-List] word-internal structure"

    > > Does anybody know of an existing tool to translate between the BNC C5
    > > tag-set and the Penn Tree Bank tag-set?
    >
    > [...]
    > You could alternatively just retag the BNC using a Penn-style tagger, of
    > course, given that the BNC data was for the most part automatically tagged.

    I'd be very careful there. The 2 million word BNC core corpus is
    hand-corrected, which according to Leech (1997) reduced the error
    rate to less than 0.3%. And for the 100 million word BNC that paper
    mentions an error rate of 1.7% (of all words, excluding punctuation
    marks). For the BNC2, the "BNC2 POS-tagging Manual" that comes with
    the corpus estimates the overall error rate at 1.15% (cf. also the
    BNC Tagging Enhancement Project). So "simple automatic retagging
    with a Penn-style tagger" is likely to double or triple your error
    rate.

    Lieben Gruss,
    Detmar

    @Manual{leech:97,
      title = {A Brief Users' Guide to the Grammatical Tagging of the British
                     National Corpus},
      author = {Geoffrey Leech},
      organization = {UCREL, Lancaster University},
      year = 1997,
      note = {\url{http://www.hcu.ox.ac.uk/BNC/what/gramtag.html}}}
                              
                              

    --
    Detmar Meurers                              Fax: Int + 614 292-8833 
    The Ohio State University                   Tel: Int + 614 292-0461
    Department of Linguistics                   E-Mail: dm@ling.osu.edu
    1712 Neil Avenue, Oxley Hall     Homepage: http://ling.osu.edu/~dm/
    Columbus OH 43210-1298, USA    PGP key on web page (use encouraged)
    

    "It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts." Sherlock Holmes in "A Scandal in Bohemia" (A. C. Doyle)



    This archive was generated by hypermail 2b29 : Fri Jan 31 2003 - 06:55:50 MET