Re: Corpora: sgml detagger

From: Michael Betsch (
Date: Wed Apr 17 2002 - 09:44:54 MET DST

  • Next message: Josephine Lo: "Corpora: Spontaneous speech corpora"

    It will probably be more easy to use an existing sgml parser than to
    write a script that can really identify _all_ possible tags and
    remove them.

    The (freely available) parser onsgmls has in its output format all
    data content on lines of their own, which are prefixed by a "-". So
    you can simply run onsgmls on your sgml-files and retain only those
    lines that start with "-". (using 'grep -e "^-"'); then you can
    easily remove the leading "-" with perl or something similar. This
    assumes that all data is good and not e.g. a javascript, which you
    will probably not want to include in your corpus.


    _______________________________________________________________________ Dr. Michael Betsch privat: SFB 441, Projekt B1 Nauklerstraße 35 Rappenberghalde 27 72074 Tübingen 72070 Tübingen Tel. 07071/29-77161 Tel. 07071/51917 email: _______________________________________________________________________

    This archive was generated by hypermail 2b29 : Wed Apr 17 2002 - 09:50:06 MET DST