Re: Corpora: sgml detagger

From: Alexander S. Yeh (
Date: Tue Apr 16 2002 - 20:43:59 MET DST

  • Next message: edwards@ICSI.Berkeley.EDU: "Re: Corpora: Historical background of Corpus Linguistics"

    The script below will work for most tags, but may fail in the following
    more complicated cases:

    1. A tag is spread out over more than 1 line (usual cases: comment tags,
    tags with attribute/value pairs).

    2. A tag has an attribute value that has a ">" in it.

    3. A comment tag has a ">" embedded in it.

    I have encountered these in html files of journal articles gotten off
    the web. Thanks.

    -Alex Yeh

    Danko Sipka wrote:

    > Hi:This Perl script should do the job: print "What is your input file
    > name:\n";
    > chomp($infile=<STDIN>);
    > open IN, $infile or die "No file, no fun!";
    > open OUT, ">$infile.out" or die "No file, no fun!";
    > while (<IN>) {
    > $_=~s/\<.+?\>//g;
    > print OUT "$_";
    > }
    > close (IN) or die "D'oh!";
    > close (OUT) or die "D'oh!";Best, Danko |
    > Danko.Sipka@asu.edu |
    > ----- Original Message -----
    > From: Tine & Colleen
    > Sent: Tuesday, April 16, 2002 8:13 PM
    > Subject: Corpora: sgml detagger
    > HiI am compiling a corpus for research reasons and some of
    > the texts are sgml-tagged.Does anybody know an easy way to
    > remove the tags and save the texts as 'raw' .txt files?Maybe
    > a PERL script? Thanks in advance Tine LassenCopenhagen

    This archive was generated by hypermail 2b29 : Tue Apr 16 2002 - 20:41:20 MET DST