Re: Corpora: line joining

From: Susana Sotelo Docio (sdocio@usc.es)
Date: Sat Feb 24 2001 - 12:33:42 MET

  • Next message: Kalina Bontcheva: "Corpora: RANLP'2001: First Call for Papers"

    Hello,

    > I need to fix an output from a tagger and join consecutive lines of text, so
    > that, for example, this:
    > de PREP
    > a ART
    > turns into this:
    > da CPR
    > Does anyone know how to do this in sed or perl?

    If the output of the tagger is a big file, you could prefer flex (under
    unix/linux). It would be:

    ------------------------------file contrac.lex------------------
    %%
    ^de\tPREP\na\tART\n { printf("da\tCPR\n"); }
    %%
    ------------------------------end-------------------------------

    You must compile this code:

       flex contrac.lex; gcc -o contrac lex.yy.c -lfl

       contrac < tagged_text.in > tagged_text.out

    If you prefer perl, the script could be something like:

    ------------------------------file contract.pl-----------------
    #!/usr/bin/perl

    while(<>)
    {
      if(/de\tPREP\n/)
      {
        $newline = <>;
        if($newline =~ /^a\tART\n/) { print "da\tCPR\n" }
        else { print $_ . $newline }
      }
      else { print }
    }
    --------------------------------end----------------------------

    Syntax:
      contrac.pl tagged_text.in > tagged_text.out

    Under DOS, you must replace \n with \r\n. I assume tabs between word forms
    and tags.
    Greetings,
    Susana.

    ----------------------------------------------------------------------
    Susana Sotelo Docío
    Facultade de Filoloxía sdocio@usc.es _o)
    Universidade de Santiago http://web.usc.es/~fesdocio / \\
    "Neunu ti at a abberrer mai si thocceddas a sas jannas _(___V
    cun mudos thoccos de ocros" #96506
    ----------------------------------------------------------------------



    This archive was generated by hypermail 2b29 : Mon Feb 26 2001 - 09:18:11 MET