[Corpora-List] Some comments on aligners

From: Santos Diana (Diana.Santos@sintef.no)
Date: Thu Sep 05 2002 - 13:04:31 MET DST

  • Next message: Sampo Nevalainen: "Re: [Corpora-List] Some comments on aligners"

    Dear colleagues,

    It sounds to me somehow a waste of time and resources to be discussing
    aligners for a particular commercial application such as ParaConc in this
    list (I know that was the initial question...), given that there are so many
    other systems that may cater for better functionalities of search in
    paralell corpora and which are moreover free and already existing.

    So, after some reflection, I decided, to prevent some naive readers of the
    list to conclude that the only existing aligners were the ones discussed in
    the previous mail thread, to talk about our approach in COMPARA, basically
    to suggest to anyone involved in parallel corpora work to use

    1) the IMS Corpus Workbench developed at Stuttgart (Stefan Evert and Ulrich
    Heid)
    2) and the EasyAlign aligner that comes with it and has all the
    functionalities that have been described in the previous mails (namely it
    aligns, or accepts a previous alignment, so that one can easily incorporate
    the results of manual revision into a powerful corpus querying system)

    For those that would complain that the system is in Unix / Linux and
    therefore not usable for naive users, the obvious solution is to create a
    Web frontend as we did in COMPARA, see http://www.portugues.mct.pt/COMPARA

    I'm not paid to make any advertisements to IMS-CWB nor to align texts for
    other projects (although we do it ocasionally for some people when one of
    the languages of the parallel texts is Portuguese), but I really think after
    careful consideration of many other systems and approaches that this is the
    best way to go.

    People interested in technical details of exactly how the DISPARA setup
    works can read as well, after the Web pages, the paper

    Santos, Diana. "DISPARA, a system for distributing parallel corpora on the
    Web", in Elisabete Ranchhod & Nuno J. Mamede (eds.), Advances in Natural
    Language Processing (Third International Conference, PorTAL 2002, Faro,
    Portugal, June 2002, Proceedings), LNAI 2389, Springer, 2002, pp.209-218.

    and here is a soft presentation for non-technical users

    Frankenberg-Garcia, Ana & Diana Santos. "Introducing COMPARA, the
    Portuguese-English parallel translation corpus", paper presented at
    CULT'2000, to appear in a volume of selected contributions, St.Jerome,
    http://www.linguateca.pt/Diana/download/FrankenbergSantos.rtf
    http://www.linguateca.pt/Diana/download/FrankenbergSantos.ps

    The service we ocasionally do (NB! only when one of the languages is
    Portuguese!!! -- to be fair, we have so far only tried with
    English-Portuguese and Norwegian-Portuguese pairs...) is to accept texts in
    text-only format (eg, TEXT1.po and TEXT1.en) already aligned by paragraph
    (this means one paragraph per line in each text), submit them to EasyAlign
    and send the output back sentence aligned. (Paragraphs can of course be
    titles or other things.) I've prepared an example of text input and text
    output for those interested in the service in
    http://acdc.linguateca.pt/example_alignment.html. (Note that it has to
    involve Portuguese as one of the languages)

    However, I would warmly encourage people to actually use the IMS-CWB
    themselves and create their own Web services. The advantages of using the
    query power (also in translation corpora) are tremendous.

    Diana
    ************************************************************************
    Diana Santos Computational processing of Portuguese

    SINTEF Telecom & Informatics Tel. (direct line) +47 22 06 73 12
    Forskningsveien 1 Tel. +47 22 06 73 00
    Box 124 Blindern Fax. +47 22 06 73 50
    N-0314 Oslo Email: Diana.Santos@sintef.no
    Norway http://www.portugues.mct.pt/
    ************************************************************************



    This archive was generated by hypermail 2b29 : Fri Sep 06 2002 - 09:31:43 MET DST