RE: [Corpora-List] Some comments on aligners

From: Santos Diana (Diana.Santos@sintef.no)
Date: Sat Sep 07 2002 - 16:52:38 MET DST

  • Next message: James L. Fidelholtz: "Re: [Corpora-List] Some comments on aligners"

    Dear Ute and Sampo, and corpora-list members in general,

    I believe my message was doubly misunderstood.

    About Ute's remark:

    I was not criticizing Sampo's question about a particular commercial
    aligner, I was suggesting that the answers to the question were more
    encompassing - in fact Sampo's message later mentioned he had after all
    found another program (so he was not THAT concerned with ParaConc after
    all). That´s why I posted my message: to give other people in the list the
    idea that there are other more powerful tools out there. It was not to
    criticize people for asking specific questions.

    About Sampos's answer, and this is where I'm most sorry for not having been
    understood, I was not discussing the use of IMS-CWB for large projects,
    where I think its advantages are uncontroversial.

    On the contrary, I was suggesting (and was actually hoping to have
    demonstrated) that also for cases like the ones discussed by Sampo: some
    students working with their own corpora, it was also the best way to
    proceed.

    I tried to explain that it was easy to setup a Web service that would align
    texts for the user, let them revise them if needed, and then show them in an
    easy, user-friendly, and platform-independent way using a Web interface to
    the IMS-CWB, as we do for COMPARA.

    I was not suggesting that everyone who wanted to look at parallel corpora
    had to copy and devise a system such as COMPARA, which was thought from the
    beginning for a large range of users and to be made publically available.

    Rather, I was suggesting another kind of service (incidentally, that we are
    also planning to offer at Linguateca, a distributed resource center for
    Portuguese, at the forthcoming pole in Porto, with Belinda Maia), namely,
    the possibility of having different students and researchers working in
    their own corpora with a common infrastructure.

    If you have ONE student, it may be the same work for you to tell him to go
    and fetch a Windows-based program with limited functionalities, etc. But if
    you have more than one student or user, it would be to your advantage that
    all of them use the same tools and input the texts the same way so that you
    could even reuse (or at least look at) the texts they are using, all with
    the same Web functionality. (Even if for copyright reasons it would have to
    be password protected, that is a straightforward matter...)

    So, that was what I was proposing: Set up a simple service based on IMS-CWB
    that aligned the text and displayed them with a Web interface, which they
    can then access from wherever. (Then it would be up to you to define what is
    a "concordancer that would be relatively simple in use and not too picky
    with texts to be used as a corpus". My experience is that the second
    criterion is already met by the IMS-CWB, for we have used large amounts of
    all kinds of text in our Portuguese text at the AC/DC project,
    http://acdc.linguateca.pt/acesso/info_acesso_English.html.)

    I won't be bothering the list with further technical details...

    Thank you Ute and Sampo for your answers so that I could have another go at
    this subject :-)
    Diana

    > -----Original Message-----
    > From: Ute Römer [mailto:ute.roemer@uni-koeln.de]
    > Sent: 7. september 2002 11:46
    > To: corpora@hd.uib.no
    > Subject: Re: [Corpora-List] Some comments on aligners
    >
    >
    > Dear all,
    >
    > Some weekend thoughts on Corpora List discussions -- in reply to Diana
    > Santos' recent posting.
    >
    > I was just wondering, is it really "a waste of time" to
    > discuss -- on an
    > email list the purpose of which it is, or ought to be, to
    > exchange ideas on
    > certain specific topics and to help people solve corpus linguistic
    > problems -- special software tools, their use, and problems
    > you encounter
    > while using them? And does it make a difference then whether
    > the tools in
    > question are freely available or not? What's wrong with
    > explicitly asking
    > for help with a certain program like Sampo Nevalainen did? I
    > actually do not
    > very much like the idea of having to think twice before
    > sending queries on
    > commercially available corpora and corpus analysis tools to
    > the list and I
    > suspect that other list members might feel the same.
    >
    > Have a good weekend all of you!
    >
    > Best,
    > Ute
    >
    >
    > ----- Original Message -----
    > From: "Santos Diana" <Diana.Santos@sintef.no>
    > To: <corpora@hd.uib.no>
    > Sent: Thursday, September 05, 2002 1:04 PM
    > Subject: [Corpora-List] Some comments on aligners
    >
    >
    > > Dear colleagues,
    > >
    > > It sounds to me somehow a waste of time and resources to be
    > discussing
    > > aligners for a particular commercial application such as
    > ParaConc in this
    > > list (I know that was the initial question...), given that
    > there are so
    > many
    > > other systems that may cater for better functionalities of search in
    > > paralell corpora and which are moreover free and already existing.
    > >
    > > So, after some reflection, I decided, to prevent some naive
    > readers of the
    > > list to conclude that the only existing aligners were the
    > ones discussed
    > in
    > > the previous mail thread, to talk about our approach in
    > COMPARA, basically
    > > to suggest to anyone involved in parallel corpora work to use
    > >
    > > 1) the IMS Corpus Workbench developed at Stuttgart (Stefan Evert and
    > Ulrich
    > > Heid)
    > > 2) and the EasyAlign aligner that comes with it and has all the
    > > functionalities that have been described in the previous
    > mails (namely it
    > > aligns, or accepts a previous alignment, so that one can easily
    > incorporate
    > > the results of manual revision into a powerful corpus
    > querying system)
    > >
    > > For those that would complain that the system is in Unix / Linux and
    > > therefore not usable for naive users, the obvious solution
    > is to create a
    > > Web frontend as we did in COMPARA, see
    http://www.portugues.mct.pt/COMPARA
    >
    > I'm not paid to make any advertisements to IMS-CWB nor to align texts for
    > other projects (although we do it ocasionally for some people when one of
    > the languages of the parallel texts is Portuguese), but I really think
    after
    > careful consideration of many other systems and approaches that this is
    the
    > best way to go.
    >
    > People interested in technical details of exactly how the DISPARA setup
    > works can read as well, after the Web pages, the paper
    >
    > Santos, Diana. "DISPARA, a system for distributing parallel corpora on the
    > Web", in Elisabete Ranchhod & Nuno J. Mamede (eds.), Advances in Natural
    > Language Processing (Third International Conference, PorTAL 2002, Faro,
    > Portugal, June 2002, Proceedings), LNAI 2389, Springer, 2002, pp.209-218.
    >
    > and here is a soft presentation for non-technical users
    >
    > Frankenberg-Garcia, Ana & Diana Santos. "Introducing COMPARA, the
    > Portuguese-English parallel translation corpus", paper presented at
    > CULT'2000, to appear in a volume of selected contributions, St.Jerome,
    > http://www.linguateca.pt/Diana/download/FrankenbergSantos.rtf
    > http://www.linguateca.pt/Diana/download/FrankenbergSantos.ps
    >
    > The service we ocasionally do (NB! only when one of the languages is
    > Portuguese!!! -- to be fair, we have so far only tried with
    > English-Portuguese and Norwegian-Portuguese pairs...) is to accept texts
    in
    > text-only format (eg, TEXT1.po and TEXT1.en) already aligned by paragraph
    > (this means one paragraph per line in each text), submit them to EasyAlign
    > and send the output back sentence aligned. (Paragraphs can of course be
    > titles or other things.) I've prepared an example of text input and text
    > output for those interested in the service in
    > http://acdc.linguateca.pt/example_alignment.html. (Note that it has to
    > involve Portuguese as one of the languages)
    >
    > However, I would warmly encourage people to actually use the IMS-CWB
    > themselves and create their own Web services. The advantages of using the
    > query power (also in translation corpora) are tremendous.
    >
    > Diana
    > ************************************************************************
    > Diana Santos Computational processing of Portuguese
    >
    > SINTEF Telecom & Informatics Tel. (direct line) +47 22 06 73 12
    > Forskningsveien 1 Tel. +47 22 06 73 00
    > Box 124 Blindern Fax. +47 22 06 73 50
    > N-0314 Oslo Email: Diana.Santos@sintef.no
    > Norway http://www.portugues.mct.pt/
    > ************************************************************************
    >
    >
    >
    >



    This archive was generated by hypermail 2b29 : Sat Sep 07 2002 - 17:04:32 MET DST