[Corpora-List] Aligner for ParaConc? - summary

From: Sampo Nevalainen (samponev@cc.joensuu.fi)
Date: Tue Sep 03 2002 - 11:20:13 MET DST

  • Next message: Michael Barlow: "Re: [Corpora-List] Aligner for ParaConc? - summary"

    Dear all,

    Some time ago I asked for an aligner that could be used with ParaConc. I
    got two replies and a request for a summary. Unfortunately I do not have
    time for a proper summary, instead I have attached the original message and
    the replies I got . I would like to thank Martin Wynne
    and Raphael Salkie for their assistance.

    By the way, after I had sent my request I got to know about a free TM
    software called Wordfast. The program is fully integrated into MS Word and
    for me it seems an exellent tool, considering it is a freeware. (This is
    not a paid advertisement, just my personal opinion!) Wordfast has got an
    add-on called +Tools, which includes an aligner, also based on MS Word. The
    aligner automates some things that you should do manually in Word (such as
    breaking text into sentences and line numbering), but I am afraid the
    aligning method is not too intelligent: a lot of work must be done manually
    anyways. However, it is one possibility worth mentioning. And, for other
    corpus fans and enthusiasts, Wordfast is provided with a pretty fast but
    modest concordancer, too :-) Both Wordfast and +Tools can be downloaded
    from the following URL: http://www.champollion.net/

    sincerely,
    sampo

    The original message below:
    -------------------------------------------------------------------------------------------------------------------------------
    I wonder if there is any (freely available) alignment tools to be used with
    ParaConc? That is, the aligner should let users save the original and
    target texts into separate files. I know there is an aligner in the WS
    Tools pack, but for some reason the program tends to "re-join" the
    sentences you already "un-joined"... Well, you can use the WSTools Aligner
    if you get the job done at once, in one go, without saving and re-opening
    the files. (I don't know whether it's my fault - I cannot use the program
    correctly - or there's a bug in the prog.) I also know there are alignment
    tools for "filling up" translation memories (e.g. Trans Suite 2000 Align,
    which is distributed freely), but they seem not to have an option of saving
    the source and the target texts into separate files. Ok, I could save the
    output file as a text file with a separator between the segments, then open
    it to Excel using these separators as column separators, and, finally, save
    each column as a separate text file... but this makes a simple task too
    complicated, IMHO. So, could someone help me to find out an aligner
    (preferably Windows GUI, to be used in a classroom) that would simply split
    the texts into sentences and let the user correct the alignment by joining
    and unjoining sentences? The program should then save the files into
    separate (ascii) text files. Many thanks in advance for your tips and advice!
    ----------------------------------------------------------------------------------------------------------------------------------

    -----------------------------------------------------------------------------------------------------
    From: Martin Wynne <martin.wynne@ota.ahds.ac.uk>
    To: "'Sampo Nevalainen'" <samponev@cc.joensuu.fi>
    -----------------------------------------------------------------------------------------------------
    I have used a simple Perl aligner written by Pernilla Danielsson and Daniel
    Ridings. When I taught with pernilla on a course at the Tuscan Word Centre
    we used this program (which she calls the "vanilla aligner") to align texts
    specifically to use with ParaConc, so I know it can do this job. We may
    have done a bit of tweaking on the output. You can contact her on
    pernilla@ccl.bham.ac.uk.
    best,
    Martin

    -----------------------------------------------------------------------------------------------------------------
    From: R.M.Salkie@bton.ac.uk
    To: samponev@cc.joensuu.fi
    ------------------------------------------------------------------------------------------------------------------
    I've been struggling with the same problem, including using Trans Suite
    2000 Align. I don't have a good answer, just two suggestions.
    Firstly, it's possible to use the replace function in Word using the output
    of Trans Suite, saved in TMX format. This is what a typical pair of
    sentences looks like:
    <tu
    creationdate="20020723T151150Z"
    creationid="TS2!ALIGN"
    changedate="20020723T151150Z"
    >
    <tuv lang="EN-GB">
    <seg>World consumption has expanded at an unprecedented pace over the 20th
    century, with private and public consumption expenditures reaching $24
    trillion in 1998, twice the level of 1975 and six times that of 1950. </seg>
    </tuv>
    <tuv lang="DE-DE">
    <seg>Der weltweite Konsum hat sich im Verlauf des 20. Jahrhundert in
    beispiellosem Tempo ausgeweitet. 1998 erreichen die privaten und
    öffentlichen Konsumausgaben 24 Billionen Dollar, sie sind damit doppelt so
    hoch wie 1975 und sechsmal so hoch wie 1950. </seg>
    </tuv>
    </tu>
    The aim is to remove all the English sentences, leaving the German ones in
    place. Load the document into Word, choose "Replace", then tick "use
    wildcards" . In the "Find what" box paste in:
    \<tuv lang="EN-GB"\>*\</tuv\>
    (Notice that the < and > characters need a backslash before them so that
    Word does not treat them as wildcards). If you choose "Replace all", this
    will now delete all the English sentences. Then use "save as" to save the
    file as German only. To create the English file, do the same thing to the
    original file but change the language code in the "Find what" box to
    "DE-DE". You can then use some similar techniques to remove the remaining
    XML codes and the creation dates. I realise that this is even more
    elaborate than your suggestion of using Excel, but it's something that
    students could perhaps manage. I agree entirely that it would be better if
    students didn't have to do this.

    Suggestion 2: Write to Mike Barlow and suggest that he adds to ParaConc the
    ability to handle files which are in this typical translation memory format
    where the source and target sentences are in pairs. Presumably this is a
    simpler task for a computer programme than relating texts in two separate
    files: as long as the computer knows which is the source language, then it
    would have to produce the sentence (or KWIC) containing the source word,
    along with the sentence which follows. For searches in the target language
    it would be the sentence that precedes. I couldn't wirte a programme to do
    this, but I think a programmer could. I hope that someone comes up with a
    better solution, and I'd be grateful if you could publicise anything useful.

    Best wishes. - Raphael
    -------------------------------------------------------------------------------------------------------------------

    ( : ============================================= : )

    Sampo Nevalainen, M.A.
    Researcher
    University of Joensuu
    Savonlinna School of Translation Studies
    P.O.Box 48
    FIN-57101 Savonlinna
    FINLAND

    tel +358-15-511 70 (operator)
             +358-15-511 7704
    fax +358-15-515 096
    email samponev@cc.joensuu.fi
    http://www.joensuu.fi/slnkvl/



    This archive was generated by hypermail 2b29 : Tue Sep 03 2002 - 11:30:32 MET DST