[Corpora-List] Aligner for ParaConc? - summary

From: Sampo Nevalainen (samponev@cc.joensuu.fi)
Date: Tue Sep 03 2002 - 11:20:13 MET DST

Next message: Michael Barlow: "Re: [Corpora-List] Aligner for ParaConc? - summary"

Previous message: P bI K O B_ B.B.: "Re: [Corpora-List] [osander@gmx.de: How to extract N-grams]"
Next in thread: Michael Barlow: "Re: [Corpora-List] Aligner for ParaConc? - summary"
Reply: Michael Barlow: "Re: [Corpora-List] Aligner for ParaConc? - summary"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Dear all,

Some time ago I asked for an aligner that could be used with ParaConc. I
got two replies and a request for a summary. Unfortunately I do not have
time for a proper summary, instead I have attached the original message and
the replies I got . I would like to thank Martin Wynne
and Raphael Salkie for their assistance.

By the way, after I had sent my request I got to know about a free TM
software called Wordfast. The program is fully integrated into MS Word and
for me it seems an exellent tool, considering it is a freeware. (This is
not a paid advertisement, just my personal opinion!) Wordfast has got an
add-on called +Tools, which includes an aligner, also based on MS Word. The
aligner automates some things that you should do manually in Word (such as
breaking text into sentences and line numbering), but I am afraid the
aligning method is not too intelligent: a lot of work must be done manually
anyways. However, it is one possibility worth mentioning. And, for other
corpus fans and enthusiasts, Wordfast is provided with a pretty fast but
modest concordancer, too :-) Both Wordfast and +Tools can be downloaded
from the following URL: http://www.champollion.net/

sincerely,
sampo

The original message below:
-------------------------------------------------------------------------------------------------------------------------------
I wonder if there is any (freely available) alignment tools to be used with
ParaConc? That is, the aligner should let users save the original and
target texts into separate files. I know there is an aligner in the WS
Tools pack, but for some reason the program tends to "re-join" the
sentences you already "un-joined"... Well, you can use the WSTools Aligner
if you get the job done at once, in one go, without saving and re-opening
the files. (I don't know whether it's my fault - I cannot use the program
correctly - or there's a bug in the prog.) I also know there are alignment
tools for "filling up" translation memories (e.g. Trans Suite 2000 Align,
which is distributed freely), but they seem not to have an option of saving
the source and the target texts into separate files. Ok, I could save the
output file as a text file with a separator between the segments, then open
it to Excel using these separators as column separators, and, finally, save
each column as a separate text file... but this makes a simple task too
complicated, IMHO. So, could someone help me to find out an aligner
(preferably Windows GUI, to be used in a classroom) that would simply split
the texts into sentences and let the user correct the alignment by joining
and unjoining sentences? The program should then save the files into
separate (ascii) text files. Many thanks in advance for your tips and advice!
----------------------------------------------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------------
From: Martin Wynne <martin.wynne@ota.ahds.ac.uk>
To: "'Sampo Nevalainen'" <samponev@cc.joensuu.fi>
-----------------------------------------------------------------------------------------------------
I have used a simple Perl aligner written by Pernilla Danielsson and Daniel
Ridings. When I taught with pernilla on a course at the Tuscan Word Centre
we used this program (which she calls the "vanilla aligner") to align texts
specifically to use with ParaConc, so I know it can do this job. We may
have done a bit of tweaking on the output. You can contact her on
pernilla@ccl.bham.ac.uk.
best,
Martin

-----------------------------------------------------------------------------------------------------------------
From: R.M.Salkie@bton.ac.uk
To: samponev@cc.joensuu.fi
------------------------------------------------------------------------------------------------------------------
I've been struggling with the same problem, including using Trans Suite
2000 Align. I don't have a good answer, just two suggestions.
Firstly, it's possible to use the replace function in Word using the output
of Trans Suite, saved in TMX format. This is what a typical pair of
sentences looks like:
<tu
creationdate="20020723T151150Z"
creationid="TS2!ALIGN"
changedate="20020723T151150Z"
>
<tuv lang="EN-GB">
<seg>World consumption has expanded at an unprecedented pace over the 20th
century, with private and public consumption expenditures reaching $24
trillion in 1998, twice the level of 1975 and six times that of 1950. </seg>
</tuv>
<tuv lang="DE-DE">
<seg>Der weltweite Konsum hat sich im Verlauf des 20. Jahrhundert in
beispiellosem Tempo ausgeweitet. 1998 erreichen die privaten und
öffentlichen Konsumausgaben 24 Billionen Dollar, sie sind damit doppelt so
hoch wie 1975 und sechsmal so hoch wie 1950. </seg>
</tuv>
</tu>
The aim is to remove all the English sentences, leaving the German ones in
place. Load the document into Word, choose "Replace", then tick "use
wildcards" . In the "Find what" box paste in:
\<tuv lang="EN-GB"\>*\</tuv\>
(Notice that the < and > characters need a backslash before them so that
Word does not treat them as wildcards). If you choose "Replace all", this
will now delete all the English sentences. Then use "save as" to save the
file as German only. To create the English file, do the same thing to the
original file but change the language code in the "Find what" box to
"DE-DE". You can then use some similar techniques to remove the remaining
XML codes and the creation dates. I realise that this is even more
elaborate than your suggestion of using Excel, but it's something that
students could perhaps manage. I agree entirely that it would be better if
students didn't have to do this.

Suggestion 2: Write to Mike Barlow and suggest that he adds to ParaConc the
ability to handle files which are in this typical translation memory format
where the source and target sentences are in pairs. Presumably this is a
simpler task for a computer programme than relating texts in two separate
files: as long as the computer knows which is the source language, then it
would have to produce the sentence (or KWIC) containing the source word,
along with the sentence which follows. For searches in the target language
it would be the sentence that precedes. I couldn't wirte a programme to do
this, but I think a programmer could. I hope that someone comes up with a
better solution, and I'd be grateful if you could publicise anything useful.

Best wishes. - Raphael
-------------------------------------------------------------------------------------------------------------------

( : ============================================= : )

Sampo Nevalainen, M.A.
Researcher
University of Joensuu
Savonlinna School of Translation Studies
P.O.Box 48
FIN-57101 Savonlinna
FINLAND

tel +358-15-511 70 (operator)
+358-15-511 7704
fax +358-15-515 096
email samponev@cc.joensuu.fi
http://www.joensuu.fi/slnkvl/

Next message: Michael Barlow: "Re: [Corpora-List] Aligner for ParaConc? - summary"
Previous message: P bI K O B_ B.B.: "Re: [Corpora-List] [osander@gmx.de: How to extract N-grams]"
Next in thread: Michael Barlow: "Re: [Corpora-List] Aligner for ParaConc? - summary"
Reply: Michael Barlow: "Re: [Corpora-List] Aligner for ParaConc? - summary"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Tue Sep 03 2002 - 11:30:32 MET DST