Sample Turkish and English aligned texts are now available for general
use. They have been automatically aligned (by Kursat Ince) at the
sentence level using Gale and Church's align code ( Computational
Linguistics Vol 19 No 1 March 1993). There may be occasional problems
due to misidentification of sentence boundaries. Turkish has been coded
in all lower case with the 6 upper case ASCII characters (C,G, I, O,S,
U) representing the 6 non-ASCII Turkish characters.
Currently there are 6 parallel texts. Text 1 is a
foreign ministry press release, Texts 2 and 3 are the texts of two
treaties, Texts 4 - 6 are samples texts ocr'ed from
a journal on translation.
These can be accessed by WWW at
http://www.cs.bilkent.edu.tr/~ko/Turklang/corpus/par-corpus/
Any corrections and suggestions are welcome.
Kemal Oflazer
ko@cs.bilkent.cs.edu.tr