Re: Corpora: Frequency of use of accents in French texts

Pierre Isabelle (isabelle@IRO.UMontreal.CA)
Thu, 12 Feb 1998 12:34:50 -0500 (EST)

Alan F. Smeaton wrote:

>
> Would anybody out there know where I could get hold of a table of
> the relative distributions of use of acccented characters in French.
> I have a collection of French newspapers and can (and have) simply
> counted these but I suspect the use of accented characters in this
> source is less than it should be as accented characters are being dropped
> for their unaccented equivalents. I'd like to measure this if possible.
>
> Replies to me and I'll post a summary, thanks.

One thing you could do is run some portion of your text through our
automatic French reaccentuation system (REACC) and see how much
different the output is from the input. The typical error rate of
REACC is 99.3%, in the sense that the mean distance between 2
incorrectly accentuated words is about 130 words.

REACC disregards any accents that might appear in its input,
recalculating from scratch all accents that will appear in its
input. As a consequence, if you feed REACC a correctly accentuated
French text, the difference between REACC's input and REACC's output
is basically the set of reaccentuation errors produced by the
system. And as mentioned above there is typically one such difference
for every 130 words of text.

If, as you suspect, your text is missing part of its accents, the
number of differences between REACC's input and REACC's output will
become (much) larger.

You can try REACC online, both from our lab's Web pages:

http://www-rali.iro.umontreal.ca/ProjetReacc.en.html

and from Alis Technologies' Web pages (Alis is our commercial partner):

http://www.alis.com/castil/reacc/index.en.html

-- 
Pierre Isabelle, RALI, DIRO
  Universite de Montreal, C.P. 6128, Succ. Centre-Ville
    Montreal, Quebec, Canada H3C 3J7
 tel: (514) 343-6161                 fax: (514) 343-2496 
e-mail: isabelle@iro.umontreal.ca    W3: http://www-rali.iro.umontreal.ca