Corpora: Frequency of use of accents in French texts (clarification)

asmeaton@compapp.dcu.ie
Thu, 12 Feb 1998 17:19:05 GMT

Folks

Yesterday I posted a request for information on the use of accented characters
in French .. after a few replies I think I need to clarify my message a bit.

I have a corpus of Swiss-French newsagency reports and I've taken a sample
of about 86 Million characters from this and counted character occurrences
as follows:

e - 7.67M a - 4.44M i - 4.06M o - 2.96M u - 2.86M

e-aigu - 1.3M a-grave 0.22M *ALL* other accented characters - .28M

This struck me as rather a small percentage for accented characters so I took the
word "pre/sident" (e/ equals e-aigu) and found 48,677 occurrences while for the word
"president" I found 9,644 occurrences. Initially this struck me as evidence of "lazy"
journalism though somebody pointed out that these are two different words. On
inspecting a couple of pages of these occurrences I do indeed find examples of
"le president des Etats-Unis" (which is correct) but I also find "le president
Francois Mitterrand" (which is incorrect) but I don't want to have to count all
these.

I have no basis for my intuition that the occurrence of accented characters is less
than expected except that I do remember a similar situation arising for a
corpus of Mexican Spanish newspaper texts where "lazy" journalism led to the dropping
of accents and I wondered whether the same situaltion was true here.

So my question is this: does anybody know whether the relative numbers of occurrences
of accented characters as shown above, is normal ?

The reason I'm chasing this information is that I am evaluating an information
retrieval application based on the shapes of words and letters where the accented
characters and the letter "i" all have the same shape.

Thanks for helping

- Alan Smeaton