Corpora: Re: Arabic vs Spanish diacritics

From: Steven Krauwer (Steven.Krauwer@let.uu.nl)
Date: Tue Apr 24 2001 - 00:23:46 MET DST

  • Next message: Hristo Tanev: "Re: Corpora: ngram frequencies with intervening words?"

    Tim Buckwalter wrote:

    > The big difference between Arabic and accented languages such as Spanish
    > in this regard is that accent-less Spanish is probably sub-standard or
    > at least informal orthography. Whereas it is the norm for an entire
    > formal Arabic newspaper to have only a dozen or so thoughtfully-placed
    > short vowels & diacritics, an unaccented Spanish newspaper would be hard
    > to imagine (I've never seen one, at least), or one with accents placed
    > only where there is not enough context to know what is intended.

    So, the picture is (in a very black and white version): the
    Spanish have fewer diacritics (both types and tokens) but use
    them
    virtually all the time, and the Arabs have a lot more of them,
    but they hardly ever use them.

    I have three questions:
    - does this difference have any measurable effect on the
      learning process (for native speakers who learn to read
      and write)
    - same for parsing and processing by humans
    - same for NLP

    Any pointers to any empirical data?

    I realize that we are now really moving away from this list's
    core business, so I'll be happy to continue this discussion
    somewhere else if people prefer that.

    [ One place to go could be the email list
    elsnet-arabic@elsnet.org
      that we have just set up for discussing Arabic NLP and Speech
      processing issues, but that hasn't been officially launched
      yet. Subscription is already open at
      http://utrecht.elsnet.org/subscriptions.html ]

    Steven



    This archive was generated by hypermail 2b29 : Tue Apr 24 2001 - 00:19:33 MET DST