Re: [Corpora-List] language-specific harvesting of texts from the Web

From: Mike Maxwell (maxwell@ldc.upenn.edu)
Date: Tue Aug 31 2004 - 19:45:10 MET DST

  • Next message: Sebastian Hoffmann: "Re: [Corpora-List] Searching BNC for adverbs followed by verb"

    Marco Baroni (Hi, Marco!) wrote:
    >>One situation where your approach may not work so well, is when a
    >>language's websites use multiple character encodings...
    >
    > At least for Japanese, our way to get around this problem in our
    > web-mining scripts was to look for the charset declaration in the html
    > code of each page...
    > ...
    > Btw: I thought Japanese was tough (as you can find euc-jp, shiftjis, utf8
    > and iso-2002-jp), but the situation you describe for Hindi sounds like a
    > true encoding nightmare!

    For the websites we've looked at (mostly south Asia, particularly India)
    and Eritrea (for Tigrinya), there are no charset declarations. Rather,
    it's all font-based: there are html tags (I forget the exact syntax
    right now, but it's something like <font face="foobar.ttf">) to indicate
    the font to use. At some sites, text in one font is embedded inside
    another, while in others you get a sequence of <font...> ... </font>
    sequences. (Again, I can't remember the exact html tag.)

    So once you know what font a particular site is using, you have to find
    an encoding converter that someone else has written (if you're lucky),
    or write one yourself. Since the fonts are often undocumented, this is
    non-trivial.

    It would be bad enough if there were a 1-to-1 mapping between
    proprietary code points and Unicode code points. There usually isn't.

    For Indic languages, the encoding usually does not use the same
    conventions as Unicode. For example, the short 'i' in many Indic
    scripts appears in writing to the left of the consonant after which it
    is pronounced. In Unicode, the short 'i' character is after the
    consonant in the text stream (phonological order), and making it appear
    to the left of the consonant is delegated to the rendering system;
    whereas in most of the 8-bit encodings, it is before the consonant in
    the text stream.

    Also, many of these Indic scripts, as well as Amharic and related
    scripts (such as Tigrinya) have more than 256 characters, or at least
    glyphs. Some of the latter are for conjoint consonants (where two
    adjacent consonants get written together, usually in reduced form), or
    other combined symbols. Unicode generally leaves these
    context-sensitive glyphs to the rendering system; 8-bit encodings have
    what can be best described as imaginative solutions to this problem.

    I might also add that there are some very nice looking websites for
    certain Southeast Asian languages (I forget which ones). Unfortunately,
    that's all you can do: look. Because the web pages are giant gifs, so
    there are no text characters you can extract.

    >>I gave a talk at the ALLC/ACH meeting in June on our search technique,
    >>including its pros and cons. The abstract was published, but not the
    >>full paper. I suppose I should post it somewhere...
    >
    > Please do!

    I'll see what I can do...

    -- 
    	Mike Maxwell
    	Linguistic Data Consortium
    	maxwell@ldc.upenn.edu
    



    This archive was generated by hypermail 2b29 : Tue Aug 31 2004 - 19:39:40 MET DST