Re: [Corpora-List] language-specific harvesting of texts from the Web

From: Marco Baroni (baroni@einstein.sslmit.unibo.it)
Date: Tue Aug 31 2004 - 18:51:46 MET DST

  • Next message: Kevin Patrick Scannell: "Re: [Corpora-List] language-specific harvesting of texts from the Web"

    > One situation where your approach may not work so well, is when a
    > language's websites use multiple character encodings. Unfortunately,
    > this is quite common in languages that have non-Roman writing systems,

    At least for Japanese, our way to get around this problem in our
    web-mining scripts was to look for the charset declaration in the html
    code of each page, and then to convert (inside the script) the page from
    that charset to utf8.

    I would be interested in hearing about other ways to deal with multiple
    encodings.

    Btw: I thought Japanese was tough (as you can find euc-jp, shiftjis, utf8
    and iso-2002-jp), but the situation you describe for Hindi sounds like a
    true encoding nightmare!
     
    > I gave a talk at the ALLC/ACH meeting in June on our search technique,
    > including its pros and cons. The abstract was published, but not the
    > full paper. I suppose I should post it somewhere...

    Please do!

    Regards,

    Marco

    -- 
    Marco Baroni
    University of Bologna
    http://sslmit.unibo.it/~baroni
    



    This archive was generated by hypermail 2b29 : Tue Aug 31 2004 - 19:06:21 MET DST