Re: [Corpora-List] language-specific harvesting of texts from the Web

From: Stuart A Yeates (stuart.yeates@computing-services.oxford.ac.uk)
Date: Wed Sep 01 2004 - 09:55:11 MET DST

  • Next message: Adam Kilgarriff: "RE: [Corpora-List] Searching BNC for adverbs followed by verb"

    Marco Baroni wrote:
    >>One situation where your approach may not work so well, is when a
    >>language's websites use multiple character encodings. Unfortunately,
    >>this is quite common in languages that have non-Roman writing systems,
    >
    >
    > At least for Japanese, our way to get around this problem in our
    > web-mining scripts was to look for the charset declaration in the html
    > code of each page, and then to convert (inside the script) the page from
    > that charset to utf8.
    >
    > I would be interested in hearing about other ways to deal with multiple
    > encodings.

    textcat (http://odur.let.rug.nl/~vannoord/TextCat/) is a language and
    encoding guesser which reliably guesses test language and encoding based
    solely on examples and statistics. Knows 69 natural languages. Open source.

    I've had good experiance using the built-in java encoding converters
    (readers and writers shipped for ~100 encodings as standard) to convert
    between languages. Freely avaliable.

    cheers
    stuart

    -- 
    Stuart Yeates            stuart.yeates@computing-services.oxford.ac.uk
    OSS Watch                                  http://www.oss-watch.ac.uk/
    Oxford Text Archive                             http://ota.ahds.ac.uk/
    Humbul Humanities Hub                         http://www.humbul.ac.uk/
    



    This archive was generated by hypermail 2b29 : Wed Sep 01 2004 - 10:09:42 MET DST