Re: Corpora: control chars

From: Tom Emerson (
Date: Fri Jun 07 2002 - 16:00:54 MET DST

  • Next message: ted pedersen: "Corpora: measures of semantic distance in wordnet"

    Gil Graf writes:
    > is there any encoding, except utf16, which uses the
    > control range (0-31) in a way different than ASCII ?
    > more specifically, is it safe to cut off text at 10
    > (normally newline) or 32 (normally space) bytes?

    The question presumes you are looking at characters in terms of 8-bit
    bytes instead of abstract character units consisting of one or more

    There are some C0 code points you may want to keep:

    0x09 Horizontal Tab
    0x0A Line Feed
    0x0D Carriage Return

    I presume you are using a multibyte character encoding in your data:
    in that case all instances I can think of (including UTF-8) share the
    C0 range. The two- and four-byte encodings of Unicode also have the C0
    code points, but at a byte-level these may have leading or trailing
    0x00 depending on the endianness of the machine you are on.

    If you are working with C and are using the wchar_t type, then it is
    possible that the system is using UTF-32/UCS-4 as the underlying
    character type, in which case the encoding is less of an issue and you
    can think only in terms of codepoint.



    Tom Emerson                                          Basis Technology Corp.
    Sr. Computational Linguist               
      "Beware the lollipop of mediocrity: lick it once and you suck forever"

    This archive was generated by hypermail 2b29 : Fri Jun 07 2002 - 16:20:36 MET DST