Re: Corpora: Relatve text length

From: Alex Chengyu Fang (alex_chengyu@yahoo.co.uk)
Date: Mon Apr 29 2002 - 17:34:00 MET DST

  • Next message: Alessandro Lenci: "Corpora: EXTENDED DEADLINE: Workshop on Asian Resources and International Standardization"

    What I wanted to say is that there are different ways
    of measuring the relative length and that, if counts
    of characters, syllables and morphemes are used, you
    are likely to see differences between language pairs.
    If, however, semantic proposition is used as key,
    lanauges may not be so different as the number of
    propositions should be a near constant across
    multi-lingual texts that are mutual translations of
    each other.

    So, my simplistic view is that to see the differences,
    use characters, syllables and morphemes as
    measurements. To see similarities (the other
    direction), the number of semantic propositions can
    serve the purpose.

    Regards,

    Alex

     --- Yorick Wilks <yorick@dcs.shef.ac.uk> wrote: >
    Sorry, I dont quite follow this--I thought the
    > original question was
    > just about length (whether text, characters,
    > morphemes or words) and I
    > didnt know when reading the question what the
    > questioner's
    > purpose was---I HOPE it wasnt language
    > discrimination because Ramesh's
    > figues show pretty clearly
    > that length (as words) doesnt separate Slavic
    > languages like Czech from
    > Estonian/Hungarian--though
    > length as characters does a bit bette, although
    > theres no separation
    > from the Slavic family as a whole at all!
    > None of that seems terribly simplistic ,just
    > natural, given the question
    > and answer
    > (though which is unhelpful as it turns out).
    >
    > What iIdont follow is the link to alignment that you
    > make--alignment is
    > clearly interesting but
    > what does it or can it say about the relative length
    > of languages that
    > the simpler counts do not?
    > What is this 'other direction' you write of ----is
    > it that, if you align
    > at the sentence level
    > many-one it says something about some property of
    > the languages that
    > can distinguish them?
    > Or won't all that depend on the existence and shared
    > significance of
    > punctuation marks--which seems a bit implausible?
    > Regards
    > Yorick Wilks
    >
    >
    >
    > Alex Chengyu Fang wrote:
    >
    > > Which measure to use depends on the purpose of the
    > > study, whether to bring out differences or
    > > similarities of the languages concerned.
    > >
    > > A rather simplistic view is that counds of words,
    > > characters, syllables, morphemes etc tend to be
    > used
    > > to discriminate between languages. An attempt in
    > the
    > > other direction is the use of the number of
    > > propositions to, for instance, automatically align
    > > multilingual texts:
    > >
    > > Campbell, J. and A.C. Fang. 1995. Automated
    > Alignment
    > > in Multilingual Corpora. In Proceedings of the
    > 10th
    > > Pacific Asia Conference on Language, Information
    > and
    > > Computation (PACLIC10), 27-28 December 1995, Hong
    > Kong
    > > City University, Hong Kong. pp 185-193.
    > >
    > > Regards,
    > >
    > > Alex Fang
    > >
    > > --- ramesh@ccl.bham.ac.uk wrote: > Dear Yorick
    > > >
    > > > Would morpheme counts not be even more accurate
    > (or
    > > > linguistically valid) than counting orthographic
    > > > characters?
    > > > Unfortunately, I don't think anyone has done
    > these
    > > > yet...
    > > >
    > > > Anyway, I agree that for the moment, character
    > > > counts
    > > > are a useful addition to word counts.
    > > >
    > > > Problems about translation (compensation,
    > > > explication,
    > > > zero translation, etc) obviously apply
    > throughout.
    > > >
    > > > Here are some figures from my own research:
    > > >
    > > > 1. FIFA Laws in English, German, Spanish, and
    > > > French.
    > > > French is longest, then Spanish, German, and
    > > > English.
    > > >
    > > > lines words characters text
    > > >
    > > > 726 10216 56874 Laws97GB.txt
    > > > 724 9173 63402 Laws97DE.txt
    > > > 1342 11030 63765 Laws97SP.txt
    > > > 1169 11763 67537 Laws97FR.txt
    > > >
    > > > 2. Canadian Hansard in English and French.
    > > > French is longer in both samples.
    > > >
    > > > lines words chars text
    > > >
    > > > 1569 20336 104015 c1.001.E.A
    > > > 1569 22413 124457 c1.002.F.A
    > > >
    > > > 1120 12260 62421 c2.002.E.A
    > > > 1120 12135 62622 c2.003.F.A
    > > >
    > > > 3. George Orwell's 1984 (thanks to Multext-East
    > and
    > > > TELRI)
    > > > in several languages. These figures were
    > provided by
    > > >
    > > > Dr Tomaz Erjavec (Ljubljana) with various
    > additional
    > > > caveats:
    > > >
    > > > line word char
    > > >
    > > > English 16053 102787 584803
    > > > Bulgarian 11172 85878 536977
    > > > Czech 11087 79022 498216
    > > > Estonian 17872 78792 545984
    > > > Hungarian 8813 79814 575219
    > > > Romanian 16684 103704 603868
    > > > Slovene 14938 91336 541461
    > > >
    > > > 4. Le Monde Diplomatique in English and Fench:
    > > >
    > > > lines words characters text
    > > >
    > > > 116 956 6410 LEMAE1.txt
    > > > 133 941 7457 LEMAF1.txt
    > > >
    > > > 5. From research with Dr Maria Cristina Borba
    > (Rio
    > > > Grande, Brazil).
    > > > Alice in Wonderland in English, 2
    > > > Brazilian-Portuguese translations
    > > > (one for adults, one for children), and a
    > Catalan
    > > > translation (MARIST).
    > > >
    > > > CARROLL LEITE
    > > > SEVCENKO MARIST
    > > >
    > > > File length (bytes) 204,288 148,889
    > > > 150,235 143,055
    > > >
    > > > Running words (tokens) 31,731 25,348
    > > > 26,245 25,566
    > > > Different words (types) 3,417 3,896
    > > > 3,614 4,400
    > > > type/token ratio (mean) 44.99% 51.61%
    > > > 51.25% 51.19%
    > > > ave. word length (letters) 3.63 4.36
    > > > 4.31 4.16
    > > >
    > > > Best
    > > > Ramesh
    > > >
    > > > Ramesh Krishnamurthy
    > > > Honorary Research Fellow, University of
    > Birmingham;
    > > > Honorary Research Fellow, University of
    > > > Wolverhampton;
    > > > Consultant, Cobuild and Bank of English Corpus,
    > > > Collins Dictionaries.
    > > >
    > > >
    > > > On Thu, Apr 25, 2002 at 04:56:15PM +0100, Yorick
    > > > Wilks wrote:
    > > > > t=iso-8859-1
    > > > > Content-Transfer-Encoding: 8bit
    > > > > X-checked-clean: by exiscan on alf
    > > > > X-Scanner: 5832cd47e7f9ea43fe3a076fe9cb1a70
    > > > http://tjinfo.uib.no/virus.html
    > > > > X-Spam-Flag: NO UIB: 0 hits, 8 required;
    > > > > X-Spam-Report: spamassassin found:
    > > > > Sender: owner-corpora@lists.uib.no
    > > > > Precedence: bulk
    > > > > Status: O
    > > > > Content-Length: 3684
    > > > > Lines: 114
    > > > >
    > > > >
    > > > > Isnt there some (minor) confusion here? If
    > the
    > > > question really is relative TEXT
    > > > > length,
    > > > > then nothing to do with word counts will
    > settle
    > > > it--what matters is character
    > > > > counts, since word length
    > > > > varies considerably between languages. The
    > table
    >
    === message truncated ===

    __________________________________________________
    Do You Yahoo!?
    Everything you'll ever need on one web page
    from News and Sport to Email and Music Charts
    http://uk.my.yahoo.com



    This archive was generated by hypermail 2b29 : Mon Apr 29 2002 - 17:53:19 MET DST