Re: [Corpora-List] Chomsky

From: Mike Maxwell (maxwell@ldc.upenn.edu)
Date: Thu Oct 14 2004 - 21:37:09 MET DST

  • Next message: Lou-at-home: "Re: [Corpora-List] Chomsky"

    Someone wrote:
    >> I'm looking for the exact bibliographical reference where
    >> we can find Chomsky's idea that a corpus presents a language
    >> that is defective or corrupted.
    >
    Ronald J. Craig wrote:
    > I don't have Aspects at hand (I think maybe I burned it?)

    "A record of natural speech will show numerous false starts, deviations
    from rules, changes of plan in mid-course, and so on. The problem for
    the linguist, as well as for the child learning the language, is to
    determine from the data of performance the underlying system of rules
    that has been mastered by the speaker-hearer and that he puts to use in
    actual performance." Aspects, pg. 4

    FWIW, I don't find anything in the above to disagree with. If you think
    otherwise, you might want to consider your reaction to the various
    "Bushisms" that are floating around :-).

    Notice also that Chomsky says _speech_, i.e. he's talking about spoken
    (transcribed) corpora, not prepared texts. Although I would say the
    same is probably true of written (non-transcribed) texts, only to a
    lesser degree. I just had occasion to worry about how the name "Kim
    Jong Il" was to be translated into Panjabi. (Long story.) The
    translator had represented the third part of that name using Latin
    letters, rather than Gurmukhi (the Panjabi writing system), as "Il"
    (eye-el). When questioned, he said that it stood for "the second", and
    that it didn't make sense to translate that into Panjabi. Now there are
    two problems: one, he's presumably thinking of "II" (eye-eye), even
    though he had written "Il" (eye-el). But second, there's an empirical
    question (to use Chomsky's term): what is the last word _supposed_ to
    be? Of course it's Korean (borrowed into the Korean language from
    Chinese, I'm told), where it's written in a different writing system
    (Hangul); so to rephrase the question, what is the appropriate
    transliteration into Latin letters and semi-English spelling? If you go
    on the web, Google finds 231 thousand instances of "Kim Jong Il", and
    6300 instances of "Kim Jong II"--and for good measure, a couple hundred
    instances of "Kim Jong ll" (el-el). So as corpus linguists, you can
    rejoice that the corpus search gave the right answer, i.e. the one that
    comes closest to an English spelling of the Korean name. But you also
    have to ask, what is the status of the 6300 cases where it was spelled
    wrong--are those errors, or just different data? I think I know what
    Kim Jong Il would tell you...I think he would tell you the Web was both
    defective and corrupted!

    -- 
    	Mike Maxwell
    	Linguistic Data Consortium
    	maxwell@ldc.upenn.edu
    



    This archive was generated by hypermail 2b29 : Thu Oct 14 2004 - 21:47:30 MET DST