[Corpora-List] Re: Passing the Turing Test: My Holy Grail

From: FIDELHOLTZ DOOCHIN JAMES LAWRENCE (jfidel@siu.buap.mx)
Date: Sun May 18 2003 - 16:28:26 MET DST

  • Next message: Harold Somers: "[Corpora-List] Teaching Machine Translation Workshop: deadline not yet passed"

    [note to CorporaList: here's my answer to a query I was sent. If anyone
    thinks I'm totally off the wall, please send him (& me) your comments--Jim]

    Hi Michael:

    answers (such as they are) below, among your Qs.

    Michael Bramante escribió:

    > I ask your assistance in seeking information on how to acquire the three
    > things listed below (in order of importance). Please let me know where
    > and how I can acquire these things. Also, please include all relevant
    > names, addresses, phone numbers, email addresses and URL's.
    >
    > I am seeking:
    >
    > 1) The top 10,000 most commonly used words in the English language.

    This one is basically impossible, since, usually except for 'the' and a
    couple more words, the frequency of words depends *very much* on the
    particular corpus you are using. Probably what you want, and what would be
    most useful for you, would be the top 10,000 words of a very large corpus
    that tries to be representative, such as the BNC (British National Corpus).
    Use Google to find urls--this is available, and you can probably even get
    the info you need for free, but of that I'm not 100% sure. check the
    'corpora list' archives for more info on the BNC.
    >
    > 2) I am seeking exhaustive, empirical data containing the correct and
    > complete assignment of all possible parts of speech to each and every
    > word in the above list.

    Again, see previous comment, but the BNC, eg, is tagged with POS, so you
    could get most of the info you might need. Of course, 'exhaustive' is
    tough, since recent results indicate:

    1) As you add new text to a corpus, however large, you will always add new
    words, although proportionally less per M words; nevertheless, this
    'addition curve' does not seem to be asymptotic. Ie, every language seems
    to have literally an *infinite number of words* (my interpretation of
    Baayen's recent book). BTW, an increasing percentage of the 'new words'
    encountered is proper nouns. (personal observation, and no doubt observed
    also by others.) BTWPS: This does not invalidate the observation, since,
    according to me, proper nouns are just as much part of the language as
    anything else, contrary to the practice of eg most dictionary makers.

    2) POS tagging is at least partly an art, as even linguists are only about
    99% or so in agreement on tagging specific texts (in the best of cases).
    Taking that figure, that would mean at least 100 of your 10000 entries would
    be questionable. Also note that the more frequent a word, the more
    different meanings it tends to have (and thus the more parts of speech as
    well). Another factor is that there are for each language a very large
    number of possible sets of parts of speech, depending on the deviser's
    particular ideas about some structures, on the one hand, and their desire to
    be very specific or very general in their approach, on the other hand.
    *Sometimes* more specific divisions are easily translatable into more
    general ones (usually only if they have been devised by the same person or
    team).
    >
    > 3) This one is a tall order and is the least important item for me:
    >
    > I am interested in locating the exhaustive and finite list of all
    > possible grammatically correct English sentence structures that contain
    > at most ten words. By sentence structure I mean a sentence composed of
    > 'part of speech markers' instead of actual words.

    This might be possible, and conceivably could have been done by someone
    already, but I don't know about it. Still, even if it has been done, you
    would need to be careful, since you would need to include hierarchical
    information as well as the POS of each word, since, eg, you would have to
    distinguish between
        (Bill and Sally) and (John and Suzie) came home. (two couples)
    and
        Bill and Sally and John and Suzie came home. (our 4 kids).
    This not to mention restrictive vs. nonrestrictive modifiers, etc. I
    suspect that, practically speaking, this list would be astronomically large.
    Also, of course, its size would depend on the size of the tagset used. From
    memory, smallish tagsets used in real corpora might have about 30-35
    different tags. Using 30, we might set an upper limit at 30(up-arrow)10
    (this without considering hierarchical structures), or about 10 to the 15th.
    We could obviously drop this down a lot (maybe to 10 to the ninth?) but
    that's still a billion structures to contend with.
    >
    > I am not seeking an exhaustive and finite list of all grammatically
    > correct English sentences.that would be crazy.
    >
    > For instance the following two sentences are identical in structure
    > yet are two completely different sentences. "The cow jumped over the
    > moon." is identical in structure to "The snake slithered under the car."
    >
    > The structure could be represented as:
    >
    > "[Article Definite] ([Object Direct] [Noun Common] and [Noun
    > Concrete] and [Noun Countable]) [Verb Active Past] [Preposition]
    > [Article Definite] ([Noun Common] and [Noun Concrete] and [Noun
    > Countable])"
    >
    > I am enormously curious as to whether or not something like this
    > exists.
    >
    >
    > Thank you very much for your time and consideration.
    >
    >
    > Sincerely,
    >
    >
    > Michael J. Bramante
    > Cell (206) 227-1111
    > Email Bramante@attbi.com
    >
    Well, I hope this is some help. I'm sending this along to the Corpora List,
    in case anybody else might have more enllightening comments than me.

    Jim

    James L. Fidelholtz
    Posgrado en Ciencias del Lenguaje
    Benemérita Universidad Autónoma de Puebla MÉXICO



    This archive was generated by hypermail 2b29 : Sun May 18 2003 - 16:35:50 MET DST