Corpora: Re: What is a Corpus

From: Vladimir Rykov (rykov@iling.msk.su)
Date: Mon Feb 07 2000 - 07:32:10 MET

  • Next message: Leonardo Musumeci: "Corpora: e-mail help"

         It was very interesting for me to read the "What is a corpus"
    discussion.
         Really a problem exists - what is a corpus, is it balanced or/ and
    representative.
         If we would take as an example a case of corpus of proverbs - who
    can say that this is a corpus and not archive or set or dump of
    proverbs? We can find many interesting things at a dump storage - but
    what is the value of our findings? If we did not any pre-processing
    (filtering) during creation of our set of proverbs - then what is the
    value of the following statement: "There are no Italian proverbs about
    unlucky marriages" ?
         This statement is reliable or scientific only for representative
    proverb corpus. Otherwise - "dump as input - dump as output (dust to
    dust)". Is there a quasi-logical procedure of defining - is this
    collection (dump) of textual data a representative corpus? This is the
    starting point of all the following activity - is it scientific one or
    paid hobby?

    ---
        YS Vladimir Rykov, PhD in Computational Linguistics M_M_M_M_M_M_M_M_M_M_M_M_M
     www.blkbox.com/~gigawatt/rykov.html Linguistic Institute
      WWW.GOL.RU/~iling 1/12 B.Kislovsky per., Moscow, 103009 KREMLIN WALL IS WHERE YOU MAKE IT !!!
     Please - do NOT send Internet (attached,multimedia etc) files - we can read ASCII files ONLY Please - send us *.html, *.doc, other non-ASCII files to the addr: ILING@GOL.RU with RE: For Rykov



    This archive was generated by hypermail 2b29 : Mon Feb 07 2000 - 07:40:17 MET