Corpora: A Welsh lexical database and frequency count

From: Nick Ellis (n.ellis@bangor.ac.uk)
Date: Wed Jan 16 2002 - 11:52:14 MET

  • Next message: Magali Duclaux: "Corpora: ELRA: WLR Validation Centres"

    Cronfa Electroneg o Gymraeg (CEG)

    A 1 million word lexical database and frequency count for Welsh

    Please circulate to those interested

    This is a word frequency analysis of 1,079,032 words of written Welsh
    prose, based on 500 samples of approximately 2000 words each,
    selected from a representative range of text types to illustrate
    modern (mainly post 1970) Welsh prose writing. It was conceived as
    providing a Welsh parallel to the Kucera and Francis analysis for
    American English, and the LOB corpus for British English, in the
    expectation that such an analysed corpus would provide research tools
    for a number of academic disciplines: psychology and
    psycholinguistics, child and second language acquisition, general
    linguistics, and the linguistics of Modern Welsh, including literary
    analysis.

         The sample included materials from the fields of novels and short
    stories, religious writing, childrenís literature both factual and
    fiction, non-fiction materials in the fields of education, science,
    business, leisure activities, etc., public lectures, newspapers and
    magazines, both national and local, reminiscences, academic writing,
    and general administrative materials (letters, reports, minutes of
    meetings).

         The resultant corpus was analysed to produce frequency counts of
    words both in their raw form and as counts of lemmas where each token
    is demutated and tagged to its root. This analysis also derives basic
    information concerning the frequencies of different word classes,
    inflections, mutations, and other grammatical features.

         Available on-line:

         Ellis, N. C., O'Dochartaigh, C., Hicks, W., Morgan, M., &
    Laporte, N. (2001). Cronfa Electroneg o Gymraeg (CEG): A 1 million
    word lexical database and frequency count for Welsh. [On-line],
    Available: http://www.bangor.ac.uk/ar/cb/ceg/ceg_eng.html

    -------------------------------------------------------------------

    Cronfa Electroneg o Gymraeg (CEG)

    Cronfa ddata eirfaol o filiwn o eiriau sy'n cyfrif amlder defnydd
    geiriau yn y Gymraeg

    A wnewch chi gylchredeg hwn i bawb sydd â diddordeb ynddo.

      Mae hwn yn ddadansoddiad amlder geiriau o 1,079,032 o eiriau o
    ryddiaith Gymraeg ysgrifenedig, a seiliwyd ar 500 o samplau o tua
    2000 o eiriau yr un. Fe'u detholwyd o ystod gynrychioliadol o
    destunau rhyddiaith Gymraeg gyfoes (o 1970 ymlaen yn bennaf). Y nod
    oedd cynnig rhywbeth cyffelyb i ddadansoddiad Kucera a Francis o
    Saesneg Americanaidd, a'r corpws LOB o Saesneg Prydeinig. Y disgwyl
    oedd y byddai corpws a ddadansoddwyd fel hyn yn cynnig offer ymchwil
    ar gyfer nifer o ddisgyblaethau academaidd:

    * seicoleg a seicoieithyddiaeth
    * plant yn caffael ail iaith
    * ieitheg gyffredinol
    * ieitheg y Gymraeg Cyfoes, gan gynnwys dadansoddi llenyddol.

         Roedd y sampl yn cynnwys:

    * deunyddiau o feysydd nofelau a straeon byrion
    * ysgrifennu crefyddol
    * llenyddiaeth plant (ffeithiol a dychmygol)
    * deunyddiau ym meysydd addysg, gwyddoniaeth, busnes,
    gweithgareddau hamdden, etc.
    * darlithoedd cyhoeddus
    * papurau newydd a chylchgronau - cenedlaethol a lleol
    * atgofion
    * ysgrifennu academaidd
    * deunyddiau gweinyddu cyffredinol (yn llythyrau, adroddiadau,

         Dadansoddwyd y corpws i gynhyrchu cyfrifon amlder geiriau yn eu
    ffurf grai yn ogystal â chyfrifon o lemata lle mae pob arwydd wedi ei
    ddad-dreiglo a'i dagio yn ôl ei wreiddyn. Rhydd y dadansoddiad yma
    hefyd wybodaeth sylfaenol am amlder y gwahanol ddosbarthiadau
    geiriol, ffurfdroadau, treigliadau a nodweddion gramadegol eraill.

         Ar gael ar-lein:
         Ellis, N. C., O'Dochartaigh, C., Hicks, W., Morgan, M., &
    Laporte, N. (2001). Cronfa Electroneg o Gymraeg (CEG): A 1 million
    word lexical database and frequency count for Welsh. [On-line],
    Available: http://www.bangor.ac.uk/ar/cb/ceg/ceg_cym.html



    This archive was generated by hypermail 2b29 : Wed Jan 16 2002 - 12:13:48 MET