Corpora: MA studentship available to study South Asian languages

From: Mcenery, Tony (
Date: Tue May 14 2002 - 16:18:01 MET DST

  • Next message: LDC Office: "Corpora: New Corpus from the LDC"

    Dear All,
    please feel free to pass details of this studentship on to anybody you think
    may be interested in applying. Regards,

    MA Studentship - The corpus-based study of South Asian languages


    Lancaster University Department of Linguistics and MEL is offering an MA
    studentship in computer corpus-related research into the languages of South
    Asia. The studentship is part of the EPSRC-funded EMILLE project which is
    collecting a 67 million words of corpus data in Bengali, Gujarati, Hindi,
    Punjabi, Sinhala, Tamil or Urdu. This corpus data will form the basis of the MA
    student's research. See the following website for more details:


     Applicants should be native speakers of whichever language they wish to
    undertake research on. Note that the University also requires documentary proof
    of an average IELTS score of 6.5 for all non-native speakers of English.

     The studentship will run from the start of October 2002 to the end of
    September 2003. The studentship will cover fees (home or overseas) and provide
    a living allowance. No assistance with relocation is available from the

     Applicants should be willing to undertake research in one of the four research
    areas listed below. To apply, download the application forms from the following


     In making the application, candidates should complete the application form and
    write 'EMILLE' on the form where a source of funding is asked for.
    Additionally, candidates should include a one page description of how they
    propose to pursue the research topic they have chosen. Closing date for
    applications is 1st July 2002.

     Students may choose to research one of the following research topics:

     1. Corpus-based dictionary creation

     Many of the current standard dictionaries of South Asian languages are quite
    old and are generally not corpus-based. Using the EMILLE corpora as source of
    data, the student will develop lexicographic resources for Bengali, Gujarati,
    Hindi, Punjabi, Sinhala, Tamil or Urdu. Throughout the goal will be to apply
    the latest research in the field of corpus-based lexicography to South Asian

     2. Anaphora and anaphor resolution in Bengali, Gujarati, Hindi, Punjabi,
    Sinhala, Tamil or Urdu

     Much research has been undertaken on automated anaphor resolution for West
    European languages. Research focused on the languages of South Asia is, by
    contrast, relatively undeveloped. Students wishing to pursue research on
    anaphor resolution for South Asian language may care to focus either on a
    corpus-based account of the anaphors of one of the languages listed above, or
    may seek to develop algorithms for automated anaphor resolution for one of
    Bengali, Gujarati, Hindi, Punjabi, Sinhala, Tamil or Urdu.

     3. Machine translation between Hindi and Urdu

     Hindi and Urdu are very similar languages in their spoken form, but differ
    greatly in their written form. Using the EMILLE corpus as a data source, the
    student will develop, test and evaluate software that can translate Hindi texts
    into Urdu (and vice versa).

     4. Studying spoken language

     The student will study the spoken data in the EMILLE corpus in order to fulfil
    one of the following research goals:

    * examining the differences between the spoken and written forms of the
    * contrasting the dialects of the language spoken in the UK and South
    * analysing code-switching in spoken texts.

     The languages which may be studied for this project are Bengali, Gujarati,
    Hindi-Urdu or Punjabi.




    This archive was generated by hypermail 2b29 : Tue May 14 2002 - 16:40:31 MET DST