Corpora: NSF Supported Internships for Undergraduates

From: Laura Graham (
Date: Thu Jan 31 2002 - 21:32:21 MET

  • Next message: Magali Duclaux: "Corpora: ELRA News"

    Dear Colleague:
    The Center for Language and Speech Processing at Johns Hopkins University
    is offering a unique summer internship opportunity, which we would like
    you to bring to the attention of your best students in the current junior
    class. Only two weeks remain for students to apply for these internships.
    This internship is unique in the sense that the selected students will
    participate in cutting edge research as full members alongside leading
    scientists from industry, academia, and the government. The exciting
    nature of the internship is the exposure of the undergraduate students
    to the emerging fields of language engineering, such as automatic speech
    recognition (ASR), natural language processing (NLP), machine
    translation (MT), and speech synthesis (ITS).
    We are specifically looking to attract new talent into the field and,
    as such, do not require the students to have prior knowledge of language
    engineering technology. Please take a few moments to nominate suitable
    bright students who may be interested in this internship. On-line
    applications for the program can be found at
    along with additional information regarding plans for the 2002 Workshop
    and information on past workshops. The application deadline is
    February 15, 2002.
    If you have questions, please contact us by phone (410-516-4237),
    e-mail ( or via the Internet

    Frederick Jelinek
    J.S. Smith Professor and Director

    Project Descriptions for this Summer
    1. Weakly Supervised Learning For Wide-Coverage Parsing
    Before a computer can try to understand or translate a human sentence,
    it must identify the phrases and diagram the grammatical relationships
    among them. This is called parsing.
    State-of-the-art parsers correctly guess over 90% of the phrases and
    relationships, but make some errors on nearly half the sentences
    analyzed. Many of these errors distort any subsequent automatic
    interpretation of the sentence.
    Much of the problem is that these parsers, which are statistical,
    are not "trained" on enough example parses to know about many of the
    millions of potentially related word pairs. Human labor can produce
    more examples, but still too few by orders of magnitude.
    In this project, we seek to achieve a quantum advance by automatically
    generating large volumes of novel training examples. We plan to
    bootstrap from up to 350 million words of raw newswire stories,
    using existing parsers to generate the new parses together with
    confidence measures.
    We will use a method called co-training, in which several reasonably
    good parsing algorithms collaborate to automatically identify one
    another's weaknesses (errors) and to correct them by supplying new
    example parses to one another. This accuracy-boosting technique has
    widespread application in other areas of machine learning, natural
    language processing and artificial intelligence.
    Numerous challenges must be faced: how do we parse 350 million words
    of text in less than a year (we have 6 weeks)? How to use partly
    incompatible parsers to train one another? Which machine learning
    techniques scale up best? What kind of grammars, probability models,
    and confidence measures work best? The project will involve a
    significant amount of programming, but the rewards should be high.

    2. Novel Speech Recognition Models for Arabic
    Previous research on large-vocabulary automatic speech recognition
    (ASR) has mainly concentrated on European and Asian languages.
    Other language groups have been explored to a lesser extent,
    for instance Semitic languages like Hebrew and Arabic. These
    languages possess certain characteristics, which present problems
    for standard ASR systems. For example, their written representation
    does not contain most of the vowels present in the spoken form,
    which makes it difficult to utilize textual training data.
    Furthermore, they have a complex morphological structure, which is
    characterized not only by a high degree of affixation but also by
    the interleaving of vowel and consonant patterns (so-called
    "non-concatenative morphology"). This leads to a large number of
    possible word forms, which complicates the robust estimation of
    statistical language models.
    In this workshop group we aim to develop new modeling approaches
    to address these and related problems, and to apply them to the
    task of conversational Arabic speech recognition. We will develop
    and evaluate a multi-linear language model, which decomposes the
    task of predicting a given word form into predicting more basic
    morphological patterns and roots. Such a language model can be
    combined with a similarly decomposed acoustic model, which
    necessitates new decoding techniques based on modeling statistical
    dependencies between loosely coupled information streams. Since
    one pervading issue in language processing is the tradeoff between
    language-specific and language-independent methods, we will also
    pursue an alternative control approach which relies on the
    capabilities of existing, language-independent recognition technology.
    Under this approach no morphological analysis will be performed and
    all word forms will be treated as basic vocabulary units. Furthermore,
    acoustic model topologies will be used which specify short vowels as
    optional rather than obligatory elements, in order to facilitate the
    use of text documents as language model training data. Finally, we
    will investigate the possibility of using large, generally available
    text and audio sources to improve the accuracy of conversational Arabic
    speech recognition.

    3. Generation from Deep Syntactic Representation in Machine Translation
    Let's imagine a system for translating a sentence from a foreign
    language (say Arabic) into your native language (say English). Such a
    system works as follows. It analyzes the foreign-language sentence to
    obtain a structural representation that captures its essence, i.e.
    "who did what to whom where," It then translates (or transfers) the
    actors, actions, etc. into words in your language while "copying over"
    the deeper relationship between them. Finally it synthesizes a
    syntactically well-formed sentence that conveys the essence of the
    original sentence. Each step in this process is a hard technical
    problem, to which the best-known solutions are either not adequate
    for applications, or good enough only in narrow application domains,
    failing when applied to other domains. This summer, we will concentrate
    on improving one of these three steps, namely the synthesis (or
    The target language for generation will be English, and that the
    source language to the MT system a language of a completely different
    type (Arabic and Czech). We will further assume that the transfer
    produces a fairly deeply analyzed sentence structure. The
    incorporation of the deep analysis makes the whole approach very novel -
    so far no large-coverage translation system has tried to operate with
    such a structure, and the application to very diverse languages makes
    it an even more exciting enterprise!
    Within the generation process, we will focus on the structural
    (syntactic) part, assuming that a morphological generation module
    exists to complete the generation process, and will be added to the
    suite so as to be able to evaluate the final result, namely, the
    goodness of the plain English text coming out of the system.
    Statistical methods will be used throughout. A significant part of
    the workshop preparation will be devoted to assembling and running a
    simplified MT system from Arabic/Czech to English (up to the
    syntactic structure level), in order to have realistic training data
    for the workshop project. As a consequence, we will not only
    understand and solve the generation problem, but also learn the
    mechanics of an end-to-end MT system, creating the intellectual
    preparation of team members to work on other parts of the MT
    system in the future.

    4. SuperSID: Exploiting High-level Information for High-performance
    Speaker Recognition
    Identifying individuals based on their speech is an important component
    technology in many application, be it automatically tagging speakers
    in the transcription of a board-room meeting (to track who said what),
    user verification for computer security or picking out a known
    terrorist or narcotics trader among millions of ongoing satellite
    telephone calls.
    How do we recognize the voices of the people we know? Generally, we
    use multiple levels of speaker information conveyed in the speech signal.
    At the lowest level, we recognize a person based on the sound of his/her
    voice (e.g., low/high pitch, bass, nasality, etc.). But we also use
    other types of information in the speech signal to recognize a speaker,
    such as a unique laugh, particular phrase usage, or speed of speech
    among other things.
    Most current state-of-the-art automatic speaker recognition systems,
    however, use only the low level sound information (specifically, very
    short-term features based on purely acoustic signals computed on 10-20
    ms intervals of speech) and ignore higher-level information. While
    these systems have shown reasonably good performance, there is much
    more information in speech which can be used and potentially greatly
    improve accuracy and robustness.
    In this workshop we will look at how to augment the traditional
    signal-processing based speaker recognition systems with such
    higher-level knowledge sources. We will be exploring ways to define
    speaker-distinctive markers and create new classifiers that make use
    of these multi-layered knowledge sources. The team will be working
    on a corpus of recorded telephone conversations (Switchboard I and II
    corpora) that have been transcribed both by humans and by machine and
    have been augmented with a rich database of phonetic and prosodic
    features. A well-defined performance evaluation procedure will be
    used to measure progress and utility of newly developed techniques.

    This archive was generated by hypermail 2b29 : Fri Feb 01 2002 - 09:18:32 MET