[Corpora-List] NSF-supported Summer Internships

From: Fred Jelinek via listmember Jason Eisner (jason@cs.jhu.edu)
Date: Tue Feb 11 2003 - 13:50:11 MET

  • Next message: Joerg Schuster: "[Corpora-List] good, free sentencizer?"

    Dear Colleague:

    The Center for Language and Speech Processing at the Johns Hopkins
    University is offering a unique summer internship opportunity, which we
    would like you to bring to the attention of your best students in the
    current junior class. Preliminary applications for these internships
    are due at the end of this week.

    This internship is unique in the sense that the selected students will
    participate in cutting edge research as full members alongside leading
    scientists from industry, academia, and the government. The exciting
    nature of the internship is the exposure of the undergraduate students to
    the emerging fields of language engineering, such as automatic speech
    recognition (ASR), natural language processing (NLP) and machine
    translation (MT).

    We are specifically looking to attract new talent into the field and, as
    such, do not require the students to have prior knowledge of language
    engineering technology. Please take a few moments to nominate suitable
    bright students for this internship. On-line applications for the program
    can be found at http://www.clsp.jhu.edu/ along with additional information
    regarding plans for the 2003 Workshop and information on past workshops.
    The application deadline is February 15, 2003.

    If you have questions, please contact us by phone (410-516-4237), e-mail
    (sec@clsp.jhu.edu) or via the Internet http://www.clsp.jhu.edu

    Sincerely,

    Frederick Jelinek
    J.S. Smith Professor and Director

    ---------------------------------------------------------------------------
    Team Project Descriptions for this Summer
    ---------------------------------------------------------------------------

    1. Syntax for Statistical Machine Translation

    In recent evaluations of machine translation systems, statistical systems
    based on probabilistic models have outperformed classical approaches based
    on interpretation, transfer, and generation. Nonetheless, the output of
    statistical systems often contains obvious grammatical errors. This can be
    attributed to the fact that the syntactic well-formedness is only
    influenced by local n-gram language models and simple alignment models. We
    aim to integrate syntactic structure into statistical models to address
    this problem. A very convenient and promising approach for this
    integration is the maximum entropy framework, which allows to integrate
    many different knowledge sources into an overall model and to train the
    combination weights discriminatively. This approach will allow us to
    extend a baseline system easily by adding new feature functions.

    The workshop will start with a strong baseline -- the alignment template
    statistical machine translation system that obtained best results in the
    2002 DARPA MT evaluations. During the workshop, we will incrementally add
    new features representing syntactic knowledge that deal with specific
    problems of the underlying baseline. We want to investigate a broad range
    of possible feature functions, from very simple binary features to
    sophisticated tree-to-tree translation models. Simple feature functions
    might test if a certain constituent occurs in the source and the target
    language parse tree. More sophisticated features will be derived from an
    alignment model where whole sub-trees in source and target can be aligned
    node by node. We also plan to investigate features based on projection of
    parse trees from one language onto strings of another, a useful technique
    when parses are available for only one of the two languages. We will
    extend previous tree-based alignment models by allowing partial tree
    alignments when the two syntactic structures are not isomorphic.

    We will work with the Chinese-English data from the recent evaluations, as
    large amounts of sentence-aligned training corpora, as well as multiple
    reference translations are available. This will also allow us to compare
    our results with the various systems participating in the evaluations. In
    addition, annotation is underway on a Chinese-English parallel tree-bank.
    We plan to evaluate the improvement of our system using both automatic
    metrics for comparison with reference translations (BLEU and NIST) as well
    as subjective evaluations of adequacy and fluency. We hope both to improve
    machine translation performance and advance the understanding of how
    linguistic representations can be integrated into statistical models of
    language.

    ---------------------------------------------------------------------------

    2. Semantic Analysis Over Sparse Data

    The aim of the task is to verify the feasibility of a machine
    learning-based semantic approach to the data sparseness problem that is
    encountered in many areas of natural language processing such as language
    modeling, text classification, question answering and information
    extraction. The suggested approach takes advantage of several
    technologies for supervised and unsupervised sense disambiguation that
    have been developed in the last decade and of several resources that have
    been made available.

    The task is motivated by the fact that current language processing models
    are considerably affected by sparseness of training data, and current
    solutions, like class-based approaches, do not elicit appropriate
    information: the semantic nature and linguistic expressiveness of
    automatically derived word classes is unclear. Many of these limitations
    originate from the fact that fine-grained automatic sense disambiguation
    is not applicable on a large scale.

    The workshop will develop a weakly supervised method for sense modeling
    (i.e. reduction of possible word senses in corpora according to their
    genre) and apply it to a huge corpus in order to coarsely
    sense-disambiguate it. This can be viewed as an incremental step towards
    fine-grained sense disambiguation. The created semantic repository as well
    as the developed techniques will be made available as resources for future
    work on language modeling, semantic acquisition for text extraction,
    question answering, summarization, and most other natural language
    processing tasks.

    ---------------------------------------------------------------------------

    3. Dialectal Chinese Speech Recognition

    There are eight major dialectal regions in addition to Mandarin (Northern
    China) in China, including Wu (Southern Jiangsu, Zhejiang, and Shanghai),
    Yue (Guangdong, Hong Kong, Nanning Guangxi), Min (Fujian, Shantou
    Guangdong, Haikou Hainan, Taipei Taiwan), Hakka (Meixian Guangdong,
    Hsin-chu Taiwan), Xiang (Hunan), Gan (Jiangxi), Hui (Anhui), and Jin
    (Shanxi). These dialects can be further divided into more than 40
    sub-categories. Although the Chinese dialects share a written language and
    standard Chinese (Putonghua) is widely spoken in most regions, speech is
    still strongly influenced by the native dialects. This great linguistic
    diversity poses problems for automatic speech and language technology.
    Automatic speech recognition relies to a great extent on the consistent
    pronunciation and usage of words within a language. In Chinese, word
    usage, pronunciation, and syntax and grammar vary depending on the
    speaker's dialect. As a result speech recognition systems constructed to
    process standard Chinese (Putonghua) perform poorly for the great majority
    of the population.

    The goal of our summer project is to develop a general framework to model
    phonetic, lexical, and pronunciation variability in dialectal Chinese
    automatic speech recognition tasks. The baseline system is a standard
    Chinese recognizer. The goal of our research is to find suitable methods
    that employ dialect-related knowledge and training data (in relatively
    small quantities) to modify the baseline system to obtain a dialectal
    Chinese recognizer for the specific dialect of interest. For practical
    reasons during the summer, we will focus on one specific dialect, for
    example the Wu dialect or the Chuan dialect. However the techniques we
    intend to develop should be broadly applicable.

    Our project will build on established ASR tools and systems developed for
    standard Chinese. In particular, our previous studies in pronunciation
    modeling have established baseline Mandarin ASR systems along with their
    component lexicons and language model collections. However, little
    previous work or resources are available to support research in Chinese
    dialect variation for ASR. Our pre-workshop will therefore focus on
    further infrastructure development:

      * Dialectal Lexicon Construction. We will establish an electronic
      dialect dictionary for the chosen dialect. The lexicon will be
      constructed to represent both standard and dialectal pronunciations.

      * Dialectal Chinese Database Collection. We will set up a dialectal
      Chinese speech database with canonical pinyin level and dialectal
      pinyin level transcriptions. The database could contain two parts:
      read speech and spontaneous speech. For the spontaneous speech part,
      the generalized initial/final (GIF) level transcription should be also
      included.

    Our effort at the workshop will be to employ these materials to develop
    ASR system components that can be adapted from standard Chinese to the
    chosen dialect. Emphasis will be placed on developing techniques that work
    robustly with relatively small (or even no) dialect data. Research will
    focus primarily on acoustic phenomena, rather than syntax or grammatical
    variation, which we intend to pursue after establishing baseline ASR
    experiments.

    ---------------------------------------------------------------------------

    4. Confidence Estimation for Natural Language Applications

    Significant progress has been made in natural language processing (NLP)
    technologies in recent years, but most still do not match human
    performance. Since many applications of these technologies require
    human-quality results, some form of manual intervention is necessary.

    The success of such applications therefore depends heavily on the extent
    to which errors can be automatically detected and signaled to a human
    user. In our project we will attempt to devise a generic method for NLP
    error detection by studying the problem of Confidence Estimation (CE) in
    NLP results within a Machine Learning (ML) framework.

    Although widely used in Automatic Speech Recognition (ASR) applications,
    this approach has not yet been extensively pursued in other areas of NLP.
    In ASR, error recovery is entirely based on confidence measures: results
    with a low level of confidence are rejected and the user is asked to
    repeat his or her statement. We argue that a large number of other NLP
    applications could benefit from such an approach. For instance, when
    post-editing MT output, a human translator could revise only those
    automatic translations that have a high probability of being wrong. Apart
    from improving user interactions, CE methods could also be used to improve
    the underlying technologies. For example, bootstrap learning could be
    based on outputs with a high confidence level, and NLP output re-scoring
    could depend on probabilities of correctness.

    Our basic approach will be to use a statistical Machine Learning (ML)
    framework to post-process NLP results: an additional ML layer will be
    trained to discriminate between correct and incorrect NLP results and
    compute a confidence measure (CM) that is an estimate of the probability
    of an output being correct. We will test this approach on a statistical MT
    application using a very strong baseline MT system. Specifically, we will
    start off with the same training corpus (Chinese-English data from recent
    NIST evaluations), and baseline system as the Syntax for Statistical
    Machine Translation team.

    During the workshop we will investigate a variety of confidence features
    and test their effects on the discriminative power of our CM using
    Receiver Operating Characteristic (ROC) curves. We will investigate
    features intended to capture the amount of overlap, or consensus, among
    the system's n-best translation hypotheses, features focusing on the
    reliability of estimates from the training corpus, ones intended to
    capture the inherent difficulty of the source sentence under translation,
    and those that exploit information from the base statistical MT system.
    Other themes for investigation include a comparison of different ML
    frameworks such as Neural Nets or Support Vector Machines, and a
    determination of the optimal granularity for confidence estimates
    (sentence-level, word-level, etc).

    Two methods will be used to evaluate final results. First, we will perform
    a re-scoring experiment where the n-best translation alternatives output
    by the baseline system will be re-ordered according to their confidence
    estimates. The results will be measured using the standard automatic
    evaluation metric BLEU, and should be directly comparable to those
    obtained by the Syntax for Statistical Machine Translation team. We expect
    this to lead to many insights about the differences between our approach
    and theirs. Another method of evaluation will be to estimate the tradeoff
    between final translation quality and amount of human effort invested, in
    a simulated post-editing scenario.

    ---------------------------------------------------------------------------



    This archive was generated by hypermail 2b29 : Tue Feb 11 2003 - 13:56:02 MET