Re: Corpora: Using SARA to query other corpora than the BNC

From: Thomas Kuenneth (tommi@linguistik.uni-erlangen.de)
Date: Mon Jun 25 2001 - 10:53:04 MET DST

  • Next message: Thomas Kuenneth: "Re: Corpora: Using SARA to query other corpora than the BNC"

    First of all, please accept my apologies for having taken such a long time to
    respond. During the weekend I was quite busy preparing several things for
    COMPLEX2001.

    I'd then like to respond to Lou Burnard:

    > Good to see that we have some agreement on that at any rate.

    Well, it is absolutely neccessary to have corpus data encoded in a well
    documented format that can be read/interpreted by any system that is interested
    in the data. And meta languages are undoubtedly an ideal base for this
    interchange purpose.

    Nonetheless I think that there are better ways to represent corpus data
    internally, inside the system. I was referring to this, when I said:

    > > will handle SGML data describing 100 million annotated word forms
    > > efficiently.

    Your - well I guess - surprised response:

    > On what grounds do you make this assertion? I suppose it all depends what
    > you mean by "handle efficiently", but it's simply not true that NO
    > software can handle SGML data on that scale

    I was quite happy to see that Stefan Evert imagined just the right thing. :-)

    As you probably recall he said:

    > Perhaps he should have written "raw SGML data", in which case I will
    > absolutely second that opinion. All XML encodings that I have seen so
    > far waste more space (in terms of characters) on markup than on the
    > actual data. An XML-encoded version of a 100 million word corpus (with
    > PoS and lemma annotations) will usually take up several gigabytes of
    > disk space.

    That - in a nutshell - is what I should have said. :-)

    > And what would you advocate as an alternative?

    Well, my basic assumption is that implementing database technology for storing
    coprus data implies a lot of problems that can be avoided if "of the shelf
    systems" are used instead. I claim that an RDBMS can in fact be an ideal base
    for storing and retrieving corpus data. Since SQL is not really a user friendly
    language (at least for linguists :-)) client programs implement a user interface
    that actually communicates with the RDBMS. That by the way is what I am going to
    talk about at Bham this week.

    Regards
    Thomas

    ---
    Thomas Kuenneth M.A.           Universitaet Erlangen-Nuernberg
    Institut fuer Germanistik         Abteilung Computerlinguistik
    Bismarckstr. 6  *  D-91054 Erlangen  *  Tel.: +49 9131 8529250
    http://www.linguistik.uni-erlangen.de/~tommi
    



    This archive was generated by hypermail 2b29 : Mon Jun 25 2001 - 10:48:18 MET DST