[Corpora-List] On tools for indexing and searching large corpora

From: Serge Sharoff (sharoff@aha.ru)
Date: Tue Nov 19 2002 - 12:03:59 MET

Next message: mdavies@ilstu.edu: "Re: [Corpora-List] On tools for indexing and searching large corpora"

Previous message: geoffrey.williams: "Re: [Corpora-List] Plea for help"
Next in thread: mdavies@ilstu.edu: "Re: [Corpora-List] On tools for indexing and searching large corpora"
Reply: mdavies@ilstu.edu: "Re: [Corpora-List] On tools for indexing and searching large corpora"
Reply: Sylvain Loiseau: "Re: [Corpora-List] On tools for indexing and searching large corpora"
Reply: Pavel Rychly: "Re: [Corpora-List] On tools for indexing and searching large corpora"
Reply: Serge Sharoff: "Re: [Corpora-List] On tools for indexing and searching large corpora"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Dear all,

I'm in the process of compiling a corpus of modern Russian comparable to the
BNC in its size and coverage. The format of the corpus is based on TEI, for
instance,
<s id="nashi.535">
...
   <w>глава
      <ana lemma="глава" pos="noun" feats="мр,од,ед,им"/>
      <ana lemma="глава" pos="noun" feats="жр,но,ед,им"/>
   </w>
   <w>Владивостока
      <ana lemma="Владивосток" pos="noun" feats="мр,но,ед,рд,геог"/>
   </w>
...
</s>
in the first case, the POS tagger detects and cannot resolve an ambiguity
between two possible readings (masc, animate, i.e. the head of, and fem.,
inanimate, i.e. the chapter of), so both analyses are left.

Currently for searching the corpus I use custom tools written in Perl and
based on regular expressions. As the corpus gets larger (currently 40
million words), the indexing scheme gets totally inefficient and I'm
reluctant to reinvent the wheel by improving it.

What is the technology used in the BNC and other annotated corpora of
similar size? Can it be applied in this case (given the need to cope with
possible ambiguity)? The corpus uses Win-1251 encoding, but eventually I
plan to convert it to Unicode. Any suggestions?

Best,
Serge

Next message: mdavies@ilstu.edu: "Re: [Corpora-List] On tools for indexing and searching large corpora"
Previous message: geoffrey.williams: "Re: [Corpora-List] Plea for help"
Next in thread: mdavies@ilstu.edu: "Re: [Corpora-List] On tools for indexing and searching large corpora"
Reply: mdavies@ilstu.edu: "Re: [Corpora-List] On tools for indexing and searching large corpora"
Reply: Sylvain Loiseau: "Re: [Corpora-List] On tools for indexing and searching large corpora"
Reply: Pavel Rychly: "Re: [Corpora-List] On tools for indexing and searching large corpora"
Reply: Serge Sharoff: "Re: [Corpora-List] On tools for indexing and searching large corpora"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Tue Nov 19 2002 - 12:09:31 MET