[Corpora-List] Wordnet file format

From: Yuri Leikind (YuriLeikind@scnsoft.com)
Date: Fri Aug 23 2002 - 16:01:45 MET DST

Next message: American Association for Applied Linguistics (AAAL) 2003: "[Corpora-List] FINAL REMINDER: AAAL 2003 Call for Papers deadline is August 26!"

Previous message: Tanja Gaustad: "[Corpora-List] 2nd Call for Papers CLIN 2002"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hello all,

Maybe someone on the list can help me to understand
how Wordnet database is organised.

Here is a typical entry in file data.verb

01288779 36 v 03 conduct 0 lead 0 direct 2 002 @ 01274998 v 0000 $ 01289007 v 0000 01 + 08 00 | lead, as in the performance of a musical composition; "conduct an orchestra; Bairenboim conducted the Chicago symphony for years"

This line represents a synset with words "conduct", "lead", and "direct"

The numbers are so-called lex_id's:

lex_id One digit hexadecimal integer that, when appended onto lemma, uniquely identifies a sense within a lexi╜
                      cographer file. lex_id numbers usually start with 0, and are incremented as additional senses of the
                      word are added to the same file, although there is no requirement that the numbers be consecutive or
                      begin with 0. Note that a value of 0 is the default, and therefore is not present in lexicographer
                      files.

Ok, I get it - "conduct" in meaning 0, "direct" in meaning 2

But in the output of the wn program the meaning are different:

1. (40) lead, take, direct, conduct, guide -- (take somebody somewhere; "We lead him to our chief"; "can you take me to the main entrance?"; "He conducted us to the palace")
..........
10. (8) conduct, lead, direct -- (lead, as in the performance of a musical composition; "conduct an orchestra; Bairenboim conducted the Chicago symphony for years")

Here, our "lead" in meaning 0 is Sense N 10.

How these sense numbers are obtained is explained in the docs:

Sense Numbers
       Senses in WordNet are generally ordered from most to least frequently used, with the most common sense numbered 1.
       Frequency of use is determined by the number of times a sense is tagged in the various semantic concordance texts.
       Senses that are not semantically tagged follow the ordered senses. The tagsense_cnt field for each entry in the
       index.pos files indicates how many of the senses in the list have been tagged.

The cntlist(5WN) file provided with the database lists the number of times each sense is tagged in the semantic concor╜
dances. The data from cntlist is used by grind(1WN) to order the senses of each word.

Now the questions:

1) Where can I see the so-called lexicographer files ?

2) What is the default lex_id, with value 0 ?

3) Sense numbers are obtained via cntlist file. I was unable to find the explanation of the format of this file:
27 lead%2:42:12:: 3

4) If a synset can be viewed as a set of words each having a common meaning, and each word has its own lex_id, which is
   also a unique meaning identifier within one word then how is it possible that there are different synsets where one
    and the same word has the same lex_id.
   For example:

4258476 spark_advance 0 lead 1
3077077 jumper_cable 0 jumper_lead 0 lead 1

To me this is nonsense, or I don't get something important.

I'd be grateful you someone enlightens me.

___
Best regards,
Yuri Leikind

To iterate is human,
to recurse is divine.

Next message: American Association for Applied Linguistics (AAAL) 2003: "[Corpora-List] FINAL REMINDER: AAAL 2003 Call for Papers deadline is August 26!"
Previous message: Tanja Gaustad: "[Corpora-List] 2nd Call for Papers CLIN 2002"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Fri Aug 23 2002 - 16:12:16 MET DST