Corpora: Wordclassed tagged Swedish corpus on the web

Daniel Ridings (ridings@svenska.gu.se)
Sat, 24 Jan 1998 16:50:56 +0100 (MET)

There is now access to 10,000,000 words of corpus material via the web.
It has been tagged with the Swedish version of the PAROLE tagset (156 tags).
It is possible to search for individual words, phrases, or tags, thus making
it possible to extract patterns based on the morphosyntactic tags.
(see http:www2.echo.lu/langeng/en/le2/le-parole/le-parole.html for
more information about PAROLE)

Please note that in the following the quotation marks are part of the query
language and are thereby essential.

Truncation is not a simple * but .* (period-star).

WORD (followed by truncated examples)

[word="skattemedel"]
(tax revenue)

Since word searches are so frequent the above can be abbreviated to:

"skattemedel"

With truncations:

"skatte.*"
".*medel"

PHRASES

"för" "egen" "del"
(for his/her part)

"för" [] "del"
(för followed by any word and then "del")

"för" []{1,3} "del"
(för followed by 1-3 words then followed by "del")

"för" []{1,3} "del" within S
(as above, but limited in range to within an s-unit (sentence))

TAGS (msd = MorphoSyntactic Description)

[msd="DF@US@S"] []{0,4} [msd=NCUSN@DS"]
(all NP's consisting of a determiner in the definite form, 0-4 words and
a noun with genus=utrum, numerus=singular, case=normal and the feature for
definite or indefinite set at "definite").

Similar searches can be done for prepositions (msd=SPS).

The first letter of a tag provides the word class information (N=nomen,
A=adjective, V=verb, S=preposition, R=adverb, D=determiner etc.) The other
positions are features (genus, numerus, case, definite/indefinite for nouns,
mood, tense, passive/active (actually s-form, not passive for these tags) for
verbs. There are two tables providing correspondences between the Swedish
PAROLE tags and the tags used in the Stockholm-Umeå Corpus.

Fairly advanced queries can be made. For example, the periphrastic futurum
in Swedish consists of "kommer" (come) + infinitive marker + infinitive.
In recent times, the infinitive marker is being left out with growing
frequency. This can be confirmed by comparing the corpus from 1965 with the
PAROLE corpus (both are on the web page). The search string would be as
follows:

"kommer" [word!="att" & msd!="(V@I.*|FI)"]{0,4} [msd="V@N.*"] within S

!= (not equal), & (and), | (or)

"kommer" followed by 0-4 words (which are not "att", not verbs in the
indicative or internal punctuation (FI), followed by an infinitive (V@N.*)
within a sentence.

The same query run against the contemporary material and the material from
thirty years ago is revealing.

The address is: http://ldb20.svenska.gu.se

The query motor is the one from IMS in Stuttgart (Oli Christ et al.).
http://www.ims.uni-stuttgart/Tools/CorpusTools

Granted, not everyone is interested in Swedish, but for those who are, this
could be quite helpful. Eventually the interface will be made nicer
and in the course of the next few weeks the material will be
lemmatized. The tagging has been performed by yours truly with a version
of Eric Brill's tagger that I'm working on. I'm almost satisfied with it,
but not quite. There are mistakes and this tagged version is a phase in
my efforts of improving the tagger.

Daniel Ridings
Språkdata
Göteborgs universitet
Sweden