Re: Corpora: french parser

Fiammetta NAMER (namer@clsh.univ-nancy2.fr)
Sun, 11 Oct 1998 16:55:46 +0100

Hello,

I take advantage of the following message to inform you of the
availability
of a program that performs inflexional morphological analysis of French
texts.

Fiammetta Namer
e-mail: namer@clsh.univ-nancy2.fr
Université Nancy 2
======================
Loris Cardullo wrote:
>
> Hi all members,
>
> I would now where I can found some morphological parser of French
> running in Win '95 .Thanks in advance.
>
> Best regards
>
> Loris Cardullo
>
> e-mail: palermo@imeuniv.unime.it

==================================================================
Announcement : availability of an inflectional analyser for French
==================================================================

I have written a Perl5 program, called FLEMM, that performs
inflectional analysis on French texts which have previously
been tagged (eg. by the Brill tagger). This is a small program,
(50kb in a zipped format)
mainly rule-based (i.e. only a 3000 words lexicon is used in order
to deal with exceptions). It runs on PCs or Workstation, under Unix,
Linux or Windows95/NT OS. I have not tried so far, but it is not
impossible that it also runs on Mac (provided that Perl5 is
installed).

============
Availability
============
This program is currently being integrated within WinBrill, the
Windows port of the BRILL tagger (trained for French) at
the Inalf institute, and the whole system will be freely distributed
in November. The current version of WinBrill (without morphologic
analysis) is available, at the following URL:

http://www.ciril.fr/~gsouvay/WinBrill/

If you want to use the FLEMM program alone, you need to provide it
with a tagged text as input. At the time being, the only recognized
tagset is the Brill one: See

http://www.ciril.fr/INALF/inalf.presentation/analyseur.htm#Brill

FLEMM will soon be available on my forthcoming website.
In the meantime, I can send it by e-mail to those who want to try it.

===========
Description
===========
FLEMM computes the lemma of each inflected word (according
to the tag) and also provided its main morphological features :
- gender and number for adjectives, determiners, participles
- number for nouns
- gender, number, person and case for pronouns
- number, person, tense, mood and conjugation group for verbs

The array below summarizes the structure of the analysed words, and
below
the potential values of the features:

=====================================================================
GraM. Cat (TAG) | Format
=====================================================================
verbs (VCJ) |InflectedWd/Tag:person:nb:tense:mood/Lemma:group/
participles, |
adjectives, nouns,|
determiners, |
relative pronouns |
(VPAR, VNCNT, |
ADJ1PAR, ADJ2PAR, |
EPAR, APAR, ANCNT,|
ENCNT, ADJ, SBC, |
DTN, DTC, REL) |InflectedWd/Tag:gender:nb/Lemma/
==================================================================
personal pronouns |
(PRO/PRV) |InflectedWd/Tag:person:gender:nb:case/Lemma/
==================================================================
Other categories |Word/Tag/Word
==================================================================

==================================================================
Feature | Possible values
==================================================================
person | 1p , 2p , 3p , _
gender | m, f, _
nb | s, p, _
tense | pst, impft, fut, ps
mood | ind, subj, cond, imper
group | 1g, 2g, 3g
case | n, a, d, o, _
==================================================================

Remarks :
---------
- "_" is the undefined value.
- tense and mood morphosyntactic ones. So "pst, impft, fut, ps"
respectively mean "present", "imperfect", "future" and "simple past".
As far as "ind, subj, cond, imper", they hold for : "indicative",
"subjunctive", "conditional" and "imperative".
- The case values are : (n)ominative, (a)ccusative, (d)ative and
(o)blique.

Ambiguous analyses are factorized as disjunctive sets limited by " {"
and " } ", and separated by " | ".

Examples:
ex1 : {bruissant/PPRES:m:s/bruisser:1g/|bruissant/PPRES:m:s/bruire:3g/}

ex2 : allions/VCJ:1p:{impft:ind|pst:subj}/aller:3g/

=======
Example
=======
The example below illustrates the input format and the output result :

1) Input file :
---------------

La/DTN:sg IIIe/ADJ:sg République/SBC:sg nous/PRV:pl avait/ACJ:sg
promis/VPAR:sg que/SUB$ la/DTN:sg Première/SBP:sg Guerre/SBP:sg
mondiale/ADJ:sg serait/ECJ:sg aussi/ADV la/DTN:sg dernière/SBC:sg
,/, "/" la/DTN:sg der/SBC:sg des/DTC:pl der/SBC:sg "/" ;/; pour/PREP
tenir/VNCFF parole/SBC:sg ,/, elle/PRV:sg nous/PRV:pl offrit/VCJ:sg
la/DTN:sg ligne/SBC:sg Maginot/SBP:sg ,/, qui/REL eut/ACJ:sg l'/DTN:sg
utilité/SBC:sg que/SUB$ l'_on/PRV:sg sait/VCJ:sg ./.
Mais/COO il/PRV:sg serait/ECJ:sg malvenu/ADJ:sg de/PREP gloser/VNCFF
sur/PREP le/DTN:sg pitoyable/ADJ:sg désastre/SBC:sg de/PREP 1940/CAR ./.

Mieux/ADV vaut/VCJ:sg se/PRV:++ souvenir/VNCFF de/PREP l'/DTN:sg
éclatante/ADJ:sg victoire/SBC:sg de/PREP 1945/CAR ,/, victoire/SBC:sg
célébrée/ADJ2PAR:sg "/" entre_nous/ADV "/" ,/, puisque/SUB de/PREP
Gaulle/SBP:sg descendit/VCJ:sg seul/ADJ:sg les/DTN:pl
Champs-élysées/SBP:pl ,/, Churchill/SBP:sg restant/VNCNT à/PREP
Londres/SBP:sg ,/, Roosevelt/SBP:sg infirme/ADJ:sg à/PREP
Washington/SBP:sg ,/, et/COO Staline/SBP:sg à/PREP
Moscou/SBP:sg ./.

2) Output file :
----------------

La/DTN:f:s/le IIIe/ADJ:f:s/iii République/SBC:_:s/république
nous/PRV:1p:_:p:_/lui avait/ACJ:3p:s:impft:ind/avoir:3g
promis/VPAR:m:s/promettre que/SUB$/que la/DTN:f:s/le
Première/SBP/première Guerre/SBP/guerre mondiale/ADJ:f:s/mondial
serait/ECJ:3p:s:pst:cond/être:3g aussi/ADV/aussi la/DTN:f:s/le
dernière/SBC:_:s/dernière ,/, "/" la/DTN:f:s/le der/SBC:_:s/der
des/DTC:_:p/du der/SBC:_:s/der "/" ;/; pour/PREP/pour tenir/VNCFF/tenir
parole/SBC:_:s/parole ,/, elle/PRV:3p:f:s:{n|d|o}/lui
nous/PRV:1p:_:p:_/lui offrit/VCJ:3p:s:ps:ind/offrir:3g la/DTN:f:s/le
ligne/SBC:_:s/ligne Maginot/SBP/maginot ,/, qui/REL:_:_/qui
eut/ACJ:3p:s:ps:ind/avoir:3g l'/DTN:_:s/le utilité/SBC:_:s/utilité
que/SUB$/que l'_on/PRV:3p:m:s:_/l'_on
sait/VCJ:3p:s:pst:ind/savoir:3g ./.
Mais/COO/mais il/PRV:3p:m:s:n/lui serait/ECJ:3p:s:pst:cond/être:3g
malvenu/ADJ:m:s/malvenu de/PREP/de gloser/VNCFF/gloser sur/PREP/sur
le/DTN:m:s/le pitoyable/ADJ:_:s/pitoyable désastre/SBC:_:s/désastre
de/PREP/de 1940/CAR/1940 ./.
Mieux/ADV/mieux vaut/VCJ:3p:s:pst:ind/valoir:3g se/PRV:3p:_:_:{a|d}/lui
souvenir/VNCFF/souvenir de/PREP/de l'/DTN:_:s/le
éclatante/ADJ:f:s/éclatant victoire/SBC:_:s/victoire de/PREP/de
1945/CAR/1945 ,/, victoire/SBC:_:s/victoire
célébrée/ADJ2PAR:f:s/célébrer "/" entre_nous/ADV/entre_nous "/" ,/,
puisque/SUB/puisque de/PREP/de Gaulle/SBP/gaulle
descendit/VCJ:3p:s:ps:ind/descendre:3g seul/ADJ:m:s/seul
les/DTN:_:p/le Champs-élysées/SBP/champs-élysées ,/,
Churchill/SBP/churchill restant/VNCNT:m:s/rester:1g à/PREP/à
Londres/SBP/londres ,/, Roosevelt/SBP/roosevelt
infirme/ADJ:_:s/infirme à/PREP/à Washington/SBP/washington ,/,
et/COO/et Staline/SBP/staline à/PREP/à Moscou/SBP/moscou ./.

=====================
Other Functionalities
=====================
Moreover, FLEMM checks and fixes some segmentation or tagging
errors. When asked by the user, the detected errors, together
with the corresponding corrections, are reported in special
files.

Examples :

1) tagging log file
-------------------

phytoplancton / VNCFF ==> phytoplancton/SBC
phytoplanctivores / ADJ2PAR ==> phytoplanctivores/ADJ

2) Segmentation log file
-------------------------

,inhibiteurs est réduit à inhibiteurs (SBC)

=================
Program structure
=================

init_lemm.perl
entrees_sorties
lemmatizer
exceptions
EXCEP/

The startup program file is "init_lemm.perl". Il calls the
"entrees_sorties" module, that deals with input and output
formats, and that calls in turn the main module : "lemmatizer".
This module performs morphological analysis and calls the exception
lists handler ("exceptions" module, which examines the exception
files in the "EXCEP" directory).

==============
Command line :
==============

perl init_lemm.perl --entree INPUT_FILE
(--repertoire PROGRAM_DIRECTORY)
(--sortie OUTPUT_FILE)
(--log |--nolog)

All options values (INPUT_FILE, PROGRAM_DIRECTORY, OUTPUT_FILE) are
global adresses.
INPUT_FILE is mandatory.
The other options are optional.

- if the --repertoire option is not provided, the working directory
default value is the current directory ( . ).
- if the --sortie option is not provided, the output file default
value is the INPUT_FILE adress with the " .lemm " extension.
- if --log is given, the INPUT_FILE.seg and INPUT_FILE.etiq are
created and store respectively the segmentation and the tagging
errors detected and corrected by the program. If either
--nolog or no-option is given, no log file are produced.

Examples of a command line and of the message which is displayed on
the standard output, before the starting of the analysis process:

f:\LEMMAT\PGM> perl init_lemm.perl --entree f:/DATA/fic1 --log

Valeur par defaut du repertoire de travail : .
Fichier d'entree : f:\DATA\fic1
Par defaut, le fichier de sortie s'appelle : f:\DATA\fic1.lemm
Les fichiers log s'appellent f:\DATA\fic1.etiq (etiquetage) et
f:\DATA\fic1.seg (segmentation)

f:\LEMMAT> perl init_lemm.perl --entree f:/DATA/fic1
--repertoire ./PGM --sortie ./fic1.out

Repertoire de travail : .\PGM
Fichier d'entree : f:\DATA\fic1
Fichier de sortie : f:\LEMMAT\fic1.out
Les fichiers log s'appellent f:\DATA\fic1.etiq (etiquetage) et
f:\DATA\fic1.seg (segmentation)
=================================================