Re: Corpora: MS Word to text

Arishka (arishka@bay1.bjt.net)
Thu, 2 Sep 1999 18:56:30 -0700 (PDT)

Here's a perl script that will translate your Word file to ASCII in Unix.
Put it in your bin directory, name it word2txt, then do

%chmod u+x word2txt
%word2txt [file1] > [file2]

Please write if you have questions.

######################################
#!/usr/bin/perl

while (<>) {
tr/[\0x00-\0x1F][\0xA0-\0xFF]//dc;
print;
}
######################################

Ari

On Thu, 2 Sep 1999, Marco Antonio Esteves da Rocha wrote:

> Dear all,
> Someone has collected a sizable corpus of literary works and documents
> written in Brazilian Portuguese throughout the nineteenth century. It is a
> valuable asset for us here and it is been all typed in MS Word, thus it is
> impossible to use all those software resources you all know. Does anyone
> know about a way to transform these .doc files into ASCII text files
> without having to do that one by one ? If you feel tempted to suggest
> sitting on the curb and crying, please don't.
> Marco Rocha
> marcor@cce.ufsc.br
>
>