Re[2]: Corpora: MS Word to text

mike_maxwell@sil.org
Fri, 03 Sep 1999 10:03:33 -0400

arishka@bay1bjt.net. wrote:
>Here's a perl script that will translate your Word file to ASCII in Unix.
>Put it in your bin directory, name it word2txt, then do
>
>%chmod u+x word2txt
>%word2txt [file1] > [file2]
>######################################
>#!/usr/bin/perl
>
>while (<>) {
> tr/[\0x00-\0x1F][\0xA0-\0xFF]//dc;
> print;
>}
>######################################

That will indeed strip out the lower and some upper ANSI characters, leaving you
with a sort of ASCII file (including characters from 128-159). But it probably
won't give what I think Marco Antonio Esteves da Rocha was wanting, which (I
assume) is an ASCII file containing just the text. In particular, it leaves in
all the other sort of information Word puts in the file--things like the name of
the person whose computer the Word doc was created on, information about any
modified formats used, the file name including the directory, some font
information (including munged versions of font tables for any fonts that were
embedded into the doc when it was saved), etc. It also removes any newlines
(although that could be fixed by changing the tr line in the Perl script).

If the above Perl script doesn't do what you want, then you can use Word itself
to export the files into text format. Word can save a file in a number of
formats, including several forms of "text" (with or without line breaks for each
line, etc.). I just tried that on a Word doc that contained a "symbol" font
(the SIL Doulos IPA font, which Word thinks is a "symbol" font), and it
substituted question marks for all the symbol font chars. There are a number of
ways one could get around that, but the first thing would be to check whether
it's in fact a problem for the files you want to process by exporting a sample
file. If the Word doc has been typed in a standard font (and I assume
Portuguese can be represented in the standard ANSI font), there will probably be
no problem.

As for doing this on a bazillion files one by one, you could set up a Word macro
that would be invoked on opening a file, and which would automatically export
the file in a selected format to some directory (preserving the file name, but
automatically changing the suffix to .TXT or whatever), then close Word.
Assuming you're running in Win95 (or a more recent version), you could then
invoke Word on every file in a given directory from an MS-DOS box using a
for-loop. It would be slow, but you wouldn't have to hand-step through the
process.

If you decide this is the way to go, and need more details on the above, let me
know.

Mike Maxwell
Mike_Maxwell@sil.org
Summer Institute of Linguistics