Re: Corpora: MS Word to text

Ted E. Dunning (ted@hncais.com)
Fri, 3 Sep 1999 14:20:29 -0700 (PDT)

>>>>> "mm" == mike maxwell <mike_maxwell@sil.org> writes:

mm> arishka@bay1bjt.net. wrote:
>> Here's a perl script that will translate your Word file to
>> ASCII in Unix.

mm> That will indeed strip out the lower and some upper ANSI
mm> characters, leaving you with a sort of ASCII file (including
mm> characters from 128-159).

I would like to point out further that with more recent versions of
word, you won't get any of the text out with this script.

mm> As for doing this on a bazillion files one by one, you could
mm> set up a Word macro that would be invoked on opening a file,

"Word macro" is a bit of Microsoft oriented jargon and thus may not be
clear to all. I will try to explain in an alternative jargon. :-)

The key point here is that all major microsoft applications have an
embedded BASIC interpreter which allows access to pretty much all of
the functionality of the underlying application. Thus, you can write
basic programs for Word or for Excel or for Access.

<unix_bigot_warning> Of course, this access is completely ad-hoc,
internally inconsistent and not subject to simple logical description.
It is also not documented well anywhere </unix_bigot_warning>.

<lisp_bigot_warning> Futhermore, the underlying extension language is
also a terrible choice from a technical standpoint since it doesn't
have decent security models, reasonable reflection or any degree of
introspection </lisp_bigot_warning>.

<voice_of_reason> All these horrific technical defects
notwithstanding, you *can* still do useful things in
{Word,Excel...}Basic. You just can't do truly wondrous things and
most people have only little use for the truly wondrous.
</voice_of_reason>

You can therefore build a WordBasic program which will read every file
in a directory and convert it to text. Recent versions of word (at
least on NT) can export text as unicode which is particularly nice if
you are working with multi-lingual documents. Once you have documents
in Unicode, then you can do all sorts of things.

I personally recommend TCL version 8.1 and above for its ability to
work very nicely with all sorts of character set encodings. Java has
similar ability to read and convert data, but would be much more
difficult to do interesting text processing with. PERL is getting or
has recently gotten reasonable support for unicode, but may have a
lower level emphasis than most people would like.