Corpora: Belated Summary: MS Word to text

Philip Resnik (resnik@umiacs.umd.edu)
Wed, 27 Oct 1999 12:38:30 -0400 (EDT)

Greg Kondrak <kondrak@cs.toronto.edu> wrote:
> Have you ever received a useful response to that question?
> I have some MS Word files right now and I would like to convert them
> to ascii on a Unix platform.

Shoot, I forgot to post a summary. Sorry for the delay! The solution
I wound up with, thanks to many people suggesting it, was to use
StarOffice, which is a MS compatible office suite that is freely
available on the Web. I've used it and it's very nice -- it was
trivial to install under Solaris and it does what I need it to. You
can find it at:

http://www.sun.com/staroffice/

Alternative suggestions included:

- Linux package, rtf2latex, that can convert Word documents to LaTex.

- http://www.csn.ul.ie/~caolan/docs/MSWordView.html
which will convert MSWord to HTML.

- http://www.w3.org/Tools/Word_proc_filters.html
for a bunch of links for conversion routines, some of which even work
on Unix.

- http://www.fz-juelich.de/isr/1/texconv/pctotex.html
a number of PC to Unix converters

- catdoc (a Unix utility written by V.B.Wagner <vitus@fe.msk.su>)

- There are at least two commercial solutions, one from Inso
(OutsideIn) and one from Verity (KeyView). Both can convert many
versions of Word (and a number of other formats) to HTML. You might
have to cleanse the resulting HTML for your purposes, since the two
products tend to focus on layout fidelity.
I know that Inso has a web server version of OutsideIn that I'm
betting can be set up on various Unix servers (they support Solaris,
HP-UX and AIX if I'm not mistaken), and I know that Verity supports
Unix as well, but they might only have an SDK product available.

I'm grateful to the following people for their replies.

Bob Krovetz <krovetz@research.nj.nec.com>
Larry Spitz <spitz@docrec.com>
John McNaught <jock@ccl.umist.ac.uk>
Stephen Green <sjgreen@ics.mq.edu.au>
Ted E. Dunning <ted@hncais.com>
Gregory Grefenstette <Gregory.Grefenstette@xrce.xerox.com>
Stefan Evert <evert@IMS.Uni-Stuttgart.DE>
Jean Carletta <jeanc@mail.cogsci.ed.ac.uk>
Constantin ORASAN <in6093@wlv.ac.uk>
Lluís Padró <padro@lsi.upc.es>
Ian Hersey <ihersey@inxight.com>
Robert Freeman <rjfreeman@email.com>

Best,

Philip
----------------------------------------------------------------
Philip Resnik, Assistant Professor
Department of Linguistics and Institute for Advanced Computer Studies

1401 Marie Mount Hall UMIACS phone: (301) 405-6760
University of Maryland Linguistics phone: (301) 405-8903
College Park, MD 20742 USA Fax : (301) 405-7104
http://umiacs.umd.edu/~resnik E-mail: resnik@umiacs.umd.edu