Re: Corpora: How to transform DOC files to PDF or PS?

George Demetriou (g.demetriou@dcs.shef.ac.uk)
Thu, 17 Jun 1999 15:13:38 +0100

========================================================================
Jose Maria Gomez Hidalgo wrote:
[stuff deleted]
> I have also a question:
>
> How can I convert PDF documents to text or HTML?
>
> This is required for building a search program over a set of documents in
> PDF format. I am aware of two solutions: (1) Using Ghost View with the pdf
> to text option, and (2) sending the document to an email address which is
> provided by Adobe. Neither of these two solutions is reasonable for the
> search program, because the first involves one by one online conversion,
> and the second one implies to publish documentation over the Internet, and
> it is quite slow. Do you know about other solutions?
========================================================================

Dear Jose,

(1) You can go to PDFZone at

http://www.pdfzone.com/products/software/toolinfo_all.asp

for a full list of PDF conversion tools. Some of them may work for your
needs.

(2) You can also use the following ps2ascii script (uses ghostscript)
which may work with simple PDF files:

-----------------------------------------------------------------------
#!/bin/sh
# Extract ASCII text from a PostScript file. Usage:
# ps2ascii [infile.ps [outfile.txt]]
# If outfile is omitted, output goes to stdout.
# If both infile and outfile are omitted, ps2ascii acts as a filter,
# reading from stdin and writing on stdout.

trap "rm -f _temp_.err _temp_.out" 0 1 2 15

if ( test $# -eq 0 ) then
gs -q -dNODISPLAY -dNOBIND -dWRITESYSTEMDICT -dSIMPLE -c save -f
ps2ascii.ps - -c quit
elif ( test $# -eq 1 ) then
gs -q -dNODISPLAY -dNOBIND -dWRITESYSTEMDICT -dSIMPLE -c save -f
ps2ascii.ps $1 -c quit
else
gs -q -dNODISPLAY -dNOBIND -dWRITESYSTEMDICT -dSIMPLE -c save -f
ps2ascii.ps $1 -c quit >$2
fi
-----------------------------------------------------------------------

(3) You can try the pstotext utility from

http://www.research.digital.com/SRC/virtualpaper/pstotext.html

This also requires requires Aladdin Ghostscript. It's supposed to work
for both postscript and PDF conversion although I've found that it fails
for PDF documents of complex technical nature.

(4) We are currently in the process of evaluating a (commercial)
PDF-to-text tool called Argus (it's in the PDFzone list). It seems to
work fairly well and the plus point is that it is configurable. As with
other tools though, it seems to have problems when the document
includes a lot of equations and tables and we're trying to find a way
around that.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Dr George Demetriou

Dept. of Computer Science Room: 219
The University of Sheffield Tel: +44 (0) 114 2221894
Regent Court FAX: +44 (0) 114 2229237
211 Portobello Street e-mail: demetri@dcs.shef.ac.uk
Sheffield, S1 4DP, UK
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%