RE: Corpora: Converting PDF files

From: Tolkin, Steve (Steve.Tolkin@FMR.COM)
Date: Fri Dec 28 2001 - 17:43:01 MET

  • Next message: Mike O'Connell: "RE: Corpora: Converting PDF files"

    Oh, I wish it were so easy!

    Summary:
    I believe there are several problems that affect all the approaches.
    1. Ligatures e.g. fi, ff, ffi, Fi, etc. are emitted as special
    control characters, e.g. the single character ^L.
    2. Words that had a hyphen introduced due to a line ending
    are emitted in two pieces.

    Details:
    1. Just as an example here is the last part of page 7 of
    http://www.cs.columbia.edu/~min/papers/cucs-002-01.pdf
    that I created by copying with the text tool and then pasting into my
    editor (emacs). Note that I have replaced the actual single
    characters ^L and ^K by a two character pair so you would see them in
    this email. The original file contained a single character ^L (aka
    Control-l, C-l, octal 014, hexadecimal 0xc etc.) Note also that ^L is
    used for two different purposes: for the ligature fi and to denote a
    page break. ^K is used for "ff".

    <quote>
    The relative di^Kerence between these features across headers within a
    document seems to dictate their nesting depth. Header thus computes
    its ^Lnal feature set based on the di^Kerences in the values of these
    initial features in adjacent headers, shown in Table 3. This
    corresponds to learning whether one header dominates, is dominated by,
    or is on parity with an adjacent header. These pairwise features are
    Header's output and are passed on to the Combiner ^Lnal machine
    learning module.
    7
    ^L
    </quote>
    Unfortunately the approach of having the file read by
    Ghostview (and processed by Ghostscript) is even worse.
    All the above errors appear, as well as another kind of error where it
    cannot
    read the contents due to some font problem or other issue,
    and so uses ### instead, e.g. the last sentence becomes:
    <quote>
    These pairwise features are ######'s output and are
    passed on to the ######## ^Lnal machine learning module.
    </quote>

    Unfortuantely there are many more ligatures than this, e.g. fl,
    including some with three letters: ffi, etc. They also
    can occur anywhere in a word, e.g. specific became "speci^Lc".

    I seem to recall that the particular assignments used by Acrobat,
    i.e. which control code is used for which ligature,
    vary. (If anyone could provide more information about
    this I would appreciate it.)

    Assuming you have a big dictionary this problem can be
    partially remedied as follows:
    Find all words containing a ligature and scan the text
    looking for the assignment (i.e. on a per document level).
    Then fix them using the inferred mapping.

    Aside: This is similar to the problem with ligatures in *.ps files
    which the ps2text program tries to fix, e.g. here is an excerpt:
    <quote>
    #
    # Process the filtered PostScript with $ps2txt_cmd and clean up its output.
    # Substitute \ddd characters with correct combinations.
    #
    open(PS2TXT, "$ps2txt_cmd $dviflag < $tmpfile |") || die "Cannot run
    ps2txt";
    while (<PS2TXT>) {
            next if (/^\n/o);
            chop;
            if (/^.*\\.*$/o) {
                    s/\\214/fi/g;
                    s/\\256/fi/g;
                    s/\\257/fl/g;
                    s/\\320//g;
    </quote>

    2. When converting Adobe Acrobat *.pdf file to text
    there are often many hyphenated words.
    Here is an example from p. 11 of the same document above.
    <quote>
    To further analyze CLASP's performance,
    we assess the features used by Ripper, since it implicitly does feature
    selec-
    tion when constructing its hypothesis.
    </quote>

    In certain cases the frequency of hyphenated words is very high.
    For example the U.S. IRS presents its publications
    using 3 columns, and so there are many hyphenated words introduced.

    Assuming you have a big dictionary this problem can be
    partially remedied as follows:
    If removing the hyphen produces a word, and neither fragment
    is a word then we simply store the word, e.g.
    "ap-propriate" becomes "appropriate".
    My coinage for this process: "dehyphenization".

    Requests for Additional information:

    If anyone has tools, e.g. in perl, to perform either of the
    fix up workarounds above I would like to know about them.

    It may be that these problems can be minimized by
    the use of some options when creating the *.pdf file.
    If so I would like to learn about that. (But I believe
    once the file is created you are stuck.)

    Google seems to have a decent *.pdf to *.html convertor
    and I would be interested in any information about that.

     
    Hopefully helpfully yours,
    Steve

    -- 
    Steven Tolkin          steve.tolkin@fmr.com      617-563-0516 
    Fidelity Investments   82 Devonshire St. V1D     Boston MA 02109
    There is nothing so practical as a good theory.  Comments are by me, 
    not Fidelity Investments, its subsidiaries or affiliates.
    

    > -----Original Message----- > From: ramesh@clg2.bham.ac.uk [mailto:ramesh@clg2.bham.ac.uk] > Sent: Friday, December 28, 2001 9:55 AM > To: corpora@hd.uib.no > Subject: Corpora: Converting PDF files > > > > Dear All > > In May 2001, I asked: > I'm working on a PC with Windows95. > I have MSWord 2000, Acrobat Reader5, and GSview3.6. > Can anyone tell me if it is possible to convert > PDF files into ASCII or MSWord? > And how.... > > I received many helpful replies, and > promised to post a summary, but forgot. > > A colleague has just asked me about the same problem, > which reminded me that I did not post the summary. > > So here it is. Apologies to anyone I have > forgotten. > > Best > Ramesh Krishnamurthy > Consultant: COBUILD, Collins Dictionaries. > Hon. Res. Fellow: University of Birmingham. > Hon. Res. Fellow: University of Wolverhampton. > > > 1. Kevin McTait (UMIST): > try the auto-email service at: > http://www.pdfzone.com/services/access.html > > 2. Ha Le An (Wolverhampton Uni): > the simplest way is select all, copy from Acrobat Reader, and > paste into > word, but there is no way to keep the format, and images, and > tables etc. > > 3. Fabio Tamburini (Bologna): > Open the file with GhostView, then choose menu EDIT, then "Text > Extract..." and an ASCII text file will be produced... > Pay attention to the formatting of the new file! ;-) > I have GSview3.3, but such feature should be available also in 3.6... > > 4. Mike Scott (Liverpool): > Adobe Acrobat, the full version, not just the Reader, > will export to various formats, haven't checked > them all yet though. > > 5. Chris Tribble (Sri Lanka): > I do this with the full Acrobat - I use version 4. This has a text > selection tool. Once you've clicked on this you can use Ctrl > A to select > all text in the documenn if you've selected View, Continuous. > This text can > then be pasted to a notepad or word document. > > 6. Acrobat has an export to Postscript option. Then you can use a > `postscript-to-text' converter. > > 7. Everita Milconoka (Latvia): > You may try to send your .pdf file to > access-b@Adobe.COM > and then in subject line you have to write either pdf2txt or pdf2htm, > and after some minutes they will send you back the file in > .txt or .htm > format. > > 8. Steven Krauwer (Netherlands): > Adobe offers on-line and email facilities for this > at http://access.adobe.com:80/simple_form.html > > 9. Philip Resnik (Maryland): > The solution was at > http://www.research.compaq.com/SRC/virtualpaper/pstotext.html -- > it seems to work very nicely for pdf2txt conversion at least > in the Unix version. > > 10. Simon G. J. Smith (Birmingham): > MSword -- www.adobe.com will do free conversions FROM word > (they get emailed > back to you, and you can only do abt 5 per email address), > but I don't know about the other way round. > To extract text from acrobat (mine is 4.0) choose the text > select tool (capital T with a little > box). Then just cut and paste the text you want. This works > one page at a time. > From ghostview (if it can read your particular PDF, sometimes > doesn't work for > me), do the whole thing at once by Edit|Text Extract. It's > in the gsview help. > You can convert whole pages to bitmaps with gsview, and I > think in Acrobat you > can select graphics from the pdf file (the Acrobat help says > use the graphics select > tool, but I can't find this tool). The bitmap file can then > be viewed from Word. > > 14. Jerome Richalot (Lyon) > Acrobat 5 apparently makes the whole difference. You can > download a plug-in from adobe.com called Access and add it on > Acrobat to > convert from pdf to rtf. >



    This archive was generated by hypermail 2b29 : Fri Dec 28 2001 - 19:02:39 MET