Re: [Corpora-List] Translator_HTML_to_XML

From: Scott James Cederberg (cederber@csli.Stanford.EDU)
Date: Sat May 03 2003 - 01:48:43 MET DST

  • Next message: NASSLLI'03 Bloomington, Indiana: "[Corpora-List] EXTENDED EARLY REGISTRATION for NASSLLI 2003"

    Hey there,

        HTML is an SGML document type; it includes some features (namely
        opening tags appearing without closing tags and attribute values
        appearing without surrounding quotation marks) that do not
        constitute well-formed XML. XHTML is precisely a version of HTML
        that has been designed to be conforming XML.

        The osx program, part of the OpenSP package (a successor to James
        Clark's sp package) can automatically convert SGML files to
        corresponding XML files; you could give that a try. OpenSP is
        maintained along with OpenJade (http://openjade.sourceforge.net/).

        I believe osx is written in C++...

        You should be able to validate the resulting XML documents against
        the XHTML DTDs, although I would imagine you'll need to make minor
        changes for full validity.

        Hope that helps.

                                                            Scott

    On Sat, May 03, 2003 at 12:48:59AM +0200, wassim souayah wrote:
    > Dear all,
    >
    > I'm working on an Internet Query System,
    > Can somebody point me to : any system for translating
    > HTML to XML (In Java)?
    >
    > Thanks a lot,
    > wassim
    >
    >
    >
    > ************************************************
    > Wassim Souayah
    > Etudiant DEA
    > Laboratoire de LARIS
    > Sfax-TUNISIE
    >
    > Email : wsouayah@yahoo.fr
    >
    >
    > ___________________________________________________________
    > Do You Yahoo!? -- Une adresse @yahoo.fr gratuite et en français !
    > Yahoo! Mail : http://fr.mail.yahoo.com
    >

    -- 
    Scott Cederberg
    Researcher
    

    Infomap Project Computational Semantics Lab Center for the Study of Language and Information (CSLI) Stanford University

    http://infomap.stanford.edu/



    This archive was generated by hypermail 2b29 : Sat May 03 2003 - 01:48:30 MET DST