RE: [Corpora-List] Translator_HTML_to_XML

From: Burnard Towers (lou.burnard@computing-services.oxford.ac.uk)
Date: Sat May 03 2003 - 13:22:42 MET DST

  • Next message: Roman Yangarber: "[Corpora-List] Swedish newspaper corpus?"

    Dave Raggett's tidy utility is THE best way of converting HTML to XML.
    However, this is almost certainly not enough for your purposes, since
    presumably you will want to be using meaningful tagging in your XML if it is
    to be part of a query system. In other words, you might find
    <b>123</b> or <em>123</em> in your HTML or the XML version of it, where
    your query system really wants to find <partNumber>123</partNumber>

    This is obviously easy to fix *if* the HTML or XML input is completely
    regular, and you never find<b> or <em> used to mark things which are not
    part numbers. But, of course, the world aint like that and you need fairly
    sophisticated tools to add semantics to the purely presentational markup
    that HTML will give you, even when it's converted to something that is valid
    XML.

    The good news is that the sophistication of such tools is increased by their
    ability to act on XML structures. So for example, if the part numbers are
    always in column two of a table, you can apply the transformation I
    suggested above only to <b> or <em> elements appearing in the second column
    of a table. There are lots of good XML-aware tools, many of them in Java,
    which can do this kind of thing. And there is also XSLT, which is the
    language I would recommend for such jobs.

    > -----Original Message-----
    > From: owner-corpora@lists.uib.no [mailto:owner-corpora@lists.uib.no]On
    > Behalf Of wassim souayah
    > Sent: 02 May 2003 23:53
    > To: corpora@hd.uib.no
    > Subject: [Corpora-List] Translator_HTML_to_XML
    >
    >
    > Dear all,
    >
    > I'm working on an Internet Query System,
    > Can somebody point me to : any system for translating
    > HTML to XML (In Java)?
    >
    > Thanks a lot,
    > wassim
    >
    >
    >
    > ************************************************
    > Wassim Souayah
    > Etudiant DEA
    > Laboratoire de LARIS
    > Sfax-TUNISIE
    >
    > Email : wsouayah@yahoo.fr
    >
    >
    > ___________________________________________________________
    > Do You Yahoo!? -- Une adresse @yahoo.fr gratuite et en français !
    > Yahoo! Mail : http://fr.mail.yahoo.com
    >
    >
    >



    This archive was generated by hypermail 2b29 : Sat May 03 2003 - 13:25:17 MET DST