Re: Corpora: On-line KWIC system in PHP

From: Mark Davies (mdavies@ilstu.edu)
Date: Fri Jan 05 2001 - 15:42:21 MET

  • Next message: Alexander Maedche: "Corpora: CFP: IJCAI-01 Workshop on Ontology Learning OL-2001"

    [Sorry for any duplicate messages. My email client is behaving strangely
    today]

    >I am interested in processing corpora (mainly in Spanish and Japanese) and
    >now I am preparing some exercises for my students for the new course
    >beginning next April. What I am trying to do is a Web KWIC system using only
    >(or mainly) PHP.
    >
    >Is there anybody using PHP for this purpose? For big corpora I am developing
    >a system with PHP and MySQL, and I think that its response time is quite
    >fast compared with PERL even without a backend database.

    A while back I posted a notice about a web-accessible corpus of Spanish
    texts that works on more or less the same basis as what you've
    proposed. The corpus is composed of 3,000,000 words in nearly 200 texts
    from the 1200s to the 1900s (including 1,000,000 words from Modern Spanish,
    divided equally among LatAm-Spoken, LatAm-Written, Spain-Spoken,
    Spain-Written). The URL is:

             http://mdavies.for.ilstu.edu/corpus

    The data is stored in a SQL Server database and is indexed via the "Full
    Text" indexing in SQL Server, which allows for proximity searches and
    searches for several types of word forms. The database is linked to the
    web via Active Server Pages, including ADO (Active Data Objects) and VBScript.

    The speed is quite good -- about 1-2 seconds for most searches -- which
    compares nicely with the solution that you've suggested. In addition, the
    SQL Server approach is quite scalable. Searches on a 200 million word
    corpus of Modern Portuguese (http://mdavies.for.ilstu.edu/corpus/publico)
    are nearly as fast -- less than 5-10 seconds for nearly all searches.

    What would be interesting is to use the PHP/mySQL approach with a large
    database -- 50 million words or more -- and see what the performance is
    like. If it's still fast -- like what you have right now -- then I think
    that it would be an ideal solution for the NT platform. And of course one
    of the main advantages of the PHP/mySQL solution is the cost (or lack of
    cost :), as compared to the NT Server / SQL Server approach, which can be a
    bit pricey.

    Mark Davies
    Illinois State University



    This archive was generated by hypermail 2b29 : Fri Jan 05 2001 - 15:39:52 MET