Re: Corpora: On-line KWIC system in PHP

From: Antonio Ruiz Tinoco (a-ruiz@hoffman.cc.sophia.ac.jp)
Date: Sat Jan 06 2001 - 09:57:18 MET

  • Next message: Hristo Tanev: "Corpora: Morfological ambiguity"

    Mark Davies said:

    > A while back I posted a notice about a web-accessible corpus of Spanish
    > texts that works on more or less the same basis as what you've...

    I visited your page and it is also fast from this part of the world (Japan).

    > The speed is quite good -- about 1-2 seconds for most searches -- which
    > compares nicely with the solution that you've suggested. In addition, the
    > SQL Server approach is quite scalable. Searches on a 200 million word
    > corpus of Modern Portuguese (http://mdavies.for.ilstu.edu/corpus/publico)
    > are nearly as fast -- less than 5-10 seconds for nearly all searches.

    Well, for small corpora, say few millions words processing flat files in
    ASCII is not so slow. After your mail, I have tested (using the function
    microtime() ) the response time (from the keyword input until the last
    concordance line is printed out) for Don Quijote (about 2Mb) and the results
    were about 0.4sec to 0.6sec (in "plain" PHP4 without optimizer) or 4 seconds
    (in older PHP3, unfortunately it is the test page that I posted). Getting
    more than 1200 matches of the Spanish article "el" in a smaller file (La
    Gaviota, 0.5Mb) took about 1.3 sec. (PHP3) and about 0.8 sec. for not so
    frequent keywords. But with PHP4 I had a response time of less than 0.07
    sec. in almost any case. In a near future I will (try to) install the Zend
    Optimizer, Cache and Loader so the response times will be faster even with
    heavy traffic. I will report the results in this list if there is some
    interest.

    > What would be interesting is to use the PHP/mySQL approach with a large
    > database -- 50 million words or more -- and see what the performance is
    > like. If it's still fast -- like what you have right now -- then I think
    > that it would be an ideal solution for the NT platform. And of course one
    > of the main advantages of the PHP/mySQL solution is the cost (or lack of
    > cost :), as compared to the NT Server / SQL Server approach, which can be
    a
    > bit pricey.

    Yes it would be very interesting. I don't have such a big corpus, but in the
    near future I would like to test it if not with a Spanish corpora, I could
    manage it with any other available corpora, perhaps Japanese. Of course,
    when PHP is used with MySQL, code must be different in order to get a better
    performance.

    As you pointed out, one of the main advantages of this type of approach is
    the lack of cost. Any student can install the whole system in a 500$ PC
    (just hardaware) and it works flawlessly. It also works in Windows98 and
    other platforms.

    One other interesting point is that propietary software could be like a
    black box, but when using open source software you know what is inside and
    can modify it (only sometimes;-). Perhaps we will have to try different
    approaches for different purposes.

    Antonio Ruiz Tinoco
    Sophia University, Tokyo
    a-ruiz@hoffman.cc.sophia.ac.jp



    This archive was generated by hypermail 2b29 : Sat Jan 06 2001 - 11:10:22 MET