jonathan's blog

Counting the number of a category in a search result sped up.

Now, there are no remaining speed issues - with the slowest query on my laptop taking less than a second. And the whole page return in 1.5 to 2.5 seconds. This is still pretty poor performance compared to some searching. I'm not using the SPARQL views or elaborate caching, and I haven't tried YARS yet, although Sesame in Memory seems to be doing quite well.

With more time, I'll spend time squeezing more speed out of this thing. But first:
There are some issues about the correctness of subcategories, and other minor issues, but the hardest part is now complete.

minor quercus bugs

using json_encode in the quercus PHP environment caused some crashes recently - which I've filed a bug for and created a workaround.

I've also created this bug http://bugs.caucho.com/view.php?id=2014 on array_multisort. I need it fixed :(

JFYI, the quercus library is licensed under the GPL, and can run in java servers other than Caucho's Resin. It is a fairly complete implementation of PHP 5.

introducing SPARQL views

Sparql views are a caching mechanism for prepared queries - a simple mechanism for storing subgraphs to make the queries faster. It can uses more resources (HD space and connection resources) while improving speed.

at this point, Sesame's Sparql's prepared queries only seem slightly faster

Here are results of running the scripts against the database of 90,000 statements - 34 queries, in about .3 seconds.

It retrieves all distinct properties, then loops through and retrieves all distinct values for those properties.

The punchline : after a number of tests, prepared queries at this point are only slightly faster. I may cache the tuple query in quercus APC - but for now, the regular queries are sufficiently fast.

COUNT of 33 DISTINCT TOTAL PROPERTIES

no SPARQL RDF aggregate functions leading to other sorting routines

If SPARQL won't do it for me, I'll just have to do it myself... Well, it's easy in php:
foreach ($prepared_aggregates as $key => $row) {
    $count[$key]  = $row["count"];
}
array_multisort($count, SORT_NUMERIC,SORT_DESC, $prepared_aggregates);
1
I started to write one in Java, but it was slower under Quercus because of type conversions

private Hashtable[] doOrderBy( 
            Hashtable[] results,
            String[] order_by,
            String[] sorting
            ){
        
        /**
         * a container for the return values
         * */
        //Hashtable<String,String>
        int resultlength = results.length;
        Hashtable[]  final_result = new Hashtable[resultlength] ; //new Hashtable<Integer, Hashtable<String,String>>();
        /**
         * a mapping for looking up keys of the provided results when the sorting is complete
         * */
        Hashtable<String,Integer> mapTable = new Hashtable<String,Integer>(); 
        
        /**
         * the array of strings for sorting
         * */
        ArrayList<String> sortingTable = new ArrayList<String>();
        
        /**
         * a container for the sort string
         * */
        String sortstring;
        
        //Enumeration<Integer> resultKeys = results.keys();
        //Enumeration<String> orders;
        String order_key;
        String order_value;
        Iterator it;
        
        Hashtable<String,String>  row = new Hashtable<String,String>();
        int rownumber=0;
        int or_num = order_by.length;
 
        
        //iterate over the results
        for(int j=0;j<resultlength;j++){
            sortstring = "";
            //rownumber = resultKeys.nextElement();
            row = results[j];
            //orders = order_by.keys();
            
            //iterate over the orders
            for(int i=0;i< or_num ; i++) {
                order_key = order_by[i];
                //order_key = orders.nextElement();
                //order_value = order_by.get(order_key);
                sortstring += row.get(order_key);
            }
            
            //get the string from the row and addit to the sortstring
            mapTable.put(sortstring,j);
            sortingTable.add(sortstring);
        }
        
        //sort the sortTable
        Collections.sort(sortingTable); //add MIXED ASC DESC
        it = sortingTable.iterator();
        //iterate the sort table
        //retrieve the result key from the sortable
        //get the hashtable result row from the results
        //place in the final results with the new sorted index
        rownumber = 0;
        String[] debugger = new String[results.length];
        while (it.hasNext()) {
            
           order_key =  (String)it.next();
           final_result[rownumber] = results[mapTable.get(order_key)];
          // debugger[rownumber] = order_key;
           rownumber++;
        }
        //return debugger;
        return final_result;
    }
1
http://java.sun.com/docs/books/tutorial/collections/algorithms/index.html http://java.sun.com/javase/6/docs/api/java/util/Collections.html http://java.sun.com/javase/6/docs/api/java/util/Comparator.html

Faceted Search - the hard part without aggregate functions in RDF

Faceted search also means that people often want to know not only the general category, but also the "Count" - which could mean three things.
1. How many object are there like this
2. How many objects are there like this in the search I just completed
3. How many objects are there like this if I add different facet

The computational challenge is big, and the programmatic challenge even greater when (especially, when like most sparql developers, I'm waiting for aggregate functions.)

What I am hoping will rescue me from the evil of slow queries is Sesame's prepared queries

Sesame RDF PHP JAVA bridge on Caucho yields massive performance increase

You are looking at the speed of AHIRC front page loads now, after another 18 hour day. This is 3-12X faster than the last post!!

Caucho Resin PHP bridge showing renewed promise for Speed Improvements

Looks like I'll be using a hybrid approach of the HTTP client, and a custom class running on JAVA to do the aggregation emulation.

Here's a recap of the consequences.
The first points of failure are (since the HTTP client for Sesame works great):

* SPARQL/Sesame not having Aggregate functions
* Sesame not having ORDER BY

This produces large amounts of results and/or queries, which then need to be parsed by JSON, leading to the second point of failure

* Zend JSON and native php 5.2 JSON are not fast enough (perhaps they should not be expected to be for 6000 results)

Caucho Resin native PHP Java bridge to midigate speed issues in RDF stores with lack of aggregation function support

I'm going to take an application specific approach to solving the problems I'm facing with speed. By application specific, I mean that I will depend on another piece of software in the dependency chain - that being Caucho's Resin fast Java serverlet container. Why? To avoid sending lots of data and doing lots of queries over HTTP.

Syndicate content