[Topaz-dev] lazy collections

Pradeep Krishnan pradeepk at soft-point.com
Wed Jan 30 14:52:16 PST 2008


[Long e-mail warning]

This e-mail is to solicit comments regarding the implementation of lazy 
collection support in OTM.

Just to set a background to this discussion, a quick overview of OTM's 
query generation follows. If you are familiar with all this, please skip 
ahead to the section on lazy collections.

OTM query and object mapping overview:

OTM treats the field marked with an @Id annotation as the subject-uri 
and the fields as holders for the values. ie. OTM sees the graph as 
containing statements of the form:

<@Id field value>  <@Predicate uri value of a field> field-value

   or for inverse mappings:

<field-value>  <@Predicate uri value of a field> <@Id field-value>

   or for rdf:List, rdf:Seq etc.

<@Id field value>  <@Predicate uri value of a field> $collection-node


So when an application requests that an object be instantiated given a 
java class and a subject-uri (ie. by calling 
session.get(SomeClass.class, id)) what OTM does is as follows:

(discounting the caching that happens in session)

1. The following conceptually simple query is generated:
     select $s $p $o from <...> where $s $p $o and
           $s <mulgara:is> <id>;

    The actual query is more complex to account for inverse mapping,
    rdf:type look ahead and filters. But let us ignore that for this
    discussion.

2. The rdf:type values from the query result and the SomeClass.class
    given by the app are used to determine the correct sub-class of
    SomeClass.class to be instantiated.

3. For all fields that are mapped to an rdf:List, rdf:Seq etc. further
    queries are performed to retrieve the values of the collection

4. Instantiate the object with the proper sub-class of SomeClass.class 
as determined in step #2 and populate the fields.

There are couple of things that are noteworthy here:

1. The query is an $s $p $o wild card query instead of querying for just 
the predicates defined in SomeClass.class. This is because of 
sub-classing and to capture the fields of the sub-class also in one 
query itself.

2. rdf:List, rdf:Seq etc, queries are performed as a second step again 
because of the same reason - the sub-class determination has to happen 
first - only then you will know all the fields that are of rdf 
collection types.

How collection fields are loaded:

Collection fields are fields that are either rdf collections or have 
predicates with max-cardinality > 1. They are loaded and populated by 
the TripleStore implementation by the time #4 above is done just like 
any other field.

For Associations however, there is an additional step. Associations show 
up in the query results above as a URI value. These URIs then need to be 
treated as an @Id of the association class and an appropriate object 
needs to be instantiated. This is where the type look-ahead comes in. 
Since we now know the proper sub-class type, we simply call 
session.load(ClassOfTheAssociation.class, id) and stick the resulting 
object into the collection. Session#load function will create a proxy 
object (javassist BCE object) if the object is not already in the 
Session cache.

This all works fine - so where does the lazy collection come in:
-----------------------------------------------------------------

The idea with the lazy collection is to delay the loading of collection 
fields so as to optimize the load times and in some cases even skip the 
loading of the collection fields altogether if the application only uses 
these fields conditionally. There are couple of variants to this strategy:

1. Make the TripleStore impl in OTM capable of doing partial object loads
2. Or selectively apply this to collections of associations where 
session.load() call is lazily made

The main interest in doing a lazy collection implementation now is with 
collections of associations that are java.util.Sets. When adding an 
object to a Set, the equals() or hashCode() method on that or any 
previously added object is called to enforce the no-duplicate guarantee 
of the Set. This in turn will trigger an actual load of the proxy object 
(javassist BCE object) and thus defeating all advantages gained by a 
lazy loaded association.

The option #1 above has the possibility of lazy loading of any field - 
not necessarily a collection. For this to work the field must be 
accessed only via its getter method and we create a javassist proxy even 
for a session.get() call. (Currently only session.load() may create a 
proxy object). However this use case is rather limited and I am guessing 
  it won't be a problem if this implementation option is slated for a 
later version.

#1 (partial object load) just for collections:

For this we should do the rdf:type query first to determine the 
sub-class followed by a query of all the non-lazy loaded (eager fetched) 
predicates. The problem however is that now get() operations will always 
result in 2 queries. Some optimizations can be done however - if we can 
determine that there is no sub-class of the class to be loaded is 
registered in the SessionFactory, the rdf:type query can be skipped. An 
additional option is to have a cache (ehacahe, oscache or whatever the 
app decides to use) of subject-uri vs rdf:types partitioned by 
models/graphs.

In any case when the object is instantiated all the lazy loaded 
collection fields will have a dummy implementation supplied by OTM. A 
method call on the collection will trigger the actual load. For 
associations then a further optimization can be done to delay the call 
to Session#load. It is just this optimization that is required to 
satisfy the #2 case above.

That is basically it. The dummy/proxy implementation needs to be done in 
both the cases. So I am leaning towards just that for now and do the 
partial object loads later.

Thoughts? Comments? Should we just start with #2 (ie. lazy load support 
only for collections of associations) and later do the #1 (partial 
object load with support for lazy collections for all types)?

Thanks,
Pradeep


More information about the Topaz-Dev mailing list