[Topaz-dev] lazy collections
Pradeep Krishnan
pradeepk at soft-point.com
Wed Jan 30 14:52:16 PST 2008
[Long e-mail warning]
This e-mail is to solicit comments regarding the implementation of lazy
collection support in OTM.
Just to set a background to this discussion, a quick overview of OTM's
query generation follows. If you are familiar with all this, please skip
ahead to the section on lazy collections.
OTM query and object mapping overview:
OTM treats the field marked with an @Id annotation as the subject-uri
and the fields as holders for the values. ie. OTM sees the graph as
containing statements of the form:
<@Id field value> <@Predicate uri value of a field> field-value
or for inverse mappings:
<field-value> <@Predicate uri value of a field> <@Id field-value>
or for rdf:List, rdf:Seq etc.
<@Id field value> <@Predicate uri value of a field> $collection-node
So when an application requests that an object be instantiated given a
java class and a subject-uri (ie. by calling
session.get(SomeClass.class, id)) what OTM does is as follows:
(discounting the caching that happens in session)
1. The following conceptually simple query is generated:
select $s $p $o from <...> where $s $p $o and
$s <mulgara:is> <id>;
The actual query is more complex to account for inverse mapping,
rdf:type look ahead and filters. But let us ignore that for this
discussion.
2. The rdf:type values from the query result and the SomeClass.class
given by the app are used to determine the correct sub-class of
SomeClass.class to be instantiated.
3. For all fields that are mapped to an rdf:List, rdf:Seq etc. further
queries are performed to retrieve the values of the collection
4. Instantiate the object with the proper sub-class of SomeClass.class
as determined in step #2 and populate the fields.
There are couple of things that are noteworthy here:
1. The query is an $s $p $o wild card query instead of querying for just
the predicates defined in SomeClass.class. This is because of
sub-classing and to capture the fields of the sub-class also in one
query itself.
2. rdf:List, rdf:Seq etc, queries are performed as a second step again
because of the same reason - the sub-class determination has to happen
first - only then you will know all the fields that are of rdf
collection types.
How collection fields are loaded:
Collection fields are fields that are either rdf collections or have
predicates with max-cardinality > 1. They are loaded and populated by
the TripleStore implementation by the time #4 above is done just like
any other field.
For Associations however, there is an additional step. Associations show
up in the query results above as a URI value. These URIs then need to be
treated as an @Id of the association class and an appropriate object
needs to be instantiated. This is where the type look-ahead comes in.
Since we now know the proper sub-class type, we simply call
session.load(ClassOfTheAssociation.class, id) and stick the resulting
object into the collection. Session#load function will create a proxy
object (javassist BCE object) if the object is not already in the
Session cache.
This all works fine - so where does the lazy collection come in:
-----------------------------------------------------------------
The idea with the lazy collection is to delay the loading of collection
fields so as to optimize the load times and in some cases even skip the
loading of the collection fields altogether if the application only uses
these fields conditionally. There are couple of variants to this strategy:
1. Make the TripleStore impl in OTM capable of doing partial object loads
2. Or selectively apply this to collections of associations where
session.load() call is lazily made
The main interest in doing a lazy collection implementation now is with
collections of associations that are java.util.Sets. When adding an
object to a Set, the equals() or hashCode() method on that or any
previously added object is called to enforce the no-duplicate guarantee
of the Set. This in turn will trigger an actual load of the proxy object
(javassist BCE object) and thus defeating all advantages gained by a
lazy loaded association.
The option #1 above has the possibility of lazy loading of any field -
not necessarily a collection. For this to work the field must be
accessed only via its getter method and we create a javassist proxy even
for a session.get() call. (Currently only session.load() may create a
proxy object). However this use case is rather limited and I am guessing
it won't be a problem if this implementation option is slated for a
later version.
#1 (partial object load) just for collections:
For this we should do the rdf:type query first to determine the
sub-class followed by a query of all the non-lazy loaded (eager fetched)
predicates. The problem however is that now get() operations will always
result in 2 queries. Some optimizations can be done however - if we can
determine that there is no sub-class of the class to be loaded is
registered in the SessionFactory, the rdf:type query can be skipped. An
additional option is to have a cache (ehacahe, oscache or whatever the
app decides to use) of subject-uri vs rdf:types partitioned by
models/graphs.
In any case when the object is instantiated all the lazy loaded
collection fields will have a dummy implementation supplied by OTM. A
method call on the collection will trigger the actual load. For
associations then a further optimization can be done to delay the call
to Session#load. It is just this optimization that is required to
satisfy the #2 case above.
That is basically it. The dummy/proxy implementation needs to be done in
both the cases. So I am leaning towards just that for now and do the
partial object loads later.
Thoughts? Comments? Should we just start with #2 (ie. lazy load support
only for collections of associations) and later do the #1 (partial
object load with support for lazy collections for all types)?
Thanks,
Pradeep
More information about the Topaz-Dev
mailing list