[oslc-core] ChangeLog Proposal moving to Convergence Phase

Wed Apr 13 15:24:23 EDT 2011

Jim des Rivieres/Ottawa/IBM wrote on 04/12/2011 04:48:00 PM:

> Re: [oslc-core] ChangeLog Proposal moving to Convergence Phase
>
> Jim des Rivieres
>
> to:
>
> Frank Budinsky
>
> 04/12/2011 04:48 PM
>
> Cc:
>
> oslc-core, RELM Development
>
> Hi Frank,
>
> I've been giving the OSLC indexing spec proposal a lot of thought.
> I've been looking for a good way to layer the specification, so that
> it is sufficiently general to garner wide use, and not overly
> dependent on other OSLC mechanisms that would make it more awkward
> for applications that have nothing to do with any OSLC domains.
>
> First, a couple of general comments on the 3 capabilities in the current
spec:
>
> re: Resource Publishing Capability. The need to enumerate is clear.
> However, it is unclear how an index-building client would use this.
> Even if there was stable paging, it's unclear how the client would
> then catch up on all the changes that happened since the client
> started walking the enumeration. I believe the client needs to have
> a way to learn exactly which event it needs to carry on from using
> the change log. The only way to do this currently is for the client
> to make a separate request to retrieve the current event number from
> the change log before it requests the enumeration. Also, the
> resources enumeration is tied to the time of the request. For a
> server with a very large set of resources, this may be expensive. If
> it were possible for the server to answer an enumeration of its
> resource set at a point in the past of its choosing, the server
> would have more flexibility as to how and when to enumerate its resource
set.

It's not a bad idea, but before we add additional complexity we prefer to
first make sure that its necessary. We can easily add this kind of
optimization in the future. I think that all that would be needed is a way
to let the service provider pass a "starting at" changeLog sequence number
that corresponds to the ChangeLog position from which the changes will need
to be processed after loading the enumerated resources.

>
> re: Resource Changelog Capability. A stable, paged representation of
> change entries in reverse chronological order based on event
> sequence number - feels right, except for format of the
> representation. Using an RDF representation for this seems awkward
> and unwarranted (RDF does not do ordering easily). Using Atom and
> AtomPub would be more appropriate, and arguably closer to what
> people would expect of an internet change log protocol.

I think the RDF format we have now is simple enough. We could always allow
additional formats in the future.

> Also, given
> that the server is allowed to truncate change entries from the
> change log, it might make sense to tell the client up front the
> number of the lowest numbered event in the entire log. That way the
> client can at least determine when they've "missed the boat" -
> missed out on some events that were crucial to incrementally
> updating their picture.

That sounds simple enough to add, but I wonder if the case it helps should
(almost) never happen anyway. I think that a client will typically be OK,
or know it's been down so long that it won't even try to use the ChangeLog.
That said, if others agree this is worth it, I would be OK with adding an
optional property for this.

>
> re: Resource Security Capability. I didn't look at this - but agree
> we will need something that addresses security. One observation: the
> Resource Publishing Capability and Resource Changelog Capability are
> designed to be used by a different clientele from that of the other
> capabilities found in an OSLC service provider. The latter
> capabilities implicitly require an authenticated user, and constrain
> access based on the permission of that user. The former capabilities
> likely require an authenticated client application, will need to
> reify access constraints, and later apply those access constraints
> when running queries on behalf of an authenticated user.

Interesting observation.

>
> More generally, here's how I've come to think about this problem.
>
> A server maintains a particular set of resources, and wants to make
> that set of resources available to its clients. These clients, who
> have no a priori knowledge of which resources are or are not in the
> set, need a way to enumerate the URIs of the resources in the set.
> (Hence the Resource Publishing Capability.) The set of resources may
> be continually changing under foot, and clients need a way to track
> how those changes affect the set of resources. (Hence the Resource
> Changelog Capability.) Our primary envisioned clients do both. They
> start off enumerating, and afterwards switch to incremental
> updating. And the reason our clients are interested in certain sets
> of resources is that they are trying to retrieve them to get RDF
> triples to put into a RDF triple store. However, 99.99% of the
> protocol is about dealing with a large active set of resources, and
> only 0.01% about these resources being bearers of RDF triples.

You're right that both the ChangeLog and Publishing capabilities don't care
about the resource contents (whether it's RDF or something else). This
would change if we in the future decide to add finer grain information in
the change log as some people have been suggesting.

>
> Rather than specify it as two separate capabilities, it would makes
> sense to specify them as a single capability, with a single
> endpoint. Concieved of this way, the capability at the heart of
> things is a protocol for dealing with big sets of resources. Call
> this the Big Resource Set protocol. A server would implement Big
> Resource Set protocol to expose its set of resources; a client would
> consume the Big Resource Set protocol to initially enumerate the
> resource set, and afterwards to continue monitoring for incremental
> changes affecting resources in the set. The Big Resource Set
> protocol would be neutral on how the resource set comes into being,
> what causes it to change, and which resources might end up as
> members of the set. The protocol would also be neutral on the
> representation of the resources; all the spec would need to promise
> is that all resources are identified by URIs, and that HTTP etags
> are used to identify distinct resource states.

I think the "Big Resource Set" protocol you're describing is the same as
the current ChangeLog Capability/spec, if I make a few small changes to it:
1) remove the part that says how a ChangeLog URI is published in the
ServiceProvider, 2) remove the linkage to QueryCapabilities (i.e., what I
was calling Publishing Capability) and simply say it needs the the URIs for
the published resources, 3) change the wording a little to be more generic
and not use OSLC terms like service provider - which implies lots of things
that don't need to be specified but are assumed.

I like this idea and will take a pass at making these changes.

>
> The Big Resource Set protocol would serve as the lowest level
> protocol spec. We would build a second separate layer atop it. For
> our problem at hard - defining general purpose sources of indexable
> RDF content - a server would provide an endpoint implementing the
> Big Resource Set protocol, with the added provisio that all
> resources in the resource set are dereferencable to RDF content,
> with the etag varying with significant changes to that RDF content.
> (We would also address the matter of security at this level, and
> spell out the expectations about how these endpoints are available
> only to trusted indexer clients that can pick up ACL information for
> the resources and correctly apply it when the index data is shown to
> regular users. We would also need to spell out expectations
> regarding the logical consistency of the RDF content across
> resources in the resource set, since it will likely be undesirable
> if the fact base contains contradictions. The matter of overlapping
> resource sets that you raise would also be addressed at this level.)

Isn't this essentially what I was calling the Indexing Profile, only not
defining the OSLC capabilities that it's based on?

>
> We would add a thin layer on top the second to tie things in to an
> OSLC domain. In the context of an OSLC domain specification, we
> would further specify that an OSLC service provider should expose
> one or more RDF index source endpoints and make them discoverable
> via markup in the OSLC service provider. The resources in the
> resource sets would be the "published resources" belonging to that
> OSLC service provider.

If I understand you, this is where we say how the ChangeLog and published
resources are added to the OSLC discovery model. Maybe we don't even need
to rush on this part, since OSLC service providers can implement the
ChangeLog spec even before we provide a "standard" way to publish/discover
it.

>
> Regards,
> Jim
>
> From:
>
> Frank Budinsky/Toronto/IBM at IBMCA
>
> To:
>
> oslc-core at open-services.net
>
> Cc:
>
> RELM Development <RELM_Development%IBMCA at ca.ibm.com>
>
> Date:
>
> 04/05/2011 04:23 PM
>
> Subject:
>
> [oslc-core] ChangeLog Proposal moving to Convergence Phase
>
> Sent by:
>
> oslc-core-bounces at open-services.net
>
> I've updated the Change Log proposal to include all the issues we've
> discussed, and to provide a little more elaboration for things that
> people didn't seem to easily pick up from earlier drafts. It is
> available here:
>
http://open-services.net/pub/Main/IndexingProposals/OSLC_indexing_0404.doc
>
> I'll look at converting it to the proper OSLC TWiki format next.
>
> Thanks,
> Frank._______________________________________________
> Oslc-Core mailing list
> Oslc-Core at open-services.net
> http://open-services.net/mailman/listinfo/oslc-core_open-services.net
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://open-services.net/pipermail/oslc-core_open-services.net/attachments/20110413/011e7b91/attachment-0003.html>