This wiki is locked. Future workgroup activity and specification development must take place at our new wiki. For more information, see this blog post about the new governance model and this post about changes to the website.

OSLC Indexing

STATUS: topic overview, not considered an OSLC specification.

Overview

Many independent tools are used in a collaborative lifecycle management system. Each tool manages its own data of some specific type (for example, requirements, test cases, defects, etc.), however, much of each tool’s data refers to data managed by the other tools. The problem faced by developers and managers is how to understand and work with this web of data as a whole. For example, what test cases are associated with a specific requirement and which ones are currently passing or failing?

An existing solution to this problem involves “extracting” the data from the individual tools, “transforming” it into a corresponding relational database (RDB) model, and then “loading” it all into an RDB, referred to as a “data warehouse”, where it can be queried, analyzed, reported on, etc. This ETL (Extract, Transform, and Load) process for building a data warehouse has several drawbacks:

  1. The process is time-consuming and therefore typically only done once or twice a day. As a result, the content of the data warehouse is always somewhat stale.
  2. The process is complex and difficult to maintain in the face of new types of data or changes to existing data types. Since each tool defines its own way of referring to other data, the ETL process needs to implement a lot of business logic to resolve the various kinds of cross-tool data references.
  3. It is not possible to automatically enforce the access control rules from the original tools on the data in the warehouse.

The Linked Lifecycle Data (LLD) specifications from OSLC provide the foundation for a better solution to this data integration problem. LLD provides a uniform way to identify data, HTTP Uniform Resource Identifiers (URI), and a common representation for it, Resource Description Framework (RDF). With OSLC, each lifecycle tool assigns a URI and provides an RDF representation for its data and refers to other data using RDF properties that contain the target data URI.

Using LLD, data integration across the multiple lifecycle tools can be achieved by loading the RDF representations of all the data into a shared triple store, a.k.a., an “index”. The RDF index will then provide a SPARQL query service that can answer queries on data across all the tools in the system.

This approach has several advantages over a data warehouse:

  1. There are no tables to maintain and no mapping from resources to tables to maintain.
  2. We index exactly the triples in the resources, thereby avoiding the “transform” part of the ETL process.
  3. Because the resources map one-to-one, we can mirror the access rules of the underlying tools.

Without an expensive transformation process, the extraction is fast and can be run every few seconds. Because of this, the RDF data in the index can always be kept current. For example, when running a report 15 minutes before a status meeting, or after adding a new data source (or new properties in an existing data source), the latest data will appear (almost immediately) in new queries.

Indexing Profile

Implementing an index-based query and search engine involves several challenges:

  1. The RDF indexer must be able to initially load large amounts of RDF data.
  2. Updates to data in the individual tools must appear in the index with low latency.
  3. Fine-grained access control as defined by the tools must be enforced by the query service.

Several capabilities are required of participating tools in order to achieve these goals. To be specific, any OSLC service provider that wishes its resources to be available for inclusion in an RDF index MUST provide the following capabilities, collectively referred to as the Indexing Profile:

  1. Resource Publishing Capability (TBD) exposes a service provider’s complete set of resources. These resources are referred to as the service provider’s "published resources".
  2. Resource Changelog Capability (OslcCoreChangelog) provides a change log that lists change events for the set of published resources in (1), as well as creation events for new resources that would have been included in (1).
  3. Resource Security Capability (TBD) provides access to security information (ACLs) for the published resources, to allow an index to mirror the access rules of the underlying tools.

In other words, an service provider that wishes to be indexable, must publish its resources and then provide change information and access-control information for those resources.

All three of these capabilities are, themselves, optional OSLC features that can also be used independently for other purposes. However, to participate in an RDF index, especially with acceptable performance, a service provider must implement them all.

Topic revision: r2 - 24 Apr 2011 - 21:20:51 - DaveJohnson
 
This site is powered by the TWiki collaboration platform Copyright � by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Contributions are governed by our Terms of Use
Ideas, requests, problems regarding this site? Send feedback