Saturday, November 05, 2005

Automated Methods of Semantic Reconciliation of Structured Enterprise Data Sources

One of the biggest difficulties in building enterprise-wide data warehouses is assembling logical models of entities sourced from the data within ERP/SCM/CRM/BPM schemas, where the same entity (e.g. customer, product, employee) exists in multiple locations. In any given corporation, there are extraordinarily few, if any, individuals who know the details of more than one of the underlying schemas. For example, I may know where the customer data in a Siebel CRM system is held, but I would not have any idea where this is held in Oracle Financials or SAP R/3. To enable BI application construction where data is sourced from multiple transactional systems, vendors will have to lower the barrier to building an enterprise-wide information model, otherwise I don’t believe it will be adopted. The paper below discusses the challenges in accomplishing this, which are substantial, but there is tremendous room for innovation here that is necessary to make enterprise-wide performance management feasible.

An excellent article from ACM Queue discusses some recent research in this area that shows some promise for making this crucial task much easier:

Making heterogeneous schemas play nicely together has challenged computer
scientists for years, but we're on the path to better behavior.
Reconciling the vocabularies of different data sources is also the subject of the thesis by Dr. AnHai Doan which won the 2003 ACM's Prestigious Doctoral Dissertation Award. It's a fascinating read.