Monday, October 17, 2005

Simple Semi-Structured Data

This is an excellent article by David Loshin on the value of what he terms "semi-structured data". I've seen this term being used to describe a wide variety of data, including raw HTML, XML, etc., but I think that Loshin captures a more precise and hence useful definition.

"There is an intermediate classification of content called “semi-structured data.” This refers to sets of data in which there is some implicit structure that is generally followed, but not enough of a regular structure to “qualify” for the kinds of management and automation usually applied to structured data. We are bombarded by semi-structured data on a daily basis, both in technical and non-technical environments. For example, web pages follow certain typical forms, and content embedded within HTML often have some degree of metadata within the tags. This automatically implies certain details about the data being presented. A non-technical example would be traffic signs posted along highways. While different areas use their own local protocols, you will probably figure out which exit is yours after reviewing a few highway signs."

"This is what makes semi-structured data interesting—while there is no strict formatting rule, there is enough regularity that some interesting information can be extracted. Often, the interesting knowledge involves entity identification and entity relationships. " This doesn't sound a lot different than classic ERD modeling or relational data warehouse modeling. With existing pattern recognition techniques, I wonder how difficult it would build a generic parser across semi-structured and structured data that could come up with composite entity models across multiple information domains?

No comments:

Post a Comment