Here Comes the Enterprise AI Data Platform
These days, business managers are under increasing pressure to use generative AI (GenAI) as a business management tool, and there are high expectations regarding improved productivity at lower cost. In order for GenAI to really provide the kind of fully informed picture of the business, however, it must have access to complete, high quality, and consistent data from across the enterprise. Sadly, these days, most business databases are stuck in departmental or business function dependent siloes.
The solution to this problem, and what’s needed to achieve the dream of making AI at least a junior partner in business management, is to make comprehensive access to data along with the metadata necessary to make the meaning and context of the data clear. In short, what’s called for is an Enterprise Data Platform.
Background
From the time digital computers were first used to assist with business management, it has been necessary to store the data in media that can be used for processing that data in successive steps. The initial means included holding the data in sequential format using paper tape, punched cards, and magnetic tape. Data stored this way would serve as input to some calculating process for such applications as accounting, inventory management, sales recordkeeping, tax reporting, and so on. Such a process would read the input, perform calculations, and then produce output that would be used in another program in a succession of steps that formed a batch process. Reports would be generated along the way. Eventually, random access data formats, including magnetic drum and spinning disk displaced the older forms and afforded a wider variety of processing activities while obviating the need for intermediate steps for sorting and merging the data.
Each application had its own data files. The data files for each application were in a different format than those of any other application, so data sharing was virtually impossible. Converting and combining the data for analytic purposes was done only occasionally since such processes had to be updated continually to match the format of the source data. What was needed was a means of selecting out data from each application, converting and combining it, and generating analytic reports without running the same processes over and over to do it, creating temporary analytic files that would be used to generate reports and then discarded. What was needed, in short, was a collection point, continuously and automatically updated, that could be used to generate reports at any time.
Bill Inmon and the Data Warehouse
Although IBM researchers Barry Devlin and Paul Murphy are credited with having invented the concept of the data warehouse in the late 1980s, Bill Inmon is thought to have written the first comprehensive book on the subject and has provided leadership in the area of data warehousing and enterprise data intelligence ever since.
Simply put, a data warehouse represents a collection of data having enterprise analytic value, culled from various relevant operational applications and reconciled into a single relational database with a comprehensive schema. When databases were still largely dependent on interaction with physical disk storage for their operation, these data warehouses were often compromised by such non-relational hacks as star schemas to reduce disk i/o and improve performance (a star schema is a hack because it anticipates the queries to be posed, whereas the relational ideal is to support any query that one might dream up). Today, however, thanks to very large memory models and electronic storage devices such as Flash, data warehouses generally feature a full third normal form schema.
Data Marts
Because data warehouses have large, complex schemas and extensive storage requirements, they lack the nimbleness necessary for more tactical and short term analytic projects, so businesses turned to smaller more focused databases called data marts, which tended to be single subject, and were sometimes subsets of the data warehouse, usually with project-specific modifications such as survey data, or involved data that was not in the data warehouse at all.
This approach provides some nimbleness and is cheaper, but still requires many formal hoops, and when done on the premises demands requisition approval for its systems and storage.
Hadoop and Map/Reduce
In an effort to create an expandable large data collection platform that could run on commodity servers, researchers Doug Cutting and Mike Cafarella developed a search engine for the Apache Nutch project in 2002 which Google funded, and the technology for which was folded into the Google service. The combination storage and data retrieval software became part of the Apache Hadoop project.
The search mechanism for Hadoop is based on a model called MapReduce, which involves recursive Map (the building of key value pairs) and Reduce (removing elements not relevant to the search) functions, in a pattern inspired (or so I have heard) by the recursive programming language LISP.
The plus side of this is that it is highly scalable, and its ability to operate on a cluster of commodity servers and storage make the hardware quite affordable, especially when the cost is compared with that of a large-scale data warehouse. The negative is that MapReduce programming is neither easy nor straightforward, and because MapReduce operates as a series of batch operations, it is not especially fast. The system is great for searching through massive data collections for results that satisfy specific combinations of criteria, but less useful for typical analytic queries.
The Cloud and Data Lake Platforms
The resulting data collection is commonly called a “data lake”. Some enterprises have attempted to reproduce their existing data warehouse functionality using Hadoop, but found that, for the most part, Hadoop was too expensive (from a labor perspective) and lacked the algebraic regularity that is naturally present with data governed by a third normal form schema.
They did find it somewhat useful for less formal large dataset queries and for searching. To overcome the performance and labor cost (complexity) issues, cloud services formed that offered a more scalable platform with a highly scalable query capability. New data collection and analysis services in the cloud were pioneers in this regard, offering a cloud-based data lake platform that supported the Spark open-source query language. The advantage of the cloud as a platform is that, since analytic projects often require large amounts of storage and processing resources for brief periods of time, users can execute their projects without tying up such resources on-premises and only pay for what they use.
As this platform gained popularity, and as users clamored for a more familiar query language, platform providers also offered support for open-source SQL engines such as Presto and Trino. This technology fed a desire for a more comprehensive query environment, one that could blend the ad hoc analytic data of the data lake with the formalized ongoing data collection in the data warehouse.
Data Lakehouses
The data lakehouse is a response to that need. The data lakehouse blends data from the data warehouse with data in the data lake in a malleable environment based on a variety of open-source technologies. This creates a full spectrum environment for analytic data of all kinds in the enterprise. It is also seen as a suitable environment for the use of generative AI in pulling data from a range of sources in response to wide ranging business prompts, without exposing users to the complexity of varying schematic structures.
Data lakehouses typically feature a data lake environment that includes relational-style table support, usually offered with the open-source Apache Iceberg, support for less formal data in sequential files using such things as a comma separated value (CSV) format. Data in this environment can be joined or unioned across data lake and data warehouse boundaries. Vendors offering a data lakehouse platform typically offer the informal data holding environment with common query and search capabilities built in, a relational data warehouse or the ability to dynamically link to one, and some data movement and transformation capabilities. Increasingly, they are also offering semantic metadata support, enabling data managers to establish durable relationships between data in the two environments based on the meaning and business context of the data.
The Challenge: Integrating live operational data with AI
What is missing in all this is a dynamic link to the live operational data that drives the business. Today, operational business data must be transferred to the data warehouse by means of extract, transform, and load (ETL) processes and those ETL processes must be manually maintained as systems change. This means the operational data accessed in a query or report from the lakehouse can’t be current. Sometimes one can get close by building the ETL function onto a change detection mechanism in the database known as “change data capture” (CDC), but the time required for conversion and transfer still guarantees that the data will be at least a little old.
Some data lakehouse platform vendors are starting to offer database systems built into their environments that include transaction processing, but the idea that these could replace hundreds of databases managing petabytes of live, continuously changing data, and do so with acceptable performance, strains the imagination.
The Enterprise AI Data Platform
The ultimate answer must involve something so messy that cloud service providers have been avoiding it up to now: providing real time data integration with legacy systems that performs at scale. Offering some support for transactions in select cases is a step in the right direction but is not sufficient. Users will expect to be able to use AI to get a picture of the enterprise in real time, and to get detailed answers to questions of any kind based on the latest state of the data.
This is the Enterprise AI Data Platform: a facility that enables real time analytics to be performed on any data, not just data resident in the data warehouse or data lake. Thus, It is able to expose the current state of the enterprise to the user as well as enable AI to traverse the database and retrieve data based on its meaning. This facility must have sufficient semantic metadata to enable examination of the data based on its meaning and context, which is especially important if the query is driven by an AI prompt.
The Enterprise AI Data Platform has a number of required elements in order to be considered complete. These are listed below.
Data lake: a flexible data container that supports a variety of formats, including CSV (comma-separated-values) files and tables. A facility of growing popularity for holding tables is Apache Iceberg. The data lake is used to collect data for analysis and run searches, queries, or AI analyses on the data.
Data warehouse: a formally defined relational database used for enterprise-wide business intelligence reporting and analysis. Normally, it is kept in third normal form.
Data lakehouse: an intermediate space that links data warehouse and data lake data together for comprehensive queries, searches, and AI analyses.
A means for periodic or dynamic data selection from running operational applications, usually involving ETL (extract-transform-load) tools and CDC (change data capture) on the databases for real-time data movement.
A means of master data extraction for the data warehouse. Master data is data that underlies business processes but normally does not include the data engaged in the processes (i.e., detail data), except in summary form. For instance, the data may capture the names of products and customers, but only summarize sales, since the actual transaction dates are sizes are less useful for analysis. Projects doing analysis at this level are better left to the data lake.
A semantic metadata property graph. A property graph represents structures of relationships in the form of nodes and edges (or relationships), each of which having defining properties. This is necessary to support AI traversal of the data. The semantics capture both connectivity rules (context) and definitions (meaning). AI systems are arising that can largely infer the connectivity rules from analysis of existing data combinations, but meaning must be provided or at least overseen by humans.
The AI engine: not a unitary entity but a complex of programmatic elements that include the interactive tool to receive the “prompts” (or requests, usually in human language form), the large language model (LLM) that processes the prompt based on patterns in a very large sample of data, applying those patterns to data patterns of interest (such as corporate data), and using probabilistic predictive algorithms to identify likely responses. Other elements include retrieval augmented generation (RAG) and agents that condition the processing based of fixed constraints. Roughly speaking, the AI operation uses the semantic property graph to identify the data to analyze in response to the prompt and to put it together in a meaningful way. This means the AI engine must have full access to all the data of interest, and all the data of interest should contain defining relationship structures in the property graph.
The following figure illustrates the elements described above.
Most AI data configurations we see include most of the elements in this figure, but the one that enables a full AI data platform is that of the semantic model property graph. This graph contains nodes referencing database elements at both a structural level and at a meaning level. Structural data indicates how data may be related and combined and is much more detailed than the simple formalisms of a relational schema. In it, for instance, one can see the combination of, for example, customers, products, sales records, and sales items as a business structure, whereas the schema only sees tables with rows, tables, unique keys, and foreign keys. Meaning semantics add business terms, so that the AI system can more precisely determine which database elements need be brought to bear in response to a prompt.
What the diagram indicates is that while operational applications contribute selected data to data lakes, and master data to the data warehouse, and do so on an ongoing, usually scheduled basis (using ETL), only a fraction of the operational data is involved. With the semantic property graph present, however, the AI system may query the operational databases directly for data not present in the data lakehouse environment, but which is still necessary to satisfy the request. The order of operations in this case is as follows:
A prompt comes to the AI engine.
The AI engine consults the semantic metadata property graph.
Using the graph as a map of data, the AI engine queries the data lakehouse, which reconciles data from both the data lake and warehouse and returns an analytic query result.
The AI engine also issues queries using the graph against the operational databases, yielding a semantic operational search result.
The AI engine combines the results to form a response to the user prompt.
In most cases involving this type of functionality, the analytic level is progressing quite nicely, but the operational aspect still needs considerable work.
Vendors and Products Leading the Fight
The following vendors and products are providing leadership in this area. This list is not comprehensive, and its order does not imply relative completeness.
Vendor |
Product |
Web Page |
AWS |
Sagemaker |
https://aws.amazon.com/sagemaker/lakehouse/ |
Databricks |
Data
Intelligence Platform |
https://www.databricks.com/product/data-lakehouse |
Google |
Data Cloud |
https://cloud.google.com/data-cloud |
IBM |
Watsonx.data |
https://www.ibm.com/products/watsonx-data |
InterSystems |
IRIS |
https://www.intersystems.com/products/intersystems-iris/ |
Microsoft |
Microsoft
Fabric Lakehouse |
https://learn.microsoft.com/en-us/fabric/data-engineering/lakehouse-overview |
Oracle |
OCI Big Data |
https://www.oracle.com/big-data/what-is-data-lakehouse/ |
SAP |
SAP Business
Data Cloud |
https://www.sap.com/products/data-cloud.html |
Snowflake |
Open
Lakehouse |
https://www.snowflake.com/en/product/use-cases/data-lakehouse/ |
Starburst |
Galaxy |
https://www.starburst.io/starburst-galaxy/ |