The Relevance of Data in the Age of AI

Feb 18

Carl Olofson
Independent Analyst | Recognized Thought Leader | Market Research

I have found that there are many who believe that having systems that can answer questions posed in human language with human language responses means that structured data is somehow passe. Nothing could be further from the truth. AI, driven by very large sets of text such as scanned Internet content can only give approximate answers to questions because the source of information is approximate. In fact, sometimes the patterns in the source text are so inconsistent that the resulting answer provided by the GenAI system can be unintentionally comical.

Formal settings such as business or science require precise and verifiably correct answers, which can only be achieved by basing those answers on facts carefully recorded and kept consistent and current. The underlying elements of such facts are data. In fact, extending AI to answer fact-based questions does not result in less well managed data, but much more. So, going forward, if we want a fact-based anchor for AI responses, we must do the heavy lifting of building and maintaining current and consistent data collections, which calls for the use of databases. Notice I say databases in the plural, because no one database can effectively manage all the different kinds of data, and all the different kinds of data interaction required.

Databases

A database is managed by a database management system (DBMS). There is a wide variety of DBMS formats and architectures, each targeting a specific range of use cases. Although every DBMS differs from every other in some respects, they can and should be classified according to usage.

Nearly every database is optimized for either operational or analytical use, but usually not both. Some DBMSs are better for high-speed data processing where the input format is uniform, and others for widely varying types of data. Some are optimized for highly complex queries in support of strategic business analytics, and most are not. They also vary based on timeliness: some are fine for data actions that need to be done in the next few hours, or hour, or minute, whereas others deal with data issues by the microsecond. Some analytical databases are good for answering detailed questions based on known patterns; others are better at finding patterns.

The rest of this discussion focuses on classifying database technologies for evaluation.

Operational vs Analytical

An operational database accepts and returns data as part of operational business processes such as recording sales, marking inventory items for shipment, capturing payments, and so on. The emphasis here is on speed and efficiency of data storage and retrieval. Some operational DBMSs are optimized for data directly tied to specific application actions and maintain relatively simple data structures for that purpose. Document databases (databases that store data in the form of documents, usually encoded in JSON) and relational databases optimized for transaction throughput are generally preferred for this use case. Other transactional DBMSs are optimized for managing complex operational business data structures, often shared by a wide range of applications. These are usually relational, owing to that model’s emphasis on complex data retrieval and database-wide continuous data consistency. A relational database organizes data based on the relational data model of tables having rows with unique primary keys and related to rows in other tables via foreign key relationships. The most common language for accessing relational data is the Structured Query Language (SQL).

An analytical database is measured not on its rate of throughput, but on the speed of query execution. For some, high query performance against a very large database is valued. For others, high query performance against a very complex data structure is best. DBMSs vary in their ability to address these two somewhat different requirements, and some are aimed at supporting both. Most analytical databases use either the full form of relational data management, or a simplified form aimed at optimizing load or query speed. Others are simpler forms such as key-value stores, which tend to be faster but require more work on the part of the developer. SQL is the most favored language for analytical data query support, but Spark and Python are also commonly used.

Real Time

An orthogonal requirement is that of processing data in “real time”. This may mean actual integration with data storage and retrieval with the motion of the application in which it is embedded, and in other cases may simply mean very fast, that is, sub-microsecond performance. Database technologies classified as “real time” are often applied to both operational and analytical use cases. The design compromises that ensure that speed mean that they must be used in addition to, rather than instead of, more conventional operational or analytical database systems.

AI

Artificial intelligence (AI) imposes a new set of requirements on any data that is to be included in AI responses to data-based prompts. Generative AI (GenAI) has caused an explosion of interest in AI generally. GenAI, however, is driven by the recombination of text components gleaned from very large collections of data with constructs based on patterns of coincident occurrence emerging from the “training” of large language models (LLMs). This approach works because human language has formal structural patterns that reveal the meaning of words in context; those patterns are called semantics.

Currently, database systems don’t integrate structured data into GenAI processes. They enable the use ofGenAI to generate queries based on textual requests (prompts), to design databases, and to support database application coding. In other words, database technology allows the use of GenAI to find and manipulate data, but not to incorporate it in a generalized AI operation. For this to happen, the databases in question must be carefully and completely documented using semantic metadata.

Semantics don’t exist in structured data in databases, so in order for them to participate in broader data-driven AI analysis, such semantics must be provided explicitly and must reveal the meaning of data typologically as well as the context of data at the instance level. This must be done through metadata. The best platform for capturing such metadata is a property graph.

When considering AI technologies for databases, it is important to keep in mind the fact that this space is evolving rapidly, and infrastructure, techniques, and services in AI are in a state of flux. Most DBMS products include AI features, and it may be best to start with those, because they represent relatively low risk.

Data Platforms and Lakehouses

Over the past couple of decades, data science has emerged as a way to glean insight from data that is not organized for analytic purposes, and where meaning is often inferred from patterns of data occurrences. Data lakes were created for this purpose; initially using Hadoop, but more recently by simply collecting subject-related data in open table format datasets, the most common format today being Apache Iceberg.

These data lakes can serve as an open space for collecting data and abstracting insight, but a fuller benefit from this effort may be realized when the data in the lakes is reliably related to formal enterprise data in an analytic database such as a data warehouse. The mechanism that provides blended access across the two is called a data lakehouse.

In order for the lakehouse to be useful over time, it needs a management layer called a data platform. This platform includes a data control plane sufficient to manage the data altogether as well as technology that maintains the meaning metadata and manages access across databases.

Cloud, Cloudless, or Partly Cloudy

It is important to think strategically about deployment in the public cloud. In general, it is advisable to collocate operational databases that are closely associated in workflows to avoid overhead of data shipment plus delays in processing. Analytical databases represent a safer choice for the cloud because they are updated on a periodic rather than constant basis, and because they are used mostly for queries that, again, occur from time to time, so you get the most savings over an on-premises deployment. This is because on the premises, you pay a base cost for the hardware, but in the cloud, you only pay for what you use.

For operational databases, one must have a strategy that includes a schedule for migration. This is especially important if there are in-house developed applications that must be rewritten for cloud deployment. Many enterprises are taking a step-wise approach to such migration, using a hybrid cloud (partly integrated with the on-premises environment in which they are deployed, and partly with cloud-based components) for safer, gradual migrations.

It is also important to consider whether the DBMS technology used should be offered by the cloud provider or should be cloud independent. The former may offer benefits in terms of integration with other cloud components and platform-based discounts, but the latter enables movement from one cloud to another if need be, and also enables a team of database professionals to use the same technology on any cloud, reducing the need for training on multiple cloud database platforms.

For Example…

The following are examples of options for the various technologies mentioned above. This is by no means an exhaustive list.

Cloud Provider Products

AWS: Operational relational DBMS – Aurora, operational document DBMS – DocumentDB or DynamoDB, analytical relational DBMS – Redshift, high speed DBMS – ElastiCache or MemoryDB, graph DBMS: Neptune, analytic data platform – Amazon SageMaker
Microsoft Azure: Operational/analytical relational DBMS – Azure SQL Database, high speed DBMS – Azure Cache for Redis, operational document DBMS – Cosmos DB (can also support graphs), analytic data platform – Azure HDInsight
Google Cloud Platform (GCP): Operational relational DBMS – AlloyDB and geoscalable RDBMS – Spanner, high speed DBMS – MemoryStore, data lake – Dataproc, analytic data platform – BigQuery

Cloud Independent by Product Type

Operational Relational DBMS: Microsoft SQL Server, Oracle Database, IBM Db2, Oracle MySQL, SAP HANA, SingleStore, MariaDB, PerconaDB, VoltDB, PostgreSQL (community open source), NuoDB (geodistributed), Cockroach DB (geodistributed)
Analytical Relational DBMS: Oracle Database, SAP HANA, Teradata, Snowflake, Vertica, IBM Db2
High speed DBMS: Redis, GridGain, Hazelcast, Terracotta, Riak, Influx
Document DBMS: MongoDB, Couchbase, Progress MarkLogic
Graph DBMS: Neo4j, TigerGraph, BlazeGraph
Wide Column Store : DataStax Enterprise (built on Apache Cassandra)
Data lake : Cloudera Data Platform, Databricks, HPE Ezmeral
Analytic Data Platform : Databricks, Snowflake, Teradata Vantage, Starburst

For More Information

To learn more about database technology, you might consult this tutorial by Simplilearn. Forbes offers this explanation of a data lakehouse.

Carl Olofson