A data lake is a system or repository that stores data in its raw format along with transformed, trusted data sets, and provides both programmatic and SQL-based access to this data for diverse analytics tasks such as data exploration, interactive analytics, and machine learning. The data stored in a data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs), and binary data (images, audio, video).
A challenge with data lakes is not getting locked into proprietary formats or systems. This lock-in restricts the ability to move data in and out for other uses or to process data using other tools, and can also tie a data lake to a single cloud environment. That’s why businesses should strive to build open data lakes, where data is stored in an open format and accessed through open, standards-based interfaces. Adherence to an open philosophy should permeate every aspect of the system, including data storage, data management, data processing, operations, data access, governance, and security.
An open format is one based on an underlying open standard, developed and shared through a public, community-driven process without vendor-specific proprietary extensions. For example, an open data format is a platform-independent, machine-readable data format, such as ORC or Parquet, whose specification is published to the community, such that any organization can create tools and applications to read data in the format.
A typical data lake has the following capabilities:
- Data ingestion and storage
- Data processing and support for continuous data engineering
- Data access and consumption
- Data governance including discoverability, security, and compliance
- Infrastructure and operations