Definition of a Data Pipeline

A data pipeline refers to a workflow that ingests multi-structured data, schemas, and other types of metadata from sources to targets and transforms that data for analytics. Ingestion entails one or more of the following tasks:

  • Extracting or capturing data from a source, such as one or many records from a database
  • Streaming data messages in memory between sources and targets, for example to enable real-time transformation, delivery, and/or analytics
  • Loading either batch data or incremental updates into a target such as a data lake
  • Appending data to a target by adding it to existing datasets
  • Merging data into a target by combining it with existing objects such as tables or files

Transformation, meanwhile, includes tasks such as the following. It can take place before or after the pipeline loads data to the target. 

  • Filtering data to identify and remove unneeded subsets such as columns, tables, or images, for example to protect personally identifiable information¬†
  • Combining multi-sourced data, for example to add columns to a table or join tables for a query
  • Formatting data, for example by converting various tables to a single format such as Parquet
  • Structuring data, for example by applying a schema to organize tables and columns in a database
  • Cleansing data by removing duplicates, fixing errors, or taking other steps to improve data quality

Modern pipelines span on premises, hybrid, cloud, and multi-cloud ecosystems that include various pipelines, languages, open-source projects, interfaces, tools and now AI bots, as shown in the examples in this diagram.

A data pipeline refers to a workflow that ingests multi-structured data, schemas, and other types of metadata from sources to targets and transforms that data for analytics

What is a data pipeline?

Additional Resources