There must be data pipelines set up to move the data to a centralized data store. The primary objective of a data pipeline is to extract, transform, and load data. This process is referred to by the acronym ETL.
Additionally, data pipelines clean data before it is placed in a centralized database. Data cleaning includes removing anomalies, correcting errors, and performing data governance checks.
Data pipelines also handle data transformations. Data transformation is converting data from one structure into a different format. Data transformation is necessary because when an API is queried for data, it is sent in an unstructured JSON format. A data pipeline is responsible for organizing it into a tabular format for analysis.
Transformations include appending additional columns onto structured data. Suppose a table exists containing integers from one to ten representing the mean Net Promoter Score for every product order by date. A simple regression model can calculate moving rolling averages and append these values into a separate auxiliary column.
Finally, data pipelines are responsible for ensuring that the data extracted from the sources fit the data warehouse’s schema model. This process is referred to as data modeling. Data modeling is the process of describing the structure, associations, and constraints relevant to available data.
After the data pipelines have extracted and transformed data from the sources, they place the data in a centralized data storage facility. A cloud data warehouse will be used for centrally storing data for the organization.
This post is part of a 23 part mini-series about implementing a cost-effective modern data infrastructure for a small organization. This is a small part of a whitepaper that will be released at the end of this series.