Unified Data Abstraction Layer

Challenge

Planblues data pipeline was devided into multiple layers, each providing a specific maturatiy of processed data. Whilst each layer consisted of multiple data processors, we wanted to have one data access point per layer for the results of each data processor. As the layers grew, but the core logic of data ingestion and access remained the same, we wanted to have a unified abstraction that we just could configure and easily spin up for each layer.

My Contribution

I designed and implemented a python framework that enabled the team to register new data processors and their data products only by configuration. The framework provided an interface to start data ingestion whenever data processing finished. It also dynamicaly created API routes to access the data products, providing configurable filtering and custom routes if needed.

Key features:

Data processor and data products registration
Dynamic API route creation with custom filtering
Data ingestion interface
Data processor metadata data ingestion
- e.g. storing airflow run ids, cloudwatch log paths…
Deployment template for aws lambda

Impact

With the implemented framework, we were able to spin up new data access points within minutes instead of hours. Also we were able to register new data processors and their data products only by configuration. Additinally consistency in terms of API design and data ingestion was strengthened.

pythonairflowaws lambdamongodbfastapidockerrest api