As a data engineer, one of the most common tasks is ingesting data from various sources into a data warehouse or lake for analysis. DataHub has emerged as a popular open source metadata hub that helps organize and catalog data assets across an organization.
Ingesting data from DataHub into your own systems can provide tremendous value by giving you access to its centralized metadata store. However, it does require some effort to set up and configure the data pipelines. In this comprehensive guide, we’ll walk through the entire process step-by-step.
Prerequisites
Before diving into the specifics, there are a few prerequisites you’ll need in place:
- Access to a DataHub instance, either self-hosted or using DataHub Cloud
- Basic familiarity with DataHub’s architecture and metadata model
- A destination to ingest the data into, such as a data warehouse, lake, or other database
- Familiarity with ingestion tools like Airflow, dbt, Spark, etc that you’ll use to move the data
Assuming those elements are in place, you’re ready to start setting up the data ingestion pipelines.
Key DataHub APIs
DataHub provides several APIs that allow you to programmatically interact with its metadata. The two core APIs we’ll focus on for ingestion are:
- Rest.li: The REST API for reading metadata
- MAE Consumer: An API for integrating with Metadata Change Events
Rest.li provides full read access to all metadata within DataHub. Using standard REST patterns, you can request information on entities like datasets, dashboards, data flows, and more. This will be our primary mechanism for extracting metadata.
MAE Consumer allows you to subscribe to a stream of metadata change events through Kafka or AMQP. As metadata is created, updated, or deleted in DataHub, an event can be pushed out to downstream consumers. This can be helpful for keeping a replica database in sync.
General Ingestion Approach
The general approach we’ll take for ingesting DataHub metadata is:
- Use Rest.li to perform an initial full extract of metadata into the destination
- Incrementally pull updated metadata on a schedule using Rest.li
- Optional: Use MAE Consumer to get a stream of change events for more real-time synchronization
This provides a robust and fault-tolerant pipeline for maintaining an up-to-date copy of DataHub metadata in your own systems. The bulk of the work lies in the initial extraction and transformation of the metadata into your destination schema.
Initial Extraction Steps
Now let’s walk through how to perform the initial metadata extraction:
- Model destination schema: Based on your use cases, model out the schema in your destination database for storing the metadata. Try to understand what entities and attributes you want to mirror from DataHub.
- Explore Rest.li API: Using an API client like Postman, start querying different Rest.li endpoints and payloads to understand the shape of the metadata.
- Develop extraction logic: Write code that hits the Rest.li API, extracts the metadata payload, transforms it, and loads it into your database. An ingestion framework like Airflow is well-suited for this.
- Incremental extraction: Update your pipeline to extract only updated metadata on each run. The Rest.li API supports filtering on lastUpdated timestamp.
- Error handling: Build in error handling, alerting, and retry mechanisms to make the pipeline robust.
The details will vary based on your technology stack and infrastructure, but those are the key tasks for initial extraction. Next let’s explore some examples.
Ingestion Examples
Let’s walk through a few examples of what ingesting DataHub metadata might look like using common data tools:
Airflow + Postgres
Using Apache Airflow for orchestration, we can use the Rest.li API to extract metadata and store it in Postgres. Some key steps would include:
- Define Postgres tables like datasets, columns, dashboard, etc to hold metadata
- Within Airflow, use the Rest.li API to pull metadata for each entity as needed
- Parse the JSON response, transform it, and insert into Postgres
- On subsequent runs, use lastUpdated filters and only ingest changed rows
We can put this logic within Airflow PythonOperators to build out the full pipeline. Airflow provides visibility, error handling, and scheduling capabilities to make this production-grade.
Spark + Parquet
For pipelines using Spark and a data lakehouse architecture, we can extract metadata into Parquet:
- Design Parquet schema for entities like dataset, chart, dashboards
- Develop Spark jobs that call the Rest.li API to extract metadata
- Parse the JSON response and transform into Spark DataFrames
- Write DataFrames out to Parquet on S3/ADLS/HDFS
Spark allows us to parallelize extractions and leverage data lake storage for scalability. The structured nature of Parquet lends itself well to metadata storage and querying.
Singer Taps
For pipelines using Singer for ELT, we could build a Singer Tap that extracts DataHub metadata:
- Follow Singer spec for taps connecting to APIs
- Develop tap to call Rest.li API and parse response
- Output metadata Singer records for consumption
- A corresponding Target would write the records to Postgres/Redshift/Snowflake
The Singer ecosystem works nicely with metadata extraction and loading into data warehouses. Existing Singer taps could also help sync the metadata into downstream tools.
MAE Consumer for Change Events
In addition to Rest.li bulk extraction, we can use the MAE Consumer API to subscribe to a stream of metadata change events from DataHub. This powers more real-time synchronization of metadata by pushing updates rather than polling.
Some key steps include:
- Configure Kafka or AMQP for MAE events in DataHub
- Implement MAE consumer that reads from stream
- Parse JSON event, identify entity and changes
- Use change events to update destination store
MAE Consumer integrates nicely with orchestrators like Airflow, Spark Streaming, or Flink to process the events. The change stream can then efficiently update downstream systems.
Destination Options
In terms of destinations for the DataHub metadata, popular options include:
- Relational database: Postgres, MySQL, SQL Server to power dashboards and applications.
- Data warehouse: Redshift, BigQuery, Snowflake for analytics use cases.
- Lakehouse: Delta Lake, Hudi for large-scale analytics.
- Search index: Elasticsearch to enable metadata search and discovery.
- Graph database: Neo4j, JanusGraph for graph traversal and analytics.
The destination depends on your specific infrastructure and use cases for utilizing the metadata. A graph database can be especially useful for powering lineage applications.
Use Cases
What are some example use cases enabled by ingesting DataHub metadata?
Centralized Metadata Management
Ingesting DataHub metadata into a database or warehouse combines it with your existing metadata, enabling a single unified metadata layer. This improves discoverability, quality, and trust.
Self-Service Access
Syncing metadata to a data warehouse or lakehouse makes it readily available for analysts and engineers. They can query it on-demand for self-service insights.
Granular Lineage
Combining DataHub metadata with your own lineage sources provides end-to-end granular lineage visualizations powered by graph databases.
Compliance
Ingesting comprehensive DataHub metadata provides the raw material for standards like CISQ/OpenLineage for regulatory compliance.
AI/ML Training
Rich metadata can be used to train AI models for auto-tagging, classification, entity resolution, and other techniques.
As you can see, ingesting DataHub metadata opens up many possibilities for driving value across data domains. The flexibility of the REST API allows integrating it into any modern data stack.
Key Considerations
When implementing DataHub ingestion pipelines, keep these tips in mind:
- Plan schemas and extraction logic upfront for efficiency.
- Implement incremental extractions to minimize load volumes.
- Scale out parallelism for large DataHub instances.
- Handle errors and retries to make robust.
- Use MAE Consumer for low-latency change events.
- Assess downstream uses to drive ingestion directions.
- Document architecture and data flow thoroughly.
Conclusion
In closing, ingesting metadata from DataHub can enable extremely valuable use cases by harnessing its centralized knowledge base. Combining flexibility with simplicity, the REST and MAE APIs make integration straightforward for any modern stack. With proper planning and implementation, you can build robust pipelines for powering metadata-driven approaches in your organization. The benefits for data discovery, compliance, ML applications and more are immense. hope this guide provided some ideas and starting points as you embark on your own DataHub integration journey.