How do I ingest data from DataHub?

As a data engineer, one of the most common tasks is ingesting data from various sources into a data warehouse or lake for analysis. DataHub has emerged as a popular open source metadata hub that helps organize and catalog data assets across an organization.

Ingesting data from DataHub into your own systems can provide tremendous value by giving you access to its centralized metadata store. However, it does require some effort to set up and configure the data pipelines. In this comprehensive guide, we’ll walk through the entire process step-by-step.

Prerequisites

Before diving into the specifics, there are a few prerequisites you’ll need in place:

Access to a DataHub instance, either self-hosted or using DataHub Cloud
Basic familiarity with DataHub’s architecture and metadata model
A destination to ingest the data into, such as a data warehouse, lake, or other database

Familiarity with ingestion tools like Airflow, dbt, Spark, etc that you’ll use to move the data

Assuming those elements are in place, you’re ready to start setting up the data ingestion pipelines.

Key DataHub APIs

DataHub provides several APIs that allow you to programmatically interact with its metadata. The two core APIs we’ll focus on for ingestion are:

Rest.li: The REST API for reading metadata
MAE Consumer: An API for integrating with Metadata Change Events

Rest.li provides full read access to all metadata within DataHub. Using standard REST patterns, you can request information on entities like datasets, dashboards, data flows, and more. This will be our primary mechanism for extracting metadata.

MAE Consumer allows you to subscribe to a stream of metadata change events through Kafka or AMQP. As metadata is created, updated, or deleted in DataHub, an event can be pushed out to downstream consumers. This can be helpful for keeping a replica database in sync.

General Ingestion Approach

The general approach we’ll take for ingesting DataHub metadata is:

Use Rest.li to perform an initial full extract of metadata into the destination

Incrementally pull updated metadata on a schedule using Rest.li
Optional: Use MAE Consumer to get a stream of change events for more real-time synchronization

This provides a robust and fault-tolerant pipeline for maintaining an up-to-date copy of DataHub metadata in your own systems. The bulk of the work lies in the initial extraction and transformation of the metadata into your destination schema.

Initial Extraction Steps

Now let’s walk through how to perform the initial metadata extraction:

Model destination schema: Based on your use cases, model out the schema in your destination database for storing the metadata. Try to understand what entities and attributes you want to mirror from DataHub.
Explore Rest.li API: Using an API client like Postman, start querying different Rest.li endpoints and payloads to understand the shape of the metadata.

Develop extraction logic: Write code that hits the Rest.li API, extracts the metadata payload, transforms it, and loads it into your database. An ingestion framework like Airflow is well-suited for this.
Incremental extraction: Update your pipeline to extract only updated metadata on each run. The Rest.li API supports filtering on lastUpdated timestamp.
Error handling: Build in error handling, alerting, and retry mechanisms to make the pipeline robust.

The details will vary based on your technology stack and infrastructure, but those are the key tasks for initial extraction. Next let’s explore some examples.

Ingestion Examples

Let’s walk through a few examples of what ingesting DataHub metadata might look like using common data tools:

Airflow + Postgres

Using Apache Airflow for orchestration, we can use the Rest.li API to extract metadata and store it in Postgres. Some key steps would include:

Define Postgres tables like datasets, columns, dashboard, etc to hold metadata
Within Airflow, use the Rest.li API to pull metadata for each entity as needed
Parse the JSON response, transform it, and insert into Postgres

On subsequent runs, use lastUpdated filters and only ingest changed rows

We can put this logic within Airflow PythonOperators to build out the full pipeline. Airflow provides visibility, error handling, and scheduling capabilities to make this production-grade.

Spark + Parquet

For pipelines using Spark and a data lakehouse architecture, we can extract metadata into Parquet:

Design Parquet schema for entities like dataset, chart, dashboards
Develop Spark jobs that call the Rest.li API to extract metadata
Parse the JSON response and transform into Spark DataFrames

Write DataFrames out to Parquet on S3/ADLS/HDFS

Spark allows us to parallelize extractions and leverage data lake storage for scalability. The structured nature of Parquet lends itself well to metadata storage and querying.

Singer Taps

For pipelines using Singer for ELT, we could build a Singer Tap that extracts DataHub metadata:

Follow Singer spec for taps connecting to APIs
Develop tap to call Rest.li API and parse response
Output metadata Singer records for consumption

A corresponding Target would write the records to Postgres/Redshift/Snowflake

The Singer ecosystem works nicely with metadata extraction and loading into data warehouses. Existing Singer taps could also help sync the metadata into downstream tools.

MAE Consumer for Change Events

In addition to Rest.li bulk extraction, we can use the MAE Consumer API to subscribe to a stream of metadata change events from DataHub. This powers more real-time synchronization of metadata by pushing updates rather than polling.

Some key steps include:

Configure Kafka or AMQP for MAE events in DataHub
Implement MAE consumer that reads from stream

Parse JSON event, identify entity and changes
Use change events to update destination store

MAE Consumer integrates nicely with orchestrators like Airflow, Spark Streaming, or Flink to process the events. The change stream can then efficiently update downstream systems.

Destination Options

In terms of destinations for the DataHub metadata, popular options include:

Relational database: Postgres, MySQL, SQL Server to power dashboards and applications.
Data warehouse: Redshift, BigQuery, Snowflake for analytics use cases.

Lakehouse: Delta Lake, Hudi for large-scale analytics.
Search index: Elasticsearch to enable metadata search and discovery.
Graph database: Neo4j, JanusGraph for graph traversal and analytics.

The destination depends on your specific infrastructure and use cases for utilizing the metadata. A graph database can be especially useful for powering lineage applications.

Use Cases

What are some example use cases enabled by ingesting DataHub metadata?

Centralized Metadata Management

Ingesting DataHub metadata into a database or warehouse combines it with your existing metadata, enabling a single unified metadata layer. This improves discoverability, quality, and trust.

Self-Service Access

Syncing metadata to a data warehouse or lakehouse makes it readily available for analysts and engineers. They can query it on-demand for self-service insights.

Granular Lineage

Combining DataHub metadata with your own lineage sources provides end-to-end granular lineage visualizations powered by graph databases.

Compliance

Ingesting comprehensive DataHub metadata provides the raw material for standards like CISQ/OpenLineage for regulatory compliance.

AI/ML Training

Rich metadata can be used to train AI models for auto-tagging, classification, entity resolution, and other techniques.

As you can see, ingesting DataHub metadata opens up many possibilities for driving value across data domains. The flexibility of the REST API allows integrating it into any modern data stack.

Key Considerations

When implementing DataHub ingestion pipelines, keep these tips in mind:

Plan schemas and extraction logic upfront for efficiency.
Implement incremental extractions to minimize load volumes.
Scale out parallelism for large DataHub instances.

Handle errors and retries to make robust.
Use MAE Consumer for low-latency change events.
Assess downstream uses to drive ingestion directions.

Document architecture and data flow thoroughly.

Conclusion

In closing, ingesting metadata from DataHub can enable extremely valuable use cases by harnessing its centralized knowledge base. Combining flexibility with simplicity, the REST and MAE APIs make integration straightforward for any modern stack. With proper planning and implementation, you can build robust pipelines for powering metadata-driven approaches in your organization. The benefits for data discovery, compliance, ML applications and more are immense. hope this guide provided some ideas and starting points as you embark on your own DataHub integration journey.

Can you get a job by 15?

What do I need for an IT job?

Can I add thumbnail after uploading video?

Leave A Reply Cancel Reply