Skip to main content
Version: Next

DataHub

Testing

CLI based Ingestion

Install the Plugin

pip install 'acryl-datahub[datahub]'

Config Details

Note that a . is used to denote nested fields in the YAML recipe.

FieldDescription
commit_state_interval
integer
Number of records to process before committing state
Default: 1000
commit_with_parse_errors
boolean
Whether to update createdon timestamp and kafka offset despite parse errors. Enable if you want to ignore the errors.
Default: False
include_all_versions
boolean
If enabled, include all versions of each aspect. Otherwise, only include the latest version of each aspect.
Default: False
kafka_topic_name
string
Name of kafka topic containing timeseries MCLs
Default: MetadataChangeLog_Timeseries_v1
mysql_batch_size
integer
Number of records to fetch from MySQL at a time
Default: 10000
mysql_table_name
string
Name of MySQL table containing all versioned aspects
Default: metadata_aspect_v2
kafka_connection
KafkaConsumerConnectionConfig
Kafka connection config
Default: {'bootstrap': 'localhost:9092', 'schema_registry_u...
kafka_connection.bootstrap
string
Default: localhost:9092
kafka_connection.client_timeout_seconds
integer
The request timeout used when interacting with the Kafka APIs.
Default: 60
kafka_connection.consumer_config
object
Extra consumer config serialized as JSON. These options will be passed into Kafka's DeserializingConsumer. See https://docs.confluent.io/platform/current/clients/confluent-kafka-python/html/index.html#deserializingconsumer and https://github.com/edenhill/librdkafka/blob/master/CONFIGURATION.md .
kafka_connection.schema_registry_config
object
Extra schema registry config serialized as JSON. These options will be passed into Kafka's SchemaRegistryClient. https://docs.confluent.io/platform/current/clients/confluent-kafka-python/html/index.html?#schemaregistryclient
kafka_connection.schema_registry_url
string
Default: http://localhost:8080/schema-registry/api/
mysql_connection
MySQLConnectionConfig
MySQL connection config
Default: {'username': None, 'host_port': 'localhost:3306', ...
mysql_connection.database
string
database (catalog)
mysql_connection.database_alias
string
[Deprecated] Alias to apply to database when ingesting.
mysql_connection.host_port
string
MySQL host URL.
Default: localhost:3306
mysql_connection.options
object
Any options specified here will be passed to SQLAlchemy.create_engine as kwargs.
mysql_connection.password
string(password)
password
mysql_connection.scheme
string
Default: mysql+pymysql
mysql_connection.sqlalchemy_uri
string
URI of database to connect to. See https://docs.sqlalchemy.org/en/14/core/engines.html#database-urls. Takes precedence over other connection parameters.
mysql_connection.username
string
username
stateful_ingestion
StatefulIngestionConfig
Stateful Ingestion Config
Default: {'enabled': True, 'max_checkpoint_state_size': 167...
stateful_ingestion.enabled
boolean
The type of the ingestion state provider registered with datahub.
Default: False

Code Coordinates

  • Class Name: datahub.ingestion.source.datahub.datahub_source.DataHubSource
  • Browse on GitHub

Questions

If you've got any questions on configuring ingestion for DataHub, feel free to ping us on our Slack.