Data-pipeline troubleshooting

To verify if data-pipeline is working properly

  • Ensure that if a unique, valid, recent event is ingested from kafka ingestion topic (e.g. dev.telemetry.ingest), it is reaching the topic druid ingests from (e.g.
    • get golden data set from test-data/flink_golden_dataset.json
    • update data.params.msgid to a unique uuid
    • for each event in, update mid to a unique uuid
    • update time fields to a recent time, because druid-event-validator job drops older events (3 months) silently
  • When the event is confirmed to be reaching druid ingestion topic, check if a data-source is created in Druid, and the message is added


# ssh to kafka server (KP)
cd opt/kafka/bin

# to create a topic
./ --create --topic <topic> --replication-factor 1 --partitions 1 --bootstrap-server localhost:9092

# to list topics
./ --list --zookeeper localhost:2181

# start consumer add `--from-beginning` to get all messages from start
./ --bootstrap-server localhost:9092 --topic <topic>

# to start a producer
./ --topic <topic> --broker-list localhost:9092

Flink jobs

check logs for flink jobs

# ssh to kubernetes server (jenkins in our case)
# export Kubernetes config file environment variable
export KUBECONFIG=/path/to/kube-config.yaml

# list pods flink-dev namespace
kubectl get po -n flink-dev

# get logs for a pod
kubectl logs <pod> -n flink-dev

if logs contain errors because of missing topics check if config for flink jobs is correct (ansible/kubernetes/helm_charts/datapipeline/flink-jobs/values.j2) or, to create missing topics, ssh to KP(kafka) server and create the topic

# ssh to kafka server (KP)
cd opt/kafka/bin
./ --create --topic <topic> --replication-factor 1 --partitions 1 --bootstrap-server localhost:9092

flink state backend

if flink is unable to connect to state backend, check your state-backend config or, optionally you can turn off the state backend (flink will store state in memory, but will not be able to recover state in case pod crashed)

Note: for more detailed info on different kafka topics and flink jobs click here


Change log level for druid services

to change log level for any of the druid services edit their respective log4j2.xml file. for example to set broker log level to warn, edit it's log4j2.xml -

Note: setting loglevel above WARN (e.g. INFO) will make logs very busy, and log files would inflate to MBs in a couple of minutes, set loglevel back to ERROR as soon as done with debugging

<?xml version="1.0" encoding="UTF-8"?>
<Configuration status="WARN">
    <RollingFile name="File" fileName="/var/log/druid//broker.log" filePattern="/var/log/druid//broker.%i.log">
        <Pattern>"%d{ISO8601} %p [%t] %c - %m%n"</Pattern>
        <SizeBasedTriggeringPolicy size="50 MB"/>
      <DefaultRolloverStrategy max="20"/>
    <Root level="warn">
      <AppenderRef ref="File"/>

Druid config

common druid config for services is present at /data/druid/conf/druid/_common/ config for individual services is present at /data/druid/conf/druid/<service e.g. broker or overlord>/

for s3 compatible deep storage

to use s3 as deep storage make sure contains following config



# set protocol and endpoint together

# or separately as
# druid.s3.endpoint.url=<host>
# druid.s3.endpoint.protocol=<prototocol>

for non-aws s3-like stores (like ceph), we might have to add additional config

# enable access of bucket from any region

# to enable path like access
# if true,  url=<protocol>://<host>/<bucket> 
# if false, url=<protocol>://<bucket>.<host> 

to allow Druid to publish task logs to s3 add following config

# path to logs within the bucker

additional config for s3 deep storage (optional)

# uncomment to enable server side encryption for s3

# uncomment to enable v4 signing of requests
# druid.s3.endpoint.signingRegion=<aws-region-code>

# uncomment to disable chunk encoding
# druid.s3.disableChunkedEncoding=true

S3 bucket policy

Druid should have permissions to read and write from druid dir of the bucket For S3, we would require GetObject, PutObject, GetObjectAcl, PutObjectAcl permissions

Example policy might look like-


  "Statement": [
      "Action": [
      "Effect": "Allow",
      "Resource": "arn:aws:s3:::*"
      "Action": [
      "Effect": "Allow",
      "Resource": [

to update bucket policy using s3cmd, first install s3cmd and configure using s3cmd --configure, then run

s3cmd setpolicy policy.json s3://<bucket>

for azure deep storage

to use azure as deep storage make sure contains following config


to allow Druid to publish task logs to azure add following config

druid.indexer.logs.prefix=<prefix e.g. druidlogs>

misc config

# uncomment to disable acl for deep storage

# uncomment to disable acl for only logs
# druid.indexer.logs.disableAcl=true

Druid graceful restart / rolling update

For configurations to take effect Druid services for which config has changed must be restarted. All Druid services except for middlemanager can be restarted safely through systemctl

# ssh to druid
systemctl restart druid_broker.service
systemctl restart druid_coordinator.service
systemctl restart druid_historical.service
systemctl restart druid_overlord.service

to gracefully restart middlemanager first we have to suspend all running supervisors. this publishes segments which have not been published yet

# ssh to druid
# get running supervisor names
curl -X GET http://localhost:8081/druid/indexer/v1/supervisor -i

# do this for all running supervisors
# suspend supervisor (stop running tasks and publish segments)
curl -X POST http://localhost:8090/druid/indexer/v1/supervisor/<supervisor-name>/suspend

# restart middlemanager service
systemctl restart druid_middlemanager.service

# resume suspended supervisors
curl -X POST http://localhost:8090/druid/indexer/v1/supervisor/<supervisor-name>/resume

Druid API

ports - to find out what ports each of the services are running check file in /data/druid/conf/druid/<service>/

default ports -

# coordinator - 8081
# broker - 8082
# historical - 8083
# overlord - 8090
# middlemanager - 8091

Check status, get data sources

# check status of overlord service
curl -X GET http://localhost:8090/status

# show data sources
curl -X GET http://localhost:8081/druid/coordinator/v1/datasources -i

Manage Ingestion

# get running supervisor names
curl -X GET http://localhost:8081/druid/indexer/v1/supervisor -i

# inspect particular supervisor ingestion config
curl -X GET http://localhost:8081/druid/indexer/v1/supervisor/<supervisor-name> -i

# inspect particular supervisor status
curl -X GET http://localhost:8081/druid/indexer/v1/supervisor/<supervisor-name>/status -i

# inspect particular supervisor task stats
curl -X GET http://localhost:8081/druid/indexer/v1/supervisor/<supervisor-name>/stats -i

# inspect tasks
curl -X GET http://localhost:8081/druid/indexer/v1/supervisor/tasks -i

# inspect pending tasks
curl -X GET http://localhost:8081/druid/indexer/v1/supervisor/pendingTasks -i

# inspect running tasks
curl -X GET http://localhost:8081/druid/indexer/v1/supervisor/runningTasks -i

# add new supervisor
curl -X POST -H 'Content-Type: application/json' -d @spec.json http://localhost:8090/druid/indexer/v1/supervisor

# stop and delete supervisor
curl -X POST http://localhost:8090/druid/indexer/v1/supervisor/<supervisor-name>/terminate -i

# suspend supervisor (stop running tasks and publish segments)
curl -X POST http://localhost:8090/druid/indexer/v1/supervisor/<supervisor-name>/suspend

# resume supervisor
curl -X POST http://localhost:8090/druid/indexer/v1/supervisor/<supervisor-name>/resume