Skip to content

Commit

Permalink
fix: data quality may fail in docker mode (apache#15563)
Browse files Browse the repository at this point in the history
  • Loading branch information
zhongjiajie committed Feb 5, 2024
1 parent 01eb8f8 commit 91d56f4
Show file tree
Hide file tree
Showing 17 changed files with 60 additions and 36 deletions.
2 changes: 1 addition & 1 deletion deploy/kubernetes/dolphinscheduler/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,7 +121,7 @@ Please refer to the [Quick Start in Kubernetes](../../../docs/docs/en/guide/inst
| conf.common."alert.rpc.port" | int | `50052` | rpc port |
| conf.common."appId.collect" | string | `"log"` | way to collect applicationId: log, aop |
| conf.common."conda.path" | string | `"/opt/anaconda3/etc/profile.d/conda.sh"` | set path of conda.sh |
| conf.common."data-quality.jar.name" | string | `"dolphinscheduler-data-quality-dev-SNAPSHOT.jar"` | data quality option |
| conf.common."data-quality.jar.dir" | string | `nil` | data quality option |
| conf.common."data.basedir.path" | string | `"/tmp/dolphinscheduler"` | user data local directory path, please make sure the directory exists and have read write permissions |
| conf.common."datasource.encryption.enable" | bool | `false` | datasource encryption enable |
| conf.common."datasource.encryption.salt" | string | `"!@#$%^&*"` | datasource encryption salt |
Expand Down
2 changes: 1 addition & 1 deletion deploy/kubernetes/dolphinscheduler/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -328,7 +328,7 @@ conf:
datasource.encryption.salt: '!@#$%^&*'

# -- data quality option
data-quality.jar.name: dolphinscheduler-data-quality-dev-SNAPSHOT.jar
data-quality.jar.dir:

# -- Whether hive SQL is executed in the same session
support.hive.oneSession: false
Expand Down
2 changes: 1 addition & 1 deletion docs/docs/en/architecture/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -226,7 +226,7 @@ The default configuration is as follows:
| yarn.job.history.status.address | http://ds1:19888/ws/v1/history/mapreduce/jobs/%s | job history status url of yarn |
| datasource.encryption.enable | false | whether to enable datasource encryption |
| datasource.encryption.salt | !@#$%^&* | the salt of the datasource encryption |
| data-quality.jar.name | dolphinscheduler-data-quality-dev-SNAPSHOT.jar | the jar of data quality |
| data-quality.jar.dir | | the jar of data quality |
| support.hive.oneSession | false | specify whether hive SQL is executed in the same session |
| sudo.enable | true | whether to enable sudo |
| alert.rpc.port | 50052 | the RPC port of Alert Server |
Expand Down
2 changes: 1 addition & 1 deletion docs/docs/en/guide/data-quality.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ The execution logic of the data quality task is as follows:
- The current data quality task result is stored in the `t_ds_dq_execute_result` table of `dolphinscheduler`
`Worker` sends the task result to `Master`, after `Master` receives `TaskResponse`, it will judge whether the task type is `DataQualityTask`, if so, it will read the corresponding result from `t_ds_dq_execute_result` according to `taskInstanceId`, and then The result is judged according to the check mode, operator and threshold configured by the user.
- If the result is a failure, the corresponding operation, alarm or interruption will be performed according to the failure policy configured by the user.
- If you package `data-quality` separately, remember to modify the package name to be consistent with `data-quality.jar.name` in `common.properties` with attribute name `data-quality.jar.name`
- If you package `data-quality` separately, remember to modify the package name to be consistent with `data-quality.jar.dir` in `common.properties` with attribute name `data-quality.jar.dir`
- If the old version is upgraded and used, you need to execute the `sql` update script to initialize the database before running.
- `dolphinscheduler-data-quality-dev-SNAPSHOT.jar` was built with no dependencies. If a `JDBC` driver is required, you can set the `-jars` parameter in the `node settings` `Option Parameters`, e.g. `--jars /lib/jars/mysql-connector-java-8.0.16.jar`.
- Currently only `MySQL`, `PostgreSQL` and `HIVE` data sources have been tested, other data sources have not been tested yet.
Expand Down
7 changes: 4 additions & 3 deletions docs/docs/en/guide/resource/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -152,9 +152,10 @@ datasource.encryption.enable=false
# datasource encryption salt
datasource.encryption.salt=!@#$%^&*

# data quality absolute path, it would auto discovery from libs directory. You can also specific the jar name in libs directory
# if you re-build it alone, or auto discovery mechanism fail
data-quality.jar.name=
# data quality jar directory path, it would auto discovery data quality jar from this given dir. You should keep it empty if you do not change anything in
# data-quality, it will auto discovery by dolphinscheduler itself. Change it only if you want to use your own data-quality jar and it is not in worker-server
# libs directory(but may sure your jar name start with `dolphinscheduler-data-quality`).
data-quality.jar.dir=

#data-quality.error.output.path=/tmp/data-quality-error-data

Expand Down
1 change: 1 addition & 0 deletions docs/docs/en/guide/upgrade/incompatible.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ This document records the incompatible updates between each version. You need to
* Change the default unix shell executor from sh to bash ([#12180](https://github.com/apache/dolphinscheduler/pull/12180)).
* Remove `deleteSource` in `download()` of `StorageOperate` ([#14084](https://github.com/apache/dolphinscheduler/pull/14084))
* Remove default key for attribute `data-quality.jar.name` in `common.properties` ([#15551](https://github.com/apache/dolphinscheduler/pull/15551))
* Rename attribute `data-quality.jar.name` to `data-quality.jar.dir` in `common.properties` and represent for directory ([#15563](https://github.com/apache/dolphinscheduler/pull/15563))

## 3.2.0

Expand Down
2 changes: 1 addition & 1 deletion docs/docs/zh/architecture/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -226,7 +226,7 @@ common.properties配置文件目前主要是配置hadoop/s3/yarn/applicationId
| yarn.job.history.status.address | http://ds1:19888/ws/v1/history/mapreduce/jobs/%s | yarn的作业历史状态URL |
| datasource.encryption.enable | false | 是否启用datasource 加密 |
| datasource.encryption.salt | !@#$%^&* | datasource加密使用的salt |
| data-quality.jar.name | dolphinscheduler-data-quality-dev-SNAPSHOT.jar | 配置数据质量使用的jar包 |
| data-quality.jar.dir | | 配置数据质量使用的jar包 |
| support.hive.oneSession | false | 设置hive SQL是否在同一个session中执行 |
| sudo.enable | true | 是否开启sudo |
| alert.rpc.port | 50052 | Alert Server的RPC端口 |
Expand Down
2 changes: 1 addition & 1 deletion docs/docs/zh/guide/data-quality.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
>
## 注意事项

- 如果单独打包`data-quality`的话,记得修改包名和`data-quality.jar.name`一致,配置内容在 `common.properties` 中的 `data-quality.jar.name`
- 如果单独打包`data-quality`的话,记得修改包路径和`data-quality.jar.dir`一致,配置内容在 `common.properties` 中的 `data-quality.jar.dir`
- 如果是老版本升级使用,运行之前需要先执行`SQL`更新脚本进行数据库初始化。
- 当前 `dolphinscheduler-data-quality-dev-SNAPSHOT.jar` 是瘦包,不包含任何 `JDBC` 驱动。
如果有 `JDBC` 驱动需要,可以在`节点设置` `选项参数`处设置 `--jars` 参数,
Expand Down
7 changes: 4 additions & 3 deletions docs/docs/zh/guide/resource/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -156,9 +156,10 @@ datasource.encryption.enable=false
# datasource encryption salt
datasource.encryption.salt=!@#$%^&*

# data quality absolute path, it would auto discovery from libs directory. You can also specific the jar name in libs directory
# if you re-build it alone, or auto discovery mechanism fail
data-quality.jar.name=
# data quality jar directory path, it would auto discovery data quality jar from this given dir. You should keep it empty if you do not change anything in
# data-quality, it will auto discovery by dolphinscheduler itself. Change it only if you want to use your own data-quality jar and it is not in worker-server
# libs directory(but may sure your jar name start with `dolphinscheduler-data-quality`).
data-quality.jar.dir=

#data-quality.error.output.path=/tmp/data-quality-error-data

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -84,9 +84,10 @@ datasource.encryption.enable=false
# datasource encryption salt
datasource.encryption.salt=!@#$%^&*

# data quality absolute path, it would auto discovery from libs directory. You can also specific the jar name in libs directory
# if you re-build it alone, or auto discovery mechanism fail
data-quality.jar.name=
# data quality jar directory path, it would auto discovery data quality jar from this given dir. You should keep it empty if you do not change anything in
# data-quality, it will auto discovery by dolphinscheduler itself. Change it only if you want to use your own data-quality jar and it is not in worker-server
# libs directory(but may sure your jar name start with `dolphinscheduler-data-quality`).
data-quality.jar.dir=

#data-quality.error.output.path=/tmp/data-quality-error-data

Expand Down
7 changes: 4 additions & 3 deletions dolphinscheduler-common/src/main/resources/common.properties
Original file line number Diff line number Diff line change
Expand Up @@ -120,9 +120,10 @@ datasource.encryption.enable=false
# datasource encryption salt
datasource.encryption.salt=!@#$%^&*

# data quality absolute path, it would auto discovery from libs directory. You can also specific the jar name in libs directory
# if you re-build it alone, or auto discovery mechanism fail
data-quality.jar.name=
# data quality jar directory path, it would auto discovery data quality jar from this given dir. You should keep it empty if you do not change anything in
# data-quality, it will auto discovery by dolphinscheduler itself. Change it only if you want to use your own data-quality jar and it is not in worker-server
# libs directory(but may sure your jar name start with `dolphinscheduler-data-quality`).
data-quality.jar.dir=

#data-quality.error.output.path=/tmp/data-quality-error-data

Expand Down
7 changes: 4 additions & 3 deletions dolphinscheduler-common/src/test/resources/common.properties
Original file line number Diff line number Diff line change
Expand Up @@ -115,9 +115,10 @@ datasource.encryption.enable=false
# datasource encryption salt
datasource.encryption.salt=!@#$%^&*

# data quality absolute path, it would auto discovery from libs directory. You can also specific the jar name in libs directory
# if you re-build it alone, or auto discovery mechanism fail
data-quality.jar.name=
# data quality jar directory path, it would auto discovery data quality jar from this given dir. You should keep it empty if you do not change anything in
# data-quality, it will auto discovery by dolphinscheduler itself. Change it only if you want to use your own data-quality jar and it is not in worker-server
# libs directory(but may sure your jar name start with `dolphinscheduler-data-quality`).
data-quality.jar.dir=

#data-quality.error.output.path=/tmp/data-quality-error-data

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
package org.apache.dolphinscheduler.plugin.datasource.api.utils;

import static org.apache.dolphinscheduler.common.constants.Constants.RESOURCE_STORAGE_TYPE;
import static org.apache.dolphinscheduler.plugin.task.api.TaskConstants.DATA_QUALITY_JAR_NAME;
import static org.apache.dolphinscheduler.plugin.task.api.TaskConstants.DATA_QUALITY_JAR_DIR;
import static org.apache.dolphinscheduler.plugin.task.api.TaskConstants.HADOOP_SECURITY_AUTHENTICATION;
import static org.apache.dolphinscheduler.plugin.task.api.TaskConstants.HADOOP_SECURITY_AUTHENTICATION_STARTUP_STATE;
import static org.apache.dolphinscheduler.plugin.task.api.TaskConstants.JAVA_SECURITY_KRB5_CONF;
Expand Down Expand Up @@ -133,14 +133,28 @@ public static boolean loadKerberosConf(String javaSecurityKrb5Conf, String login
}

public static String getDataQualityJarPath() {
String dqsJarPath = PropertyUtils.getString(DATA_QUALITY_JAR_NAME);
log.info("Trying to get data quality jar in path");
String dqJarDir = PropertyUtils.getString(DATA_QUALITY_JAR_DIR);

if (StringUtils.isNotEmpty(dqJarDir)) {
log.info(
"Configuration data-quality.jar.dir is not empty, will try to get data quality jar from directory {}",
dqJarDir);
getDataQualityJarPathFromPath(dqJarDir).ifPresent(jarName -> DEFAULT_DATA_QUALITY_JAR_PATH = jarName);
}

if (StringUtils.isEmpty(DEFAULT_DATA_QUALITY_JAR_PATH)) {
log.info("data quality jar path is empty, will try to auto discover it from build-in rules.");
getDefaultDataQualityJarPath();
}

if (StringUtils.isEmpty(dqsJarPath)) {
log.info("data quality jar path is empty, will try to get it from data quality jar name");
return getDefaultDataQualityJarPath();
if (StringUtils.isEmpty(DEFAULT_DATA_QUALITY_JAR_PATH)) {
log.error(
"Can not find data quality jar in both configuration and auto discover, please check your configuration or report a bug.");
throw new RuntimeException("data quality jar path is empty");
}

return dqsJarPath;
return DEFAULT_DATA_QUALITY_JAR_PATH;
}

private static String getDefaultDataQualityJarPath() {
Expand Down Expand Up @@ -173,14 +187,15 @@ private static Optional<String> getDataQualityJarPathFromPath(String path) {
log.info("Try to get data quality jar from path {}", path);
File[] jars = new File(path).listFiles();
if (jars == null) {
log.warn("No data quality related jar found from path {}", path);
log.warn("No any files find given path {}", path);
return Optional.empty();
}
for (File jar : jars) {
if (jar.getName().startsWith("dolphinscheduler-data-quality")) {
return Optional.of(jar.getAbsolutePath());
}
}
log.warn("No data quality related jar found from path {}", path);
return Optional.empty();
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -95,9 +95,10 @@ datasource.encryption.enable=false
# datasource encryption salt
datasource.encryption.salt=!@#$%^&*

# data quality option, it would auto discovery from libs directory. You can also specific the jar name in libs directory
# if you re-build it alone, or auto discovery mechanism fail
data-quality.jar.name=
# data quality jar directory path, it would auto discovery data quality jar from this given dir. You should keep it empty if you do not change anything in
# data-quality, it will auto discovery by dolphinscheduler itself. Change it only if you want to use your own data-quality jar and it is not in worker-server
# libs directory(but may sure your jar name start with `dolphinscheduler-data-quality`).
data-quality.jar.dir=

#data-quality.error.output.path=/tmp/data-quality-error-data

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ FROM eclipse-temurin:8-jdk
ENV DOCKER true
ENV TZ Asia/Shanghai
ENV DOLPHINSCHEDULER_HOME /opt/dolphinscheduler
ENV DATA_QUALITY_JAR_DIR /opt/dolphinscheduler/libs/worker-server

RUN apt update ; \
apt install -y sudo ; \
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -358,9 +358,9 @@ private TaskConstants() {
public static final String RESOURCE_UPLOAD_PATH = "resource.storage.upload.base.path";

/**
* data.quality.jar.name
* data.quality.jar.dir
*/
public static final String DATA_QUALITY_JAR_NAME = "data-quality.jar.name";
public static final String DATA_QUALITY_JAR_DIR = "data-quality.jar.dir";

public static final String TASK_TYPE_CONDITIONS = "CONDITIONS";

Expand Down
Loading

0 comments on commit 91d56f4

Please sign in to comment.