- 2024/12/27: Project Initialization
Dingo is a data quality evaluation tool that helps you automatically detect data quality issues in your datasets. Dingo provides a variety of built-in rules and model evaluation methods, and also supports custom evaluation methods. Dingo supports commonly used text datasets and multimodal datasets, including pre-training datasets, fine-tuning datasets, and evaluation datasets. In addition, Dingo supports multiple usage methods, including local CLI and SDK, making it easy to integrate into various evaluation platforms, such as OpenCompass.
Users can use Dingo in two ways as shown below.
Install dingo
pip install dingo-python
Try to run the SDK
call method below:
from dingo.io import InputArgs
from dingo.exec import Executor
input_data = {
"eval_group": "sft", # rule list for sft data, other ['default', 'pretrain' ...]
"input_path": "tatsu-lab/alpaca", # dataset from huggingface
"data_format": "plaintext", # data format, other ['json', 'jsonl', 'plaintext']
"save_data": True, # save data to local
}
input_args = InputArgs(**input_data)
executor = Executor.exec_map["local"](input_args)
result = executor.execute()
print(result)
For more usage examples, please refer to examples, for more evaluation results, please refer to evaluation, and for more configurations, please refer to config.
Try to run the CLI
call rule set evaluation below:
python -m dingo.run.cli --input_path tatsu-lab/alpaca -e sft --data_format plaintext --save_data True
Or try to run the CLI
call gpt-4o model evaluation below:
python -m dingo.run.cli --input_path test/data/test_local_json.json --dataset local -e openai --data_format json --column_content prediction --custom_config test/config/config_gpt.json --save_data True
Note that calling the model evaluation requires adding the corresponding configuration, such as the configuration used in the above example:
$ cat test/data/config_gpt.json
{
"llm_config": {
"openai": {
"model": "gpt-4o",
"key": "xxxx",
"api_url": "https://api.openai.com/v1/chat/completions"
}
}
}
After the project runs on the cli
side, if the user sets the save_data parameter to True, a frontend page will be automatically generated based on the quality inspection results.
If the user wants to manually start a frontend page, you need to enter the following command:
python -m dingo.run.vsl --input xxx
The input followed is the directory of the quality inspection results. Users need to ensure that there is a summary.json file when the directory is opened.
Dingo supports local files, huggingface datasets, S3 storage files as data sources; supports pre-training, fine-tuning, and evaluation datasets as data types; supports text and image data modalities.
Dingo has built-in 20+ general heuristic rule evaluations, common LLMs (such as OpenAI, kimi, etc.) evaluations, and launching local specified model (llama3, etc.) evaluations. Built-in heuristic rules have built-in multiple rule set combinations such as pretrain, sft according to the dataset type. Both rules and model evaluations support customization or modification. Supports data security evaluation, such as perspective API.
Dingo supports multiple interface usage methods, including local CLI and SDK, making it easy to integrate into various evaluation platforms, such as OpenCompass.
Dingo supports local and SPARK two execution engines, which is convenient for executing data evaluation tasks of various sizes.
Dingo supports outputting 7 Quality Metrics summary reports and abnormal data trace details reports.
The installation mentioned in the quick start module above only installs the necessary packages required for running, and some special function packages are not installed. If users need to install corresponding packages during the practice use process, then you can refer to: Install Dependencies
If the heuristic rules inside the project do not meet the user's quality inspection requirements, users can also customize rules or models.
If the user wants to create a new rule CommonPatternDemo
, then the first step is to add a decorator to the rule to inject the rule into the project.
Secondly, the metric_type
type, such as QUALITY_BAD_RELEVANCE
, needs to be set for the rule, and group
does not need to be set.
Then the user needs to define the DynamicRuleConfig
object, so that the properties of the rule can be configured dynamically.
In addition, the method name of the rule must be eval
and it needs to be a class method.
The return value of the last step should be a ModelRes
object.
For example: Register Rules
Users can also register prompts, the method is similar to when registering rules.
For example: Register Prompts
The way to register models is slightly different, users need to implement a call_api method, accept MetaData type parameters, and return ModelRes type results. There are already implemented basic model classes BaseOpenAI in the project, users can directly inherit. If the user has special functions to implement, then you can rewrite the corresponding methods.
For example: Register Models
Dingo
can run locally or on a spark cluster.
Regardless of the choice of engine, the executor supports some common methods:
function name | description |
---|---|
get_summary | get the summary of test. |
get_bad_info_list | get the bad data. |
get_good_info_list | get the good data. |
When choosing the spark engine, users can freely choose rules, models for quality inspection.
When choosing the spark engine, users can only choose rules for quality inspection, and models cannot be used.
And only eval_group
,save_data
,save_correct
,custom_config
in InputArgs
are still valid.
Therefore, the user needs to input spark_session
to initialize spark, and input spark_rdd
(composed of MetaData
structure) as data for quality inspection.
It should be noted that if save_data
is False
, then the data in memory will be cleared immediately after the quality inspection is completed, and spark_session
will also stop immediately.
After completing an evaluation, Dingo will generate a summary report (summary) and a detailed report (detail). The summary includes the overall score Score and the scores of the 7 Quality Metrics dimensions of this evaluation. The detailed report will include the specific data content of each Quality Metrics evaluation with exceptions, which is convenient for tracing the cause.
The summary.json
profile file example is as follows:
{
"task_id": "d6c922ec-981c-11ef-b723-7c10c9512fac",
"task_name": "dingo",
"eval_group": "default",
"input_path": "test/data/test_local_jsonl.jsonl",
"output_path": "outputs/d6c921ac-981c-11ef-b723-7c10c9512fac",
"create_time": "20241101_144510",
"score": 50.0,
"num_good": 1,
"num_bad": 1,
"total": 2,
"type_ratio": {
"QUALITY_BAD_COMPLETENESS": 0.5,
"QUALITY_BAD_RELEVANCE": 0.5
},
"name_ratio": {
"QUALITY_BAD_COMPLETENESS-RuleColonEnd": 0.5,
"QUALITY_BAD_RELEVANCE-RuleSpecialCharacter": 0.5
}
}
The detailed report such as the RuleColonEnd.json
file example is as follows:
{"data_id": "1", "prompt": "", "content": "�I am 8 years old. ^I love apple because:", "type_list": ["QUALITY_BAD_COMPLETENESS", "QUALITY_BAD_RELEVANCE"], "name_list": ["QUALITY_BAD_COMPLETENESS-RuleColonEnd", "QUALITY_BAD_RELEVANCE-RuleSpecialCharacter"], "reason_list": ["�I am 8 years old. ^I love apple because:", ["�"]]}
- Richer graphic and text evaluation indicators;
- New audio and video data modality evaluation;
- New small model evaluation, such as fasttext, Qurating;
- New data diversity evaluation;
- The current evaluation tool's built-in detection rules and model methods mostly come from papers, open source projects, etc., mainly focusing on common data quality problems. If there is a need to evaluate special data problems, it is recommended to customize the corresponding detection rules for evaluation;
We appreciate all the contributors for their efforts to improve and enhance Dingo
. Please refer to the Contribution Guide for guidance on contributing to the project.
This project uses the Apache 2.0 Open Source License.
If you find this project useful, please consider citing our tool:
@misc{dingo,
title={Dingo: A Comprehensive Data Quality Evaluation Tool for Large Models},
howpublished={\url{https://github.com/DataEval/dingo}},
year={2024}
}