http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2402.17302v2
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2402.17302v3
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2402.17302v3
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2402.17302v3
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2402.17302v3
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2410.14897v1
From Test-Taking to Test-Making: Examining LLM Authoring of Commonsense Assessment Items
LLMs can now perform a variety of complex writing tasks. They also excel in
answering questions pertaining to natural language inference and commonsense
reasoning. Composing these questions is itself a skilled writing task, so in
this paper we consider LLMs as authors of commonsense assessment items. We
prompt LLMs to generate items in the style of a prominent benchmark for
commonsense reasoning, the Choice of Plausible Alternatives (COPA). We examine
the outcome according to analyses facilitated by the LLMs and human annotation.
We find that LLMs that succeed in answering the original COPA benchmark are
also more successful in authoring their own items.
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2410.14897v1
From Test-Taking to Test-Making: Examining LLM Authoring of Commonsense Assessment Items
LLMs can now perform a variety of complex writing tasks. They also excel in
answering questions pertaining to natural language inference and commonsense
reasoning. Composing these questions is itself a skilled writing task, so in
this paper we consider LLMs as authors of commonsense assessment items. We
prompt LLMs to generate items in the style of a prominent benchmark for
commonsense reasoning, the Choice of Plausible Alternatives (COPA). We examine
the outcome according to analyses facilitated by the LLMs and human annotation.
We find that LLMs that succeed in answering the original COPA benchmark are
also more successful in authoring their own items.
http://arxiv.org/abs/2402.17302v3
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2402.17302v3
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2410.14897v1
From Test-Taking to Test-Making: Examining LLM Authoring of Commonsense Assessment Items
LLMs can now perform a variety of complex writing tasks. They also excel in
answering questions pertaining to natural language inference and commonsense
reasoning. Composing these questions is itself a skilled writing task, so in
this paper we consider LLMs as authors of commonsense assessment items. We
prompt LLMs to generate items in the style of a prominent benchmark for
commonsense reasoning, the Choice of Plausible Alternatives (COPA). We examine
the outcome according to analyses facilitated by the LLMs and human annotation.
We find that LLMs that succeed in answering the original COPA benchmark are
also more successful in authoring their own items.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2402.17302v3
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2410.14897v1
From Test-Taking to Test-Making: Examining LLM Authoring of Commonsense Assessment Items
LLMs can now perform a variety of complex writing tasks. They also excel in
answering questions pertaining to natural language inference and commonsense
reasoning. Composing these questions is itself a skilled writing task, so in
this paper we consider LLMs as authors of commonsense assessment items. We
prompt LLMs to generate items in the style of a prominent benchmark for
commonsense reasoning, the Choice of Plausible Alternatives (COPA). We examine
the outcome according to analyses facilitated by the LLMs and human annotation.
We find that LLMs that succeed in answering the original COPA benchmark are
also more successful in authoring their own items.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2410.23844v1
Commonsense Knowledge Editing Based on Free-Text in LLMs
Knowledge editing technology is crucial for maintaining the accuracy and
timeliness of large language models (LLMs) . However, the setting of this task
overlooks a significant portion of commonsense knowledge based on free-text in
the real world, characterized by broad knowledge scope, long content and non
instantiation. The editing objects of previous methods (e.g., MEMIT) were
single token or entity, which were not suitable for commonsense knowledge in
free-text form. To address the aforementioned challenges, we conducted
experiments from two perspectives: knowledge localization and knowledge
editing. Firstly, we introduced Knowledge Localization for Free-Text(KLFT)
method, revealing the challenges associated with the distribution of
commonsense knowledge in MLP and Attention layers, as well as in decentralized
distribution. Next, we propose a Dynamics-aware Editing Method(DEM), which
utilizes a Dynamics-aware Module to locate the parameter positions
corresponding to commonsense knowledge, and uses Knowledge Editing Module to
update knowledge. The DEM method fully explores the potential of the MLP and
Attention layers, and successfully edits commonsense knowledge based on
free-text. The experimental results indicate that the DEM can achieve excellent
editing performance.
http://arxiv.org/abs/2410.14897v1
From Test-Taking to Test-Making: Examining LLM Authoring of Commonsense Assessment Items
LLMs can now perform a variety of complex writing tasks. They also excel in
answering questions pertaining to natural language inference and commonsense
reasoning. Composing these questions is itself a skilled writing task, so in
this paper we consider LLMs as authors of commonsense assessment items. We
prompt LLMs to generate items in the style of a prominent benchmark for
commonsense reasoning, the Choice of Plausible Alternatives (COPA). We examine
the outcome according to analyses facilitated by the LLMs and human annotation.
We find that LLMs that succeed in answering the original COPA benchmark are
also more successful in authoring their own items.
http://arxiv.org/abs/2410.23844v1
Commonsense Knowledge Editing Based on Free-Text in LLMs
Knowledge editing technology is crucial for maintaining the accuracy and
timeliness of large language models (LLMs) . However, the setting of this task
overlooks a significant portion of commonsense knowledge based on free-text in
the real world, characterized by broad knowledge scope, long content and non
instantiation. The editing objects of previous methods (e.g., MEMIT) were
single token or entity, which were not suitable for commonsense knowledge in
free-text form. To address the aforementioned challenges, we conducted
experiments from two perspectives: knowledge localization and knowledge
editing. Firstly, we introduced Knowledge Localization for Free-Text(KLFT)
method, revealing the challenges associated with the distribution of
commonsense knowledge in MLP and Attention layers, as well as in decentralized
distribution. Next, we propose a Dynamics-aware Editing Method(DEM), which
utilizes a Dynamics-aware Module to locate the parameter positions
corresponding to commonsense knowledge, and uses Knowledge Editing Module to
update knowledge. The DEM method fully explores the potential of the MLP and
Attention layers, and successfully edits commonsense knowledge based on
free-text. The experimental results indicate that the DEM can achieve excellent
editing performance.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2410.23844v1
Commonsense Knowledge Editing Based on Free-Text in LLMs
Knowledge editing technology is crucial for maintaining the accuracy and
timeliness of large language models (LLMs) . However, the setting of this task
overlooks a significant portion of commonsense knowledge based on free-text in
the real world, characterized by broad knowledge scope, long content and non
instantiation. The editing objects of previous methods (e.g., MEMIT) were
single token or entity, which were not suitable for commonsense knowledge in
free-text form. To address the aforementioned challenges, we conducted
experiments from two perspectives: knowledge localization and knowledge
editing. Firstly, we introduced Knowledge Localization for Free-Text(KLFT)
method, revealing the challenges associated with the distribution of
commonsense knowledge in MLP and Attention layers, as well as in decentralized
distribution. Next, we propose a Dynamics-aware Editing Method(DEM), which
utilizes a Dynamics-aware Module to locate the parameter positions
corresponding to commonsense knowledge, and uses Knowledge Editing Module to
update knowledge. The DEM method fully explores the potential of the MLP and
Attention layers, and successfully edits commonsense knowledge based on
free-text. The experimental results indicate that the DEM can achieve excellent
editing performance.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2410.23844v1
Commonsense Knowledge Editing Based on Free-Text in LLMs
Knowledge editing technology is crucial for maintaining the accuracy and
timeliness of large language models (LLMs) . However, the setting of this task
overlooks a significant portion of commonsense knowledge based on free-text in
the real world, characterized by broad knowledge scope, long content and non
instantiation. The editing objects of previous methods (e.g., MEMIT) were
single token or entity, which were not suitable for commonsense knowledge in
free-text form. To address the aforementioned challenges, we conducted
experiments from two perspectives: knowledge localization and knowledge
editing. Firstly, we introduced Knowledge Localization for Free-Text(KLFT)
method, revealing the challenges associated with the distribution of
commonsense knowledge in MLP and Attention layers, as well as in decentralized
distribution. Next, we propose a Dynamics-aware Editing Method(DEM), which
utilizes a Dynamics-aware Module to locate the parameter positions
corresponding to commonsense knowledge, and uses Knowledge Editing Module to
update knowledge. The DEM method fully explores the potential of the MLP and
Attention layers, and successfully edits commonsense knowledge based on
free-text. The experimental results indicate that the DEM can achieve excellent
editing performance.
http://arxiv.org/abs/2402.17302v3
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2410.23844v1
Commonsense Knowledge Editing Based on Free-Text in LLMs
Knowledge editing technology is crucial for maintaining the accuracy and
timeliness of large language models (LLMs) . However, the setting of this task
overlooks a significant portion of commonsense knowledge based on free-text in
the real world, characterized by broad knowledge scope, long content and non
instantiation. The editing objects of previous methods (e.g., MEMIT) were
single token or entity, which were not suitable for commonsense knowledge in
free-text form. To address the aforementioned challenges, we conducted
experiments from two perspectives: knowledge localization and knowledge
editing. Firstly, we introduced Knowledge Localization for Free-Text(KLFT)
method, revealing the challenges associated with the distribution of
commonsense knowledge in MLP and Attention layers, as well as in decentralized
distribution. Next, we propose a Dynamics-aware Editing Method(DEM), which
utilizes a Dynamics-aware Module to locate the parameter positions
corresponding to commonsense knowledge, and uses Knowledge Editing Module to
update knowledge. The DEM method fully explores the potential of the MLP and
Attention layers, and successfully edits commonsense knowledge based on
free-text. The experimental results indicate that the DEM can achieve excellent
editing performance.
http://arxiv.org/abs/2403.14112v2
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2410.14897v1
From Test-Taking to Test-Making: Examining LLM Authoring of Commonsense Assessment Items
LLMs can now perform a variety of complex writing tasks. They also excel in
answering questions pertaining to natural language inference and commonsense
reasoning. Composing these questions is itself a skilled writing task, so in
this paper we consider LLMs as authors of commonsense assessment items. We
prompt LLMs to generate items in the style of a prominent benchmark for
commonsense reasoning, the Choice of Plausible Alternatives (COPA). We examine
the outcome according to analyses facilitated by the LLMs and human annotation.
We find that LLMs that succeed in answering the original COPA benchmark are
also more successful in authoring their own items.
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2410.14897v1
From Test-Taking to Test-Making: Examining LLM Authoring of Commonsense Assessment Items
LLMs can now perform a variety of complex writing tasks. They also excel in
answering questions pertaining to natural language inference and commonsense
reasoning. Composing these questions is itself a skilled writing task, so in
this paper we consider LLMs as authors of commonsense assessment items. We
prompt LLMs to generate items in the style of a prominent benchmark for
commonsense reasoning, the Choice of Plausible Alternatives (COPA). We examine
the outcome according to analyses facilitated by the LLMs and human annotation.
We find that LLMs that succeed in answering the original COPA benchmark are
also more successful in authoring their own items.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2402.17302v3
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2410.14897v1
From Test-Taking to Test-Making: Examining LLM Authoring of Commonsense Assessment Items
LLMs can now perform a variety of complex writing tasks. They also excel in
answering questions pertaining to natural language inference and commonsense
reasoning. Composing these questions is itself a skilled writing task, so in
this paper we consider LLMs as authors of commonsense assessment items. We
prompt LLMs to generate items in the style of a prominent benchmark for
commonsense reasoning, the Choice of Plausible Alternatives (COPA). We examine
the outcome according to analyses facilitated by the LLMs and human annotation.
We find that LLMs that succeed in answering the original COPA benchmark are
also more successful in authoring their own items.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2410.14897v1
From Test-Taking to Test-Making: Examining LLM Authoring of Commonsense Assessment Items
LLMs can now perform a variety of complex writing tasks. They also excel in
answering questions pertaining to natural language inference and commonsense
reasoning. Composing these questions is itself a skilled writing task, so in
this paper we consider LLMs as authors of commonsense assessment items. We
prompt LLMs to generate items in the style of a prominent benchmark for
commonsense reasoning, the Choice of Plausible Alternatives (COPA). We examine
the outcome according to analyses facilitated by the LLMs and human annotation.
We find that LLMs that succeed in answering the original COPA benchmark are
also more successful in authoring their own items.
http://arxiv.org/abs/2402.17302v3
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2410.14897v1
From Test-Taking to Test-Making: Examining LLM Authoring of Commonsense Assessment Items
LLMs can now perform a variety of complex writing tasks. They also excel in
answering questions pertaining to natural language inference and commonsense
reasoning. Composing these questions is itself a skilled writing task, so in
this paper we consider LLMs as authors of commonsense assessment items. We
prompt LLMs to generate items in the style of a prominent benchmark for
commonsense reasoning, the Choice of Plausible Alternatives (COPA). We examine
the outcome according to analyses facilitated by the LLMs and human annotation.
We find that LLMs that succeed in answering the original COPA benchmark are
also more successful in authoring their own items.
http://arxiv.org/abs/2410.23844v1
Commonsense Knowledge Editing Based on Free-Text in LLMs
Knowledge editing technology is crucial for maintaining the accuracy and
timeliness of large language models (LLMs) . However, the setting of this task
overlooks a significant portion of commonsense knowledge based on free-text in
the real world, characterized by broad knowledge scope, long content and non
instantiation. The editing objects of previous methods (e.g., MEMIT) were
single token or entity, which were not suitable for commonsense knowledge in
free-text form. To address the aforementioned challenges, we conducted
experiments from two perspectives: knowledge localization and knowledge
editing. Firstly, we introduced Knowledge Localization for Free-Text(KLFT)
method, revealing the challenges associated with the distribution of
commonsense knowledge in MLP and Attention layers, as well as in decentralized
distribution. Next, we propose a Dynamics-aware Editing Method(DEM), which
utilizes a Dynamics-aware Module to locate the parameter positions
corresponding to commonsense knowledge, and uses Knowledge Editing Module to
update knowledge. The DEM method fully explores the potential of the MLP and
Attention layers, and successfully edits commonsense knowledge based on
free-text. The experimental results indicate that the DEM can achieve excellent
editing performance.
http://arxiv.org/abs/2410.14897v1
From Test-Taking to Test-Making: Examining LLM Authoring of Commonsense Assessment Items
LLMs can now perform a variety of complex writing tasks. They also excel in
answering questions pertaining to natural language inference and commonsense
reasoning. Composing these questions is itself a skilled writing task, so in
this paper we consider LLMs as authors of commonsense assessment items. We
prompt LLMs to generate items in the style of a prominent benchmark for
commonsense reasoning, the Choice of Plausible Alternatives (COPA). We examine
the outcome according to analyses facilitated by the LLMs and human annotation.
We find that LLMs that succeed in answering the original COPA benchmark are
also more successful in authoring their own items.
http://arxiv.org/abs/2410.14897v1
From Test-Taking to Test-Making: Examining LLM Authoring of Commonsense Assessment Items
LLMs can now perform a variety of complex writing tasks. They also excel in
answering questions pertaining to natural language inference and commonsense
reasoning. Composing these questions is itself a skilled writing task, so in
this paper we consider LLMs as authors of commonsense assessment items. We
prompt LLMs to generate items in the style of a prominent benchmark for
commonsense reasoning, the Choice of Plausible Alternatives (COPA). We examine
the outcome according to analyses facilitated by the LLMs and human annotation.
We find that LLMs that succeed in answering the original COPA benchmark are
also more successful in authoring their own items.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2402.17302v3
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2403.14112v3
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2410.14897v1
From Test-Taking to Test-Making: Examining LLM Authoring of Commonsense Assessment Items
LLMs can now perform a variety of complex writing tasks. They also excel in
answering questions pertaining to natural language inference and commonsense
reasoning. Composing these questions is itself a skilled writing task, so in
this paper we consider LLMs as authors of commonsense assessment items. We
prompt LLMs to generate items in the style of a prominent benchmark for
commonsense reasoning, the Choice of Plausible Alternatives (COPA). We examine
the outcome according to analyses facilitated by the LLMs and human annotation.
We find that LLMs that succeed in answering the original COPA benchmark are
also more successful in authoring their own items.
http://arxiv.org/abs/2410.23844v1
Commonsense Knowledge Editing Based on Free-Text in LLMs
Knowledge editing technology is crucial for maintaining the accuracy and
timeliness of large language models (LLMs) . However, the setting of this task
overlooks a significant portion of commonsense knowledge based on free-text in
the real world, characterized by broad knowledge scope, long content and non
instantiation. The editing objects of previous methods (e.g., MEMIT) were
single token or entity, which were not suitable for commonsense knowledge in
free-text form. To address the aforementioned challenges, we conducted
experiments from two perspectives: knowledge localization and knowledge
editing. Firstly, we introduced Knowledge Localization for Free-Text(KLFT)
method, revealing the challenges associated with the distribution of
commonsense knowledge in MLP and Attention layers, as well as in decentralized
distribution. Next, we propose a Dynamics-aware Editing Method(DEM), which
utilizes a Dynamics-aware Module to locate the parameter positions
corresponding to commonsense knowledge, and uses Knowledge Editing Module to
update knowledge. The DEM method fully explores the potential of the MLP and
Attention layers, and successfully edits commonsense knowledge based on
free-text. The experimental results indicate that the DEM can achieve excellent
editing performance.
http://arxiv.org/abs/2410.14897v1
From Test-Taking to Test-Making: Examining LLM Authoring of Commonsense Assessment Items
LLMs can now perform a variety of complex writing tasks. They also excel in
answering questions pertaining to natural language inference and commonsense
reasoning. Composing these questions is itself a skilled writing task, so in
this paper we consider LLMs as authors of commonsense assessment items. We
prompt LLMs to generate items in the style of a prominent benchmark for
commonsense reasoning, the Choice of Plausible Alternatives (COPA). We examine
the outcome according to analyses facilitated by the LLMs and human annotation.
We find that LLMs that succeed in answering the original COPA benchmark are
also more successful in authoring their own items.
http://arxiv.org/abs/2402.17302v3
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2410.23844v1
Commonsense Knowledge Editing Based on Free-Text in LLMs
Knowledge editing technology is crucial for maintaining the accuracy and
timeliness of large language models (LLMs) . However, the setting of this task
overlooks a significant portion of commonsense knowledge based on free-text in
the real world, characterized by broad knowledge scope, long content and non
instantiation. The editing objects of previous methods (e.g., MEMIT) were
single token or entity, which were not suitable for commonsense knowledge in
free-text form. To address the aforementioned challenges, we conducted
experiments from two perspectives: knowledge localization and knowledge
editing. Firstly, we introduced Knowledge Localization for Free-Text(KLFT)
method, revealing the challenges associated with the distribution of
commonsense knowledge in MLP and Attention layers, as well as in decentralized
distribution. Next, we propose a Dynamics-aware Editing Method(DEM), which
utilizes a Dynamics-aware Module to locate the parameter positions
corresponding to commonsense knowledge, and uses Knowledge Editing Module to
update knowledge. The DEM method fully explores the potential of the MLP and
Attention layers, and successfully edits commonsense knowledge based on
free-text. The experimental results indicate that the DEM can achieve excellent
editing performance.
http://arxiv.org/abs/2402.17302v3
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2402.17302v3
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2402.17302v3
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2403.14112v3
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2410.14897v1
From Test-Taking to Test-Making: Examining LLM Authoring of Commonsense Assessment Items
LLMs can now perform a variety of complex writing tasks. They also excel in
answering questions pertaining to natural language inference and commonsense
reasoning. Composing these questions is itself a skilled writing task, so in
this paper we consider LLMs as authors of commonsense assessment items. We
prompt LLMs to generate items in the style of a prominent benchmark for
commonsense reasoning, the Choice of Plausible Alternatives (COPA). We examine
the outcome according to analyses facilitated by the LLMs and human annotation.
We find that LLMs that succeed in answering the original COPA benchmark are
also more successful in authoring their own items.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2403.14112v3
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2402.17302v3
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2410.14897v1
From Test-Taking to Test-Making: Examining LLM Authoring of Commonsense Assessment Items
LLMs can now perform a variety of complex writing tasks. They also excel in
answering questions pertaining to natural language inference and commonsense
reasoning. Composing these questions is itself a skilled writing task, so in
this paper we consider LLMs as authors of commonsense assessment items. We
prompt LLMs to generate items in the style of a prominent benchmark for
commonsense reasoning, the Choice of Plausible Alternatives (COPA). We examine
the outcome according to analyses facilitated by the LLMs and human annotation.
We find that LLMs that succeed in answering the original COPA benchmark are
also more successful in authoring their own items.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2410.23844v1
Commonsense Knowledge Editing Based on Free-Text in LLMs
Knowledge editing technology is crucial for maintaining the accuracy and
timeliness of large language models (LLMs) . However, the setting of this task
overlooks a significant portion of commonsense knowledge based on free-text in
the real world, characterized by broad knowledge scope, long content and non
instantiation. The editing objects of previous methods (e.g., MEMIT) were
single token or entity, which were not suitable for commonsense knowledge in
free-text form. To address the aforementioned challenges, we conducted
experiments from two perspectives: knowledge localization and knowledge
editing. Firstly, we introduced Knowledge Localization for Free-Text(KLFT)
method, revealing the challenges associated with the distribution of
commonsense knowledge in MLP and Attention layers, as well as in decentralized
distribution. Next, we propose a Dynamics-aware Editing Method(DEM), which
utilizes a Dynamics-aware Module to locate the parameter positions
corresponding to commonsense knowledge, and uses Knowledge Editing Module to
update knowledge. The DEM method fully explores the potential of the MLP and
Attention layers, and successfully edits commonsense knowledge based on
free-text. The experimental results indicate that the DEM can achieve excellent
editing performance.
http://arxiv.org/abs/2403.14112v3
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2402.17302v3
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2410.14897v1
From Test-Taking to Test-Making: Examining LLM Authoring of Commonsense Assessment Items
LLMs can now perform a variety of complex writing tasks. They also excel in
answering questions pertaining to natural language inference and commonsense
reasoning. Composing these questions is itself a skilled writing task, so in
this paper we consider LLMs as authors of commonsense assessment items. We
prompt LLMs to generate items in the style of a prominent benchmark for
commonsense reasoning, the Choice of Plausible Alternatives (COPA). We examine
the outcome according to analyses facilitated by the LLMs and human annotation.
We find that LLMs that succeed in answering the original COPA benchmark are
also more successful in authoring their own items.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2403.14112v3
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2403.14112v3
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2410.23844v1
Commonsense Knowledge Editing Based on Free-Text in LLMs
Knowledge editing technology is crucial for maintaining the accuracy and
timeliness of large language models (LLMs) . However, the setting of this task
overlooks a significant portion of commonsense knowledge based on free-text in
the real world, characterized by broad knowledge scope, long content and non
instantiation. The editing objects of previous methods (e.g., MEMIT) were
single token or entity, which were not suitable for commonsense knowledge in
free-text form. To address the aforementioned challenges, we conducted
experiments from two perspectives: knowledge localization and knowledge
editing. Firstly, we introduced Knowledge Localization for Free-Text(KLFT)
method, revealing the challenges associated with the distribution of
commonsense knowledge in MLP and Attention layers, as well as in decentralized
distribution. Next, we propose a Dynamics-aware Editing Method(DEM), which
utilizes a Dynamics-aware Module to locate the parameter positions
corresponding to commonsense knowledge, and uses Knowledge Editing Module to
update knowledge. The DEM method fully explores the potential of the MLP and
Attention layers, and successfully edits commonsense knowledge based on
free-text. The experimental results indicate that the DEM can achieve excellent
editing performance.
http://arxiv.org/abs/2402.17302v3
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2402.17302v3
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2402.17302v3
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2402.17302v3
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2410.23844v1
Commonsense Knowledge Editing Based on Free-Text in LLMs
Knowledge editing technology is crucial for maintaining the accuracy and
timeliness of large language models (LLMs) . However, the setting of this task
overlooks a significant portion of commonsense knowledge based on free-text in
the real world, characterized by broad knowledge scope, long content and non
instantiation. The editing objects of previous methods (e.g., MEMIT) were
single token or entity, which were not suitable for commonsense knowledge in
free-text form. To address the aforementioned challenges, we conducted
experiments from two perspectives: knowledge localization and knowledge
editing. Firstly, we introduced Knowledge Localization for Free-Text(KLFT)
method, revealing the challenges associated with the distribution of
commonsense knowledge in MLP and Attention layers, as well as in decentralized
distribution. Next, we propose a Dynamics-aware Editing Method(DEM), which
utilizes a Dynamics-aware Module to locate the parameter positions
corresponding to commonsense knowledge, and uses Knowledge Editing Module to
update knowledge. The DEM method fully explores the potential of the MLP and
Attention layers, and successfully edits commonsense knowledge based on
free-text. The experimental results indicate that the DEM can achieve excellent
editing performance.
http://arxiv.org/abs/2403.14112v3
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2403.14112v3
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2410.14897v1
From Test-Taking to Test-Making: Examining LLM Authoring of Commonsense Assessment Items
LLMs can now perform a variety of complex writing tasks. They also excel in
answering questions pertaining to natural language inference and commonsense
reasoning. Composing these questions is itself a skilled writing task, so in
this paper we consider LLMs as authors of commonsense assessment items. We
prompt LLMs to generate items in the style of a prominent benchmark for
commonsense reasoning, the Choice of Plausible Alternatives (COPA). We examine
the outcome according to analyses facilitated by the LLMs and human annotation.
We find that LLMs that succeed in answering the original COPA benchmark are
also more successful in authoring their own items.
http://arxiv.org/abs/2402.17302v3
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2402.17302v3
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2410.23844v1
Commonsense Knowledge Editing Based on Free-Text in LLMs
Knowledge editing technology is crucial for maintaining the accuracy and
timeliness of large language models (LLMs) . However, the setting of this task
overlooks a significant portion of commonsense knowledge based on free-text in
the real world, characterized by broad knowledge scope, long content and non
instantiation. The editing objects of previous methods (e.g., MEMIT) were
single token or entity, which were not suitable for commonsense knowledge in
free-text form. To address the aforementioned challenges, we conducted
experiments from two perspectives: knowledge localization and knowledge
editing. Firstly, we introduced Knowledge Localization for Free-Text(KLFT)
method, revealing the challenges associated with the distribution of
commonsense knowledge in MLP and Attention layers, as well as in decentralized
distribution. Next, we propose a Dynamics-aware Editing Method(DEM), which
utilizes a Dynamics-aware Module to locate the parameter positions
corresponding to commonsense knowledge, and uses Knowledge Editing Module to
update knowledge. The DEM method fully explores the potential of the MLP and
Attention layers, and successfully edits commonsense knowledge based on
free-text. The experimental results indicate that the DEM can achieve excellent
editing performance.
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2402.17302v3
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2410.14897v1
From Test-Taking to Test-Making: Examining LLM Authoring of Commonsense Assessment Items
LLMs can now perform a variety of complex writing tasks. They also excel in
answering questions pertaining to natural language inference and commonsense
reasoning. Composing these questions is itself a skilled writing task, so in
this paper we consider LLMs as authors of commonsense assessment items. We
prompt LLMs to generate items in the style of a prominent benchmark for
commonsense reasoning, the Choice of Plausible Alternatives (COPA). We examine
the outcome according to analyses facilitated by the LLMs and human annotation.
We find that LLMs that succeed in answering the original COPA benchmark are
also more successful in authoring their own items.
http://arxiv.org/abs/2410.23844v1
Commonsense Knowledge Editing Based on Free-Text in LLMs
Knowledge editing technology is crucial for maintaining the accuracy and
timeliness of large language models (LLMs) . However, the setting of this task
overlooks a significant portion of commonsense knowledge based on free-text in
the real world, characterized by broad knowledge scope, long content and non
instantiation. The editing objects of previous methods (e.g., MEMIT) were
single token or entity, which were not suitable for commonsense knowledge in
free-text form. To address the aforementioned challenges, we conducted
experiments from two perspectives: knowledge localization and knowledge
editing. Firstly, we introduced Knowledge Localization for Free-Text(KLFT)
method, revealing the challenges associated with the distribution of
commonsense knowledge in MLP and Attention layers, as well as in decentralized
distribution. Next, we propose a Dynamics-aware Editing Method(DEM), which
utilizes a Dynamics-aware Module to locate the parameter positions
corresponding to commonsense knowledge, and uses Knowledge Editing Module to
update knowledge. The DEM method fully explores the potential of the MLP and
Attention layers, and successfully edits commonsense knowledge based on
free-text. The experimental results indicate that the DEM can achieve excellent
editing performance.
http://arxiv.org/abs/2403.14112v3
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2403.14112v3
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2410.14897v1
From Test-Taking to Test-Making: Examining LLM Authoring of Commonsense Assessment Items
LLMs can now perform a variety of complex writing tasks. They also excel in
answering questions pertaining to natural language inference and commonsense
reasoning. Composing these questions is itself a skilled writing task, so in
this paper we consider LLMs as authors of commonsense assessment items. We
prompt LLMs to generate items in the style of a prominent benchmark for
commonsense reasoning, the Choice of Plausible Alternatives (COPA). We examine
the outcome according to analyses facilitated by the LLMs and human annotation.
We find that LLMs that succeed in answering the original COPA benchmark are
also more successful in authoring their own items.
http://arxiv.org/abs/2403.14112v3
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
http://arxiv.org/abs/2407.15281v1
SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking
Understanding rich dialogues often requires NLP systems to access relevant
commonsense persona knowledge, but retrieving this knowledge is challenging due
to complex contexts and the implicit nature of commonsense. This paper presents
our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge,
addressing the critical need for integrating persona and commonsense knowledge
in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that
leverages Large Language Models to generate high-quality synthetic datasets for
training commonsense persona knowledge linkers. To demonstrate the efficacy of
our approach, we present SynCPKL, a new dataset specifically designed for this
task. Our experiments validate the effectiveness of SynCPKL for training
commonsense persona knowledge linkers. Additionally, our top-performing model,
Derberta-SynCPKL, secured first place in the CPKL challenge by a 16%
improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at
https://github.com/irislin1006/CPKL.
http://arxiv.org/abs/2304.11164v1
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs
Language models have become very popular recently and many claims have been
made about their abilities, including for commonsense reasoning. Given the
increasingly better results of current language models on previous static
benchmarks for commonsense reasoning, we explore an alternative dialectical
evaluation. The goal of this kind of evaluation is not to obtain an aggregate
performance value but to find failures and map the boundaries of the system.
Dialoguing with the system gives the opportunity to check for consistency and
get more reassurance of these boundaries beyond anecdotal evidence. In this
paper we conduct some qualitative investigations of this kind of evaluation for
the particular case of spatial reasoning (which is a fundamental aspect of
commonsense reasoning). We conclude with some suggestions for future work both
to improve the capabilities of language models and to systematise this kind of
dialectical evaluation.
http://arxiv.org/abs/2410.14897v1
From Test-Taking to Test-Making: Examining LLM Authoring of Commonsense Assessment Items
LLMs can now perform a variety of complex writing tasks. They also excel in
answering questions pertaining to natural language inference and commonsense
reasoning. Composing these questions is itself a skilled writing task, so in
this paper we consider LLMs as authors of commonsense assessment items. We
prompt LLMs to generate items in the style of a prominent benchmark for
commonsense reasoning, the Choice of Plausible Alternatives (COPA). We examine
the outcome according to analyses facilitated by the LLMs and human annotation.
We find that LLMs that succeed in answering the original COPA benchmark are
also more successful in authoring their own items.
http://arxiv.org/abs/2402.17302v3
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Large Language Models (LLMs) are increasingly being used to generate
synthetic data for training and evaluating models. However, it is unclear
whether they can generate a good quality of question answering (QA) dataset
that incorporates knowledge and cultural nuance embedded in a language,
especially for low-resource languages. In this study, we investigate the
effectiveness of using LLMs in generating culturally relevant commonsense QA
datasets for Indonesian and Sundanese languages. To do so, we create datasets
for these languages using various methods involving both LLMs and human
annotators, resulting in ~4.5K questions per language (~9K in total), making
our dataset the largest of its kind. Our experiments show that automatic data
adaptation from an existing English dataset is less effective for Sundanese.
Interestingly, using the direct generation method on the target language, GPT-4
Turbo can generate questions with adequate general knowledge in both languages,
albeit not as culturally 'deep' as humans. We also observe a higher occurrence
of fluency errors in the Sundanese dataset, highlighting the discrepancy
between medium- and lower-resource languages.
http://arxiv.org/abs/2410.23844v1
Commonsense Knowledge Editing Based on Free-Text in LLMs
Knowledge editing technology is crucial for maintaining the accuracy and
timeliness of large language models (LLMs) . However, the setting of this task
overlooks a significant portion of commonsense knowledge based on free-text in
the real world, characterized by broad knowledge scope, long content and non
instantiation. The editing objects of previous methods (e.g., MEMIT) were
single token or entity, which were not suitable for commonsense knowledge in
free-text form. To address the aforementioned challenges, we conducted
experiments from two perspectives: knowledge localization and knowledge
editing. Firstly, we introduced Knowledge Localization for Free-Text(KLFT)
method, revealing the challenges associated with the distribution of
commonsense knowledge in MLP and Attention layers, as well as in decentralized
distribution. Next, we propose a Dynamics-aware Editing Method(DEM), which
utilizes a Dynamics-aware Module to locate the parameter positions
corresponding to commonsense knowledge, and uses Knowledge Editing Module to
update knowledge. The DEM method fully explores the potential of the MLP and
Attention layers, and successfully edits commonsense knowledge based on
free-text. The experimental results indicate that the DEM can achieve excellent
editing performance.
http://arxiv.org/abs/2403.14112v3
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
We introduce CHARM, the first benchmark for comprehensively and in-depth
evaluating the commonsense reasoning ability of large language models (LLMs) in
Chinese, which covers both globally known and Chinese-specific commonsense. We
evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5
representative prompt strategies for improving LLMs' reasoning ability, such as
Chain-of-Thought. Our findings indicate that the LLM's language orientation and
the task's domain influence the effectiveness of the prompt strategy, which
enriches previous research findings. We built closely-interconnected reasoning
and memorization tasks, and found that some LLMs struggle with memorizing
Chinese commonsense, affecting their reasoning ability, while others show
differences in reasoning despite similar memorization performance. We also
evaluated the LLMs' memorization-independent reasoning abilities and analyzed
the typical errors. Our study precisely identified the LLMs' strengths and
weaknesses, providing the clear direction for optimization. It can also serve
as a reference for studies in other fields. We will release CHARM at
https://github.com/opendatalab/CHARM .
-
Notifications
You must be signed in to change notification settings - Fork 0
shintaro-ozaki/cs_bot
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
 |  | |||
 |  | |||
 |  | |||
Repository files navigation
About
The automatic system update the information of survey everyday on readme.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published