WikiDes: A Wikipedia-based dataset for generating short descriptions from paragraphs

As free online encyclopedias with massive volumes of content, Wikipedia and Wikidata are key to many Natural Language Processing (NLP) tasks, such as information retrieval, knowledge base building, machine translation, text classification, and text summarization. In this paper, we introduce WikiDes,...

Mô tả đầy đủ

Đã lưu trong:

Chi tiết về thư mục
Tác giả chính:	Tạ, Hoàng Thắng
Định dạng:	Research article
Ngôn ngữ:	English
Được phát hành:	Elsevier 2023
Những chủ đề:	Text summarizationContrastive learningSentiment analysisMetric fusionWikipediaWikidata
Truy cập trực tuyến:	https://scholar.dlu.edu.vn/handle/123456789/2005 https://doi.org/10.1016/j.inffus.2022.09.022
Các nhãn:	Thêm thẻ Không có thẻ, Là người đầu tiên thẻ bản ghi này!

Thư viện lưu trữ:	Thư viện Trường Đại học Đà Lạt

id	oai:scholar.dlu.edu.vn:123456789-2005
record_format	dspace
spelling	oai:scholar.dlu.edu.vn:123456789-20052023-12-13T04:33:45Z WikiDes: A Wikipedia-based dataset for generating short descriptions from paragraphs Tạ, Hoàng Thắng Text summarizationContrastive learningSentiment analysisMetric fusionWikipediaWikidata As free online encyclopedias with massive volumes of content, Wikipedia and Wikidata are key to many Natural Language Processing (NLP) tasks, such as information retrieval, knowledge base building, machine translation, text classification, and text summarization. In this paper, we introduce WikiDes, a novel dataset to generate short descriptions of Wikipedia articles for the problem of text summarization. The dataset consists of over 80k English samples on 6987 topics. We set up a two-phase summarization method — description generation (Phase I) and candidate ranking (Phase II) — as a strong approach that relies on transfer and contrastive learning. For description generation, T5 and BART show their superiority compared to other small-scale pre-trained models. By applying contrastive learning with the diverse input from beam search, the metric fusion-based ranking models outperform the direct description generation models significantly up to 22 ROUGE in topic-exclusive split and topic-independent split. Furthermore, the outcome descriptions in Phase II are supported by human evaluation in over 45.33% chosen compared to 23.66% in Phase I against the gold descriptions. In the aspect of sentiment analysis, the generated descriptions cannot effectively capture all sentiment polarities from paragraphs while doing this task better from the gold descriptions. The automatic generation of new descriptions reduces the human efforts in creating them and enriches Wikidata-based knowledge graphs. Our paper shows a practical impact on Wikipedia and Wikidata since there are thousands of missing descriptions. Finally, we expect WikiDes to be a useful dataset for related works in capturing salient information from short paragraphs. The curated dataset is publicly available at: https://github.com/declare-lab/WikiDes. 2023-04-20T04:48:06Z 2023-04-20T04:48:06Z 2022-09 Research article Bài báo đăng trên tạp chí thuộc SCOPUS, bao gồm book chapter https://scholar.dlu.edu.vn/handle/123456789/2005 https://doi.org/10.1016/j.inffus.2022.09.022 en Information Fusion Elsevier Elsevier
institution	Thư viện Trường Đại học Đà Lạt
collection	Thư viện số
language	English
topic	Text summarizationContrastive learningSentiment analysisMetric fusionWikipediaWikidata
spellingShingle	Text summarizationContrastive learningSentiment analysisMetric fusionWikipediaWikidata Tạ, Hoàng Thắng WikiDes: A Wikipedia-based dataset for generating short descriptions from paragraphs
description	As free online encyclopedias with massive volumes of content, Wikipedia and Wikidata are key to many Natural Language Processing (NLP) tasks, such as information retrieval, knowledge base building, machine translation, text classification, and text summarization. In this paper, we introduce WikiDes, a novel dataset to generate short descriptions of Wikipedia articles for the problem of text summarization. The dataset consists of over 80k English samples on 6987 topics. We set up a two-phase summarization method — description generation (Phase I) and candidate ranking (Phase II) — as a strong approach that relies on transfer and contrastive learning. For description generation, T5 and BART show their superiority compared to other small-scale pre-trained models. By applying contrastive learning with the diverse input from beam search, the metric fusion-based ranking models outperform the direct description generation models significantly up to 22 ROUGE in topic-exclusive split and topic-independent split. Furthermore, the outcome descriptions in Phase II are supported by human evaluation in over 45.33% chosen compared to 23.66% in Phase I against the gold descriptions. In the aspect of sentiment analysis, the generated descriptions cannot effectively capture all sentiment polarities from paragraphs while doing this task better from the gold descriptions. The automatic generation of new descriptions reduces the human efforts in creating them and enriches Wikidata-based knowledge graphs. Our paper shows a practical impact on Wikipedia and Wikidata since there are thousands of missing descriptions. Finally, we expect WikiDes to be a useful dataset for related works in capturing salient information from short paragraphs. The curated dataset is publicly available at: https://github.com/declare-lab/WikiDes.
format	Research article
author	Tạ, Hoàng Thắng
author_facet	Tạ, Hoàng Thắng
author_sort	Tạ, Hoàng Thắng
title	WikiDes: A Wikipedia-based dataset for generating short descriptions from paragraphs
title_short	WikiDes: A Wikipedia-based dataset for generating short descriptions from paragraphs
title_full	WikiDes: A Wikipedia-based dataset for generating short descriptions from paragraphs
title_fullStr	WikiDes: A Wikipedia-based dataset for generating short descriptions from paragraphs
title_full_unstemmed	WikiDes: A Wikipedia-based dataset for generating short descriptions from paragraphs
title_sort	wikides: a wikipedia-based dataset for generating short descriptions from paragraphs
publisher	Elsevier
publishDate	2023
url	https://scholar.dlu.edu.vn/handle/123456789/2005 https://doi.org/10.1016/j.inffus.2022.09.022
_version_	1785973013329477632

WikiDes: A Wikipedia-based dataset for generating short descriptions from paragraphs

Những quyển sách tương tự