+
Skip to content

gorovuha/CleanComedy

Repository files navigation

CleanComedy

Humour generation is a challenging task in natural language processing due to limited resources and the quality of existing datasets. Available humour language resources often suffer from toxicity and duplication, limiting their effectiveness for training robust models. In this paper, we present CleanComedy, a specialised, partially annotated corpus, which includes jokes in English and Russian languages. The dataset is a filtered collection of existing sources, where toxic jokes and duplicates are removed with various algorithmic filters. The end quality of the dataset is validated with human assessment. We also present subjective human humour score annotation for 1,000 Russian and 1,000 English jokes providing detailed, ethical and comprehensive dataset for humour detection and generation tasks.

CleanComedy English

Ethical filtered jokes with 2-scale score 44,481 instances

CleanComedy English Gold

Ethical filtered jokes with human humour 5-scale score 1,000 instances

CleanComedy Russian

Ethical filtered jokes with 2-scale score 40,926 instances

CleanComedy Russian Gold

Ethical filtered jokes with human humour 5-scale score 1,000 instances

Source

We also provide filtering pipe-line in Jupyter notebooks in both English and Russian folders

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载