Humour generation is a challenging task in natural language processing due to limited resources and the quality of existing datasets. Available humour language resources often suffer from toxicity and duplication, limiting their effectiveness for training robust models. In this paper, we present CleanComedy, a specialised, partially annotated corpus, which includes jokes in English and Russian languages. The dataset is a filtered collection of existing sources, where toxic jokes and duplicates are removed with various algorithmic filters. The end quality of the dataset is validated with human assessment. We also present subjective human humour score annotation for 1,000 Russian and 1,000 English jokes providing detailed, ethical and comprehensive dataset for humour detection and generation tasks.
Ethical filtered jokes with 2-scale score 44,481 instances
Ethical filtered jokes with human humour 5-scale score 1,000 instances
Ethical filtered jokes with 2-scale score 40,926 instances
Ethical filtered jokes with human humour 5-scale score 1,000 instances
We also provide filtering pipe-line in Jupyter notebooks in both English and Russian folders