I've tried to reproduce the results. All results seem normal except in the summarization
questions. (20+ compared to 40+ in the paper).
I found that many of these questions contain patterns like "what happens between xx:xx-xx:xx", which is nearly impossible for existing models. Did you adopt some sort of pre-processing like clipping?