On the details on evaluating `summarization` questions

I've tried to reproduce the results. All results seem normal except in the `summarization` questions. (20+ compared to 40+ in the paper).

I found that many of these questions contain patterns like "what happens between xx:xx-xx:xx", which is nearly impossible for existing models. Did you adopt some sort of pre-processing like clipping?