-
Notifications
You must be signed in to change notification settings - Fork 65
Detect article-like pages entirely in two columns when generating markdown #1269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
elems.sort(key=generate_elem_prefer_left(left_col)) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just return here, making the special case self-contained and not needing the else and indentation.
left_col = [] | ||
right_col = [] | ||
for elem in elems: | ||
if elem.type == "Page-footer" or elem.type == "Page-header": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably also exclude titles. Not sure about section headers.
if x1 < 0.6 and x2 < 0.6: | ||
left_col.append(elem) | ||
elif x1 > 0.4 and x2 > 0.4: | ||
right_col.append(elem) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conceptually, this gets fooled by a 5-column newspaper-ish layout. It puts the first three columns in left. It also calls something two-column if it's one column on the side, but that may end up OK after processing.
The .4 and .6 constants are kind-of magic. I'm sure there will be false positives and false negatives.
left_col = [] | ||
right_col = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing uses right_col. Might as well nuke it, unless you plan to use it. Perhaps if either array is empty, return false and sort by y.
right_col.append(elem) | ||
else: | ||
return False, [], [] | ||
return True, left_col, right_col |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of returning a list of "lefts", we could use the existing tagging approach used in this file. Then we might not need a fancy closure-based sort key.
is_entirely_two_columns, left_col, _ = check_entirely_two_columns(elems) | ||
if is_entirely_two_columns: | ||
# if the entire page is two columns, sort the left column first | ||
elems.sort(key=generate_elem_prefer_left(left_col)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would happen if we just used elem_left_top
here?
While this solution improves robustness, we have chosen an even more robust solution here: #1273. |
No description provided.