-
Notifications
You must be signed in to change notification settings - Fork 27
Use PDF version of input for output file #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi, thanks for the pull request. I noticed that CI wasn't running for your PR, so I made a quick fix to enable CI for PRs (#8). Unfortunately, your PR fails the
The problem is that Additionally, the PR fails the
In the There are also some stylistic issues. The code style of the PR does not match the existing code, causing the file to lose internal consistency.
|
I agree that choosing a higher compatibility level can create smaller PDFs, but I'm not sure if the best strategy is to carry over the input compat level to the output. If the script does not specify It would be a lot easier to add a command line option for the user to specify an optional compatibility version, than to extract the PDF version using shell tools. |
First, thank you for the detailed review/response. I just rebased the branch to include your changes to the Workflow, but it seems like you must first approve it, so that the workflows can run on this Pull Request. I will have a look at your technical review later. |
So, the used commands should now be POSIX compliant and Shellcheck is also happy. Regarding the change itself: I created a Regarding compatibility: If the input file is PDF version 1.6 or even higher (I do not know what kind of versions already exist), then I do not see a reason why shrinkpdf should change that PDF version. Having a newer PDF version is already a compatibility "issue" of the original file so leaving it at that version does not introduce a compatibility issue. |
shrinkpdf.sh
Outdated
get_pdf_version() { | ||
# $1 is the input file. The PDF version is contained in the | ||
# first 1024 bytes and will be extracted from the PDF file. | ||
version=$(cut -b "-1024" "$1" | sed -n '/%PDF-[0-9]\.[0-9]/{s/.*%PDF-/%PDF-/;p;q}' | cut -c "-8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message with cut
(https://github.com/aklomp/shrinkpdf/actions/runs/3275070529/jobs/5392222355#step:5:7) comes from the first cut
invocation in this line, I suppose. The sed
command closes its incoming pipe after finding the string %PDF-x.y
. cut
obviously does not handle it gracefully…
I guess we could remove the ;q
from the sed
program. That way it will process the whole input. if multiple %PDF-x.y
are contained in multiple lines of the PDF within the first 1024 bytes, then all of them will be printed by sed
, but the cut
afterwards will make sure, that only the first version survives. I will push a new commit with the change. If you better like it this way, you can always remove that commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole expression looks a bit overcomplicated to me. What we are trying to accomplish is parse the PDF signature, which has a fixed location near the start of the file. There might be a way to get sed
to directly return the first match in the first "line".
I also have the feeling that extraction and formatting could be handled nicely by an AWK one-liner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's say: I am not proud of it. I read, that the header can be anywhere within the first 1024 bytes of the file according to Adobes documents (see Acrobat viewers require only that the header appear somewhere within the first 1024 bytes of the file https://bugs.ghostscript.com/show_bug.cgi?id=687125 for example). I did not find the actual documentation by Adobe, I guess it is not available freely.
By prepending the cut
command I wanted to ensure, that only a fixed amount of the file needs to be read by the function. But yeah, a AWK script could also do the trick and be easier on the eyes…
shrinkpdf.sh
Outdated
|
||
# If the pdf version could not be extracted, do not change it | ||
if [ -z "$version" ]; then | ||
return | ||
return 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function still returns successfully (zero) on errors. The return code should be 1
when an error is encountered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If no version is found, then no version is reported by the function, but then the default will be used.
This way the whole program still works even if this function does not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I now changed it, so that an error is thrown and when the error is caught, the default version is set. I like it a lot better this way.
shrinkpdf.sh
Outdated
@@ -64,16 +64,16 @@ shrink () | |||
get_pdf_version() { | |||
# $1 is the input file. The PDF version is contained in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR still does not follow the style conventions of this project: tabs for indent, opening function parenthesis on a new line, etc. I know that style is very subjective and I don't personally have a problem with the common two-space indenting style, but the important thing is consistency. A project must use the same style throughout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sorry. I read over that. Will change it accordingly 👍
shrinkpdf.sh
Outdated
get_pdf_version() { | ||
# $1 is the input file. The PDF version is contained in the | ||
# first 1024 bytes and will be extracted from the PDF file. | ||
version=$(cut -b "-1024" "$1" | sed -n '/%PDF-[0-9]\.[0-9]/{s/.*%PDF-/%PDF-/;p;q}' | cut -c "-8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole expression looks a bit overcomplicated to me. What we are trying to accomplish is parse the PDF signature, which has a fixed location near the start of the file. There might be a way to get sed
to directly return the first match in the first "line".
I also have the feeling that extraction and formatting could be handled nicely by an AWK one-liner.
Regarding your point about the change itself: I did some testing myself and I fully accept that using higher compatibility versions can lead to smaller PDFs. Regarding your point about compatibility: I'm not that concerned about introducing incompatibilities. A small file size is what the user wants. But I think we can reach that outcome more easily by removing the |
shrinkpdf.sh
Outdated
{ | ||
# $1 is the input file. The PDF version is contained in the | ||
# first 1024 bytes and will be extracted from the PDF file. | ||
pdf_version=$(cut -b -1024 "$1" | awk '{ if (match($0, "%PDF-[0-9]\\.[0-9]")) { print substr($0, RSTART + 5, 3); exit } }') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cut could be left out here, but then the whole file will be parsed instead of only the first 1024 bytes. I like it better this way, but if you prefer an only awk solution, then cut could also be removed here…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I think it's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a separate commit that lets awk process the whole 1024 bytes and does not exit early. This way the error message that cut sometimes throws should be avoided.
Thanks for all the updates, I think the code is fine to merge now. I pushed an extra small commit with some minor style fixes, to fix the last inconsistencies. There are a number of commits in the PR which undo changes made by previous commits; do you mind if I squash this PR into a single commit? Do you appreciate a mention in the copyright header? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a number of commits in the PR which undo changes made by previous commits; do you mind if I squash this PR into a single commit?
I was about to ask you, whether you wanted it to be squashed. I now made three separate commits out of it.
If you do not like the last one, then you can leave it out or change it to your liking.
Do you appreciate a mention in the copyright header?
I do not need that, but thank you for asking!
shrinkpdf.sh
Outdated
{ | ||
# $1 is the input file. The PDF version is contained in the | ||
# first 1024 bytes and will be extracted from the PDF file. | ||
pdf_version=$(cut -b -1024 "$1" | awk '{ if (match($0, "%PDF-[0-9]\\.[0-9]")) { print substr($0, RSTART + 5, 3); exit } }') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a separate commit that lets awk process the whole 1024 bytes and does not exit early. This way the error message that cut sometimes throws should be avoided.
Thanks for the updates. I squashed them into one commit and merged it to master. I'm satisfied with how the code turned out, and I think it will be a useful addition for our users. Thanks again for your contribution. |
With this change it happens much less that the output file is larger than the input.