+
Skip to content

Use PDF version of input for output file #7

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed

Use PDF version of input for output file #7

wants to merge 3 commits into from

Conversation

He-Ro
Copy link
Contributor

@He-Ro He-Ro commented Oct 17, 2022

With this change it happens much less that the output file is larger than the input.

@aklomp
Copy link
Owner

aklomp commented Oct 17, 2022

Hi, thanks for the pull request. I noticed that CI wasn't running for your PR, so I made a quick fix to enable CI for PRs (#8).

Unfortunately, your PR fails the shellcheck and test workflows. Not sure if you can see the job output (you should be able to see future job reports now that CI is enabled for PRs), but the error for the shellcheck workflow is this:

In shrinkpdf.sh line 80:
  pdf_version="${version:5}"
               ^----------^ SC2039: In POSIX sh, string indexing is undefined.

For more information:
  https://www.shellcheck.net/wiki/SC2039 -- In POSIX sh, string indexing is u...

The problem is that shrinkpdf targets POSIX sh, and cannot use bashisms. I didn't check whether head -c and grep -Eoa are POSIX compliant, but the same issue may arise.

Additionally, the PR fails the test workflow with the following output:

./shrinkpdf.sh: 80: Bad substitution

In the test workflow, a throwaway PDF is created with pandoc and LaTeX, and fed through the script. Apparently the script does not find the version correctly. This is exacerbated by the fact that get_pdf_version always returns success, even on errors.

There are also some stylistic issues. The code style of the PR does not match the existing code, causing the file to lose internal consistency.

  • The script uses tabs for indent; the PR uses two spaces. Regardless of preference, a single standard should be followed.
  • The script places a space between a function name and its parentheses, and places the opening bracket on the following line.

@aklomp
Copy link
Owner

aklomp commented Oct 17, 2022

I agree that choosing a higher compatibility level can create smaller PDFs, but I'm not sure if the best strategy is to carry over the input compat level to the output. If the script does not specify -dCompatibilityLevel, then GhostScript will use its own default, which is 1.5 for the /screen target in GhostScript 10. Maybe that's a sane default? Leave it unspecified?

It would be a lot easier to add a command line option for the user to specify an optional compatibility version, than to extract the PDF version using shell tools.

@aklomp aklomp added the enhancement New feature or request label Oct 17, 2022
@He-Ro
Copy link
Contributor Author

He-Ro commented Oct 18, 2022

First, thank you for the detailed review/response. I just rebased the branch to include your changes to the Workflow, but it seems like you must first approve it, so that the workflows can run on this Pull Request. I will have a look at your technical review later.

@He-Ro
Copy link
Contributor Author

He-Ro commented Oct 18, 2022

So, the used commands should now be POSIX compliant and Shellcheck is also happy.

Regarding the change itself:

I created a orig.pdf file just like the pipeline which is 97,205 bytes in size. When I run ./shrinkpdf.sh -o new.pdf orig.pdf on it, then without my changes the new file is not smaller than the input. However with my changes the output is 35,488 bytes in size. So using a newer PDF version than 1.3 makes a big difference, if the input is also newer (my orig.pdf has PDF version 1.5).

Regarding compatibility:

If the input file is PDF version 1.6 or even higher (I do not know what kind of versions already exist), then I do not see a reason why shrinkpdf should change that PDF version. Having a newer PDF version is already a compatibility "issue" of the original file so leaving it at that version does not introduce a compatibility issue.

shrinkpdf.sh Outdated
get_pdf_version() {
# $1 is the input file. The PDF version is contained in the
# first 1024 bytes and will be extracted from the PDF file.
version=$(cut -b "-1024" "$1" | sed -n '/%PDF-[0-9]\.[0-9]/{s/.*%PDF-/%PDF-/;p;q}' | cut -c "-8")
Copy link
Contributor Author

@He-Ro He-Ro Oct 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message with cut (https://github.com/aklomp/shrinkpdf/actions/runs/3275070529/jobs/5392222355#step:5:7) comes from the first cut invocation in this line, I suppose. The sed command closes its incoming pipe after finding the string %PDF-x.y. cut obviously does not handle it gracefully…

I guess we could remove the ;q from the sed program. That way it will process the whole input. if multiple %PDF-x.y are contained in multiple lines of the PDF within the first 1024 bytes, then all of them will be printed by sed, but the cut afterwards will make sure, that only the first version survives. I will push a new commit with the change. If you better like it this way, you can always remove that commit.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole expression looks a bit overcomplicated to me. What we are trying to accomplish is parse the PDF signature, which has a fixed location near the start of the file. There might be a way to get sed to directly return the first match in the first "line".

I also have the feeling that extraction and formatting could be handled nicely by an AWK one-liner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's say: I am not proud of it. I read, that the header can be anywhere within the first 1024 bytes of the file according to Adobes documents (see Acrobat viewers require only that the header appear somewhere within the first 1024 bytes of the file https://bugs.ghostscript.com/show_bug.cgi?id=687125 for example). I did not find the actual documentation by Adobe, I guess it is not available freely.

By prepending the cut command I wanted to ensure, that only a fixed amount of the file needs to be read by the function. But yeah, a AWK script could also do the trick and be easier on the eyes…

shrinkpdf.sh Outdated

# If the pdf version could not be extracted, do not change it
if [ -z "$version" ]; then
return
return 0
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function still returns successfully (zero) on errors. The return code should be 1 when an error is encountered.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If no version is found, then no version is reported by the function, but then the default will be used.

This way the whole program still works even if this function does not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now changed it, so that an error is thrown and when the error is caught, the default version is set. I like it a lot better this way.

shrinkpdf.sh Outdated
@@ -64,16 +64,16 @@ shrink ()
get_pdf_version() {
# $1 is the input file. The PDF version is contained in the
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR still does not follow the style conventions of this project: tabs for indent, opening function parenthesis on a new line, etc. I know that style is very subjective and I don't personally have a problem with the common two-space indenting style, but the important thing is consistency. A project must use the same style throughout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry. I read over that. Will change it accordingly 👍

shrinkpdf.sh Outdated
get_pdf_version() {
# $1 is the input file. The PDF version is contained in the
# first 1024 bytes and will be extracted from the PDF file.
version=$(cut -b "-1024" "$1" | sed -n '/%PDF-[0-9]\.[0-9]/{s/.*%PDF-/%PDF-/;p;q}' | cut -c "-8")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole expression looks a bit overcomplicated to me. What we are trying to accomplish is parse the PDF signature, which has a fixed location near the start of the file. There might be a way to get sed to directly return the first match in the first "line".

I also have the feeling that extraction and formatting could be handled nicely by an AWK one-liner.

@aklomp
Copy link
Owner

aklomp commented Oct 18, 2022

Regarding your point about the change itself: I did some testing myself and I fully accept that using higher compatibility versions can lead to smaller PDFs.

Regarding your point about compatibility: I'm not that concerned about introducing incompatibilities. A small file size is what the user wants. But I think we can reach that outcome more easily by removing the -dCompatibilityLevel=1.3 line from the script and letting GhostScript use its default (1.5 for the /screen target). It accomplishes the same goal, but without introducing lots of new code to parse the input. Which means that I'm not convinced that this PR is the right solution.

shrinkpdf.sh Outdated
{
# $1 is the input file. The PDF version is contained in the
# first 1024 bytes and will be extracted from the PDF file.
pdf_version=$(cut -b -1024 "$1" | awk '{ if (match($0, "%PDF-[0-9]\\.[0-9]")) { print substr($0, RSTART + 5, 3); exit } }')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cut could be left out here, but then the whole file will be parsed instead of only the first 1024 bytes. I like it better this way, but if you prefer an only awk solution, then cut could also be removed here…

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think it's fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a separate commit that lets awk process the whole 1024 bytes and does not exit early. This way the error message that cut sometimes throws should be avoided.

@aklomp
Copy link
Owner

aklomp commented Oct 20, 2022

Thanks for all the updates, I think the code is fine to merge now. I pushed an extra small commit with some minor style fixes, to fix the last inconsistencies.

There are a number of commits in the PR which undo changes made by previous commits; do you mind if I squash this PR into a single commit?

Do you appreciate a mention in the copyright header?

Copy link
Contributor Author

@He-Ro He-Ro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a number of commits in the PR which undo changes made by previous commits; do you mind if I squash this PR into a single commit?

I was about to ask you, whether you wanted it to be squashed. I now made three separate commits out of it.

If you do not like the last one, then you can leave it out or change it to your liking.

Do you appreciate a mention in the copyright header?

I do not need that, but thank you for asking!

shrinkpdf.sh Outdated
{
# $1 is the input file. The PDF version is contained in the
# first 1024 bytes and will be extracted from the PDF file.
pdf_version=$(cut -b -1024 "$1" | awk '{ if (match($0, "%PDF-[0-9]\\.[0-9]")) { print substr($0, RSTART + 5, 3); exit } }')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a separate commit that lets awk process the whole 1024 bytes and does not exit early. This way the error message that cut sometimes throws should be avoided.

@aklomp
Copy link
Owner

aklomp commented Oct 20, 2022

Thanks for the updates. I squashed them into one commit and merged it to master. I'm satisfied with how the code turned out, and I think it will be a useful addition for our users. Thanks again for your contribution.

@aklomp aklomp closed this Oct 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载