Use PDF version of input for output file #7

He-Ro · 2022-10-17T20:19:49Z

With this change it happens much less that the output file is larger than the input.

aklomp · 2022-10-17T21:21:07Z

Hi, thanks for the pull request. I noticed that CI wasn't running for your PR, so I made a quick fix to enable CI for PRs (#8).

Unfortunately, your PR fails the shellcheck and test workflows. Not sure if you can see the job output (you should be able to see future job reports now that CI is enabled for PRs), but the error for the shellcheck workflow is this:

In shrinkpdf.sh line 80:
  pdf_version="${version:5}"
               ^----------^ SC2039: In POSIX sh, string indexing is undefined.

For more information:
  https://www.shellcheck.net/wiki/SC2039 -- In POSIX sh, string indexing is u...

The problem is that shrinkpdf targets POSIX sh, and cannot use bashisms. I didn't check whether head -c and grep -Eoa are POSIX compliant, but the same issue may arise.

Additionally, the PR fails the test workflow with the following output:

./shrinkpdf.sh: 80: Bad substitution

In the test workflow, a throwaway PDF is created with pandoc and LaTeX, and fed through the script. Apparently the script does not find the version correctly. This is exacerbated by the fact that get_pdf_version always returns success, even on errors.

There are also some stylistic issues. The code style of the PR does not match the existing code, causing the file to lose internal consistency.

The script uses tabs for indent; the PR uses two spaces. Regardless of preference, a single standard should be followed.
The script places a space between a function name and its parentheses, and places the opening bracket on the following line.

aklomp · 2022-10-17T21:24:41Z

I agree that choosing a higher compatibility level can create smaller PDFs, but I'm not sure if the best strategy is to carry over the input compat level to the output. If the script does not specify -dCompatibilityLevel, then GhostScript will use its own default, which is 1.5 for the /screen target in GhostScript 10. Maybe that's a sane default? Leave it unspecified?

It would be a lot easier to add a command line option for the user to specify an optional compatibility version, than to extract the PDF version using shell tools.

He-Ro · 2022-10-18T15:21:45Z

First, thank you for the detailed review/response. I just rebased the branch to include your changes to the Workflow, but it seems like you must first approve it, so that the workflows can run on this Pull Request. I will have a look at your technical review later.

He-Ro · 2022-10-18T16:26:15Z

So, the used commands should now be POSIX compliant and Shellcheck is also happy.

Regarding the change itself:

I created a orig.pdf file just like the pipeline which is 97,205 bytes in size. When I run ./shrinkpdf.sh -o new.pdf orig.pdf on it, then without my changes the new file is not smaller than the input. However with my changes the output is 35,488 bytes in size. So using a newer PDF version than 1.3 makes a big difference, if the input is also newer (my orig.pdf has PDF version 1.5).

Regarding compatibility:

If the input file is PDF version 1.6 or even higher (I do not know what kind of versions already exist), then I do not see a reason why shrinkpdf should change that PDF version. Having a newer PDF version is already a compatibility "issue" of the original file so leaving it at that version does not introduce a compatibility issue.

He-Ro · 2022-10-18T19:45:23Z

shrinkpdf.sh

+get_pdf_version() {
+  # $1 is the input file. The PDF version is contained in the
+  # first 1024 bytes and will be extracted from the PDF file.
+  version=$(cut -b "-1024" "$1" | sed -n '/%PDF-[0-9]\.[0-9]/{s/.*%PDF-/%PDF-/;p;q}' | cut -c "-8")


The error message with cut (https://github.com/aklomp/shrinkpdf/actions/runs/3275070529/jobs/5392222355#step:5:7) comes from the first cut invocation in this line, I suppose. The sed command closes its incoming pipe after finding the string %PDF-x.y. cut obviously does not handle it gracefully…

I guess we could remove the ;q from the sed program. That way it will process the whole input. if multiple %PDF-x.y are contained in multiple lines of the PDF within the first 1024 bytes, then all of them will be printed by sed, but the cut afterwards will make sure, that only the first version survives. I will push a new commit with the change. If you better like it this way, you can always remove that commit.

The whole expression looks a bit overcomplicated to me. What we are trying to accomplish is parse the PDF signature, which has a fixed location near the start of the file. There might be a way to get sed to directly return the first match in the first "line".

I also have the feeling that extraction and formatting could be handled nicely by an AWK one-liner.

Let's say: I am not proud of it. I read, that the header can be anywhere within the first 1024 bytes of the file according to Adobes documents (see Acrobat viewers require only that the header appear somewhere within the first 1024 bytes of the file https://bugs.ghostscript.com/show_bug.cgi?id=687125 for example). I did not find the actual documentation by Adobe, I guess it is not available freely.

By prepending the cut command I wanted to ensure, that only a fixed amount of the file needs to be read by the function. But yeah, a AWK script could also do the trick and be easier on the eyes…

aklomp · 2022-10-18T19:47:49Z

shrinkpdf.sh


  # If the pdf version could not be extracted, do not change it
  if [ -z "$version" ]; then
-    return
+    return 0


The function still returns successfully (zero) on errors. The return code should be 1 when an error is encountered.

If no version is found, then no version is reported by the function, but then the default will be used.

This way the whole program still works even if this function does not.

I now changed it, so that an error is thrown and when the error is caught, the default version is set. I like it a lot better this way.

aklomp · 2022-10-18T19:49:55Z

shrinkpdf.sh

@@ -64,16 +64,16 @@ shrink ()
 get_pdf_version() {
  # $1 is the input file. The PDF version is contained in the


This PR still does not follow the style conventions of this project: tabs for indent, opening function parenthesis on a new line, etc. I know that style is very subjective and I don't personally have a problem with the common two-space indenting style, but the important thing is consistency. A project must use the same style throughout.

Ah, sorry. I read over that. Will change it accordingly 👍

aklomp · 2022-10-18T19:55:20Z

shrinkpdf.sh

+get_pdf_version() {
+  # $1 is the input file. The PDF version is contained in the
+  # first 1024 bytes and will be extracted from the PDF file.
+  version=$(cut -b "-1024" "$1" | sed -n '/%PDF-[0-9]\.[0-9]/{s/.*%PDF-/%PDF-/;p;q}' | cut -c "-8")


The whole expression looks a bit overcomplicated to me. What we are trying to accomplish is parse the PDF signature, which has a fixed location near the start of the file. There might be a way to get sed to directly return the first match in the first "line".

I also have the feeling that extraction and formatting could be handled nicely by an AWK one-liner.

aklomp · 2022-10-18T19:59:55Z

Regarding your point about the change itself: I did some testing myself and I fully accept that using higher compatibility versions can lead to smaller PDFs.

Regarding your point about compatibility: I'm not that concerned about introducing incompatibilities. A small file size is what the user wants. But I think we can reach that outcome more easily by removing the -dCompatibilityLevel=1.3 line from the script and letting GhostScript use its default (1.5 for the /screen target). It accomplishes the same goal, but without introducing lots of new code to parse the input. Which means that I'm not convinced that this PR is the right solution.

He-Ro · 2022-10-18T20:41:16Z

shrinkpdf.sh

+{
+	# $1 is the input file. The PDF version is contained in the
+	# first 1024 bytes and will be extracted from the PDF file.
+	pdf_version=$(cut -b -1024 "$1" | awk '{ if (match($0, "%PDF-[0-9]\\.[0-9]")) { print substr($0, RSTART + 5, 3); exit } }')


The cut could be left out here, but then the whole file will be parsed instead of only the first 1024 bytes. I like it better this way, but if you prefer an only awk solution, then cut could also be removed here…

No, I think it's fine.

I added a separate commit that lets awk process the whole 1024 bytes and does not exit early. This way the error message that cut sometimes throws should be avoided.

shrinkpdf.sh

aklomp · 2022-10-20T12:15:23Z

Thanks for all the updates, I think the code is fine to merge now. I pushed an extra small commit with some minor style fixes, to fix the last inconsistencies.

There are a number of commits in the PR which undo changes made by previous commits; do you mind if I squash this PR into a single commit?

Do you appreciate a mention in the copyright header?

He-Ro

There are a number of commits in the PR which undo changes made by previous commits; do you mind if I squash this PR into a single commit?

I was about to ask you, whether you wanted it to be squashed. I now made three separate commits out of it.

If you do not like the last one, then you can leave it out or change it to your liking.

Do you appreciate a mention in the copyright header?

I do not need that, but thank you for asking!

He-Ro · 2022-10-20T17:38:09Z

shrinkpdf.sh

+{
+	# $1 is the input file. The PDF version is contained in the
+	# first 1024 bytes and will be extracted from the PDF file.
+	pdf_version=$(cut -b -1024 "$1" | awk '{ if (match($0, "%PDF-[0-9]\\.[0-9]")) { print substr($0, RSTART + 5, 3); exit } }')


I added a separate commit that lets awk process the whole 1024 bytes and does not exit early. This way the error message that cut sometimes throws should be avoided.

aklomp · 2022-10-20T22:28:27Z

Thanks for the updates. I squashed them into one commit and merged it to master. I'm satisfied with how the code turned out, and I think it will be a useful addition for our users. Thanks again for your contribution.

aklomp added the enhancement New feature or request label Oct 17, 2022

He-Ro commented Oct 18, 2022

View reviewed changes

aklomp reviewed Oct 18, 2022

View reviewed changes

He-Ro commented Oct 18, 2022

View reviewed changes

He-Ro added 3 commits October 20, 2022 19:23

Use PDF version of input for output file

6e72cb5

Change default version to 1.5

fbf7b14

Process whole 1024 bytes to supress warning

6f5951a

He-Ro commented Oct 20, 2022

View reviewed changes

aklomp closed this Oct 20, 2022

		@@ -64,16 +64,16 @@ shrink ()
		get_pdf_version() {
		# $1 is the input file. The PDF version is contained in the

Use PDF version of input for output file #7

Use PDF version of input for output file #7

Uh oh!

Conversation

He-Ro commented Oct 17, 2022

Uh oh!

aklomp commented Oct 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aklomp commented Oct 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

He-Ro commented Oct 18, 2022

Uh oh!

He-Ro commented Oct 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

He-Ro Oct 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aklomp commented Oct 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aklomp commented Oct 20, 2022

Uh oh!

He-Ro left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aklomp commented Oct 20, 2022

Uh oh!

Uh oh!

aklomp commented Oct 17, 2022 •

edited

Loading

aklomp commented Oct 17, 2022 •

edited

Loading

He-Ro commented Oct 18, 2022 •

edited

Loading

He-Ro Oct 18, 2022 •

edited

Loading

aklomp commented Oct 18, 2022 •

edited

Loading