这是indexloc提供的服务,不要输入任何密码
Skip to content

Add gitoutput package, rework all git hashing #1239

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 84 commits into from
Jun 23, 2022

Conversation

nathanhammond
Copy link
Contributor

@nathanhammond nathanhammond commented May 13, 2022

Migrate all calls out to git to a streaming reading strategy.

Along the way:

  • Addresses the git path handling by using -z.
  • 15% performance improvement on large-monorepo.
  • Added path types.

History grabbed with:
git log --pretty=email --patch-with-stat --reverse --full-index --binary -- src/pkg/csv/reader.go src/pkg/csv/reader_test.go src/pkg/encoding/csv/reader.go src/pkg/encoding/csv/reader_test.go src/encoding/csv/reader.go src/encoding/csv/reader_test.go

@vercel
Copy link

vercel bot commented May 13, 2022

@nathanhammond is attempting to deploy a commit to the Vercel Team on Vercel.

A member of the Team first needs to authorize it.

@vercel
Copy link

vercel bot commented May 17, 2022

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Updated
turbo-site ⬜️ Ignored (Inspect) Jun 17, 2022 at 10:04AM (UTC)

@nathanhammond nathanhammond force-pushed the lstree-with-history branch 2 times, most recently from 08be0da to 0007f2f Compare May 18, 2022 14:37
@nathanhammond nathanhammond force-pushed the lstree-with-history branch 2 times, most recently from 2d4ff8b to de13a56 Compare May 19, 2022 14:56
@nathanhammond nathanhammond changed the title Add lstree package Add gitoutput package, rework all git hashing May 19, 2022
Copy link
Contributor

@gsoltis gsoltis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome. I like the approach a lot.

Most of my comments are related to adding comments. Since this is performance-sensitive code, it appears that there are a bunch of non-obvious ways of doing things that I'm assuming are for streaming or avoiding allocations. Commenting those would go a long way towards making that complexity understandable. Most of my reading was like "wait, why don't you just... oh, I guess it's reusing a buffer to avoid an allocation".

@nathanhammond nathanhammond force-pushed the lstree-with-history branch 3 times, most recently from a343732 to fa8ca56 Compare May 30, 2022 18:42
@nathanhammond nathanhammond force-pushed the lstree-with-history branch 2 times, most recently from 26551d9 to 72d077f Compare June 6, 2022 13:06
globalDepsArray := globalDeps.UnsafeListOfStrings()
globalDepsPaths := make([]turbopath.AbsoluteSystemPath, len(globalDepsArray))
for i, path := range globalDepsArray {
globalDepsPaths[i] = turbopath.AbsoluteSystemPath(path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have the intention of removing these casts someday, I think it's better to have an explicit function for them. It serves two purposes:

  • highlights when an IDE has "helpfully" inserted a cast
  • makes them more greppable / find-referencesable. Currently, the cast overlaps with type declarations in both cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm game to add a function; I'll take a pass at naming it.

@@ -133,7 +134,7 @@ func manuallyHashPackage(pkg *fs.PackageJSON, inputs []string, rootPath fs.Absol
if err != nil {
return fmt.Errorf("could not hash file %v. \n%w", name, err)
}
hashObject[strings.TrimPrefix(name, toTrim)] = hash
hashObject[turbopath.RelativeUnixPath(strings.TrimPrefix(name, toTrim))] = hash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sort of cast makes me nervous. If we're not checking this, then we might want an UnsafeToRelativeUnixPath. If we are checking this, we should at least add some comments and maybe even still have an explicit function call instead of a cast.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually did check the upstream inputs of this, but since it isn't type-constrained in the upstream sources that could change out from underneath me.

It's right that this makes you nervous; I should indeed force this one in particular through a checked path. (The boundary isn't library code that we would expect to maintain a consistent interface.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whelp, this one was a cross-platform bug and turbopath caught it.

By satisfying the type constraints the tests started breaking on Windows: https://github.com/vercel/turborepo/runs/6913146718?check_suite_focus=true

After looking at it for a bit I decided that it was a bit weird, and sure enough: #1413

Going to include a fix in this PR.

import "path/filepath"

// RepoRelativeSystemPath is a relative path from the repository using system separators.
type RepoRelativeSystemPath string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this type?

I understand that some other types return it, but if we don't actually ever need it, maybe that's a signal to rework our interfaces so that we aren't forced to define it.

Copy link
Contributor Author

@nathanhammond nathanhammond Jun 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Start by reading this comment, otherwise it ends up ordered weirdly: #1239 (comment)


Two reasons it exists:

  • Maintain consistent behavior for library code.
  • I believe we'll need it eventually, I'm imagining using it in the symlink stuff we have coming up.

type FilePathInterface interface {
filePathStamp()

ToSystemPath() SystemPathInterface
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried commenting out this method and the one below it and nothing seemed to go horribly wrong.

I also tried un-exporting the interface, and that seemed to go well too. I think the only place that we use it is in toStringArray below? And that one just requires ToString(), and, implicitly filePathStamp().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier iterations of this PR extensively used FilePathInterface but I eventually decided to push the types up to call sites where I already knew what they were.

This is one of the takeaways for me: the more-general interfaces will primarily be used in situations where you're integrating with an existing repository. That forces you to make sure that everything that you do is safe while making it possible to avoid pushing all of the types up through the call sites.

}

func toStringArray[T FilePathInterface](source []T) []string {
output := make([]string, len(source))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bummer. I don't have a better idea right now, but it's a bummer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two things about this:

  1. Introduces generics. I can refactor away from that if you think it isn't worth it for this PR.
  2. Introduces runtime cost to accomodate types.

I think that if we eventually pull this out to a library we could just AST-transform this thing out of existence at compile time. They're an array of strings already and it should be a straightforward walk. Honestly way more-reasonable than the AST-walking for ENV variables we saw recently...

//
// Much of this is dreadfully repetitive because of intentional
// limitations in the Go type system.
package turbopath
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got curious and started deleting things from this package to see what I could get away with. I think some of our type system woes can potentially be addressed by just not defining some things. For instance, it appears we never call RepoRelativeUnixPath.ToRepoRelativeSystemPath. That, combined with other things like that, mean that I don't think we need RepoRelativeSystemPath.

I'll put up a PR to show what I mean, but I think by dropping some thing from interfaces when we aren't actually using the interface methods, we can return more useful types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to show what I'm talking about: nathanhammond#2

Copy link
Contributor Author

@nathanhammond nathanhammond Jun 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I did actively choose to implement the full set because my intent was to treat this as a library—and possibly one which could be fully-externalized (move out of internal, and then maybe even into a separate repo). That would also make it a bit easier to optionally include it since it does have runtime costs. In some (far-)future opportunity we could even add a cute little AST transform that enables it to remove itself from runtime code.

The tradeoff here is that if we optimize it for the paths that we have right now it results in an inconsistent interface and stops feeling like a library and starts to feel more like a whole bunch of utility functions.

I do understand the appeal of reducing the use of interfaces: using them can add to verbosity and generally feels unergonomic. But I think it's worth it in this case.

Also, I'm ~sure that this is the full set of things will eventually be used. Repo-relative system paths (the one explicitly added to mirror repo-relative Unix paths which is 100% unused) are going to be useful if/when we add rewrite-symlink-on-cache for Windows.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the utility approach. I don't think there's much value in spending effort to make this externally consumable. Given that we're the main or only consumers, we should optimize for what we need. If a useful library happens to emerge from that, great. If it doesn't, I don't particularly think we're missing out on much. We also are not in a rush to produce such a library, and in the event that we want one in the future, I think it would benefit from growing from what we actually use, rather than making a guess up front.

If all we ever need are some inconsistent interfaces, I don't think that's a problem. There may well be some transitions that we don't need, and in that case I don't think we gain anything by having to support them. And in practice, with Go's type system, we are losing information by supporting interfaces rather than direct types. So it's not just the burden of unused code, but it bleeds over into code we do intend to use.

As to what we need, the goal is for us to be able to use the type system, to the extent possible, to ensure we are interacting with our various external APIs in the way we expect. So, for the code in turbopath does that, we should keep it. By the same token, I think if there's code in turbopath that we aren't using, we should remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I further refactored application code more to enable removing interfaces:
ae13799

I continued on to remove all Interfaces from turbopath:
71b515e

But the path to get to where we are goes through the interfaces that I removed, for example:
c27caff

I'm very much not a fan of losing the tool I used to get where this PR is. package_deps_hash.go is nearly a full rewrite and removing the interfaces makes traveling that path impossible. It's all showing up as a single PR at the end, but the incremental changes to get there are captured in the history, go through the interfaces, and are absolutely necessary when adopting a complicated mess of unknown paths into the system.

Having gone through this process I feel very strongly that removing the interfaces is a bad idea. If I were to do the same thing for, say, caching, my first step would be to reintroduce them as guiderails for the effort.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent some time reading through these.

Maybe we need to chat offline about it, because I'm not 100% sure I'm understanding the issue. For instance, going from c27caff to the current implementation looks like a clear improvement to me. To pick one example, having filesToHash as an interface argument to gitHashObject suggests that we don't know what kind of paths we're dealing with at that point. That stands out as the sort of thing these types are intended to fix: we should know exactly what kind of path we have there. And in what we have now, that is in fact the case, we know they are AnchoredSystemPath instances.

ae13799 also looks like a clear win to me: aside from the Relative -> Anchored change, the only change outside of the turbopath package, I think, is changing from an AbsolutePathInterface to AbsoluteSystemPath. That's doing what I think we want: giving us more explicit information about exactly what kind of path we have.

I also have no objection to using interfaces as a tool for getting to an implementation using explicit types, or in cases where we do not know or care what the underlying type is. I just don't want unused interfaces and unused interface methods in main if we can help it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me start with: I'm sorry. I was writing this while frustrated for wanting to get this thing landed, knowing that it wasn't going to make it on Friday because of time zones. Everything about the reviews have improved the code and I'm immensely grateful for them.

The point I (unsuccessfully) was trying to present in my previous comment is that I don't think I can (easily) go from c27caff8073868eee9e62a5fce2784badf8de098 (which was very early in this process) to ae13799cc89d2090827b94a896f375da39554f67 (which is nearly a platonic ideal) without using the tools which get removed in 71b515e2a11ebf185eeccd1755942624d61b724b.

The interfaces were the stepping stone; they're indicative of an incomplete migration. The fact that they leak into needing application code to deal with typechecks on them makes them feel gross, and strongly encourages their eventual removal. However, shipping something that is an interface as part of a migration step is still a net improvement since it acts as a forcing function for the author handle the complexity of, "we don't know what type of path we've been handed."

This PR itself has, by virtue of continued investment, evolved beyond needing the interfaces. But that's not necessarily the place where the effort should be stopped on all changes (and arguably, not where this one should have gone to either). The interfaces enable incrementalism. We could go through and tag everything as a FilePathInterface tomorrow and within a month we'd have fully-typed paths throughout the codebase because a FilePathInterface is such a pain to work with.

I don't know how good Go is at dropping things, but I assume that since it does a SSA pass (that only bails for generics), anything that isn't used should get dropped at compile time. I'd like to revert 71b515e2a11ebf185eeccd1755942624d61b724b, but the other changes were a net positive.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is not with Go's ability to do tree-shaking. I'm happy to assume that it can either do a fine job or that the extra code size is negligible.

So far, the I think using concrete types at the point of interacting with another system, where we should know the type exactly, (remote cache, os filesystem) has been a decent forcing function for converting paths, but it's certainly not the only way to approach the problem. If FilePathInterface is helpful for working through some of those scenarios, then great, that sounds like a useful tool. I agree that in this case we've ended up in a better spot, and obviously the interface code was part of that journey. I would recommend keeping it in a patch to be applied when we are tackling the future conversions. As we move forward with the path migration, I'm sure some of what's currently unused will end up useful. I just don't know which parts yet.

As for now, I don't want to add unused code to main. There are a few reasons:

  • Without a consumer, we don't actually know that it's the correct code to have. I think the cross-product of different aspects of paths in this case produced some code to satisfy those interfaces that we explicitly don't want, such as AbsoluteUnixPath. I'm also not sure what exactly some of the Join methods should be doing, but it's currently impossible to tell, since they are unused. If and when we have a need for them, we will by definition have a better idea of what they should do.
  • It takes away from the understandability of the codebase. The default assumption about every line is that it has some reason for being there. If we find things that are not being used, we should work to remove them. There are pieces of unused code that I've already removed, and I would be unsurprised to learn that there are more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to add unused code to main. Let's cut this and open another PR at another time. PR has been open over a month and we need to ship

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can ship path abstraction in 1.3.x

@nathanhammond
Copy link
Contributor Author

nathanhammond commented Jun 15, 2022

Open conversations:

Fast follows:

  • Migrate AbsolutePath to turbopath.AbsoluteSystemPath
  • Move package_deps_hash.go out of fs

@nathanhammond nathanhammond requested a review from gsoltis June 16, 2022 04:38
@nathanhammond nathanhammond force-pushed the lstree-with-history branch 3 times, most recently from 6e6dc0e to 99e2751 Compare June 16, 2022 08:52
Copy link
Contributor

@gsoltis gsoltis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: the open questions:

  • turbopath: I wrote a longer comment, but tl;dr is let's not ship code we aren't immediately using, we can extract a library later
  • Re: linting: I don't want to blanket ignore them, but I am open to methods for making the situation more explicit and less tedious. I proposed one, but I haven't tested it.
  • Re: manuallyHashPackage, I left some comments on the function and the test, but I think the behavior change is good
  • [][]string is fine for now.

//
// Much of this is dreadfully repetitive because of intentional
// limitations in the Go type system.
package turbopath
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the utility approach. I don't think there's much value in spending effort to make this externally consumable. Given that we're the main or only consumers, we should optimize for what we need. If a useful library happens to emerge from that, great. If it doesn't, I don't particularly think we're missing out on much. We also are not in a rush to produce such a library, and in the event that we want one in the future, I think it would benefit from growing from what we actually use, rather than making a guess up front.

If all we ever need are some inconsistent interfaces, I don't think that's a problem. There may well be some transitions that we don't need, and in that case I don't think we gain anything by having to support them. And in practice, with Go's type system, we are losing information by supporting interfaces rather than direct types. So it's not just the burden of unused code, but it bleeds over into code we do intend to use.

As to what we need, the goal is for us to be able to use the type system, to the extent possible, to ensure we are interacting with our various external APIs in the way we expect. So, for the code in turbopath does that, we should keep it. By the same token, I think if there's code in turbopath that we aren't using, we should remove it.

@@ -21,6 +21,9 @@ linters-settings:
max-func-lines: 3

issues:
exclude:
# This is golangci-lint's EXC0001, but no way to use that collection as an exclude.
- .((os\.)?std(out|err)\..*|.*Close|.*Flush|os\.Remove(All)?|.*print(f|ln)?|os\.(Un)?Setenv). is not checked
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm conflicted on this one. On the one hand, 99% of the time I think we don't care and defer func() { _ = thing.Close() }() is tedious. But that other 1% is things like "log files for a task" where we really do care.

Maybe there's another way to address this? Would something like this work?

func ignoreCloseError(closer interface{ Close() error }) {
  _ = closer.Close()
}
...
defer ignoreCloseError(thingToBeClosed)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a new util/closer.go function and adopted it here.

@@ -114,26 +114,32 @@ func manuallyHashPackage(pkg *fs.PackageJSON, inputs []string, rootPath fs.Absol
}

pathPrefix := rootPath.Join(pkg.Dir).ToString()
toTrim := filepath.FromSlash(pathPrefix + "/")
convertedPathPrefix := turbopath.AbsoluteSystemPath(pathPrefix)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use an explicit function rather than a cast. Casts are hard to search for and IDEs can add them automatically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed all casts from the application, but not from the tests; I considered that a good balance. I can go back and automatically add them to the tests if you'd like (but don't want to block this PR on that).

fs.Walk(pathPrefix, func(name string, isDir bool) error {
rootMatch := ignore.MatchesPath(name)
otherMatch := ignorePkg.MatchesPath(name)
convertedName := turbopath.AbsoluteSystemPath(name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With fs.AbsolutePath, we now have AbsolutePathFromUpstream. I think an equivalent would be warranted here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a bonus, so many of the casts will evaporate once we land this and the fast-follow which combines fs.AbsolutePath and turbopath.AbsoluteSystemPath lands. That'll make things much nicer.

@@ -94,8 +94,8 @@ func (pfs *packageFileSpec) hash(pkg *fs.PackageJSON, repoRoot fs.AbsolutePath)
return hashOfFiles, nil
}

func manuallyHashPackage(pkg *fs.PackageJSON, inputs []string, rootPath fs.AbsolutePath) (map[string]string, error) {
hashObject := make(map[string]string)
func manuallyHashPackage(pkg *fs.PackageJSON, inputs []string, rootPath fs.AbsolutePath) (map[turbopath.RelativeUnixPath]string, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the switch to the typed return here, and I agree that this is the right type for hashing.

Copy link
Contributor Author

@nathanhammond nathanhammond Jun 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's fixing these method signatures (and then plumbing through the consequences of that) that really make a difference for correctness.

My comments on Friday about this (while frustrated, I'm sorry, I wanted to get this landed, but each revision makes it better) were focused on traveling the path from "I changed the method signature" to "I have types that go through the entire system. It's that interim state when I didn't know which type I wanted where, and where changing one meant a cascade up to multiple different call sites which may not be doing the right thing (as demonstrated by #1413).

All of the changes I've made with this have resulted in better code, except for in my opinion c27caff8073868eee9e62a5fce2784badf8de098 which just feels like pulling the ladder up after you've climbed the tree.

filepath.FromSlash("libA/ignorethisdir/anything"): {"anything", ""},
filepath.FromSlash("libA/pkgignoreme"): {"anything", ""},
filepath.FromSlash("libA/pkgignorethisdir/file"): {"anything", ""},
"top-level-file": {"top-level-file-contents", ""},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These paths are passed to os.Create below. This works because windows accepts /-delimited paths. However, I think we should convert at the point of interfacing w/ the filesystem so that we're clear.

Perhaps typing this map as map[RelativeUnixPath]fileHash would highlight this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch, I caught just one of the ways this was working; we need to get the utility functions from fs.AbsolutePath merged into these so this sort of mistake doesn't happen.

(We were using it for creation of the files and wanted System paths, but also using it to check the outputs where we want AnchoredUnixPaths.)

Copy link
Contributor

@gsoltis gsoltis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving for 1.3

@kodiakhq kodiakhq bot merged commit b591e8a into vercel:main Jun 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.