Episode 4 - Go dep with Sam Boyer - https://manifest.fm/4
Andrew Nesbitt: Welcome to The Manifest, a podcast all about package management. I'm Andrew Nesbitt.
Alex Pounds: And I'm Alex Pounds.
Andrew Nesbitt: Together, we're exploring the technical ins and outs of package management. The stories of the people behind the code and the communities around the projects.
Alex Pounds: Today we're joined by Sam Boyer. Lead maintainer of Dep and author of the Go packaging solve for GPS. Sam, welcome to the manifest.
Sam Boyer: Thank you.
Alex Pounds: Let's get started by hearing how you found yourself involved in Go. Because it's a fairly young language.
Sam Boyer: Well, really, I love graphs. Somewhere along the line, I discovered that I love graphs and I started writing graph libraries. I wrote one in PHP and then realized that was a terrible idea. I wanted to find a language that would help me write a graph library that was a little closer to let's call it the algorithmic ideal. I ended up picking Go. The whole other story, I now realize that writing graph libraries is mostly a terrible idea most of the time. And then if I really cared about algorithmic superiority, maybe I wouldn't have picked Go. But it doesn't matter, because it got me into Go.
Alex Pounds: When we say, algorithmic superiority, what do we mean by that?
Sam Boyer: I mean, whatever the theoretical ideal is for any given algorithm, you can implement it in a way where the language supports that. That the REST zero cost abstractions type idea, I can make a nice interface for it, but still have theoretical maximums.
Alex Pounds: How has good works out for you with that?
Sam Boyer: It actually does pretty well in a lot of things, and is good enough that I'm almost never actually worrying about that. If I really, really care, then I can write into something else. But most of the time, the things that I appreciate about Go now, are much more the things that are often touted about it. It's a language that places value on a certain form of simplicity, this sort of what you see right in front of you is exactly what it is. It reduces some cognitive load in that way at the expense of other things, of course, but it as the authors argue, does feel like it helps me most of the time to focus on the goal that I'm trying to reach, and not the cute abstractions that I could make around it.
Alex Pounds: Not everyone listening might have had a chance to write some Go or really explored it in depth. But I found it to be a language which has made some pretty unique decisions about how you structure your code and how you interact with it. Could you tell us a little bit more about both specifics and the decisions that are unique to Go?
Sam Boyer: Sure. Someone else might be able to give a better line by line overview of this. But the thing that I've been thinking about as a good example of this recently has to do with a very well-worn argumentative path in the world of Go, which is the fact that Go does not have generics. The effect of this is that you can be very sure that the types that are literally written in the source code in front of you are the types that you're operating on. Nothing is getting slide in outside of your knowledge, you barely rarely have to dig in several layers into the code to see what actual types are operating behind the scenes.
Sam Boyer: There's only really one mechanism for doing that, and that's the interface. To be clear, at times I feel very constricted by this. I've often heard it said that Go would be fantastic if I had generics and nobody else did. But we can't have that world. The benefit I'm trying to talk about here is that because we forego the power and the expressiveness that comes from generic systems like this, the benefit that we get is I can simply look at the code in front of me and not need to teeth look up paths to see what actual types are fulfilling an interface or fulfilling a signature. It's much more readily available right in front of me, and for me, at least that reduces cognitive load and helps me to focus on the thing that I'm trying to accomplish.
Alex Pounds: You initially came to Go because you were looking for something to build your graph library in, and now you are lead maintainer of Dep and also created gps. How did you get from A to B?
Sam Boyer: I have some background doing things related to this. In 2010, I was the lead engineer and architect responsible for moving the Drupal project from CVS to GIT. Drupal's code is all self-hosted on drupal.org. There are tens of thousands of projects. I think we migrated there were maybe like 16,000 or something like that. Distinct projects from people all over the community and we needed to move those over and to get repositories out of one giant CVS mono repo.
Sam Boyer: The project took about six months, but it meant that I had some background with things sort of related to package management, at the very least .
Sam Boyer: I got my hands dirty with GIT. Within Go though, Go has a history of how its package management has worked. It's been sparse, spare perhaps, we could say since Go one was released in 2012.
Sam Boyer: GoGet is essentially the only official package manager. It took a few years for the community to agree that this really isn't enough. Finally. So I had been working with guys responsible for glide, one of the major package management efforts that was out there on it, I was sort of getting more and more interested in this problem, and considering diving in deep on it around December 2015.
Sam Boyer: Right around then some discussion started breaking out in the community around where we should go forward with package management. This coincided with the fact that the vendor directory, which is space of the compensatory especially inside of a Go project structure. It's a place where you can put your dependencies and the Go compiler will search there before it searches the other places. It allows you to achieve encapsulation of your dependencies inside of your product.
Sam Boyer: This discussion came up, and I basically said, "Okay, this is a hard problem." Really what I saw happening was I saw a bunch of people who were working on their own tools, a lot of people talking past each other, a lot of people really well-meaning people who just had made slightly different decisions six months or a year ago. And that had ended up making it very difficult for them to talk about what they saw as the main problems through the lens of the particular projects that they were working on.
Sam Boyer: I decided that I was going to try to address this by writing a manifesto because I guess that's what you do in her world. It took about six weeks, and I wrote this giant long essay called, So You Want To be a Packet manager, in which I tried to just describe the whole domain of language package management and what the relevant choices are, and the trade offs and how they inter live with the language that you're writing it for and the competing concerns, the balance that you have to strike between the mushiness of human requirements and the precision of making things work for computers.
Alex Pounds: Okay. Let's take a step back second and hear a little bit more about how you got into computing and your background.
Sam Boyer: That is actually not relevant here. I don't have any formal education in computer science or engineering. I was a medieval studies international politics and linguistics major. An undergrad and then dropped out of a Master's program and an interdisciplinary program in the social sciences. But I was also engaged in lots of social movements all over the place.
Sam Boyer: The thing that actually drove me into engineering was I was working with social moments that really, really valued the equal participation of their members. These were social moments that would literally spread all over the world. What I had found was that the communication technologies that they were using really prevented them from being able to realize the goal of horizontal participation. So I was like, "All right, fine. I know computers sort of." Really, I just play video games for years, and didn't know anything about them.
Sam Boyer: I'll dive in and I'll figure out how to build platforms to facilitate social structures that these movements say that they want. That's what got me into Drupal, and it goes from there. but I say it's, it's relevant today because I've always been interested in communities because that's what got me into computing in the first place, and ways of making communities work better together.
Andrew Nesbitt: That maps really nicely onto an entry into the world of package management. So let's skip over a little bit of kind of the history of where Go managed to get itself into the mess of package management and look at the future of where Go is going with its package manager. I guess that's mostly focused around Dep. What kind of decisions have been made around Dep that stand out to you as being interesting or different from other application level package managers?
Sam Boyer: First, let me outline how things work today with Go. The only official package manager that we have is what's built into the Go tool link, which is really just one commands. For the most parts, it's GoGet and GoGet works by you give it an address, basically a repository and this is crucial there is no central registry in the world of Go. You'll say. "GoGet github.com./ whatever whatever
Sam Boyer: If you see Go projects, I see a lot of them. The installation instructions are GoGet and the address of the project. Go tooling will then go and clone down a repository and it will walk down the import path in that repository which it can use because you can go the import parts. Most of them look like FQDN's. Lock those down and find the dependencies that are predictable. It goes down and dumps them all into one big giant pool, that's called the GOPATH. The thing is, GoGet, essentially just says there was really only version and it is the default tip branch, whatever it's called for the underlying version control system, and that's what it uses. You can also tell GoGet to update, GoGet-you and it will go through and recursively update all of those dependencies in the giant shared pool, that is the GOPATH.
Sam Boyer: This makes it very difficult to do a number of things. Because the GOPATH is one big space, and you can't have different versions of a given thing on the GOPATH. It's very difficult to develop multiple projects at the same time, if they have shared dependencies, because they're all looking back at GOPATH, and if they need different versions of them, then what are you going to do? Go switching the version of the thing on GOPATH? What if that's the repository that actually I was hacking on because we also hack on the things that are in our GOPATH.
Sam Boyer: There's no separation between the things that you are working on and the things that you just happen to depend on inside of GOPATH. Vendor directory that I mentioned earlier helps with this a lot. It lets us at the very least take things off of GOPATH and encapsulate them inside of our projects so that we no longer have multiple projects bouncing into each other.
Sam Boyer: But it sort of induces new problems. Vendor is something that we're reliant on it now. It's what we're built around now, but it's also something that in the longer term, we hope to be able to move beyond. So, notable design features. One is there is no registry. We just talked directly to version control repositories. This is not ideal, but this is one of those things where Dep cannot make a different decision about this because we need to operate inside of the constraints that the tool chain provides. Primarily because the aim is for Dep to move into the tool chain. Can't say merge, it's not going to be a one to one thing, but we do want to integrate into the tool chain. The fewer things that we have, which represent departures from the way that the tool chain works now, the easier that process is going to be.
Sam Boyer: We have a bunch of decisions which break with how the tool chain currently works that we've deferred until after or during that integration time. Registers are one. The other is the nature of names, which is kind of related. But in most systems, the name that you use for whatever your import or require or whatever statement, the meaning is given to that name by a registry. It's just a string until the tool takes it and hands it over to a registry. The registry is essentially the thing which is operating basically, in the role of DNS when you really think about it. Given some input string, it decides what resource that ends up corresponding to.
Sam Boyer: Contrast that with Go, where we have these import paths that, like I said, actually contain GitHub/bar in them. That has meaning. That meaning is given to it by actual DNS. So, we have to work on top of that. That also gets really complicated because we have to be able to look at an arbitrary import string and decide how to deal with it without having to consult any external source. We have to do these sort of purely symbolic analysis of strings. Anybody who's ever done that before, that's how terrifying that sentence should be. We have to look at an import path and say, this is github.com/foo/bar/slash,/[inaudible 00:13:12] or whatever. I have to be able to know, for any given string, where the root of that project is.
Sam Boyer: It's easier enough when it's GitHub, but you can import from anywhere. So, how do you deal with that? There are pattern rules around this, but it ends up informing a lot of things.
Andrew Nesbitt: The import statements in Go currently allow you to describe the version numbers that are required, or is that specified elsewhere?
Sam Boyer: Nothing precludes you from doing that. There is a service out there called gopackage.in, which was created several years ago that basically facilitates that. Gopackage.im works by being sort of a proxy layer on top of GitHub. A common package that's often imported is going package [inaudible 00:13:56]. That maps to github.com/go-yamo/yamo.
Sam Boyer: But gopackage.in lets you put a suffix on it. So you can say like import gopackage.in/yamo.v2. What will happen is when GoGet or when Dep talks to the gopackage.in server, gopackage.in proxies the clone of the Git repository through and actually modifies the set of references that get reports on the fly to only those that correspond to that version number constraint.
Sam Boyer: This is the only place though where versions are intermixed with that import statement. That is something that we otherwise try to avoid. And in fact, actually, Dep has entirely re-implemented the behavior of gopackage.in internally. So, we never talk to gopackage.in directly because we have to map it's model of versioning and constraints on to the model that the Dep imposes on the world.
Andrew Nesbitt: If you have two different Go files in the same project, that are looking to use a specific Go file from GitHub that are within the same repo, are they able to import two different versions of that same file?
Sam Boyer: They are not. We do not allow this. we debated this for a long time. It's something we probably could do, except that it would create a crazy fractal of complexity if we allowed it. It'd be very difficult to understand why it is that you have some portions of a repository imported at one version, and then and then other portions imported at a different version. And then that ends up spreading out to the dependencies of that repository as well. So, you have this split version brain thing happening that infects the entire Dep REST.
Andrew Nesbitt: That's basically what the Node community has been running on for a long time. Where they chose to not make the decision to ensure that everyone is using the same version of the same package within a particular application. Which actually lowers the barrier and removes some of those kind of, as I call them, the first and the second levels of dependency. How preferring instead, the seventh or eighth level that you're not going to run into so often, but when you do, it's going to cause some interesting things. Is that a limitation of Go, the language as well that enforces that?
Sam Boyer: It actually is. We don't really have much choice about that. It's technically possible with the way that Vendor directories work. We could nest in the same way that NPM does. That basically everything gets its own encapsulated copy of the code that it relies on. The problem is the compiler blows up. The names of types are a product of the full path to them. If I rely on project foo, and some other project over there relies on project foo, but we each have our own copies of it, then instances of types from foo will be incompatible.
Sam Boyer: This doesn't always present a problem. But if I leak out the types from foo, and the other project leaks out the types from foo, and something else is this relying on both of them, then when those types from foo come together, they will be incompatible, and the compiler will fail. And there's nothing that you can do about it. So, yes, with the way that Go works now, we pretty much have no choice but to flatten up to the topmost level.
Andrew Nesbitt: At least, you do find the compiler will find those problems for you before you actually deploy those things, right? In the world of JavaScript, you may never know that you are running two different versions of that same thing. That might be fine. It also might cause some very strange issues. I find it interesting that the NPM community for the longest time decided to go with that trade off. It definitely shows in the kind of way that NPM was seen as solving dependency hell was not making a big deal of the subtlety. It meant that they could actually push a deeper and deeper dependency tree and actually build up a bigger community over a longer period of time with much more complex dependency trees, because they didn't have to do this collaboration and coordination between different package maintenance.
Andrew Nesbitt: Do you think that has had an impact on the adoption of dependencies within the GO community?
Sam Boyer: I do? I absolutely do. Yeah, I entirely agree with that assessment of NPM. To be very clear, I believe that this is a trade off. I don't believe there is a right answer here. Interestingly, it's a trade off which might be surprising for folks who are not package management nerds, but this is one of those trade offs which is deeply interlinked with the semantics of the language itself.
Sam Boyer: I think that lack of good dependency management has harmed the NGO community in other ways. What we're looking at going forward is some of the trade off of growth versus, I guess correctness, safety is probably a better word; growth versus safety, better inherent in this, do we flatten and require just one version, or do we make trees on NPM?
Sam Boyer: But to my mind, I do think that lack of a dependency management has harmed Go's growth. I think that either of those scenarios. Whether it is tree based, which like I said, isn't really feasible for Go right now, or flattening, both of them are better than the historical situation for Go. So, incremental improvement either way.
Alex Pounds: Where did that historical context come from? Because they must have seen good ideas at the time.
Sam Boyer: I think there are a few things there. The reality is though that when they were originally working on all this, Go was first announced I think in like November 2009, I really hope I get that right. And then Go 1 released in 2012, I think. There were a couple things going on. These ideas were more nascent then. Yes, NPM existed, yes, Bundler existed, but part of it, I think is the static versus dynamic language divide. Part of it as well is ... This Go on compatibility guarantee, which promises that the standard library at the very least, will not have backwards incompatible changes. They have a semi-formal definition of that, which actually, we are turning into a more formal definition and putting into a tool that will use similar to how [inaudible 00:20:37] package works, will tell you what your next [inaudible 00:20:40] release should be. We are codified what [inaudible 00:20:43] means in the context of Go.
Sam Boyer: So, yeah, I think they were focused on this compatibility guarantee that they had within the standard library. There's also just the basic fact that the folks who were designing it, work at Google. Google has a large amount of repository. They don't just on a day to day basis have to deal with these kinds of problems. So, it wasn't something that was necessarily a failed experience for them. I can't blame anyone for not having those things pressing on their lives all day.
Sam Boyer: From there, I think what ended up happening was, we had a couple of years of all right, we need to really try to make these things work. A bunch of arguments got interlaced with this, right? Like, Go is a simple language, and don't write it backwards incompatible code. A bunch of the things that are really positive about Go ended up being brought up as cultural approaches to countering the deficiencies in the package management situation.
Sam Boyer: What's really interesting now is that I firmly believe that the Go community, they have a lot of culture around things like not making backwards incompatible changes. Which I don't care how good your package manager is, that's always a complicated situation. I am happy that we have this past that we've had because now going into this era where we have better package management, we still have a lot of the residue of this culture that will help us be better stewards of our ecosystem even when we have tools that let us be a lot lazier.
Alex Pounds: Go is a compiled language, and it's also a static language. Do those two elements have any implications for how it does package management?
Sam Boyer: Yes, it does. It definitely does have an impact on how we do package management. I'll talk about the compiled aspect first, because that is the easier and more obvious one. Whereas the dynamic language needs the dependencies to be present in order to actually run, which means that you are NPM installing or composer installing on your production boxes, Go's a compiled language, it means that the last place that you actually need to collect your dependencies into place is your build box. Go is all about the static [inaudible 00:22:49] binaries, you just check that thing in production and you're done.
Sam Boyer: Nothing really jumps to mind, though as a terribly big difference in terms of how we do the rest of our work just because it's compiled. How we're being statically typed is really interesting. I'm really interested in doing on the fly type analysis in much the same way that a compiler does to determine whether or not versions are compatible. This is something that I was much more hopeful about, maybe last year, I thought that we could do a lot more with it. As I've looked into it more I realized that it's not feasible even in a language with a very simple type system like Go's it's not feasible to prove everything all the way down. But I have this hope growing in the back of my mind that we can create a complimentary system that exists alongside versions and version constraints and [inaudible 00:23:42] and all of that, which are actually doing the checking on the fly to see, "Oh, no, you can't actually use this dependency at this version, even though your constraints." Say that you can because we can look and see that you are trying to access the field on a struct that doesn't actually exist in that version. We cannot say that code is compatible, but we can say that it is incompatible.
Andrew Nesbitt: That's quite an interesting aspect that could be brought back into the discovery side of the package management, which I guess Go has punted on that. Mostly out to say, GitHub can solve this and you find your projects by just searching for Go projects on GitHub or Bitbucket. One thing that you might be able to do with a system like that will be able to introspect each version of each package and its dependencies, and basically be able to knock out all of the incompatible packages that essentially wouldn't be able to build a tree, potentially reducing the amount of dead packages or packages that are incompatible with other ones that you know to be in your dependency tree already. And actually reduce down to here are some good probably known working versions of things that you'd be able to use. Could you imagine that being a tool that would essentially index and connect the dots between every different Go package?
Sam Boyer: Yes, I absolutely think that's something that we can do. I designed gps around the idea that all of the work that it's doing right now to extract information from source code, to feed it into the solver that's making decisions about what things work together, I designed it around the idea that we could incrementally push those responsibilities towards the edge, cash them something, whatever.
Sam Boyer: The point is, yes. Right now, we do almost all of the work locally. If we develop registries where we can trust that they've done some of the work already, then we can be used that work, we can avoid having to explore a whole bunch of already known to be incompatible versions. We don't have to do relatively expensive type compatibility analysis because it can be done once and then just look it up.
Andrew Nesbitt: You could even potentially go as far as pre-compiling, or at least collecting sections of that dependency tree especially if you know that it doesn't clash with any other pieces of the dependency tree of the whole application to save doing that resolving over and over.
Sam Boyer: Yeah, it does get tricky because it's graphs, man, and graphs are sneaky. I thought some way down these paths, I haven't gone all the way down them because we have plenty of work right in front of us to do. But core idea in Dep is this notion of a sync based tool, which is to say that we think about there being a few discrete different states that we want to synchronize between. There are basically four states. There is your product code itself, there is your manifest type file where you express constraints, there is your lock type file, which is the fully [inaudible 00:26:49] resolved snapshot of your builds, and then there's the space where your actual dependencies are store. Again, source code.
Sam Boyer: The goal of being sync based as a design philosophy is that the tool should always seek to exit where all four of these states are in sync, are correct with respect to some function that defines a relationship between them. The hardest part of that tends to be for some languages at least, is mapping the project code on to the log file. For Go, it's easy because we have fast static analysis and we have an unambiguous names that we have to work with. We can just keep all this stuff in sync all the time.
Sam Boyer: The reason that the relationship between the different states matters and the function that maps product and manifest to lock matters is, at Dep, what we do is we take a hash of all of the inputs. That is all you import statements, all of your constraints, all of the information that is relevant to constituting a lock, and we hash it. We have a Sha256 hash digest, which sits in your lock file. That information can interestingly be used potentially as a key that we could fire off to remote servers.
Sam Boyer: I've done some work, I've worked some things out. For these inputs at this time, this was a good solution. This one worked.
Andrew Nesbitt: Interestingly, I've done a similar thing with the libraries IO database, where as I've indexed every version of everything, and Go being slightly with not having discrete versions, currently doesn't work with this implementation. But for most package managers, to have that kind of published discrete version that can't be easily changed or code replaced in the sense of, I get forced push to a particular tag. Because I have the timestamps for those release dates of all of the versions, I've written a poor man's dependency resolver that you can actually pass a date which will scope the SQL queries it runs against Postgres, as it does a recursive resolving to be able to say, can you give me the dependency tree back on this date.
Andrew Nesbitt: You can actually roll back in time and say, I'd like you to get me that dependency tree back the last week, because I don't have the lock far to be able to produce what was there. For some package managers that don't have lock files, that's really useful. The R package manager doesn't actually have the ability to set an upper bound on version numbers when declaring a version range.
Andrew Nesbitt: If someone publishes a newer version that you don't want, you actually can't say no, I don't want to install that version. I want the previous one. The solution to that is that they take a snapshot of their registry every day, and actually, you can then say, I will point my package manager at last week's registry and I will get exactly back what I want.
Andrew Nesbitt: The nice thing about doing that is that at least you can say, I've frozen the world and I can be sure that every time I resolve this, and any caching that I do around this will be identical every time.
Sam Boyer: That's fascinating on multiple levels. One, simply because by precluding the possibility of specifying an upper bound on ranges, you can avoid Satisfiability, which is interesting. I'm thinking about this now though as a thing that we can do at least for reconstructive purposes and how much it would take to make Dep operate in that way. It actually wouldn't be that hard with the way that our system works. We could parse in basically just a date to the master object, the source manager, which works with all of the upstream repositories and say, don't allow any versions newer than this date, and it would work.
Andrew Nesbitt: Yeah, assuming that no one has changed those repositories since that point. The force push is always going to cause you pain where you just say, I hope that this mutable file system over there hasn't changed since the last time that I asked for it.
Alex Pounds: This is a nice segue into one of the questions that I had, in that lacking a centralized repository and a centralized place where there is a canonical package, it seems like Go might be more vulnerable to things like history and a Git repo being rewritten, or a Git repo completely going away, or something else changing underneath. Have there been problems with that in Go?
Sam Boyer: Yes. And guarding against some of those issues is probably doubled my development time last year when working on gps. There's the left-pad problem, upstream project just goes away. There's some split in the Go community as to whether or not you should commit your Vendor directory into your project or not. Really, the main reason to do it and this will always be a reason to do it is because your immune from left-pad if you do. You've got a copy of the code locally, you can survive without the upstream.
Sam Boyer: Dep, actually does not work well with that right now. Dep really does continue to require the upstream to be present. We treat Vendor as dead code. You can have something in there, but if it's gone away upstream, then Dep will blow up.
Sam Boyer: Tags moving though is definitely something that we had to accommodate from the beginning. When I approached this, I designed a system for versions that was centrally designed around Git, but is equally applicable to the four version control systems that we have to support because that's what GoGet already supports. That is get, Mercurial bizarre and subversion. That version system has four types in it. You've got semantic versions which correspond to they have to be tags in the Git sense and then they have to actually conform to the shape that is specified by [inaudible 00:32:57]
Sam Boyer: That can be branches, which are whatever underlying constraints are imposed by the version control system, or the boundaries there. And then you can have these plain tags which again, like a Git tag, which doesn't match with [inaudible 00:33:10] But the semantics that we assigned to them are that ... There's branches which are references that we expect to move. It is a normal thing for them to move. By move, I mean they correspond to a different underlying revision.
Sam Boyer: The other piece of this is that all three of [inaudible 00:33:30] branch and plain tag are assumed to have or it is possible to pair them with exactly one underlying revision. And then revisions are something which are immutable. That is true for at least all four of those systems that we have to work with. There isn't an immutable revision that we can get out of them.
Sam Boyer: We expect the branches move, and we accept that it is possible that either [inaudible 00:33:59] or plain tags will move, even though it's frowned upon, it's entirely possible the system cannot break if it happens. Functionally, we treat these mostly the same. We index everything by revision. And then we have a map on top of that, which are the plain, [inaudible 00:34:14] or branch, that correspond to those revisions. And when they move, really the only functional difference is if it's a branch, well, it's normal for it to move. If it's one of the tag types that moved, then that's something that we would produce a warning about or what ideally but I don't think we currently do. But yeah, we definitely had an architect around this from the very beginning because we don't have something like a registry to provide us guarantee like [inaudible 00:34:38] does that its immutable to append only. It's one of the properties that I really really want to have in a registry.
Alex Pounds: It also seems like Go could be vulnerable to some kind of impersonation attacks as well. One thing we've seen with other package managers is malicious packages being published which have names very close to an official package name and hoping the people fat finger it. But in Go, it seems like you could have a package with the same package name. It's a different repository path on GitHub that would be from a different owner, but anyone can publish their own get repo. Has that been an issue?
Sam Boyer: Not to my knowledge. There's different degrees of impersonation. Sure, there's just fat fingering over and actually getting a malicious package. We don't have any defense against that right now. Because there is no canonicality, how would we even know that you fat fingered? We can't help you defend against that. What we do have or are working on now anyway is we have an additional layer of parsing that we do on dependent source code so over and above the Sha1 that we pull from Git, there is a Sha256 hash digest, which will go into the lock file, which identifies the Go code. We can do that for any of the four different backends. And we're ultimately going to use that for a bunch of more things.
Sam Boyer: That at the very least helps with attacks where, if someone is somehow intercepted, if you're using a proxy or something like that. If someone has attacked your company's proxy and they've stuck a false Git repository into there. As long as someone had a good version of whatever upstream project was there before the attack, then those hashes won't match and everything will blow up and everything will stop. But as long as we're in the domain of what is the right project to grab, yeah, there's not a lot we can do about that. But maybe I misunderstood the problem you were posing.
Alex Pounds: No, that is the problem I was posing. It was less a question of what does Go do to fix this, because there may not be things that you can do to fix it. I'm more just asking, how's it come up?
Sam Boyer: I think part of what I'm saying is how would we even know if it did?
Andrew Nesbitt: That's a lot more scarier question. You have that and compilers, where how do you know that the compiler that was compiled with the different compiler didn't compile something horrible into that compiler? You need a full chain of providence to be able to prove where everything came from. At least Go is downloading the source code and not downloading binaries. It's actually like at least right there to be able to introspect it before you work on it and be able to see everything that's going into it. But yeah, you're always going to be relying on at some point, I downloaded something from the internet, I should probably review this or at least have some way to look at other people's previous reviews to make sure that that is what is expected.
Andrew Nesbitt: But I get the feeling that that hasn't really been solved by any package manager community, at least at the application level. The system level, people have a much better grasp on that, but they're also dealing with a lot more binaries and pre compiled things from different sources that they need to be able to prove that actually came from where we expect it came from.
Sam Boyer: This is where we get into, are you going to sign your releases, and whose signatures do we trust? It becomes much more of a human problem. I think the other factor of why it is more feasible perhaps for system package managers to take steps like that, whether they do or not, or whether they do them well or not, is because the community of people who are working on putting together over a set of packages that constitute an overall release for Ubuntu or whatever, they know each other, it's a finite group, they can do things like sign each other's keys and actually attach signatures to these.
Sam Boyer: It's possible to know all of the signatures that you should trust for a given release of Ubuntu, which you can then use to verify the packages that you get on the other side. But because language communities are pretty much by definition open and not finite, you have this just constant problem of, which all people do I trust? It's much harder to implement a solution of that shape.
Alex Pounds: One thing you mentioned earlier was Satisfiability, what is Satisfiability, and how does it relate to package management?
Sam Boyer: Satisfiability is a general problem in math and computer science. It is NP complete. It's the first problem that was proven to be NP complete. It's one of Karo's original 21 NP complete problems. I like to refer to it as the original gangster of NP complete problems, because you can reduce everything to SAT. indeed, one of the things that's interesting about SAT solvers, Boolean Satisfiability solvers is that because you can reduce everything to them, because they are this nice, general way of expressing these constraints problems. In general, what they're doing is I have a whole bunch of different propositions, can I find an assignment for all of the variables that will evaluate to true for all of the propositions?
Sam Boyer: But because you can reduce all these different problems to them, SAT solvers are very useful because you can write a general SAT solver, and in theory, at least, it can solve problems in a whole ton of different domains. Everything from weather prediction to problems in AI to diversion selection. Indeed, the problem of version selection that is figuring out ... And I can put a narrative spin on this one more easily, as long as you have the requirements that one, we want exactly one version of any given project in our dependency graph. Two, we want to honor all of the constraints expressed by all of the different projects in the dependency graph, including any situation where you have multiple things depending upon one dependency. And then there's a bunch of versions of these individual ones that we might try, then we get into a situation where we need to potentially search this very large commonitorial space of different versions and different constraints in order to try to find an overall dependency graph where all of the different constraints are satisfied, and all the different required packages are present. And the graph in there is exactly one version of each of them. It's a knotty problem.
Alex Pounds: We talked a little bit about a number of complicated mathematical and computer science problems that go into package management. How did you go about researching and learning about these different things?
Sam Boyer: I started looking at this a number of months ago, really, what interested me was that given that SAT is NP complete. Given that NP complete problems are generally terrible problems, which any decent engineer should look at an NP complete problem and the first reaction to that, is well, I'm not solving that, how do we go around this? The question to me was, how is it that we have so many different projects that are out there for solving problems more or less like this, both system package management level and at the language level, how is it not more terrible?
Sam Boyer: This problem is terrible, why is everything not burning down around us all the time? Which is a real question. These problems are incomplete, people don't even necessarily know that the problem that they're solving is NP complete, why isn't everything broken all the time?
Sam Boyer: I started looking at mini-SAT, which is a sort of entry level gateway projects in the world of Satisfiability, which is its own little interesting subculture of computer science. But mini SAT is designed to really be as simple as possible of a solver, but still very highly effective one. Such that you can get in and get your hands dirty, and start understanding how some of these basic Satisfiability algorithms work.
Sam Boyer: I started trying to do that. But eventually I realized that the problem that I was having was that I did not have ... Sure, I can go read about Horn clauses all day. But I'm not going to be able to get my head around what it is that is perhaps, so much more powerful about these generic SAT solvers versus what it is that we have today?
Sam Boyer: I went out, and I bought Donald Knuth's, The Art of Computer Programming, volume four, fascicle six on Satisfiability. I actually read it, or at least most of it, enough of it to try to figure out what was going on. Why aren't things more terrible, like I said? I came to some interesting conclusions. The way that Satisfiability usually gets talked about, at least from my perspective, it feels like, oh, we talked about SAT solvers, it's these amazing mythical things that have magical powers for solving this horribly hard problem.
Sam Boyer: But the thing is that it's not just like, you can take some problem that you have that exists in some domain over here, and have it magically fed to the SAT solver. No, you need to actually take your problem and figure out how to encode it in a way that the SAT solver can read. This is more or less a straight quote from Knuth right now, there is at least as much art and complexity in the process of encoding SAT, as there is in actually solving SAT. Once I read that, I was like, okay, everything makes sense now, because Boolean algebra is sparse, you've just cut true and false, and then clauses that stack up on top of each other.
Sam Boyer: Pretty obviously, if you're going to take a domain which is as complex and has as many different types of constraints and overlapping concepts as version selection, then you're going to have to do a whole lot of translation and re-modeling of that problem in order to feed it to a SAT solver in a way that it will recognize.
Sam Boyer: The thing that I learned is that part of the reason that SAT research is so difficult and that fully general sat salting is so difficult, is indeed because solving on that incredibly sparse input makes it very, very difficult to know what sort of patterns you might look for in order to take advantage of patterns and structures in the data. The general solvers have to figure out what the patterns are as they're going in order to figure out what the right algorithm is to apply.
Sam Boyer: Many industrial SAT solvers actually employ multiple different strategies and will sometimes dynamically pick different strategies depending on what they've seen from the input so far. Now, a little history, in the late '90s, a technique for fully general SAT solvers was discovered. It's referred to as conflict driven clause learning. This was basically the difference between SAT being a ridiculous problem that no one could ever solve, and suddenly like, oh wait, this is useful for industry now once CDCL solvers were discovered.
Sam Boyer: When I started reading about the basic way that CDCL solvers work, they have a concept of a trail and later discovered notions of back jumping, and then they have this bounded model checking stuff that's going on in there as well. I'm like, oh, my God, I have all of these things in my solver, but I didn't have to think about them. Because these are just things that make sense inside of the domain of package management. Once we get out of this incredibly just abstracted thing that are Boolean proposition clauses, and we move back to the richer environment, that is, I have projects and they have versions and they have dependencies and they have constraints, there are relatively obvious steps that an implementer can take in order to solve the problem much more efficiently that are much harder to see in the abstracted contact. In fact, the notion of back jumping, which is essentially when a solver encounters an irreconcilable conflict, it has to decide what to do next.
Sam Boyer: If I'm trying to pull in project A, and then I've already pulled in Project B, and Project B has a dependency on project C. A also has a dependency on C, and I see that the constraints that A and B declare on C are disjoint. There's no way that we could ever find something that would satisfy both of them. I have to decide what to do next.
Sam Boyer: So, back jumping in general says, given that we've encountered a failure at this point, instead of just blindly backtracking to the next choice that we could make, we're going to back jump to the last known failure point. It's a smarter backtrack that knows the next place that is likely to have that even could possibly have a resolution.
Sam Boyer: That notion fairly obvious when you're looking at things in the richness of the domain of real versions. It took an entire dissertation to come up with the notion of back jumping in the context of CDCL solvers in this totally abstracted domain.
Sam Boyer: So, this is my basic answer. The reason why we have managed to do relatively well with this is because it's quite possible, it seems for engineers who are more or less following their noses to come up with algorithms that strongly approximate the way that CDCL solvers work. They're just domain specific and CDCL solvers are very powerful. They have known failure modes and there are issues with that, but it's totally possible to write a domain specific CDCL solver for this problem that we have right here in front of us without having the mathematical background or even knowing that that's what you're doing, or even knowing that CDCL solvers are a thing.
Alex Pounds: What you're saying is anyone could build a package manager without any of this previous knowledge, and accidentally end up in the right place.
Sam Boyer: I think there's certainly plenty of places that you could go wrong, but yeah, that seems to be the case.
Alex Pounds: Yeah it's not literally that could happen, but that has happened repeatedly in every every language.
Sam Boyer: Yes, over and over again.
Alex Pounds: Looking around the decisions and architectures of other package managers, Are there any of those things that you look upon with envy and that you wish you had in the Go world?
Sam Boyer: Yes, there are few. I really want Cargoes yanks a lot. That's something that we just can't have without registries.
Alex Pounds: Tell us about yanks, what's a yank?
Sam Boyer: Sorry. Cargo allows you to yank a version. It creates being immutable and append only, you can't delete something off of it, but you can yank it. I haven't actually seen the a cargo, this is my understanding of how it works. When it's going through and trying to pick versions, it will see that a given version has been yanked, and it will not consider that in any new solves that are going through. However, if you currently have a version that's been yanked in your lock, then it's at least possible that it would remain in your lock until you explicitly choose to move it. But once you move away from it, you can't get back there.
Sam Boyer: It's a way of pulling versions that are known bad or have security problems without causing immediate downstream chaos, because the version is just gone. We can differentiate between something that should no longer be used, and something that perhaps never existed at all, and how did you have that thing in your lock file in the first place? Why were you editing your lock file? That's not what you do.
Alex Pounds: Yeah, that exists in Ruby gems as well. So, rubygems.org from ... I'm not exactly sure when it came in, but it's definitely a gem cutter, like new version of rubygems.org that had that. It was the kind of reaction to people asking for code to be removed and the fall outs of that being incredibly painful in particular situations actually reduced their amount of support request that they were getting, massively. And a really good way of saying, "Okay, well you can stop people downloading this, but if they're going to specifically ask for that version, it's still going to be available. If you really really want to remove it, you're going to have to send us a DCMA take down request to actually yank that thing.
Alex Pounds: Which whenever you look at these kind of immutable systems, they still need to actually be able to follow the laws of the countries that they're based in, which often means that that code should never have been there in the first place even if there are many people that depend on that, it may have to go. But the yank is a nice middle ground and allows the users to work on it themselves without having to have some admin person be the ruler and the decision maker around a particular package manager. Which happens in some and others still manage to have a registry without needing a gatekeeper.
Sam Boyer: Yeah, that whole class of things you just described is one of the biggest set of concerns that folks in the Go community have brought up about introducing a registry. Access control and having people who know this thing or that thing over who controls it, all those sorts of questions, which inevitably get complicated and yanks are just great on every level. But as you say, not replacement for deletion. There are still cases for that, but we just don't want deletion to be the normal workflow.
Alex Pounds: If people want to learn more about Go Dep and gps, where should they go?
Sam Boyer: The main place to go is github.com/golang/dep. We conduct most of our development just by issues. We do have a mailing list. It's not that well trafficked. Unfortunately, I'm not great at managing contributions by a mailing list. But we also then have a Slack channel in the Gophers Slack, the Vendor channel where we are very friendly and present and happy to answer questions and provide guidance to folks.
Sam Boyer: I also have been writing periodic updates on Dep cheekily called Dep Status, since that is a command that you can run. Those are all at sdboyer.io/depstatus.
Alex Pounds: Great. Thanks so much for coming on and sharing some of the more technical details and the low level problems involved in package management. It's been really interesting.
Sam Boyer: Its been wonderful. I rarely get to nerd out on things like this. I really enjoyed it.
Alex Pounds: That wraps everything up for this episode of The Manifest. The next time, we'll be talking to somebody new and exploring more of these package nerdy things. Bye for now.