Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configuration-based deploy scripts #99

Open
slifty opened this issue Feb 28, 2021 · 3 comments
Open

Configuration-based deploy scripts #99

slifty opened this issue Feb 28, 2021 · 3 comments

Comments

@slifty
Copy link
Contributor

slifty commented Feb 28, 2021

Right now the deploy processes for various competitions are explicit about the files they expect to exist (this is good!). This is done by having python code that explicitly hard codes the files that are being decrypted (and unzipped). The downside of this approach is it is a bit more verbose than it might otherwise be, and there is some logic repeated across the competitions.

What do we think of shifting to a deploy script model where the competition-specific-stuff is handled via yaml or json, and the deploy script is treated as a single script that is passed a configuration. For instance, the configuration might have the list of expected gpg files with optional checksums.

The script could then iterate through that list (and maybe even automatically detect zip files and inflate those files)

This approach would mean we could still benefit from things like checksums and ensuring that all of the expected files are there, but we wouldn't have to repeat things like GPG parameter logic and various types of path mapping.

(I may have some similar thoughts about the compose-and-upload processes once I get into that.)

@frankduncan
Copy link
Collaborator

frankduncan commented Feb 28, 2021

The compose-and-upload scripts have been refactored largely into the etl package at the toplevel. The scripts themselves are largely just configuration, but the language is python rather than having a secondary data format that is then parsed by python. The reason for that is largely personal preference that I like my data formats to be turing complete.

That refactoring hasn't been done to the deployment scripts, and there's no reason it shouldn't be (as the only reason it hasn't is time). There's some overlap between the deploy scripts, and there's also a lot of cruft that has retained from the first version of things. If anything, I would say moving them from bash to python and putting the common code in the etl package would bring things more in line with the rest of the pipeline, and simplify the whole system. There's some challenge there because a lot of what's going on in those scripts is very shell-oriented, but I think that could easily be abstracted out to python and even though it will be a small code smell that python code is just wrapping shell commands, that should be contained enough that it won't be a problem.

In that case, a large part of the compose-and-upload scripts (the shell documentation and argument handling) could be removed from the scripts as a whole, as the prescribed way to call those scripts would no longer be from the commandline. The original versions of this all had this idea that it was bash calling python, so we wanted to make the python well formed. I can definitely agree that we're at the point that no one is going to call compose-and-upload directly anymore, and it is always from deploy.

I think I disagree with going so far as to doing things like detecting zip files and inflating those. That moves a fair bit further into implicit software, which is always a bit more challenging to debug and extend. I also don't think we'll ever get to the point that the files we receive for competitions always fit into defined buckets, so keeping the configuration in code allows us the flexibility to handle all the edge cases. In the world of personal preference, I find the the former nicer to work with:

validateAndExpand("attachments.zip", "<checksum>")
validate("competition.csv", "<checksum>")
validateAndExapnd("otherattachments.zip", "<checksome>", "tmpdir")
for dir in tmpdir:
    for file in dir:
        file.move(path.combine(attachments_dir, dir, 'COVID_' + file.filename)

versus

files = [
    {file: "attachments.zip", checksum: "<checksum>"},
    {file: "competition.csv", checksum: "<checksum>"},
    {file: "otherattachments.zip", checksum: "<checksum>", temporary_attachment_prefix: "COVID_"},
]

# Code that runs that left as exercise for the reader

Standard disclaimer: This is all just personal opinion, which was only the law because the number of developers working on it was one :)

@slifty
Copy link
Contributor Author

slifty commented Feb 28, 2021

Good to know about compose-and-upload!

In terms of the deploy scripts I would even go so far as to lobby that files variable to be completely defined in a json file, and the reader would not have to open a python IDE at all to set up a new competition! But I'm still getting the hang of it all.

It seems like we're already starting to move into the realm of thinking as ETL as a product with a feature set where the features are activated by config files (in the current state it is config-as-code). I'm thinking of a model where there would be a SINGLE instance of deploy (and compose-and-uploade) which could be unit tested and documented. New features would be needed of course, and the core product could be improved to support them -- as opposed to each competition having its own bespoke code base.

In theory this approach could mean that new competitions and files could actually be created by non-developers because we could create a UX that is able to generate the configuration files and execute the scripts using those files! I get ahead of myself.

Total other note: I'm with you on avoiding implicit, but I do think of file types as a gray area... For instance, it's common to have software that looks at a file type and processes files differently depending on their detected type. But I also appreciate that we might want an unzip flag as part of file definition. Let's talk more about zip later.

@frankduncan
Copy link
Collaborator

Ah, yes! If your end goal is to have competition set up done by non-developers, then your approach makes a ton of sense. As long as it's being set up by developers, it may be a step backwards. Consider the following:

https://github.com/OpenTechStrategies/torque-sites/blob/main/competitions/ECW2020/etl/compose-and-upload#L110

This is code that processes a spreadsheet that was created in a one-off situation for ECW. If I understand correctly, the new workflow would be to put that code into the core etl with a new type, and then we reference that new type in whatever json file we have. That's a solid approach, and I did consider it. I found it more natural to keep that competition specific code local to the competition, for the workflow we have now. Otherwise, when you extrapolate to all competitions, the number of largely useless types you would have defined centrally would create too much noise for your options when setting up a new one. I wanted to keep the central etl pipeline library limited to the kinds of things that are needed by 3 or more competitions. The downside, of course, is that you may accidentally be repeating yourself without knowing it.

However, this is all acting as a developer who is creating a script powered by a library, rather than someone configuring a system. And most of the same philosophies apply to the deploy scripts as well. It is easier for me, at least, to reason about code where I instantiate a new object right in the script, as I can go look for that specific class and look at what the code is doing. The more indirections (how is that not a word?!) we create, the harder reasoning becomes. And we already have one indirection from class hierarchies in the current scripts.

This is all moot if we do decide that the future direction is a service that is fronted by some UI. Then all this configuration goes into a database, all the files get uploaded (not by us!) into a database rather than subversion, there's error checking on the actual content of the files to see if they are able to be handled by the script, and everything gets a lot more self service. That's definitely a compelling vision! But without it, this evolution of the project feels like putting the cart before the horse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants