SegmentScripts/3-GitFramework.txt

Git framework
25 minutes - live
-Overview
--branches, file states, relationships
--tracking changes
-Automations vs Manual versioning
-How to apply a Git-like framework to office documents
-Questions

1. Overview
5:00

Ok, so we know now that Git and GitHub are optimized for code,
We know that Git only tracks the changes to files, not different versions of whole files like you might rename v1.docx, v2.docx, etc.
We know that other people can get a look at your code on GitHub and download it there and suggest changes
Let's get a little more in depth with each of these ideas

We'll start with a Glossary
Repository - repo for shorthand.
Repos are both the file and folder structure of the code and also the changes, the users, the comments on a commit, etc.
-There are two kinds of repositories, local and remote.
Local is code on your computer that you're using right now.
Remote is the code on GitHub (or similar). You don't need a remote repo to run Git.

Commit
A commit is a set of changes to one or more files.
commits are how you assemble all the changes you've made to your work.
You can restore old commits.
You can apply one commit at a time or lots of them at once.
You can comment on commits.

"Pull" is another important Git term and it means to retrieve code from somewhere else and integrate it into the existing code.
That "somewhere else" can be the remote repo or another repo entirely

Pull of course is the opposite of push
"Pushing" changes in Git means to send the changes to another repo (could be remote, could be someone else's repo)

So, you push to something
And, you pull from something

Ok let's get into some workflows
Branches
Here's a very important diagram about Git (branch diagram)
When you're doing your work, writing code, making updates, you're working on a "branch" of your codebase
Branches help distinguish between coders, or between coding tasks
You might have a branch "New Feature" or "Brendan's updates"
Every repo has a branch named "main" -
renamed from the previous default "master" which is a change everyone like me was definitely asking for
and we're all glad they prioritized that above ending their ICE contracts

Anyway, you can make changes to one branch and they don't affect the other branches
So any updates I make to Brendan's Fantastic Branch are separate from the code on main
They carry the same filenames and folder structure unless I change them,
there's no _BC in a filename to distinguish that I worked on it
although that information is logged in a commit

Ok, moving between branches
We say that you "checkout" code from Branch A to Branch B
To bring those changes back from Branch B to Branch A, we call that a "merge"

You can have lots and lots of different branches, they all refer back to the point where they were checked out from main
And, when you merge them back into main, the changes are automatically integrated into the code
So, two people can make updates to the same file and,
as long as those changes don't conflict with each other,
they're automatically integrated

We'll get to what happens when those changes do conflict later

So, you might notice that there's a really important relationship
between a remote repository and a local repository.
They are linked but they are not the same.

The remote repo is where the best code is
The most up to date, most usable and refined code, everyone knows it's there in remote main.
That's like, what it means to use Git and GitHub - there's a remote main where the best work ends up
There's a code of conduct in the remote repo that everyone abides by.
There's a license in the remote repo that people can refer back to when deploying your code for themselves.
Remote is where others go to use or modify code that you wrote.
Remote protects against corruption to your local hard drive.

So, you make your changes locally, you test that they work, you send them to your remote repo
People with access to your remote repo can then change the code you wrote and send it back to you and you can integrate the changes
You can undo the changes, you can create a hyperlink to the changes, you can comment on the changes

I'm going to pause for questions now because this is where it starts to get really complicated
But, after this next section, we'll do the tutorials and maybe some things will become more clear, too


[5 mins?]


File states
4:30

So, how does Git actually know about changes to a file?
You have to tell it!
When you initialize a repo, you generate a .git folder in your repository
It's normally hidden, but this is how Git knows what's going on in your repo
If you remove or mess with this folder, bad things tend to happen

Git's data is stored in an object database
This object database is basically its own filesystem
It's optimized for software version control
There are three objects in the database that we need to concern ourselves with: blobs, trees, and commits

Blobs are the file data, but not the files themselves
Blobs contain the sha1 hash of a file, not the filename
Blobs contain only the headers and data chunk(s) of files, not their info from the OS's file allocation table (or whatever)
Git is content-addressable, which means that each file is not necessarily it's own thing in Git, but each package of unique data is
Imagine if a txt file and a docx file could be compared to each other for just the text contained in them, without any formatting
That's the kind of thing that Git is doing under the hood, for files

Stick with me here

Those blobs are then organized into tree objects, which more directly correspond to the file/folder structure of the repo
The tree is basically how Git moves between its object database and the filesystem on your machine
Tree objects are lists
and they're composed of tuples containing a filename, a file type, and a hash
Trees are where you branch info lives
Trees refer back to blob objects

The trees are then organized into commits
A commit, again, is a set of changes to one or more files
A commit also contains a message describing what those changes are, written by you, the user

It's important to note that the filename and file content and integration of file changes are separated from each other here
If you change a filename in your repo, that change is tracked on the tree level - ALL OF YOUR BLOBS REMAIN THE SAME
If you don't commit your changes, your repo doesn't get updated, even though you pressed Ctrl+S in your text editor

Because there is redundancy on the content-level and the file/folder level,
and those changes are rolled into a single object called a "commit"
The system is very robust. It's very, very difficult to change something and have it unintentionally missed by Git

The way this is managed on the user end is through the following three commands

git add
this tells Git to search the repo and add any content changes to its objects database

git commit
this tells git to bundle all the file content changes, all the filename and folder structure data, and various timestamps and admin info (like your acct)
And package them into an update

git push
send all of the changes/ commits you made locally to the remote repository, so others can have the most up-to-date version of your code

This diagram says it better than I ever could


Versioning
5:00

You might have made code that looks like index.html and index.html.bak - where you have an old version in case you mess something up
As mentioned, Git doesn't save things like this

There are v1, v2, vN for the entire repository though
Those are called releases
You may be familiar with updating Firefox to the latest version, and it has a long string of numbers 60.2.37.244
That's the release number
the release info refers to your entire repo, not an individual file
There's really no "version" of a file in Git, just its latest commit
I'm gonna keep repeating this kind of thing for the whole workshop so it sinks in


2. Modeling Git for non-code documents
You may now be realizing that there are significant barriers to using Git with your office documents
But, that doesn't mean we can't do something git-like anyway, or maybe closer to GitHub-like

This becomes very important in a situation where you have multiple people in multiple locations
who all need to work on the same file/ set of files and you need to track who's worked on it when
You know, like in quarantine

Remember the essentials of the local-remote repository relationship:
There is 1 stable version that is broadly accessible
Everyone involved agrees on where it is
Everyone agrees on the methods required to update it
Everyone involved knows that this 1 stable version may not be "done"
may not be everything we all want
but, it's the best we have right now

If you have all of those elements in place, you can do something Git-like

I'll walk through an example from my own work: transcripts

We have lots of different transcript versions
Basically no transcripts in our Visual History Program Collection are "done"
but all of them need to be available for curatorial and research use by staff and by patrons
And, crucially, they need to be in a format that's broadly readable and shareable and understandable by non-technical people

So, I set up an attachment field in Airtable
It's called "current transcript"
It contains the most up-to-date transcript
Everyone can look there first, if they need a transcript

Everyone in the department has transcript versions which aren't on Airtable
Earlier versions, before something was edited out
Curatorial versions where segments are highlighted
Those versions are also spread across a series of RAIDs and networked drives, in addition to our individual computers

When we need a transcript though, we don't need those arcane versions
We usually need the best version, or at least the best version we have now
that that version is on Airtable
The version on Airtable may not exist anywhere else, but that's okay

So, this Airtable field is like remote main
It doesn't matter what you have locally, it doesn't matter what's on a shared drive that you can get through the VPN
The best version is on remote main
If you need to dig deeper, you can

It's this agreement between people, and the documentation that supports it, that makes the system work
It doesn't have to be technical
The reason Git is so powerful isn't just because it's well thought out, as a system, but also because everyone has kind of agreed to use it
We've agreed to
 keep discussion of the code to the commit comments and wiki
 the branch structure for large and small updates
 the circumstances necessary to release a new version
We've agreed to use Git, essentially

That kind of consensus is achievable in small teams without using Git, though
And, if you set up a remote space where 1 single "best, right now" version of a document lives,
and for us it's helpful to have good filenames, but not essential,
you can take advantage of the distributed team setup without as much of the headache of versioning