Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add stats to codon tables #350

Merged
merged 38 commits into from
Nov 15, 2023

Conversation

cachemoi
Copy link
Contributor

This PR takes care of this issue

We want to add a count of the observed start codon occurences in a generated codon table, this might be of interest to check occurences of alternative start codons.

I've added a table test for generateCodonTable, checking that we get the expected full object, and I made it so we add the stats together in AddCodonTable and use the stats from the first table in CompromiseCodonTable (I think that's the intended behavior but let me know if not!)

I realise that the original issue wanted to add the codon stats to the table struct, but after doing this I think it might be better to create a ComputeTableStats(t Table) Stats, or a ComputeStats() function to the table interface instead which does not save the count to the codonTable struct.

This is because otherwise we might forget to update the stats when mutating the table (e.g when calling OptimizeTable, which I think requires updating the start codon stats). Let me know what you think.

@Koeng101
Copy link
Contributor

Koeng101 commented Sep 18, 2023

This is an interesting example of I think how we could improve the readability of this package. I'd like to know your take @cachemoi

So the problem here is that there are actually two things we're talking about: a codon table, and a codon usage table. A codon table is just the mapping of codons to amino acids (defaultCodonTablesByNumber is the NCBI definitions of each codon table). For translating genes, this is just fine - we can take a set of codons and, since they are a degenerate code, match to 1 amino acid (64->20). For optimizing genes, this is not enough information - we need to know the weights of each codon to amino acid mapping, because if we don't we'll use crappy codons which will significantly impact expression of our proteins (not all codons are made equal!)

What you've done with stats.StartCodonCount[triplet]++ shows this package lacks clear documentation or on the difference between these 2 concepts. It specifically impacts the codon table, but not the codon usage table, which is the interesting bit biologically-speaking (since we already know the the start codons in the codon table).

Hence:

how many gtg start codons there are vs ATG start codons

So actually, all the real magic for the use of this package can be found in the tests

func ExampleOptimize() {
	gfpTranslation := "MASKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKRHDFFKSAMPEGYVQERTISFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYITADKQKNGIKANFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK*"

	sequence, _ := genbank.Read("../../data/puc19.gbk")
	codonTable := codon.GetCodonTable(11)

	// a string builder to build a single concatenated string of all coding regions
	var codingRegionsBuilder strings.Builder

	// initiate genes
	genes := 0
	// iterate through the features of the genbank file and if the feature is a coding region, append the sequence to the string builder
	for _, feature := range sequence.Features {
		if feature.Type == "CDS" {
			sequence, _ := feature.GetSequence()
			// Note: sometimes, genbank files will have annotated CDSs that are pseudo genes (not having triplet codons).
			// This will shift the entire codon table, messing up the end results. To fix this, make sure to do a modulo
			// check.
			if len(sequence)%3 == 0 {
				codingRegionsBuilder.WriteString(sequence)

				// Another good double check is to count genes, then count stop codons.
				genes++
			}
		}
	}

	// get the concatenated sequence string of the coding regions
	codingRegions := codingRegionsBuilder.String()

	// weight our codon optimization table using the regions we collected from the genbank file above
	optimizationTable := codonTable.OptimizeTable(codingRegions)

	// Here, we double check if the number of genes is equal to the number of stop codons
	stopCodonCount := 0
	for _, aa := range optimizationTable.GetAminoAcids() {
		if aa.Letter == "*" {
			for _, codon := range aa.Codons {
				stopCodonCount = stopCodonCount + codon.Weight
			}
		}
	}
	if stopCodonCount != genes {
		fmt.Println("Stop codons don't equal number of genes!")
	}

	optimizedSequence, _ := codon.Optimize(gfpTranslation, optimizationTable)
	optimizedSequenceTranslation, _ := codon.Translate(optimizedSequence, optimizationTable)

	fmt.Println(optimizedSequenceTranslation == gfpTranslation)
	// output: true
}

Ideally, within the loop finding feature sequences we would add in the add for start codons. So, actually what might be really useful here is integrating that example function to be a method of the codon table, such that we can simply call a method and we get a table ready to optimize a protein that just so happens to have start codon stats

Comment on lines 106 to 96
for codon, count := range toAdd.StartCodonCount {
s.StartCodonCount[codon] = s.StartCodonCount[codon] + count
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add should getting added to as we add example protein sequences to the codon table describing the usage of different codons. This would not be a feature of the codon table itself, but something that gets modified over time.

IsEmpty() bool
OptimizeTable(string) Table
}

// Stats denotes a set of computed codon table statistics
type Stats struct {
StartCodonCount map[string]int `json:"start_codons_counts"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with, instead of having a Stats struct that is added to the codon table, we have ComputeStats(). I think that is much cleaner. Although we'd still need to have StartCodonCount, since that can't be contained in the current struct.

@@ -338,6 +364,7 @@ func generateCodonTable(aminoAcids, starts string) codonTable {
triplet := string([]byte{base1[i], base2[i], base3[i]})
aminoAcidMap[aminoAcid] = append(aminoAcidMap[aminoAcid], Codon{triplet, 1})
if starts[i] == 77 { // M rune
stats.StartCodonCount[triplet]++
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't implement a check for the number of codons in a genome, just a count (which will be here, by definition, 1) of the codons that can be start codons, not the weighted probability that any given one will be a start codon, which is what we are looking for.

synthesis/codon/codon_test.go Outdated Show resolved Hide resolved
synthesis/codon/codon_test.go Outdated Show resolved Hide resolved
@cachemoi
Copy link
Contributor Author

cachemoi commented Sep 19, 2023

Thanks for the thorough explanation @Koeng101 ! Sorry about the misunderstanding my biology is (very) rusty 😅

So if I understand correctly the meat of this package is to change a coding sequence so that we use some codons more frequently than other for e.g making translation of a gene present in one specie easier in another.

Looking through the code my instinct is that the Table interface is not really necessary since as far as I can tell there is only a single implentation (codonTable) and it looks like it's only really used as a data container. The 2 main functions that the consumers of the package will use are:

func Translate(sequence string, codonTable Table) (string, error) 
func Optimize(aminoAcids string, codonTable Table, randomState ...int) (string, error)

the Table interface looks like this:

// Table is an interface that specifies the functions that all table types must implement
type Table interface {
	Chooser() (map[string]weightedRand.Chooser, error)
	GenerateTranslationTable() map[string]string
	GenerateStartCodonTable() map[string]string
	GetAminoAcids() []AminoAcid
	GetStartCodons() []string
	GetStopCodons() []string
	IsEmpty() bool
	OptimizeTable(string) Table
}

Apart from the data accessor funcs the functions with extra logic in them are OptimizeTable and Chooser()

My understanding of OptimizeTable is that it computes the codons observed in a sequence and mutates the table to fit those observed codon weights, and then returns the table. Chooser() is only used in Optimize() to give a weighted amino acid "picker"

I think it's nice to have structs which are named like <Thing>er which contains the functions that the package users cares about, only exposing the bare minimum they need to do the tasks. The consumer can write the interfaces themselves (accept interfaces return structs). So for e.g the functionChooser() can be private since it should never really be used outside of Optimize()

What do you think about separating the key functions in small structs? For example:

type Optimizer struct {
	table   codonTable
	chooser weightedRand.Chooser
}

// (previously codonTable.OptimizeTable)
func NewOptimizerFromSequence(sequence string) *Optimizer {
	// code
}

func NewOptimizerFromTable(table codonTable) *Optimizer {
	// code
}

func (o *Optimizer) Optimize(sequence string) string {
	// code
	
	return optimizedSequence
}

// (Translator could probably live in another package)
type Translator struct {
	codonTable Table
}

func (t *Translator) Translate(sequence string) string {
	// code
	
       return translatedSequence
}

This would (to me) make the purpose of the package a bit clearer and reduce the number of exposed funcs, making things a bit simpler.

I've pushed an example of what that would look like for the stats computation, the codon.Analyser which computes stats given a genbank sequence. I think it's not exactly what you want (you want some stats on the start codon frequencies observed every time we do a Translate or Optimize) but it's just to give an idea of the structures i usually work with. I'll work more on it to get the stats we want tomorrow.

Let me know if you think that would make the package clearer (not suggesting I'd do this in this PR, just some thoughts).

return nil, err
}

stats.StartCodonCount[sequence[:3]]++
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably a bit of a naive way to find the start codon (just picking the first 3 letters of a cds), is there a better way to find the start codon @Koeng101 ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an instance where magic numbers are ok - unless we start dealing with xenobiotic stuff (O_o), which IMO is outside the scope of this package.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was asking more because I'm not 100% sure that a CDS will always begin with the start codon. In typical biology fashion you can get things like upstream start codons (or many other special cases I'm sure). No need to overcomplicate if not necessary, I was just wondering if this was too naive

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lab I worked in actually did a study on this. Turns out, biology is really... wack!

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5397182/

In E.coli, out of 64 codons, 47 give detectable levels of start-codon expression. Sometimes with f-methionine, sometimes they are just initiated with a different amino acid. Very wobbly, but generally the expression was kind of crap compared to canonical start codons.

That being said, a start codon is definitionally the first codon of the CDS, so if there was an alternative upstream codon - guess it shoulda been annotated! In this case, we can safely say 3 is always safe and an alright use of a magic number (since it isn't even really a magic number if you know about codons).

@carreter
Copy link
Collaborator

I know the <action>er interface pattern is idiomatic Go, but I'm personally wary of overusing single-method interfaces.

Separating functions across all these interfaces makes usage more complicated as clients have to instantiate and keep track of an additional object. If all Tables should be able to be optimized + take a sequence to be translated (which IMO they should), these methods should be part of the Table interface.

Reviewing now and leaving more detailed thoughts.

Comment on lines 5 to 14
// Stats denotes a set of computed codon table statistics
type Stats struct {
StartCodonCount map[string]int `json:"start_codons_counts"`
}

func NewStats() *Stats {
return &Stats{
StartCodonCount: map[string]int{},
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why define a new struct which has a single member and no methods?

This should either be a type alias (type Stats map[string] int) or not be a type at all. IMO the latter is preferable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd use type aliases more to enhance readabilty/make it harder to mess up (e.g type DNA string), this is meant as a container struct because we presumably will add more stats in the future. Rather than having a large list of ints and maps in a struct that has nothing to do with codon stats, it's nicer to have it in its own type.

Where would you put it if you'd like to keep it as an untyped map? codonTable?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless there's a case where the stats will be used together often, I think they should be returned by different methods or functions.

IMO it should be something like the following would be best:

func ComputeStartCodonStatistics(codonTable Table) map[string]int {
  // code
  return statisticsMap
}

func ComputeAnotherStatistic(codonTable Table) SomeType {
  // code
  return instanceOfSomeType
}

Comment on lines 16 to 17
// Analyzer is used to compute codon usage statistics
type Analyzer struct{}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty struct with a single method could just be a function. Structs are for storing state/bundling data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not always, here's one in kubernetes. You use structs like this to make it easier to mock in unit tests, for example we might want to test something which relies on the analyser without actually calling the struct's code.

e.g we can imagine elsewhere:

type Analyser interface{
	ComputeStats(sequence genbank.Genbank) (*Stats, error)
}


// test.go

type mockAnalyzer struct{}

func (a *Analyzer) ComputeStats(sequence genbank.Genbank) (*Stats, error) {
	return //stats, mockError
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense in the Kubernetes example because these structs implement a common interface. I don't see a use case for defining an interface here.

Getting the statistics from a codon table should be a pure function we can extensively unit test, what scenarios do you have in mind where we would need to mock it?

type Analyzer struct{}

// ComputeStats returns a set of codon related statistics from a genebank sequence
func (a *Analyzer) ComputeStats(sequence genbank.Genbank) (*Stats, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to have this take a genbank entry, or just a list of arbitrary sequences? @Koeng101 idk what your intentions were for this feature.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mainly just wanna be able to generate an optimization table reliably from a genbank file. Stats are kinda optional, mostly because I can just stare down into the struct if I really need them (also, I don't actually use any of the stats I do look at, it's mostly just curiosity for me. Like how Mesoplasma florum has extremely few CGG codons...)


// iterate through the features of the genbank file and if the feature is a coding region, get the start codon
for _, feature := range sequence.Features {
if feature.Type == "CDS" {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Koeng101 hmm, maybe we should store common feature types in the genbank package as an enum (type FeatureType string).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea! Though first we have to merge the genbank package... maybe add as an issue?

// iterate through the features of the genbank file and if the feature is a coding region, get the start codon
for _, feature := range sequence.Features {
if feature.Type == "CDS" {
sequence, err := feature.GetSequence()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, why does this method of Feature have an error in its return type? Reading through the implementation, the error will always be nil. @Koeng101 please confirm whether this is intended behavior, if not I will open a new issue for it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the intended behavior, though ironically I think most of the time its just panicked for me and I had to go fix the panic. The reason is that genbank location strings can be stupid, and impossible to parse.

return nil, err
}

stats.StartCodonCount[sequence[:3]]++
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an instance where magic numbers are ok - unless we start dealing with xenobiotic stuff (O_o), which IMO is outside the scope of this package.

"github.com/google/go-cmp/cmp/cmpopts"
)

func TestComputeStats(t *testing.T) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit tests should not rely on file I/O unless absolutely necessary. These are already table-driven tests, the tests struct should be of the following form:

struct {
    name string

    input genbank.Genbank
    want *Stats
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was basing it off the existing codon test which read the genebank files, I assume that's because the genebank input you'd need to include in your table test would be rather large? I can change it to the object if we're ok with it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Koeng101 I guess this just brings up my question again as to whether we want to handle genbank objects or just string sequences.

I totally forgot that the genbank.Genbank is a massive struct - I'm ok with reading it from file, you're right that it would be annoying to type out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

genbank.Genbank structs are silly big, so I think we should just read from a file for tests, though for examples I think we should always have it as a string (so it can run on the docs website well).

The reason the tests here are off of genbank files is because that is where pretty much all the codon tables will come from until we implement #247 , which also requires quite a bit of engineering effort, I think. If we do it off of just sequences, for example, you can miss a lot of the small things that matter when parsing a real codon table from a real file.

To put #247 in perspective, the codon usage database, the most widely used database for codon usage, was last updated in 2007. So there is lots of room for us to improve over the state-of-the-art.

synthesis/codon/analyzer_test.go Outdated Show resolved Hide resolved
Comment on lines 45 to 50
analyzer := &Analyzer{}

stats, err := analyzer.ComputeStats(sequence)
if err != nil {
t.Fatal(err)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this shows why the Analyzer struct is unnecessary. You're instantiating an empty struct and immediately calling one of its methods, then never use the struct again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the test is meant to only test ComputeStats, so which other method would I be calling? The point of the single method struct is to make mocking/interfacing easier for consumers.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the test is meant to only test ComputeStats, so which other method would I be calling?

Oop yeah, nevermind on this.

The point of the single method struct is to make mocking/interfacing easier for consumers.

I still don't see the the use case for this. I'll leave a more detailed response in the main PR thread.

Comment on lines 549 to 324
if !cmp.Equal(got, tt.wantCodonTable, sortCodons, sortAminoAcids, sortStrings) {
t.Errorf("got and tt.wantCodonTable didn't match %s", cmp.Diff(got, tt.wantCodonTable))
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the same if diff := cmp.Diff(); diff != "" {} syntax I used elsewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a particular reason?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More of a style choice than anything, though I guess it reduces the number of calls to the cmp package. Should have marked this as a nit!

@cachemoi
Copy link
Contributor Author

I know the er interface pattern is idiomatic Go, but I'm personally wary of overusing single-method interfaces.

Separating functions across all these interfaces makes usage more complicated as clients have to instantiate and keep track of an additional object. If all Tables should be able to be optimized + take a sequence to be translated (which IMO they should), these methods should be part of the Table interface.

If the object instantiation is kept clear and we make it impossible (or at least hard) to end up with an invalid object that's not really an issue (which is why I'm suggesting the NewOptimizerFromTable and NewOptimizerFromSequence constructors)

I'm not overly fond of single method interfaces either (and my suggested names stutter), I was just struggling to come up with a good name for a struct which does both Translation and Optimisation 😅

Just to reiterate the point of those structs is to make interfacing/mocking/testing easier for consumers, if you keep them as top level package functions they become harder to swap out

@carreter
Copy link
Collaborator

If the object instantiation is kept clear and we make it impossible (or at least hard) to end up with an invalid object that's not really an issue (which is why I'm suggesting the NewOptimizerFromTable and NewOptimizerFromSequence constructors)

It's an issue for readability, as it adds a new interface + associated struct to our API where we could just have a single function instead.

I'm not overly fond of single method interfaces either (and my suggested names stutter), I was just struggling to come up with a good name for a struct which does both Translation and Optimisation 😅

Translation is going to be something we do with pretty much every codon table, so I think it belongs in the Table interface. I could see an argument for writing it the following way though:

type Translator interface {
  Translate(string) string
}

type Table interface {
  Translator
  // Other methods
}

Just to reiterate the point of those structs is to make interfacing/mocking/testing easier for consumers, if you keep them as top level package functions they become harder to swap out

How so? Take this example:

// Our library code.
func Do(input string) string { //code here }

// Client's code.
type DoFunc func(string) string

type Something struct {
  doFunc DoFunc
  // other members
}

If the client wants to mock Do, they can just write their own function that satisfies the DoFunc type. No need for additional structs/interfaces.

If we could reasonably expect Do to need to store state, then I would understand making a Doer struct. However, all of the functions we're talking about here (namely: Translate, Optimize, and ComputeStartCodonStatistics) should be pure.

@Koeng101
Copy link
Contributor

So if I understand correctly the meat of this package is to change a coding sequence so that we use some codons more frequently than other for e.g making translation of a gene present in one specie easier in another.

Correct. It is also used in our (gene fixer)[https://github.com/TimothyStiles/poly/blob/main/synthesis/fix/synthesis.go] for fixing up coding sequences so they can be synthesized more easily. So it's basically Translate, Optimize, and whatever fix.Cds uses - which I think is the struct.

the Table interface looks like this:

I think the table interface could definitely be simplifed. I'm not tied to it, and would encourage any simplification.

What do you think about separating the key functions in small structs? For example:

I like it, I'm sold.

it's just to give an idea of the structures i usually work with. I'll work more on it to get the stats we want tomorrow.

Honestly, I think the stats are useful, but separately from the start codon issue. For start codons, it's pretty simple: Just gimme the occurence count of the first 3 base pairs for every protein used. When I think of codon statistics, I'm thinking of codon distribution and rare codon counts, stuff like that. Though beyond all that, I think simplification of this package is more important than stats or start codons - I think you've laid out a way it could be made easier and more readable.

Separating functions across all these interfaces makes usage more complicated as clients have to instantiate and keep track of an additional object. If all Tables should be able to be optimized + take a sequence to be translated (which IMO they should), these methods should be part of the Table interface.

Translation tables are very easy to get (we actually have literally all the translation tables in a map), optimization tables not so much. It might be nice to more easily say "this is a translation table" and "this is a optimization table", since the first we can make very easily accessible, and the second can have methods for generating.

Translation is going to be something we do with pretty much every codon table

Here's some more context: You must know a translation table in order to generate an optimization table. Then an optimization table is just a translation table extended with counts for each codon occurence. I stuck em both in the same struct cause I sucked more at Golang back then. Nowadays I'd probably just have a translation table and then a separate map[string]int to count each codon, and that is where the weights go. An optimization table is just a translation table + that map. It more clearly differentiates what makes them actually different.

Then, we kinda just added functionality on top of that foundation. But fundamentally:

  1. We have a map of amino acids to triplets (so, map[rune][]string - stop codons can just be encoded in this, and ATG assumed to be start for most synthetic biology applications)
  2. We have a map of weights (map[string]int)
  3. The two functions we have are Translate() on top of the 1st map[rune][]string, and Optimize on top of the 2nd map[string]int

Then, we have a couple things we can do with those codon tables:

  1. AddCodonTable to build for multi-chromosomal organisms
  2. CompromiseCodonTable to build for cross-organism synthesis

Of course, there are nice things about the struct right now (it generates clearer JSON, which I consider a win, because I mainly want this to replace the codon usage database at some point), so don't take the map literally - but that said, we can remove literally any code that doesn't help us accomplish those ~4ish tasks, and should only add code that helps us generate or use those 2 distinct tables more effectively.

@Koeng101
Copy link
Contributor

I guess I'll reiterate that simplification and communication of what we want in this package is more important than any stats about codon tables, because the pretty much the only thing stats of codon tables will be used for is my own edification when I browse through the genbank databank converted to codon tables...

// apply weights to codonTable
for codonIndex, codon := range aminoAcid.Codons {
table.AminoAcids[aminoAcidIndex].Codons[codonIndex].Weight = codonFrequencyMap[codon.Triplet]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not mistaken this function updated the (global) default NCBI tables, it seems that this may have been intended but I assumed not so changed getCodonTable() to return a copy of the table rather than the hardcoded one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The global NCBI tables was intended so you only had to instantiate it once, but might be better to have it under a function. The NCBI tables really don't mutate quickly at all, so it'd be nicest to have them as constants, but that is kinda hard in Golang I think

synthesis/codon/codon.go Outdated Show resolved Hide resolved
@@ -1,22 +1,20 @@
package codon
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic of all existing tests is kept the same (and they pass) but I should probably add some more. Let me know if you want a particular one!

@cachemoi
Copy link
Contributor Author

I tried to have a go at re-jigging the structures a bit @Koeng101 but this means the PR has grown 😅 . Let me know if it makes sense to you, basically I removed most simple data accessor functions from the Table interface and moved it to a constructor function for a codonTable. I also moved the loop that looked for CDS in most examples to a package function.

This means it's harder to end up with an invalid table, we don't recompute translation maps on every run and the exposed interface is cleaner (most of those functions were onnly used in internal codon package logic)

I did not create a separate OptimizationTable struct because to me it make sense that you could optimise with any table, since optimizing with a table with no special weights would just me a no-op.

You could also repeateadly update weights on the same table, hence the codonTable.UpdateWeights function.

Let me know what you think too @carreter, I removed the separate Analyser struct since we now have what we need in the Table struct functions to compute stats.

@Koeng101
Copy link
Contributor

I am running some GoldenGate + transformations today, so I'll have to get to this PR a little later - will in next few days!

Thank you for being patient with us :)

@cachemoi cachemoi marked this pull request as ready for review October 18, 2023 16:16
Copy link
Contributor

@Koeng101 Koeng101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few changes, and then I think it looks good!

I mostly went and just read the code in the file, I think the changes were sufficient that a from-scratch re-read would be nice. I mostly think there should be comment / style updates, the code itself looks good.

Comment on lines 75 to 77
// Stats denotes a set of statistics we maintain throughout the translation table's lifetime. For example we track
// the start codons observed when we update the codon table's weights with other DNA sequences
type Stats struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One edit I'd like: Place stats below the Codon / Amino acid / Table interface. I think these should be arranged by priority to understand the code: Stats are rather low here.

synthesis/codon/codon.go Show resolved Hide resolved
synthesis/codon/codon.go Show resolved Hide resolved
synthesis/codon/codon.go Show resolved Hide resolved
synthesis/codon/codon.go Show resolved Hide resolved
Copy link
Contributor

@Koeng101 Koeng101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Let's get @TimothyStiles in here for a final review n merge!

@Koeng101
Copy link
Contributor

Koeng101 commented Nov 5, 2023

@cachemoi

Sorry, one more change: Could you add the feature to the changelog? Then I think ready for merge.

@cachemoi cachemoi force-pushed the FEAT-262/start-codon-stats branch from 2c35512 to 58a511c Compare November 5, 2023 17:19
@cachemoi
Copy link
Contributor Author

cachemoi commented Nov 5, 2023

Sorry, one more change: Could you add the feature to the changelog? Then I think ready for merge.

I had to rebase to get the changelog on my branch, done in 58a511c

Copy link
Collaborator

@TimothyStiles TimothyStiles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is good. Test coverage has dropped from the previous 100% to 95.3% and I've commented on the lines that are not covered.

github wouldn't let me add this inline but line 359 in codon.go is also no longer covered.

return nil, fmt.Errorf("weightedRand.NewChooser() error: %s", err)

I do like users just being able to weight a codon table using a genbank but I think we should also have an explicit example for how to weight codon tables manually in the case that the user isn't fortunate (or misfortunate) enough to be using genbank. The codon optimizer was actually shipped to work this way originally but we changed it when someone mentioned they weren't able to use it with fasta sequences.

}

// codonTable holds information for a codon table.
type codonTable struct {
func NewStats() *Stats {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs a doc string // NewStats does ...

// This function is run at buildtime and failure here means we have an invalid codon table.
chooser, err := newAminoAcidChoosers(aminoAcidSlice)
if err != nil {
panic(fmt.Errorf("tried to generate an invalid codon table %w", err))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if this should panic and is also not covered by tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generateCodonTable runs at build time when the hardcoded tables are initialised, having an invalid table is not an error the caller can/will ever handle hence the panic. Should I refactor the initialisation code to handle the error? What would I do with the error apart from panicing?

if feature.Type == "CDS" {
sequence, err := feature.GetSequence()
if err != nil {
return nil, err
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not covered by tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will never error (issue),should I ignore the error here or make mocks?

}

if len(codingRegions) == 0 {
return errNoCodingRegions
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not covered by tests.

func (table *TranslationTable) UpdateWeightsWithSequence(data genbank.Genbank) error {
codingRegions, err := extractCodingRegion(data)
if err != nil {
return err
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not covered by tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extractCodingRegion would only error due to a feature.GetSequence() call, which never errors as per this comment

// This will shift the entire codon table, messing up the end results. To fix this, make sure to do a modulo
// check.
if len(sequence)%3 != 0 {
continue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not covered by tests.


err := mergedTable.UpdateWeights(finalAminoAcids)
if err != nil {
return nil, err
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not covered by tests.


err := mergedTable.UpdateWeights(finalAminoAcids)
if err != nil {
return nil, err
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not covered by tests.

// Update Chooser
updatedChoosers, err := newAminoAcidChoosers(table.AminoAcids)
if err != nil {
return err
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not covered by tests.

Copy link
Collaborator

@TimothyStiles TimothyStiles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is good. Test coverage has dropped from the previous 100% to 95.3% and I've commented on the lines that are not covered.

github wouldn't let me add this inline but line 359 in codon.go is also no longer covered.

return nil, fmt.Errorf("weightedRand.NewChooser() error: %s", err)

I do like users just being able to weight a codon table using a genbank but I think we should also have an explicit example for how to weight codon tables manually in the case that the user isn't fortunate (or misfortunate) enough to be using genbank. The codon optimizer was actually shipped to work this way originally but we changed it when someone mentioned they weren't able to use it with fasta sequences.

@TimothyStiles TimothyStiles merged commit 96b5cff into bebop:main Nov 15, 2023
3 checks passed
@cachemoi cachemoi mentioned this pull request Jan 4, 2024
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants