Skip to content

Commit

Permalink
Improved validation&docs, update CI, rm skbio (#5)
Browse files Browse the repository at this point in the history
* DEV: Travis --> GitHub Actions (close #3)

* REL: bump version number to 0.1.0

* STY: update based on new black vsn

* DOC: Update CI badge in README

* TST: more explicit coverage report formats

* BUG/DOC: Fix & improve edge decl err msg

Previously, I was just outputting {} instead of the actual
problematic declaration. Guess I forgot to use .format(), whoops.

* MNT: on second thought, don't use f-strings

no need to limit this to >= py 3.6 if we can avoid it

* TST: thats a lotta python

* DOC: topology note

See marbl/MetagenomeScope#234. Thought
it'd be good to be extra explicit about this.

* DOC: clarify parsing more

* TST: no-rc test

* DOC: improve multi-colon error message

Just in case someone rolls up with a FASTG file with properties

* TST: "undefined target edge" integration test

This was already basically guaranteed, but now it's unambiguous
that the final check works.

could make the error message more descriptive (for which edges
is this attribute missing), but that'd necessitate refactoring
and not important to do that atm imo

* TST: more paranoid testing

* DEP/ENH: remove skbio dep; compute GC manually

This change makes the implicit sequence validation more explicit (we do
it), and also more lax (now uracil [U] is ok).

Closes #2.

* DOC: shorten README title

* TST: don't test on python 3.4

We could mess with the ubuntu version to try to make this work,
but i think it makes sense to prioritize support for newer python
versions. see actions/setup-python#185

* TST: Fix 3.10 in YAML

* TST: axe py 3.5, add py 3.11

from the github actions logs:
"DEPRECATION: Python 3.5 reached the end of its life on September 13th,
2020. Please upgrade your Python as Python 3.5 is no longer maintained.
pip 21.0 will drop support for Python 3.5 in January 2021."

* TST: 3.11-dev

* DOC/ENH: tidy up declaration info extraction

- Include ^ and $ in the regex
- Document the (?P<groupname>...) stuff
- Change the name of the first capture group from node --> name
- Remove the | in the middle of the coverage set of accepted
  characters (this was a small bug -- wouldn't have done any damage
  since the attempted conversion to float would've immediately broken
  due to any |s)

* MNT: Remove +? from name and length regex

Since the sets of accepted characters for name and length don't
include "_", the minimality thing shouldn't make a difference. This
makes the code clearer. See
https://docs.python.org/3/library/re.html#index-6 for details on what
this was doing.

* TST: test bad-coverage cases

* DOC: mention invalid coverage floats in readme

* TST: test leading/trailing info not ok in decls

* TST/ENH: Add check that decl lines ends with ;

(since the spec technically allows them not to...)

* TST: another end-with-; test

* ENH/TST: Declaration consistency check!

also:

-removed the >? from the start of the main edge declaration line
 regex to make life easier
-made the (lack of) space on the line after "Parameters\n---------"
 consistent
-lotta tests

* DOC: tidy up readme some

* TST: test another decl-inconsistency case (1 line)

* DOC: better error message on inconsistent decls

wrt the node/edge debacle

* ENH: Detect and throw error if duplicate out adjs

* REL: add draft changelog

not necessarily thorough yet -- will look over again tmw

* DOC: make readme intro read better

* DOC: yassify readme intro more

* DOC: various README improvements

* TST: verify that spaces next to adj commas not ok

The error should be self-explanatory.

* ENH: hit the yoinky sploinky and disallow ~

a situation where we can be very explicit in the error message

* ENH: more blocking problematic things explicitly

* DOC: tidy up the intro of the README a bit

* DOC: Update README

- Move dependency docs up to installation
- Add dev instructions
- Link to changelog

* DOC: tidy up FASTG details in README

* DOC: impv orientation, node name formatting, etc

* DOC: Update the changelog

phew

* DOC: grammar

* DOC: grammar but again
  • Loading branch information
fedarko authored Jun 27, 2022
1 parent 3b99ba8 commit 30b0a44
Show file tree
Hide file tree
Showing 21 changed files with 1,122 additions and 184 deletions.
40 changes: 40 additions & 0 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# From
# https://github.com/fedarko/strainFlye/blob/main/.github/workflows/main.yml
name: pyfastg CI
on: [push, pull_request]
jobs:
build:
runs-on: ubuntu-latest

strategy:
matrix:
# Gotta specify 3.10 as a string to avoid it being converted to 3.1:
# https://github.com/actions/setup-python/issues/160
# The -dev after 3.11 is so that we can test on the latest dev version
# of python 3.11, since I believe no stable version of 3.11 exists yet
# (see https://github.com/actions/setup-python/issues/160 again)
python-version: [3.6, 3.7, 3.8, 3.9, "3.10", "3.11-dev"]

steps:

# https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python
- name: Check out code
uses: actions/checkout@v3

# https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v3
with:
python-version: ${{ matrix.python-version }}

- name: Install pyfastg (and its pip dependencies)
run: pip install -e .[dev]

- name: Run tests
run: make test

- name: Lint and stylecheck
run: make stylecheck

- name: Upload code coverage information to Codecov
uses: codecov/codecov-action@v2
12 changes: 0 additions & 12 deletions .travis.yml

This file was deleted.

121 changes: 121 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# pyfastg changelog

## pyfastg v0.1.0 (June 26, 2022)

It's been a while! This release includes a fair amount of changes; the bulk of
these make the parser more strict when dealing with unsupported or malformed
FASTG files.

### New features
- Sequences containing `U` (uracil) are now explicitly allowed. Previously,
these would have triggered an error from scikit-bio about them not being
valid DNA characters.

### Documentation
- Improved some of the error messages produced when pyfastg encounters
malformed graphs.

- Added information to the README (and various error messages) to be
clearer about the fact that the sequences in the input FASTG file are "edges,"
but are represented as nodes in the output NetworkX graph.

- Updated the README to clarify pyfastg's behavior when an edge's implied
reverse-complement (e.g. for 1+ → 2-, this would be 2+ → 1-) does not
explicitly exist in the FASTG file. (The implied edge is not added if it
isn't described in the FASTG file.)

- Restructured and improved the README in various small ways (e.g. moved
the information on pyfastg's dependencies to be alongside the installation
instructions).

### Backward-incompatible changes

N/A. Some of the ways in which we've made validation more strict might be
technically "backward-incompatible," but these sorts of situations were already
unsupported by pyfastg due to not being part of the general SPAdes FASTG dialect.

### Bug fixes
- The error thrown upon seeing an invalid declaration was meant to show the
problematic declaration, but it just showed the string `{}` (due to Python's
`.format()` function not being used). This has been fixed -- this error
message now shows the declaration that caused a problem.

- The `|` character was previously
[included](https://github.com/fedarko/pyfastg/blob/3b99ba8624339e3efdc87de1d122e5f401b4537a/pyfastg/pyfastg.py#L43)
in the set of accepted characters our regular expression used when extracting
an edge's coverage (e.g. the coverage `12|3.5` would have technically been
accepted by the regular expression).

- This shouldn't have caused any major problems, since [a few lines later](https://github.com/fedarko/pyfastg/blob/3b99ba8624339e3efdc87de1d122e5f401b4537a/pyfastg/pyfastg.py#L60)
Python would have thrown an error about such a string not being convertible
to a float. However, this is now caught further up in this function: the
regular expression no longer accepts `|`.

- Added checks for rare problems with inconsistent edge declarations.

- Previously, edge declarations were parsed using a [regular expression](https://github.com/fedarko/pyfastg/blob/3b99ba8624339e3efdc87de1d122e5f401b4537a/pyfastg/pyfastg.py#L43)
that did not use the `^` or `$` characters (see
[Python's documentation on regular expressions](https://docs.python.org/3/library/re.html)
for details).

- Long story short, this meant that an edge formatted like
`asdfEDGE_1_length_9_cov_3ghij` would have technically been accepted, even
though it contained extra information before (`asdf`) and after (`ghij`) the
relevant parts of the edge name.

- This could have caused problems in rare cases (e.g. if a graph contained
both an edge labelled `asdfEDGE_1_length_9_cov_3ghij` and an edge named
`EDGE_1_length_9_cov_3`, these edges would have been treated as if they
were the same edge since the ID `1` is the same).

- To prevent this sort of problem, we now enforce that edge
names cannot have any extra leading or trailing information.

- Additionally, we now check each occurrence of an edge's declaration in
the FASTG file (both on the line that this edge is declared, and on any
lines where other edges have outgoing adjacencies to this edge). If any
declarations are inconsistent (e.g. we see `EDGE_1_length_20_cov_5.2` on
one line, but `EDGE_1_length_20_cov_4.5` on another line), we will raise
an error about this inconsistency.

- We now detect and throw an error if an edge has multiple outgoing adjacencies
to the same edge (e.g. we see a line formatted like `>x:y,z,y`).

- Previously, nothing would have happened in this case (only one copy of
the edge from `x`'s node to `y`'s node would be created), since the type
of graph we create (`networkx.DiGraph`) does not support multi-edges.
Throwing an error in this case seems like the better way to handle this.

- Detect and throw an error if an edge declaration line does not end with `;`.

- Detect and throw clear errors if we see syntax associated with certain
unsupported things in the FASTG spec (e.g. the `~` or `[]` notations).

### Performance improvements

N/A. The extra layers of validation added in this version might slow down
pyfastg slightly, but (in our opinion) this is worth it.

### Development improvements
- Switched from Travis CI to GitHub Actions ([#3](https://github.com/fedarko/pyfastg/issues/3)).

- Set up the GitHub Actions CI to test against Python 3.6, 3.7, 3.8, 3.9, 3.10,
and the latest development version of 3.11. Previously, the CI only checked
against Python 3.6 and 3.7.

- Added development instructions to the README.

### Miscellaneous
- Removed dependency on scikit-bio for validating sequences and computing GC
content ([#2](https://github.com/fedarko/pyfastg/issues/2)). Now, the only
non-development dependency of pyfastg is NetworkX. (That being said, we may
add more dependencies in the future if absolutely necessary.)

## pyfastg v0.0.0 (October 14, 2019)

Initial release.

Note that pyfastg v0.0.0 was released on PyPI on October 7, 2019; there are a
few small changes to pyfastg's documentation that were committed to the GitHub
repository from October 7, 2019 to October 14, 2019. These small changes are
thus not reflected in the PyPI version of the README for v0.0.0.
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
.PHONY: test stylecheck style

test:
python3 -B -m pytest pyfastg/tests/ --cov
python3 -B -m pytest pyfastg/tests/ --cov-report xml --cov-report term

stylecheck:
flake8 pyfastg
Expand Down
Loading

0 comments on commit 30b0a44

Please sign in to comment.