003-documentation_metadata.Rmd

# Documentation and Metadata

If I asked you next week, next month, 3 months, 1 year from now, where a dataset or figure came from.
Would you be able to tell me within 10 minutes?

You can write all the metadata information on the toplevel `README.md` file.
If you don't want to deal with merge conflicts, you can 

## Code documentaion

1. Each R script should perform a single task
(if you have a 1000+ long file, you're probably doing something wrong).
For example:

    1. `01-data_ingestion.R`
    2. `02-data_clean.R`
    3. `03-data_visualize.R`
    4. `04-data_output.R`

2. Each script should have a short description on the top that explains what it is doing.
If the script is part of a pipline. It should also document where the input data/script is coming from.

3. If you wrote a function in the script make sure it has a docstring
that explains what it does and what the inputs and outputs are.  For example:

```r
#' squares a given value
#'
#' x: a value to square
#' return: a numeric value
my_square <- function(x):
    return(x ** 2)
```

4. Make sure the libraries that are loaded are towards the top of the script.
There should not be a `library` call in the middle of your script.
This helps figuring out what packages are needed.

5. Functions you've written should be towards the top (and doucumented) of the script as well.
Usually it is under the `library` loading.
This helps separate your functions from the code.

6. If the script does not take too long to run, you should test your script by restarting an R session.
And running the script from top to bottom.
To reset your R session, you can:

    1. Click the red button on the top right corner of rstudio
    2. In RStudio: command/ctrl + shift + F10
    3. In RStudio type: `.rs.restartR()` in the terminal

This will make sure you have a totally clean enviornment when you are testing and running your script.
It's even 'better' than using `ls(list = ls())` since it will also detach loaded packages.

### lintr

Using a linter helps find potential errors in your code. For example, variables that you don't use.
It also checks to conform code to a common [code style](http://adv-r.had.co.nz/Style.html),
all of which help make code easier to read for other people/collaboratiors.

To lint your script, you can run

```r
lintr::lint('my_r_script.R')
```

RStudio will open a static code analysis "Markers" tab

## Metadata

Now that your scripts are documented, you need to start documenting the inputs and outputs of your file.

Ideally all the non-`original` data can be recreated from your `R` code,
and those instructions are placed in a master script file like a `Makefile` or `bash` script.
But we'll keep it simple for now, and just document the process.

#### Datasets

For each input dataset, a comment about what is in the dataset, where it came from,
and what you used it for should all be listed.
You can use a format like this:

```
- my_awesome_dataset.csv
    - ./data/folder/original/awesome/my_awesome_dataset.csv
    - contains data about how awesome the various datasets are
    - used to calculate the 'awesome' metric
```

We use the `original`/`working`/`final` folders in our data folder.
The `working` and `Final` datasets should list what script it comes from.

```
- web_scraped_data.RData
    - data scraped from the web
    - comes from ./src/dan/web/scraping.R
```

#### Reports/Posters/etc

Create a `doc` folder on the top level.
What ever your 'final' version of a poster or report should be here and checked in.

#### Figures and Tables

Each figure and table used in the poster should list the script it comes from.
The exact method on how to regenerate that figure should be throughly listed

```
- ./output/poster_fig_1.png
    - scatter plot for the amazing data
    - generated by: `./src/dan/amazing/plot_code.R`
```

## README files

The top level `README` file should contain all the information about what the project is
and how to get started.
It should link to (or contain) the metadata information from the previous example.
Remember this is the first thing people will see when they open a repository,
the more information here about where things are, the better.


### Example File

```
# Datasets

- my_awesome_dataset.csv
    - ./data/folder/original/awesome/my_awesome_dataset.csv
    - contains data about how awesome the various datasets are
    - used to calculate the 'awesome' metric

- web_scraped_data.RData
    - data scraped from the web
    - comes from ./src/dan/web/scraping.R

# Figures

- ./output/poster_fig_1.png
    - scatter plot for the amazing data
    - generated by: `./src/dan/amazing/plot_code.R`


- ./output/poster_fig_2.pdf
    - violin plotfor the amazing data
    - generated by: `./src/dan/amazing/plot_code_42.R`
```