Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
156 changes: 156 additions & 0 deletions chapters/02_primary_practices.md
Original file line number Diff line number Diff line change
Expand Up @@ -229,6 +229,162 @@ toy_data/ Very (very!) small pieces of data for dev testing
If a directory contains many files or subdirectories, consider whether it's
clearer to write a separate manifest specifically for that directory.

### Document the Data

**Metadata**, or data that describes data, is critical to the research process.
It delineates how the data were collected, what assumptions were made, what
biases might be present, any ethical concerns, the overall structure, what each
observation means, what each feature means, and more. Good data
documentation guides researchers towards appropriate, responsible use of the
data in future studies.

Good metadata should answer the questions who, what, when, where, why, and how.
Though the way the metadata answers these questions will depend on your field of
research. If you are submitting your data to a particular data repository, they
will likely have a required metadata scheme to follow. Otherwise, pick a
metadata scheme that aligns with other researchers in your field. If you have to
submit a Data Management Plan, it will specifically ask how you will apply and
adhere to field specific data standards.

If you aren't sure what the standard is in your field, there are several online
repositories to help you out. The [Metadata Standards Catalog][msc] has a fairly
exhaustive list of metadata schemes, which you can browse [by
subject][msc-subject]. [Fairsharing.org][fairshare] also stores metadata and
other documentation standards. By using an existing community standard metadata
scheme, you make it possible for future researchers (including you!) to compare
your data to data from other, heterogeneous, sources.

```{note}
Many metadata resources refer to something called a **[controlled
vocabulary][c-vocab]**. This is a list of specific values, each with a
predefined meaning. It is designed to provide consistency and uniqueness across
data sources. One common example of a controlled vocabulary is a list of
geographic names, like the [Thesaurus of Geographic Names (TGN)][tgn]. There are
many ways you can refer to New York City (NYC, the Big Apple, Manhattan etc).
But if you want to be able to group together all data about New York City, it is
helpful if everyone calls it the same thing.
```

Even if you don't know where your data will end up, documenting your data when
you collect it will help ensure your documentation doesn't have any gaps. Timely
documentation also maximizes the likelihood that your research can be
reproduced, and that your data reused by other researchers. If your project uses
data collected earlier or by someone else, it's a good practice to fill gaps in
the existing documentation with your own. Thorough documentation isn't just
beneficial to other researchers, it's also beneficial to future you---small
details you notice and document about features could be important later in the
project.

```{figure} /images/michener_information_entropy.png
---
name: information-entropy
figwidth: 550px
align: center
alt:
---
Information Entropy (Figure 1) from ([Michener et al. 1997][michener]).
```

One of the simplest and most widely used metadata standards is the [Dublin
Core][dublin-core], a set of 15 metadata elements originally defined at a 1995
workshop in Dublin, Ohio. The exact definition of the Dublin Core elements can
be a bit technical, but the University College Dublin (Ireland) Library provides
simplified explanations and examples [here][dublin-ex].

If all of this seems overwhelming, that's okay. The Consortium of European
Social Science Data Archives (CESSDA) has a great [video][cessda-video] for
those who have never documented data before. CESSDA also provides detailed
explanations of what information to document at both project and data level in
their [Data Management Expert Guide][cessda-guide]. This includes detailed
information about documenting quantitative and qualitative data. Just make sure
to expand all the collapsed sections.


:::{seealso}
There are many resouces on documenting your data available. Here are a selection
of them:
- [Metadata Standards Catalog][msc]
- [Fairsharing.org][fairshare]
- [README, Write Me! DataLab workshop reader][datalab-readme]
- [UC Davis Research Data Management LibGuide][lib-metadata]
- [CESSDA's Data Management Expert Guide][cessda-guide]
- [The Turing Way on Documentation and Metadata][turing-metadata]
- [MIT Metadata Info][mit-metadata]
- [Harvard Biomedical Documentation and Metadata][harvard-metadata]
- University College Dublin on [Metadata][dublin-ex] and [Documentation][ucd-doc]
:::


[lib-metadata]: https://guides.library.ucdavis.edu/data-management/documentation
[msc]: https://rdamsc.bath.ac.uk/
[msc-subject]: https://rdamsc.bath.ac.uk/subject-index
[fairshare]: https://fairsharing.org/
[michener]: https://esajournals.onlinelibrary.wiley.com/doi/10.1890/1051-0761%281997%29007%5B0330%3ANMFTES%5D2.0.CO%3B2
[mit-metadata]: https://libraries.mit.edu/data-management/store/documentation/
[dublin-core]: https://www.dublincore.org/specifications/dublin-core/dcmi-terms/#section-3
[dublin-ex]: https://libguides.ucd.ie/data/metadata
[cessda-video]: https://www.youtube.com/watch?v=cjGz-I0GgKk
[cessda-guide]: https://dmeg.cessda.eu/Data-Management-Expert-Guide/2.-Organise-Document/Documentation-and-metadata
[tgn]: https://www.getty.edu/research/tools/vocabularies/tgn/index.html
[turing-metadata]: https://book.the-turing-way.org/reproducible-research/rdm/rdm-metadata/
[harvard-metadata]: https://datamanagement.hms.harvard.edu/collect-analyze/documentation-metadata
[c-vocab]: https://rdf-vocabulary.ddialliance.org/
[ucd-doc]: https://libguides.ucd.ie/data/documentation

(create-data-dictionary)=
#### Create a Data Dictionary

A **data dictionary** is a document that explains what every field or element in
your dataset means as well as any restrictions on their values. This includes
things like the data type (ex. number, date, text, boolean), and whether that
field can be missing. The more information you include, the more helpful it will
be down the line (see [Captain Obvious][captain_o]). Data dictionaries are the
most efficient way to communicate the structure and content of your data to
other collaborators, including future you! A very basic one could look like
this:

|Field Name |Field Description |Data Type |Notes |
|-----------|------------------------------------------|------------|----------|
|person_id |autogenerated by database |integer | |
|name |legal full name (family name, given name) |string | |
|occupation |A person's job or vocation |string |Must come from the Bureau of Labor Statistics Occupation List |
|... |... |... |... |



If you aren't sure where to start with creating a data dictionary, DataLab has a
[template][datalab_dd_template] you can use as a jumping off point. [Open
Science Framework][osf_dd] has resources on what details to add to your data
dictionary, and the [USGS][usgs_dd] provides many examples of data dictionaries
and how they are used in different contexts. If you are working with multiple
data sets, make sure to clarify which data dictionary to use with each data set.

If your dataset looks less like a series of rows and columns, and more like a
long list of files, consider creating a **data inventory** instead. A data
inventory should include the author or source, title, publication year (if
published), and file name for each file, but can include more file metadata as
necessary. A data inventory for a public domain fiction data set would look
something like this.

|Author |Title |Year |Filename |
|--------------------|--------------------|-----|------------------------------------------|
|Bronte,Charlotte |JaneEyre |1847 |EN_1847_BronteCharlotte_JaneEyre.txt |
|Austen,Jane |SenseandSensibility |1811 |EN_1811_AustenJane_SenseandSensibility.txt|
|Wollstonecraft,Mary |Maria |1798 |EN_1798_WollstonecraftMary_Maria.txt |
|... |... |... |... |


If you also need to keep track of things like the provenance or license
associated with each file or data set, DataLab's
[data inventory template][datalab_di_template] provides a pretty comprehensive
starting point.

[osf_dd]: https://help.osf.io/article/217-how-to-make-a-data-dictionary
[usgs_dd]: https://www.usgs.gov/data-management/data-dictionaries
[captain_o]: https://dataedo.com/blog/captain-obivous-guide-to-column-descriptions-data-dictionary-best-practices
[datalab_dd_template]: https://docs.google.com/spreadsheets/d/12N0hKyeT0ndZnt7rVZsz7LTW--BHhbb6TOegXEKQoxE/edit?usp=sharing
[datalab_di_template]: https://docs.google.com/spreadsheets/d/1nUb-eu82Q7VplDpk0np5rYuaN52mYHLdql18pRD0i4Y/edit?usp=sharing


(workflows)=
#### Workflows
Expand Down
25 changes: 0 additions & 25 deletions chapters/03_secondary_practices.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,31 +12,6 @@ relevant, and we recommend you do too.
Documentation
-------------

### Document the Data

In a perfect world, every data set would come with detailed documentation or
**metadata** about how the data were collected, what assumptions were made,
what biases might be present, any ethical concerns, the overall structure, what
each observation means, what each feature means, and more. Good data
documentation guides researchers towards appropriate, responsible use of the
data.

Collecting data as part of a project gives you and your collaborators control
over how the data are documented, so you can ensure there are no gaps. If your
project uses data collected earlier or by someone else, it's a good practice to
fill gaps in the existing documentation with your own. Thorough documentation
isn't just beneficial to other researchers, it's also beneficial to future
you---small details you notice and document about features could be important
later in the project.

:::{seealso}
See DataLab's [README, Write Me! workshop reader][datalab-readme] for more
about how to document data.
:::

[datalab-readme]: https://ucdavisdatalab.github.io/workshop_how-to-data-documentation/


(document-every-experiment)=
### Document Every Experiment

Expand Down
Binary file added images/michener_information_entropy.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.