ucdavisdatalab · elisehellwig · Oct 17, 2025 · Oct 20, 2025 · Oct 20, 2025 · Oct 20, 2025
diff --git a/chapters/02_primary_practices.md b/chapters/02_primary_practices.md
@@ -229,6 +229,162 @@ toy_data/       Very (very!) small pieces of data for dev testing
 If a directory contains many files or subdirectories, consider whether it's
 clearer to write a separate manifest specifically for that directory.
 
+### Document the Data
+
+**Metadata**, or data that describes data, is critical to the research process.
+It delineates how the data were collected, what assumptions were made, what
+biases might be present, any ethical concerns, the overall structure, what each
+observation means, what each feature means, and more. Good data
+documentation guides researchers towards appropriate, responsible use of the
+data in future studies.
+
+Good metadata should answer the questions who, what, when, where, why, and how.
+Though the way the metadata answers these questions will depend on your field of
+research. If you are submitting your data to a particular data repository, they
+will likely have a required metadata scheme to follow. Otherwise, pick a
+metadata scheme that aligns with other researchers in your field. If you have to
+submit a Data Management Plan, it will specifically ask how you will apply and
+adhere to field specific data standards. 
+
+If you aren't sure what the standard is in your field, there are several online
+repositories to help you out. The [Metadata Standards Catalog][msc] has a fairly
+exhaustive list of metadata schemes, which you can browse [by
+subject][msc-subject]. [Fairsharing.org][fairshare] also stores metadata and
+other documentation standards. By using an existing community standard metadata
+scheme, you make it possible for future researchers (including you!) to compare
+your data to data from other, heterogeneous, sources.
+
+```{note}
+Many metadata resources refer to something called a **[controlled
+vocabulary][c-vocab]**. This is a list of specific values, each with a
+predefined meaning. It is designed to provide consistency and uniqueness across
+data sources. One common example of a controlled vocabulary is a list of
+geographic names, like the [Thesaurus of Geographic Names (TGN)][tgn]. There are
+many ways you can refer to New York City (NYC, the Big Apple, Manhattan etc).
+But if you want to be able to group together all data about New York City, it is
+helpful if everyone calls it the same thing.
+```
+
+Even if you don't know where your data will end up, documenting your data when
+you collect it will help ensure your documentation doesn't have any gaps. Timely
+documentation also maximizes the likelihood that your research can be
+reproduced, and that your data reused by other researchers. If your project uses
+data collected earlier or by someone else, it's a good practice to fill gaps in
+the existing documentation with your own. Thorough documentation isn't just
+beneficial to other researchers, it's also beneficial to future you---small
+details you notice and document about features could be important later in the
+project.
+
+```{figure} /images/michener_information_entropy.png
+---
+name: information-entropy
+figwidth: 550px
+align: center
+alt: 
+---
+Information Entropy (Figure 1) from ([Michener et al. 1997][michener]).
+```
+
+One of the simplest and most widely used metadata standards is the [Dublin
+Core][dublin-core], a set of 15 metadata elements originally defined at a 1995
+workshop in Dublin, Ohio. The exact definition of the Dublin Core elements can
+be a bit technical, but the University College Dublin (Ireland) Library provides
+simplified explanations and examples [here][dublin-ex]. 
+
+If all of this seems overwhelming, that's okay. The Consortium of European
+Social Science Data Archives (CESSDA) has a great [video][cessda-video] for
+those who have never documented data before. CESSDA also provides detailed
+explanations of what information to document at both project and data level in
+their [Data Management Expert Guide][cessda-guide]. This includes detailed
+information about documenting quantitative and qualitative data. Just make sure
+to expand all the collapsed sections.
+
+
+:::{seealso} 
+There are many resouces on documenting your data available. Here are a selection
+of them:
+- [Metadata Standards Catalog][msc]
+- [Fairsharing.org][fairshare]
+- [README, Write Me! DataLab workshop reader][datalab-readme]
+- [UC Davis Research Data Management LibGuide][lib-metadata]
+- [CESSDA's Data Management Expert Guide][cessda-guide]
+- [The Turing Way on Documentation and Metadata][turing-metadata]
+- [MIT Metadata Info][mit-metadata]
+- [Harvard Biomedical Documentation and Metadata][harvard-metadata] 
+- University College Dublin on [Metadata][dublin-ex] and [Documentation][ucd-doc]
+:::
+
+
+[lib-metadata]: https://guides.library.ucdavis.edu/data-management/documentation
+[msc]: https://rdamsc.bath.ac.uk/
+[msc-subject]: https://rdamsc.bath.ac.uk/subject-index
+[fairshare]: https://fairsharing.org/
+[michener]: https://esajournals.onlinelibrary.wiley.com/doi/10.1890/1051-0761%281997%29007%5B0330%3ANMFTES%5D2.0.CO%3B2
+[mit-metadata]: https://libraries.mit.edu/data-management/store/documentation/
+[dublin-core]: https://www.dublincore.org/specifications/dublin-core/dcmi-terms/#section-3
+[dublin-ex]: https://libguides.ucd.ie/data/metadata
+[cessda-video]: https://www.youtube.com/watch?v=cjGz-I0GgKk
+[cessda-guide]: https://dmeg.cessda.eu/Data-Management-Expert-Guide/2.-Organise-Document/Documentation-and-metadata
+[tgn]: https://www.getty.edu/research/tools/vocabularies/tgn/index.html
+[turing-metadata]: https://book.the-turing-way.org/reproducible-research/rdm/rdm-metadata/
+[harvard-metadata]: https://datamanagement.hms.harvard.edu/collect-analyze/documentation-metadata
+[c-vocab]: https://rdf-vocabulary.ddialliance.org/
+[ucd-doc]: https://libguides.ucd.ie/data/documentation
+
+(create-data-dictionary)=
+#### Create a Data Dictionary
+
+A **data dictionary** is a document that explains what every field or element in
+your dataset means as well as any restrictions on their values. This includes
+things like the data type (ex. number, date, text, boolean), and whether that
+field can be missing. The more information you include, the more helpful it will
+be down the line (see [Captain Obvious][captain_o]). Data dictionaries are the
+most efficient way to communicate the structure and content of your data to
+other collaborators, including future you! A very basic one could look like 
+this:
+
+|Field Name |Field Description                         |Data Type   |Notes     |
+|-----------|------------------------------------------|------------|----------|
+|person_id  |autogenerated by database                 |integer     |          |
+|name       |legal full name (family name, given name) |string      |          |
+|occupation |A person's job or vocation                |string      |Must come from the Bureau of Labor Statistics Occupation List |
+|...        |...                                       |...         |...       |
+
+
+
+If you aren't sure where to start with creating a data dictionary, DataLab has a
+[template][datalab_dd_template] you can use as a jumping off point. [Open
+Science Framework][osf_dd] has resources on what details to add to your data
+dictionary, and the [USGS][usgs_dd] provides many examples of data dictionaries
+and how they are used in different contexts. If you are working with multiple 
+data sets, make sure to clarify which data dictionary to use with each data set.
+
+If your dataset looks less like a series of rows and columns, and more like a
+long list of files, consider creating a **data inventory** instead. A data
+inventory should include the author or source, title, publication year (if
+published), and file name for each file, but can include more file metadata as
+necessary. A data inventory for a public domain fiction data set would look
+something like this.
+
+|Author              |Title               |Year |Filename                                  |
+|--------------------|--------------------|-----|------------------------------------------|
+|Bronte,Charlotte    |JaneEyre            |1847 |EN_1847_BronteCharlotte_JaneEyre.txt      |
+|Austen,Jane         |SenseandSensibility |1811 |EN_1811_AustenJane_SenseandSensibility.txt|
+|Wollstonecraft,Mary |Maria               |1798 |EN_1798_WollstonecraftMary_Maria.txt      |
+|...                 |...                 |...  |...                                       |
+
+
+If you also need to keep track of things like the provenance or license
+associated with each file or data set, DataLab's 
+[data inventory template][datalab_di_template] provides a pretty comprehensive
+starting point. 
+
+[osf_dd]: https://help.osf.io/article/217-how-to-make-a-data-dictionary
+[usgs_dd]: https://www.usgs.gov/data-management/data-dictionaries
+[captain_o]: https://dataedo.com/blog/captain-obivous-guide-to-column-descriptions-data-dictionary-best-practices
+[datalab_dd_template]: https://docs.google.com/spreadsheets/d/12N0hKyeT0ndZnt7rVZsz7LTW--BHhbb6TOegXEKQoxE/edit?usp=sharing
+[datalab_di_template]: https://docs.google.com/spreadsheets/d/1nUb-eu82Q7VplDpk0np5rYuaN52mYHLdql18pRD0i4Y/edit?usp=sharing
+
 
 (workflows)=
 #### Workflows

diff --git a/chapters/03_secondary_practices.md b/chapters/03_secondary_practices.md
@@ -12,31 +12,6 @@ relevant, and we recommend you do too.
 Documentation
 -------------
 
-### Document the Data
-
-In a perfect world, every data set would come with detailed documentation or
-**metadata** about how the data were collected, what assumptions were made,
-what biases might be present, any ethical concerns, the overall structure, what
-each observation means, what each feature means, and more. Good data
-documentation guides researchers towards appropriate, responsible use of the
-data.
-
-Collecting data as part of a project gives you and your collaborators control
-over how the data are documented, so you can ensure there are no gaps. If your
-project uses data collected earlier or by someone else, it's a good practice to
-fill gaps in the existing documentation with your own. Thorough documentation
-isn't just beneficial to other researchers, it's also beneficial to future
-you---small details you notice and document about features could be important
-later in the project.
-
-:::{seealso}
-See DataLab's [README, Write Me! workshop reader][datalab-readme] for more
-about how to document data.
-:::
-
-[datalab-readme]: https://ucdavisdatalab.github.io/workshop_how-to-data-documentation/
-
-
 (document-every-experiment)=
 ### Document Every Experiment
 

diff --git a/images/michener_information_entropy.png b/images/michener_information_entropy.png