diff --git a/chapters/02_primary_practices.md b/chapters/02_primary_practices.md index 2f16230..6cd048f 100644 --- a/chapters/02_primary_practices.md +++ b/chapters/02_primary_practices.md @@ -229,6 +229,162 @@ toy_data/ Very (very!) small pieces of data for dev testing If a directory contains many files or subdirectories, consider whether it's clearer to write a separate manifest specifically for that directory. +### Document the Data + +**Metadata**, or data that describes data, is critical to the research process. +It delineates how the data were collected, what assumptions were made, what +biases might be present, any ethical concerns, the overall structure, what each +observation means, what each feature means, and more. Good data +documentation guides researchers towards appropriate, responsible use of the +data in future studies. + +Good metadata should answer the questions who, what, when, where, why, and how. +Though the way the metadata answers these questions will depend on your field of +research. If you are submitting your data to a particular data repository, they +will likely have a required metadata scheme to follow. Otherwise, pick a +metadata scheme that aligns with other researchers in your field. If you have to +submit a Data Management Plan, it will specifically ask how you will apply and +adhere to field specific data standards. + +If you aren't sure what the standard is in your field, there are several online +repositories to help you out. The [Metadata Standards Catalog][msc] has a fairly +exhaustive list of metadata schemes, which you can browse [by +subject][msc-subject]. [Fairsharing.org][fairshare] also stores metadata and +other documentation standards. By using an existing community standard metadata +scheme, you make it possible for future researchers (including you!) to compare +your data to data from other, heterogeneous, sources. + +```{note} +Many metadata resources refer to something called a **[controlled +vocabulary][c-vocab]**. This is a list of specific values, each with a +predefined meaning. It is designed to provide consistency and uniqueness across +data sources. One common example of a controlled vocabulary is a list of +geographic names, like the [Thesaurus of Geographic Names (TGN)][tgn]. There are +many ways you can refer to New York City (NYC, the Big Apple, Manhattan etc). +But if you want to be able to group together all data about New York City, it is +helpful if everyone calls it the same thing. +``` + +Even if you don't know where your data will end up, documenting your data when +you collect it will help ensure your documentation doesn't have any gaps. Timely +documentation also maximizes the likelihood that your research can be +reproduced, and that your data reused by other researchers. If your project uses +data collected earlier or by someone else, it's a good practice to fill gaps in +the existing documentation with your own. Thorough documentation isn't just +beneficial to other researchers, it's also beneficial to future you---small +details you notice and document about features could be important later in the +project. + +```{figure} /images/michener_information_entropy.png +--- +name: information-entropy +figwidth: 550px +align: center +alt: +--- +Information Entropy (Figure 1) from ([Michener et al. 1997][michener]). +``` + +One of the simplest and most widely used metadata standards is the [Dublin +Core][dublin-core], a set of 15 metadata elements originally defined at a 1995 +workshop in Dublin, Ohio. The exact definition of the Dublin Core elements can +be a bit technical, but the University College Dublin (Ireland) Library provides +simplified explanations and examples [here][dublin-ex]. + +If all of this seems overwhelming, that's okay. The Consortium of European +Social Science Data Archives (CESSDA) has a great [video][cessda-video] for +those who have never documented data before. CESSDA also provides detailed +explanations of what information to document at both project and data level in +their [Data Management Expert Guide][cessda-guide]. This includes detailed +information about documenting quantitative and qualitative data. Just make sure +to expand all the collapsed sections. + + +:::{seealso} +There are many resouces on documenting your data available. Here are a selection +of them: +- [Metadata Standards Catalog][msc] +- [Fairsharing.org][fairshare] +- [README, Write Me! DataLab workshop reader][datalab-readme] +- [UC Davis Research Data Management LibGuide][lib-metadata] +- [CESSDA's Data Management Expert Guide][cessda-guide] +- [The Turing Way on Documentation and Metadata][turing-metadata] +- [MIT Metadata Info][mit-metadata] +- [Harvard Biomedical Documentation and Metadata][harvard-metadata] +- University College Dublin on [Metadata][dublin-ex] and [Documentation][ucd-doc] +::: + + +[lib-metadata]: https://guides.library.ucdavis.edu/data-management/documentation +[msc]: https://rdamsc.bath.ac.uk/ +[msc-subject]: https://rdamsc.bath.ac.uk/subject-index +[fairshare]: https://fairsharing.org/ +[michener]: https://esajournals.onlinelibrary.wiley.com/doi/10.1890/1051-0761%281997%29007%5B0330%3ANMFTES%5D2.0.CO%3B2 +[mit-metadata]: https://libraries.mit.edu/data-management/store/documentation/ +[dublin-core]: https://www.dublincore.org/specifications/dublin-core/dcmi-terms/#section-3 +[dublin-ex]: https://libguides.ucd.ie/data/metadata +[cessda-video]: https://www.youtube.com/watch?v=cjGz-I0GgKk +[cessda-guide]: https://dmeg.cessda.eu/Data-Management-Expert-Guide/2.-Organise-Document/Documentation-and-metadata +[tgn]: https://www.getty.edu/research/tools/vocabularies/tgn/index.html +[turing-metadata]: https://book.the-turing-way.org/reproducible-research/rdm/rdm-metadata/ +[harvard-metadata]: https://datamanagement.hms.harvard.edu/collect-analyze/documentation-metadata +[c-vocab]: https://rdf-vocabulary.ddialliance.org/ +[ucd-doc]: https://libguides.ucd.ie/data/documentation + +(create-data-dictionary)= +#### Create a Data Dictionary + +A **data dictionary** is a document that explains what every field or element in +your dataset means as well as any restrictions on their values. This includes +things like the data type (ex. number, date, text, boolean), and whether that +field can be missing. The more information you include, the more helpful it will +be down the line (see [Captain Obvious][captain_o]). Data dictionaries are the +most efficient way to communicate the structure and content of your data to +other collaborators, including future you! A very basic one could look like +this: + +|Field Name |Field Description |Data Type |Notes | +|-----------|------------------------------------------|------------|----------| +|person_id |autogenerated by database |integer | | +|name |legal full name (family name, given name) |string | | +|occupation |A person's job or vocation |string |Must come from the Bureau of Labor Statistics Occupation List | +|... |... |... |... | + + + +If you aren't sure where to start with creating a data dictionary, DataLab has a +[template][datalab_dd_template] you can use as a jumping off point. [Open +Science Framework][osf_dd] has resources on what details to add to your data +dictionary, and the [USGS][usgs_dd] provides many examples of data dictionaries +and how they are used in different contexts. If you are working with multiple +data sets, make sure to clarify which data dictionary to use with each data set. + +If your dataset looks less like a series of rows and columns, and more like a +long list of files, consider creating a **data inventory** instead. A data +inventory should include the author or source, title, publication year (if +published), and file name for each file, but can include more file metadata as +necessary. A data inventory for a public domain fiction data set would look +something like this. + +|Author |Title |Year |Filename | +|--------------------|--------------------|-----|------------------------------------------| +|Bronte,Charlotte |JaneEyre |1847 |EN_1847_BronteCharlotte_JaneEyre.txt | +|Austen,Jane |SenseandSensibility |1811 |EN_1811_AustenJane_SenseandSensibility.txt| +|Wollstonecraft,Mary |Maria |1798 |EN_1798_WollstonecraftMary_Maria.txt | +|... |... |... |... | + + +If you also need to keep track of things like the provenance or license +associated with each file or data set, DataLab's +[data inventory template][datalab_di_template] provides a pretty comprehensive +starting point. + +[osf_dd]: https://help.osf.io/article/217-how-to-make-a-data-dictionary +[usgs_dd]: https://www.usgs.gov/data-management/data-dictionaries +[captain_o]: https://dataedo.com/blog/captain-obivous-guide-to-column-descriptions-data-dictionary-best-practices +[datalab_dd_template]: https://docs.google.com/spreadsheets/d/12N0hKyeT0ndZnt7rVZsz7LTW--BHhbb6TOegXEKQoxE/edit?usp=sharing +[datalab_di_template]: https://docs.google.com/spreadsheets/d/1nUb-eu82Q7VplDpk0np5rYuaN52mYHLdql18pRD0i4Y/edit?usp=sharing + (workflows)= #### Workflows diff --git a/chapters/03_secondary_practices.md b/chapters/03_secondary_practices.md index 6af33ea..6c6bbaa 100644 --- a/chapters/03_secondary_practices.md +++ b/chapters/03_secondary_practices.md @@ -12,31 +12,6 @@ relevant, and we recommend you do too. Documentation ------------- -### Document the Data - -In a perfect world, every data set would come with detailed documentation or -**metadata** about how the data were collected, what assumptions were made, -what biases might be present, any ethical concerns, the overall structure, what -each observation means, what each feature means, and more. Good data -documentation guides researchers towards appropriate, responsible use of the -data. - -Collecting data as part of a project gives you and your collaborators control -over how the data are documented, so you can ensure there are no gaps. If your -project uses data collected earlier or by someone else, it's a good practice to -fill gaps in the existing documentation with your own. Thorough documentation -isn't just beneficial to other researchers, it's also beneficial to future -you---small details you notice and document about features could be important -later in the project. - -:::{seealso} -See DataLab's [README, Write Me! workshop reader][datalab-readme] for more -about how to document data. -::: - -[datalab-readme]: https://ucdavisdatalab.github.io/workshop_how-to-data-documentation/ - - (document-every-experiment)= ### Document Every Experiment diff --git a/images/michener_information_entropy.png b/images/michener_information_entropy.png new file mode 100644 index 0000000..87ccd73 Binary files /dev/null and b/images/michener_information_entropy.png differ