Skip to content

Dataset creation guide

JulienLamour edited this page Dec 3, 2024 · 30 revisions

This guide provides a comprehensive overview of all the necessary steps to add a new dataset. The folder “Lamour_et_al_2021” can be used as a template since the code within the folder has been commented and corresponds to each step outlined in this guide.

An overview of the process and database organization is presented here.

1) Creation of a Dataset folder

Each dataset should be put into a folder named “Names_Year” for example “Serbin_et_al_2019”. Please, also include any article associated with the dataset and the protocol of measurement, which should detail the gas exchange measurements, leaf reflectance measurements, as well as the equipment used. The protocol should also provide information about the location, growing conditions (e.g., natural environment, greenhouse, agricultural or experimental field, plants in pots), species (e.g., natural or agricultural), and plant status (e.g., stressed or not stressed).

In addition, please include details about your stability criterion to initiate the A-Ci curves. For example, did you wait for the stability of the photosynthesis rate and stomatal conductance before starting the curve? What was the average acclimation time of the leaf within the leaf chamber? Also include this information if you used the “one point method” to estimate Vcmax (De Kauwe et al. 2016, Burnett et al. 2019).

2) Adding a dataset description CSV file

Within each dataset folder, a CSV file called Description.csv has to be included.

This file will be used to list the authors, associated papers, and acknowledgments.

Table 1 Required Description.csv file

Authors Acknowledgment Dataset_DOI Publication_DOI Email
List of authors of the dataset Acknowledgement of funding and help to generate the dataset Digital Object Identifier (DOI) associated with the dataset if the dataset was published Digital Object Identifier (DOI) associated with a publication Contact email for the dataset
Julien Lamour, Kenneth J. Davidson, Kim S. Ely, Jeremiah A. Anderson, Alistair Rogers, Jin Wu , Shawn P. Serbin This work was supported by the Next-Generation Ecosystem Experiments (NGEE Tropics) project that is supported by the Office of Biological and Environmental Research in the Department of Energy, Office of Science, and through the United States Department of Energy contract No. DE-SC0012704 to Brookhaven National Laboratory. 10.15486/ngt/1781003, 10.15486/ngt/1781004 10.1371/journal.pone.0258791 [email protected]

3) Adding a site description CSV file

A file named Site.csv must be included in the dataset folder with the columns listed below. The latitude and longitude coordinates will be used to position the dataset on a world map.

If you have different sites for the same dataset with wide differences in positions that make a difference on a world map or if this includes different biomes, you can add several rows to your Site.csv file.

Table 2 Required Site.scv file

Site_name Latitude Longitude Elevation Biome_number
Short text name for the site where the measurements were taken Latitude in decimal units Longitude in decimal units Elevation in meters above sea level Biome number as described in the documentation (1 to 19)
BCI 9.1562792 -79.862707 30 1
PNM 8.9943457 -79.543073 100 2
BRF 41.413708 -74.010606 1100 5

For the Biome_number column, please choose a number within the list below. We chose to use the Olson et al. (2001) list of 14 natural Biomes that we complemented with agricultural and managed biomes.

Table 3 Olson et al. 2011 list of ecosystems of the world, adapted to add managed biomes

Biome Biome_number
Tropical and subtropical moist broadleaf forests 1
Tropical and subtropical dry broadleaf forests 2
Tropical and subtropical coniferous forests 3
Tropical and subtropical grasslands, savannas and shrublands 4
Temperate broadleaf and mixed forests 5
Temperate coniferous forests 6
Temperate grasslands and savannas 7
Flooded grasslands and savannas 8
Montane grasslands and shrublands 9
Tundra 10
Mediterranean forests, woodlands and scrub 11
Boreal forests/Taiga 12
Desert and xeric shrublands 13
Mangroves 14
Managed Grasslands 15
Field Crop Ecosystems 16
Tree Crop Ecosystems 17
Greenhouse Ecosystems 18
Other Managed Ecosystems 19

4) Adding the gas exchange A-Ci data to the dataset folder

The raw A-Ci data must be included in each dataset folder. It can be included in any parsable format. The final A-Ci data to be used by the fitting procedure must meet certain requirements (see table below with the required column names). However, we don’t have hard requirements in the way to obtain this final data. Note that the SampleID_num column will be used by the fitting procedure to identify individual A-Ci curves. If you made several A-Ci curves on the same leaf we recommend to only keep the best one. We decided to use a column SampleID_num in addition to the SampleID column. SampleID should correspond to the original identifier of the leaves in the raw dataset which is often a complex string. The column SampleID_num should be an integer. We chose to use the SampleID_num to facilitate the QAQC of the curves and the plotting of the figures.

The A-Ci data should be cleaned from spurious measurements and points that would bias Vcmax or Jmax estimation should not be included. If several measurements were taken at a given Ci, please only choose one so each Ci has the same number of measurements. We are usually quite conservative on the quality analysis and only keep the curves where the estimation of Vcmax will be good. If we have doubts about the quality of the data we tend to remove them from the final curated data.

The curated A-Ci data should be present in the dataset folder in a Rdata format called ‘2_Fitted_ACi_data.Rdata’ which contains the A-Ci data in a data.frame with at least the columns listed in the table below.

Note that we include in the dataset folder the raw data, as well as the R code is used to read, import, and transform the raw data. All those preliminary steps are made in two R codes called ‘0_Import_transform_ACi_data.R’ and ‘1_QaQc_curated_ACi.R’.

Table 4 Required column names and units for A-Ci data

SampleID SampleID_num Record A Ci CO2s Patm Qin RHs Tleaf gsw
Identifier of the measured leaf Integer Identifier of the measured leaf Observation record number Net CO2 exchange per leaf area Intercellular CO2 concentration in air CO2 concentration in wet air inside chamber Atmospheric pressure In chamber photosynthetic flux density incident on the leaf in quanta per area Relative humidity of air inside the chamber Leaf surface temperature Stomatal conductance to water vapor per leaf area
micromol m-2 s-1 micromol mol-1 micromol mol-1 kPa micromol m-2 s-1 percent (0 - 100) degrees celcius mmol m-2 s-1

5) Fitting the A-Ci data to estimate the photosynthetic traits

Estimation of Vcmax is done in the ‘2_Fit_ACi.R’ code included in each dataset folder. This code calls the function f.fit_Aci() to estimate the photosynthetic parameters Vcmax25, Jmax25, TPU25 and Rday25 of the A-Ci curve. It produces several pdf files:

  • 2_ACi_fitting_Ac.pdf

  • 2_ACi_fitting_Ac_Aj.pdf

  • 2_ACi_fitting_Ac_Aj_Ap.pdf

and

  • 2_ACi_fitting_best_model.pdf

The first three pdf show the fitting of each A-Ci curve when including the rate of maximum carboxylation (Ac), the rate of electron transport (Aj) and the rate of triose phosphate utilization (Ap).

The best model corresponds to the model with the lowest AIC that includes Ac or Ac + Aj or Ac + Aj + Ap. Note that if the model with the best AIC is the one including Ac only, then Vcmax25 and Rday25 are the only parameters estimated. If the model with the best AIC is Ac + Aj, then Jmax25 is also estimated. TPU25 is estimated if Ac, Aj and Ap are limiting. In all cases, the transition between the Ac, Aj, and Ap rates are determined automatically by the fitting procedure to avoid manual and somehow subjective choices in the transitions.

This code produces a dataframe, called Bilan with the following columns:

Table 5 Table produced by the A-Ci fitting

SampleID SampleID_num Vcmax25 Jmax25 TPU25 Rday25 StdError_Vcmax25 StdError_Jmax25 StdError_TPU25 StdError_Rday25 Tleaf sigma AIC Model Fitting_method
Identifier of the measured leaf Integer Identifier of the measured leaf Maximum rate of carboxylation at the reference temperature 25 degrees Celsius calculated assuming infinite mesophyll conductance i.e. apparent Vcmax Maximum rate of electron transport per leaf area at the reference temperature of 25 degrees Celcius calculated assuming infinite mesophyll conductance and saturating light Triose phosphate utilization rate per leaf area at the reference temperature of 25 degrees celcius CO2 release from the leaf in the light at the reference temperature of 25 degrees celcius Standard error of Vcmax25 estimation Standard Error of Jmax25 estimation Standard Error of TPU25 estimation Standard Error of Rday25 estimation Leaf surface temperature standard error of the residuals of the fitted A-Ci curve Akaike information criterion Model used for the fitting of the A-Ci curves Method used to estimate Vcmax. Can be ‘A-Ci curve’ or ‘One point’
micromol m-2 s-1 micromol m-2 s-1 micromol m-2 s-1 micromol m-2 s-1 micromol m-2 s-1 micromol m-2 s-1 micromol m-2 s-1 micromol m-2 s-1 degrees celcius micromol m-2 s-1

It is also possible to estimate Vcmax25 by the one-point method (De Kauwe et al. 2016; Burnett et al. 2019).In that case, the measurements should be done at saturating irradiance in ambient CO2 conditions. You can use the function f.fit_One_Point() to estimate Vcmax25. It will produce the exact same data frame as when using the function f.fit_Aci.

Importantly, the same temperature correction is used for all the datasets to estimate the parameters at 25°C. Since the Tleaf is also given in the output of the table, it will be possible to re-estimate the parameters at the leaf temperature and to try other temperature dependence parameterizations if needed. The A-Ci fitting can also be re-run with different parametrizations for all the datasets using the output from step 1.

6) Adding dark-adapted leaf respiration data (optional)

If you measured the dark respiration of leaves you can also add them to the dataset. All you need is to include a file with the following columns:

  • SampleID (the leaf identifier that is used everywhere to link different data)
  • Rdark (the dark respiration value, in micromol m-2 s-1, which corresponds to the CO2 release from the leaf in the dark, at measurement temperature, reported as a positive value)
  • Tleaf_Rdark, in degrees Celsius, the leaf temperature.

As for A-Ci curves, don't forget to also include the raw gas exchange instrument output files.

7) Adding the leaf spectra data

The spectral information is ideally a full-range reflectance measurement (350 nm to 2500 nm) with a 1 nm resolution.

If you don’t have values for all the wavelengths (for example from 350 nm to 500 nm or from 2400 nm to 2500 nm), you can put NA in those wavelengths.

A code “3_Import_transform_Reflectance.R” should be used to create a R data frame file called “3_QC_Reflectance_data.Rdata” with four columns:

  • SampleID which has to be consistent with the previous files for each leaf,

  • Spectrometer, the spectrometer model used (SE PSR+ 3500, SVC HR-1024i, SVC XHR-1024i, ASD FieldSpec 3, ASD FieldSpec 4, ASD FieldSpec 4 Hi-Res, …)

  • Probe_type, the type of probe (Integrating sphere, Leaf clip, Imager)

  • Probe_model, the reference of the probe model (SVC LC-RP, SVC LC-RP Pro, ASD Leaf Clip, …)

  • Spectra_trait_pairing, a column that explains how the gas exchange information is paired with the spectral data. If the gas exchange and leaf spectra were measured in the same leaf, choose “Same”. If they were measured in similar leaves, choose “Similar”. Finally, if hyperspectral data was measured at the plant scale (and paired to gas exchange at the leaf scale), choose “Plant scale”.

  • Reflectance, which is a matrix with the Reflectance in column expressed in percent from 0 to 100.

We use a matrix in the column Reflectance, following the “pls” package requirement (Mevik & Wehrens, 2007). More information is given in the “pls” package documentation and manual (https://cran.r-project.org/web/packages/pls/vignettes/pls-manual.pdf)

Bjørn-Helge Mevik and Ron Wehrens. The pls package: Principal component and partial least squares regression in R. Journal of Statistical Software, 18(2):1–24, 2007.

8) Adding a leaf sample information description csv file

A code called “4_Import_transform_SampleDetails.R” should be used to create a ‘SampleDetails’ dataframe with the following columns:

Table 6 Required SampleDetails dataframe

SampleID Dataset_name Site_name Species Sun_Shade Phenological_stage Photosynthetic_pathway Plant_type Soil LMA Narea Nmass Parea Pmass LWC
Identifier of the measured leaf Name of the dataset Name of the site Full species name, for example Cecropia insignis Was the leaf at the top of the canopy and usually receiving light (sun) or a shaded leaf? Choose between Sun, Shade or leave empty Leaf phenological stage (Young, Mature, Old) Leaf photosynthetic pathway (C3, C4, C2, CAM) Choose between Wild, Agricultural, or Ornamental Please choose Natural, Pot, Managed, or Hydroponic soil Leaf dry mass per unit area in g m-2 Nitrogen content of leaf per unit leaf area in g m-2 Nitrogen content of leaf by dry mass in mg g-1 Phosphorus content of leaf per unit leaf area in g m-2 Phosophorus content of leaf by dry mass in mg g-1 Leaf water content. Percent water content of fresh leaf by mass in % (0-100)
BNL20202 Davidson_et_al_2020 SLZ Cecropia insignis Sun Mature C3 Wild Natural 90.24 2.35 26 0.27 3 62
BNL10101 Burnett_et_al_2018 BNL Cucurbit pepo Sun Old C3 Agricultural Pot 118.65 2.59 21.8 0.42 3.5 70
W9_932 Ting_et_al_2013 AAPF Oryza sativa Mature C3 Agricultural Pot 101.81 3.28 32.2

Importantly, the Site_name has to be consistent with the Site_name written in the Site.csv file and the SampleID will have to be consistent with the identifier used for the gas exchange and for the spectra as the SampleID will be used to merge all the different data.

The first columns have to be filled (SampleID, Dataset_name, Site_name, Species, Sun_Shade, Phenological_stage, Photosynthetic_pathway, Plant_type, Soil), the columns related to the leaf traits can be left empty if you don’t have the data (LMA, Narea, Nmass, Parea, Pmass, LWC). Note that the leaf water content, LWC, corresponds to (fresh weight - dry weight) / fresh weight.

For the species name, please write “Genus species”, for exemple Cercropia insignis. If you know the genus but not the species, write for example “Cecropia sp.”. If you do not know the genus, write “Family sp.” for example (Urticaceae sp.). If you don’t know anything, well you can write “Unknown”.

The SampleDetails dataframe is stored in the Rdata file ‘4_SampleDetails.Rdata’.

9) Checking the overall dataset information

The function “f.Check_data()” can be used to validate that the format of the curated dataset is correct and that all the required files are provided. The function checks if required variables are present in the data files and if all the information can be merged together.

This function is called in the last R file “4_Import_transform_SampleDetails.R”

References

Burnett, AC, Davidson, KJ, Serbin, SP, Rogers, A. The “one-point method” for estimating maximum carboxylation capacity of photosynthesis: A cautionary tale. Plant Cell Environ. 2019; 42: 2472– 2481. https://doi.org/10.1111/pce.13574

De Kauwe, M. G., Lin, Y. S., Wright, I. J., Medlyn, B. E., Crous, K. Y., Ellsworth, D. S., … Domingues, T. F. (2016b). A test of the “one-point method” for estimating maximum carboxylation capacity from field-measured, light-saturated photosynthesis. New Phytologist, 210(3), 1130– 1144. https://doi.org/10.1111/nph.13815

David M. Olson, Eric Dinerstein, Eric D. Wikramanayake, Neil D. Burgess, George V. N. Powell, Emma C. Underwood, Jennifer A. D’amico, Illanga Itoua, Holly E. Strand, John C. Morrison, Colby J. Loucks, Thomas F. Allnutt, Taylor H. Ricketts, Yumiko Kura, John F. Lamoreux, Wesley W. Wettengel, Prashant Hedao, Kenneth R. Kassem, Terrestrial Ecoregions of the World: A New Map of Life on Earth: A new global map of terrestrial ecoregions provides an innovative tool for conserving biodiversity, BioScience, Volume 51, Issue 11, November 2001, Pages 933–938, https://doi.org/10.1641/0006-3568(2001)051%5B0933:TEOTWA%5D2.0.CO;2