Dataset creation guide

This guide provides a comprehensive overview of all the necessary steps to add a new dataset. The folder “Lamour_et_al_2021” can be used as a template since the code within the folder has been commented and corresponds to each step outlined in this guide.

An overview of the process and database organization is presented here.

1) Creation of a Dataset folder

Each dataset should be put into a folder named “Names_Year” for example “Serbin_et_al_2019”. Please, also include any article associated with the dataset and the protocol of measurement, which should detail the gas exchange measurements, leaf reflectance measurements, as well as the equipment used. The protocol should also provide information about the location, growing conditions (e.g., natural environment, greenhouse, agricultural or experimental field, plants in pots), species (e.g., natural or agricultural), and plant status (e.g., stressed or not stressed).

In addition, please include details about your stability criterion to initiate the A-Ci curves. For example, did you wait for the stability of the photosynthesis rate and stomatal conductance before starting the curve? What was the average acclimation time of the leaf within the leaf chamber? Also include this information if you used the “one point method” to estimate Vcmax (De Kauwe et al. 2016, Burnett et al. 2019).

2) Adding a dataset description CSV file

Within each dataset folder, a CSV file called Description.csv has to be included.

This file will be used to list the authors, associated papers, and acknowledgments.

Table 1 Required Description.csv file

Authors	Acknowledgment	Dataset_DOI	Publication_DOI	Email
List of authors of the dataset	Acknowledgement of funding and help to generate the dataset	Digital Object Identifier (DOI) associated with the dataset if the dataset was published	Digital Object Identifier (DOI) associated with a publication	Contact email for the dataset
Julien Lamour, Kenneth J. Davidson, Kim S. Ely, Jeremiah A. Anderson, Alistair Rogers, Jin Wu , Shawn P. Serbin	This work was supported by the Next-Generation Ecosystem Experiments (NGEE Tropics) project that is supported by the Office of Biological and Environmental Research in the Department of Energy, Office of Science, and through the United States Department of Energy contract No. DE-SC0012704 to Brookhaven National Laboratory.	10.15486/ngt/1781003, 10.15486/ngt/1781004	10.1371/journal.pone.0258791	[email protected]

3) Adding a site description CSV file

A file named Site.csv must be included in the dataset folder with the columns listed below. The latitude and longitude coordinates will be used to position the dataset on a world map.

If you have different sites for the same dataset with wide differences in positions that make a difference on a world map or if this includes different biomes, you can add several rows to your Site.csv file.

Table 2 Required Site.scv file

Site_name	Latitude	Longitude	Elevation	Biome_number
Short text name for the site where the measurements were taken	Latitude in decimal units	Longitude in decimal units	Elevation in meters above sea level	Biome number as described in the documentation (1 to 19)
BCI	9.1562792	-79.862707	30	1
PNM	8.9943457	-79.543073	100	2
BRF	41.413708	-74.010606	1100	5

For the Biome_number column, please choose a number within the list below. We chose to use the Olson et al. (2001) list of 14 natural Biomes that we complemented with agricultural and managed biomes.

Table 3 Olson et al. 2011 list of ecosystems of the world, adapted to add managed biomes

Biome	Biome_number
Tropical and subtropical moist broadleaf forests	1
Tropical and subtropical dry broadleaf forests	2
Tropical and subtropical coniferous forests	3
Tropical and subtropical grasslands, savannas and shrublands	4
Temperate broadleaf and mixed forests	5
Temperate coniferous forests	6
Temperate grasslands and savannas	7
Flooded grasslands and savannas	8
Montane grasslands and shrublands	9
Tundra	10
Mediterranean forests, woodlands and scrub	11
Boreal forests/Taiga	12
Desert and xeric shrublands	13
Mangroves	14
Managed Grasslands	15
Field Crop Ecosystems	16
Tree Crop Ecosystems	17
Greenhouse Ecosystems	18
Other Managed Ecosystems	19

4) Adding the gas exchange A-C_i data to the dataset folder

The raw A-C_i data must be included in each dataset folder. It can be included in any parsable format. The final A-C_i data to be used by the fitting procedure must meet certain requirements (see table below with the required column names). However, we don’t have hard requirements in the way to obtain this final data. Note that the SampleID_num column will be used by the fitting procedure to identify individual A-C_i curves. If you made several A-C_i curves on the same leaf we recommend to only keep the best one. We decided to use a column SampleID_num in addition to the SampleID column. SampleID should correspond to the original identifier of the leaves in the raw dataset which is often a complex string. The column SampleID_num should be an integer. We chose to use the SampleID_num to facilitate the QAQC of the curves and the plotting of the figures.

The A-C_i data should be cleaned from spurious measurements and points that would bias V_cmax or J_max estimation should not be included. If several measurements were taken at a given Ci, please only choose one so each Ci has the same number of measurements. We are usually quite conservative on the quality analysis and only keep the curves where the estimation of V_cmax will be good. If we have doubts about the quality of the data we tend to remove them from the final curated data.

The curated A-C_i data should be present in the dataset folder in a Rdata format called ‘2_Fitted_ACi_data.Rdata’ which contains the A-C_i data in a data.frame with at least the columns listed in the table below.

Note that we include in the dataset folder the raw data, as well as the R code is used to read, import, and transform the raw data. All those preliminary steps are made in two R codes called ‘0_Import_transform_ACi_data.R’ and ‘1_QaQc_curated_ACi.R’.

Table 4 Required column names and units for A-Ci data

SampleID	SampleID_num	Record	A	Ci	CO2s	Patm	Qin	RHs	Tleaf	gsw
Identifier of the measured leaf	Integer Identifier of the measured leaf	Observation record number	Net CO2 exchange per leaf area	Intercellular CO2 concentration in air	CO2 concentration in wet air inside chamber	Atmospheric pressure	In chamber photosynthetic flux density incident on the leaf in quanta per area	Relative humidity of air inside the chamber	Leaf surface temperature	Stomatal conductance to water vapor per leaf area
			micromol m-2 s-1	micromol mol-1	micromol mol-1	kPa	micromol m-2 s-1	percent (0 - 100)	degrees celcius	mmol m-2 s-1

5) Fitting the A-C_i data to estimate the photosynthetic traits

Estimation of V_cmax is done in the ‘2_Fit_ACi.R’ code included in each dataset folder. This code calls the function f.fit_Aci() to estimate the photosynthetic parameters V_cmax25, J_max25, TPU₂₅ and R_day25 of the A-C_i curve. It produces several pdf files:

2_ACi_fitting_Ac.pdf
2_ACi_fitting_Ac_Aj.pdf
2_ACi_fitting_Ac_Aj_Ap.pdf

and

2_ACi_fitting_best_model.pdf

The first three pdf show the fitting of each A-C_i curve when including the rate of maximum carboxylation (A_c), the rate of electron transport (A_j) and the rate of triose phosphate utilization (A_p).

The best model corresponds to the model with the lowest AIC that includes A_c or A_c + A_j or A_c + A_j + A_p. Note that if the model with the best AIC is the one including A_c only, then V_cmax25 and R_day25 are the only parameters estimated. If the model with the best AIC is A_c + A_j, then J_max25 is also estimated. TPU₂₅ is estimated if A_c, A_j and A_p are limiting. In all cases, the transition between the A_c, A_j, and A_p rates are determined automatically by the fitting procedure to avoid manual and somehow subjective choices in the transitions.

This code produces a dataframe, called Bilan with the following columns:

Table 5 Table produced by the A-Ci fitting

SampleID	SampleID_num	Vcmax25	Jmax25	TPU25	Rday25	StdError_Vcmax25	StdError_Jmax25	StdError_TPU25	StdError_Rday25	Tleaf	sigma	AIC	Model	Fitting_method
Identifier of the measured leaf	Integer Identifier of the measured leaf	Maximum rate of carboxylation at the reference temperature 25 degrees Celsius calculated assuming infinite mesophyll conductance i.e. apparent Vcmax	Maximum rate of electron transport per leaf area at the reference temperature of 25 degrees Celcius calculated assuming infinite mesophyll conductance and saturating light	Triose phosphate utilization rate per leaf area at the reference temperature of 25 degrees celcius	CO2 release from the leaf in the light at the reference temperature of 25 degrees celcius	Standard error of Vcmax25 estimation	Standard Error of Jmax25 estimation	Standard Error of TPU25 estimation	Standard Error of Rday25 estimation	Leaf surface temperature	standard error of the residuals of the fitted A-Ci curve	Akaike information criterion	Model used for the fitting of the A-Ci curves	Method used to estimate Vcmax. Can be ‘A-Ci curve’ or ‘One point’
		micromol m-2 s-1	micromol m-2 s-1	micromol m-2 s-1	micromol m-2 s-1	micromol m-2 s-1	micromol m-2 s-1	micromol m-2 s-1	micromol m-2 s-1	degrees celcius	micromol m-2 s-1

It is also possible to estimate V_cmax25 by the one-point method (De Kauwe et al. 2016; Burnett et al. 2019).In that case, the measurements should be done at saturating irradiance in ambient CO₂ conditions. You can use the function f.fit_One_Point() to estimate V_cmax25. It will produce the exact same data frame as when using the function f.fit_Aci.

Importantly, the same temperature correction is used for all the datasets to estimate the parameters at 25°C. Since the Tleaf is also given in the output of the table, it will be possible to re-estimate the parameters at the leaf temperature and to try other temperature dependence parameterizations if needed. The A-C_i fitting can also be re-run with different parametrizations for all the datasets using the output from step 1.

6) Adding dark-adapted leaf respiration data (optional)

If you measured the dark respiration of leaves you can also add them to the dataset. All you need is to include a file with the following columns:

SampleID (the leaf identifier that is used everywhere to link different data)
Rdark (the dark respiration value, in micromol m-2 s-1, which corresponds to the CO2 release from the leaf in the dark, at measurement temperature, reported as a positive value)
Tleaf_Rdark, in degrees Celsius, the leaf temperature.

As for A-Ci curves, don't forget to also include the raw gas exchange instrument output files.

7) Adding the leaf spectra data

The spectral information is ideally a full-range reflectance measurement (350 nm to 2500 nm) with a 1 nm resolution.

If you don’t have values for all the wavelengths (for example from 350 nm to 500 nm or from 2400 nm to 2500 nm), you can put NA in those wavelengths.

A code “3_Import_transform_Reflectance.R” should be used to create a R data frame file called “3_QC_Reflectance_data.Rdata” with four columns:

SampleID which has to be consistent with the previous files for each leaf,
Spectrometer, the spectrometer model used (SE PSR+ 3500, SVC HR-1024i, SVC XHR-1024i, ASD FieldSpec 3, ASD FieldSpec 4, ASD FieldSpec 4 Hi-Res, …)
Probe_type, the type of probe (Integrating sphere, Leaf clip, Imager)
Probe_model, the reference of the probe model (SVC LC-RP, SVC LC-RP Pro, ASD Leaf Clip, …)
Spectra_trait_pairing, a column that explains how the gas exchange information is paired with the spectral data. If the gas exchange and leaf spectra were measured in the same leaf, choose “Same”. If they were measured in similar leaves, choose “Similar”. Finally, if hyperspectral data was measured at the plant scale (and paired to gas exchange at the leaf scale), choose “Plant scale”.
Reflectance, which is a matrix with the Reflectance in column expressed in percent from 0 to 100.

We use a matrix in the column Reflectance, following the “pls” package requirement (Mevik & Wehrens, 2007). More information is given in the “pls” package documentation and manual (https://cran.r-project.org/web/packages/pls/vignettes/pls-manual.pdf)

Bjørn-Helge Mevik and Ron Wehrens. The pls package: Principal component and partial least squares regression in R. Journal of Statistical Software, 18(2):1–24, 2007.

8) Adding a leaf sample information description csv file

A code called “4_Import_transform_SampleDetails.R” should be used to create a ‘SampleDetails’ dataframe with the following columns:

Table 6 Required SampleDetails dataframe

SampleID	Dataset_name	Site_name	Species	Sun_Shade	Phenological_stage	Photosynthetic_pathway	Plant_type	Soil	LMA	Narea	Nmass	Parea	Pmass	LWC
Identifier of the measured leaf	Name of the dataset	Name of the site	Full species name, for example Cecropia insignis	Was the leaf at the top of the canopy and usually receiving light (sun) or a shaded leaf? Choose between Sun, Shade or leave empty	Leaf phenological stage (Young, Mature, Old)	Leaf photosynthetic pathway (C3, C4, C2, CAM)	Choose between Wild, Agricultural, or Ornamental	Please choose Natural, Pot, Managed, or Hydroponic soil	Leaf dry mass per unit area in g m-2	Nitrogen content of leaf per unit leaf area in g m-2	Nitrogen content of leaf by dry mass in mg g-1	Phosphorus content of leaf per unit leaf area in g m-2	Phosophorus content of leaf by dry mass in mg g-1	Leaf water content. Percent water content of fresh leaf by mass in % (0-100)
BNL20202	Davidson_et_al_2020	SLZ	Cecropia insignis	Sun	Mature	C3	Wild	Natural	90.24	2.35	26	0.27	3	62
BNL10101	Burnett_et_al_2018	BNL	Cucurbit pepo	Sun	Old	C3	Agricultural	Pot	118.65	2.59	21.8	0.42	3.5	70
W9_932	Ting_et_al_2013	AAPF	Oryza sativa		Mature	C3	Agricultural	Pot	101.81	3.28	32.2

Importantly, the Site_name has to be consistent with the Site_name written in the Site.csv file and the SampleID will have to be consistent with the identifier used for the gas exchange and for the spectra as the SampleID will be used to merge all the different data.

The first columns have to be filled (SampleID, Dataset_name, Site_name, Species, Sun_Shade, Phenological_stage, Photosynthetic_pathway, Plant_type, Soil), the columns related to the leaf traits can be left empty if you don’t have the data (LMA, Narea, Nmass, Parea, Pmass, LWC). Note that the leaf water content, LWC, corresponds to (fresh weight - dry weight) / fresh weight.

For the species name, please write “Genus species”, for exemple Cercropia insignis. If you know the genus but not the species, write for example “Cecropia sp.”. If you do not know the genus, write “Family sp.” for example (Urticaceae sp.). If you don’t know anything, well you can write “Unknown”.

The SampleDetails dataframe is stored in the Rdata file ‘4_SampleDetails.Rdata’.

9) Checking the overall dataset information

The function “f.Check_data()” can be used to validate that the format of the curated dataset is correct and that all the required files are provided. The function checks if required variables are present in the data files and if all the information can be merged together.

This function is called in the last R file “4_Import_transform_SampleDetails.R”

References

Burnett, AC, Davidson, KJ, Serbin, SP, Rogers, A. The “one-point method” for estimating maximum carboxylation capacity of photosynthesis: A cautionary tale. Plant Cell Environ. 2019; 42: 2472– 2481. https://doi.org/10.1111/pce.13574

De Kauwe, M. G., Lin, Y. S., Wright, I. J., Medlyn, B. E., Crous, K. Y., Ellsworth, D. S., … Domingues, T. F. (2016b). A test of the “one-point method” for estimating maximum carboxylation capacity from field-measured, light-saturated photosynthesis. New Phytologist, 210(3), 1130– 1144. https://doi.org/10.1111/nph.13815

David M. Olson, Eric Dinerstein, Eric D. Wikramanayake, Neil D. Burgess, George V. N. Powell, Emma C. Underwood, Jennifer A. D’amico, Illanga Itoua, Holly E. Strand, John C. Morrison, Colby J. Loucks, Thomas F. Allnutt, Taylor H. Ricketts, Yumiko Kura, John F. Lamoreux, Wesley W. Wettengel, Prashant Hedao, Kenneth R. Kassem, Terrestrial Ecoregions of the World: A New Map of Life on Earth: A new global map of terrestrial ecoregions provides an innovative tool for conserving biodiversity, BioScience, Volume 51, Issue 11, November 2001, Pages 933–938, https://doi.org/10.1641/0006-3568(2001)051%5B0933:TEOTWA%5D2.0.CO;2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset creation guide

1) Creation of a Dataset folder

2) Adding a dataset description CSV file

3) Adding a site description CSV file

4) Adding the gas exchange A-C_i data to the dataset folder

5) Fitting the A-C_i data to estimate the photosynthetic traits

6) Adding dark-adapted leaf respiration data (optional)

7) Adding the leaf spectra data

8) Adding a leaf sample information description csv file

9) Checking the overall dataset information

References

Clone this wiki locally

Dataset creation guide

1) Creation of a Dataset folder

2) Adding a dataset description CSV file

3) Adding a site description CSV file

4) Adding the gas exchange A-Ci data to the dataset folder

5) Fitting the A-Ci data to estimate the photosynthetic traits

6) Adding dark-adapted leaf respiration data (optional)

7) Adding the leaf spectra data

8) Adding a leaf sample information description csv file

9) Checking the overall dataset information

References

Clone this wiki locally

4) Adding the gas exchange A-C_i data to the dataset folder

5) Fitting the A-C_i data to estimate the photosynthetic traits