Discussion: long-term data storage #2222

brockho · 2023-10-24T16:33:15Z

During our first COCO sprint, we discussed the idea of a general long-term storage format for benchmarking data (an idea put forward by the IOH Profiler people together with @olafmersmann). A first suggestion from our side is the following.

For each experiment (with a concrete experiment id or timestamp or ...), we store its metadata (things that stay constant over the entire experiment) in a metadata table like this:

exp. Timestamp	key	value
1	"author"	"Mr. Coco"
1	"doi"	https://doi.org/876.123
...	...	...
2	"author"	"H. Simpson"
2	"cma.opts.sigma0"	1.5
2	"suite"	"bbob-biobj-ext"
2	"experimental setup"	"bbob-biobj"
…	…	…
3	"link to code"
3	"url"
3	"publication"	"GECCO 2023 Proceedings"
3	"algo implementation"	"pycma, version 3.1"
3	"software"	COCO
3	"suite"	bbob-constrained
3	"experimental setup"	bbob-constrained
3	"algname"	"COBYLA"

To be discussed: which entries are mandatory and which are optional.

For each experiment, we then can then store the single evaluations (or a subset thereof) in a big table like this one:

timestamp	function id	dim	instance	#funevals	indicator value	#g-calls	#non-feasible points evaluated so far	target reached	f-values	g-values	x-values	experimental data
1	1	2	12	123	1.20E+03	nan	nan	1.30E+03	nan	nan	[1,23]	{"param1": 12, "param2": 44}
1	1	2	12	124	1.12E+03	nan	nan	1.25E+03	nan	nan	[1.1,19.7]	{"param1": 10, "param2": 44.9}
4	1	2	12	123	1.42E+03	nan	nan	2.00E+03	nan	nan	[0.7,23.62]	{"name": "exp2", "doi": "https://doi.org/10.287.22"}
2	1	3	12	123	1.23E+03	nan	nan	4.00E+03	[1.2, 3.234]	nan	[232,2376,21]
3	1	5	11	234	1.29E+02	54	288	1.30E+02	[1.1, 0,85]	[7654, 8987, 3123]	[43,123,22,23,1]

Our ideas behind all this are:

All columns until (and including) indicator value look like they have to be mandatory (at least for most experiments and certainly for all COCO data, produced so far)
All other columns are non-mandatory and could be different in different experiments tables (which are then not compatible anymore).
The #funevals column is rather an "effort spent" column and must be a monotonously increasing function, for example in the case of constrained problems the number of combined f- and g-evaluations. This means also that it might, in some cases, contain vectors such as the number of calls to each individual objective function if they are callable independently (and need, for example, different times to evaluate).
The indicator value column contains the objective function to be optimized, such as the best so far f-value in the unconstrained, single-objective case, a quality indicator in the multiobjective case, the Lagrangian in the constrained case, ...
The target reached column seems a nice-to-have in the COCO context, even if we don't write these data ourselves right now (but it should be easy to reconstruct because the targets are fixed in our case.
Entries in the same experiment table should be, in principle, comparable with each other.

Note that this is a first draft and will be hopefully extended here.

The text was updated successfully, but these errors were encountered:

brockho added the open question label Oct 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: long-term data storage #2222

Discussion: long-term data storage #2222

brockho commented Oct 24, 2023 •

edited

Loading

Discussion: long-term data storage #2222

Discussion: long-term data storage #2222

Comments

brockho commented Oct 24, 2023 • edited Loading

brockho commented Oct 24, 2023 •

edited

Loading