Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ThermoML Archive dataset #118

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

ml-evs
Copy link
Contributor

@ml-evs ml-evs commented Mar 16, 2023

This PR adds the ThermoML dataset as a flat csv file. I'm not sure if this is the most appropriate way of doing it yet, and its not entirely clear to me how to define the targets for this dataset!

The compressed archive is about 200 MB, around of XML and JSON files 4 GB when extracted. The data takes about an hour to parse on my machine (for ~10k files into around ~2.6m rows rows)

Here, we use @marocsfelt's thermopyl fork to extract the xml files into a dataframe. I imagine this will need an amount of additional processing to be useful in this project, but its not clear to me what that would involve yet (and probably needs to be done at the transform step per file, rather than cleaning of the final csv).

Other matters arising:

  • This dataset has a requirements file associated with it for the transform step. I don't think it is worth adding these to the project level deps. In the end, there was no issue with old deps (thanks to using the fork).

Example rows for current state:

,filename,components,component_1_inchi,"Pressure, kPa","Temperature, K","Speed of sound, m/s",phase,"Speed of sound, m/s_std",component_2_inchi,"Mass density, kg/m3","Mass density, kg/m3_std",component_3_inchi,component_4_inchi,Mole fraction Variable metadata,Mole fraction
0,10.1007/s10765-010-0874-x.xml,hexan-1-ol,"InChI=1S/C6H14O/c1-2-3-4-5-6-7/h7H,2-6H2,1H3",101.0,293.15,1320.11,Liquid,,,,,,,,
1,10.1007/s10765-010-0874-x.xml,hexan-1-ol,"InChI=1S/C6H14O/c1-2-3-4-5-6-7/h7H,2-6H2,1H3",101.0,298.15,1303.19,Liquid,,,,,,,,
2,10.1007/s10765-010-0874-x.xml,hexan-1-ol,"InChI=1S/C6H14O/c1-2-3-4-5-6-7/h7H,2-6H2,1H3",101.0,293.15,,Liquid,,"InChI=1S/C6H14O/c1-2-3-4-5-6-7/h7H,2-6H2,1H3",818.856,,,,,
3,10.1007/s10765-010-0874-x.xml,hexan-1-ol,"InChI=1S/C6H14O/c1-2-3-4-5-6-7/h7H,2-6H2,1H3",101.0,298.15,,Liquid,,"InChI=1S/C6H14O/c1-2-3-4-5-6-7/h7H,2-6H2,1H3",815.281,,,,,
4,10.1007/s10765-010-0874-x.xml,"1,2-dichloroethane","InChI=1S/C6H14O/c1-2-3-4-5-6-7/h7H,2-6H2,1H3",101.0,293.15,1212.92,Liquid,,"InChI=1S/C6H14O/c1-2-3-4-5-6-7/h7H,2-6H2,1H3",,,InChI=1S/C2H4Cl2/c3-1-2-4/h1-2H2,,,
5,10.1007/s10765-010-0874-x.xml,"1,2-dichloroethane","InChI=1S/C6H14O/c1-2-3-4-5-6-7/h7H,2-6H2,1H3",101.0,298.15,1193.6,Liquid,,"InChI=1S/C6H14O/c1-2-3-4-5-6-7/h7H,2-6H2,1H3",,,InChI=1S/C2H4Cl2/c3-1-2-4/h1-2H2,,,
6,10.1007/s10765-010-0874-x.xml,"1,2-dichloroethane","InChI=1S/C6H14O/c1-2-3-4-5-6-7/h7H,2-6H2,1H3",101.0,293.15,,Liquid,,"InChI=1S/C6H14O/c1-2-3-4-5-6-7/h7H,2-6H2,1H3",1252.766,,InChI=1S/C2H4Cl2/c3-1-2-4/h1-2H2,InChI=1S/C2H4Cl2/c3-1-2-4/h1-2H2,,
7,10.1007/s10765-010-0874-x.xml,"1,2-dichloroethane","InChI=1S/C6H14O/c1-2-3-4-5-6-7/h7H,2-6H2,1H3",101.0,298.15,,Liquid,,"InChI=1S/C6H14O/c1-2-3-4-5-6-7/h7H,2-6H2,1H3",1245.555,,InChI=1S/C2H4Cl2/c3-1-2-4/h1-2H2,InChI=1S/C2H4Cl2/c3-1-2-4/h1-2H2,,
8,10.1007/s10765-010-0874-x.xml,"1,2-dibromoethane","InChI=1S/C6H14O/c1-2-3-4-5-6-7/h7H,2-6H2,1H3",101.0,293.15,1007.04,Liquid,,"InChI=1S/C6H14O/c1-2-3-4-5-6-7/h7H,2-6H2,1H3",,,InChI=1S/C2H4Cl2/c3-1-2-4/h1-2H2,InChI=1S/C2H4Cl2/c3-1-2-4/h1-2H2,,

Closes #117

@ml-evs ml-evs changed the title Initial ThermoML transform script Add ThermoML Archive dataset Mar 16, 2023
@ml-evs ml-evs marked this pull request as ready for review March 16, 2023 19:15
Copy link

@ekalosak ekalosak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without speaking to the content, I worry about the maintainability of this module - please see inline comments for specifics mostly around repeating code snippets that may be implemented by a stdlib.

data/thermoml_archive/transform.py Show resolved Hide resolved
data/thermoml_archive/transform.py Show resolved Hide resolved
data/thermoml_archive/transform.py Show resolved Hide resolved
"description": "ThermoML is an XML-based IUPAC standard for the storage and exchange of experimental thermophysical and thermochemical property data. The ThermoML archive is a subset of Thermodynamics Research Center (TRC) data holdings corresponding to cooperation between NIST TRC and five journals.", # noqa
"identifiers": [
{
"id": "",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it ok to not provide an id value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is unfinished (as mentioned in PR desc), this is the main bit that needs feedback/further discussion.

@ml-evs
Copy link
Contributor Author

ml-evs commented Mar 16, 2023

Without speaking to the content, I worry about the maintainability of this module - please see inline comments for specifics mostly around repeating code snippets that may be implemented by a stdlib.

Point taken, but I think you are misunderstanding the point of these transform scripts which are meant to be standalone and aren't really targeting reusability/maintainability.

If the stdlib file digest can do chunked hashing then fine by me, it's not clear from the docs whether the whole file needs to be in memory (the final CSV is quite big so probably not desirable).

@ml-evs
Copy link
Contributor Author

ml-evs commented Mar 16, 2023

In fact, file digest was only added in 3.11 so we can't use it here (this project is 3.8 only). Happy to refactor into a two line function though.

@kjappelbaum
Copy link
Collaborator

Point taken, but I think you are misunderstanding the point of these transform scripts which are meant to be standalone and aren't really targeting reusability/maintainability.

I agree, If there is some code we can share across the scripts, great. But we don't envision those scripts to be used other than for collecting the initial dataset.

@kjappelbaum
Copy link
Collaborator

I think this dataset is one example for which we might want to use our more "advanced" yaml templates, that specify prompts such as "what is the {speed of sound} for mixture of {component 1} and {component 2}".

The documentation on how to specify this is currently not so great. I can help after the weekend with an example if you would like me to do so

@ml-evs
Copy link
Contributor Author

ml-evs commented Mar 17, 2023

I think this dataset is one example for which we might want to use our more "advanced" yaml templates, that specify prompts such as "what is the {speed of sound} for mixture of {component 1} and {component 2}".

The documentation on how to specify this is currently not so great. I can help after the weekend with an example if you would like me to do so

Sounds good to me -- I'm not in any rush with this, so it could wait until the dataset hackathon perhaps (unless you are already heavily committed on other stuff for that!) I should be able to attend for a couple of hours in the morning at least.

I think there is a lot of data being left on the table with this current parsing approach -- happy to investigate further about customising the thermopyl parser usage and getting access to more of the (https://trc.nist.gov/ThermoML/)[listed properties] (see below). It may even be worth splitting into multiple datasets.

image

@phalem phalem mentioned this pull request Mar 17, 2023
@kjappelbaum
Copy link
Collaborator

@ml-evs, we have revised the prompt template syntax in the contribution guide - do you want to give it a shot?

Otherwise, if you're overcommitted, I can also look at it. Just let me know

@ml-evs
Copy link
Contributor Author

ml-evs commented May 5, 2023

I am definitely over-committed right now but don't just want to shift the burden to you... if you find time to look at it yourself then just write here, otherwise I will add it around the bottom of my to-do list again!

@kjappelbaum kjappelbaum added the help wanted Extra attention is needed label Oct 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adding the ThermoML dataset
3 participants