Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for deserialization of file types into Array, DataFrame #482

Merged
merged 33 commits into from
May 26, 2023

Conversation

sdrobert
Copy link
Contributor

This PR follows #382, #470 as well as discussion with @jbednar and @jlstevens.

param currently models deserialization routines after serialization routines. This is fine if you're only ever deserializing something that has previously been automatically serialized. However, my use case (and I believe that of plenty of others) focuses on reading in hand-written config files for e.g. specifying neural network parameters. In such cases, writing a Series, Array, or DataFrame could involve dumping a giant list of numbers by hand into the configuration file.

A better solution would be to specify a path to a data file in the configuration, which, when deserialized, is quietly parsed into the value. The reference to the file can be thrown away and just the value stored in the parameter. The type of file and thus the routine for parsing is inferred by the file extension. In the deserialize() method of the relevant functions, before interpreting the value as arguments to a constructor (either a ndarray or DataFrame), we: first check if it matches a file on disk; then if it matches an extension we know; and then, if Numpy or Pandas has a routine to read that extension, we call it.

This is not a bulletproof solution - some files will have misleading or no extensions, or might require non-standard arguments to parse - it will yield correct results in most situations. The user can always overload or subclass the relevant Parameters if she needs a special method of deserialization.

I have so far handled the easiest file types: those with both read and write routines in either numpy or pandas. Some require extra dependencies. If those dependencies were easily installed, I added guards in the test file and added them to the test environment in tox.ini. For numpy, these are

  • .npy (Numpy archive)
  • .txt[.gz] (Numpy text file)

For pandas, these are

  • .csv (comma-separated)
  • .dta (stata)
  • .feather
  • .json
  • .ods (OpenOffice sheet)
  • .parquet
  • .pkl (pickle)
  • .tsv (tab-separated)
  • .xls{m,x} (Excel sheet)

Pandas has a lot more I/O routines that are trivial to add but harder to test. The two glaring absences are:

  • .hdf5
  • .xls (Pre-2007 Excel)

Pandas does have a write routine for ".xls" but its backend has been deprecated. HDF5 support relies on PyTables. PyTables is easily installed on Conda but has only a source distribution on PyPI. To install that source distribution, you need access to hdf5's headers.

I hope this is a good starting point for this functionality.

Thank you for your time,
Sean

@philippjfr
Copy link
Member

Nice! I like this a lot, are you also anticipating implementing the other direction and adding serialization?

@sdrobert
Copy link
Contributor Author

@philippjfr In my use case there isn't a need to serialize back to file. @jbednar and @jlstevens bandied the idea back and forth and I think came to the conclusion that, whereas deserialization can be performed transparently from file to value, serialization would need some mechanism to tether the Parameter value to a specific file/encoding. This PR as-is makes no changes to param's API (except perhaps removing the ability to deserialize an existing path as an array of characters - which was probably not intended in the first place).

@jbednar
Copy link
Member

jbednar commented May 31, 2021

Right; at this point it only covers deserialization, and fully supporting transparent roundtripping (ensuring the file type doesn't change in the process) sounds complicated. Still, I think we'll be able to review and merge this and add serialization later. I'd guess the filetype won't be preserved, e.g. a .csv might turn into .parquet, which is what Intake does for caching, but that seems ok to me.

param/__init__.py Outdated Show resolved Hide resolved
@jlstevens
Copy link
Contributor

Looks great!

This is not a bulletproof solution - some files will have misleading or no extensions, or might require non-standard arguments to parse - it will yield correct results in most situations.

I realize this is still WIP but I've made one comment that I think would help users when the file fails to load: essentially, it would be nice to state which extensions are supported for that type.

@sdrobert
Copy link
Contributor Author

Thanks for the review jlstevens. Letting users know what file types are supported is a really good idea.

I also wanted to mention a quiet bug that I'm glossing over right now in case you want me to handle it differently. Python 2.7 supports only up to Pandas 0.24. In that version, pandas.read_excel did not support .ods files. I am currently just skipping the .ods test for Python 2.7 and the package will incorrectly report the ability to handle .ods files. A more correct solution would involve checking the Pandas version and excluding the file type appropriately, or nixing the type altogether. That said, it's Python 2.7. I'm sure there are also minimum version requirements to Pandas (pre 1.0) and Numpy that I've overlooked as well.

@jlstevens
Copy link
Contributor

I am currently just skipping the .ods test for Python 2.7 and the package will incorrectly report the ability to handle .ods files.

I think we can just mention this in the release notes. Even though param will probably support 2.7 for a while, many of our downstream projects are now switching to Python 3. At this point, it isn't critical if there are a few holes in the Python 2 support.

Copy link
Member

@jbednar jbednar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I agree that putting the mapping from file extension to reader functions would be nice to have at a global level, but those reader functions are in code that might not have been imported and may not even exist in this environment. Thus we'd have to store the mapping as text, which is doable but a bit ugly. So I'd be inclined to leave those mappings where they are, and use the keys() of the dict they are in to list the available file types.

param/__init__.py Outdated Show resolved Hide resolved
param/__init__.py Outdated Show resolved Hide resolved
param/__init__.py Outdated Show resolved Hide resolved
tox.ini Outdated
@@ -41,17 +41,55 @@ deps = {[testenv]deps}
gmpy
setenv = PARAM_TEST_GMPY = 1

# xlrd is the reader for xls files in pandas. xlrd is also the reader
# for xlsx and xlsm files for pandas<1.2, but this is only possible using
# xlrd<2. If pandas>=1.2 is guaranteed, you can remove the version spec
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be ok with testing this new file-reading functionality only on pandas>=1.2; it's a new feature and not supporting it with old versions wouldn't be a big deal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can do, but I think this'll nix python 3.6? I think they only go up to 1.1?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with not supporting Python 3.6, since we are trying to merge this into Param 2, our forward-looking codebase.

Co-authored-by: James A. Bednar <[email protected]>
@jbednar jbednar added this to the v1.11.1 milestone Jul 2, 2021
@MridulS MridulS modified the milestones: v1.11.2, v1.12.1 Feb 7, 2022
@philippjfr philippjfr modified the milestones: v1.12.1, 2.0 Mar 31, 2022
@maximlt
Copy link
Member

maximlt commented Apr 5, 2023

@jlstevens do you think you'll have time to review this for 2.0 or prefer to postpone that?

@jlstevens
Copy link
Contributor

I think this probably should go in param 2.0 but it would be good to have a little time to test this change as well.

If you could fix the merge conflict, I'll do a quick review then merge.

@maximlt
Copy link
Member

maximlt commented Apr 16, 2023

@jlstevens I fixed the merge conflict.

@sdrobert
Copy link
Contributor Author

@maximlt Sorry for the trouble! It's been a busy few weeks.

@maximlt
Copy link
Member

maximlt commented Apr 21, 2023

@maximlt Sorry for the trouble! It's been a busy few weeks.

Oh no worries :) We're pushing to get Param 2.0 out, hence the recent movement on this PR and others.

@maximlt
Copy link
Member

maximlt commented May 4, 2023

@jlstevens fixed the conflicts that were introduced recently after a few big merged PRs.

Copy link
Member

@jbednar jbednar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm eager to get this merged at last, but I remain confused about the expected user-level API. What I was expecting is for a user to be able to replace pd.read_csv("file.tsv") with simply "file.tsv" in code like:

import param, numpy as np, pandas as pd

class A(param.Parameterized):
    f = param.DataFrame(pd.read_csv("file.tsv"), rows=(2,None), columns=set(['a','b']))

a = A()
a.f

That doesn't work, and I'm not sure what the conditions should be for a user to invoke this functionality, since replacing pd.read_csv("file.tsv") with param.DataFrame.deserialize("file.tsv") doesn't seem like a net win:
image

Presumably we need some examples for the docs before merging?

param/__init__.py Outdated Show resolved Hide resolved
Co-authored-by: James A. Bednar <[email protected]>
@jbednar
Copy link
Member

jbednar commented May 16, 2023

@sdrobert , even after all this time, I'm still missing a key bit of the intended use case and motivation, because there still aren't any examples of actual usage. As best I can tell, since there is no serialization implemented, what this PR will address is someone who writes their own JSON file and wants to specify a filename rather than the actual contents of the DataFrame or Array. Can you give us any example of an actual JSON file that would be used in this way? I consider JSON to be a read-only format, and would never edit it by hand since that just leads to file-format errors, but I understand that editing JSON can be feasible for some people in an editor like VSCode that has better support than what I use. Is that really the intended use case? Directly authoring JSON? If so we need to include an example in the docs of doing that, or no one will ever use this functionality.

@sdrobert
Copy link
Contributor Author

@jbednar, to answer your immediate question, my use case has always been machine learning. You can find an example here in my supplementary library to param (N.B. this is not a plug; I would rather all the functionality exist in param so I could sunset my library). This example does not contain any arrays or data frames, but may easily and plausibly be augmented to include one, e.g.

{
  "training": {
    "lr": 1e-05,
    "max_epochs": 10,
    "model_regex": "model-{epoch:05d}.pkl"
  },
  "model": {
    "activations": "relu",
    "layers": [
      "conv",
      "conv",
      "fc"
    ],
    "mean": "mean.npy",
    "std": "std.npy"
  }
}

Where mean.npy and std.npy point to files where, e.g., feature means and standard deviations reside. When deserializing from file, the mean and std params are populated with the contents of the file. This configuration can be specified by hand; other methods would be tedious/impossible.

DataFrame parameters could provide an easy means of dynamically specifying training sets for smaller ML tasks, e.g. for scikit-learn routines. It could also be used similarly to script visualization routines for e.g., seaborn or possibly HoloViz, avoiding notebooks.

I'm not really sure how to answer the other question about modifying JSON by hand. I agree that JSON is unwieldy, which is why I have also implemented YAML (de)serialization in my repo and am willing to make a PR for it here. Based on the framing of the question, however, it seems perhaps that the JSON (de)serialization mechanism in param is intended solely for machine consumption? There doesn't appear to be a standard means of getting parameters into a Parameterized instance beyond manipulating them after instantiation programmatically. This is a broader question than that of just arrays and data frames. If the team doesn't see much value in ingesting manual configurations overall, this PR isn't going to do much. I'm sorry for the noise at that point.

@jbednar
Copy link
Member

jbednar commented May 17, 2023

That's perfect, thanks! Indeed, we do see much use for a declarative YAML spec, and now it makes sense. This PR is only half of what's needed, which is what was confusing me! Ok, now we can move forward. Thanks.

@maximlt
Copy link
Member

maximlt commented May 25, 2023

Given that you've reviewed this extensively @jbednar, I'll let you decide whether you want to hit merge.

@jbednar jbednar merged commit b86d8d6 into holoviz:main May 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants