-
-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for deserialization of file types into Array, DataFrame #482
Conversation
Nice! I like this a lot, are you also anticipating implementing the other direction and adding serialization? |
@philippjfr In my use case there isn't a need to serialize back to file. @jbednar and @jlstevens bandied the idea back and forth and I think came to the conclusion that, whereas deserialization can be performed transparently from file to value, serialization would need some mechanism to tether the Parameter value to a specific file/encoding. This PR as-is makes no changes to param's API (except perhaps removing the ability to deserialize an existing path as an array of characters - which was probably not intended in the first place). |
Right; at this point it only covers deserialization, and fully supporting transparent roundtripping (ensuring the file type doesn't change in the process) sounds complicated. Still, I think we'll be able to review and merge this and add serialization later. I'd guess the filetype won't be preserved, e.g. a .csv might turn into .parquet, which is what Intake does for caching, but that seems ok to me. |
Looks great!
I realize this is still WIP but I've made one comment that I think would help users when the file fails to load: essentially, it would be nice to state which extensions are supported for that type. |
Thanks for the review jlstevens. Letting users know what file types are supported is a really good idea. I also wanted to mention a quiet bug that I'm glossing over right now in case you want me to handle it differently. Python 2.7 supports only up to Pandas |
I think we can just mention this in the release notes. Even though param will probably support 2.7 for a while, many of our downstream projects are now switching to Python 3. At this point, it isn't critical if there are a few holes in the Python 2 support. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! I agree that putting the mapping from file extension to reader functions would be nice to have at a global level, but those reader functions are in code that might not have been imported and may not even exist in this environment. Thus we'd have to store the mapping as text, which is doable but a bit ugly. So I'd be inclined to leave those mappings where they are, and use the keys() of the dict they are in to list the available file types.
tox.ini
Outdated
@@ -41,17 +41,55 @@ deps = {[testenv]deps} | |||
gmpy | |||
setenv = PARAM_TEST_GMPY = 1 | |||
|
|||
# xlrd is the reader for xls files in pandas. xlrd is also the reader | |||
# for xlsx and xlsm files for pandas<1.2, but this is only possible using | |||
# xlrd<2. If pandas>=1.2 is guaranteed, you can remove the version spec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be ok with testing this new file-reading functionality only on pandas>=1.2; it's a new feature and not supporting it with old versions wouldn't be a big deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can do, but I think this'll nix python 3.6? I think they only go up to 1.1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with not supporting Python 3.6, since we are trying to merge this into Param 2, our forward-looking codebase.
Co-authored-by: James A. Bednar <[email protected]>
@jlstevens do you think you'll have time to review this for 2.0 or prefer to postpone that? |
I think this probably should go in param 2.0 but it would be good to have a little time to test this change as well. If you could fix the merge conflict, I'll do a quick review then merge. |
@jlstevens I fixed the merge conflict. |
@maximlt Sorry for the trouble! It's been a busy few weeks. |
Oh no worries :) We're pushing to get Param 2.0 out, hence the recent movement on this PR and others. |
@jlstevens fixed the conflicts that were introduced recently after a few big merged PRs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm eager to get this merged at last, but I remain confused about the expected user-level API. What I was expecting is for a user to be able to replace pd.read_csv("file.tsv")
with simply "file.tsv"
in code like:
import param, numpy as np, pandas as pd
class A(param.Parameterized):
f = param.DataFrame(pd.read_csv("file.tsv"), rows=(2,None), columns=set(['a','b']))
a = A()
a.f
That doesn't work, and I'm not sure what the conditions should be for a user to invoke this functionality, since replacing pd.read_csv("file.tsv")
with param.DataFrame.deserialize("file.tsv")
doesn't seem like a net win:
Presumably we need some examples for the docs before merging?
Co-authored-by: James A. Bednar <[email protected]>
@sdrobert , even after all this time, I'm still missing a key bit of the intended use case and motivation, because there still aren't any examples of actual usage. As best I can tell, since there is no serialization implemented, what this PR will address is someone who writes their own JSON file and wants to specify a filename rather than the actual contents of the DataFrame or Array. Can you give us any example of an actual JSON file that would be used in this way? I consider JSON to be a read-only format, and would never edit it by hand since that just leads to file-format errors, but I understand that editing JSON can be feasible for some people in an editor like VSCode that has better support than what I use. Is that really the intended use case? Directly authoring JSON? If so we need to include an example in the docs of doing that, or no one will ever use this functionality. |
@jbednar, to answer your immediate question, my use case has always been machine learning. You can find an example here in my supplementary library to {
"training": {
"lr": 1e-05,
"max_epochs": 10,
"model_regex": "model-{epoch:05d}.pkl"
},
"model": {
"activations": "relu",
"layers": [
"conv",
"conv",
"fc"
],
"mean": "mean.npy",
"std": "std.npy"
}
} Where DataFrame parameters could provide an easy means of dynamically specifying training sets for smaller ML tasks, e.g. for scikit-learn routines. It could also be used similarly to script visualization routines for e.g., seaborn or possibly HoloViz, avoiding notebooks. I'm not really sure how to answer the other question about modifying JSON by hand. I agree that JSON is unwieldy, which is why I have also implemented YAML (de)serialization in my repo and am willing to make a PR for it here. Based on the framing of the question, however, it seems perhaps that the JSON (de)serialization mechanism in |
That's perfect, thanks! Indeed, we do see much use for a declarative YAML spec, and now it makes sense. This PR is only half of what's needed, which is what was confusing me! Ok, now we can move forward. Thanks. |
Given that you've reviewed this extensively @jbednar, I'll let you decide whether you want to hit merge. |
This PR follows #382, #470 as well as discussion with @jbednar and @jlstevens.
param currently models deserialization routines after serialization routines. This is fine if you're only ever deserializing something that has previously been automatically serialized. However, my use case (and I believe that of plenty of others) focuses on reading in hand-written config files for e.g. specifying neural network parameters. In such cases, writing a Series, Array, or DataFrame could involve dumping a giant list of numbers by hand into the configuration file.
A better solution would be to specify a path to a data file in the configuration, which, when deserialized, is quietly parsed into the value. The reference to the file can be thrown away and just the value stored in the parameter. The type of file and thus the routine for parsing is inferred by the file extension. In the
deserialize()
method of the relevant functions, before interpreting the value as arguments to a constructor (either andarray
orDataFrame
), we: first check if it matches a file on disk; then if it matches an extension we know; and then, if Numpy or Pandas has a routine to read that extension, we call it.This is not a bulletproof solution - some files will have misleading or no extensions, or might require non-standard arguments to parse - it will yield correct results in most situations. The user can always overload or subclass the relevant Parameters if she needs a special method of deserialization.
I have so far handled the easiest file types: those with both read and write routines in either numpy or pandas. Some require extra dependencies. If those dependencies were easily installed, I added guards in the test file and added them to the test environment in tox.ini. For numpy, these are
For pandas, these are
Pandas has a lot more I/O routines that are trivial to add but harder to test. The two glaring absences are:
Pandas does have a write routine for ".xls" but its backend has been deprecated. HDF5 support relies on PyTables. PyTables is easily installed on Conda but has only a source distribution on PyPI. To install that source distribution, you need access to hdf5's headers.
I hope this is a good starting point for this functionality.
Thank you for your time,
Sean