Support for deserialization of file types into Array, DataFrame #482

sdrobert · 2021-05-28T22:20:43Z

This PR follows #382, #470 as well as discussion with @jbednar and @jlstevens.

param currently models deserialization routines after serialization routines. This is fine if you're only ever deserializing something that has previously been automatically serialized. However, my use case (and I believe that of plenty of others) focuses on reading in hand-written config files for e.g. specifying neural network parameters. In such cases, writing a Series, Array, or DataFrame could involve dumping a giant list of numbers by hand into the configuration file.

A better solution would be to specify a path to a data file in the configuration, which, when deserialized, is quietly parsed into the value. The reference to the file can be thrown away and just the value stored in the parameter. The type of file and thus the routine for parsing is inferred by the file extension. In the deserialize() method of the relevant functions, before interpreting the value as arguments to a constructor (either a ndarray or DataFrame), we: first check if it matches a file on disk; then if it matches an extension we know; and then, if Numpy or Pandas has a routine to read that extension, we call it.

This is not a bulletproof solution - some files will have misleading or no extensions, or might require non-standard arguments to parse - it will yield correct results in most situations. The user can always overload or subclass the relevant Parameters if she needs a special method of deserialization.

I have so far handled the easiest file types: those with both read and write routines in either numpy or pandas. Some require extra dependencies. If those dependencies were easily installed, I added guards in the test file and added them to the test environment in tox.ini. For numpy, these are

.npy (Numpy archive)
.txt[.gz] (Numpy text file)

For pandas, these are

.csv (comma-separated)
.dta (stata)
.feather
.json
.ods (OpenOffice sheet)
.parquet
.pkl (pickle)
.tsv (tab-separated)
.xls{m,x} (Excel sheet)

Pandas has a lot more I/O routines that are trivial to add but harder to test. The two glaring absences are:

.hdf5
.xls (Pre-2007 Excel)

Pandas does have a write routine for ".xls" but its backend has been deprecated. HDF5 support relies on PyTables. PyTables is easily installed on Conda but has only a source distribution on PyPI. To install that source distribution, you need access to hdf5's headers.

I hope this is a good starting point for this functionality.

Thank you for your time,
Sean

philippjfr · 2021-05-29T21:06:54Z

Nice! I like this a lot, are you also anticipating implementing the other direction and adding serialization?

sdrobert · 2021-05-30T22:12:37Z

@philippjfr In my use case there isn't a need to serialize back to file. @jbednar and @jlstevens bandied the idea back and forth and I think came to the conclusion that, whereas deserialization can be performed transparently from file to value, serialization would need some mechanism to tether the Parameter value to a specific file/encoding. This PR as-is makes no changes to param's API (except perhaps removing the ability to deserialize an existing path as an array of characters - which was probably not intended in the first place).

jbednar · 2021-05-31T04:44:58Z

Right; at this point it only covers deserialization, and fully supporting transparent roundtripping (ensuring the file type doesn't change in the process) sounds complicated. Still, I think we'll be able to review and merge this and add serialization later. I'd guess the filetype won't be preserved, e.g. a .csv might turn into .parquet, which is what Intake does for caching, but that seems ok to me.

param/__init__.py

jlstevens · 2021-05-31T09:42:13Z

Looks great!

This is not a bulletproof solution - some files will have misleading or no extensions, or might require non-standard arguments to parse - it will yield correct results in most situations.

I realize this is still WIP but I've made one comment that I think would help users when the file fails to load: essentially, it would be nice to state which extensions are supported for that type.

param/__init__.py

sdrobert · 2021-05-31T16:03:05Z

Thanks for the review jlstevens. Letting users know what file types are supported is a really good idea.

I also wanted to mention a quiet bug that I'm glossing over right now in case you want me to handle it differently. Python 2.7 supports only up to Pandas 0.24. In that version, pandas.read_excel did not support .ods files. I am currently just skipping the .ods test for Python 2.7 and the package will incorrectly report the ability to handle .ods files. A more correct solution would involve checking the Pandas version and excluding the file type appropriately, or nixing the type altogether. That said, it's Python 2.7. I'm sure there are also minimum version requirements to Pandas (pre 1.0) and Numpy that I've overlooked as well.

jlstevens · 2021-06-02T19:49:41Z

I am currently just skipping the .ods test for Python 2.7 and the package will incorrectly report the ability to handle .ods files.

I think we can just mention this in the release notes. Even though param will probably support 2.7 for a while, many of our downstream projects are now switching to Python 3. At this point, it isn't critical if there are a few holes in the Python 2 support.

jbednar

Looks great! I agree that putting the mapping from file extension to reader functions would be nice to have at a global level, but those reader functions are in code that might not have been imported and may not even exist in this environment. Thus we'd have to store the mapping as text, which is doable but a bit ugly. So I'd be inclined to leave those mappings where they are, and use the keys() of the dict they are in to list the available file types.

param/__init__.py

jbednar · 2021-06-03T02:38:15Z

tox.ini

@@ -41,17 +41,55 @@ deps = {[testenv]deps}
       gmpy
 setenv = PARAM_TEST_GMPY = 1

+# xlrd is the reader for xls files in pandas. xlrd is also the reader
+# for xlsx and xlsm files for pandas<1.2, but this is only possible using
+# xlrd<2. If pandas>=1.2 is guaranteed, you can remove the version spec


I'd be ok with testing this new file-reading functionality only on pandas>=1.2; it's a new feature and not supporting it with old versions wouldn't be a big deal.

Can do, but I think this'll nix python 3.6? I think they only go up to 1.1?

I'm ok with not supporting Python 3.6, since we are trying to merge this into Param 2, our forward-looking codebase.

Co-authored-by: James A. Bednar <[email protected]>

maximlt · 2023-04-05T10:03:08Z

@jlstevens do you think you'll have time to review this for 2.0 or prefer to postpone that?

jlstevens · 2023-04-05T10:08:32Z

I think this probably should go in param 2.0 but it would be good to have a little time to test this change as well.

If you could fix the merge conflict, I'll do a quick review then merge.

maximlt · 2023-04-16T20:42:51Z

@jlstevens I fixed the merge conflict.

sdrobert · 2023-04-19T21:10:59Z

@maximlt Sorry for the trouble! It's been a busy few weeks.

maximlt · 2023-04-21T08:50:04Z

@maximlt Sorry for the trouble! It's been a busy few weeks.

Oh no worries :) We're pushing to get Param 2.0 out, hence the recent movement on this PR and others.

maximlt · 2023-05-04T00:45:19Z

@jlstevens fixed the conflicts that were introduced recently after a few big merged PRs.

jbednar

I'm eager to get this merged at last, but I remain confused about the expected user-level API. What I was expecting is for a user to be able to replace pd.read_csv("file.tsv") with simply "file.tsv" in code like:

import param, numpy as np, pandas as pd

class A(param.Parameterized):
    f = param.DataFrame(pd.read_csv("file.tsv"), rows=(2,None), columns=set(['a','b']))

a = A()
a.f

That doesn't work, and I'm not sure what the conditions should be for a user to invoke this functionality, since replacing pd.read_csv("file.tsv") with param.DataFrame.deserialize("file.tsv") doesn't seem like a net win:

Presumably we need some examples for the docs before merging?

param/__init__.py

Co-authored-by: James A. Bednar <[email protected]>

jbednar · 2023-05-16T22:50:23Z

@sdrobert , even after all this time, I'm still missing a key bit of the intended use case and motivation, because there still aren't any examples of actual usage. As best I can tell, since there is no serialization implemented, what this PR will address is someone who writes their own JSON file and wants to specify a filename rather than the actual contents of the DataFrame or Array. Can you give us any example of an actual JSON file that would be used in this way? I consider JSON to be a read-only format, and would never edit it by hand since that just leads to file-format errors, but I understand that editing JSON can be feasible for some people in an editor like VSCode that has better support than what I use. Is that really the intended use case? Directly authoring JSON? If so we need to include an example in the docs of doing that, or no one will ever use this functionality.

sdrobert · 2023-05-16T23:43:47Z

@jbednar, to answer your immediate question, my use case has always been machine learning. You can find an example here in my supplementary library to param (N.B. this is not a plug; I would rather all the functionality exist in param so I could sunset my library). This example does not contain any arrays or data frames, but may easily and plausibly be augmented to include one, e.g.

{
  "training": {
    "lr": 1e-05,
    "max_epochs": 10,
    "model_regex": "model-{epoch:05d}.pkl"
  },
  "model": {
    "activations": "relu",
    "layers": [
      "conv",
      "conv",
      "fc"
    ],
    "mean": "mean.npy",
    "std": "std.npy"
  }
}

Where mean.npy and std.npy point to files where, e.g., feature means and standard deviations reside. When deserializing from file, the mean and std params are populated with the contents of the file. This configuration can be specified by hand; other methods would be tedious/impossible.

DataFrame parameters could provide an easy means of dynamically specifying training sets for smaller ML tasks, e.g. for scikit-learn routines. It could also be used similarly to script visualization routines for e.g., seaborn or possibly HoloViz, avoiding notebooks.

I'm not really sure how to answer the other question about modifying JSON by hand. I agree that JSON is unwieldy, which is why I have also implemented YAML (de)serialization in my repo and am willing to make a PR for it here. Based on the framing of the question, however, it seems perhaps that the JSON (de)serialization mechanism in param is intended solely for machine consumption? There doesn't appear to be a standard means of getting parameters into a Parameterized instance beyond manipulating them after instantiation programmatically. This is a broader question than that of just arrays and data frames. If the team doesn't see much value in ingesting manual configurations overall, this PR isn't going to do much. I'm sorry for the noise at that point.

jbednar · 2023-05-17T02:12:24Z

That's perfect, thanks! Indeed, we do see much use for a declarative YAML spec, and now it makes sense. This PR is only half of what's needed, which is what was confusing me! Ok, now we can move forward. Thanks.

.github/workflows/test.yaml

maximlt · 2023-05-25T20:15:57Z

Given that you've reviewed this extensively @jbednar, I'll let you decide whether you want to hit merge.

sdrobert added 18 commits May 28, 2021 12:59

.npy deserialization + test

eae3dc7

.txt and .gz deserialization + test

c158b90

Changed way to handle compressed extensions

88e29ba

pandas pkl + test

87c05b8

dataframe csv + test

0ab7ce0

dataframe tsv + test

61af683

dataframe json + test

2b76217

data frame excel + tests

3c4ef0f

rename some tests

d2b0637

Added warnings on failure to parse

2ae5b1a

dataframe feather + test

953d0bb

data frame parquet + test

b437a8e

stata + test

460bd25

Fix for 2.7 and removed unused import

f78b8d2

Not my fault

53e88fc

py 2.7 syntax + pandas check

1d3b02c

Oops

724afd7

Possible 2.7 fixes

6890675

jbednar mentioned this pull request May 30, 2021

Easy to Use Download DataFrame Button holoviz/panel#2342

Open

jlstevens reviewed May 31, 2021

View reviewed changes

param/__init__.py Outdated Show resolved Hide resolved

sdrobert commented May 31, 2021

View reviewed changes

param/__init__.py Show resolved Hide resolved

Py 2.7 + test fixes

d16643a

jbednar reviewed Jun 3, 2021

View reviewed changes

Update param/__init__.py

53afaf7

Co-authored-by: James A. Bednar <[email protected]>

sdrobert mentioned this pull request Jun 9, 2021

Feature Request: Serialization for other file types #485

Open

jbednar added this to the v1.11.1 milestone Jul 2, 2021

jbednar mentioned this pull request Sep 1, 2021

Full JSON/YAML support for Param #520

Open

MridulS modified the milestones: v1.11.2, v1.12.1 Feb 7, 2022

philippjfr modified the milestones: v1.12.1, 2.0 Mar 31, 2022

maximlt added 4 commits April 16, 2023 02:07

Merge branch 'main' into pr/sdrobert/482

38a714b

fix bad merge and some warnings

1ba2664

remove xlrd pin

c2646c2

run tests on CI

e4d69b9

maximlt added 3 commits May 4, 2023 02:28

Merge branch 'main' into pr/sdrobert/482

07e5818

conditional install of tables

c9c019d

[debug mode]

40f8498

jbednar requested changes May 12, 2023

View reviewed changes

param/__init__.py Outdated Show resolved Hide resolved

Update param/__init__.py

78afef9

Co-authored-by: James A. Bednar <[email protected]>

maximlt reviewed May 25, 2023

View reviewed changes

.github/workflows/test.yaml Outdated Show resolved Hide resolved

Remove debug code

c7f0326

fix workflow

15f7fd4

jbednar merged commit b86d8d6 into holoviz:main May 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for deserialization of file types into Array, DataFrame #482

Support for deserialization of file types into Array, DataFrame #482

sdrobert commented May 28, 2021

philippjfr commented May 29, 2021

sdrobert commented May 30, 2021

jbednar commented May 31, 2021

jlstevens commented May 31, 2021

sdrobert commented May 31, 2021

jlstevens commented Jun 2, 2021

jbednar left a comment

jbednar Jun 3, 2021

sdrobert Jun 3, 2021

jbednar May 12, 2023

maximlt commented Apr 5, 2023

jlstevens commented Apr 5, 2023

maximlt commented Apr 16, 2023

sdrobert commented Apr 19, 2023

maximlt commented Apr 21, 2023

maximlt commented May 4, 2023

jbednar left a comment

jbednar commented May 16, 2023

sdrobert commented May 16, 2023

jbednar commented May 17, 2023

maximlt commented May 25, 2023

Support for deserialization of file types into Array, DataFrame #482

Support for deserialization of file types into Array, DataFrame #482

Conversation

sdrobert commented May 28, 2021

philippjfr commented May 29, 2021

sdrobert commented May 30, 2021

jbednar commented May 31, 2021

jlstevens commented May 31, 2021

sdrobert commented May 31, 2021

jlstevens commented Jun 2, 2021

jbednar left a comment

Choose a reason for hiding this comment

jbednar Jun 3, 2021

Choose a reason for hiding this comment

sdrobert Jun 3, 2021

Choose a reason for hiding this comment

jbednar May 12, 2023

Choose a reason for hiding this comment

maximlt commented Apr 5, 2023

jlstevens commented Apr 5, 2023

maximlt commented Apr 16, 2023

sdrobert commented Apr 19, 2023

maximlt commented Apr 21, 2023

maximlt commented May 4, 2023

jbednar left a comment

Choose a reason for hiding this comment

jbednar commented May 16, 2023

sdrobert commented May 16, 2023

jbednar commented May 17, 2023

maximlt commented May 25, 2023