Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow udf in run_udf to be array of strings #374

Open
soxofaan opened this issue Jun 22, 2022 · 4 comments
Open

Allow udf in run_udf to be array of strings #374

soxofaan opened this issue Jun 22, 2022 · 4 comments

Comments

@soxofaan
Copy link
Member

soxofaan commented Jun 22, 2022

"name": "udf",
"description": "Either source code, an absolute URL or a path to a UDF script.",
"schema": [
{
"description": "Absolute URL to a UDF",
"type": "string",
"format": "uri",
"subtype": "uri",
"pattern": "^https?://"
},
{
"description": "Path to a UDF uploaded to the server.",
"type": "string",
"subtype": "file-path",
"pattern": "^[^\r\n\\:'\"]+$"
},
{
"description": "The multi-line source code of a UDF, must contain a newline/line-break.",
"type": "string",
"subtype": "udf-code",
"pattern": "(\r\n|\r|\n)"
}

In all practical use cases of run_udf I've seen, the udf code is provided as an inline string (not as URL or file path).
Putting the UDF code as a single one-line string with newlines enced as \n is however horrible for version control (when the process graph is stored in a repository, which we often do).

Feature request: allow the udf code also to be an array of strings (one string per line of code), which is a lot friendlier for version control tools. (Note that this is the JSON encoding approach used by jupyter notebooks)

@m-mohr
Copy link
Member

m-mohr commented Jun 25, 2022

Hmm, I'm not sure about this. We also experiment with use cases (in R) where we may need to pass multiple UDF files (i.e. multiple file paths or uris) and then we run into ambiguities as an array of strings could be anything and we don't have a good pattern anymore to distinguish them.

That the file path is rarely used is IMHO not a good argument yet as user workspaces are simply not supported by back-ends yet and as such file-path can't be used although peoply may want to use it.

@soxofaan
Copy link
Member Author

we may need to pass multiple UDF files (i.e. multiple file paths or uris) and then we run into ambiguities as an array of strings could be anything and we don't have a good pattern anymore to distinguish them.

Interesting. But if you want to pass multiple "files", you probably want to encode this as a mapping (filename -> contents) instead of just an array, because these files probably want to refer to each other (e.g. based on a filename).

That the file path is rarely used is IMHO not a good argument yet as user workspaces are simply not supported by back-ends yet and as such file-path can't be used although peoply may want to use it.

The problem with file path or URL is that you add a dependency to an external resource that might change unannounced, disappear, cause permission issues ... . Having your UDF fully embedded within your process graph is a good thing if you want to make it completely self-contained.

@m-mohr
Copy link
Member

m-mohr commented Jun 27, 2022

Interesting. But if you want to pass multiple "files", you probably want to encode this as a mapping (filename -> contents) instead of just an array, because these files probably want to refer to each other (e.g. based on a filename).

For file paths this is not required, you have the filename/path directly available and can load the content from it.
For your use case, you'd indeed want to encode it in an object or an array of objects rather than in an array of strings, I assume.

It just gives you a bit more structure for complex UDFs instead of squashing everything into one file (and some languages are relatively strict what they allow in a file, think Java with one class per File or C++ with h and cpp files, if you think about the future).

The problem with file path or URL is that you add a dependency to an external resource that might change unannounced, disappear, cause permission issues ... . Having your UDF fully embedded within your process graph is a good thing if you want to make it completely self-contained.

Agreed for URLs, but I'd sincerly hope that this doesn't occur for my user workspace at the back-end. And I personally, would prefer to not have the UDF available as a file as for me it is easier to just up- and download files. But that's a matter of taste, indeed.

@soxofaan
Copy link
Member Author

soxofaan commented Sep 6, 2022

I'd like to raise the importance of this feature request for version control friendly inline UDFs (e.g. as list of strings)

I just introduced a bug in an internal project because I could not properly review the diff of the UDF code.
Luckily the unit tests caught some side effect in my case, but in general this is pretty bad for the developer experience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants