-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add hts_open_cb() interface which allows callback function used as data source #647
base: develop
Are you sure you want to change the base?
Conversation
The new interface will take a callback function as data source.
For reference, this change is to help allow htslib integration into bedtools. If the htslib maintainers have any suggestions, please let us know. |
This looks like a good idea in principle, but we might want to tweak the implementation a bit.
In Some of the names here are quite terse. It would be better to change Being able to optionally pass a file name into If we want to match what Finally, it would be a good idea to add some tests to |
- Add test cases - Make unsupported read and write call fails instead of returning 0
I like the principle behind this, to permit use of htslib with arbitrary source and sinks, as it brings a lot more power to the library. It does indeed also have some similarity to fopencookie as @daviesrob points out. I'm wondering though whether this is a duplicate of the functionality already implemented in the @jmarshall this is your area of expertise as I think you wrote most of this. Any comments or preferences before we proceed? |
Do you have to use C++ istream? Why not use the hFILE *hp = hopen(file_name, "r");
hpeek(hp, buffer, 64);
if (is_hts_file(buffer, 64)) { // this is a file recognized by htslib
htsFile *fp = hts_hopen(hp, file_name, "r");
// then use htslib
} else { // this is NOT a file htslib can read
// call hread() etc to access the file in your own way
} Once you use |
@jkbonfield wrote:
It could be implemented this way, with the caller making an hFILE where backend points to the caller's function. When the backend functions are called, they are passed an @lh2 wrote:
As I understand it, this is to allow bedtools to hand over files that HTSlib understands without having to make major changes to bedtools itself. I guess it should be possible to make an istream that wraps an hFILE, but I don't know how hard it would be, or how difficult to integrate it with bedtools. |
@lh3 The requirement is a little bit complicated. Since the bedtools actually detects file format like a gziped bam or even a bam that has been compressed multiple times (although this is extermely rare). As @daviesrob mentioned, the solution seems the only way to avoid major code change in bedtools. Plus I do think the API in this flavor makes a lot of other usage of htslib easier. For @jkbonfield 's question, as my understand, what this change is trying to do is exposing the internal logic without changing existing APIs. The callback logic is already there, and the point of this change is it provides the application a way to use it. |
Ok, I agree wrapping hFILE won't achieve these functionalities easily. |
Thanks for the explanation. I confess I haven't studied the internals in detail to know if this is the best route, but was making sure you were aware of the alternatives and also pinging John who does know this inside out. I'm happy given others have looked at it and deemed it appropriate. Edit: no longer relevant. |
As James noted, HTSlib already has a mechanism for putting arbitrary streams underneath So it would be more in the existing HTSlib style to write a hFILE backend for Conversely, it would be trivial to write a Bedtools's I/O is mostly localised in src/utils/bedFile/bedFile.cpp et al, no? HTSlib also provides gzipping and bgzipping facilities, and to the extent that bedtools's I/O is localised in those utility classes, I think it would be worth investigating converting bedtools from iostreams to native hFILE. This would allow for improvements in I/O error handling and filetype detection (cf |
Thanks for the updates. I'd still like typedef struct hFILE_callback_ops {
ssize_t (* const read)(void *callback_data, void *buf, size_t sz) HTS_RESULT_USED;
ssize_t (* const write)(void *callback_data, const void *buf, size_t sz) HTS_RESULT_USED;
off_t (* const seek)(void *callback_data, off_t ofs, int whence) HTS_RESULT_USED;
int (* const flush)(void *callback_data) HTS_RESULT_USED;
int (* const close)(void *callback_data);
} hFILE_callback_ops;
hFILE *hopen_callback(void *callback_data, const char *mode, const hFILE_callback_ops *ops);
htsFile *hts_open_callback(void *callback_data, const char *fn, const char *mode, const struct hFILE_callback_ops *ops); And usage would be something like this: const static hFILE_callback_ops my_callback = {
callback_read, callback_write, callback_lseek, callback_flush, callback_close
};
// ...
int *fd_ptr = malloc(sizeof(int));
if (!fd_ptr) return EXIT_FAILURE;
*fd_ptr = open(name, O_RDONLY);
if (*fd_ptr < 0) return EXIT_FAILURE;
hFILE *f = hopen_callback(fd_ptr, "r", &my_callback);
// ... Also, I've noticed a couple of coding style points. Please can you replace any tabs with four spaces? Tabs just cause too much trouble with different editors, so we've eliminated them. And please could you align the |
If the maintainers are considering this PR as is (as noted, the original intention with this code was for the |
@jmarshall Which bits of API do you think we would need to make public to get this to work via Having looked a bit more, it appears using the backend pointers directly would in most cases avoid one function call on the way to the callback. So there is a small benefit to doing it this way, but it would involve more knowledge of hFILE internals on the part of anyone writing callbacks. |
The use case I am currently working on is use htslib with C++ istream. And this change makes htslib able to consume data from a set of consumized callback functions.
For my senario, I can just wrap the C++ stream object as an callback function, thus the htslib can work with the istream.
This is also allows program detect the data format in a pipe by reading a few bytes from the pipe and then detect if htslib should be used. Otherwise, since UNIX pipe cannot do either unget or seek, thus once the program detect the data should be handled by htslib, it's too late to handle off the FD.