Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hts_open_cb() interface which allows callback function used as data source #647

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

38
Copy link

@38 38 commented Jan 10, 2018

The use case I am currently working on is use htslib with C++ istream. And this change makes htslib able to consume data from a set of consumized callback functions.
For my senario, I can just wrap the C++ stream object as an callback function, thus the htslib can work with the istream.

struct stream_data_t {
	std::istream& stream;
	stream_data_t(std::istream& is) : stream(is){}
	
	static ssize_t read(void* data, void* resbuf, size_t sz)
	{
		stream_data_t *stream_data = static_cast<stream_data_t*>(data);
		stream_data->stream.read((char*)resbuf, sz);
		if(!stream_data->stream)
			sz = (ssize_t)stream_data->stream.gcount();
		return (ssize_t)sz;
	}
	
	static int close(void* data)
	{
		delete (stream_data_t*)data;
		return 0;
	}
};
stream_data_t* cb_data = new stream_data_t(*is);
hFILE_ops ops;
memset(&ops, 0, sizeof(ops));
ops.read = stream_data_t::read;
ops.close = stream_data_t::close;
ops.cb_data = cb_data;
samFile* fp = hts_open_cb(&ops, "rb");

This is also allows program detect the data format in a pipe by reading a few bytes from the pipe and then detect if htslib should be used. Otherwise, since UNIX pipe cannot do either unget or seek, thus once the program detect the data should be handled by htslib, it's too late to handle off the FD.

38 added 2 commits January 9, 2018 08:38
The new interface will take a callback function as data source.
@arq5x
Copy link
Contributor

arq5x commented Jan 13, 2018

For reference, this change is to help allow htslib integration into bedtools. If the htslib maintainers have any suggestions, please let us know.

@daviesrob
Copy link
Member

This looks like a good idea in principle, but we might want to tweak the implementation a bit.
It looks like you're implementing something like the glibc extension fopencookie, which has the signature:

FILE *fopencookie(void *cookie, const char *mode, cookie_io_functions_t io_funcs);

In fopencookie, the data (cookie) passed to the callback functions is supplied separately to the callback function pointers (in io_funcs), while in this PR it's a member of hFILE_ops. Keeping it separate would allow the function pointers to be declared statically (as for hFILE_backend) which would simplify use of the hts_open_cb() function (no need to create and populate a struct on each call).

Some of the names here are quite terse. It would be better to change hFILE_ops to something like hFILE_callback_ops, hcbopen() to hopen_callback() and hts_open_cb to hts_open_callback so it's easier to see what the functions do.

Being able to optionally pass a file name into hts_open_cb would allow it to populate the htsFile fn field. Currently it looks like it gets set to '-', which would be misleading to anything that tried to use it.

If we want to match what fopencookie() does, cb_write() should discard the input data if the callback function is NULL (i.e. it should just return nbytes). Currently it returns 0, which I think it might cause an endless loop...

Finally, it would be a good idea to add some tests to test/hfile.c to ensure the new interfaces get exercised regularly.

- Add test cases
- Make unsupported read and write call fails instead of returning 0
@jkbonfield
Copy link
Contributor

I like the principle behind this, to permit use of htslib with arbitrary source and sinks, as it brings a lot more power to the library. It does indeed also have some similarity to fopencookie as @daviesrob points out.

I'm wondering though whether this is a duplicate of the functionality already implemented in the hfile_add_scheme_handler function and the struct hFILE_backend. Eg see https://github.com/samtools/htslib/blob/develop/hfile_internal.h#L59. Some of this is non-public, hidden away in hFILE_internal.h. However if there is demand I could see benefit from permitting calling code to register their own backends. This is sort of how the plugins work already, but they still have additional access to non-public functions, so maybe it should be opened further.

@jmarshall this is your area of expertise as I think you wrote most of this. Any comments or preferences before we proceed?

@lh3
Copy link
Member

lh3 commented Jan 19, 2018

since UNIX pipe cannot do either unget or seek, thus once the program detect the data should be handled by htslib, it's too late to handle off the FD.

Do you have to use C++ istream? Why not use the hFILE APIs instead? I haven't used hFile APIs myself. I am imaging something like this:

hFILE *hp = hopen(file_name, "r");
hpeek(hp, buffer, 64);
if (is_hts_file(buffer, 64)) { // this is a file recognized by htslib
  htsFile *fp = hts_hopen(hp, file_name, "r");
  // then use htslib
} else { // this is NOT a file htslib can read
  // call hread() etc to access the file in your own way
}

Once you use hFILE, your program naturally gains the ability to read URLs.

@daviesrob
Copy link
Member

@jkbonfield wrote:

I'm wondering though whether this is a duplicate of the functionality already implemented in the hfile_add_scheme_handler function and the struct hFILE_backend.

It could be implemented this way, with the caller making an hFILE where backend points to the caller's function. When the backend functions are called, they are passed an hFILE * as the first parameter, so they need to cast this and then extract the pointer to the enclosed istream (or whatever else is being used). This is what the cb_read(), cb_write(), etc. functions in the pull request are doing. So I don't think implementing this via the backend pointer would make much difference to execution times, and it would make the callback functions more complicated to write.

@lh2 wrote:

Do you have to use C++ istream? Why not use the hFILE APIs instead?

As I understand it, this is to allow bedtools to hand over files that HTSlib understands without having to make major changes to bedtools itself. I guess it should be possible to make an istream that wraps an hFILE, but I don't know how hard it would be, or how difficult to integrate it with bedtools.

@38
Copy link
Author

38 commented Jan 21, 2018

Do you have to use C++ istream? Why not use the hFILE APIs instead?

@lh3 The requirement is a little bit complicated. Since the bedtools actually detects file format like a gziped bam or even a bam that has been compressed multiple times (although this is extermely rare).
And the C++ stream in here isn't a simple file stream, it could be a gziped stream, etc.

As @daviesrob mentioned, the solution seems the only way to avoid major code change in bedtools. Plus I do think the API in this flavor makes a lot of other usage of htslib easier.

For @jkbonfield 's question, as my understand, what this change is trying to do is exposing the internal logic without changing existing APIs. The callback logic is already there, and the point of this change is it provides the application a way to use it.

@lh3
Copy link
Member

lh3 commented Jan 21, 2018

the bedtools actually detects file format like a gziped bam or even a bam that has been compressed multiple times

Ok, I agree wrapping hFILE won't achieve these functionalities easily.

@jkbonfield
Copy link
Contributor

jkbonfield commented Jan 22, 2018

Thanks for the explanation. I confess I haven't studied the internals in detail to know if this is the best route, but was making sure you were aware of the alternatives and also pinging John who does know this inside out. I'm happy given others have looked at it and deemed it appropriate.

Edit: no longer relevant. I'd agree with @daviesrob though that the function names could be a little more descriptive

@jmarshall
Copy link
Member

As James noted, HTSlib already has a mechanism for putting arbitrary streams underneath hFILE: hFILE_backend. When I designed that mechanism, I weighed up callback vs. object-oriented approaches, and considered that the hFILE_backend approach would lead to clearer code with less casting in C.

So it would be more in the existing HTSlib style to write a hFILE backend for std::istream (or rather the underlying std::streambuf). It would be appropriate to make the functions James noted public in htslib/hfile.h, or cf https://github.com/samtools/htslib-plugins which has its own copy of the relevant header. Or in this case it might be worthwhile to add a C++ iostream backend to HTSlib itself (but the maintainers might not be keen to start compiling one HTSlib source file as C++!).

Conversely, it would be trivial to write a std::streambuf subclass for hFILE (or perhaps better hFILE_backend, but there are buffer draining considerations), though this would lead to double-buffering — unless you used similar ->backend->read trickery to HTSlib's multipart.c, but this is fragile.

Bedtools's I/O is mostly localised in src/utils/bedFile/bedFile.cpp et al, no? HTSlib also provides gzipping and bgzipping facilities, and to the extent that bedtools's I/O is localised in those utility classes, I think it would be worth investigating converting bedtools from iostreams to native hFILE. This would allow for improvements in I/O error handling and filetype detection (cf hts_detect_format()), areas in which bedtools has historically had bugs.

@daviesrob
Copy link
Member

Thanks for the updates. I'd still like cb_data to be removed from struct hFILE_callback_ops, though. It would make using the interface easier, as you could make a single const static hFILE_callback_ops and then use it on each call to hopen_callback. So the interfaces would be something like this:

typedef struct  hFILE_callback_ops {
    ssize_t (* const read)(void *callback_data, void *buf, size_t sz) HTS_RESULT_USED;
    ssize_t (* const write)(void *callback_data, const void *buf, size_t sz) HTS_RESULT_USED;
    off_t (* const seek)(void *callback_data, off_t ofs, int whence) HTS_RESULT_USED;
    int (* const flush)(void *callback_data) HTS_RESULT_USED;
    int (* const close)(void *callback_data);
} hFILE_callback_ops;

hFILE *hopen_callback(void *callback_data, const char *mode, const hFILE_callback_ops *ops);

htsFile *hts_open_callback(void *callback_data,  const char *fn,  const char *mode, const struct hFILE_callback_ops *ops);

And usage would be something like this:

const static hFILE_callback_ops my_callback = {
    callback_read, callback_write, callback_lseek, callback_flush, callback_close
};

// ...
int *fd_ptr = malloc(sizeof(int));
if (!fd_ptr) return EXIT_FAILURE;
*fd_ptr = open(name, O_RDONLY);
if (*fd_ptr < 0) return EXIT_FAILURE;
hFILE *f = hopen_callback(fd_ptr, "r", &my_callback);
// ...

Also, I've noticed a couple of coding style points. Please can you replace any tabs with four spaces? Tabs just cause too much trouble with different editors, so we've eliminated them. And please could you align the * for pointer declarations with the variable name and not the type (i.e. void *callback_data instead of void* callback_data)?

@jmarshall
Copy link
Member

If the maintainers are considering this PR as is (as noted, the original intention with this code was for the hFILE_backend mechanism to become public in due course, and by now it is truly bedded in), then other style infelicities are the missing spaces in if (foo) and (hFILE_cb *) fpv (casts).

@daviesrob
Copy link
Member

@jmarshall Which bits of API do you think we would need to make public to get this to work via hFILE_backend? I guess it would need at least struct hFILE_backend, hfile_init() and hfile_destroy()? Or were you thinking of something more like the hopen_callback() function here that would call hfile_init() itself, populate backend with the given function pointers and return an hFILE that the caller can then cast into their own type?

Having looked a bit more, it appears using the backend pointers directly would in most cases avoid one function call on the way to the callback. So there is a small benefit to doing it this way, but it would involve more knowledge of hFILE internals on the part of anyone writing callbacks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants