-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't use cyvcf2 against AWS S3 #174
Comments
you might need to build your own htslib, instead of using the one that cyvcf2 builds for you. I'm not sure it has everything to support S3 access (curl + ?). |
I'm using Conda to manage my packages. I use Pysam and its version 0.16.0.1 fails in a similar way, so I'm using 0.15.4. |
@ccwang002 , do you know the best way to do this? (assuming it's related to the compilation of htslib included with cyvcf2) |
The bioconda version of cyvcf2 links to the conda-wide htslib, which implements the file access using different protocols (HTTP, AWS S3, and etc). This conda-wide htstlib compiles all the supported protocols (see bioconda's htslib Unfortunately, I think the current htsilb build is a bit broken on macOS (I ran into issues like bioconda/bioconda-recipes#15415), and I am not sure if it's the reason that breaks the S3 function. Since the latest version of htstlib is 1.11 and cyvcf2 is on 1.10, I will investigate it a bit more and try to fill a issue/PR to fix the older version build upstream. Do you have a publicly available VCF on S3 that I can test? This will help me dissect the problems. I will put down some notes here how to locally build cyvcf2 and htslib in the next comment. |
Meanwhile, you can try to build cyvcf2 and htslib locally following the README's instruction to install from GitHub source: git clone --recursive https://github.com/brentp/cyvcf2
cd cyvcf2/htslib
autoheader
autoconf
# See htslib's INSTALL to add any other relevant flags
./configure --enable-libcurl --with-libdeflate --enable-plugins --enable-gcs --enable-s3
make
cd ..
pip install -r requirements.txt
CYTHONIZE=1 pip install -e . This approach requires all the external dependencies to be manually installed before the compilation. It might fail if any dependency is not found. Happy to look at the error message together if any. |
Regarding the conda-based installation, an easy way to test if this issue is related to cyvcf2 is to install bcftools in the same environment, since both of them links to the same htslib: $ conda create -n cyvcf2 python=3.8 cyvcf2 bcftools=1.10 htslib=1.10
$ conda activate cyvcf2
$ bcftools view -h https://github.com/brentp/cyvcf2/raw/master/cyvcf2/tests/test.vcf.gz | head -n 2
[E::idx_test_and_fetch] Format of index file 'https://github.com/brentp/cyvcf2/raw/master/cyvcf2/tests/test.vcf.gz.tbi' is not supported
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed"> >>> from cyvcf2 import VCF
>>> v = VCF('https://github.com/brentp/cyvcf2/raw/master/cyvcf2/tests/test.vcf.gz')
v[E::idx_test_and_fetch] Format of index file 'https://github.com/brentp/cyvcf2/raw/master/cyvcf2/tests/test.vcf.gz.tbi' is not supported
>>> v.raw_header.splitlines()[:2]
['##fileformat=VCFv4.1', '##FILTER=<ID=PASS,Description="All filters passed">'] |
thank you @ccwang002 ! |
I haven't build anything yet, just using CONDA so far, but I created an example using a public VCF file in S3 aws s3 ls s3://3kricegenome/test/test.vcf.gz Using PySAM with S3import boto3
import botocore
import pysam
from botocore.client import Config
config = Config(signature_version=botocore.UNSIGNED)
s3 = boto3.client('s3', config=config)
vcf_index = s3.generate_presigned_url('get_object', Params={'Bucket': '3kricegenome', 'Key': 'test/test.vcf.gz.tbi'}, ExpiresIn=5000)
vcf_file = s3.generate_presigned_url('get_object', Params={'Bucket': '3kricegenome', 'Key': 'test/test.vcf.gz'}, ExpiresIn=5000)
variant_file = pysam.VariantFile(vcf_file, index_filename=vcf_index)
vsam = variant_file.fetch('9311_chr01', 1009, 1010)
v = next(vsam)
print(dict(v.info))
{'AN': 2, 'DP': 10, 'MQ': 29.079999923706055, 'MQ0': 0}
# Worked as expected Using CyVCF2 with local file (downloaded from the S3 bucket)from cyvcf2 import VCF
variant_file = VCF("test.vcf.gz")
variant_file.set_index(index_path = "test.vcf.gz.tbi")
for v in variant_file('9311_chr01:1010-1020'):
print(str(v))
vv = variant_file('9311_chr01:1010-1020')
v = next(vv)
print(dict(v.INFO))
...
{'AN': 2, 'DP': 10, 'MQ': 29.079999923706055, 'MQ0': 0}
# Worked as expected Using CyVCF2 with S3import boto3
import botocore
from cyvcf2 import VCF
from botocore.client import Config
config = Config(signature_version=botocore.UNSIGNED)
s3 = boto3.client('s3', config=config)
vcf_index = s3.generate_presigned_url('get_object', Params={'Bucket': '3kricegenome', 'Key': 'test/test.vcf.gz.tbi'}, ExpiresIn=5000)
vcf_file = s3.generate_presigned_url('get_object', Params={'Bucket': '3kricegenome', 'Key': 'test/test.vcf.gz'}, ExpiresIn=5000)
vcfs3 = VCF(vcf_file)
vcfs3.set_index(vcf_index)
for v in vcfs3('9311_chr01:1010-1020'):
print(str(v))
# does nothing
v = vcfs3('9311_chr01:1010-1020')
next(v)
---------------------------------------------------------------------------
StopIteration Traceback (most recent call last)
<ipython-input-7-460873294da3> in <module>
1 v = vcfs3('9311_chr01:1010-1020')
----> 2 next(v)
StopIteration: Note that here the behaviour is different from when I tried with a private S3 (see my first post) |
I followed the instructions of @ccwang002 above and was able to run cyvcf2 on your final example and see the variants printed. There's probably some what to incorporate this into the build process. But probably simplest is to include necessary flags (--enable-s3 ) in the bioconda recipe. |
I did in my docker container (using a slim Debian Buster) and all worked as expected. However, I'm using version I feel ashamed that I may have be lured all this time by this issue being essentially on macOS. Yet, for development, I will need to get it working on my mac. |
I'm lost, whatever I've tried, haven't work on my Mac so far. #For compilers to find openssl you may need to set:
export LDFLAGS="-L/usr/local/opt/openssl/lib"
export CPPFLAGS="-I/usr/local/opt/openssl/include"
#For pkg-config to find openssl you may need to set:
export PKG_CONFIG_PATH="/usr/local/opt/openssl/lib/pkgconfig"
# Buggy Accelerate Backend on Mac https://github.com/numpy/numpy/issues/15947
OPENBLAS="$(brew --prefix openblas)" pip install -U --no-cache-dir --force-reinstall --ignore-installed --no-binary :all: cython numpy coloredlogs click
brew install libdeflate autoconf
git clone --recursive https://github.com/brentp/cyvcf2
cd cyvcf2/htslib
autoheader
autoconf
# See htslib's INSTALL to add any other relevant flags
./configure --enable-libcurl --with-libdeflate --enable-plugins --enable-gcs --enable-s3
make
#make test
cd ..
pip install -r requirements.txt
CYTHONIZE=1 pip install -e . and I got this long error message:
|
I've tried to sort this myself. I've created this patch: diff --git a/hfile_gcs.c b/hfile_gcs.c
index e6f72ae..c757d84 100644
--- a/hfile_gcs.c
+++ b/hfile_gcs.c
@@ -116,7 +116,8 @@ static hFILE *gcs_vopen(const char *url, const char *mode_colon, va_list args0)
return fp;
}
-int PLUGIN_GLOBAL(hfile_plugin_init,_gcs)(struct hFILE_plugin *self)
+// int PLUGIN_GLOBAL(hfile_plugin_init,_gcs)(struct hFILE_plugin *self)
+int hfile_plugin_init_gcs(struct hFILE_plugin *self)
{
static const struct hFILE_scheme_handler handler =
{ gcs_open, hfile_always_remote, "Google Cloud Storage",
diff --git a/hfile_libcurl.c b/hfile_libcurl.c
index 235b4c1..bfacff5 100644
--- a/hfile_libcurl.c
+++ b/hfile_libcurl.c
@@ -1438,7 +1438,8 @@ static hFILE *vhopen_libcurl(const char *url, const char *modes, va_list args)
return fp;
}
-int PLUGIN_GLOBAL(hfile_plugin_init,_libcurl)(struct hFILE_plugin *self)
+// int PLUGIN_GLOBAL(hfile_plugin_init,_libcurl)(struct hFILE_plugin *self)
+int hfile_plugin_init_libcurl(struct hFILE_plugin *self)
{
static const struct hFILE_scheme_handler handler =
{ hopen_libcurl, hfile_always_remote, "libcurl",
diff --git a/hfile_s3.c b/hfile_s3.c
index 3f094d3..df10d27 100644
--- a/hfile_s3.c
+++ b/hfile_s3.c
@@ -1236,7 +1236,8 @@ static hFILE *s3_vopen(const char *url, const char *mode_colon, va_list args0)
return fp;
}
-int PLUGIN_GLOBAL(hfile_plugin_init,_s3)(struct hFILE_plugin *self)
+// int PLUGIN_GLOBAL(hfile_plugin_init,_s3)(struct hFILE_plugin *self)
+int hfile_plugin_init_s3(struct hFILE_plugin *self)
{
static const struct hFILE_scheme_handler handler =
{ s3_open, hfile_always_remote, "Amazon S3", 2000 + 50, s3_vopen
diff --git a/hfile_s3_write.c b/hfile_s3_write.c
index 9008622..d1ef808 100644
--- a/hfile_s3_write.c
+++ b/hfile_s3_write.c
@@ -832,8 +832,8 @@ static void s3_write_exit() {
}
-int PLUGIN_GLOBAL(hfile_plugin_init,_s3_write)(struct hFILE_plugin *self) {
-
+// int PLUGIN_GLOBAL(hfile_plugin_init,_s3_write)(struct hFILE_plugin *self) {
+int hfile_plugin_init_s3_write(struct hFILE_plugin *self) {
static const struct hFILE_scheme_handler handler =
{ hopen_s3_write, hfile_always_remote, "S3 Multipart Upload",
2000 + 50, vhopen_s3_write so I could get around the It did compiled this time but when testing conda create -n cyvcf2 python=3.9 -y
conda activate cyvcf2
brew install libdeflate autoconf
#For compilers to find openssl you may need to set:
export LDFLAGS="-L/usr/local/opt/openssl/lib"
export CPPFLAGS="-I/usr/local/opt/openssl/include"
#For pkg-config to find openssl you may need to set:
export PKG_CONFIG_PATH="/usr/local/opt/openssl/lib/pkgconfig"
# https://github.com/numpy/numpy/issues/15947
OPENBLAS="$(brew --prefix openblas)" pip install --no-binary :all: cython coloredlogs click
# in principal this was to install numpy but it's failing, so numpy is installed via wheel
git clone --recursive https://github.com/brentp/cyvcf2
cd cyvcf2/htslib
git apply ~/Programmes/patch2.diff
autoheader
autoconf
# See htslib's INSTALL to add any other relevant flags
./configure --enable-libcurl --with-libdeflate --enable-plugins --enable-gcs --enable-s3
make test All fine up to here cd ..
pip install -r requirements.txt
# binary numpy 1.19.4 is installed
CYTHONIZE=1 pip install -e .
Obtaining file:///Users/alan/Programmes/cyvcf2
Requirement already satisfied: numpy in /usr/local/Caskroom/miniconda/base/envs/cyvcf2/lib/python3.9/site-packages (from cyvcf2==0.20.9) (1.19.4)
Requirement already satisfied: coloredlogs in /usr/local/Caskroom/miniconda/base/envs/cyvcf2/lib/python3.9/site-packages (from cyvcf2==0.20.9) (14.0)
Requirement already satisfied: click in /usr/local/Caskroom/miniconda/base/envs/cyvcf2/lib/python3.9/site-packages (from cyvcf2==0.20.9) (7.1.2)
Requirement already satisfied: humanfriendly>=7.1 in /usr/local/Caskroom/miniconda/base/envs/cyvcf2/lib/python3.9/site-packages (from coloredlogs->cyvcf2==0.20.9) (8.2)
Installing collected packages: cyvcf2
Running setup.py develop for cyvcf2
Successfully installed cyvcf2 No error messages! Let's test: python setup.py test
running test
WARNING: Testing via this command is deprecated and will be removed in a future version. Users looking for a generic test entry point independent of test runner are encouraged to use tox.
running egg_info
writing cyvcf2.egg-info/PKG-INFO
writing dependency_links to cyvcf2.egg-info/dependency_links.txt
writing entry points to cyvcf2.egg-info/entry_points.txt
writing requirements to cyvcf2.egg-info/requires.txt
writing top-level names to cyvcf2.egg-info/top_level.txt
reading manifest file 'cyvcf2.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'cyvcf2.egg-info/SOURCES.txt'
running build_ext
copying build/lib.macosx-10.9-x86_64-3.9/cyvcf2/cyvcf2.cpython-39-darwin.so -> cyvcf2
/Users/alan/Programmes/cyvcf2/.eggs/nose-1.3.7-py3.9.egg/nose/config.py:264: RuntimeWarning: Option 'with-coverage' in config file 'setup.cfg' ignored: excluded by runtime environment
warn(msg, RuntimeWarning)
Failure: ImportError (dlopen(/Users/alan/Programmes/cyvcf2/cyvcf2/cyvcf2.cpython-39-darwin.so, 2): Symbol not found: _close_plugin
Referenced from: /Users/alan/Programmes/cyvcf2/cyvcf2/cyvcf2.cpython-39-darwin.so
Expected in: flat namespace
in /Users/alan/Programmes/cyvcf2/cyvcf2/cyvcf2.cpython-39-darwin.so) ... ERROR
======================================================================
ERROR: Failure: ImportError (dlopen(/Users/alan/Programmes/cyvcf2/cyvcf2/cyvcf2.cpython-39-darwin.so, 2): Symbol not found: _close_plugin
Referenced from: /Users/alan/Programmes/cyvcf2/cyvcf2/cyvcf2.cpython-39-darwin.so
Expected in: flat namespace
in /Users/alan/Programmes/cyvcf2/cyvcf2/cyvcf2.cpython-39-darwin.so)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/alan/Programmes/cyvcf2/.eggs/nose-1.3.7-py3.9.egg/nose/failure.py", line 39, in runTest
raise self.exc_val.with_traceback(self.tb)
File "/Users/alan/Programmes/cyvcf2/.eggs/nose-1.3.7-py3.9.egg/nose/loader.py", line 417, in loadTestsFromName
module = self.importer.importFromPath(
File "/Users/alan/Programmes/cyvcf2/.eggs/nose-1.3.7-py3.9.egg/nose/importer.py", line 47, in importFromPath
return self.importFromDir(dir_path, fqname)
File "/Users/alan/Programmes/cyvcf2/.eggs/nose-1.3.7-py3.9.egg/nose/importer.py", line 94, in importFromDir
mod = load_module(part_fqname, fh, filename, desc)
File "/usr/local/Caskroom/miniconda/base/envs/cyvcf2/lib/python3.9/imp.py", line 244, in load_module
return load_package(name, filename)
File "/usr/local/Caskroom/miniconda/base/envs/cyvcf2/lib/python3.9/imp.py", line 216, in load_package
return _load(spec)
File "<frozen importlib._bootstrap>", line 711, in _load
File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 790, in exec_module
File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
File "/Users/alan/Programmes/cyvcf2/cyvcf2/__init__.py", line 1, in <module>
from .cyvcf2 import (VCF, Variant, Writer, r_ as r_unphased, par_relatedness,
ImportError: dlopen(/Users/alan/Programmes/cyvcf2/cyvcf2/cyvcf2.cpython-39-darwin.so, 2): Symbol not found: _close_plugin
Referenced from: /Users/alan/Programmes/cyvcf2/cyvcf2/cyvcf2.cpython-39-darwin.so
Expected in: flat namespace
in /Users/alan/Programmes/cyvcf2/cyvcf2/cyvcf2.cpython-39-darwin.so
----------------------------------------------------------------------
Ran 1 test in 0.583s
FAILED (errors=1)
Test failed: <unittest.runner.TextTestResult run=1 errors=1 failures=0>
error: Test failed: <unittest.runner.TextTestResult run=1 errors=1 failures=0>
Traceback (most recent call last):
File "/usr/local/Caskroom/miniconda/base/envs/cyvcf2/bin/cyvcf2", line 33, in <module>
sys.exit(load_entry_point('cyvcf2', 'console_scripts', 'cyvcf2')())
File "/usr/local/Caskroom/miniconda/base/envs/cyvcf2/bin/cyvcf2", line 25, in importlib_load_entry_point
return next(matches).load()
File "/usr/local/Caskroom/miniconda/base/envs/cyvcf2/lib/python3.9/importlib/metadata.py", line 77, in load
module = import_module(match.group('module'))
File "/usr/local/Caskroom/miniconda/base/envs/cyvcf2/lib/python3.9/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
File "<frozen importlib._bootstrap>", line 972, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 790, in exec_module
File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
File "/Users/alan/Programmes/cyvcf2/cyvcf2/__init__.py", line 1, in <module>
from .cyvcf2 import (VCF, Variant, Writer, r_ as r_unphased, par_relatedness,
ImportError: dlopen(/Users/alan/Programmes/cyvcf2/cyvcf2/cyvcf2.cpython-39-darwin.so, 2): Symbol not found: _close_plugin
Referenced from: /Users/alan/Programmes/cyvcf2/cyvcf2/cyvcf2.cpython-39-darwin.so
Expected in: flat namespace
in /Users/alan/Programmes/cyvcf2/cyvcf2/cyvcf2.cpython-39-darwin.so Well, I wouldn't be surprised if this is because of my patch above but, alas, I'm desperate to get I've tried many things. In one of my attempts, using Could anyone, specially a mac user/developer, tell me which environment are you using to succeed with |
Hi @alanwilter, I created a private S3 bucket and put your test file there. I was able to access the private S3 file. I think this is an issue regarding the S3 authentication, less likely related to with how cyvcf2 is built. In fact, the current cyvcf2 from bioconda works on macOS. This is how I install my conda env: I have two identical files on S3:
The public VCF can be read by htslib (and thus also readable by cyvcf2) $ htsfile -vv 's3://3kricegenome/test/test.vcf.gz'
[D::init_add_plugin] Loaded "knetfile"
[D::init_add_plugin] Loaded "mem"
[D::init_add_plugin] Loaded "/Users/liang/miniconda3/envs/cyvcf2/libexec/htslib/hfile_s3.bundle"
[D::init_add_plugin] Loaded "/Users/liang/miniconda3/envs/cyvcf2/libexec/htslib/hfile_s3_write.bundle"
[D::init_add_plugin] Loaded "/Users/liang/miniconda3/envs/cyvcf2/libexec/htslib/hfile_libcurl.bundle"
[D::init_add_plugin] Loaded "/Users/liang/miniconda3/envs/cyvcf2/libexec/htslib/hfile_gcs.bundle"
s3://3kricegenome/test/test.vcf.gz: VCF version 4.1 BGZF-compressed variant calling data >>> from cyvcf2 import VCF
>>> v = VCF('s3://3kricegenome/test/test.vcf.gz')
next(v[E::idx_test_and_fetch] Format of index file 's3://3kricegenome/test/test.vcf.gz.tbi' is not supported
>>> next(v)
Variant(9311_chr01:1001 C/) And the private VCF works too (on cyvcf2 as well): $ export AWS_DEFAULT_REGION=us-east-2 # Set the region of the S3 file for htslib
$ htsfile -vv 's3://cyvcf2-s3-test/test.vcf.gz'
[D::init_add_plugin] Loaded "knetfile"
[D::init_add_plugin] Loaded "mem"
[D::init_add_plugin] Loaded "/Users/liang/miniconda3/envs/cyvcf2/libexec/htslib/hfile_s3.bundle"
[D::init_add_plugin] Loaded "/Users/liang/miniconda3/envs/cyvcf2/libexec/htslib/hfile_s3_write.bundle"
[D::init_add_plugin] Loaded "/Users/liang/miniconda3/envs/cyvcf2/libexec/htslib/hfile_libcurl.bundle"
[D::init_add_plugin] Loaded "/Users/liang/miniconda3/envs/cyvcf2/libexec/htslib/hfile_gcs.bundle"
s3://cyvcf2-s3-test/test.vcf.gz: VCF version 4.1 BGZF-compressed variant calling data As shown by Full output
htslib is able to detect my AWS credentials from To sum up,
Hope it helps. |
I spent my whole morning testing this, and I found a related bug on htslib 1.11 (either from bioconda or compiled from source ), where none of the htslib plugins work on macOS. htslib 1.10 works fine. Will file a bug upstream. |
Thanks @ccwang002. I've been spending days on it. I'm testing with So far I've got this awkward behaviour where the format is not identified: I can't explain this: $ htsfile -vv /Users/alan/Downloads/annotations/Fujinami2020.vcf.gz
/Users/alan/Downloads/annotations/Fujinami2020.vcf.gz: VCF version 4.2 BGZF-compressed variant calling data
$ aws s3 cp /Users/alan/Downloads/annotations/Fujinami2020.vcf.gz s3://vcf-test/atest/
upload: Downloads/annotations/Fujinami2020.vcf.gz to s3://vcf-test/atest/Fujinami2020.vcf.gz
$ htsfile -vvvv s3://vcf-test/atest/Fujinami2020.vcf.gz
[D::init_add_plugin] Loaded "knetfile"
[D::init_add_plugin] Loaded "mem"
[D::init_add_plugin] Loaded "crypt4gh-needed"
[D::init_add_plugin] Loaded "libcurl"
[D::init_add_plugin] Loaded "gcs"
[D::init_add_plugin] Loaded "s3"
[D::init_add_plugin] Loaded "s3w"
s3://vcf-test/atest/Fujinami2020.vcf.gz: unknown text
$ htsfile -vv s3://3kricegenome/test/test.vcf.gz
...
[D::init_add_plugin] Loaded "s3w"
s3://3kricegenome/test/test.vcf.gz: VCF version 4.1 BGZF-compressed variant calling data and hence, tabix does not work. I frankly don't know why the file in S3 is not recognised as a VCF while the public and my other private is recognised as VCF. But later I will test using your steps. |
I'm using |
Many thank @ccwang002 I got a workable solution and, no, I tried with
But when I run this: htsfile -vv 's3://3kricegenome/test/test.vcf.gz'
[D::init_add_plugin] Loaded "knetfile"
[D::init_add_plugin] Loaded "mem"
[D::init_add_plugin] Loaded "/usr/local/Caskroom/miniconda/base/envs/cyvcf2/libexec/htslib/hfile_s3.bundle"
[D::init_add_plugin] Loaded "/usr/local/Caskroom/miniconda/base/envs/cyvcf2/libexec/htslib/hfile_s3_write.bundle"
[D::init_add_plugin] Loaded "/usr/local/Caskroom/miniconda/base/envs/cyvcf2/libexec/htslib/hfile_libcurl.bundle"
[D::init_add_plugin] Loaded "/usr/local/Caskroom/miniconda/base/envs/cyvcf2/libexec/htslib/hfile_gcs.bundle"
htsfile: can't open "s3://3kricegenome/test/test.vcf.gz": Input/output error I'm closing this issue but the only solution that seemed to work for me was: brew install python
python3 -m venv myenv2
source myenv2/bin/activate
pip install --upgrade pip
pip install -U boto3 wheel cython ipython
export HTSLIB_CONFIGURE_OPTIONS=--enable-plugins
export HTSLIB_LIBRARY_DIR=/usr/local/lib
export HTSLIB_INCLUDE_DIR=/usr/local/include
pip install pysam cyvcf2
# pysam 0.16.0.1
# cyvcf2 0.20.9 then either use export AWS_ACCESS_KEY_ID=....
export AWS_SECRET_ACCESS_KEY=.... and my tests, with from pysam import VariantFile
from cyvcf2 import VCF
# Public S3 VCF ex1
vcfs3 = VCF('s3://3kricegenome/test/test.vcf.gz')
vv = vcfs3('9311_chr01:1011-1011')
v1 = next(vv)
print(dict(v1.INFO))
{'AN': 2, 'DP': 12, 'MQ': 26.549999237060547, 'MQ0': 2}
# OK: Mac, Docker
# Private S3 VCF ex1
vcfs3 = VCF('s3://vcf-test/atest/test.vcf.gz')
vv = vcfs3('9311_chr01:1011-1011')
v1 = next(vv)
print(dict(v1.INFO))
{'AN': 2, 'DP': 12, 'MQ': 26.549999237060547, 'MQ0': 2}
# OK: Mac, Docker However, for import os
import boto3
import botocore
import pysam
s3 = boto3.client('s3', aws_secret_access_key=os.environ["AWS_SECRET_ACCESS_KEY"],
aws_access_key_id=os.environ["AWS_ACCESS_KEY_ID"],
region_name="eu-west-2",
config=boto3.session.Config(signature_version='s3v4'))
vcf_index = s3.generate_presigned_url('get_object', Params={'Bucket': 'vcf-test', 'Key': 'atest/test.vcf.gz.tbi'}, ExpiresIn=5000)
vcf_file = s3.generate_presigned_url('get_object', Params={'Bucket': 'vcf-test', 'Key': 'atest/test.vcf.gz'}, ExpiresIn=5000)
# Private S3+Boto PYSAM ex1
variant_file = pysam.VariantFile(vcf_file, index_filename=vcf_index)
vsam = variant_file.fetch('9311_chr01', 1010, 1011)
v = next(vsam)
print(dict(v.info))
# OK: Mac, Docker
{'AN': 2, 'DP': 12, 'MQ': 26.549999237060547, 'MQ0': 2} The bottom line for me is, unfortunately, there's something broken for the Anyway, now I hope to get to the point where I can compare Thanks again and sorry for taking much of your time just to find out that, actually, |
Sorry guys, now I'm trying to get This is my test case: from cyvcf2 import VCF
# Public S3 VCF ex1
vcfs3 = VCF('s3://3kricegenome/test/test.vcf.gz')
vv = vcfs3('9311_chr01:1011-1011')
v1 = next(vv)
print(dict(v1.INFO)) Thats works fine on my macOS Big Sur and prints: 1. Using
In [3]: vcfs3 = VCF('s3://3kricegenome/test/test.vcf.gz')
[E::hts_open_format] Failed to open file "s3://3kricegenome/test/test.vcf.gz" : Input/output error
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
<ipython-input-3-4ffad091729e> in <module>
----> 1 vcfs3 = VCF('s3://3kricegenome/test/test.vcf.gz')
~/phenopolis_browser/venv/lib/python3.6/site-packages/cyvcf2/cyvcf2.pyx in cyvcf2.cyvcf2.VCF.__init__()
~/phenopolis_browser/venv/lib/python3.6/site-packages/cyvcf2/cyvcf2.pyx in cyvcf2.cyvcf2.HTSFile._open_htsfile()
OSError: Error opening s3://3kricegenome/test/test.vcf.gz 2. Using a fresh (virtual) Ubuntu 20.04 with
3. Trying with
4. Again, on fresh Ubuntu, trying to install from sources (version 0.30.6): sudo apt-get install libbz2-dev liblzma-dev libdeflate-dev libcurl4-openssl-dev
git clone --recursive https://github.com/brentp/cyvcf2
cd cyvcf2/htslib
autoheader
autoconf
./configure --enable-libcurl --with-libdeflate --enable-plugins --enable-gcs --enable-s3
make
make test
...
#Number of tests:
# total .. 153
# passed .. 153
# failed .. 0
cd ..
pip install -r requirements.txt
# all fine so far
CYTHONIZE=1 pip install -e . failed! Full output
|
I don't think cyvcf2 has changed a lot between v0.11.6 and v0.30.4 except that the former version uses an older htslib (1.9 vs 1.10). If you cannot use conda and you don't mind an older htslib, v0.11.6 is probably the best alternative option at the moment.
|
I would be more than happy to have v0.11.6 from a deb package but I need it for Ubuntu 18.04 and this is not available AFAIK, python3-cyvcf2 is only in Ubuntu 20. Re approach (4), I tried this (keep in mind that for Ubuntu 18, an eventual solution will be different since I don't have mkdir tmp_test
cd tmp_test
cat << EOF >| Dockerfile_ubuntu
FROM ubuntu:20.04
# set work directory
WORKDIR /app
# set environment variables, to avoid pyc files and flushing buffer
ENV PYTHONDONTWRITEBYTECODE 1
ENV PYTHONUNBUFFERED 1
RUN apt-get update \
&& apt-get install -y python3-pip libbz2-dev liblzma-dev libdeflate-dev \
libssl-dev libcurl4-openssl-dev git autoconf \
&& pip3 --no-cache-dir install --upgrade pip
RUN git clone --recursive https://github.com/brentp/cyvcf2 \
&& cd cyvcf2/htslib \
&& autoheader \
&& autoconf \
&& ./configure --enable-libcurl --with-libdeflate --enable-plugins --enable-gcs --enable-s3 \
&& make \
&& cd .. \
&& pip install -r requirements.txt
RUN cd /app/cyvcf2 \
&& CYTHONIZE=1 python3 setup.py install
EOF
docker build -f Dockerfile_ubuntu -t ubuntu20_test . Got the same error as seen in the approach (4) discussed. I have re-read this whole ticket and I noticed that I never managed to build |
I got it working by changing the configuration of htslib (changing only one line of your Dockerfile):
The culprit here is htslib's |
Thanks @ccwang002! That nailed the problem, I can at least got one solution finally working. I'll close and I hope you guys may address the pip issues. |
Here I'm again, now trying to install on macOS Big Sur 11.3.1, cyvcf2 version 0.30.8 from github, following instructions for Mac with Brew, but using curl -O http://ftp.gnu.org/gnu/autoconf/autoconf-2.69.tar.gz
tar zxvf autoconf-2.69.tar.gz
cd autoconf-2.69
./configure && make && sudo make install
I got all installed but when I try to run... and ./configure --enable-libcurl --with-libdeflate --enable-lzma --enable-bz2 --enable-gcs --enable-s3 python3 -c "import cyvcf2; print(cyvcf2.__version__)"
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/Users/alan/Programmes/cyvcf2/cyvcf2/__init__.py", line 1, in <module>
from .cyvcf2 import (VCF, Variant, Writer, r_ as r_unphased, par_relatedness,
ImportError: dlopen(/Users/alan/Programmes/cyvcf2/cyvcf2/cyvcf2.cpython-39-darwin.so, 2): Symbol not found: _libdeflate_alloc_compressor
Referenced from: /Users/alan/Programmes/cyvcf2/cyvcf2/cyvcf2.cpython-39-darwin.so
Expected in: flat namespace
in /Users/alan/Programmes/cyvcf2/cyvcf2/cyvcf2.cpython-39-darwin.so Any idea? |
Long post! Mac details:
I've tried what I could figure out and no success. Here's the details: 1. Trying the basic standard
|
did you try: libdeflate is optoinal and that's what was giving the first error |
Thanks a lot @brentp, it did work. |
oh excellent! |
Sorry guys but something in Try this to reproduce the error: docker run --rm -i -t amazonlinux:latest bash
# inside docker
yum update -y
yum install -y --setopt install_weak_deps=false python3
pip3 install --upgrade pip
pip3 install cyvcf2=0.30.12
export AWS_SECRET_ACCESS_KEY=_use_yours_
export AWS_ACCESS_KEY_ID=_use_yours_
python3 -c "import cyvcf2; print(cyvcf2.__version__)"
# 0.30.12
# test_case
python3 <<EOF
from cyvcf2 import VCF
# Public S3 VCF
vcfs3 = VCF('s3://3kricegenome/test/test.vcf.gz')
vv = vcfs3('9311_chr01:1011-1011')
v1 = next(vv)
print(dict(v1.INFO))
EOF
# {'AN': 2, 'DP': 12, 'MQ': 26.549999237060547, 'MQ0': 2}
# Upgrade to latest cyvcf2:
pip3 install cyvcf2==0.30.14
python3 -c "import cyvcf2; print(cyvcf2.__version__)"
# 0.30.14
# run test_case again and it will fail
Traceback (most recent call last):
File "<stdin>", line 5, in <module>
StopIteration |
We started with issue #154 and thought that would have addressed our issues but, alas, setting a index in AWS S3 does not solve the problem.
The text was updated successfully, but these errors were encountered: