You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm getting an error trying to train checkpoints using style.py and the traceback seems to point to: Not found: ./bin/ptxas not found as the source of the error.
Do you have any idea what the issue here is?
Submission Script
#!/bin/bash
#$ -pwd
# bash fastTrainer.bash /images/ /outpath/ /testpath/
##
## An embarrassingly parallel script to train many style transfer networks on a HPC
## Access to SLURM job scheduler and fast-style-transfer is required to run this program.
## The three mandatory pathways must be specified in the indicated order.
IMG=$(readlink -f "${1%/}") # path_to_train_images
OUT_DIR=$(readlink -f "${2%/}") # path_to_checkpoints
TEST=$(readlink -f "${3%/}") # path_to_tests
mkdir -p ${OUT_DIR}/jobs
JID=0 # job ID for SLURM job name
for f in ${IMG}/*; do
let JID=(JID+1)
cat > ${OUT_DIR}/jobs/style_${JID}.bash << EOT # write job information for each job
#!/bin/bash
#SBATCH --gres=gpu:1 # request GPU
#SBATCH --account=def-mtarailo
#SBATCH --cpus-per-task=10 # maximum CPU cores per GPU request
#SBATCH --time=12:00:00 # request 8 hours of walltime
#SBATCH --mem=10G # request 10G (or 1G per core)
#SBATCH --job-name="fst_${JID}"
#SBATCH --output=${OUT_DIR}/jobs/%N-%j.out # %N for node name, %j for jobID
#SBATCH --error=${OUT_DIR}/jobs/%N-%j.err # %N for node name, %j for jobID
### JOB SCRIPT BELLOW ###
# Load Modules
source activate tf-gpu
module load cuda/10.1
mkdir ${OUT_DIR}/${JID}
#mkdir ${TEST}/${JID}
python style.py --style $f \
--checkpoint-dir ${OUT_DIR}/${JID} \
--test examples/content/chicago.jpg \
--test-dir ${OUT_DIR}/${JID} \
--content-weight 1.5e1 \
--checkpoint-iterations 1000 \
--batch-size 20
EOT
chmod 754 $(readlink -f "${OUT_DIR}")/jobs/style_${JID}.bash
sbatch $(readlink -f "${OUT_DIR}")/jobs/style_${JID}.bash
done
Error
Due to MODULEPATH changes, the following have been reloaded:
1) openmpi/3.1.2
2021-03-22 21:11:45.184962: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-03-22 21:11:45.240458: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:1d:00.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s
2021-03-22 21:11:45.283971: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2021-03-22 21:11:45.473571: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2021-03-22 21:11:45.634452: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2021-03-22 21:11:45.716070: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2021-03-22 21:11:45.890345: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2021-03-22 21:11:45.927906: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2021-03-22 21:11:46.114498: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-03-22 21:11:46.116365: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2021-03-22 21:11:46.116859: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
2021-03-22 21:11:46.164009: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2400000000 Hz
2021-03-22 21:11:46.165725: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x564a7240ad60 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-03-22 21:11:46.165751: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2021-03-22 21:11:46.168706: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:1d:00.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s
2021-03-22 21:11:46.168744: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2021-03-22 21:11:46.168760: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2021-03-22 21:11:46.168775: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2021-03-22 21:11:46.168789: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2021-03-22 21:11:46.168803: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2021-03-22 21:11:46.168817: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2021-03-22 21:11:46.168831: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-03-22 21:11:46.170436: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2021-03-22 21:11:46.170479: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2021-03-22 21:11:46.305358: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-03-22 21:11:46.305405: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0
2021-03-22 21:11:46.305424: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N
2021-03-22 21:11:46.308686: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15059 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:1d:00.0, compute capability: 7.0)
2021-03-22 21:11:46.312111: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x564a72cf2f30 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2021-03-22 21:11:46.312137: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Tesla V100-SXM2-16GB, Compute Capability 7.0
2021-03-22 21:11:49.670398: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:1d:00.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s
2021-03-22 21:11:49.670498: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2021-03-22 21:11:49.670522: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2021-03-22 21:11:49.670543: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2021-03-22 21:11:49.670561: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2021-03-22 21:11:49.670577: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2021-03-22 21:11:49.670594: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2021-03-22 21:11:49.670613: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-03-22 21:11:49.672271: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2021-03-22 21:11:49.672324: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-03-22 21:11:49.672338: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0
2021-03-22 21:11:49.672349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N
2021-03-22 21:11:49.674005: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15059 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:1d:00.0, compute capability: 7.0)
WARNING:tensorflow:From /home/moldach/anaconda3/envs/tf-gpu/lib/python3.7/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1635: calling BaseResourceVariable.__init__ (from tensorflow.python.ops
: Tesla V100-SXM2-16GB, pci bus id: 0000:1d:00.0, compute capability: 7.0)
WARNING:tensorflow:From /home/moldach/anaconda3/envs/tf-gpu/lib/python3.7/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1635: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
2021-03-22 21:12:02.532101: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-03-22 21:12:04.268532: W tensorflow/stream_executor/gpu/redzone_allocator.cc:312] Not found: ./bin/ptxas not found
Relying on driver to perform ptx compilation. This message will be only logged once.
2021-03-22 21:12:04.438014: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
Traceback (most recent call last):
File "/home/moldach/anaconda3/envs/tf-gpu/lib/python3.7/site-packages/imageio/plugins/pillow.py", line 669, in pil_try_read
im.getdata()[0]
File "/home/moldach/anaconda3/envs/tf-gpu/lib/python3.7/site-packages/PIL/Image.py", line 1271, in getdata
self.load()
File "/home/moldach/anaconda3/envs/tf-gpu/lib/python3.7/site-packages/PIL/ImageFile.py", line 260, in load
"image file is truncated "
OSError: image file is truncated (20 bytes not processed)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "style.py", line 167, in <module>
main()
File "style.py", line 147, in main
for preds, losses, i, epoch in optimize(*args, **kwargs):
File "src/optimize.py", line 105, in optimize
X_batch[j] = get_img(img_p, (256,256,3)).astype(np.float32)
File "src/utils.py", line 18, in get_img
img = imageio.imread(src, pilmode='RGB') # misc.imresize(, (256, 256, 3))
File "/home/moldach/anaconda3/envs/tf-gpu/lib/python3.7/site-packages/imageio/core/functions.py", line 265, in imread
reader = read(uri, format, "i", **kwargs)
File "/home/moldach/anaconda3/envs/tf-gpu/lib/python3.7/site-packages/imageio/core/functions.py", line 186, in get_reader
return format.get_reader(request)
File "/home/moldach/anaconda3/envs/tf-gpu/lib/python3.7/site-packages/imageio/core/format.py", line 170, in get_reader
return self.Reader(self, request)
File "/home/moldach/anaconda3/envs/tf-gpu/lib/python3.7/site-packages/imageio/core/format.py", line 221, in __init__
self._open(**self.request.kwargs.copy())
File "/home/moldach/anaconda3/envs/tf-gpu/lib/python3.7/site-packages/imageio/plugins/pillow.py", line 429, in _open
return PillowFormat.Reader._open(self, pilmode=pilmode, as_gray=as_gray)
File "/home/moldach/anaconda3/envs/tf-gpu/lib/python3.7/site-packages/imageio/plugins/pillow.py", line 135, in _open
pil_try_read(self._im)
File "/home/moldach/anaconda3/envs/tf-gpu/lib/python3.7/site-packages/imageio/plugins/pillow.py", line 680, in pil_try_read
raise ValueError(error_message)
ValueError: Could not load ""
Reason: "image file is truncated (20 bytes not processed)"
Please see documentation at: http://pillow.readthedocs.io/en/latest/installation.html#external-libraries
(END)
The text was updated successfully, but these errors were encountered:
I'm getting an error trying to train checkpoints using
style.py
and the traceback seems to point to:Not found: ./bin/ptxas not found
as the source of the error.Do you have any idea what the issue here is?
Submission Script
Error
The text was updated successfully, but these errors were encountered: