Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Fix large array tests #16328

Merged
merged 13 commits into from
Oct 14, 2019
Merged

Fix large array tests #16328

merged 13 commits into from
Oct 14, 2019

Conversation

ChaiBapchya
Copy link
Contributor

@ChaiBapchya ChaiBapchya commented Sep 29, 2019

Description

  • fix activation, copy_to, random_multinomial
  • removed 2 redundant functions (imod, expand_dims)
  • Lint fixes
  • Random_* functions used
MEDIUM_X, SMALL_Y, SMALL_X, SMALL_Y -> (10000, 50, 100, 50) < 2**32 elements

Hence made it

MEDIUM_X, SMALL_X, SMALL_X, SMALL_Y -> (10000, 100, 100, 50) > 2**32 elements
  • pooling, on other hand, used
MEDIUM_X, MEDIUM_X, SMALL_Y, SMALL_Y -> (10000, 10000, 50, 50) >> 2**32 elements

A very big value that is > CPU memory gives the error

mxnet.base.MXNetError: [20:08:27] ../src/storage/./cpu_device_storage.h:75: Failed to allocate CPU Memory

Hence made it just greater than 2**32

MEDIUM_X, MEDIUM_X, SMALL_Y, SMALL_Y -> (10000, 200, 50, 50) > 2**32 elements

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • tests/nightly/test_large_array.py

Comments

@marcoabreu
Copy link
Contributor

Please remove commented out code

@access2rohit
Copy link
Contributor

Thanks for the quick fix! I left 1 comment. Rest LGTM! Post complete test run for test_large_array.py here, then it should be good to go.

@ChaiBapchya ChaiBapchya force-pushed the fix_large_array branch 2 times, most recently from 8d3ea39 to fd69dfb Compare October 4, 2019 18:18
@ChaiBapchya
Copy link
Contributor Author

ChaiBapchya commented Oct 9, 2019

test_large_array.test_gluon_embedding ... ok
test_large_array.test_ndarray_zeros ... ok
test_large_array.test_ndarray_ones ... ok
test_large_array.test_ndarray_convert ... ok
test_large_array.test_ndarray_random_uniform ... ok
test_large_array.test_ndarray_random_randint ... ok
test_large_array.test_ndarray_random_exponential ... ok
test_large_array.test_ndarray_random_gamma ... ok
test_large_array.test_ndarray_random_multinomial ... ok
test_large_array.test_ndarray_random_generalized_negative_binomial ... ok
test_large_array.test_ndarray_random_negative_binomial ... ok
test_large_array.test_ndarray_random_normal ... ok
test_large_array.test_ndarray_random_poisson ... ok
test_large_array.test_ndarray_random_randn ... ok
test_large_array.test_ndarray_random_shuffle ... ok
test_large_array.test_ndarray_empty ... ok
test_large_array.test_elementwise ... ok
test_large_array.test_reduce ... ok
test_large_array.test_dot ... ok
test_large_array.test_FullyConnected ... ok
test_large_array.test_broadcast ... ok
test_large_array.test_clip ... ok
test_large_array.test_split ... ok
test_large_array.test_argmin ... ok
test_large_array.test_tile ... ok
test_large_array.test_take ... ok
test_large_array.test_slice ... ok
test_large_array.test_slice_assign ... ok
test_large_array.test_expand_dims ... ok
test_large_array.test_squeeze ... ok
test_large_array.test_broadcast_div ... ok
test_large_array.test_Dense ... ok
test_large_array.test_where ... ok
test_large_array.test_pick ... ok
test_large_array.test_depthtospace ...Killed

Gets killed as it reaches max memory limit 480G/480G of p3 instance
Resume running from depthtospace

test_large_array.test_depthtospace ...
ok
test_large_array.test_spacetodepth ... ok
test_large_array.test_diag ... ok
test_large_array.test_ravel_multi_index ... ok
test_large_array.test_unravel_index ... ok
test_large_array.test_transpose ... ok
test_large_array.test_swapaxes ... ok
test_large_array.test_flip ... ok
test_large_array.test_softmax ... ok
test_large_array.test_argsort ... ok
test_large_array.test_sort ... ok
test_large_array.test_topk ... ok
test_large_array.test_exponent_logarithm_operators ... ok
test_large_array.test_power_operators ... ok
test_large_array.test_sequence_mask ... ERROR
test_large_array.test_sequence_reverse ... ok
test_large_array.test_sequence_last ... ok
test_large_array.test_softmax_cross_entropy ... ERROR
test_large_array.test_index_copy ... ok
test_large_array.testSoftmaxOutput ... [23:22:11] ../src/executor/graph_executor.cc:2014: Subgraph backend MKLDNN is activated.
ok
test_large_array.test_leaky_relu ... ok
test_large_array.test_pooling ... ok
test_large_array.test_layer_norm ... [23:34:54] ../src/executor/graph_executor.cc:1936: Subgraph backend MKLDNN is activated.
ERROR
test_large_array.test_dropout ... [23:35:57] ../src/executor/graph_executor.cc:1936: Subgraph backend MKLDNN is activated.
ok
test_large_array.test_activation ... ok
test_large_array.test_batchnorm ... ok
test_large_array.test_add ... ok
test_large_array.test_sub ... ok
test_large_array.test_rsub ... ok
test_large_array.test_neg ... ok
test_large_array.test_mul ... ok
test_large_array.test_div ... ok
test_large_array.test_rdiv ... ok
test_large_array.test_mod ... ok
test_large_array.test_rmod ... ok
test_large_array.test_imod ... ok
test_large_array.test_pow ... ok
test_large_array.test_rpow ... ok
test_large_array.test_shape ... ERROR
test_large_array.test_size ... ok
test_large_array.test_copy ... ok
test_large_array.test_copy_to ... ok
test_large_array.test_zeros_like ... Killed

@access2rohit
Copy link
Contributor

@anirudh2290 can you review and merge this

@anirudh2290
Copy link
Member

There are a few tests failing after you resumed running. Why ?

test_large_array.test_softmax_cross_entropy ... ERROR

@ChaiBapchya
Copy link
Contributor Author

ChaiBapchya commented Oct 10, 2019

Tests that give error when run together, pass when run individually.

nosetests tests/nightly/test_large_array.py:test_softmax_cross_entropy --verbose -s
test_large_array.test_softmax_cross_entropy ... ok

----------------------------------------------------------------------
Ran 1 test in 199.836s

OK

I'm guessing memory overflow has to do with this (not sure though) But we've seen this since quite sometime

  1. tests getting killed
  2. tests error out when run together but individually they pass

@marcoabreu
Copy link
Contributor

I assume you're testing with the teardown for now. But before making the PR final, I'd appreciate if you could elaborate. Feel free to ping me once you figured it out.

@marcoabreu marcoabreu changed the title Fix large array tests [WIP] Fix large array tests Oct 10, 2019
@marcoabreu marcoabreu added the pr-work-in-progress PR is still work in progress label Oct 10, 2019
@access2rohit
Copy link
Contributor

I assume you're testing with the teardown for now. But before making the PR final, I'd appreciate if you could elaborate. Feel free to ping me once you figured it out.

@marcoabreu : we need teardown to free up memory allocated once each test finishes. The reason of doing this is to ensure that memory is freed up after each test run, which wasn't happening before. These tests require more memory, running all of them in a single execution was causing out of memory error since memory was not being freed up in time(as observed using htop). Therefore we need to call clear cache and teardown after individual test runs. Let me know if it makes sense.

@anirudh2290
Copy link
Member

anirudh2290 commented Oct 11, 2019

Is there a issue for the disabled nightly large tensor tests ? If not can you please open an issue for these disabled tests.

@ChaiBapchya
Copy link
Contributor Author

ChaiBapchya commented Oct 11, 2019

Nope. I made one to document these findings. #16447
This one - #14980

@anirudh2290
Copy link
Member

also why does the title still say WIP

@ChaiBapchya ChaiBapchya changed the title [WIP] Fix large array tests Fix large array tests Oct 11, 2019
raise
finally:
mx.nd.waitall()
mx.cpu().empty_cache()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the tests only for cpu context ?

Copy link
Contributor Author

@ChaiBapchya ChaiBapchya Oct 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya. We don't test for gpu (as large tensor GPU is not supported)

@anirudh2290 anirudh2290 merged commit 858a52e into apache:master Oct 14, 2019
aaronmarkham pushed a commit to aaronmarkham/incubator-mxnet that referenced this pull request Oct 16, 2019
* fix activation

* remove comments

* fix copy_to

* fix lint, remove redundant function, fix shape sizes for random functions

* fix sigmoid issue

* fix leaky relu

* fix random shuffle

* fix pooling

* fix dropout

* fix index copy

* add teardown and fix lint

* post test cleanup

* removed decorator since it needs C API for CPU memory release
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-work-in-progress PR is still work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants