Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance not scaling with multithreading #15874

Closed
abhijithch opened this issue Apr 14, 2016 · 18 comments
Closed

Performance not scaling with multithreading #15874

abhijithch opened this issue Apr 14, 2016 · 18 comments
Labels
domain:multithreading Base.Threads and related functionality performance Must go faster

Comments

@abhijithch
Copy link

abhijithch commented Apr 14, 2016

While running an example from recommender system package for movielens data, the performance of multithreaded parallel version does not seem to scale well.

screen shot 2016-04-16 at 5 30 01 pm

To run the example download the dataset from here, unzip the files to any folder. Clone this repository and run the movielens example in julia built with multithreading,

  • have a thread enabled julia build
  • include the julia file
  • set OPENBLAS_NUM_THREADS to 1
  • set JULIA_NUM_THREADS to desired number of threads
  • call the test_thread(dataset_path) method in the example script
@andreasnoack
Copy link
Member

I think this is a duplicate of #15871 and #15276. Could you try running the threading benchmarks with commit dc6b0de

@abhijithch
Copy link
Author

Thanks @andreasnoack, also this is segfaulting for nthreads() > 30.

@ranjanan
Copy link
Contributor

Just adding versioninfo() for reference:

Julia Version 0.5.0-dev+1918
Commit 80b14f3 (2015-12-24 13:04 UTC)
Platform Info:
  System: Linux (x86_64-redhat-linux)
  CPU: Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

@ViralBShah ViralBShah added domain:multithreading Base.Threads and related functionality performance Must go faster labels Apr 14, 2016
@ViralBShah
Copy link
Member

I think this is not a dup. @abhijithch Didn't you use an older commit on julia master? Could you point out which commit you used, and is it possible to bisect and find where things broke?

@ViralBShah
Copy link
Member

Oops - ignore. I see @ranjanan did post the commit.

@ViralBShah
Copy link
Member

Cc @tanmaykm

@amitmurthy
Copy link
Contributor

How many cores?

@ranjanan
Copy link
Contributor

30 cores with 2 hardware threads per core.

@kpamnany
Copy link
Contributor

There appear to be 3 different things here.

  1. Performance regression. This doesn't seem connected with threads; with 1 thread alone, test_thread() completes in 60.011244 seconds on commit dc6b0de and in 75.866557 seconds on master. I think @andreasnoack is right about this part.
  2. Segfault. Occurs with 32 threads on commit dc6b0de, and with 16 threads on master. This is Multithreading segmentation fault #15875.
  3. Poor scalability. On commit dc6b0de, performance with 1, 2, 4, 8, and 16 threads is 60.011244, 53.702526, 49.711778, 47.373577, and 47.155389 seconds respectively. On inspection, update_user() and update_item() take ~1.5 and ~1.1 seconds on 138493 users and 26744 items. This should show better benefit from threads. Barrier time is negligible which means the loops are regular and load is balanced. How big and unique is the data touched by each thread? Perhaps the cache is being thrashed?

@yuyichao
Copy link
Contributor

I suspect the performance/scalability issues are probably all/mainly #15276. This includes #15740 #15871 and #15874

@kpamnany
Copy link
Contributor

kpamnany commented Apr 15, 2016

Threads still scale on master so I think the scalability issue here is unrelated to the performance regression.
laplace3d_scaling

@ViralBShah
Copy link
Member

ViralBShah commented Apr 16, 2016

Given that each user will have different interactions with the items set, there shouldn't be much of cache thrashing. You can randomply permute the rows and columns of the R matrix and check if it changes performance. At each iteration, a submatrix is selected from R to work on.

@abhijithch
Copy link
Author

On the commit dc6b0de segfault occured for 30 threads itself. The performance with 1, 2, 5, 10, 20 and 29 threads is 68, 61, 58, 53, 58 and 79 seconds respectively.

@kpamnany
Copy link
Contributor

@ViralBShah: what I meant was, is it possible that each thread is bringing too much data into the cache and evicting other threads' data?

@pingzou
Copy link

pingzou commented Apr 27, 2016

I want to try multi-threading, but when I type nthreads() in the REPL, the
result is 1. Would you please tell me how to set JULIA_NUM_THREADS?

@nalimilan
Copy link
Member

@pingzou Just start Julia with e.g. JULIA_NUM_THREADS=2 julia. (But note this only works with a recent git master.)

@ViralBShah
Copy link
Member

@abhijithch Can you try this again and see where we are?

@KristofferC
Copy link
Sponsor Member

Closing due to inactivity.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:multithreading Base.Threads and related functionality performance Must go faster
Projects
None yet
Development

No branches or pull requests

10 participants