Skip to content

Conversation

@derekwaynecarr
Copy link
Member

This PR adds support to to List Projects based on the user's authorization.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a WeakHashMap construct in Go? That makes a Locker easier to reason about since you never have to worry about deletion logic. Once the references are gone, the object can be GC'ed. Otherwise, you'll need a sort of ref counter prior to delete to protect from multiple GetOrCreates racing with a Delete on a different thread and ending up with different locks on each thread.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have not seen that, but yes, good point on ref counters needed for each lock

Sent from my iPhone

On Feb 11, 2015, at 8:32 AM, David Eads [email protected] wrote:

In pkg/project/auth/locker.go:

  •   return keyLock
    
  • }
  • l.lock.RUnlock()
  • // we need to create one
  • keyLock = &sync.RWMutex{}
  • l.lock.Lock()
  • defer l.lock.Unlock()
  • l.items[key] = keyLock
  • return keyLock
    +}

+// Delete the lock for the specified key
+func (l *locker) Delete(key string) {
Is there a WeakHashMap construct in Go? That makes a Locker easier to reason about since you never have to worry about deletion logic. Once the references are gone, the object can be GC'ed. Otherwise, you'll need a sort of ref counter prior to delete to protect from multiple GetOrCreates racing with a Delete on a different thread and ending up with different locks on each thread.


Reply to this email directly or view it on GitHub.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will probably make Delete be a Release func and ensure its called after every GetOrCreate. Will probably rename GetOrCreate to Acquire.

Sent from my iPhone

On Feb 11, 2015, at 8:32 AM, David Eads [email protected] wrote:

In pkg/project/auth/locker.go:

  •   return keyLock
    
  • }
  • l.lock.RUnlock()
  • // we need to create one
  • keyLock = &sync.RWMutex{}
  • l.lock.Lock()
  • defer l.lock.Unlock()
  • l.items[key] = keyLock
  • return keyLock
    +}

+// Delete the lock for the specified key
+func (l *locker) Delete(key string) {
Is there a WeakHashMap construct in Go? That makes a Locker easier to reason about since you never have to worry about deletion logic. Once the references are gone, the object can be GC'ed. Otherwise, you'll need a sort of ref counter prior to delete to protect from multiple GetOrCreates racing with a Delete on a different thread and ending up with different locks on each thread.


Reply to this email directly or view it on GitHub.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will probably make Delete be a Release func and ensure its called after every GetOrCreate. Will probably rename GetOrCreate to Acquire.

Sounds good.

@derekwaynecarr derekwaynecarr force-pushed the acl_cache branch 2 times, most recently from bd7b88c to 2c7fcbd Compare February 13, 2015 20:57
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either scope items in the for loop or name it more precisely. The reuse makes it harder to read.

@derekwaynecarr derekwaynecarr force-pushed the acl_cache branch 4 times, most recently from 4e27842 to 86ab35f Compare February 17, 2015 17:16
@derekwaynecarr
Copy link
Member Author

Prerequsite PR: #1038

@derekwaynecarr derekwaynecarr force-pushed the acl_cache branch 3 times, most recently from a18d41c to ccda543 Compare February 17, 2015 20:58
@derekwaynecarr derekwaynecarr changed the title WIP: Caching proxy to list projects Make List Projects authorization aware Feb 17, 2015
@derekwaynecarr
Copy link
Member Author

This is ready for review.

It cannot be merged until the following are merged:

  1. Project is Namespace #1038 - project is namespace
  2. ignore setSelfLink errors #1049 - fix resource access reviews after latest rebase.

A user is able to create a project, and not see the project in a list until the sync runs.

/cc @smarterclayton @deads2k @liggitt

@derekwaynecarr derekwaynecarr force-pushed the acl_cache branch 2 times, most recently from a9c0190 to 9d3bb85 Compare February 17, 2015 22:14
@smarterclayton
Copy link
Contributor

For #1049 @deads2k can you add tests to RAR that prevent this from being broken again?

On Feb 17, 2015, at 4:03 PM, Derek Carr [email protected] wrote:

This is ready for review.

It cannot be merged until the following are merged:

  1. Project is Namespace #1038 - project is namespace
  2. ignore setSelfLink errors #1049 - fix resource access reviews after latest rebase.

A user is able to create a project, and not see the project in a list until the sync runs.

/cc @smarterclayton @deads2k @liggitt


Reply to this email directly or view it on GitHub.

@derekwaynecarr derekwaynecarr force-pushed the acl_cache branch 2 times, most recently from 4d1c4d3 to 16a0674 Compare February 18, 2015 19:19
@derekwaynecarr
Copy link
Member Author

This works now.

Note:
Due to #1049 , you see the following in the log at startup:

E0218 14:17:21.322616   26099 resthandler.go:351] error generating link: unable to find object fields on reflect.Value{typ:(*reflect.rtype)(0x1159cc0), ptr:(unsafe.Pointer)(0xc208fdd6e0), flag:0xd9}: couldn't find Namespace field in api.TypeMeta{Kind:"", APIVersion:""}
E0218 14:17:21.442787   26099 resthandler.go:351] error generating link: unable to find object fields on reflect.Value{typ:(*reflect.rtype)(0x1159cc0), ptr:(unsafe.Pointer)(0xc2090a9380), flag:0xd9}: couldn't find Namespace field in api.TypeMeta{Kind:"", APIVersion:""}
E0218 14:17:21.578445   26099 resthandler.go:351] error generating link: unable to find object fields on reflect.Value{typ:(*reflect.rtype)(0x1159cc0), ptr:(unsafe.Pointer)(0xc208b72060), flag:0xd9}: couldn't find Namespace field in api.TypeMeta{Kind:"", APIVersion:""}
E0218 14:17:21.703789   26099 resthandler.go:351] error generating link: unable to find object fields on reflect.Value{typ:(*reflect.rtype)(0x1159cc0), ptr:(unsafe.Pointer)(0xc208d3fa40), flag:0xd9}: couldn't find Namespace field in api.TypeMeta{Kind:"", APIVersion:""}

Conceptually, you will see one of those messages whenever you add/modify namespace, policy, binding.

@derekwaynecarr
Copy link
Member Author

Ignore my previous comment on log spams now, the log level for that issue has been reduced.

@deads2k
Copy link
Contributor

deads2k commented Feb 19, 2015

gofmt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're indexing by name, but retrieving by UID so no projects ever get returned back. Switching this to GetName() makes it work properly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix that up and try to get a unit test with different values to make sure it doesn't happen again.

@derekwaynecarr
Copy link
Member Author

@deads2k - fixed up. thanks for the heads-up, for some reason thought we were using uid.

@deads2k
Copy link
Contributor

deads2k commented Feb 20, 2015

Trouble in paradise. With #971 and 500 hundred projects, I see one CPU fully pegged just idling. Exact same etcd, pull #971 and I see 3% CPU utilization.

@derekwaynecarr
Copy link
Member Author

This has not yet merged, so added a commit to fix the peg CPU, and added my project-spawner script

@derekwaynecarr
Copy link
Member Author

FYI: Fixing that bad loop brought create time to < 4 minutes. It will improve more when @deads2k change goes in to further improve policy. CPU utilization is back down below 1% in steady state.

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_openshift3/993/) (Image: devenv-fedora_852)

@smarterclayton
Copy link
Contributor

Cache test fails gofmt

@derekwaynecarr
Copy link
Member Author

ugh, gofmt rules change with goversions

@smarterclayton
Copy link
Contributor

Yeah, gofmt on 1.3 for now

----- Original Message -----

ugh, gofmt rules change with goversions


Reply to this email directly or view it on GitHub:
#971 (comment)

@derekwaynecarr
Copy link
Member Author

that was a pain to figure out, but should be fixed now.

@openshift-bot
Copy link
Contributor

[Test]ing while waiting on the merge queue

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_openshift3/1129/)

@derekwaynecarr
Copy link
Member Author

I forgot to fix my integration test will do that now.

Sent from my iPhone

On Feb 20, 2015, at 5:22 PM, OpenShift Bot [email protected] wrote:

continuous-integration/openshift-jenkins/test Evaluating for testing


Reply to this email directly or view it on GitHub.

@derekwaynecarr
Copy link
Member Author

Well, now hack/test-cmd.sh looks to have flaked because it looks to work fine for me locally.

@derekwaynecarr
Copy link
Member Author

[Test]

@derekwaynecarr
Copy link
Member Author

Passed test so re [Merge]

@openshift-bot
Copy link
Contributor

Evaluated for origin up to 3fb1a4d

openshift-bot pushed a commit that referenced this pull request Feb 21, 2015
@openshift-bot openshift-bot merged commit 35ee25d into openshift:master Feb 21, 2015
@smarterclayton
Copy link
Contributor

Looks like the bulk of the performance issue is allocation and the CPU taken by GC - 23GB (33% of CPU) allocated by create project spawner, only 20MB actually allocated at any one time.

Sample below of alloc_total was over 30 seconds while project spawner was running:

(pprof) top15 -cum
100.21MB of 23354.72MB total ( 0.43%)
Dropped 653 nodes (cum <= 116.77MB)
Showing top 15 nodes out of 180 (cum >= 11232.09MB)
      flat  flat%   sum%        cum   cum%
       2MB 0.0086% 0.0086% 18090.55MB 77.46%  github.com/GoogleCloudPlatform/kubernetes/pkg/conversion.(*Scheme).DecodeInto
         0     0% 0.0086% 17960.54MB 76.90%  github.com/GoogleCloudPlatform/kubernetes/pkg/runtime.(*Scheme).DecodeInto
       3MB 0.013% 0.021% 17866.75MB 76.50%  github.com/ghodss/yaml.Unmarshal
    2.50MB 0.011% 0.032% 17404.44MB 74.52%  github.com/ghodss/yaml.yamlToJSON
         0     0% 0.032% 16181.20MB 69.28%  github.com/GoogleCloudPlatform/kubernetes/pkg/tools.(*EtcdHelper).bodyAndExtractObj
         0     0% 0.032% 15924.18MB 68.18%  github.com/GoogleCloudPlatform/kubernetes/pkg/tools.(*EtcdHelper).ExtractObj
   77.21MB  0.33%  0.36% 15721.20MB 67.31%  github.com/GoogleCloudPlatform/kubernetes/pkg/tools.(*EtcdHelper).extractObj
   10.50MB 0.045%  0.41% 15564.69MB 66.64%  gopkg.in/yaml%2ev2.Unmarshal
         0     0%  0.41% 15303.70MB 65.53%  github.com/GoogleCloudPlatform/kubernetes/pkg/registry/generic/etcd.(*Etcd).Get
         0     0%  0.41% 15032.03MB 64.36%  github.com/openshift/origin/pkg/authorization/registry/etcd.(*Etcd).GetPolicy
         0     0%  0.41% 12871.41MB 55.11%  github.com/openshift/origin/pkg/authorization/rulevalidation.(*DefaultRuleResolver).getPolicy
       4MB 0.017%  0.42% 12354.36MB 52.90%  github.com/openshift/origin/pkg/authorization/rulevalidation.(*DefaultRuleResolver).GetRole
       1MB 0.0043%  0.43% 11524.13MB 49.34%  github.com/openshift/origin/pkg/authorization/rulevalidation.(*DefaultRuleResolver).GetEffectivePolicyRules
         0     0%  0.43% 11232.09MB 48.09%  gopkg.in/yaml%2ev2.(*parser).skip
         0     0%  0.43% 11232.09MB 48.09%  gopkg.in/yaml%2ev2.yaml_parser_parse

@smarterclayton
Copy link
Contributor

Also seeing this on startup

E0221 15:23:18.673707    2552 master.go:478] Error creating namespace: &{{ } {master    791232fb-ba07-11e4-a576-7831c1b76042  2015-02-21 15:23:18.67307407 -0500 EST map[] map[]} {} {}} due to request [&{Method:POST URL:https://192.168.1.103:8443/api/v1beta1/namespaces Proto:HTTP/1.1 ProtoMajor:1 ProtoMinor:1 Header:map[] Body:{Reader:} ContentLength:172 TransferEncoding:[] Close:false Host:192.168.1.103:8443 Form:map[] PostForm:map[] MultipartForm:<nil> Trailer:map[] RemoteAddr: RequestURI: TLS:<nil>}] failed (403) 403 Forbidden: Forbidden: "/api/v1beta1/namespaces" denied by default

@deads2k ?

@liggitt
Copy link
Contributor

liggitt commented Feb 21, 2015

Probably the loop to ensure the master policy namespace exists failing the first time because bootstrap policy doesn't exist yet

@derekwaynecarr
Copy link
Member Author

Need to flip the ensure call order. I will submit a PR.

Sent from my iPhone

On Feb 21, 2015, at 3:49 PM, Jordan Liggitt [email protected] wrote:

Probably the loop to ensure the master policy namespace exists failing the first time because bootstrap policy doesn't exist yet


Reply to this email directly or view it on GitHub.

@smarterclayton
Copy link
Contributor

1GB allocated per second is pretty impressive.

@derekwaynecarr
Copy link
Member Author

We need a realistic workload to actually measure performance of the total system more accurately and make improvements moving forward.

Measuring project and policy in isolation of the resources they hold is problematic. After all, once we start
sticking 2 replication controllers per project, we will start to see completely different issues. Same for 6-10 pods per project, etc.

Let's chat how we can build a go library that can populate data and simulate load (which is really far more read heavy). May want to see if we can build a common core population library with upstream that we can then build around. Populate N namespaces each with X to Y pods, replication controllers, and services. This is a good topic for this weeks Kubernetes meet up as I think many groups building around the project are well served building against an anticipated data set. Once we have something, we should run and report those stats each sprint.

On this topic though, the improvements are obvious and known.

  1. We should upgrade etcd to match upstream.
  2. We need to measure performance when not embedding etcd in our process.
  3. If we have a latest resource version on a cache.Store, we can optimize our synch loop more in steady state to avoid iterating at all.
  4. If I can make resource access review calls without invoking a rest client, we can avoid encode/decode cost and associated garbage it generates.
  5. If policy stops fetching the same thing from etcd for each request we improve more by avoiding the encode/decode process there and the garbage it collects. Policy should hold frequently accessed items in memory. The global policy and policy bindings seem like an easy candidate. The project auth cache is holding this already so we need to share that data between modules.
  6. We need to take a serious look at holding subject to namespace and group to namespace records in a data store, and do an LRU over the data set. Etcd is an option since there is a natural lookup key. This would let us avoid having to warm our cache at process start. Some caution is needed here where I think we should do the previous items first because if we do this first, things may look worse before they get better. I am not sure what to do with namespace review records, it may need the same.

I think 1-5 will get us a long way at getting a better understanding of the real system, but I suspect other issues will come up in other parts of the system at that point.

Time to get hacking and measuring now that we have a working core :)

Sent from my iPhone

On Feb 22, 2015, at 5:13 PM, Clayton Coleman [email protected] wrote:

1GB allocated per second is pretty impressive.


Reply to this email directly or view it on GitHub.

@smarterclayton
Copy link
Contributor

----- Original Message -----

We need a realistic workload to actually measure performance of the total
system more accurately and make improvements moving forward.

Measuring project and policy in isolation of the resources they hold is
problematic. After all, once we start
sticking 2 replication controllers per project, we will start to see
completely different issues. Same for 6-10 pods per project, etc.

Let's chat how we can build a go library that can populate data and simulate
load (which is really far more read heavy). May want to see if we can build
a common core population library with upstream that we can then build
around. Populate N namespaces each with X to Y pods, replication
controllers, and services. This is a good topic for this weeks Kubernetes
meet up as I think many groups building around the project are well served
building against an anticipated data set. Once we have something, we should
run and report those stats each sprint.

On this topic though, the improvements are obvious and known.

  1. We should upgrade etcd to match upstream.

I don't think we can upgrade until the stability issues folks are seeing are addressed.

  1. We need to measure performance when not embedding etcd in our process.

Honestly from the trace this is entirely GC load from allocations in policy. I don't even think we need to look at etcd until we get this down to something more accurate.

  1. If we have a latest resource version on a cache.Store, we can optimize
    our synch loop more in steady state to avoid iterating at all.

Don't you know whether you've had a watch passed? High water is one, but any change notification is sufficient.

  1. If I can make resource access review calls without invoking a rest
    client, we can avoid encode/decode cost and associated garbage it generates.

The evaluation here definitely needs to start at the lower levels.

  1. Is it the minimal set of calls being made to calculate the request
  2. As you note in 5, what are the core things we can cache
  3. Are we doing more copies than we should for other reasons (i.e. are we using map[string]Type instead of map[string]*Type).
  1. If policy stops fetching the same thing from etcd for each request we
    improve more by avoiding the encode/decode process there and the garbage it
    collects. Policy should hold frequently accessed items in memory. The
    global policy and policy bindings seem like an easy candidate. The project
    auth cache is holding this already so we need to share that data between
    modules.

I think this is probably the first item.

  1. We need to take a serious look at holding subject to namespace and group
    to namespace records in a data store, and do an LRU over the data set.
    Etcd is an option since there is a natural lookup key. This would let us
    avoid having to warm our cache at process start. Some caution is needed
    here where I think we should do the previous items first because if we do
    this first, things may look worse before they get better. I am not sure
    what to do with namespace review records, it may need the same.

I agree, this is last.

I think 1-5 will get us a long way at getting a better understanding of the
real system, but I suspect other issues will come up in other parts of the
system at that point.

Time to get hacking and measuring now that we have a working core :)

Like I said, generating 1GB of garbage is an achievement. Try that in Ruby...

jpeeler pushed a commit to jpeeler/origin that referenced this pull request Aug 10, 2017
…service-catalog/' changes from 568a7b9..8f07b7b

8f07b7b origin: add required patches
ee57bfb Cleanup of ups broker example + making controller follow the OSB API (openshift#807)
45a11ed Revert "Rename our resources to have ServiceCatalog prefix (openshift#1054)" (openshift#1061)
4e47ec1 Rename our resources to have ServiceCatalog prefix (openshift#1054)
2bb334a Rebase on 1.7 API machinery (openshift#944)
5780b59 Run broker reconciler when spec is changed. (openshift#1026)
9c22d04 Merge branch 'pr/1006'
d077915 check number of expected events before dereferencing to avoid panic (openshift#1052)
90d615f Merge branch 'pr/1055'
bb6d6d8 fix log output to use formatted output (openshift#1056)
c7abc81 Adding examples to the README
ccc93c9 Remove different-org rule for LGTM (openshift#1050)
be04cd5 Allow for a period in the GUID of the External ID (openshift#1034)
8c246df Make it so that binding.spec.secretName defaults to binding name (openshift#851)
6745418 Bump OSB Client (openshift#1049)
8346a0d apiserver etcd healthcheck as suggested to address k/k#48215 (openshift#1039)
11d0d4a use GKE's latest 1.6.X cluster version for Jenkins (openshift#1036)
7d71b5b Cross-build all the things!
8ec0874 RBAC setup behind the aggregator. (openshift#936)
0864a2e Upsert retry loop for Secret, set/check ownerReference for Secret owned by Binding (openshift#979)
6be9886 add info about weekly calls (openshift#1027)
a242b26 add OSB API Header version flag (openshift#1014)
66e2ce6 Update REVIEWING doc with changes to LGTM process (openshift#1016)
699e016 Writing the returned progress description from the broker (openshift#998)
02642f4 Adding target to test on the host (openshift#1020)
78ca572 v0.0.13 (openshift#1024)
9e79ec2 use GKE's default K8S version for Jenkins (openshift#1023)
d3c915a Fix curl on API server start error (openshift#1015)
b50be75 Merge branch 'pr/1013'
2c98ba1 Using tag URLs
687f091 Parameterizing the priority fields
34ed5cd update apiregistration yaml to v1.7 final (openshift#1011)
91fa1ad make e2e look for pods' existence before checking status (openshift#1012)
0f90705 explicitly disable leader election if it is not enabled (openshift#965)
f5761e7 controller-manager health checks (openshift#694)
da260f2 Add logging for normal Unbind errors (openshift#992)
4c916a5 make the apiserver test use tls (openshift#991)
1a62ecc refactor reconcileBroker (openshift#986)
cc179bc Add logging for normal Bind errors (openshift#993)
a1458dd add parameterization for user-broker image to e2e tests (openshift#995)
fb15891 Bump OSB client (openshift#1000)
79d5206 v0.0.12 (openshift#996)
39c7407 Merge branch 'pr/975'
a553b2d Merge branch 'pr/974'
d573339 reconcileBinding error checking (openshift#973)
39a1061 Making events and actions checks generic (openshift#960)
73136a4 Bump osb client (openshift#971)
878a987 reconcileInstance error checking in unit-tests
4991d57  reconcileBroker error checking in unit-tests
9ed6812 Extract methods for binding test setup (openshift#961)
b69a1ee Make ups-broker return valid unbind response (openshift#964)
8b37d2f Releasing 0.0.11 (openshift#962)
52fec8b Merge branch 'pr/954'
d49cdeb Swap client
445fa71 Add dependency on pmorie/go-open-service-broker-client
9f743b2 Instructions for enabling API Aggregation (openshift#895)
512508d Use correct infof calls in controller_manager (openshift#950)
77943ba fix regex that determines if a tag is deployable (openshift#947)
8a226b8 Updates for v0.0.10 release (openshift#943)
REVERT: 568a7b9 origin build: add origin tooling

git-subtree-dir: cmd/service-catalog/go/src/github.com/kubernetes-incubator/service-catalog
git-subtree-split: 8f07b7bbf3acb2b557f23596a92b5e775ae9321c
jpeeler pushed a commit to jpeeler/origin that referenced this pull request Feb 1, 2018
* Bump OSB client

* Controller changes for OSB client bump
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants