Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ui: race condition during initialization with login enabled #25771

Closed
couchand opened this issue May 21, 2018 · 20 comments
Closed

ui: race condition during initialization with login enabled #25771

couchand opened this issue May 21, 2018 · 20 comments
Assignees
Labels
A-kv-server Relating to the KV-level RPC server C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting.
Milestone

Comments

@couchand
Copy link
Contributor

When login is enabled, there seems to be a race condition during initialization, manifesting as the error "settings/cluster/settings.go:164 Version() was called before having been initialized". It seems to occur when an existing admin ui browser tab is open during node startup, issuing requests such as health checks. It was previously theorized that #24945 fixed this issue, but it looks like it's come up since.

Stack trace for my repro just now:

$ COCKROACH_EXPERIMENTAL_REQUIRE_WEB_LOGIN=true ./cockroach start --certs-dir clusters/login-test/certs --store clusters/login-test/cockroach-data/
F180521 18:20:17.173359 127 settings/cluster/settings.go:164  Version() was called before having been initialized
goroutine 127 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc4204fa500, 0xc4204fa540, 0x6fe6c00, 0x1c)
	/Users/couch/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:956 +0xcf
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x76ab5e0, 0xc400000004, 0x6fe6cb6, 0x1c, 0xa4, 0xc4207f6400, 0x33)
	/Users/couch/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:832 +0x753
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x63f4aa0, 0xc420054098, 0x4, 0x2, 0x0, 0x0, 0xc420b96c08, 0x1, 0x1)
	/Users/couch/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:154 +0x2e5
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x63f4aa0, 0xc420054098, 0x1, 0xc400000004, 0x0, 0x0, 0xc420b96c08, 0x1, 0x1)
	/Users/couch/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:55 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatal(0x63f4aa0, 0xc420054098, 0xc420b96c08, 0x1, 0x1)
	/Users/couch/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:158 +0x6c
github.com/cockroachdb/cockroach/pkg/settings/cluster.(*ExposedClusterVersion).Version(0xc420705830, 0x0, 0x0, 0x0, 0x5ea7e60)
	/Users/couch/go/src/github.com/cockroachdb/cockroach/pkg/settings/cluster/settings.go:164 +0x161
github.com/cockroachdb/cockroach/pkg/settings/cluster.(*ExposedClusterVersion).IsActive(0xc420705830, 0x7, 0x608d904)
	/Users/couch/go/src/github.com/cockroachdb/cockroach/pkg/settings/cluster/settings.go:199 +0x2b
github.com/cockroachdb/cockroach/pkg/kv.(*RangeDescriptorCache).performRangeLookup(0xc42052a420, 0x63f4b20, 0xc42024f260, 0xc4202b1c90, 0xe, 0x10, 0x6417800, 0xc4202868c0, 0xc4206eb000, 0x1000, ...)
	/Users/couch/go/src/github.com/cockroachdb/cockroach/pkg/kv/range_cache.go:419 +0x103
github.com/cockroachdb/cockroach/pkg/kv.(*RangeDescriptorCache).lookupRangeDescriptorInternal.func3(0x0, 0x0, 0x0, 0x0)
	/Users/couch/go/src/github.com/cockroachdb/cockroach/pkg/kv/range_cache.go:310 +0x132
github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight.(*Group).doCall(0xc42052a460, 0xc42025e780, 0xc420312820, 0x14, 0xc4206b7d00)
	/Users/couch/go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight/singleflight.go:118 +0x2e
created by github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight.(*Group).DoChan
	/Users/couch/go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight/singleflight.go:111 +0x2d0
@couchand couchand added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. A-webui-general Issues on the DB Console that span multiple areas or don't have another clear category. labels May 21, 2018
@couchand couchand added this to the 2.1 milestone May 21, 2018
@tbg
Copy link
Member

tbg commented May 21, 2018

We might be opening up the web endpoint too early in server/server.go ((*Server).Start). We intentionally open it early so that the debug pages can be accessed when the cluster can't connect to the cluster.

@vilterp
Copy link
Contributor

vilterp commented Jun 4, 2018

This used to be easy to repro, but strangely now I can't, with COCKROACH_EXPERIMENTAL_REQUIRE_WEB_LOGIN=true cockroach start --certs-dir certs --logtostderr --host localhost and 10 tabs open. @couchand, can you repro?

@vilterp
Copy link
Contributor

vilterp commented Jun 4, 2018

@couchand did a bisect and thinks this was fixed by ee6d448 in #26329, but we don't understand why.

We thought that this crash was happening because the web authentication code was running SQL before the SQL system was initialized. It's not evident to me how that PR would have fixed this, but we didn't completely understand the bug in the first place.

@vilterp vilterp assigned couchand and unassigned vilterp Jun 4, 2018
@couchand
Copy link
Contributor Author

couchand commented Jun 5, 2018

This bug was created in bedcbed when @vilterp added the maybeAuthMux to pkg/server to handle the root route "/".

bedcbed#diff-09856fe9becddf0199651f451409356aR1187

Previously, the authenticationMux had been mounted around line 1475, well after the pkg/server.Node has been initialized around line 1350 or 1400. The new mux was mounted around line 1185, way before the node is initialized. This provides a wide window between the route being available and the node being initialized.

Node initialization winds up setting the version cluster setting, and so only accesses after this initialization are legal. When a request is received after the root route is available but before the version cluster setting has been initialized, the process panics.

The bug was fixed by @nvanbenschoten in ee6d448. This change removes the method performLegacyRangeLookup from pkg/kv. This removes the need to check the version cluster setting along this code path, so the panic no longer occurs.

ee6d448#diff-ed00a68165bb66ba6524401660df29aaL419

It's great that this specific panic is no longer an issue, but it is a bit troubling that it was so easy to make this situation happen and then very difficult to debug and identify the root issue. Is there something that can be done to prevent this sort of thing from happening again? @tschottdorf @nvanbenschoten @vilterp

@couchand couchand added A-kv-server Relating to the KV-level RPC server C-question A question rather than an issue. No code/spec/doc change needed. and removed A-webui-general Issues on the DB Console that span multiple areas or don't have another clear category. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. labels Jun 5, 2018
@a-robinson
Copy link
Contributor

I just ran into this locally on a 1 node cluster on top of a recent master (61325cd):

F180905 16:47:23.248104 72 settings/cluster/settings.go:164  Version() was called before having been initialized
goroutine 72 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc4200c8b00, 0xc4200c8ba0, 0x7871b00, 0x1c)
        /Users/alex/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:972 +0xcf
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x7f85580, 0xc400000004, 0x7871bc1, 0x1c, 0xa4, 0xc4204fac80, 0x33)
        /Users/alex/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:841 +0x804
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x6a38360, 0xc4200520d0, 0x4, 0x2, 0x0, 0x0, 0xc420a680c8, 0x1, 0x1)
        /Users/alex/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:154 +0x2e5
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x6a38360, 0xc4200520d0, 0x1, 0x4, 0x0, 0x0, 0xc420a680c8, 0x1, 0x1)
        /Users/alex/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:69 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatal(0x6a38360, 0xc4200520d0, 0xc420a680c8, 0x1, 0x1)
        /Users/alex/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:181 +0x6c
github.com/cockroachdb/cockroach/pkg/settings/cluster.(*ExposedClusterVersion).Version(0xc4200a11b0, 0x15518ee0b99de1ba, 0x0, 0x0, 0x0)
        /Users/alex/go/src/github.com/cockroachdb/cockroach/pkg/settings/cluster/settings.go:164 +0x161
github.com/cockroachdb/cockroach/pkg/settings/cluster.(*ExposedClusterVersion).IsActive(0xc4200a11b0, 0x1c, 0x0)
        /Users/alex/go/src/github.com/cockroachdb/cockroach/pkg/settings/cluster/settings.go:199 +0x2b
github.com/cockroachdb/cockroach/pkg/kv.(*txnSpanRefresher).augmentMetaLocked(0xc42048def0, 0x9a446263b8fbf05e, 0x752022fea9e7e392, 0x0, 0x0, 0x0, 0x0, 0x0, 0x15518ee09bd07cba, 0x0, ...)
        /Users/alex/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_span_refresher.go:356 +0x58
github.com/cockroachdb/cockroach/pkg/kv.(*TxnCoordSender).augmentMetaLocked(0xc42048dc00, 0x9a446263b8fbf05e, 0x752022fea9e7e392, 0x0, 0x0, 0x0, 0x0, 0x0, 0x15518ee09bd07cba, 0x0, ...)
        /Users/alex/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_coord_sender.go:514 +0x11b
github.com/cockroachdb/cockroach/pkg/kv.(*TxnCoordSenderFactory).TransactionalSender(0xc42044bee0, 0x1, 0x9a446263b8fbf05e, 0x752022fea9e7e392, 0x0, 0x0, 0x0, 0x0, 0x0, 0x15518ee09bd07cba, ...)
        /Users/alex/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_coord_sender.go:468 +0x6a8
github.com/cockroachdb/cockroach/pkg/internal/client.NewTxnWithCoordMeta(0xc420516800, 0xc400000000, 0x1, 0x9a446263b8fbf05e, 0x752022fea9e7e392, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /Users/alex/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/txn.go:141 +0x13f
github.com/cockroachdb/cockroach/pkg/internal/client.NewTxnWithProto(0xc420516800, 0x0, 0x1, 0x9a446263b8fbf05e, 0x752022fea9e7e392, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /Users/alex/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/txn.go:126 +0x150
github.com/cockroachdb/cockroach/pkg/internal/client.NewTxn(0xc420516800, 0x0, 0x1, 0x0)
        /Users/alex/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/txn.go:116 +0x134
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).prepare(0xc420113300, 0x6a383e0, 0xc420a4d020, 0x6a3ba60, 0xc4207b4800, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /Users/alex/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_prepare.go:175 +0x291
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).addPreparedStmt(0xc420113300, 0x6a383e0, 0xc420a4d020, 0x0, 0x0, 0x6a3ba60, 0xc4207b4800, 0x0, 0x0, 0x0, ...)
        /Users/alex/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_prepare.go:122 +0xe1
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execPrepare(0xc420113300, 0x6a383e0, 0xc420a4d020, 0x0, 0x0, 0x6a3ba60, 0xc4207b4800, 0xc420a4cfc0, 0x0, 0x0, ...)
        /Users/alex/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_prepare.go:55 +0x15e
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).run(0xc420113300, 0x6a383e0, 0xc420a4cdb0, 0x0, 0x0, 0x0)
        /Users/alex/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1143 +0x2b44
github.com/cockroachdb/cockroach/pkg/sql.(*internalExecutorImpl).initConnEx.func1(0xc420113300, 0x6a383e0, 0xc420a4cdb0, 0xc4203a0480, 0x0, 0xc42050cbd0)
        /Users/alex/go/src/github.com/cockroachdb/cockroach/pkg/sql/internal.go:210 +0x8a
created by github.com/cockroachdb/cockroach/pkg/sql.(*internalExecutorImpl).initConnEx
        /Users/alex/go/src/github.com/cockroachdb/cockroach/pkg/sql/internal.go:199 +0x275

I had a couple logged-in tabs of the admin UI already open from before restarting the process. I also had a tab trying to load the make watch endpoint, with make watch TARGET=https://localhost:8080 running as well.

@vilterp
Copy link
Contributor

vilterp commented Sep 6, 2018

Attempting to repro and bisect. May have to do with #29230.

@vilterp
Copy link
Contributor

vilterp commented Sep 6, 2018

Nope, much older than that. Bisect seems to show that first commit with this bug is bedcbed.

@couchand
Copy link
Contributor Author

That's unfortunate. Perhaps we should just live with a round trip for an API call as a short-term fix?

@couchand couchand added S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. and removed C-question A question rather than an issue. No code/spec/doc change needed. labels Sep 10, 2018
@couchand couchand assigned vilterp and unassigned couchand Sep 17, 2018
@couchand
Copy link
Contributor Author

couchand commented Sep 18, 2018

it also just highlights how little control our startup process gives about what components of the system get touched at which point in the boot sequence

Well that's exactly what I mean -- how can we make explicit which components are safe to touch and when? I don't think the answer can just be ¯\_(ツ)_/¯...

Currently, in a single 500 line function, we create a bunch of shared mutable state (server/Server has 40 fields and NewServer has 40 local variables), call into a lot of various subsystems (server/server.go directly imports 40 other modules), and hope and pray that we're doing things reasonably (server/server_test.go is half the length of server/server.go).

@tbg
Copy link
Member

tbg commented Sep 18, 2018

Well that's exactly what I mean

Oh, I somehow thought you meant something about the Version thing specifically. I agree with you that the boot code just needs to be... better.

@mberhault
Copy link
Contributor

Occurred again today on cockroach-cyan-0002, running 280d8c7:

F180918 20:03:51.492753 38 settings/cluster/settings.go:164  Version() was called before having been initialized
goroutine 38 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc4200b9201, 0xc4200b92c0, 0x3e7bf00, 0x1c)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:972 +0xcf
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x4758340, 0xc400000004, 0x3e7bff0, 0x1c, 0xa4, 0xc4208b2400, 0x33)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:843 +0x7ea
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x2fb8980, 0xc4200d6010, 0x4, 0x2, 0x0, 0x0, 0xc420ab6070, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:85 +0x2e5
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x2fb8980, 0xc4200d6010, 0x1, 0x4, 0x0, 0x0, 0xc420ab6070, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:69 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatal(0x2fb8980, 0xc4200d6010, 0xc420ab6070, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:181 +0x6c
github.com/cockroachdb/cockroach/pkg/settings/cluster.(*ExposedClusterVersion).Version(0xc42008f1a8, 0x15559724c4c3b87f, 0x0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/settings/cluster/settings.go:164 +0x161
github.com/cockroachdb/cockroach/pkg/settings/cluster.(*ExposedClusterVersion).IsActive(0xc42008f1a8, 0x1c, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/settings/cluster/settings.go:199 +0x2b
github.com/cockroachdb/cockroach/pkg/kv.(*txnSpanRefresher).augmentMetaLocked(0xc4209286f8, 0x7f4edef963289f72, 0x808961136b56b4be, 0x0, 0x0, 0x0, 0x0, 0x0, 0x15559724a6f6537f, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_span_refresher.go:371 +0x58
github.com/cockroachdb/cockroach/pkg/kv.(*TxnCoordSender).augmentMetaLocked(0xc420928400, 0x7f4edef963289f72, 0x808961136b56b4be, 0x0, 0x0, 0x0, 0x0, 0x0, 0x15559724a6f6537f, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_coord_sender.go:534 +0x11b
github.com/cockroachdb/cockroach/pkg/kv.(*TxnCoordSenderFactory).TransactionalSender(0xc4205a45b0, 0x1, 0x7f4edef963289f72, 0x808961136b56b4be, 0x0, 0x0, 0x0, 0x0, 0x0, 0x15559724a6f6537f, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_coord_sender.go:479 +0x6da
github.com/cockroachdb/cockroach/pkg/internal/client.NewTxnWithCoordMeta(0x2fb8a00, 0xc4205ec0f0, 0xc4204b8580, 0xc400000000, 0x1, 0x7f4edef963289f72, 0x808961136b56b4be, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/internal/client/txn.go:145 +0x14c
github.com/cockroachdb/cockroach/pkg/internal/client.NewTxnWithProto(0x2fb8a00, 0xc4205ec0f0, 0xc4204b8580, 0x0, 0x1, 0x7f4edef963289f72, 0x808961136b56b4be, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/internal/client/txn.go:126 +0x16a
github.com/cockroachdb/cockroach/pkg/internal/client.NewTxn(0x2fb8a00, 0xc4205ec0f0, 0xc4204b8580, 0x0, 0x1, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/internal/client/txn.go:116 +0x14e
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).prepare(0xc4209c2000, 0x2fb8a00, 0xc4205ec0f0, 0x2fbc100, 0xc42081e240, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_prepare.go:179 +0x2a4
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).addPreparedStmt(0xc4209c2000, 0x2fb8a00, 0xc4205ec0f0, 0x0, 0x0, 0x2fbc100, 0xc42081e240, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_prepare.go:122 +0xd4
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execPrepare(0xc4209c2000, 0x2fb8a00, 0xc4205ec0f0, 0x0, 0x0, 0x2fbc100, 0xc42081e240, 0xc4205ec090, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_prepare.go:55 +0x144
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).run(0xc4209c2000, 0x2fb8a00, 0xc42045bef0, 0x0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1144 +0x2b63
github.com/cockroachdb/cockroach/pkg/sql.(*internalExecutorImpl).initConnEx.func1(0xc4209c2000, 0x2fb8a00, 0xc42045bef0, 0xc4203f6280, 0x0, 0xc4203206e0)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/internal.go:211 +0x8a
created by github.com/cockroachdb/cockroach/pkg/sql.(*internalExecutorImpl).initConnEx
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/internal.go:200 +0x275

@couchand
Copy link
Contributor Author

@vilterp what's the status here? Have you made any progress on this?

@vilterp
Copy link
Contributor

vilterp commented Sep 19, 2018

Not yet, besides identifying the commit which seems to have introduced it. Hoping to look at it today.

@tbg
Copy link
Member

tbg commented Sep 25, 2018

Friendly ping. I see this in crashes every now and then and presumably this was released in betas already? It's relatively urgent to fix.

@vilterp
Copy link
Contributor

vilterp commented Sep 25, 2018

Hey Tobi, @couchand and I were debugging this yesterday and will be again today. We have a solid repro. The basic problem is that the authentication mux is running SQL with the internal executor before the Version cluster setting is initialized. The initialization of all the HTTP stuff is pretty spread out in server.Start; not sure why. Hoping to find a reordering such that everything the internal executor needs will be initialized by the time the endpoints are exposed.

@tbg
Copy link
Member

tbg commented Sep 25, 2018

Sounds good, thanks! FWIW, I'd be fine by me if you just added a way to check if the thing is initialized before you use it (or even just wait for it).

That's a wart, but as long as we're calling it that, I think it's defensible at this point in the cycle. Plus, the version thing is basically impossible to use correctly. Maybe all callers should implicitly wait until there's a sane way to write code that doesn't accidentally trigger this.

@vilterp
Copy link
Contributor

vilterp commented Sep 25, 2018

Hoping not to have to do something like that. Btw, noticed this method IsInitialized that seems to be a duplicate of HasBeenInitialized (IsInitialized added later):

// IsInitialized returns true if the cluster version has been initialized and is
// ready for use.
func (ecv *ExposedClusterVersion) IsInitialized() bool {
return *ecv.baseVersion.Load().(*ClusterVersion) != ClusterVersion{}
}
I can remove that while I'm here.

vilterp pushed a commit to vilterp/cockroach that referenced this issue Sep 25, 2018
This fixes cockroachdb#25771, in which the auth mux was using the internal executor
to run SQL before the cluster version, which the internal executor is
dependent upon, was initialized.

A preferable solution would be to not register any handlers using the
auth mux until the cluster setting is initialized, but it's not
immediately apparent what reordering of things in Server.Start would
achieve this without breaking anything else.

Release note: None
@vilterp
Copy link
Contributor

vilterp commented Sep 25, 2018

Ok, minimal fix in #30640. Would be better to reorder things s.t. that check is not necessary, but I'm a little afraid of rejiggering a bunch of things in Server.Start at this point in the cycle. Open to better fixes if we can think of them.

@vilterp
Copy link
Contributor

vilterp commented Sep 27, 2018

Discovered while working on a unit test to repro this that the crash only happens when the request contains a cookie; otherwise there is no session to look up and the auth mux never runs any SQL.

So, the user has to log in, then restart a node or start a new node at the same address and send it a request containing that login cookie to trigger this. Doesn't really change the severity, since it's probably likely to be logged in and looking at the admin UI while you restart and add new nodes.

benesch added a commit to benesch/cockroach that referenced this issue Sep 27, 2018
Teach the cluster-init test to reproduce the crash during bootup
described in issue cockroachdb#25771. The trick is to send requests with a session
cookie, as the admin UI would, while the cluster is waiting for init.

Release note: None
vilterp pushed a commit to vilterp/cockroach that referenced this issue Sep 27, 2018
Teach the cluster-init test to reproduce the crash during bootup
described in issue cockroachdb#25771. The trick is to send requests with a session
cookie, as the admin UI would, while the cluster is waiting for init.

Note that the crash was caused by the authenticationMux, which is used
to protect secure clusters, but this test uses an insecure cluster. This
is relying on a wart: an authenticationMux is installed for UI assets
whether or not the cluster is secure. This could change in the future. A
better test would use a secure cluster, however roachprod/roachtest do
not easily support that at this time, nor is it easy to write a unit
test that exercises the crash more robustly.

Release note: None
vilterp pushed a commit to vilterp/cockroach that referenced this issue Sep 27, 2018
Teach the cluster-init test to reproduce the crash during bootup
described in issue cockroachdb#25771. The trick is to send requests with a session
cookie, as the admin UI would, while the cluster is waiting for init.

Note that the crash was caused by the authenticationMux, which is used
to protect secure clusters, but this test uses an insecure cluster. This
is relying on a wart: an authenticationMux is installed for UI assets
whether or not the cluster is secure. This could change in the future. A
better test would use a secure cluster, however roachprod/roachtest do
not easily support that at this time, nor is it easy to write a unit
test that exercises the crash more robustly.

Release note: None
vilterp pushed a commit to vilterp/cockroach that referenced this issue Oct 1, 2018
Teach the cluster-init test to reproduce the crash during bootup
described in issue cockroachdb#25771. The trick is to send requests with a session
cookie, as the admin UI would, while the cluster is waiting for init.

Note that the crash was caused by the authenticationMux, which is used
to protect secure clusters, but this test uses an insecure cluster. This
is relying on a wart: an authenticationMux is installed for UI assets
whether or not the cluster is secure. This could change in the future. A
better test would use a secure cluster, however roachprod/roachtest do
not easily support that at this time, nor is it easy to write a unit
test that exercises the crash more robustly.

Release note: None
craig bot pushed a commit that referenced this issue Oct 1, 2018
30746: server/ui: fix crash caused by race in server initialization r=vilterp a=vilterp

Fixes #25771

See commits for details.

Co-authored-by: Pete Vilter <[email protected]>
Co-authored-by: Nikhil Benesch <[email protected]>
@craig craig bot closed this as completed in #30746 Oct 1, 2018
vilterp pushed a commit to vilterp/cockroach that referenced this issue Oct 1, 2018
Teach the cluster-init test to reproduce the crash during bootup
described in issue cockroachdb#25771. The trick is to send requests with a session
cookie, as the admin UI would, while the cluster is waiting for init.

Note that the crash was caused by the authenticationMux, which is used
to protect secure clusters, but this test uses an insecure cluster. This
is relying on a wart: an authenticationMux is installed for UI assets
whether or not the cluster is secure. This could change in the future. A
better test would use a secure cluster, however roachprod/roachtest do
not easily support that at this time, nor is it easy to write a unit
test that exercises the crash more robustly.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-server Relating to the KV-level RPC server C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants