Kubernetes is a large growing community of volunteers, users, and vendors. The Kubernetes community has adopted this security disclosures and response policy to ensure we responsibly handle critical issues.
Table of Contents
- Product Security Committee (PSC)
- Disclosures
- Patch, Release, and Public Communication
- Private Distributors List
- Severity Thresholds - How We Do Vulnerability Scoring
Security vulnerabilities should be handled quickly and sometimes privately. The primary goal of this process is to reduce the total time users are vulnerable to publicly known exploits.
The Product Security Committee (PSC) is responsible for organizing the entire response including internal communication and external disclosure but will need help from relevant developers and release managers to successfully run this process.
The initial Product Security Committee will consist of volunteers subscribed to the private Kubernetes Security list. These are the people who have been involved in the initial discussion and volunteered:
- Brandon Philips (@philips)
<[email protected]>
[4096R/154343260542DF34] - CJ Cullen (@cjcullen)
<[email protected]>
- Tim Allclair (@tallclair)
<[email protected]>
[4096R/0x5E6F2E2DA760AF51] - Jordan Liggitt (@liggitt)
<[email protected]>
[4096R/0x39928704103C7229] - Joel Smith (@joelsmith)
<[email protected]>
- Jonathan Pulsifer (@jonpulsifer)
<[email protected]>
Associate members include:
- Micah Hausler (@micahhausler)
<[email protected]>
- Luke Hinds (@lukehinds)
[email protected]
- Swamy Shivaganga Nagaraju (@swamymsft)
<[email protected]>
There are a number of contact points for the PSC and release managers in charge of security releases. Please use the correct forum for the best and fastest response.
List or Group | Visibility | Uses |
---|---|---|
[email protected] | Private | Kubernetes security disclosures. This list is closely monitored and triaged by the PSC. See the disclosure guide for full details. |
kubernetes-security-discuss Google Group | Public | Discussion about security disclosure handling, this document, and other updates. |
[email protected] | Private | Release manager private discussion. All members are subscribed to [email protected]. |
[email protected] | Private | PSC private discussion. All members are subscribed to [email protected] |
New potential members to the PSC will first fill a minimum of a 3 month rotation in the Associate role.
These individuals will be nominated by individuals on steering committee, PSC, or release manager lead and patch release manager.
The PSC has a goal to expand the PSC to 7 members by end of year (2018) or early 2019. [Please delete this when this has been done :)].
To encourage diversity members will also abide by a 1/2 maximal representation from any one company at any time. If representation changes due to job shifts or mergers then members are encouraged to grow the team or replace themselves through mentoring new members. Being overlimit should not continue for longer than 12 months to give time to identify and mentor new members.
Selection of new members from the associates will be done by lazy consensus.
- Use lazy consensus amongst members for adding new people up to 7 from pool of associate members with fallback on majority vote.
- Use lazy consensus amongst other members with fallback on majority vote to accept proposal on step down.
Members may step down at anytime and propose a replacement from pool of eligible associate members.
- Members should remain active and responsive in their Roles.
- Members taking an extended leave of 1 or more months should coordinate with other members to ensure the role is adequately staffed during the leave.
- Members going on leave for 1-3 months may identify a temporary replacement.
- Members of a role should remove any other members that have not communicated a leave of absence and either cannot be reached for more than 1 month or are not fulfilling their documented responsibilities for more than 1 month. This may be done through a super-majority vote of members.
Membership on the Product Security Committee is based on filling a specific role. These are defined below to make sure everyone "carries water".
Per issue, sees it through to the end. The sheep herder. A rotating role so we can make sure everyone helps and no one is stuck constantly with the sheep herding.
Makes sure the people who should be in "the know" (aka notified) are notified, also responds to issues that are not actually issues and lets the OP know that. This person is the escalation path for a bug if it is one. The on-call PSC member should triage the request. If the on-call member is unavailable then it is first come first serve on the alert escalation.
Making sure we can test the fixes appropriately. This person is the build cop. Who do you call when the cluster is not well or you gotta test something weird, or the release branch is all messed up? This person.
Handles public messaging around the bug. Documentation on how to upgrade. Changelog. Explaining to public the severity. Notifications of bugs sent to mailing lists etc. Requests CVEs.
A role for those wishing to join the PSC. They should not currently be a member of [email protected].
Their rotation will involve the following:
- lead disclosures that are publicly disclosed on a public GitHub issue tracker (often done because of reporter request, a low CVSS score, or design issue that requires long-term refactoring).
- assisting in process improvements, bug bounty administration, audits, or other non-disclosure activities
Each week has a primary & secondary for each role. If something comes up that week, primary owns it. Secondary is there if we can't get a hold of the primary, or if primary has a really good reason why they can’t handle it.
Each role has an ordered list of “who’s handling the next issue.” If something comes up, top of the list takes it. If we can’t contact them, or they have a good reason not to, we go down the list. We’d probably need some coordination of vacation schedules to make sure we don’t leave a role unstaffed.
Also included on the private Security Release Team list are all Release Managers.
It is the responsibility of the Release Managers to add and remove people from the Security Release Team group as Kubernetes minor releases created and deprecated.
The Kubernetes Community asks that all suspected vulnerabilities be privately and responsibly disclosed via the Private Disclosure process available at https://kubernetes.io/security.
If you know of a publicly disclosed security vulnerability please IMMEDIATELY email [email protected] to inform the Product Security Committee (PSC) about the vulnerability so they may start the patch, release, and communication process.
If possible the PSC will ask the person making the public report if the issue can be handled via a private disclosure process. If the reporter denies the PSC will move swiftly with the fix and release process. In extreme cases you can ask GitHub to delete the issue but this generally isn't necessary and is unlikely to make a public disclosure less damaging.
For each vulnerability a member of the PSC will volunteer to lead coordination with the Fix Team, Release Managers and is responsible for sending disclosure emails to the rest of the community. This lead will be referred to as the Fix Lead.
The role of Fix Lead should rotate round-robin across the PSC.
All of the timelines below are suggestions and assume a Private Disclosure. The Fix Lead drives the schedule using their best judgment based on severity, development time, and release manager feedback. If the Fix Lead is dealing with a Public Disclosure all timelines become ASAP. If the fix relies on another upstream project's disclosure timeline, that will adjust the process as well. We will work with the upstream project to fit their timeline and best protect our users.
These steps should be completed within the first 24 hours of Disclosure.
- The Fix Lead will work quickly to identify relevant engineers from the affected projects and packages and CC those engineers into the disclosure thread. This selected developers are the Fix Team. A best guess is to invite all assignees in the OWNERS file from the affected packages.
- The Fix Lead will get the Fix Team access to private security repos in the kubernetes-security GitHub org to develop the fix.
Note: The kubernetes-security GitHub org is co-owned and viewable by the PSC and Kubernetes Release Managers. Management of the org is done by SIG Contributor Experience's GitHub management subproject.
These steps should be completed within the 1-7 days of Disclosure.
- The Fix Lead and the Fix Team will create a CVSS using the CVSS Calculator. They will also use the Severity Thresholds - How We Do Vulnerability Scoring to determine the effect and severity of the bug. The Fix Lead makes the final call on the calculated risk; it is better to move quickly than make the perfect assessment.
- The Fix Lead will request a CVE from the Kubernetes CVE Numbering Authority.
- The Fix Team will notify the Fix Lead that work on the fix branch is complete once there are LGTMs on all commits in the private repo from one or more relevant assignees in the relevant OWNERS file.
If the CVSS score is under ~4.0 (a low severity score) or the assessed risk is low the Fix Team can decide to slow the release process down in the face of holidays, developer bandwidth, etc. These decisions must be discussed on the [email protected] mailing list.
If the CVSS score is under ~7.0 (a medium severity score), the Fix Lead may choose to carry out the fix semi-publicly. This means that PRs are made directly in the public kubernetes/kubernetes repo, while restricting discussion of the security aspects to private channels. The fix lead will make the determination whether there would be user harm in handling the fix publicly that outweighs the benefits of open engagement with the community.
Note: CVSS is convenient but imperfect. Ultimately, the fix lead has discretion on classifying the severity of a vulnerability.
No matter the CVSS score, if the vulnerability requires User Interaction, especially in client components like kubectl, or otherwise has a straightforward, non-disruptive mitigation, the Fix Lead may choose to disclose the vulnerability before a fix is developed if they determine that users would be better off being warned against a specific interaction.
With the Fix Development underway the Fix Lead needs to come up with an overall communication plan for the wider community. This Disclosure process should begin after the Fix Team has developed a Fix or mitigation so that a realistic timeline can be communicated to users.
Optional Fix Disclosure to Private Distributors List (Completed within 1-14 days of Disclosure):
- The Fix Lead will make a determination with the help of the Fix Team if an issue is critical enough to require early disclosure to distributors. Generally this Private Distributor Disclosure process should be reserved for remotely exploitable or privilege escalation issues. Otherwise, this process can be skipped.
- The Fix Lead will email the patches to [email protected] so distributors can prepare builds to be available to users on the day of the issue's announcement. Distributors should read about the Private Distributors List to find out the requirements for being added to this list.
- What if a vendor breaks embargo? The PSC will assess the damage. The Fix Lead will make the call to release earlier or continue with the plan. When in doubt push forward and go public ASAP.
Fix Release Day (Completed within 1-21 days of Disclosure)
- The Private Distributors List will be notified at least 24 hours in advance of a pending release with the public messaging, date, and time of the announcement.
- The Fix Lead will cherry-pick the patches onto the master branch and all relevant release branches. The Fix Team will
/lgtm
and/approve
. - The Release Managers will merge these PRs as quickly as possible. Changes shouldn't be made to the commits at this point, to prevent potential conflicts with the patches sent to distributors, and conflicts as the fix is cherry-picked around branches.
- The Release Managers will ensure all the binaries are built, publicly available, and functional.
- The Fix Lead will announce the new releases, the CVE number, severity, and impact, and the location of the binaries to get wide distribution and user action. As much as possible this announcement should be actionable, and include any mitigating steps users can take prior to upgrading to a fixed version. The recommended target time is 4pm UTC on a non-Friday weekday. This means the announcement will be seen morning Pacific, early evening Europe, and late evening Asia. The announcement will be sent via the following channels:
- The Fix Lead will remove the Fix Team from the private security repo.
This list is used to provide actionable information to multiple distribution vendors at once.
See the private distributor list doc for more information.
These steps should be completed 1-3 days after the Release Date. The retrospective process should be blameless.
- The Fix Lead will send a retrospective of the process to [email protected] including details on everyone involved, the timeline of the process, links to relevant PRs that introduced the issue, if relevant, and any critiques of the response and release process.
- The Release Managers and Fix Team are also encouraged to send their own feedback on the process to [email protected]. Honest critique is the only way we are going to get good at this as a community.
The content presented below outlines basic criteria when considering the effect of bugs disclosed to the Product Security Committee and their severity. These apply mostly to items in "core" but could be abstractly applied to those in other repos as well.
The "server" pertains to any "server-side components" (see below) we distribute,
but you can specifically think of kube-apiserver
and kubelet
being the main ones.
"Server-side components" pertains to:
- Components built from source in the
kubernetes
github organization and all current organizations in use - Container images under the
k8s.gcr.io
repository - Containers included in the core addons directory
The server classification is usually not appropriate when user interaction is part of the exploitation process, those should fall under the client classifications as "client" implies user interaction with a client. If a "Critical" vulnerability exists only on server components, and is exploited in a way that requires user interaction and results in the compromise of the server, the severity may be reduced from "Critical" to "High" in accordance with the data definition of extensive user interaction presented at the start of the client severity pivot.
Network worms or unavoidable cases where the server is "compromised."
The ability to either execute arbitrary code or obtain more privilege than authorized.
- Remote Anonymous User
- Unauthorized access. Examples:
- read, write, or delete access to sensitive API objects
- arbitrary writing/deletion to the file system or anywhere in the cluster that should not typically be modifiable
- getting data from the server or server components
- Execution of arbitrary code (remote code execution in the server)
- Unauthorized access. Examples:
This includes all write access violations.
Cases where an unprivileged user can locate and read highly sensitive information from anywhere on the system, including system information that was not intended or designed to be exposed.
- Remote Anonymous User
- Unauthorized access. Examples:
- Personally identifiable information (PII) disclosure (email addresses, cloud provider credentials)
- Attacker can collect private data without user consent or in a covert fashion
- Secrets data
- System credentials
- Unauthorized access. Examples:
"Critical" attacks are downgraded to "High" when requiring adjacent network access.
Non-default (or non-typical) critical scenarios or cases where mitigations exist that can help prevent critical scenarios.
Must be "easy to exploit" by sending a small amount of data or be otherwise quickly induced.
-
- Persistent DoS. Examples:
- Sending a single malicious request results in service failure
- Sending a small number of requests that causes a service failure
- Temporary DoS. Examples:
- Sending a small number of requests that causes the system to be unusable for a period of time
- Server being down for a minute or longer
- A single remote client consuming all available resources (sessions, memory) on a server by establishing sessions and keeping them open
- Persistent DoS. Examples:
-
- Persistent DoS against a high value asset. Example:
- Sending a small number of requests that causes a service failure for a high value asset
- Persistent DoS against a high value asset. Example:
The ability to either execute arbitrary code or obtain more privilege than authorized. This includes all write access violations.
- Authenticated User
- Unauthorized access. Examples:
- arbitrary writing to the file system or server components like etcd, where the user should not have the ability to write
- reading or writing to API objects where the user should not be able to
- getting data from the server or server components, where the user should not have the ability to read
- execution of arbitrary code (like creating a pod or remote code execution attacks)
- Unauthorized access. Examples:
Breaking or bypassing any security feature provided. A vulnerability in a security feature is rated “High” by default, but the rating may be adjusted based on other considerations as documented here.
Examples:
- Disabling or bypassing a
NetworkPolicy
orPodSecurityPolicy
without winforming users or gaining consent - Reconfiguring a
NetworkPolicy
and allowing connections to other processes without consent
An entity (computer, server, user, process) is able to masquerade as a specific entity (user or computer) of his/her choice.
Examples:
- Service masquerades as the API server for the cluster and effectively man-in-the-middle's the real API server.
- API server uses client certificate authentication (SSL) improperly to allow an attacker to be identified as any user of his/her choice
- RBAC bug that allows a malicious remote user to be seen as a different user of his or her choice
Cases where an anonymous user can easily read sensitive information on the system from specific locations, including system information that was not intended/designed to be exposed.
Examples:
- Targeted disclosure of the existence of an arbitrary file
- Arbitrary filesystem or application data that should not typically be accessed
- Arbitrary API resources that should not typically be accessed
- Workload & namespace names or identifying metadata
- System & workload logs that are not typically exposed
- Workload or user-created metrics
"High" attacks are downgraded to "Medium" when requiring local access or when there is a straight-forward mitigation.
Cases where an anonymous user can easily read anonymous (non-sensitive) information on the system, including system metrics that were not intended to be exposed.
Examples:
- System metrics common to most Kubernetes installs that were not intended to be public
- Non-identifying workload metadata (e.g. UIDs, creationTimestamps, resourceVersions)
"Medium" attacks are downgraded to "Low" when there is a straight-forward mitigation and the attack requires quite a few non-default or very untypical scenarios to trigger.
Temporary modification of data in a specific scenario that does not persist.
"User interaction" can only happen in client-driven scenario.
Using kubectl
or a client library are considered user interaction.
The target is a user or user's computer in these scenarios.
The effect of user interaction is not one level reduction in severity, but is a reduction in severity in certain circumstances where user interaction is required.
The distinction exists to help differentiate fast-spreading and wormable attacks from those where because the user interacts, the attack is slowed down.
Network worms or unavoidable cases where the client is "compromised" without warnings or prompts.
The ability to either execute arbitrary code or obtain more privilege than authorized. This includes all write access violations.
- Remote Anonymous User
- Unauthorized access. Examples:
- read, write, and delete access to sensitive API objects
- arbitrary writing/reading/deletion to/from the file system
- Execution of arbitrary code (remote code execution on the client's system)
- Unauthorized access. Examples:
Cases where the attacker can locate and read information from anywhere on the system, including system information that was not intended or designed to be exposed.
- Remote Anonymous User
- Unauthorized access. Examples:
- Personally identifiable information (PII) disclosure (email addresses, local data)
- Client "phone-ing home" without permission
- Unauthorized access. Examples:
Non-default critical scenarios or cases where mitigations exist that can help prevent critical scenarios.
System corruption DoS requires re-installation of system and/or components.
Examples:
- Calling a
kubectl
command makes the local machine unbootable.
The ability to either execute arbitrary code or obtain more privilege than authorized. This includes all write access violations.
- Authenticated User
- Unauthorized access. Examples:
- arbitrary writing/deletion on the file system, where the user should not have the ability to do so
- reading or writing to API objects where the user should not be able to
- local low privilege user can elevate themselves to another user, administrator, or local system.
- Execution of arbitrary code (remote code execution via the client)
- Unauthorized access. Examples:
Breaking or bypassing any security feature provided. A vulnerability in a security feature is rated “High” by default, but the rating may be adjusted based on other considerations as documented here.
Examples:
- Disabling or bypassing a RBAC without informing users or gaining consent
- Reconfiguring RBAC without consent
Modification of any user data or data used to make trust decisions in a common or default scenario where the modification persists. This includes permanent or persistent modification of any user or system data used in a common or default scenario.
Examples:
- Modification of application data files such as
~/.kube
in a specific scenario - Modification of cluster or objects without user consent in a specific scenario
Modification of any user data or data used to make trust decisions in a specific scenario where the modification persists. This includes permanent or persistent modification of any user or system data used in a specific scenario.
Examples:
- Modification of application data files such as
~/.kube
in a specific scenario - Modification of cluster or objects without user consent in a specific scenario
Temporary DoS requires restart of the client application.
Example:
- Running a
kubectl
command crashes itself or the API server in a way that is recoverable.
Temporary modification of data in a specific scenario that does not persist.
On the private network. They can access internal IP addresses.
An authenticated user can be:
- Unprivileged: authenticated, but with no privileges
- Privileged: authenticated, with some privileges
On the same physical host (or VM).
Remote implies from an arbitrary network location. The attacker can only access externally exposed, public facing IP addresses.
Anonymous implies the attacker requires no credentials or Authentication.
A mission critical component such that if/when it has failed the entire cluster is unusable. For example, if the master services/nodes are unreachable or in a failure mode, the entire cluster is basically unusable.
Failure in such a way that the system/service requires intervention from a human operator to recover.