Skip to content

mdw-nl/v6-km-studyathon

Repository files navigation

Federated Kaplan-Meier Curve Calculation with vantage6

This repository contains an implementation of the Kaplan-Meier curve calculation designed for federated learning environments via the vantage6 framework. It allows for the estimation of survival probabilities across distributed datasets without sharing the patient-specific information. This method supports privacy-preserving data analysis in medical research and other fields where event-time analysis is critical.

The algorithm operates within the vantage6 infrastructure, a platform supporting federated learning, to enable institutions to perform survival analysis while maintaining data privacy. The initial idea was based on contributions from Benedetta Gottardelli ([email protected]).

Follow the instructions in subsequent sections to set up and execute the federated Kaplan-Meier analysis.

Usage

This section provides a comprehensive guide on how to use the repository to perform federated Kaplan-Meier analysis, from initializing the client to executing the task and retrieving the results.

To perform Kaplan-Meier curve calculation in a federated learning context using vantage6, follow these instructions:

  1. Install vantage6 Client (if not already installed):
pip install vantage6-client
  1. Initialize vantage6 Client
from vantage6.client import Client

# Load your configuration settings from a file or environment
config = {
    'server_url': '<API_ENDPOINT>',
    'server_port': <API_PORT>,
    'server_api': '<API_VERSION>',
    'username': '<USERNAME>',
    'password': '<PASSWORD>',
    'organization_key': '<ORGANIZATION_PRIVATE_KEY>'
}

client = Client(config['server_url'], config['server_port'], config['server_api'])
client.authenticate(username=config['username'], password=config['password'])
client.setup_encryption(config['organization_key'])

Replace the placeholders in config with your actual configuration details.

  1. Define Algorithm Input
input_ = {
    'method': 'master',
    'kwargs': {
        'time_column_name': 'time_to_event',
        'censor_column_name': 'event_occurred',
        'organization_ids': [1, 2, 3], # Example organization IDs
        'bin_size': None  # Or a specific bin size
    }
}

Set your specific time and censor column names, organization IDs, and bin size if needed.

  1. Create and Run the Task
task = client.task.create(
    collaboration=3,  # Use your specific collaboration ID
    organizations=[1, 2, 3],  # List your organization IDs
    name='Kaplan-Meier Task',  # Give your task a specific name
    image='ghcr.io/mdw-nl/v6-km-studyathon:v1',  # Specify the desired algorithm Docker image version
    description='Survival analysis using Kaplan-Meier',  # Describe the task
    databases=[{'label': 'my_database_label'}],  # Use your database label
    input_=input_
)

Provide actual values for the collaboration, organizations, name, image, description, and databases fields.

  1. Monitor and Retrieve Results: Utilize the vantage6 client methods to check the status of the task and retrieve the results when the task is complete.

Ensure all prerequisites are met and configurations are set by referring to the 'Installation and Setup' section before proceeding with the above steps.

Data Format and Preprocessing

To ensure successful Kaplan-Meier curve calculation, databases at each node need to be structured with the necessary columns:

  • time_column_name: Indicates the time from the start point (e.g., diagnosis) to either an event of interest (e.g., death) or right censoring. Should be of a numeric dtype (integer or float).

  • censor_column_name: A binary column indicating whether the event of interest occurred (1) or if the data was censored (0). Needs to be of integer dtype.

Optionally, a patient_id column can be included as a unique identifier for each subject, but it is not required for the analysis.

Sample Table Structure:

Column Name Description Dtype Required
patient_id Unique identifier for each patient (optional) String No
time_to_event Duration until event of interest or censoring Numeric Yes
event_occurred Event occurrence indicator (1: yes, 0: no) Integer Yes
additional_column1 Description of optional additional data ... No
additional_column2 Description of optional additional data ... No
... ... ... ...

time_to_event refers to your time_column_name and event_occurred to your censor_column_name, as defined in the input parameters of the algorithm.

Preprocessing Steps:

  1. Confirm no missing values in numeric columns like time_column_name. Handle any missing data through imputation or exclusion before proceeding.

  2. Ensure censor_column_name is binary (containing only 0s and 1s) and of integer dtype.

  3. Perform any necessary data cleaning, normalization, or datatype conversion on additional columns according to the specifics of your study and requirements for the analysis.

Be mindful that any domain-specific preprocessing, such as adjusting time units or categorizing features, should be completed prior to analysis.

Follow these specifications to prepare your data correctly for a federated analysis with the Kaplan-Meier algorithm on vantage6.

Output Interpretation

The Kaplan-Meier curve calculation returns a DataFrame with the following columns, including their data types and descriptions:

Column Name Dtype Description
<time_column_name> Numeric (float or int) Timestamps of the events or censored data, based on the provided time data.
removed Integer Number of subjects removed from the risk set in each time interval.
observed Integer Observed number of events of interest (e.g., death or failure) at each timestamp.
censored Integer Number of subjects censored at each timestamp.
at_risk Integer Number of individuals at risk at each timestamp.
hazard Float Hazard rate at each timestamp, calculated as observed / at_risk.
survival_cdf Float Cumulative survival probability up to and including each timestamp.
  • Replace <time_column_name> with the column name you specified in the input configuration for the time data.

How to Interpret the Output:

  • <time_column_name> shows each recorded or estimated event/censoring timestamp, which is not an interval but discrete points in time.

  • observed provides the count of events that occurred, while censored shows how many subjects' data did not reach an event by the end of observation.

  • at_risk is critical as it denotes the number of subjects that could potentially experience the event at each timestamp.

  • The hazard rate gives an indication of the instant risk of event occurrence over time.

  • survival_cdf is the key metric representing the estimated probability of surviving beyond each timestamp in <time_column_name>.

The analysis is commonly graphed as the Kaplan-Meier curve plotting survival_cdf versus <time_column_name> to depict survival trends over time. Periods with a high censored count should be carefully interpreted, as they may affect the accuracy of the survival analysis.