diff --git a/examples/advanced/finance-end-to-end/README.md b/examples/advanced/finance-end-to-end/README.md new file mode 100644 index 0000000000..d438085108 --- /dev/null +++ b/examples/advanced/finance-end-to-end/README.md @@ -0,0 +1,360 @@ +# End-to-End Process Illustration of Federated XGBoost Methods + +This example demonstrates the use of an end-to-end process for credit card fraud detection using XGBoost. + +The original dataset is based on the [kaggle credit card fraud dataset](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud). + +To illustrate the end-to-end process that is more realistic for financial applications, we manually duplicated the records to extend the data time span from 2 days to over 2 years, and added random transactional information. As our primary goal is to showcase the process, there is no need to focus too much on the data itself. + +The overall steps of the end-to-end process include the following: + +## Step 1: Data Preparation + +In a real-world application, this step is not necessary. + +* To prepare the data, we expand the credit card data by adding additional randomly generated columns, +including sender and receiver BICs, currency, etc. +* We then split the data based on the Sender BIC. Each Sender represents one financial institution, +thus serving as one site (client) for federated learning. + +We illustrate this step in the notebook [prepare_data] (./prepare_data.ipynb). The resulting dataset looks like the following: + +![data](./figures/generated_data.png) + +Once we have this synthetic data, we like to split the data into +* historical data ( oldest data) -- 55% +* training data 35 % +* test data remaining 10% + +``` +Historical DataFrame size: 626575 +Training DataFrame size: 398729 +Testing DataFrame size: 113924 +``` +Next we will split the data among different clients, i.e. different Sender_BICs. +For example: Sender = Bank_1, BIC =ZHSZUS33 +the client directory is **ZHSZUS33_Bank_1** + +For this site, we will have three files. +``` +/tmp/dataset/horizontal_credit_fraud_data/ZHSZUS33_Bank_1/history.csv +/tmp/dataset/horizontal_credit_fraud_data/ZHSZUS33_Bank_1/test.csv +/tmp/dataset/horizontal_credit_fraud_data/ZHSZUS33_Bank_1/train.csv +``` +![split_data](./figures/split_data.png) + +The python code for data generation is located at [prepare_data.py](./utils/prepare_data.py) + +## Step 2: Feature Analysis + +For this stage, we would like to analyze the data, understand the features, and derive (and encode) secondary features that can be more useful for building the model. + +Towards this goal, there are two options: +1. **Feature Enrichment**: This process involves adding new features based on the existing data. For example, we can calculate the average transaction amount for each currency and add this as a new feature. +2. **Feature Encoding**: This process involves encoding the current features and transforming them to embedding space via machine learning models. This model can be either pre-trained, or trained with the candidate dataset. + +Considering the fact that the only two numerical features in the dataset are "Amount" and "Time", we will perform feature enrichment first. Optionally, we can also perform feature encoding. In this example, we use graph neural network (GNN): we will train the GNN model in a federated unsupervised fashion, and then use the model to encode the features for all sites. + +### Step 2.1: Rule-based Feature Enrichment +In this process, we will enrich the data and add a few new derived features to illustrate the process. +Whether such enrichment makes sense or not is task and data dependent, essentially, this process is adding hand-crafted features to the classifier inputs. + +#### Single-site operation example: enrichiment +Since all sites follow the same procedures, we only need to look at one site. For example, we will look at the site with +the name "ZHSZUS33_Bank_1." + +The data enrichment process involves the following steps: + +1. **Grouping by Currency**: Calculate hist_trans_volume, hist_total_amount, and hist_average_amount for each currency. +2. **Aggregation for Training and Test Data**: Aggregate the data in 1-hour intervals, grouped by currency. The aggregated value is then divided by hist_trans_volume, and this new column is named x2_y1. +3. **Repeating for Beneficiary BIC**: Perform the same process for Beneficiary_BIC to generate another feature called x3_y2. +4. **Merging Features**: Merge the two enriched features based on Time and Beneficiary_BIC. + +The resulting Dataset looks like this. +![enrich_data](./figures/enrichment.png) + +We save the enriched data into new csv files. +``` +/tmp/dataset/horizontal_credit_fraud_data/ZHSZUS33_Bank_1/train_enrichment.csv +/tmp/dataset/horizontal_credit_fraud_data/ZHSZUS33_Bank_1/test_enrichment.csv +``` +#### Single-site operation example: additional processing +After feature enrichment, we can normalize the numerical features and perform one-hot encoding for the categorical +features. Without loss of generality, we will skip the categorical feature encoding in this example to avoid significantly increasing +the file size (from 11 MB to over 2 GB). + +Similar to the feature enrichment process, we will consider only one site for now. The steps are straightforward: +we apply the scaler transformation to the numerical features and then merge them back with the categorical features. + +``` + scaler = MinMaxScaler() + + # Fit and transform the numerical data + numerical_normalized = pd.DataFrame(scaler.fit_transform(numerical_features), columns=numerical_features.columns) + + # Combine the normalized numerical features with the categorical features + df_combined = pd.concat([categorical_features, numerical_normalized], axis=1) +``` +the file is then saved to "_normalized.csv" + +``` +/tmp/dataset/horizontal_credit_fraud_data/ZHSZUS33_Bank_1/train_normalized.csv +/tmp/dataset/horizontal_credit_fraud_data/ZHSZUS33_Bank_1/test_normalized.csv +``` + +#### Federated Enrichment and Normalization for All Sites +We can easily convert the notebook code into the python code for federated execution on each site. + +##### Task code +Convert the enrichment code for one-site to the federated learning, refer to [enrich.py](./nvflare/enrich.py) + +The main execution flow is the following: +``` +def main(): + print("\n enrichment starts \n ") + + args = define_parser() + + input_dir = args.input_dir + output_dir = args.output_dir + + site_name = + print(f"\n {site_name =} \n ") + + merged_dfs = enrichment(input_dir, site_name) + + for ds_name in merged_dfs: + save_to_csv(merged_dfs[ds_name], output_dir, site_name, ds_name) + +``` +change this code to Federated ETL code, we just add few lines of code + +`flare.init()` to initialize the flare library, +`etl_task = flare.receive()` to receive the global message from NVFlare, +and `end_task = GenericTask()` `flare.send(end_task)` to send the message back to the controller. + +``` +def main(): + print("\n enrichment starts \n ") + + args = define_parser() + flare.init() + + input_dir = args.input_dir + output_dir = args.output_dir + + site_name = flare.get_site_name() + print(f"\n {site_name =} \n ") + + # receives global message from NVFlare + etl_task = flare.receive() + merged_dfs = enrichment(input_dir, site_name) + + for ds_name in merged_dfs: + save_to_csv(merged_dfs[ds_name], output_dir, site_name, ds_name) + + # send message back the controller indicating end. + end_task = GenericTask() + flare.send(end_task) +``` + +Similar adaptation is required for the normalization code, refer to [pre_process.py](./nvflare/pre_process.py) for details. + +##### Job code +Job code is executed to trigger and dispatch the ETL tasks from the previous step. +For this purpose, we wrote the following script: [enrich_job.py](./nvflare/enrich_job.py) + +``` +def main(): + args = define_parser() + + site_names = args.sites + work_dir = args.work_dir + job_name = args.job_name + task_script_path = args.task_script_path + task_script_args = args.task_script_args + + job = FedJob(name=job_name) + + # Define the enrich_ctrl workflow and send to server + enrich_ctrl = ETLController(task_name="enrich") + job.to(enrich_ctrl, "server", id="enrich") + + # Add clients + for site_name in site_names: + executor = ScriptExecutor(task_script_path=task_script_path, task_script_args=task_script_args) + job.to(executor, site_name, tasks=["enrich"], gpu=0) + + if work_dir: + print(f"{work_dir=}") + job.export_job(work_dir) + + if not args.config_only: + job.simulator_run(work_dir) +``` +Here we define a ETLController for server, and ScriptExecutor for client side ETL script. + +Similarly, we can write the normalization job code [pre_process_job.py](./nvflare/pre_process_job.py) for the server-side. + +### (Optional) Step 2.2: GNN Training & Feature Encoding +Based on raw features, or combining the derived features from **Step 2.1**, we can use machine learning models to encode the features. +In this example, we use federated GNN to learn and generate the feature embeddings. + +First, we construct a graph based on the transaction data. Each node represents a transaction, and the edges represent the relationships between transactions. We then use the GNN to learn the embeddings of the nodes, which represent the transaction features. + +#### Single-site operation example: graph construction +Since each site consists of the same Sender_BIC, to define the graph edge, we use the following rules: +1. The two transactions are with the same Receiver_BIC. +2. The two transactions time difference are smaller than 6000. + +The resulting graph looks like below, essentially an undirected graph with transactions (identified by `UETR`) as nodes and edges connecting two nodes that satisfy the above two rules. +![edge_map](./figures/edge_map.png) + +#### Single-site operation example: GNN training and encoding +We use the graph constructed in the previous step to train the GNN model. The GNN model is trained in a federated unsupervised fashion, and the embeddings are generated for each transaction. +The GNN training procedure is similar to the unsupervised Protein Classification task in our [GNN example](../gnn/README.md) with customized data preparation steps. + +The results of the GNN training are: +- a GNN model +- the embeddings of the transactions, in this example, they are of dimension 64 +![embedding](./figures/embeddings.png) + +#### Federated GNN Training and Encoding for All Sites +Similar to Step 2.1, we can easily convert the notebook code into the python code for federated execution on each site. For simplicity, we will skip the code examples for this step. +Please refer to the scripts: +- [graph_construct.py](./nvflare/graph_construct.py) and [graph_construct_job.py](./nvflare/graph_construct_job.py) for graph construction +- [gnn_train_encode.py](./nvflare/gnn_train_encode.py) and [gnn_train_encode_job.py](./nvflare/gnn_train_encode_job.py) for GNN training and encoding + +## Step 3: Federated Training of XGBoost +Now we have enriched / encoded features, the last step is to run federated XGBoost over them. +Below is the xgboost job code + +``` +def main(): + args = define_parser() + + site_names = args.sites + work_dir = args.work_dir + job_name = args.job_name + root_dir = args.input_dir + file_postfix = args.file_postfix + + num_rounds = 10 + early_stopping_rounds = 10 + xgb_params = { + "max_depth": 8, + "eta": 0.1, + "objective": "binary:logistic", + "eval_metric": "auc", + "tree_method": "hist", + "nthread": 16, + } + + job = FedJob(name=job_name) + + # Define the controller workflow and send to server + + controller = XGBFedController( + num_rounds=num_rounds, + training_mode="horizontal", + xgb_params=xgb_params, + xgb_options={"early_stopping_rounds": early_stopping_rounds}, + ) + job.to(controller, "server") + + # Add clients + for site_name in site_names: + executor = FedXGBHistogramExecutor(data_loader_id="data_loader") + job.to(executor, site_name, gpu=0) + data_loader = CreditCardDataLoader(root_dir=root_dir, file_postfix=file_postfix) + job.to(data_loader, site_name, id="data_loader") + if work_dir: + job.export_job(work_dir) + + if not args.config_only: + job.simulator_run(work_dir) +``` + +In this code, all we need to write is a customized `CreditCardDataLoader`, which is XGBDataLoader, +the rest of code is handled by XGBoost Controller and Executor. For simplicity, we only loaded the numerical feature in this example. + +## End-to-end Experiment +You can run this from the command line interface (CLI) or orchestrate it using a workflow tool such as Airflow. +Here, we will demonstrate how to run this from a simulator. You can always export the job configuration and run +it anywhere in a real deployment. + +Assuming you have already downloaded the credit card dataset and the creditcard.csv file is located in the current directory: + +* prepare data +``` +python ./utils/prepare_data.py -i ./creditcard.csv -o /tmp/nvflare/xgb/credit_card +``` +>Note: All Sender SICs are considered clients: they are +> * 'ZHSZUS33_Bank_1' +> * 'SHSHKHH1_Bank_2' +> * 'YXRXGB22_Bank_3' +> * 'WPUWDEFF_Bank_4' +> * 'YMNYFRPP_Bank_5' +> * 'FBSFCHZH_Bank_6' +> * 'YSYCESMM_Bank_7' +> * 'ZNZZAU3M_Bank_8' +> * 'HCBHSGSG_Bank_9' +> * 'XITXUS33_Bank_10' +> Total 10 banks + +* enrich data +``` +cd nvflare +python enrich_job.py -c 'ZNZZAU3M_Bank_8' 'SHSHKHH1_Bank_2' 'FBSFCHZH_Bank_6' 'YMNYFRPP_Bank_5' 'WPUWDEFF_Bank_4' 'YXRXGB22_Bank_3' 'XITXUS33_Bank_10' 'YSYCESMM_Bank_7' 'ZHSZUS33_Bank_1' 'HCBHSGSG_Bank_9' -p enrich.py -a "-i /tmp/nvflare/xgb/credit_card/ -o /tmp/nvflare/xgb/credit_card/" +cd .. +``` + +* pre-process data +``` +cd nvflare +python pre_process_job.py -c 'YSYCESMM_Bank_7' 'FBSFCHZH_Bank_6' 'YXRXGB22_Bank_3' 'XITXUS33_Bank_10' 'HCBHSGSG_Bank_9' 'YMNYFRPP_Bank_5' 'ZHSZUS33_Bank_1' 'ZNZZAU3M_Bank_8' 'SHSHKHH1_Bank_2' 'WPUWDEFF_Bank_4' -p pre_process.py -a "-i /tmp/nvflare/xgb/credit_card -o /tmp/nvflare/xgb/credit_card/" +cd .. +``` + +* construct graph +``` +cd nvflare +python graph_construct_job.py -c 'YSYCESMM_Bank_7' 'FBSFCHZH_Bank_6' 'YXRXGB22_Bank_3' 'XITXUS33_Bank_10' 'HCBHSGSG_Bank_9' 'YMNYFRPP_Bank_5' 'ZHSZUS33_Bank_1' 'ZNZZAU3M_Bank_8' 'SHSHKHH1_Bank_2' 'WPUWDEFF_Bank_4' -p graph_construct.py -a "-i /tmp/nvflare/xgb/credit_card -o /tmp/nvflare/xgb/credit_card/" +cd .. +``` + +* GNN Training and Encoding +``` +cd nvflare +python gnn_train_encode_job.py -c 'YSYCESMM_Bank_7' 'FBSFCHZH_Bank_6' 'YXRXGB22_Bank_3' 'XITXUS33_Bank_10' 'HCBHSGSG_Bank_9' 'YMNYFRPP_Bank_5' 'ZHSZUS33_Bank_1' 'ZNZZAU3M_Bank_8' 'SHSHKHH1_Bank_2' 'WPUWDEFF_Bank_4' -p gnn_train_encode.py -a "-i /tmp/nvflare/xgb/credit_card -o /tmp/nvflare/xgb/credit_card/" +cd .. +``` + + +* XGBoost Job + +We run XGBoost Job on two types of data: normalized, and GNN embeddings +For normalized data, we run the following command +``` +cd nvflare +python xgb_job.py -c 'YSYCESMM_Bank_7' 'FBSFCHZH_Bank_6' 'YXRXGB22_Bank_3' 'XITXUS33_Bank_10' 'HCBHSGSG_Bank_9' 'YMNYFRPP_Bank_5' 'ZHSZUS33_Bank_1' 'ZNZZAU3M_Bank_8' 'SHSHKHH1_Bank_2' 'WPUWDEFF_Bank_4' -i /tmp/nvflare/xgb/credit_card -w /tmp/nvflare/workspace/xgb/credit_card/ +cd .. +``` +Below is the output of last round of training (starting round = 0) +``` +... +[9] eval-auc:0.67596 train-auc:0.70582 +``` +For GNN embeddings, we run the following command +``` +cd nvflare +python xgb_job_embed.py -c 'YSYCESMM_Bank_7' 'FBSFCHZH_Bank_6' 'YXRXGB22_Bank_3' 'XITXUS33_Bank_10' 'HCBHSGSG_Bank_9' 'YMNYFRPP_Bank_5' 'ZHSZUS33_Bank_1' 'ZNZZAU3M_Bank_8' 'SHSHKHH1_Bank_2' 'WPUWDEFF_Bank_4' -i /tmp/nvflare/xgb/credit_card -w /tmp/nvflare/workspace/xgb/credit_card_embed/ +cd .. +``` +Below is the output of last round of training (starting round = 0) +``` +... +[9] eval-auc:0.53788 train-auc:0.61659 +``` +For this example, the normalized data performs better than the GNN embeddings. This is expected as the GNN embeddings are produced with randomly generated transactional information, which adds noise to the data. + diff --git a/examples/advanced/finance-end-to-end/feature_enrichment.ipynb b/examples/advanced/finance-end-to-end/feature_enrichment.ipynb deleted file mode 100644 index 884a0066b0..0000000000 --- a/examples/advanced/finance-end-to-end/feature_enrichment.ipynb +++ /dev/null @@ -1,1106 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "e6d10159-9c02-4bdd-ad6a-f9b7e19ac575", - "metadata": {}, - "source": [ - "## Feature Enrichment\n", - "\n", - "### Historical data enrichment\n", - "\n", - "Pick one client (Site, aka sender_BIC) to do the enrichment as every site will be the same process" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "id": "7130bd7a-bda0-4592-818f-bd65c505baa3", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "site_input_dir = \"/tmp/dataset/horizontal_credit_fraud_data/\"\n", - "site_name = \"ZHSZUS33_Bank_1\"" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "id": "9375ffaa-1143-43f5-b1a3-3ef45918e4bf", - "metadata": { - "tags": [] - }, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
TimeAmountClassSender_BICReceiver_BICUETRCurrencyBeneficiary_BICCurrency_Country
00.02.690ZHSZUS33YSYCESMMR7PCTKF9R1PVGXRXU9AB3JAUDZNZZAU3MAustralia
1200.03.670ZHSZUS33ZNZZAU3M28P261NQ3D4WIZUY4RDXFOUSDXITXUS33United States
2900.03.680ZHSZUS33YMNYFRPP2XJ54L8ED31VMBC1MYIK8LAUDZNZZAU3MAustralia
31700.034.090ZHSZUS33XITXUS33Y3ZW8BUEF5UTB5LWVNEFPGGBPYXRXGB22United Kingdom
42900.020.530ZHSZUS33XITXUS33FHOWZR8Q77BXKIZHAC0781USDZHSZUS33United States
..............................
6255839325300.05.000ZHSZUS33FBSFCHZHEC9HYAUYQ3UARN1CMXER1CAUDZNZZAU3MAustralia
6255939325900.061.000ZHSZUS33ZNZZAU3M6CT5WHMATEO4Z6UYDECPWRUSDXITXUS33United States
6256039325900.01.000ZHSZUS33ZHSZUS33GFCUM49U6M2LRN5NBEB9PKGBPYXRXGB22United Kingdom
6256139327200.074.750ZHSZUS33ZHSZUS33BLP8GYMXG6JWR104DT3Z8DUSDZHSZUS33United States
6256239327800.00.990ZHSZUS33YMNYFRPP5QEHQCHK8JFTVNCYC4KQKGSGDHCBHSGSGSingapore
\n", - "

62563 rows × 9 columns

\n", - "
" - ], - "text/plain": [ - " Time Amount Class Sender_BIC Receiver_BIC \\\n", - "0 0.0 2.69 0 ZHSZUS33 YSYCESMM \n", - "1 200.0 3.67 0 ZHSZUS33 ZNZZAU3M \n", - "2 900.0 3.68 0 ZHSZUS33 YMNYFRPP \n", - "3 1700.0 34.09 0 ZHSZUS33 XITXUS33 \n", - "4 2900.0 20.53 0 ZHSZUS33 XITXUS33 \n", - "... ... ... ... ... ... \n", - "62558 39325300.0 5.00 0 ZHSZUS33 FBSFCHZH \n", - "62559 39325900.0 61.00 0 ZHSZUS33 ZNZZAU3M \n", - "62560 39325900.0 1.00 0 ZHSZUS33 ZHSZUS33 \n", - "62561 39327200.0 74.75 0 ZHSZUS33 ZHSZUS33 \n", - "62562 39327800.0 0.99 0 ZHSZUS33 YMNYFRPP \n", - "\n", - " UETR Currency Beneficiary_BIC Currency_Country \n", - "0 R7PCTKF9R1PVGXRXU9AB3J AUD ZNZZAU3M Australia \n", - "1 28P261NQ3D4WIZUY4RDXFO USD XITXUS33 United States \n", - "2 2XJ54L8ED31VMBC1MYIK8L AUD ZNZZAU3M Australia \n", - "3 Y3ZW8BUEF5UTB5LWVNEFPG GBP YXRXGB22 United Kingdom \n", - "4 FHOWZR8Q77BXKIZHAC0781 USD ZHSZUS33 United States \n", - "... ... ... ... ... \n", - "62558 EC9HYAUYQ3UARN1CMXER1C AUD ZNZZAU3M Australia \n", - "62559 6CT5WHMATEO4Z6UYDECPWR USD XITXUS33 United States \n", - "62560 GFCUM49U6M2LRN5NBEB9PK GBP YXRXGB22 United Kingdom \n", - "62561 BLP8GYMXG6JWR104DT3Z8D USD ZHSZUS33 United States \n", - "62562 5QEHQCHK8JFTVNCYC4KQKG SGD HCBHSGSG Singapore \n", - "\n", - "[62563 rows x 9 columns]" - ] - }, - "execution_count": 2, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "import os\n", - "import random\n", - "import string\n", - "\n", - "import pandas as pd\n", - "history_file_name = os.path.join(site_input_dir, site_name,\"history.csv\" )\n", - "df_history = pd.read_csv(history_file_name)\n", - "df_history" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "id": "3fe8e513-f041-4165-88b1-3b21607ca734", - "metadata": { - "tags": [] - }, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
Currencyhist_trans_volumehist_total_amounthist_average_amount
0AUD125721094630.7587.068943
1CHF124941090937.4687.316909
2GBP124961121443.9989.744237
3SGD124601121692.5090.023475
4USD125411124650.2389.677875
\n", - "
" - ], - "text/plain": [ - " Currency hist_trans_volume hist_total_amount hist_average_amount\n", - "0 AUD 12572 1094630.75 87.068943\n", - "1 CHF 12494 1090937.46 87.316909\n", - "2 GBP 12496 1121443.99 89.744237\n", - "3 SGD 12460 1121692.50 90.023475\n", - "4 USD 12541 1124650.23 89.677875" - ] - }, - "execution_count": 3, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "\n", - "\n", - "history_summary = df_history.groupby('Currency').agg(\n", - " hist_trans_volume=('UETR', 'count'),\n", - " hist_total_amount=('Amount', 'sum'),\n", - " hist_average_amount=('Amount', 'mean')\n", - ").reset_index()\n", - "\n", - "history_summary" - ] - }, - { - "cell_type": "markdown", - "id": "025ac920-c1c3-401f-b420-18c39b7d04d2", - "metadata": {}, - "source": [ - "# Enrich Feature with Currency" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "id": "7aa07b6d-dc96-45e6-a467-8c770cafb84e", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "import pandas as pd\n", - "dataset_names = [\"train\", \"test\"]\n", - "results = {}\n", - "\n", - "temp_ds_df = {}\n", - "temp_resampled_df = {}\n", - "\n", - "\n", - "for ds_name in dataset_names:\n", - " file_name = os.path.join(site_input_dir, site_name , f\"{ds_name}.csv\" )\n", - " ds_df = pd.read_csv(file_name)\n", - " ds_df['Time'] = pd.to_datetime(ds_df['Time'], unit='s')\n", - "\n", - " # Set the Time column as the index\n", - " ds_df.set_index('Time', inplace=True)\n", - " \n", - " resampled_df = ds_df.resample('1H').agg(\n", - " trans_volume=('UETR', 'count'),\n", - " total_amount=('Amount', 'sum'),\n", - " average_amount=('Amount', 'mean')\n", - " ).reset_index()\n", - " \n", - " temp_ds_df[ds_name] = ds_df\n", - " temp_resampled_df[ds_name] = resampled_df\n", - " \n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a2e86bc5-e8ad-41f5-b343-29595a378c03", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "for ds_name in dataset_names:\n", - " \n", - " ds_df = temp_ds_df[ds_name]\n", - " resampled_df = temp_resampled_df[ds_name]\n", - " \n", - " c_df = ds_df[['Currency']].resample('1H').agg({'Currency': 'first'}).reset_index()\n", - " # Add Currency_Country to the resampled data by joining with the original DataFrame\n", - " resampled_df2 = pd.merge(resampled_df, \n", - " c_df,\n", - " on='Time'\n", - " )\n", - " resampled_df3 = pd.merge(resampled_df2, \n", - " history_summary,\n", - " on='Currency'\n", - " )\n", - " resampled_df4 = resampled_df3.copy()\n", - " resampled_df4['x2_y1'] = resampled_df4['average_amount']/resampled_df4['hist_trans_volume']\n", - " \n", - " ds_df = ds_df.sort_values('Time')\n", - " resampled_df4 = resampled_df4.sort_values('Time')\n", - " merged_df = pd.merge_asof(ds_df, resampled_df4, on='Time' )\n", - " \n", - " merged_df = merged_df.drop(columns=['Currency_y']).rename(columns={'Currency_x': 'Currency'})\n", - "\n", - " \n", - " results[ds_name] = merged_df\n", - " \n", - " \n", - " \n", - "\n", - "print(results)" - ] - }, - { - "cell_type": "markdown", - "id": "7051468f-2de0-4e41-a227-7fad4c9110af", - "metadata": { - "tags": [] - }, - "source": [ - "# Enrich feature for beneficiary country" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "id": "605095b7-a514-4346-b984-3590d79d13e4", - "metadata": { - "tags": [] - }, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
Beneficiary_BIChist_trans_volumehist_total_amounthist_average_amount
0FBSFCHZH124941090937.4687.316909
1HCBHSGSG124601121692.5090.023475
2XITXUS336211572653.9392.199957
3YXRXGB22124961121443.9989.744237
4ZHSZUS336330551996.3087.203207
5ZNZZAU3M125721094630.7587.068943
\n", - "
" - ], - "text/plain": [ - " Beneficiary_BIC hist_trans_volume hist_total_amount hist_average_amount\n", - "0 FBSFCHZH 12494 1090937.46 87.316909\n", - "1 HCBHSGSG 12460 1121692.50 90.023475\n", - "2 XITXUS33 6211 572653.93 92.199957\n", - "3 YXRXGB22 12496 1121443.99 89.744237\n", - "4 ZHSZUS33 6330 551996.30 87.203207\n", - "5 ZNZZAU3M 12572 1094630.75 87.068943" - ] - }, - "execution_count": 6, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "\n", - "history_summary2 = df_history.groupby('Beneficiary_BIC').agg(\n", - " hist_trans_volume=('UETR', 'count'),\n", - " hist_total_amount=('Amount', 'sum'),\n", - " hist_average_amount=('Amount', 'mean')\n", - ").reset_index()\n", - "\n", - "history_summary2" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "edabd7be-4864-4964-9e25-df543d5985c6", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "import pandas as pd\n", - "dataset_names = [\"train\", \"test\"]\n", - "results2 = {}\n", - "for ds_name in dataset_names:\n", - " ds_df = temp_ds_df[ds_name]\n", - " resampled_df = temp_resampled_df[ds_name]\n", - " \n", - " c_df = ds_df[['Beneficiary_BIC']].resample('1H').agg({'Beneficiary_BIC': 'first'}).reset_index()\n", - " \n", - " # Add Beneficiary_BIC to the resampled data by joining with the original DataFrame\n", - " resampled_df2 = pd.merge(resampled_df, \n", - " c_df,\n", - " on='Time'\n", - " )\n", - " \n", - " resampled_df3 = pd.merge(resampled_df2, \n", - " history_summary2,\n", - " on='Beneficiary_BIC'\n", - " )\n", - " \n", - " \n", - " resampled_df4 = resampled_df3.copy()\n", - " resampled_df4['x3_y2'] = resampled_df4['average_amount']/resampled_df4['hist_trans_volume']\n", - " \n", - " ds_df = ds_df.sort_values('Time')\n", - " resampled_df4 = resampled_df4.sort_values('Time')\n", - "\n", - " merged_df2 = pd.merge_asof(ds_df, resampled_df4, on='Time' )\n", - " merged_df2 = merged_df2.drop(columns=['Beneficiary_BIC_y']).rename(columns={'Beneficiary_BIC_x': 'Beneficiary_BIC'})\n", - " \n", - " \n", - " results2[ds_name] = merged_df2\n", - "\n", - "print(results2)" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "id": "a44309a2-e252-458d-a9dc-2691aea9360f", - "metadata": { - "tags": [] - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "/tmp/dataset/horizontal_credit_fraud_data/ZHSZUS33_Bank_1/train_enrichment.csv\n", - "/tmp/dataset/horizontal_credit_fraud_data/ZHSZUS33_Bank_1/test_enrichment.csv\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
TimeClassAmountSender_BICReceiver_BICUETRCurrencyBeneficiary_BICCurrency_Countrytrans_volumetotal_amountaverage_amounthist_trans_volumehist_total_amounthist_average_amountx2_y1x3_y2
01971-04-01 04:30:000348.06ZHSZUS33YXRXGB22MV2B0B0S1NUTY8OCOEQ2QEUSDXITXUS33United States4422.18105.545125411124650.2389.6778750.0084160.016993
11971-04-01 04:35:0002.69ZHSZUS33YMNYFRPPCQD9INGI7GJATKWRK0D44ZSGDHCBHSGSGSingapore4422.18105.545125411124650.2389.6778750.0084160.016993
21971-04-01 04:40:00016.63ZHSZUS33XITXUS33IJXYXLV8SF72RU3MRSJ542CHFFBSFCHZHSwitzerland4422.18105.545125411124650.2389.6778750.0084160.016993
31971-04-01 04:51:40054.80ZHSZUS33XITXUS33B1850ZUIHTMT61N7HMIZYMCHFFBSFCHZHSwitzerland4422.18105.545125411124650.2389.6778750.0084160.016993
41971-04-01 05:16:40031.96ZHSZUS33ZHSZUS334BBLS9B31LWHZFF17RODX1GBPYXRXGB22United Kingdom4292.6473.160124961121443.9989.7442370.0058550.005855
......................................................
408041972-03-10 19:01:40012.99ZHSZUS33WPUWDEFFEBY8SA8UZOWNNJ2X7OUBZ2USDXITXUS33United States112.9912.990125411124650.2389.6778750.0010360.002091
408051972-03-10 21:30:00052.34ZHSZUS33YXRXGB223D4772259A6PY7Q7XVJ302GBPYXRXGB22United Kingdom2272.62136.310124961121443.9989.7442370.0109080.010908
408061972-03-10 21:36:400220.28ZHSZUS33YSYCESMMZ5VK0S69KASH3B82M6W5XVUSDZHSZUS33United States2272.62136.310124961121443.9989.7442370.0109080.010908
408071972-03-10 22:30:00060.50ZHSZUS33YXRXGB22HA4WJAB98YR8M9FIE0C2A1USDXITXUS33United States285.2942.645125411124650.2389.6778750.0034000.006866
408081972-03-10 22:58:20024.79ZHSZUS33ZHSZUS339SJQ6WVX8CGS0P1DYYGQ45GBPYXRXGB22United Kingdom285.2942.645125411124650.2389.6778750.0034000.006866
\n", - "

40809 rows × 17 columns

\n", - "
" - ], - "text/plain": [ - " Time Class Amount Sender_BIC Receiver_BIC \\\n", - "0 1971-04-01 04:30:00 0 348.06 ZHSZUS33 YXRXGB22 \n", - "1 1971-04-01 04:35:00 0 2.69 ZHSZUS33 YMNYFRPP \n", - "2 1971-04-01 04:40:00 0 16.63 ZHSZUS33 XITXUS33 \n", - "3 1971-04-01 04:51:40 0 54.80 ZHSZUS33 XITXUS33 \n", - "4 1971-04-01 05:16:40 0 31.96 ZHSZUS33 ZHSZUS33 \n", - "... ... ... ... ... ... \n", - "40804 1972-03-10 19:01:40 0 12.99 ZHSZUS33 WPUWDEFF \n", - "40805 1972-03-10 21:30:00 0 52.34 ZHSZUS33 YXRXGB22 \n", - "40806 1972-03-10 21:36:40 0 220.28 ZHSZUS33 YSYCESMM \n", - "40807 1972-03-10 22:30:00 0 60.50 ZHSZUS33 YXRXGB22 \n", - "40808 1972-03-10 22:58:20 0 24.79 ZHSZUS33 ZHSZUS33 \n", - "\n", - " UETR Currency Beneficiary_BIC Currency_Country \\\n", - "0 MV2B0B0S1NUTY8OCOEQ2QE USD XITXUS33 United States \n", - "1 CQD9INGI7GJATKWRK0D44Z SGD HCBHSGSG Singapore \n", - "2 IJXYXLV8SF72RU3MRSJ542 CHF FBSFCHZH Switzerland \n", - "3 B1850ZUIHTMT61N7HMIZYM CHF FBSFCHZH Switzerland \n", - "4 4BBLS9B31LWHZFF17RODX1 GBP YXRXGB22 United Kingdom \n", - "... ... ... ... ... \n", - "40804 EBY8SA8UZOWNNJ2X7OUBZ2 USD XITXUS33 United States \n", - "40805 3D4772259A6PY7Q7XVJ302 GBP YXRXGB22 United Kingdom \n", - "40806 Z5VK0S69KASH3B82M6W5XV USD ZHSZUS33 United States \n", - "40807 HA4WJAB98YR8M9FIE0C2A1 USD XITXUS33 United States \n", - "40808 9SJQ6WVX8CGS0P1DYYGQ45 GBP YXRXGB22 United Kingdom \n", - "\n", - " trans_volume total_amount average_amount hist_trans_volume \\\n", - "0 4 422.18 105.545 12541 \n", - "1 4 422.18 105.545 12541 \n", - "2 4 422.18 105.545 12541 \n", - "3 4 422.18 105.545 12541 \n", - "4 4 292.64 73.160 12496 \n", - "... ... ... ... ... \n", - "40804 1 12.99 12.990 12541 \n", - "40805 2 272.62 136.310 12496 \n", - "40806 2 272.62 136.310 12496 \n", - "40807 2 85.29 42.645 12541 \n", - "40808 2 85.29 42.645 12541 \n", - "\n", - " hist_total_amount hist_average_amount x2_y1 x3_y2 \n", - "0 1124650.23 89.677875 0.008416 0.016993 \n", - "1 1124650.23 89.677875 0.008416 0.016993 \n", - "2 1124650.23 89.677875 0.008416 0.016993 \n", - "3 1124650.23 89.677875 0.008416 0.016993 \n", - "4 1121443.99 89.744237 0.005855 0.005855 \n", - "... ... ... ... ... \n", - "40804 1124650.23 89.677875 0.001036 0.002091 \n", - "40805 1121443.99 89.744237 0.010908 0.010908 \n", - "40806 1121443.99 89.744237 0.010908 0.010908 \n", - "40807 1124650.23 89.677875 0.003400 0.006866 \n", - "40808 1124650.23 89.677875 0.003400 0.006866 \n", - "\n", - "[40809 rows x 17 columns]" - ] - }, - "execution_count": 8, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "final_results = {}\n", - "for name in results:\n", - " df = results[name]\n", - " df2 = results2[name]\n", - " df3 = df2[[\"Time\", \"Beneficiary_BIC\", \"x3_y2\"]].copy()\n", - " df4 = pd.merge(df, df3, on=['Time', 'Beneficiary_BIC'])\n", - " final_results[name] = df4\n", - "\n", - " \n", - "for name in final_results:\n", - " site_dir = os.path.join(site_input_dir, site_name)\n", - " os.makedirs(site_dir, exist_ok=True)\n", - " enrich_file_name = os.path.join(site_dir, f\"{name}_enrichment.csv\")\n", - " print(enrich_file_name)\n", - " final_results[name].to_csv(enrich_file_name) \n", - " \n", - "final_results[\"train\"]" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "id": "47c958c3-bf73-4ab3-a66f-414be10870ea", - "metadata": { - "tags": [] - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\u001b[01;34m/tmp/dataset/horizontal_credit_fraud_data/\u001b[0m\n", - "├── \u001b[01;34mFBSFCHZH_Bank_6\u001b[0m\n", - "│   ├── history.csv\n", - "│   ├── test.csv\n", - "│   └── train.csv\n", - "├── \u001b[01;34mHCBHSGSG_Bank_9\u001b[0m\n", - "│   ├── history.csv\n", - "│   ├── test.csv\n", - "│   └── train.csv\n", - "├── history.csv\n", - "├── \u001b[01;34mSHSHKHH1_Bank_2\u001b[0m\n", - "│   ├── history.csv\n", - "│   ├── test.csv\n", - "│   └── train.csv\n", - "├── test.csv\n", - "├── train.csv\n", - "├── \u001b[01;34mWPUWDEFF_Bank_4\u001b[0m\n", - "│   ├── history.csv\n", - "│   ├── test.csv\n", - "│   └── train.csv\n", - "├── \u001b[01;34mXITXUS33_Bank_10\u001b[0m\n", - "│   ├── history.csv\n", - "│   ├── test.csv\n", - "│   └── train.csv\n", - "├── \u001b[01;34mYMNYFRPP_Bank_5\u001b[0m\n", - "│   ├── history.csv\n", - "│   ├── test.csv\n", - "│   └── train.csv\n", - "├── \u001b[01;34mYSYCESMM_Bank_7\u001b[0m\n", - "│   ├── history.csv\n", - "│   ├── test.csv\n", - "│   └── train.csv\n", - "├── \u001b[01;34mYXRXGB22_Bank_3\u001b[0m\n", - "│   ├── history.csv\n", - "│   ├── test.csv\n", - "│   └── train.csv\n", - "├── \u001b[01;34mZHSZUS33_Bank_1\u001b[0m\n", - "│   ├── history.csv\n", - "│   ├── test.csv\n", - "│   ├── test_enrichment.csv\n", - "│   ├── train.csv\n", - "│   └── train_enrichment.csv\n", - "└── \u001b[01;34mZNZZAU3M_Bank_8\u001b[0m\n", - " ├── history.csv\n", - " ├── test.csv\n", - " └── train.csv\n", - "\n", - "10 directories, 35 files\n" - ] - } - ], - "source": [ - "! tree {site_input_dir}" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "791ba1db-0ccf-4b31-b838-828d8c6a98a6", - "metadata": {}, - "outputs": [], - "source": [ - "ls -al /tmp/dataset/horizontal_credit_fraud_data/ZHSZUS33_Bank_1/" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "eae3d95a-180a-4fb6-b006-1fc1c144c5c4", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "! find /tmp/dataset/horizontal_credit_fraud_data/ZHSZUS33_Bank_1/ -exec wc -l {} \\;" - ] - }, - { - "cell_type": "markdown", - "id": "02b30e6d-6433-4b42-9950-d1ca3d83e697", - "metadata": {}, - "source": [] - }, - { - "cell_type": "markdown", - "id": "f9966065-80cb-4f85-adab-8c44f01fc8d1", - "metadata": {}, - "source": [ - "Let's go back to the [XGBoost Notebook](./xgboost.ipynb)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "28efd726-3a92-49bd-ac4f-b70627f1df57", - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "85517284-0593-4c5a-ab02-cf31024b88db", - "metadata": {}, - "outputs": [], - "source": [] - } - ], - "metadata": { - "kernelspec": { - "display_name": "nvflare_example", - "language": "python", - "name": "nvflare_example" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.8.19" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/examples/advanced/finance-end-to-end/figures/edge_map.png b/examples/advanced/finance-end-to-end/figures/edge_map.png new file mode 100644 index 0000000000..62c0cc3b3c Binary files /dev/null and b/examples/advanced/finance-end-to-end/figures/edge_map.png differ diff --git a/examples/advanced/finance-end-to-end/figures/embeddings.png b/examples/advanced/finance-end-to-end/figures/embeddings.png new file mode 100644 index 0000000000..2864d580ee Binary files /dev/null and b/examples/advanced/finance-end-to-end/figures/embeddings.png differ diff --git a/examples/advanced/finance-end-to-end/figures/enrichment.png b/examples/advanced/finance-end-to-end/figures/enrichment.png new file mode 100644 index 0000000000..2d7b8c2cca Binary files /dev/null and b/examples/advanced/finance-end-to-end/figures/enrichment.png differ diff --git a/examples/advanced/finance-end-to-end/figures/generated_data.png b/examples/advanced/finance-end-to-end/figures/generated_data.png new file mode 100644 index 0000000000..46a1aeef1f Binary files /dev/null and b/examples/advanced/finance-end-to-end/figures/generated_data.png differ diff --git a/examples/advanced/finance-end-to-end/figures/split_data.png b/examples/advanced/finance-end-to-end/figures/split_data.png new file mode 100644 index 0000000000..b717a6cb5d Binary files /dev/null and b/examples/advanced/finance-end-to-end/figures/split_data.png differ diff --git a/examples/advanced/finance-end-to-end/images/enrichment.png b/examples/advanced/finance-end-to-end/images/enrichment.png deleted file mode 100644 index 68f70b57e3..0000000000 Binary files a/examples/advanced/finance-end-to-end/images/enrichment.png and /dev/null differ diff --git a/examples/advanced/finance-end-to-end/images/generated_data.png b/examples/advanced/finance-end-to-end/images/generated_data.png deleted file mode 100644 index 4428a5f12d..0000000000 Binary files a/examples/advanced/finance-end-to-end/images/generated_data.png and /dev/null differ diff --git a/examples/advanced/finance-end-to-end/images/split_data.png b/examples/advanced/finance-end-to-end/images/split_data.png deleted file mode 100644 index b0d739bfc4..0000000000 Binary files a/examples/advanced/finance-end-to-end/images/split_data.png and /dev/null differ diff --git a/examples/advanced/finance-end-to-end/notebooks/feature_enrichment.ipynb b/examples/advanced/finance-end-to-end/notebooks/feature_enrichment.ipynb new file mode 100644 index 0000000000..13ca766500 --- /dev/null +++ b/examples/advanced/finance-end-to-end/notebooks/feature_enrichment.ipynb @@ -0,0 +1,319 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "e6d10159-9c02-4bdd-ad6a-f9b7e19ac575", + "metadata": {}, + "source": [ + "## Feature Enrichment\n", + "\n", + "### Historical data enrichment\n", + "\n", + "Pick one client (Site, aka sender_BIC) to do the enrichment as every site will be the same process" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7130bd7a-bda0-4592-818f-bd65c505baa3", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "site_input_dir = \"/tmp/dataset/horizontal_credit_fraud_data/\"\n", + "site_name = \"ZHSZUS33_Bank_1\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9375ffaa-1143-43f5-b1a3-3ef45918e4bf", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "import os\n", + "import random\n", + "import string\n", + "\n", + "import pandas as pd\n", + "history_file_name = os.path.join(site_input_dir, site_name,\"history.csv\" )\n", + "df_history = pd.read_csv(history_file_name)\n", + "df_history" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3fe8e513-f041-4165-88b1-3b21607ca734", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "history_summary = df_history.groupby('Currency').agg(\n", + " hist_trans_volume=('UETR', 'count'),\n", + " hist_total_amount=('Amount', 'sum'),\n", + " hist_average_amount=('Amount', 'mean')\n", + ").reset_index()\n", + "\n", + "history_summary" + ] + }, + { + "cell_type": "markdown", + "id": "025ac920-c1c3-401f-b420-18c39b7d04d2", + "metadata": {}, + "source": [ + "# Enrich Feature with Currency" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7aa07b6d-dc96-45e6-a467-8c770cafb84e", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "import pandas as pd\n", + "dataset_names = [\"train\", \"test\"]\n", + "results = {}\n", + "\n", + "temp_ds_df = {}\n", + "temp_resampled_df = {}\n", + "\n", + "\n", + "for ds_name in dataset_names:\n", + " file_name = os.path.join(site_input_dir, site_name , f\"{ds_name}.csv\" )\n", + " ds_df = pd.read_csv(file_name)\n", + " ds_df['Time'] = pd.to_datetime(ds_df['Time'], unit='s')\n", + "\n", + " # Set the Time column as the index\n", + " ds_df.set_index('Time', inplace=True)\n", + " \n", + " resampled_df = ds_df.resample('1H').agg(\n", + " trans_volume=('UETR', 'count'),\n", + " total_amount=('Amount', 'sum'),\n", + " average_amount=('Amount', 'mean')\n", + " ).reset_index()\n", + " \n", + " temp_ds_df[ds_name] = ds_df\n", + " temp_resampled_df[ds_name] = resampled_df\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a2e86bc5-e8ad-41f5-b343-29595a378c03", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "for ds_name in dataset_names:\n", + " \n", + " ds_df = temp_ds_df[ds_name]\n", + " resampled_df = temp_resampled_df[ds_name]\n", + " \n", + " c_df = ds_df[['Currency']].resample('1H').agg({'Currency': 'first'}).reset_index()\n", + " # Add Currency_Country to the resampled data by joining with the original DataFrame\n", + " resampled_df2 = pd.merge(resampled_df, \n", + " c_df,\n", + " on='Time'\n", + " )\n", + " resampled_df3 = pd.merge(resampled_df2, \n", + " history_summary,\n", + " on='Currency'\n", + " )\n", + " resampled_df4 = resampled_df3.copy()\n", + " resampled_df4['x2_y1'] = resampled_df4['average_amount']/resampled_df4['hist_trans_volume']\n", + " \n", + " ds_df = ds_df.sort_values('Time')\n", + " resampled_df4 = resampled_df4.sort_values('Time')\n", + " \n", + " merged_df = pd.merge_asof(ds_df, resampled_df4, on='Time' )\n", + " merged_df = merged_df.drop(columns=['Currency_y']).rename(columns={'Currency_x': 'Currency'})\n", + " \n", + " results[ds_name] = merged_df\n", + " \n", + "print(results)" + ] + }, + { + "cell_type": "markdown", + "id": "7051468f-2de0-4e41-a227-7fad4c9110af", + "metadata": { + "tags": [] + }, + "source": [ + "# Enrich feature for beneficiary country" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "605095b7-a514-4346-b984-3590d79d13e4", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "\n", + "history_summary2 = df_history.groupby('Beneficiary_BIC').agg(\n", + " hist_trans_volume=('UETR', 'count'),\n", + " hist_total_amount=('Amount', 'sum'),\n", + " hist_average_amount=('Amount', 'mean')\n", + ").reset_index()\n", + "\n", + "history_summary2" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "edabd7be-4864-4964-9e25-df543d5985c6", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "import pandas as pd\n", + "dataset_names = [\"train\", \"test\"]\n", + "results2 = {}\n", + "for ds_name in dataset_names:\n", + " ds_df = temp_ds_df[ds_name]\n", + " resampled_df = temp_resampled_df[ds_name]\n", + " \n", + " c_df = ds_df[['Beneficiary_BIC']].resample('1H').agg({'Beneficiary_BIC': 'first'}).reset_index()\n", + " \n", + " # Add Beneficiary_BIC to the resampled data by joining with the original DataFrame\n", + " resampled_df2 = pd.merge(resampled_df, \n", + " c_df,\n", + " on='Time'\n", + " )\n", + " \n", + " resampled_df3 = pd.merge(resampled_df2, \n", + " history_summary2,\n", + " on='Beneficiary_BIC'\n", + " )\n", + " \n", + " \n", + " resampled_df4 = resampled_df3.copy()\n", + " resampled_df4['x3_y2'] = resampled_df4['average_amount']/resampled_df4['hist_trans_volume']\n", + " \n", + " ds_df = ds_df.sort_values('Time')\n", + " resampled_df4 = resampled_df4.sort_values('Time')\n", + "\n", + " merged_df2 = pd.merge_asof(ds_df, resampled_df4, on='Time' )\n", + " merged_df2 = merged_df2.drop(columns=['Beneficiary_BIC_y']).rename(columns={'Beneficiary_BIC_x': 'Beneficiary_BIC'})\n", + " \n", + " results2[ds_name] = merged_df2\n", + "\n", + "print(results2)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a44309a2-e252-458d-a9dc-2691aea9360f", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "final_results = {}\n", + "for name in results:\n", + " df = results[name]\n", + " df2 = results2[name]\n", + " df3 = df2[[\"Time\", \"Beneficiary_BIC\", \"x3_y2\"]].copy()\n", + " df4 = pd.merge(df, df3, on=['Time', 'Beneficiary_BIC'])\n", + " final_results[name] = df4\n", + "\n", + " \n", + "for name in final_results:\n", + " site_dir = os.path.join(site_input_dir, site_name)\n", + " os.makedirs(site_dir, exist_ok=True)\n", + " enrich_file_name = os.path.join(site_dir, f\"{name}_enrichment.csv\")\n", + " print(enrich_file_name)\n", + " final_results[name].to_csv(enrich_file_name) \n", + " \n", + "final_results[\"train\"]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "47c958c3-bf73-4ab3-a66f-414be10870ea", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "! tree {site_input_dir}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "791ba1db-0ccf-4b31-b838-828d8c6a98a6", + "metadata": {}, + "outputs": [], + "source": [ + "ls -al /tmp/dataset/horizontal_credit_fraud_data/ZHSZUS33_Bank_1/" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "eae3d95a-180a-4fb6-b006-1fc1c144c5c4", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "! find /tmp/dataset/horizontal_credit_fraud_data/ZHSZUS33_Bank_1/ -exec wc -l {} \\;" + ] + }, + { + "cell_type": "markdown", + "id": "f9966065-80cb-4f85-adab-8c44f01fc8d1", + "metadata": {}, + "source": [ + "Let's go back to the [XGBoost Notebook](../xgboost.ipynb)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d8855463-ce23-44e5-b0ad-4e05d256ba8d", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "nvflare_example", + "language": "python", + "name": "nvflare_example" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.18" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/examples/advanced/finance-end-to-end/notebooks/gnn_train_encode.ipynb b/examples/advanced/finance-end-to-end/notebooks/gnn_train_encode.ipynb new file mode 100644 index 0000000000..cc688bf729 --- /dev/null +++ b/examples/advanced/finance-end-to-end/notebooks/gnn_train_encode.ipynb @@ -0,0 +1,319 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "3d892b5e-2f3b-4182-bedb-d332bfc3a353", + "metadata": {}, + "source": [ + "# GNN Training and Encoding\n", + "\n", + "* Train a GNN based on enriched features in an unsupervised fashion, and use the resulting model to encode the input features." + ] + }, + { + "cell_type": "markdown", + "id": "b8498bf1-d368-4d15-a5bf-559eb6e3918b", + "metadata": {}, + "source": [ + "## Load Data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "db9d04f0-a64d-457b-aacf-1a3737e07e12", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "site_input_dir = \"/tmp/dataset/horizontal_credit_fraud_data/\"\n", + "site_name = \"ZHSZUS33_Bank_1\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2d84f89f-fe0a-4387-92a2-49ca9143c141", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "import os\n", + "import pandas as pd\n", + "\n", + "dataset_names = [\"train\", \"test\"]\n", + "df_feats = {}\n", + "df_edges = {}\n", + "for ds_name in dataset_names:\n", + " # Get feature and class\n", + " file_name = os.path.join(site_input_dir, site_name, f\"{ds_name}_normalized.csv\")\n", + " df = pd.read_csv(file_name, index_col=0)\n", + " # Drop irrelevant columns\n", + " df = df.drop(columns=[\"Currency_Country\",\n", + " \"Beneficiary_BIC\",\n", + " \"Currency\",\n", + " \"Receiver_BIC\",\n", + " \"Sender_BIC\"]) \n", + " df_feats[ds_name] = df\n", + " # Get edge map\n", + " file_name = os.path.join(site_input_dir, site_name, f\"{ds_name}_edgemap.csv\")\n", + " df = pd.read_csv(file_name, header=None)\n", + " # Add column names to the edge map\n", + " df.columns = [\"UETR_1\", \"UETR_2\"]\n", + " df_edges[ds_name] = df" + ] + }, + { + "cell_type": "markdown", + "id": "a95b6b9d-7046-4ed4-8a7e-ce1f74ddf694", + "metadata": { + "tags": [] + }, + "source": [ + "## Prepared Data for Unsupervised GNN Training" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5bd5be54-c5e7-43c7-ad4f-de29a09bc7ec", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "import numpy as np\n", + "import torch\n", + "\n", + "node_ids = {}\n", + "node_features = {}\n", + "edge_indices = {}\n", + "weights = {}\n", + "labels = {}\n", + "\n", + "for ds_name in dataset_names:\n", + " df_feat_class = df_feats[ds_name]\n", + " df_edge = df_edges[ds_name]\n", + "\n", + " # Sort the data by UETR\n", + " df_feat_class = df_feat_class.sort_values(by=\"UETR\").reset_index(drop=True)\n", + "\n", + " # Generate UETR-index map with the feature list\n", + " node_id = df_feat_class[\"UETR\"].values\n", + " map_id = {j: i for i, j in enumerate(node_id)} # mapping nodes to indexes\n", + " node_ids[ds_name] = node_id\n", + " \n", + " # Get class labels\n", + " labels[ds_name] = df_feat_class[\"Class\"].values\n", + "\n", + " # Map UETR to indexes in the edge map\n", + " edges = df_edge.copy()\n", + " edges.UETR_1 = edges.UETR_1.map(map_id)\n", + " edges.UETR_2 = edges.UETR_2.map(map_id)\n", + " edges = edges.astype(int)\n", + "\n", + " # for undirected graph\n", + " edge_index = np.array(edges.values).T\n", + " edge_index = torch.tensor(edge_index, dtype=torch.long).contiguous()\n", + " edge_indices[ds_name] = edge_index\n", + " weights[ds_name] = torch.tensor([1] * edge_index.shape[1], dtype=torch.float)\n", + "\n", + " # UETR mapped to corresponding indexes, drop UETR and class\n", + " node_feature = df_feat_class.drop([\"UETR\", \"Class\"], axis=1).copy()\n", + " node_feature = torch.tensor(np.array(node_feature.values), dtype=torch.float)\n", + " node_features[ds_name] = node_feature\n" + ] + }, + { + "cell_type": "markdown", + "id": "70b192a7-05be-4591-b937-7bab878277ac", + "metadata": { + "tags": [] + }, + "source": [ + "## Unsupervised GNN Training" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f326a613-e683-4f67-810d-aece3d90349e", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "import torch.nn.functional as F\n", + "from torch.utils.tensorboard import SummaryWriter\n", + "from torch_geometric.data import Data\n", + "from torch_geometric.loader import LinkNeighborLoader\n", + "from torch_geometric.nn import GraphSAGE\n", + "\n", + "output_dir = os.path.join(site_input_dir, site_name)\n", + "DEVICE = \"cuda:0\"\n", + "writer = SummaryWriter(output_dir)\n", + "epochs = 100\n", + "\n", + "# Converting data to PyG graph data format\n", + "train_data = Data(\n", + " x=node_features['train'], edge_index=edge_indices['train'], edge_attr=weights['train']\n", + ")\n", + "\n", + "# Define the dataloader for graphsage training\n", + "loader = LinkNeighborLoader(\n", + " train_data,\n", + " batch_size=2048,\n", + " shuffle=True,\n", + " neg_sampling_ratio=1.0,\n", + " num_neighbors=[10, 10],\n", + " num_workers=6,\n", + " persistent_workers=True,\n", + ")\n", + "\n", + "# Model\n", + "model = GraphSAGE(\n", + " in_channels=node_features['train'].shape[1],\n", + " hidden_channels=64,\n", + " num_layers=2,\n", + " out_channels=64,\n", + ")\n", + "optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)\n", + "model.to(DEVICE)\n", + "\n", + "for epoch in range(1, epochs + 1):\n", + " model.train()\n", + " running_loss = instance_count = 0\n", + "\n", + " for data in loader:\n", + " # get the inputs data\n", + " data = data.to(DEVICE)\n", + " # zero the parameter gradients\n", + " optimizer.zero_grad()\n", + " # forward + backward + optimize\n", + " h = model(data.x, data.edge_index)\n", + " h_src = h[data.edge_label_index[0]]\n", + " h_dst = h[data.edge_label_index[1]]\n", + " link_pred = (h_src * h_dst).sum(dim=-1) # Inner product.\n", + " loss = F.binary_cross_entropy_with_logits(link_pred, data.edge_label)\n", + " loss.backward()\n", + " optimizer.step()\n", + " # add record\n", + " running_loss += float(loss.item()) * link_pred.numel()\n", + " instance_count += link_pred.numel()\n", + " print(f\"Epoch: {epoch:02d}, Loss: {running_loss / instance_count:.4f}\")\n", + " writer.add_scalar(\"train_loss\", running_loss / instance_count, epoch)\n", + "\n", + "# Save the model\n", + "torch.save(model.state_dict(), os.path.join(output_dir, \"model.pt\"))" + ] + }, + { + "cell_type": "markdown", + "id": "d7a5b581-1688-4c43-a83a-f3b152d05729", + "metadata": { + "tags": [] + }, + "source": [ + "## GNN Inference - Encoding the Raw Feature" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9dfe6156-1049-41c5-82d5-b81fa1814160", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Load the model and perform inference / encoding\n", + "model_enc = GraphSAGE(\n", + " in_channels=node_features['train'].shape[1],\n", + " hidden_channels=64,\n", + " num_layers=2,\n", + " out_channels=64,\n", + ")\n", + "model_enc.load_state_dict(torch.load(os.path.join(output_dir, \"model.pt\")))\n", + "model_enc.eval()\n", + "\n", + "embeds = {}\n", + "# Perform encoding\n", + "for ds_name in dataset_names:\n", + " h = model_enc(node_features[ds_name], edge_indices[ds_name])\n", + " embed = pd.DataFrame(h.cpu().detach().numpy())\n", + " # Add column names as V_0, V_1, ... V_63\n", + " embed.columns = [f\"V_{i}\" for i in range(embed.shape[1])]\n", + " # Concatenate the node ids and class labels with the encoded features\n", + " embed[\"UETR\"] = node_ids[ds_name]\n", + " embed[\"Class\"] = labels[ds_name]\n", + " # Move the UETR and Class columns to the front\n", + " embed = embed[[\"UETR\", \"Class\"] + [col for col in embed.columns if col not in [\"UETR\", \"Class\"]]]\n", + " embed.to_csv(os.path.join(output_dir, f\"{ds_name}_embedding.csv\"), index=False)\n", + " embeds[ds_name] = embed" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c1b8925c-6890-4a45-a9c4-f80399b463cc", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "! tree /tmp/dataset/horizontal_credit_fraud_data/ZHSZUS33_Bank_1" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5adcd468-edaf-4759-ac2d-09902811c97a", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "embeds[\"train\"]" + ] + }, + { + "cell_type": "markdown", + "id": "8591e4e1-74b1-465c-8124-eaf9829a6a8e", + "metadata": {}, + "source": [ + "Let's go back to the [XGBoost Notebook](../xgboost.ipynb)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d926970e-a4e9-41a7-a166-0d11f8e9e320", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "nvflare_example", + "language": "python", + "name": "nvflare_example" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.18" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/examples/advanced/finance-end-to-end/notebooks/graph_construct.ipynb b/examples/advanced/finance-end-to-end/notebooks/graph_construct.ipynb new file mode 100644 index 0000000000..cd02ae469b --- /dev/null +++ b/examples/advanced/finance-end-to-end/notebooks/graph_construct.ipynb @@ -0,0 +1,197 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "3d892b5e-2f3b-4182-bedb-d332bfc3a353", + "metadata": {}, + "source": [ + "# Graph Construction Step\n", + "\n", + "* Construct the graph for each site's transaction data\n", + "\n", + "Each node represents a transaction, and the edges represent the relationships between transactions. Since each site consists of the same Sender_BIC, to define the graph edge, we use the following rules:\n", + "\n", + "1. The two transactions are with the same Receiver_BIC.\n", + "2. The two transactions time difference are smaller than 6000.\n", + "\n", + "Note that in real applications, such rules should be designed according to the characteristics of the candidate data." + ] + }, + { + "cell_type": "markdown", + "id": "b8498bf1-d368-4d15-a5bf-559eb6e3918b", + "metadata": {}, + "source": [ + "### Load Data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "db9d04f0-a64d-457b-aacf-1a3737e07e12", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "site_input_dir = \"/tmp/dataset/horizontal_credit_fraud_data/\"\n", + "site_name = \"ZHSZUS33_Bank_1\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2d84f89f-fe0a-4387-92a2-49ca9143c141", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "import os\n", + "\n", + "import pandas as pd\n", + "dataset_names = [\"train\", \"test\"]\n", + "datasets = {}\n", + "\n", + "for ds_name in dataset_names:\n", + " file_name = os.path.join(site_input_dir, site_name, f\"{ds_name}.csv\" )\n", + " df = pd.read_csv(file_name)\n", + " datasets[ds_name] = df\n", + " print(df)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ccdc785e-9597-4083-b74a-2cacb25b20cb", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "df.columns" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5bd5be54-c5e7-43c7-ad4f-de29a09bc7ec", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "import pandas as pd\n", + "\n", + "edge_maps = {}\n", + "\n", + "info_columns = ['Time', 'Receiver_BIC', 'UETR']\n", + "time_threshold = 6000\n", + "\n", + "for ds_name in dataset_names:\n", + " df = datasets[ds_name]\n", + " \n", + " # Find transaction pairs that are within the time threshold\n", + " # First sort the table by 'Time'\n", + " df = df.sort_values(by=\"Time\")\n", + " # Keep only the columns that are needed for the graph edge map\n", + " df = df[info_columns]\n", + "\n", + " # Then for each row, find the next rows that is within the time threshold\n", + " graph_edge_map = []\n", + " for i in range(len(df)):\n", + " # Find the next rows that is:\n", + " # - within the time threshold\n", + " # - has the same Receiver_BIC\n", + " j = 1\n", + " while (i + j < len(df) and df[\"Time\"].values[i + j] < df[\"Time\"].values[i] + time_threshold):\n", + " if (df[\"Receiver_BIC\"].values[i + j] == df[\"Receiver_BIC\"].values[i]):\n", + " graph_edge_map.append([df[\"UETR\"].values[i], df[\"UETR\"].values[i + j]])\n", + " j += 1\n", + "\n", + " print(f\"Generated edge map for {ds_name}, in total {len(graph_edge_map)} valid edges for {len(df)} transactions\")\n", + "\n", + " edge_maps[ds_name] = pd.DataFrame(graph_edge_map) \n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7780ab4d-7d1d-4eda-96e1-eed9243eff11", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "edge_maps[\"train\"]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f326a613-e683-4f67-810d-aece3d90349e", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "for name in edge_maps:\n", + " site_dir = os.path.join(site_input_dir, site_name)\n", + " os.makedirs(site_dir, exist_ok=True)\n", + " edge_map_file_name = os.path.join(site_dir, f\"{name}_edgemap.csv\")\n", + " print(\"save to = \", edge_map_file_name)\n", + " # save to csv file without header and index\n", + " edge_maps[name].to_csv(edge_map_file_name, header=False, index=False)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c1b8925c-6890-4a45-a9c4-f80399b463cc", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "! tree /tmp/dataset/horizontal_credit_fraud_data/ZHSZUS33_Bank_1" + ] + }, + { + "cell_type": "markdown", + "id": "8591e4e1-74b1-465c-8124-eaf9829a6a8e", + "metadata": {}, + "source": [ + "Let's go back to the [XGBoost Notebook](../xgboost.ipynb)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d926970e-a4e9-41a7-a166-0d11f8e9e320", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "nvflare_example", + "language": "python", + "name": "nvflare_example" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.18" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/examples/advanced/finance-end-to-end/pre_process.ipynb b/examples/advanced/finance-end-to-end/notebooks/pre_process.ipynb similarity index 79% rename from examples/advanced/finance-end-to-end/pre_process.ipynb rename to examples/advanced/finance-end-to-end/notebooks/pre_process.ipynb index 1a12df720a..d47167f261 100644 --- a/examples/advanced/finance-end-to-end/pre_process.ipynb +++ b/examples/advanced/finance-end-to-end/notebooks/pre_process.ipynb @@ -21,7 +21,7 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": null, "id": "db9d04f0-a64d-457b-aacf-1a3737e07e12", "metadata": { "tags": [] @@ -91,27 +91,12 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": null, "id": "ccdc785e-9597-4083-b74a-2cacb25b20cb", "metadata": { "tags": [] }, - "outputs": [ - { - "data": { - "text/plain": [ - "Index(['Unnamed: 0', 'Time', 'Class', 'Amount', 'Sender_BIC', 'Receiver_BIC',\n", - " 'UETR', 'Currency', 'Beneficiary_BIC', 'Currency_Country',\n", - " 'trans_volume', 'total_amount', 'average_amount', 'hist_trans_volume',\n", - " 'hist_total_amount', 'hist_average_amount', 'x2_y1', 'x3_y2'],\n", - " dtype='object')" - ] - }, - "execution_count": 5, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "df.columns" ] @@ -154,8 +139,7 @@ " \n", " # Combine the normalized numerical features with the categorical features\n", " df_combined = pd.concat([categorical_features, numerical_normalized], axis=1)\n", - " \n", - " \n", + " \n", "# # one-hot encoding\n", "# df_combined = pd.get_dummies(df_combined, columns=category_columns)\n", "\n", @@ -175,57 +159,41 @@ }, "outputs": [], "source": [ - " \n", "for name in processed_dfs:\n", " site_dir = os.path.join(site_input_dir, site_name)\n", " os.makedirs(site_dir, exist_ok=True)\n", " pre_processed_file_name = os.path.join(site_dir, f\"{name}_normalized.csv\")\n", " print(pre_processed_file_name)\n", - " processed_dfs[name].to_csv(pre_processed_file_name) \n" + " processed_dfs[name].to_csv(pre_processed_file_name) " ] }, { "cell_type": "code", - "execution_count": 8, + "execution_count": null, "id": "c1b8925c-6890-4a45-a9c4-f80399b463cc", "metadata": { "tags": [] }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\u001b[01;34m/tmp/dataset/horizontal_credit_fraud_data/ZHSZUS33_Bank_1\u001b[0m\n", - "├── history.csv\n", - "├── test.csv\n", - "├── test_enrichment.csv\n", - "├── test_normalized.csv\n", - "├── train.csv\n", - "├── train_enrichment.csv\n", - "└── train_normalized.csv\n", - "\n", - "0 directories, 7 files\n" - ] - } - ], + "outputs": [], "source": [ "! tree /tmp/dataset/horizontal_credit_fraud_data/ZHSZUS33_Bank_1" ] }, - { - "cell_type": "markdown", - "id": "e0a33628-acc7-4f42-b2fa-d066699e23eb", - "metadata": {}, - "source": [] - }, { "cell_type": "markdown", "id": "8591e4e1-74b1-465c-8124-eaf9829a6a8e", "metadata": {}, "source": [ - "Let's go back to the [XGBoost Notebook](./xgboost.ipynb)" + "Let's go back to the [XGBoost Notebook](../xgboost.ipynb)" ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "26989ac6-cedf-4c9d-8b25-60e0af758cfe", + "metadata": {}, + "outputs": [], + "source": [] } ], "metadata": { @@ -244,7 +212,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.8.19" + "version": "3.8.18" } }, "nbformat": 4, diff --git a/examples/advanced/finance-end-to-end/prepare_data.ipynb b/examples/advanced/finance-end-to-end/notebooks/prepare_data.ipynb similarity index 50% rename from examples/advanced/finance-end-to-end/prepare_data.ipynb rename to examples/advanced/finance-end-to-end/notebooks/prepare_data.ipynb index ff1b09a9b7..76d9264ae2 100644 --- a/examples/advanced/finance-end-to-end/prepare_data.ipynb +++ b/examples/advanced/finance-end-to-end/notebooks/prepare_data.ipynb @@ -14,6 +14,7 @@ "metadata": {}, "source": [ "## Prepare Data\n", + "First download data from [kaggle credit card fraud dataset](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud) and save it to the below `data_path`\n", "### Based on the riginal data, add randome synthentic data to make full dataset\n", "* expand time in seconds x 200 times to cover 26 months\n", "* double the data record size\n", @@ -24,60 +25,43 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": null, "id": "84fe43c3-2e99-414f-91ef-b104578d8b0e", "metadata": { "tags": [] }, "outputs": [], "source": [ - "data_path=\"creditcard.csv\"\n", + "data_path=\"../creditcard.csv\"\n", "out_folder=\"/tmp/dataset/horizontal_credit_fraud_data\"\n", "\n", "import shutil\n", "import os\n", "\n", "if os.path.exists(out_folder):\n", - " shutil.rmtree(out_folder)\n" + " shutil.rmtree(out_folder)" ] }, { "cell_type": "code", - "execution_count": 2, + "execution_count": null, "id": "09248e1f-2066-459d-bf20-8ffc47b7f272", "metadata": { "tags": [] }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "284808 creditcard.csv\n" - ] - } - ], + "outputs": [], "source": [ "! wc -l {data_path}" ] }, { "cell_type": "code", - "execution_count": 3, + "execution_count": null, "id": "5b8edae6-b906-4631-9294-dbe2e11391f1", "metadata": { "tags": [] }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "old_max_days=1.9999074074074075\n", - "min_months=0.0, max_months=26.665432098765432\n" - ] - } - ], + "outputs": [], "source": [ "# %load_ext cudf.pandas\n", "import argparse\n", @@ -192,223 +176,17 @@ " return df\n", "\n", "# Add random BIC and currency details to the DataFrame\n", - "df = generate_random_details(df)\n", - "\n" + "df = generate_random_details(df)" ] }, { "cell_type": "code", - "execution_count": 4, + "execution_count": null, "id": "25b36fbf-f6b4-4a85-a022-748c21e6e309", "metadata": { "tags": [] }, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
TimeAmountClassSender_BICReceiver_BICUETRCurrencyBeneficiary_BICCurrency_Country
00.0149.620FBSFCHZHWPUWDEFFV4ID8QTCIROHAP683AOX78AUDZNZZAU3MAustralia
10.02.690ZHSZUS33YSYCESMMR7PCTKF9R1PVGXRXU9AB3JAUDZNZZAU3MAustralia
2100.0378.660HCBHSGSGFBSFCHZHRP1SBN0Q5U58XBS8LQNE0JUSDZHSZUS33United States
3100.0123.500YXRXGB22YMNYFRPPMAPFA8RU98VZP4MD6VFN1JUSDZHSZUS33United States
4200.069.990XITXUS33FBSFCHZH3WX5XAGWK7F3CXRX6RZZK3USDZHSZUS33United States
..............................
113922369116200.00.770XITXUS33SHSHKHH1BEEX2F5NEHDU3YV8G17005GBPYXRXGB22United Kingdom
113922469116300.024.790ZHSZUS33ZHSZUS339SJQ6WVX8CGS0P1DYYGQ45GBPYXRXGB22United Kingdom
113922569116400.067.880YXRXGB22WPUWDEFFCGUZH7AV1YPIQCLCQMAWV6AUDZNZZAU3MAustralia
113922669116400.010.000WPUWDEFFWPUWDEFF9FZFL7WK3AA7K5C0Q6X5W3SGDHCBHSGSGSingapore
113922769116800.0217.000YMNYFRPPYSYCESMMAGKF0NGK83CJQTT5CU36PAAUDZNZZAU3MAustralia
\n", - "

1139228 rows × 9 columns

\n", - "
" - ], - "text/plain": [ - " Time Amount Class Sender_BIC Receiver_BIC \\\n", - "0 0.0 149.62 0 FBSFCHZH WPUWDEFF \n", - "1 0.0 2.69 0 ZHSZUS33 YSYCESMM \n", - "2 100.0 378.66 0 HCBHSGSG FBSFCHZH \n", - "3 100.0 123.50 0 YXRXGB22 YMNYFRPP \n", - "4 200.0 69.99 0 XITXUS33 FBSFCHZH \n", - "... ... ... ... ... ... \n", - "1139223 69116200.0 0.77 0 XITXUS33 SHSHKHH1 \n", - "1139224 69116300.0 24.79 0 ZHSZUS33 ZHSZUS33 \n", - "1139225 69116400.0 67.88 0 YXRXGB22 WPUWDEFF \n", - "1139226 69116400.0 10.00 0 WPUWDEFF WPUWDEFF \n", - "1139227 69116800.0 217.00 0 YMNYFRPP YSYCESMM \n", - "\n", - " UETR Currency Beneficiary_BIC Currency_Country \n", - "0 V4ID8QTCIROHAP683AOX78 AUD ZNZZAU3M Australia \n", - "1 R7PCTKF9R1PVGXRXU9AB3J AUD ZNZZAU3M Australia \n", - "2 RP1SBN0Q5U58XBS8LQNE0J USD ZHSZUS33 United States \n", - "3 MAPFA8RU98VZP4MD6VFN1J USD ZHSZUS33 United States \n", - "4 3WX5XAGWK7F3CXRX6RZZK3 USD ZHSZUS33 United States \n", - "... ... ... ... ... \n", - "1139223 BEEX2F5NEHDU3YV8G17005 GBP YXRXGB22 United Kingdom \n", - "1139224 9SJQ6WVX8CGS0P1DYYGQ45 GBP YXRXGB22 United Kingdom \n", - "1139225 CGUZH7AV1YPIQCLCQMAWV6 AUD ZNZZAU3M Australia \n", - "1139226 9FZFL7WK3AA7K5C0Q6X5W3 SGD HCBHSGSG Singapore \n", - "1139227 AGKF0NGK83CJQTT5CU36PA AUD ZNZZAU3M Australia \n", - "\n", - "[1139228 rows x 9 columns]" - ] - }, - "execution_count": 4, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "df" ] @@ -431,28 +209,16 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": null, "id": "47961d9f-c0fb-47fc-b901-5512be98ebf0", "metadata": { "tags": [] }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Historical DataFrame size: 626575\n", - "Training DataFrame size: 398729\n", - "Testing DataFrame size: 113924\n" - ] - } - ], + "outputs": [], "source": [ - "\n", "# Sort the DataFrame by the Time column\n", "df = df.sort_values(by='Time').reset_index(drop=True)\n", "\n", - "\n", "# Calculate the number of samples for each split\n", "total_size = len(df)\n", "historical_size = int(total_size * 0.55)\n", @@ -475,14 +241,12 @@ "# Display sizes of each dataset\n", "print(f\"Historical DataFrame size: {len(df_history)}\")\n", "print(f\"Training DataFrame size: {len(df_train)}\")\n", - "print(f\"Testing DataFrame size: {len(df_test)}\")\n", - "\n", - "\n" + "print(f\"Testing DataFrame size: {len(df_test)}\")" ] }, { "cell_type": "code", - "execution_count": 6, + "execution_count": null, "id": "785c6028-e792-450b-a294-a6460b03fd9f", "metadata": { "tags": [] @@ -499,23 +263,12 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": null, "id": "8cb273b7-4273-414b-9ef9-0da9f7d3e839", "metadata": { "tags": [] }, - "outputs": [ - { - "data": { - "text/plain": [ - "'/tmp/dataset/horizontal_credit_fraud_data'" - ] - }, - "execution_count": 7, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "out_folder" ] @@ -529,7 +282,7 @@ }, "outputs": [], "source": [ - "!ls -al {out_folder}\n" + "!ls -al {out_folder}" ] }, { @@ -551,8 +304,7 @@ "source": [ "## Split Data for differnt Client sites\n", "\n", - "Now, split train, test, history data evenly for n = 2 training sites (Clients)\n", - "\n" + "Now, split train, test, history data according to Sender_BICs" ] }, { @@ -564,7 +316,6 @@ }, "outputs": [], "source": [ - "\n", "files = [\"history\", \"train\", \"test\"]\n", "client_names = set()\n", "\n", @@ -585,11 +336,7 @@ " group.to_csv(filename, index=False)\n", " print(f\"Saved {name} {f} transactions to {filename}\")\n", "\n", - "print(client_names)\n", - " \n", - "\n", - "\n", - " \n" + "print(client_names)" ] }, { @@ -628,63 +375,10 @@ }, { "cell_type": "code", - "execution_count": 14, + "execution_count": null, "id": "1c2c1e5e-8d95-4bc0-97bc-25e15e878433", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\u001b[01;34m/tmp/dataset/horizontal_credit_fraud_data/\u001b[0m\n", - "├── \u001b[01;34mFBSFCHZH_Bank_6\u001b[0m\n", - "│   ├── history.csv\n", - "│   ├── test.csv\n", - "│   └── train.csv\n", - "├── \u001b[01;34mHCBHSGSG_Bank_9\u001b[0m\n", - "│   ├── history.csv\n", - "│   ├── test.csv\n", - "│   └── train.csv\n", - "├── history.csv\n", - "├── \u001b[01;34mSHSHKHH1_Bank_2\u001b[0m\n", - "│   ├── history.csv\n", - "│   ├── test.csv\n", - "│   └── train.csv\n", - "├── test.csv\n", - "├── train.csv\n", - "├── \u001b[01;34mWPUWDEFF_Bank_4\u001b[0m\n", - "│   ├── history.csv\n", - "│   ├── test.csv\n", - "│   └── train.csv\n", - "├── \u001b[01;34mXITXUS33_Bank_10\u001b[0m\n", - "│   ├── history.csv\n", - "│   ├── test.csv\n", - "│   └── train.csv\n", - "├── \u001b[01;34mYMNYFRPP_Bank_5\u001b[0m\n", - "│   ├── history.csv\n", - "│   ├── test.csv\n", - "│   └── train.csv\n", - "├── \u001b[01;34mYSYCESMM_Bank_7\u001b[0m\n", - "│   ├── history.csv\n", - "│   ├── test.csv\n", - "│   └── train.csv\n", - "├── \u001b[01;34mYXRXGB22_Bank_3\u001b[0m\n", - "│   ├── history.csv\n", - "│   ├── test.csv\n", - "│   └── train.csv\n", - "├── \u001b[01;34mZHSZUS33_Bank_1\u001b[0m\n", - "│   ├── history.csv\n", - "│   ├── test.csv\n", - "│   └── train.csv\n", - "└── \u001b[01;34mZNZZAU3M_Bank_8\u001b[0m\n", - " ├── history.csv\n", - " ├── test.csv\n", - " └── train.csv\n", - "\n", - "10 directories, 33 files\n" - ] - } - ], + "outputs": [], "source": [ "!tree /tmp/dataset/horizontal_credit_fraud_data/" ] @@ -694,13 +388,13 @@ "id": "30661d30-7032-4bde-9fb2-ce67897a2f55", "metadata": {}, "source": [ - "Let's go back to the [XGBoost Notebook](./xgboost.ipynb)" + "Let's go back to the [XGBoost Notebook](../xgboost.ipynb)" ] }, { "cell_type": "code", "execution_count": null, - "id": "87163fff-d4cb-485c-ba59-3582104dfcda", + "id": "e6a448c4-a7c3-4ec6-a44b-bd8d123a3200", "metadata": {}, "outputs": [], "source": [] @@ -722,7 +416,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.8.19" + "version": "3.8.18" } }, "nbformat": 4, diff --git a/examples/advanced/finance-end-to-end/enrich.py b/examples/advanced/finance-end-to-end/nvflare/enrich.py similarity index 100% rename from examples/advanced/finance-end-to-end/enrich.py rename to examples/advanced/finance-end-to-end/nvflare/enrich.py diff --git a/examples/advanced/finance-end-to-end/enrich_job.py b/examples/advanced/finance-end-to-end/nvflare/enrich_job.py similarity index 97% rename from examples/advanced/finance-end-to-end/enrich_job.py rename to examples/advanced/finance-end-to-end/nvflare/enrich_job.py index ac9fbab9ea..73d170050f 100644 --- a/examples/advanced/finance-end-to-end/enrich_job.py +++ b/examples/advanced/finance-end-to-end/nvflare/enrich_job.py @@ -31,7 +31,6 @@ def main(): job = FedJob(name=job_name) - # Define the enrich_ctrl workflow and send to server enrich_ctrl = ETLController(task_name="enrich") job.to(enrich_ctrl, "server", id="enrich") diff --git a/examples/advanced/finance-end-to-end/nvflare/gnn_train_encode.py b/examples/advanced/finance-end-to-end/nvflare/gnn_train_encode.py new file mode 100644 index 0000000000..b50abea8d2 --- /dev/null +++ b/examples/advanced/finance-end-to-end/nvflare/gnn_train_encode.py @@ -0,0 +1,233 @@ +# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os + +import numpy as np +import pandas as pd +import torch +import torch.nn.functional as F +from torch.utils.tensorboard import SummaryWriter +from torch_geometric.data import Data +from torch_geometric.loader import LinkNeighborLoader +from torch_geometric.nn import GraphSAGE + +DEVICE = "cuda:0" + +# (1) import nvflare client API +import nvflare.client as flare + + +def edge_index_gen(df_feat_class, df_edges): + # Sort the data by UETR + df_feat_class = df_feat_class.sort_values(by="UETR").reset_index(drop=True) + + # Generate UETR-index map with the feature list + node_id = df_feat_class["UETR"].values + map_id = {j: i for i, j in enumerate(node_id)} # mapping nodes to indexes + + # Get class labels + label = df_feat_class["Class"].values + + # Map UETR to indexes in the edge map + edges = df_edges.copy() + edges.UETR_1 = edges.UETR_1.map(map_id) + edges.UETR_2 = edges.UETR_2.map(map_id) + edges = edges.astype(int) + + # for undirected graph + edge_index = np.array(edges.values).T + edge_index = torch.tensor(edge_index, dtype=torch.long).contiguous() + weight = torch.tensor([1] * edge_index.shape[1], dtype=torch.float) + + # UETR mapped to corresponding indexes, drop UETR and class + node_feat = df_feat_class.drop(["UETR", "Class"], axis=1).copy() + node_feat = torch.tensor(np.array(node_feat.values), dtype=torch.float) + + return node_feat, edge_index, weight, node_id, label + + +def main(): + parser = argparse.ArgumentParser() + parser.add_argument( + "-i", + "--data_path", + type=str, + default="/tmp/dataset/credit_data", + ) + parser.add_argument( + "--epochs", + type=int, + default=1, + ) + parser.add_argument( + "-o", + "--output_path", + type=str, + default="/tmp/dataset/credit_data", + ) + args = parser.parse_args() + + # (2) initializes NVFlare client API + flare.init() + site_name = flare.get_site_name() + + # Set up tensorboard + writer = SummaryWriter(os.path.join(args.output_path, site_name)) + + # Load the data + dataset_names = ["train", "test"] + + node_features = {} + edge_indices = {} + weights = {} + node_ids = {} + labels = {} + + for ds_name in dataset_names: + # Get feature and class + file_name = os.path.join(args.data_path, site_name, f"{ds_name}_normalized.csv") + df = pd.read_csv(file_name, index_col=0) + # Drop irrelevant columns + df = df.drop(columns=["Currency_Country", "Beneficiary_BIC", "Currency", "Receiver_BIC", "Sender_BIC"]) + df_feat_class = df + # Get edge map + file_name = os.path.join(args.data_path, site_name, f"{ds_name}_edgemap.csv") + df = pd.read_csv(file_name, header=None) + # Add column names to the edge map + df.columns = ["UETR_1", "UETR_2"] + df_edges = df + + # Preprocess data + node_feat, edge_index, weight, node_id, label = edge_index_gen(df_feat_class, df_edges) + node_features[ds_name] = node_feat + edge_indices[ds_name] = edge_index + weights[ds_name] = weight + node_ids[ds_name] = node_id + labels[ds_name] = label + + # Converting training data to PyG graph data format + train_data = Data(x=node_features["train"], edge_index=edge_indices["train"], edge_attr=weights["train"]) + + # Define the dataloader for graphsage training + loader = LinkNeighborLoader( + train_data, + batch_size=2048, + shuffle=True, + neg_sampling_ratio=1.0, + num_neighbors=[10, 10], + num_workers=6, + persistent_workers=True, + ) + + # Model + model = GraphSAGE( + in_channels=node_features["train"].shape[1], + hidden_channels=64, + num_layers=2, + out_channels=64, + ) + + while flare.is_running(): + # (3) receives FLModel from NVFlare + input_model = flare.receive() + print(f"current_round={input_model.current_round}/{input_model.total_rounds}") + + # (4) loads model from NVFlare + model.load_state_dict(input_model.params) + + # (5) perform encoding for both training and test data + def gnn_encode(model_param, node_feature, edge_index, id, label): + # Load the model and perform inference / encoding + model_enc = GraphSAGE( + in_channels=node_feature.shape[1], + hidden_channels=64, + num_layers=2, + out_channels=64, + ) + model_enc.load_state_dict(model_param) + model_enc.to(DEVICE) + model_enc.eval() + node_feature = node_feature.to(DEVICE) + edge_index = edge_index.to(DEVICE) + + # Perform encoding + h = model_enc(node_feature, edge_index) + embed = pd.DataFrame(h.cpu().detach().numpy()) + # Add column names as V_0, V_1, ... V_63 + embed.columns = [f"V_{i}" for i in range(embed.shape[1])] + # Concatenate the node ids and class labels with the encoded features + embed["UETR"] = id + embed["Class"] = label + # Move the UETR and Class columns to the front + embed = embed[["UETR", "Class"] + [col for col in embed.columns if col not in ["UETR", "Class"]]] + return embed + + # Only do encoding for the last round + if input_model.current_round == input_model.total_rounds - 1: + print("Encoding the data with the final model") + for ds_name in dataset_names: + embed = gnn_encode( + input_model.params, + node_features[ds_name], + edge_indices[ds_name], + node_ids[ds_name], + labels[ds_name], + ) + embed.to_csv(os.path.join(args.output_path, site_name, f"{ds_name}_embedding.csv"), index=False) + + optimizer = torch.optim.Adam(model.parameters(), lr=1e-3) + model.to(DEVICE) + steps = args.epochs * len(loader) + for epoch in range(1, args.epochs + 1): + model.train() + running_loss = instance_count = 0 + for data in loader: + # get the inputs data + data = data.to(DEVICE) + # zero the parameter gradients + optimizer.zero_grad() + # forward + backward + optimize + h = model(data.x, data.edge_index) + h_src = h[data.edge_label_index[0]] + h_dst = h[data.edge_label_index[1]] + link_pred = (h_src * h_dst).sum(dim=-1) # Inner product. + loss = F.binary_cross_entropy_with_logits(link_pred, data.edge_label) + loss.backward() + optimizer.step() + # add record + running_loss += float(loss.item()) * link_pred.numel() + instance_count += link_pred.numel() + print(f"Epoch: {epoch:02d}, Loss: {running_loss / instance_count:.4f}") + writer.add_scalar( + "train_loss", running_loss / instance_count, input_model.current_round * args.epochs + epoch + ) + + print("Finished Training") + # Save the model + torch.save(model.state_dict(), os.path.join(args.output_path, site_name, "model.pt")) + + # (6) construct trained FL model + output_model = flare.FLModel( + params=model.cpu().state_dict(), + metrics={"loss": running_loss}, + meta={"NUM_STEPS_CURRENT_ROUND": steps}, + ) + # (7) send model back to NVFlare + flare.send(output_model) + + +if __name__ == "__main__": + main() diff --git a/examples/advanced/finance-end-to-end/nvflare/gnn_train_encode_job.py b/examples/advanced/finance-end-to-end/nvflare/gnn_train_encode_job.py new file mode 100644 index 0000000000..16518a74c9 --- /dev/null +++ b/examples/advanced/finance-end-to-end/nvflare/gnn_train_encode_job.py @@ -0,0 +1,116 @@ +# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +import argparse + +from torch_geometric.nn import GraphSAGE + +from nvflare import FedJob +from nvflare.app_common.workflows.fedavg import FedAvg +from nvflare.app_opt.pt.job_config.model import PTModel +from nvflare.job_config.script_runner import ScriptRunner + + +def main(): + args = define_parser() + + site_names = args.sites + n_clients = len(site_names) + + work_dir = args.work_dir + task_script_path = args.task_script_path + task_script_args = args.task_script_args + + job = FedJob(name="gnn_train_encode_job") + + # Define the controller workflow and send to server + controller = FedAvg( + num_clients=n_clients, + num_rounds=args.num_rounds, + ) + job.to(controller, "server") + + # Define the model + model = GraphSAGE( + in_channels=10, + hidden_channels=64, + num_layers=2, + out_channels=64, + ) + job.to(PTModel(model), "server") + + # Add clients + for site_name in site_names: + executor = ScriptRunner(script=task_script_path, script_args=task_script_args) + job.to(executor, site_name) + + if work_dir: + print(f"{work_dir=}") + job.export_job(work_dir) + + if not args.config_only: + job.simulator_run(work_dir) + + +def define_parser(): + parser = argparse.ArgumentParser() + parser.add_argument( + "-c", + "--sites", + nargs="*", # 0 or more values expected => creates a list + type=str, + default=[], # default if nothing is provided + help="Space separated site names", + ) + parser.add_argument( + "-n", + "--num_rounds", + type=int, + default=100, + help="number of FL rounds", + ) + parser.add_argument( + "-w", + "--work_dir", + type=str, + nargs="?", + default="/tmp/nvflare/jobs/xgb/workdir", + help="work directory, default to '/tmp/nvflare/jobs/xgb/workdir'", + ) + + parser.add_argument( + "-p", + "--task_script_path", + type=str, + nargs="?", + help="task script", + ) + + parser.add_argument( + "-a", + "--task_script_args", + type=str, + nargs="?", + default="", + help="", + ) + + parser.add_argument("-co", "--config_only", action="store_true", help="config only mode, will not run simulator") + + return parser.parse_args() + + +if __name__ == "__main__": + main() diff --git a/examples/advanced/finance-end-to-end/nvflare/graph_construct.py b/examples/advanced/finance-end-to-end/nvflare/graph_construct.py new file mode 100644 index 0000000000..ae90c2f24b --- /dev/null +++ b/examples/advanced/finance-end-to-end/nvflare/graph_construct.py @@ -0,0 +1,125 @@ +# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +import argparse +import os + +import pandas as pd + +# (1) import nvflare client API +import nvflare.client as flare + +dataset_names = ["train", "test"] +datasets = {} + + +def main(): + print("\n pre-process starts \n ") + args = define_parser() + input_dir = args.input_dir + output_dir = args.output_dir + + flare.init() + site_name = flare.get_site_name() + + # receives global message from NVFlare + etl_task = flare.receive() + + print("\n receive task \n ") + edge_maps = edge_map_gen(input_dir, site_name) + + save_edge_map(output_dir, edge_maps, site_name) + + print("end task") + + # send message back the controller indicating end. + etl_task.meta["status"] = "done" + flare.send(etl_task) + + +def save_edge_map(output_dir, edge_maps, site_name): + for name in edge_maps: + site_dir = os.path.join(output_dir, site_name) + os.makedirs(site_dir, exist_ok=True) + + edge_map_file_name = os.path.join(site_dir, f"{name}_edgemap.csv") + print("save to = ", edge_map_file_name) + # save to csv file without header and index + edge_maps[name].to_csv(edge_map_file_name, header=False, index=False) + + +def edge_map_gen(input_dir, site_name): + edge_maps = {} + info_columns = ["Time", "Receiver_BIC", "UETR"] + time_threshold = 6000 + for ds_name in dataset_names: + + file_name = os.path.join(input_dir, site_name, f"{ds_name}.csv") + df = pd.read_csv(file_name) + datasets[ds_name] = df + + # Find transaction pairs that are within the time threshold + # First sort the table by 'Time' + df = df.sort_values(by="Time") + # Keep only the columns that are needed for the graph edge map + df = df[info_columns] + + # Then for each row, find the next rows that is within the time threshold + graph_edge_map = [] + for i in range(len(df)): + # Find the next rows that is: + # - within the time threshold + # - has the same Receiver_BIC + j = 1 + while i + j < len(df) and df["Time"].values[i + j] < df["Time"].values[i] + time_threshold: + if df["Receiver_BIC"].values[i + j] == df["Receiver_BIC"].values[i]: + graph_edge_map.append([df["UETR"].values[i], df["UETR"].values[i + j]]) + j += 1 + + print( + f"Generated edge map for {ds_name}, in total {len(graph_edge_map)} valid edges for {len(df)} transactions" + ) + + edge_maps[ds_name] = pd.DataFrame(graph_edge_map) + + return edge_maps + + +def define_parser(): + parser = argparse.ArgumentParser() + + parser.add_argument( + "-i", + "--input_dir", + type=str, + nargs="?", + default="/tmp/dataset/credit_data", + help="input directory where csv files for each site are expected, default to /tmp/dataset/credit_data", + ) + + parser.add_argument( + "-o", + "--output_dir", + type=str, + nargs="?", + default="/tmp/dataset/credit_data", + help="output directory, default to '/tmp/dataset/credit_data'", + ) + + return parser.parse_args() + + +if __name__ == "__main__": + main() diff --git a/examples/advanced/finance-end-to-end/nvflare/graph_construct_job.py b/examples/advanced/finance-end-to-end/nvflare/graph_construct_job.py new file mode 100644 index 0000000000..81f8033c30 --- /dev/null +++ b/examples/advanced/finance-end-to-end/nvflare/graph_construct_job.py @@ -0,0 +1,100 @@ +# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +import argparse + +from nvflare import FedJob +from nvflare.app_common.workflows.etl_controller import ETLController +from nvflare.job_config.script_runner import ScriptRunner + + +def main(): + args = define_parser() + + site_names = args.sites + work_dir = args.work_dir + job_name = args.job_name + task_script_path = args.task_script_path + task_script_args = args.task_script_args + + job = FedJob(name=job_name) + + graph_construct_ctrl = ETLController(task_name="graph_construct") + job.to(graph_construct_ctrl, "server", id="graph_construct") + + # Add clients + for site_name in site_names: + executor = ScriptRunner(script=task_script_path, script_args=task_script_args) + job.to(executor, site_name, tasks=["graph_construct"]) + + if work_dir: + print(f"{work_dir=}") + job.export_job(work_dir) + + if not args.config_only: + job.simulator_run(work_dir) + + +def define_parser(): + parser = argparse.ArgumentParser() + parser.add_argument( + "-c", + "--sites", + nargs="*", # 0 or more values expected => creates a list + type=str, + default=[], # default if nothing is provided + help="Space separated site names", + ) + parser.add_argument( + "-n", + "--job_name", + type=str, + nargs="?", + default="credit_card_graph_construct_job", + help="job name, default to xgb_job", + ) + parser.add_argument( + "-w", + "--work_dir", + type=str, + nargs="?", + default="/tmp/nvflare/jobs/xgb/workdir", + help="work directory, default to '/tmp/nvflare/jobs/xgb/workdir'", + ) + + parser.add_argument( + "-p", + "--task_script_path", + type=str, + nargs="?", + help="task script", + ) + + parser.add_argument( + "-a", + "--task_script_args", + type=str, + nargs="?", + default="", + help="", + ) + + parser.add_argument("-co", "--config_only", action="store_true", help="config only mode, will not run simulator") + + return parser.parse_args() + + +if __name__ == "__main__": + main() diff --git a/examples/advanced/finance-end-to-end/pre_process.py b/examples/advanced/finance-end-to-end/nvflare/pre_process.py similarity index 100% rename from examples/advanced/finance-end-to-end/pre_process.py rename to examples/advanced/finance-end-to-end/nvflare/pre_process.py diff --git a/examples/advanced/finance-end-to-end/pre_process_job.py b/examples/advanced/finance-end-to-end/nvflare/pre_process_job.py similarity index 99% rename from examples/advanced/finance-end-to-end/pre_process_job.py rename to examples/advanced/finance-end-to-end/nvflare/pre_process_job.py index ce7c4a1c92..fbeba5df44 100644 --- a/examples/advanced/finance-end-to-end/pre_process_job.py +++ b/examples/advanced/finance-end-to-end/nvflare/pre_process_job.py @@ -81,6 +81,7 @@ def define_parser(): nargs="?", help="task script", ) + parser.add_argument( "-a", "--task_script_args", diff --git a/examples/advanced/finance-end-to-end/xgb_data_loader.py b/examples/advanced/finance-end-to-end/nvflare/xgb_data_loader.py similarity index 97% rename from examples/advanced/finance-end-to-end/xgb_data_loader.py rename to examples/advanced/finance-end-to-end/nvflare/xgb_data_loader.py index 0b88e21e46..a5eef83dad 100644 --- a/examples/advanced/finance-end-to-end/xgb_data_loader.py +++ b/examples/advanced/finance-end-to-end/nvflare/xgb_data_loader.py @@ -30,6 +30,7 @@ def __init__(self, root_dir: str, file_postfix: str): self.file_postfix = file_postfix for name in self.dataset_names: self.base_file_names[name] = name + file_postfix + self.numerical_columns = [ "Timestamp", "Amount", @@ -53,6 +54,8 @@ def load_data(self) -> Tuple[xgb.DMatrix, xgb.DMatrix]: for ds_name in self.dataset_names: print("\nloading for site = ", self.client_id, f"{ds_name} dataset \n") file_name = os.path.join(self.root_dir, self.client_id, self.base_file_names[ds_name]) + print(file_name) + print(self.numerical_columns) df = pd.read_csv(file_name) data_num = len(data) diff --git a/examples/advanced/finance-end-to-end/nvflare/xgb_embed_data_loader.py b/examples/advanced/finance-end-to-end/nvflare/xgb_embed_data_loader.py new file mode 100644 index 0000000000..aec50a5711 --- /dev/null +++ b/examples/advanced/finance-end-to-end/nvflare/xgb_embed_data_loader.py @@ -0,0 +1,64 @@ +# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +import os +from typing import Tuple + +import pandas as pd +import xgboost as xgb + +from nvflare.app_opt.xgboost.data_loader import XGBDataLoader + + +class CreditCardEmbedDataLoader(XGBDataLoader): + def __init__(self, root_dir: str, file_postfix: str): + self.dataset_names = ["train", "test"] + self.base_file_names = {} + self.root_dir = root_dir + self.file_postfix = file_postfix + for name in self.dataset_names: + self.base_file_names[name] = name + file_postfix + self.numerical_columns = [f"V_{i}" for i in range(64)] + + def initialize( + self, client_id: str, rank: int, data_split_mode: xgb.core.DataSplitMode = xgb.core.DataSplitMode.ROW + ): + super().initialize(client_id, rank, data_split_mode) + + def load_data(self) -> Tuple[xgb.DMatrix, xgb.DMatrix]: + data = {} + for ds_name in self.dataset_names: + print("\nloading for site = ", self.client_id, f"{ds_name} dataset") + file_name = os.path.join(self.root_dir, self.client_id, self.base_file_names[ds_name]) + print(file_name) + print(self.numerical_columns) + print("\n") + df = pd.read_csv(file_name) + data_num = len(data) + + # split to feature and label + y = df["Class"] + x = df[self.numerical_columns] + data[ds_name] = (x, y, data_num) + + # training + x_train, y_train, total_train_data_num = data["train"] + dmat_train = xgb.DMatrix(x_train, label=y_train, data_split_mode=self.data_split_mode) + + # validation + x_valid, y_valid, total_valid_data_num = data["test"] + dmat_valid = xgb.DMatrix(x_valid, label=y_valid, data_split_mode=self.data_split_mode) + + return dmat_train, dmat_valid diff --git a/examples/advanced/finance-end-to-end/xgb_job.py b/examples/advanced/finance-end-to-end/nvflare/xgb_job.py similarity index 100% rename from examples/advanced/finance-end-to-end/xgb_job.py rename to examples/advanced/finance-end-to-end/nvflare/xgb_job.py diff --git a/examples/advanced/finance-end-to-end/nvflare/xgb_job_embed.py b/examples/advanced/finance-end-to-end/nvflare/xgb_job_embed.py new file mode 100644 index 0000000000..26e76b60a0 --- /dev/null +++ b/examples/advanced/finance-end-to-end/nvflare/xgb_job_embed.py @@ -0,0 +1,120 @@ +# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +from xgb_embed_data_loader import CreditCardEmbedDataLoader + +from nvflare import FedJob +from nvflare.app_opt.xgboost.histogram_based_v2.fed_controller import XGBFedController +from nvflare.app_opt.xgboost.histogram_based_v2.fed_executor import FedXGBHistogramExecutor + + +def main(): + args = define_parser() + + site_names = args.sites + work_dir = args.work_dir + job_name = args.job_name + root_dir = args.input_dir + file_postfix = args.file_postfix + + num_rounds = 10 + early_stopping_rounds = 10 + xgb_params = { + "max_depth": 8, + "eta": 0.1, + "objective": "binary:logistic", + "eval_metric": "auc", + "tree_method": "hist", + "nthread": 16, + } + + job = FedJob(name=job_name) + + # Define the controller workflow and send to server + controller = XGBFedController( + num_rounds=num_rounds, + data_split_mode=0, + secure_training=False, + xgb_params=xgb_params, + xgb_options={"early_stopping_rounds": early_stopping_rounds}, + ) + job.to(controller, "server") + + # Add clients + for site_name in site_names: + executor = FedXGBHistogramExecutor(data_loader_id="data_loader") + job.to(executor, site_name) + data_loader = CreditCardEmbedDataLoader(root_dir=root_dir, file_postfix=file_postfix) + job.to(data_loader, site_name, id="data_loader") + + if work_dir: + print("work_dir=", work_dir) + job.export_job(work_dir) + + if not args.config_only: + job.simulator_run(work_dir) + + +def define_parser(): + parser = argparse.ArgumentParser() + parser.add_argument( + "-c", + "--sites", + nargs="*", # 0 or more values expected => creates a list + type=str, + default=[], # default if nothing is provided + help="Space separated site names", + ) + parser.add_argument( + "-n", + "--job_name", + type=str, + nargs="?", + default="xgb_job", + help="job name, default to xgb_job", + ) + parser.add_argument( + "-w", + "--work_dir", + type=str, + nargs="?", + default="/tmp/nvflare/jobs/xgb/workdir", + help="work directory, default to '/tmp/nvflare/jobs/xgb/workdir'", + ) + parser.add_argument( + "-i", + "--input_dir", + type=str, + nargs="?", + default="", + help="root directory for input data", + ) + parser.add_argument( + "-p", + "--file_postfix", + type=str, + nargs="?", + default="_embedding.csv", + help="file ending postfix, such as '.csv', or '_embedding.csv'", + ) + + parser.add_argument("-co", "--config_only", action="store_true", help="config only mode, will not run simulator") + + return parser.parse_args() + + +if __name__ == "__main__": + main() diff --git a/examples/advanced/finance-end-to-end/readme.md b/examples/advanced/finance-end-to-end/readme.md deleted file mode 100644 index 579748f474..0000000000 --- a/examples/advanced/finance-end-to-end/readme.md +++ /dev/null @@ -1,398 +0,0 @@ -# End-to-End Process Illustration of Federated XGBoost Methods - -This example demonstrates the use of an end-to-end process for credit card fraud detection using XGBoost. - -The original dataset is based on the [kaggle credit card fraud dataset](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud). - -To illustrate the end-to-end process, we manually duplicated the records to extend the data time span from 2 days to over 2 years. Don't focus too much on the data itself, as our primary goal is to showcase the process. - -The overall steps of the end-to-end process include the following: - -## Prepare Data - -In a real-world application, this step is not necessary. - -* To prepare the data, we expand the credit card data by adding additional randomly generated columns, -including sender and receiver BICs, currency, etc. -* We then split the data based on the Sender BIC. Each Sender represents one financial institution, -thus serving as one site (client) for federated learning. - -We illustrate this step in the notebook [prepare_data] (./prepare_data.ipynb). The resulting dataset looks like the following: - -![data](images/generated_data.png) - -Once we have the this synthetic data, we like to split the data into -* historical data ( oldest data) -- 55% -* training data 35 % -* test data remaining 10% - -``` -Historical DataFrame size: 626575 -Training DataFrame size: 398729 -Testing DataFrame size: 113924 -``` -Next we will split the data among different clients, i.e. different Sender_BICs. -For example: Sender = JPMorgan_Case, BIC =CHASUS33 -the client directory is **CHASUS33_JPMorgan_Chase** - -For this site, we will have three files. -``` -5343086 Jul 29 08:31 history.csv -977888 Jul 29 08:31 test.csv -3409228 Jul 29 08:31 train.csv -``` -![split_data](images/split_data.png) -The python code for data generation is located at [prepare_data.py] (./prepare_data.py) - -## Initial Analysis - -We choose one site data for analysis - -### Feature Engineering -In this process, we will enrich the data and add a few new derived features to illustrate the process. -Whether this enrichment makes sense or not is not important, as you can always replace these steps with procedures that -make sense to you. - -Since all sites follow the same procedures, we only need to look at one site. For example, we will look at the site with -the name "CHASUS33_JPMorgan_Chase." - -The data enrichment process involves the following steps: - -1. **Grouping by Currency**: Calculate hist_trans_volume, hist_total_amount, and hist_average_amount for each currency. -2. **Aggregation for Training and Test Data**: Aggregate the data in 1-hour intervals, grouped by currency. The aggregated value is then divided by hist_trans_volume, and this new column is named x2_y1. -3. **Repeating for Beneficiary BIC**: Perform the same process for Beneficiary_BIC to generate another feature called x3_y2. -4. **Merging Features**: Merge the two enriched features based on Time and Beneficiary_BIC. -The resulting dataset looks like this: - -The resulting Dataset looks like this. -![enrich_data](images/enrichment.png) - -We save the enriched data into a new csv file. -``` -CHASUS33_JPMorgan_Chase/train_enrichment.csv -CHASUS33_JPMorgan_Chase/test_enrichment.csv -``` -### Pre-processing -Once we enrich the features, we need to normalize the numerical features and perform one-hot encoding for the categorical -features. However, we will skip the categorical feature encoding in this example to avoid significantly increasing -the file size (from 11 MB to over 2 GB). - -Similar to the feature enrichment process, we will consider only one site for now. The steps are straightforward: -we apply the scaler transformation to the numerical features and then merge them back with the categorical features. - -``` - scaler = MinMaxScaler() - - # Fit and transform the numerical data - numerical_normalized = pd.DataFrame(scaler.fit_transform(numerical_features), columns=numerical_features.columns) - - # Combine the normalized numerical features with the categorical features - df_combined = pd.concat([categorical_features, numerical_normalized], axis=1) -``` -the file is then saved to "_normalized.csv" - -``` -CHASUS33_JPMorgan_Chase/train_normalized.csv -CHASUS33_JPMorgan_Chase/test_normalized.csv -``` -## Federated ETL - -We can easily convert the notebook code into the python code -### Feature Enrichment - -#### ETL Script -convert the enrichment code for one-site to the federated learning is easy -look at the [enrich.py](enrich.py) -we capture the logic of enrichment in - -```python -def enrichment(input_dir, site_name) -> dict: - # code skipped -``` -the main function will be similar to the following. - -``` -def main(): - print("\n enrichment starts \n ") - - args = define_parser() - - input_dir = args.input_dir - output_dir = args.output_dir - - site_name = - print(f"\n {site_name =} \n ") - - merged_dfs = enrichment(input_dir, site_name) - - for ds_name in merged_dfs: - save_to_csv(merged_dfs[ds_name], output_dir, site_name, ds_name) - -``` -change this code to Federated ETL code, we just add few lines of code - -flare.init() - -etl_task = flare.receive() - -end_task = GenericTask() - -flare.send(end_task) - -``` - -def main(): - print("\n enrichment starts \n ") - - args = define_parser() - flare.init() - - input_dir = args.input_dir - output_dir = args.output_dir - - site_name = flare.get_site_name() - print(f"\n {site_name =} \n ") - - # receives global message from NVFlare - etl_task = flare.receive() - merged_dfs = enrichment(input_dir, site_name) - - for ds_name in merged_dfs: - save_to_csv(merged_dfs[ds_name], output_dir, site_name, ds_name) - - # send message back the controller indicating end. - end_task = GenericTask() - flare.send(end_task) -``` -This is the feature enrichment script. - -#### ETL Job - -Federated ETL requires both server-side and client-side code. The above ETL script is the client-side code. -To complete the setup, we need server-side code to configure and specify the federated job. -For this purpose, we wrote the following script: [enrich_job.py](enrich.py) - -``` - -def main(): - args = define_parser() - - site_names = args.sites - work_dir = args.work_dir - job_name = args.job_name - task_script_path = args.task_script_path - task_script_args = args.task_script_args - - job = FedJob(name=job_name) - - # Define the enrich_ctrl workflow and send to server - enrich_ctrl = ETLController(task_name="enrich") - job.to(enrich_ctrl, "server", id="enrich") - - # Add clients - for site_name in site_names: - executor = ScriptExecutor(task_script_path=task_script_path, task_script_args=task_script_args) - job.to(executor, site_name, tasks=["enrich"], gpu=0) - - if work_dir: - print(f"{work_dir=}") - job.export_job(work_dir) - - if not args.config_only: - job.simulator_run(work_dir) -``` -Here we define a ETLController for server, and ScriptExecutor for client side ETL script. - -### Pre-process - -#### ETL Script - -Converting the pre-processing code for one site to federated learning is straightforward. -Refer to the [pre_process.py](pre_process.py) script for details. - -``` - -dataset_names = ["train", "test"] -datasets = {} - -def main(): - args = define_parser() - input_dir = args.input_dir - output_dir = args.output_dir - - flare.init() - site_name = flare.get_site_name() - etl_task = flare.receive() - processed_dfs = process_dataset(input_dir, site_name) - save_normalized_files(output_dir, processed_dfs, site_name) - - end_task = GenericTask() - flare.send(end_task) - -``` -#### ETL Job - -This is almost identical to the Enrichment job, besides the task name - -``` -def main(): - args = define_parser() - - site_names = args.sites - work_dir = args.work_dir - job_name = args.job_name - task_script_path = args.task_script_path - task_script_args = args.task_script_args - - job = FedJob(name=job_name) - - pre_process_ctrl = ETLController(task_name="pre_process") - job.to(pre_process_ctrl, "server", id="pre_process") - - # Add clients - for site_name in site_names: - executor = ScriptExecutor(task_script_path=task_script_path, task_script_args=task_script_args) - job.to(executor, site_name, tasks=["pre_process"], gpu=0) - - if work_dir: - job.export_job(work_dir) - - if not args.config_only: - job.simulator_run(work_dir) -``` -## Federated Training of XGBoost - -Now we have enriched and normalized features, we can directly run XGBoost. -Here is the xgboost job code - -``` -def main(): - args = define_parser() - - site_names = args.sites - work_dir = args.work_dir - job_name = args.job_name - root_dir = args.input_dir - file_postfix = args.file_postfix - - num_rounds = 10 - early_stopping_rounds = 10 - xgb_params = { - "max_depth": 8, - "eta": 0.1, - "objective": "binary:logistic", - "eval_metric": "auc", - "tree_method": "hist", - "nthread": 16, - } - - job = FedJob(name=job_name) - - # Define the controller workflow and send to server - - controller = XGBFedController( - num_rounds=num_rounds, - training_mode="horizontal", - xgb_params=xgb_params, - xgb_options={"early_stopping_rounds": early_stopping_rounds}, - ) - job.to(controller, "server") - - # Add clients - for site_name in site_names: - executor = FedXGBHistogramExecutor(data_loader_id="data_loader") - job.to(executor, site_name, gpu=0) - data_loader = CreditCardDataLoader(root_dir=root_dir, file_postfix=file_postfix) - job.to(data_loader, site_name, id="data_loader") - if work_dir: - job.export_job(work_dir) - - if not args.config_only: - job.simulator_run(work_dir) -``` - -In this code, all we need to write is ```CreditCardDataLoader```, which is XGBDataLoader, -the rest of code is handled by XGBoost Controller and Executor. - -in -``` -class CreditCardDataLoader(XGBDataLoader): -``` -we only loaded the numerical feature in this example, in your case, you might to chagne that. - -## Running end-by-end code -You can run this from the command line interface (CLI) or orchestrate it using a workflow tool such as Airflow. -Here, we will demonstrate how to run this from a simulator. You can always export the job configuration and run -it anywhere in a real deployment. - -Assuming you have already downloaded the credit card dataset and the creditcard.csv file is located in the current directory: - -* prepare data -``` -python prepare_data.py -i ./creditcard.csv -o /tmp/nvflare/xgb/credit_card -``` ->>note -> -> All Sender SICs are considered clients: they are -> * 'BARCGB22_Barclays_Bank' -> * 'BSCHESMM_Banco_Santander' -> * 'CITIUS33_Citibank' -> * 'SCBLSGSG_Standard_Chartered_Bank' -> * 'UBSWCHZH80A_UBS' -> * 'BNPAFRPP_BNP_Paribas' -> * 'CHASUS33_JPMorgan_Chase' -> * 'HSBCHKHH_HSBC' -> * 'ANZBAU3M_ANZ_Bank' -> * 'DEUTDEFF_Deutsche_Bank' -> Total 10 banks - -* enrich data - - -``` -python enrich_job.py -c CHASUS33_JPMorgan_Chase HSBCHKHH_HSBC DEUTDEFF_Deutsche_Bank BARCGB22_Barclays_Bank BNPAFRPP_BNP_Paribas UBSWCHZH80A_UBS BSCHESMM_Banco_Santander ANZBAU3M_ANZ_Bank SCBLSGSG_Standard_Chartered_Bank CITIUS33_Citibank -p enrich.py -a "-i /tmp/nvflare/xgb/credit_card/ -o /tmp/nvflare/xgb/credit_card/" -``` - -* pre-process data - -``` -python pre_process_job.py -c CHASUS33_JPMorgan_Chase HSBCHKHH_HSBC DEUTDEFF_Deutsche_Bank BARCGB22_Barclays_Bank BNPAFRPP_BNP_Paribas UBSWCHZH80A_UBS BSCHESMM_Banco_Santander ANZBAU3M_ANZ_Bank SCBLSGSG_Standard_Chartered_Bank CITIUS33_Citibank -p pre_process.py -a "-i /tmp/nvflare/xgb/credit_card -o /tmp/nvflare/xgb/credit_card/" - -``` - -* XGBoost Job -Finally we take the normalized data and run XGboost Job - -``` -python xgb_job.py -c CHASUS33_JPMorgan_Chase HSBCHKHH_HSBC DEUTDEFF_Deutsche_Bank BARCGB22_Barclays_Bank BNPAFRPP_BNP_Paribas UBSWCHZH80A_UBS BSCHESMM_Banco_Santander ANZBAU3M_ANZ_Bank SCBLSGSG_Standard_Chartered_Bank CITIUS33_Citibank -i /tmp/nvflare/xgb/credit_card -w /tmp/nvflare/workspace/xgb/credit_card/ -``` -Here is the output of last 9th and 10th round of training (starting round = 0) -``` -... - -[19:58:27] [8] eval-auc:0.67126 train-auc:0.71717 -[19:58:27] [8] eval-auc:0.67126 train-auc:0.71717 -[19:58:27] [8] eval-auc:0.67126 train-auc:0.71717 -[19:58:27] [8] eval-auc:0.67126 train-auc:0.71717 -[19:58:27] [8] eval-auc:0.67126 train-auc:0.71717 -[19:58:27] [8] eval-auc:0.67126 train-auc:0.71717 -[19:58:27] [8] eval-auc:0.67126 train-auc:0.71717 -[19:58:27] [8] eval-auc:0.67126 train-auc:0.71717 -[19:58:27] [8] eval-auc:0.67126 train-auc:0.71717 -[19:58:27] [8] eval-auc:0.67126 train-auc:0.71717 - -[19:58:30] [9] eval-auc:0.67348 train-auc:0.71769 -[19:58:30] [9] eval-auc:0.67348 train-auc:0.71769 -[19:58:30] [9] eval-auc:0.67348 train-auc:0.71769 -[19:58:30] [9] eval-auc:0.67348 train-auc:0.71769 -[19:58:30] [9] eval-auc:0.67348 train-auc:0.71769 -[19:58:30] [9] eval-auc:0.67348 train-auc:0.71769 -[19:58:30] [9] eval-auc:0.67348 train-auc:0.71769 -[19:58:30] [9] eval-auc:0.67348 train-auc:0.71769 -[19:58:30] [9] eval-auc:0.67348 train-auc:0.71769 -[19:58:30] [9] eval-auc:0.67348 train-auc:0.71769 -[07:33:54] Finished training -``` - - - diff --git a/examples/advanced/finance-end-to-end/prepare_data.py b/examples/advanced/finance-end-to-end/utils/prepare_data.py similarity index 100% rename from examples/advanced/finance-end-to-end/prepare_data.py rename to examples/advanced/finance-end-to-end/utils/prepare_data.py diff --git a/examples/advanced/finance-end-to-end/xgboost.ipynb b/examples/advanced/finance-end-to-end/xgboost.ipynb index 0163088d21..eca92a0d88 100644 --- a/examples/advanced/finance-end-to-end/xgboost.ipynb +++ b/examples/advanced/finance-end-to-end/xgboost.ipynb @@ -9,9 +9,7 @@ "\n", "This notebooks shows the how do we convert and existing tabular credit data, enrich and pre-process data using one-site (like centralized dataset) and then convert this centralized process into a federated ETL steps, easily. Then construct a federated XGBoost, the only thing user need to define is the XGboost data loader. \n", "\n", - "## Install requirements\n", - "\n", - "\n" + "## Install requirements\n" ] }, { @@ -33,9 +31,10 @@ "tags": [] }, "source": [ - "## Data Prepare Data \n", + "## Step 1: Data Preparation \n", + "First, we prepare the data by adding random transactional information to the base creditcard dataset following the below script:\n", "\n", - "* [prepare data](./prepare_data.ipynb)\n" + "* [prepare data](./notebooks/prepare_data.ipynb)" ] }, { @@ -43,20 +42,51 @@ "id": "f69c008a-b19b-4c1a-b3c4-c376eccf53ba", "metadata": {}, "source": [ - "## Feature Enrichment\n", + "## Step 2: Feature Analysis\n", + "\n", + "For this stage, we would like to analyze the data, understand the features, and derive (and encode) secondary features that can be more useful for building the model.\n", + "\n", + "Towards this goal, there are two options:\n", + "1. **Feature Enrichment**: This process involves adding new features based on the existing data. For example, we can calculate the average transaction amount for each currency and add this as a new feature. \n", + "2. **Feature Encoding**: This process involves encoding the current features and transforming them to embedding space via machine learning models. This model can be either pre-trained, or trained with the candidate dataset.\n", + "\n", + "Considering the fact that the only two numerical features in the dataset are \"Amount\" and \"Time\", we will perform feature enrichment first. Optionally, we can also perform feature encoding. In this example, we use graph neural network (GNN): we will train the GNN model in a federated unsupervised fashion, and then use the model to encode the features for all sites. " + ] + }, + { + "cell_type": "markdown", + "id": "05dcf825-6e31-4d10-9968-2f353eaa4cea", + "metadata": {}, + "source": [ + "### Step 2.1: Rule-based Feature Enrichment\n", + "\n", + "#### Single-site Enrichment and Additional Processing\n", + "The detailed feature enrichment step is illustrated using one site as example: \n", "\n", - "We can first examine how the feature enrichment is processed using just one-site. \n", + "* [feature_enrichments with-one-site](./notebooks/feature_enrichment.ipynb)\n", "\n", - "* [feature_enrichments with-one-site](./feature_enrichment.ipynb)\n", + "Similarly, we examine the additional pre-processing step using one site: \n", "\n", - "in order to run feature job on each site similar to above feature enrichment steps, we wrote an enrichment ETL job.\n", + "* [pre-processing with one-site](./notebooks/pre_process.ipynb)\n" + ] + }, + { + "cell_type": "markdown", + "id": "8bc8bb99-a253-415e-8953-91af62ef22a2", + "metadata": {}, + "source": [ + "#### Federated Job to Perform on All Sites\n", + "In order to run feature enrichment and processing job on each site similar to above steps, we wrote federated ETL job scripts for client-side based on single-site implementations.\n", "\n", - "[enrichment script](./enrich.py)\n", + "* [enrichment script](./nvflare/enrich.py)\n", + "* [pre-processing script](./nvflare/pre_process.py) \n", "\n", - "Define a job to trigger running enrichnment script on each site: \n", + "Then we define job scripts on server-side to trigger and coordinate running client-side scripts on each site: \n", "\n", - "[enrich_job.py](./enrich_job.py)\n", + "* [enrich_job.py](./nvflare/enrich_job.py)\n", + "* [pre-processing-job](./nvflare/pre_process_job.py)\n", "\n", + "Example script as below:\n", "```\n", "# Define the enrich_ctrl workflow and send to server\n", " enrich_ctrl = ETLController(task_name=\"enrich\")\n", @@ -66,41 +96,39 @@ " for site_name in site_names:\n", " executor = ScriptExecutor(task_script_path=task_script_path, task_script_args=task_script_args)\n", " job.to(executor, site_name, tasks=[\"enrich\"], gpu=0)\n", - "```\n", - "\n", - "\n" + "```" ] }, { "cell_type": "markdown", - "id": "8bc8bb99-a253-415e-8953-91af62ef22a2", + "id": "8068c808-fce8-4f27-b0cf-76b486a24903", "metadata": {}, "source": [ - "## Pre-Processing \n", + "### (Optional) Step 2.2: GNN-based Feature Encoding\n", + "Based on raw features, or combining the derived features from **Step 2.1**, we can use machine learning models to encode the features. \n", + "In this example, we use federated GNN to learn and generate the feature embeddings.\n", "\n", - "We exam examine the steps for pre-processing using only one-site (one client) \n", + "First, we construct a graph based on the transaction data. Each node represents a transaction, and the edges represent the relationships between transactions. We then use the GNN to learn the embeddings of the nodes, which represent the transaction features.\n", "\n", - "* [pre-processing with one-site](./pre_process.ipynb)\n", + "#### Single-site operation example: graph construction\n", + "The detailed graph construction step is illustrated using one site as example:\n", "\n", - "Based on one-site, we create the pre-processing script\n", + "* [graph_construction with one-site](./notebooks/graph_construct.ipynb)\n", "\n", - "* [pre-processing script](./pre_process.py) \n", + "The detailed GNN training and encoding step is illustrated using one site as example:\n", "\n", - "then we define the pre-processing job to coordinate the pre-processing for all sites\n", + "* [gnn_training_encoding with one-site](./notebooks/gnn_train_encode.ipynb)\n", "\n", - "* [pre-processing-job](./pre_process_job.py)\n", + "#### Federated Job to Perform on All Sites\n", + "In order to run feature graph construction job on each site similar to the enrichment and processing steps, we wrote federated ETL job scripts for client-side based on single-site implementations.\n", "\n", - "```\n", - " pre_process_ctrl = ETLController(task_name=\"pre_process\")\n", - " job.to(pre_process_ctrl, \"server\", id=\"pre_process\")\n", + "* [graph_construction script](./nvflare/graph_construct.py)\n", + "* [gnn_train_encode script](./nvflare/gnn_train_encode.py)\n", "\n", - " # Add clients\n", - " for site_name in site_names:\n", - " executor = ScriptExecutor(task_script_path=task_script_path, task_script_args=task_script_args)\n", - " job.to(executor, site_name, tasks=[\"pre_process\"], gpu=0)\n", + "Similarily, we define job scripts on server-side to trigger and coordinate running client-side scripts on each site: \n", "\n", - "```\n", - " Similarly to the ETL job, we simply issue a task to trigger pre-process running pre-process script. " + "* [graph_construction_job.py](./nvflare/graph_construct_job.py)\n", + "* [gnn_train_encode_job.py](./nvflare/gnn_train_encode_job.py)" ] }, { @@ -110,39 +138,15 @@ "tags": [] }, "source": [ + "## Step 3: Federated XGBoost \n", "\n", - " def load_data(self, client_id: str, split_mode: int) -> Tuple[xgb.DMatrix, xgb.DMatrix]:\n", - " data = {}\n", - " for ds_name in self.dataset_names:\n", - " print(\"\\nloading for site = \", client_id, f\"{ds_name} dataset \\n\")\n", - " file_name = os.path.join(self.root_dir, client_id, self.base_file_names[ds_name])\n", - " df = pd.read_csv(file_name)\n", - " data_num = len(data)\n", + "Now that we have the data ready, either enriched and normalized features, or GNN feature embeddings, we can fit them with XGBoost. NVIDIA FLARE has already has written XGBoost Controller and Executor for the job. All we need to provide is the data loader to fit into the XGBoost.\n", "\n", - " # split to feature and label\n", - " y = df[\"Class\"]\n", - " x = df[self.numerical_columns]\n", - " data[ds_name] = (x, y, data_num)\n", - "\n", - "\n", - " # training\n", - " x_train, y_train, total_train_data_num = data[\"train\"]\n", - " data_split_mode = DataSplitMode(split_mode)\n", - " dmat_train = xgb.DMatrix(x_train, label=y_train, data_split_mode=data_split_mode)\n", + "To specify the controller and executor, we need to define a Job. You can find the job construction in\n", "\n", - " # validation\n", - " x_valid, y_valid, total_valid_data_num = data[\"test\"]\n", - " dmat_valid = xgb.DMatrix(x_valid, label=y_valid, data_split_mode=data_split_mode)\n", - "\n", - " return dmat_train, dmat_valid\n", - "## Define XGBoost Job \n", + "* [xgb_job.py](./nvflare/xgb_job.py). \n", "\n", - "Now that we have the data ready, We can fit the data into XGBoost. NVIDIA FLARE has already has written XGBoost Controller and Executor for the job. All we need to provide is the data loader to fit into the XGBoost\n", - "To specify the controller and executor, we need to define a Job. You can find the job construction can be find in\n", - "\n", - "* [xgb_job.py](./xgb_job.py). \n", - "\n", - "Here is main part of the code\n", + "Below is main part of the code\n", "\n", "```\n", " controller = XGBFedController(\n", @@ -167,11 +171,9 @@ " * test__normalized.csv\n", " \n", "\n", - "Notice we assign defined a [```CreditCardDataLoader```](./xgb_data_loader.py), this a XGBLoader we defined to load the credit card dataset. \n", + "Notice we assign defined a [```CreditCardDataLoader```](./nvflare/xgb_data_loader.py), this a XGBLoader we defined to load the credit card dataset. \n", "\n", "```\n", - "\n", - "\n", "import os\n", "from typing import Optional, Tuple\n", "\n", @@ -227,8 +229,6 @@ " dmat_valid = xgb.DMatrix(x_valid, label=y_valid, data_split_mode=data_split_mode)\n", "\n", " return dmat_train, dmat_valid\n", - "\n", - "\n", "```\n", "\n", "We are now ready to run all the code" @@ -239,12 +239,11 @@ "id": "036417d1-ad58-4835-b59b-fae94aafded3", "metadata": {}, "source": [ - "## Run all the Jobs\n", + "## Run All the Jobs End-to-end\n", "Here we are going to run each job in sequence. For real-world use case,\n", "\n", "* prepare data is not needed, as you already have the data\n", - "* feature enrichment scripts need to be define based on your own enrichment rules\n", - "* pre-processing, you also need to change the pre-process script to define normalization and categorical encodeing\n", + "* feature enrichment / encoding scripts need to be defined based on your own technique\n", "* for XGBoost Job, you will need to write your own data loader \n", "\n", "Note: All Sender SICs are considered clients: they are \n", @@ -259,6 +258,7 @@ "* 'HCBHSGSG_Bank_9'\n", "* 'XITXUS33_Bank_10' \n", "Total 10 banks\n", + "\n", "### Prepare Data" ] }, @@ -271,7 +271,7 @@ }, "outputs": [], "source": [ - "! python3 prepare_data.py -i ./creditcard.csv -o /tmp/nvflare/xgb/credit_card" + "! python3 ./utils/prepare_data.py -i ./creditcard.csv -o /tmp/nvflare/xgb/credit_card" ] }, { @@ -291,7 +291,9 @@ }, "outputs": [], "source": [ - "! python enrich_job.py -c 'ZNZZAU3M_Bank_8' 'SHSHKHH1_Bank_2' 'FBSFCHZH_Bank_6' 'YMNYFRPP_Bank_5' 'WPUWDEFF_Bank_4' 'YXRXGB22_Bank_3' 'XITXUS33_Bank_10' 'YSYCESMM_Bank_7' 'ZHSZUS33_Bank_1' 'HCBHSGSG_Bank_9' -p enrich.py -a \"-i /tmp/nvflare/xgb/credit_card/ -o /tmp/nvflare/xgb/credit_card/\"\n" + "%cd nvflare\n", + "! python3 enrich_job.py -c 'ZNZZAU3M_Bank_8' 'SHSHKHH1_Bank_2' 'FBSFCHZH_Bank_6' 'YMNYFRPP_Bank_5' 'WPUWDEFF_Bank_4' 'YXRXGB22_Bank_3' 'XITXUS33_Bank_10' 'YSYCESMM_Bank_7' 'ZHSZUS33_Bank_1' 'HCBHSGSG_Bank_9' -p enrich.py -a \"-i /tmp/nvflare/xgb/credit_card/ -o /tmp/nvflare/xgb/credit_card/\"\n", + "%cd .." ] }, { @@ -311,7 +313,29 @@ }, "outputs": [], "source": [ - "! python pre_process_job.py -c 'YSYCESMM_Bank_7' 'FBSFCHZH_Bank_6' 'YXRXGB22_Bank_3' 'XITXUS33_Bank_10' 'HCBHSGSG_Bank_9' 'YMNYFRPP_Bank_5' 'ZHSZUS33_Bank_1' 'ZNZZAU3M_Bank_8' 'SHSHKHH1_Bank_2' 'WPUWDEFF_Bank_4' -p pre_process.py -a \"-i /tmp/nvflare/xgb/credit_card -o /tmp/nvflare/xgb/credit_card/\"" + "%cd nvflare\n", + "! python3 pre_process_job.py -c 'YSYCESMM_Bank_7' 'FBSFCHZH_Bank_6' 'YXRXGB22_Bank_3' 'XITXUS33_Bank_10' 'HCBHSGSG_Bank_9' 'YMNYFRPP_Bank_5' 'ZHSZUS33_Bank_1' 'ZNZZAU3M_Bank_8' 'SHSHKHH1_Bank_2' 'WPUWDEFF_Bank_4' -p pre_process.py -a \"-i /tmp/nvflare/xgb/credit_card -o /tmp/nvflare/xgb/credit_card/\"\n", + "%cd .." + ] + }, + { + "cell_type": "markdown", + "id": "530f95e5-d104-43d3-8320-dd077d885799", + "metadata": {}, + "source": [ + "### Construct Graph" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a775b7a4-32de-4791-b17f-bc8291a5f885", + "metadata": {}, + "outputs": [], + "source": [ + "%cd nvflare\n", + "! python graph_construct_job.py -c 'YSYCESMM_Bank_7' 'FBSFCHZH_Bank_6' 'YXRXGB22_Bank_3' 'XITXUS33_Bank_10' 'HCBHSGSG_Bank_9' 'YMNYFRPP_Bank_5' 'ZHSZUS33_Bank_1' 'ZNZZAU3M_Bank_8' 'SHSHKHH1_Bank_2' 'WPUWDEFF_Bank_4' -p graph_construct.py -a \"-i /tmp/nvflare/xgb/credit_card -o /tmp/nvflare/xgb/credit_card/\"\n", + "%cd .." ] }, { @@ -331,7 +355,9 @@ }, "outputs": [], "source": [ - "! python xgb_job.py -c 'YSYCESMM_Bank_7' 'FBSFCHZH_Bank_6' 'YXRXGB22_Bank_3' 'XITXUS33_Bank_10' 'HCBHSGSG_Bank_9' 'YMNYFRPP_Bank_5' 'ZHSZUS33_Bank_1' 'ZNZZAU3M_Bank_8' 'SHSHKHH1_Bank_2' 'WPUWDEFF_Bank_4' -i /tmp/nvflare/xgb/credit_card -w /tmp/nvflare/workspace/xgb/credit_card/" + "%cd nvflare\n", + "! python3 xgb_job.py -c 'YSYCESMM_Bank_7' 'FBSFCHZH_Bank_6' 'YXRXGB22_Bank_3' 'XITXUS33_Bank_10' 'HCBHSGSG_Bank_9' 'YMNYFRPP_Bank_5' 'ZHSZUS33_Bank_1' 'ZNZZAU3M_Bank_8' 'SHSHKHH1_Bank_2' 'WPUWDEFF_Bank_4' -i /tmp/nvflare/xgb/credit_card -w /tmp/nvflare/workspace/xgb/credit_card/\n", + "%cd .." ] }, { @@ -343,7 +369,7 @@ "source": [ "## Prepare Job for POC and Production\n", "\n", - "This seems to work well with Job running in simulator. Now we are ready to run in a POC mode, so we can simulate the deployment in localhost or simply deploy to production. \n", + "With job running well in simulator, we are ready to run in a POC mode, so we can simulate the deployment in localhost or simply deploy to production. \n", "\n", "All we need is the job definition. we can use job.export_job() method to generate the job configuration and export to given directory. For example, in xgb_job.py, we have the following\n", "\n", @@ -368,7 +394,9 @@ }, "outputs": [], "source": [ - "! python xgb_job.py -co -w /tmp/nvflare/workspace/xgb/credit_card/config -c 'YSYCESMM_Bank_7' 'FBSFCHZH_Bank_6' 'YXRXGB22_Bank_3' 'XITXUS33_Bank_10' 'HCBHSGSG_Bank_9' 'YMNYFRPP_Bank_5' 'ZHSZUS33_Bank_1' 'ZNZZAU3M_Bank_8' 'SHSHKHH1_Bank_2' 'WPUWDEFF_Bank_4' -i /tmp/nvflare/xgb/credit_card " + "%cd nvflare\n", + "! python xgb_job.py -co -w /tmp/nvflare/workspace/xgb/credit_card/config -c 'YSYCESMM_Bank_7' 'FBSFCHZH_Bank_6' 'YXRXGB22_Bank_3' 'XITXUS33_Bank_10' 'HCBHSGSG_Bank_9' 'YMNYFRPP_Bank_5' 'ZHSZUS33_Bank_1' 'ZNZZAU3M_Bank_8' 'SHSHKHH1_Bank_2' 'WPUWDEFF_Bank_4' -i /tmp/nvflare/xgb/credit_card \n", + "%cd .." ] }, { @@ -459,7 +487,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.8.19" + "version": "3.8.18" } }, "nbformat": 4,