2 yıl önce · 3e7003c7da
--- a/README.md
+++ b/README.md
@@ -4,10 +4,11 @@
 
				   Shepherd
			
 
				   <br>
			
 
				 </h1>
			
 
				-<h4 align="center"><em><span style="font-size:18pt">Large Language Models with Parameter-Efficient Federated Finetuning in the Presence of Heterogeneous Instructions</span></em></h4>
			
 
				+<h4 align="center"><em><span style="font-size:18pt"> A Platform Supporting Federated Instruction Tuning </span></em></h4>
			
 
				 
			
 
				 <p align="center">
			
 
				   <a href="#Overview">Overview</a> •
			
 
				+  <a href="https://arxiv.org/pdf/2305.05644.pdf">Paper</a> •
			
 
				   <a href="#Installation">Installation</a> •
			
 
				   <a href="#Data_Preparation">Data_Preparation</a> •
			
 
				   <a href="#Federated_Finetuning">Federated_Finetuning</a> •
			
@@ -16,15 +17,22 @@
 
				 </p>
			
 
				 
			
 
				 
			
 
				+ [![Code License: PLACE 2.0](https://img.shields.io/badge/Code-License%3A%20PLACE%202.0-blue)](https://github.com/JayZhang42/FederatedGPT-Shepherd/blob/main/LICENSE) \
			
 
				+  **Usage and License Notices**:The data, code and checkpoints are intended and licensed for research use only.
			
 
				 
			
 
				 
			
 
				 ## Overview
			
 
				 
			
 
				-Recent advancements in fine-tuning large language models (LLMs) have leveraged instructions created by humans or APIs (such as ChatGPT and GPT-4) to revolutionize NLP research and industry applications. However, the collection of instructions from a wide array of individuals presents challenges in privacy and heterogeneity. Federated Learning, a well-studied and well-developed learning approach, provides a solution to addresses these challenges and paves the way for designing personalized LLMs tailored to individual users.
			
 
				+Recent advancements in fine-tuning large language models (LLMs) have leveraged instructions created by humans or APIs (such as ChatGPT and GPT-4) to revolutionize NLP research and industry applications. However, the collection of instructions from a wide array of individuals presents challenges in cost and privacy. For instance, collecting vast amounts of daily conversations from users is a valuable means of providing guidance for LLMs, enabling them to generate authentic and genuine responses. However, privacy concerns may hinder users from sharing their conversations, resulting in a limited quantity of instructions that are not fully representative of the target population. Federated Learning, a well-studied and well-developed learning approach, provides a solution to addresses these challenges and paves the way for designing personalized LLMs tailored to individual users.
			
 
				 
			
 
				-This repository offers a foundational framework for exploring federated fine-tuning of LLMs using heterogeneous instructions across diverse categories. The framework is designed for ease of use, adaptability, and scalability to accommodate large datasets. Additionally, it facilitates seamless integration of novel algorithms and configurations, making it a convenient tool for both researchers and practitioners in the NLP community.
			
 
				+This repository, *Shepherd*, offers a foundational framework for exploring federated finetuning of LLMs using heterogeneous instructions across diverse categories. The framework is designed for ease of use, adaptability, and scalability to accommodate large datasets. Additionally, it facilitates seamless integration of novel algorithms and configurations, making it a convenient tool for researchers and practitioners in both the FL and the NLP community.
			
 
				 
			
 
				+## Paper
			
 
				 
			
 
				+We are pleased to share our [***FedIT***](https://arxiv.org/pdf/2305.05644.pdf) [Paper], "*Towards Building the Federated GPT: Federated Instruction Tuning.*" We kindly invite you to read the paper for an in-depth understanding of Federated Instruction Tuning for LLMs and further insights into our repository.
			
 
				+<p align="center">
			
 
				+  <img src="assets/FedIT.png" width="100%">
			
 
				+</p>
			
 
				 
			
 
				 ## Installation 
			
 
				 
			
@@ -41,39 +49,28 @@ Prior to commencing the federated fine-tuning, make sure to create a data file f
 
				 ```bash
			
 
				 num_client=10 # The number of clients
			
 
				 diff_quantity=0 # Whether clients have different amounts of data
			
 
				-python clients_datasets.py $num_client $diff_quantity
			
 
				+python client_data_allocation.py $num_client $diff_quantity
			
 
				 ```
			
 
				 Running this command will save the data files in the folder `./data/str(num_client)`. The data file `new-databricks-dolly-15k.json` for generating each client's local dataset is the first version of `databricks-dolly-15k` , which is a corpus of more than 15,000 records with 8 categeries generated by thousands of [Databricks Lab](https://www.databricks.com/learn/labs) employees. Please refer to their official repository [dolly](https://github.com/databrickslabs/dolly) for the latest version of data.
			
 
				 
			
 
				 ### Categories distribution and Heteogeneity
			
 
				-The first version of `databricks-dolly-15k` contains 8 Categories, with the distribution of each category shown in the following figure.
			
 
				+The first version of `databricks-dolly-15k` contains 8 Categories, with the distribution of each category shown in the following subfigure provided on the right.
			
 
				 
			
 
				 <p align="center">
			
 
				-  <img src="assets/pie_chart_viridis_style.png" width="100%">
			
 
				+  <img src="assets/twodonuts.png" width="150%">
			
 
				 </p>
			
 
				 
			
 
				+Without federated learning, the model can be trained on only the particular local instruction categories of each user (left) due to privacy or cost issue. By implementing our Federated instruction tuning ([***FedIT***](https://arxiv.org/pdf/2305.05644.pdf)) framework with this repo *Shepherd*, the LLM can be trained on the local instruction datasets of all clients with greater diversity and quantity of data points that cover the entire range of the subject matter (right).
			
 
				 
			
 
				+The following figure presents an illustrative depiction of the category distributions among each client, serving to exemplify the heterogeneity nature of clients' instructions.
			
 
				 
			
 
				-The following table presents an illustrative depiction of the category distributions among each client, serving to exemplify the diverse nature of clients' instructions
			
 
				-
			
 
				-|          | Open_qa | General_qa | Classification | Closed_qa | Brainstorming | Information_extraction | Summarization | Creative_writing |
			
 
				-|----------|---------|------------|----------------|-----------|---------------|------------------------|---------------|------------------|
			
 
				-| Client 0 | 0       | 0          | **149**        | **598**   | 0             | 0                      | **746**       | 0                |
			
 
				-| Client 1 | **747** | 0          | **747**        | 0         | 0             | 0                      | 0             | 0                |
			
 
				-| Client 2 | **377** | **747**    | 0              | 0         | 0             | **370**                | 0             | 0                |
			
 
				-| Client 3 | **985** | 0          | 0              | 0         | 0             | 0                      | **507**       | 0                |
			
 
				-| Client 4 | 0       | 0          | 0              | **747**   | 0             | **747**                | 0             | 0                |
			
 
				-| Client 5 | **746** | **747**    | 0              | 0         | 0             | 0                      | 0             | 0                |
			
 
				-| Client 6 | 0       | **362**    | 0              | 0         | **747**       | **385**                | 0             | 0                |
			
 
				-| Client 7 | **746** | 0          | **483**        | 0         | **264**       | 0                      | 0             | 0                |
			
 
				-| Client 8 | 0       | **325**    | 0              | **468**   | 0             | 0                      | 0             | **701**          |
			
 
				-| Client 9 | 0       | 0          | **747**        | 0         | **747**       | 0                      | 0             | 0                |
			
 
				-
			
 
				-
			
 
				+<p align="center">
			
 
				+  <img src="assets/hetero.png" width="150%">
			
 
				+</p>
			
 
				 
			
 
				 ### Use your own data
			
 
				 
			
 
				-You can simply modify `clients_datasets.py` to load your own  dataset for federated training.
			
 
				+You can simply modify `client_data_allocation.py` to load your own  dataset for federated training.
			
 
				 
			
 
				 
			
 
				 ## Federated_Finetuning
			
@@ -99,7 +96,7 @@ python main.py --global_model 'chavinlo/alpaca-native'\
 
				       --output_dir  './lora-shepherd-7b/'\
			
 
				       --num_communication_rounds 10 \
			
 
				       --num_clients  10 \
			
 
				-      --client_selection_frac 0.05 \
			
 
				+      --client_selection_frac 0.1 \
			
 
				       --local_num_epochs  2 \
			
 
				       --local_batch_size  64 \
			
 
				       --local_micro_batch_size 32 \
			
@@ -129,11 +126,21 @@ python GlobalModel_generate.py \
 
				 
			
 
				 ## Citation
			
 
				 
			
 
				-Please cite this repo if you find our repository helpful for your research.
			
 
				+Please cite our FedIT paper and this repo if you find our repository helpful for your research. Thank you!
			
 
				+```
			
 
				+@misc{zhang2023building,
			
 
				+      title={Towards Building the Federated GPT: Federated Instruction Tuning}, 
			
 
				+      author={Jianyi Zhang and Saeed Vahidian and Martin Kuo and Chunyuan Li and Ruiyi Zhang and Guoyin Wang and Yiran Chen},
			
 
				+      year={2023},
			
 
				+      eprint={2305.05644},
			
 
				+      archivePrefix={arXiv},
			
 
				+      primaryClass={cs.CL}
			
 
				+}
			
 
				+```
			
 
				 ```
			
 
				-@misc{Shepherd,
			
 
				-  author = {Jianyi Zhang, Martin Kuo, Ruiyi Zhang, Guoyin Wang, Saeed Vahidian, Yiran Chen},
			
 
				-  title = {Shepherd: Large Language Models with Parameter-Efficient Federated Finetuning in the Presence of Heterogeneous Instructions},
			
 
				+@misc{Shepherdgithub,
			
 
				+  author = {Jianyi Zhang and Martin Kuo and Ruiyi Zhang and Guoyin Wang and Saeed Vahidian and Yiran Chen},
			
 
				+  title = {Shepherd: A Lightweight GitHub Platform Supporting Federated Instruction Tuning},
			
 
				   year = {2023},
			
 
				   publisher = {GitHub},
			
 
				   journal = {GitHub repository},
			
--- a/client_data_allocation.py
+++ b/client_data_allocation.py
@@ -0,0 +1,85 @@
 
				+import sys
			
 
				+import pandas as pd
			
 
				+import numpy as np
			
 
				+import random
			
 
				+import os
			
 
				+import json
			
 
				+import pdb
			
 
				+
			
 
				+num_clients = int(sys.argv[1])
			
 
				+diff_quantity = int(sys.argv[2])
			
 
				+
			
 
				+np.random.seed(42)
			
 
				+random.seed(42)
			
 
				+
			
 
				+# Divide the entire dataset into a training set and a test set.
			
 
				+
			
 
				+df = pd.read_json("new-databricks-dolly-15k.json", orient='records')
			
 
				+sorted_df = df.sort_values(by=['category'])
			
 
				+grouped = sorted_df.groupby('category')
			
 
				+sampled_df = grouped.apply(lambda x: x.sample(n=10))
			
 
				+sampled_df = sampled_df.reset_index(level=0, drop=True)
			
 
				+remaining_df = sorted_df.drop(index=sampled_df.index)
			
 
				+
			
 
				+sampled_df = sampled_df.reset_index().drop('index', axis=1)
			
 
				+remaining_df = remaining_df.reset_index().drop('index', axis=1)
			
 
				+data_path = os.path.join("data", str(num_clients))
			
 
				+
			
 
				+os.makedirs(data_path,exist_ok=True)
			
 
				+
			
 
				+remaining_df_dic = remaining_df.to_dict(orient='records')
			
 
				+with open(os.path.join(data_path, "global_training.json"), 'w') as outfile:
			
 
				+    json.dump(remaining_df_dic, outfile)
			
 
				+
			
 
				+sampled_df_dic = sampled_df.to_dict(orient='records')
			
 
				+with open(os.path.join(data_path, "global_test.json"), 'w') as outfile:
			
 
				+    json.dump(sampled_df_dic, outfile)
			
 
				+
			
 
				+# Partition the global training data into smaller subsets for each client's local training dataset
			
 
				+
			
 
				+if diff_quantity:
			
 
				+    min_size = 0
			
 
				+    min_require_size = 40
			
 
				+    alpha = 0.5
			
 
				+
			
 
				+    N = len(remaining_df)
			
 
				+    net_dataidx_map = {}
			
 
				+    category_uniques = remaining_df['category'].unique().tolist()
			
 
				+    while min_size < min_require_size:
			
 
				+
			
 
				+        idx_partition = [[] for _ in range(num_clients)]
			
 
				+        for k in range(len(category_uniques)):
			
 
				+            category_rows_k = remaining_df.loc[remaining_df['category'] == category_uniques[k]]
			
 
				+            category_rows_k_index = category_rows_k.index.values
			
 
				+            np.random.shuffle(category_rows_k_index)
			
 
				+            proportions = np.random.dirichlet(np.repeat(alpha, num_clients))
			
 
				+            proportions = np.array([p * (len(idx_j) < N / num_clients) for p, idx_j in zip(proportions, idx_partition)])
			
 
				+            proportions = proportions / proportions.sum()
			
 
				+            proportions = (np.cumsum(proportions) * len(category_rows_k_index)).astype(int)[:-1]
			
 
				+            idx_partition = [idx_j + idx.tolist() for idx_j, idx in
			
 
				+                             zip(idx_partition, np.split(category_rows_k_index, proportions))]
			
 
				+            min_size = min([len(idx_j) for idx_j in idx_partition])
			
 
				+
			
 
				+        print(min_size)
			
 
				+
			
 
				+
			
 
				+else:
			
 
				+    num_shards_per_clients = 2
			
 
				+    remaining_df_index = remaining_df.index.values
			
 
				+    shards = np.array_split(remaining_df_index, int(num_shards_per_clients * num_clients))
			
 
				+    random.shuffle(shards)
			
 
				+
			
 
				+    shards = [shards[i:i + num_shards_per_clients] for i in range(0, len(shards), num_shards_per_clients)]
			
 
				+    idx_partition = [np.concatenate(shards[n]).tolist() for n in range(num_clients)]
			
 
				+
			
 
				+
			
 
				+for client_id, idx in enumerate(idx_partition):
			
 
				+    print(
			
 
				+        "\n Generating the local training dataset of Client_{}".format(client_id)
			
 
				+    )
			
 
				+    sub_remaining_df = remaining_df.loc[idx]
			
 
				+    sub_remaining_df = sub_remaining_df.reset_index().drop('index', axis=1)
			
 
				+    sub_remaining_df_dic = sub_remaining_df.to_dict(orient='records')
			
 
				+
			
 
				+    with open(os.path.join(data_path, "local_training_{}.json".format(client_id)), 'w') as outfile:
			
 
				+        json.dump(sub_remaining_df_dic, outfile)