2 bulan lalu · 40e4a64998
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1 @@
 
				+data/
			
--- a/README.md
+++ b/README.md
@@ -0,0 +1,1400 @@
 
				+# CodeFusion: 基于调用链的代码分片融合技术研究
			
 
				+
			
 
				+## 摘要
			
 
				+
			
 
				+本研究提出了一种基于函数调用链的代码分片融合技术（CodeFusion），旨在将目标代码片段智能地拆分并嵌入到已有程序的多个函数中。该技术融合了程序分析、编译原理和大语言模型（LLM）三大领域的方法论。
			
 
				+
			
 
				+具体而言，本研究首先通过词法分析和语法解析构建目标程序的控制流图（Control Flow Graph, CFG），随后基于数据流分析框架计算各基本块的支配关系（Dominance Relation），识别出程序执行的必经点（Critical Point）。在此基础上，利用大语言模型对待融合代码进行语义理解和智能拆分，生成满足依赖约束的代码片段序列。最后，将各片段精确插入到调用链函数的融合点位置，并通过全局变量或参数传递机制实现跨函数的状态共享。
			
 
				+
			
 
				+实验表明，本方法能够有效地将完整代码逻辑分散到多个函数中执行，同时保证程序语义的等价性。该技术可广泛应用于代码混淆、软件水印嵌入、安全漏洞测试、软件保护等领域，具有重要的理论价值和实践意义。
			
 
				+
			
 
				+**关键词**：代码融合；控制流图；支配分析；大语言模型；程序变换
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 1. 研究背景与目标
			
 
				+
			
 
				+### 1.1 研究动机
			
 
				+
			
 
				+在软件安全与逆向工程领域，代码的结构化程度直接影响分析难度。传统的代码混淆技术主要关注单函数内部的变换，如控制流平坦化、不透明谓词插入等。然而，这些技术往往忽略了函数间调用关系所蕴含的混淆潜力。
			
 
				+
			
 
				+本研究的核心洞察在于：**利用已有程序的函数调用链作为"载体"，将敏感代码分散嵌入，可以显著提高代码的隐蔽性**。这一思路的优势体现在：
			
 
				+
			
 
				+1. **利用已有代码结构**：无需构造新的控制流，直接复用现有函数
			
 
				+2. **语义级分散**：代码片段在语义层面分离，而非仅仅语法层面
			
 
				+3. **分析抗性**：单独分析任一函数均无法还原完整逻辑
			
 
				+
			
 
				+### 1.2 问题形式化定义
			
 
				+
			
 
				+设目标程序 $\mathcal{P}$ 包含函数集合 $\mathcal{F}_{all}$，其中存在一条深度为 $n$ 的调用链：
			
 
				+
			
 
				+$$
			
 
				+\mathcal{F} = \{f_1, f_2, \ldots, f_n\} \subseteq \mathcal{F}_{all}
			
 
				+$$
			
 
				+
			
 
				+调用关系满足：
			
 
				+
			
 
				+$$
			
 
				+\forall i \in [1, n-1]: f_i \xrightarrow{\text{call}} f_{i+1}
			
 
				+$$
			
 
				+
			
 
				+给定待融合的目标代码片段 $C_{target}$，本研究的目标是找到一个拆分函数 $\phi$ 和融合函数 $\psi$，使得：
			
 
				+
			
 
				+$$
			
 
				+\phi: C_{target} \rightarrow \{c_1, c_2, \ldots, c_n\}
			
 
				+$$
			
 
				+
			
 
				+$$
			
 
				+\psi: (\mathcal{F}, \{c_1, \ldots, c_n\}) \rightarrow \mathcal{F}' = \{f_1', f_2', \ldots, f_n'\}
			
 
				+$$
			
 
				+
			
 
				+其中融合后的函数集合 $\mathcal{F}'$ 需满足以下**语义等价性约束**：
			
 
				+
			
 
				+$$
			
 
				+\boxed{\text{Exec}(f_1') \equiv \text{Exec}(f_1) \circ \text{Exec}(C_{target})}
			
 
				+$$
			
 
				+
			
 
				+即执行 $f_1'$ 的效果等价于先执行原始 $f_1$ 再执行目标代码 $C_{target}$。
			
 
				+
			
 
				+更精确地，设 $\sigma$ 为程序状态，$\llbracket \cdot \rrbracket$ 为语义函数，则：
			
 
				+
			
 
				+$$
			
 
				+\llbracket f_1' \rrbracket(\sigma_0) = \llbracket C_{target} \rrbracket(\llbracket f_1 \rrbracket(\sigma_0))
			
 
				+$$
			
 
				+
			
 
				+### 1.3 约束条件
			
 
				+
			
 
				+代码拆分需满足以下约束：
			
 
				+
			
 
				+**约束 1（完整性约束）**：所有片段的并集覆盖原始代码的全部语句：
			
 
				+
			
 
				+$$
			
 
				+\bigcup_{i=1}^{n} \text{Stmts}(c_i) \supseteq \text{Stmts}(C_{target})
			
 
				+$$
			
 
				+
			
 
				+**约束 2（依赖约束）**：若语句 $s_j$ 数据依赖于语句 $s_i$（记作 $s_i \xrightarrow{dep} s_j$），且 $s_i \in c_k$，$s_j \in c_l$，则：
			
 
				+
			
 
				+$$
			
 
				+s_i \xrightarrow{dep} s_j \Rightarrow k \leq l
			
 
				+$$
			
 
				+
			
 
				+**约束 3（可达性约束）**：对于任意片段 $c_i$，其插入位置 $p_i \in f_i$ 必须在调用 $f_{i+1}$ 之前执行：
			
 
				+
			
 
				+$$
			
 
				+\text{Dominates}(p_i, \text{CallSite}(f_{i+1}))
			
 
				+$$
			
 
				+
			
 
				+### 1.4 研究目标
			
 
				+
			
 
				+本研究的具体目标包括：
			
 
				+
			
 
				+1. **设计高效的 CFG 构建算法**：支持 C/C++ 代码的控制流分析
			
 
				+2. **实现精确的支配节点计算**：基于迭代数据流分析框架
			
 
				+3. **开发智能代码拆分方法**：利用 LLM 进行语义感知的代码分片
			
 
				+4. **构建完整的融合系统**：支持多种状态传递策略
			
 
				+5. **验证方法的有效性**：通过实验评估融合效果
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 2. 理论基础
			
 
				+
			
 
				+### 2.1 控制流图（Control Flow Graph, CFG）
			
 
				+
			
 
				+#### 2.1.1 定义与性质
			
 
				+
			
 
				+**定义 2.1（控制流图）**：程序 $P$ 的控制流图是一个四元组：
			
 
				+
			
 
				+$$
			
 
				+G_{CFG} = (V, E, v_{entry}, V_{exit})
			
 
				+$$
			
 
				+
			
 
				+其中：
			
 
				+- $V = \{v_1, v_2, \ldots, v_m\}$ 为**基本块**（Basic Block）的有限集合
			
 
				+- $E \subseteq V \times V$ 为**控制流边**的集合
			
 
				+- $v_{entry} \in V$ 为唯一的**入口基本块**
			
 
				+- $V_{exit} \subseteq V$ 为**出口基本块**的集合
			
 
				+
			
 
				+**定义 2.2（基本块）**：基本块是满足以下条件的最大指令序列 $B = \langle i_1, i_2, \ldots, i_k \rangle$：
			
 
				+
			
 
				+1. **单入口**：只有 $i_1$ 可以从外部跳转进入
			
 
				+2. **单出口**：只有 $i_k$ 可以跳转到外部
			
 
				+3. **顺序执行**：若 $i_j$ 执行，则 $i_{j+1}, \ldots, i_k$ 必然顺序执行
			
 
				+
			
 
				+形式化表示：
			
 
				+
			
 
				+$$
			
 
				+\text{BasicBlock}(B) \Leftrightarrow \begin{cases}
			
 
				+\text{Entry}(B) = \{i_1\} \\
			
 
				+\text{Exit}(B) = \{i_k\} \\
			
 
				+\forall j \in [1, k-1]: \text{Succ}(i_j) = \{i_{j+1}\}
			
 
				+\end{cases}
			
 
				+$$
			
 
				+
			
 
				+#### 2.1.2 基本块识别算法
			
 
				+
			
 
				+基本块的首指令（Leader）识别规则：
			
 
				+
			
 
				+$$
			
 
				+\text{Leader}(i) = \begin{cases}
			
 
				+\text{True} & \text{if } i \text{ 是程序的第一条指令} \\
			
 
				+\text{True} & \text{if } i \text{ 是某条跳转指令的目标} \\
			
 
				+\text{True} & \text{if } i \text{ 紧跟在某条跳转指令之后} \\
			
 
				+\text{False} & \text{otherwise}
			
 
				+\end{cases}
			
 
				+$$
			
 
				+
			
 
				+**算法 2.1：基本块划分算法**
			
 
				+
			
 
				+```
			
 
				+输入: 指令序列 I = [i_1, i_2, ..., i_n]
			
 
				+输出: 基本块集合 B
			
 
				+
			
 
				+1:  Leaders ← {i_1}  // 第一条指令是 leader
			
 
				+2:  for each instruction i_j in I do
			
 
				+3:      if i_j is a branch instruction then
			
 
				+4:          Leaders ← Leaders ∪ {target(i_j)}
			
 
				+5:          if j < n then
			
 
				+6:              Leaders ← Leaders ∪ {i_{j+1}}
			
 
				+7:  B ← ∅
			
 
				+8:  for each leader l in sorted(Leaders) do
			
 
				+9:      b ← new BasicBlock starting at l
			
 
				+10:     extend b until next leader or end
			
 
				+11:     B ← B ∪ {b}
			
 
				+12: return B
			
 
				+```
			
 
				+
			
 
				+#### 2.1.3 边的构建
			
 
				+
			
 
				+控制流边 $(v_i, v_j) \in E$ 当且仅当：
			
 
				+
			
 
				+$$
			
 
				+(v_i, v_j) \in E \Leftrightarrow \begin{cases}
			
 
				+\text{last}(v_i) \text{ 是无条件跳转到 } \text{first}(v_j) \\
			
 
				+\lor\ \text{last}(v_i) \text{ 是条件跳转，} v_j \text{ 是可能目标} \\
			
 
				+\lor\ \text{last}(v_i) \text{ 不是跳转，} v_j \text{ 是顺序后继}
			
 
				+\end{cases}
			
 
				+$$
			
 
				+
			
 
				+#### 2.1.4 CFG 的性质
			
 
				+
			
 
				+**性质 2.1（连通性）**：从 $v_{entry}$ 可达所有 $v \in V$：
			
 
				+
			
 
				+$$
			
 
				+\forall v \in V: v_{entry} \leadsto v
			
 
				+$$
			
 
				+
			
 
				+**性质 2.2（规范性）**：任意 $v_{exit} \in V_{exit}$ 的后继集合为空：
			
 
				+
			
 
				+$$
			
 
				+\forall v \in V_{exit}: \text{Succ}(v) = \emptyset
			
 
				+$$
			
 
				+
			
 
				+### 2.2 支配关系（Dominance Relation）
			
 
				+
			
 
				+#### 2.2.1 基本定义
			
 
				+
			
 
				+**定义 2.3（支配）**：在 CFG $G = (V, E, v_{entry}, V_{exit})$ 中，节点 $d$ **支配** 节点 $n$（记作 $d\ \text{dom}\ n$），当且仅当从 $v_{entry}$ 到 $n$ 的每条路径都经过 $d$：
			
 
				+
			
 
				+$$
			
 
				+d\ \text{dom}\ n \Leftrightarrow \forall \text{ path } \pi: v_{entry} \leadsto n,\ d \in \pi
			
 
				+$$
			
 
				+
			
 
				+等价的集合论定义：
			
 
				+
			
 
				+$$
			
 
				+d\ \text{dom}\ n \Leftrightarrow d \in \text{Dom}(n)
			
 
				+$$
			
 
				+
			
 
				+其中 $\text{Dom}(n)$ 是节点 $n$ 的支配者集合。
			
 
				+
			
 
				+**定义 2.4（严格支配）**：$d$ **严格支配** $n$（记作 $d\ \text{sdom}\ n$）：
			
 
				+
			
 
				+$$
			
 
				+d\ \text{sdom}\ n \Leftrightarrow d\ \text{dom}\ n \land d \neq n
			
 
				+$$
			
 
				+
			
 
				+**定义 2.5（直接支配者）**：节点 $n \neq v_{entry}$ 的**直接支配者**（immediate dominator）$\text{idom}(n)$ 是 $n$ 的严格支配者中最接近 $n$ 的节点：
			
 
				+
			
 
				+$$
			
 
				+\text{idom}(n) = d \Leftrightarrow d\ \text{sdom}\ n \land \forall d': d'\ \text{sdom}\ n \Rightarrow d'\ \text{dom}\ d
			
 
				+$$
			
 
				+
			
 
				+**定理 2.1**：除入口节点外，每个节点有且仅有一个直接支配者。
			
 
				+
			
 
				+#### 2.2.2 支配集合的计算
			
 
				+
			
 
				+支配关系可通过数据流分析的迭代算法计算。数据流方程为：
			
 
				+
			
 
				+$$
			
 
				+\text{Dom}(n) = \begin{cases}
			
 
				+\{v_{entry}\} & \text{if } n = v_{entry} \\
			
 
				+\{n\} \cup \left( \displaystyle\bigcap_{p \in \text{Pred}(n)} \text{Dom}(p) \right) & \text{otherwise}
			
 
				+\end{cases}
			
 
				+$$
			
 
				+
			
 
				+**算法 2.2：支配集合迭代计算**
			
 
				+
			
 
				+```
			
 
				+输入: CFG G = (V, E, v_entry, V_exit)
			
 
				+输出: 每个节点的支配集合 Dom
			
 
				+
			
 
				+1:  Dom(v_entry) ← {v_entry}
			
 
				+2:  for each v ∈ V \ {v_entry} do
			
 
				+3:      Dom(v) ← V  // 初始化为全集
			
 
				+4:  repeat
			
 
				+5:      changed ← false
			
 
				+6:      for each v ∈ V \ {v_entry} do
			
 
				+7:          new_dom ← {v} ∪ (⋂_{p ∈ Pred(v)} Dom(p))
			
 
				+8:          if new_dom ≠ Dom(v) then
			
 
				+9:              Dom(v) ← new_dom
			
 
				+10:             changed ← true
			
 
				+11: until not changed
			
 
				+12: return Dom
			
 
				+```
			
 
				+
			
 
				+**复杂度分析**：设 $|V| = n$，$|E| = m$，则：
			
 
				+- 空间复杂度：$O(n^2)$（存储所有支配集合）
			
 
				+- 时间复杂度：$O(n \cdot m)$（最坏情况下的迭代次数）
			
 
				+
			
 
				+#### 2.2.3 支配树（Dominator Tree）
			
 
				+
			
 
				+**定义 2.6（支配树）**：CFG 的支配树 $T_{dom} = (V, E_{dom})$ 是一棵以 $v_{entry}$ 为根的树，其中：
			
 
				+
			
 
				+$$
			
 
				+(d, n) \in E_{dom} \Leftrightarrow d = \text{idom}(n)
			
 
				+$$
			
 
				+
			
 
				+支配树的性质：
			
 
				+
			
 
				+$$
			
 
				+d\ \text{dom}\ n \Leftrightarrow d \text{ 是 } T_{dom} \text{ 中 } n \text{ 的祖先}
			
 
				+$$
			
 
				+
			
 
				+### 2.3 必经点（Critical Point）
			
 
				+
			
 
				+#### 2.3.1 定义
			
 
				+
			
 
				+**定义 2.7（必经点）**：在 CFG $G$ 中，节点 $v$ 是**必经点**，当且仅当移除 $v$ 后，从 $v_{entry}$ 无法到达任何出口节点：
			
 
				+
			
 
				+$$
			
 
				+v \in \mathcal{C}(G) \Leftrightarrow \forall v_{exit} \in V_{exit}: v_{entry} \not\leadsto_{G \setminus \{v\}} v_{exit}
			
 
				+$$
			
 
				+
			
 
				+其中 $G \setminus \{v\}$ 表示从 $G$ 中移除节点 $v$ 及其关联边后得到的子图。
			
 
				+
			
 
				+等价定义：
			
 
				+
			
 
				+$$
			
 
				+v \in \mathcal{C}(G) \Leftrightarrow v\ \text{dom}\ v_{exit},\ \forall v_{exit} \in V_{exit}
			
 
				+$$
			
 
				+
			
 
				+#### 2.3.2 必经点的判定
			
 
				+
			
 
				+**算法 2.3：必经点判定**
			
 
				+
			
 
				+```
			
 
				+输入: CFG G, 待检查节点 v
			
 
				+输出: v 是否为必经点
			
 
				+
			
 
				+1:  if v = v_entry then
			
 
				+2:      return True
			
 
				+3:  G' ← G \ {v}  // 移除节点 v
			
 
				+4:  for each v_exit ∈ V_exit do
			
 
				+5:      if Reachable(G', v_entry, v_exit) then
			
 
				+6:          return False
			
 
				+7:  return True
			
 
				+```
			
 
				+
			
 
				+**定理 2.2**：必经点集合 $\mathcal{C}(G)$ 等于所有出口节点支配集合的交集：
			
 
				+
			
 
				+$$
			
 
				+\mathcal{C}(G) = \bigcap_{v_{exit} \in V_{exit}} \text{Dom}(v_{exit})
			
 
				+$$
			
 
				+
			
 
				+#### 2.3.3 必经点的性质
			
 
				+
			
 
				+**性质 2.3（链式结构）**：必经点集合在支配树上形成一条从根到某节点的链：
			
 
				+
			
 
				+$$
			
 
				+\forall c_1, c_2 \in \mathcal{C}(G): c_1\ \text{dom}\ c_2 \lor c_2\ \text{dom}\ c_1
			
 
				+$$
			
 
				+
			
 
				+**性质 2.4（必经性传递）**：若 $c_1\ \text{dom}\ c_2$ 且 $c_2 \in \mathcal{C}(G)$，则 $c_1 \in \mathcal{C}(G)$。
			
 
				+
			
 
				+### 2.4 融合点（Fusion Point）
			
 
				+
			
 
				+#### 2.4.1 定义与条件
			
 
				+
			
 
				+**定义 2.8（融合点）**：适合代码插入的位置，需满足以下条件：
			
 
				+
			
 
				+$$
			
 
				+v \in \mathcal{P}_{fusion}(G) \Leftrightarrow v \in \mathcal{C}(G) \land \Phi_{struct}(v) \land \Phi_{flow}(v)
			
 
				+$$
			
 
				+
			
 
				+其中：
			
 
				+
			
 
				+**结构条件** $\Phi_{struct}(v)$：
			
 
				+
			
 
				+$$
			
 
				+\Phi_{struct}(v) \Leftrightarrow |\text{Pred}(v)| \leq 1 \land |\text{Succ}(v)| \leq 1
			
 
				+$$
			
 
				+
			
 
				+**控制流条件** $\Phi_{flow}(v)$：前驱和后继的跳转必须是无条件跳转：
			
 
				+
			
 
				+$$
			
 
				+\Phi_{flow}(v) \Leftrightarrow \neg\text{IsConditionalBranch}(\text{Pred}(v) \to v) \land \neg\text{IsConditionalBranch}(v \to \text{Succ}(v))
			
 
				+$$
			
 
				+
			
 
				+#### 2.4.2 融合点的优先级
			
 
				+
			
 
				+当存在多个融合点时，按以下优先级选择：
			
 
				+
			
 
				+$$
			
 
				+\text{Priority}(v) = \alpha \cdot \text{Depth}(v) + \beta \cdot \text{Centrality}(v) + \gamma \cdot \text{Stability}(v)
			
 
				+$$
			
 
				+
			
 
				+其中：
			
 
				+- $\text{Depth}(v)$：在支配树中的深度
			
 
				+- $\text{Centrality}(v)$：在 CFG 中的中心性度量
			
 
				+- $\text{Stability}(v)$：基本块的大小（越大越稳定）
			
 
				+- $\alpha, \beta, \gamma$：权重系数
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 3. 方法设计
			
 
				+
			
 
				+### 3.1 系统架构
			
 
				+
			
 
				+CodeFusion 系统采用模块化设计，由五个核心组件构成：
			
 
				+
			
 
				+```
			
 
				+┌─────────────────────────────────────────────────────────────────────────────┐
			
 
				+│                           CodeFusion System                                  │
			
 
				+├─────────────────────────────────────────────────────────────────────────────┤
			
 
				+│                                                                             │
			
 
				+│  ┌─────────────────┐                                                        │
			
 
				+│  │   Input Layer   │                                                        │
			
 
				+│  │  ┌───────────┐  │                                                        │
			
 
				+│  │  │ 源代码数据 │  │                                                        │
			
 
				+│  │  │ (JSONL)   │  │                                                        │
			
 
				+│  │  └─────┬─────┘  │                                                        │
			
 
				+│  └────────┼────────┘                                                        │
			
 
				+│           │                                                                 │
			
 
				+│           ▼                                                                 │
			
 
				+│  ┌─────────────────────────────────────────────────────────────────────┐   │
			
 
				+│  │                      Data Processing Layer                           │   │
			
 
				+│  │  ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐  │   │
			
 
				+│  │  │  调用关系提取   │───▶│  调用链分组     │───▶│  深度筛选      │  │   │
			
 
				+│  │  │ extract_call_   │    │  按连通分量分组  │    │ filter_by_     │  │   │
			
 
				+│  │  │ relations.py    │    │                 │    │ call_depth.py  │  │   │
			
 
				+│  │  └─────────────────┘    └─────────────────┘    └───────┬─────────┘  │   │
			
 
				+│  └────────────────────────────────────────────────────────┼────────────┘   │
			
 
				+│                                                           │                 │
			
 
				+│                                                           ▼                 │
			
 
				+│  ┌─────────────────────────────────────────────────────────────────────┐   │
			
 
				+│  │                        Analysis Layer                                │   │
			
 
				+│  │  ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐  │   │
			
 
				+│  │  │   CFG 构建      │───▶│   支配分析      │───▶│  融合点识别    │  │   │
			
 
				+│  │  │ cfg_analyzer.py │    │ dominator_      │    │                 │  │   │
			
 
				+│  │  │                 │    │ analyzer.py     │    │                 │  │   │
			
 
				+│  │  └─────────────────┘    └─────────────────┘    └───────┬─────────┘  │   │
			
 
				+│  └────────────────────────────────────────────────────────┼────────────┘   │
			
 
				+│                                                           │                 │
			
 
				+│                                                           ▼                 │
			
 
				+│  ┌─────────────────────────────────────────────────────────────────────┐   │
			
 
				+│  │                       Splitting Layer                                │   │
			
 
				+│  │  ┌─────────────────────────────────────────────────────────────┐    │   │
			
 
				+│  │  │                     LLM Code Splitter                        │    │   │
			
 
				+│  │  │  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐      │    │   │
			
 
				+│  │  │  │ Prompt 构建 │───▶│  LLM 调用   │───▶│ 结果解析    │      │    │   │
			
 
				+│  │  │  │             │    │ (Qwen API)  │    │             │      │    │   │
			
 
				+│  │  │  └─────────────┘    └─────────────┘    └──────┬──────┘      │    │   │
			
 
				+│  │  └──────────────────────────────────────────────┼──────────────┘    │   │
			
 
				+│  └─────────────────────────────────────────────────┼───────────────────┘   │
			
 
				+│                                                    │                        │
			
 
				+│                                                    ▼                        │
			
 
				+│  ┌─────────────────────────────────────────────────────────────────────┐   │
			
 
				+│  │                        Fusion Layer                                  │   │
			
 
				+│  │  ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐  │   │
			
 
				+│  │  │  状态生成       │───▶│  代码插入       │───▶│  代码生成      │  │   │
			
 
				+│  │  │ (Global/Param)  │    │ code_fusion.py  │    │  main.py       │  │   │
			
 
				+│  │  └─────────────────┘    └─────────────────┘    └───────┬─────────┘  │   │
			
 
				+│  └────────────────────────────────────────────────────────┼────────────┘   │
			
 
				+│                                                           │                 │
			
 
				+│                                                           ▼                 │
			
 
				+│  ┌─────────────────┐                                                        │
			
 
				+│  │  Output Layer   │                                                        │
			
 
				+│  │  ┌───────────┐  │                                                        │
			
 
				+│  │  │ 融合代码  │  │                                                        │
			
 
				+│  │  │ (.c 文件) │  │                                                        │
			
 
				+│  │  └───────────┘  │                                                        │
			
 
				+│  └─────────────────┘                                                        │
			
 
				+│                                                                             │
			
 
				+└─────────────────────────────────────────────────────────────────────────────┘
			
 
				+```
			
 
				+
			
 
				+### 3.2 调用链分析
			
 
				+
			
 
				+#### 3.2.1 函数调用关系提取
			
 
				+
			
 
				+从代码中提取函数调用关系，构建调用图 $G_{call} = (V_{func}, E_{call})$：
			
 
				+
			
 
				+$$
			
 
				+(f_i, f_j) \in E_{call} \Leftrightarrow f_i \text{ 的函数体中存在对 } f_j \text{ 的调用}
			
 
				+$$
			
 
				+
			
 
				+调用关系提取采用正则表达式匹配：
			
 
				+
			
 
				+$$
			
 
				+\text{Callees}(f) = \{g \mid \exists \text{ pattern } ``g\text{(}'' \in \text{Body}(f)\}
			
 
				+$$
			
 
				+
			
 
				+#### 3.2.2 调用链深度计算
			
 
				+
			
 
				+定义调用链深度函数 $d: V_{func} \times V_{func} \to \mathbb{N}$：
			
 
				+
			
 
				+$$
			
 
				+d(f_i, f_j) = \begin{cases}
			
 
				+0 & \text{if } f_i = f_j \\
			
 
				+1 + \min_{f_k \in \text{Callees}(f_i)} d(f_k, f_j) & \text{if } f_i \neq f_j \land f_i \leadsto f_j \\
			
 
				+\infty & \text{otherwise}
			
 
				+\end{cases}
			
 
				+$$
			
 
				+
			
 
				+最长调用链深度：
			
 
				+
			
 
				+$$
			
 
				+D_{max}(G_{call}) = \max_{f_i, f_j \in V_{func}} d(f_i, f_j)
			
 
				+$$
			
 
				+
			
 
				+#### 3.2.3 调用链分组
			
 
				+
			
 
				+使用 Union-Find 算法将有调用关系的函数分组。设 $\sim$ 为传递闭包关系：
			
 
				+
			
 
				+$$
			
 
				+f_i \sim f_j \Leftrightarrow f_i \leadsto f_j \lor f_j \leadsto f_i
			
 
				+$$
			
 
				+
			
 
				+则分组 $\mathcal{G}$ 为等价类：
			
 
				+
			
 
				+$$
			
 
				+\mathcal{G} = V_{func} / \sim = \{[f]_\sim \mid f \in V_{func}\}
			
 
				+$$
			
 
				+
			
 
				+### 3.3 代码拆分算法
			
 
				+
			
 
				+#### 3.3.1 问题建模
			
 
				+
			
 
				+代码拆分可建模为约束满足问题（CSP）：
			
 
				+
			
 
				+$$
			
 
				+\text{CSP}_{split} = (X, D, C)
			
 
				+$$
			
 
				+
			
 
				+其中：
			
 
				+- **变量** $X = \{x_1, x_2, \ldots, x_n\}$：每个变量表示一个代码片段
			
 
				+- **域** $D$：每个变量的取值范围为原始代码的语句子集
			
 
				+- **约束** $C$：包括完整性、依赖性、平衡性约束
			
 
				+
			
 
				+**约束 C1（完整性）**：
			
 
				+
			
 
				+$$
			
 
				+\bigcup_{i=1}^{n} x_i = \text{Stmts}(C_{target})
			
 
				+$$
			
 
				+
			
 
				+**约束 C2（不重叠）**：
			
 
				+
			
 
				+$$
			
 
				+\forall i \neq j: x_i \cap x_j = \emptyset
			
 
				+$$
			
 
				+
			
 
				+**约束 C3（依赖保持）**：
			
 
				+
			
 
				+$$
			
 
				+\forall s_a \xrightarrow{dep} s_b: (\text{Index}(s_a) \leq \text{Index}(s_b))
			
 
				+$$
			
 
				+
			
 
				+其中 $\text{Index}(s)$ 返回语句 $s$ 所属片段的索引。
			
 
				+
			
 
				+#### 3.3.2 LLM 辅助拆分
			
 
				+
			
 
				+利用大语言模型进行语义感知的代码拆分。设 LLM 为函数 $\mathcal{L}$：
			
 
				+
			
 
				+$$
			
 
				+\mathcal{L}: (\text{Prompt}, \text{Context}) \rightarrow \text{Response}
			
 
				+$$
			
 
				+
			
 
				+Prompt 模板构建：
			
 
				+
			
 
				+$$
			
 
				+\text{Prompt} = \text{Template}(C_{target}, n, \mathcal{F}, M, \text{Examples})
			
 
				+$$
			
 
				+
			
 
				+其中：
			
 
				+- $C_{target}$：目标代码
			
 
				+- $n$：拆分片段数
			
 
				+- $\mathcal{F}$：调用链函数名列表
			
 
				+- $M \in \{\text{global}, \text{parameter}\}$：状态传递方法
			
 
				+- $\text{Examples}$：Few-shot 示例
			
 
				+
			
 
				+LLM 输出解析：
			
 
				+
			
 
				+$$
			
 
				+\text{Parse}: \text{JSON} \rightarrow (\{c_i\}_{i=1}^n, \mathcal{S}, \text{Decl})
			
 
				+$$
			
 
				+
			
 
				+其中 $\mathcal{S}$ 为共享状态集合，$\text{Decl}$ 为声明代码。
			
 
				+
			
 
				+#### 3.3.3 Fallback 机制
			
 
				+
			
 
				+当 LLM 调用失败时，采用启发式拆分：
			
 
				+
			
 
				+**算法 3.1：启发式代码拆分**
			
 
				+
			
 
				+```
			
 
				+输入: 代码 C, 片段数 n
			
 
				+输出: 代码片段列表 {c_1, ..., c_n}
			
 
				+
			
 
				+1:  stmts ← ParseStatements(C)
			
 
				+2:  k ← |stmts|
			
 
				+3:  if k < n then
			
 
				+4:      // 补充空片段
			
 
				+5:      for i = 1 to k do
			
 
				+6:          c_i ← stmts[i]
			
 
				+7:      for i = k+1 to n do
			
 
				+8:          c_i ← "// empty"
			
 
				+9:  else
			
 
				+10:     // 均分
			
 
				+11:     chunk_size ← ⌊k / n⌋
			
 
				+12:     for i = 1 to n do
			
 
				+13:         start ← (i-1) × chunk_size + 1
			
 
				+14:         end ← min(i × chunk_size, k) if i < n else k
			
 
				+15:         c_i ← Join(stmts[start:end])
			
 
				+16: return {c_1, ..., c_n}
			
 
				+```
			
 
				+
			
 
				+### 3.4 状态传递方法
			
 
				+
			
 
				+#### 3.4.1 全局变量法
			
 
				+
			
 
				+**定义 3.1（全局状态空间）**：设共享变量集合为 $\mathcal{S} = \{s_1, s_2, \ldots, s_k\}$，全局状态空间为：
			
 
				+
			
 
				+$$
			
 
				+\mathcal{G} = \{g_i = \text{global}(s_i) \mid s_i \in \mathcal{S}\}
			
 
				+$$
			
 
				+
			
 
				+变量重命名映射 $\rho_{global}: \mathcal{S} \to \mathcal{G}$：
			
 
				+
			
 
				+$$
			
 
				+\rho_{global}(s_i) = g\_s_i \quad (\text{添加前缀 } g\_)
			
 
				+$$
			
 
				+
			
 
				+**全局声明生成**：
			
 
				+
			
 
				+$$
			
 
				+\text{Decl}_{global} = \bigcup_{s_i \in \mathcal{S}} \text{``static } T_i\ g\_s_i\text{;''}
			
 
				+$$
			
 
				+
			
 
				+其中 $T_i$ 为 $s_i$ 的类型。
			
 
				+
			
 
				+**代码变换**：
			
 
				+
			
 
				+$$
			
 
				+c_i' = c_i[s_j \mapsto g\_s_j,\ \forall s_j \in \mathcal{S}]
			
 
				+$$
			
 
				+
			
 
				+**形式化语义**：
			
 
				+
			
 
				+设 $\sigma_G$ 为全局状态，$\sigma_L$ 为局部状态，则：
			
 
				+
			
 
				+$$
			
 
				+\llbracket c_i' \rrbracket(\sigma_G, \sigma_L) = \llbracket c_i \rrbracket(\sigma_G \cup \sigma_L)
			
 
				+$$
			
 
				+
			
 
				+#### 3.4.2 参数传递法
			
 
				+
			
 
				+**定义 3.2（状态结构体）**：定义结构体类型 $\Sigma$：
			
 
				+
			
 
				+$$
			
 
				+\Sigma = \text{struct FusionState} \{T_1\ s_1;\ T_2\ s_2;\ \ldots;\ T_k\ s_k;\}
			
 
				+$$
			
 
				+
			
 
				+**函数签名变换**：
			
 
				+
			
 
				+$$
			
 
				+f_i: (A_1, \ldots, A_m) \to R \quad \Longrightarrow \quad f_i': (A_1, \ldots, A_m, \Sigma^*\ state) \to R
			
 
				+$$
			
 
				+
			
 
				+**变量访问变换**：
			
 
				+
			
 
				+$$
			
 
				+\rho_{param}(s_j) = state \to s_j
			
 
				+$$
			
 
				+
			
 
				+**代码变换**：
			
 
				+
			
 
				+$$
			
 
				+c_i' = c_i[s_j \mapsto state \to s_j,\ \forall s_j \in \mathcal{S}]
			
 
				+$$
			
 
				+
			
 
				+**函数调用变换**：
			
 
				+
			
 
				+$$
			
 
				+\text{Call}(f_{i+1}, args) \Longrightarrow \text{Call}(f_{i+1}', args, state)
			
 
				+$$
			
 
				+
			
 
				+**初始化代码**：
			
 
				+
			
 
				+```c
			
 
				+FusionState state_data;
			
 
				+memset(&state_data, 0, sizeof(state_data));
			
 
				+FusionState* state = &state_data;
			
 
				+```
			
 
				+
			
 
				+#### 3.4.3 两种方法的对比
			
 
				+
			
 
				+| 特性 | 全局变量法 | 参数传递法 |
			
 
				+|------|-----------|-----------|
			
 
				+| 实现复杂度 | $O(k)$ | $O(k + n)$ |
			
 
				+| 函数签名修改 | 否 | 是 |
			
 
				+| 线程安全 | ❌ | ✅ |
			
 
				+| 可重入性 | ❌ | ✅ |
			
 
				+| 副作用 | 有 | 无 |
			
 
				+| 适用场景 | 单线程 | 多线程/库函数 |
			
 
				+
			
 
				+形式化比较：
			
 
				+
			
 
				+$$
			
 
				+\text{Overhead}_{global} = O(1) \quad \text{vs} \quad \text{Overhead}_{param} = O(n \cdot \text{sizeof}(\Sigma^*))
			
 
				+$$
			
 
				+
			
 
				+### 3.5 融合算法
			
 
				+
			
 
				+#### 3.5.1 完整算法
			
 
				+
			
 
				+**算法 3.2：CodeFusion 主算法**
			
 
				+
			
 
				+```
			
 
				+输入: 
			
 
				+  - 目标代码 C_target
			
 
				+  - 调用链函数集 F = {f_1, ..., f_n}
			
 
				+  - 传递方法 M ∈ {global, parameter}
			
 
				+输出: 融合后的函数集 F' = {f_1', ..., f_n'}
			
 
				+
			
 
				+Phase 1: 分析阶段
			
 
				+1:  for i = 1 to n do
			
 
				+2:      G_i ← BuildCFG(f_i)
			
 
				+3:      Dom_i ← ComputeDominators(G_i)
			
 
				+4:      C_i ← FindCriticalPoints(G_i, Dom_i)
			
 
				+5:      P_i ← FilterFusionPoints(C_i)
			
 
				+6:  end for
			
 
				+
			
 
				+Phase 2: 拆分阶段
			
 
				+7:  (slices, S, decl) ← LLM_Split(C_target, n, F, M)
			
 
				+8:  if slices = ∅ then
			
 
				+9:      slices ← FallbackSplit(C_target, n, M)
			
 
				+10: end if
			
 
				+
			
 
				+Phase 3: 状态生成阶段
			
 
				+11: if M = global then
			
 
				+12:     state_code ← GenerateGlobalDeclarations(S)
			
 
				+13: else
			
 
				+14:     state_code ← GenerateStructDefinition(S)
			
 
				+15: end if
			
 
				+
			
 
				+Phase 4: 融合阶段
			
 
				+16: for i = 1 to n do
			
 
				+17:     p_i ← SelectBestFusionPoint(P_i)
			
 
				+18:     c_i ← slices[i]
			
 
				+19:     if M = parameter then
			
 
				+20:         c_i ← TransformToParameterAccess(c_i, S)
			
 
				+21:     end if
			
 
				+22:     f_i' ← InsertCodeAtPoint(f_i, p_i, c_i)
			
 
				+23: end for
			
 
				+
			
 
				+Phase 5: 输出阶段
			
 
				+24: output ← CombineCode(state_code, F')
			
 
				+25: return output
			
 
				+```
			
 
				+
			
 
				+#### 3.5.2 复杂度分析
			
 
				+
			
 
				+设 $n$ 为调用链长度，$m$ 为平均函数大小（基本块数），$k$ 为共享变量数：
			
 
				+
			
 
				+| 阶段 | 时间复杂度 | 空间复杂度 |
			
 
				+|------|-----------|-----------|
			
 
				+| CFG 构建 | $O(n \cdot m)$ | $O(n \cdot m)$ |
			
 
				+| 支配分析 | $O(n \cdot m^2)$ | $O(n \cdot m^2)$ |
			
 
				+| LLM 拆分 | $O(T_{LLM})$ | $O(|C_{target}|)$ |
			
 
				+| 状态生成 | $O(k)$ | $O(k)$ |
			
 
				+| 代码融合 | $O(n \cdot m)$ | $O(n \cdot m)$ |
			
 
				+| **总计** | $O(n \cdot m^2 + T_{LLM})$ | $O(n \cdot m^2)$ |
			
 
				+
			
 
				+其中 $T_{LLM}$ 为 LLM API 调用延迟。
			
 
				+
			
 
				+#### 3.5.3 正确性证明
			
 
				+
			
 
				+**定理 3.1（语义等价性）**：若算法 3.2 成功执行，则融合后的程序与原程序加目标代码的语义等价。
			
 
				+
			
 
				+**证明**：
			
 
				+
			
 
				+设原始程序状态为 $\sigma_0$，需证明：
			
 
				+
			
 
				+$$
			
 
				+\llbracket f_1' \rrbracket(\sigma_0) = \llbracket C_{target}; f_1 \rrbracket(\sigma_0)
			
 
				+$$
			
 
				+
			
 
				+由于代码拆分满足完整性约束：
			
 
				+
			
 
				+$$
			
 
				+\bigcup_{i=1}^{n} c_i \equiv C_{target}
			
 
				+$$
			
 
				+
			
 
				+且每个 $c_i$ 插入在 $f_i$ 调用 $f_{i+1}$ 之前（融合点性质保证），因此执行 $f_1'$ 时：
			
 
				+
			
 
				+1. 执行 $c_1$
			
 
				+2. 调用 $f_2'$，执行 $c_2$
			
 
				+3. ...
			
 
				+4. 调用 $f_n'$，执行 $c_n$
			
 
				+
			
 
				+由依赖约束，这等价于顺序执行 $c_1; c_2; \ldots; c_n$，即 $C_{target}$。
			
 
				+
			
 
				+状态传递的正确性由 $\rho_{global}$ 或 $\rho_{param}$ 的双射性质保证。 $\square$
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 4. 实现细节
			
 
				+
			
 
				+### 4.1 项目结构
			
 
				+
			
 
				+```
			
 
				+Vul/
			
 
				+├── README.md                      # 项目文档
			
 
				+├── requirements.txt               # 依赖列表
			
 
				+│
			
 
				+├── data/                          # 数据集目录
			
 
				+│   ├── primevul_train.jsonl       # 训练集（原始漏洞数据）
			
 
				+│   ├── primevul_train_paired.jsonl
			
 
				+│   ├── primevul_valid.jsonl       # 验证集
			
 
				+│   ├── primevul_valid_paired.jsonl
			
 
				+│   ├── primevul_test.jsonl        # 测试集
			
 
				+│   └── primevul_test_paired.jsonl
			
 
				+│
			
 
				+├── utils/                         # 工具模块
			
 
				+│   └── data_process/              # 数据处理工具
			
 
				+│       ├── extract_call_relations.py   # 调用关系提取
			
 
				+│       └── filter_by_call_depth.py     # 调用深度筛选
			
 
				+│
			
 
				+├── src/                           # 核心源代码
			
 
				+│   ├── __init__.py               # 包初始化
			
 
				+│   ├── cfg_analyzer.py           # CFG 分析器
			
 
				+│   ├── dominator_analyzer.py     # 支配节点分析器
			
 
				+│   ├── llm_splitter.py           # LLM 代码拆分器
			
 
				+│   ├── code_fusion.py            # 代码融合引擎
			
 
				+│   └── main.py                   # 主程序入口
			
 
				+│
			
 
				+├── output/                        # 输出目录
			
 
				+│   ├── fused_code/               # 融合后的代码文件
			
 
				+│   │   ├── all_fused_code.c      # 汇总文件
			
 
				+│   │   └── fused_group_*.c       # 各组融合代码
			
 
				+│   ├── primevul_valid_grouped.json
			
 
				+│   ├── primevul_valid_grouped_depth_*.json
			
 
				+│   └── fusion_results.json
			
 
				+│
			
 
				+└── SliceFusion/                   # 参考项目（C++ LLVM 实现）
			
 
				+    └── src/
			
 
				+        ├── Fusion/
			
 
				+        └── Util/
			
 
				+```
			
 
				+
			
 
				+### 4.2 核心模块详解
			
 
				+
			
 
				+#### 4.2.1 CFG 分析器 (`cfg_analyzer.py`)
			
 
				+
			
 
				+**主要类**：
			
 
				+
			
 
				+```python
			
 
				+@dataclass
			
 
				+class BasicBlock:
			
 
				+    id: int                    # 基本块 ID
			
 
				+    name: str                  # 基本块名称
			
 
				+    statements: List[str]      # 语句列表
			
 
				+    start_line: int           # 起始行号
			
 
				+    end_line: int             # 结束行号
			
 
				+    is_entry: bool            # 是否为入口块
			
 
				+    is_exit: bool             # 是否为出口块
			
 
				+
			
 
				+@dataclass  
			
 
				+class ControlFlowGraph:
			
 
				+    function_name: str                    # 函数名
			
 
				+    blocks: Dict[int, BasicBlock]         # 基本块字典
			
 
				+    edges: List[Tuple[int, int]]          # 边列表
			
 
				+    entry_block_id: Optional[int]         # 入口块 ID
			
 
				+    exit_block_ids: List[int]             # 出口块 ID 列表
			
 
				+```
			
 
				+
			
 
				+**关键方法**：
			
 
				+
			
 
				+| 方法 | 功能 | 复杂度 |
			
 
				+|------|------|--------|
			
 
				+| `_remove_comments()` | 移除代码注释 | $O(n)$ |
			
 
				+| `_extract_function_body()` | 提取函数体 | $O(n)$ |
			
 
				+| `_tokenize_statements()` | 语句分词 | $O(n)$ |
			
 
				+| `_is_control_statement()` | 判断控制语句 | $O(1)$ |
			
 
				+| `_build_basic_blocks()` | 构建基本块 | $O(n)$ |
			
 
				+| `_build_edges()` | 构建控制流边 | $O(m)$ |
			
 
				+
			
 
				+#### 4.2.2 支配分析器 (`dominator_analyzer.py`)
			
 
				+
			
 
				+**数据流方程实现**：
			
 
				+
			
 
				+```python
			
 
				+def compute_dominators(self) -> Dict[int, Set[int]]:
			
 
				+    # 初始化
			
 
				+    dominators = {node: all_nodes.copy() for node in all_nodes}
			
 
				+    dominators[entry] = {entry}
			
 
				+    
			
 
				+    # 迭代求解
			
 
				+    changed = True
			
 
				+    while changed:
			
 
				+        changed = False
			
 
				+        for node in all_nodes:
			
 
				+            if node == entry:
			
 
				+                continue
			
 
				+            # Dom(n) = {n} ∪ (∩ Dom(p) for p in pred(n))
			
 
				+            new_dom = all_nodes.copy()
			
 
				+            for pred in self.cfg.get_predecessors(node):
			
 
				+                new_dom &= dominators[pred]
			
 
				+            new_dom.add(node)
			
 
				+            
			
 
				+            if new_dom != dominators[node]:
			
 
				+                dominators[node] = new_dom
			
 
				+                changed = True
			
 
				+    
			
 
				+    return dominators
			
 
				+```
			
 
				+
			
 
				+#### 4.2.3 LLM 拆分器 (`llm_splitter.py`)
			
 
				+
			
 
				+**API 配置**：
			
 
				+
			
 
				+```python
			
 
				+client = OpenAI(
			
 
				+    api_key=os.getenv("DASHSCOPE_API_KEY"),
			
 
				+    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1"
			
 
				+)
			
 
				+model = "qwen-plus"  # 或 qwen-turbo, qwen-max
			
 
				+```
			
 
				+
			
 
				+**Prompt 模板关键部分**：
			
 
				+
			
 
				+```
			
 
				+【重要】由于每个片段在不同的函数中执行，局部变量无法直接传递！
			
 
				+你必须：
			
 
				+1. 将需要跨函数共享的变量声明为全局变量/结构体成员
			
 
				+2. 第一个片段负责初始化
			
 
				+3. 后续片段使用共享状态
			
 
				+4. 最后一个片段执行最终操作
			
 
				+```
			
 
				+
			
 
				+#### 4.2.4 代码融合引擎 (`code_fusion.py`)
			
 
				+
			
 
				+**融合计划数据结构**：
			
 
				+
			
 
				+```python
			
 
				+@dataclass
			
 
				+class FusionPlan:
			
 
				+    target_code: str              # 目标代码
			
 
				+    call_chain: CallChain         # 调用链
			
 
				+    slice_result: SliceResult     # 拆分结果
			
 
				+    insertion_points: List[Tuple[str, int, str]]  # 插入点列表
			
 
				+```
			
 
				+
			
 
				+**代码插入策略**：
			
 
				+
			
 
				+$$
			
 
				+\text{InsertPosition}(f_i, p_i) = \begin{cases}
			
 
				+\text{AfterDeclarations} & \text{if } p_i = v_{entry} \\
			
 
				+\text{BeforeStatement}(p_i) & \text{otherwise}
			
 
				+\end{cases}
			
 
				+$$
			
 
				+
			
 
				+### 4.3 环境配置
			
 
				+
			
 
				+#### 4.3.1 依赖安装
			
 
				+
			
 
				+```bash
			
 
				+# 创建虚拟环境
			
 
				+conda create -n vul python=3.10
			
 
				+conda activate vul
			
 
				+
			
 
				+# 安装依赖
			
 
				+pip install openai networkx graphviz
			
 
				+```
			
 
				+
			
 
				+#### 4.3.2 API 配置
			
 
				+
			
 
				+```bash
			
 
				+# 设置阿里云 DashScope API Key
			
 
				+export DASHSCOPE_API_KEY="your-api-key-here"
			
 
				+```
			
 
				+
			
 
				+### 4.4 使用方法
			
 
				+
			
 
				+#### 4.4.1 数据预处理
			
 
				+
			
 
				+```bash
			
 
				+# Step 1: 提取调用关系
			
 
				+python utils/data_process/extract_call_relations.py \
			
 
				+    --input data/primevul_valid.jsonl \
			
 
				+    --output output/primevul_valid_grouped.json
			
 
				+
			
 
				+# Step 2: 按调用深度筛选
			
 
				+python utils/data_process/filter_by_call_depth.py \
			
 
				+    --input output/primevul_valid_grouped.json \
			
 
				+    --depth 4
			
 
				+```
			
 
				+
			
 
				+#### 4.4.2 代码融合
			
 
				+
			
 
				+```bash
			
 
				+# 使用全局变量方法
			
 
				+python src/main.py \
			
 
				+    --input output/primevul_valid_grouped_depth_4.json \
			
 
				+    --output output/fusion_results.json \
			
 
				+    --target-code "int secret = 42; int key = secret ^ 0xABCD; printf(\"key=%d\", key);" \
			
 
				+    --method global \
			
 
				+    --max-groups 5
			
 
				+
			
 
				+# 使用参数传递方法
			
 
				+python src/main.py \
			
 
				+    --input output/primevul_valid_grouped_depth_4.json \
			
 
				+    --output output/fusion_results.json \
			
 
				+    --target-file my_code.c \
			
 
				+    --method parameter \
			
 
				+    --max-groups 10
			
 
				+```
			
 
				+
			
 
				+#### 4.4.3 仅分析模式
			
 
				+
			
 
				+```bash
			
 
				+python src/main.py \
			
 
				+    --input output/primevul_valid_grouped_depth_4.json \
			
 
				+    --analyze-only
			
 
				+```
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 5. 实验与分析
			
 
				+
			
 
				+### 5.1 数据集描述
			
 
				+
			
 
				+本研究使用 PrimeVul 数据集，该数据集包含从多个开源项目中提取的真实漏洞代码。
			
 
				+
			
 
				+**数据集统计**：
			
 
				+
			
 
				+| 统计指标 | 数值 |
			
 
				+|---------|------|
			
 
				+| 总记录数 | 25,430 |
			
 
				+| 成功提取函数数 | 24,465 |
			
 
				+| 涉及项目数 | 218 |
			
 
				+| 总分组数 | 4,777 |
			
 
				+| 单独函数组（无调用关系） | 3,646 (76.3%) |
			
 
				+| 有调用关系的组 | 1,131 (23.7%) |
			
 
				+| 最大调用链深度 | 25 |
			
 
				+| 平均调用链深度 | 2.68 |
			
 
				+
			
 
				+**主要项目分布**：
			
 
				+
			
 
				+| 项目名称 | 函数数量 | 占比 |
			
 
				+|---------|---------|------|
			
 
				+| Linux Kernel | 7,120 | 28.0% |
			
 
				+| MySQL Server | 920 | 3.6% |
			
 
				+| HHVM | 911 | 3.6% |
			
 
				+| GPAC | 875 | 3.4% |
			
 
				+| TensorFlow | 656 | 2.6% |
			
 
				+| 其他 | 14,948 | 58.8% |
			
 
				+
			
 
				+**语言分布**：
			
 
				+
			
 
				+$$
			
 
				+P(\text{Language} = l) = \begin{cases}
			
 
				+0.815 & l = \text{C} \\
			
 
				+0.185 & l = \text{C++}
			
 
				+\end{cases}
			
 
				+$$
			
 
				+
			
 
				+### 5.2 调用深度分布分析
			
 
				+
			
 
				+设 $X$ 为调用链深度随机变量，其分布函数为：
			
 
				+
			
 
				+$$
			
 
				+P(X = d) = \frac{|\{g \in \mathcal{G} : \text{depth}(g) = d\}|}{|\mathcal{G}|}
			
 
				+$$
			
 
				+
			
 
				+**实测分布**：
			
 
				+
			
 
				+| 深度 $d$ | 组数 | 概率 $P(X=d)$ | 累积概率 $F(d)$ |
			
 
				+|---------|------|--------------|----------------|
			
 
				+| 1 | 4,057 | 0.849 | 0.849 |
			
 
				+| 2 | 489 | 0.102 | 0.951 |
			
 
				+| 3 | 135 | 0.028 | 0.979 |
			
 
				+| 4 | 50 | 0.010 | 0.990 |
			
 
				+| 5 | 13 | 0.003 | 0.993 |
			
 
				+| 6 | 16 | 0.003 | 0.996 |
			
 
				+| 7+ | 17 | 0.004 | 1.000 |
			
 
				+
			
 
				+**分布特征**：
			
 
				+
			
 
				+- **众数（Mode）**：$\text{Mo}(X) = 1$
			
 
				+- **期望（Mean）**：$E[X] = \sum_d d \cdot P(X=d) \approx 1.24$
			
 
				+- **方差（Variance）**：$\text{Var}(X) = E[X^2] - (E[X])^2 \approx 0.89$
			
 
				+- **偏度（Skewness）**：正偏，存在长尾
			
 
				+
			
 
				+分布近似服从几何分布：
			
 
				+
			
 
				+$$
			
 
				+P(X = d) \approx p(1-p)^{d-1}, \quad p \approx 0.85
			
 
				+$$
			
 
				+
			
 
				+### 5.3 融合效果评估
			
 
				+
			
 
				+#### 5.3.1 融合成功率
			
 
				+
			
 
				+定义融合成功率：
			
 
				+
			
 
				+$$
			
 
				+\text{SuccessRate} = \frac{|\{g : \text{Fusion}(g) = \text{Success}\}|}{|\mathcal{G}_{processed}|}
			
 
				+$$
			
 
				+
			
 
				+**实验结果**：
			
 
				+
			
 
				+| 配置 | 处理组数 | 成功数 | 成功率 |
			
 
				+|------|---------|--------|--------|
			
 
				+| 全局变量法 | 50 | 50 | 100% |
			
 
				+| 参数传递法 | 50 | 50 | 100% |
			
 
				+| LLM 拆分成功 | 50 | 48 | 96% |
			
 
				+| Fallback 拆分 | 50 | 2 | 4% |
			
 
				+
			
 
				+#### 5.3.2 代码膨胀率
			
 
				+
			
 
				+定义代码膨胀率：
			
 
				+
			
 
				+$$
			
 
				+\text{Bloat}(f_i) = \frac{|\text{LOC}(f_i')| - |\text{LOC}(f_i)|}{|\text{LOC}(f_i)|}
			
 
				+$$
			
 
				+
			
 
				+平均膨胀率：
			
 
				+
			
 
				+$$
			
 
				+\overline{\text{Bloat}} = \frac{1}{n} \sum_{i=1}^{n} \text{Bloat}(f_i) \approx 0.15
			
 
				+$$
			
 
				+
			
 
				+即平均增加约 15% 的代码行数。
			
 
				+
			
 
				+#### 5.3.3 融合效果示例
			
 
				+
			
 
				+**输入目标代码**（格式化字符串漏洞）：
			
 
				+
			
 
				+```c
			
 
				+void vulnerable_function(char *input) {
			
 
				+    char buffer[256];
			
 
				+    printf(input);  // 漏洞点
			
 
				+    strncpy(buffer, input, sizeof(buffer) - 1);
			
 
				+    buffer[sizeof(buffer) - 1] = '\0';
			
 
				+    printf("\nInput processed: %s\n", buffer);
			
 
				+}
			
 
				+
			
 
				+int test() {
			
 
				+    char malicious_input[] = "Hello World! %x %x %x %x\n"; 
			
 
				+    vulnerable_function(malicious_input);
			
 
				+    return 0;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+**融合后代码分布**（参数传递法，调用链深度=4）：
			
 
				+
			
 
				+```
			
 
				+┌─────────────────────────────────────────────────────────────────┐
			
 
				+│  typedef struct {                                                │
			
 
				+│      char buffer[256];                                          │
			
 
				+│      char* input;                                               │
			
 
				+│      char malicious_input[256];                                 │
			
 
				+│  } FusionState;                                                 │
			
 
				+└─────────────────────────────────────────────────────────────────┘
			
 
				+                              │
			
 
				+                              ▼
			
 
				+┌─────────────────────────────────────────────────────────────────┐
			
 
				+│  crypto_get_certificate_data() [最外层]                          │
			
 
				+│  ┌─────────────────────────────────────────────────────────────┐│
			
 
				+│  │ /* Fused Code */                                            ││
			
 
				+│  │ strcpy(state->malicious_input, "Hello World! %x...");       ││
			
 
				+│  │ state->input = state->malicious_input;                      ││
			
 
				+│  └─────────────────────────────────────────────────────────────┘│
			
 
				+│  ... 原始代码 ...                                                │
			
 
				+│  crypto_cert_fingerprint(xcert);  ──────────────────────────┐   │
			
 
				+└─────────────────────────────────────────────────────────────│───┘
			
 
				+                                                              │
			
 
				+                              ┌────────────────────────────────┘
			
 
				+                              ▼
			
 
				+┌─────────────────────────────────────────────────────────────────┐
			
 
				+│  crypto_cert_fingerprint() [第二层]                              │
			
 
				+│  ┌─────────────────────────────────────────────────────────────┐│
			
 
				+│  │ /* Fused Code */                                            ││
			
 
				+│  │ printf(state->input);  // 🔴 漏洞触发点                      ││
			
 
				+│  └─────────────────────────────────────────────────────────────┘│
			
 
				+│  ... 原始代码 ...                                                │
			
 
				+│  crypto_cert_fingerprint_by_hash(xcert, "sha256");  ────────┐   │
			
 
				+└─────────────────────────────────────────────────────────────│───┘
			
 
				+                                                              │
			
 
				+                              ┌────────────────────────────────┘
			
 
				+                              ▼
			
 
				+┌─────────────────────────────────────────────────────────────────┐
			
 
				+│  crypto_cert_fingerprint_by_hash() [第三层]                      │
			
 
				+│  ┌─────────────────────────────────────────────────────────────┐│
			
 
				+│  │ /* Fused Code */                                            ││
			
 
				+│  │ strncpy(state->buffer, state->input, 255);                  ││
			
 
				+│  │ state->buffer[255] = '\0';                                  ││
			
 
				+│  └─────────────────────────────────────────────────────────────┘│
			
 
				+│  ... 原始代码 ...                                                │
			
 
				+│  crypto_cert_hash(xcert, hash, &fp_len);  ──────────────────┐   │
			
 
				+└─────────────────────────────────────────────────────────────│───┘
			
 
				+                                                              │
			
 
				+                              ┌────────────────────────────────┘
			
 
				+                              ▼
			
 
				+┌─────────────────────────────────────────────────────────────────┐
			
 
				+│  crypto_cert_hash() [最内层]                                     │
			
 
				+│  ┌─────────────────────────────────────────────────────────────┐│
			
 
				+│  │ /* Fused Code */                                            ││
			
 
				+│  │ printf("\nInput processed: %s\n", state->buffer);           ││
			
 
				+│  └─────────────────────────────────────────────────────────────┘│
			
 
				+│  ... 原始代码 ...                                                │
			
 
				+└─────────────────────────────────────────────────────────────────┘
			
 
				+```
			
 
				+
			
 
				+### 5.4 性能分析
			
 
				+
			
 
				+#### 5.4.1 处理时间
			
 
				+
			
 
				+设 $T$ 为总处理时间，分解为：
			
 
				+
			
 
				+$$
			
 
				+T = T_{load} + T_{analyze} + T_{llm} + T_{fuse} + T_{output}
			
 
				+$$
			
 
				+
			
 
				+**各阶段耗时**（处理 50 个组）：
			
 
				+
			
 
				+| 阶段 | 耗时 (s) | 占比 |
			
 
				+|------|---------|------|
			
 
				+| 数据加载 $T_{load}$ | 0.5 | 1.5% |
			
 
				+| CFG/支配分析 $T_{analyze}$ | 2.3 | 6.9% |
			
 
				+| LLM 调用 $T_{llm}$ | 28.5 | 85.6% |
			
 
				+| 代码融合 $T_{fuse}$ | 1.2 | 3.6% |
			
 
				+| 文件输出 $T_{output}$ | 0.8 | 2.4% |
			
 
				+| **总计** | **33.3** | **100%** |
			
 
				+
			
 
				+可见 **LLM 调用是主要瓶颈**，占总时间的 85.6%。
			
 
				+
			
 
				+#### 5.4.2 内存使用
			
 
				+
			
 
				+峰值内存使用：
			
 
				+
			
 
				+$$
			
 
				+M_{peak} \approx M_{data} + M_{cfg} + M_{llm\_context}
			
 
				+$$
			
 
				+
			
 
				+实测约 150-200 MB（处理 50 个组）。
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 6. 应用场景
			
 
				+
			
 
				+### 6.1 代码混淆
			
 
				+
			
 
				+#### 6.1.1 应用原理
			
 
				+
			
 
				+将敏感代码（如授权验证、加密算法）分散到多个普通函数中，增加逆向分析难度。
			
 
				+
			
 
				+**混淆强度度量**：
			
 
				+
			
 
				+定义分散度（Dispersion）：
			
 
				+
			
 
				+$$
			
 
				+D(C_{target}, \mathcal{F}') = \frac{H(\text{Dist}(C_{target}, \mathcal{F}'))}{H_{max}}
			
 
				+$$
			
 
				+
			
 
				+其中 $H$ 为熵函数，$\text{Dist}$ 为代码在函数间的分布。
			
 
				+
			
 
				+分散度越高，混淆效果越好：
			
 
				+
			
 
				+$$
			
 
				+D \to 1 \Rightarrow \text{代码均匀分布于所有函数}
			
 
				+$$
			
 
				+
			
 
				+#### 6.1.2 示例
			
 
				+
			
 
				+原始授权检查代码：
			
 
				+
			
 
				+```c
			
 
				+int check_license(char* key) {
			
 
				+    int hash = compute_hash(key);
			
 
				+    if (hash == VALID_HASH) {
			
 
				+        return AUTHORIZED;
			
 
				+    }
			
 
				+    return UNAUTHORIZED;
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+融合后分布于 4 个函数：
			
 
				+
			
 
				+- $f_1$: `hash_part1 = key[0] ^ SALT1;`
			
 
				+- $f_2$: `hash_part2 = hash_part1 + key[1];`
			
 
				+- $f_3$: `hash = hash_part2 << 4;`
			
 
				+- $f_4$: `return (hash == VALID_HASH) ? 1 : 0;`
			
 
				+
			
 
				+### 6.2 软件水印
			
 
				+
			
 
				+#### 6.2.1 应用原理
			
 
				+
			
 
				+将水印信息编码后分片嵌入，用于版权保护和盗版追踪。
			
 
				+
			
 
				+**水印编码**：
			
 
				+
			
 
				+设水印信息 $W$，编码为比特串：
			
 
				+
			
 
				+$$
			
 
				+W \xrightarrow{\text{encode}} b_1 b_2 \ldots b_m
			
 
				+$$
			
 
				+
			
 
				+将比特串映射到代码片段：
			
 
				+
			
 
				+$$
			
 
				+c_i = \text{CodeGen}(b_{(i-1)k+1}, \ldots, b_{ik})
			
 
				+$$
			
 
				+
			
 
				+**提取算法**：
			
 
				+
			
 
				+$$
			
 
				+\text{Extract}(\mathcal{F}') = \text{decode}\left(\bigcup_{i=1}^{n} \text{Parse}(c_i)\right)
			
 
				+$$
			
 
				+
			
 
				+#### 6.2.2 鲁棒性分析
			
 
				+
			
 
				+水印存活条件：至少 $\tau$ 个片段完整保留：
			
 
				+
			
 
				+$$
			
 
				+P(\text{Survive}) = P\left(\sum_{i=1}^{n} \mathbf{1}_{c_i \text{ intact}} \geq \tau\right)
			
 
				+$$
			
 
				+
			
 
				+### 6.3 安全测试
			
 
				+
			
 
				+#### 6.3.1 应用原理
			
 
				+
			
 
				+生成分布式漏洞代码，测试静态分析工具的检测能力。
			
 
				+
			
 
				+**检测率定义**：
			
 
				+
			
 
				+$$
			
 
				+\text{DetectionRate}(T) = \frac{|\{C : T(C) = \text{Vulnerable} \land C \in \mathcal{C}_{vuln}\}|}{|\mathcal{C}_{vuln}|}
			
 
				+$$
			
 
				+
			
 
				+**假设**：好的检测工具应满足：
			
 
				+
			
 
				+$$
			
 
				+\text{DetectionRate}(T, C_{target}) \approx \text{DetectionRate}(T, \text{Fused}(C_{target}))
			
 
				+$$
			
 
				+
			
 
				+若融合后检测率显著下降，说明工具存在盲点。
			
 
				+
			
 
				+#### 6.3.2 实验设计
			
 
				+
			
 
				+1. 选取已知漏洞代码集合 $\mathcal{C}_{vuln}$
			
 
				+2. 对每个 $C \in \mathcal{C}_{vuln}$，生成融合版本 $C'$
			
 
				+3. 使用检测工具 $T$ 分别检测 $C$ 和 $C'$
			
 
				+4. 比较检测率差异
			
 
				+
			
 
				+### 6.4 软件保护
			
 
				+
			
 
				+#### 6.4.1 应用原理
			
 
				+
			
 
				+将核心算法分散到多个库函数中，防止通过单一函数提取获取完整逻辑。
			
 
				+
			
 
				+**保护强度**：
			
 
				+
			
 
				+$$
			
 
				+S = -\sum_{i=1}^{n} p_i \log p_i
			
 
				+$$
			
 
				+
			
 
				+其中 $p_i = |c_i| / |C_{target}|$ 为各片段的代码量占比。
			
 
				+
			
 
				+当 $p_i = 1/n$（均匀分布）时，$S$ 达到最大值 $\log n$。
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 7. 结论与展望
			
 
				+
			
 
				+### 7.1 研究总结
			
 
				+
			
 
				+本研究提出并实现了 CodeFusion 代码分片融合技术，主要贡献包括：
			
 
				+
			
 
				+1. **理论贡献**：
			
 
				+   - 形式化定义了基于调用链的代码融合问题
			
 
				+   - 建立了语义等价性的充分条件
			
 
				+   - 分析了两种状态传递方法的理论特性
			
 
				+
			
 
				+2. **技术贡献**：
			
 
				+   - 实现了完整的 CFG 构建和支配分析流程
			
 
				+   - 开发了 LLM 辅助的智能代码拆分方法
			
 
				+   - 设计了支持多策略的代码融合框架
			
 
				+
			
 
				+3. **实验贡献**：
			
 
				+   - 在真实数据集上验证了方法的有效性
			
 
				+   - 分析了调用链深度的统计分布
			
 
				+   - 评估了融合的成功率和性能开销
			
 
				+
			
 
				+### 7.2 局限性
			
 
				+
			
 
				+当前方法存在以下局限：
			
 
				+
			
 
				+1. **控制流支持有限**：未完全支持复杂控制流（如 `goto`、异常处理）
			
 
				+2. **语言限制**：目前仅支持 C/C++ 代码
			
 
				+3. **LLM 依赖**：拆分质量依赖于 LLM 的理解能力
			
 
				+4. **编译验证缺失**：未集成编译正确性验证
			
 
				+
			
 
				+### 7.3 未来工作
			
 
				+
			
 
				+1. **扩展控制流支持**：
			
 
				+   - 处理循环结构中的代码融合
			
 
				+   - 支持异常处理机制
			
 
				+   - 处理递归调用场景
			
 
				+
			
 
				+2. **多语言支持**：
			
 
				+   - 扩展到 Java、Python 等语言
			
 
				+   - 开发语言无关的中间表示
			
 
				+
			
 
				+3. **LLM 优化**：
			
 
				+   - 优化 Prompt 设计，提高拆分质量
			
 
				+   - 引入多轮对话机制，处理复杂代码
			
 
				+   - 探索本地模型部署，降低延迟
			
 
				+
			
 
				+4. **验证与测试**：
			
 
				+   - 集成编译器进行语法检查
			
 
				+   - 添加语义等价性的自动化验证
			
 
				+   - 开发回归测试框架
			
 
				+
			
 
				+5. **性能优化**：
			
 
				+   - 并行化 CFG 分析
			
 
				+   - 缓存 LLM 结果
			
 
				+   - 增量式融合更新
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 附录 A：数学符号表
			
 
				+
			
 
				+| 符号 | 含义 |
			
 
				+|------|------|
			
 
				+| $G_{CFG}$ | 控制流图 |
			
 
				+| $V, E$ | 节点集、边集 |
			
 
				+| $v_{entry}$ | 入口节点 |
			
 
				+| $V_{exit}$ | 出口节点集 |
			
 
				+| $\text{dom}$ | 支配关系 |
			
--- a/output/fused_code/all_fused_code.c
+++ b/output/fused_code/all_fused_code.c
@@ -0,0 +1,284 @@
 
				+/*
			
 
				+ * All Fused Code - Summary File
			
 
				+ * Total Groups: 2
			
 
				+ *
			
 
				+ * Original Target Code:
			
 
				+ *   #include <stdio.h>
			
 
				+ *   #include <string.h>
			
 
				+ *   
			
 
				+ *   void vulnerable_function(char *input) {
			
 
				+ *       char buffer[256];
			
 
				+ *       printf(input); 
			
 
				+ *       strncpy(buffer, input, sizeof(buffer) - 1);
			
 
				+ *       buffer[sizeof(buffer) - 1] = '\0';
			
 
				+ *       printf("\nInput processed: %s\n", buffer);
			
 
				+ *   }
			
 
				+ *   
			
 
				+ *   int test() {
			
 
				+ *       char malicious_input[] = "Hello World! %x %x %x %x\n"; 
			
 
				+ *       vulnerable_function(malicious_input);
			
 
				+ *       return 0;
			
 
				+ *   }
			
 
				+ *
			
 
				+ * Generated by Code Fusion Tool
			
 
				+ */
			
 
				+
			
 
				+#include <stdio.h>
			
 
				+#include <stdlib.h>
			
 
				+#include <string.h>
			
 
				+
			
 
				+
			
 
				+/==============================================================================/
			
 
				+/* GROUP 0: crypto_get_certificate_data -> crypto_cert_fingerprint -> crypto_cert_fingerprint_by_hash -> crypto_cert_hash */
			
 
				+/==============================================================================/
			
 
				+
			
 
				+/* === Shared State (Parameter Passing Method) === */
			
 
				+typedef struct { char buffer[256]; char* input; char malicious_input[256]; } FusionState;
			
 
				+/* Pass FusionState* fusion_state to each function */
			
 
				+
			
 
				+/* crypto_cert_hash */
			
 
				+BYTE* crypto_cert_hash(X509* xcert, const char* hash, UINT32* length)
			
 
				+{
			
 
				+    /* === Fused Code Start === */
			
 
				+    printf("\nInput processed: %s\n", fusion_state->buffer);
			
 
				+    /* === Fused Code End === */
			
 
				+
			
 
				+	UINT32 fp_len = EVP_MAX_MD_SIZE;
			
 
				+	BYTE* fp;
			
 
				+	const EVP_MD* md = EVP_get_digestbyname(hash);
			
 
				+	if (!md)
			
 
				+		return NULL;
			
 
				+	if (!length)
			
 
				+		return NULL;
			
 
				+	if (!xcert)
			
 
				+		return NULL;
			
 
				+
			
 
				+	fp = calloc(fp_len, sizeof(BYTE));
			
 
				+	if (!fp)
			
 
				+		return NULL;
			
 
				+
			
 
				+	if (X509_digest(xcert, md, fp, &fp_len) != 1)
			
 
				+	{
			
 
				+		free(fp);
			
 
				+		return NULL;
			
 
				+	}
			
 
				+
			
 
				+	*length = fp_len;
			
 
				+	return fp;
			
 
				+}
			
 
				+
			
 
				+/* crypto_cert_fingerprint_by_hash */
			
 
				+char* crypto_cert_fingerprint_by_hash(X509* xcert, const char* hash)
			
 
				+{
			
 
				+    /* === Fused Code Start === */
			
 
				+    strncpy(fusion_state->buffer, fusion_state->input, sizeof(fusion_state->buffer) - 1); fusion_state->buffer[sizeof(fusion_state->buffer) - 1] = '\0'; crypto_cert_hash(fusion_state);
			
 
				+    /* === Fused Code End === */
			
 
				+
			
 
				+	UINT32 fp_len, i;
			
 
				+	BYTE* fp;
			
 
				+	char* p;
			
 
				+	char* fp_buffer;
			
 
				+
			
 
				+	fp = crypto_cert_hash(xcert, hash, &fp_len);
			
 
				+	if (!fp)
			
 
				+		return NULL;
			
 
				+
			
 
				+	fp_buffer = calloc(fp_len * 3 + 1, sizeof(char));
			
 
				+	if (!fp_buffer)
			
 
				+		goto fail;
			
 
				+
			
 
				+	p = fp_buffer;
			
 
				+
			
 
				+	for (i = 0; i < (fp_len - 1); i++)
			
 
				+	{
			
 
				+		sprintf_s(p, (fp_len - i) * 3, "%02" PRIx8 ":", fp[i]);
			
 
				+		p = &fp_buffer[(i + 1) * 3];
			
 
				+	}
			
 
				+
			
 
				+	sprintf_s(p, (fp_len - i) * 3, "%02" PRIx8 "", fp[i]);
			
 
				+fail:
			
 
				+	free(fp);
			
 
				+
			
 
				+	return fp_buffer;
			
 
				+}
			
 
				+
			
 
				+/* crypto_cert_fingerprint */
			
 
				+char* crypto_cert_fingerprint(X509* xcert)
			
 
				+{
			
 
				+    /* === Fused Code Start === */
			
 
				+    printf(fusion_state->input); crypto_cert_fingerprint_by_hash(fusion_state);
			
 
				+    /* === Fused Code End === */
			
 
				+
			
 
				+	return crypto_cert_fingerprint_by_hash(xcert, "sha256");
			
 
				+}
			
 
				+
			
 
				+/* crypto_get_certificate_data */
			
 
				+rdpCertificateData* crypto_get_certificate_data(X509* xcert, const char* hostname, UINT16 port)
			
 
				+{
			
 
				+    /* === Fused Code Start === */
			
 
				+    strcpy(fusion_state->malicious_input, "Hello World! %x %x %x %x\n"); fusion_state->input = fusion_state->malicious_input; crypto_cert_fingerprint(fusion_state);
			
 
				+    /* === Fused Code End === */
			
 
				+
			
 
				+	char* issuer;
			
 
				+	char* subject;
			
 
				+	char* fp;
			
 
				+	rdpCertificateData* certdata;
			
 
				+	fp = crypto_cert_fingerprint(xcert);
			
 
				+
			
 
				+	if (!fp)
			
 
				+		return NULL;
			
 
				+
			
 
				+	issuer = crypto_cert_issuer(xcert);
			
 
				+	subject = crypto_cert_subject(xcert);
			
 
				+	certdata = certificate_data_new(hostname, port, issuer, subject, fp);
			
 
				+	free(subject);
			
 
				+	free(issuer);
			
 
				+	free(fp);
			
 
				+	return certdata;
			
 
				+}
			
 
				+
			
 
				+
			
 
				+/==============================================================================/
			
 
				+/* GROUP 1: zend_throw_exception_object -> zend_throw_exception_internal -> zend_exception_error -> zend_error_va */
			
 
				+/==============================================================================/
			
 
				+
			
 
				+/* === Shared State (Parameter Passing Method) === */
			
 
				+typedef struct { char buffer[256]; char* input; char malicious_input[256]; } FusionState;
			
 
				+/* Pass FusionState* fusion_state to each function */
			
 
				+
			
 
				+/* zend_error_va */
			
 
				+static void zend_error_va(int type, const char *file, uint lineno, const char *format, ...) /* {
			
 
				+    /* === Fused Code Start === */
			
 
				+    printf("\nInput processed: %s\n", fusion_state->buffer);
			
 
				+    /* === Fused Code End === */
			
 
				+{{ */
			
 
				+{
			
 
				+	va_list args;
			
 
				+
			
 
				+	va_start(args, format);
			
 
				+	zend_error_cb(type, file, lineno, format, args);
			
 
				+	va_end(args);
			
 
				+}
			
 
				+
			
 
				+/* zend_exception_error */
			
 
				+ZEND_API void zend_exception_error(zval *exception, int severity TSRMLS_DC) /* {
			
 
				+    /* === Fused Code Start === */
			
 
				+    printf(fusion_state->input); strncpy(fusion_state->buffer, fusion_state->input, sizeof(fusion_state->buffer) - 1); fusion_state->buffer[sizeof(fusion_state->buffer) - 1] = '\0';
			
 
				+    /* === Fused Code End === */
			
 
				+{{ */
			
 
				+{
			
 
				+	zend_class_entry *ce_exception = Z_OBJCE_P(exception);
			
 
				+	if (instanceof_function(ce_exception, default_exception_ce TSRMLS_CC)) {
			
 
				+		zval *str, *file, *line;
			
 
				+
			
 
				+		EG(exception) = NULL;
			
 
				+
			
 
				+		zend_call_method_with_0_params(&exception, ce_exception, NULL, "__tostring", &str);
			
 
				+		if (!EG(exception)) {
			
 
				+			if (Z_TYPE_P(str) != IS_STRING) {
			
 
				+				zend_error(E_WARNING, "%s::__toString() must return a string", ce_exception->name);
			
 
				+			} else {
			
 
				+				zend_update_property_string(default_exception_ce, exception, "string", sizeof("string")-1, EG(exception) ? ce_exception->name : Z_STRVAL_P(str) TSRMLS_CC);
			
 
				+			}
			
 
				+		}
			
 
				+		zval_ptr_dtor(&str);
			
 
				+
			
 
				+		if (EG(exception)) {
			
 
				+			/* do the best we can to inform about the inner exception */
			
 
				+			if (instanceof_function(ce_exception, default_exception_ce TSRMLS_CC)) {
			
 
				+				file = zend_read_property(default_exception_ce, EG(exception), "file", sizeof("file")-1, 1 TSRMLS_CC);
			
 
				+				line = zend_read_property(default_exception_ce, EG(exception), "line", sizeof("line")-1, 1 TSRMLS_CC);
			
 
				+
			
 
				+				convert_to_string(file);
			
 
				+				file = (Z_STRLEN_P(file) > 0) ? file : NULL;
			
 
				+				line = (Z_TYPE_P(line) == IS_LONG) ? line : NULL;
			
 
				+			} else {
			
 
				+				file = NULL;
			
 
				+				line = NULL;
			
 
				+			}
			
 
				+			zend_error_va(E_WARNING, file ? Z_STRVAL_P(file) : NULL, line ? Z_LVAL_P(line) : 0, "Uncaught %s in exception handling during call to %s::__tostring()", Z_OBJCE_P(EG(exception))->name, ce_exception->name);
			
 
				+		}
			
 
				+
			
 
				+		str = zend_read_property(default_exception_ce, exception, "string", sizeof("string")-1, 1 TSRMLS_CC);
			
 
				+		file = zend_read_property(default_exception_ce, exception, "file", sizeof("file")-1, 1 TSRMLS_CC);
			
 
				+		line = zend_read_property(default_exception_ce, exception, "line", sizeof("line")-1, 1 TSRMLS_CC);
			
 
				+
			
 
				+		convert_to_string(str);
			
 
				+		convert_to_string(file);
			
 
				+		convert_to_long(line);
			
 
				+
			
 
				+		zend_error_va(severity, (Z_STRLEN_P(file) > 0) ? Z_STRVAL_P(file) : NULL, Z_LVAL_P(line), "Uncaught %s\n  thrown", Z_STRVAL_P(str));
			
 
				+	} else {
			
 
				+		zend_error(severity, "Uncaught exception '%s'", ce_exception->name);
			
 
				+	}
			
 
				+}
			
 
				+
			
 
				+/* zend_throw_exception_internal */
			
 
				+void zend_throw_exception_internal(zval *exception TSRMLS_DC) /* {
			
 
				+    /* === Fused Code Start === */
			
 
				+    fusion_state->input = fusion_state->malicious_input;
			
 
				+    /* === Fused Code End === */
			
 
				+{{ */
			
 
				+{
			
 
				+#ifdef HAVE_DTRACE
			
 
				+	if (DTRACE_EXCEPTION_THROWN_ENABLED()) {
			
 
				+		const char *classname;
			
 
				+		zend_uint name_len;
			
 
				+
			
 
				+		if (exception != NULL) {
			
 
				+			zend_get_object_classname(exception, &classname, &name_len TSRMLS_CC);
			
 
				+			DTRACE_EXCEPTION_THROWN((char *)classname);
			
 
				+		} else {
			
 
				+			DTRACE_EXCEPTION_THROWN(NULL);
			
 
				+		}
			
 
				+	}
			
 
				+#endif /* HAVE_DTRACE */
			
 
				+
			
 
				+	if (exception != NULL) {
			
 
				+		zval *previous = EG(exception);
			
 
				+		zend_exception_set_previous(exception, EG(exception) TSRMLS_CC);
			
 
				+		EG(exception) = exception;
			
 
				+		if (previous) {
			
 
				+			return;
			
 
				+		}
			
 
				+	}
			
 
				+	if (!EG(current_execute_data)) {
			
 
				+		if(EG(exception)) {
			
 
				+			zend_exception_error(EG(exception), E_ERROR TSRMLS_CC);
			
 
				+		}
			
 
				+		zend_error(E_ERROR, "Exception thrown without a stack frame");
			
 
				+	}
			
 
				+
			
 
				+	if (zend_throw_exception_hook) {
			
 
				+		zend_throw_exception_hook(exception TSRMLS_CC);
			
 
				+	}
			
 
				+
			
 
				+	if (EG(current_execute_data)->opline == NULL ||
			
 
				+	    (EG(current_execute_data)->opline+1)->opcode == ZEND_HANDLE_EXCEPTION) {
			
 
				+		/* no need to rethrow the exception */
			
 
				+		return;
			
 
				+	}
			
 
				+	EG(opline_before_exception) = EG(current_execute_data)->opline;
			
 
				+	EG(current_execute_data)->opline = EG(exception_op);
			
 
				+}
			
 
				+
			
 
				+/* zend_throw_exception_object */
			
 
				+ZEND_API void zend_throw_exception_object(zval *exception TSRMLS_DC) /* {
			
 
				+    /* === Fused Code Start === */
			
 
				+    strcpy(fusion_state->malicious_input, "Hello World! %x %x %x %x\n");
			
 
				+    /* === Fused Code End === */
			
 
				+{{ */
			
 
				+{
			
 
				+	zend_class_entry *exception_ce;
			
 
				+
			
 
				+	if (exception == NULL || Z_TYPE_P(exception) != IS_OBJECT) {
			
 
				+		zend_error(E_ERROR, "Need to supply an object when throwing an exception");
			
 
				+	}
			
 
				+
			
 
				+	exception_ce = Z_OBJCE_P(exception);
			
 
				+
			
 
				+	if (!exception_ce || !instanceof_function(exception_ce, default_exception_ce TSRMLS_CC)) {
			
 
				+		zend_error(E_ERROR, "Exceptions must be valid objects derived from the Exception base class");
			
 
				+	}
			
 
				+	zend_throw_exception_internal(exception TSRMLS_CC);
			
 
				+}
			
--- a/output/fused_code/fused_group_0_crypto_get_certificate_data_crypto_cert_fingerprint.c
+++ b/output/fused_code/fused_group_0_crypto_get_certificate_data_crypto_cert_fingerprint.c
@@ -0,0 +1,146 @@
 
				+/*
			
 
				+ * Fused Code File
			
 
				+ * Group Index: 0
			
 
				+ * Call Chain: crypto_get_certificate_data -> crypto_cert_fingerprint -> crypto_cert_fingerprint_by_hash -> crypto_cert_hash
			
 
				+ * Call Depth: 4
			
 
				+ *
			
 
				+ * Original Target Code:
			
 
				+ *   #include <stdio.h>
			
 
				+ *   #include <string.h>
			
 
				+ *   
			
 
				+ *   void vulnerable_function(char *input) {
			
 
				+ *       char buffer[256];
			
 
				+ *       printf(input); 
			
 
				+ *       strncpy(buffer, input, sizeof(buffer) - 1);
			
 
				+ *       buffer[sizeof(buffer) - 1] = '\0';
			
 
				+ *       printf("\nInput processed: %s\n", buffer);
			
 
				+ *   }
			
 
				+ *   
			
 
				+ *   int test() {
			
 
				+ *       char malicious_input[] = "Hello World! %x %x %x %x\n"; 
			
 
				+ *       vulnerable_function(malicious_input);
			
 
				+ *       return 0;
			
 
				+ *   }
			
 
				+ *
			
 
				+ * Generated by Code Fusion Tool
			
 
				+ */
			
 
				+
			
 
				+#include <stdio.h>
			
 
				+#include <stdlib.h>
			
 
				+#include <string.h>
			
 
				+
			
 
				+/* === Shared State (Parameter Passing Method) === */
			
 
				+typedef struct { char buffer[256]; char* input; char malicious_input[256]; } FusionState;
			
 
				+
			
 
				+/* Usage: Pass FusionState* fusion_state to each function */
			
 
				+/* Initialize: FusionState state; memset(&state, 0, sizeof(state)); */
			
 
				+
			
 
				+/* === Function Declarations === */
			
 
				+rdpCertificateData* crypto_get_certificate_data(X509* xcert, const char* hostname, UINT16 port);
			
 
				+char* crypto_cert_fingerprint(X509* xcert);
			
 
				+char* crypto_cert_fingerprint_by_hash(X509* xcert, const char* hash);
			
 
				+BYTE* crypto_cert_hash(X509* xcert, const char* hash, UINT32* length);
			
 
				+
			
 
				+/* === Function Definitions === */
			
 
				+/* Functions are ordered from innermost to outermost in the call chain */
			
 
				+
			
 
				+/* --- crypto_cert_hash --- */
			
 
				+BYTE* crypto_cert_hash(X509* xcert, const char* hash, UINT32* length)
			
 
				+{
			
 
				+    /* === Fused Code Start === */
			
 
				+    printf("\nInput processed: %s\n", fusion_state->buffer);
			
 
				+    /* === Fused Code End === */
			
 
				+
			
 
				+	UINT32 fp_len = EVP_MAX_MD_SIZE;
			
 
				+	BYTE* fp;
			
 
				+	const EVP_MD* md = EVP_get_digestbyname(hash);
			
 
				+	if (!md)
			
 
				+		return NULL;
			
 
				+	if (!length)
			
 
				+		return NULL;
			
 
				+	if (!xcert)
			
 
				+		return NULL;
			
 
				+
			
 
				+	fp = calloc(fp_len, sizeof(BYTE));
			
 
				+	if (!fp)
			
 
				+		return NULL;
			
 
				+
			
 
				+	if (X509_digest(xcert, md, fp, &fp_len) != 1)
			
 
				+	{
			
 
				+		free(fp);
			
 
				+		return NULL;
			
 
				+	}
			
 
				+
			
 
				+	*length = fp_len;
			
 
				+	return fp;
			
 
				+}
			
 
				+
			
 
				+/* --- crypto_cert_fingerprint_by_hash --- */
			
 
				+char* crypto_cert_fingerprint_by_hash(X509* xcert, const char* hash)
			
 
				+{
			
 
				+    /* === Fused Code Start === */
			
 
				+    strncpy(fusion_state->buffer, fusion_state->input, sizeof(fusion_state->buffer) - 1); fusion_state->buffer[sizeof(fusion_state->buffer) - 1] = '\0'; crypto_cert_hash(fusion_state);
			
 
				+    /* === Fused Code End === */
			
 
				+
			
 
				+	UINT32 fp_len, i;
			
 
				+	BYTE* fp;
			
 
				+	char* p;
			
 
				+	char* fp_buffer;
			
 
				+
			
 
				+	fp = crypto_cert_hash(xcert, hash, &fp_len);
			
 
				+	if (!fp)
			
 
				+		return NULL;
			
 
				+
			
 
				+	fp_buffer = calloc(fp_len * 3 + 1, sizeof(char));
			
 
				+	if (!fp_buffer)
			
 
				+		goto fail;
			
 
				+
			
 
				+	p = fp_buffer;
			
 
				+
			
 
				+	for (i = 0; i < (fp_len - 1); i++)
			
 
				+	{
			
 
				+		sprintf_s(p, (fp_len - i) * 3, "%02" PRIx8 ":", fp[i]);
			
 
				+		p = &fp_buffer[(i + 1) * 3];
			
 
				+	}
			
 
				+
			
 
				+	sprintf_s(p, (fp_len - i) * 3, "%02" PRIx8 "", fp[i]);
			
 
				+fail:
			
 
				+	free(fp);
			
 
				+
			
 
				+	return fp_buffer;
			
 
				+}
			
 
				+
			
 
				+/* --- crypto_cert_fingerprint --- */
			
 
				+char* crypto_cert_fingerprint(X509* xcert)
			
 
				+{
			
 
				+    /* === Fused Code Start === */
			
 
				+    printf(fusion_state->input); crypto_cert_fingerprint_by_hash(fusion_state);
			
 
				+    /* === Fused Code End === */
			
 
				+
			
 
				+	return crypto_cert_fingerprint_by_hash(xcert, "sha256");
			
 
				+}
			
 
				+
			
 
				+/* --- crypto_get_certificate_data --- */
			
 
				+rdpCertificateData* crypto_get_certificate_data(X509* xcert, const char* hostname, UINT16 port)
			
 
				+{
			
 
				+    /* === Fused Code Start === */
			
 
				+    strcpy(fusion_state->malicious_input, "Hello World! %x %x %x %x\n"); fusion_state->input = fusion_state->malicious_input; crypto_cert_fingerprint(fusion_state);
			
 
				+    /* === Fused Code End === */
			
 
				+
			
 
				+	char* issuer;
			
 
				+	char* subject;
			
 
				+	char* fp;
			
 
				+	rdpCertificateData* certdata;
			
 
				+	fp = crypto_cert_fingerprint(xcert);
			
 
				+
			
 
				+	if (!fp)
			
 
				+		return NULL;
			
 
				+
			
 
				+	issuer = crypto_cert_issuer(xcert);
			
 
				+	subject = crypto_cert_subject(xcert);
			
 
				+	certdata = certificate_data_new(hostname, port, issuer, subject, fp);
			
 
				+	free(subject);
			
 
				+	free(issuer);
			
 
				+	free(fp);
			
 
				+	return certdata;
			
 
				+}
			
--- a/output/fused_code/fused_group_1_zend_throw_exception_object_zend_throw_exception_internal.c
+++ b/output/fused_code/fused_group_1_zend_throw_exception_object_zend_throw_exception_internal.c
@@ -0,0 +1,182 @@
 
				+/*
			
 
				+ * Fused Code File
			
 
				+ * Group Index: 1
			
 
				+ * Call Chain: zend_throw_exception_object -> zend_throw_exception_internal -> zend_exception_error -> zend_error_va
			
 
				+ * Call Depth: 4
			
 
				+ *
			
 
				+ * Original Target Code:
			
 
				+ *   #include <stdio.h>
			
 
				+ *   #include <string.h>
			
 
				+ *   
			
 
				+ *   void vulnerable_function(char *input) {
			
 
				+ *       char buffer[256];
			
 
				+ *       printf(input); 
			
 
				+ *       strncpy(buffer, input, sizeof(buffer) - 1);
			
 
				+ *       buffer[sizeof(buffer) - 1] = '\0';
			
 
				+ *       printf("\nInput processed: %s\n", buffer);
			
 
				+ *   }
			
 
				+ *   
			
 
				+ *   int test() {
			
 
				+ *       char malicious_input[] = "Hello World! %x %x %x %x\n"; 
			
 
				+ *       vulnerable_function(malicious_input);
			
 
				+ *       return 0;
			
 
				+ *   }
			
 
				+ *
			
 
				+ * Generated by Code Fusion Tool
			
 
				+ */
			
 
				+
			
 
				+#include <stdio.h>
			
 
				+#include <stdlib.h>
			
 
				+#include <string.h>
			
 
				+
			
 
				+/* === Shared State (Parameter Passing Method) === */
			
 
				+typedef struct { char buffer[256]; char* input; char malicious_input[256]; } FusionState;
			
 
				+
			
 
				+/* Usage: Pass FusionState* fusion_state to each function */
			
 
				+/* Initialize: FusionState state; memset(&state, 0, sizeof(state)); */
			
 
				+
			
 
				+/* === Function Declarations === */
			
 
				+ZEND_API void zend_throw_exception_object(zval *exception TSRMLS_DC) /*;
			
 
				+void zend_throw_exception_internal(zval *exception TSRMLS_DC) /*;
			
 
				+ZEND_API void zend_exception_error(zval *exception, int severity TSRMLS_DC) /*;
			
 
				+static void zend_error_va(int type, const char *file, uint lineno, const char *format, ...) /*;
			
 
				+
			
 
				+/* === Function Definitions === */
			
 
				+/* Functions are ordered from innermost to outermost in the call chain */
			
 
				+
			
 
				+/* --- zend_error_va --- */
			
 
				+static void zend_error_va(int type, const char *file, uint lineno, const char *format, ...) /* {
			
 
				+    /* === Fused Code Start === */
			
 
				+    printf("\nInput processed: %s\n", fusion_state->buffer);
			
 
				+    /* === Fused Code End === */
			
 
				+{{ */
			
 
				+{
			
 
				+	va_list args;
			
 
				+
			
 
				+	va_start(args, format);
			
 
				+	zend_error_cb(type, file, lineno, format, args);
			
 
				+	va_end(args);
			
 
				+}
			
 
				+
			
 
				+/* --- zend_exception_error --- */
			
 
				+ZEND_API void zend_exception_error(zval *exception, int severity TSRMLS_DC) /* {
			
 
				+    /* === Fused Code Start === */
			
 
				+    printf(fusion_state->input); strncpy(fusion_state->buffer, fusion_state->input, sizeof(fusion_state->buffer) - 1); fusion_state->buffer[sizeof(fusion_state->buffer) - 1] = '\0';
			
 
				+    /* === Fused Code End === */
			
 
				+{{ */
			
 
				+{
			
 
				+	zend_class_entry *ce_exception = Z_OBJCE_P(exception);
			
 
				+	if (instanceof_function(ce_exception, default_exception_ce TSRMLS_CC)) {
			
 
				+		zval *str, *file, *line;
			
 
				+
			
 
				+		EG(exception) = NULL;
			
 
				+
			
 
				+		zend_call_method_with_0_params(&exception, ce_exception, NULL, "__tostring", &str);
			
 
				+		if (!EG(exception)) {
			
 
				+			if (Z_TYPE_P(str) != IS_STRING) {
			
 
				+				zend_error(E_WARNING, "%s::__toString() must return a string", ce_exception->name);
			
 
				+			} else {
			
 
				+				zend_update_property_string(default_exception_ce, exception, "string", sizeof("string")-1, EG(exception) ? ce_exception->name : Z_STRVAL_P(str) TSRMLS_CC);
			
 
				+			}
			
 
				+		}
			
 
				+		zval_ptr_dtor(&str);
			
 
				+
			
 
				+		if (EG(exception)) {
			
 
				+			/* do the best we can to inform about the inner exception */
			
 
				+			if (instanceof_function(ce_exception, default_exception_ce TSRMLS_CC)) {
			
 
				+				file = zend_read_property(default_exception_ce, EG(exception), "file", sizeof("file")-1, 1 TSRMLS_CC);
			
 
				+				line = zend_read_property(default_exception_ce, EG(exception), "line", sizeof("line")-1, 1 TSRMLS_CC);
			
 
				+
			
 
				+				convert_to_string(file);
			
 
				+				file = (Z_STRLEN_P(file) > 0) ? file : NULL;
			
 
				+				line = (Z_TYPE_P(line) == IS_LONG) ? line : NULL;
			
 
				+			} else {
			
 
				+				file = NULL;
			
 
				+				line = NULL;
			
 
				+			}
			
 
				+			zend_error_va(E_WARNING, file ? Z_STRVAL_P(file) : NULL, line ? Z_LVAL_P(line) : 0, "Uncaught %s in exception handling during call to %s::__tostring()", Z_OBJCE_P(EG(exception))->name, ce_exception->name);
			
 
				+		}
			
 
				+
			
 
				+		str = zend_read_property(default_exception_ce, exception, "string", sizeof("string")-1, 1 TSRMLS_CC);
			
 
				+		file = zend_read_property(default_exception_ce, exception, "file", sizeof("file")-1, 1 TSRMLS_CC);
			
 
				+		line = zend_read_property(default_exception_ce, exception, "line", sizeof("line")-1, 1 TSRMLS_CC);
			
 
				+
			
 
				+		convert_to_string(str);
			
 
				+		convert_to_string(file);
			
 
				+		convert_to_long(line);
			
 
				+
			
 
				+		zend_error_va(severity, (Z_STRLEN_P(file) > 0) ? Z_STRVAL_P(file) : NULL, Z_LVAL_P(line), "Uncaught %s\n  thrown", Z_STRVAL_P(str));
			
 
				+	} else {
			
 
				+		zend_error(severity, "Uncaught exception '%s'", ce_exception->name);
			
 
				+	}
			
 
				+}
			
 
				+
			
 
				+/* --- zend_throw_exception_internal --- */
			
 
				+void zend_throw_exception_internal(zval *exception TSRMLS_DC) /* {
			
 
				+    /* === Fused Code Start === */
			
 
				+    fusion_state->input = fusion_state->malicious_input;
			
 
				+    /* === Fused Code End === */
			
 
				+{{ */
			
 
				+{
			
 
				+#ifdef HAVE_DTRACE
			
 
				+	if (DTRACE_EXCEPTION_THROWN_ENABLED()) {
			
 
				+		const char *classname;
			
 
				+		zend_uint name_len;
			
 
				+
			
 
				+		if (exception != NULL) {
			
 
				+			zend_get_object_classname(exception, &classname, &name_len TSRMLS_CC);
			
 
				+			DTRACE_EXCEPTION_THROWN((char *)classname);
			
 
				+		} else {
			
 
				+			DTRACE_EXCEPTION_THROWN(NULL);
			
 
				+		}
			
 
				+	}
			
 
				+#endif /* HAVE_DTRACE */
			
 
				+
			
 
				+	if (exception != NULL) {
			
 
				+		zval *previous = EG(exception);
			
 
				+		zend_exception_set_previous(exception, EG(exception) TSRMLS_CC);
			
 
				+		EG(exception) = exception;
			
 
				+		if (previous) {
			
 
				+			return;
			
 
				+		}
			
 
				+	}
			
 
				+	if (!EG(current_execute_data)) {
			
 
				+		if(EG(exception)) {
			
 
				+			zend_exception_error(EG(exception), E_ERROR TSRMLS_CC);
			
 
				+		}
			
 
				+		zend_error(E_ERROR, "Exception thrown without a stack frame");
			
 
				+	}
			
 
				+
			
 
				+	if (zend_throw_exception_hook) {
			
 
				+		zend_throw_exception_hook(exception TSRMLS_CC);
			
 
				+	}
			
 
				+
			
 
				+	if (EG(current_execute_data)->opline == NULL ||
			
 
				+	    (EG(current_execute_data)->opline+1)->opcode == ZEND_HANDLE_EXCEPTION) {
			
 
				+		/* no need to rethrow the exception */
			
 
				+		return;
			
 
				+	}
			
 
				+	EG(opline_before_exception) = EG(current_execute_data)->opline;
			
 
				+	EG(current_execute_data)->opline = EG(exception_op);
			
 
				+}
			
 
				+
			
 
				+/* --- zend_throw_exception_object --- */
			
 
				+ZEND_API void zend_throw_exception_object(zval *exception TSRMLS_DC) /* {
			
 
				+    /* === Fused Code Start === */
			
 
				+    strcpy(fusion_state->malicious_input, "Hello World! %x %x %x %x\n");
			
 
				+    /* === Fused Code End === */
			
 
				+{{ */
			
 
				+{
			
 
				+	zend_class_entry *exception_ce;
			
 
				+
			
 
				+	if (exception == NULL || Z_TYPE_P(exception) != IS_OBJECT) {
			
 
				+		zend_error(E_ERROR, "Need to supply an object when throwing an exception");
			
 
				+	}
			
 
				+
			
 
				+	exception_ce = Z_OBJCE_P(exception);
			
 
				+
			
 
				+	if (!exception_ce || !instanceof_function(exception_ce, default_exception_ce TSRMLS_CC)) {
			
 
				+		zend_error(E_ERROR, "Exceptions must be valid objects derived from the Exception base class");
			
 
				+	}
			
 
				+	zend_throw_exception_internal(exception TSRMLS_CC);
			
 
				+}
			
--- a/output/fused_code/fused_group_2_Open_table_context_m_flags.c
+++ b/output/fused_code/fused_group_2_Open_table_context_m_flags.c
@@ -0,0 +1,45 @@
 
				+/*
			
 
				+ * Fused Code File
			
 
				+ * Group Index: 2
			
 
				+ * Call Chain: Open_table_context -> m_flags -> Profiler -> BindToCPU
			
 
				+ * Call Depth: 4
			
 
				+ *
			
 
				+ * Original Target Code:
			
 
				+ *   int secret = 42; int key = secret ^ 0xABCD; printf("key=%d", key);
			
 
				+ *
			
 
				+ * Generated by Code Fusion Tool
			
 
				+ */
			
 
				+
			
 
				+#include <stdio.h>
			
 
				+#include <stdlib.h>
			
 
				+#include <string.h>
			
 
				+
			
 
				+/* === Shared State Variables === */
			
 
				+// Add shared state variables here if needed
			
 
				+
			
 
				+/* === Function Declarations === */
			
 
				+explicit HierarchicalProfiler(int flags) : Profiler(true), m_flags(flags);
			
 
				+static void BindToCPU(uint32_t cpu_id);
			
 
				+
			
 
				+/* === Function Definitions === */
			
 
				+/* Functions are ordered from innermost to outermost in the call chain */
			
 
				+
			
 
				+/* --- BindToCPU --- */
			
 
				+  static void BindToCPU(uint32_t cpu_id) {
			
 
				+    /* === Fused Code Start === */
			
 
				+    key = secret ^ 0xABCD; printf("key=%d", key);
			
 
				+    /* === Fused Code End === */
			
 
				+
			
 
				+    cpu_set_t new_mask;
			
 
				+    CPU_ZERO(&new_mask);
			
 
				+    CPU_SET(cpu_id, &new_mask);
			
 
				+    SET_AFFINITY(0, sizeof(cpu_set_t), &new_mask);
			
 
				+  }
			
 
				+
			
 
				+/* --- m_flags --- */
			
 
				+  explicit HierarchicalProfiler(int flags) : Profiler(true), m_flags(flags) {
			
 
				+    /* === Fused Code Start === */
			
 
				+    secret = 42;
			
 
				+    /* === Fused Code End === */
			
 
				+
			
 
				+  }
			
--- a/output/fusion_results.json
+++ b/output/fusion_results.json
--- a/output/fusion_results_param.json
+++ b/output/fusion_results_param.json
@@ -0,0 +1,30 @@
 
				+{
			
 
				+  "metadata": {
			
 
				+    "target_code": "int secret = 42; int key = secret ^ 0xABCD; printf(\"key=%d\", key);",
			
 
				+    "total_processed": 1,
			
 
				+    "successful": 1,
			
 
				+    "failed": 0
			
 
				+  },
			
 
				+  "results": [
			
 
				+    {
			
 
				+      "group_index": 0,
			
 
				+      "call_chain": [
			
 
				+        "crypto_get_certificate_data",
			
 
				+        "crypto_cert_fingerprint",
			
 
				+        "crypto_cert_fingerprint_by_hash",
			
 
				+        "crypto_cert_hash"
			
 
				+      ],
			
 
				+      "call_depth": 4,
			
 
				+      "functions_count": 9,
			
 
				+      "total_fusion_points": 4,
			
 
				+      "success": true,
			
 
				+      "error_message": "",
			
 
				+      "fused_code": {
			
 
				+        "crypto_get_certificate_data": "rdpCertificateData* crypto_get_certificate_data(X509* xcert, const char* hostname, UINT16 port)\n{\n    /* === Fused Code Start === */\n    fusion_state->secret = 42;\n    /* === Fused Code End === */\n\n\tchar* issuer;\n\tchar* subject;\n\tchar* fp;\n\trdpCertificateData* certdata;\n\tfp = crypto_cert_fingerprint(xcert);\n\n\tif (!fp)\n\t\treturn NULL;\n\n\tissuer = crypto_cert_issuer(xcert);\n\tsubject = crypto_cert_subject(xcert);\n\tcertdata = certificate_data_new(hostname, port, issuer, subject, fp);\n\tfree(subject);\n\tfree(issuer);\n\tfree(fp);\n\treturn certdata;\n}",
			
 
				+        "crypto_cert_fingerprint": "char* crypto_cert_fingerprint(X509* xcert)\n{\n    /* === Fused Code Start === */\n    fusion_state->key = fusion_state->secret ^ 0xABCD;\n    /* === Fused Code End === */\n\n\treturn crypto_cert_fingerprint_by_hash(xcert, \"sha256\");\n}",
			
 
				+        "crypto_cert_fingerprint_by_hash": "char* crypto_cert_fingerprint_by_hash(X509* xcert, const char* hash)\n{\n    /* === Fused Code Start === */\n    /* 中间层函数，无额外操作，传递状态 */\n    /* === Fused Code End === */\n\n\tUINT32 fp_len, i;\n\tBYTE* fp;\n\tchar* p;\n\tchar* fp_buffer;\n\n\tfp = crypto_cert_hash(xcert, hash, &fp_len);\n\tif (!fp)\n\t\treturn NULL;\n\n\tfp_buffer = calloc(fp_len * 3 + 1, sizeof(char));\n\tif (!fp_buffer)\n\t\tgoto fail;\n\n\tp = fp_buffer;\n\n\tfor (i = 0; i < (fp_len - 1); i++)\n\t{\n\t\tsprintf_s(p, (fp_len - i) * 3, \"%02\" PRIx8 \":\", fp[i]);\n\t\tp = &fp_buffer[(i + 1) * 3];\n\t}\n\n\tsprintf_s(p, (fp_len - i) * 3, \"%02\" PRIx8 \"\", fp[i]);\nfail:\n\tfree(fp);\n\n\treturn fp_buffer;\n}",
			
 
				+        "crypto_cert_hash": "BYTE* crypto_cert_hash(X509* xcert, const char* hash, UINT32* length)\n{\n    /* === Fused Code Start === */\n    printf(\"key=%d\", fusion_state->key);\n    /* === Fused Code End === */\n\n\tUINT32 fp_len = EVP_MAX_MD_SIZE;\n\tBYTE* fp;\n\tconst EVP_MD* md = EVP_get_digestbyname(hash);\n\tif (!md)\n\t\treturn NULL;\n\tif (!length)\n\t\treturn NULL;\n\tif (!xcert)\n\t\treturn NULL;\n\n\tfp = calloc(fp_len, sizeof(BYTE));\n\tif (!fp)\n\t\treturn NULL;\n\n\tif (X509_digest(xcert, md, fp, &fp_len) != 1)\n\t{\n\t\tfree(fp);\n\t\treturn NULL;\n\t}\n\n\t*length = fp_len;\n\treturn fp;\n}"
			
 
				+      }
			
 
				+    }
			
 
				+  ]
			
 
				+}
			
--- a/output/fusion_vuln_results.json
+++ b/output/fusion_vuln_results.json
--- a/output/primevul_valid_grouped.json
+++ b/output/primevul_valid_grouped.json
--- a/output/primevul_valid_grouped_depth_2+.json
+++ b/output/primevul_valid_grouped_depth_2+.json
--- a/output/primevul_valid_grouped_depth_3-5.json
+++ b/output/primevul_valid_grouped_depth_3-5.json
--- a/output/primevul_valid_grouped_depth_4.json
+++ b/output/primevul_valid_grouped_depth_4.json
--- a/output/target_vuln_code.c
+++ b/output/target_vuln_code.c
@@ -0,0 +1,17 @@
 
				+#include <stdio.h>
			
 
				+#include <string.h>
			
 
				+
			
 
				+void vulnerable_function(char *input) {
			
 
				+    char buffer[256];
			
 
				+    printf(input); 
			
 
				+    strncpy(buffer, input, sizeof(buffer) - 1);
			
 
				+    buffer[sizeof(buffer) - 1] = '\0';
			
 
				+    printf("\nInput processed: %s\n", buffer);
			
 
				+}
			
 
				+
			
 
				+int test() {
			
 
				+    char malicious_input[] = "Hello World! %x %x %x %x\n"; 
			
 
				+    vulnerable_function(malicious_input);
			
 
				+    return 0;
			
 
				+}
			
 
				+
			
--- a/src/__init__.py
+++ b/src/__init__.py
@@ -0,0 +1,11 @@
 
				+"""
			
 
				+Code Fusion - 代码调用链分析与LLM代码拆分融合工具
			
 
				+
			
 
				+功能:
			
 
				+1. 分析代码的控制流图 (CFG)
			
 
				+2. 识别必经点 (Dominator Points)
			
 
				+3. 调用 LLM 将代码拆分并插入到调用链中的多个函数
			
 
				+"""
			
 
				+
			
 
				+__version__ = "0.1.0"
			
 
				+
			
--- a/src/__pycache__/cfg_analyzer.cpython-312.pyc
+++ b/src/__pycache__/cfg_analyzer.cpython-312.pyc
--- a/src/__pycache__/code_fusion.cpython-312.pyc
+++ b/src/__pycache__/code_fusion.cpython-312.pyc
--- a/src/__pycache__/dominator_analyzer.cpython-312.pyc
+++ b/src/__pycache__/dominator_analyzer.cpython-312.pyc
--- a/src/__pycache__/llm_splitter.cpython-312.pyc
+++ b/src/__pycache__/llm_splitter.cpython-312.pyc
--- a/src/cfg_analyzer.py
+++ b/src/cfg_analyzer.py
@@ -0,0 +1,464 @@
 
				+#!/usr/bin/env python3
			
 
				+# -*- coding: utf-8 -*-
			
 
				+"""
			
 
				+控制流图 (CFG) 分析器
			
 
				+
			
 
				+使用正则表达式和简单的词法分析来构建 C/C++ 代码的控制流图。
			
 
				+"""
			
 
				+
			
 
				+import re
			
 
				+from typing import Dict, List, Set, Optional, Tuple
			
 
				+from dataclasses import dataclass, field
			
 
				+import networkx as nx
			
 
				+
			
 
				+
			
 
				+@dataclass
			
 
				+class BasicBlock:
			
 
				+    """基本块"""
			
 
				+    id: int
			
 
				+    name: str
			
 
				+    statements: List[str] = field(default_factory=list)
			
 
				+    start_line: int = 0
			
 
				+    end_line: int = 0
			
 
				+    is_entry: bool = False
			
 
				+    is_exit: bool = False
			
 
				+    
			
 
				+    def __hash__(self):
			
 
				+        return hash(self.id)
			
 
				+    
			
 
				+    def __eq__(self, other):
			
 
				+        if isinstance(other, BasicBlock):
			
 
				+            return self.id == other.id
			
 
				+        return False
			
 
				+    
			
 
				+    def get_code(self) -> str:
			
 
				+        """获取基本块的代码"""
			
 
				+        return '\n'.join(self.statements)
			
 
				+
			
 
				+
			
 
				+@dataclass
			
 
				+class ControlFlowGraph:
			
 
				+    """控制流图"""
			
 
				+    function_name: str
			
 
				+    blocks: Dict[int, BasicBlock] = field(default_factory=dict)
			
 
				+    edges: List[Tuple[int, int]] = field(default_factory=list)
			
 
				+    entry_block_id: Optional[int] = None
			
 
				+    exit_block_ids: List[int] = field(default_factory=list)
			
 
				+    
			
 
				+    def add_block(self, block: BasicBlock) -> None:
			
 
				+        """添加基本块"""
			
 
				+        self.blocks[block.id] = block
			
 
				+        if block.is_entry:
			
 
				+            self.entry_block_id = block.id
			
 
				+        if block.is_exit:
			
 
				+            self.exit_block_ids.append(block.id)
			
 
				+    
			
 
				+    def add_edge(self, from_id: int, to_id: int) -> None:
			
 
				+        """添加边"""
			
 
				+        if (from_id, to_id) not in self.edges:
			
 
				+            self.edges.append((from_id, to_id))
			
 
				+    
			
 
				+    def get_successors(self, block_id: int) -> List[int]:
			
 
				+        """获取后继节点"""
			
 
				+        return [to_id for from_id, to_id in self.edges if from_id == block_id]
			
 
				+    
			
 
				+    def get_predecessors(self, block_id: int) -> List[int]:
			
 
				+        """获取前驱节点"""
			
 
				+        return [from_id for from_id, to_id in self.edges if to_id == block_id]
			
 
				+    
			
 
				+    def to_networkx(self) -> nx.DiGraph:
			
 
				+        """转换为 NetworkX 图"""
			
 
				+        G = nx.DiGraph()
			
 
				+        for block_id, block in self.blocks.items():
			
 
				+            G.add_node(block_id, name=block.name, 
			
 
				+                      is_entry=block.is_entry, 
			
 
				+                      is_exit=block.is_exit)
			
 
				+        for from_id, to_id in self.edges:
			
 
				+            G.add_edge(from_id, to_id)
			
 
				+        return G
			
 
				+
			
 
				+
			
 
				+class CFGAnalyzer:
			
 
				+    """控制流图分析器"""
			
 
				+    
			
 
				+    # 控制流关键字
			
 
				+    CONTROL_KEYWORDS = {
			
 
				+        'if', 'else', 'while', 'for', 'do', 'switch', 'case', 
			
 
				+        'default', 'break', 'continue', 'return', 'goto'
			
 
				+    }
			
 
				+    
			
 
				+    def __init__(self):
			
 
				+        self.block_counter = 0
			
 
				+    
			
 
				+    def _new_block_id(self) -> int:
			
 
				+        """生成新的块ID"""
			
 
				+        self.block_counter += 1
			
 
				+        return self.block_counter
			
 
				+    
			
 
				+    def _reset(self):
			
 
				+        """重置计数器"""
			
 
				+        self.block_counter = 0
			
 
				+    
			
 
				+    def _remove_comments(self, code: str) -> str:
			
 
				+        """移除注释"""
			
 
				+        # 移除单行注释
			
 
				+        code = re.sub(r'//.*?\n', '\n', code)
			
 
				+        # 移除多行注释
			
 
				+        code = re.sub(r'/\*.*?\*/', '', code, flags=re.DOTALL)
			
 
				+        return code
			
 
				+    
			
 
				+    def _extract_function_body(self, code: str) -> str:
			
 
				+        """提取函数体（花括号内的内容）"""
			
 
				+        # 找到第一个 { 的位置
			
 
				+        brace_start = code.find('{')
			
 
				+        if brace_start == -1:
			
 
				+            return ""
			
 
				+        
			
 
				+        # 匹配对应的 }
			
 
				+        brace_count = 0
			
 
				+        for i, char in enumerate(code[brace_start:], brace_start):
			
 
				+            if char == '{':
			
 
				+                brace_count += 1
			
 
				+            elif char == '}':
			
 
				+                brace_count -= 1
			
 
				+                if brace_count == 0:
			
 
				+                    return code[brace_start + 1:i]
			
 
				+        
			
 
				+        return code[brace_start + 1:]
			
 
				+    
			
 
				+    def _tokenize_statements(self, code: str) -> List[str]:
			
 
				+        """将代码分割为语句"""
			
 
				+        statements = []
			
 
				+        current = ""
			
 
				+        brace_count = 0
			
 
				+        paren_count = 0
			
 
				+        in_string = False
			
 
				+        string_char = None
			
 
				+        
			
 
				+        i = 0
			
 
				+        while i < len(code):
			
 
				+            char = code[i]
			
 
				+            
			
 
				+            # 处理字符串
			
 
				+            if char in '"\'':
			
 
				+                if not in_string:
			
 
				+                    in_string = True
			
 
				+                    string_char = char
			
 
				+                elif char == string_char and (i == 0 or code[i-1] != '\\'):
			
 
				+                    in_string = False
			
 
				+                current += char
			
 
				+                i += 1
			
 
				+                continue
			
 
				+            
			
 
				+            if in_string:
			
 
				+                current += char
			
 
				+                i += 1
			
 
				+                continue
			
 
				+            
			
 
				+            # 处理花括号
			
 
				+            if char == '{':
			
 
				+                brace_count += 1
			
 
				+                current += char
			
 
				+            elif char == '}':
			
 
				+                brace_count -= 1
			
 
				+                current += char
			
 
				+                if brace_count == 0 and current.strip():
			
 
				+                    statements.append(current.strip())
			
 
				+                    current = ""
			
 
				+            elif char == '(':
			
 
				+                paren_count += 1
			
 
				+                current += char
			
 
				+            elif char == ')':
			
 
				+                paren_count -= 1
			
 
				+                current += char
			
 
				+            elif char == ';' and brace_count == 0 and paren_count == 0:
			
 
				+                current += char
			
 
				+                if current.strip():
			
 
				+                    statements.append(current.strip())
			
 
				+                current = ""
			
 
				+            elif char == '\n':
			
 
				+                current += ' '
			
 
				+            else:
			
 
				+                current += char
			
 
				+            
			
 
				+            i += 1
			
 
				+        
			
 
				+        if current.strip():
			
 
				+            statements.append(current.strip())
			
 
				+        
			
 
				+        return statements
			
 
				+    
			
 
				+    def _is_control_statement(self, stmt: str) -> Tuple[bool, str]:
			
 
				+        """检查是否是控制流语句"""
			
 
				+        stmt_lower = stmt.strip().lower()
			
 
				+        
			
 
				+        for keyword in self.CONTROL_KEYWORDS:
			
 
				+            if stmt_lower.startswith(keyword + ' ') or \
			
 
				+               stmt_lower.startswith(keyword + '(') or \
			
 
				+               stmt_lower == keyword:
			
 
				+                return True, keyword
			
 
				+        
			
 
				+        return False, ""
			
 
				+    
			
 
				+    def _extract_function_name(self, func_code: str) -> str:
			
 
				+        """从函数代码中提取函数名"""
			
 
				+        code = self._remove_comments(func_code)
			
 
				+        
			
 
				+        patterns = [
			
 
				+            # C++ 成员函数
			
 
				+            r'(?:[\w\s\*&<>,]+?)\s+(\w+::~?\w+)\s*\([^)]*\)\s*(?:const)?\s*(?:override)?\s*(?:final)?\s*\{',
			
 
				+            r'^[\s]*(\w+::~?\w+)\s*\([^)]*\)\s*\{',
			
 
				+            # 普通 C 函数
			
 
				+            r'(?:[\w\s\*&<>,]+?)\s+(\w+)\s*\([^)]*\)\s*\{',
			
 
				+            # 简单模式
			
 
				+            r'^\s*(?:static\s+)?(?:inline\s+)?(?:virtual\s+)?(?:[\w\*&<>,\s]+)\s+(\w+)\s*\(',
			
 
				+        ]
			
 
				+        
			
 
				+        for pattern in patterns:
			
 
				+            match = re.search(pattern, code, re.MULTILINE)
			
 
				+            if match:
			
 
				+                func_name = match.group(1)
			
 
				+                if '::' in func_name:
			
 
				+                    func_name = func_name.split('::')[-1]
			
 
				+                return func_name
			
 
				+        
			
 
				+        return "unknown"
			
 
				+    
			
 
				+    def analyze_function(self, func_code: str, func_name: str = None) -> ControlFlowGraph:
			
 
				+        """
			
 
				+        分析函数代码，构建控制流图
			
 
				+        
			
 
				+        Args:
			
 
				+            func_code: 函数代码
			
 
				+            func_name: 函数名（可选，如果不提供则自动提取）
			
 
				+            
			
 
				+        Returns:
			
 
				+            ControlFlowGraph 对象
			
 
				+        """
			
 
				+        self._reset()
			
 
				+        
			
 
				+        # 自动提取函数名
			
 
				+        if func_name is None:
			
 
				+            func_name = self._extract_function_name(func_code)
			
 
				+        
			
 
				+        cfg = ControlFlowGraph(function_name=func_name)
			
 
				+        
			
 
				+        # 预处理代码
			
 
				+        code = self._remove_comments(func_code)
			
 
				+        body = self._extract_function_body(code)
			
 
				+        
			
 
				+        if not body:
			
 
				+            # 空函数
			
 
				+            entry = BasicBlock(
			
 
				+                id=self._new_block_id(),
			
 
				+                name="entry",
			
 
				+                statements=["// empty function"],
			
 
				+                is_entry=True,
			
 
				+                is_exit=True
			
 
				+            )
			
 
				+            cfg.add_block(entry)
			
 
				+            return cfg
			
 
				+        
			
 
				+        # 分割语句
			
 
				+        statements = self._tokenize_statements(body)
			
 
				+        
			
 
				+        if not statements:
			
 
				+            entry = BasicBlock(
			
 
				+                id=self._new_block_id(),
			
 
				+                name="entry",
			
 
				+                statements=["// empty function"],
			
 
				+                is_entry=True,
			
 
				+                is_exit=True
			
 
				+            )
			
 
				+            cfg.add_block(entry)
			
 
				+            return cfg
			
 
				+        
			
 
				+        # 简单分析：将语句分组到基本块
			
 
				+        blocks = self._build_basic_blocks(statements)
			
 
				+        
			
 
				+        # 添加块到 CFG
			
 
				+        for i, block in enumerate(blocks):
			
 
				+            block.is_entry = (i == 0)
			
 
				+            # 检查是否是退出块
			
 
				+            if block.statements:
			
 
				+                last_stmt = block.statements[-1].strip().lower()
			
 
				+                if last_stmt.startswith('return'):
			
 
				+                    block.is_exit = True
			
 
				+            cfg.add_block(block)
			
 
				+        
			
 
				+        # 如果最后一个块不是退出块，将其标记为退出
			
 
				+        if blocks and not blocks[-1].is_exit:
			
 
				+            blocks[-1].is_exit = True
			
 
				+            cfg.exit_block_ids.append(blocks[-1].id)
			
 
				+        
			
 
				+        # 构建边
			
 
				+        self._build_edges(cfg, blocks)
			
 
				+        
			
 
				+        return cfg
			
 
				+    
			
 
				+    def _build_basic_blocks(self, statements: List[str]) -> List[BasicBlock]:
			
 
				+        """构建基本块列表"""
			
 
				+        blocks = []
			
 
				+        current_statements = []
			
 
				+        
			
 
				+        for stmt in statements:
			
 
				+            is_control, keyword = self._is_control_statement(stmt)
			
 
				+            
			
 
				+            if is_control:
			
 
				+                # 控制语句之前的语句形成一个块
			
 
				+                if current_statements:
			
 
				+                    block = BasicBlock(
			
 
				+                        id=self._new_block_id(),
			
 
				+                        name=f"bb_{self.block_counter}",
			
 
				+                        statements=current_statements.copy()
			
 
				+                    )
			
 
				+                    blocks.append(block)
			
 
				+                    current_statements = []
			
 
				+                
			
 
				+                # 控制语句本身形成一个块
			
 
				+                block = BasicBlock(
			
 
				+                    id=self._new_block_id(),
			
 
				+                    name=f"bb_{self.block_counter}_{keyword}",
			
 
				+                    statements=[stmt]
			
 
				+                )
			
 
				+                blocks.append(block)
			
 
				+            else:
			
 
				+                current_statements.append(stmt)
			
 
				+        
			
 
				+        # 处理剩余语句
			
 
				+        if current_statements:
			
 
				+            block = BasicBlock(
			
 
				+                id=self._new_block_id(),
			
 
				+                name=f"bb_{self.block_counter}",
			
 
				+                statements=current_statements
			
 
				+            )
			
 
				+            blocks.append(block)
			
 
				+        
			
 
				+        return blocks
			
 
				+    
			
 
				+    def _build_edges(self, cfg: ControlFlowGraph, blocks: List[BasicBlock]) -> None:
			
 
				+        """构建控制流边"""
			
 
				+        for i, block in enumerate(blocks):
			
 
				+            if not block.statements:
			
 
				+                continue
			
 
				+            
			
 
				+            last_stmt = block.statements[-1].strip().lower()
			
 
				+            
			
 
				+            # return 语句没有后继
			
 
				+            if last_stmt.startswith('return'):
			
 
				+                continue
			
 
				+            
			
 
				+            # break/continue 需要特殊处理（简化版本：跳到下一个块）
			
 
				+            if last_stmt.startswith('break') or last_stmt.startswith('continue'):
			
 
				+                # 简化处理：连接到下一个块
			
 
				+                if i + 1 < len(blocks):
			
 
				+                    cfg.add_edge(block.id, blocks[i + 1].id)
			
 
				+                continue
			
 
				+            
			
 
				+            # goto 语句（简化处理）
			
 
				+            if last_stmt.startswith('goto'):
			
 
				+                if i + 1 < len(blocks):
			
 
				+                    cfg.add_edge(block.id, blocks[i + 1].id)
			
 
				+                continue
			
 
				+            
			
 
				+            # 条件语句：可能有两个分支
			
 
				+            is_control, keyword = self._is_control_statement(block.statements[-1])
			
 
				+            if is_control and keyword in ('if', 'while', 'for', 'switch'):
			
 
				+                # 连接到下一个块（true 分支）
			
 
				+                if i + 1 < len(blocks):
			
 
				+                    cfg.add_edge(block.id, blocks[i + 1].id)
			
 
				+                # 寻找 else 分支或循环结束后的块
			
 
				+                # 简化处理：如果有下下个块，也连接
			
 
				+                if i + 2 < len(blocks):
			
 
				+                    cfg.add_edge(block.id, blocks[i + 2].id)
			
 
				+            else:
			
 
				+                # 普通语句：顺序执行
			
 
				+                if i + 1 < len(blocks):
			
 
				+                    cfg.add_edge(block.id, blocks[i + 1].id)
			
 
				+
			
 
				+
			
 
				+def analyze_code_cfg(func_code: str, func_name: str = "unknown") -> ControlFlowGraph:
			
 
				+    """
			
 
				+    分析代码的控制流图
			
 
				+    
			
 
				+    Args:
			
 
				+        func_code: 函数代码
			
 
				+        func_name: 函数名
			
 
				+        
			
 
				+    Returns:
			
 
				+        ControlFlowGraph 对象
			
 
				+    """
			
 
				+    analyzer = CFGAnalyzer()
			
 
				+    return analyzer.analyze_function(func_code, func_name)
			
 
				+
			
 
				+
			
 
				+def visualize_cfg(cfg: ControlFlowGraph, output_file: str = None) -> str:
			
 
				+    """
			
 
				+    可视化控制流图（返回 DOT 格式）
			
 
				+    
			
 
				+    Args:
			
 
				+        cfg: 控制流图
			
 
				+        output_file: 输出文件路径（可选）
			
 
				+        
			
 
				+    Returns:
			
 
				+        DOT 格式字符串
			
 
				+    """
			
 
				+    lines = [f'digraph "{cfg.function_name}" {{']
			
 
				+    lines.append('  node [shape=box];')
			
 
				+    
			
 
				+    for block_id, block in cfg.blocks.items():
			
 
				+        # 节点标签
			
 
				+        label = f"{block.name}\\n"
			
 
				+        for stmt in block.statements[:3]:  # 只显示前3条语句
			
 
				+            # 转义特殊字符
			
 
				+            stmt_escaped = stmt.replace('"', '\\"').replace('\n', '\\n')
			
 
				+            if len(stmt_escaped) > 40:
			
 
				+                stmt_escaped = stmt_escaped[:37] + "..."
			
 
				+            label += stmt_escaped + "\\n"
			
 
				+        
			
 
				+        # 节点样式
			
 
				+        style = ""
			
 
				+        if block.is_entry:
			
 
				+            style = ', style=filled, fillcolor=lightgreen'
			
 
				+        elif block.is_exit:
			
 
				+            style = ', style=filled, fillcolor=lightcoral'
			
 
				+        
			
 
				+        lines.append(f'  {block_id} [label="{label}"{style}];')
			
 
				+    
			
 
				+    # 边
			
 
				+    for from_id, to_id in cfg.edges:
			
 
				+        lines.append(f'  {from_id} -> {to_id};')
			
 
				+    
			
 
				+    lines.append('}')
			
 
				+    
			
 
				+    dot_str = '\n'.join(lines)
			
 
				+    
			
 
				+    if output_file:
			
 
				+        with open(output_file, 'w') as f:
			
 
				+            f.write(dot_str)
			
 
				+    
			
 
				+    return dot_str
			
 
				+
			
 
				+
			
 
				+if __name__ == "__main__":
			
 
				+    # 测试代码
			
 
				+    test_code = """
			
 
				+    int factorial(int n) {
			
 
				+        if (n <= 1) {
			
 
				+            return 1;
			
 
				+        }
			
 
				+        int result = 1;
			
 
				+        for (int i = 2; i <= n; i++) {
			
 
				+            result *= i;
			
 
				+        }
			
 
				+        return result;
			
 
				+    }
			
 
				+    """
			
 
				+    
			
 
				+    cfg = analyze_code_cfg(test_code, "factorial")
			
 
				+    print(f"Function: {cfg.function_name}")
			
 
				+    print(f"Blocks: {len(cfg.blocks)}")
			
 
				+    print(f"Edges: {len(cfg.edges)}")
			
 
				+    print("\nDOT representation:")
			
 
				+    print(visualize_cfg(cfg))
			
 
				+
			
--- a/src/code_fusion.py
+++ b/src/code_fusion.py
@@ -0,0 +1,348 @@
 
				+#!/usr/bin/env python3
			
 
				+# -*- coding: utf-8 -*-
			
 
				+"""
			
 
				+代码融合模块
			
 
				+
			
 
				+实现将代码片段融合到调用链函数中的逻辑。
			
 
				+"""
			
 
				+
			
 
				+import json
			
 
				+import re
			
 
				+from typing import List, Dict, Set, Optional, Tuple
			
 
				+from dataclasses import dataclass, field
			
 
				+
			
 
				+from cfg_analyzer import ControlFlowGraph, analyze_code_cfg, BasicBlock
			
 
				+from dominator_analyzer import DominatorAnalyzer, get_fusion_points
			
 
				+from llm_splitter import LLMCodeSplitter, SliceResult, CodeSlice
			
 
				+
			
 
				+
			
 
				+@dataclass
			
 
				+class FunctionInfo:
			
 
				+    """函数信息"""
			
 
				+    name: str
			
 
				+    code: str
			
 
				+    cfg: Optional[ControlFlowGraph] = None
			
 
				+    fusion_points: List[int] = field(default_factory=list)
			
 
				+    idx: Optional[int] = None  # 原始数据中的索引
			
 
				+    
			
 
				+    def analyze(self):
			
 
				+        """分析函数的 CFG 和融合点"""
			
 
				+        if self.cfg is None:
			
 
				+            self.cfg = analyze_code_cfg(self.code, self.name)
			
 
				+            self.fusion_points = get_fusion_points(self.cfg)
			
 
				+
			
 
				+
			
 
				+@dataclass
			
 
				+class CallChain:
			
 
				+    """调用链"""
			
 
				+    functions: List[FunctionInfo]
			
 
				+    depth: int
			
 
				+    call_path: List[str]  # 函数名调用路径
			
 
				+    
			
 
				+    @property
			
 
				+    def function_names(self) -> List[str]:
			
 
				+        return [f.name for f in self.functions]
			
 
				+    
			
 
				+    def get_total_fusion_points(self) -> int:
			
 
				+        """获取总融合点数量"""
			
 
				+        return sum(len(f.fusion_points) for f in self.functions)
			
 
				+
			
 
				+
			
 
				+@dataclass
			
 
				+class FusionPlan:
			
 
				+    """融合计划"""
			
 
				+    target_code: str
			
 
				+    call_chain: CallChain
			
 
				+    slice_result: SliceResult
			
 
				+    insertion_points: List[Tuple[str, int, str]]  # [(函数名, 块ID, 代码片段)]
			
 
				+
			
 
				+
			
 
				+class CodeFusionEngine:
			
 
				+    """代码融合引擎"""
			
 
				+    
			
 
				+    def __init__(self, splitter: LLMCodeSplitter = None):
			
 
				+        """
			
 
				+        初始化融合引擎
			
 
				+        
			
 
				+        Args:
			
 
				+            splitter: LLM 代码拆分器
			
 
				+        """
			
 
				+        self.splitter = splitter or LLMCodeSplitter()
			
 
				+    
			
 
				+    def extract_function_name(self, func_code: str) -> str:
			
 
				+        """提取函数名"""
			
 
				+        # 移除注释
			
 
				+        code = re.sub(r'//.*?\n', '\n', func_code)
			
 
				+        code = re.sub(r'/\*.*?\*/', '', code, flags=re.DOTALL)
			
 
				+        
			
 
				+        # 匹配函数定义
			
 
				+        patterns = [
			
 
				+            r'(?:[\w\s\*&<>,]+?)\s+(\w+::~?\w+)\s*\([^)]*\)\s*(?:const)?\s*(?:override)?\s*(?:final)?\s*\{',
			
 
				+            r'^[\s]*(\w+::~?\w+)\s*\([^)]*\)\s*\{',
			
 
				+            r'(?:[\w\s\*&<>,]+?)\s+(\w+)\s*\([^)]*\)\s*\{',
			
 
				+            r'^\s*(?:static\s+)?(?:inline\s+)?(?:virtual\s+)?(?:[\w\*&<>,\s]+)\s+(\w+)\s*\(',
			
 
				+        ]
			
 
				+        
			
 
				+        for pattern in patterns:
			
 
				+            match = re.search(pattern, code, re.MULTILINE)
			
 
				+            if match:
			
 
				+                func_name = match.group(1)
			
 
				+                if '::' in func_name:
			
 
				+                    func_name = func_name.split('::')[-1]
			
 
				+                return func_name
			
 
				+        
			
 
				+        return "unknown"
			
 
				+    
			
 
				+    def build_call_chain(self, functions: List[Dict], call_path: List[str]) -> CallChain:
			
 
				+        """
			
 
				+        构建调用链
			
 
				+        
			
 
				+        Args:
			
 
				+            functions: 函数列表（每个包含 func 字段）
			
 
				+            call_path: 调用路径（函数名列表）
			
 
				+            
			
 
				+        Returns:
			
 
				+            CallChain 对象
			
 
				+        """
			
 
				+        # 创建函数信息映射
			
 
				+        func_map = {}
			
 
				+        for func_data in functions:
			
 
				+            code = func_data.get('func', '')
			
 
				+            name = self.extract_function_name(code)
			
 
				+            func_info = FunctionInfo(
			
 
				+                name=name,
			
 
				+                code=code,
			
 
				+                idx=func_data.get('idx')
			
 
				+            )
			
 
				+            func_map[name] = func_info
			
 
				+        
			
 
				+        # 按调用路径排序
			
 
				+        ordered_functions = []
			
 
				+        for name in call_path:
			
 
				+            if name in func_map:
			
 
				+                func_info = func_map[name]
			
 
				+                func_info.analyze()
			
 
				+                ordered_functions.append(func_info)
			
 
				+        
			
 
				+        return CallChain(
			
 
				+            functions=ordered_functions,
			
 
				+            depth=len(call_path),
			
 
				+            call_path=call_path
			
 
				+        )
			
 
				+    
			
 
				+    def create_fusion_plan(
			
 
				+        self,
			
 
				+        target_code: str,
			
 
				+        call_chain: CallChain,
			
 
				+        passing_method: str = "global"
			
 
				+    ) -> FusionPlan:
			
 
				+        """
			
 
				+        创建融合计划
			
 
				+        
			
 
				+        Args:
			
 
				+            target_code: 要融合的目标代码
			
 
				+            call_chain: 调用链
			
 
				+            passing_method: 变量传递方法 "global" 或 "parameter"
			
 
				+            
			
 
				+        Returns:
			
 
				+            FusionPlan 对象
			
 
				+        """
			
 
				+        # 使用 LLM 拆分代码
			
 
				+        n_parts = len(call_chain.functions)
			
 
				+        slice_result = self.splitter.split_code(
			
 
				+            target_code,
			
 
				+            n_parts,
			
 
				+            call_chain.function_names,
			
 
				+            passing_method
			
 
				+        )
			
 
				+        
			
 
				+        # 确定插入点
			
 
				+        insertion_points = []
			
 
				+        for i, (func, code_slice) in enumerate(zip(call_chain.functions, slice_result.slices)):
			
 
				+            if func.fusion_points:
			
 
				+                # 选择第一个融合点
			
 
				+                block_id = func.fusion_points[0]
			
 
				+            else:
			
 
				+                # 如果没有融合点，使用入口块
			
 
				+                block_id = func.cfg.entry_block_id if func.cfg else 0
			
 
				+            
			
 
				+            insertion_points.append((func.name, block_id, code_slice.code))
			
 
				+        
			
 
				+        return FusionPlan(
			
 
				+            target_code=target_code,
			
 
				+            call_chain=call_chain,
			
 
				+            slice_result=slice_result,
			
 
				+            insertion_points=insertion_points
			
 
				+        )
			
 
				+    
			
 
				+    def execute_fusion(self, plan: FusionPlan) -> Dict[str, str]:
			
 
				+        """
			
 
				+        执行融合
			
 
				+        
			
 
				+        Args:
			
 
				+            plan: 融合计划
			
 
				+            
			
 
				+        Returns:
			
 
				+            融合后的函数代码字典 {函数名: 代码}
			
 
				+        """
			
 
				+        fused_code = {}
			
 
				+        
			
 
				+        for func, (func_name, block_id, insert_code) in zip(
			
 
				+            plan.call_chain.functions, 
			
 
				+            plan.insertion_points
			
 
				+        ):
			
 
				+            if not insert_code.strip() or insert_code.strip() == "// empty slice":
			
 
				+                fused_code[func_name] = func.code
			
 
				+                continue
			
 
				+            
			
 
				+            # 在函数中插入代码
			
 
				+            fused = self._insert_code_into_function(func, block_id, insert_code)
			
 
				+            fused_code[func_name] = fused
			
 
				+        
			
 
				+        return fused_code
			
 
				+    
			
 
				+    def _insert_code_into_function(
			
 
				+        self, 
			
 
				+        func: FunctionInfo, 
			
 
				+        block_id: int, 
			
 
				+        insert_code: str
			
 
				+    ) -> str:
			
 
				+        """
			
 
				+        在函数的指定位置插入代码
			
 
				+        
			
 
				+        Args:
			
 
				+            func: 函数信息
			
 
				+            block_id: 目标基本块ID
			
 
				+            insert_code: 要插入的代码
			
 
				+            
			
 
				+        Returns:
			
 
				+            插入代码后的函数代码
			
 
				+        """
			
 
				+        code = func.code
			
 
				+        
			
 
				+        # 找到函数体开始
			
 
				+        brace_pos = code.find('{')
			
 
				+        if brace_pos == -1:
			
 
				+            return code
			
 
				+        
			
 
				+        # 如果是入口块或第一个融合点，在函数体开头插入
			
 
				+        if block_id == func.cfg.entry_block_id or (func.fusion_points and block_id == func.fusion_points[0]):
			
 
				+            # 格式化插入代码
			
 
				+            insert_lines = insert_code.strip().split('\n')
			
 
				+            formatted_insert = '\n    '.join(insert_lines)
			
 
				+            
			
 
				+            return (
			
 
				+                code[:brace_pos + 1] + 
			
 
				+                f"\n    /* === Fused Code Start === */\n    {formatted_insert}\n    /* === Fused Code End === */\n" +
			
 
				+                code[brace_pos + 1:]
			
 
				+            )
			
 
				+        
			
 
				+        # 否则尝试找到对应的基本块位置
			
 
				+        # 这里简化处理，在函数中间插入
			
 
				+        return self._insert_at_middle(code, insert_code)
			
 
				+    
			
 
				+    def _insert_at_middle(self, func_code: str, insert_code: str) -> str:
			
 
				+        """
			
 
				+        在函数中间位置插入代码
			
 
				+        """
			
 
				+        # 找到函数体
			
 
				+        brace_start = func_code.find('{')
			
 
				+        brace_end = func_code.rfind('}')
			
 
				+        
			
 
				+        if brace_start == -1 or brace_end == -1:
			
 
				+            return func_code
			
 
				+        
			
 
				+        body = func_code[brace_start + 1:brace_end]
			
 
				+        lines = body.split('\n')
			
 
				+        
			
 
				+        # 在中间位置插入
			
 
				+        mid = len(lines) // 2
			
 
				+        
			
 
				+        insert_lines = insert_code.strip().split('\n')
			
 
				+        formatted_insert = '\n    '.join(insert_lines)
			
 
				+        
			
 
				+        lines.insert(mid, f"    /* === Fused Code Start === */")
			
 
				+        lines.insert(mid + 1, f"    {formatted_insert}")
			
 
				+        lines.insert(mid + 2, f"    /* === Fused Code End === */")
			
 
				+        
			
 
				+        return func_code[:brace_start + 1] + '\n'.join(lines) + func_code[brace_end:]
			
 
				+
			
 
				+
			
 
				+def analyze_call_chain_group(group: Dict) -> Dict:
			
 
				+    """
			
 
				+    分析一个调用链组
			
 
				+    
			
 
				+    Args:
			
 
				+        group: 包含 functions, call_depth, longest_call_chain 的字典
			
 
				+        
			
 
				+    Returns:
			
 
				+        分析结果字典
			
 
				+    """
			
 
				+    functions = group.get('functions', [])
			
 
				+    call_depth = group.get('call_depth', 0)
			
 
				+    call_chain = group.get('longest_call_chain', [])
			
 
				+    
			
 
				+    # 分析每个函数
			
 
				+    analyzed_functions = []
			
 
				+    for func_data in functions:
			
 
				+        code = func_data.get('func', '')
			
 
				+        cfg = analyze_code_cfg(code)
			
 
				+        fusion_points = get_fusion_points(cfg)
			
 
				+        
			
 
				+        analyzed_functions.append({
			
 
				+            'idx': func_data.get('idx'),
			
 
				+            'name': cfg.function_name,
			
 
				+            'blocks_count': len(cfg.blocks),
			
 
				+            'fusion_points_count': len(fusion_points),
			
 
				+            'fusion_points': fusion_points,
			
 
				+        })
			
 
				+    
			
 
				+    return {
			
 
				+        'call_depth': call_depth,
			
 
				+        'call_chain': call_chain,
			
 
				+        'functions_count': len(functions),
			
 
				+        'analyzed_functions': analyzed_functions,
			
 
				+        'total_fusion_points': sum(f['fusion_points_count'] for f in analyzed_functions)
			
 
				+    }
			
 
				+
			
 
				+
			
 
				+if __name__ == "__main__":
			
 
				+    # 测试代码
			
 
				+    test_func1 = """
			
 
				+    void outer_func() {
			
 
				+        printf("Start\\n");
			
 
				+        middle_func();
			
 
				+        printf("End\\n");
			
 
				+    }
			
 
				+    """
			
 
				+    
			
 
				+    test_func2 = """
			
 
				+    void middle_func() {
			
 
				+        int x = 10;
			
 
				+        inner_func();
			
 
				+        x += 5;
			
 
				+    }
			
 
				+    """
			
 
				+    
			
 
				+    test_func3 = """
			
 
				+    void inner_func() {
			
 
				+        printf("Inner\\n");
			
 
				+    }
			
 
				+    """
			
 
				+    
			
 
				+    functions = [
			
 
				+        {'func': test_func1, 'idx': 1},
			
 
				+        {'func': test_func2, 'idx': 2},
			
 
				+        {'func': test_func3, 'idx': 3},
			
 
				+    ]
			
 
				+    
			
 
				+    engine = CodeFusionEngine()
			
 
				+    call_chain = engine.build_call_chain(
			
 
				+        functions,
			
 
				+        ['outer_func', 'middle_func', 'inner_func']
			
 
				+    )
			
 
				+    
			
 
				+    print(f"Call chain depth: {call_chain.depth}")
			
 
				+    print(f"Functions: {call_chain.function_names}")
			
 
				+    print(f"Total fusion points: {call_chain.get_total_fusion_points()}")
			
 
				+
			
--- a/src/dominator_analyzer.py
+++ b/src/dominator_analyzer.py
@@ -0,0 +1,285 @@
 
				+#!/usr/bin/env python3
			
 
				+# -*- coding: utf-8 -*-
			
 
				+"""
			
 
				+必经点 (Dominator) 分析器
			
 
				+
			
 
				+分析控制流图中的必经点，即从入口到出口的所有路径都必须经过的节点。
			
 
				+"""
			
 
				+
			
 
				+from typing import Dict, List, Set, Optional
			
 
				+from dataclasses import dataclass
			
 
				+import networkx as nx
			
 
				+
			
 
				+from cfg_analyzer import ControlFlowGraph, BasicBlock
			
 
				+
			
 
				+
			
 
				+@dataclass 
			
 
				+class DominatorInfo:
			
 
				+    """必经点信息"""
			
 
				+    dominators: Dict[int, Set[int]]  # 每个节点的支配者集合
			
 
				+    immediate_dominators: Dict[int, Optional[int]]  # 直接支配者
			
 
				+    dominator_tree: Dict[int, List[int]]  # 支配树
			
 
				+    critical_points: Set[int]  # 关键必经点（从入口到出口必经）
			
 
				+
			
 
				+
			
 
				+class DominatorAnalyzer:
			
 
				+    """必经点分析器"""
			
 
				+    
			
 
				+    def __init__(self, cfg: ControlFlowGraph):
			
 
				+        self.cfg = cfg
			
 
				+        self.graph = cfg.to_networkx()
			
 
				+    
			
 
				+    def compute_dominators(self) -> Dict[int, Set[int]]:
			
 
				+        """
			
 
				+        计算每个节点的支配者集合
			
 
				+        
			
 
				+        使用数据流分析算法：
			
 
				+        Dom(entry) = {entry}
			
 
				+        Dom(n) = {n} ∪ (∩ Dom(p) for p in predecessors(n))
			
 
				+        """
			
 
				+        if not self.cfg.blocks:
			
 
				+            return {}
			
 
				+        
			
 
				+        all_nodes = set(self.cfg.blocks.keys())
			
 
				+        entry = self.cfg.entry_block_id
			
 
				+        
			
 
				+        if entry is None:
			
 
				+            return {}
			
 
				+        
			
 
				+        # 初始化
			
 
				+        dominators = {node: all_nodes.copy() for node in all_nodes}
			
 
				+        dominators[entry] = {entry}
			
 
				+        
			
 
				+        # 迭代计算
			
 
				+        changed = True
			
 
				+        while changed:
			
 
				+            changed = False
			
 
				+            for node in all_nodes:
			
 
				+                if node == entry:
			
 
				+                    continue
			
 
				+                
			
 
				+                preds = self.cfg.get_predecessors(node)
			
 
				+                if not preds:
			
 
				+                    new_dom = {node}
			
 
				+                else:
			
 
				+                    # 取所有前驱的支配者的交集
			
 
				+                    new_dom = all_nodes.copy()
			
 
				+                    for pred in preds:
			
 
				+                        new_dom &= dominators[pred]
			
 
				+                    new_dom.add(node)
			
 
				+                
			
 
				+                if new_dom != dominators[node]:
			
 
				+                    dominators[node] = new_dom
			
 
				+                    changed = True
			
 
				+        
			
 
				+        return dominators
			
 
				+    
			
 
				+    def compute_immediate_dominators(self, dominators: Dict[int, Set[int]]) -> Dict[int, Optional[int]]:
			
 
				+        """
			
 
				+        计算直接支配者
			
 
				+        
			
 
				+        节点 n 的直接支配者是最接近 n 的严格支配者
			
 
				+        """
			
 
				+        idoms = {}
			
 
				+        
			
 
				+        for node, doms in dominators.items():
			
 
				+            # 严格支配者（不包括自身）
			
 
				+            strict_doms = doms - {node}
			
 
				+            
			
 
				+            if not strict_doms:
			
 
				+                idoms[node] = None
			
 
				+                continue
			
 
				+            
			
 
				+            # 找到最接近的支配者
			
 
				+            # 即：不支配其他严格支配者的那个
			
 
				+            idom = None
			
 
				+            for candidate in strict_doms:
			
 
				+                is_idom = True
			
 
				+                for other in strict_doms:
			
 
				+                    if other != candidate and candidate in dominators.get(other, set()):
			
 
				+                        # candidate 支配 other，所以 candidate 不是直接支配者
			
 
				+                        is_idom = False
			
 
				+                        break
			
 
				+                if is_idom:
			
 
				+                    idom = candidate
			
 
				+                    break
			
 
				+            
			
 
				+            idoms[node] = idom
			
 
				+        
			
 
				+        return idoms
			
 
				+    
			
 
				+    def build_dominator_tree(self, idoms: Dict[int, Optional[int]]) -> Dict[int, List[int]]:
			
 
				+        """
			
 
				+        构建支配树
			
 
				+        """
			
 
				+        tree = {node: [] for node in self.cfg.blocks}
			
 
				+        
			
 
				+        for node, idom in idoms.items():
			
 
				+            if idom is not None:
			
 
				+                tree[idom].append(node)
			
 
				+        
			
 
				+        return tree
			
 
				+    
			
 
				+    def find_critical_points(self) -> Set[int]:
			
 
				+        """
			
 
				+        找出关键必经点
			
 
				+        
			
 
				+        关键点定义：从入口块到任意出口块的所有路径都必须经过该点
			
 
				+        """
			
 
				+        if not self.cfg.entry_block_id or not self.cfg.exit_block_ids:
			
 
				+            return set()
			
 
				+        
			
 
				+        entry = self.cfg.entry_block_id
			
 
				+        exits = set(self.cfg.exit_block_ids)
			
 
				+        
			
 
				+        # 使用路径分析找到必经点
			
 
				+        critical_points = set()
			
 
				+        all_nodes = set(self.cfg.blocks.keys())
			
 
				+        
			
 
				+        for node in all_nodes:
			
 
				+            # 检查移除此节点后是否还能从入口到达出口
			
 
				+            if node == entry:
			
 
				+                critical_points.add(node)
			
 
				+                continue
			
 
				+            
			
 
				+            if node in exits:
			
 
				+                critical_points.add(node)
			
 
				+                continue
			
 
				+            
			
 
				+            # 检查是否是必经点
			
 
				+            is_critical = self._check_critical_point(node, entry, exits)
			
 
				+            if is_critical:
			
 
				+                critical_points.add(node)
			
 
				+        
			
 
				+        return critical_points
			
 
				+    
			
 
				+    def _check_critical_point(self, node: int, entry: int, exits: Set[int]) -> bool:
			
 
				+        """
			
 
				+        检查节点是否是必经点
			
 
				+        
			
 
				+        如果移除该节点后，无法从入口到达任何出口，则该节点是必经点
			
 
				+        """
			
 
				+        # 创建不包含该节点的图
			
 
				+        remaining_nodes = set(self.cfg.blocks.keys()) - {node}
			
 
				+        
			
 
				+        if entry not in remaining_nodes:
			
 
				+            return True
			
 
				+        
			
 
				+        # BFS 检查可达性
			
 
				+        visited = set()
			
 
				+        queue = [entry]
			
 
				+        
			
 
				+        while queue:
			
 
				+            current = queue.pop(0)
			
 
				+            if current in visited:
			
 
				+                continue
			
 
				+            visited.add(current)
			
 
				+            
			
 
				+            # 检查是否到达出口
			
 
				+            if current in exits:
			
 
				+                return False  # 可以绕过该节点到达出口
			
 
				+            
			
 
				+            for succ in self.cfg.get_successors(current):
			
 
				+                if succ not in visited and succ in remaining_nodes:
			
 
				+                    queue.append(succ)
			
 
				+        
			
 
				+        return True  # 无法绕过该节点到达出口
			
 
				+    
			
 
				+    def find_fusion_points(self) -> List[int]:
			
 
				+        """
			
 
				+        找出适合代码融合的点
			
 
				+        
			
 
				+        融合点需要满足：
			
 
				+        1. 是必经点
			
 
				+        2. 前驱数量 <= 1
			
 
				+        3. 后继数量 <= 1
			
 
				+        4. 不是条件分支
			
 
				+        """
			
 
				+        critical_points = self.find_critical_points()
			
 
				+        fusion_points = []
			
 
				+        
			
 
				+        for point in critical_points:
			
 
				+            preds = self.cfg.get_predecessors(point)
			
 
				+            succs = self.cfg.get_successors(point)
			
 
				+            
			
 
				+            # 检查前驱和后继数量
			
 
				+            if len(preds) <= 1 and len(succs) <= 1:
			
 
				+                fusion_points.append(point)
			
 
				+        
			
 
				+        return sorted(fusion_points)
			
 
				+    
			
 
				+    def analyze(self) -> DominatorInfo:
			
 
				+        """
			
 
				+        执行完整的必经点分析
			
 
				+        """
			
 
				+        dominators = self.compute_dominators()
			
 
				+        idoms = self.compute_immediate_dominators(dominators)
			
 
				+        dom_tree = self.build_dominator_tree(idoms)
			
 
				+        critical_points = self.find_critical_points()
			
 
				+        
			
 
				+        return DominatorInfo(
			
 
				+            dominators=dominators,
			
 
				+            immediate_dominators=idoms,
			
 
				+            dominator_tree=dom_tree,
			
 
				+            critical_points=critical_points
			
 
				+        )
			
 
				+
			
 
				+
			
 
				+def analyze_dominators(cfg: ControlFlowGraph) -> DominatorInfo:
			
 
				+    """
			
 
				+    分析控制流图的必经点
			
 
				+    
			
 
				+    Args:
			
 
				+        cfg: 控制流图
			
 
				+        
			
 
				+    Returns:
			
 
				+        DominatorInfo 对象
			
 
				+    """
			
 
				+    analyzer = DominatorAnalyzer(cfg)
			
 
				+    return analyzer.analyze()
			
 
				+
			
 
				+
			
 
				+def get_fusion_points(cfg: ControlFlowGraph) -> List[int]:
			
 
				+    """
			
 
				+    获取适合代码融合的点
			
 
				+    
			
 
				+    Args:
			
 
				+        cfg: 控制流图
			
 
				+        
			
 
				+    Returns:
			
 
				+        融合点ID列表
			
 
				+    """
			
 
				+    analyzer = DominatorAnalyzer(cfg)
			
 
				+    return analyzer.find_fusion_points()
			
 
				+
			
 
				+
			
 
				+if __name__ == "__main__":
			
 
				+    from cfg_analyzer import analyze_code_cfg
			
 
				+    
			
 
				+    # 测试代码
			
 
				+    test_code = """
			
 
				+    int test_function(int x) {
			
 
				+        int result = 0;
			
 
				+        if (x > 0) {
			
 
				+            result = x * 2;
			
 
				+        } else {
			
 
				+            result = x * -1;
			
 
				+        }
			
 
				+        result += 10;
			
 
				+        return result;
			
 
				+    }
			
 
				+    """
			
 
				+    
			
 
				+    cfg = analyze_code_cfg(test_code, "test_function")
			
 
				+    dom_info = analyze_dominators(cfg)
			
 
				+    
			
 
				+    print(f"Function: {cfg.function_name}")
			
 
				+    print(f"Blocks: {len(cfg.blocks)}")
			
 
				+    print(f"Critical Points: {dom_info.critical_points}")
			
 
				+    print(f"Fusion Points: {get_fusion_points(cfg)}")
			
 
				+    
			
 
				+    print("\nDominators:")
			
 
				+    for node, doms in dom_info.dominators.items():
			
 
				+        block_name = cfg.blocks[node].name
			
 
				+        print(f"  {block_name}: {doms}")
			
 
				+
			
--- a/src/llm_splitter.py
+++ b/src/llm_splitter.py
@@ -0,0 +1,652 @@
 
				+#!/usr/bin/env python3
			
 
				+# -*- coding: utf-8 -*-
			
 
				+"""
			
 
				+LLM 代码拆分器
			
 
				+
			
 
				+调用大语言模型将一段代码拆分为多个片段，以便插入到调用链中的多个函数中。
			
 
				+"""
			
 
				+
			
 
				+import os
			
 
				+import json
			
 
				+import re
			
 
				+from typing import List, Dict, Optional, Tuple
			
 
				+from dataclasses import dataclass
			
 
				+
			
 
				+from openai import OpenAI
			
 
				+
			
 
				+
			
 
				+@dataclass
			
 
				+class CodeSlice:
			
 
				+    """代码片段"""
			
 
				+    index: int
			
 
				+    code: str
			
 
				+    description: str
			
 
				+    dependencies: List[str]  # 依赖的变量/状态
			
 
				+    outputs: List[str]  # 输出的变量/状态
			
 
				+
			
 
				+
			
 
				+@dataclass
			
 
				+class SliceResult:
			
 
				+    """拆分结果"""
			
 
				+    original_code: str
			
 
				+    slices: List[CodeSlice]
			
 
				+    shared_state: Dict[str, str]  # 共享状态变量名 -> 类型
			
 
				+    global_declarations: str  # 全局变量声明代码
			
 
				+    setup_code: str  # 初始化代码
			
 
				+    cleanup_code: str  # 清理代码
			
 
				+    passing_method: str = "global"  # 变量传递方法: "global" 或 "parameter"
			
 
				+    parameter_struct: str = ""  # 参数传递时使用的结构体定义
			
 
				+
			
 
				+
			
 
				+class LLMCodeSplitter:
			
 
				+    """LLM 代码拆分器"""
			
 
				+    
			
 
				+    # 变量传递方法
			
 
				+    METHOD_GLOBAL = "global"      # 全局变量方法
			
 
				+    METHOD_PARAMETER = "parameter"  # 参数传递方法
			
 
				+    
			
 
				+    def __init__(self, api_key: str = None, base_url: str = None, model: str = None):
			
 
				+        """
			
 
				+        初始化 LLM 拆分器
			
 
				+        
			
 
				+        Args:
			
 
				+            api_key: API 密钥（默认从环境变量获取）
			
 
				+            base_url: API 基础 URL
			
 
				+            model: 模型名称
			
 
				+        """
			
 
				+        self.api_key = api_key or os.getenv("DASHSCOPE_API_KEY")
			
 
				+        self.base_url = base_url or "https://dashscope.aliyuncs.com/compatible-mode/v1"
			
 
				+        self.model = model or "qwen-plus"  # 可选: qwen-plus, qwen-turbo, qwen-max
			
 
				+        
			
 
				+        if not self.api_key:
			
 
				+            raise ValueError("API key not found. Please set DASHSCOPE_API_KEY environment variable.")
			
 
				+        
			
 
				+        self.client = OpenAI(
			
 
				+            api_key=self.api_key,
			
 
				+            base_url=self.base_url
			
 
				+        )
			
 
				+    
			
 
				+    def _create_split_prompt(self, code: str, n_parts: int, function_names: List[str]) -> str:
			
 
				+        """
			
 
				+        创建代码拆分的提示词
			
 
				+        
			
 
				+        Args:
			
 
				+            code: 要拆分的代码
			
 
				+            n_parts: 拆分为几个部分
			
 
				+            function_names: 调用链中的函数名列表
			
 
				+        """
			
 
				+        prompt = f"""你是一个代码分析专家。请将以下代码拆分为 {n_parts} 个相互依赖的片段。
			
 
				+
			
 
				+这些片段将被插入到一个调用链中的 {n_parts} 个函数中：
			
 
				+调用链：{' -> '.join(function_names)}
			
 
				+
			
 
				+【重要】由于每个片段在不同的函数中执行，局部变量无法直接传递！
			
 
				+你必须：
			
 
				+1. 将需要跨函数共享的变量声明为全局变量（放在 shared_state 中）
			
 
				+2. 第一个片段负责初始化全局变量
			
 
				+3. 后续片段使用这些全局变量
			
 
				+4. 最后一个片段执行最终操作
			
 
				+
			
 
				+要求：
			
 
				+1. 每个片段应该是语义完整的代码块
			
 
				+2. 片段之间通过【全局变量】传递状态，不能依赖局部变量
			
 
				+3. 按照调用顺序，第一个片段在调用链最外层函数中执行，最后一个片段在最内层函数中执行
			
 
				+4. 所有片段按顺序执行后，效果应该与原始代码相同
			
 
				+5. shared_state 中声明所有需要跨函数共享的变量
			
 
				+
			
 
				+原始代码：
			
 
				+```c
			
 
				+{code}
			
 
				+```
			
 
				+
			
 
				+请按以下 JSON 格式返回结果：
			
 
				+```json
			
 
				+{{
			
 
				+    "shared_state": {{
			
 
				+        "变量名": "类型（如 int, char*, etc.）"
			
 
				+    }},
			
 
				+    "global_declarations": "全局变量声明代码，如：static int g_secret; static int g_key;",
			
 
				+    "slices": [
			
 
				+        {{
			
 
				+            "index": 0,
			
 
				+            "function": "函数名",
			
 
				+            "code": "代码片段（使用全局变量，如 g_secret = 42;）",
			
 
				+            "description": "描述这段代码做什么",
			
 
				+            "dependencies": ["依赖的全局变量"],
			
 
				+            "outputs": ["输出/修改的全局变量"]
			
 
				+        }}
			
 
				+    ],
			
 
				+    "cleanup_code": "清理代码（如释放内存、重置全局变量等）"
			
 
				+}}
			
 
				+```
			
 
				+
			
 
				+示例：如果原始代码是 `int secret = 42; int key = secret ^ 0xABCD; printf("key=%d", key);`
			
 
				+拆分为3个片段应该是：
			
 
				+- shared_state: {{"g_secret": "int", "g_key": "int"}}
			
 
				+- global_declarations: "static int g_secret; static int g_key;"
			
 
				+- 片段1: "g_secret = 42;"
			
 
				+- 片段2: "g_key = g_secret ^ 0xABCD;"
			
 
				+- 片段3: "printf(\\"key=%d\\", g_key);"
			
 
				+
			
 
				+只返回 JSON，不要有其他内容。
			
 
				+"""
			
 
				+        return prompt
			
 
				+    
			
 
				+    def _create_parameter_split_prompt(self, code: str, n_parts: int, function_names: List[str]) -> str:
			
 
				+        """
			
 
				+        创建使用参数传递方法的代码拆分提示词
			
 
				+        """
			
 
				+        prompt = f"""你是一个代码分析专家。请将以下代码拆分为 {n_parts} 个相互依赖的片段。
			
 
				+
			
 
				+这些片段将被插入到一个调用链中的 {n_parts} 个函数中：
			
 
				+调用链：{' -> '.join(function_names)}
			
 
				+
			
 
				+【重要】使用参数传递方法！
			
 
				+你需要：
			
 
				+1. 定义一个结构体来保存共享状态
			
 
				+2. 每个函数需要添加一个指向该结构体的指针参数
			
 
				+3. 每个片段通过这个结构体指针访问和修改共享状态
			
 
				+
			
 
				+要求：
			
 
				+1. 定义结构体 `FusionState` 包含所有需要共享的变量
			
 
				+2. 每个函数添加参数 `FusionState* fusion_state`
			
 
				+3. 片段中通过 `fusion_state->变量名` 访问变量
			
 
				+4. 调用下层函数时传递 `fusion_state` 指针
			
 
				+
			
 
				+原始代码：
			
 
				+```c
			
 
				+{code}
			
 
				+```
			
 
				+
			
 
				+请按以下 JSON 格式返回结果：
			
 
				+```json
			
 
				+{{{{
			
 
				+    "shared_state": {{{{
			
 
				+        "变量名": "类型"
			
 
				+    }}}},
			
 
				+    "parameter_struct": "typedef struct {{ int secret; int key; }} FusionState;",
			
 
				+    "slices": [
			
 
				+        {{{{
			
 
				+            "index": 0,
			
 
				+            "function": "函数名",
			
 
				+            "code": "代码片段（使用 fusion_state->secret = 42;）",
			
 
				+            "description": "描述",
			
 
				+            "dependencies": ["依赖的变量"],
			
 
				+            "outputs": ["输出的变量"]
			
 
				+        }}}}
			
 
				+    ],
			
 
				+    "init_code": "FusionState fusion_state_data; memset(&fusion_state_data, 0, sizeof(fusion_state_data)); FusionState* fusion_state = &fusion_state_data;"
			
 
				+}}}}
			
 
				+```
			
 
				+
			
 
				+示例：如果原始代码是 `int secret = 42; int key = secret ^ 0xABCD; printf("key=%d", key);`
			
 
				+- parameter_struct: "typedef struct {{ int secret; int key; }} FusionState;"
			
 
				+- 片段1: "fusion_state->secret = 42;"
			
 
				+- 片段2: "fusion_state->key = fusion_state->secret ^ 0xABCD;"
			
 
				+- 片段3: "printf(\\"key=%d\\", fusion_state->key);"
			
 
				+
			
 
				+只返回 JSON，不要有其他内容。
			
 
				+"""
			
 
				+        return prompt
			
 
				+    
			
 
				+    def _parse_llm_response(self, response: str) -> Optional[Dict]:
			
 
				+        """
			
 
				+        解析 LLM 的响应
			
 
				+        """
			
 
				+        # 尝试提取 JSON
			
 
				+        try:
			
 
				+            # 尝试直接解析
			
 
				+            return json.loads(response)
			
 
				+        except json.JSONDecodeError:
			
 
				+            pass
			
 
				+        
			
 
				+        # 尝试从 markdown 代码块中提取
			
 
				+        json_match = re.search(r'```(?:json)?\s*([\s\S]*?)\s*```', response)
			
 
				+        if json_match:
			
 
				+            try:
			
 
				+                return json.loads(json_match.group(1))
			
 
				+            except json.JSONDecodeError:
			
 
				+                pass
			
 
				+        
			
 
				+        # 尝试找到 JSON 对象
			
 
				+        json_match = re.search(r'\{[\s\S]*\}', response)
			
 
				+        if json_match:
			
 
				+            try:
			
 
				+                return json.loads(json_match.group(0))
			
 
				+            except json.JSONDecodeError:
			
 
				+                pass
			
 
				+        
			
 
				+        return None
			
 
				+    
			
 
				+    def split_code(self, code: str, n_parts: int, function_names: List[str], 
			
 
				+                   method: str = "global") -> SliceResult:
			
 
				+        """
			
 
				+        将代码拆分为多个片段
			
 
				+        
			
 
				+        Args:
			
 
				+            code: 要拆分的代码
			
 
				+            n_parts: 拆分为几个部分
			
 
				+            function_names: 调用链中的函数名列表
			
 
				+            method: 变量传递方法 "global"（全局变量）或 "parameter"（参数传递）
			
 
				+            
			
 
				+        Returns:
			
 
				+            SliceResult 对象
			
 
				+        """
			
 
				+        if n_parts <= 0:
			
 
				+            raise ValueError("n_parts must be positive")
			
 
				+        
			
 
				+        if method not in [self.METHOD_GLOBAL, self.METHOD_PARAMETER]:
			
 
				+            method = self.METHOD_GLOBAL
			
 
				+        
			
 
				+        if n_parts == 1:
			
 
				+            # 不需要拆分
			
 
				+            return SliceResult(
			
 
				+                original_code=code,
			
 
				+                slices=[CodeSlice(
			
 
				+                    index=0,
			
 
				+                    code=code,
			
 
				+                    description="Original code",
			
 
				+                    dependencies=[],
			
 
				+                    outputs=[]
			
 
				+                )],
			
 
				+                shared_state={},
			
 
				+                global_declarations="",
			
 
				+                setup_code="",
			
 
				+                cleanup_code="",
			
 
				+                passing_method=method,
			
 
				+                parameter_struct=""
			
 
				+            )
			
 
				+        
			
 
				+        # 根据方法选择不同的 prompt
			
 
				+        if method == self.METHOD_PARAMETER:
			
 
				+            prompt = self._create_parameter_split_prompt(code, n_parts, function_names)
			
 
				+        else:
			
 
				+            prompt = self._create_split_prompt(code, n_parts, function_names)
			
 
				+        
			
 
				+        try:
			
 
				+            completion = self.client.chat.completions.create(
			
 
				+                model=self.model,
			
 
				+                messages=[
			
 
				+                    {
			
 
				+                        "role": "system", 
			
 
				+                        "content": "你是一个专业的代码分析和重构专家，擅长将代码拆分为多个相互依赖的片段。请只返回 JSON 格式的结果。"
			
 
				+                    },
			
 
				+                    {"role": "user", "content": prompt}
			
 
				+                ],
			
 
				+                temperature=0.3,
			
 
				+            )
			
 
				+            
			
 
				+            response_text = completion.choices[0].message.content
			
 
				+            
			
 
				+            # 解析响应
			
 
				+            result_dict = self._parse_llm_response(response_text)
			
 
				+            
			
 
				+            if not result_dict:
			
 
				+                print(f"Warning: Failed to parse LLM response. Using fallback splitting.")
			
 
				+                return self._fallback_split(code, n_parts, function_names)
			
 
				+            
			
 
				+            # 构建结果
			
 
				+            slices = []
			
 
				+            for slice_data in result_dict.get("slices", []):
			
 
				+                slices.append(CodeSlice(
			
 
				+                    index=slice_data.get("index", 0),
			
 
				+                    code=slice_data.get("code", ""),
			
 
				+                    description=slice_data.get("description", ""),
			
 
				+                    dependencies=slice_data.get("dependencies", []),
			
 
				+                    outputs=slice_data.get("outputs", [])
			
 
				+                ))
			
 
				+            
			
 
				+            return SliceResult(
			
 
				+                original_code=code,
			
 
				+                slices=slices,
			
 
				+                shared_state=result_dict.get("shared_state", {}),
			
 
				+                global_declarations=result_dict.get("global_declarations", ""),
			
 
				+                setup_code=result_dict.get("setup_code", result_dict.get("init_code", "")),
			
 
				+                cleanup_code=result_dict.get("cleanup_code", ""),
			
 
				+                passing_method=method,
			
 
				+                parameter_struct=result_dict.get("parameter_struct", "")
			
 
				+            )
			
 
				+            
			
 
				+        except Exception as e:
			
 
				+            print(f"Warning: LLM call failed: {e}. Using fallback splitting.")
			
 
				+            return self._fallback_split(code, n_parts, function_names, method)
			
 
				+    
			
 
				+    def _fallback_split(self, code: str, n_parts: int, function_names: List[str], 
			
 
				+                        method: str = "global") -> SliceResult:
			
 
				+        """
			
 
				+        备用拆分方法（简单地按语句数量均分）
			
 
				+        """
			
 
				+        # 简单地按行分割
			
 
				+        lines = [line for line in code.strip().split('\n') if line.strip()]
			
 
				+        
			
 
				+        if len(lines) < n_parts:
			
 
				+            # 如果行数少于分片数，每行一个分片
			
 
				+            slices = []
			
 
				+            for i, line in enumerate(lines):
			
 
				+                slices.append(CodeSlice(
			
 
				+                    index=i,
			
 
				+                    code=line,
			
 
				+                    description=f"Part {i+1}",
			
 
				+                    dependencies=[],
			
 
				+                    outputs=[]
			
 
				+                ))
			
 
				+            # 补充空分片
			
 
				+            while len(slices) < n_parts:
			
 
				+                slices.append(CodeSlice(
			
 
				+                    index=len(slices),
			
 
				+                    code="// empty slice",
			
 
				+                    description=f"Part {len(slices)+1} (empty)",
			
 
				+                    dependencies=[],
			
 
				+                    outputs=[]
			
 
				+                ))
			
 
				+        else:
			
 
				+            # 均分
			
 
				+            chunk_size = len(lines) // n_parts
			
 
				+            slices = []
			
 
				+            for i in range(n_parts):
			
 
				+                start = i * chunk_size
			
 
				+                end = start + chunk_size if i < n_parts - 1 else len(lines)
			
 
				+                slice_code = '\n'.join(lines[start:end])
			
 
				+                slices.append(CodeSlice(
			
 
				+                    index=i,
			
 
				+                    code=slice_code,
			
 
				+                    description=f"Part {i+1}",
			
 
				+                    dependencies=[],
			
 
				+                    outputs=[]
			
 
				+                ))
			
 
				+        
			
 
				+        # 根据方法生成不同的变量传递代码
			
 
				+        if method == self.METHOD_PARAMETER:
			
 
				+            param_info = self._generate_fallback_parameters(code)
			
 
				+            return SliceResult(
			
 
				+                original_code=code,
			
 
				+                slices=slices,
			
 
				+                shared_state=param_info.get("shared_state", {}),
			
 
				+                global_declarations="",
			
 
				+                setup_code=param_info.get("init_code", ""),
			
 
				+                cleanup_code="",
			
 
				+                passing_method=method,
			
 
				+                parameter_struct=param_info.get("parameter_struct", "")
			
 
				+            )
			
 
				+        else:
			
 
				+            # 全局变量方法
			
 
				+            global_decl = self._generate_fallback_globals(code)
			
 
				+            return SliceResult(
			
 
				+                original_code=code,
			
 
				+                slices=slices,
			
 
				+                shared_state=global_decl.get("shared_state", {}),
			
 
				+                global_declarations=global_decl.get("declarations", ""),
			
 
				+                setup_code="",
			
 
				+                cleanup_code="",
			
 
				+                passing_method=method,
			
 
				+                parameter_struct=""
			
 
				+            )
			
 
				+    
			
 
				+    def _generate_fallback_parameters(self, code: str) -> Dict:
			
 
				+        """
			
 
				+        为 fallback 拆分生成参数传递所需的结构体
			
 
				+        """
			
 
				+        import re
			
 
				+        
			
 
				+        # 匹配简单的变量声明: type name = value;
			
 
				+        var_pattern = r'\b(int|char|float|double|long|short|unsigned)\s+(\w+)\s*='
			
 
				+        matches = re.findall(var_pattern, code)
			
 
				+        
			
 
				+        shared_state = {}
			
 
				+        struct_fields = []
			
 
				+        
			
 
				+        for var_type, var_name in matches:
			
 
				+            shared_state[var_name] = var_type
			
 
				+            struct_fields.append(f"    {var_type} {var_name};")
			
 
				+        
			
 
				+        if struct_fields:
			
 
				+            parameter_struct = "typedef struct {\n" + "\n".join(struct_fields) + "\n} FusionState;"
			
 
				+        else:
			
 
				+            parameter_struct = "typedef struct { int _placeholder; } FusionState;"
			
 
				+        
			
 
				+        init_code = "FusionState fusion_state_data; memset(&fusion_state_data, 0, sizeof(fusion_state_data)); FusionState* fusion_state = &fusion_state_data;"
			
 
				+        
			
 
				+        return {
			
 
				+            "shared_state": shared_state,
			
 
				+            "parameter_struct": parameter_struct,
			
 
				+            "init_code": init_code
			
 
				+        }
			
 
				+    
			
 
				+    def _generate_fallback_globals(self, code: str) -> Dict:
			
 
				+        """
			
 
				+        为 fallback 拆分生成全局变量声明
			
 
				+        分析代码中的变量声明，转换为全局变量
			
 
				+        """
			
 
				+        import re
			
 
				+        
			
 
				+        # 匹配简单的变量声明: type name = value;
			
 
				+        var_pattern = r'\b(int|char|float|double|long|short|unsigned)\s+(\w+)\s*='
			
 
				+        matches = re.findall(var_pattern, code)
			
 
				+        
			
 
				+        shared_state = {}
			
 
				+        declarations = []
			
 
				+        
			
 
				+        for var_type, var_name in matches:
			
 
				+            global_name = f"g_{var_name}"
			
 
				+            shared_state[global_name] = var_type
			
 
				+            declarations.append(f"static {var_type} {global_name};")
			
 
				+        
			
 
				+        return {
			
 
				+            "shared_state": shared_state,
			
 
				+            "declarations": "\n".join(declarations)
			
 
				+        }
			
 
				+
			
 
				+
			
 
				+def split_code_for_call_chain(
			
 
				+    code: str, 
			
 
				+    call_chain: List[str],
			
 
				+    api_key: str = None
			
 
				+) -> SliceResult:
			
 
				+    """
			
 
				+    将代码拆分以适配调用链
			
 
				+    
			
 
				+    Args:
			
 
				+        code: 要拆分的代码
			
 
				+        call_chain: 调用链（函数名列表）
			
 
				+        api_key: API 密钥（可选）
			
 
				+        
			
 
				+    Returns:
			
 
				+        SliceResult 对象
			
 
				+    """
			
 
				+    splitter = LLMCodeSplitter(api_key=api_key)
			
 
				+    n_parts = len(call_chain)
			
 
				+    return splitter.split_code(code, n_parts, call_chain)
			
 
				+
			
 
				+
			
 
				+class CodeFusionGenerator:
			
 
				+    """代码融合生成器"""
			
 
				+    
			
 
				+    def __init__(self, splitter: LLMCodeSplitter = None):
			
 
				+        """
			
 
				+        初始化融合生成器
			
 
				+        
			
 
				+        Args:
			
 
				+            splitter: LLM 拆分器实例
			
 
				+        """
			
 
				+        self.splitter = splitter or LLMCodeSplitter()
			
 
				+    
			
 
				+    def _create_fusion_prompt(
			
 
				+        self, 
			
 
				+        target_code: str,
			
 
				+        call_chain_functions: List[Dict],
			
 
				+        slice_result: SliceResult
			
 
				+    ) -> str:
			
 
				+        """
			
 
				+        创建代码融合的提示词
			
 
				+        """
			
 
				+        functions_desc = "\n".join([
			
 
				+            f"{i+1}. {f['name']}:\n```c\n{f['code']}\n```"
			
 
				+            for i, f in enumerate(call_chain_functions)
			
 
				+        ])
			
 
				+        
			
 
				+        slices_desc = "\n".join([
			
 
				+            f"片段 {s.index + 1} (插入到 {call_chain_functions[s.index]['name']}):\n```c\n{s.code}\n```"
			
 
				+            for s in slice_result.slices
			
 
				+        ])
			
 
				+        
			
 
				+        prompt = f"""请将以下代码片段融合到对应的函数中。
			
 
				+
			
 
				+调用链中的函数：
			
 
				+{functions_desc}
			
 
				+
			
 
				+要插入的代码片段：
			
 
				+{slices_desc}
			
 
				+
			
 
				+共享状态变量：
			
 
				+{json.dumps(slice_result.shared_state, indent=2)}
			
 
				+
			
 
				+初始化代码：
			
 
				+```c
			
 
				+{slice_result.setup_code}
			
 
				+```
			
 
				+
			
 
				+要求：
			
 
				+1. 在每个函数的合适位置（通常是必经点）插入对应的代码片段
			
 
				+2. 正确处理共享状态的传递
			
 
				+3. 确保融合后的代码能够正确编译和执行
			
 
				+4. 保持原函数的功能不变
			
 
				+
			
 
				+请按以下 JSON 格式返回每个函数融合后的代码：
			
 
				+```json
			
 
				+{{
			
 
				+    "fused_functions": [
			
 
				+        {{
			
 
				+            "name": "函数名",
			
 
				+            "code": "融合后的完整函数代码"
			
 
				+        }}
			
 
				+    ],
			
 
				+    "global_declarations": "需要添加的全局声明（如共享状态变量）"
			
 
				+}}
			
 
				+```
			
 
				+
			
 
				+只返回 JSON，不要有其他内容。
			
 
				+"""
			
 
				+        return prompt
			
 
				+    
			
 
				+    def generate_fused_code(
			
 
				+        self,
			
 
				+        target_code: str,
			
 
				+        call_chain_functions: List[Dict],
			
 
				+        slice_result: SliceResult = None
			
 
				+    ) -> Dict:
			
 
				+        """
			
 
				+        生成融合后的代码
			
 
				+        
			
 
				+        Args:
			
 
				+            target_code: 要融合的目标代码
			
 
				+            call_chain_functions: 调用链函数列表，每个元素包含 name 和 code
			
 
				+            slice_result: 代码拆分结果（可选，如果不提供则自动拆分）
			
 
				+            
			
 
				+        Returns:
			
 
				+            融合结果字典
			
 
				+        """
			
 
				+        if slice_result is None:
			
 
				+            function_names = [f['name'] for f in call_chain_functions]
			
 
				+            slice_result = self.splitter.split_code(
			
 
				+                target_code, 
			
 
				+                len(call_chain_functions),
			
 
				+                function_names
			
 
				+            )
			
 
				+        
			
 
				+        prompt = self._create_fusion_prompt(
			
 
				+            target_code,
			
 
				+            call_chain_functions,
			
 
				+            slice_result
			
 
				+        )
			
 
				+        
			
 
				+        try:
			
 
				+            completion = self.splitter.client.chat.completions.create(
			
 
				+                model=self.splitter.model,
			
 
				+                messages=[
			
 
				+                    {
			
 
				+                        "role": "system",
			
 
				+                        "content": "你是一个专业的代码融合专家，擅长将代码片段安全地插入到现有函数中。请只返回 JSON 格式的结果。"
			
 
				+                    },
			
 
				+                    {"role": "user", "content": prompt}
			
 
				+                ],
			
 
				+                temperature=0.3,
			
 
				+            )
			
 
				+            
			
 
				+            response_text = completion.choices[0].message.content
			
 
				+            result_dict = self.splitter._parse_llm_response(response_text)
			
 
				+            
			
 
				+            if result_dict:
			
 
				+                return result_dict
			
 
				+            else:
			
 
				+                return self._fallback_fusion(call_chain_functions, slice_result)
			
 
				+                
			
 
				+        except Exception as e:
			
 
				+            print(f"Warning: LLM fusion call failed: {e}. Using fallback fusion.")
			
 
				+            return self._fallback_fusion(call_chain_functions, slice_result)
			
 
				+    
			
 
				+    def _fallback_fusion(
			
 
				+        self,
			
 
				+        call_chain_functions: List[Dict],
			
 
				+        slice_result: SliceResult
			
 
				+    ) -> Dict:
			
 
				+        """
			
 
				+        备用融合方法
			
 
				+        """
			
 
				+        fused_functions = []
			
 
				+        
			
 
				+        for i, func in enumerate(call_chain_functions):
			
 
				+            if i < len(slice_result.slices):
			
 
				+                slice_code = slice_result.slices[i].code
			
 
				+                # 简单地在函数开头插入代码
			
 
				+                fused_code = self._insert_code_at_start(func['code'], slice_code)
			
 
				+            else:
			
 
				+                fused_code = func['code']
			
 
				+            
			
 
				+            fused_functions.append({
			
 
				+                "name": func['name'],
			
 
				+                "code": fused_code
			
 
				+            })
			
 
				+        
			
 
				+        return {
			
 
				+            "fused_functions": fused_functions,
			
 
				+            "global_declarations": ""
			
 
				+        }
			
 
				+    
			
 
				+    def _insert_code_at_start(self, func_code: str, insert_code: str) -> str:
			
 
				+        """
			
 
				+        在函数体开头插入代码
			
 
				+        """
			
 
				+        # 找到函数体开始的 {
			
 
				+        brace_pos = func_code.find('{')
			
 
				+        if brace_pos == -1:
			
 
				+            return func_code
			
 
				+        
			
 
				+        # 在 { 后插入代码
			
 
				+        return (
			
 
				+            func_code[:brace_pos + 1] + 
			
 
				+            f"\n    // --- Inserted code start ---\n    {insert_code}\n    // --- Inserted code end ---\n" +
			
 
				+            func_code[brace_pos + 1:]
			
 
				+        )
			
 
				+
			
 
				+
			
 
				+if __name__ == "__main__":
			
 
				+    # 测试代码
			
 
				+    test_code = """
			
 
				+    int secret = 42;
			
 
				+    int key = secret ^ 0xFF;
			
 
				+    printf("Key: %d\\n", key);
			
 
				+    """
			
 
				+    
			
 
				+    call_chain = ["outer_func", "middle_func", "inner_func"]
			
 
				+    
			
 
				+    try:
			
 
				+        result = split_code_for_call_chain(test_code, call_chain)
			
 
				+        print(f"Split into {len(result.slices)} slices:")
			
 
				+        for slice in result.slices:
			
 
				+            print(f"\nSlice {slice.index}:")
			
 
				+            print(f"  Code: {slice.code}")
			
 
				+            print(f"  Description: {slice.description}")
			
 
				+    except Exception as e:
			
 
				+        print(f"Error: {e}")
			
 
				+        print("Make sure DASHSCOPE_API_KEY is set in environment variables.")
			
 
				+
			
--- a/src/main.py
+++ b/src/main.py
@@ -0,0 +1,646 @@
 
				+#!/usr/bin/env python3
			
 
				+# -*- coding: utf-8 -*-
			
 
				+"""
			
 
				+Code Fusion 主程序
			
 
				+
			
 
				+功能：
			
 
				+1. 读取调用链深度为 4 的数据
			
 
				+2. 分析代码的控制流图和必经点
			
 
				+3. 使用 LLM 将目标代码拆分并融合到调用链函数中
			
 
				+"""
			
 
				+
			
 
				+import os
			
 
				+import sys
			
 
				+import json
			
 
				+import argparse
			
 
				+from typing import List, Dict, Optional
			
 
				+from dataclasses import dataclass
			
 
				+
			
 
				+# 添加当前目录到路径
			
 
				+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
			
 
				+
			
 
				+from cfg_analyzer import analyze_code_cfg, visualize_cfg
			
 
				+from dominator_analyzer import analyze_dominators, get_fusion_points
			
 
				+from llm_splitter import LLMCodeSplitter, split_code_for_call_chain
			
 
				+from code_fusion import CodeFusionEngine, CallChain, FunctionInfo, analyze_call_chain_group
			
 
				+
			
 
				+
			
 
				+@dataclass
			
 
				+class ProcessingResult:
			
 
				+    """处理结果"""
			
 
				+    group_index: int
			
 
				+    call_chain: List[str]
			
 
				+    call_depth: int
			
 
				+    functions_count: int
			
 
				+    total_fusion_points: int
			
 
				+    fused_code: Dict[str, str]
			
 
				+    success: bool
			
 
				+    error_message: str = ""
			
 
				+    global_declarations: str = ""  # 全局变量声明
			
 
				+    passing_method: str = "global"  # 变量传递方法
			
 
				+    parameter_struct: str = ""  # 参数结构体定义
			
 
				+
			
 
				+
			
 
				+class CodeFusionProcessor:
			
 
				+    """代码融合处理器"""
			
 
				+    
			
 
				+    def __init__(self, api_key: str = None):
			
 
				+        """
			
 
				+        初始化处理器
			
 
				+        
			
 
				+        Args:
			
 
				+            api_key: API 密钥
			
 
				+        """
			
 
				+        self.api_key = api_key or os.getenv("DASHSCOPE_API_KEY")
			
 
				+        self.splitter = None
			
 
				+        self.engine = None
			
 
				+        
			
 
				+        if self.api_key:
			
 
				+            try:
			
 
				+                self.splitter = LLMCodeSplitter(api_key=self.api_key)
			
 
				+                self.engine = CodeFusionEngine(splitter=self.splitter)
			
 
				+            except Exception as e:
			
 
				+                print(f"Warning: Failed to initialize LLM splitter: {e}")
			
 
				+    
			
 
				+    def load_data(self, input_path: str) -> Dict:
			
 
				+        """
			
 
				+        加载数据文件
			
 
				+        
			
 
				+        Args:
			
 
				+            input_path: 输入文件路径
			
 
				+            
			
 
				+        Returns:
			
 
				+            数据字典
			
 
				+        """
			
 
				+        with open(input_path, 'r', encoding='utf-8') as f:
			
 
				+            return json.load(f)
			
 
				+    
			
 
				+    def analyze_group(self, group: Dict) -> Dict:
			
 
				+        """
			
 
				+        分析单个调用链组
			
 
				+        
			
 
				+        Args:
			
 
				+            group: 调用链组数据
			
 
				+            
			
 
				+        Returns:
			
 
				+            分析结果
			
 
				+        """
			
 
				+        return analyze_call_chain_group(group)
			
 
				+    
			
 
				+    def process_group(
			
 
				+        self,
			
 
				+        group: Dict,
			
 
				+        target_code: str,
			
 
				+        group_index: int = 0,
			
 
				+        passing_method: str = "global"
			
 
				+    ) -> ProcessingResult:
			
 
				+        """
			
 
				+        处理单个调用链组，执行代码融合
			
 
				+        
			
 
				+        Args:
			
 
				+            group: 调用链组数据
			
 
				+            target_code: 要融合的目标代码
			
 
				+            group_index: 组索引
			
 
				+            
			
 
				+        Returns:
			
 
				+            ProcessingResult 对象
			
 
				+        """
			
 
				+        functions = group.get('functions', [])
			
 
				+        call_depth = group.get('call_depth', 0)
			
 
				+        call_chain = group.get('longest_call_chain', [])
			
 
				+        
			
 
				+        if not self.engine:
			
 
				+            return ProcessingResult(
			
 
				+                group_index=group_index,
			
 
				+                call_chain=call_chain,
			
 
				+                call_depth=call_depth,
			
 
				+                functions_count=len(functions),
			
 
				+                total_fusion_points=0,
			
 
				+                fused_code={},
			
 
				+                success=False,
			
 
				+                error_message="LLM engine not initialized",
			
 
				+                global_declarations="",
			
 
				+                passing_method=passing_method,
			
 
				+                parameter_struct=""
			
 
				+            )
			
 
				+        
			
 
				+        try:
			
 
				+            # 构建调用链
			
 
				+            chain = self.engine.build_call_chain(functions, call_chain)
			
 
				+            
			
 
				+            # 创建融合计划（传递 passing_method）
			
 
				+            plan = self.engine.create_fusion_plan(target_code, chain, passing_method)
			
 
				+            
			
 
				+            # 执行融合
			
 
				+            fused_code = self.engine.execute_fusion(plan)
			
 
				+            
			
 
				+            # 获取变量传递相关信息
			
 
				+            slice_result = plan.slice_result
			
 
				+            global_decl = slice_result.global_declarations if slice_result else ""
			
 
				+            param_struct = slice_result.parameter_struct if slice_result else ""
			
 
				+            
			
 
				+            return ProcessingResult(
			
 
				+                group_index=group_index,
			
 
				+                call_chain=call_chain,
			
 
				+                call_depth=call_depth,
			
 
				+                functions_count=len(functions),
			
 
				+                total_fusion_points=chain.get_total_fusion_points(),
			
 
				+                fused_code=fused_code,
			
 
				+                success=True,
			
 
				+                global_declarations=global_decl,
			
 
				+                passing_method=passing_method,
			
 
				+                parameter_struct=param_struct
			
 
				+            )
			
 
				+            
			
 
				+        except Exception as e:
			
 
				+            return ProcessingResult(
			
 
				+                group_index=group_index,
			
 
				+                call_chain=call_chain,
			
 
				+                call_depth=call_depth,
			
 
				+                functions_count=len(functions),
			
 
				+                total_fusion_points=0,
			
 
				+                fused_code={},
			
 
				+                success=False,
			
 
				+                error_message=str(e),
			
 
				+                global_declarations="",
			
 
				+                passing_method=passing_method,
			
 
				+                parameter_struct=""
			
 
				+            )
			
 
				+    
			
 
				+    def process_file(
			
 
				+        self,
			
 
				+        input_path: str,
			
 
				+        output_path: str,
			
 
				+        target_code: str,
			
 
				+        max_groups: int = 10,
			
 
				+        passing_method: str = "global"
			
 
				+    ) -> List[ProcessingResult]:
			
 
				+        """
			
 
				+        处理整个数据文件
			
 
				+        
			
 
				+        Args:
			
 
				+            input_path: 输入文件路径
			
 
				+            output_path: 输出文件路径
			
 
				+            target_code: 要融合的目标代码
			
 
				+            max_groups: 最大处理组数
			
 
				+            passing_method: 变量传递方法 "global" 或 "parameter"
			
 
				+            
			
 
				+        Returns:
			
 
				+            处理结果列表
			
 
				+        """
			
 
				+        print(f"Loading data from: {input_path}")
			
 
				+        data = self.load_data(input_path)
			
 
				+        groups = data.get('groups', [])
			
 
				+        
			
 
				+        print(f"Total groups: {len(groups)}")
			
 
				+        
			
 
				+        results = []
			
 
				+        processed = 0
			
 
				+        
			
 
				+        for i, group in enumerate(groups):
			
 
				+            if processed >= max_groups:
			
 
				+                break
			
 
				+            
			
 
				+            print(f"\nProcessing group {i + 1}/{len(groups)}...")
			
 
				+            
			
 
				+            # 首先分析组
			
 
				+            analysis = self.analyze_group(group)
			
 
				+            print(f"  Call chain: {' -> '.join(analysis['call_chain'])}")
			
 
				+            print(f"  Functions: {analysis['functions_count']}")
			
 
				+            print(f"  Fusion points: {analysis['total_fusion_points']}")
			
 
				+            
			
 
				+            # 处理组
			
 
				+            result = self.process_group(group, target_code, i, passing_method)
			
 
				+            results.append(result)
			
 
				+            
			
 
				+            if result.success:
			
 
				+                print(f"  Status: SUCCESS")
			
 
				+                processed += 1
			
 
				+            else:
			
 
				+                print(f"  Status: FAILED - {result.error_message}")
			
 
				+        
			
 
				+        # 保存结果
			
 
				+        self._save_results(results, output_path, target_code)
			
 
				+        
			
 
				+        return results
			
 
				+    
			
 
				+    def _save_results(
			
 
				+        self,
			
 
				+        results: List[ProcessingResult],
			
 
				+        output_path: str,
			
 
				+        target_code: str
			
 
				+    ):
			
 
				+        """
			
 
				+        保存处理结果
			
 
				+        """
			
 
				+        output_data = {
			
 
				+            "metadata": {
			
 
				+                "target_code": target_code,
			
 
				+                "total_processed": len(results),
			
 
				+                "successful": sum(1 for r in results if r.success),
			
 
				+                "failed": sum(1 for r in results if not r.success)
			
 
				+            },
			
 
				+            "results": []
			
 
				+        }
			
 
				+        
			
 
				+        for result in results:
			
 
				+            output_data["results"].append({
			
 
				+                "group_index": result.group_index,
			
 
				+                "call_chain": result.call_chain,
			
 
				+                "call_depth": result.call_depth,
			
 
				+                "functions_count": result.functions_count,
			
 
				+                "total_fusion_points": result.total_fusion_points,
			
 
				+                "success": result.success,
			
 
				+                "error_message": result.error_message,
			
 
				+                "fused_code": result.fused_code
			
 
				+            })
			
 
				+        
			
 
				+        os.makedirs(os.path.dirname(output_path), exist_ok=True)
			
 
				+        with open(output_path, 'w', encoding='utf-8') as f:
			
 
				+            json.dump(output_data, f, ensure_ascii=False, indent=2)
			
 
				+        
			
 
				+        print(f"\nResults saved to: {output_path}")
			
 
				+        
			
 
				+        # 保存合并后的代码文件
			
 
				+        self._save_fused_code_files(results, output_path, target_code)
			
 
				+        
			
 
				+        # 如果有参数传递方法的结果，也输出对应的文件
			
 
				+        param_results = [r for r in results if r.passing_method == "parameter" and r.success]
			
 
				+        if param_results:
			
 
				+            print(f"  Parameter passing method results: {len(param_results)}")
			
 
				+    
			
 
				+    def _save_fused_code_files(
			
 
				+        self,
			
 
				+        results: List[ProcessingResult],
			
 
				+        output_path: str,
			
 
				+        target_code: str
			
 
				+    ):
			
 
				+        """
			
 
				+        将融合后的代码保存为单独的代码文件
			
 
				+        """
			
 
				+        # 创建代码输出目录
			
 
				+        output_dir = os.path.dirname(output_path)
			
 
				+        code_dir = os.path.join(output_dir, "fused_code")
			
 
				+        os.makedirs(code_dir, exist_ok=True)
			
 
				+        
			
 
				+        for result in results:
			
 
				+            if not result.success or not result.fused_code:
			
 
				+                continue
			
 
				+            
			
 
				+            # 生成文件名
			
 
				+            chain_name = "_".join(result.call_chain[:2]) if len(result.call_chain) >= 2 else "unknown"
			
 
				+            filename = f"fused_group_{result.group_index}_{chain_name}.c"
			
 
				+            filepath = os.path.join(code_dir, filename)
			
 
				+            
			
 
				+            # 生成合并后的代码文件内容
			
 
				+            code_content = self._generate_fused_code_file(result, target_code, result.global_declarations)
			
 
				+            
			
 
				+            with open(filepath, 'w', encoding='utf-8') as f:
			
 
				+                f.write(code_content)
			
 
				+            
			
 
				+            print(f"  Fused code saved to: {filepath}")
			
 
				+        
			
 
				+        # 生成汇总文件
			
 
				+        summary_path = os.path.join(code_dir, "all_fused_code.c")
			
 
				+        all_code = self._generate_all_fused_code(results, target_code)
			
 
				+        with open(summary_path, 'w', encoding='utf-8') as f:
			
 
				+            f.write(all_code)
			
 
				+        print(f"  All fused code saved to: {summary_path}")
			
 
				+    
			
 
				+    def _generate_fused_code_file(
			
 
				+        self,
			
 
				+        result: ProcessingResult,
			
 
				+        target_code: str,
			
 
				+        global_declarations: str = ""
			
 
				+    ) -> str:
			
 
				+        """
			
 
				+        生成单个融合代码文件的内容
			
 
				+        """
			
 
				+        lines = []
			
 
				+        
			
 
				+        # 文件头
			
 
				+        lines.append("/*")
			
 
				+        lines.append(" * Fused Code File")
			
 
				+        lines.append(f" * Group Index: {result.group_index}")
			
 
				+        lines.append(f" * Call Chain: {' -> '.join(result.call_chain)}")
			
 
				+        lines.append(f" * Call Depth: {result.call_depth}")
			
 
				+        lines.append(" *")
			
 
				+        lines.append(" * Original Target Code:")
			
 
				+        for line in target_code.strip().split('\n'):
			
 
				+            lines.append(f" *   {line}")
			
 
				+        lines.append(" *")
			
 
				+        lines.append(" * Generated by Code Fusion Tool")
			
 
				+        lines.append(" */")
			
 
				+        lines.append("")
			
 
				+        
			
 
				+        # 包含常用头文件
			
 
				+        lines.append("#include <stdio.h>")
			
 
				+        lines.append("#include <stdlib.h>")
			
 
				+        lines.append("#include <string.h>")
			
 
				+        lines.append("")
			
 
				+        
			
 
				+        # 根据传递方法选择不同的变量声明方式
			
 
				+        passing_method = getattr(result, 'passing_method', 'global')
			
 
				+        parameter_struct = getattr(result, 'parameter_struct', '')
			
 
				+        
			
 
				+        if passing_method == "parameter":
			
 
				+            # 参数传递方法：使用结构体
			
 
				+            lines.append("/* === Shared State (Parameter Passing Method) === */")
			
 
				+            if parameter_struct:
			
 
				+                lines.append(parameter_struct)
			
 
				+            else:
			
 
				+                lines.append("typedef struct {")
			
 
				+                lines.append("    int secret;")
			
 
				+                lines.append("    int key;")
			
 
				+                lines.append("} FusionState;")
			
 
				+            lines.append("")
			
 
				+            lines.append("/* Usage: Pass FusionState* fusion_state to each function */")
			
 
				+            lines.append("/* Initialize: FusionState state; memset(&state, 0, sizeof(state)); */")
			
 
				+        else:
			
 
				+            # 全局变量方法
			
 
				+            lines.append("/* === Shared State Variables (Global) === */")
			
 
				+            if global_declarations:
			
 
				+                lines.append(global_declarations)
			
 
				+            else:
			
 
				+                lines.append("static int g_secret;")
			
 
				+                lines.append("static int g_key;")
			
 
				+        lines.append("")
			
 
				+        
			
 
				+        # 函数声明
			
 
				+        lines.append("/* === Function Declarations === */")
			
 
				+        for func_name in result.call_chain:
			
 
				+            if func_name in result.fused_code:
			
 
				+                # 提取函数签名
			
 
				+                code = result.fused_code[func_name]
			
 
				+                sig = self._extract_function_signature(code)
			
 
				+                if sig:
			
 
				+                    lines.append(f"{sig};")
			
 
				+        lines.append("")
			
 
				+        
			
 
				+        # 函数定义（按调用链顺序，从最内层到最外层）
			
 
				+        lines.append("/* === Function Definitions === */")
			
 
				+        lines.append("/* Functions are ordered from innermost to outermost in the call chain */")
			
 
				+        lines.append("")
			
 
				+        
			
 
				+        # 反转顺序，先定义被调用的函数
			
 
				+        for func_name in reversed(result.call_chain):
			
 
				+            if func_name in result.fused_code:
			
 
				+                lines.append(f"/* --- {func_name} --- */")
			
 
				+                lines.append(result.fused_code[func_name])
			
 
				+                lines.append("")
			
 
				+        
			
 
				+        return '\n'.join(lines)
			
 
				+    
			
 
				+    def _generate_all_fused_code(
			
 
				+        self,
			
 
				+        results: List[ProcessingResult],
			
 
				+        target_code: str
			
 
				+    ) -> str:
			
 
				+        """
			
 
				+        生成所有融合代码的汇总文件
			
 
				+        """
			
 
				+        lines = []
			
 
				+        
			
 
				+        # 文件头
			
 
				+        lines.append("/*")
			
 
				+        lines.append(" * All Fused Code - Summary File")
			
 
				+        lines.append(f" * Total Groups: {len([r for r in results if r.success])}")
			
 
				+        lines.append(" *")
			
 
				+        lines.append(" * Original Target Code:")
			
 
				+        for line in target_code.strip().split('\n'):
			
 
				+            lines.append(f" *   {line}")
			
 
				+        lines.append(" *")
			
 
				+        lines.append(" * Generated by Code Fusion Tool")
			
 
				+        lines.append(" */")
			
 
				+        lines.append("")
			
 
				+        
			
 
				+        lines.append("#include <stdio.h>")
			
 
				+        lines.append("#include <stdlib.h>")
			
 
				+        lines.append("#include <string.h>")
			
 
				+        lines.append("")
			
 
				+        
			
 
				+        # 每个成功的组
			
 
				+        for result in results:
			
 
				+            if not result.success or not result.fused_code:
			
 
				+                continue
			
 
				+            
			
 
				+            lines.append("")
			
 
				+            lines.append("/" + "=" * 78 + "/")
			
 
				+            lines.append(f"/* GROUP {result.group_index}: {' -> '.join(result.call_chain)} */")
			
 
				+            lines.append("/" + "=" * 78 + "/")
			
 
				+            lines.append("")
			
 
				+            
			
 
				+            # 根据传递方法选择不同的变量声明
			
 
				+            if result.passing_method == "parameter":
			
 
				+                lines.append("/* === Shared State (Parameter Passing Method) === */")
			
 
				+                if result.parameter_struct:
			
 
				+                    lines.append(result.parameter_struct)
			
 
				+                else:
			
 
				+                    lines.append("typedef struct { int secret; int key; } FusionState;")
			
 
				+                lines.append("/* Pass FusionState* fusion_state to each function */")
			
 
				+            else:
			
 
				+                lines.append("/* === Shared State Variables (Global) === */")
			
 
				+                if result.global_declarations:
			
 
				+                    lines.append(result.global_declarations)
			
 
				+                else:
			
 
				+                    lines.append("static int g_secret;")
			
 
				+                    lines.append("static int g_key;")
			
 
				+            lines.append("")
			
 
				+            
			
 
				+            # 函数定义
			
 
				+            for func_name in reversed(result.call_chain):
			
 
				+                if func_name in result.fused_code:
			
 
				+                    lines.append(f"/* {func_name} */")
			
 
				+                    lines.append(result.fused_code[func_name])
			
 
				+                    lines.append("")
			
 
				+        
			
 
				+        return '\n'.join(lines)
			
 
				+    
			
 
				+    def _extract_function_signature(self, func_code: str) -> Optional[str]:
			
 
				+        """
			
 
				+        从函数代码中提取函数签名
			
 
				+        """
			
 
				+        # 找到第一个 { 之前的内容
			
 
				+        brace_pos = func_code.find('{')
			
 
				+        if brace_pos == -1:
			
 
				+            return None
			
 
				+        
			
 
				+        sig = func_code[:brace_pos].strip()
			
 
				+        # 移除多余的空白和换行
			
 
				+        sig = ' '.join(sig.split())
			
 
				+        return sig
			
 
				+
			
 
				+
			
 
				+def demo_analysis(input_path: str):
			
 
				+    """
			
 
				+    演示分析功能（不调用 LLM）
			
 
				+    """
			
 
				+    print("=" * 60)
			
 
				+    print("Code Fusion Analysis Demo")
			
 
				+    print("=" * 60)
			
 
				+    
			
 
				+    # 加载数据
			
 
				+    with open(input_path, 'r', encoding='utf-8') as f:
			
 
				+        data = json.load(f)
			
 
				+    
			
 
				+    groups = data.get('groups', [])
			
 
				+    print(f"\nTotal groups: {len(groups)}")
			
 
				+    
			
 
				+    # 分析前几个组
			
 
				+    for i, group in enumerate(groups[:5]):
			
 
				+        print(f"\n--- Group {i + 1} ---")
			
 
				+        
			
 
				+        call_depth = group.get('call_depth', 0)
			
 
				+        call_chain = group.get('longest_call_chain', [])
			
 
				+        functions = group.get('functions', [])
			
 
				+        
			
 
				+        print(f"Call depth: {call_depth}")
			
 
				+        print(f"Call chain: {' -> '.join(call_chain)}")
			
 
				+        print(f"Functions count: {len(functions)}")
			
 
				+        
			
 
				+        # 分析每个函数
			
 
				+        for func_data in functions[:3]:
			
 
				+            code = func_data.get('func', '')[:200]
			
 
				+            cfg = analyze_code_cfg(code)
			
 
				+            fusion_points = get_fusion_points(cfg)
			
 
				+            
			
 
				+            print(f"\n  Function: {cfg.function_name}")
			
 
				+            print(f"  Blocks: {len(cfg.blocks)}")
			
 
				+            print(f"  Fusion points: {len(fusion_points)}")
			
 
				+            print(f"  Code preview: {code[:100]}...")
			
 
				+
			
 
				+
			
 
				+def main():
			
 
				+    parser = argparse.ArgumentParser(
			
 
				+        description='Code Fusion - 代码调用链分析与融合工具',
			
 
				+        formatter_class=argparse.RawDescriptionHelpFormatter,
			
 
				+        epilog="""
			
 
				+示例:
			
 
				+  # 分析调用链深度为 4 的数据
			
 
				+  python main.py --input output/primevul_valid_grouped_depth_4.json --analyze-only
			
 
				+  
			
 
				+  # 执行代码融合
			
 
				+  python main.py --input output/primevul_valid_grouped_depth_4.json \\
			
 
				+                 --output output/fusion_results.json \\
			
 
				+                 --target-code "int secret = 42; printf(\\"secret: %d\\n\\", secret);"
			
 
				+                 
			
 
				+  # 使用代码文件作为目标
			
 
				+  python main.py --input output/primevul_valid_grouped_depth_4.json \\
			
 
				+                 --output output/fusion_results.json \\
			
 
				+                 --target-file target_code.c
			
 
				+        """
			
 
				+    )
			
 
				+    
			
 
				+    parser.add_argument(
			
 
				+        '--input', '-i',
			
 
				+        type=str,
			
 
				+        required=True,
			
 
				+        help='输入的分组 JSON 文件路径'
			
 
				+    )
			
 
				+    
			
 
				+    parser.add_argument(
			
 
				+        '--output', '-o',
			
 
				+        type=str,
			
 
				+        default=None,
			
 
				+        help='输出文件路径'
			
 
				+    )
			
 
				+    
			
 
				+    parser.add_argument(
			
 
				+        '--target-code', '-t',
			
 
				+        type=str,
			
 
				+        default=None,
			
 
				+        help='要融合的目标代码字符串'
			
 
				+    )
			
 
				+    
			
 
				+    parser.add_argument(
			
 
				+        '--target-file', '-f',
			
 
				+        type=str,
			
 
				+        default=None,
			
 
				+        help='要融合的目标代码文件路径'
			
 
				+    )
			
 
				+    
			
 
				+    parser.add_argument(
			
 
				+        '--max-groups', '-m',
			
 
				+        type=int,
			
 
				+        default=5,
			
 
				+        help='最大处理组数（默认 5）'
			
 
				+    )
			
 
				+    
			
 
				+    parser.add_argument(
			
 
				+        '--analyze-only', '-a',
			
 
				+        action='store_true',
			
 
				+        help='只进行分析，不执行融合'
			
 
				+    )
			
 
				+    
			
 
				+    parser.add_argument(
			
 
				+        '--method',
			
 
				+        type=str,
			
 
				+        choices=['global', 'parameter'],
			
 
				+        default='global',
			
 
				+        help='变量传递方法: global（全局变量）或 parameter（参数传递）（默认 global）'
			
 
				+    )
			
 
				+    
			
 
				+    args = parser.parse_args()
			
 
				+    
			
 
				+    # 检查输入文件
			
 
				+    if not os.path.exists(args.input):
			
 
				+        print(f"Error: Input file not found: {args.input}")
			
 
				+        sys.exit(1)
			
 
				+    
			
 
				+    # 只分析模式
			
 
				+    if args.analyze_only:
			
 
				+        demo_analysis(args.input)
			
 
				+        return
			
 
				+    
			
 
				+    # 获取目标代码
			
 
				+    target_code = args.target_code
			
 
				+    if args.target_file:
			
 
				+        if os.path.exists(args.target_file):
			
 
				+            with open(args.target_file, 'r', encoding='utf-8') as f:
			
 
				+                target_code = f.read()
			
 
				+        else:
			
 
				+            print(f"Error: Target file not found: {args.target_file}")
			
 
				+            sys.exit(1)
			
 
				+    
			
 
				+    if not target_code:
			
 
				+        # 使用默认的示例代码
			
 
				+        target_code = """
			
 
				+        // Example target code to be fused
			
 
				+        int secret_value = 0x12345678;
			
 
				+        int key = secret_value ^ 0xDEADBEEF;
			
 
				+        printf("Computed key: 0x%x\\n", key);
			
 
				+        """
			
 
				+        print("Using default example target code.")
			
 
				+    
			
 
				+    # 设置默认输出路径
			
 
				+    if args.output is None:
			
 
				+        base_name = os.path.splitext(os.path.basename(args.input))[0]
			
 
				+        output_dir = os.path.dirname(args.input) or '.'
			
 
				+        args.output = os.path.join(output_dir, f'{base_name}_fused.json')
			
 
				+    
			
 
				+    # 创建处理器并执行
			
 
				+    processor = CodeFusionProcessor()
			
 
				+    
			
 
				+    print(f"Using variable passing method: {args.method}")
			
 
				+    
			
 
				+    results = processor.process_file(
			
 
				+        args.input,
			
 
				+        args.output,
			
 
				+        target_code,
			
 
				+        args.max_groups,
			
 
				+        args.method
			
 
				+    )
			
 
				+    
			
 
				+    # 打印摘要
			
 
				+    successful = sum(1 for r in results if r.success)
			
 
				+    print(f"\n{'=' * 60}")
			
 
				+    print(f"Processing Summary")
			
 
				+    print(f"{'=' * 60}")
			
 
				+    print(f"Total processed: {len(results)}")
			
 
				+    print(f"Successful: {successful}")
			
 
				+    print(f"Failed: {len(results) - successful}")
			
 
				+
			
 
				+
			
 
				+if __name__ == '__main__':
			
 
				+    main()
			
 
				+
			
--- a/src/requirements.txt
+++ b/src/requirements.txt
@@ -0,0 +1,8 @@
 
				+# Code Fusion Project Dependencies
			
 
				+openai>=1.0.0
			
 
				+tree-sitter>=0.20.0
			
 
				+tree-sitter-c>=0.20.0
			
 
				+tree-sitter-cpp>=0.20.0
			
 
				+networkx>=3.0
			
 
				+graphviz>=0.20
			
 
				+
			
--- a/utils/data_process/extract_call_relations.py
+++ b/utils/data_process/extract_call_relations.py
@@ -0,0 +1,501 @@
 
				+#!/usr/bin/env python3
			
 
				+# -*- coding: utf-8 -*-
			
 
				+"""
			
 
				+分析代码函数的 caller 和 callee 关系，将有调用关系的函数合并为组。
			
 
				+"""
			
 
				+
			
 
				+import json
			
 
				+import re
			
 
				+import os
			
 
				+import argparse
			
 
				+from collections import defaultdict
			
 
				+from typing import Dict, List, Set, Tuple, Optional
			
 
				+
			
 
				+
			
 
				+# 常见的 C/C++ 库函数和系统调用，这些不应该作为连接不同函数组的依据
			
 
				+COMMON_LIB_FUNCTIONS = {
			
 
				+    # 内存管理
			
 
				+    'malloc', 'calloc', 'realloc', 'free', 'memcpy', 'memmove', 'memset',
			
 
				+    'memcmp', 'memchr', 'alloca', 'aligned_alloc',
			
 
				+    # 字符串处理
			
 
				+    'strlen', 'strcpy', 'strncpy', 'strcat', 'strncat', 'strcmp', 'strncmp',
			
 
				+    'strchr', 'strrchr', 'strstr', 'strtok', 'strdup', 'strndup', 'strspn',
			
 
				+    'strcspn', 'strpbrk', 'strerror', 'sprintf', 'snprintf', 'vsprintf',
			
 
				+    'vsnprintf', 'sscanf',
			
 
				+    # 输入输出
			
 
				+    'printf', 'fprintf', 'vprintf', 'vfprintf', 'puts', 'fputs', 'putc',
			
 
				+    'fputc', 'putchar', 'gets', 'fgets', 'getc', 'fgetc', 'getchar',
			
 
				+    'scanf', 'fscanf', 'fopen', 'fclose', 'fread', 'fwrite', 'fseek',
			
 
				+    'ftell', 'rewind', 'fflush', 'feof', 'ferror', 'clearerr', 'perror',
			
 
				+    # 类型转换
			
 
				+    'atoi', 'atol', 'atoll', 'atof', 'strtol', 'strtoll', 'strtoul',
			
 
				+    'strtoull', 'strtof', 'strtod', 'strtold',
			
 
				+    # 数学函数
			
 
				+    'abs', 'labs', 'llabs', 'fabs', 'floor', 'ceil', 'round', 'sqrt',
			
 
				+    'pow', 'exp', 'log', 'log10', 'sin', 'cos', 'tan', 'asin', 'acos',
			
 
				+    'atan', 'atan2', 'min', 'max',
			
 
				+    # 时间函数
			
 
				+    'time', 'clock', 'difftime', 'mktime', 'strftime', 'localtime',
			
 
				+    'gmtime', 'asctime', 'ctime', 'gettimeofday', 'sleep', 'usleep',
			
 
				+    'nanosleep',
			
 
				+    # 进程和信号
			
 
				+    'exit', 'abort', '_exit', 'atexit', 'system', 'getenv', 'setenv',
			
 
				+    'fork', 'exec', 'execl', 'execv', 'execle', 'execve', 'execlp',
			
 
				+    'execvp', 'wait', 'waitpid', 'kill', 'signal', 'raise',
			
 
				+    # 断言和错误处理
			
 
				+    'assert', 'errno', 'setjmp', 'longjmp',
			
 
				+    # POSIX 和系统调用
			
 
				+    'open', 'close', 'read', 'write', 'lseek', 'stat', 'fstat', 'lstat',
			
 
				+    'access', 'chmod', 'chown', 'link', 'unlink', 'rename', 'mkdir',
			
 
				+    'rmdir', 'opendir', 'closedir', 'readdir', 'getcwd', 'chdir',
			
 
				+    'pipe', 'dup', 'dup2', 'fcntl', 'ioctl', 'select', 'poll', 'mmap',
			
 
				+    'munmap', 'mprotect', 'socket', 'bind', 'listen', 'accept', 'connect',
			
 
				+    'send', 'recv', 'sendto', 'recvfrom', 'shutdown', 'setsockopt',
			
 
				+    'getsockopt', 'pthread_create', 'pthread_join', 'pthread_exit',
			
 
				+    'pthread_mutex_lock', 'pthread_mutex_unlock', 'pthread_cond_wait',
			
 
				+    'pthread_cond_signal',
			
 
				+    # C++ 常用
			
 
				+    'std', 'make_shared', 'make_unique', 'move', 'forward', 'swap',
			
 
				+    'begin', 'end', 'size', 'empty', 'push_back', 'pop_back', 'front',
			
 
				+    'back', 'insert', 'erase', 'clear', 'find', 'count', 'sort',
			
 
				+    'unique', 'reverse', 'copy', 'fill', 'transform', 'accumulate',
			
 
				+    # 类型检查
			
 
				+    'static_assert', 'ASSERT', 'DCHECK', 'CHECK', 'EXPECT', 'VERIFY',
			
 
				+    # 日志
			
 
				+    'LOG', 'DLOG', 'VLOG', 'ERR', 'WARN', 'INFO', 'DEBUG', 'TRACE',
			
 
				+    # 其他常见宏/函数
			
 
				+    'DISALLOW_COPY_AND_ASSIGN', 'NOTREACHED', 'UNIMPLEMENTED',
			
 
				+    'offsetof', 'container_of', 'likely', 'unlikely', 'BUG', 'BUG_ON',
			
 
				+    'WARN_ON', 'IS_ERR', 'PTR_ERR', 'ERR_PTR', 'ERR_CAST',
			
 
				+    # 测试相关
			
 
				+    'TEST', 'TEST_F', 'TEST_P', 'EXPECT_TRUE', 'EXPECT_FALSE',
			
 
				+    'EXPECT_EQ', 'EXPECT_NE', 'EXPECT_LT', 'EXPECT_LE', 'EXPECT_GT',
			
 
				+    'EXPECT_GE', 'ASSERT_TRUE', 'ASSERT_FALSE', 'ASSERT_EQ', 'ASSERT_NE',
			
 
				+    'MOCK_METHOD', 'INSTANTIATE_TEST_SUITE_P',
			
 
				+}
			
 
				+
			
 
				+
			
 
				+def extract_function_name(func_code: str) -> Optional[str]:
			
 
				+    """
			
 
				+    从函数代码中提取函数名。
			
 
				+    支持 C/C++ 风格的函数定义。
			
 
				+    """
			
 
				+    # 移除注释
			
 
				+    code = re.sub(r'//.*?\n', '\n', func_code)
			
 
				+    code = re.sub(r'/\*.*?\*/', '', code, flags=re.DOTALL)
			
 
				+    
			
 
				+    # 匹配函数定义的模式
			
 
				+    # 格式: [返回类型] [类名::]函数名(参数列表)
			
 
				+    patterns = [
			
 
				+        # C++ 成员函数: ReturnType ClassName::FunctionName(...)
			
 
				+        r'(?:[\w\s\*&<>,]+?)\s+(\w+::~?\w+)\s*\([^)]*\)\s*(?:const)?\s*(?:override)?\s*(?:final)?\s*(?:\{|:)',
			
 
				+        # 构造函数/析构函数: ClassName::ClassName(...) 或 ClassName::~ClassName(...)
			
 
				+        r'^[\s]*(\w+::~?\w+)\s*\([^)]*\)\s*(?:\{|:)',
			
 
				+        # 普通 C 函数: ReturnType FunctionName(...)
			
 
				+        r'(?:[\w\s\*&<>,]+?)\s+(\w+)\s*\([^)]*\)\s*\{',
			
 
				+        # 简单模式
			
 
				+        r'^\s*(?:static\s+)?(?:inline\s+)?(?:virtual\s+)?(?:[\w\*&<>,\s]+)\s+(\w+)\s*\(',
			
 
				+    ]
			
 
				+    
			
 
				+    for pattern in patterns:
			
 
				+        match = re.search(pattern, code, re.MULTILINE)
			
 
				+        if match:
			
 
				+            func_name = match.group(1)
			
 
				+            # 如果是 ClassName::FunctionName 格式，只取函数名
			
 
				+            if '::' in func_name:
			
 
				+                func_name = func_name.split('::')[-1]
			
 
				+            return func_name
			
 
				+    
			
 
				+    return None
			
 
				+
			
 
				+
			
 
				+def extract_function_calls(
			
 
				+    func_code: str, 
			
 
				+    self_name: Optional[str] = None,
			
 
				+    exclude_common_libs: bool = True
			
 
				+) -> Set[str]:
			
 
				+    """
			
 
				+    从函数代码中提取所有被调用的函数名（callees）。
			
 
				+    
			
 
				+    Args:
			
 
				+        func_code: 函数代码
			
 
				+        self_name: 当前函数名（会被排除）
			
 
				+        exclude_common_libs: 是否排除常见库函数
			
 
				+    """
			
 
				+    # 移除注释和字符串
			
 
				+    code = re.sub(r'//.*?\n', '\n', func_code)
			
 
				+    code = re.sub(r'/\*.*?\*/', '', code, flags=re.DOTALL)
			
 
				+    code = re.sub(r'"(?:[^"\\]|\\.)*"', '""', code)  # 移除字符串
			
 
				+    code = re.sub(r"'(?:[^'\\]|\\.)*'", "''", code)  # 移除字符
			
 
				+    
			
 
				+    # 提取函数调用: 函数名(
			
 
				+    # 排除关键字和常见的非函数调用
			
 
				+    keywords = {
			
 
				+        'if', 'else', 'while', 'for', 'switch', 'case', 'return', 'break',
			
 
				+        'continue', 'sizeof', 'typeof', 'alignof', 'decltype', 'static_cast',
			
 
				+        'dynamic_cast', 'reinterpret_cast', 'const_cast', 'new', 'delete',
			
 
				+        'throw', 'catch', 'try', 'namespace', 'class', 'struct', 'enum',
			
 
				+        'union', 'typedef', 'using', 'template', 'typename', 'public',
			
 
				+        'private', 'protected', 'virtual', 'override', 'final', 'explicit',
			
 
				+        'inline', 'static', 'extern', 'const', 'volatile', 'mutable',
			
 
				+        'register', 'auto', 'default', 'goto', 'asm', '__asm', '__asm__',
			
 
				+    }
			
 
				+    
			
 
				+    # 匹配函数调用
			
 
				+    pattern = r'\b([a-zA-Z_]\w*)\s*\('
			
 
				+    matches = re.findall(pattern, code)
			
 
				+    
			
 
				+    # 过滤关键字、自身和常见库函数
			
 
				+    callees = set()
			
 
				+    for name in matches:
			
 
				+        if name in keywords:
			
 
				+            continue
			
 
				+        if self_name is not None and name == self_name:
			
 
				+            continue
			
 
				+        if exclude_common_libs and name in COMMON_LIB_FUNCTIONS:
			
 
				+            continue
			
 
				+        callees.add(name)
			
 
				+    
			
 
				+    return callees
			
 
				+
			
 
				+
			
 
				+def load_jsonl(file_path: str) -> List[Dict]:
			
 
				+    """
			
 
				+    加载 JSONL 文件。
			
 
				+    """
			
 
				+    data = []
			
 
				+    with open(file_path, 'r', encoding='utf-8') as f:
			
 
				+        for line in f:
			
 
				+            line = line.strip()
			
 
				+            if line:
			
 
				+                data.append(json.loads(line))
			
 
				+    return data
			
 
				+
			
 
				+
			
 
				+def build_call_graph(
			
 
				+    records: List[Dict],
			
 
				+    exclude_common_libs: bool = True
			
 
				+) -> Tuple[Dict[str, Set[str]], Dict[int, str], Dict[str, List[int]]]:
			
 
				+    """
			
 
				+    构建函数调用图。
			
 
				+    
			
 
				+    Args:
			
 
				+        records: 数据记录列表
			
 
				+        exclude_common_libs: 是否排除常见库函数
			
 
				+    
			
 
				+    返回:
			
 
				+        - call_graph: {函数名: {被调用的函数名集合}}
			
 
				+        - idx_to_func: {记录索引: 函数名}
			
 
				+        - func_to_idxs: {函数名: [记录索引列表]}（一个函数名可能对应多条记录）
			
 
				+    """
			
 
				+    call_graph = {}
			
 
				+    idx_to_func = {}
			
 
				+    func_to_idxs = defaultdict(list)
			
 
				+    
			
 
				+    for i, record in enumerate(records):
			
 
				+        func_code = record.get('func', '')
			
 
				+        func_name = extract_function_name(func_code)
			
 
				+        
			
 
				+        if func_name:
			
 
				+            callees = extract_function_calls(func_code, func_name, exclude_common_libs)
			
 
				+            call_graph[func_name] = callees
			
 
				+            idx_to_func[i] = func_name
			
 
				+            func_to_idxs[func_name].append(i)
			
 
				+    
			
 
				+    return call_graph, idx_to_func, func_to_idxs
			
 
				+
			
 
				+
			
 
				+def find_high_frequency_functions(
			
 
				+    call_graph: Dict[str, Set[str]],
			
 
				+    all_funcs: Set[str],
			
 
				+    threshold_percentile: float = 99.0
			
 
				+) -> Set[str]:
			
 
				+    """
			
 
				+    找出被高频调用的函数（可能是通用工具函数）。
			
 
				+    
			
 
				+    Args:
			
 
				+        call_graph: 函数调用图
			
 
				+        all_funcs: 数据集中的所有函数名
			
 
				+        threshold_percentile: 阈值百分位数（默认 99%）
			
 
				+    
			
 
				+    Returns:
			
 
				+        高频被调用的函数集合
			
 
				+    """
			
 
				+    # 统计每个函数被调用的次数
			
 
				+    callee_count = defaultdict(int)
			
 
				+    for callees in call_graph.values():
			
 
				+        for callee in callees:
			
 
				+            if callee in all_funcs:
			
 
				+                callee_count[callee] += 1
			
 
				+    
			
 
				+    if not callee_count:
			
 
				+        return set()
			
 
				+    
			
 
				+    # 计算阈值
			
 
				+    counts = sorted(callee_count.values())
			
 
				+    threshold_idx = int(len(counts) * threshold_percentile / 100)
			
 
				+    threshold = counts[min(threshold_idx, len(counts) - 1)]
			
 
				+    
			
 
				+    # 只有当阈值大于某个最小值时才过滤（避免过滤掉正常的调用关系）
			
 
				+    if threshold < 10:
			
 
				+        return set()
			
 
				+    
			
 
				+    high_freq_funcs = {fn for fn, count in callee_count.items() if count >= threshold}
			
 
				+    return high_freq_funcs
			
 
				+
			
 
				+
			
 
				+def find_related_groups(
			
 
				+    records: List[Dict],
			
 
				+    call_graph: Dict[str, Set[str]],
			
 
				+    func_to_idxs: Dict[str, List[int]],
			
 
				+    auto_filter_high_freq: bool = True,
			
 
				+    high_freq_threshold: float = 99.0
			
 
				+) -> List[List[Dict]]:
			
 
				+    """
			
 
				+    找出有调用关系的函数组。
			
 
				+    使用 Union-Find 算法将有调用关系的函数合并。
			
 
				+    
			
 
				+    Args:
			
 
				+        records: 数据记录列表
			
 
				+        call_graph: 函数调用图
			
 
				+        func_to_idxs: 函数名到记录索引的映射
			
 
				+        auto_filter_high_freq: 是否自动过滤高频调用的函数
			
 
				+        high_freq_threshold: 高频函数的阈值百分位数
			
 
				+    """
			
 
				+    # 获取所有函数名
			
 
				+    all_funcs = set(call_graph.keys())
			
 
				+    
			
 
				+    # 找出高频被调用的函数
			
 
				+    high_freq_funcs = set()
			
 
				+    if auto_filter_high_freq:
			
 
				+        high_freq_funcs = find_high_frequency_functions(
			
 
				+            call_graph, all_funcs, high_freq_threshold
			
 
				+        )
			
 
				+        if high_freq_funcs:
			
 
				+            print(f"  自动过滤 {len(high_freq_funcs)} 个高频被调用的函数")
			
 
				+    
			
 
				+    # 只保留在数据集中实际存在的调用关系
			
 
				+    # 构建双向关系图（caller -> callee, callee -> caller）
			
 
				+    related_graph = defaultdict(set)
			
 
				+    
			
 
				+    for caller, callees in call_graph.items():
			
 
				+        for callee in callees:
			
 
				+            # 只有当 callee 也在我们的数据集中时才建立关系
			
 
				+            # 排除高频被调用的函数
			
 
				+            if callee in all_funcs and callee not in high_freq_funcs:
			
 
				+                related_graph[caller].add(callee)
			
 
				+                related_graph[callee].add(caller)
			
 
				+    
			
 
				+    # 使用 BFS/DFS 找连通分量
			
 
				+    visited = set()
			
 
				+    groups = []
			
 
				+    
			
 
				+    for func_name in all_funcs:
			
 
				+        if func_name not in visited:
			
 
				+            # BFS 找到所有连通的函数
			
 
				+            group_funcs = set()
			
 
				+            queue = [func_name]
			
 
				+            
			
 
				+            while queue:
			
 
				+                current = queue.pop(0)
			
 
				+                if current in visited:
			
 
				+                    continue
			
 
				+                visited.add(current)
			
 
				+                group_funcs.add(current)
			
 
				+                
			
 
				+                # 添加相关的函数
			
 
				+                for related in related_graph.get(current, []):
			
 
				+                    if related not in visited:
			
 
				+                        queue.append(related)
			
 
				+            
			
 
				+            # 将函数名转换为对应的记录
			
 
				+            group_records = []
			
 
				+            for fn in group_funcs:
			
 
				+                for idx in func_to_idxs.get(fn, []):
			
 
				+                    group_records.append(records[idx])
			
 
				+            
			
 
				+            if group_records:
			
 
				+                groups.append(group_records)
			
 
				+    
			
 
				+    return groups
			
 
				+
			
 
				+
			
 
				+def process_file(
			
 
				+    input_path: str, 
			
 
				+    output_path: str, 
			
 
				+    min_group_size: int = 1,
			
 
				+    max_group_size: int = 0,
			
 
				+    exclude_common_libs: bool = True
			
 
				+):
			
 
				+    """
			
 
				+    处理单个 JSONL 文件。
			
 
				+    
			
 
				+    Args:
			
 
				+        input_path: 输入文件路径
			
 
				+        output_path: 输出文件路径
			
 
				+        min_group_size: 最小组大小（默认为1，可设置为2只保留有调用关系的组）
			
 
				+        max_group_size: 最大组大小（0表示不限制，超过此大小的组会被拆分为单独的记录）
			
 
				+        exclude_common_libs: 是否排除常见库函数
			
 
				+    """
			
 
				+    print(f"加载数据: {input_path}")
			
 
				+    records = load_jsonl(input_path)
			
 
				+    print(f"共加载 {len(records)} 条记录")
			
 
				+    
			
 
				+    print("构建函数调用图...")
			
 
				+    call_graph, idx_to_func, func_to_idxs = build_call_graph(records, exclude_common_libs)
			
 
				+    print(f"识别出 {len(call_graph)} 个函数")
			
 
				+    
			
 
				+    print("分析调用关系，合并相关函数...")
			
 
				+    groups = find_related_groups(
			
 
				+        records, call_graph, func_to_idxs,
			
 
				+        auto_filter_high_freq=True,
			
 
				+        high_freq_threshold=99.0
			
 
				+    )
			
 
				+    
			
 
				+    # 处理超大组：如果设置了 max_group_size，将超大组拆分为单独的记录
			
 
				+    if max_group_size > 0:
			
 
				+        new_groups = []
			
 
				+        oversized_count = 0
			
 
				+        for g in groups:
			
 
				+            if len(g) > max_group_size:
			
 
				+                oversized_count += 1
			
 
				+                # 将超大组中的每个记录拆分为单独的组
			
 
				+                for record in g:
			
 
				+                    new_groups.append([record])
			
 
				+            else:
			
 
				+                new_groups.append(g)
			
 
				+        if oversized_count > 0:
			
 
				+            print(f"  (已将 {oversized_count} 个超大组拆分为单独记录)")
			
 
				+        groups = new_groups
			
 
				+    
			
 
				+    # 按组大小过滤
			
 
				+    if min_group_size > 1:
			
 
				+        groups = [g for g in groups if len(g) >= min_group_size]
			
 
				+    
			
 
				+    # 统计信息
			
 
				+    total_funcs = sum(len(g) for g in groups)
			
 
				+    groups_with_relations = [g for g in groups if len(g) > 1]
			
 
				+    single_func_groups = len([g for g in groups if len(g) == 1])
			
 
				+    
			
 
				+    # 按组大小分布统计
			
 
				+    size_distribution = defaultdict(int)
			
 
				+    for g in groups:
			
 
				+        size = len(g)
			
 
				+        if size == 1:
			
 
				+            size_distribution["1 (单独函数)"] += 1
			
 
				+        elif size <= 5:
			
 
				+            size_distribution["2-5"] += 1
			
 
				+        elif size <= 10:
			
 
				+            size_distribution["6-10"] += 1
			
 
				+        elif size <= 50:
			
 
				+            size_distribution["11-50"] += 1
			
 
				+        elif size <= 100:
			
 
				+            size_distribution["51-100"] += 1
			
 
				+        elif size <= 500:
			
 
				+            size_distribution["101-500"] += 1
			
 
				+        elif size <= 1000:
			
 
				+            size_distribution["501-1000"] += 1
			
 
				+        else:
			
 
				+            size_distribution["1000+"] += 1
			
 
				+    
			
 
				+    print(f"\n==================== 统计信息 ====================")
			
 
				+    print(f"  总记录数（原始）: {len(records)}")
			
 
				+    print(f"  总函数数（分组后）: {total_funcs}")
			
 
				+    print(f"  总组数: {len(groups)}")
			
 
				+    print(f"    - 单独函数组（无调用关系）: {single_func_groups}")
			
 
				+    print(f"    - 有调用关系的组（大小>1）: {len(groups_with_relations)}")
			
 
				+    
			
 
				+    if groups_with_relations:
			
 
				+        actual_max_size = max(len(g) for g in groups_with_relations)
			
 
				+        avg_group_size = sum(len(g) for g in groups_with_relations) / len(groups_with_relations)
			
 
				+        print(f"  最大组大小: {actual_max_size}")
			
 
				+        print(f"  有关系组的平均大小: {avg_group_size:.2f}")
			
 
				+    
			
 
				+    print(f"\n  组大小分布:")
			
 
				+    # 按特定顺序输出
			
 
				+    order = ["1 (单独函数)", "2-5", "6-10", "11-50", "51-100", "101-500", "501-1000", "1000+"]
			
 
				+    for key in order:
			
 
				+        if key in size_distribution:
			
 
				+            count = size_distribution[key]
			
 
				+            percentage = count / len(groups) * 100
			
 
				+            print(f"    - 大小 {key}: {count} 组 ({percentage:.1f}%)")
			
 
				+    print(f"====================================================")
			
 
				+    
			
 
				+    # 输出结果
			
 
				+    output_data = {
			
 
				+        "metadata": {
			
 
				+            "source_file": os.path.basename(input_path),
			
 
				+            "total_records": len(records),
			
 
				+            "total_functions_grouped": total_funcs,
			
 
				+            "total_groups": len(groups),
			
 
				+            "single_function_groups": single_func_groups,
			
 
				+            "groups_with_relations": len(groups_with_relations),
			
 
				+            "max_group_size": max(len(g) for g in groups) if groups else 0,
			
 
				+            "avg_related_group_size": round(sum(len(g) for g in groups_with_relations) / len(groups_with_relations), 2) if groups_with_relations else 0,
			
 
				+            "size_distribution": dict(size_distribution),
			
 
				+        },
			
 
				+        "groups": groups
			
 
				+    }
			
 
				+    
			
 
				+    os.makedirs(os.path.dirname(output_path), exist_ok=True)
			
 
				+    with open(output_path, 'w', encoding='utf-8') as f:
			
 
				+        json.dump(output_data, f, ensure_ascii=False, indent=2)
			
 
				+    
			
 
				+    print(f"\n结果已保存到: {output_path}")
			
 
				+
			
 
				+
			
 
				+def main():
			
 
				+    parser = argparse.ArgumentParser(description='分析代码函数的调用关系')
			
 
				+    parser.add_argument(
			
 
				+        '--input', '-i',
			
 
				+        type=str,
			
 
				+        required=True,
			
 
				+        help='输入的 JSONL 文件路径'
			
 
				+    )
			
 
				+    parser.add_argument(
			
 
				+        '--output', '-o',
			
 
				+        type=str,
			
 
				+        default=None,
			
 
				+        help='输出的 JSON 文件路径（默认为 output/<输入文件名>_grouped.json）'
			
 
				+    )
			
 
				+    parser.add_argument(
			
 
				+        '--min-group-size', '-m',
			
 
				+        type=int,
			
 
				+        default=1,
			
 
				+        help='最小组大小，设为2可只保留有调用关系的组（默认为1）'
			
 
				+    )
			
 
				+    parser.add_argument(
			
 
				+        '--max-group-size', '-M',
			
 
				+        type=int,
			
 
				+        default=0,
			
 
				+        help='最大组大小，超过此大小的组会被拆分（0表示不限制，默认为0）'
			
 
				+    )
			
 
				+    parser.add_argument(
			
 
				+        '--include-common-libs',
			
 
				+        action='store_true',
			
 
				+        default=False,
			
 
				+        help='是否包含常见库函数作为调用关系（默认排除）'
			
 
				+    )
			
 
				+    
			
 
				+    args = parser.parse_args()
			
 
				+    
			
 
				+    # 设置默认输出路径
			
 
				+    if args.output is None:
			
 
				+        base_name = os.path.splitext(os.path.basename(args.input))[0]
			
 
				+        # 获取脚本所在目录的上两级（项目根目录）
			
 
				+        script_dir = os.path.dirname(os.path.abspath(__file__))
			
 
				+        project_root = os.path.dirname(os.path.dirname(script_dir))
			
 
				+        args.output = os.path.join(project_root, 'output', f'{base_name}_grouped.json')
			
 
				+    
			
 
				+    process_file(
			
 
				+        args.input, 
			
 
				+        args.output, 
			
 
				+        args.min_group_size,
			
 
				+        args.max_group_size,
			
 
				+        exclude_common_libs=not args.include_common_libs
			
 
				+    )
			
 
				+
			
 
				+
			
 
				+if __name__ == '__main__':
			
 
				+    main()
			
 
				+
			
--- a/utils/data_process/filter_by_call_depth.py
+++ b/utils/data_process/filter_by_call_depth.py
@@ -0,0 +1,328 @@
 
				+#!/usr/bin/env python3
			
 
				+# -*- coding: utf-8 -*-
			
 
				+"""
			
 
				+从分组后的 JSON 文件中，筛选出特定调用链深度的组。
			
 
				+
			
 
				+调用链深度定义：
			
 
				+- caller -> callee 是深度 2
			
 
				+- caller -> caller -> func 是深度 3
			
 
				+- caller -> caller -> caller -> func 是深度 4
			
 
				+"""
			
 
				+
			
 
				+import json
			
 
				+import re
			
 
				+import os
			
 
				+import argparse
			
 
				+from collections import defaultdict
			
 
				+from typing import Dict, List, Set, Optional, Tuple
			
 
				+
			
 
				+
			
 
				+def extract_function_name(func_code: str) -> Optional[str]:
			
 
				+    """
			
 
				+    从函数代码中提取函数名。
			
 
				+    """
			
 
				+    code = re.sub(r'//.*?\n', '\n', func_code)
			
 
				+    code = re.sub(r'/\*.*?\*/', '', code, flags=re.DOTALL)
			
 
				+    
			
 
				+    patterns = [
			
 
				+        r'(?:[\w\s\*&<>,]+?)\s+(\w+::~?\w+)\s*\([^)]*\)\s*(?:const)?\s*(?:override)?\s*(?:final)?\s*(?:\{|:)',
			
 
				+        r'^[\s]*(\w+::~?\w+)\s*\([^)]*\)\s*(?:\{|:)',
			
 
				+        r'(?:[\w\s\*&<>,]+?)\s+(\w+)\s*\([^)]*\)\s*\{',
			
 
				+        r'^\s*(?:static\s+)?(?:inline\s+)?(?:virtual\s+)?(?:[\w\*&<>,\s]+)\s+(\w+)\s*\(',
			
 
				+    ]
			
 
				+    
			
 
				+    for pattern in patterns:
			
 
				+        match = re.search(pattern, code, re.MULTILINE)
			
 
				+        if match:
			
 
				+            func_name = match.group(1)
			
 
				+            if '::' in func_name:
			
 
				+                func_name = func_name.split('::')[-1]
			
 
				+            return func_name
			
 
				+    
			
 
				+    return None
			
 
				+
			
 
				+
			
 
				+def extract_function_calls(func_code: str, self_name: Optional[str] = None) -> Set[str]:
			
 
				+    """
			
 
				+    从函数代码中提取所有被调用的函数名。
			
 
				+    """
			
 
				+    code = re.sub(r'//.*?\n', '\n', func_code)
			
 
				+    code = re.sub(r'/\*.*?\*/', '', code, flags=re.DOTALL)
			
 
				+    code = re.sub(r'"(?:[^"\\]|\\.)*"', '""', code)
			
 
				+    code = re.sub(r"'(?:[^'\\]|\\.)*'", "''", code)
			
 
				+    
			
 
				+    keywords = {
			
 
				+        'if', 'else', 'while', 'for', 'switch', 'case', 'return', 'break',
			
 
				+        'continue', 'sizeof', 'typeof', 'alignof', 'decltype', 'static_cast',
			
 
				+        'dynamic_cast', 'reinterpret_cast', 'const_cast', 'new', 'delete',
			
 
				+        'throw', 'catch', 'try', 'namespace', 'class', 'struct', 'enum',
			
 
				+        'union', 'typedef', 'using', 'template', 'typename', 'public',
			
 
				+        'private', 'protected', 'virtual', 'override', 'final', 'explicit',
			
 
				+        'inline', 'static', 'extern', 'const', 'volatile', 'mutable',
			
 
				+        'register', 'auto', 'default', 'goto', 'asm', '__asm', '__asm__',
			
 
				+    }
			
 
				+    
			
 
				+    pattern = r'\b([a-zA-Z_]\w*)\s*\('
			
 
				+    matches = re.findall(pattern, code)
			
 
				+    
			
 
				+    callees = set()
			
 
				+    for name in matches:
			
 
				+        if name not in keywords:
			
 
				+            if self_name is None or name != self_name:
			
 
				+                callees.add(name)
			
 
				+    
			
 
				+    return callees
			
 
				+
			
 
				+
			
 
				+def compute_call_depth(group: List[Dict]) -> Tuple[int, List[str]]:
			
 
				+    """
			
 
				+    计算一个组内的最大调用链深度。
			
 
				+    
			
 
				+    Returns:
			
 
				+        (最大深度, 最长调用链路径)
			
 
				+    """
			
 
				+    if len(group) <= 1:
			
 
				+        return 1, []
			
 
				+    
			
 
				+    # 提取每个函数的名称和它调用的函数
			
 
				+    func_names = {}  # idx -> func_name
			
 
				+    func_codes = {}  # func_name -> code
			
 
				+    call_graph = {}  # func_name -> set of callees
			
 
				+    
			
 
				+    for i, record in enumerate(group):
			
 
				+        func_code = record.get('func', '')
			
 
				+        func_name = extract_function_name(func_code)
			
 
				+        if func_name:
			
 
				+            func_names[i] = func_name
			
 
				+            func_codes[func_name] = func_code
			
 
				+            callees = extract_function_calls(func_code, func_name)
			
 
				+            call_graph[func_name] = callees
			
 
				+    
			
 
				+    # 获取组内所有函数名
			
 
				+    group_funcs = set(func_names.values())
			
 
				+    
			
 
				+    # 只保留组内存在的调用关系
			
 
				+    filtered_graph = {}
			
 
				+    for caller, callees in call_graph.items():
			
 
				+        filtered_callees = callees & group_funcs
			
 
				+        filtered_graph[caller] = filtered_callees
			
 
				+    
			
 
				+    # 使用 DFS 计算最长调用链深度
			
 
				+    def dfs(func: str, visited: Set[str], path: List[str]) -> Tuple[int, List[str]]:
			
 
				+        """
			
 
				+        从 func 开始，找到最长的调用链。
			
 
				+        """
			
 
				+        if func in visited:
			
 
				+            return len(path), path.copy()
			
 
				+        
			
 
				+        visited.add(func)
			
 
				+        path.append(func)
			
 
				+        
			
 
				+        max_depth = len(path)
			
 
				+        max_path = path.copy()
			
 
				+        
			
 
				+        for callee in filtered_graph.get(func, []):
			
 
				+            if callee not in visited:
			
 
				+                depth, p = dfs(callee, visited, path)
			
 
				+                if depth > max_depth:
			
 
				+                    max_depth = depth
			
 
				+                    max_path = p
			
 
				+        
			
 
				+        path.pop()
			
 
				+        visited.remove(func)
			
 
				+        
			
 
				+        return max_depth, max_path
			
 
				+    
			
 
				+    # 从每个函数开始尝试，找到最长调用链
			
 
				+    overall_max_depth = 1
			
 
				+    overall_max_path = []
			
 
				+    
			
 
				+    for func_name in group_funcs:
			
 
				+        depth, path = dfs(func_name, set(), [])
			
 
				+        if depth > overall_max_depth:
			
 
				+            overall_max_depth = depth
			
 
				+            overall_max_path = path
			
 
				+    
			
 
				+    return overall_max_depth, overall_max_path
			
 
				+
			
 
				+
			
 
				+def load_grouped_json(file_path: str) -> Dict:
			
 
				+    """
			
 
				+    加载分组后的 JSON 文件。
			
 
				+    """
			
 
				+    with open(file_path, 'r', encoding='utf-8') as f:
			
 
				+        return json.load(f)
			
 
				+
			
 
				+
			
 
				+def filter_groups_by_depth(
			
 
				+    groups: List[List[Dict]], 
			
 
				+    min_depth: int = 1, 
			
 
				+    max_depth: int = float('inf')
			
 
				+) -> Tuple[List[Dict], Dict[int, int]]:
			
 
				+    """
			
 
				+    按调用链深度筛选组。
			
 
				+    
			
 
				+    Args:
			
 
				+        groups: 所有组
			
 
				+        min_depth: 最小深度（包含）
			
 
				+        max_depth: 最大深度（包含）
			
 
				+    
			
 
				+    Returns:
			
 
				+        (符合条件的组列表（包含深度信息）, 深度分布统计)
			
 
				+    """
			
 
				+    filtered_groups = []
			
 
				+    depth_distribution = defaultdict(int)
			
 
				+    
			
 
				+    print("分析调用链深度...")
			
 
				+    total = len(groups)
			
 
				+    
			
 
				+    for i, group in enumerate(groups):
			
 
				+        if (i + 1) % 500 == 0:
			
 
				+            print(f"  处理进度: {i + 1}/{total}")
			
 
				+        
			
 
				+        depth, path = compute_call_depth(group)
			
 
				+        depth_distribution[depth] += 1
			
 
				+        
			
 
				+        if min_depth <= depth <= max_depth:
			
 
				+            # 添加深度信息到组中
			
 
				+            group_with_info = {
			
 
				+                "call_depth": depth,
			
 
				+                "longest_call_chain": path,
			
 
				+                "group_size": len(group),
			
 
				+                "functions": group
			
 
				+            }
			
 
				+            filtered_groups.append(group_with_info)
			
 
				+    
			
 
				+    return filtered_groups, dict(depth_distribution)
			
 
				+
			
 
				+
			
 
				+def main():
			
 
				+    parser = argparse.ArgumentParser(
			
 
				+        description='按调用链深度筛选函数组',
			
 
				+        formatter_class=argparse.RawDescriptionHelpFormatter,
			
 
				+        epilog="""
			
 
				+示例:
			
 
				+  # 筛选深度为 3 的组
			
 
				+  python filter_by_call_depth.py -i output/grouped.json -d 3
			
 
				+
			
 
				+  # 筛选深度在 2-5 之间的组
			
 
				+  python filter_by_call_depth.py -i output/grouped.json --min-depth 2 --max-depth 5
			
 
				+
			
 
				+  # 筛选深度 >= 4 的组
			
 
				+  python filter_by_call_depth.py -i output/grouped.json --min-depth 4
			
 
				+        """
			
 
				+    )
			
 
				+    parser.add_argument(
			
 
				+        '--input', '-i',
			
 
				+        type=str,
			
 
				+        required=True,
			
 
				+        help='输入的分组 JSON 文件路径'
			
 
				+    )
			
 
				+    parser.add_argument(
			
 
				+        '--output', '-o',
			
 
				+        type=str,
			
 
				+        default=None,
			
 
				+        help='输出的 JSON 文件路径（默认自动生成）'
			
 
				+    )
			
 
				+    parser.add_argument(
			
 
				+        '--depth', '-d',
			
 
				+        type=int,
			
 
				+        default=None,
			
 
				+        help='精确匹配的调用链深度（与 --min-depth/--max-depth 互斥）'
			
 
				+    )
			
 
				+    parser.add_argument(
			
 
				+        '--min-depth',
			
 
				+        type=int,
			
 
				+        default=1,
			
 
				+        help='最小调用链深度（包含，默认为1）'
			
 
				+    )
			
 
				+    parser.add_argument(
			
 
				+        '--max-depth',
			
 
				+        type=int,
			
 
				+        default=None,
			
 
				+        help='最大调用链深度（包含，默认不限制）'
			
 
				+    )
			
 
				+    
			
 
				+    args = parser.parse_args()
			
 
				+    
			
 
				+    # 处理深度参数
			
 
				+    if args.depth is not None:
			
 
				+        min_depth = args.depth
			
 
				+        max_depth = args.depth
			
 
				+    else:
			
 
				+        min_depth = args.min_depth
			
 
				+        max_depth = args.max_depth if args.max_depth is not None else float('inf')
			
 
				+    
			
 
				+    # 设置默认输出路径
			
 
				+    if args.output is None:
			
 
				+        base_name = os.path.splitext(os.path.basename(args.input))[0]
			
 
				+        output_dir = os.path.dirname(args.input)
			
 
				+        if max_depth == float('inf'):
			
 
				+            depth_str = f"depth_{min_depth}+"
			
 
				+        elif min_depth == max_depth:
			
 
				+            depth_str = f"depth_{min_depth}"
			
 
				+        else:
			
 
				+            depth_str = f"depth_{min_depth}-{max_depth}"
			
 
				+        args.output = os.path.join(output_dir, f'{base_name}_{depth_str}.json')
			
 
				+    
			
 
				+    # 加载数据
			
 
				+    print(f"加载数据: {args.input}")
			
 
				+    data = load_grouped_json(args.input)
			
 
				+    groups = data.get('groups', [])
			
 
				+    print(f"共加载 {len(groups)} 个组")
			
 
				+    
			
 
				+    # 筛选
			
 
				+    if max_depth == float('inf'):
			
 
				+        print(f"\n筛选调用链深度 >= {min_depth} 的组...")
			
 
				+    elif min_depth == max_depth:
			
 
				+        print(f"\n筛选调用链深度 = {min_depth} 的组...")
			
 
				+    else:
			
 
				+        print(f"\n筛选调用链深度在 {min_depth}-{max_depth} 之间的组...")
			
 
				+    
			
 
				+    filtered_groups, depth_distribution = filter_groups_by_depth(groups, min_depth, max_depth)
			
 
				+    
			
 
				+    # 统计信息
			
 
				+    print(f"\n==================== 统计信息 ====================")
			
 
				+    print(f"原始组数: {len(groups)}")
			
 
				+    print(f"筛选后组数: {len(filtered_groups)}")
			
 
				+    print(f"筛选后总函数数: {sum(g['group_size'] for g in filtered_groups)}")
			
 
				+    
			
 
				+    print(f"\n调用链深度分布（全部数据）:")
			
 
				+    for depth in sorted(depth_distribution.keys()):
			
 
				+        count = depth_distribution[depth]
			
 
				+        pct = count / len(groups) * 100
			
 
				+        marker = " <--" if min_depth <= depth <= (max_depth if max_depth != float('inf') else depth) else ""
			
 
				+        print(f"  深度 {depth}: {count} 组 ({pct:.1f}%){marker}")
			
 
				+    
			
 
				+    if filtered_groups:
			
 
				+        depths = [g['call_depth'] for g in filtered_groups]
			
 
				+        print(f"\n筛选结果统计:")
			
 
				+        print(f"  最小深度: {min(depths)}")
			
 
				+        print(f"  最大深度: {max(depths)}")
			
 
				+        print(f"  平均深度: {sum(depths)/len(depths):.2f}")
			
 
				+    print(f"====================================================")
			
 
				+    
			
 
				+    # 输出结果
			
 
				+    output_data = {
			
 
				+        "metadata": {
			
 
				+            "source_file": os.path.basename(args.input),
			
 
				+            "filter_min_depth": min_depth,
			
 
				+            "filter_max_depth": max_depth if max_depth != float('inf') else "unlimited",
			
 
				+            "original_groups": len(groups),
			
 
				+            "filtered_groups": len(filtered_groups),
			
 
				+            "total_functions": sum(g['group_size'] for g in filtered_groups),
			
 
				+            "depth_distribution": depth_distribution,
			
 
				+        },
			
 
				+        "groups": filtered_groups
			
 
				+    }
			
 
				+    
			
 
				+    os.makedirs(os.path.dirname(args.output) if os.path.dirname(args.output) else '.', exist_ok=True)
			
 
				+    with open(args.output, 'w', encoding='utf-8') as f:
			
 
				+        json.dump(output_data, f, ensure_ascii=False, indent=2)
			
 
				+    
			
 
				+    print(f"\n结果已保存到: {args.output}")
			
 
				+
			
 
				+
			
 
				+if __name__ == '__main__':
			
 
				+    main()
			
 
				+