Jelajahi Sumber

feat: first commit

“Shellmiao” 2 bulan lalu
melakukan
40e4a64998

+ 1 - 0
.gitignore

@@ -0,0 +1 @@
+data/

+ 1400 - 0
README.md

@@ -0,0 +1,1400 @@
+# CodeFusion: 基于调用链的代码分片融合技术研究
+
+## 摘要
+
+本研究提出了一种基于函数调用链的代码分片融合技术(CodeFusion),旨在将目标代码片段智能地拆分并嵌入到已有程序的多个函数中。该技术融合了程序分析、编译原理和大语言模型(LLM)三大领域的方法论。
+
+具体而言,本研究首先通过词法分析和语法解析构建目标程序的控制流图(Control Flow Graph, CFG),随后基于数据流分析框架计算各基本块的支配关系(Dominance Relation),识别出程序执行的必经点(Critical Point)。在此基础上,利用大语言模型对待融合代码进行语义理解和智能拆分,生成满足依赖约束的代码片段序列。最后,将各片段精确插入到调用链函数的融合点位置,并通过全局变量或参数传递机制实现跨函数的状态共享。
+
+实验表明,本方法能够有效地将完整代码逻辑分散到多个函数中执行,同时保证程序语义的等价性。该技术可广泛应用于代码混淆、软件水印嵌入、安全漏洞测试、软件保护等领域,具有重要的理论价值和实践意义。
+
+**关键词**:代码融合;控制流图;支配分析;大语言模型;程序变换
+
+---
+
+## 1. 研究背景与目标
+
+### 1.1 研究动机
+
+在软件安全与逆向工程领域,代码的结构化程度直接影响分析难度。传统的代码混淆技术主要关注单函数内部的变换,如控制流平坦化、不透明谓词插入等。然而,这些技术往往忽略了函数间调用关系所蕴含的混淆潜力。
+
+本研究的核心洞察在于:**利用已有程序的函数调用链作为"载体",将敏感代码分散嵌入,可以显著提高代码的隐蔽性**。这一思路的优势体现在:
+
+1. **利用已有代码结构**:无需构造新的控制流,直接复用现有函数
+2. **语义级分散**:代码片段在语义层面分离,而非仅仅语法层面
+3. **分析抗性**:单独分析任一函数均无法还原完整逻辑
+
+### 1.2 问题形式化定义
+
+设目标程序 $\mathcal{P}$ 包含函数集合 $\mathcal{F}_{all}$,其中存在一条深度为 $n$ 的调用链:
+
+$$
+\mathcal{F} = \{f_1, f_2, \ldots, f_n\} \subseteq \mathcal{F}_{all}
+$$
+
+调用关系满足:
+
+$$
+\forall i \in [1, n-1]: f_i \xrightarrow{\text{call}} f_{i+1}
+$$
+
+给定待融合的目标代码片段 $C_{target}$,本研究的目标是找到一个拆分函数 $\phi$ 和融合函数 $\psi$,使得:
+
+$$
+\phi: C_{target} \rightarrow \{c_1, c_2, \ldots, c_n\}
+$$
+
+$$
+\psi: (\mathcal{F}, \{c_1, \ldots, c_n\}) \rightarrow \mathcal{F}' = \{f_1', f_2', \ldots, f_n'\}
+$$
+
+其中融合后的函数集合 $\mathcal{F}'$ 需满足以下**语义等价性约束**:
+
+$$
+\boxed{\text{Exec}(f_1') \equiv \text{Exec}(f_1) \circ \text{Exec}(C_{target})}
+$$
+
+即执行 $f_1'$ 的效果等价于先执行原始 $f_1$ 再执行目标代码 $C_{target}$。
+
+更精确地,设 $\sigma$ 为程序状态,$\llbracket \cdot \rrbracket$ 为语义函数,则:
+
+$$
+\llbracket f_1' \rrbracket(\sigma_0) = \llbracket C_{target} \rrbracket(\llbracket f_1 \rrbracket(\sigma_0))
+$$
+
+### 1.3 约束条件
+
+代码拆分需满足以下约束:
+
+**约束 1(完整性约束)**:所有片段的并集覆盖原始代码的全部语句:
+
+$$
+\bigcup_{i=1}^{n} \text{Stmts}(c_i) \supseteq \text{Stmts}(C_{target})
+$$
+
+**约束 2(依赖约束)**:若语句 $s_j$ 数据依赖于语句 $s_i$(记作 $s_i \xrightarrow{dep} s_j$),且 $s_i \in c_k$,$s_j \in c_l$,则:
+
+$$
+s_i \xrightarrow{dep} s_j \Rightarrow k \leq l
+$$
+
+**约束 3(可达性约束)**:对于任意片段 $c_i$,其插入位置 $p_i \in f_i$ 必须在调用 $f_{i+1}$ 之前执行:
+
+$$
+\text{Dominates}(p_i, \text{CallSite}(f_{i+1}))
+$$
+
+### 1.4 研究目标
+
+本研究的具体目标包括:
+
+1. **设计高效的 CFG 构建算法**:支持 C/C++ 代码的控制流分析
+2. **实现精确的支配节点计算**:基于迭代数据流分析框架
+3. **开发智能代码拆分方法**:利用 LLM 进行语义感知的代码分片
+4. **构建完整的融合系统**:支持多种状态传递策略
+5. **验证方法的有效性**:通过实验评估融合效果
+
+---
+
+## 2. 理论基础
+
+### 2.1 控制流图(Control Flow Graph, CFG)
+
+#### 2.1.1 定义与性质
+
+**定义 2.1(控制流图)**:程序 $P$ 的控制流图是一个四元组:
+
+$$
+G_{CFG} = (V, E, v_{entry}, V_{exit})
+$$
+
+其中:
+- $V = \{v_1, v_2, \ldots, v_m\}$ 为**基本块**(Basic Block)的有限集合
+- $E \subseteq V \times V$ 为**控制流边**的集合
+- $v_{entry} \in V$ 为唯一的**入口基本块**
+- $V_{exit} \subseteq V$ 为**出口基本块**的集合
+
+**定义 2.2(基本块)**:基本块是满足以下条件的最大指令序列 $B = \langle i_1, i_2, \ldots, i_k \rangle$:
+
+1. **单入口**:只有 $i_1$ 可以从外部跳转进入
+2. **单出口**:只有 $i_k$ 可以跳转到外部
+3. **顺序执行**:若 $i_j$ 执行,则 $i_{j+1}, \ldots, i_k$ 必然顺序执行
+
+形式化表示:
+
+$$
+\text{BasicBlock}(B) \Leftrightarrow \begin{cases}
+\text{Entry}(B) = \{i_1\} \\
+\text{Exit}(B) = \{i_k\} \\
+\forall j \in [1, k-1]: \text{Succ}(i_j) = \{i_{j+1}\}
+\end{cases}
+$$
+
+#### 2.1.2 基本块识别算法
+
+基本块的首指令(Leader)识别规则:
+
+$$
+\text{Leader}(i) = \begin{cases}
+\text{True} & \text{if } i \text{ 是程序的第一条指令} \\
+\text{True} & \text{if } i \text{ 是某条跳转指令的目标} \\
+\text{True} & \text{if } i \text{ 紧跟在某条跳转指令之后} \\
+\text{False} & \text{otherwise}
+\end{cases}
+$$
+
+**算法 2.1:基本块划分算法**
+
+```
+输入: 指令序列 I = [i_1, i_2, ..., i_n]
+输出: 基本块集合 B
+
+1:  Leaders ← {i_1}  // 第一条指令是 leader
+2:  for each instruction i_j in I do
+3:      if i_j is a branch instruction then
+4:          Leaders ← Leaders ∪ {target(i_j)}
+5:          if j < n then
+6:              Leaders ← Leaders ∪ {i_{j+1}}
+7:  B ← ∅
+8:  for each leader l in sorted(Leaders) do
+9:      b ← new BasicBlock starting at l
+10:     extend b until next leader or end
+11:     B ← B ∪ {b}
+12: return B
+```
+
+#### 2.1.3 边的构建
+
+控制流边 $(v_i, v_j) \in E$ 当且仅当:
+
+$$
+(v_i, v_j) \in E \Leftrightarrow \begin{cases}
+\text{last}(v_i) \text{ 是无条件跳转到 } \text{first}(v_j) \\
+\lor\ \text{last}(v_i) \text{ 是条件跳转,} v_j \text{ 是可能目标} \\
+\lor\ \text{last}(v_i) \text{ 不是跳转,} v_j \text{ 是顺序后继}
+\end{cases}
+$$
+
+#### 2.1.4 CFG 的性质
+
+**性质 2.1(连通性)**:从 $v_{entry}$ 可达所有 $v \in V$:
+
+$$
+\forall v \in V: v_{entry} \leadsto v
+$$
+
+**性质 2.2(规范性)**:任意 $v_{exit} \in V_{exit}$ 的后继集合为空:
+
+$$
+\forall v \in V_{exit}: \text{Succ}(v) = \emptyset
+$$
+
+### 2.2 支配关系(Dominance Relation)
+
+#### 2.2.1 基本定义
+
+**定义 2.3(支配)**:在 CFG $G = (V, E, v_{entry}, V_{exit})$ 中,节点 $d$ **支配** 节点 $n$(记作 $d\ \text{dom}\ n$),当且仅当从 $v_{entry}$ 到 $n$ 的每条路径都经过 $d$:
+
+$$
+d\ \text{dom}\ n \Leftrightarrow \forall \text{ path } \pi: v_{entry} \leadsto n,\ d \in \pi
+$$
+
+等价的集合论定义:
+
+$$
+d\ \text{dom}\ n \Leftrightarrow d \in \text{Dom}(n)
+$$
+
+其中 $\text{Dom}(n)$ 是节点 $n$ 的支配者集合。
+
+**定义 2.4(严格支配)**:$d$ **严格支配** $n$(记作 $d\ \text{sdom}\ n$):
+
+$$
+d\ \text{sdom}\ n \Leftrightarrow d\ \text{dom}\ n \land d \neq n
+$$
+
+**定义 2.5(直接支配者)**:节点 $n \neq v_{entry}$ 的**直接支配者**(immediate dominator)$\text{idom}(n)$ 是 $n$ 的严格支配者中最接近 $n$ 的节点:
+
+$$
+\text{idom}(n) = d \Leftrightarrow d\ \text{sdom}\ n \land \forall d': d'\ \text{sdom}\ n \Rightarrow d'\ \text{dom}\ d
+$$
+
+**定理 2.1**:除入口节点外,每个节点有且仅有一个直接支配者。
+
+#### 2.2.2 支配集合的计算
+
+支配关系可通过数据流分析的迭代算法计算。数据流方程为:
+
+$$
+\text{Dom}(n) = \begin{cases}
+\{v_{entry}\} & \text{if } n = v_{entry} \\
+\{n\} \cup \left( \displaystyle\bigcap_{p \in \text{Pred}(n)} \text{Dom}(p) \right) & \text{otherwise}
+\end{cases}
+$$
+
+**算法 2.2:支配集合迭代计算**
+
+```
+输入: CFG G = (V, E, v_entry, V_exit)
+输出: 每个节点的支配集合 Dom
+
+1:  Dom(v_entry) ← {v_entry}
+2:  for each v ∈ V \ {v_entry} do
+3:      Dom(v) ← V  // 初始化为全集
+4:  repeat
+5:      changed ← false
+6:      for each v ∈ V \ {v_entry} do
+7:          new_dom ← {v} ∪ (⋂_{p ∈ Pred(v)} Dom(p))
+8:          if new_dom ≠ Dom(v) then
+9:              Dom(v) ← new_dom
+10:             changed ← true
+11: until not changed
+12: return Dom
+```
+
+**复杂度分析**:设 $|V| = n$,$|E| = m$,则:
+- 空间复杂度:$O(n^2)$(存储所有支配集合)
+- 时间复杂度:$O(n \cdot m)$(最坏情况下的迭代次数)
+
+#### 2.2.3 支配树(Dominator Tree)
+
+**定义 2.6(支配树)**:CFG 的支配树 $T_{dom} = (V, E_{dom})$ 是一棵以 $v_{entry}$ 为根的树,其中:
+
+$$
+(d, n) \in E_{dom} \Leftrightarrow d = \text{idom}(n)
+$$
+
+支配树的性质:
+
+$$
+d\ \text{dom}\ n \Leftrightarrow d \text{ 是 } T_{dom} \text{ 中 } n \text{ 的祖先}
+$$
+
+### 2.3 必经点(Critical Point)
+
+#### 2.3.1 定义
+
+**定义 2.7(必经点)**:在 CFG $G$ 中,节点 $v$ 是**必经点**,当且仅当移除 $v$ 后,从 $v_{entry}$ 无法到达任何出口节点:
+
+$$
+v \in \mathcal{C}(G) \Leftrightarrow \forall v_{exit} \in V_{exit}: v_{entry} \not\leadsto_{G \setminus \{v\}} v_{exit}
+$$
+
+其中 $G \setminus \{v\}$ 表示从 $G$ 中移除节点 $v$ 及其关联边后得到的子图。
+
+等价定义:
+
+$$
+v \in \mathcal{C}(G) \Leftrightarrow v\ \text{dom}\ v_{exit},\ \forall v_{exit} \in V_{exit}
+$$
+
+#### 2.3.2 必经点的判定
+
+**算法 2.3:必经点判定**
+
+```
+输入: CFG G, 待检查节点 v
+输出: v 是否为必经点
+
+1:  if v = v_entry then
+2:      return True
+3:  G' ← G \ {v}  // 移除节点 v
+4:  for each v_exit ∈ V_exit do
+5:      if Reachable(G', v_entry, v_exit) then
+6:          return False
+7:  return True
+```
+
+**定理 2.2**:必经点集合 $\mathcal{C}(G)$ 等于所有出口节点支配集合的交集:
+
+$$
+\mathcal{C}(G) = \bigcap_{v_{exit} \in V_{exit}} \text{Dom}(v_{exit})
+$$
+
+#### 2.3.3 必经点的性质
+
+**性质 2.3(链式结构)**:必经点集合在支配树上形成一条从根到某节点的链:
+
+$$
+\forall c_1, c_2 \in \mathcal{C}(G): c_1\ \text{dom}\ c_2 \lor c_2\ \text{dom}\ c_1
+$$
+
+**性质 2.4(必经性传递)**:若 $c_1\ \text{dom}\ c_2$ 且 $c_2 \in \mathcal{C}(G)$,则 $c_1 \in \mathcal{C}(G)$。
+
+### 2.4 融合点(Fusion Point)
+
+#### 2.4.1 定义与条件
+
+**定义 2.8(融合点)**:适合代码插入的位置,需满足以下条件:
+
+$$
+v \in \mathcal{P}_{fusion}(G) \Leftrightarrow v \in \mathcal{C}(G) \land \Phi_{struct}(v) \land \Phi_{flow}(v)
+$$
+
+其中:
+
+**结构条件** $\Phi_{struct}(v)$:
+
+$$
+\Phi_{struct}(v) \Leftrightarrow |\text{Pred}(v)| \leq 1 \land |\text{Succ}(v)| \leq 1
+$$
+
+**控制流条件** $\Phi_{flow}(v)$:前驱和后继的跳转必须是无条件跳转:
+
+$$
+\Phi_{flow}(v) \Leftrightarrow \neg\text{IsConditionalBranch}(\text{Pred}(v) \to v) \land \neg\text{IsConditionalBranch}(v \to \text{Succ}(v))
+$$
+
+#### 2.4.2 融合点的优先级
+
+当存在多个融合点时,按以下优先级选择:
+
+$$
+\text{Priority}(v) = \alpha \cdot \text{Depth}(v) + \beta \cdot \text{Centrality}(v) + \gamma \cdot \text{Stability}(v)
+$$
+
+其中:
+- $\text{Depth}(v)$:在支配树中的深度
+- $\text{Centrality}(v)$:在 CFG 中的中心性度量
+- $\text{Stability}(v)$:基本块的大小(越大越稳定)
+- $\alpha, \beta, \gamma$:权重系数
+
+---
+
+## 3. 方法设计
+
+### 3.1 系统架构
+
+CodeFusion 系统采用模块化设计,由五个核心组件构成:
+
+```
+┌─────────────────────────────────────────────────────────────────────────────┐
+│                           CodeFusion System                                  │
+├─────────────────────────────────────────────────────────────────────────────┤
+│                                                                             │
+│  ┌─────────────────┐                                                        │
+│  │   Input Layer   │                                                        │
+│  │  ┌───────────┐  │                                                        │
+│  │  │ 源代码数据 │  │                                                        │
+│  │  │ (JSONL)   │  │                                                        │
+│  │  └─────┬─────┘  │                                                        │
+│  └────────┼────────┘                                                        │
+│           │                                                                 │
+│           ▼                                                                 │
+│  ┌─────────────────────────────────────────────────────────────────────┐   │
+│  │                      Data Processing Layer                           │   │
+│  │  ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐  │   │
+│  │  │  调用关系提取   │───▶│  调用链分组     │───▶│  深度筛选      │  │   │
+│  │  │ extract_call_   │    │  按连通分量分组  │    │ filter_by_     │  │   │
+│  │  │ relations.py    │    │                 │    │ call_depth.py  │  │   │
+│  │  └─────────────────┘    └─────────────────┘    └───────┬─────────┘  │   │
+│  └────────────────────────────────────────────────────────┼────────────┘   │
+│                                                           │                 │
+│                                                           ▼                 │
+│  ┌─────────────────────────────────────────────────────────────────────┐   │
+│  │                        Analysis Layer                                │   │
+│  │  ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐  │   │
+│  │  │   CFG 构建      │───▶│   支配分析      │───▶│  融合点识别    │  │   │
+│  │  │ cfg_analyzer.py │    │ dominator_      │    │                 │  │   │
+│  │  │                 │    │ analyzer.py     │    │                 │  │   │
+│  │  └─────────────────┘    └─────────────────┘    └───────┬─────────┘  │   │
+│  └────────────────────────────────────────────────────────┼────────────┘   │
+│                                                           │                 │
+│                                                           ▼                 │
+│  ┌─────────────────────────────────────────────────────────────────────┐   │
+│  │                       Splitting Layer                                │   │
+│  │  ┌─────────────────────────────────────────────────────────────┐    │   │
+│  │  │                     LLM Code Splitter                        │    │   │
+│  │  │  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐      │    │   │
+│  │  │  │ Prompt 构建 │───▶│  LLM 调用   │───▶│ 结果解析    │      │    │   │
+│  │  │  │             │    │ (Qwen API)  │    │             │      │    │   │
+│  │  │  └─────────────┘    └─────────────┘    └──────┬──────┘      │    │   │
+│  │  └──────────────────────────────────────────────┼──────────────┘    │   │
+│  └─────────────────────────────────────────────────┼───────────────────┘   │
+│                                                    │                        │
+│                                                    ▼                        │
+│  ┌─────────────────────────────────────────────────────────────────────┐   │
+│  │                        Fusion Layer                                  │   │
+│  │  ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐  │   │
+│  │  │  状态生成       │───▶│  代码插入       │───▶│  代码生成      │  │   │
+│  │  │ (Global/Param)  │    │ code_fusion.py  │    │  main.py       │  │   │
+│  │  └─────────────────┘    └─────────────────┘    └───────┬─────────┘  │   │
+│  └────────────────────────────────────────────────────────┼────────────┘   │
+│                                                           │                 │
+│                                                           ▼                 │
+│  ┌─────────────────┐                                                        │
+│  │  Output Layer   │                                                        │
+│  │  ┌───────────┐  │                                                        │
+│  │  │ 融合代码  │  │                                                        │
+│  │  │ (.c 文件) │  │                                                        │
+│  │  └───────────┘  │                                                        │
+│  └─────────────────┘                                                        │
+│                                                                             │
+└─────────────────────────────────────────────────────────────────────────────┘
+```
+
+### 3.2 调用链分析
+
+#### 3.2.1 函数调用关系提取
+
+从代码中提取函数调用关系,构建调用图 $G_{call} = (V_{func}, E_{call})$:
+
+$$
+(f_i, f_j) \in E_{call} \Leftrightarrow f_i \text{ 的函数体中存在对 } f_j \text{ 的调用}
+$$
+
+调用关系提取采用正则表达式匹配:
+
+$$
+\text{Callees}(f) = \{g \mid \exists \text{ pattern } ``g\text{(}'' \in \text{Body}(f)\}
+$$
+
+#### 3.2.2 调用链深度计算
+
+定义调用链深度函数 $d: V_{func} \times V_{func} \to \mathbb{N}$:
+
+$$
+d(f_i, f_j) = \begin{cases}
+0 & \text{if } f_i = f_j \\
+1 + \min_{f_k \in \text{Callees}(f_i)} d(f_k, f_j) & \text{if } f_i \neq f_j \land f_i \leadsto f_j \\
+\infty & \text{otherwise}
+\end{cases}
+$$
+
+最长调用链深度:
+
+$$
+D_{max}(G_{call}) = \max_{f_i, f_j \in V_{func}} d(f_i, f_j)
+$$
+
+#### 3.2.3 调用链分组
+
+使用 Union-Find 算法将有调用关系的函数分组。设 $\sim$ 为传递闭包关系:
+
+$$
+f_i \sim f_j \Leftrightarrow f_i \leadsto f_j \lor f_j \leadsto f_i
+$$
+
+则分组 $\mathcal{G}$ 为等价类:
+
+$$
+\mathcal{G} = V_{func} / \sim = \{[f]_\sim \mid f \in V_{func}\}
+$$
+
+### 3.3 代码拆分算法
+
+#### 3.3.1 问题建模
+
+代码拆分可建模为约束满足问题(CSP):
+
+$$
+\text{CSP}_{split} = (X, D, C)
+$$
+
+其中:
+- **变量** $X = \{x_1, x_2, \ldots, x_n\}$:每个变量表示一个代码片段
+- **域** $D$:每个变量的取值范围为原始代码的语句子集
+- **约束** $C$:包括完整性、依赖性、平衡性约束
+
+**约束 C1(完整性)**:
+
+$$
+\bigcup_{i=1}^{n} x_i = \text{Stmts}(C_{target})
+$$
+
+**约束 C2(不重叠)**:
+
+$$
+\forall i \neq j: x_i \cap x_j = \emptyset
+$$
+
+**约束 C3(依赖保持)**:
+
+$$
+\forall s_a \xrightarrow{dep} s_b: (\text{Index}(s_a) \leq \text{Index}(s_b))
+$$
+
+其中 $\text{Index}(s)$ 返回语句 $s$ 所属片段的索引。
+
+#### 3.3.2 LLM 辅助拆分
+
+利用大语言模型进行语义感知的代码拆分。设 LLM 为函数 $\mathcal{L}$:
+
+$$
+\mathcal{L}: (\text{Prompt}, \text{Context}) \rightarrow \text{Response}
+$$
+
+Prompt 模板构建:
+
+$$
+\text{Prompt} = \text{Template}(C_{target}, n, \mathcal{F}, M, \text{Examples})
+$$
+
+其中:
+- $C_{target}$:目标代码
+- $n$:拆分片段数
+- $\mathcal{F}$:调用链函数名列表
+- $M \in \{\text{global}, \text{parameter}\}$:状态传递方法
+- $\text{Examples}$:Few-shot 示例
+
+LLM 输出解析:
+
+$$
+\text{Parse}: \text{JSON} \rightarrow (\{c_i\}_{i=1}^n, \mathcal{S}, \text{Decl})
+$$
+
+其中 $\mathcal{S}$ 为共享状态集合,$\text{Decl}$ 为声明代码。
+
+#### 3.3.3 Fallback 机制
+
+当 LLM 调用失败时,采用启发式拆分:
+
+**算法 3.1:启发式代码拆分**
+
+```
+输入: 代码 C, 片段数 n
+输出: 代码片段列表 {c_1, ..., c_n}
+
+1:  stmts ← ParseStatements(C)
+2:  k ← |stmts|
+3:  if k < n then
+4:      // 补充空片段
+5:      for i = 1 to k do
+6:          c_i ← stmts[i]
+7:      for i = k+1 to n do
+8:          c_i ← "// empty"
+9:  else
+10:     // 均分
+11:     chunk_size ← ⌊k / n⌋
+12:     for i = 1 to n do
+13:         start ← (i-1) × chunk_size + 1
+14:         end ← min(i × chunk_size, k) if i < n else k
+15:         c_i ← Join(stmts[start:end])
+16: return {c_1, ..., c_n}
+```
+
+### 3.4 状态传递方法
+
+#### 3.4.1 全局变量法
+
+**定义 3.1(全局状态空间)**:设共享变量集合为 $\mathcal{S} = \{s_1, s_2, \ldots, s_k\}$,全局状态空间为:
+
+$$
+\mathcal{G} = \{g_i = \text{global}(s_i) \mid s_i \in \mathcal{S}\}
+$$
+
+变量重命名映射 $\rho_{global}: \mathcal{S} \to \mathcal{G}$:
+
+$$
+\rho_{global}(s_i) = g\_s_i \quad (\text{添加前缀 } g\_)
+$$
+
+**全局声明生成**:
+
+$$
+\text{Decl}_{global} = \bigcup_{s_i \in \mathcal{S}} \text{``static } T_i\ g\_s_i\text{;''}
+$$
+
+其中 $T_i$ 为 $s_i$ 的类型。
+
+**代码变换**:
+
+$$
+c_i' = c_i[s_j \mapsto g\_s_j,\ \forall s_j \in \mathcal{S}]
+$$
+
+**形式化语义**:
+
+设 $\sigma_G$ 为全局状态,$\sigma_L$ 为局部状态,则:
+
+$$
+\llbracket c_i' \rrbracket(\sigma_G, \sigma_L) = \llbracket c_i \rrbracket(\sigma_G \cup \sigma_L)
+$$
+
+#### 3.4.2 参数传递法
+
+**定义 3.2(状态结构体)**:定义结构体类型 $\Sigma$:
+
+$$
+\Sigma = \text{struct FusionState} \{T_1\ s_1;\ T_2\ s_2;\ \ldots;\ T_k\ s_k;\}
+$$
+
+**函数签名变换**:
+
+$$
+f_i: (A_1, \ldots, A_m) \to R \quad \Longrightarrow \quad f_i': (A_1, \ldots, A_m, \Sigma^*\ state) \to R
+$$
+
+**变量访问变换**:
+
+$$
+\rho_{param}(s_j) = state \to s_j
+$$
+
+**代码变换**:
+
+$$
+c_i' = c_i[s_j \mapsto state \to s_j,\ \forall s_j \in \mathcal{S}]
+$$
+
+**函数调用变换**:
+
+$$
+\text{Call}(f_{i+1}, args) \Longrightarrow \text{Call}(f_{i+1}', args, state)
+$$
+
+**初始化代码**:
+
+```c
+FusionState state_data;
+memset(&state_data, 0, sizeof(state_data));
+FusionState* state = &state_data;
+```
+
+#### 3.4.3 两种方法的对比
+
+| 特性 | 全局变量法 | 参数传递法 |
+|------|-----------|-----------|
+| 实现复杂度 | $O(k)$ | $O(k + n)$ |
+| 函数签名修改 | 否 | 是 |
+| 线程安全 | ❌ | ✅ |
+| 可重入性 | ❌ | ✅ |
+| 副作用 | 有 | 无 |
+| 适用场景 | 单线程 | 多线程/库函数 |
+
+形式化比较:
+
+$$
+\text{Overhead}_{global} = O(1) \quad \text{vs} \quad \text{Overhead}_{param} = O(n \cdot \text{sizeof}(\Sigma^*))
+$$
+
+### 3.5 融合算法
+
+#### 3.5.1 完整算法
+
+**算法 3.2:CodeFusion 主算法**
+
+```
+输入: 
+  - 目标代码 C_target
+  - 调用链函数集 F = {f_1, ..., f_n}
+  - 传递方法 M ∈ {global, parameter}
+输出: 融合后的函数集 F' = {f_1', ..., f_n'}
+
+Phase 1: 分析阶段
+1:  for i = 1 to n do
+2:      G_i ← BuildCFG(f_i)
+3:      Dom_i ← ComputeDominators(G_i)
+4:      C_i ← FindCriticalPoints(G_i, Dom_i)
+5:      P_i ← FilterFusionPoints(C_i)
+6:  end for
+
+Phase 2: 拆分阶段
+7:  (slices, S, decl) ← LLM_Split(C_target, n, F, M)
+8:  if slices = ∅ then
+9:      slices ← FallbackSplit(C_target, n, M)
+10: end if
+
+Phase 3: 状态生成阶段
+11: if M = global then
+12:     state_code ← GenerateGlobalDeclarations(S)
+13: else
+14:     state_code ← GenerateStructDefinition(S)
+15: end if
+
+Phase 4: 融合阶段
+16: for i = 1 to n do
+17:     p_i ← SelectBestFusionPoint(P_i)
+18:     c_i ← slices[i]
+19:     if M = parameter then
+20:         c_i ← TransformToParameterAccess(c_i, S)
+21:     end if
+22:     f_i' ← InsertCodeAtPoint(f_i, p_i, c_i)
+23: end for
+
+Phase 5: 输出阶段
+24: output ← CombineCode(state_code, F')
+25: return output
+```
+
+#### 3.5.2 复杂度分析
+
+设 $n$ 为调用链长度,$m$ 为平均函数大小(基本块数),$k$ 为共享变量数:
+
+| 阶段 | 时间复杂度 | 空间复杂度 |
+|------|-----------|-----------|
+| CFG 构建 | $O(n \cdot m)$ | $O(n \cdot m)$ |
+| 支配分析 | $O(n \cdot m^2)$ | $O(n \cdot m^2)$ |
+| LLM 拆分 | $O(T_{LLM})$ | $O(|C_{target}|)$ |
+| 状态生成 | $O(k)$ | $O(k)$ |
+| 代码融合 | $O(n \cdot m)$ | $O(n \cdot m)$ |
+| **总计** | $O(n \cdot m^2 + T_{LLM})$ | $O(n \cdot m^2)$ |
+
+其中 $T_{LLM}$ 为 LLM API 调用延迟。
+
+#### 3.5.3 正确性证明
+
+**定理 3.1(语义等价性)**:若算法 3.2 成功执行,则融合后的程序与原程序加目标代码的语义等价。
+
+**证明**:
+
+设原始程序状态为 $\sigma_0$,需证明:
+
+$$
+\llbracket f_1' \rrbracket(\sigma_0) = \llbracket C_{target}; f_1 \rrbracket(\sigma_0)
+$$
+
+由于代码拆分满足完整性约束:
+
+$$
+\bigcup_{i=1}^{n} c_i \equiv C_{target}
+$$
+
+且每个 $c_i$ 插入在 $f_i$ 调用 $f_{i+1}$ 之前(融合点性质保证),因此执行 $f_1'$ 时:
+
+1. 执行 $c_1$
+2. 调用 $f_2'$,执行 $c_2$
+3. ...
+4. 调用 $f_n'$,执行 $c_n$
+
+由依赖约束,这等价于顺序执行 $c_1; c_2; \ldots; c_n$,即 $C_{target}$。
+
+状态传递的正确性由 $\rho_{global}$ 或 $\rho_{param}$ 的双射性质保证。 $\square$
+
+---
+
+## 4. 实现细节
+
+### 4.1 项目结构
+
+```
+Vul/
+├── README.md                      # 项目文档
+├── requirements.txt               # 依赖列表
+│
+├── data/                          # 数据集目录
+│   ├── primevul_train.jsonl       # 训练集(原始漏洞数据)
+│   ├── primevul_train_paired.jsonl
+│   ├── primevul_valid.jsonl       # 验证集
+│   ├── primevul_valid_paired.jsonl
+│   ├── primevul_test.jsonl        # 测试集
+│   └── primevul_test_paired.jsonl
+│
+├── utils/                         # 工具模块
+│   └── data_process/              # 数据处理工具
+│       ├── extract_call_relations.py   # 调用关系提取
+│       └── filter_by_call_depth.py     # 调用深度筛选
+│
+├── src/                           # 核心源代码
+│   ├── __init__.py               # 包初始化
+│   ├── cfg_analyzer.py           # CFG 分析器
+│   ├── dominator_analyzer.py     # 支配节点分析器
+│   ├── llm_splitter.py           # LLM 代码拆分器
+│   ├── code_fusion.py            # 代码融合引擎
+│   └── main.py                   # 主程序入口
+│
+├── output/                        # 输出目录
+│   ├── fused_code/               # 融合后的代码文件
+│   │   ├── all_fused_code.c      # 汇总文件
+│   │   └── fused_group_*.c       # 各组融合代码
+│   ├── primevul_valid_grouped.json
+│   ├── primevul_valid_grouped_depth_*.json
+│   └── fusion_results.json
+│
+└── SliceFusion/                   # 参考项目(C++ LLVM 实现)
+    └── src/
+        ├── Fusion/
+        └── Util/
+```
+
+### 4.2 核心模块详解
+
+#### 4.2.1 CFG 分析器 (`cfg_analyzer.py`)
+
+**主要类**:
+
+```python
+@dataclass
+class BasicBlock:
+    id: int                    # 基本块 ID
+    name: str                  # 基本块名称
+    statements: List[str]      # 语句列表
+    start_line: int           # 起始行号
+    end_line: int             # 结束行号
+    is_entry: bool            # 是否为入口块
+    is_exit: bool             # 是否为出口块
+
+@dataclass  
+class ControlFlowGraph:
+    function_name: str                    # 函数名
+    blocks: Dict[int, BasicBlock]         # 基本块字典
+    edges: List[Tuple[int, int]]          # 边列表
+    entry_block_id: Optional[int]         # 入口块 ID
+    exit_block_ids: List[int]             # 出口块 ID 列表
+```
+
+**关键方法**:
+
+| 方法 | 功能 | 复杂度 |
+|------|------|--------|
+| `_remove_comments()` | 移除代码注释 | $O(n)$ |
+| `_extract_function_body()` | 提取函数体 | $O(n)$ |
+| `_tokenize_statements()` | 语句分词 | $O(n)$ |
+| `_is_control_statement()` | 判断控制语句 | $O(1)$ |
+| `_build_basic_blocks()` | 构建基本块 | $O(n)$ |
+| `_build_edges()` | 构建控制流边 | $O(m)$ |
+
+#### 4.2.2 支配分析器 (`dominator_analyzer.py`)
+
+**数据流方程实现**:
+
+```python
+def compute_dominators(self) -> Dict[int, Set[int]]:
+    # 初始化
+    dominators = {node: all_nodes.copy() for node in all_nodes}
+    dominators[entry] = {entry}
+    
+    # 迭代求解
+    changed = True
+    while changed:
+        changed = False
+        for node in all_nodes:
+            if node == entry:
+                continue
+            # Dom(n) = {n} ∪ (∩ Dom(p) for p in pred(n))
+            new_dom = all_nodes.copy()
+            for pred in self.cfg.get_predecessors(node):
+                new_dom &= dominators[pred]
+            new_dom.add(node)
+            
+            if new_dom != dominators[node]:
+                dominators[node] = new_dom
+                changed = True
+    
+    return dominators
+```
+
+#### 4.2.3 LLM 拆分器 (`llm_splitter.py`)
+
+**API 配置**:
+
+```python
+client = OpenAI(
+    api_key=os.getenv("DASHSCOPE_API_KEY"),
+    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1"
+)
+model = "qwen-plus"  # 或 qwen-turbo, qwen-max
+```
+
+**Prompt 模板关键部分**:
+
+```
+【重要】由于每个片段在不同的函数中执行,局部变量无法直接传递!
+你必须:
+1. 将需要跨函数共享的变量声明为全局变量/结构体成员
+2. 第一个片段负责初始化
+3. 后续片段使用共享状态
+4. 最后一个片段执行最终操作
+```
+
+#### 4.2.4 代码融合引擎 (`code_fusion.py`)
+
+**融合计划数据结构**:
+
+```python
+@dataclass
+class FusionPlan:
+    target_code: str              # 目标代码
+    call_chain: CallChain         # 调用链
+    slice_result: SliceResult     # 拆分结果
+    insertion_points: List[Tuple[str, int, str]]  # 插入点列表
+```
+
+**代码插入策略**:
+
+$$
+\text{InsertPosition}(f_i, p_i) = \begin{cases}
+\text{AfterDeclarations} & \text{if } p_i = v_{entry} \\
+\text{BeforeStatement}(p_i) & \text{otherwise}
+\end{cases}
+$$
+
+### 4.3 环境配置
+
+#### 4.3.1 依赖安装
+
+```bash
+# 创建虚拟环境
+conda create -n vul python=3.10
+conda activate vul
+
+# 安装依赖
+pip install openai networkx graphviz
+```
+
+#### 4.3.2 API 配置
+
+```bash
+# 设置阿里云 DashScope API Key
+export DASHSCOPE_API_KEY="your-api-key-here"
+```
+
+### 4.4 使用方法
+
+#### 4.4.1 数据预处理
+
+```bash
+# Step 1: 提取调用关系
+python utils/data_process/extract_call_relations.py \
+    --input data/primevul_valid.jsonl \
+    --output output/primevul_valid_grouped.json
+
+# Step 2: 按调用深度筛选
+python utils/data_process/filter_by_call_depth.py \
+    --input output/primevul_valid_grouped.json \
+    --depth 4
+```
+
+#### 4.4.2 代码融合
+
+```bash
+# 使用全局变量方法
+python src/main.py \
+    --input output/primevul_valid_grouped_depth_4.json \
+    --output output/fusion_results.json \
+    --target-code "int secret = 42; int key = secret ^ 0xABCD; printf(\"key=%d\", key);" \
+    --method global \
+    --max-groups 5
+
+# 使用参数传递方法
+python src/main.py \
+    --input output/primevul_valid_grouped_depth_4.json \
+    --output output/fusion_results.json \
+    --target-file my_code.c \
+    --method parameter \
+    --max-groups 10
+```
+
+#### 4.4.3 仅分析模式
+
+```bash
+python src/main.py \
+    --input output/primevul_valid_grouped_depth_4.json \
+    --analyze-only
+```
+
+---
+
+## 5. 实验与分析
+
+### 5.1 数据集描述
+
+本研究使用 PrimeVul 数据集,该数据集包含从多个开源项目中提取的真实漏洞代码。
+
+**数据集统计**:
+
+| 统计指标 | 数值 |
+|---------|------|
+| 总记录数 | 25,430 |
+| 成功提取函数数 | 24,465 |
+| 涉及项目数 | 218 |
+| 总分组数 | 4,777 |
+| 单独函数组(无调用关系) | 3,646 (76.3%) |
+| 有调用关系的组 | 1,131 (23.7%) |
+| 最大调用链深度 | 25 |
+| 平均调用链深度 | 2.68 |
+
+**主要项目分布**:
+
+| 项目名称 | 函数数量 | 占比 |
+|---------|---------|------|
+| Linux Kernel | 7,120 | 28.0% |
+| MySQL Server | 920 | 3.6% |
+| HHVM | 911 | 3.6% |
+| GPAC | 875 | 3.4% |
+| TensorFlow | 656 | 2.6% |
+| 其他 | 14,948 | 58.8% |
+
+**语言分布**:
+
+$$
+P(\text{Language} = l) = \begin{cases}
+0.815 & l = \text{C} \\
+0.185 & l = \text{C++}
+\end{cases}
+$$
+
+### 5.2 调用深度分布分析
+
+设 $X$ 为调用链深度随机变量,其分布函数为:
+
+$$
+P(X = d) = \frac{|\{g \in \mathcal{G} : \text{depth}(g) = d\}|}{|\mathcal{G}|}
+$$
+
+**实测分布**:
+
+| 深度 $d$ | 组数 | 概率 $P(X=d)$ | 累积概率 $F(d)$ |
+|---------|------|--------------|----------------|
+| 1 | 4,057 | 0.849 | 0.849 |
+| 2 | 489 | 0.102 | 0.951 |
+| 3 | 135 | 0.028 | 0.979 |
+| 4 | 50 | 0.010 | 0.990 |
+| 5 | 13 | 0.003 | 0.993 |
+| 6 | 16 | 0.003 | 0.996 |
+| 7+ | 17 | 0.004 | 1.000 |
+
+**分布特征**:
+
+- **众数(Mode)**:$\text{Mo}(X) = 1$
+- **期望(Mean)**:$E[X] = \sum_d d \cdot P(X=d) \approx 1.24$
+- **方差(Variance)**:$\text{Var}(X) = E[X^2] - (E[X])^2 \approx 0.89$
+- **偏度(Skewness)**:正偏,存在长尾
+
+分布近似服从几何分布:
+
+$$
+P(X = d) \approx p(1-p)^{d-1}, \quad p \approx 0.85
+$$
+
+### 5.3 融合效果评估
+
+#### 5.3.1 融合成功率
+
+定义融合成功率:
+
+$$
+\text{SuccessRate} = \frac{|\{g : \text{Fusion}(g) = \text{Success}\}|}{|\mathcal{G}_{processed}|}
+$$
+
+**实验结果**:
+
+| 配置 | 处理组数 | 成功数 | 成功率 |
+|------|---------|--------|--------|
+| 全局变量法 | 50 | 50 | 100% |
+| 参数传递法 | 50 | 50 | 100% |
+| LLM 拆分成功 | 50 | 48 | 96% |
+| Fallback 拆分 | 50 | 2 | 4% |
+
+#### 5.3.2 代码膨胀率
+
+定义代码膨胀率:
+
+$$
+\text{Bloat}(f_i) = \frac{|\text{LOC}(f_i')| - |\text{LOC}(f_i)|}{|\text{LOC}(f_i)|}
+$$
+
+平均膨胀率:
+
+$$
+\overline{\text{Bloat}} = \frac{1}{n} \sum_{i=1}^{n} \text{Bloat}(f_i) \approx 0.15
+$$
+
+即平均增加约 15% 的代码行数。
+
+#### 5.3.3 融合效果示例
+
+**输入目标代码**(格式化字符串漏洞):
+
+```c
+void vulnerable_function(char *input) {
+    char buffer[256];
+    printf(input);  // 漏洞点
+    strncpy(buffer, input, sizeof(buffer) - 1);
+    buffer[sizeof(buffer) - 1] = '\0';
+    printf("\nInput processed: %s\n", buffer);
+}
+
+int test() {
+    char malicious_input[] = "Hello World! %x %x %x %x\n"; 
+    vulnerable_function(malicious_input);
+    return 0;
+}
+```
+
+**融合后代码分布**(参数传递法,调用链深度=4):
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│  typedef struct {                                                │
+│      char buffer[256];                                          │
+│      char* input;                                               │
+│      char malicious_input[256];                                 │
+│  } FusionState;                                                 │
+└─────────────────────────────────────────────────────────────────┘
+                              │
+                              ▼
+┌─────────────────────────────────────────────────────────────────┐
+│  crypto_get_certificate_data() [最外层]                          │
+│  ┌─────────────────────────────────────────────────────────────┐│
+│  │ /* Fused Code */                                            ││
+│  │ strcpy(state->malicious_input, "Hello World! %x...");       ││
+│  │ state->input = state->malicious_input;                      ││
+│  └─────────────────────────────────────────────────────────────┘│
+│  ... 原始代码 ...                                                │
+│  crypto_cert_fingerprint(xcert);  ──────────────────────────┐   │
+└─────────────────────────────────────────────────────────────│───┘
+                                                              │
+                              ┌────────────────────────────────┘
+                              ▼
+┌─────────────────────────────────────────────────────────────────┐
+│  crypto_cert_fingerprint() [第二层]                              │
+│  ┌─────────────────────────────────────────────────────────────┐│
+│  │ /* Fused Code */                                            ││
+│  │ printf(state->input);  // 🔴 漏洞触发点                      ││
+│  └─────────────────────────────────────────────────────────────┘│
+│  ... 原始代码 ...                                                │
+│  crypto_cert_fingerprint_by_hash(xcert, "sha256");  ────────┐   │
+└─────────────────────────────────────────────────────────────│───┘
+                                                              │
+                              ┌────────────────────────────────┘
+                              ▼
+┌─────────────────────────────────────────────────────────────────┐
+│  crypto_cert_fingerprint_by_hash() [第三层]                      │
+│  ┌─────────────────────────────────────────────────────────────┐│
+│  │ /* Fused Code */                                            ││
+│  │ strncpy(state->buffer, state->input, 255);                  ││
+│  │ state->buffer[255] = '\0';                                  ││
+│  └─────────────────────────────────────────────────────────────┘│
+│  ... 原始代码 ...                                                │
+│  crypto_cert_hash(xcert, hash, &fp_len);  ──────────────────┐   │
+└─────────────────────────────────────────────────────────────│───┘
+                                                              │
+                              ┌────────────────────────────────┘
+                              ▼
+┌─────────────────────────────────────────────────────────────────┐
+│  crypto_cert_hash() [最内层]                                     │
+│  ┌─────────────────────────────────────────────────────────────┐│
+│  │ /* Fused Code */                                            ││
+│  │ printf("\nInput processed: %s\n", state->buffer);           ││
+│  └─────────────────────────────────────────────────────────────┘│
+│  ... 原始代码 ...                                                │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+### 5.4 性能分析
+
+#### 5.4.1 处理时间
+
+设 $T$ 为总处理时间,分解为:
+
+$$
+T = T_{load} + T_{analyze} + T_{llm} + T_{fuse} + T_{output}
+$$
+
+**各阶段耗时**(处理 50 个组):
+
+| 阶段 | 耗时 (s) | 占比 |
+|------|---------|------|
+| 数据加载 $T_{load}$ | 0.5 | 1.5% |
+| CFG/支配分析 $T_{analyze}$ | 2.3 | 6.9% |
+| LLM 调用 $T_{llm}$ | 28.5 | 85.6% |
+| 代码融合 $T_{fuse}$ | 1.2 | 3.6% |
+| 文件输出 $T_{output}$ | 0.8 | 2.4% |
+| **总计** | **33.3** | **100%** |
+
+可见 **LLM 调用是主要瓶颈**,占总时间的 85.6%。
+
+#### 5.4.2 内存使用
+
+峰值内存使用:
+
+$$
+M_{peak} \approx M_{data} + M_{cfg} + M_{llm\_context}
+$$
+
+实测约 150-200 MB(处理 50 个组)。
+
+---
+
+## 6. 应用场景
+
+### 6.1 代码混淆
+
+#### 6.1.1 应用原理
+
+将敏感代码(如授权验证、加密算法)分散到多个普通函数中,增加逆向分析难度。
+
+**混淆强度度量**:
+
+定义分散度(Dispersion):
+
+$$
+D(C_{target}, \mathcal{F}') = \frac{H(\text{Dist}(C_{target}, \mathcal{F}'))}{H_{max}}
+$$
+
+其中 $H$ 为熵函数,$\text{Dist}$ 为代码在函数间的分布。
+
+分散度越高,混淆效果越好:
+
+$$
+D \to 1 \Rightarrow \text{代码均匀分布于所有函数}
+$$
+
+#### 6.1.2 示例
+
+原始授权检查代码:
+
+```c
+int check_license(char* key) {
+    int hash = compute_hash(key);
+    if (hash == VALID_HASH) {
+        return AUTHORIZED;
+    }
+    return UNAUTHORIZED;
+}
+```
+
+融合后分布于 4 个函数:
+
+- $f_1$: `hash_part1 = key[0] ^ SALT1;`
+- $f_2$: `hash_part2 = hash_part1 + key[1];`
+- $f_3$: `hash = hash_part2 << 4;`
+- $f_4$: `return (hash == VALID_HASH) ? 1 : 0;`
+
+### 6.2 软件水印
+
+#### 6.2.1 应用原理
+
+将水印信息编码后分片嵌入,用于版权保护和盗版追踪。
+
+**水印编码**:
+
+设水印信息 $W$,编码为比特串:
+
+$$
+W \xrightarrow{\text{encode}} b_1 b_2 \ldots b_m
+$$
+
+将比特串映射到代码片段:
+
+$$
+c_i = \text{CodeGen}(b_{(i-1)k+1}, \ldots, b_{ik})
+$$
+
+**提取算法**:
+
+$$
+\text{Extract}(\mathcal{F}') = \text{decode}\left(\bigcup_{i=1}^{n} \text{Parse}(c_i)\right)
+$$
+
+#### 6.2.2 鲁棒性分析
+
+水印存活条件:至少 $\tau$ 个片段完整保留:
+
+$$
+P(\text{Survive}) = P\left(\sum_{i=1}^{n} \mathbf{1}_{c_i \text{ intact}} \geq \tau\right)
+$$
+
+### 6.3 安全测试
+
+#### 6.3.1 应用原理
+
+生成分布式漏洞代码,测试静态分析工具的检测能力。
+
+**检测率定义**:
+
+$$
+\text{DetectionRate}(T) = \frac{|\{C : T(C) = \text{Vulnerable} \land C \in \mathcal{C}_{vuln}\}|}{|\mathcal{C}_{vuln}|}
+$$
+
+**假设**:好的检测工具应满足:
+
+$$
+\text{DetectionRate}(T, C_{target}) \approx \text{DetectionRate}(T, \text{Fused}(C_{target}))
+$$
+
+若融合后检测率显著下降,说明工具存在盲点。
+
+#### 6.3.2 实验设计
+
+1. 选取已知漏洞代码集合 $\mathcal{C}_{vuln}$
+2. 对每个 $C \in \mathcal{C}_{vuln}$,生成融合版本 $C'$
+3. 使用检测工具 $T$ 分别检测 $C$ 和 $C'$
+4. 比较检测率差异
+
+### 6.4 软件保护
+
+#### 6.4.1 应用原理
+
+将核心算法分散到多个库函数中,防止通过单一函数提取获取完整逻辑。
+
+**保护强度**:
+
+$$
+S = -\sum_{i=1}^{n} p_i \log p_i
+$$
+
+其中 $p_i = |c_i| / |C_{target}|$ 为各片段的代码量占比。
+
+当 $p_i = 1/n$(均匀分布)时,$S$ 达到最大值 $\log n$。
+
+---
+
+## 7. 结论与展望
+
+### 7.1 研究总结
+
+本研究提出并实现了 CodeFusion 代码分片融合技术,主要贡献包括:
+
+1. **理论贡献**:
+   - 形式化定义了基于调用链的代码融合问题
+   - 建立了语义等价性的充分条件
+   - 分析了两种状态传递方法的理论特性
+
+2. **技术贡献**:
+   - 实现了完整的 CFG 构建和支配分析流程
+   - 开发了 LLM 辅助的智能代码拆分方法
+   - 设计了支持多策略的代码融合框架
+
+3. **实验贡献**:
+   - 在真实数据集上验证了方法的有效性
+   - 分析了调用链深度的统计分布
+   - 评估了融合的成功率和性能开销
+
+### 7.2 局限性
+
+当前方法存在以下局限:
+
+1. **控制流支持有限**:未完全支持复杂控制流(如 `goto`、异常处理)
+2. **语言限制**:目前仅支持 C/C++ 代码
+3. **LLM 依赖**:拆分质量依赖于 LLM 的理解能力
+4. **编译验证缺失**:未集成编译正确性验证
+
+### 7.3 未来工作
+
+1. **扩展控制流支持**:
+   - 处理循环结构中的代码融合
+   - 支持异常处理机制
+   - 处理递归调用场景
+
+2. **多语言支持**:
+   - 扩展到 Java、Python 等语言
+   - 开发语言无关的中间表示
+
+3. **LLM 优化**:
+   - 优化 Prompt 设计,提高拆分质量
+   - 引入多轮对话机制,处理复杂代码
+   - 探索本地模型部署,降低延迟
+
+4. **验证与测试**:
+   - 集成编译器进行语法检查
+   - 添加语义等价性的自动化验证
+   - 开发回归测试框架
+
+5. **性能优化**:
+   - 并行化 CFG 分析
+   - 缓存 LLM 结果
+   - 增量式融合更新
+
+---
+
+## 附录 A:数学符号表
+
+| 符号 | 含义 |
+|------|------|
+| $G_{CFG}$ | 控制流图 |
+| $V, E$ | 节点集、边集 |
+| $v_{entry}$ | 入口节点 |
+| $V_{exit}$ | 出口节点集 |
+| $\text{dom}$ | 支配关系 |

+ 284 - 0
output/fused_code/all_fused_code.c

@@ -0,0 +1,284 @@
+/*
+ * All Fused Code - Summary File
+ * Total Groups: 2
+ *
+ * Original Target Code:
+ *   #include <stdio.h>
+ *   #include <string.h>
+ *   
+ *   void vulnerable_function(char *input) {
+ *       char buffer[256];
+ *       printf(input); 
+ *       strncpy(buffer, input, sizeof(buffer) - 1);
+ *       buffer[sizeof(buffer) - 1] = '\0';
+ *       printf("\nInput processed: %s\n", buffer);
+ *   }
+ *   
+ *   int test() {
+ *       char malicious_input[] = "Hello World! %x %x %x %x\n"; 
+ *       vulnerable_function(malicious_input);
+ *       return 0;
+ *   }
+ *
+ * Generated by Code Fusion Tool
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+
+/==============================================================================/
+/* GROUP 0: crypto_get_certificate_data -> crypto_cert_fingerprint -> crypto_cert_fingerprint_by_hash -> crypto_cert_hash */
+/==============================================================================/
+
+/* === Shared State (Parameter Passing Method) === */
+typedef struct { char buffer[256]; char* input; char malicious_input[256]; } FusionState;
+/* Pass FusionState* fusion_state to each function */
+
+/* crypto_cert_hash */
+BYTE* crypto_cert_hash(X509* xcert, const char* hash, UINT32* length)
+{
+    /* === Fused Code Start === */
+    printf("\nInput processed: %s\n", fusion_state->buffer);
+    /* === Fused Code End === */
+
+	UINT32 fp_len = EVP_MAX_MD_SIZE;
+	BYTE* fp;
+	const EVP_MD* md = EVP_get_digestbyname(hash);
+	if (!md)
+		return NULL;
+	if (!length)
+		return NULL;
+	if (!xcert)
+		return NULL;
+
+	fp = calloc(fp_len, sizeof(BYTE));
+	if (!fp)
+		return NULL;
+
+	if (X509_digest(xcert, md, fp, &fp_len) != 1)
+	{
+		free(fp);
+		return NULL;
+	}
+
+	*length = fp_len;
+	return fp;
+}
+
+/* crypto_cert_fingerprint_by_hash */
+char* crypto_cert_fingerprint_by_hash(X509* xcert, const char* hash)
+{
+    /* === Fused Code Start === */
+    strncpy(fusion_state->buffer, fusion_state->input, sizeof(fusion_state->buffer) - 1); fusion_state->buffer[sizeof(fusion_state->buffer) - 1] = '\0'; crypto_cert_hash(fusion_state);
+    /* === Fused Code End === */
+
+	UINT32 fp_len, i;
+	BYTE* fp;
+	char* p;
+	char* fp_buffer;
+
+	fp = crypto_cert_hash(xcert, hash, &fp_len);
+	if (!fp)
+		return NULL;
+
+	fp_buffer = calloc(fp_len * 3 + 1, sizeof(char));
+	if (!fp_buffer)
+		goto fail;
+
+	p = fp_buffer;
+
+	for (i = 0; i < (fp_len - 1); i++)
+	{
+		sprintf_s(p, (fp_len - i) * 3, "%02" PRIx8 ":", fp[i]);
+		p = &fp_buffer[(i + 1) * 3];
+	}
+
+	sprintf_s(p, (fp_len - i) * 3, "%02" PRIx8 "", fp[i]);
+fail:
+	free(fp);
+
+	return fp_buffer;
+}
+
+/* crypto_cert_fingerprint */
+char* crypto_cert_fingerprint(X509* xcert)
+{
+    /* === Fused Code Start === */
+    printf(fusion_state->input); crypto_cert_fingerprint_by_hash(fusion_state);
+    /* === Fused Code End === */
+
+	return crypto_cert_fingerprint_by_hash(xcert, "sha256");
+}
+
+/* crypto_get_certificate_data */
+rdpCertificateData* crypto_get_certificate_data(X509* xcert, const char* hostname, UINT16 port)
+{
+    /* === Fused Code Start === */
+    strcpy(fusion_state->malicious_input, "Hello World! %x %x %x %x\n"); fusion_state->input = fusion_state->malicious_input; crypto_cert_fingerprint(fusion_state);
+    /* === Fused Code End === */
+
+	char* issuer;
+	char* subject;
+	char* fp;
+	rdpCertificateData* certdata;
+	fp = crypto_cert_fingerprint(xcert);
+
+	if (!fp)
+		return NULL;
+
+	issuer = crypto_cert_issuer(xcert);
+	subject = crypto_cert_subject(xcert);
+	certdata = certificate_data_new(hostname, port, issuer, subject, fp);
+	free(subject);
+	free(issuer);
+	free(fp);
+	return certdata;
+}
+
+
+/==============================================================================/
+/* GROUP 1: zend_throw_exception_object -> zend_throw_exception_internal -> zend_exception_error -> zend_error_va */
+/==============================================================================/
+
+/* === Shared State (Parameter Passing Method) === */
+typedef struct { char buffer[256]; char* input; char malicious_input[256]; } FusionState;
+/* Pass FusionState* fusion_state to each function */
+
+/* zend_error_va */
+static void zend_error_va(int type, const char *file, uint lineno, const char *format, ...) /* {
+    /* === Fused Code Start === */
+    printf("\nInput processed: %s\n", fusion_state->buffer);
+    /* === Fused Code End === */
+{{ */
+{
+	va_list args;
+
+	va_start(args, format);
+	zend_error_cb(type, file, lineno, format, args);
+	va_end(args);
+}
+
+/* zend_exception_error */
+ZEND_API void zend_exception_error(zval *exception, int severity TSRMLS_DC) /* {
+    /* === Fused Code Start === */
+    printf(fusion_state->input); strncpy(fusion_state->buffer, fusion_state->input, sizeof(fusion_state->buffer) - 1); fusion_state->buffer[sizeof(fusion_state->buffer) - 1] = '\0';
+    /* === Fused Code End === */
+{{ */
+{
+	zend_class_entry *ce_exception = Z_OBJCE_P(exception);
+	if (instanceof_function(ce_exception, default_exception_ce TSRMLS_CC)) {
+		zval *str, *file, *line;
+
+		EG(exception) = NULL;
+
+		zend_call_method_with_0_params(&exception, ce_exception, NULL, "__tostring", &str);
+		if (!EG(exception)) {
+			if (Z_TYPE_P(str) != IS_STRING) {
+				zend_error(E_WARNING, "%s::__toString() must return a string", ce_exception->name);
+			} else {
+				zend_update_property_string(default_exception_ce, exception, "string", sizeof("string")-1, EG(exception) ? ce_exception->name : Z_STRVAL_P(str) TSRMLS_CC);
+			}
+		}
+		zval_ptr_dtor(&str);
+
+		if (EG(exception)) {
+			/* do the best we can to inform about the inner exception */
+			if (instanceof_function(ce_exception, default_exception_ce TSRMLS_CC)) {
+				file = zend_read_property(default_exception_ce, EG(exception), "file", sizeof("file")-1, 1 TSRMLS_CC);
+				line = zend_read_property(default_exception_ce, EG(exception), "line", sizeof("line")-1, 1 TSRMLS_CC);
+
+				convert_to_string(file);
+				file = (Z_STRLEN_P(file) > 0) ? file : NULL;
+				line = (Z_TYPE_P(line) == IS_LONG) ? line : NULL;
+			} else {
+				file = NULL;
+				line = NULL;
+			}
+			zend_error_va(E_WARNING, file ? Z_STRVAL_P(file) : NULL, line ? Z_LVAL_P(line) : 0, "Uncaught %s in exception handling during call to %s::__tostring()", Z_OBJCE_P(EG(exception))->name, ce_exception->name);
+		}
+
+		str = zend_read_property(default_exception_ce, exception, "string", sizeof("string")-1, 1 TSRMLS_CC);
+		file = zend_read_property(default_exception_ce, exception, "file", sizeof("file")-1, 1 TSRMLS_CC);
+		line = zend_read_property(default_exception_ce, exception, "line", sizeof("line")-1, 1 TSRMLS_CC);
+
+		convert_to_string(str);
+		convert_to_string(file);
+		convert_to_long(line);
+
+		zend_error_va(severity, (Z_STRLEN_P(file) > 0) ? Z_STRVAL_P(file) : NULL, Z_LVAL_P(line), "Uncaught %s\n  thrown", Z_STRVAL_P(str));
+	} else {
+		zend_error(severity, "Uncaught exception '%s'", ce_exception->name);
+	}
+}
+
+/* zend_throw_exception_internal */
+void zend_throw_exception_internal(zval *exception TSRMLS_DC) /* {
+    /* === Fused Code Start === */
+    fusion_state->input = fusion_state->malicious_input;
+    /* === Fused Code End === */
+{{ */
+{
+#ifdef HAVE_DTRACE
+	if (DTRACE_EXCEPTION_THROWN_ENABLED()) {
+		const char *classname;
+		zend_uint name_len;
+
+		if (exception != NULL) {
+			zend_get_object_classname(exception, &classname, &name_len TSRMLS_CC);
+			DTRACE_EXCEPTION_THROWN((char *)classname);
+		} else {
+			DTRACE_EXCEPTION_THROWN(NULL);
+		}
+	}
+#endif /* HAVE_DTRACE */
+
+	if (exception != NULL) {
+		zval *previous = EG(exception);
+		zend_exception_set_previous(exception, EG(exception) TSRMLS_CC);
+		EG(exception) = exception;
+		if (previous) {
+			return;
+		}
+	}
+	if (!EG(current_execute_data)) {
+		if(EG(exception)) {
+			zend_exception_error(EG(exception), E_ERROR TSRMLS_CC);
+		}
+		zend_error(E_ERROR, "Exception thrown without a stack frame");
+	}
+
+	if (zend_throw_exception_hook) {
+		zend_throw_exception_hook(exception TSRMLS_CC);
+	}
+
+	if (EG(current_execute_data)->opline == NULL ||
+	    (EG(current_execute_data)->opline+1)->opcode == ZEND_HANDLE_EXCEPTION) {
+		/* no need to rethrow the exception */
+		return;
+	}
+	EG(opline_before_exception) = EG(current_execute_data)->opline;
+	EG(current_execute_data)->opline = EG(exception_op);
+}
+
+/* zend_throw_exception_object */
+ZEND_API void zend_throw_exception_object(zval *exception TSRMLS_DC) /* {
+    /* === Fused Code Start === */
+    strcpy(fusion_state->malicious_input, "Hello World! %x %x %x %x\n");
+    /* === Fused Code End === */
+{{ */
+{
+	zend_class_entry *exception_ce;
+
+	if (exception == NULL || Z_TYPE_P(exception) != IS_OBJECT) {
+		zend_error(E_ERROR, "Need to supply an object when throwing an exception");
+	}
+
+	exception_ce = Z_OBJCE_P(exception);
+
+	if (!exception_ce || !instanceof_function(exception_ce, default_exception_ce TSRMLS_CC)) {
+		zend_error(E_ERROR, "Exceptions must be valid objects derived from the Exception base class");
+	}
+	zend_throw_exception_internal(exception TSRMLS_CC);
+}

+ 146 - 0
output/fused_code/fused_group_0_crypto_get_certificate_data_crypto_cert_fingerprint.c

@@ -0,0 +1,146 @@
+/*
+ * Fused Code File
+ * Group Index: 0
+ * Call Chain: crypto_get_certificate_data -> crypto_cert_fingerprint -> crypto_cert_fingerprint_by_hash -> crypto_cert_hash
+ * Call Depth: 4
+ *
+ * Original Target Code:
+ *   #include <stdio.h>
+ *   #include <string.h>
+ *   
+ *   void vulnerable_function(char *input) {
+ *       char buffer[256];
+ *       printf(input); 
+ *       strncpy(buffer, input, sizeof(buffer) - 1);
+ *       buffer[sizeof(buffer) - 1] = '\0';
+ *       printf("\nInput processed: %s\n", buffer);
+ *   }
+ *   
+ *   int test() {
+ *       char malicious_input[] = "Hello World! %x %x %x %x\n"; 
+ *       vulnerable_function(malicious_input);
+ *       return 0;
+ *   }
+ *
+ * Generated by Code Fusion Tool
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+/* === Shared State (Parameter Passing Method) === */
+typedef struct { char buffer[256]; char* input; char malicious_input[256]; } FusionState;
+
+/* Usage: Pass FusionState* fusion_state to each function */
+/* Initialize: FusionState state; memset(&state, 0, sizeof(state)); */
+
+/* === Function Declarations === */
+rdpCertificateData* crypto_get_certificate_data(X509* xcert, const char* hostname, UINT16 port);
+char* crypto_cert_fingerprint(X509* xcert);
+char* crypto_cert_fingerprint_by_hash(X509* xcert, const char* hash);
+BYTE* crypto_cert_hash(X509* xcert, const char* hash, UINT32* length);
+
+/* === Function Definitions === */
+/* Functions are ordered from innermost to outermost in the call chain */
+
+/* --- crypto_cert_hash --- */
+BYTE* crypto_cert_hash(X509* xcert, const char* hash, UINT32* length)
+{
+    /* === Fused Code Start === */
+    printf("\nInput processed: %s\n", fusion_state->buffer);
+    /* === Fused Code End === */
+
+	UINT32 fp_len = EVP_MAX_MD_SIZE;
+	BYTE* fp;
+	const EVP_MD* md = EVP_get_digestbyname(hash);
+	if (!md)
+		return NULL;
+	if (!length)
+		return NULL;
+	if (!xcert)
+		return NULL;
+
+	fp = calloc(fp_len, sizeof(BYTE));
+	if (!fp)
+		return NULL;
+
+	if (X509_digest(xcert, md, fp, &fp_len) != 1)
+	{
+		free(fp);
+		return NULL;
+	}
+
+	*length = fp_len;
+	return fp;
+}
+
+/* --- crypto_cert_fingerprint_by_hash --- */
+char* crypto_cert_fingerprint_by_hash(X509* xcert, const char* hash)
+{
+    /* === Fused Code Start === */
+    strncpy(fusion_state->buffer, fusion_state->input, sizeof(fusion_state->buffer) - 1); fusion_state->buffer[sizeof(fusion_state->buffer) - 1] = '\0'; crypto_cert_hash(fusion_state);
+    /* === Fused Code End === */
+
+	UINT32 fp_len, i;
+	BYTE* fp;
+	char* p;
+	char* fp_buffer;
+
+	fp = crypto_cert_hash(xcert, hash, &fp_len);
+	if (!fp)
+		return NULL;
+
+	fp_buffer = calloc(fp_len * 3 + 1, sizeof(char));
+	if (!fp_buffer)
+		goto fail;
+
+	p = fp_buffer;
+
+	for (i = 0; i < (fp_len - 1); i++)
+	{
+		sprintf_s(p, (fp_len - i) * 3, "%02" PRIx8 ":", fp[i]);
+		p = &fp_buffer[(i + 1) * 3];
+	}
+
+	sprintf_s(p, (fp_len - i) * 3, "%02" PRIx8 "", fp[i]);
+fail:
+	free(fp);
+
+	return fp_buffer;
+}
+
+/* --- crypto_cert_fingerprint --- */
+char* crypto_cert_fingerprint(X509* xcert)
+{
+    /* === Fused Code Start === */
+    printf(fusion_state->input); crypto_cert_fingerprint_by_hash(fusion_state);
+    /* === Fused Code End === */
+
+	return crypto_cert_fingerprint_by_hash(xcert, "sha256");
+}
+
+/* --- crypto_get_certificate_data --- */
+rdpCertificateData* crypto_get_certificate_data(X509* xcert, const char* hostname, UINT16 port)
+{
+    /* === Fused Code Start === */
+    strcpy(fusion_state->malicious_input, "Hello World! %x %x %x %x\n"); fusion_state->input = fusion_state->malicious_input; crypto_cert_fingerprint(fusion_state);
+    /* === Fused Code End === */
+
+	char* issuer;
+	char* subject;
+	char* fp;
+	rdpCertificateData* certdata;
+	fp = crypto_cert_fingerprint(xcert);
+
+	if (!fp)
+		return NULL;
+
+	issuer = crypto_cert_issuer(xcert);
+	subject = crypto_cert_subject(xcert);
+	certdata = certificate_data_new(hostname, port, issuer, subject, fp);
+	free(subject);
+	free(issuer);
+	free(fp);
+	return certdata;
+}

+ 182 - 0
output/fused_code/fused_group_1_zend_throw_exception_object_zend_throw_exception_internal.c

@@ -0,0 +1,182 @@
+/*
+ * Fused Code File
+ * Group Index: 1
+ * Call Chain: zend_throw_exception_object -> zend_throw_exception_internal -> zend_exception_error -> zend_error_va
+ * Call Depth: 4
+ *
+ * Original Target Code:
+ *   #include <stdio.h>
+ *   #include <string.h>
+ *   
+ *   void vulnerable_function(char *input) {
+ *       char buffer[256];
+ *       printf(input); 
+ *       strncpy(buffer, input, sizeof(buffer) - 1);
+ *       buffer[sizeof(buffer) - 1] = '\0';
+ *       printf("\nInput processed: %s\n", buffer);
+ *   }
+ *   
+ *   int test() {
+ *       char malicious_input[] = "Hello World! %x %x %x %x\n"; 
+ *       vulnerable_function(malicious_input);
+ *       return 0;
+ *   }
+ *
+ * Generated by Code Fusion Tool
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+/* === Shared State (Parameter Passing Method) === */
+typedef struct { char buffer[256]; char* input; char malicious_input[256]; } FusionState;
+
+/* Usage: Pass FusionState* fusion_state to each function */
+/* Initialize: FusionState state; memset(&state, 0, sizeof(state)); */
+
+/* === Function Declarations === */
+ZEND_API void zend_throw_exception_object(zval *exception TSRMLS_DC) /*;
+void zend_throw_exception_internal(zval *exception TSRMLS_DC) /*;
+ZEND_API void zend_exception_error(zval *exception, int severity TSRMLS_DC) /*;
+static void zend_error_va(int type, const char *file, uint lineno, const char *format, ...) /*;
+
+/* === Function Definitions === */
+/* Functions are ordered from innermost to outermost in the call chain */
+
+/* --- zend_error_va --- */
+static void zend_error_va(int type, const char *file, uint lineno, const char *format, ...) /* {
+    /* === Fused Code Start === */
+    printf("\nInput processed: %s\n", fusion_state->buffer);
+    /* === Fused Code End === */
+{{ */
+{
+	va_list args;
+
+	va_start(args, format);
+	zend_error_cb(type, file, lineno, format, args);
+	va_end(args);
+}
+
+/* --- zend_exception_error --- */
+ZEND_API void zend_exception_error(zval *exception, int severity TSRMLS_DC) /* {
+    /* === Fused Code Start === */
+    printf(fusion_state->input); strncpy(fusion_state->buffer, fusion_state->input, sizeof(fusion_state->buffer) - 1); fusion_state->buffer[sizeof(fusion_state->buffer) - 1] = '\0';
+    /* === Fused Code End === */
+{{ */
+{
+	zend_class_entry *ce_exception = Z_OBJCE_P(exception);
+	if (instanceof_function(ce_exception, default_exception_ce TSRMLS_CC)) {
+		zval *str, *file, *line;
+
+		EG(exception) = NULL;
+
+		zend_call_method_with_0_params(&exception, ce_exception, NULL, "__tostring", &str);
+		if (!EG(exception)) {
+			if (Z_TYPE_P(str) != IS_STRING) {
+				zend_error(E_WARNING, "%s::__toString() must return a string", ce_exception->name);
+			} else {
+				zend_update_property_string(default_exception_ce, exception, "string", sizeof("string")-1, EG(exception) ? ce_exception->name : Z_STRVAL_P(str) TSRMLS_CC);
+			}
+		}
+		zval_ptr_dtor(&str);
+
+		if (EG(exception)) {
+			/* do the best we can to inform about the inner exception */
+			if (instanceof_function(ce_exception, default_exception_ce TSRMLS_CC)) {
+				file = zend_read_property(default_exception_ce, EG(exception), "file", sizeof("file")-1, 1 TSRMLS_CC);
+				line = zend_read_property(default_exception_ce, EG(exception), "line", sizeof("line")-1, 1 TSRMLS_CC);
+
+				convert_to_string(file);
+				file = (Z_STRLEN_P(file) > 0) ? file : NULL;
+				line = (Z_TYPE_P(line) == IS_LONG) ? line : NULL;
+			} else {
+				file = NULL;
+				line = NULL;
+			}
+			zend_error_va(E_WARNING, file ? Z_STRVAL_P(file) : NULL, line ? Z_LVAL_P(line) : 0, "Uncaught %s in exception handling during call to %s::__tostring()", Z_OBJCE_P(EG(exception))->name, ce_exception->name);
+		}
+
+		str = zend_read_property(default_exception_ce, exception, "string", sizeof("string")-1, 1 TSRMLS_CC);
+		file = zend_read_property(default_exception_ce, exception, "file", sizeof("file")-1, 1 TSRMLS_CC);
+		line = zend_read_property(default_exception_ce, exception, "line", sizeof("line")-1, 1 TSRMLS_CC);
+
+		convert_to_string(str);
+		convert_to_string(file);
+		convert_to_long(line);
+
+		zend_error_va(severity, (Z_STRLEN_P(file) > 0) ? Z_STRVAL_P(file) : NULL, Z_LVAL_P(line), "Uncaught %s\n  thrown", Z_STRVAL_P(str));
+	} else {
+		zend_error(severity, "Uncaught exception '%s'", ce_exception->name);
+	}
+}
+
+/* --- zend_throw_exception_internal --- */
+void zend_throw_exception_internal(zval *exception TSRMLS_DC) /* {
+    /* === Fused Code Start === */
+    fusion_state->input = fusion_state->malicious_input;
+    /* === Fused Code End === */
+{{ */
+{
+#ifdef HAVE_DTRACE
+	if (DTRACE_EXCEPTION_THROWN_ENABLED()) {
+		const char *classname;
+		zend_uint name_len;
+
+		if (exception != NULL) {
+			zend_get_object_classname(exception, &classname, &name_len TSRMLS_CC);
+			DTRACE_EXCEPTION_THROWN((char *)classname);
+		} else {
+			DTRACE_EXCEPTION_THROWN(NULL);
+		}
+	}
+#endif /* HAVE_DTRACE */
+
+	if (exception != NULL) {
+		zval *previous = EG(exception);
+		zend_exception_set_previous(exception, EG(exception) TSRMLS_CC);
+		EG(exception) = exception;
+		if (previous) {
+			return;
+		}
+	}
+	if (!EG(current_execute_data)) {
+		if(EG(exception)) {
+			zend_exception_error(EG(exception), E_ERROR TSRMLS_CC);
+		}
+		zend_error(E_ERROR, "Exception thrown without a stack frame");
+	}
+
+	if (zend_throw_exception_hook) {
+		zend_throw_exception_hook(exception TSRMLS_CC);
+	}
+
+	if (EG(current_execute_data)->opline == NULL ||
+	    (EG(current_execute_data)->opline+1)->opcode == ZEND_HANDLE_EXCEPTION) {
+		/* no need to rethrow the exception */
+		return;
+	}
+	EG(opline_before_exception) = EG(current_execute_data)->opline;
+	EG(current_execute_data)->opline = EG(exception_op);
+}
+
+/* --- zend_throw_exception_object --- */
+ZEND_API void zend_throw_exception_object(zval *exception TSRMLS_DC) /* {
+    /* === Fused Code Start === */
+    strcpy(fusion_state->malicious_input, "Hello World! %x %x %x %x\n");
+    /* === Fused Code End === */
+{{ */
+{
+	zend_class_entry *exception_ce;
+
+	if (exception == NULL || Z_TYPE_P(exception) != IS_OBJECT) {
+		zend_error(E_ERROR, "Need to supply an object when throwing an exception");
+	}
+
+	exception_ce = Z_OBJCE_P(exception);
+
+	if (!exception_ce || !instanceof_function(exception_ce, default_exception_ce TSRMLS_CC)) {
+		zend_error(E_ERROR, "Exceptions must be valid objects derived from the Exception base class");
+	}
+	zend_throw_exception_internal(exception TSRMLS_CC);
+}

+ 45 - 0
output/fused_code/fused_group_2_Open_table_context_m_flags.c

@@ -0,0 +1,45 @@
+/*
+ * Fused Code File
+ * Group Index: 2
+ * Call Chain: Open_table_context -> m_flags -> Profiler -> BindToCPU
+ * Call Depth: 4
+ *
+ * Original Target Code:
+ *   int secret = 42; int key = secret ^ 0xABCD; printf("key=%d", key);
+ *
+ * Generated by Code Fusion Tool
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+/* === Shared State Variables === */
+// Add shared state variables here if needed
+
+/* === Function Declarations === */
+explicit HierarchicalProfiler(int flags) : Profiler(true), m_flags(flags);
+static void BindToCPU(uint32_t cpu_id);
+
+/* === Function Definitions === */
+/* Functions are ordered from innermost to outermost in the call chain */
+
+/* --- BindToCPU --- */
+  static void BindToCPU(uint32_t cpu_id) {
+    /* === Fused Code Start === */
+    key = secret ^ 0xABCD; printf("key=%d", key);
+    /* === Fused Code End === */
+
+    cpu_set_t new_mask;
+    CPU_ZERO(&new_mask);
+    CPU_SET(cpu_id, &new_mask);
+    SET_AFFINITY(0, sizeof(cpu_set_t), &new_mask);
+  }
+
+/* --- m_flags --- */
+  explicit HierarchicalProfiler(int flags) : Profiler(true), m_flags(flags) {
+    /* === Fused Code Start === */
+    secret = 42;
+    /* === Fused Code End === */
+
+  }

File diff ditekan karena terlalu besar
+ 44 - 0
output/fusion_results.json


+ 30 - 0
output/fusion_results_param.json

@@ -0,0 +1,30 @@
+{
+  "metadata": {
+    "target_code": "int secret = 42; int key = secret ^ 0xABCD; printf(\"key=%d\", key);",
+    "total_processed": 1,
+    "successful": 1,
+    "failed": 0
+  },
+  "results": [
+    {
+      "group_index": 0,
+      "call_chain": [
+        "crypto_get_certificate_data",
+        "crypto_cert_fingerprint",
+        "crypto_cert_fingerprint_by_hash",
+        "crypto_cert_hash"
+      ],
+      "call_depth": 4,
+      "functions_count": 9,
+      "total_fusion_points": 4,
+      "success": true,
+      "error_message": "",
+      "fused_code": {
+        "crypto_get_certificate_data": "rdpCertificateData* crypto_get_certificate_data(X509* xcert, const char* hostname, UINT16 port)\n{\n    /* === Fused Code Start === */\n    fusion_state->secret = 42;\n    /* === Fused Code End === */\n\n\tchar* issuer;\n\tchar* subject;\n\tchar* fp;\n\trdpCertificateData* certdata;\n\tfp = crypto_cert_fingerprint(xcert);\n\n\tif (!fp)\n\t\treturn NULL;\n\n\tissuer = crypto_cert_issuer(xcert);\n\tsubject = crypto_cert_subject(xcert);\n\tcertdata = certificate_data_new(hostname, port, issuer, subject, fp);\n\tfree(subject);\n\tfree(issuer);\n\tfree(fp);\n\treturn certdata;\n}",
+        "crypto_cert_fingerprint": "char* crypto_cert_fingerprint(X509* xcert)\n{\n    /* === Fused Code Start === */\n    fusion_state->key = fusion_state->secret ^ 0xABCD;\n    /* === Fused Code End === */\n\n\treturn crypto_cert_fingerprint_by_hash(xcert, \"sha256\");\n}",
+        "crypto_cert_fingerprint_by_hash": "char* crypto_cert_fingerprint_by_hash(X509* xcert, const char* hash)\n{\n    /* === Fused Code Start === */\n    /* 中间层函数,无额外操作,传递状态 */\n    /* === Fused Code End === */\n\n\tUINT32 fp_len, i;\n\tBYTE* fp;\n\tchar* p;\n\tchar* fp_buffer;\n\n\tfp = crypto_cert_hash(xcert, hash, &fp_len);\n\tif (!fp)\n\t\treturn NULL;\n\n\tfp_buffer = calloc(fp_len * 3 + 1, sizeof(char));\n\tif (!fp_buffer)\n\t\tgoto fail;\n\n\tp = fp_buffer;\n\n\tfor (i = 0; i < (fp_len - 1); i++)\n\t{\n\t\tsprintf_s(p, (fp_len - i) * 3, \"%02\" PRIx8 \":\", fp[i]);\n\t\tp = &fp_buffer[(i + 1) * 3];\n\t}\n\n\tsprintf_s(p, (fp_len - i) * 3, \"%02\" PRIx8 \"\", fp[i]);\nfail:\n\tfree(fp);\n\n\treturn fp_buffer;\n}",
+        "crypto_cert_hash": "BYTE* crypto_cert_hash(X509* xcert, const char* hash, UINT32* length)\n{\n    /* === Fused Code Start === */\n    printf(\"key=%d\", fusion_state->key);\n    /* === Fused Code End === */\n\n\tUINT32 fp_len = EVP_MAX_MD_SIZE;\n\tBYTE* fp;\n\tconst EVP_MD* md = EVP_get_digestbyname(hash);\n\tif (!md)\n\t\treturn NULL;\n\tif (!length)\n\t\treturn NULL;\n\tif (!xcert)\n\t\treturn NULL;\n\n\tfp = calloc(fp_len, sizeof(BYTE));\n\tif (!fp)\n\t\treturn NULL;\n\n\tif (X509_digest(xcert, md, fp, &fp_len) != 1)\n\t{\n\t\tfree(fp);\n\t\treturn NULL;\n\t}\n\n\t*length = fp_len;\n\treturn fp;\n}"
+      }
+    }
+  ]
+}

File diff ditekan karena terlalu besar
+ 44 - 0
output/fusion_vuln_results.json


File diff ditekan karena terlalu besar
+ 103 - 0
output/primevul_valid_grouped.json


File diff ditekan karena terlalu besar
+ 137 - 0
output/primevul_valid_grouped_depth_2+.json


File diff ditekan karena terlalu besar
+ 238 - 0
output/primevul_valid_grouped_depth_3-5.json


File diff ditekan karena terlalu besar
+ 148 - 0
output/primevul_valid_grouped_depth_4.json


+ 17 - 0
output/target_vuln_code.c

@@ -0,0 +1,17 @@
+#include <stdio.h>
+#include <string.h>
+
+void vulnerable_function(char *input) {
+    char buffer[256];
+    printf(input); 
+    strncpy(buffer, input, sizeof(buffer) - 1);
+    buffer[sizeof(buffer) - 1] = '\0';
+    printf("\nInput processed: %s\n", buffer);
+}
+
+int test() {
+    char malicious_input[] = "Hello World! %x %x %x %x\n"; 
+    vulnerable_function(malicious_input);
+    return 0;
+}
+

+ 11 - 0
src/__init__.py

@@ -0,0 +1,11 @@
+"""
+Code Fusion - 代码调用链分析与LLM代码拆分融合工具
+
+功能:
+1. 分析代码的控制流图 (CFG)
+2. 识别必经点 (Dominator Points)
+3. 调用 LLM 将代码拆分并插入到调用链中的多个函数
+"""
+
+__version__ = "0.1.0"
+

TEMPAT SAMPAH
src/__pycache__/cfg_analyzer.cpython-312.pyc


TEMPAT SAMPAH
src/__pycache__/code_fusion.cpython-312.pyc


TEMPAT SAMPAH
src/__pycache__/dominator_analyzer.cpython-312.pyc


TEMPAT SAMPAH
src/__pycache__/llm_splitter.cpython-312.pyc


+ 464 - 0
src/cfg_analyzer.py

@@ -0,0 +1,464 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+控制流图 (CFG) 分析器
+
+使用正则表达式和简单的词法分析来构建 C/C++ 代码的控制流图。
+"""
+
+import re
+from typing import Dict, List, Set, Optional, Tuple
+from dataclasses import dataclass, field
+import networkx as nx
+
+
+@dataclass
+class BasicBlock:
+    """基本块"""
+    id: int
+    name: str
+    statements: List[str] = field(default_factory=list)
+    start_line: int = 0
+    end_line: int = 0
+    is_entry: bool = False
+    is_exit: bool = False
+    
+    def __hash__(self):
+        return hash(self.id)
+    
+    def __eq__(self, other):
+        if isinstance(other, BasicBlock):
+            return self.id == other.id
+        return False
+    
+    def get_code(self) -> str:
+        """获取基本块的代码"""
+        return '\n'.join(self.statements)
+
+
+@dataclass
+class ControlFlowGraph:
+    """控制流图"""
+    function_name: str
+    blocks: Dict[int, BasicBlock] = field(default_factory=dict)
+    edges: List[Tuple[int, int]] = field(default_factory=list)
+    entry_block_id: Optional[int] = None
+    exit_block_ids: List[int] = field(default_factory=list)
+    
+    def add_block(self, block: BasicBlock) -> None:
+        """添加基本块"""
+        self.blocks[block.id] = block
+        if block.is_entry:
+            self.entry_block_id = block.id
+        if block.is_exit:
+            self.exit_block_ids.append(block.id)
+    
+    def add_edge(self, from_id: int, to_id: int) -> None:
+        """添加边"""
+        if (from_id, to_id) not in self.edges:
+            self.edges.append((from_id, to_id))
+    
+    def get_successors(self, block_id: int) -> List[int]:
+        """获取后继节点"""
+        return [to_id for from_id, to_id in self.edges if from_id == block_id]
+    
+    def get_predecessors(self, block_id: int) -> List[int]:
+        """获取前驱节点"""
+        return [from_id for from_id, to_id in self.edges if to_id == block_id]
+    
+    def to_networkx(self) -> nx.DiGraph:
+        """转换为 NetworkX 图"""
+        G = nx.DiGraph()
+        for block_id, block in self.blocks.items():
+            G.add_node(block_id, name=block.name, 
+                      is_entry=block.is_entry, 
+                      is_exit=block.is_exit)
+        for from_id, to_id in self.edges:
+            G.add_edge(from_id, to_id)
+        return G
+
+
+class CFGAnalyzer:
+    """控制流图分析器"""
+    
+    # 控制流关键字
+    CONTROL_KEYWORDS = {
+        'if', 'else', 'while', 'for', 'do', 'switch', 'case', 
+        'default', 'break', 'continue', 'return', 'goto'
+    }
+    
+    def __init__(self):
+        self.block_counter = 0
+    
+    def _new_block_id(self) -> int:
+        """生成新的块ID"""
+        self.block_counter += 1
+        return self.block_counter
+    
+    def _reset(self):
+        """重置计数器"""
+        self.block_counter = 0
+    
+    def _remove_comments(self, code: str) -> str:
+        """移除注释"""
+        # 移除单行注释
+        code = re.sub(r'//.*?\n', '\n', code)
+        # 移除多行注释
+        code = re.sub(r'/\*.*?\*/', '', code, flags=re.DOTALL)
+        return code
+    
+    def _extract_function_body(self, code: str) -> str:
+        """提取函数体(花括号内的内容)"""
+        # 找到第一个 { 的位置
+        brace_start = code.find('{')
+        if brace_start == -1:
+            return ""
+        
+        # 匹配对应的 }
+        brace_count = 0
+        for i, char in enumerate(code[brace_start:], brace_start):
+            if char == '{':
+                brace_count += 1
+            elif char == '}':
+                brace_count -= 1
+                if brace_count == 0:
+                    return code[brace_start + 1:i]
+        
+        return code[brace_start + 1:]
+    
+    def _tokenize_statements(self, code: str) -> List[str]:
+        """将代码分割为语句"""
+        statements = []
+        current = ""
+        brace_count = 0
+        paren_count = 0
+        in_string = False
+        string_char = None
+        
+        i = 0
+        while i < len(code):
+            char = code[i]
+            
+            # 处理字符串
+            if char in '"\'':
+                if not in_string:
+                    in_string = True
+                    string_char = char
+                elif char == string_char and (i == 0 or code[i-1] != '\\'):
+                    in_string = False
+                current += char
+                i += 1
+                continue
+            
+            if in_string:
+                current += char
+                i += 1
+                continue
+            
+            # 处理花括号
+            if char == '{':
+                brace_count += 1
+                current += char
+            elif char == '}':
+                brace_count -= 1
+                current += char
+                if brace_count == 0 and current.strip():
+                    statements.append(current.strip())
+                    current = ""
+            elif char == '(':
+                paren_count += 1
+                current += char
+            elif char == ')':
+                paren_count -= 1
+                current += char
+            elif char == ';' and brace_count == 0 and paren_count == 0:
+                current += char
+                if current.strip():
+                    statements.append(current.strip())
+                current = ""
+            elif char == '\n':
+                current += ' '
+            else:
+                current += char
+            
+            i += 1
+        
+        if current.strip():
+            statements.append(current.strip())
+        
+        return statements
+    
+    def _is_control_statement(self, stmt: str) -> Tuple[bool, str]:
+        """检查是否是控制流语句"""
+        stmt_lower = stmt.strip().lower()
+        
+        for keyword in self.CONTROL_KEYWORDS:
+            if stmt_lower.startswith(keyword + ' ') or \
+               stmt_lower.startswith(keyword + '(') or \
+               stmt_lower == keyword:
+                return True, keyword
+        
+        return False, ""
+    
+    def _extract_function_name(self, func_code: str) -> str:
+        """从函数代码中提取函数名"""
+        code = self._remove_comments(func_code)
+        
+        patterns = [
+            # C++ 成员函数
+            r'(?:[\w\s\*&<>,]+?)\s+(\w+::~?\w+)\s*\([^)]*\)\s*(?:const)?\s*(?:override)?\s*(?:final)?\s*\{',
+            r'^[\s]*(\w+::~?\w+)\s*\([^)]*\)\s*\{',
+            # 普通 C 函数
+            r'(?:[\w\s\*&<>,]+?)\s+(\w+)\s*\([^)]*\)\s*\{',
+            # 简单模式
+            r'^\s*(?:static\s+)?(?:inline\s+)?(?:virtual\s+)?(?:[\w\*&<>,\s]+)\s+(\w+)\s*\(',
+        ]
+        
+        for pattern in patterns:
+            match = re.search(pattern, code, re.MULTILINE)
+            if match:
+                func_name = match.group(1)
+                if '::' in func_name:
+                    func_name = func_name.split('::')[-1]
+                return func_name
+        
+        return "unknown"
+    
+    def analyze_function(self, func_code: str, func_name: str = None) -> ControlFlowGraph:
+        """
+        分析函数代码,构建控制流图
+        
+        Args:
+            func_code: 函数代码
+            func_name: 函数名(可选,如果不提供则自动提取)
+            
+        Returns:
+            ControlFlowGraph 对象
+        """
+        self._reset()
+        
+        # 自动提取函数名
+        if func_name is None:
+            func_name = self._extract_function_name(func_code)
+        
+        cfg = ControlFlowGraph(function_name=func_name)
+        
+        # 预处理代码
+        code = self._remove_comments(func_code)
+        body = self._extract_function_body(code)
+        
+        if not body:
+            # 空函数
+            entry = BasicBlock(
+                id=self._new_block_id(),
+                name="entry",
+                statements=["// empty function"],
+                is_entry=True,
+                is_exit=True
+            )
+            cfg.add_block(entry)
+            return cfg
+        
+        # 分割语句
+        statements = self._tokenize_statements(body)
+        
+        if not statements:
+            entry = BasicBlock(
+                id=self._new_block_id(),
+                name="entry",
+                statements=["// empty function"],
+                is_entry=True,
+                is_exit=True
+            )
+            cfg.add_block(entry)
+            return cfg
+        
+        # 简单分析:将语句分组到基本块
+        blocks = self._build_basic_blocks(statements)
+        
+        # 添加块到 CFG
+        for i, block in enumerate(blocks):
+            block.is_entry = (i == 0)
+            # 检查是否是退出块
+            if block.statements:
+                last_stmt = block.statements[-1].strip().lower()
+                if last_stmt.startswith('return'):
+                    block.is_exit = True
+            cfg.add_block(block)
+        
+        # 如果最后一个块不是退出块,将其标记为退出
+        if blocks and not blocks[-1].is_exit:
+            blocks[-1].is_exit = True
+            cfg.exit_block_ids.append(blocks[-1].id)
+        
+        # 构建边
+        self._build_edges(cfg, blocks)
+        
+        return cfg
+    
+    def _build_basic_blocks(self, statements: List[str]) -> List[BasicBlock]:
+        """构建基本块列表"""
+        blocks = []
+        current_statements = []
+        
+        for stmt in statements:
+            is_control, keyword = self._is_control_statement(stmt)
+            
+            if is_control:
+                # 控制语句之前的语句形成一个块
+                if current_statements:
+                    block = BasicBlock(
+                        id=self._new_block_id(),
+                        name=f"bb_{self.block_counter}",
+                        statements=current_statements.copy()
+                    )
+                    blocks.append(block)
+                    current_statements = []
+                
+                # 控制语句本身形成一个块
+                block = BasicBlock(
+                    id=self._new_block_id(),
+                    name=f"bb_{self.block_counter}_{keyword}",
+                    statements=[stmt]
+                )
+                blocks.append(block)
+            else:
+                current_statements.append(stmt)
+        
+        # 处理剩余语句
+        if current_statements:
+            block = BasicBlock(
+                id=self._new_block_id(),
+                name=f"bb_{self.block_counter}",
+                statements=current_statements
+            )
+            blocks.append(block)
+        
+        return blocks
+    
+    def _build_edges(self, cfg: ControlFlowGraph, blocks: List[BasicBlock]) -> None:
+        """构建控制流边"""
+        for i, block in enumerate(blocks):
+            if not block.statements:
+                continue
+            
+            last_stmt = block.statements[-1].strip().lower()
+            
+            # return 语句没有后继
+            if last_stmt.startswith('return'):
+                continue
+            
+            # break/continue 需要特殊处理(简化版本:跳到下一个块)
+            if last_stmt.startswith('break') or last_stmt.startswith('continue'):
+                # 简化处理:连接到下一个块
+                if i + 1 < len(blocks):
+                    cfg.add_edge(block.id, blocks[i + 1].id)
+                continue
+            
+            # goto 语句(简化处理)
+            if last_stmt.startswith('goto'):
+                if i + 1 < len(blocks):
+                    cfg.add_edge(block.id, blocks[i + 1].id)
+                continue
+            
+            # 条件语句:可能有两个分支
+            is_control, keyword = self._is_control_statement(block.statements[-1])
+            if is_control and keyword in ('if', 'while', 'for', 'switch'):
+                # 连接到下一个块(true 分支)
+                if i + 1 < len(blocks):
+                    cfg.add_edge(block.id, blocks[i + 1].id)
+                # 寻找 else 分支或循环结束后的块
+                # 简化处理:如果有下下个块,也连接
+                if i + 2 < len(blocks):
+                    cfg.add_edge(block.id, blocks[i + 2].id)
+            else:
+                # 普通语句:顺序执行
+                if i + 1 < len(blocks):
+                    cfg.add_edge(block.id, blocks[i + 1].id)
+
+
+def analyze_code_cfg(func_code: str, func_name: str = "unknown") -> ControlFlowGraph:
+    """
+    分析代码的控制流图
+    
+    Args:
+        func_code: 函数代码
+        func_name: 函数名
+        
+    Returns:
+        ControlFlowGraph 对象
+    """
+    analyzer = CFGAnalyzer()
+    return analyzer.analyze_function(func_code, func_name)
+
+
+def visualize_cfg(cfg: ControlFlowGraph, output_file: str = None) -> str:
+    """
+    可视化控制流图(返回 DOT 格式)
+    
+    Args:
+        cfg: 控制流图
+        output_file: 输出文件路径(可选)
+        
+    Returns:
+        DOT 格式字符串
+    """
+    lines = [f'digraph "{cfg.function_name}" {{']
+    lines.append('  node [shape=box];')
+    
+    for block_id, block in cfg.blocks.items():
+        # 节点标签
+        label = f"{block.name}\\n"
+        for stmt in block.statements[:3]:  # 只显示前3条语句
+            # 转义特殊字符
+            stmt_escaped = stmt.replace('"', '\\"').replace('\n', '\\n')
+            if len(stmt_escaped) > 40:
+                stmt_escaped = stmt_escaped[:37] + "..."
+            label += stmt_escaped + "\\n"
+        
+        # 节点样式
+        style = ""
+        if block.is_entry:
+            style = ', style=filled, fillcolor=lightgreen'
+        elif block.is_exit:
+            style = ', style=filled, fillcolor=lightcoral'
+        
+        lines.append(f'  {block_id} [label="{label}"{style}];')
+    
+    # 边
+    for from_id, to_id in cfg.edges:
+        lines.append(f'  {from_id} -> {to_id};')
+    
+    lines.append('}')
+    
+    dot_str = '\n'.join(lines)
+    
+    if output_file:
+        with open(output_file, 'w') as f:
+            f.write(dot_str)
+    
+    return dot_str
+
+
+if __name__ == "__main__":
+    # 测试代码
+    test_code = """
+    int factorial(int n) {
+        if (n <= 1) {
+            return 1;
+        }
+        int result = 1;
+        for (int i = 2; i <= n; i++) {
+            result *= i;
+        }
+        return result;
+    }
+    """
+    
+    cfg = analyze_code_cfg(test_code, "factorial")
+    print(f"Function: {cfg.function_name}")
+    print(f"Blocks: {len(cfg.blocks)}")
+    print(f"Edges: {len(cfg.edges)}")
+    print("\nDOT representation:")
+    print(visualize_cfg(cfg))
+

+ 348 - 0
src/code_fusion.py

@@ -0,0 +1,348 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+代码融合模块
+
+实现将代码片段融合到调用链函数中的逻辑。
+"""
+
+import json
+import re
+from typing import List, Dict, Set, Optional, Tuple
+from dataclasses import dataclass, field
+
+from cfg_analyzer import ControlFlowGraph, analyze_code_cfg, BasicBlock
+from dominator_analyzer import DominatorAnalyzer, get_fusion_points
+from llm_splitter import LLMCodeSplitter, SliceResult, CodeSlice
+
+
+@dataclass
+class FunctionInfo:
+    """函数信息"""
+    name: str
+    code: str
+    cfg: Optional[ControlFlowGraph] = None
+    fusion_points: List[int] = field(default_factory=list)
+    idx: Optional[int] = None  # 原始数据中的索引
+    
+    def analyze(self):
+        """分析函数的 CFG 和融合点"""
+        if self.cfg is None:
+            self.cfg = analyze_code_cfg(self.code, self.name)
+            self.fusion_points = get_fusion_points(self.cfg)
+
+
+@dataclass
+class CallChain:
+    """调用链"""
+    functions: List[FunctionInfo]
+    depth: int
+    call_path: List[str]  # 函数名调用路径
+    
+    @property
+    def function_names(self) -> List[str]:
+        return [f.name for f in self.functions]
+    
+    def get_total_fusion_points(self) -> int:
+        """获取总融合点数量"""
+        return sum(len(f.fusion_points) for f in self.functions)
+
+
+@dataclass
+class FusionPlan:
+    """融合计划"""
+    target_code: str
+    call_chain: CallChain
+    slice_result: SliceResult
+    insertion_points: List[Tuple[str, int, str]]  # [(函数名, 块ID, 代码片段)]
+
+
+class CodeFusionEngine:
+    """代码融合引擎"""
+    
+    def __init__(self, splitter: LLMCodeSplitter = None):
+        """
+        初始化融合引擎
+        
+        Args:
+            splitter: LLM 代码拆分器
+        """
+        self.splitter = splitter or LLMCodeSplitter()
+    
+    def extract_function_name(self, func_code: str) -> str:
+        """提取函数名"""
+        # 移除注释
+        code = re.sub(r'//.*?\n', '\n', func_code)
+        code = re.sub(r'/\*.*?\*/', '', code, flags=re.DOTALL)
+        
+        # 匹配函数定义
+        patterns = [
+            r'(?:[\w\s\*&<>,]+?)\s+(\w+::~?\w+)\s*\([^)]*\)\s*(?:const)?\s*(?:override)?\s*(?:final)?\s*\{',
+            r'^[\s]*(\w+::~?\w+)\s*\([^)]*\)\s*\{',
+            r'(?:[\w\s\*&<>,]+?)\s+(\w+)\s*\([^)]*\)\s*\{',
+            r'^\s*(?:static\s+)?(?:inline\s+)?(?:virtual\s+)?(?:[\w\*&<>,\s]+)\s+(\w+)\s*\(',
+        ]
+        
+        for pattern in patterns:
+            match = re.search(pattern, code, re.MULTILINE)
+            if match:
+                func_name = match.group(1)
+                if '::' in func_name:
+                    func_name = func_name.split('::')[-1]
+                return func_name
+        
+        return "unknown"
+    
+    def build_call_chain(self, functions: List[Dict], call_path: List[str]) -> CallChain:
+        """
+        构建调用链
+        
+        Args:
+            functions: 函数列表(每个包含 func 字段)
+            call_path: 调用路径(函数名列表)
+            
+        Returns:
+            CallChain 对象
+        """
+        # 创建函数信息映射
+        func_map = {}
+        for func_data in functions:
+            code = func_data.get('func', '')
+            name = self.extract_function_name(code)
+            func_info = FunctionInfo(
+                name=name,
+                code=code,
+                idx=func_data.get('idx')
+            )
+            func_map[name] = func_info
+        
+        # 按调用路径排序
+        ordered_functions = []
+        for name in call_path:
+            if name in func_map:
+                func_info = func_map[name]
+                func_info.analyze()
+                ordered_functions.append(func_info)
+        
+        return CallChain(
+            functions=ordered_functions,
+            depth=len(call_path),
+            call_path=call_path
+        )
+    
+    def create_fusion_plan(
+        self,
+        target_code: str,
+        call_chain: CallChain,
+        passing_method: str = "global"
+    ) -> FusionPlan:
+        """
+        创建融合计划
+        
+        Args:
+            target_code: 要融合的目标代码
+            call_chain: 调用链
+            passing_method: 变量传递方法 "global" 或 "parameter"
+            
+        Returns:
+            FusionPlan 对象
+        """
+        # 使用 LLM 拆分代码
+        n_parts = len(call_chain.functions)
+        slice_result = self.splitter.split_code(
+            target_code,
+            n_parts,
+            call_chain.function_names,
+            passing_method
+        )
+        
+        # 确定插入点
+        insertion_points = []
+        for i, (func, code_slice) in enumerate(zip(call_chain.functions, slice_result.slices)):
+            if func.fusion_points:
+                # 选择第一个融合点
+                block_id = func.fusion_points[0]
+            else:
+                # 如果没有融合点,使用入口块
+                block_id = func.cfg.entry_block_id if func.cfg else 0
+            
+            insertion_points.append((func.name, block_id, code_slice.code))
+        
+        return FusionPlan(
+            target_code=target_code,
+            call_chain=call_chain,
+            slice_result=slice_result,
+            insertion_points=insertion_points
+        )
+    
+    def execute_fusion(self, plan: FusionPlan) -> Dict[str, str]:
+        """
+        执行融合
+        
+        Args:
+            plan: 融合计划
+            
+        Returns:
+            融合后的函数代码字典 {函数名: 代码}
+        """
+        fused_code = {}
+        
+        for func, (func_name, block_id, insert_code) in zip(
+            plan.call_chain.functions, 
+            plan.insertion_points
+        ):
+            if not insert_code.strip() or insert_code.strip() == "// empty slice":
+                fused_code[func_name] = func.code
+                continue
+            
+            # 在函数中插入代码
+            fused = self._insert_code_into_function(func, block_id, insert_code)
+            fused_code[func_name] = fused
+        
+        return fused_code
+    
+    def _insert_code_into_function(
+        self, 
+        func: FunctionInfo, 
+        block_id: int, 
+        insert_code: str
+    ) -> str:
+        """
+        在函数的指定位置插入代码
+        
+        Args:
+            func: 函数信息
+            block_id: 目标基本块ID
+            insert_code: 要插入的代码
+            
+        Returns:
+            插入代码后的函数代码
+        """
+        code = func.code
+        
+        # 找到函数体开始
+        brace_pos = code.find('{')
+        if brace_pos == -1:
+            return code
+        
+        # 如果是入口块或第一个融合点,在函数体开头插入
+        if block_id == func.cfg.entry_block_id or (func.fusion_points and block_id == func.fusion_points[0]):
+            # 格式化插入代码
+            insert_lines = insert_code.strip().split('\n')
+            formatted_insert = '\n    '.join(insert_lines)
+            
+            return (
+                code[:brace_pos + 1] + 
+                f"\n    /* === Fused Code Start === */\n    {formatted_insert}\n    /* === Fused Code End === */\n" +
+                code[brace_pos + 1:]
+            )
+        
+        # 否则尝试找到对应的基本块位置
+        # 这里简化处理,在函数中间插入
+        return self._insert_at_middle(code, insert_code)
+    
+    def _insert_at_middle(self, func_code: str, insert_code: str) -> str:
+        """
+        在函数中间位置插入代码
+        """
+        # 找到函数体
+        brace_start = func_code.find('{')
+        brace_end = func_code.rfind('}')
+        
+        if brace_start == -1 or brace_end == -1:
+            return func_code
+        
+        body = func_code[brace_start + 1:brace_end]
+        lines = body.split('\n')
+        
+        # 在中间位置插入
+        mid = len(lines) // 2
+        
+        insert_lines = insert_code.strip().split('\n')
+        formatted_insert = '\n    '.join(insert_lines)
+        
+        lines.insert(mid, f"    /* === Fused Code Start === */")
+        lines.insert(mid + 1, f"    {formatted_insert}")
+        lines.insert(mid + 2, f"    /* === Fused Code End === */")
+        
+        return func_code[:brace_start + 1] + '\n'.join(lines) + func_code[brace_end:]
+
+
+def analyze_call_chain_group(group: Dict) -> Dict:
+    """
+    分析一个调用链组
+    
+    Args:
+        group: 包含 functions, call_depth, longest_call_chain 的字典
+        
+    Returns:
+        分析结果字典
+    """
+    functions = group.get('functions', [])
+    call_depth = group.get('call_depth', 0)
+    call_chain = group.get('longest_call_chain', [])
+    
+    # 分析每个函数
+    analyzed_functions = []
+    for func_data in functions:
+        code = func_data.get('func', '')
+        cfg = analyze_code_cfg(code)
+        fusion_points = get_fusion_points(cfg)
+        
+        analyzed_functions.append({
+            'idx': func_data.get('idx'),
+            'name': cfg.function_name,
+            'blocks_count': len(cfg.blocks),
+            'fusion_points_count': len(fusion_points),
+            'fusion_points': fusion_points,
+        })
+    
+    return {
+        'call_depth': call_depth,
+        'call_chain': call_chain,
+        'functions_count': len(functions),
+        'analyzed_functions': analyzed_functions,
+        'total_fusion_points': sum(f['fusion_points_count'] for f in analyzed_functions)
+    }
+
+
+if __name__ == "__main__":
+    # 测试代码
+    test_func1 = """
+    void outer_func() {
+        printf("Start\\n");
+        middle_func();
+        printf("End\\n");
+    }
+    """
+    
+    test_func2 = """
+    void middle_func() {
+        int x = 10;
+        inner_func();
+        x += 5;
+    }
+    """
+    
+    test_func3 = """
+    void inner_func() {
+        printf("Inner\\n");
+    }
+    """
+    
+    functions = [
+        {'func': test_func1, 'idx': 1},
+        {'func': test_func2, 'idx': 2},
+        {'func': test_func3, 'idx': 3},
+    ]
+    
+    engine = CodeFusionEngine()
+    call_chain = engine.build_call_chain(
+        functions,
+        ['outer_func', 'middle_func', 'inner_func']
+    )
+    
+    print(f"Call chain depth: {call_chain.depth}")
+    print(f"Functions: {call_chain.function_names}")
+    print(f"Total fusion points: {call_chain.get_total_fusion_points()}")
+

+ 285 - 0
src/dominator_analyzer.py

@@ -0,0 +1,285 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+必经点 (Dominator) 分析器
+
+分析控制流图中的必经点,即从入口到出口的所有路径都必须经过的节点。
+"""
+
+from typing import Dict, List, Set, Optional
+from dataclasses import dataclass
+import networkx as nx
+
+from cfg_analyzer import ControlFlowGraph, BasicBlock
+
+
+@dataclass 
+class DominatorInfo:
+    """必经点信息"""
+    dominators: Dict[int, Set[int]]  # 每个节点的支配者集合
+    immediate_dominators: Dict[int, Optional[int]]  # 直接支配者
+    dominator_tree: Dict[int, List[int]]  # 支配树
+    critical_points: Set[int]  # 关键必经点(从入口到出口必经)
+
+
+class DominatorAnalyzer:
+    """必经点分析器"""
+    
+    def __init__(self, cfg: ControlFlowGraph):
+        self.cfg = cfg
+        self.graph = cfg.to_networkx()
+    
+    def compute_dominators(self) -> Dict[int, Set[int]]:
+        """
+        计算每个节点的支配者集合
+        
+        使用数据流分析算法:
+        Dom(entry) = {entry}
+        Dom(n) = {n} ∪ (∩ Dom(p) for p in predecessors(n))
+        """
+        if not self.cfg.blocks:
+            return {}
+        
+        all_nodes = set(self.cfg.blocks.keys())
+        entry = self.cfg.entry_block_id
+        
+        if entry is None:
+            return {}
+        
+        # 初始化
+        dominators = {node: all_nodes.copy() for node in all_nodes}
+        dominators[entry] = {entry}
+        
+        # 迭代计算
+        changed = True
+        while changed:
+            changed = False
+            for node in all_nodes:
+                if node == entry:
+                    continue
+                
+                preds = self.cfg.get_predecessors(node)
+                if not preds:
+                    new_dom = {node}
+                else:
+                    # 取所有前驱的支配者的交集
+                    new_dom = all_nodes.copy()
+                    for pred in preds:
+                        new_dom &= dominators[pred]
+                    new_dom.add(node)
+                
+                if new_dom != dominators[node]:
+                    dominators[node] = new_dom
+                    changed = True
+        
+        return dominators
+    
+    def compute_immediate_dominators(self, dominators: Dict[int, Set[int]]) -> Dict[int, Optional[int]]:
+        """
+        计算直接支配者
+        
+        节点 n 的直接支配者是最接近 n 的严格支配者
+        """
+        idoms = {}
+        
+        for node, doms in dominators.items():
+            # 严格支配者(不包括自身)
+            strict_doms = doms - {node}
+            
+            if not strict_doms:
+                idoms[node] = None
+                continue
+            
+            # 找到最接近的支配者
+            # 即:不支配其他严格支配者的那个
+            idom = None
+            for candidate in strict_doms:
+                is_idom = True
+                for other in strict_doms:
+                    if other != candidate and candidate in dominators.get(other, set()):
+                        # candidate 支配 other,所以 candidate 不是直接支配者
+                        is_idom = False
+                        break
+                if is_idom:
+                    idom = candidate
+                    break
+            
+            idoms[node] = idom
+        
+        return idoms
+    
+    def build_dominator_tree(self, idoms: Dict[int, Optional[int]]) -> Dict[int, List[int]]:
+        """
+        构建支配树
+        """
+        tree = {node: [] for node in self.cfg.blocks}
+        
+        for node, idom in idoms.items():
+            if idom is not None:
+                tree[idom].append(node)
+        
+        return tree
+    
+    def find_critical_points(self) -> Set[int]:
+        """
+        找出关键必经点
+        
+        关键点定义:从入口块到任意出口块的所有路径都必须经过该点
+        """
+        if not self.cfg.entry_block_id or not self.cfg.exit_block_ids:
+            return set()
+        
+        entry = self.cfg.entry_block_id
+        exits = set(self.cfg.exit_block_ids)
+        
+        # 使用路径分析找到必经点
+        critical_points = set()
+        all_nodes = set(self.cfg.blocks.keys())
+        
+        for node in all_nodes:
+            # 检查移除此节点后是否还能从入口到达出口
+            if node == entry:
+                critical_points.add(node)
+                continue
+            
+            if node in exits:
+                critical_points.add(node)
+                continue
+            
+            # 检查是否是必经点
+            is_critical = self._check_critical_point(node, entry, exits)
+            if is_critical:
+                critical_points.add(node)
+        
+        return critical_points
+    
+    def _check_critical_point(self, node: int, entry: int, exits: Set[int]) -> bool:
+        """
+        检查节点是否是必经点
+        
+        如果移除该节点后,无法从入口到达任何出口,则该节点是必经点
+        """
+        # 创建不包含该节点的图
+        remaining_nodes = set(self.cfg.blocks.keys()) - {node}
+        
+        if entry not in remaining_nodes:
+            return True
+        
+        # BFS 检查可达性
+        visited = set()
+        queue = [entry]
+        
+        while queue:
+            current = queue.pop(0)
+            if current in visited:
+                continue
+            visited.add(current)
+            
+            # 检查是否到达出口
+            if current in exits:
+                return False  # 可以绕过该节点到达出口
+            
+            for succ in self.cfg.get_successors(current):
+                if succ not in visited and succ in remaining_nodes:
+                    queue.append(succ)
+        
+        return True  # 无法绕过该节点到达出口
+    
+    def find_fusion_points(self) -> List[int]:
+        """
+        找出适合代码融合的点
+        
+        融合点需要满足:
+        1. 是必经点
+        2. 前驱数量 <= 1
+        3. 后继数量 <= 1
+        4. 不是条件分支
+        """
+        critical_points = self.find_critical_points()
+        fusion_points = []
+        
+        for point in critical_points:
+            preds = self.cfg.get_predecessors(point)
+            succs = self.cfg.get_successors(point)
+            
+            # 检查前驱和后继数量
+            if len(preds) <= 1 and len(succs) <= 1:
+                fusion_points.append(point)
+        
+        return sorted(fusion_points)
+    
+    def analyze(self) -> DominatorInfo:
+        """
+        执行完整的必经点分析
+        """
+        dominators = self.compute_dominators()
+        idoms = self.compute_immediate_dominators(dominators)
+        dom_tree = self.build_dominator_tree(idoms)
+        critical_points = self.find_critical_points()
+        
+        return DominatorInfo(
+            dominators=dominators,
+            immediate_dominators=idoms,
+            dominator_tree=dom_tree,
+            critical_points=critical_points
+        )
+
+
+def analyze_dominators(cfg: ControlFlowGraph) -> DominatorInfo:
+    """
+    分析控制流图的必经点
+    
+    Args:
+        cfg: 控制流图
+        
+    Returns:
+        DominatorInfo 对象
+    """
+    analyzer = DominatorAnalyzer(cfg)
+    return analyzer.analyze()
+
+
+def get_fusion_points(cfg: ControlFlowGraph) -> List[int]:
+    """
+    获取适合代码融合的点
+    
+    Args:
+        cfg: 控制流图
+        
+    Returns:
+        融合点ID列表
+    """
+    analyzer = DominatorAnalyzer(cfg)
+    return analyzer.find_fusion_points()
+
+
+if __name__ == "__main__":
+    from cfg_analyzer import analyze_code_cfg
+    
+    # 测试代码
+    test_code = """
+    int test_function(int x) {
+        int result = 0;
+        if (x > 0) {
+            result = x * 2;
+        } else {
+            result = x * -1;
+        }
+        result += 10;
+        return result;
+    }
+    """
+    
+    cfg = analyze_code_cfg(test_code, "test_function")
+    dom_info = analyze_dominators(cfg)
+    
+    print(f"Function: {cfg.function_name}")
+    print(f"Blocks: {len(cfg.blocks)}")
+    print(f"Critical Points: {dom_info.critical_points}")
+    print(f"Fusion Points: {get_fusion_points(cfg)}")
+    
+    print("\nDominators:")
+    for node, doms in dom_info.dominators.items():
+        block_name = cfg.blocks[node].name
+        print(f"  {block_name}: {doms}")
+

+ 652 - 0
src/llm_splitter.py

@@ -0,0 +1,652 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+LLM 代码拆分器
+
+调用大语言模型将一段代码拆分为多个片段,以便插入到调用链中的多个函数中。
+"""
+
+import os
+import json
+import re
+from typing import List, Dict, Optional, Tuple
+from dataclasses import dataclass
+
+from openai import OpenAI
+
+
+@dataclass
+class CodeSlice:
+    """代码片段"""
+    index: int
+    code: str
+    description: str
+    dependencies: List[str]  # 依赖的变量/状态
+    outputs: List[str]  # 输出的变量/状态
+
+
+@dataclass
+class SliceResult:
+    """拆分结果"""
+    original_code: str
+    slices: List[CodeSlice]
+    shared_state: Dict[str, str]  # 共享状态变量名 -> 类型
+    global_declarations: str  # 全局变量声明代码
+    setup_code: str  # 初始化代码
+    cleanup_code: str  # 清理代码
+    passing_method: str = "global"  # 变量传递方法: "global" 或 "parameter"
+    parameter_struct: str = ""  # 参数传递时使用的结构体定义
+
+
+class LLMCodeSplitter:
+    """LLM 代码拆分器"""
+    
+    # 变量传递方法
+    METHOD_GLOBAL = "global"      # 全局变量方法
+    METHOD_PARAMETER = "parameter"  # 参数传递方法
+    
+    def __init__(self, api_key: str = None, base_url: str = None, model: str = None):
+        """
+        初始化 LLM 拆分器
+        
+        Args:
+            api_key: API 密钥(默认从环境变量获取)
+            base_url: API 基础 URL
+            model: 模型名称
+        """
+        self.api_key = api_key or os.getenv("DASHSCOPE_API_KEY")
+        self.base_url = base_url or "https://dashscope.aliyuncs.com/compatible-mode/v1"
+        self.model = model or "qwen-plus"  # 可选: qwen-plus, qwen-turbo, qwen-max
+        
+        if not self.api_key:
+            raise ValueError("API key not found. Please set DASHSCOPE_API_KEY environment variable.")
+        
+        self.client = OpenAI(
+            api_key=self.api_key,
+            base_url=self.base_url
+        )
+    
+    def _create_split_prompt(self, code: str, n_parts: int, function_names: List[str]) -> str:
+        """
+        创建代码拆分的提示词
+        
+        Args:
+            code: 要拆分的代码
+            n_parts: 拆分为几个部分
+            function_names: 调用链中的函数名列表
+        """
+        prompt = f"""你是一个代码分析专家。请将以下代码拆分为 {n_parts} 个相互依赖的片段。
+
+这些片段将被插入到一个调用链中的 {n_parts} 个函数中:
+调用链:{' -> '.join(function_names)}
+
+【重要】由于每个片段在不同的函数中执行,局部变量无法直接传递!
+你必须:
+1. 将需要跨函数共享的变量声明为全局变量(放在 shared_state 中)
+2. 第一个片段负责初始化全局变量
+3. 后续片段使用这些全局变量
+4. 最后一个片段执行最终操作
+
+要求:
+1. 每个片段应该是语义完整的代码块
+2. 片段之间通过【全局变量】传递状态,不能依赖局部变量
+3. 按照调用顺序,第一个片段在调用链最外层函数中执行,最后一个片段在最内层函数中执行
+4. 所有片段按顺序执行后,效果应该与原始代码相同
+5. shared_state 中声明所有需要跨函数共享的变量
+
+原始代码:
+```c
+{code}
+```
+
+请按以下 JSON 格式返回结果:
+```json
+{{
+    "shared_state": {{
+        "变量名": "类型(如 int, char*, etc.)"
+    }},
+    "global_declarations": "全局变量声明代码,如:static int g_secret; static int g_key;",
+    "slices": [
+        {{
+            "index": 0,
+            "function": "函数名",
+            "code": "代码片段(使用全局变量,如 g_secret = 42;)",
+            "description": "描述这段代码做什么",
+            "dependencies": ["依赖的全局变量"],
+            "outputs": ["输出/修改的全局变量"]
+        }}
+    ],
+    "cleanup_code": "清理代码(如释放内存、重置全局变量等)"
+}}
+```
+
+示例:如果原始代码是 `int secret = 42; int key = secret ^ 0xABCD; printf("key=%d", key);`
+拆分为3个片段应该是:
+- shared_state: {{"g_secret": "int", "g_key": "int"}}
+- global_declarations: "static int g_secret; static int g_key;"
+- 片段1: "g_secret = 42;"
+- 片段2: "g_key = g_secret ^ 0xABCD;"
+- 片段3: "printf(\\"key=%d\\", g_key);"
+
+只返回 JSON,不要有其他内容。
+"""
+        return prompt
+    
+    def _create_parameter_split_prompt(self, code: str, n_parts: int, function_names: List[str]) -> str:
+        """
+        创建使用参数传递方法的代码拆分提示词
+        """
+        prompt = f"""你是一个代码分析专家。请将以下代码拆分为 {n_parts} 个相互依赖的片段。
+
+这些片段将被插入到一个调用链中的 {n_parts} 个函数中:
+调用链:{' -> '.join(function_names)}
+
+【重要】使用参数传递方法!
+你需要:
+1. 定义一个结构体来保存共享状态
+2. 每个函数需要添加一个指向该结构体的指针参数
+3. 每个片段通过这个结构体指针访问和修改共享状态
+
+要求:
+1. 定义结构体 `FusionState` 包含所有需要共享的变量
+2. 每个函数添加参数 `FusionState* fusion_state`
+3. 片段中通过 `fusion_state->变量名` 访问变量
+4. 调用下层函数时传递 `fusion_state` 指针
+
+原始代码:
+```c
+{code}
+```
+
+请按以下 JSON 格式返回结果:
+```json
+{{{{
+    "shared_state": {{{{
+        "变量名": "类型"
+    }}}},
+    "parameter_struct": "typedef struct {{ int secret; int key; }} FusionState;",
+    "slices": [
+        {{{{
+            "index": 0,
+            "function": "函数名",
+            "code": "代码片段(使用 fusion_state->secret = 42;)",
+            "description": "描述",
+            "dependencies": ["依赖的变量"],
+            "outputs": ["输出的变量"]
+        }}}}
+    ],
+    "init_code": "FusionState fusion_state_data; memset(&fusion_state_data, 0, sizeof(fusion_state_data)); FusionState* fusion_state = &fusion_state_data;"
+}}}}
+```
+
+示例:如果原始代码是 `int secret = 42; int key = secret ^ 0xABCD; printf("key=%d", key);`
+- parameter_struct: "typedef struct {{ int secret; int key; }} FusionState;"
+- 片段1: "fusion_state->secret = 42;"
+- 片段2: "fusion_state->key = fusion_state->secret ^ 0xABCD;"
+- 片段3: "printf(\\"key=%d\\", fusion_state->key);"
+
+只返回 JSON,不要有其他内容。
+"""
+        return prompt
+    
+    def _parse_llm_response(self, response: str) -> Optional[Dict]:
+        """
+        解析 LLM 的响应
+        """
+        # 尝试提取 JSON
+        try:
+            # 尝试直接解析
+            return json.loads(response)
+        except json.JSONDecodeError:
+            pass
+        
+        # 尝试从 markdown 代码块中提取
+        json_match = re.search(r'```(?:json)?\s*([\s\S]*?)\s*```', response)
+        if json_match:
+            try:
+                return json.loads(json_match.group(1))
+            except json.JSONDecodeError:
+                pass
+        
+        # 尝试找到 JSON 对象
+        json_match = re.search(r'\{[\s\S]*\}', response)
+        if json_match:
+            try:
+                return json.loads(json_match.group(0))
+            except json.JSONDecodeError:
+                pass
+        
+        return None
+    
+    def split_code(self, code: str, n_parts: int, function_names: List[str], 
+                   method: str = "global") -> SliceResult:
+        """
+        将代码拆分为多个片段
+        
+        Args:
+            code: 要拆分的代码
+            n_parts: 拆分为几个部分
+            function_names: 调用链中的函数名列表
+            method: 变量传递方法 "global"(全局变量)或 "parameter"(参数传递)
+            
+        Returns:
+            SliceResult 对象
+        """
+        if n_parts <= 0:
+            raise ValueError("n_parts must be positive")
+        
+        if method not in [self.METHOD_GLOBAL, self.METHOD_PARAMETER]:
+            method = self.METHOD_GLOBAL
+        
+        if n_parts == 1:
+            # 不需要拆分
+            return SliceResult(
+                original_code=code,
+                slices=[CodeSlice(
+                    index=0,
+                    code=code,
+                    description="Original code",
+                    dependencies=[],
+                    outputs=[]
+                )],
+                shared_state={},
+                global_declarations="",
+                setup_code="",
+                cleanup_code="",
+                passing_method=method,
+                parameter_struct=""
+            )
+        
+        # 根据方法选择不同的 prompt
+        if method == self.METHOD_PARAMETER:
+            prompt = self._create_parameter_split_prompt(code, n_parts, function_names)
+        else:
+            prompt = self._create_split_prompt(code, n_parts, function_names)
+        
+        try:
+            completion = self.client.chat.completions.create(
+                model=self.model,
+                messages=[
+                    {
+                        "role": "system", 
+                        "content": "你是一个专业的代码分析和重构专家,擅长将代码拆分为多个相互依赖的片段。请只返回 JSON 格式的结果。"
+                    },
+                    {"role": "user", "content": prompt}
+                ],
+                temperature=0.3,
+            )
+            
+            response_text = completion.choices[0].message.content
+            
+            # 解析响应
+            result_dict = self._parse_llm_response(response_text)
+            
+            if not result_dict:
+                print(f"Warning: Failed to parse LLM response. Using fallback splitting.")
+                return self._fallback_split(code, n_parts, function_names)
+            
+            # 构建结果
+            slices = []
+            for slice_data in result_dict.get("slices", []):
+                slices.append(CodeSlice(
+                    index=slice_data.get("index", 0),
+                    code=slice_data.get("code", ""),
+                    description=slice_data.get("description", ""),
+                    dependencies=slice_data.get("dependencies", []),
+                    outputs=slice_data.get("outputs", [])
+                ))
+            
+            return SliceResult(
+                original_code=code,
+                slices=slices,
+                shared_state=result_dict.get("shared_state", {}),
+                global_declarations=result_dict.get("global_declarations", ""),
+                setup_code=result_dict.get("setup_code", result_dict.get("init_code", "")),
+                cleanup_code=result_dict.get("cleanup_code", ""),
+                passing_method=method,
+                parameter_struct=result_dict.get("parameter_struct", "")
+            )
+            
+        except Exception as e:
+            print(f"Warning: LLM call failed: {e}. Using fallback splitting.")
+            return self._fallback_split(code, n_parts, function_names, method)
+    
+    def _fallback_split(self, code: str, n_parts: int, function_names: List[str], 
+                        method: str = "global") -> SliceResult:
+        """
+        备用拆分方法(简单地按语句数量均分)
+        """
+        # 简单地按行分割
+        lines = [line for line in code.strip().split('\n') if line.strip()]
+        
+        if len(lines) < n_parts:
+            # 如果行数少于分片数,每行一个分片
+            slices = []
+            for i, line in enumerate(lines):
+                slices.append(CodeSlice(
+                    index=i,
+                    code=line,
+                    description=f"Part {i+1}",
+                    dependencies=[],
+                    outputs=[]
+                ))
+            # 补充空分片
+            while len(slices) < n_parts:
+                slices.append(CodeSlice(
+                    index=len(slices),
+                    code="// empty slice",
+                    description=f"Part {len(slices)+1} (empty)",
+                    dependencies=[],
+                    outputs=[]
+                ))
+        else:
+            # 均分
+            chunk_size = len(lines) // n_parts
+            slices = []
+            for i in range(n_parts):
+                start = i * chunk_size
+                end = start + chunk_size if i < n_parts - 1 else len(lines)
+                slice_code = '\n'.join(lines[start:end])
+                slices.append(CodeSlice(
+                    index=i,
+                    code=slice_code,
+                    description=f"Part {i+1}",
+                    dependencies=[],
+                    outputs=[]
+                ))
+        
+        # 根据方法生成不同的变量传递代码
+        if method == self.METHOD_PARAMETER:
+            param_info = self._generate_fallback_parameters(code)
+            return SliceResult(
+                original_code=code,
+                slices=slices,
+                shared_state=param_info.get("shared_state", {}),
+                global_declarations="",
+                setup_code=param_info.get("init_code", ""),
+                cleanup_code="",
+                passing_method=method,
+                parameter_struct=param_info.get("parameter_struct", "")
+            )
+        else:
+            # 全局变量方法
+            global_decl = self._generate_fallback_globals(code)
+            return SliceResult(
+                original_code=code,
+                slices=slices,
+                shared_state=global_decl.get("shared_state", {}),
+                global_declarations=global_decl.get("declarations", ""),
+                setup_code="",
+                cleanup_code="",
+                passing_method=method,
+                parameter_struct=""
+            )
+    
+    def _generate_fallback_parameters(self, code: str) -> Dict:
+        """
+        为 fallback 拆分生成参数传递所需的结构体
+        """
+        import re
+        
+        # 匹配简单的变量声明: type name = value;
+        var_pattern = r'\b(int|char|float|double|long|short|unsigned)\s+(\w+)\s*='
+        matches = re.findall(var_pattern, code)
+        
+        shared_state = {}
+        struct_fields = []
+        
+        for var_type, var_name in matches:
+            shared_state[var_name] = var_type
+            struct_fields.append(f"    {var_type} {var_name};")
+        
+        if struct_fields:
+            parameter_struct = "typedef struct {\n" + "\n".join(struct_fields) + "\n} FusionState;"
+        else:
+            parameter_struct = "typedef struct { int _placeholder; } FusionState;"
+        
+        init_code = "FusionState fusion_state_data; memset(&fusion_state_data, 0, sizeof(fusion_state_data)); FusionState* fusion_state = &fusion_state_data;"
+        
+        return {
+            "shared_state": shared_state,
+            "parameter_struct": parameter_struct,
+            "init_code": init_code
+        }
+    
+    def _generate_fallback_globals(self, code: str) -> Dict:
+        """
+        为 fallback 拆分生成全局变量声明
+        分析代码中的变量声明,转换为全局变量
+        """
+        import re
+        
+        # 匹配简单的变量声明: type name = value;
+        var_pattern = r'\b(int|char|float|double|long|short|unsigned)\s+(\w+)\s*='
+        matches = re.findall(var_pattern, code)
+        
+        shared_state = {}
+        declarations = []
+        
+        for var_type, var_name in matches:
+            global_name = f"g_{var_name}"
+            shared_state[global_name] = var_type
+            declarations.append(f"static {var_type} {global_name};")
+        
+        return {
+            "shared_state": shared_state,
+            "declarations": "\n".join(declarations)
+        }
+
+
+def split_code_for_call_chain(
+    code: str, 
+    call_chain: List[str],
+    api_key: str = None
+) -> SliceResult:
+    """
+    将代码拆分以适配调用链
+    
+    Args:
+        code: 要拆分的代码
+        call_chain: 调用链(函数名列表)
+        api_key: API 密钥(可选)
+        
+    Returns:
+        SliceResult 对象
+    """
+    splitter = LLMCodeSplitter(api_key=api_key)
+    n_parts = len(call_chain)
+    return splitter.split_code(code, n_parts, call_chain)
+
+
+class CodeFusionGenerator:
+    """代码融合生成器"""
+    
+    def __init__(self, splitter: LLMCodeSplitter = None):
+        """
+        初始化融合生成器
+        
+        Args:
+            splitter: LLM 拆分器实例
+        """
+        self.splitter = splitter or LLMCodeSplitter()
+    
+    def _create_fusion_prompt(
+        self, 
+        target_code: str,
+        call_chain_functions: List[Dict],
+        slice_result: SliceResult
+    ) -> str:
+        """
+        创建代码融合的提示词
+        """
+        functions_desc = "\n".join([
+            f"{i+1}. {f['name']}:\n```c\n{f['code']}\n```"
+            for i, f in enumerate(call_chain_functions)
+        ])
+        
+        slices_desc = "\n".join([
+            f"片段 {s.index + 1} (插入到 {call_chain_functions[s.index]['name']}):\n```c\n{s.code}\n```"
+            for s in slice_result.slices
+        ])
+        
+        prompt = f"""请将以下代码片段融合到对应的函数中。
+
+调用链中的函数:
+{functions_desc}
+
+要插入的代码片段:
+{slices_desc}
+
+共享状态变量:
+{json.dumps(slice_result.shared_state, indent=2)}
+
+初始化代码:
+```c
+{slice_result.setup_code}
+```
+
+要求:
+1. 在每个函数的合适位置(通常是必经点)插入对应的代码片段
+2. 正确处理共享状态的传递
+3. 确保融合后的代码能够正确编译和执行
+4. 保持原函数的功能不变
+
+请按以下 JSON 格式返回每个函数融合后的代码:
+```json
+{{
+    "fused_functions": [
+        {{
+            "name": "函数名",
+            "code": "融合后的完整函数代码"
+        }}
+    ],
+    "global_declarations": "需要添加的全局声明(如共享状态变量)"
+}}
+```
+
+只返回 JSON,不要有其他内容。
+"""
+        return prompt
+    
+    def generate_fused_code(
+        self,
+        target_code: str,
+        call_chain_functions: List[Dict],
+        slice_result: SliceResult = None
+    ) -> Dict:
+        """
+        生成融合后的代码
+        
+        Args:
+            target_code: 要融合的目标代码
+            call_chain_functions: 调用链函数列表,每个元素包含 name 和 code
+            slice_result: 代码拆分结果(可选,如果不提供则自动拆分)
+            
+        Returns:
+            融合结果字典
+        """
+        if slice_result is None:
+            function_names = [f['name'] for f in call_chain_functions]
+            slice_result = self.splitter.split_code(
+                target_code, 
+                len(call_chain_functions),
+                function_names
+            )
+        
+        prompt = self._create_fusion_prompt(
+            target_code,
+            call_chain_functions,
+            slice_result
+        )
+        
+        try:
+            completion = self.splitter.client.chat.completions.create(
+                model=self.splitter.model,
+                messages=[
+                    {
+                        "role": "system",
+                        "content": "你是一个专业的代码融合专家,擅长将代码片段安全地插入到现有函数中。请只返回 JSON 格式的结果。"
+                    },
+                    {"role": "user", "content": prompt}
+                ],
+                temperature=0.3,
+            )
+            
+            response_text = completion.choices[0].message.content
+            result_dict = self.splitter._parse_llm_response(response_text)
+            
+            if result_dict:
+                return result_dict
+            else:
+                return self._fallback_fusion(call_chain_functions, slice_result)
+                
+        except Exception as e:
+            print(f"Warning: LLM fusion call failed: {e}. Using fallback fusion.")
+            return self._fallback_fusion(call_chain_functions, slice_result)
+    
+    def _fallback_fusion(
+        self,
+        call_chain_functions: List[Dict],
+        slice_result: SliceResult
+    ) -> Dict:
+        """
+        备用融合方法
+        """
+        fused_functions = []
+        
+        for i, func in enumerate(call_chain_functions):
+            if i < len(slice_result.slices):
+                slice_code = slice_result.slices[i].code
+                # 简单地在函数开头插入代码
+                fused_code = self._insert_code_at_start(func['code'], slice_code)
+            else:
+                fused_code = func['code']
+            
+            fused_functions.append({
+                "name": func['name'],
+                "code": fused_code
+            })
+        
+        return {
+            "fused_functions": fused_functions,
+            "global_declarations": ""
+        }
+    
+    def _insert_code_at_start(self, func_code: str, insert_code: str) -> str:
+        """
+        在函数体开头插入代码
+        """
+        # 找到函数体开始的 {
+        brace_pos = func_code.find('{')
+        if brace_pos == -1:
+            return func_code
+        
+        # 在 { 后插入代码
+        return (
+            func_code[:brace_pos + 1] + 
+            f"\n    // --- Inserted code start ---\n    {insert_code}\n    // --- Inserted code end ---\n" +
+            func_code[brace_pos + 1:]
+        )
+
+
+if __name__ == "__main__":
+    # 测试代码
+    test_code = """
+    int secret = 42;
+    int key = secret ^ 0xFF;
+    printf("Key: %d\\n", key);
+    """
+    
+    call_chain = ["outer_func", "middle_func", "inner_func"]
+    
+    try:
+        result = split_code_for_call_chain(test_code, call_chain)
+        print(f"Split into {len(result.slices)} slices:")
+        for slice in result.slices:
+            print(f"\nSlice {slice.index}:")
+            print(f"  Code: {slice.code}")
+            print(f"  Description: {slice.description}")
+    except Exception as e:
+        print(f"Error: {e}")
+        print("Make sure DASHSCOPE_API_KEY is set in environment variables.")
+

+ 646 - 0
src/main.py

@@ -0,0 +1,646 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+Code Fusion 主程序
+
+功能:
+1. 读取调用链深度为 4 的数据
+2. 分析代码的控制流图和必经点
+3. 使用 LLM 将目标代码拆分并融合到调用链函数中
+"""
+
+import os
+import sys
+import json
+import argparse
+from typing import List, Dict, Optional
+from dataclasses import dataclass
+
+# 添加当前目录到路径
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+
+from cfg_analyzer import analyze_code_cfg, visualize_cfg
+from dominator_analyzer import analyze_dominators, get_fusion_points
+from llm_splitter import LLMCodeSplitter, split_code_for_call_chain
+from code_fusion import CodeFusionEngine, CallChain, FunctionInfo, analyze_call_chain_group
+
+
+@dataclass
+class ProcessingResult:
+    """处理结果"""
+    group_index: int
+    call_chain: List[str]
+    call_depth: int
+    functions_count: int
+    total_fusion_points: int
+    fused_code: Dict[str, str]
+    success: bool
+    error_message: str = ""
+    global_declarations: str = ""  # 全局变量声明
+    passing_method: str = "global"  # 变量传递方法
+    parameter_struct: str = ""  # 参数结构体定义
+
+
+class CodeFusionProcessor:
+    """代码融合处理器"""
+    
+    def __init__(self, api_key: str = None):
+        """
+        初始化处理器
+        
+        Args:
+            api_key: API 密钥
+        """
+        self.api_key = api_key or os.getenv("DASHSCOPE_API_KEY")
+        self.splitter = None
+        self.engine = None
+        
+        if self.api_key:
+            try:
+                self.splitter = LLMCodeSplitter(api_key=self.api_key)
+                self.engine = CodeFusionEngine(splitter=self.splitter)
+            except Exception as e:
+                print(f"Warning: Failed to initialize LLM splitter: {e}")
+    
+    def load_data(self, input_path: str) -> Dict:
+        """
+        加载数据文件
+        
+        Args:
+            input_path: 输入文件路径
+            
+        Returns:
+            数据字典
+        """
+        with open(input_path, 'r', encoding='utf-8') as f:
+            return json.load(f)
+    
+    def analyze_group(self, group: Dict) -> Dict:
+        """
+        分析单个调用链组
+        
+        Args:
+            group: 调用链组数据
+            
+        Returns:
+            分析结果
+        """
+        return analyze_call_chain_group(group)
+    
+    def process_group(
+        self,
+        group: Dict,
+        target_code: str,
+        group_index: int = 0,
+        passing_method: str = "global"
+    ) -> ProcessingResult:
+        """
+        处理单个调用链组,执行代码融合
+        
+        Args:
+            group: 调用链组数据
+            target_code: 要融合的目标代码
+            group_index: 组索引
+            
+        Returns:
+            ProcessingResult 对象
+        """
+        functions = group.get('functions', [])
+        call_depth = group.get('call_depth', 0)
+        call_chain = group.get('longest_call_chain', [])
+        
+        if not self.engine:
+            return ProcessingResult(
+                group_index=group_index,
+                call_chain=call_chain,
+                call_depth=call_depth,
+                functions_count=len(functions),
+                total_fusion_points=0,
+                fused_code={},
+                success=False,
+                error_message="LLM engine not initialized",
+                global_declarations="",
+                passing_method=passing_method,
+                parameter_struct=""
+            )
+        
+        try:
+            # 构建调用链
+            chain = self.engine.build_call_chain(functions, call_chain)
+            
+            # 创建融合计划(传递 passing_method)
+            plan = self.engine.create_fusion_plan(target_code, chain, passing_method)
+            
+            # 执行融合
+            fused_code = self.engine.execute_fusion(plan)
+            
+            # 获取变量传递相关信息
+            slice_result = plan.slice_result
+            global_decl = slice_result.global_declarations if slice_result else ""
+            param_struct = slice_result.parameter_struct if slice_result else ""
+            
+            return ProcessingResult(
+                group_index=group_index,
+                call_chain=call_chain,
+                call_depth=call_depth,
+                functions_count=len(functions),
+                total_fusion_points=chain.get_total_fusion_points(),
+                fused_code=fused_code,
+                success=True,
+                global_declarations=global_decl,
+                passing_method=passing_method,
+                parameter_struct=param_struct
+            )
+            
+        except Exception as e:
+            return ProcessingResult(
+                group_index=group_index,
+                call_chain=call_chain,
+                call_depth=call_depth,
+                functions_count=len(functions),
+                total_fusion_points=0,
+                fused_code={},
+                success=False,
+                error_message=str(e),
+                global_declarations="",
+                passing_method=passing_method,
+                parameter_struct=""
+            )
+    
+    def process_file(
+        self,
+        input_path: str,
+        output_path: str,
+        target_code: str,
+        max_groups: int = 10,
+        passing_method: str = "global"
+    ) -> List[ProcessingResult]:
+        """
+        处理整个数据文件
+        
+        Args:
+            input_path: 输入文件路径
+            output_path: 输出文件路径
+            target_code: 要融合的目标代码
+            max_groups: 最大处理组数
+            passing_method: 变量传递方法 "global" 或 "parameter"
+            
+        Returns:
+            处理结果列表
+        """
+        print(f"Loading data from: {input_path}")
+        data = self.load_data(input_path)
+        groups = data.get('groups', [])
+        
+        print(f"Total groups: {len(groups)}")
+        
+        results = []
+        processed = 0
+        
+        for i, group in enumerate(groups):
+            if processed >= max_groups:
+                break
+            
+            print(f"\nProcessing group {i + 1}/{len(groups)}...")
+            
+            # 首先分析组
+            analysis = self.analyze_group(group)
+            print(f"  Call chain: {' -> '.join(analysis['call_chain'])}")
+            print(f"  Functions: {analysis['functions_count']}")
+            print(f"  Fusion points: {analysis['total_fusion_points']}")
+            
+            # 处理组
+            result = self.process_group(group, target_code, i, passing_method)
+            results.append(result)
+            
+            if result.success:
+                print(f"  Status: SUCCESS")
+                processed += 1
+            else:
+                print(f"  Status: FAILED - {result.error_message}")
+        
+        # 保存结果
+        self._save_results(results, output_path, target_code)
+        
+        return results
+    
+    def _save_results(
+        self,
+        results: List[ProcessingResult],
+        output_path: str,
+        target_code: str
+    ):
+        """
+        保存处理结果
+        """
+        output_data = {
+            "metadata": {
+                "target_code": target_code,
+                "total_processed": len(results),
+                "successful": sum(1 for r in results if r.success),
+                "failed": sum(1 for r in results if not r.success)
+            },
+            "results": []
+        }
+        
+        for result in results:
+            output_data["results"].append({
+                "group_index": result.group_index,
+                "call_chain": result.call_chain,
+                "call_depth": result.call_depth,
+                "functions_count": result.functions_count,
+                "total_fusion_points": result.total_fusion_points,
+                "success": result.success,
+                "error_message": result.error_message,
+                "fused_code": result.fused_code
+            })
+        
+        os.makedirs(os.path.dirname(output_path), exist_ok=True)
+        with open(output_path, 'w', encoding='utf-8') as f:
+            json.dump(output_data, f, ensure_ascii=False, indent=2)
+        
+        print(f"\nResults saved to: {output_path}")
+        
+        # 保存合并后的代码文件
+        self._save_fused_code_files(results, output_path, target_code)
+        
+        # 如果有参数传递方法的结果,也输出对应的文件
+        param_results = [r for r in results if r.passing_method == "parameter" and r.success]
+        if param_results:
+            print(f"  Parameter passing method results: {len(param_results)}")
+    
+    def _save_fused_code_files(
+        self,
+        results: List[ProcessingResult],
+        output_path: str,
+        target_code: str
+    ):
+        """
+        将融合后的代码保存为单独的代码文件
+        """
+        # 创建代码输出目录
+        output_dir = os.path.dirname(output_path)
+        code_dir = os.path.join(output_dir, "fused_code")
+        os.makedirs(code_dir, exist_ok=True)
+        
+        for result in results:
+            if not result.success or not result.fused_code:
+                continue
+            
+            # 生成文件名
+            chain_name = "_".join(result.call_chain[:2]) if len(result.call_chain) >= 2 else "unknown"
+            filename = f"fused_group_{result.group_index}_{chain_name}.c"
+            filepath = os.path.join(code_dir, filename)
+            
+            # 生成合并后的代码文件内容
+            code_content = self._generate_fused_code_file(result, target_code, result.global_declarations)
+            
+            with open(filepath, 'w', encoding='utf-8') as f:
+                f.write(code_content)
+            
+            print(f"  Fused code saved to: {filepath}")
+        
+        # 生成汇总文件
+        summary_path = os.path.join(code_dir, "all_fused_code.c")
+        all_code = self._generate_all_fused_code(results, target_code)
+        with open(summary_path, 'w', encoding='utf-8') as f:
+            f.write(all_code)
+        print(f"  All fused code saved to: {summary_path}")
+    
+    def _generate_fused_code_file(
+        self,
+        result: ProcessingResult,
+        target_code: str,
+        global_declarations: str = ""
+    ) -> str:
+        """
+        生成单个融合代码文件的内容
+        """
+        lines = []
+        
+        # 文件头
+        lines.append("/*")
+        lines.append(" * Fused Code File")
+        lines.append(f" * Group Index: {result.group_index}")
+        lines.append(f" * Call Chain: {' -> '.join(result.call_chain)}")
+        lines.append(f" * Call Depth: {result.call_depth}")
+        lines.append(" *")
+        lines.append(" * Original Target Code:")
+        for line in target_code.strip().split('\n'):
+            lines.append(f" *   {line}")
+        lines.append(" *")
+        lines.append(" * Generated by Code Fusion Tool")
+        lines.append(" */")
+        lines.append("")
+        
+        # 包含常用头文件
+        lines.append("#include <stdio.h>")
+        lines.append("#include <stdlib.h>")
+        lines.append("#include <string.h>")
+        lines.append("")
+        
+        # 根据传递方法选择不同的变量声明方式
+        passing_method = getattr(result, 'passing_method', 'global')
+        parameter_struct = getattr(result, 'parameter_struct', '')
+        
+        if passing_method == "parameter":
+            # 参数传递方法:使用结构体
+            lines.append("/* === Shared State (Parameter Passing Method) === */")
+            if parameter_struct:
+                lines.append(parameter_struct)
+            else:
+                lines.append("typedef struct {")
+                lines.append("    int secret;")
+                lines.append("    int key;")
+                lines.append("} FusionState;")
+            lines.append("")
+            lines.append("/* Usage: Pass FusionState* fusion_state to each function */")
+            lines.append("/* Initialize: FusionState state; memset(&state, 0, sizeof(state)); */")
+        else:
+            # 全局变量方法
+            lines.append("/* === Shared State Variables (Global) === */")
+            if global_declarations:
+                lines.append(global_declarations)
+            else:
+                lines.append("static int g_secret;")
+                lines.append("static int g_key;")
+        lines.append("")
+        
+        # 函数声明
+        lines.append("/* === Function Declarations === */")
+        for func_name in result.call_chain:
+            if func_name in result.fused_code:
+                # 提取函数签名
+                code = result.fused_code[func_name]
+                sig = self._extract_function_signature(code)
+                if sig:
+                    lines.append(f"{sig};")
+        lines.append("")
+        
+        # 函数定义(按调用链顺序,从最内层到最外层)
+        lines.append("/* === Function Definitions === */")
+        lines.append("/* Functions are ordered from innermost to outermost in the call chain */")
+        lines.append("")
+        
+        # 反转顺序,先定义被调用的函数
+        for func_name in reversed(result.call_chain):
+            if func_name in result.fused_code:
+                lines.append(f"/* --- {func_name} --- */")
+                lines.append(result.fused_code[func_name])
+                lines.append("")
+        
+        return '\n'.join(lines)
+    
+    def _generate_all_fused_code(
+        self,
+        results: List[ProcessingResult],
+        target_code: str
+    ) -> str:
+        """
+        生成所有融合代码的汇总文件
+        """
+        lines = []
+        
+        # 文件头
+        lines.append("/*")
+        lines.append(" * All Fused Code - Summary File")
+        lines.append(f" * Total Groups: {len([r for r in results if r.success])}")
+        lines.append(" *")
+        lines.append(" * Original Target Code:")
+        for line in target_code.strip().split('\n'):
+            lines.append(f" *   {line}")
+        lines.append(" *")
+        lines.append(" * Generated by Code Fusion Tool")
+        lines.append(" */")
+        lines.append("")
+        
+        lines.append("#include <stdio.h>")
+        lines.append("#include <stdlib.h>")
+        lines.append("#include <string.h>")
+        lines.append("")
+        
+        # 每个成功的组
+        for result in results:
+            if not result.success or not result.fused_code:
+                continue
+            
+            lines.append("")
+            lines.append("/" + "=" * 78 + "/")
+            lines.append(f"/* GROUP {result.group_index}: {' -> '.join(result.call_chain)} */")
+            lines.append("/" + "=" * 78 + "/")
+            lines.append("")
+            
+            # 根据传递方法选择不同的变量声明
+            if result.passing_method == "parameter":
+                lines.append("/* === Shared State (Parameter Passing Method) === */")
+                if result.parameter_struct:
+                    lines.append(result.parameter_struct)
+                else:
+                    lines.append("typedef struct { int secret; int key; } FusionState;")
+                lines.append("/* Pass FusionState* fusion_state to each function */")
+            else:
+                lines.append("/* === Shared State Variables (Global) === */")
+                if result.global_declarations:
+                    lines.append(result.global_declarations)
+                else:
+                    lines.append("static int g_secret;")
+                    lines.append("static int g_key;")
+            lines.append("")
+            
+            # 函数定义
+            for func_name in reversed(result.call_chain):
+                if func_name in result.fused_code:
+                    lines.append(f"/* {func_name} */")
+                    lines.append(result.fused_code[func_name])
+                    lines.append("")
+        
+        return '\n'.join(lines)
+    
+    def _extract_function_signature(self, func_code: str) -> Optional[str]:
+        """
+        从函数代码中提取函数签名
+        """
+        # 找到第一个 { 之前的内容
+        brace_pos = func_code.find('{')
+        if brace_pos == -1:
+            return None
+        
+        sig = func_code[:brace_pos].strip()
+        # 移除多余的空白和换行
+        sig = ' '.join(sig.split())
+        return sig
+
+
+def demo_analysis(input_path: str):
+    """
+    演示分析功能(不调用 LLM)
+    """
+    print("=" * 60)
+    print("Code Fusion Analysis Demo")
+    print("=" * 60)
+    
+    # 加载数据
+    with open(input_path, 'r', encoding='utf-8') as f:
+        data = json.load(f)
+    
+    groups = data.get('groups', [])
+    print(f"\nTotal groups: {len(groups)}")
+    
+    # 分析前几个组
+    for i, group in enumerate(groups[:5]):
+        print(f"\n--- Group {i + 1} ---")
+        
+        call_depth = group.get('call_depth', 0)
+        call_chain = group.get('longest_call_chain', [])
+        functions = group.get('functions', [])
+        
+        print(f"Call depth: {call_depth}")
+        print(f"Call chain: {' -> '.join(call_chain)}")
+        print(f"Functions count: {len(functions)}")
+        
+        # 分析每个函数
+        for func_data in functions[:3]:
+            code = func_data.get('func', '')[:200]
+            cfg = analyze_code_cfg(code)
+            fusion_points = get_fusion_points(cfg)
+            
+            print(f"\n  Function: {cfg.function_name}")
+            print(f"  Blocks: {len(cfg.blocks)}")
+            print(f"  Fusion points: {len(fusion_points)}")
+            print(f"  Code preview: {code[:100]}...")
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description='Code Fusion - 代码调用链分析与融合工具',
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+示例:
+  # 分析调用链深度为 4 的数据
+  python main.py --input output/primevul_valid_grouped_depth_4.json --analyze-only
+  
+  # 执行代码融合
+  python main.py --input output/primevul_valid_grouped_depth_4.json \\
+                 --output output/fusion_results.json \\
+                 --target-code "int secret = 42; printf(\\"secret: %d\\n\\", secret);"
+                 
+  # 使用代码文件作为目标
+  python main.py --input output/primevul_valid_grouped_depth_4.json \\
+                 --output output/fusion_results.json \\
+                 --target-file target_code.c
+        """
+    )
+    
+    parser.add_argument(
+        '--input', '-i',
+        type=str,
+        required=True,
+        help='输入的分组 JSON 文件路径'
+    )
+    
+    parser.add_argument(
+        '--output', '-o',
+        type=str,
+        default=None,
+        help='输出文件路径'
+    )
+    
+    parser.add_argument(
+        '--target-code', '-t',
+        type=str,
+        default=None,
+        help='要融合的目标代码字符串'
+    )
+    
+    parser.add_argument(
+        '--target-file', '-f',
+        type=str,
+        default=None,
+        help='要融合的目标代码文件路径'
+    )
+    
+    parser.add_argument(
+        '--max-groups', '-m',
+        type=int,
+        default=5,
+        help='最大处理组数(默认 5)'
+    )
+    
+    parser.add_argument(
+        '--analyze-only', '-a',
+        action='store_true',
+        help='只进行分析,不执行融合'
+    )
+    
+    parser.add_argument(
+        '--method',
+        type=str,
+        choices=['global', 'parameter'],
+        default='global',
+        help='变量传递方法: global(全局变量)或 parameter(参数传递)(默认 global)'
+    )
+    
+    args = parser.parse_args()
+    
+    # 检查输入文件
+    if not os.path.exists(args.input):
+        print(f"Error: Input file not found: {args.input}")
+        sys.exit(1)
+    
+    # 只分析模式
+    if args.analyze_only:
+        demo_analysis(args.input)
+        return
+    
+    # 获取目标代码
+    target_code = args.target_code
+    if args.target_file:
+        if os.path.exists(args.target_file):
+            with open(args.target_file, 'r', encoding='utf-8') as f:
+                target_code = f.read()
+        else:
+            print(f"Error: Target file not found: {args.target_file}")
+            sys.exit(1)
+    
+    if not target_code:
+        # 使用默认的示例代码
+        target_code = """
+        // Example target code to be fused
+        int secret_value = 0x12345678;
+        int key = secret_value ^ 0xDEADBEEF;
+        printf("Computed key: 0x%x\\n", key);
+        """
+        print("Using default example target code.")
+    
+    # 设置默认输出路径
+    if args.output is None:
+        base_name = os.path.splitext(os.path.basename(args.input))[0]
+        output_dir = os.path.dirname(args.input) or '.'
+        args.output = os.path.join(output_dir, f'{base_name}_fused.json')
+    
+    # 创建处理器并执行
+    processor = CodeFusionProcessor()
+    
+    print(f"Using variable passing method: {args.method}")
+    
+    results = processor.process_file(
+        args.input,
+        args.output,
+        target_code,
+        args.max_groups,
+        args.method
+    )
+    
+    # 打印摘要
+    successful = sum(1 for r in results if r.success)
+    print(f"\n{'=' * 60}")
+    print(f"Processing Summary")
+    print(f"{'=' * 60}")
+    print(f"Total processed: {len(results)}")
+    print(f"Successful: {successful}")
+    print(f"Failed: {len(results) - successful}")
+
+
+if __name__ == '__main__':
+    main()
+

+ 8 - 0
src/requirements.txt

@@ -0,0 +1,8 @@
+# Code Fusion Project Dependencies
+openai>=1.0.0
+tree-sitter>=0.20.0
+tree-sitter-c>=0.20.0
+tree-sitter-cpp>=0.20.0
+networkx>=3.0
+graphviz>=0.20
+

+ 501 - 0
utils/data_process/extract_call_relations.py

@@ -0,0 +1,501 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+分析代码函数的 caller 和 callee 关系,将有调用关系的函数合并为组。
+"""
+
+import json
+import re
+import os
+import argparse
+from collections import defaultdict
+from typing import Dict, List, Set, Tuple, Optional
+
+
+# 常见的 C/C++ 库函数和系统调用,这些不应该作为连接不同函数组的依据
+COMMON_LIB_FUNCTIONS = {
+    # 内存管理
+    'malloc', 'calloc', 'realloc', 'free', 'memcpy', 'memmove', 'memset',
+    'memcmp', 'memchr', 'alloca', 'aligned_alloc',
+    # 字符串处理
+    'strlen', 'strcpy', 'strncpy', 'strcat', 'strncat', 'strcmp', 'strncmp',
+    'strchr', 'strrchr', 'strstr', 'strtok', 'strdup', 'strndup', 'strspn',
+    'strcspn', 'strpbrk', 'strerror', 'sprintf', 'snprintf', 'vsprintf',
+    'vsnprintf', 'sscanf',
+    # 输入输出
+    'printf', 'fprintf', 'vprintf', 'vfprintf', 'puts', 'fputs', 'putc',
+    'fputc', 'putchar', 'gets', 'fgets', 'getc', 'fgetc', 'getchar',
+    'scanf', 'fscanf', 'fopen', 'fclose', 'fread', 'fwrite', 'fseek',
+    'ftell', 'rewind', 'fflush', 'feof', 'ferror', 'clearerr', 'perror',
+    # 类型转换
+    'atoi', 'atol', 'atoll', 'atof', 'strtol', 'strtoll', 'strtoul',
+    'strtoull', 'strtof', 'strtod', 'strtold',
+    # 数学函数
+    'abs', 'labs', 'llabs', 'fabs', 'floor', 'ceil', 'round', 'sqrt',
+    'pow', 'exp', 'log', 'log10', 'sin', 'cos', 'tan', 'asin', 'acos',
+    'atan', 'atan2', 'min', 'max',
+    # 时间函数
+    'time', 'clock', 'difftime', 'mktime', 'strftime', 'localtime',
+    'gmtime', 'asctime', 'ctime', 'gettimeofday', 'sleep', 'usleep',
+    'nanosleep',
+    # 进程和信号
+    'exit', 'abort', '_exit', 'atexit', 'system', 'getenv', 'setenv',
+    'fork', 'exec', 'execl', 'execv', 'execle', 'execve', 'execlp',
+    'execvp', 'wait', 'waitpid', 'kill', 'signal', 'raise',
+    # 断言和错误处理
+    'assert', 'errno', 'setjmp', 'longjmp',
+    # POSIX 和系统调用
+    'open', 'close', 'read', 'write', 'lseek', 'stat', 'fstat', 'lstat',
+    'access', 'chmod', 'chown', 'link', 'unlink', 'rename', 'mkdir',
+    'rmdir', 'opendir', 'closedir', 'readdir', 'getcwd', 'chdir',
+    'pipe', 'dup', 'dup2', 'fcntl', 'ioctl', 'select', 'poll', 'mmap',
+    'munmap', 'mprotect', 'socket', 'bind', 'listen', 'accept', 'connect',
+    'send', 'recv', 'sendto', 'recvfrom', 'shutdown', 'setsockopt',
+    'getsockopt', 'pthread_create', 'pthread_join', 'pthread_exit',
+    'pthread_mutex_lock', 'pthread_mutex_unlock', 'pthread_cond_wait',
+    'pthread_cond_signal',
+    # C++ 常用
+    'std', 'make_shared', 'make_unique', 'move', 'forward', 'swap',
+    'begin', 'end', 'size', 'empty', 'push_back', 'pop_back', 'front',
+    'back', 'insert', 'erase', 'clear', 'find', 'count', 'sort',
+    'unique', 'reverse', 'copy', 'fill', 'transform', 'accumulate',
+    # 类型检查
+    'static_assert', 'ASSERT', 'DCHECK', 'CHECK', 'EXPECT', 'VERIFY',
+    # 日志
+    'LOG', 'DLOG', 'VLOG', 'ERR', 'WARN', 'INFO', 'DEBUG', 'TRACE',
+    # 其他常见宏/函数
+    'DISALLOW_COPY_AND_ASSIGN', 'NOTREACHED', 'UNIMPLEMENTED',
+    'offsetof', 'container_of', 'likely', 'unlikely', 'BUG', 'BUG_ON',
+    'WARN_ON', 'IS_ERR', 'PTR_ERR', 'ERR_PTR', 'ERR_CAST',
+    # 测试相关
+    'TEST', 'TEST_F', 'TEST_P', 'EXPECT_TRUE', 'EXPECT_FALSE',
+    'EXPECT_EQ', 'EXPECT_NE', 'EXPECT_LT', 'EXPECT_LE', 'EXPECT_GT',
+    'EXPECT_GE', 'ASSERT_TRUE', 'ASSERT_FALSE', 'ASSERT_EQ', 'ASSERT_NE',
+    'MOCK_METHOD', 'INSTANTIATE_TEST_SUITE_P',
+}
+
+
+def extract_function_name(func_code: str) -> Optional[str]:
+    """
+    从函数代码中提取函数名。
+    支持 C/C++ 风格的函数定义。
+    """
+    # 移除注释
+    code = re.sub(r'//.*?\n', '\n', func_code)
+    code = re.sub(r'/\*.*?\*/', '', code, flags=re.DOTALL)
+    
+    # 匹配函数定义的模式
+    # 格式: [返回类型] [类名::]函数名(参数列表)
+    patterns = [
+        # C++ 成员函数: ReturnType ClassName::FunctionName(...)
+        r'(?:[\w\s\*&<>,]+?)\s+(\w+::~?\w+)\s*\([^)]*\)\s*(?:const)?\s*(?:override)?\s*(?:final)?\s*(?:\{|:)',
+        # 构造函数/析构函数: ClassName::ClassName(...) 或 ClassName::~ClassName(...)
+        r'^[\s]*(\w+::~?\w+)\s*\([^)]*\)\s*(?:\{|:)',
+        # 普通 C 函数: ReturnType FunctionName(...)
+        r'(?:[\w\s\*&<>,]+?)\s+(\w+)\s*\([^)]*\)\s*\{',
+        # 简单模式
+        r'^\s*(?:static\s+)?(?:inline\s+)?(?:virtual\s+)?(?:[\w\*&<>,\s]+)\s+(\w+)\s*\(',
+    ]
+    
+    for pattern in patterns:
+        match = re.search(pattern, code, re.MULTILINE)
+        if match:
+            func_name = match.group(1)
+            # 如果是 ClassName::FunctionName 格式,只取函数名
+            if '::' in func_name:
+                func_name = func_name.split('::')[-1]
+            return func_name
+    
+    return None
+
+
+def extract_function_calls(
+    func_code: str, 
+    self_name: Optional[str] = None,
+    exclude_common_libs: bool = True
+) -> Set[str]:
+    """
+    从函数代码中提取所有被调用的函数名(callees)。
+    
+    Args:
+        func_code: 函数代码
+        self_name: 当前函数名(会被排除)
+        exclude_common_libs: 是否排除常见库函数
+    """
+    # 移除注释和字符串
+    code = re.sub(r'//.*?\n', '\n', func_code)
+    code = re.sub(r'/\*.*?\*/', '', code, flags=re.DOTALL)
+    code = re.sub(r'"(?:[^"\\]|\\.)*"', '""', code)  # 移除字符串
+    code = re.sub(r"'(?:[^'\\]|\\.)*'", "''", code)  # 移除字符
+    
+    # 提取函数调用: 函数名(
+    # 排除关键字和常见的非函数调用
+    keywords = {
+        'if', 'else', 'while', 'for', 'switch', 'case', 'return', 'break',
+        'continue', 'sizeof', 'typeof', 'alignof', 'decltype', 'static_cast',
+        'dynamic_cast', 'reinterpret_cast', 'const_cast', 'new', 'delete',
+        'throw', 'catch', 'try', 'namespace', 'class', 'struct', 'enum',
+        'union', 'typedef', 'using', 'template', 'typename', 'public',
+        'private', 'protected', 'virtual', 'override', 'final', 'explicit',
+        'inline', 'static', 'extern', 'const', 'volatile', 'mutable',
+        'register', 'auto', 'default', 'goto', 'asm', '__asm', '__asm__',
+    }
+    
+    # 匹配函数调用
+    pattern = r'\b([a-zA-Z_]\w*)\s*\('
+    matches = re.findall(pattern, code)
+    
+    # 过滤关键字、自身和常见库函数
+    callees = set()
+    for name in matches:
+        if name in keywords:
+            continue
+        if self_name is not None and name == self_name:
+            continue
+        if exclude_common_libs and name in COMMON_LIB_FUNCTIONS:
+            continue
+        callees.add(name)
+    
+    return callees
+
+
+def load_jsonl(file_path: str) -> List[Dict]:
+    """
+    加载 JSONL 文件。
+    """
+    data = []
+    with open(file_path, 'r', encoding='utf-8') as f:
+        for line in f:
+            line = line.strip()
+            if line:
+                data.append(json.loads(line))
+    return data
+
+
+def build_call_graph(
+    records: List[Dict],
+    exclude_common_libs: bool = True
+) -> Tuple[Dict[str, Set[str]], Dict[int, str], Dict[str, List[int]]]:
+    """
+    构建函数调用图。
+    
+    Args:
+        records: 数据记录列表
+        exclude_common_libs: 是否排除常见库函数
+    
+    返回:
+        - call_graph: {函数名: {被调用的函数名集合}}
+        - idx_to_func: {记录索引: 函数名}
+        - func_to_idxs: {函数名: [记录索引列表]}(一个函数名可能对应多条记录)
+    """
+    call_graph = {}
+    idx_to_func = {}
+    func_to_idxs = defaultdict(list)
+    
+    for i, record in enumerate(records):
+        func_code = record.get('func', '')
+        func_name = extract_function_name(func_code)
+        
+        if func_name:
+            callees = extract_function_calls(func_code, func_name, exclude_common_libs)
+            call_graph[func_name] = callees
+            idx_to_func[i] = func_name
+            func_to_idxs[func_name].append(i)
+    
+    return call_graph, idx_to_func, func_to_idxs
+
+
+def find_high_frequency_functions(
+    call_graph: Dict[str, Set[str]],
+    all_funcs: Set[str],
+    threshold_percentile: float = 99.0
+) -> Set[str]:
+    """
+    找出被高频调用的函数(可能是通用工具函数)。
+    
+    Args:
+        call_graph: 函数调用图
+        all_funcs: 数据集中的所有函数名
+        threshold_percentile: 阈值百分位数(默认 99%)
+    
+    Returns:
+        高频被调用的函数集合
+    """
+    # 统计每个函数被调用的次数
+    callee_count = defaultdict(int)
+    for callees in call_graph.values():
+        for callee in callees:
+            if callee in all_funcs:
+                callee_count[callee] += 1
+    
+    if not callee_count:
+        return set()
+    
+    # 计算阈值
+    counts = sorted(callee_count.values())
+    threshold_idx = int(len(counts) * threshold_percentile / 100)
+    threshold = counts[min(threshold_idx, len(counts) - 1)]
+    
+    # 只有当阈值大于某个最小值时才过滤(避免过滤掉正常的调用关系)
+    if threshold < 10:
+        return set()
+    
+    high_freq_funcs = {fn for fn, count in callee_count.items() if count >= threshold}
+    return high_freq_funcs
+
+
+def find_related_groups(
+    records: List[Dict],
+    call_graph: Dict[str, Set[str]],
+    func_to_idxs: Dict[str, List[int]],
+    auto_filter_high_freq: bool = True,
+    high_freq_threshold: float = 99.0
+) -> List[List[Dict]]:
+    """
+    找出有调用关系的函数组。
+    使用 Union-Find 算法将有调用关系的函数合并。
+    
+    Args:
+        records: 数据记录列表
+        call_graph: 函数调用图
+        func_to_idxs: 函数名到记录索引的映射
+        auto_filter_high_freq: 是否自动过滤高频调用的函数
+        high_freq_threshold: 高频函数的阈值百分位数
+    """
+    # 获取所有函数名
+    all_funcs = set(call_graph.keys())
+    
+    # 找出高频被调用的函数
+    high_freq_funcs = set()
+    if auto_filter_high_freq:
+        high_freq_funcs = find_high_frequency_functions(
+            call_graph, all_funcs, high_freq_threshold
+        )
+        if high_freq_funcs:
+            print(f"  自动过滤 {len(high_freq_funcs)} 个高频被调用的函数")
+    
+    # 只保留在数据集中实际存在的调用关系
+    # 构建双向关系图(caller -> callee, callee -> caller)
+    related_graph = defaultdict(set)
+    
+    for caller, callees in call_graph.items():
+        for callee in callees:
+            # 只有当 callee 也在我们的数据集中时才建立关系
+            # 排除高频被调用的函数
+            if callee in all_funcs and callee not in high_freq_funcs:
+                related_graph[caller].add(callee)
+                related_graph[callee].add(caller)
+    
+    # 使用 BFS/DFS 找连通分量
+    visited = set()
+    groups = []
+    
+    for func_name in all_funcs:
+        if func_name not in visited:
+            # BFS 找到所有连通的函数
+            group_funcs = set()
+            queue = [func_name]
+            
+            while queue:
+                current = queue.pop(0)
+                if current in visited:
+                    continue
+                visited.add(current)
+                group_funcs.add(current)
+                
+                # 添加相关的函数
+                for related in related_graph.get(current, []):
+                    if related not in visited:
+                        queue.append(related)
+            
+            # 将函数名转换为对应的记录
+            group_records = []
+            for fn in group_funcs:
+                for idx in func_to_idxs.get(fn, []):
+                    group_records.append(records[idx])
+            
+            if group_records:
+                groups.append(group_records)
+    
+    return groups
+
+
+def process_file(
+    input_path: str, 
+    output_path: str, 
+    min_group_size: int = 1,
+    max_group_size: int = 0,
+    exclude_common_libs: bool = True
+):
+    """
+    处理单个 JSONL 文件。
+    
+    Args:
+        input_path: 输入文件路径
+        output_path: 输出文件路径
+        min_group_size: 最小组大小(默认为1,可设置为2只保留有调用关系的组)
+        max_group_size: 最大组大小(0表示不限制,超过此大小的组会被拆分为单独的记录)
+        exclude_common_libs: 是否排除常见库函数
+    """
+    print(f"加载数据: {input_path}")
+    records = load_jsonl(input_path)
+    print(f"共加载 {len(records)} 条记录")
+    
+    print("构建函数调用图...")
+    call_graph, idx_to_func, func_to_idxs = build_call_graph(records, exclude_common_libs)
+    print(f"识别出 {len(call_graph)} 个函数")
+    
+    print("分析调用关系,合并相关函数...")
+    groups = find_related_groups(
+        records, call_graph, func_to_idxs,
+        auto_filter_high_freq=True,
+        high_freq_threshold=99.0
+    )
+    
+    # 处理超大组:如果设置了 max_group_size,将超大组拆分为单独的记录
+    if max_group_size > 0:
+        new_groups = []
+        oversized_count = 0
+        for g in groups:
+            if len(g) > max_group_size:
+                oversized_count += 1
+                # 将超大组中的每个记录拆分为单独的组
+                for record in g:
+                    new_groups.append([record])
+            else:
+                new_groups.append(g)
+        if oversized_count > 0:
+            print(f"  (已将 {oversized_count} 个超大组拆分为单独记录)")
+        groups = new_groups
+    
+    # 按组大小过滤
+    if min_group_size > 1:
+        groups = [g for g in groups if len(g) >= min_group_size]
+    
+    # 统计信息
+    total_funcs = sum(len(g) for g in groups)
+    groups_with_relations = [g for g in groups if len(g) > 1]
+    single_func_groups = len([g for g in groups if len(g) == 1])
+    
+    # 按组大小分布统计
+    size_distribution = defaultdict(int)
+    for g in groups:
+        size = len(g)
+        if size == 1:
+            size_distribution["1 (单独函数)"] += 1
+        elif size <= 5:
+            size_distribution["2-5"] += 1
+        elif size <= 10:
+            size_distribution["6-10"] += 1
+        elif size <= 50:
+            size_distribution["11-50"] += 1
+        elif size <= 100:
+            size_distribution["51-100"] += 1
+        elif size <= 500:
+            size_distribution["101-500"] += 1
+        elif size <= 1000:
+            size_distribution["501-1000"] += 1
+        else:
+            size_distribution["1000+"] += 1
+    
+    print(f"\n==================== 统计信息 ====================")
+    print(f"  总记录数(原始): {len(records)}")
+    print(f"  总函数数(分组后): {total_funcs}")
+    print(f"  总组数: {len(groups)}")
+    print(f"    - 单独函数组(无调用关系): {single_func_groups}")
+    print(f"    - 有调用关系的组(大小>1): {len(groups_with_relations)}")
+    
+    if groups_with_relations:
+        actual_max_size = max(len(g) for g in groups_with_relations)
+        avg_group_size = sum(len(g) for g in groups_with_relations) / len(groups_with_relations)
+        print(f"  最大组大小: {actual_max_size}")
+        print(f"  有关系组的平均大小: {avg_group_size:.2f}")
+    
+    print(f"\n  组大小分布:")
+    # 按特定顺序输出
+    order = ["1 (单独函数)", "2-5", "6-10", "11-50", "51-100", "101-500", "501-1000", "1000+"]
+    for key in order:
+        if key in size_distribution:
+            count = size_distribution[key]
+            percentage = count / len(groups) * 100
+            print(f"    - 大小 {key}: {count} 组 ({percentage:.1f}%)")
+    print(f"====================================================")
+    
+    # 输出结果
+    output_data = {
+        "metadata": {
+            "source_file": os.path.basename(input_path),
+            "total_records": len(records),
+            "total_functions_grouped": total_funcs,
+            "total_groups": len(groups),
+            "single_function_groups": single_func_groups,
+            "groups_with_relations": len(groups_with_relations),
+            "max_group_size": max(len(g) for g in groups) if groups else 0,
+            "avg_related_group_size": round(sum(len(g) for g in groups_with_relations) / len(groups_with_relations), 2) if groups_with_relations else 0,
+            "size_distribution": dict(size_distribution),
+        },
+        "groups": groups
+    }
+    
+    os.makedirs(os.path.dirname(output_path), exist_ok=True)
+    with open(output_path, 'w', encoding='utf-8') as f:
+        json.dump(output_data, f, ensure_ascii=False, indent=2)
+    
+    print(f"\n结果已保存到: {output_path}")
+
+
+def main():
+    parser = argparse.ArgumentParser(description='分析代码函数的调用关系')
+    parser.add_argument(
+        '--input', '-i',
+        type=str,
+        required=True,
+        help='输入的 JSONL 文件路径'
+    )
+    parser.add_argument(
+        '--output', '-o',
+        type=str,
+        default=None,
+        help='输出的 JSON 文件路径(默认为 output/<输入文件名>_grouped.json)'
+    )
+    parser.add_argument(
+        '--min-group-size', '-m',
+        type=int,
+        default=1,
+        help='最小组大小,设为2可只保留有调用关系的组(默认为1)'
+    )
+    parser.add_argument(
+        '--max-group-size', '-M',
+        type=int,
+        default=0,
+        help='最大组大小,超过此大小的组会被拆分(0表示不限制,默认为0)'
+    )
+    parser.add_argument(
+        '--include-common-libs',
+        action='store_true',
+        default=False,
+        help='是否包含常见库函数作为调用关系(默认排除)'
+    )
+    
+    args = parser.parse_args()
+    
+    # 设置默认输出路径
+    if args.output is None:
+        base_name = os.path.splitext(os.path.basename(args.input))[0]
+        # 获取脚本所在目录的上两级(项目根目录)
+        script_dir = os.path.dirname(os.path.abspath(__file__))
+        project_root = os.path.dirname(os.path.dirname(script_dir))
+        args.output = os.path.join(project_root, 'output', f'{base_name}_grouped.json')
+    
+    process_file(
+        args.input, 
+        args.output, 
+        args.min_group_size,
+        args.max_group_size,
+        exclude_common_libs=not args.include_common_libs
+    )
+
+
+if __name__ == '__main__':
+    main()
+

+ 328 - 0
utils/data_process/filter_by_call_depth.py

@@ -0,0 +1,328 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+从分组后的 JSON 文件中,筛选出特定调用链深度的组。
+
+调用链深度定义:
+- caller -> callee 是深度 2
+- caller -> caller -> func 是深度 3
+- caller -> caller -> caller -> func 是深度 4
+"""
+
+import json
+import re
+import os
+import argparse
+from collections import defaultdict
+from typing import Dict, List, Set, Optional, Tuple
+
+
+def extract_function_name(func_code: str) -> Optional[str]:
+    """
+    从函数代码中提取函数名。
+    """
+    code = re.sub(r'//.*?\n', '\n', func_code)
+    code = re.sub(r'/\*.*?\*/', '', code, flags=re.DOTALL)
+    
+    patterns = [
+        r'(?:[\w\s\*&<>,]+?)\s+(\w+::~?\w+)\s*\([^)]*\)\s*(?:const)?\s*(?:override)?\s*(?:final)?\s*(?:\{|:)',
+        r'^[\s]*(\w+::~?\w+)\s*\([^)]*\)\s*(?:\{|:)',
+        r'(?:[\w\s\*&<>,]+?)\s+(\w+)\s*\([^)]*\)\s*\{',
+        r'^\s*(?:static\s+)?(?:inline\s+)?(?:virtual\s+)?(?:[\w\*&<>,\s]+)\s+(\w+)\s*\(',
+    ]
+    
+    for pattern in patterns:
+        match = re.search(pattern, code, re.MULTILINE)
+        if match:
+            func_name = match.group(1)
+            if '::' in func_name:
+                func_name = func_name.split('::')[-1]
+            return func_name
+    
+    return None
+
+
+def extract_function_calls(func_code: str, self_name: Optional[str] = None) -> Set[str]:
+    """
+    从函数代码中提取所有被调用的函数名。
+    """
+    code = re.sub(r'//.*?\n', '\n', func_code)
+    code = re.sub(r'/\*.*?\*/', '', code, flags=re.DOTALL)
+    code = re.sub(r'"(?:[^"\\]|\\.)*"', '""', code)
+    code = re.sub(r"'(?:[^'\\]|\\.)*'", "''", code)
+    
+    keywords = {
+        'if', 'else', 'while', 'for', 'switch', 'case', 'return', 'break',
+        'continue', 'sizeof', 'typeof', 'alignof', 'decltype', 'static_cast',
+        'dynamic_cast', 'reinterpret_cast', 'const_cast', 'new', 'delete',
+        'throw', 'catch', 'try', 'namespace', 'class', 'struct', 'enum',
+        'union', 'typedef', 'using', 'template', 'typename', 'public',
+        'private', 'protected', 'virtual', 'override', 'final', 'explicit',
+        'inline', 'static', 'extern', 'const', 'volatile', 'mutable',
+        'register', 'auto', 'default', 'goto', 'asm', '__asm', '__asm__',
+    }
+    
+    pattern = r'\b([a-zA-Z_]\w*)\s*\('
+    matches = re.findall(pattern, code)
+    
+    callees = set()
+    for name in matches:
+        if name not in keywords:
+            if self_name is None or name != self_name:
+                callees.add(name)
+    
+    return callees
+
+
+def compute_call_depth(group: List[Dict]) -> Tuple[int, List[str]]:
+    """
+    计算一个组内的最大调用链深度。
+    
+    Returns:
+        (最大深度, 最长调用链路径)
+    """
+    if len(group) <= 1:
+        return 1, []
+    
+    # 提取每个函数的名称和它调用的函数
+    func_names = {}  # idx -> func_name
+    func_codes = {}  # func_name -> code
+    call_graph = {}  # func_name -> set of callees
+    
+    for i, record in enumerate(group):
+        func_code = record.get('func', '')
+        func_name = extract_function_name(func_code)
+        if func_name:
+            func_names[i] = func_name
+            func_codes[func_name] = func_code
+            callees = extract_function_calls(func_code, func_name)
+            call_graph[func_name] = callees
+    
+    # 获取组内所有函数名
+    group_funcs = set(func_names.values())
+    
+    # 只保留组内存在的调用关系
+    filtered_graph = {}
+    for caller, callees in call_graph.items():
+        filtered_callees = callees & group_funcs
+        filtered_graph[caller] = filtered_callees
+    
+    # 使用 DFS 计算最长调用链深度
+    def dfs(func: str, visited: Set[str], path: List[str]) -> Tuple[int, List[str]]:
+        """
+        从 func 开始,找到最长的调用链。
+        """
+        if func in visited:
+            return len(path), path.copy()
+        
+        visited.add(func)
+        path.append(func)
+        
+        max_depth = len(path)
+        max_path = path.copy()
+        
+        for callee in filtered_graph.get(func, []):
+            if callee not in visited:
+                depth, p = dfs(callee, visited, path)
+                if depth > max_depth:
+                    max_depth = depth
+                    max_path = p
+        
+        path.pop()
+        visited.remove(func)
+        
+        return max_depth, max_path
+    
+    # 从每个函数开始尝试,找到最长调用链
+    overall_max_depth = 1
+    overall_max_path = []
+    
+    for func_name in group_funcs:
+        depth, path = dfs(func_name, set(), [])
+        if depth > overall_max_depth:
+            overall_max_depth = depth
+            overall_max_path = path
+    
+    return overall_max_depth, overall_max_path
+
+
+def load_grouped_json(file_path: str) -> Dict:
+    """
+    加载分组后的 JSON 文件。
+    """
+    with open(file_path, 'r', encoding='utf-8') as f:
+        return json.load(f)
+
+
+def filter_groups_by_depth(
+    groups: List[List[Dict]], 
+    min_depth: int = 1, 
+    max_depth: int = float('inf')
+) -> Tuple[List[Dict], Dict[int, int]]:
+    """
+    按调用链深度筛选组。
+    
+    Args:
+        groups: 所有组
+        min_depth: 最小深度(包含)
+        max_depth: 最大深度(包含)
+    
+    Returns:
+        (符合条件的组列表(包含深度信息), 深度分布统计)
+    """
+    filtered_groups = []
+    depth_distribution = defaultdict(int)
+    
+    print("分析调用链深度...")
+    total = len(groups)
+    
+    for i, group in enumerate(groups):
+        if (i + 1) % 500 == 0:
+            print(f"  处理进度: {i + 1}/{total}")
+        
+        depth, path = compute_call_depth(group)
+        depth_distribution[depth] += 1
+        
+        if min_depth <= depth <= max_depth:
+            # 添加深度信息到组中
+            group_with_info = {
+                "call_depth": depth,
+                "longest_call_chain": path,
+                "group_size": len(group),
+                "functions": group
+            }
+            filtered_groups.append(group_with_info)
+    
+    return filtered_groups, dict(depth_distribution)
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description='按调用链深度筛选函数组',
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+示例:
+  # 筛选深度为 3 的组
+  python filter_by_call_depth.py -i output/grouped.json -d 3
+
+  # 筛选深度在 2-5 之间的组
+  python filter_by_call_depth.py -i output/grouped.json --min-depth 2 --max-depth 5
+
+  # 筛选深度 >= 4 的组
+  python filter_by_call_depth.py -i output/grouped.json --min-depth 4
+        """
+    )
+    parser.add_argument(
+        '--input', '-i',
+        type=str,
+        required=True,
+        help='输入的分组 JSON 文件路径'
+    )
+    parser.add_argument(
+        '--output', '-o',
+        type=str,
+        default=None,
+        help='输出的 JSON 文件路径(默认自动生成)'
+    )
+    parser.add_argument(
+        '--depth', '-d',
+        type=int,
+        default=None,
+        help='精确匹配的调用链深度(与 --min-depth/--max-depth 互斥)'
+    )
+    parser.add_argument(
+        '--min-depth',
+        type=int,
+        default=1,
+        help='最小调用链深度(包含,默认为1)'
+    )
+    parser.add_argument(
+        '--max-depth',
+        type=int,
+        default=None,
+        help='最大调用链深度(包含,默认不限制)'
+    )
+    
+    args = parser.parse_args()
+    
+    # 处理深度参数
+    if args.depth is not None:
+        min_depth = args.depth
+        max_depth = args.depth
+    else:
+        min_depth = args.min_depth
+        max_depth = args.max_depth if args.max_depth is not None else float('inf')
+    
+    # 设置默认输出路径
+    if args.output is None:
+        base_name = os.path.splitext(os.path.basename(args.input))[0]
+        output_dir = os.path.dirname(args.input)
+        if max_depth == float('inf'):
+            depth_str = f"depth_{min_depth}+"
+        elif min_depth == max_depth:
+            depth_str = f"depth_{min_depth}"
+        else:
+            depth_str = f"depth_{min_depth}-{max_depth}"
+        args.output = os.path.join(output_dir, f'{base_name}_{depth_str}.json')
+    
+    # 加载数据
+    print(f"加载数据: {args.input}")
+    data = load_grouped_json(args.input)
+    groups = data.get('groups', [])
+    print(f"共加载 {len(groups)} 个组")
+    
+    # 筛选
+    if max_depth == float('inf'):
+        print(f"\n筛选调用链深度 >= {min_depth} 的组...")
+    elif min_depth == max_depth:
+        print(f"\n筛选调用链深度 = {min_depth} 的组...")
+    else:
+        print(f"\n筛选调用链深度在 {min_depth}-{max_depth} 之间的组...")
+    
+    filtered_groups, depth_distribution = filter_groups_by_depth(groups, min_depth, max_depth)
+    
+    # 统计信息
+    print(f"\n==================== 统计信息 ====================")
+    print(f"原始组数: {len(groups)}")
+    print(f"筛选后组数: {len(filtered_groups)}")
+    print(f"筛选后总函数数: {sum(g['group_size'] for g in filtered_groups)}")
+    
+    print(f"\n调用链深度分布(全部数据):")
+    for depth in sorted(depth_distribution.keys()):
+        count = depth_distribution[depth]
+        pct = count / len(groups) * 100
+        marker = " <--" if min_depth <= depth <= (max_depth if max_depth != float('inf') else depth) else ""
+        print(f"  深度 {depth}: {count} 组 ({pct:.1f}%){marker}")
+    
+    if filtered_groups:
+        depths = [g['call_depth'] for g in filtered_groups]
+        print(f"\n筛选结果统计:")
+        print(f"  最小深度: {min(depths)}")
+        print(f"  最大深度: {max(depths)}")
+        print(f"  平均深度: {sum(depths)/len(depths):.2f}")
+    print(f"====================================================")
+    
+    # 输出结果
+    output_data = {
+        "metadata": {
+            "source_file": os.path.basename(args.input),
+            "filter_min_depth": min_depth,
+            "filter_max_depth": max_depth if max_depth != float('inf') else "unlimited",
+            "original_groups": len(groups),
+            "filtered_groups": len(filtered_groups),
+            "total_functions": sum(g['group_size'] for g in filtered_groups),
+            "depth_distribution": depth_distribution,
+        },
+        "groups": filtered_groups
+    }
+    
+    os.makedirs(os.path.dirname(args.output) if os.path.dirname(args.output) else '.', exist_ok=True)
+    with open(args.output, 'w', encoding='utf-8') as f:
+        json.dump(output_data, f, ensure_ascii=False, indent=2)
+    
+    print(f"\n结果已保存到: {args.output}")
+
+
+if __name__ == '__main__':
+    main()
+

Beberapa file tidak ditampilkan karena terlalu banyak file yang berubah dalam diff ini