前言
本文是南京大学李樾、谭添老师的《软件分析》课程的笔记。内容来自课程ppt以及课程讲解。
课程主页:Static Program Analysis | Tai-e (pascal-lab.net)
授课视频:视频南京大学《软件分析》课程01(Introduction)_哔哩哔哩_bilibili
课程实验repo:pascal-lab/Tai-e-assignments: Tai-e assignments for static program analysis (github.com)
Contents
- Compilers and Static Analyzers
- AST vs. IR
- IR: Three-Address Code (3AC)
- 3AC in Real Static Analyzer
- Static Single Assignment (SSA)
- Basic Blocks (BB)
- Control Flow Graphs (CFG)
Compiler and Static Analyzers
AST vs. IR
AST:
- high-level and closed to grammar structure
- usually language dependent
- suitable for fast type checking
- lack of control flow information
IR:
- low-level and closed to machine code
- usually language independent
- compact and uniform
- contains control flow information
- usually considered as the basis for static analysis
Intermediate Representation(IR)
3-Address Code(3AC)
There is at most one operator on the right side of an instruction
1 | t = a + b + 3 |
Each 3AC contains at most 3 addresses:
- Name: a, b
- Constant: 3
- Compiler-generated temporary: t1, t2
Some Common 3AC Forms
- x, y, z: addresses
- bop: binary arithmetic or logical operation
- uop: unary operation(minus, negation, casting)
- L: a label to represent a program location
- rop: relational operator(>,<,==,>=,<=, etc.)
- goto L: unconditional jump
- if … goto L: conditional jump
1 | x = y bop z |
Soot and Its IR: Jimple
Soot: Most popular static analysis framework for Java
Soot’s IR is Jimple: typed 3-address code
1 | // Java Src |
1 | // Java Src |
1 | // Java Src |
invokespecial: call constructor, call superclass methods, call private methods
invokevirutal: instance methods call (virtual dispatch)
invokeinterface: cannot optimization,checking interface implementation
invokestatic: call static methods
Java 7: invokedynamic -> Java static typing , dynamic language runs on JVM
method signature: class name, return type, method name(parameter types)
1 | // Java Src |
Static Single Assignment(SSA)
- All assignments in SSA are to variables with distinct names
- Give each definition a fresh name
- Propagate fresh name to subsequent uses
- Every variable has exactly on definition
3AC | SSA |
---|---|
p = a + b | p1 = a + b |
q = p - c | q1 = p1 - c |
p = q * d | p2 = q1 * d |
p = e - p | p3 = e - p2 |
q = p + q | q2 = p3 + q1 |
What if a variable use is at control flow merges?
- A special merge operator,$\phi$ (called phi-function), is introduced to select the values at merge nodes
- $\phi(x0, x1)$has the value x0 if the control flow passes through the true part of the conditional and the value x1 otherwise
Why SSA?
Flow information is indirectly incorporated into the unique variable names
May help deliver some simpler analyses, e.g., flow-insensitive analysis gains partial precision of flow-sensitive analysis via SSA
Define-and-Use pairs are explicit
Enable more effective data facts storage and propagation in some on-demand tasks
Some optimization tasks perform better on SSA (e.g., conditional constant propagation, global value numbering)
Why not SSA?
- may introduce too many variables and phi-functions
- May introduce inefficiency problem when translating to machine code (due to copy operations)
Control Flow Analysis
- Usually refer to building Control Flow Graph (CFG)
- CFG serves as the basic structure for static analysis
- The node in CFG can be an individual 3-address instruction, or (usually) a Basic Block (BB)
Basic Blocks(BB)
Basic block(BB) are maximal sequences of consecutive three-address instruction with theproperties that
- It can be entered only at the beginning, i.e., the first instruction in the block
- It can be exited only at the end, i.e., the last instruction in the block
Design the algorithm to build BBs
How to build Basic Blocks?
INPUT: A sequence of three-address instructions of P
OUTPUT: A list of basic blocks of P
METHOD:
- Determine the leaders in P
- The first instruction in P is a leader
- Any target instruction of a conditional or unconditional jump is a leader
- Any instruction that immediately follows a conditional or unconditional jump is a leader
- Build BBs for P
- A BB consists of a leader and all its subsequent instructions until the next leader
Example:
Input: 3AC of P
- x = input
- y = x - 1
- z = x * y
- if z < x goto (7)
- p = x / y
- q = p + y
- a = q
- b = x + a
- c = 2a - b
- if p == q goto 12
- goto 3
- return
Output: BBs of P
- Determine the leaders of P
- (1) (first instruction)
- (3) (7) (12) (target instruction of a conditional or unconditional jump)
- (5) (11) (12) (instruction that follows a conditional or unconditional jump)
- Build BBs of P
- B1 {(1), (2)}
- B2 {(3), (4)}
- B3 {(5), (6)}
- B4 {(7), (8), (9), (10)}
- B5 {(11)}
- B6 {(12)}
Control Flow Graph(CFG)
- The nodes of CFG are basic blocks
- There is an edge from block A to block B if and only if
- There is a conditional or unconditional jump from the end of A to the beginning of B
- B immediately follows A in the original order of instructions an A does not end in an unconditional jump
- It is normal to replace the jumps to instruction lables by jumps to basic blocks
So, we can get:
And Usually we add two nodes, Entry and Exit.
- They do not correspond to executable IR
- A edge from Entry to the BB containing the first instruction of IR
- A edge to Exit from any BB containing an instruction that could be the last instruction of IR