0%

南大《软件分析》02.Intermediate Representation

前言

本文是南京大学李樾、谭添老师的《软件分析》课程的笔记。内容来自课程ppt以及课程讲解。

课程主页:Static Program Analysis | Tai-e (pascal-lab.net)

授课视频:视频南京大学《软件分析》课程01(Introduction)_哔哩哔哩_bilibili

课程实验repo:pascal-lab/Tai-e-assignments: Tai-e assignments for static program analysis (github.com)

Contents

  1. Compilers and Static Analyzers
  2. AST vs. IR
  3. IR: Three-Address Code (3AC)
  4. 3AC in Real Static Analyzer
  5. Static Single Assignment (SSA)
  6. Basic Blocks (BB)
  7. Control Flow Graphs (CFG)

Compiler and Static Analyzers

AST vs. IR

AST:

  • high-level and closed to grammar structure
  • usually language dependent
  • suitable for fast type checking
  • lack of control flow information

IR:

  • low-level and closed to machine code
  • usually language independent
  • compact and uniform
  • contains control flow information
  • usually considered as the basis for static analysis

Intermediate Representation(IR)

3-Address Code(3AC)

There is at most one operator on the right side of an instruction

1
2
3
4
t = a + b + 3

t1 = a + b
t2 = t1 + 3

Each 3AC contains at most 3 addresses:

  • Name: a, b
  • Constant: 3
  • Compiler-generated temporary: t1, t2

Some Common 3AC Forms

  • x, y, z: addresses
  • bop: binary arithmetic or logical operation
  • uop: unary operation(minus, negation, casting)
  • L: a label to represent a program location
  • rop: relational operator(>,<,==,>=,<=, etc.)
  • goto L: unconditional jump
  • if … goto L: conditional jump
1
2
3
4
5
6
x = y bop z
x = uop y
x = y
goto L
if x goto L
if x rop y goto L

Soot and Its IR: Jimple

Soot: Most popular static analysis framework for Java

https://github.com/Sable/soot/wiki/Tutorials

https://github.com/Sable/soot

Soot’s IR is Jimple: typed 3-address code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// Java Src
public class ForLoop3AC {
public static void main(String[] args) {
int x = 0;
for(int i = 0; i<10; i++) {
x = x + 1
}
}
}

// 3AC(jimple)
public static void main(java.lang.String[])
{
java.lang.String[] r0;
int i1;

r0 := @parameter0: java.lang.String[];
i1 = 0;

label1:
if i1 >= 10 goto label2;
i1 = i1 + 1;
goto label1;

label2:
return;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// Java Src
public class DoWhileLoop3AC {
public static void main(String[] args) {
int[] arr = new int[10];
int i = 0;
do {
i = i + 1;
} while (arr[i] < 10);
}
}

// 3AC(jimple)
public static void main(java.lang.String[])
{
java.lang.String[] r0;
int[] r1;
int $i0, i1;

r0 := @parameter0: java.lang.String[];
r1 = newarray (int)[10];
i1 = 0;

label1:
i1 = i1 + 1;
$i0 = r1[i1];
if $i0 < 10 goto label1;
return;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
// Java Src
public class ForLoop3AC {
String foo(String para1, String para2) {
return para1 + " " + para2;
}
public static void main(String[] args) {
MethodCall3AC mc = new MethodCall3AC();
String result = mc.foo("hello", "world");
}
}

// 3AC(jimple)
java.lang.String foo(java.lang.String, java.lang.String)
{
nju.sa.examples.MethodCal13AC r0;
java.lang.String r1, r2, $r7;
java.lang.StringBuilder $r3, $r4, $r5, $r6;
r0 := @this: nju.sa.examples.MethodCall3AC;
r1 := @parameter0: java.lang.String;
r2 := @parameter1: java.lang.String;
$r3 = new java.lang.StringBuilder;

// method signature <>
specialinvoke $r3.<java.lang.StringBuilder: void <init>()>();
$r4 = virtualinvoke $r3.<java.lang.StringBuilder: java.lang.StringBuilder append(java.lang.String)>(r1);
$r5 = virtualinoke $r4.<java.lang.StringBuilder: java.lang.StringBuilder append java.lang.String)>(" ");
$r6 = virtualinvoke $r5.<java.lang.StringBuilder: java.lang.StringBuilder append(java.lang.String)>(r2);
$r7 = virtualinvoke $r6.<java.lang.StringBuilder: java.lang.String toString()>();
return $r7;
}
public static void main(java.lang.String[])
{
java.lang.String[] r0;
nju.sa.examples.MethodCall3AC $r3;

r0 := @parameter0: java.lang.String[];
$r3 = new nju.sa.example.MethodCall3AC;

specialinvoke $r3.<nju.sa.example.MethodCall3AC: void <init>()>();
virtualinvoke $r3.<nju.sa.example.MethodCall3AC: java.lang.String foo(java.lang.String, java.lang.String)>("hello", "world");

return;
}

invokespecial: call constructor, call superclass methods, call private methods

invokevirutal: instance methods call (virtual dispatch)

invokeinterface: cannot optimization,checking interface implementation

invokestatic: call static methods

Java 7: invokedynamic -> Java static typing , dynamic language runs on JVM

method signature: class name, return type, method name(parameter types)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
// Java Src
public class Class3AC {
public static final double pi = 3.14;
public static void main(String args[]) {

}
}

// 3AC(jimple)
public class nju.sa.examples.Class3AC extends java.lang.Object
{
public static final double pi;
public void <init>()
{
nju.sa.examples.Class3AC r0;
r0 := @this: nju.sa.examples.Class3AC;
specialinvoke r0.<java.lang.0bject: void <init>()>();
return;
}
public static void main(java.lang.String[])
{
java.lang.String[] r0;
r0 := @parameter0: java.lang.String[];
return;
}
}
public static void <clinit>()
{
<nju.sa.examples.Class3AC: double pi> = 3.14;
return;
}
}

Static Single Assignment(SSA)

  • All assignments in SSA are to variables with distinct names
    • Give each definition a fresh name
    • Propagate fresh name to subsequent uses
    • Every variable has exactly on definition
3AC SSA
p = a + b p1 = a + b
q = p - c q1 = p1 - c
p = q * d p2 = q1 * d
p = e - p p3 = e - p2
q = p + q q2 = p3 + q1

What if a variable use is at control flow merges?

  • A special merge operator,$\phi$ (called phi-function), is introduced to select the values at merge nodes
  • $\phi(x0, x1)$has the value x0 if the control flow passes through the true part of the conditional and the value x1 otherwise

Why SSA?

  • Flow information is indirectly incorporated into the unique variable names

    May help deliver some simpler analyses, e.g., flow-insensitive analysis gains partial precision of flow-sensitive analysis via SSA

  • Define-and-Use pairs are explicit

    Enable more effective data facts storage and propagation in some on-demand tasks

    Some optimization tasks perform better on SSA (e.g., conditional constant propagation, global value numbering)

Why not SSA?

  • may introduce too many variables and phi-functions
  • May introduce inefficiency problem when translating to machine code (due to copy operations)

Control Flow Analysis

  • Usually refer to building Control Flow Graph (CFG)
  • CFG serves as the basic structure for static analysis
  • The node in CFG can be an individual 3-address instruction, or (usually) a Basic Block (BB)

Basic Blocks(BB)

Basic block(BB) are maximal sequences of consecutive three-address instruction with theproperties that

  • It can be entered only at the beginning, i.e., the first instruction in the block
  • It can be exited only at the end, i.e., the last instruction in the block

Design the algorithm to build BBs

How to build Basic Blocks?

INPUT: A sequence of three-address instructions of P

OUTPUT: A list of basic blocks of P

METHOD:

  1. Determine the leaders in P
    • The first instruction in P is a leader
    • Any target instruction of a conditional or unconditional jump is a leader
    • Any instruction that immediately follows a conditional or unconditional jump is a leader
  2. Build BBs for P
    • A BB consists of a leader and all its subsequent instructions until the next leader

Example:

Input: 3AC of P

  1. x = input
  2. y = x - 1
  3. z = x * y
  4. if z < x goto (7)
  5. p = x / y
  6. q = p + y
  7. a = q
  8. b = x + a
  9. c = 2a - b
  10. if p == q goto 12
  11. goto 3
  12. return

Output: BBs of P

  1. Determine the leaders of P
    • (1) (first instruction)
    • (3) (7) (12) (target instruction of a conditional or unconditional jump)
    • (5) (11) (12) (instruction that follows a conditional or unconditional jump)
  2. Build BBs of P
    • B1 {(1), (2)}
    • B2 {(3), (4)}
    • B3 {(5), (6)}
    • B4 {(7), (8), (9), (10)}
    • B5 {(11)}
    • B6 {(12)}

Control Flow Graph(CFG)

  • The nodes of CFG are basic blocks
  • There is an edge from block A to block B if and only if
    • There is a conditional or unconditional jump from the end of A to the beginning of B
    • B immediately follows A in the original order of instructions an A does not end in an unconditional jump
  • It is normal to replace the jumps to instruction lables by jumps to basic blocks

So, we can get:

And Usually we add two nodes, Entry and Exit.

  • They do not correspond to executable IR
  • A edge from Entry to the BB containing the first instruction of IR
  • A edge to Exit from any BB containing an instruction that could be the last instruction of IR