-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path4-path-sensitivity.tex
401 lines (355 loc) · 16.9 KB
/
4-path-sensitivity.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
% !TEX root = main.tex
\section{Path Sensitivity}
\subsection{Motivation}
As described previously, the dataflow library conflates all control flow paths
in a function and just reports possible flow through a function.
This is obviously an overapproximation of flow.
For example in~\autoref{lst:ex-ps-1} flow would be reported,
although it is quite obvious that no flow can happen at runtime,
because the to if conditionals evaluating \java{b} are mutually exclusive.
In one call of \java{f(b)}, only one of the blocks is executed.
Thus, no dataflow happens.
To address this issue,
it makes sense to add a notion of path-sensitivity to the dataflow library.
This path-sensitivity tracks that some path contexts make it possible to
filter out dataflow paths that will never occur at runtime,
because a path cannot visit dataflow nodes in mutually exclusive blocks.
As in~\autoref{sec:call-sens}, our goal is to not introduce new sources of
unsoundness, so a pruned path by our proposed algorithm
is never allowed to be a live path.
\begin{listing}[h]
\begin{javacode}
public void f(boolean b) {
Object o = null;
if (b) {
o = source();
}
if (!b) {
sink(o);
}
}
\end{javacode}
\caption{Simple example of path-sensitive dataflow}
\label{lst:ex-ps-1}
\end{listing}
The issue of path sensitivity is not just a theoretical issue, but of practical importance.
One analysis that runs on LGTM.com is the nullness analysis.
It warns when developers (potentially) dereference a pointer that is null.
Right now, it is built separately from the dataflow library and only works
intraprocedurally.
However, it could be implemented on top of the dataflow library.
The formulation of the nullness analysis as dataflow problem is:
Any literal \java{null} and any variable declaration of an object
without initialization is a source.
Any pointer dereference (i.e.\ field or method access on an object) is a sink.
Then any assignment to a variable with a non-null expression leads to flow to stop.
This approach would extend the nullness to be an interprocedural one, and furthermore
reduce the amount of code that needs to be maintained.
Unfortunately with the current dataflow library, the resulting analysis would
be way too imprecise to be feasible.
Often, the nullness-status of a variable is correlated with the value of a Boolean
variable, like in~\autoref{lst:ex-ps-nullness}.
% TODO bla
While having path-sensitivity as presented in this section would not be enough
to enable the precise analysis of~\autoref{lst:ex-ps-nullness},
having a concept that some dataflow nodes are only active in some program paths
is an important prerequisite to implement this analysis.
The reason that the presented analysis wouldn't work on that example is that
the proposed path-sensitivity focuses on identifying mutually exclusive
blocks like in~\autoref{lst:ex-ps-1}.
So the deduction of the fact that whenever \java{b} is \java{true},
then \java{o} is overwritten with something non-null is missing.
Adding this to the dataflow analysis would be another step needed to
implement a precise nullness analysis on top of it.
\begin{listing}[h]
\begin{javacode}
public void A(boolean b) {
SomeClass o = null;
if (b) {
o = new SomeClass();
o.method1();
}
// some more code
if (b) {
o.method2();
}
}
\end{javacode}
\caption{Nullness example that needs path-sensitivity}
\label{lst:ex-ps-nullness}
\end{listing}
The goal of this section is to describe our approach to add a notion of
path-sensitivity to the dataflow library.
Then we highlight where and why our approach introduced soundness issues
that made it unacceptable for Github to merge the code.
However, Github still benefits by learning more about the problems
concerning implementing path-sensitivity and that this approach does not work.
Furthermore, over the course of the project a lot of test cases containing
corner cases for path-sensitivity were contributed.
%TODO is there?
%At last, we propose a possible fix to our approach.
% TODO possible fixes?
%Because we ran out of time, it was not possible to implement
%and evaluate the fixed approach, though.
%\subsection{Theoretical Background}
%Is there any?
\subsection{Description of Implementation}
\subsubsection*{\texttt{Property}s and local flow filtering}
The first part of the implementation introduces the concept of a \texttt{Property}
to the local dataflow analysis.
A property is a fact that, given a Boolean, holds at one or more dataflow nodes.
Whenever a property holds with \java{true}, it is guaranteed that the dataflow nodes
at which the property holds with \java{false} will not be live.
The concept of a property abstracts away from dealing with boolean condition variables
directly. It is intended to be easily extended.
The intuition behind a property is that it abstracts that a given
dataflow node is guarded by a Boolean condition variable.
Thus, in~\autoref{lst:ex-ps-1} the property associated with the boolean variable \java{b}
holds with the value \java{true} for all dataflow nodes guarded by
\java{if (b)}, and it holds with the value \java{false} for all dataflow nodes
guarded by \java{if (!b)}.
It does not hold for other dataflow nodes in the program.
In the current implementation, the only inhabitant of the \texttt{Property}
type is the \texttt{BooleanSsaVarProperty}, that represents the concept of
a Boolean variable that guards dataflow nodes.
While there are no other subtypes of \texttt{Property} yet, an implementation
for enum constants instead of Boolean variables would be fairly easy to write.
With the property concept in place, the local flow
is filtered.
We only filter local flow to reduce the computational complexity of the problem,
as opposed to filter interprocedurally.
For any property that holds at any of the nodes of a local flow path,
the local flow path has to be compatible with that property, i.e.\ the same
property never holds with both \java{true} and \java{false} for that local flow path.
If there are no properties that hold along a local flow path, nothing is filtered.
Note that this approach to filtering fixes one property first, and then filters
for that, and then considers the next property independently of the first one.
Thus,~\autoref{lst:ps-diamond} is reported as having flow.
First, when considering the boolean \java{b1}, it is inferred that flow is only
possible if \java{b1} is \java{false}
(otherwise, the source is never part of the flow).
Then, independently of that it is inferred that flow is only possible if \java{b2}
is \java{false} (otherwise, the sink is never part of the flow).
However, the information about those properties are never combined to detect
that the assignment \java{y = x} can never happen while both reaching the
source and the sink.
\begin{listing}
\begin{javacode}
public void f(boolean b1, boolean b2) {
Object x = null;
if (!b1) {
x = source();
}
Object y = null;
if (b1) {
y = x;
}
if (b2) {
y = x;
}
if (!b2) {
sink(y);
}
}
\end{javacode}
\caption{Example of false positive because of a diamond-shaped control flow graph}
\label{lst:ps-diamond}
\end{listing}
A supposedly easy fix to correctly filter that false positive,
would be that the implementation considers a list of currently active
properties that need to be checked against at the same time.
For most programming languages, this is not a problem.
In QL however, lists are quite tricky.
They are not a first-class datatype, and because in QL all user-provided datatypes
are fully instantiated, they need to be fully constructible.
Thus, any "list" datatype would need to be fully instantiated, this means it would
need to list all possible lists ahead of time.
This makes lists very complicated to deal with, and often not very useful.
Thus, lists do not make for a quick fix for this issue.
\subsubsection*{The Boolean SSA Variable property}
Now that we have seen how properties are used, how exactly do they work internally?
Conceptually, we want to capture the fact that a boolean variable guards
control flow.
For that, we can rely on the guards library (as discussed shortly in~\autoref{sec:call-sens}).
Exposing all boolean variables as properties would incur a too high performance penalty,
though.
Because of that, we pre-filter the set of boolean variables used as guards.
We only allow Boolean variables that guard at least two different blocks,
with inverted conditions.
This means that the property is useful for pruning later.
Note that this implicitly assumes that pruning is always okay when we encounter
two blocks that cannot be live at the same time, because they read the same variable.
As we see in~\autoref{ssec:soundness}, this is unfortunately not true.
However, we still describe the proposed algorithm fully to present the insights
learned during this project.
The equality relation on Boolean variables used is equality on the
SSA form of the program - after assigning to a variable, it most likely does not
contain the same value as before the assignment, so we treat it as a different variable.
Thus, the two guards (for example, two if conditionals) need to read the same SSA
variable $b_1$ in their conditions, where one guard checks for $b_1$ to be \java{true},
and the other guard checks for $b_1$ to be \java{false}.
Then all the nodes that are guarded are candidates for dataflow pruning.
Note that this approach assumes that a SSA variable always evaluates
to the same value.
However, this is not true.
In~\autoref{lst:ps-ssa-var-different}, the variable \java{b} has one SSA definition
in line 4, but different values at run time, depending on the iteration.
Note that the SSA definition dominates all reads, thus no $\phi$-node is
needed.
We can conclude that relying that \java{b} is the same SSA variable
does not guarantee
that it always evaluates to the same value.
Without that guarantee, any pruning based on \java{b} would introduce
unsoundness into the algorithm.
\begin{listing}[h]
\begin{javacode}
public void f() {
int j = 0;
do {
boolean b = j % 2 == 0;
Object x = null;
if (b) {
x = source();
}
if (!b) {
sink(x);
}
j++;
} while (j < 100);
}
\end{javacode}
\caption{Example where the SSA variable \java{b} has different values at run time}
\label{lst:ps-ssa-var-different}
\end{listing}
An oversimplistic fix for this issue to require that the SSA definition of the
Boolean variable needs to be outside of any loops.
A loop is defined on the control flow graph ---
if the SSA definition viewed as a control flow graph node can reach itself by taking
arbitrary (function-local) steps on the control flow graph, it is contained in a loop.
This covers all cases where unsoundness occurs, but many more legitimate cases.
Furthermore, it would lead to the very strange behaviour that the dataflow analysis
does not report dataflow because of path pruning, but as soon as the code is
wrapped in a loop statement, false positives would start to appear.
This kind of behaviour is difficult to explain to the user, and also difficult
to justify theoretically.
Thus, a more involved solution is needed that caputures the
essence of the problem with higher precision.
The key insight is that every time before the SSA variable is re-defined, it
goes out of scope, so technically we have been treating different variables
as equal, just because they had the same name.
Thus, we need to ensure that whenever we prune dataflow based on Boolean variables,
all reads of the SSA variable refer to the SSA variable \textbf{in the same scope}.
Then, the assumption that the SSA variables have the same value is actually true.
As~\autoref{lst:ps-ssa-different-scopes} demonstrates,
the issue of whether the second read of \java{b} (outside of the loop)
refers always to the same definition of \java{b} that the first read of \java{b}
uses (inside the loop) is dependent on the details of the loop execution.
Thus, due to the halting problem, we cannot aim to find a precise solution to this.
However, we only need a conservative heuristic that tells us when
two reads will provably refer to the same definition of the SSA variable.
This approximation would not mark the two reads of \java{b}
in~\autoref{lst:ps-ssa-different-scopes} as referring to the same SSA definition.
We propose the following notion of that heuristic:
A SSA Boolean variable \java{b} with two reads is eligible for pruning
if all control flow paths from one read of \java{b} to the SSA definition of
\java{b} include the other read as node on that control flow path.
This ensures that everytime one read gets a value from \java{b}, the other read
also gets it.
Furthermore, it covers the case that \java{b} is defined in a loop (possibly
multiple times), but both reads happen after the loop (thus, the value of \java{b}
has "settled" before reading it).
The case that both reads of \java{b} are in the loop defining \java{b} is also
included.
The case that one read of \java{b} happens inside the loop, and one outside of
the loop (like in~\autoref{lst:ps-ssa-different-scopes}) is correctly covered -
there exists a control flow path from the first read of \java{b} to the SSA
definition of \java{b} that does \textbf{not} visit the second read of \java{b}.
Thus they might not get the same value of \java{b}, thus pruning is not safe
in this case.
\begin{listing}[h]
\begin{javacode}
public void f(int k) {
int j = 0;
boolean b;
Object x = null;
do {
b = j % 2 == 0;
if (b) {
x = source();
}
j++;
} while (j < k);
if (!b) {
sink(x);
}
}
\end{javacode}
\caption{An example where reads of
\java{b} might refer to different SSA definitions}
\label{lst:ps-ssa-different-scopes}
\end{listing}
In general, this approach would need a formal argument to support it.
This argument would show that every time this heuristic indicates that two
SSA reads read from
the same definition, they actually read the same value.
This would convince the reader that this is indeed the right approach
to tackle this problem.
However, as the whole proposed algorithm suffers soundness problems as described
in the next subsection, a formal justification is omitted.
\subsection{Soundness Problem}
\label{ssec:soundness}
Even with the quite complicated logic in place ensuring that only the same
instance of an SSA variable is read when pruning, unsoundness can occur.
In~\autoref{lst:ps-false-negative}, the reads of the SSA variable \java{b} clearly
fulfills all conditions described above, thus \java{b} and it two reads
are eligible to be used for dataflow pruning.
After pruning the example does not report dataflow.
However, in one loop iteration data flows from the source to \java{x},
and in another loop iteration the data flows from \java{x} to the sink.
Ergo pruning this dataflow is wrong.
This shows that the assumption that pruning dataflow is sound whenever two mutually
exclusive reads of a SSA variable (even when always having the same value!)
happen is wrong.
\begin{listing}[h]
\begin{javacode}
public void f() {
boolean[] boolArray = {true, false};
Object x = null;
for(boolean b: boolArray) {
if (b) {
x = source();
}
if (!b) {
sink(x);
}
}
}
\end{javacode}
\caption{Example of a false negative with the proposed algorithm}
\label{lst:ps-false-negative}
\end{listing}
An easy fix for this would be to return to the first proposed fix above and
only consider boolen variables for pruning that are defined outside of loops,
with all the associated drawbacks discussed above.
Because of those drawbacks, this is understandably
not an option that leads to an algorithm that Github wants to deploy.
However, for a more general sound algorithm, the example shows that it is not enough
to come up with the Boolean variables used for pruning dataflow without also
looking at the control flow path used by the dataflow path under consideration.
A better algorithm would need to operate on the control flow graph itself to
factor in that knowledge into the dataflow pruning.
This would allow it to see that data flows from the source to the sink in different
loop iterations.
With the current design of the dataflow algorithm, this is very difficult.
In the very first step of computing (possible) local dataflow, it already
only looks at possible reads of SSA variables, thus ignoring control flow structures
entirely.
As this is a lossy process, it is not possible to go back from an edge in the
dataflow graph and get the corresponding control flow path used by that edge.
This is intentional by design, to make the dataflow representation as sparse
as possible, and thus more performant.
Fixing this would require a major refactoring of the dataflow algorithm with
uncertain outcome.
This is, while unsatisfactory, clearly outside the allocated time budget
of a 15 ECTS project outside the course scope.
%\subsection{Improved Algorithm Proposal}