Packages

package root

The Scala compiler and reflection APIs.

Definition Classes: root

package scala

Definition Classes: root

package tools

Definition Classes: scala

package nsc

Definition Classes: tools

package backend

Definition Classes: nsc

package jvm

Definition Classes: backend

package analysis

Summary on the ASM analyzer framework --------------------------------------

Value

Abstract, needs to be implemented for each analysis.
Represents the desired information about local variables and stack values, for example:
- Is this value known to be null / not null?
- What are the instructions that could potentially have produced this value?

Interpreter

Abstract, needs to be implemented for each analysis. Sometimes one can subclass an existing interpreter, e.g., SourceInterpreter or BasicInterpreter.
Multiple abstract methods that receive an instruction and the instruction's input values, and return a value representing the result of that instruction.
- Note: due to control flow, the interpreter can be invoked multiple times for the same instruction, until reaching a fixed point.
Abstract merge function that computes the least upper bound of two values. Used by Frame.merge (see below).

Frame

Can be used directly for many analyses, no subclass required.
Every frame has an array of values: one for each local variable and for each stack slot.
- A top index stores the index of the current stack top
- NOTE: for a size-2 local variable at index i, the local variable at i+1 is set to an empty value. However, for a size-2 value at index i on the stack, the value at i+1 holds the next stack value. IMPORTANT: this is only the case in ASM's analysis framework, not in bytecode. See comment below.
Defines the execute(instruction) method.
- executing mutates the state of the frame according to the effect of the instruction
  - pop consumed values from the stack
  - pass them to the interpreter together with the instruction
  - if applicable, push the resulting value on the stack
Defines the merge(otherFrame) method
- called by the analyzer when multiple control flow paths lead to an instruction
  - the frame at the branching instruction is merged into the current frame of the instruction (held by the analyzer)
  - mutates the values of the current frame, merges all values using interpreter.merge.

Analyzer

Stores a frame for each instruction
merge function takes an instruction and a frame, merges the existing frame for that instr (from the frames array) with the new frame passed as argument. if the frame changed, puts the instruction on the work queue (fixpoint).
initial frame: initialized for first instr by calling interpreter.new[...]Value for each slot (locals and params), stored in frames[firstInstr] by calling merge
work queue of instructions (queue array, top index for next instruction to analyze)
analyze(method): simulate control flow. while work queue non-empty:
- copy the state of frames[instr] into a local frame current
- call current.execute(instr, interpreter), mutating the current frame
- if it's a branching instruction
  - for all potential destination instructions
    - merge the destination instruction frame with the current frame (this enqueues the destination instr if its frame changed)
  - invoke newControlFlowEdge (see below)
the analyzer also tracks active exception handlers at each instruction
the empty method newControlFlowEdge can be overridden to track control flow if required

MaxLocals and MaxStack ----------------------

At the JVM level, long and double values occupy two slots, both as local variables and on the stack, as specified in the JVM spec 2.6.2: "At any point in time, an operand stack has an associated depth, where a value of type long or double contributes two units to the depth and a value of any other type contributes one unit."

For example, a method class A { def f(a: Long, b: Long) = a + b } has MAXSTACK=4 in the classfile. This value is computed by the ClassWriter / MethodWriter when generating the classfile (we always pass COMPUTE_MAXS to the ClassWriter).

For running an ASM Analyzer, long and double values occupy two local variable slots, but only a single slot on the call stack, as shown by the following snippet:

import scala.tools.nsc.backend.jvm._ import scala.tools.nsc.backend.jvm.opt.BytecodeUtils._ import scala.collection.convert.decorateAsScala._ import scala.tools.asm.tree.analysis._

val cn = AsmUtils.readClass("/Users/luc/scala/scala/sandbox/A.class") val m = cn.methods.iterator.asScala.find(_.name == "f").head

// the value is read from the classfile, so it's 4 println(s"maxLocals: ${m.maxLocals}, maxStack: ${m.maxStack}") // maxLocals: 5, maxStack: 4

// we can safely set it to 2 for running the analyzer. m.maxStack = 2

val a = new Analyzer(new BasicInterpreter) a.analyze(cn.name, m) val addInsn = m.instructions.iterator.asScala.find(_.getOpcode == 97).get // LADD Opcode val addFrame = a.frameAt(addInsn, m)

addFrame.getStackSize // 2: the two long values only take one slot each addFrame.getLocals // 5: this takes one slot, the two long parameters take 2 slots each

While running the optimizer, we need to make sure that the maxStack value of a method is large enough for running an ASM analyzer. We don't need to worry if the value is incorrect in the JVM perspective: the value will be re-computed and overwritten in the ClassWriter.

Lessons learnt while benchmarking the alias tracking analysis -------------------------------------------------------------

Profiling

Use YourKit for finding hotspots (cpu profiling). when it comes to drilling down into the details of a hotspot, don't pay too much attention to the percentages / time counts.
Should also try other profilers.
Use timers. When a method showed up as a hotspot, I added a timer around that method, and a second one within the method to measure specific parts. The timers slow things down, but the relative numbers show what parts of a method are slow.

ASM analyzer insights

The time for running an analysis depends on the number of locals and the number of instructions. Reducing the number of locals helps speeding up the analysis: there are less values to merge when merging to frames. See also https://github.com/scala/scala-dev/issues/47
The common hot spot of an ASM analysis is Frame.merge, for example in producers / consumers.
For nullness analysis the time is spent as follows
- 20% merging nullness values. this is as expected: for example, the same absolute amount of time is spent in merging BasicValues when running a BasicInterpreter.
- 50% merging alias sets. i tried to optimize what i could out of this.
- 20% is spent creating new frames from existing ones, see comment on AliasingFrame.init.
The implementation of Frame.merge (the main hot spot) contains a megamorphic callsite to interpreter.merge. This can be observed easily by running a test program that either runs a BasicValue analysis only, versus a program that first runs a nullness analysis and then a BasicValue. In an example, the time for the BasicValue analysis goes from 519ms to 1963ms, a 3.8x slowdown.
I added counters to the Frame.merge methods for nullness and BasicValue analysis. In the examples I benchmarked, the number of merge invocations was always exactly the same. It would probably be possible to come up with an example where alias set merging forces additional analysis rounds until reaching the fixpoint, but I did not observe such cases.

To benchmark an analysis, instead of benchmarking analysis while it runs in the compiler backend, one can easily run it from a separate program (or the repl). The bytecode to analyze can simply be parsed from a classfile. See example at the end of this comment.

Nullness Analysis in Miguel's Optimizer ---------------------------------------

Miguel implemented alias tracking for nullness analysis differently [1]. Remember that every frame has an array of values. Miguel's idea was to represent aliasing using reference equality in the values array: if two entries in the array point to the same value object, the two entries are aliases in the frame of the given instruction.

While this idea seems elegant at first sight, Miguel's implementation does not merge frames correctly when it comes to aliasing. Assume in frame 1, values (a, b, c) are aliases, while in frame 2 (a, b) are aliases. When merging the second into the first, we have to make sure that c is removed as an alias of (a, b).

It would be possible to implement correct alias set merging in Miguel's approach. However, frame merging is the main hot spot of analysis. The computational complexity of implementing alias set merging by traversing the values array and comparing references is too high. The concrete alias set representation that is used in the current implementation (see class AliasingFrame) makes alias set merging more efficient.

[1] https://github.com/scala-opt/scala/blob/opt/rebase/src/compiler/scala/tools/nsc/backend/bcode/NullnessPropagator.java

Complexity and scaling of analysis ----------------------------------

The time complexity of a data flow analysis depends on:

The size of the method. The complexity factor is linear (assuming the number of locals and branching instructions remains constant). The main analysis loop runs through all instructions of a method once. Instructions are only re-enqueued if a control flow merge changes the frame at some instruction.
The branching instructions. When a second (third, ..) control flow edge arrives at an instruction, the existing frame at the instruction is merged with the one computed on the new branch. If the merge function changes the existing frame, the instruction is enqueued for another analysis. This results in a merge operation for the successors of the instruction.
The number of local variables. The hot spot of analysis is frame merging. The merge function iterates through the values in the frame (locals and stack values) and merges them.

I measured the running time of an analysis for two examples:

Keep the number of locals and branching instructions constant, increase the number of instructions. The running time grows linearly with the method size.
Increase the size and number of locals in a method. The method size and number of locals grow in the same pace. Here, the running time increase is polynomial. It looks like the complexity is be #instructions * #locals^2 (see below).

I measured nullness analysis (which tracks aliases) and a SimpleValue analysis. Nullness runs roughly 5x slower (because of alias tracking) at every problem size - this factor doesn't change.

The numbers below are for nullness. Note that the last column is constant, i.e., the running time is proportional to #ins * #loc^2. Therefore we use this factor when limiting the maximal method size for running an analysis.

#insns #locals time (ms) time / #ins * #loc^{2 * 10}6 1305 156 34 1.07 2610 311 165 0.65 3915 466 490 0.57 5220 621 1200 0.59 6525 776 2220 0.56 7830 931 3830 0.56 9135 1086 6570 0.60 10440 1241 9700 0.60 11745 1396 13800 0.60

As a second experiment, nullness analysis was run with varying #insns but constant #locals. The last column shows linear complexity with respect to the method size (linearOffset = 2279):

#insns #locals time (ms) (time + linearOffset) / #insns 5220 621 1090 0.645 6224 621 1690 0.637 7226 621 2280 0.630 8228 621 2870 0.625 9230 621 3530 0.629 10232 621 4130 0.626 11234 621 4770 0.627 12236 621 5520 0.637 13238 621 6170 0.638

When running a BasicValue analysis, the complexity observation is the same (time is proportional to #ins * #loc^2).

Measuring analysis execution time ---------------------------------

See code below.

Definition Classes: jvm

package opt

Definition Classes: jvm

BoxUnbox

ByteCodeRepository

BytecodeUtils

CallGraph

ClosureOptimizer

CopyProp

InlineInfoAttribute

InlineInfoAttributePrototype

Inliner

InlinerHeuristics

LabelNotLive

LocalOpt

LocalOptImpls

RemovePair

RemovePairDependency

scala.tools.nsc.backend.jvm

opt

package opt

Content Hierarchy

Ordering

Alphabetic

Visibility

Public
All

Type Members

abstract class BoxUnbox extends AnyRef
abstract class ByteCodeRepository extends AnyRef
The ByteCodeRepository provides utilities to read the bytecode of classfiles from the compilation classpath.
The ByteCodeRepository provides utilities to read the bytecode of classfiles from the compilation classpath. Parsed classes are cached in the classes map.
abstract class CallGraph extends AnyRef
abstract class ClosureOptimizer extends AnyRef
abstract class CopyProp extends AnyRef
case class InlineInfoAttribute(inlineInfo: InlineInfo) extends Attribute with Product with Serializable
This attribute stores the InlineInfo for a ClassBType as an independent classfile attribute.
This attribute stores the InlineInfo for a ClassBType as an independent classfile attribute. The compiler does so for every class being compiled.
The reason is that a precise InlineInfo can only be obtained if the symbol for a class is available. For example, we need to know if a method is final in Scala's terms, or if it has the @inline annotation. Looking up a class symbol for a given class filename is brittle (name-mangling).
The attribute is also helpful for inlining mixin methods. The mixin phase only adds mixin method symbols to classes that are being compiled. For all other class symbols, there are no mixin members. However, the inliner requires an InlineInfo for inlining mixin members. That problem is solved by reading the InlineInfo from this attribute.
In principle we could encode the InlineInfo into a Java annotation (instead of a classfile attribute). However, an attribute allows us to save many bits. In particular, note that the strings in an InlineInfo are serialized as references to constants in the constant pool, and those strings (method names, method signatures) would exist in there anyway. So the ScalaInlineAttribute remains relatively compact.
abstract class Inliner extends AnyRef
abstract class InlinerHeuristics extends PerRunInit
case class LabelNotLive(label: LabelNode) extends RemovePairDependency with Product with Serializable
abstract class LocalOpt extends AnyRef
Optimizations within a single method.
Optimizations within a single method. Certain optimizations enable others, for example removing unreachable code can render a try block empty and enable removeEmptyExceptionHandlers. The latter in turn enables more unreachable code to be eliminated (the catch block), so there is a cyclic dependency. Optimizations that depend on each other are therefore executed in a loop until reaching a fixpoint.
The optimizations marked UPSTREAM enable optimizations that were already executed, so they cause another iteration in the fixpoint loop.
nullness optimizations: rewrite null-checking branches to GOTO if nullness is known + enables downstream
- unreachable code (null / non-null branch becomes unreachable)
- box-unbox elimination (may render an escaping consumer of a box unreachable)
- stale stores (aload x is replaced by aconst_null if it's known null)
- simplify jumps (replaces conditional jumps by goto, so may enable goto chains)
unreachable code / DCE (removes instructions of basic blocks to which there is no branch) + enables downstream:
- stale stores (loads may be eliminated, removing consumers of a store)
- empty handlers (try blocks may become empty)
- simplify jumps (goto l; [dead code]; l: ..) => remove goto
- stale local variable descriptors
- (not box-unbox, which is implemented using prod-cons, so it doesn't consider dead code)
note that eliminating empty handlers and stale local variable descriptors is required for correctness, see the comment in the body of methodOptimizations.
box-unbox elimination (eliminates box-unbox pairs within the same method) + enables UPSTREAM:
- nullness optimizations (a box extraction operation (unknown nullness) may be rewritten to a read of a non-null local. example in doc comment of box-unbox implementation)
- further box-unbox elimination (e.g. an Integer stored in a Tuple; eliminating the tuple may enable eliminating the Integer) + enables downstream:
- copy propagation (new locals are introduced, may be aliases of existing)
- stale stores (multi-value boxes where not all values are used)
- redundant casts (("a", "b")._1: the generic _1 method returns Object, a cast to String is added. The cast is redundant after eliminating the tuple.)
- empty local variable descriptors (local variables that were holding the box may become unused)
- push-pop (due to artifacts of eliminating runtime type tests on primitives)
copy propagation (replaces LOAD n to the LOAD m for the smallest m that is an alias of n) + enables downstream:
- stale stores (a stored value may not be loaded anymore)
- store-load pairs (a load n may now be right after a store n) + NOTE: copy propagation is only executed once, in the first fixpoint loop iteration. none of the other optimizations enables further copy prop. we still run it as part of the loop because it requires unreachable code to be eliminated.
stale stores (replace STORE by POP) + enables downstream:
- push-pop (the new pop may be the single consumer for an instruction)
redundant casts: eliminates casts that are statically known to succeed (uses type propagation) + enables UPSTREAM:
- box-unbox elimination (a removed checkcast may be a box consumer) + enables downstream:
- push-pop for closure allocation elimination (every indyLambda is followed by a checkcast, see scala/bug#9540)
push-pop (when a POP is the only consumer of a value, remove the POP and its producer) + enables UPSTREAM:
- stale stores (if a LOAD is removed, a corresponding STORE may become stale)
- box-unbox elimination (push-pop may eliminate a closure allocation, rendering a captured box non-escaping) + enables downstream:
- store-load pairs (a variable may become non-live)
- stale handlers (push-pop removes code)
- simplify jumps (push-pop removes code)
store-load pairs (remove STORE x; LOAD x if x is otherwise not used in the method) + enables downstream:
- empty handlers (code is removes, a try block may become empty
- simplify jumps (code is removed, a goto may become redundant for example)
- stale local variable descriptors
empty handlers (removes exception handlers whose try block is empty) + enables UPSTREAM:
- unreachable code (catch block becomes unreachable)
- box-unbox (a box may be escape in an operation in a dead handler) + enables downstream:
- simplify jumps
simplify jumps (various, like GOTO l; l: ..., see doc comments of individual optimizations) + enables UPSTREAM
- unreachable code (GOTO a; a: GOTO b; b: ..., the first jump is changed to GOTO b, the second becomes unreachable)
- store-load pairs (a GOTO l; l: ... is removed between store and load)
- push-pop (IFNULL l; l: ... is replaced by POP)
The following cleanup optimizations don't enable any upstream optimizations, so they can be executed once at the end, when the above optimizations reach a fixpoint.
empty local variable descriptors (removes unused variables from the local variable table)
empty line numbers (eliminates line number nodes that describe no executable instructions)
At this point, we used to filter out redundant label nodes (sequences of labels without any executable instructions in between). However, this operation is relatively expensive, and unnecessary: labels don't exist in the classfile, they are lowered to bytecode offsets, so redundant labels disappear by design.
Note on a method's maxLocals / maxStack: the backend only uses those values for running Analyzers. The values can be conservative approximations: if an optimization removes code and the maximal stack size is now smaller, the larger maxStack value will still work fine for running an Analyzer (just that frames allocate more space than required). The correct max values written to the bytecode are re-computed during classfile serialization. To keep things simpler, we don't update the max values in every optimization:
- we do it in removeUnreachableCodeImpl, because it's quite straightforward
- maxLocals is updated in compactLocalVariables, which runs at the end of method optimizations
Note on updating the call graph: whenever an optimization eliminates a callsite or a closure instantiation, we eliminate the corresponding entry from the call graph.
case class RemovePair(store: VarInsnNode, other: AbstractInsnNode, depends: List[RemovePairDependency]) extends RemovePairDependency with Product with Serializable
trait RemovePairDependency extends AnyRef

Value Members

object BytecodeUtils
object ClosureOptimizer
object InlineInfoAttribute extends Serializable
object InlineInfoAttributePrototype extends InlineInfoAttribute
In order to instruct the ASM framework to deserialize the ScalaInlineInfo attribute, we need to pass a prototype instance when running the class reader.
object InlinerHeuristics
object LocalOptImpls

Packages

opt 

package opt

Type Members

Value Members

Ungrouped

opt