独立的汇编语言

上面介绍的在Solidity中嵌入的内联汇编语言也可以单独使用。实际上,它是被计划用来作为编译器的一种中间语言。在这个目的下,它尝试达到下述的目标:

  1. 使用它编写的代码要可读,即使代码是从Solidity编译得到的。
  2. 从汇编语言转为字节码应该尽可能的少坑。
  3. 控制流应该容易检测来帮助进行形式验证与优化。

为了达到第一条和最后一条的目标,Solidity汇编语言提供了高层级的组件比如,for循环,switch语句和函数调用。这样的话,可以不直接使用SWAP,DUP,JUMP,JUMPI语句,因为前两个有混淆的数据流,后两个有混淆的控制流。此外,函数形式的语句如mul(add(x, y), 7)比纯的指令码的形式7 y x add num更加可读。

第二个目标是通过引入一个绝对阶段来实现,该阶段只能以非常规则的方式去除较高级别的构造,并且仍允许检查生成的低级汇编代码。Solidity汇编语言提供的非原生的操作是用户定义的标识符的命名查找(函数名,变量名等),这些都遵循简单和常规的作用域规则,会清理栈上的局部变量。

作用域:一个标识符(标签,变量,函数,汇编)在定义的地方,均只有块级作用域(作用域会延伸到,所在块所嵌套的块)。跨函数边界访问局部变量是不合法的,即使可能在作用域内(译者注:这里可能说的是,函数内定义多个函数的情况,JavaScript有这种语法)。不允许shadowing。局部变量不能在定义前被访问,但标签,函数和汇编可以。汇编是非常特殊的块结构可以用来,如,返回运行时的代码,或创建合约。外部定义的汇编变量在子汇编内不可见。

如果控制流来到了块的结束,局部变量数匹配的pop指令会插入到栈底(译者注:移除局部变量,因为局部变量失效了)。无论何时引用局部变量,代码生成器需要知道其当前在堆栈中的相对位置,因此需要跟踪当前所谓的堆栈高度。由于所有的局部变量在块结束时会被移除,因此在进入块之前和之后的栈高应该是不变的,如果不是这样的,将会抛出一个警告。

我们为什么要使用高层级的构造器,比如switchfor和函数。

使用switchfor和函数,可以在不用jumpjumpi的情况下写出来复杂的代码。这会让分析控制流更加容易,也可以进行更多的形式验证及优化。

此外,如果手动使用jumps,计算栈高是非常复杂的。栈内所有的局部变量的位置必须是已知的,否则指向本地变量的引用,或者在块结束时自动删除局部变量都不会正常工作。脱机处理机制正确的在块内不可达的地方插入合适的操作以修正栈高来避免出现jump时非连续的控制流带来的栈高计算不准确的问题。

示例:

我们从一个例子来看一下Solidity到这种中间的脱机汇编结果。我们可以一起来考虑下下述Soldity程序的字节码:

contract C {
  function f(uint x) returns (uint y) {
    y = 1;
    for (uint i = 0; i < x; i++)
      y = 2 * y;
  }
}

它将生成下述的汇编内容:

{
  mstore(0x40, 0x60) // store the "free memory pointer"
  // function dispatcher
  switch div(calldataload(0), exp(2, 226))
  case 0xb3de648b {
    let (r) = f(calldataload(4))
    let ret := $allocate(0x20)
    mstore(ret, r)
    return(ret, 0x20)
  }
  default { revert(0, 0) }
  // memory allocator
  function $allocate(size) -> pos {
    pos := mload(0x40)
    mstore(0x40, add(pos, size))
  }
  // the contract function
  function f(x) -> y {
    y := 1
    for { let i := 0 } lt(i, x) { i := add(i, 1) } {
      y := mul(2, y)
    }
  }
}

在经过脱机汇编阶段,它会编译成下述的内容:

{
  mstore(0x40, 0x60)
  {
    let $0 := div(calldataload(0), exp(2, 226))
    jumpi($case1, eq($0, 0xb3de648b))
    jump($caseDefault)
    $case1:
    {
      // the function call - we put return label and arguments on the stack
      $ret1 calldataload(4) jump(f)
      // This is unreachable code. Opcodes are added that mirror the
      // effect of the function on the stack height: Arguments are
      // removed and return values are introduced.
      pop pop
      let r := 0
      $ret1: // the actual return point
      $ret2 0x20 jump($allocate)
      pop pop let ret := 0
      $ret2:
      mstore(ret, r)
      return(ret, 0x20)
      // although it is useless, the jump is automatically inserted,
      // since the desugaring process is a purely syntactic operation that
      // does not analyze control-flow
      jump($endswitch)
    }
    $caseDefault:
    {
      revert(0, 0)
      jump($endswitch)
    }
    $endswitch:
  }
  jump($afterFunction)
  allocate:
  {
    // we jump over the unreachable code that introduces the function arguments
    jump($start)
    let $retpos := 0 let size := 0
    $start:
    // output variables live in the same scope as the arguments and is
    // actually allocated.
    let pos := 0
    {
      pos := mload(0x40)
      mstore(0x40, add(pos, size))
    }
    // This code replaces the arguments by the return values and jumps back.
    swap1 pop swap1 jump
    // Again unreachable code that corrects stack height.
    0 0
  }
  f:
  {
    jump($start)
    let $retpos := 0 let x := 0
    $start:
    let y := 0
    {
      let i := 0
      $for_begin:
      jumpi($for_end, iszero(lt(i, x)))
      {
        y := mul(2, y)
      }
      $for_continue:
      { i := add(i, 1) }
      jump($for_begin)
      $for_end:
    } // Here, a pop instruction will be inserted for i
    swap1 pop swap1 jump
    0 0
  }
  $afterFunction:
  stop
}

汇编有下面四个阶段:

  1. 解析
  2. 脱汇编(移除switch,for和函数)
  3. 生成指令流
  4. 生成字节码

我们将简单的以步骤1到3指定步骤。更加详细的步骤将在后面说明。

解析、语法

解析的任务如下:

  • 将字节流转为符号流,去掉其中的C++风格的注释(一种特殊的源代码引用的注释,这里不打算深入讨论)。
  • 将符号流转为下述定义的语法结构的AST。
  • 注册块中定义的标识符,标注从哪里开始(根据AST节点的注解),变量可以被访问。

组合词典遵循由Solidity本身定义的词组。

空格用于分隔标记,它由空格,制表符和换行符组成。 注释是常规的JavaScript / C ++注释,并以与Whitespace相同的方式进行解释。

语法:

AssemblyBlock = '{' AssemblyItem* '}'
AssemblyItem =
    Identifier |
    AssemblyBlock |
    FunctionalAssemblyExpression |
    AssemblyLocalDefinition |
    FunctionalAssemblyAssignment |
    AssemblyAssignment |
    LabelDefinition |
    AssemblySwitch |
    AssemblyFunctionDefinition |
    AssemblyFor |
    'break' | 'continue' |
    SubAssembly | 'dataSize' '(' Identifier ')' |
    LinkerSymbol |
    'errorLabel' | 'bytecodeSize' |
    NumberLiteral | StringLiteral | HexLiteral
Identifier = [a-zA-Z_$] [a-zA-Z_0-9]*
FunctionalAssemblyExpression = Identifier '(' ( AssemblyItem ( ',' AssemblyItem )* )? ')'
AssemblyLocalDefinition = 'let' IdentifierOrList ':=' FunctionalAssemblyExpression
FunctionalAssemblyAssignment = IdentifierOrList ':=' FunctionalAssemblyExpression
IdentifierOrList = Identifier | '(' IdentifierList ')'
IdentifierList = Identifier ( ',' Identifier)*
AssemblyAssignment = '=:' Identifier
LabelDefinition = Identifier ':'
AssemblySwitch = 'switch' FunctionalAssemblyExpression AssemblyCase*
    ( 'default' AssemblyBlock )?
AssemblyCase = 'case' FunctionalAssemblyExpression AssemblyBlock
AssemblyFunctionDefinition = 'function' Identifier '(' IdentifierList? ')'
    ( '->' '(' IdentifierList ')' )? AssemblyBlock
AssemblyFor = 'for' ( AssemblyBlock | FunctionalAssemblyExpression)
    FunctionalAssemblyExpression ( AssemblyBlock | FunctionalAssemblyExpression) AssemblyBlock
SubAssembly = 'assembly' Identifier AssemblyBlock
LinkerSymbol = 'linkerSymbol' '(' StringLiteral ')'
NumberLiteral = HexNumber | DecimalNumber
HexLiteral = 'hex' ('"' ([0-9a-fA-F]{2})* '"' | '\'' ([0-9a-fA-F]{2})* '\'')
StringLiteral = '"' ([^"\r\n\\] | '\\' .)* '"'
HexNumber = '0x' [0-9a-fA-F]+
DecimalNumber = [0-9]+

脱汇编

一个AST转换,移除其中的forswitch和函数构建。结果仍由同一个解析器,但它不确定使用什么构造。如果添加仅跳转到并且不继续的jumpdests,则添加有关堆栈内容的信息,除非没有局部变量访问到外部作用域或栈高度与上一条指令相同。伪代码如下:

desugar item: AST -> AST =
match item {
AssemblyFunctionDefinition('function' name '(' arg1, ..., argn ')' '->' ( '(' ret1, ..., retm ')' body) ->
  <name>:
  {
    jump($<name>_start)
    let $retPC := 0 let argn := 0 ... let arg1 := 0
    $<name>_start:
    let ret1 := 0 ... let retm := 0
    { desugar(body) }
    swap and pop items so that only ret1, ... retm, $retPC are left on the stack
    jump
    0 (1 + n times) to compensate removal of arg1, ..., argn and $retPC
  }
AssemblyFor('for' { init } condition post body) ->
  {
    init // cannot be its own block because we want variable scope to extend into the body
    // find I such that there are no labels $forI_*
    $forI_begin:
    jumpi($forI_end, iszero(condition))
    { body }
    $forI_continue:
    { post }
    jump($forI_begin)
    $forI_end:
  }
'break' ->
  {
    // find nearest enclosing scope with label $forI_end
    pop all local variables that are defined at the current point
    but not at $forI_end
    jump($forI_end)
    0 (as many as variables were removed above)
  }
'continue' ->
  {
    // find nearest enclosing scope with label $forI_continue
    pop all local variables that are defined at the current point
    but not at $forI_continue
    jump($forI_continue)
    0 (as many as variables were removed above)
  }
AssemblySwitch(switch condition cases ( default: defaultBlock )? ) ->
  {
    // find I such that there is no $switchI* label or variable
    let $switchI_value := condition
    for each of cases match {
      case val: -> jumpi($switchI_caseJ, eq($switchI_value, val))
    }
    if default block present: ->
      { defaultBlock jump($switchI_end) }
    for each of cases match {
      case val: { body } -> $switchI_caseJ: { body jump($switchI_end) }
    }
    $switchI_end:
  }
FunctionalAssemblyExpression( identifier(arg1, arg2, ..., argn) ) ->
  {
    if identifier is function <name> with n args and m ret values ->
      {
        // find I such that $funcallI_* does not exist
        $funcallI_return argn  ... arg2 arg1 jump(<name>)
        pop (n + 1 times)
        if the current context is `let (id1, ..., idm) := f(...)` ->
          let id1 := 0 ... let idm := 0
          $funcallI_return:
        else ->
          0 (m times)
          $funcallI_return:
          turn the functional expression that leads to the function call
          into a statement stream
      }
    else -> desugar(children of node)
  }
default node ->
  desugar(children of node)
}

生成操作码流

在操作码流生成期间,我们在一个计数器中跟踪当前的栈高,所以通过名称访问栈的变量是可能的。栈高在会修改栈的操作码后或每一个标签后进行栈调整。当每一个新局部变量被引入时,它都会用当前的栈高进行注册。如果要访问一个变量(或者拷贝其值,或者对其赋值),会根据当前栈高与变量引入时的当时栈高的不同来选择合适的DUPSWAP指令。

伪代码:

codegen item: AST -> opcode_stream =
match item {
AssemblyBlock({ items }) ->
  join(codegen(item) for item in items)
  if last generated opcode has continuing control flow:
    POP for all local variables registered at the block (including variables
    introduced by labels)
    warn if the stack height at this point is not the same as at the start of the block
Identifier(id) ->
  lookup id in the syntactic stack of blocks
  match type of id
    Local Variable ->
      DUPi where i = 1 + stack_height - stack_height_of_identifier(id)
    Label ->
      // reference to be resolved during bytecode generation
      PUSH<bytecode position of label>
    SubAssembly ->
      PUSH<bytecode position of subassembly data>
FunctionalAssemblyExpression(id ( arguments ) ) ->
  join(codegen(arg) for arg in arguments.reversed())
  id (which has to be an opcode, might be a function name later)
AssemblyLocalDefinition(let (id1, ..., idn) := expr) ->
  register identifiers id1, ..., idn as locals in current block at current stack height
  codegen(expr) - assert that expr returns n items to the stack
FunctionalAssemblyAssignment((id1, ..., idn) := expr) ->
  lookup id1, ..., idn in the syntactic stack of blocks, assert that they are variables
  codegen(expr)
  for j = n, ..., i:
  SWAPi where i = 1 + stack_height - stack_height_of_identifier(idj)
  POP
AssemblyAssignment(=: id) ->
  look up id in the syntactic stack of blocks, assert that it is a variable
  SWAPi where i = 1 + stack_height - stack_height_of_identifier(id)
  POP
LabelDefinition(name:) ->
  JUMPDEST
NumberLiteral(num) ->
  PUSH<num interpreted as decimal and right-aligned>
HexLiteral(lit) ->
  PUSH32<lit interpreted as hex and left-aligned>
StringLiteral(lit) ->
  PUSH32<lit utf-8 encoded and left-aligned>
SubAssembly(assembly <name> block) ->
  append codegen(block) at the end of the code
dataSize(<name>) ->
  assert that <name> is a subassembly ->
  PUSH32<size of code generated from subassembly <name>>
linkerSymbol(<lit>) ->
  PUSH32<zeros> and append position to linker table
}

处于某些特定的环境下,可以看到评论框,欢迎留言交流^_^。