这是indexloc提供的服务,不要输入任何密码
Skip to content

Instantly share code, notes, and snippets.

@rxwei
Last active December 6, 2024 16:54
Show Gist options
  • Save rxwei/30ba75ce092ab3b0dce4bde1fc2c9f1d to your computer and use it in GitHub Desktop.
Save rxwei/30ba75ce092ab3b0dce4bde1fc2c9f1d to your computer and use it in GitHub Desktop.

Revisions

  1. rxwei revised this gist Nov 4, 2019. 1 changed file with 1 addition and 2056 deletions.
    2,057 changes: 1 addition & 2,056 deletions ad-manifesto.md
    Original file line number Diff line number Diff line change
    @@ -1,2056 +1 @@
    First-Class Automatic Differentiation in Swift: A Manifesto
    ===========================================================

    * Author: [Richard Wei](https://github.com/rxwei)
    * Date: October 2018

    This document is written for both the machine learning community and the Swift
    programming language design community, with a strong focus on language design.

    **Status: Outdated**

    **Please see [Swift Automatic Differentiation Design Overview](https://docs.google.com/document/d/1bPepWLfRQa6CtXqKA8CDQ87uZHixNav-TFjLSisuKag/edit?usp=sharing) instead.**

    ## Table of Contents

    - [Introduction](#introduction)
    - [What is AD](#what-is-ad)
    - [Why does Swift need AD?](#why-does-swift-need-ad)
    - [Why make AD first-class?](#why-make-ad-first-class)
    - [Vision](#vision)
    - [Part 1: Differentiable Types](#part-1-differentiable-types)
    - [Part 2: Primitive Registration](#part-2-primitive-registration)
    - [Part 3: Basic Differentiation](#part-3-basic-differentiation)
    - [Part 4: Generalized Differentiability](#part-4-generalized-differentiability)
    - [Part 5: True Differential Operators](#part-5-true-differential-operators)
    - [Part 6: Generalized Types for Differentiation](#part-6-generalized-types-for-differentiation)
    - [Part 7: Customizable Differentiation](#part-7-customizable-differentiation)
    - [Acknowledgements](#acknowledgements)

    ## Introduction

    Automatic Differentiation (AD), also known as algorithmic differentiation, is a
    family of techniques used to obtain the derivative of a function. Functions can
    be represented as a composition of elementary operators whose derivatives are
    well-known. While partial derivatives can be computed through different
    techniques, the most common is a recursive application of the chain rule in the
    reverse direction, called reverse-mode AD. Reverse-mode AD computes
    vector-Jacobian products, i.e. partial derivatives with respect to each input
    parameter, and it has become a prerequisite for implementing gradient-based
    learning methods.

    We aim to provide best-in-class AD, including the best optimizations, best error
    messages in failure cases, and the most flexibility and expressivity. To achieve
    this, we built support for AD right into the Swift compiler. This manifesto
    explains the design and vision of AD, and introduces to you the language
    extensions that will make Swift the world's first general-purpose differentiable
    programming language.

    ## What is AD?

    ### Basic Calculus

    In basic calculus, differentiating a function of type
    ![](http://latex.codecogs.com/gif.latex?\mathbb{R}\rightarrow\mathbb{R})
    produces a function
    ![](http://latex.codecogs.com/gif.latex?\mathbb{R}\rightarrow\mathbb{R}) that
    maps points onto their corresponding slopes.

    <p align="center">
    <img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/9315f1516ee5847107808697e43693d91abfc6e8"
    </p>

    In the context of Swift, differentiating a function `(Float) -> Float` produces
    `(Float) -> Float`. Functions with multiple arguments, such as `(Float, Float)
    -> Float`, can be thought of as a function whose input domain is a product of
    those arguments types, i.e.
    ![](http://latex.codecogs.com/gif.latex?\mathbb{R}\times\mathbb{R}\rightarrow\mathbb{R}),
    so the derivative of such a function has type `(Float, Float) -> (Float,
    Float)`. According to this typing rule, the differential operator
    ![](http://latex.codecogs.com/gif.latex?\dfrac{d}{dx}) can be declared as a
    higher-order function, overloaded for each number of arguments because a Swift
    function's argument list is not formally modeled as a tuple.

    ```swift
    func 𝒟<T: FloatingPoint>(_ f: (T) -> T) -> (T) -> T
    func 𝒟<T: FloatingPoint>(_ f: (T, T) -> T) -> (T, T) -> (T, T)
    func 𝒟<T: FloatingPoint>(_ f: (T, T, T) -> T) -> (T, T, T) -> (T, T, T)
    ...
    ```

    ```swift
    func f(_ x: Double, _ y: Double) -> Double {
    return tanh(x + y)
    }
    𝒟(f) // (Double, Double) -> (Double, Double)
    ```

    ### Vectors and Jacobians

    In numerical computing, users often write code that operates on high-dimensional
    mathematical objects. The basic typing rules that we defined on real scalars
    (![](http://latex.codecogs.com/gif.latex?\mathbb{R})) can be generalized for
    [module](https://en.wikipedia.org/wiki/Module_(mathematics))-like types such as
    vectors with extra consideration for shape. In vector calculus, the
    differentiation of a function
    ![](http://latex.codecogs.com/gif.latex?\mathbf{f}:\mathbb{R}^n\rightarrow\mathbb{R}^m)
    is defined per scalar because there are multiple inputs and multiple outputs.
    Full differentiation of a vector-valued function
    ![](http://latex.codecogs.com/gif.latex?\mathbf{f}) will thus result in a
    matrix, each of whose entries is a function that computes the partial derivative
    of an output scalar with respect to an input scalar. This matrix is called a
    [Jacobian](https://en.wikipedia.org/wiki/Jacobian_matrix_and_determinant). In
    this definition, the Jacobian matrix has type
    ![](http://latex.codecogs.com/gif.latex?\mathbf{J_f}:(\mathbb{R}\rightarrow\mathbb{R})^{mn}).
    For simplicity, we will model it as a function that maps vectors to real-valued
    matrices
    ![](http://latex.codecogs.com/gif.latex?\mathbf{J_f}:\mathbb{R}^n\rightarrow\mathbb{R}^{mn}).

    <p align="center">
    <img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/74e93aa903c2695e45770030453eb77224104ee4"
    alt="Automatic differentiation approaches."/>
    </p>

    While it is challenging to define this function with full type safety in Swift
    because shapes cannot be generic parameters yet, we can define a differential
    operator as the following, specialized on shapes.

    ```swift
    func 𝒟<T>(_ f: (Vector2<T>) -> Vector3<T>) -> (Vector2<T>) -> Matrix3x2<T>
    where T: FloatingPoint
    ```

    Computing the Jacobian of a function is often unnecessary in gradient-based
    optimization methods. Computing a full Jacobian will require repeated
    evaluations of some primitives in computer code: vector-Jacobian products (VJPs)
    or Jacobian-vector products (JVPs), and VJPs and JVPs are often exactly what we
    need in practice. In these terms, "vector" refers to a vector of partial
    derivatives that are to be chained with the Jacobian by left-multiplication or
    right-multiplication. As we explain chaining next, we discuss how Automatic
    Differentiation comes in the picture.

    ### Gradient and Reverse-Mode AD

    When we let a [one-hot](https://en.wikipedia.org/wiki/One-hot) row vector
    ![](http://latex.codecogs.com/gif.latex?\mathbf{v}^i:\mathbb{R}^m=\Big[0\cdots_{i-1}1\cdots0\Big])
    left-multiply a Jacobian matrix of type
    ![](http://latex.codecogs.com/gif.latex?\mathbb{R}^{mn}), we are selecting one
    row in the matrix, which is exactly the
    [gradient](https://en.wikipedia.org/wiki/Gradient) of
    ![](http://latex.codecogs.com/gif.latex?f_i) evaluated at
    ![](http://latex.codecogs.com/gif.latex?\mathbf{x}), i.e.
    ![](http://latex.codecogs.com/gif.latex?\nabla{f_i}(\mathbf{x})).

    <p align="center">
    <img src="https://latex.codecogs.com/gif.latex?\nabla{f_i}(\mathbf{x})=\mathbf{v}^i\mathbf{J_f}(\mathbf{x})=\bigg[\dfrac{\partial{f_i}(\mathbf{x})}{\partial&space;x_1}&space;\&space;\cdots&space;\&space;\dfrac{\partial{f_i}(\mathbf{x})}{\partial{x_n}}\bigg]" title="\nabla{f_i}(\mathbf{x})=\mathbf{v}^i\mathbf{J_f}(\mathbf{x})=\bigg[\dfrac{\partial{f_i}(\mathbf{x})}{\partial x_1} \ \cdots \ \dfrac{\partial{f_i}(\mathbf{x})}{\partial{x_n}}\bigg]" />
    </p>

    When vector ![](http://latex.codecogs.com/gif.latex?\mathbf{v}) in
    ![](http://latex.codecogs.com/gif.latex?\mathbf{v}\mathbf{J_f}(\mathbf{x}))
    represents the gradient of another function
    ![](http://latex.codecogs.com/gif.latex?g:\mathbb{R}^m\rightarrow\mathbb{R}) at
    ![](http://latex.codecogs.com/gif.latex?\mathbf{f}(\mathbf{x})), namely
    ![](http://latex.codecogs.com/gif.latex?\partial{g}/\partial{f_i(\mathbf{x})}),
    then the vector-Jacobian products represents
    ![](http://latex.codecogs.com/gif.latex?\partial{g}/\partial{\mathbf{x}}). The
    linear function that takes a vector and left-multiplies it with the Jacobian is
    also called a
    [pullback](https://en.wikipedia.org/wiki/Pullback_(differential_geometry)). We
    can define this function in Swift as a higher-order function shown below. The
    body of this function can be defined in terms of `𝒟`, the differential operator
    that returns a Jacobian.

    <p align="center">
    <img src="https://latex.codecogs.com/gif.latex?\dfrac{\partial&space;g(\mathbf{f}(\mathbf{x}))}{\partial&space;\mathbf{x}}=\dfrac{\partial&space;g}{\partial&space;\mathbf{f}(\mathbf{x})}\mathbf{J_f}(\mathbf{x})&space;=&space;\bigg[&space;\dfrac{\partial&space;g(\mathbf{x})}{\partial&space;x_1}&space;\&space;\cdots&space;\&space;\dfrac{\partial&space;g(\mathbf{x})}{\partial&space;x_n}&space;\bigg]" title="\dfrac{\partial g(\mathbf{f}(\mathbf{x}))}{\partial \mathbf{x}}=\dfrac{\partial g}{\partial \mathbf{f}(\mathbf{x})}\mathbf{J_f}(\mathbf{x}) = \bigg[ \dfrac{\partial g(\mathbf{x})}{\partial x_1} \ \cdots \ \dfrac{\partial g(\mathbf{x})}{\partial x_n} \bigg]" />
    </p>

    ```swift
    func pullback<T: FloatingPoint>(
    of f: (Vector2<T>) -> Vector3<T>,
    at x: Vector2<T>
    ) -> (Vector2<T>) -> Vector2<T>
    return { adjoint in matmul(adjoint, 𝒟(f)(x)) }
    }
    ```

    However, when computing gradients or general vector-Jacobian products, we do not
    need to compute the Jacobian at all: **Automatic Differentiation is here to
    help.**

    [The chain rule of differentiation](https://en.wikipedia.org/wiki/Chain_rule)
    can be interpreted in left-associative order, i.e. accumulating each function's
    partial derivatives from the final output, eventiually reaching each input.

    ### Directional Derivatives and Forward-Mode AD

    Similarly, when we let a column vector
    ![](http://latex.codecogs.com/gif.latex?\mathbf{v}:\mathbb{R}^{n1}) right-multiply a
    Jacobian value
    matrix of type ![](http://latex.codecogs.com/gif.latex?\mathbb{R}^{mn}), the result is a vector whose elements are exactly the
    [directional derivatives](https://en.wikipedia.org/wiki/Directional_derivative)
    of each ![](http://latex.codecogs.com/gif.latex?f_i) evaluated at
    ![](http://latex.codecogs.com/gif.latex?\mathbf{x}) in direction ![](http://latex.codecogs.com/gif.latex?\mathbf{v}).

    <p align="center">
    <img src="https://latex.codecogs.com/gif.latex?\nabla_\mathbf{v}\mathbf{f}(\mathbf{x})=\mathbf{J_f}(\mathbf{x})\mathbf{v}=\bigg[\nabla_\mathbf{v}{f_1}(\mathbf{x})\&space;\cdots\&space;\nabla_\mathbf{v}{f_m}(\mathbf{x})\bigg]" title="\nabla_\mathbf{v}\mathbf{f}(\mathbf{x})=\mathbf{J_f}(\mathbf{x})\mathbf{v}=\bigg[\nabla_\mathbf{v}{f_1}(\mathbf{x})\ \cdots\ \nabla_\mathbf{v}{f_m}(\mathbf{x})\bigg]" />
    </p>

    The linear function that takes a vector and right-multiplies the Jacobian value
    matrix is called a
    [differential](https://en.wikipedia.org/wiki/Pushforward_(differential)), and it
    can also be defined in Swift as a higher-order function in terms of `𝒟`.

    ```swift
    func differential<T: FloatingPoint>(
    of f: (Vector2<T>) -> Vector3<T>,
    at x: Vector2<T>
    ) -> (Vector3<T>) -> Vector3<T> {
    return { tangent in matmul(𝒟(f)(x), tangent) }
    }
    ```

    Just like vector-Jacobian products, Jacobian-vector products are easy to compute
    using Automatic Differentiation. By simply applying the chain rule of
    differentiation from an input, we will accumulate each function's partial
    derivatives and reach each output.

    AD has a rich background. For an in-depth introduction, here's some great
    documentation:
    - [Introduction to Automatic
    Differentiation](https://alexey.radul.name/ideas/2013/introduction-to-automatic-differentiation/)
    - [Automatic differentiation in machine learning: a survey](https://arxiv.org/abs/1502.05767)
    - [The simple essence of automatic
    differentiation](https://arxiv.org/abs/1804.00746)

    ## Why does Swift need AD?

    Swift is a new programming language in the machine learning space. Recently, the
    [Swift for TensorFlow](https://github.com/tensorflow/swift) project brought the
    full power of a machine learning framework into the Swift programming language.
    Numerical computing has a very different set of requirements than application
    development and systems development, and we believe that Swift needs to better
    address those requirements and improve the usability of numerical software. One
    of the most important building blocks in machine learning and numerical
    computing is the ability to differentiate math code. Automatic Differentiation
    has been implemented in many languages, but because of language constraints and
    design trade-offs, many existing AD systems have limitations. We would like to
    take this opportunity to improve Swift, and demonstrate what Swift can offer in
    all areas of numerical computing in the presence of a compiler and a static type
    system.

    ## Why make AD first-class?

    Automatic Differentiation has been a research topic in scientific computing and
    high-performance computing for nearly half a century. Traditional tools such as
    [OpenAD](http://www.mcs.anl.gov/OpenAD/),
    [TAPENADE](http://tapenade.inria.fr:8080/tapenade/index.jsp) and
    [ADIFOR](http://www.mcs.anl.gov/research/projects/adifor/) are tools that
    transform existing source code. There are many advanced techniques that improved
    the performance of derivatives written in FORTRAN, but these tools have not
    gained wide adoption in the machine learning community. More recent AD systems
    like [Stalin∇](https://github.com/Functional-AutoDiff/STALINGRAD) (pronounced
    Stalingrad, available as a dialect of Scheme) achieved good usability by
    integrating the differential operator into the language, and are equipped with a
    complete set of AD features (such as forward/reverse, nested AD, Hessians,
    Jacobians, directional derivatives and checkpointing). Along with libraries such
    as [DiffSharp](http://diffsharp.github.io/DiffSharp/) (available in F#), and
    [ad](https://hackage.haskell.org/package/ad) (available in Haskell), they
    combine AD closely with functional programming languages.

    Researchers in the machine learning community have built many library
    implementations of AD in Python and C++, including
    [Autograd](https://github.com/HIPS/autograd),
    [TensorFlow](http://tensorflow.org/), [Pytorch](http://pytorch.org/), etc.

    As Automatic Differentiation is an integral part of any machine learning
    framework, traditional designs and implementations of AD have some limitations.
    Some of these libraries are implemented as a transformation on a standalone DSL
    (a graph) with a closed set of operators. Others are implemented using operator
    overloading directly on a subset of the source language. Although these
    libraries have gained wide adoption, the ones that leverage ahead-of-time AD do
    not expose an easy-to-use programming model, and the ones that have a friendlier
    programming model lack static analysis to perform more optimized AD.

    Recent projects such as [Tangent](https://github.com/google/tangent),
    [Myia](https://github.com/mila-udem/myia), and
    [Zygote.jl](https://github.com/FluxML/Zygote.jl) based their AD upon source code
    transformation (SCT), a technique that was common in advanced AD systems before
    the deep learning era such as
    [Stalin∇](https://github.com/Functional-AutoDiff/STALINGRAD). The first two
    libraries parse a Python subset into ASTs and transform a function to its
    derivatives either in AST or in a functional IR, and Zygote hooks into the Julia
    compiler and transforms Julia's IR directly. These tools are pushing the
    boundaries of dynamic languages.

    We would like our AD system to feel native and expressive. AD in Swift aims to
    solve real-world usability problems by providing the best generalizations, best
    error messages in failure cases, composable differential operators, and fully
    customizable types and derivatives. To achieve this, we built support for AD
    right into the Swift language. Even though AD has been incubated as part of the
    Swift for TensorFlow project, we believe its importance and impact is beyond
    machine learning, so we decided to propose it eventually through Swift Evolution
    into the core language.

    ## Vision

    **Swift will be world's first general-purpose differentiable programming
    language.**

    ### Ease of Use

    We expect Swift's language-integrated AD to be super easy to use in the context
    of machine learning, control in robotics, and scientific computing. AD is a
    general language feature that works seamlessly with third-party libraries such
    as [TensorFlow](https://www.tensorflow.org/swift/api_docs/).

    ```swift
    struct Parameters: Differentiable, ParameterGroup {
    var w1 = Tensor<Float>(randomNormal: [784, 30])
    var b1 = Tensor<Float>(zeros: [30])
    var w2 = Tensor<Float>(randomNormal: [30, 10])
    var b2 = Tensor<Float>(zeros: [10])
    }

    var params = Parameters()
    let minibatches = Dataset(...)
    var optimizer = StochasticGradientDescent()
    for (x, y) in minibatches {
    let grads = gradient(at: params) { params in
    let h1 = tanh(matmul(x, params.w1) + params.b1)
    let ŷ = sigmoid(matmul(h1, params.w2) + params.b2)
    let loss = (y - ŷ).squared().mean()
    print("Loss is \(loss)")
    return loss
    }
    optimizer.fit(&params, gradients: grads)
    }
    ```

    ### Full Extensibility: Custom Types and Derivatives

    We want our AD system to be fully extensible to the point where users can
    request derivatives of a function taking their own user-defined numeric types,
    and even use this feature to implement structure-dependent algorithms such as
    tree-recursive neural networks. Therefore, when performing AD, Swift makes no
    special assumptions about individual math functions or the types it should
    support. We enable library designers and developers to easily define any type or
    differentiable functions, all in pure Swift code.

    Swift supports [protocol-oriented programming and first-class value
    semantics](https://developer.apple.com/videos/play/wwdc2015/408/). AD is deeply
    integrated with value types and has full extensibility via protocol
    conformances. The user can make their custom data structures differentiable
    simply by declaring a conformance to `Differentiable` protocol:

    ```swift
    extension MyType: Differentiable {
    ...
    }
    ```

    Or make an obviously non-differentiable function differentiable by using the
    `@differentiable` attribute, specifying a "tangent" function for computing its
    Jacobian-vector products, or an "adjoint" function for computing its
    vector-Jacobian products.

    ```swift
    @differentiable(tangent: tangentFoo, adjoint: adjointFoo)
    func foo(_ x: Float) -> Float {
    return Float(Int(x)) // obviously non-differentiable
    }

    func tangentFoo(_ x: (Float, Float), originalResult: Float) -> Float {
    // Insert custom code to compute the directional derivative
    }

    func adjointFoo(_ x: Float, originalResult: Float, adjoint: Float) -> Float {
    // Insert custom code to compute the gradient
    }
    ```

    ### Composable Differential Operators

    With fully customizable data structures and derivatives, everything should feel
    native in the language. In addition, differential operators are functional and
    composable, and differentiability is naturally integrated in the type system.
    All differential operators are defined in Swift, and developers can create their
    own differential operators by composing existing ones. For example, the user can
    use the "forward-on-reverse" approach to compute [Hessian-vector
    products](https://en.wikipedia.org/wiki/Hessian_matrix), where the `hvp(at:in:)`
    operator is defined as a native Swift function. The [`@autodiff(order:
    2)`](#the-autodiff-function-type-attribute) attribute in the closure type
    signature marks the closure argument as being differentiable up to at least the
    2nd order, so that the caller of `hvp(at:in:)` will differentiate the actual
    closure argument as needed.so that the caller of this function will implicitly
    trigger differentiation as needed.

    ```swift
    func hvp<T: Differentiable, R: FloatingPoint>(
    at x: T, in f: @autodiff(order: 2) (T) -> R
    ) -> @autodiff(linear) (T) -> T {
    return differential(at: x, in: gradient(of: f))
    }
    ```

    ### Static Analysis and Diagnostics

    By building first-class AD into the programming language, we can provide better
    diagnostics about differentiability and numeric stability than any other dynamic
    languages, all at compile-time.

    ```console
    test.swift:58:10: error: function is not differentiable
    return #gradient(funcToDiff)(x)
    ^ ~~~~~~~~~~

    test.swift:54:10: note: expression is not differentiable
    return middle2(x)
    ^

    test.swift:50:10: note: when differentiating this function call
    return middle(x)
    ^

    test.swift:46:10: note: when differentiating this function call
    return nested(y)
    ^
    ```

    ### Flexible Functional-Style Differentiation

    In common AD libraries, there are two differentiation styles: functional and
    imperative.

    | | Syntax | Meaning |
    |------------|--------|-------------|
    | Functional | `let 𝝯f = gradient(of: f)`<br/>`𝝯f(x)` | Differentiating a function |
    | Imperative | `let y = f(x)`<br/>`gradient(of: y, wrt: x)` | Differentiating code traced through data flow |

    Functional-style AD is transforming one function to another, producing a
    function that takes original arguments and returns the partial derivatives
    evaluated at each argument. Imperative-style AD, on the other hand, is a
    value-value dependency analysis. Although we use both notations in mathematics,
    imperative AD comes at the cost of semantic inconsistency with the host
    language, for example:

    ```swift
    let y = f(x)
    x = 3
    gradient(of: y, wrt: x) // undefined
    ```

    Semantically, `y` is a value, but `x` is both a value and a reference to a
    memory location -- it is unclear what exactly we are differentiating with
    respect to. Though making `y` and `x` have reference types could make this
    particular example work out semantically, it would be fundamentally inconsistent
    with Swift's core design where mathematical objects have value types, and would
    also make scalar types like `Float` incompatible with automatic differentiation.

    We believe Swift's AD can achieve the same level of expressivity as imperative
    AD while preserving functional properties, and use language integration to push
    developers' productivity to the next level.


    ## Part 1: Differentiable Types

    Swift is a general-purpose programming language. Therefore, not every function
    is mathematically differentiable, and not every type represents a real vector
    space to begin with. To make our system mathematically sound, we refine the
    Swift standard library to form a basis for automatic differentiation.

    The starting point of this refinement is the fundamental numeric protocols.
    In this section, we talk about how we improve the `Numeric` protocol to support
    the addition of vector types and protocols. Then, we introduce a protocol to
    represent vector spaces as that would be a requirement for doing calculus.
    Finally, we design a protocol specific to differentiation.

    ### Revising the [`Numeric`](https://developer.apple.com/documentation/swift/numeric) protocol

    The Numeric protocol today refines
    [`ExpressibleByIntegerLiteral`](https://developer.apple.com/documentation/swift/expressiblebyintegerliteral).
    This makes sense for scalars, but is not compatible with vector data structures
    because type-checking would fail on the scalar multiplication operator.

    On the Swift forum, we have discussed the [fundamental blocker for vector types
    to conform to the existing `Numeric`
    protocol](https://forums.swift.org/t/should-numeric-not-refine-expressiblebyintegerliteral).
    The consensus was to introduce a weakening of the `Numeric` protocol to
    represent the abstractions shared between scalars and vectors: [rng (ring
    without unity)](https://en.wikipedia.org/wiki/Rng_(algebra)) (We assumed that
    vector spaces are rngs by endowing them with `*` as element-wise
    multiplication). The protocol will be called `Arithmetic`.

    ```swift
    public protocol Arithmetic: Equatable {
    static var zero: Self { get }
    prefix static func + (x: Self) -> Self
    static func + (lhs: Self, rhs: Self) -> Self
    static func += (lhs: inout Self, rhs: Self) -> Self
    static func - (lhs: Self, rhs: Self) -> Self
    static func -= (lhs: inout Self, rhs: Self) -> Self
    static func * (lhs: Self, rhs: Self) -> Self
    static func *= (lhs: inout Self, rhs: Self) -> Self
    }
    ```

    The existing `Numeric` will be changed to refine (inherit from) `Arithmetic`,
    keeping all of its existing behavior.

    ```swift
    public protocol Numeric: Arithmetic, ExpressibleByIntegerLiteral {
    associatedtype Magnitude: Comparable, Numeric
    init?<T>(exactly source: T) where T: BinaryInteger
    var magnitude: Magnitude { get }
    }
    ```

    ### The `VectorNumeric` protocol

    After we introduce the `Arithmetic` protocol, which makes the standard library
    suitable for vector APIs and beyond, we can define a protocol that generalizes
    vectors. Mathematically, a vector space is a ring without unity if we endow them
    with `*` as element-wise multiplication. We represent vector spaces through the
    `VectorNumeric` protocol as follows. `Scalar` is the type of the elements of
    this vector space -- the field which the vector space is over. `Shape` is the
    shape of this vector space, which is customizable. The initializer takes a value
    of the `Scalar` type and a `Shape` and returns a vector of the specified shape.

    ```swift
    /// A type that represents an unranked vector space. Values of this type are
    /// elements in this vector space and with a specific shape.
    public protocol VectorNumeric: Arithmetic {
    /// The type of scalars in the vector space.
    associatedtype Scalar: Numeric

    /// The type whose values specifies the shape of an object in the vector
    /// space.
    associatedtype Shape

    /// Create an object in the vector space with the specified shape by
    /// repeatedly filling the object with the specified value.
    ///
    /// - Parameters:
    /// - repeatedValue: the value repeat for the specified shape
    /// - shape: the shape
    init(repeating repeatedValue: Scalar, shape: Shape)

    /// The shape of this vector.
    var shape: Shape { get }

    /// Returns the scalar product of the vector.
    static func * (scale: Scalar, value: Self) -> Self
    }
    ```

    ### The `Differentiable` protocol

    Now we define a protocol that "activates" a type's differentiability. At a first
    glance, the conforming type must also be a `VectorNumeric` type. So we make this
    protocol refine `VectorNumeric`. Since differentiation only makes sense on real
    vectors, we add a constraint on the associated type `Scalar` such that it
    conforms to `FloatingPoint`.

    ```swift
    public protocol Differentiable: VectorNumeric where Scalar: FloatingPoint {
    }
    ```

    You may notice that `Differentiable` looks like a dummy protocol because it
    doesn't have any requirements other than the ones inherited from
    `VectorNumeric`. Although under the current assumptions we can completely omit
    the `Differentiable` protocol and just have the AD system recognize
    `VectorNumeric`-comforming types whose scalar elements comform to
    `FloatingPoint`, we actually have theoretical and practical reasons to revise
    the `Differentiable` protocol later on. So we keep `Differentiable` as a
    separate protocol for now and build towards the final design at the end of this
    document.

    ## Part 2: Primitive Registration

    We are aiming for an open and extensible system, so we made the compiler
    agnostic of the actual operations - it does not have special knowledge of
    numeric standard library functions or distinguish between primitive operators
    and other functions. We recursively determine a function's differentiability
    based on:

    - whether a function has a primitive differentiability as specified in the
    standard or user-defined library, and

    - whether a function's definition (type signature and body) is differentiable by
    applying the chain rule of differentiation.

    As such we provide a syntactic way of specifying the differentiability of a
    function, using either the function's linearity properties or a separate
    function to specify the "tangent code", which specifies how to differentiate the
    function in forward mode, or "adjoint code”, which specifies how to
    differentiate the function in reverse mode.

    ### The `@differentiable` attribute

    We introduce a declaration attribute `@differentiable` to Swift's syntax. The
    full grammar of `@differentiable` is defined as follows:

    ```ebnf
    differentiation-mode = 'forward' | 'reverse' | 'bidirectional'
    differentiability = differentiation-mode | 'linear' | 'constant'
    differentiability-wrt-self = 'wrt' ':' 'self'
    differentiation-order = 'once'
    differentiation-tangent-specifier = 'tangent' ':' declaration-name
    differentiation-adjoint-specifier = 'adjoint' ':' declaration-name
    differentiable-attribute = '@differentiable'
    '(' differentiability
    [ ',' differentiability-wrt-self ]
    [ ',' differentiation-once ]
    [ ',' differentiation-tangent-specifier ]
    [ ',' differentiation-adjoint-specifier ]
    ')'
    declaration-attribute = differentiable-attribute
    ```

    #### First Glance

    The multiplication operator `*` is differentiable with respect to its two
    arguments. Here's how we make it differentiable in the standard library.

    ```swift
    extension FloatingPoint {
    @differentiable(bidirectional, tangent: tangentMul, adjoint: adjointMul)
    static func * (x: Self, y: Self) -> Self { ... }

    internal func tangentMul(
    x: (Self, Self), y: (Self, Self), originalResult: Self
    ) -> Self {
    return x.1 * y.0 + y.1 * x.0
    }

    internal func adjointMul(
    x: Self, y: Self, originalResult: Self, seed: Self
    ) -> (Self, Self) {
    return (seed * y, seed * x)
    }
    }
    ```

    In TensorFlow, the convolution operator is only differentiable with respect to
    a subset of arguments. Here's how we make it differentiable so that it can be
    used for back-propagation.

    ```swift
    @differentiable(reverse, adjoint: adjointConv2D)
    public func conv2d(_ input: Tensor<Float>, filter: Tensor<Float>,
    strides: @nondiff (Int32, Int32, Int32, Int32),
    padding: @nondiff Padding) -> Tensor<Float> {
    ...
    }

    func adjointConv2D(_ input: Tensor<Float>, filter: Tensor<Float>,
    strides: (Int32, Int32, Int32, Int32),
    padding: Padding) -> (Tensor<Float>, Tensor<Float>) {
    ...
    }
    ```

    #### Differentiation Parameters

    Differentiation parameters are marked inline at each argument position in the
    function declaration. By default, every argument of the funtion is to be
    differentiated with-respect-to, unless marked as `@nondiff`.

    When a differentiable attribute is applied on a method, or the getter of a
    computed property in a type, the implicit `self` argument often needs to be
    differentiated with respect to. In order to make a function a primitive
    differentiable with respect to `self`, one can add `wrt: self` to
    the `@differentiable` attribute.

    #### Differentiability

    There are five options for differentiability:

    1. Forward: `@differentiable(forward, tangent: ...)`

    This option says that the function is forward-mode differentiable.
    Forward-mode differentiation requires the "tangent code" (or tangent
    function) of this function, so that Swift knows how to compute the
    function's directional derivatives in the direction specified by the
    tangent vector that has been forward-propagated to the tangent function.

    The compiler will expect the name of the tangent function, with an expected
    type signature, to be specified later in the `tangent:` parameter in the
    attribute.

    2. Reverse: `@differentiable(reverse, adjoint: ...)`

    This option says that the function is reverse-mode differentiable.
    Reverse-mode differentiation requires the "adjoint code" (or adjoint
    function) of this function, so that Swift knows how to compute the function's
    vector-Jacobian products, where the vector, also called "adjoint vector", has
    been back-propagated to the adjoint function.

    The compiler will expect the identifier of the adjoint function, with an
    expected type signature, to be specified later in the `adjoint:` parameter
    in the attribute.

    3. Bidirectional: `@differentiable(bidirectional, tangent: ..., adjoint: ...)`

    This option says that the function is both forward-mode differentiable and
    reverse-mode differentiable. The compiler will expect both the tangent
    function and the adjoint function to be specified later in this attribute.

    4. Constant: `@differentiable(constant)`

    By definition, constant functions always have zero derivatives and are
    differentiable at any arbitrary order. So differentiating this function will
    result into a zero vector (or vectors, when the function has multiple
    differentiation arguments) with the same shape as each differentiation
    argument.

    5. Linear: `@differentiable(linear)`

    By definition, a linear map is always a unary function and its Jacobian is
    the matrix associated with this linear transformation itself. In other
    words, both its differential and its pullback are itself.

    #### Associated Functions

    As explained, differentiabilities have different functional requirements.

    1. `forward` differentiability

    When the differentiability is `forward`, the compiler expects a `tangent:`
    label in the attribute followed by the name (qualified or unqualified)
    of a tangent function that is to be associated with the original function.
    If the original function declaration has type `(T0, ..., Tn) -> U`, then
    the expected type of the tangent function is `((T0, T0), ..., (Tn, Tn), U) ->
    U`. As we can see, every argument of the original function has become
    a "dual number" in the tangent function represented as a tuple. The first
    element of such a tuple is the original argument, the second argument the
    forward-propagated directional derivatives, namely the the "vector" in
    "Jacobian-vector product". The last argument to the tangent function is the
    original function's result. The result of the tangent function is the
    directional derivatives. If any of the original arguments is marked as
    `@nondiff`, it will not become a dual number in the tangent function's argument
    list but will remain as the original argument itself.

    2. `reverse` differentiability

    When the differentiability is `reverse`, the compiler expects an `adjoint:`
    label in the attribute followed by the name (qualified or unqualified)
    of an adjoint function that is to be associated with the original function.
    If the original function declaration has type `(T0, ..., Tn) -> U`, then
    the expected type of the adjoint function is `(T0, ..., Tn, U, U) -> (T0,
    ..., Tn)`. As we can see, the first `n` arguments to the adjoint function,
    `T0, ..., Tn`, are the original arguments. The next argument is the
    original function's result. The last argument is the back-propagated
    partial derivatives at the original function's result,
    namely the "vector" in "vector-Jacobian product". The result of the
    adjoint function contains partial derivatives at each argument, if the
    argument has not been marked as `@nondiff`.

    3. `bidirectional` differentiability

    When the differentiability is `bidirectional`, the compiler expects both
    `tangent:` and `adjoint:` arguments to be specified.

    4. Other differentiabilities

    Other differentiabilities such as `constant` and `linear` do not require
    any associated functions. However, users can choose to specify
    tangent/adjoint function(s) for their own purposes such as custom
    optimizations.

    #### Differentiation Order

    When a function is marked as `@differentiable`, Swift assumes it to be
    higher-order differentiable, i.e. differentiable at all orders, unless `once` is
    specified in the attribute, in which case Swift will not guarantee any
    higher-order differentiability. If their associated functions (tangent or
    adjoint) are serialized, then their derivatives _may_ be differentiable via a
    separate code transformation.

    Differentiabilities `linear` and `constant` guarantee smoothness, and they do
    not have to be serialized whatsoever because their derivatives do not depend on
    any code transformation.

    `forward` and `reverse` transitively require the tangent function and the
    adjoint function, respectively, to be differentiable with respect to the
    original arguments. When compiling such declarations, Swift will verify the
    tangent/adjoint function is also differentiable by static analysis. If they are
    not differentiable, the compiler will error out, prompting the user to insert
    `once` in the `@differentiable` attribute.

    Example 1. Linear functions are differentiable at any order.

    ```swift
    public extension Tensor {
    @differentiable(linear, wrt: self)
    func transposed() -> Self {
    ...
    }
    }
    ```

    Example 2. A forward-mode primitive-differentiable function whose tangent is
    closed-form is differentiable.

    ```swift
    // Okay, the tangent function is differentiable.
    @differentiable(forward, tangent: tangentFoo)
    func foo(_ x: Vector<Float>) -> Float {
    return Vector(repeating: sin(x), shape: [2, 3])
    }

    func tangentFoo(_ dualX: (Float, Float),
    originalResult: Vector<Float>) -> Vector<Float> {
    let (x, dx) = dualX
    // Differentiable because `Vector.init(repeating:shape:)`, `*`, `sin` and
    // `cos` are all declared `@differentiable` and are differentiable.
    return Vector(repeating: cos(x) * dx, shape: [2, 3])
    }
    ```

    Example 3. A reverse-mode primitive-differentiable function is not
    differentiable at a higher order because its adjoint is not differentiable.

    ```swift
    @differentiable(reverse, adjoint: adjointBar)
    func bar(_ x: Vector<Float>) -> Float {
    return sin(x)[0]
    }

    var someGlobalVariable: Vector<Float> = [1, 1, 1]

    func adjointBar(_ x: Vector<Float>, y: Float, adjoint: Float) -> Vector<Float> {
    var yx = Vector<Float>(repeating: 0, shape: x.shape)
    someGlobalVariable[0] = cos(x[0]) * adjoint
    yx[0] = someGlobalVariable[0]
    return yx
    }
    ```
    ```console
    test.swift:3:35: error: function `bar` does not support higher-order differentiation
    because its adjoint is not differentiable; would you like to add `once`?
    @differentiable(reverse, adjoint: adjointBar)
    ^~~~~~~~~~
    test.swift:8:6: note: `adjointBar` is defined here
    func adjointBar(_ x: Vector<Float>, y: Float, adjoint: Float) -> Vector<Float> {
    ^~~~~~~~~~
    test.swift:10:9: note: operation is not differentiable
    ∂y∂x[0] = cos(x[0]) * adjoint
    ^~~~~~~~~~~~~~~~~~~~~~~~~
    ```

    ## Part 3: Basic Differentiation

    The application of the chain rule of differentiation gives us vector-Jacobian
    products or Jacobian-vector products, given by functions. Now that we have
    defined primitive differentiable functions, Swift can recursively differentiate
    any function whose body is available to the compiler.

    ### Start Simple: Gradient and Derivatives

    We start by introducing the syntax of two raw differential operators:
    - `#gradient(f)`: Produces the gradient of `f`, where `f: ℝⁿ → ℝ`.
    - `#derivatives(f)`: Produces derivatives of `f`, where `f: ℝ → ℝᵐ`.

    The syntax of these operators looks like macros, but we will generalize them and
    make them look much nicer towards in the second half of this document.

    Example:

    ```swift
    func f(_ x: Vector<Float>, _ w: Vector<Float>) -> Float {
    return x w
    }

    #gradient(f) // (Vector<Float>, Vector<Float>) -> (Vector<Float>, Vector<Float>)

    func g(_ x: Float) -> (Vector<Float>, Vector<Float>) {
    return x w
    }

    #derivatives(g) // (Float) -> (Vector<Float>, Vector<Float>)
    ```

    The grammar of these raw differential operators is defined as follows:

    ```ebnf
    derivatives-operator = '#derivatives'
    gradient-operator = '#gradient'
    raw-differential-operator = derivatives-operator | gradient-operator
    autodiff-argument-index-specifier = '.' integer-literal
    autodiff-expression =
    differential-operator '(' expression [ ',' 'wrt' ':' autodiff-argument-index-specifier ] ')'
    expression = autodiff-expression
    ```

    ### Embrace Generality: Vector-Jacobian Products and Jacobian-Vector Products

    Gradient and derivatives are two special cases of differentiation where the
    output or the result is a scalar, respectively. When they are not a scalar,
    vector-Jacobian products and Jacobian-vector products are being computed with a
    vector. These cases are not obvious, but are required for modular machine
    learning APIs where each neural network layer defines a back-propagation method
    that takes a partial derivative vector back-propagated from the previous layer.
    As such, we add two extra differential operators which will be useful for
    computing these products.

    - `#differential(f)`: Produces a function that takes the original arguments and
    returns the differential of `f`.
    - `#pullback(f)`: Produces a function that takes the original arguments and
    returns the pullback of `f`.

    ```ebnf
    jvp-operator = '#differential'
    vjp-operator = '#pullback'
    raw-differential-operator = jvp-operator | vjp-operator
    ```

    Example:

    ```swift
    // A random generic function that is differentiable.
    func f<T0, T1, U>(_ x: T0, _ y: T1) -> U
    where T0: Differentiable, T1: Differentiable, U: Differentiable {
    return someDifferentiableFunction(20, x + y)
    }

    #differential(f) // (T0, T1) -> (U) -> (U, (T0, T1))
    // Description:
    // (T0, T1) -> (U) -> (U, (T0, T1))
    // ^~~~~~ ^ ^ ^~~~~~~~
    // original args vector result Jacobian-vector products

    #pullback(f) // (T0, T1) -> (U, (U) -> (T0, T1))
    // Description:
    // (T0, T1) -> (U, (U) -> (T0, T1))
    // ^~~~~~ ^ ^ ^~~~~~~~
    // original args result vector vector-Jacobian products
    ```
    ### How It Works

    The compiler type-checks a `#gradient(f)`, as well as other differential
    operators, by searching for the closest match given the contextual type. `f` is
    expected to have a definition to be differentiable, and thus cannot be a
    closure whose body is opaque to the compiler. If so, Swift reports an error.

    Later in the compilation pipeline, the compiler recursively transforms the code
    of `f` to its gradient function `∇f` (or other functions in other modes of
    differentiation), and replaces `#gradient(f)` with `∇f`. Everything composes
    together naturally. Now, differentiation works.

    ### AD in Action

    Automatic Differentiation based on raw differential operators is already
    available and being incubated temporarily on [the "tensorflow" branch of
    Swift](https://github.com/apple/swift/tree/tensorflow). Swift for TensorFlow
    [development
    toolchains](https://github.com/tensorflow/swift/blob/master/Installation.md) and
    [tutorials](https://github.com/tensorflow/swift-tutorials/blob/master/iris/swift_tensorflow_tutorial.ipynb)
    are available for trying out this feature.

    ## Part 4: Generalized Differentiability

    Automatic differentiation relies on the definition (body) of a function to be
    able to differentiate it. Differential operators like `#gradient` trigger the
    differentiation of a function, and the differentiability of the function is
    determined as differentiation goes. This works perfectly so far, but has a
    number of problems.

    ### Issues with Definition-Based Differentiability

    #### Syntactic Weirdness

    Raw differential operators adopt the pound-keyword syntax, which has been
    previously used for accessing compiler builtins, e.g. `#file` and `#dsohandle`,
    referring to IDE-specific objects, e.g. `#colorLiteral` and `#imageLiteral`, and
    interoperating with "stringly-typed" Objective-C key paths, e.g.
    `#keyPath(...)`. The pound-keyword syntax does not have native parsing support
    for syntactic features like trailing closures, so it is hard to make the closure
    code short under differential operators like `#gradient`.

    Example:
    ```swift
    // Ideal
    let dydx = gradient { x in
    sin(x) + cos(x)
    }

    // Reality
    let dydx = #gradient({ x in
    sin(x) + cos(x)
    })
    ```

    #### A Higher-Order Function, But Not Quite

    When we introduced AD in Swift earlier in this document, we defined the
    differential operator as a higher-order function. Type checking and type
    inference were just expected to work like any other functions.

    However, since the compiler needs to reject functions that are not
    differentiable and differentiability is not part of the type system, even if we
    were to redefine `#gradient` as a higher-order function named `gradient(of:)`,
    the compiler would still have to maintain dedicated knowledge about this
    function in order to reject invalid arguments.

    #### Cross-Module Differentiability, Without Serialization

    As of now, the differentiability of a function is determined solely through
    two tests:
    - Is the function a primitive-differentiable function (`@differentiable`)?
    - Can the function's body be differentiated in the differentiation mode
    associated with the differential operator applied?

    This simple system works perfectly when differentiating concrete functions
    defined in a local module, but does not allow differentiation of opaque function
    values or methods required by protocols. While being free of serialization is
    not a strict requirement for numerical computing libraries, not supporting
    differentiation on protocol requirements fundamentally obstructs composable
    high-level APIs that rely on AD, such as machine learning model APIs.

    #### Opaque Closures are Non-Differentiable

    There is no way to define a higher-order function that differentiates its
    argument using `#gradient`. Here's an example:

    ```swift
    func foo(_ f: (Float) -> Float) -> Float {
    return #gradient(f)(0)
    }
    ```

    ```console
    test.swift:2:22: error: cannot differentiate an opaque closure
    return #gradient(f)(0)
    ~~~~~~~~~~^~
    test.swift:1:12: note: value defined here
    func foo(_ f: (Float) -> Float) -> Float {
    ^~~~~~~~~~~~~~~~~~~
    ```

    Closure arguments and dynamic dispatch are non-differentiable through direct
    source code transformation. The compiler does not statically know where `f` is
    coming from, nor can it delegate the task of differentiation of argument `f` to
    each callsite of `foo` because it cannot be expressed in the type system.

    ### Solution: Differentiability in Function Types

    As we can see, the core of the problem with definition-based differentiability
    is the opacity of function. The restriction that differentiation depends on the
    full definition of a function to be seen by the differential operator makes it
    impossible to define protocol-oriented differentiable code, and is the primary
    hindrance to modular, composable differentiation APIs.

    Turns out, this is not a new problem - we should learning from how we deal with
    calling conventions in Swift. Functions with different calling conventions have
    different type signatures, e.g. `@convention(thick)` and `@convention(thin)`,
    and function convert back and forth through conversion thunks implicitly.

    ```swift
    // A "thin" function that captures no variables.
    // Its representation is `@convention(thin)` by default.
    func f(x: Int) -> Int {
    return x
    }

    var globalVar = 30

    // A "thick" function that captures the value of `globalVar`.
    // Its representation is `@convention(thick)` by default.
    let g = { x in globalVar + x }

    // A higher-order function.
    // The closure argument `h`'s representation is `@convention(thick)`, because it should
    // be able to take closures that capture variables.
    func takeFunc(_ h: (Float) -> Float) { ... }

    takeFunc(f) // Implicitly converted function `f` to a `convention(thick)` closure by
    // creating a conversion thunk.
    takeFunc(g) // `g` is thick already. No conversion needed.
    ```

    Sometimes, different conventions have different binary representations for
    storing captured variables and such, just like the example with `f` and `g`
    above. In AD, the only difference between a non-differentiable function and a
    differentiated function (say, in reverse mode) is whether the function carries a
    few other function pointers that represent the function's adjoint code, so we
    can model differentiable functions using a "thicker" function type, which
    bundles the original function representation along with pointers to the original
    function's Jacobian-vector product functions and/or vector-Jacobian product
    functions. When a normal function with a visible body gets passed as an
    `@autodiff` function, the function will be differentiated.

    ```swift
    // `f` is a normal function that has type `(Float) -> Float`.
    func f(x: Float) -> Float {
    return sin(x)
    }

    // `f` gets implcitly converted (or more accurately, differentiated).
    let g = f as @autodiff (Float) -> Float

    func takesFunc(_ someFunc: @autodiff (Float) -> Float) {
    #derivatives(someFunc)
    ...
    }

    // At the callsite of `takesFunc(_:)`, `f` gets implcitly differentiated to become
    // `@autodiff (Float) -> Float`.
    takesFunc(f)
    ```

    If a normal function does not have a visible body, then it cannot be passed as
    an `@autodiff` function. Swift will show an error at compile-time.

    ```swift
    var normalFuncWithOpaqueBody: (Float) -> Float = ...

    takesFunc(normalFuncWithOpaqueBody)
    ```

    ```console
    test.swift:19:11: error: function is not differentiable, but the contextual type is
    '@autodiff (Float) -> Float'
    takesFunc(normalFuncWithOpaqueBody)
    ^~~~~~~~~~~~~~~~~~~~~~~~

    test.swift:17:4: note: value defined here
    var normalFuncWithOpaqueBody: (Float) -> Float = ...
    ^~~~~~~~~~~~~~~~~~~~~~~~
    ```

    At first glance, this could even be an addition to the existing `@convention`
    attribute as something like `@convention(autodiff)`, however, differentiability
    does not align semantically with `@convention`. First, when a function becomes
    its differentiable (or differentiated) form, its original calling convention is
    not changed. Second, functions with any convention is technically
    differentiable, including `thin`, `thick`, `method`, etc. Third,
    differentiability is not the only information that needs to be encoded --
    there's also the order of differentiation. Therefore, we need a separate
    dimension of "thickness" in the function type: differentiability.

    We define a new formalization of differentiability in Swift's type system,
    including an `@autodiff` function type attribute, an extension to functions'
    layout, and new syntax for selecting differentiable arguments.

    #### The `@autodiff` Function Type Attribute

    The `@autodiff` attribute on a function type specifies the function's
    differentiability and differentiation order, just like `@differentiable` on
    function declarations. The biggest differences are

    - `@differentiable` contains associated functions (tangent/adjoint) statically,
    but `@autodiff` functions carry those extra function pointers in their binary
    representation as a runtime property. Any user of this function will be able
    to differentiate it, with differentiability guaranteed formally by the type
    system. With this addition to the type system, serialization/inlinability is
    no longer necessary because functions can be passed around without losing
    differentiability.

    - Differentiation order is no longer once vs. infinite. Instead, `@autodiff`
    functions can specify a maximum order at which this function can be
    differentiated, unless the function is linear or constant. This is because
    function-representation-based differentiability requires functions to be
    differentiated ahead of becoming a value and being passed around.

    The grammar for `@autodiff` is defined as follows:

    ```ebnf
    differentiation-order = 'order' ':' integer-literal
    differentiability = 'forward' | 'reverse' | 'linear' | 'constant' | 'bidirectional'
    autodiff-attr = '@autodiff' '(' [ differentiability ',' ] diff-order ')'
    ```

    When a differentiability is specified on a function type, it's obvious that its
    functions' differentiation behavior is akin to what's defined for the
    `@differentiable` declaration attribute. If no differentiability is specified,
    this function is both forward-mode and reverse-mode differentiable (same as
    `bidirectional`).

    #### Creating `@autodiff` Functions

    It becomes increasingly clear that first-order differentiation will not, and
    should not, require serialization, and only higher-order differentiation should
    due to code size. In order to make the system consistent, we make each
    `@differentiable` function declaration result in an `@autodiff` function.

    Since we want to support differentiating opaque functions, we must support
    creating one. The fact is, the user does not even need to know about `@autodiff`
    or intentionally create differentiable functions if they are working with
    functions in the current module. Whenever a local function declaration gets used
    where the contextual type has an `@autodiff` attribute on it, Swift
    differentiates it. If differentiation fails, Swift reports an error at
    compile-time.

    For public APIs, we relax the constraint on `@differentiable` so that it can be
    applied to any function declaration without specifying a tangent or adjoint even
    when the differentiability is forward/reverse. This is when Swift tries to
    differentiate functions and export the derivatives as part of those public APIs: If
    the function gets differentiated, its default type signature has `@autodiff`
    attribute on it; otherwise, Swift reports an error to the user showing what's
    non-differentiable.

    #### Higher-Order Differentiation of Opaque Closures

    In order for modular libraries to support opaque higher-order differentiation,
    the differentiation order must be specified in the closure type signature, so
    that the closure ABI is guaranteed to contain the higher-order derivative.

    ```swift
    @autodiff(reverse, order: 2) (T) -> U
    ```

    For example, function `g` takes a differentiable function that is differentiable
    up to at least the 3rd order, then differentiates it 3 times in the body.

    ```swift
    // In a separate module:
    func g(_ h: @autodiff(reverse, order: 3) (Float) -> Float) -> Float {
    return #gradient(h)(1) +
    #gradient(#gradient(h))(1) +
    #gradient(#gradient(#gradient(h))(1)
    }
    ```

    We also extend the `@differentiable` attribute so that it can specify an
    primitive-differentiable function can be forced to be differentiated to a
    specific order ahead of time. For example, when Swift compiles function `f`
    below, this function will have been differentiated 6 times, and gradient
    functions will be preserved in `f`'s ABI so that its derivatives can be called
    from anywhere (any other Swift module, or even C). `f`'s default type signature
    is `@autodiff(reverse, order: 6) (Float) -> Float`.

    ```swift
    @differentiable(reverse, order: 6)
    public func f(_ x: Float) -> Float {
    return pow(x, 6)
    }
    ```

    Differentiable functions with a maximum differentiation order can be implicitly
    "down-ordered", that is, differentiable functions with a higher maximum
    differentiation order can be implicitly converted to a function with a lower
    maximum differentiation order. For example, we can directly pass `f` as an
    argument to `g`.

    ```swift
    g(f) // 156
    ```

    #### Conversion Between Differentiabilities

    Because of their mathematical properties, differentiabilities can be converted
    to one another statically without runtime overhead. For example, a constant
    function is also a linear function when it's unary; a linear function is a
    bidirectional-differentiable function whose tangent and adjoint are both
    themselves; any differentiability can be completely dropped from a function
    type, forming a "normal" function. This allows us to define generic algorithms
    using differentiation, without specializing them on function types of each
    differentiability.

    The following table shows whether each differentiability (as a column label) can
    be converted to another (as a row label).

    | Convertible to: | None | Linear | Constant | Forward | Reverse | Bidirectional |
    |-----------------|------|-----------|----------|---------|---------|---------------|
    | None | | | | | | |
    | Linear | | | | | | |
    | Constant | | (unary) | | | | |
    | Forward | | | | | | |
    | Reverse | | | | | | |
    | Bidirectional | | | | | | |

    What does differentiability conversion look like in real code? Just like
    `@convention` conversion, differentiability conversion is implicit and has
    little mental overhead to the user.

    ```swift
    let linear: @autodiff(linear) (Float) -> Float = ...
    let bidir: @autodiff (Float) -> Float = ...
    let const: @autodiff(constant) (Float) -> Float = ...

    func foo(_: @autodiff(reverse) (Float) -> Float) { ... }

    foo(linear) // Okay! Implicitly converted to `@autodiff(reverse)`.
    foo(bidir) // Okay! Implicitly converted to `@autodiff(reverse)`.
    foo(const) // Okay! Implicitly converted to `@autodiff(reverse)`.
    ...
    ```

    ## Part 5: True Differential Operators

    [Generalized Differentiability](#part-4-generalized-differentiability) enabled us
    to define custom differential operators in a functional way. Now it's time to
    define the true differential operators.

    ### Derivatives and Gradient

    We start with functions that take a function and produce a function that
    computes derivatives or gradient. Recall that we already had built-in syntax
    `#gradient` and `#derivatives` for computing gradients and derivatives, but we
    are exploring more expressive APIs enabled by Generalized Differentiability
    which enabled us to differentiate function arguments that are functions.

    #### Forward Differential Operators

    We define two forward-mode differential operators for computing basic
    derivatives:
    - `derivatives(of:)` computes a derivatives function that takes a value and
    returns derivatives evaluated at the given value.
    - `derivatives(at:in:)` computes derivatives of a closure at a given value.

    ```swift
    /// Computes derivatives of `body`.
    func derivatives<T: FloatingPoint, R: Differentiable>(
    of body: @autodiff(forward) (T) throws -> R
    ) rethrows -> (T) -> R {
    return { x in #differential(body)(x)(1).1 } // seed = dx/dx = 1
    }

    /// Computes derivatives of `body` at scalar `x`.
    func derivatives<T: FloatingPoint, R: Differentiable>(
    at x: T, in body: @autodiff(forward) (T) throws -> R
    ) rethrows -> R {
    return derivatives(of: body)(x)
    }
    ```

    #### Reverse Differential Operators

    We also define two reverse-mode differential operators for computing basic
    gradients:
    - `gradient(of:)` computes a gradient function that takes a value and returns
    the gradient evaluated at the given value.
    - `gradient(at:in:)` computes the gradient of a closure evaluated at a given
    value.

    ```swift
    /// Computes the gradient of `body`.
    func gradient<T: Differentiable, R: FloatingPoint>(
    of body: @autodiff(reverse) (T) throws -> R
    ) rethrows -> (T) -> T {
    return { x in #pullback(body)(x).1(1) } // seed = dx/dx = 1
    }

    /// Computes the gradient of `body` at `x`.
    func gradient<T: Differentiable, R: FloatingPoint>(
    at x: T, in body: @autodiff(reverse) (T) throws -> R
    ) rethrows -> T {
    return gradient(of: body)(x)
    }
    ```

    As we can see, since we are to differentiate a higher-order function's argument
    (thanks to Generalized Differentiability), we can define `derivatives(of:)` and
    `gradient(of:)` as Swift functions in terms of more general raw differential
    operators, `#differential` and `#pullback`, to replace `#derivatives` and
    `#gradient`!

    These differential operators work seamlessly with closure captures,
    error-throwing functions, or arbitrary side-effecting code that do not
    contribute to the closure result. This looks quite like value-based automatic
    differentiation while the math is actually fully functional. This achieves a
    similar level of expressivity as imperative-style automatic differentiation
    libraries: Instead of writing `gradient(...)` at the bottom of a forward pass,
    one would just write it on top and have a trailing closure close over the
    forward pass.

    Example: Train a simple 2-layer perceptron. The snippet computes the gradient
    w.r.t. each parameter at each training step, prints a loss, and optimizes
    parameters.

    ```swift
    struct Parameters: Differentiable, ParameterGroup {
    var w1 = Tensor<Float>(randomNormal: [784, 30])
    var b1 = Tensor<Float>(zeros: [30])
    var w2 = Tensor<Float>(randomNormal: [30, 10])
    var b2 = Tensor<Float>(zeros: [10])
    }

    var params = Parameters()
    let minibatches = Dataset(...)
    var optimizer = StochasticGradientDescent(learningRate: 0.1)
    for (x, y) in minibatches {
    let grads = gradient(at: params) { params in
    let h1 = tanh(matmul(x, params.w1) + params.b1)
    let ŷ = sigmoid(matmul(h1, params.w2) + params.b2)
    let loss = (y - ŷ).squared().mean()
    print("Loss is \(loss)")
    return loss
    }
    optimizer.fit(&params, gradients: grads)
    }
    ```


    ### Preserving Original Result

    Since the trailing closure as an argument to `gradient(at:in:)`, the forward
    computation is just as customizable as within operator-overloading AD systems.
    Users can do whatever they want to intermediate values or the result in the
    primal computation.

    That said, we would like to provide a way to have the differentiation API return
    the original result directly. Because of Generalized Differentiability, these
    APIs can be defined entirely as library functions using primitive differential
    operators.

    ```swift
    /// Computes `body(x)` and derivatives of each scalar output of `body` at `x`.
    func valueWithDerivatives<T: FloatingPoint, R: Differentiable>(
    at x: T, in body: @autodiff(forward) (T) throws -> R
    ) rethrows -> (value: R, derivatives: R) {
    return #differential(body)(x)(1)
    }

    /// Computes `body(x)` and the gradient of `body` at `x`.
    func valueWithGradient<T: Differentiable, R: FloatingPoint>(
    at x: T, in body: @autodiff(reverse) (T) throws -> R
    ) rethrows -> (value: R, gradient: T) {
    let (y, pullback) = #pullback(body)(x)
    return (y, pullback(1))
    }
    ```

    ### Jacobian-Vector Products and Vector-Jacobian Products

    Jacobian-vector products (forward-mode) and vector-Jacobian products
    (reverse-mode) are extremely useful differential operators for lots of tasks in
    numerical computing.

    ```swift
    /// Computes Jacobian-vector products of `body` at `x`.
    func jacobianVectorProducts<T: Differentiable, R: Differentiable>(
    at x: T, vector: T,
    in body: @autodiff(forward) (T) throws -> R
    ) rethrows -> R {
    return #differential(body)(x)(vector)
    }

    /// Computes the vector-Jacobian products of `body` at `x`.
    func vectorJacobianProducts<T: Differentiable, R: Differentiable>(
    at x: T, vector: R,
    in body: @autodiff(reverse) (T) throws -> R
    ) rethrows -> T {
    return #pullback(body)(x)(vector)
    }
    ```

    ### Differentials and Pullbacks

    In some cases, computational tasks rely on fully extensible differential
    operators as well as maximum efficiency, e.g. computing vector-Jacobian products
    as well as the original function's result. Luckily, the two operators we
    mentioned in the very beginning when we introduced Jacobians are the ones we
    need: differential and pullback. We have already had their raw operators
    supported in the syntax: `#differential` and `#pullback`, but we can make them
    nicer using by redefining them as Swift functions.

    Function `differential(at:in:)` computes the differential of a closure at a
    certain point, and returns a linear map that takes a vector and returns
    Jacobian-vector products.

    ```swift
    /// Computes the differential of `body` at `x`.
    func differential<T: Differentiable, R: Differentiable>(
    at x: T, in body: @autodiff(reverse) (T) throws -> R
    ) rethrows -> @autodiff(linear) (T) -> R {
    return #differential(body)(x).1
    }
    ```

    Function `differentialWithResult(at:in:)` computes the differential of a closure
    at a certain point, and returns a linear map that takes a vector and returns
    both the original function's result and Jacobian-vector products.

    ```swift
    /// Computes the differential of `body` at `x` that also computes the value of
    /// `body(x)`.
    func differentialWithResult<T: Differentiable, R: Differentiable>(
    at x: T, in body: @autodiff(reverse) (T) throws -> R
    ) rethrows -> @autodiff(linear) (T) -> (originalResult: T, derivatives: R) {
    return #differential(body)(x)
    }
    ```

    Function `pullback(at:in:)` computes the pullback of a closure at a certain
    point, and returns a linear map that takes a vector and returns vector-Jacobian
    products.

    ```swift
    /// Computes the pullback of `body` at `x`.
    func pullback<T: Differentiable, R: Differentiable>(
    at x: T, in body: @autodiff(reverse) (T) throws -> R
    ) rethrows -> @autodiff(linear) (R) -> T {
    return #pullback(body)(x).1
    }
    ```

    Function `resultWithPullback(at:in:)` computes the pullback of a closure at a
    certain point, and returns the original function's result and a linear map that
    takes a vector and returns vector-Jacobian products.

    ```swift
    /// Computes the original value of `body(x)` and the pullback at `x`.
    func resultWithPullback<T: Differentiable, R: Differentiable>(
    at x: T, in body: @autodiff(reverse) (T) throws -> R
    ) rethrows -> (originalResult: T, pullback: @autodiff(linear) (R) -> T) {
    return #pullback(body)(x)
    }
    ```

    It is amazing that we are able to define every differential operator in terms of
    other differential operators. `#differential` and `#pullback` have become
    unnecessary because the functional form is so much nicer, so we can teach the
    compiler to recognize Swift functions `differential(at:in:)` and
    `pullback(at:in:)` as the builtin "canonical" differential operator, and remove
    all raw differential operators that start with a `#` from the language.

    Examples:

    1. Chain directional derivatives freely using differentials.

    ```swift
    let x = 0.5
    let df = differential(at: x) { x in
    sin(cos(x))
    }
    df(1) // (f(x), df/dx)
    df(derivatives(of: log)(t)) // (f(x), df/dt)
    df(derivatives(at: t, in: log)) // (f(x), df/dt)
    ```

    2. Chain gradients freely using pullbacks.
    ```swift
    let x = 0.5
    let (y, df) = pullback(at: x) { x in
    cos(sin(x))
    }

    df(1) // dy/dx
    df(gradient(of: log)(t)) // dy/dt
    df(gradient(at: t, in: log)) // dy/dt
    ```

    ### Hessian-Vector Products

    Second-order optimization methods in machine learning make use of
    [Hessians](https://en.wikipedia.org/wiki/Hessian_matrix) and Hessian-vector
    products, which can be hard to compute. Many AD libraries such as Autograd
    already support Hessians by supporting arbitrarily nested
    forward-mode/reverse-mode differentiation. Hessian-vector products can be
    efficiently computed by applying "forward-on-reverse", namely applying the
    composition of the forward-mode differential operator and the reverse-mode
    differential operator on a function.

    <p align="center">
    <img src="https://latex.codecogs.com/png.latex?\mathbf{H}_f(\mathbf{x})\mathbf{v}&space;=&space;\mathbf{J}_{\nabla&space;f}(\mathbf{x})\mathbf{v}" title="\mathbf{H}_f(\mathbf{x})\mathbf{v} = \mathbf{J}_{\nabla f}(\mathbf{x})\mathbf{v}" />
    </p>

    Just like other differential operators, we can define the Hessian-vector
    products operator in a simple, functional way.

    ```swift
    func hvp<T: Differentiable, R: FloatingPoint>(
    at x: T, in f: @autodiff(order: 2) (T) -> R
    ) -> @autodiff(linear) (T) -> T where T: Differentiable {
    return differential(at: x, in: gradient(of: f))
    }
    ```

    Nested differentiation without a careful implementation is prone to a bug known
    as perturbation confusion
    [[1]](http://www.bcl.hamilton.ie/~qobi/nesting/papers/ifl2005.pdf)
    [[2]](https://arxiv.org/abs/1211.4892). Language-integrated AD in Swift will
    enforce tagging in compiler-generated code to guarantee the correctness of
    higher-order derivatives.

    ### Standard Library or an `AutomaticDifferentiation` Module?

    Earlier in this document, we discussed enhancements to standard library
    protocols and extensions to the standard library to model differentiable types.
    These protocols are general enough for standard library types such as floating
    point scalars (`Float`, `Double`, and `Float80`) and potentially [SIMD
    vectors](https://github.com/apple/swift-evolution/blob/master/proposals/0229-simd.md).
    However, in any general-purpose programming language, there is always a question
    of how much math the standard library should have.

    We think basic differential operators like `gradient(of:)` and
    `derivatives(of:)` should be included in the standard library, because they are
    common operators that one would find in college calculus, and they will make AD
    feel more language-integrated along with standard library protocols
    `VectorNumeric` and `Differentiable`.

    We do believe that other operators that contain terms like "Jacobian" and
    "differential" should be in a separate module, possibly called
    "AutomaticDifferentiation" that ships with the Swift language.

    ## Part 6: Generalized Types for Differentiation

    We introduced the `Differentiable` protocol that makes a type represent a vector
    space and be differentiable. However, there are a few scenarios where such a
    protocol won't work well.

    1. Customizable weight type

    Orthogonal weight matrixes have shown advantages in neural network training
    [[1]](https://arxiv.org/abs/1702.00071)
    [[2]](https://arxiv.org/abs/1709.06079). When differentiating through these
    networks, gradients with respect to weights will no long stay orthogonal -
    instead, they are skew-symmetric matrices. While we can represent both
    orthogonal matrices and skew-symmetric matrices as values of a `Matrix` or
    `Tensor` type and programmatically ensure its orthogonality, some researchers
    have been seeking a way to represent this natively in the type system of a
    programming language and still have AD produce the correct derivative.

    2. Quantized training

    Quantization techniques store and calculate numbers in more compact formats,
    i.e. a fixed-point data type. Conceptually, a quantized tensor for a
    real-valued `Tensor` can be defined as the following struct:

    ```swift
    public struct Quantized<Dequantized: Quantizable, QuantizedScalar: FixedWidthInteger> {
    var data: Quantizable
    var range: Range<Dequantized.Scalar>
    var scale: QuantizedScalar
    var zeroPoint: Int
    }
    ```

    We can think of a scenario where the developer defines a neural network as a
    function whose parameters are of type `Quantized<Tensor<Float>>`. When
    training parameters to this neural network, gradients need to flow at a
    significantly higher precision, but today's system cannot achieve that
    because it assumes gradients to have the same type as the original arguments.

    3. Generic optimizers

    Optimization problems in machine learning can be generalized by optimization
    on manifolds. Optimizers in most libraries assume the original space and the
    loss space both to be vector spaces, and perform an implicit conversion from
    cotangent vectors to tangent vectors and another conversion from tangent
    vectors to the original weight type when performing `θ -= η * L/∂θ`. While
    this works for most cases, it won't generalize over typed orthogonal
    matrices, because orthogonal matrices are not vector spaces, and a conversion
    from an orthogonal matrix to a skew symmetric matrix cannot be implicit.

    ### Revise `Differentiable` Protocol

    To address concerns raised above, we've managed to find a more general answer to
    modeling differentiable types. Instead of requiring them to be vector spaces
    (`VectorNumeric`), we model them as [differentiable
    manifolds](https://en.wikipedia.org/wiki/Differentiable_manifold). Reverse-mode
    differentiation on function over manifolds produces gradients vectors in its
    cotangent bundle; forward-mode differentiation produces derivatives in its
    tangent bundle. Note that we cannot represent tangent/cotangent bundles
    separately from tangent/cotangent spaces inside each bundle, because Swift does
    not have dependent types. By removing the restriction to `VectorNumeric`,
    `Differentiable` is now fully extensible.

    ```swift
    /// A type that mathematically represents a differentiable manifold whose
    /// tangent spaces are finite-dimensional.
    ///
    /// In automatic differentiation, differentiation will produce a Jacobian whose
    /// elements are of `Tangent` type.
    public protocol Differentiable {
    /// The tangent vector space of this differentiable manifold.
    associatedtype TangentVector: VectorNumeric
    where TangentVector.Scalar: FloatingPoint

    /// The cotangent space of this differentiable manifold.
    associatedtype CotangentVector: VectorNumeric
    where TangentVector.Scalar: FloatingPoint

    /// Returns `self` moved along the value space towards the given tangent
    /// vector. In Riemannian geometry (mathematics), this is usually equivalent
    /// to retraction or exponential map.
    func moved(toward direction: TangentVector) -> Self

    /// Convert a cotangent vector to its corresponding tangent vector.
    func tangentVector(from cotangent: CotangentVector) -> TangentVector
    }
    ```

    When the tangent vector of a differentiable manifold is equal to its cotangent
    vector, we can simply provide a default implementation of
    `tangentVector(from:)`, which is just the identity function.

    ```swift
    public extension Differentiable where TangentVector == CotangentVector {
    func tangentVector(from cotangent: CotangentVector) -> TangentVector {
    return cotangent
    }
    }
    ```

    When a differentiable manifold is a vector space, it's tangent space is usually
    itself. In these cases, we simply define `moved(toward:)` as vector addition.

    ```swift
    public extension Differentiable
    where Self: VectorNumeric, TangentVector == Self {
    func moved(toward direction: TangentVector) -> Self {
    return self + direction
    }
    }
    ```

    ### Deriving Conformances to `VectorNumeric` and `Differentiable`

    It is very common for numerical computing to deal with lots of parameters, each
    of which is a vector or a matrix. In these cases, instead of manually specifying
    each input in a differential operator's argument list, users would often like
    to differentiate through structures and obtain a structure of partial
    derivatives. It is important for the Swift to provide derived conformances for
    core protocols for numerical computing: `Differentiable` and `VectorNumeric`.

    Mathematically, it is straightforward to represent product types. A struct or
    tuple in Swift corresponds to a product of sets; an enum in Swift
    corresponds to an addition of sets.

    ```swift
    struct Parameters: VectorNumeric, Differentiable {
    var a: Vector<Float>
    var b: Float
    }
    ```

    Struct `Parameters` is equivalent to a product of sets `Vector<Float>` and
    `Float`, or a product of a real vector space `ℝⁿ` and a scalar field `ℝ`, namely
    `ℝⁿ ℝ`, which is also a vector space. To make `Parameters` obtain the traits
    of a vector space, we extend the compiler to derive a conformance to
    `VectorNumeric` similar to how `Codable` and `Hashable` conformances are
    derived. When a conformance clause is given in the current file and when all
    stored properties conform to `VectorNumeric` with the same `Scalar`, the
    compiler synthesizes AST to make this type conform, with all protocol requirements
    applying property-wise.

    After deriving conformances to `VectorNumeric`:

    ```swift
    struct Parameters: VectorNumeric {
    var a: Vector<Float>
    var b: Float

    // derived:
    typealias Scalar = Float

    // derived:
    struct Shape {
    var a: Vector<Float>.Shape
    var b: Float.Shape
    }

    // derived:
    static func + (lhs: Parameters, rhs: Parameters) -> Parameters {
    return Parameters(a: lhs.a + rhs.a, b: lhs.b + rhs.b)
    }
    // ...
    }
    ```

    In order for `Parameters` to be differentiable, it must also need to conform to
    `Differentiable`. Deriving conformances to `Differentiable` can follow the same
    rules.

    ```swift
    struct MyShapes: Differentiable {
    var a: Circle // conforms to Differentiable
    var b: Square // conforms to Differentiable
    }
    ```

    After deriving conformances to `Differentiable`:

    ```swift
    struct MyShapes: Differentiable {
    var a: Circle
    var b: Square

    // derived:
    struct TangentVector: VectorNumeric {
    var a: Circle.TangentVector
    var b: Square.TangentVector
    }
    // derived:
    struct CotangentVector: VectorNumeric {
    var a: Circle.CotangentVector
    var b: Square.CotangentVector
    }

    // derived:
    func moved(toward direction: TangentVector) -> MyShapes {
    return MyShapes(a: a.moved(toward: direction.a),
    b: b.moved(toward: direction.b))
    }

    // derived:
    func tangentVector(from cotangent: CotangentVector) -> TangentVector {
    return TangentVector(a: a.tangentVector(from: cotangent.a)
    b: b.tangentVector(from: cotangent.b))
    }
    }
    ```

    With derived conformances to these protocols, the user can now write arbitrarily
    nested structs of differentiable manifolds, and make them differentiable with
    trivial effort, greatly simplifying the development.

    ### Generalized Differential Operators

    In the new `Differentiable` protocol, we added `Tangent` and `Cotangent` types
    to represent the type of Jacobian-vector products and vector-Jacobian products,
    respectively. We make the following changes to the existing differential
    operators we introduced.
    - Differential operators that return `T` as a forward-differentiated derivative
    will return `T.Tangent` instead.
    - Differential operators that return `T` as a reverse-differentiated derivative
    will return `T.Cotangent` instead.
    - Vectors `T` for computing Jacobian-vector products will become `T.Tangent`.
    - Vectors `T` for computing vector-Jacobian products will become `T.Cotangent`.

    Here we list a few updated differential operators.

    ### Jacobian-Vector Products and Vector-Jacobian Products

    Jacobian-vector products (forward-mode) and vector-Jacobian products
    (reverse-mode) are extremely useful differential operators for lots of tasks in
    numerical computing.

    ```swift
    /// Computes Jacobian-vector products of `body` at `x`.
    func jacobianVectorProducts<T: Differentiable, R: Differentiable>(
    at x: T, vector: T.TangentVector,
    in body: @autodiff(forward) (T) throws -> R
    ) rethrows -> R.TangentVector {
    return #differential(body)(x)(vector)
    }

    /// Computes the vector-Jacobian products of `body` at `x`.
    func vectorJacobianProducts<T: Differentiable, R: Differentiable>(
    at x: T, vector: R.CotangentVector,
    in body: @autodiff(reverse) (T) throws -> R
    ) rethrows -> T.CotangentVector {
    return #pullback(body)(x)(vector)
    }
    ```

    ### Differentials and Pullbacks

    ```swift
    /// Computes the differential of `body` at `x`.
    func differential<T: Differentiable, R: Differentiable>(
    at x: T, in body: @autodiff(reverse) (T) throws -> R
    ) rethrows -> @autodiff(linear) (T.TangentVector) -> R.TangentVector {
    return #differential(body)(x).1
    }

    /// Computes the differential of `body` at `x` that also computes the value of
    /// `body(x)`.
    func differentialWithResult<T: Differentiable, R: Differentiable>(
    at x: T, in body: @autodiff(reverse) (T) throws -> R
    ) rethrows -> @autodiff(linear) (T.TangentVector) -> (originalResult: T, derivatives: R.TangentVector) {
    return #differential(body)(x)
    }

    /// Computes the pullback of `body` at `x`.
    func pullback<T: Differentiable, R: Differentiable>(
    at x: T, in body: @autodiff(reverse) (T) throws -> R
    ) rethrows -> @autodiff(linear) (R.CotangentVector) -> T.CotangentVector {
    return #pullback(body)(x).1
    }

    /// Computes the value of `body(x)` and the pullback at `x`.
    func resultWithPullback<T: Differentiable, R: Differentiable>(
    at x: T, in body: @autodiff(reverse) (T) throws -> R
    ) rethrows -> (originalResult: T, pullback: @autodiff(linear) (R.CotangentVector) -> T.CotangentVector) {
    return #pullback(body)(x)
    }
    ```

    ### Back to the Problems

    Recall that the motivation of introducing a general, future-proof
    `Differentiable` protocol is to be able to model the following use cases.

    1. Neural network with orthogonal weights can now be differentiable. We can
    define a type called `OrthogonalMatrix` to conform to `Differentiable`, and
    another type `SkewSymmetricMatrix` to conform to both `Differentiable` and
    `VectorNumeric`.

    ```swift
    struct SkewSymmetricMatrix: Differentiable, VectorNumeric {
    typealias Scalar = Float
    ...
    }
    struct OrthogonalMatrix: Differentiable {
    ...
    typealias TangentSpace = SkewSymmetricMatrix
    typealias CotangentSpace = SkewSymmetricMatrix
    }
    ```

    When we differentiate a function `(OrthogonalMatrix) -> Float` using the
    reverse-mode differential operator, we'll get a function `(OrthogonalMatrix)
    -> SkewSymmetricMatrix`. Everything falls out, without type safety
    compromises.

    2. Differentiating a quantized network is now possible with AD.

    ```swift
    // `Quantized` is a vector space when the dequantized type is one.
    extension Quantized: VectorNumeric where Dequantized: VectorNumeric {
    typealias Scalar = Dequantized.Scalar
    static func + (lhs: Quantized, rhs: Quantized) -> Quantized {
    // Custom code: Dequantize, add, and requantize!
    }
    static func * (lhs: Scalar, rhs: Quantized) -> Quantized {
    // Custom code: Dequantize, add, and requantize!
    }
    }

    // `Quantized` is a differentiable manifold when the dequantized type is one.
    extension Quantized: Differentiable where Dequantized: Differentiable {
    typealias TangentVector = Dequantized.TangentVector
    typealias CotangentVector = Dequantized.CotangentVector

    func moved(toward tangent: Dequantized.TangentVector) -> QuantizedTensor {
    // Custom code: Dequantize, optimize, and requantize!
    }
    }
    ```

    With `Quantized` conforming to the new `Differentiable` protocol, when we
    differentiate a function of type `(Quantized<Tensor<Float>, Int8>) -> U`, AD
    produces a function of type `(Quantized<Tensor<Float>, Int8>) ->
    Tensor<Float>`, which is close to exactly what we need in quantized training
    of neural networks.

    3. Generic optimizers can be defined in terms of manifold optimization
    functions, without implicit casting.

    ```swift
    extension SGD {
    func fit(_ parameters: inout Parameters, gradients: Parameters) {
    parameters.update(withGradients: gradients) { θ, g in
    θ = θ.moved(toward: -θ.tangentVector(from: g) * learningRate)
    }
    }
    }
    ```

    ## Part 7. Customizable Differentiation

    Some machine learning models require manipulating the gradient with respect to
    certain values, e.g. gradient clipping.
    [Tangent](https://github.com/google/tangent) provides such a feature as a syntax
    extension in Python. Recurrent neural networks often suffer from the "exploding
    gradient" problem, and a typical solution is to force the gradient of an RNN to
    not exceed a certain value by performing gradient clipping.

    ```swift
    func prediction(for input: Tensor<Float>) -> Float {
    var prediction = input
    for _ in 0...5 {
    // Clip gradient.
    prediction = prediction.withCustomizedGradient { grad in
    max(min(grad, 1), -1)
    }
    prediction = lstm.prediction(for: input)
    }
    return prediction
    }
    ```

    APIs `withCustomizedGradient(_:)` and `withCustomizedDerivatives(_:)` look like
    a compiler-known function which makes Swift run customized code in
    differentiated code. However, because of the generality of the [differential
    registration](#differential-registration) mechanism, these functions can be
    defined entirely as a Swift function with no special support from the compiler.
    Here's the implementation of these APIs.

    ```swift
    public extension Differentiable {
    @differentiable(forward, wrt: self, tangent: tangentCustomizingDerivatives)
    func withCustomizedDerivatives(
    _ body: @nondiff (TangentVector) -> TangentVector
    ) -> Self {
    return self
    }

    internal func tangentCustomizingDerivatives(
    body: (TangentVector) -> TangentVector,
    originalResult: Self,
    tangent: TangentVector
    ) -> TangentVector {
    return body(tangent)
    }

    @differentiable(reverse, wrt: self, adjoint: adjointCustomizingGradient)
    func withCustomizedGradient(
    _ body: @nondiff (CotangentVector) -> CotangentVector
    ) -> Self {
    return self
    }

    internal func adjointCustomizingGradient(
    body: (CotangentVector) -> CotangentVector,
    originalResult: Self,
    adjoint: CotangentVector
    ) -> CotangentVector {
    return body(adjoint)
    }
    }
    ```

    This API supports many gradient manipulation tasks in machine learning
    optimization. For example, the user can make gradient computation trigger a
    break from the loop.

    ```swift
    var prediction = input
    for _ in 0...5 {
    // Stop loop when necessary.
    var shouldStop = false
    prediction = prediction.withCustomizedGradient { grad in
    if grad < lowerBound {
    shouldStop = true
    }
    return grad
    }
    if shouldStop {
    break
    }
    prediction = lstm.prediction(for: input)
    }
    ```

    Setting a mutable flag is not the most user-friendly way. We can create APIs
    that wrap `withCustomizedDerivatives(_:)` and `withCustomizedGradient(_:)` and
    return a `Bool`, so that later code can decide whether to `break` from the loop
    based on the return value from that API. Or better, if Swift supports non-local
    control flow, i.e. a branch from nested closures, the code can be written just
    as a break.

    ```swift
    var prediction = input
    for _ in 0...5 {
    // Stop loop when necessary.
    prediction = prediction.withCustomizedGradient { grad in
    if grad < lowerBound {
    break
    }
    return grad
    }
    prediction = lstm.prediction(for: input)
    }
    ```

    ## Acknowledgements

    The author would like to thank Dan Zheng, Chris Lattner, Alex Wiltschko, Bart
    van Merriënboer, Gordon Plotkin, Dougal Maclaurin, Matthew Johnson, Casey Chu,
    Tim Harley, Marc Rasi, and Dmitri Gribenko for their input to the initial design
    of this powerful language feature.
    ### See the official [Differentiable Programming Manifesto](https://github.com/apple/swift/tree/master/docs/DifferentiableProgramming.md) instead.
  2. rxwei revised this gist Jun 17, 2019. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion ad-manifesto.md
    Original file line number Diff line number Diff line change
    @@ -9,7 +9,7 @@ programming language design community, with a strong focus on language design.

    **Status: Outdated**

    Please see [Swift Automatic Differentiation Design Overview] instead.
    **Please see [Swift Automatic Differentiation Design Overview](https://docs.google.com/document/d/1bPepWLfRQa6CtXqKA8CDQ87uZHixNav-TFjLSisuKag/edit?usp=sharing) instead.**

    ## Table of Contents

  3. rxwei revised this gist Jun 17, 2019. 1 changed file with 3 additions and 1 deletion.
    4 changes: 3 additions & 1 deletion ad-manifesto.md
    Original file line number Diff line number Diff line change
    @@ -7,7 +7,9 @@ First-Class Automatic Differentiation in Swift: A Manifesto
    This document is written for both the machine learning community and the Swift
    programming language design community, with a strong focus on language design.

    **Status: Currently undergoing major revision.**
    **Status: Outdated**

    Please see [Swift Automatic Differentiation Design Overview] instead.

    ## Table of Contents

  4. rxwei revised this gist Dec 5, 2018. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion ad-manifesto.md
    Original file line number Diff line number Diff line change
    @@ -7,7 +7,7 @@ First-Class Automatic Differentiation in Swift: A Manifesto
    This document is written for both the machine learning community and the Swift
    programming language design community, with a strong focus on language design.

    Status: Currently undergoing major revision.
    **Status: Currently undergoing major revision.**

    ## Table of Contents

  5. rxwei revised this gist Nov 12, 2018. 1 changed file with 2 additions and 0 deletions.
    2 changes: 2 additions & 0 deletions ad-manifesto.md
    Original file line number Diff line number Diff line change
    @@ -7,6 +7,8 @@ First-Class Automatic Differentiation in Swift: A Manifesto
    This document is written for both the machine learning community and the Swift
    programming language design community, with a strong focus on language design.

    Status: Currently undergoing major revision.

    ## Table of Contents

    - [Introduction](#introduction)
  6. rxwei revised this gist Oct 29, 2018. 2 changed files with 0 additions and 68 deletions.
    3 changes: 0 additions & 3 deletions .gitignore
    Original file line number Diff line number Diff line change
    @@ -1,3 +0,0 @@
    *.tex
    *.pdf
    auto/
    65 changes: 0 additions & 65 deletions reduced-differentiability-model.org
    Original file line number Diff line number Diff line change
    @@ -1,65 +0,0 @@
    #+TITLE: The Reduced Differentiability Model
    #+SUBTITLE: Using only Differentials and Adjoints

    * TODO Introduction



    * TODO Motivation



    * Solution

    ** Rule of First-Order Differentiability

    * A function is forward-differentiable if
    * it has a forward-differentiable body,
    * it has a differential, or
    * it has a reverse-differentiable /adjoint/.
    * A function is reverse-differentiable if
    * it has a reverse-differentiable body,
    * it has a reverse-differentiable /differential/, or
    * it has an /adjoint/.

    ** TODO Rule of Higher-Order Differentiability


    ** Simplified Differential and Adjoint Definition Syntax

    #+BEGIN_SRC swift
    extension Vector {
    @differentiable(wrt: self)
    static func * (lhs: Vector, rhs: Vector) -> Vector {
    return ... // non-differentiable

    adjoint(v: Vector) -> (Vector, Vector) {
    return (rhs * v, lhs * v)
    }
    }
    }
    #+END_SRC

    #+BEGIN_SRC swift
    @differentiable
    func cos(_ x: Vector) -> Vector {
    return ... // non-differentiable

    differential(v: Vector) -> Vector {
    return -sin(x) * v
    }
    }
    #+END_SRC

    #+BEGIN_SRC swift
    extension Tensor {
    @differentiable(wrt: self)
    func transposed() -> Tensor {
    return ... // non-differentiable

    adjoint(v: Tensor) -> Tensor {
    return v.transposed()
    }
    }
    }
    #+END_SRC
  7. rxwei revised this gist Oct 29, 2018. 2 changed files with 68 additions and 0 deletions.
    3 changes: 3 additions & 0 deletions .gitignore
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,3 @@
    *.tex
    *.pdf
    auto/
    65 changes: 65 additions & 0 deletions reduced-differentiability-model.org
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,65 @@
    #+TITLE: The Reduced Differentiability Model
    #+SUBTITLE: Using only Differentials and Adjoints

    * TODO Introduction



    * TODO Motivation



    * Solution

    ** Rule of First-Order Differentiability

    * A function is forward-differentiable if
    * it has a forward-differentiable body,
    * it has a differential, or
    * it has a reverse-differentiable /adjoint/.
    * A function is reverse-differentiable if
    * it has a reverse-differentiable body,
    * it has a reverse-differentiable /differential/, or
    * it has an /adjoint/.

    ** TODO Rule of Higher-Order Differentiability


    ** Simplified Differential and Adjoint Definition Syntax

    #+BEGIN_SRC swift
    extension Vector {
    @differentiable(wrt: self)
    static func * (lhs: Vector, rhs: Vector) -> Vector {
    return ... // non-differentiable

    adjoint(v: Vector) -> (Vector, Vector) {
    return (rhs * v, lhs * v)
    }
    }
    }
    #+END_SRC

    #+BEGIN_SRC swift
    @differentiable
    func cos(_ x: Vector) -> Vector {
    return ... // non-differentiable

    differential(v: Vector) -> Vector {
    return -sin(x) * v
    }
    }
    #+END_SRC

    #+BEGIN_SRC swift
    extension Tensor {
    @differentiable(wrt: self)
    func transposed() -> Tensor {
    return ... // non-differentiable

    adjoint(v: Tensor) -> Tensor {
    return v.transposed()
    }
    }
    }
    #+END_SRC
  8. rxwei revised this gist Oct 23, 2018. 1 changed file with 12 additions and 11 deletions.
    23 changes: 12 additions & 11 deletions ad-manifesto.md
    Original file line number Diff line number Diff line change
    @@ -83,23 +83,24 @@ func f(_ x: Double, _ y: Double) -> Double {

    ### Vectors and Jacobians

    In numerical computing, users often write code that operate on high-dimensional
    In numerical computing, users often write code that operates on high-dimensional
    mathematical objects. The basic typing rules that we defined on real scalars
    (![](http://latex.codecogs.com/gif.latex?\mathbb{R})) can be generalized for
    [module](https://en.wikipedia.org/wiki/Module_(mathematics))-like types such as
    vectors with extra consideration for shape. In vector calculus, the
    differentiation of a function
    ![](http://latex.codecogs.com/gif.latex?\mathbf{f}:\mathbb{R}^n\rightarrow\mathbb{R}^m) is
    defined per scalar because there are multiple inputs and multiple outputs. Full
    differentiation of vector-valued function
    ![](http://latex.codecogs.com/gif.latex?\mathbf{f}) will result in a matrix,
    each of whose entries is a function that computes the partial derivative of an
    output scalar with respect to an input scalar. This matrix is called a
    ![](http://latex.codecogs.com/gif.latex?\mathbf{f}:\mathbb{R}^n\rightarrow\mathbb{R}^m)
    is defined per scalar because there are multiple inputs and multiple outputs.
    Full differentiation of a vector-valued function
    ![](http://latex.codecogs.com/gif.latex?\mathbf{f}) will thus result in a
    matrix, each of whose entries is a function that computes the partial derivative
    of an output scalar with respect to an input scalar. This matrix is called a
    [Jacobian](https://en.wikipedia.org/wiki/Jacobian_matrix_and_determinant). In
    this definition, the Jacobian matrix has type
    ![](http://latex.codecogs.com/gif.latex?\mathbf{J_f}:(\mathbb{R}\rightarrow\mathbb{R})^{mn}).
    For simplicity, we will model it as a function that maps vectors
    to real-valued matrices ![](http://latex.codecogs.com/gif.latex?\mathbf{J_f}:\mathbb{R}^n\rightarrow\mathbb{R}^{mn}).
    For simplicity, we will model it as a function that maps vectors to real-valued
    matrices
    ![](http://latex.codecogs.com/gif.latex?\mathbf{J_f}:\mathbb{R}^n\rightarrow\mathbb{R}^{mn}).

    <p align="center">
    <img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/74e93aa903c2695e45770030453eb77224104ee4"
    @@ -118,7 +119,7 @@ func 𝒟<T>(_ f: (Vector2<T>) -> Vector3<T>) -> (Vector2<T>) -> Matrix3x2<T>
    Computing the Jacobian of a function is often unnecessary in gradient-based
    optimization methods. Computing a full Jacobian will require repeated
    evaluations of some primitives in computer code: vector-Jacobian products (VJPs)
    and Jacobian-vector products (JVPs), and VJPs and JVPs are often exactly what we
    or Jacobian-vector products (JVPs), and VJPs and JVPs are often exactly what we
    need in practice. In these terms, "vector" refers to a vector of partial
    derivatives that are to be chained with the Jacobian by left-multiplication or
    right-multiplication. As we explain chaining next, we discuss how Automatic
    @@ -1044,7 +1045,7 @@ and function convert back and forth through conversion thunks implicitly.
    ```swift
    // A "thin" function that captures no variables.
    // Its representation is `@convention(thin)` by default.
    func f(x: Int) {
    func f(x: Int) -> Int {
    return x
    }

  9. rxwei revised this gist Oct 23, 2018. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion ad-manifesto.md
    Original file line number Diff line number Diff line change
    @@ -477,7 +477,7 @@ multiplication). The protocol will be called `Arithmetic`.

    ```swift
    public protocol Arithmetic: Equatable {
    var zero: Self { get }
    static var zero: Self { get }
    prefix static func + (x: Self) -> Self
    static func + (lhs: Self, rhs: Self) -> Self
    static func += (lhs: inout Self, rhs: Self) -> Self
  10. rxwei revised this gist Oct 23, 2018. 1 changed file with 7 additions and 6 deletions.
    13 changes: 7 additions & 6 deletions ad-manifesto.md
    Original file line number Diff line number Diff line change
    @@ -115,13 +115,14 @@ func 𝒟<T>(_ f: (Vector2<T>) -> Vector3<T>) -> (Vector2<T>) -> Matrix3x2<T>
    where T: FloatingPoint
    ```

    Calculating the Jacobian of a function is often unnecessary in gradient-based
    Computing the Jacobian of a function is often unnecessary in gradient-based
    optimization methods. Computing a full Jacobian will require repeated
    evaluations of vector-Jacobian products (VJPs) and Jacobian-vector products
    (JVPs), but VJPs and JVPs are often what we need in practice. In these terms,
    "vector" refers to a vector of partial derivatives that are to be chained with
    the Jacobian by left-multiplication or right-multiplication. As we explain
    chaining next, we discuss how Automatic Differentiation comes in the picture.
    evaluations of some primitives in computer code: vector-Jacobian products (VJPs)
    and Jacobian-vector products (JVPs), and VJPs and JVPs are often exactly what we
    need in practice. In these terms, "vector" refers to a vector of partial
    derivatives that are to be chained with the Jacobian by left-multiplication or
    right-multiplication. As we explain chaining next, we discuss how Automatic
    Differentiation comes in the picture.

    ### Gradient and Reverse-Mode AD

  11. rxwei revised this gist Oct 23, 2018. 1 changed file with 16 additions and 17 deletions.
    33 changes: 16 additions & 17 deletions ad-manifesto.md
    Original file line number Diff line number Diff line change
    @@ -116,12 +116,12 @@ func 𝒟<T>(_ f: (Vector2<T>) -> Vector3<T>) -> (Vector2<T>) -> Matrix3x2<T>
    ```

    Calculating the Jacobian of a function is often unnecessary in gradient-based
    optimization methods. In practice, we care more about two byproducts of Jacobian
    calculation that are significantly easier to compute than the Jacobian itself:
    vector-Jacobian products and Jacobian-vector products. In these terms, "vector"
    refers to a vector of partial derivatives that are to be chained with the
    Jacobian by left-multiplication or right-multiplication. As we explain chaining
    next, we discuss how Automatic Differentiation comes in the picture.
    optimization methods. Computing a full Jacobian will require repeated
    evaluations of vector-Jacobian products (VJPs) and Jacobian-vector products
    (JVPs), but VJPs and JVPs are often what we need in practice. In these terms,
    "vector" refers to a vector of partial derivatives that are to be chained with
    the Jacobian by left-multiplication or right-multiplication. As we explain
    chaining next, we discuss how Automatic Differentiation comes in the picture.

    ### Gradient and Reverse-Mode AD

    @@ -469,10 +469,10 @@ On the Swift forum, we have discussed the [fundamental blocker for vector types
    to conform to the existing `Numeric`
    protocol](https://forums.swift.org/t/should-numeric-not-refine-expressiblebyintegerliteral).
    The consensus was to introduce a weakening of the `Numeric` protocol to
    represent the abstractions shared between scalars and vectors:
    [rng](https://en.wikipedia.org/wiki/Rng_(algebra)) (We assumed that vector spaces
    are rngs by endowing them with `*` as element-wise multiplication). The protocol
    will be called `Arithmetic`.
    represent the abstractions shared between scalars and vectors: [rng (ring
    without unity)](https://en.wikipedia.org/wiki/Rng_(algebra)) (We assumed that
    vector spaces are rngs by endowing them with `*` as element-wise
    multiplication). The protocol will be called `Arithmetic`.

    ```swift
    public protocol Arithmetic: Equatable {
    @@ -502,13 +502,12 @@ public protocol Numeric: Arithmetic, ExpressibleByIntegerLiteral {

    After we introduce the `Arithmetic` protocol, which makes the standard library
    suitable for vector APIs and beyond, we can define a protocol that generalizes
    vectors. Mathematically, a vector space is a rng if we endow them with `*` as
    element-wise multiplication. We represent vector spaces through the
    `VectorNumeric` protocol as follows. `Scalar` is the type of the elements
    of this vector space -- the field which the vector space is over.
    `Shape` is the shape of this vector space, which is
    customizable. The initializer takes a value of the `Scalar` type and a
    `Shape` and returns a vector of the specified shape.
    vectors. Mathematically, a vector space is a ring without unity if we endow them
    with `*` as element-wise multiplication. We represent vector spaces through the
    `VectorNumeric` protocol as follows. `Scalar` is the type of the elements of
    this vector space -- the field which the vector space is over. `Shape` is the
    shape of this vector space, which is customizable. The initializer takes a value
    of the `Scalar` type and a `Shape` and returns a vector of the specified shape.

    ```swift
    /// A type that represents an unranked vector space. Values of this type are
  12. rxwei revised this gist Oct 22, 2018. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion ad-manifesto.md
    Original file line number Diff line number Diff line change
    @@ -390,7 +390,7 @@ func hvp<T: Differentiable, R: FloatingPoint>(

    By building first-class AD into the programming language, we can provide better
    diagnostics about differentiability and numeric stability than any other dynamic
    language, all at compile-time.
    languages, all at compile-time.

    ```console
    test.swift:58:10: error: function is not differentiable
  13. rxwei revised this gist Oct 22, 2018. 1 changed file with 5 additions and 4 deletions.
    9 changes: 5 additions & 4 deletions ad-manifesto.md
    Original file line number Diff line number Diff line change
    @@ -675,8 +675,6 @@ There are five options for differentiability:

    2. Reverse: `@differentiable(reverse, adjoint: ...)`

    This option says that the function is reverse-mode differentiable.

    This option says that the function is reverse-mode differentiable.
    Reverse-mode differentiation requires the "adjoint code" (or adjoint
    function) of this function, so that Swift knows how to compute the function's
    @@ -697,7 +695,7 @@ There are five options for differentiability:

    By definition, constant functions always have zero derivatives and are
    differentiable at any arbitrary order. So differentiating this function will
    result into a vector (or vectors, when the function has multiple
    result into a zero vector (or vectors, when the function has multiple
    differentiation arguments) with the same shape as each differentiation
    argument.

    @@ -885,7 +883,10 @@ expression = autodiff-expression
    Gradient and derivatives are two special cases of differentiation where the
    output or the result is a scalar, respectively. When they are not a scalar,
    vector-Jacobian products and Jacobian-vector products are being computed with a
    vector. We add two extra differential operators which will be useful for
    vector. These cases are not obvious, but are required for modular machine
    learning APIs where each neural network layer defines a back-propagation method
    that takes a partial derivative vector back-propagated from the previous layer.
    As such, we add two extra differential operators which will be useful for
    computing these products.

    - `#differential(f)`: Produces a function that takes the original arguments and
  14. rxwei revised this gist Oct 21, 2018. 1 changed file with 2 additions and 2 deletions.
    4 changes: 2 additions & 2 deletions ad-manifesto.md
    Original file line number Diff line number Diff line change
    @@ -1242,8 +1242,8 @@ type, forming a "normal" function. This allows us to define generic algorithms
    using differentiation, without specializing them on function types of each
    differentiability.

    The following table shows whether each differentiability can be converted to
    another.
    The following table shows whether each differentiability (as a column label) can
    be converted to another (as a row label).

    | Convertible to: | None | Linear | Constant | Forward | Reverse | Bidirectional |
    |-----------------|------|-----------|----------|---------|---------|---------------|
  15. rxwei revised this gist Oct 21, 2018. 1 changed file with 3 additions and 2 deletions.
    5 changes: 3 additions & 2 deletions ad-manifesto.md
    Original file line number Diff line number Diff line change
    @@ -882,11 +882,12 @@ expression = autodiff-expression

    ### Embrace Generality: Vector-Jacobian Products and Jacobian-Vector Products

    While gradient and derivatives are two special cases of differentiation where
    the output or the result is a scalar, respectively. When they are not a scalar,
    Gradient and derivatives are two special cases of differentiation where the
    output or the result is a scalar, respectively. When they are not a scalar,
    vector-Jacobian products and Jacobian-vector products are being computed with a
    vector. We add two extra differential operators which will be useful for
    computing these products.

    - `#differential(f)`: Produces a function that takes the original arguments and
    returns the differential of `f`.
    - `#pullback(f)`: Produces a function that takes the original arguments and
  16. rxwei revised this gist Oct 21, 2018. 1 changed file with 8 additions and 7 deletions.
    15 changes: 8 additions & 7 deletions ad-manifesto.md
    Original file line number Diff line number Diff line change
    @@ -116,9 +116,8 @@ func 𝒟<T>(_ f: (Vector2<T>) -> Vector3<T>) -> (Vector2<T>) -> Matrix3x2<T>
    ```

    Calculating the Jacobian of a function is often unnecessary in gradient-based
    optimization methods, and is often unnecessary in gradient-based optimization
    methods. In practice, we care more about two byproducts of Jacobian calculation
    that are significantly easier to compute than the Jacobian itself:
    optimization methods. In practice, we care more about two byproducts of Jacobian
    calculation that are significantly easier to compute than the Jacobian itself:
    vector-Jacobian products and Jacobian-vector products. In these terms, "vector"
    refers to a vector of partial derivatives that are to be chained with the
    Jacobian by left-multiplication or right-multiplication. As we explain chaining
    @@ -140,11 +139,13 @@ row in the matrix, which is exactly the
    <img src="https://latex.codecogs.com/gif.latex?\nabla{f_i}(\mathbf{x})=\mathbf{v}^i\mathbf{J_f}(\mathbf{x})=\bigg[\dfrac{\partial{f_i}(\mathbf{x})}{\partial&space;x_1}&space;\&space;\cdots&space;\&space;\dfrac{\partial{f_i}(\mathbf{x})}{\partial{x_n}}\bigg]" title="\nabla{f_i}(\mathbf{x})=\mathbf{v}^i\mathbf{J_f}(\mathbf{x})=\bigg[\dfrac{\partial{f_i}(\mathbf{x})}{\partial x_1} \ \cdots \ \dfrac{\partial{f_i}(\mathbf{x})}{\partial{x_n}}\bigg]" />
    </p>

    When this vector ![](http://latex.codecogs.com/gif.latex?\mathbf{v}^i) represents the
    gradient of another function ![](http://latex.codecogs.com/gif.latex?g:\mathbb{R}^m\rightarrow\mathbb{R}) at
    When vector ![](http://latex.codecogs.com/gif.latex?\mathbf{v}) in
    ![](http://latex.codecogs.com/gif.latex?\mathbf{v}\mathbf{J_f}(\mathbf{x}))
    represents the gradient of another function
    ![](http://latex.codecogs.com/gif.latex?g:\mathbb{R}^m\rightarrow\mathbb{R}) at
    ![](http://latex.codecogs.com/gif.latex?\mathbf{f}(\mathbf{x})), namely
    ![](http://latex.codecogs.com/gif.latex?\partial{g}/\partial{f_i(\mathbf{x})}),
    then the vector-Jacobian products will represent
    then the vector-Jacobian products represents
    ![](http://latex.codecogs.com/gif.latex?\partial{g}/\partial{\mathbf{x}}). The
    linear function that takes a vector and left-multiplies it with the Jacobian is
    also called a
    @@ -514,7 +515,7 @@ customizable. The initializer takes a value of the `Scalar` type and a
    /// elements in this vector space and with a specific shape.
    public protocol VectorNumeric: Arithmetic {
    /// The type of scalars in the vector space.
    associatedtype Scalar : Numeric
    associatedtype Scalar: Numeric

    /// The type whose values specifies the shape of an object in the vector
    /// space.
  17. rxwei revised this gist Oct 21, 2018. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion ad-manifesto.md
    Original file line number Diff line number Diff line change
    @@ -89,7 +89,7 @@ mathematical objects. The basic typing rules that we defined on real scalars
    [module](https://en.wikipedia.org/wiki/Module_(mathematics))-like types such as
    vectors with extra consideration for shape. In vector calculus, the
    differentiation of a function
    ![](http://latex.codecogs.com/gif.latex?\mathbb{R}^n\rightarrow\mathbb{R}^m) is
    ![](http://latex.codecogs.com/gif.latex?\mathbf{f}:\mathbb{R}^n\rightarrow\mathbb{R}^m) is
    defined per scalar because there are multiple inputs and multiple outputs. Full
    differentiation of vector-valued function
    ![](http://latex.codecogs.com/gif.latex?\mathbf{f}) will result in a matrix,
  18. rxwei revised this gist Oct 21, 2018. 1 changed file with 2 additions and 2 deletions.
    4 changes: 2 additions & 2 deletions ad-manifesto.md
    Original file line number Diff line number Diff line change
    @@ -2044,5 +2044,5 @@ for _ in 0...5 {

    The author would like to thank Dan Zheng, Chris Lattner, Alex Wiltschko, Bart
    van Merriënboer, Gordon Plotkin, Dougal Maclaurin, Matthew Johnson, Casey Chu,
    Tim Harley, Marc Rasi, and Dmitri Gribenko for their input to the design of this
    powerful language feature.
    Tim Harley, Marc Rasi, and Dmitri Gribenko for their input to the initial design
    of this powerful language feature.
  19. rxwei revised this gist Oct 21, 2018. 1 changed file with 33 additions and 16 deletions.
    49 changes: 33 additions & 16 deletions ad-manifesto.md
    Original file line number Diff line number Diff line change
    @@ -324,10 +324,10 @@ for (x, y) in minibatches {
    We want our AD system to be fully extensible to the point where users can
    request derivatives of a function taking their own user-defined numeric types,
    and even use this feature to implement structure-dependent algorithms such as
    tree-recursive neural networks. Therefore, AD makes no assumptions about
    individual math functions or the types it should support. We enable library
    designers and developers to easily define any type or differentiable functions,
    all in pure Swift code.
    tree-recursive neural networks. Therefore, when performing AD, Swift makes no
    special assumptions about individual math functions or the types it should
    support. We enable library designers and developers to easily define any type or
    differentiable functions, all in pure Swift code.

    Swift supports [protocol-oriented programming and first-class value
    semantics](https://developer.apple.com/videos/play/wwdc2015/408/). AD is deeply
    @@ -341,8 +341,10 @@ extension MyType: Differentiable {
    }
    ```

    Or make obviously non-differentiable functions differentiable by using the
    `@differentiable` attribute:
    Or make an obviously non-differentiable function differentiable by using the
    `@differentiable` attribute, specifying a "tangent" function for computing its
    Jacobian-vector products, or an "adjoint" function for computing its
    vector-Jacobian products.

    ```swift
    @differentiable(tangent: tangentFoo, adjoint: adjointFoo)
    @@ -377,9 +379,8 @@ trigger differentiation as needed.

    ```swift
    func hvp<T: Differentiable, R: FloatingPoint>(
    at x: T, in f: @autodiff(order: 2) (T.CotangentVector) -> R
    ) -> @autodiff(linear) (T.TangentVector) -> T.CotangentVector.TangentVector
    where T.CotangentVector: Differentiable {
    at x: T, in f: @autodiff(order: 2) (T) -> R
    ) -> @autodiff(linear) (T) -> T {
    return differential(at: x, in: gradient(of: f))
    }
    ```
    @@ -416,16 +417,31 @@ imperative.
    | | Syntax | Meaning |
    |------------|--------|-------------|
    | Functional | `let 𝝯f = gradient(of: f)`<br/>`𝝯f(x)` | Differentiating a function |
    | Imperative | `y = f(x)`<br/>`gradient(of: y, wrt: x)` | Differentiating code traced through data flow |
    | Imperative | `let y = f(x)`<br/>`gradient(of: y, wrt: x)` | Differentiating code traced through data flow |

    Functional-style AD is transforming one function to another, producing a
    function that takes original arguments and returns the partial derivatives
    evaluated at each argument. Imperative-style AD, on the other hand, is a
    value-value dependency analysis. Although we use both notations in mathematics,
    imperative AD comes at the cost of semantic inconsistency with the host
    language. In Swift's AD system, we believe we can achieve the same level of
    expressivity as imperative AD while preserving functional properties, and use
    language integration to push developers' productivity to the next level.
    language, for example:

    ```swift
    let y = f(x)
    x = 3
    gradient(of: y, wrt: x) // undefined
    ```

    Semantically, `y` is a value, but `x` is both a value and a reference to a
    memory location -- it is unclear what exactly we are differentiating with
    respect to. Though making `y` and `x` have reference types could make this
    particular example work out semantically, it would be fundamentally inconsistent
    with Swift's core design where mathematical objects have value types, and would
    also make scalar types like `Float` incompatible with automatic differentiation.

    We believe Swift's AD can achieve the same level of expressivity as imperative
    AD while preserving functional properties, and use language integration to push
    developers' productivity to the next level.


    ## Part 1: Differentiable Types
    @@ -559,8 +575,9 @@ based on:

    As such we provide a syntactic way of specifying the differentiability of a
    function, using either the function's linearity properties or a separate
    function to specify the "tangent code" or "adjoint code" for the original
    function.
    function to specify the "tangent code", which specifies how to differentiate the
    function in forward mode, or "adjoint code”, which specifies how to
    differentiate the function in reverse mode.

    ### The `@differentiable` attribute

    @@ -622,7 +639,7 @@ public func conv2d(_ input: Tensor<Float>, filter: Tensor<Float>,

    func adjointConv2D(_ input: Tensor<Float>, filter: Tensor<Float>,
    strides: (Int32, Int32, Int32, Int32),
    padding: Padding) -> Tensor<Float> {
    padding: Padding) -> (Tensor<Float>, Tensor<Float>) {
    ...
    }
    ```
  20. rxwei revised this gist Oct 21, 2018. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion ad-manifesto.md
    Original file line number Diff line number Diff line change
    @@ -141,7 +141,7 @@ row in the matrix, which is exactly the
    </p>

    When this vector ![](http://latex.codecogs.com/gif.latex?\mathbf{v}^i) represents the
    gradient of another function ![](http://latex.codecogs.com/latex.gif?g:\mathbb{R}^m\rightarrow\mathbb{R}) at
    gradient of another function ![](http://latex.codecogs.com/gif.latex?g:\mathbb{R}^m\rightarrow\mathbb{R}) at
    ![](http://latex.codecogs.com/gif.latex?\mathbf{f}(\mathbf{x})), namely
    ![](http://latex.codecogs.com/gif.latex?\partial{g}/\partial{f_i(\mathbf{x})}),
    then the vector-Jacobian products will represent
  21. rxwei revised this gist Oct 21, 2018. 1 changed file with 7 additions and 5 deletions.
    12 changes: 7 additions & 5 deletions ad-manifesto.md
    Original file line number Diff line number Diff line change
    @@ -59,11 +59,13 @@ maps points onto their corresponding slopes.
    In the context of Swift, differentiating a function `(Float) -> Float` produces
    `(Float) -> Float`. Functions with multiple arguments, such as `(Float, Float)
    -> Float`, can be thought of as a function whose input domain is a product of
    those arguments types, i.e. `(ℝ ⨯ ℝ) → ℝ`, so the derivative of such a function
    has type `(Float, Float) -> (Float, Float)`. According to this typing rule, the
    differential operator ![](http://latex.codecogs.com/gif.latex?\dfrac{d}{dx}) can
    be declared as a higher-order function, overloaded for each number of arguments
    because a Swift function's argument list is not formally modeled as a tuple.
    those arguments types, i.e.
    ![](http://latex.codecogs.com/gif.latex?\mathbb{R}\times\mathbb{R}\rightarrow\mathbb{R}),
    so the derivative of such a function has type `(Float, Float) -> (Float,
    Float)`. According to this typing rule, the differential operator
    ![](http://latex.codecogs.com/gif.latex?\dfrac{d}{dx}) can be declared as a
    higher-order function, overloaded for each number of arguments because a Swift
    function's argument list is not formally modeled as a tuple.

    ```swift
    func 𝒟<T: FloatingPoint>(_ f: (T) -> T) -> (T) -> T
  22. rxwei revised this gist Oct 21, 2018. 1 changed file with 28 additions and 17 deletions.
    45 changes: 28 additions & 17 deletions ad-manifesto.md
    Original file line number Diff line number Diff line change
    @@ -46,8 +46,11 @@ programming language.

    ### Basic Calculus

    In basic calculus, differentiating a function of type `ℝ → ℝ` produces a function
    `ℝ → ℝ` that maps points onto their corresponding slopes.
    In basic calculus, differentiating a function of type
    ![](http://latex.codecogs.com/gif.latex?\mathbb{R}\rightarrow\mathbb{R})
    produces a function
    ![](http://latex.codecogs.com/gif.latex?\mathbb{R}\rightarrow\mathbb{R}) that
    maps points onto their corresponding slopes.

    <p align="center">
    <img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/9315f1516ee5847107808697e43693d91abfc6e8"
    @@ -80,18 +83,21 @@ func f(_ x: Double, _ y: Double) -> Double {

    In numerical computing, users often write code that operate on high-dimensional
    mathematical objects. The basic typing rules that we defined on real scalars
    (``) can be generalized for
    (![](http://latex.codecogs.com/gif.latex?\mathbb{R})) can be generalized for
    [module](https://en.wikipedia.org/wiki/Module_(mathematics))-like types such as
    vectors with extra consideration for shape. In vector calculus, the
    differentiation of a function `f: ℝⁿ → ℝᵐ` is defined per scalar because there
    are multiple inputs and multiple outputs. Full differentiation of vector
    function `f` will result in a matrix, each of whose entries is a function that
    computes the partial derivative of an output scalar with respect to an input
    scalar. This matrix is called a
    differentiation of a function
    ![](http://latex.codecogs.com/gif.latex?\mathbb{R}^n\rightarrow\mathbb{R}^m) is
    defined per scalar because there are multiple inputs and multiple outputs. Full
    differentiation of vector-valued function
    ![](http://latex.codecogs.com/gif.latex?\mathbf{f}) will result in a matrix,
    each of whose entries is a function that computes the partial derivative of an
    output scalar with respect to an input scalar. This matrix is called a
    [Jacobian](https://en.wikipedia.org/wiki/Jacobian_matrix_and_determinant). In
    this definition, the Jacobian matrix has type: `J: (ℝ → ℝ)ᵐⁿ`. For simplicity,
    we will model it as a function that maps vectors to real-valued matrices `J: ℝⁿ
    → ℝᵐⁿ`.
    this definition, the Jacobian matrix has type
    ![](http://latex.codecogs.com/gif.latex?\mathbf{J_f}:(\mathbb{R}\rightarrow\mathbb{R})^{mn}).
    For simplicity, we will model it as a function that maps vectors
    to real-valued matrices ![](http://latex.codecogs.com/gif.latex?\mathbf{J_f}:\mathbb{R}^n\rightarrow\mathbb{R}^{mn}).

    <p align="center">
    <img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/74e93aa903c2695e45770030453eb77224104ee4"
    @@ -119,9 +125,11 @@ next, we discuss how Automatic Differentiation comes in the picture.
    ### Gradient and Reverse-Mode AD

    When we let a [one-hot](https://en.wikipedia.org/wiki/One-hot) row vector
    `vⁱ: ℝᵐ = onehot(i)` left-multiply a
    Jacobian matrix of type `ℝᵐⁿ`, we are selecting one row in the matrix,
    which is exactly the [gradient](https://en.wikipedia.org/wiki/Gradient) of
    ![](http://latex.codecogs.com/gif.latex?\mathbf{v}^i:\mathbb{R}^m=\Big[0\cdots_{i-1}1\cdots0\Big])
    left-multiply a Jacobian matrix of type
    ![](http://latex.codecogs.com/gif.latex?\mathbb{R}^{mn}), we are selecting one
    row in the matrix, which is exactly the
    [gradient](https://en.wikipedia.org/wiki/Gradient) of
    ![](http://latex.codecogs.com/gif.latex?f_i) evaluated at
    ![](http://latex.codecogs.com/gif.latex?\mathbf{x}), i.e.
    ![](http://latex.codecogs.com/gif.latex?\nabla{f_i}(\mathbf{x})).
    @@ -130,7 +138,8 @@ which is exactly the [gradient](https://en.wikipedia.org/wiki/Gradient) of
    <img src="https://latex.codecogs.com/gif.latex?\nabla{f_i}(\mathbf{x})=\mathbf{v}^i\mathbf{J_f}(\mathbf{x})=\bigg[\dfrac{\partial{f_i}(\mathbf{x})}{\partial&space;x_1}&space;\&space;\cdots&space;\&space;\dfrac{\partial{f_i}(\mathbf{x})}{\partial{x_n}}\bigg]" title="\nabla{f_i}(\mathbf{x})=\mathbf{v}^i\mathbf{J_f}(\mathbf{x})=\bigg[\dfrac{\partial{f_i}(\mathbf{x})}{\partial x_1} \ \cdots \ \dfrac{\partial{f_i}(\mathbf{x})}{\partial{x_n}}\bigg]" />
    </p>

    When this vector `vⁱ` represents the gradient of another function `g: ℝᵐ → ℝ` at
    When this vector ![](http://latex.codecogs.com/gif.latex?\mathbf{v}^i) represents the
    gradient of another function ![](http://latex.codecogs.com/latex.gif?g:\mathbb{R}^m\rightarrow\mathbb{R}) at
    ![](http://latex.codecogs.com/gif.latex?\mathbf{f}(\mathbf{x})), namely
    ![](http://latex.codecogs.com/gif.latex?\partial{g}/\partial{f_i(\mathbf{x})}),
    then the vector-Jacobian products will represent
    @@ -165,8 +174,10 @@ partial derivatives from the final output, eventiually reaching each input.

    ### Directional Derivatives and Forward-Mode AD

    Similarly, when we let a column vector `v: ℝⁿ¹` right-multiply a Jacobian value
    matrix of type `ℝᵐⁿ`, the result is a vector whose elements are exactly the
    Similarly, when we let a column vector
    ![](http://latex.codecogs.com/gif.latex?\mathbf{v}:\mathbb{R}^{n1}) right-multiply a
    Jacobian value
    matrix of type ![](http://latex.codecogs.com/gif.latex?\mathbb{R}^{mn}), the result is a vector whose elements are exactly the
    [directional derivatives](https://en.wikipedia.org/wiki/Directional_derivative)
    of each ![](http://latex.codecogs.com/gif.latex?f_i) evaluated at
    ![](http://latex.codecogs.com/gif.latex?\mathbf{x}) in direction ![](http://latex.codecogs.com/gif.latex?\mathbf{v}).
  23. rxwei revised this gist Oct 21, 2018. 1 changed file with 3 additions and 3 deletions.
    6 changes: 3 additions & 3 deletions ad-manifesto.md
    Original file line number Diff line number Diff line change
    @@ -127,7 +127,7 @@ which is exactly the [gradient](https://en.wikipedia.org/wiki/Gradient) of
    ![](http://latex.codecogs.com/gif.latex?\nabla{f_i}(\mathbf{x})).

    <p align="center">
    <img src="https://latex.codecogs.com/gif.latex?\nabla{f_i}(\mathbf{x})=\mathbf{v}^i\mathbf{J_f}(\mathbf{x})=\bigg[\dfrac{\partial{f_i}(\mathbf{x})}{\partial&space;x_0}&space;\&space;\cdots&space;\&space;\dfrac{\partial{f_i}(\mathbf{x})}{\partial{x_n}}\bigg]" title="\nabla{f_i}(\mathbf{x})=\mathbf{v}^i\mathbf{J_f}(\mathbf{x})=\bigg[\dfrac{\partial{f_i}(\mathbf{x})}{\partial x_0} \ \cdots \ \dfrac{\partial{f_i}(\mathbf{x})}{\partial{x_n}}\bigg]" />
    <img src="https://latex.codecogs.com/gif.latex?\nabla{f_i}(\mathbf{x})=\mathbf{v}^i\mathbf{J_f}(\mathbf{x})=\bigg[\dfrac{\partial{f_i}(\mathbf{x})}{\partial&space;x_1}&space;\&space;\cdots&space;\&space;\dfrac{\partial{f_i}(\mathbf{x})}{\partial{x_n}}\bigg]" title="\nabla{f_i}(\mathbf{x})=\mathbf{v}^i\mathbf{J_f}(\mathbf{x})=\bigg[\dfrac{\partial{f_i}(\mathbf{x})}{\partial x_1} \ \cdots \ \dfrac{\partial{f_i}(\mathbf{x})}{\partial{x_n}}\bigg]" />
    </p>

    When this vector `vⁱ` represents the gradient of another function `g: ℝᵐ → ℝ` at
    @@ -143,7 +143,7 @@ body of this function can be defined in terms of `𝒟`, the differential operat
    that returns a Jacobian.

    <p align="center">
    <img src="https://latex.codecogs.com/gif.latex?\dfrac{\partial&space;g(\mathbf{f}(\mathbf{x}))}{\partial&space;\mathbf{x}}=\dfrac{\partial&space;g}{\partial&space;\mathbf{f}(\mathbf{x})}\mathbf{J_f}(\mathbf{x})&space;=&space;\bigg[&space;\dfrac{\partial&space;g(\mathbf{x})}{\partial&space;x_0}&space;\&space;\cdots&space;\&space;\dfrac{\partial&space;g(\mathbf{x})}{\partial&space;x_n}&space;\bigg]" title="\dfrac{\partial g(\mathbf{f}(\mathbf{x}))}{\partial \mathbf{x}}=\dfrac{\partial g}{\partial \mathbf{f}(\mathbf{x})}\mathbf{J_f}(\mathbf{x}) = \bigg[ \dfrac{\partial g(\mathbf{x})}{\partial x_0} \ \cdots \ \dfrac{\partial g(\mathbf{x})}{\partial x_n} \bigg]" />
    <img src="https://latex.codecogs.com/gif.latex?\dfrac{\partial&space;g(\mathbf{f}(\mathbf{x}))}{\partial&space;\mathbf{x}}=\dfrac{\partial&space;g}{\partial&space;\mathbf{f}(\mathbf{x})}\mathbf{J_f}(\mathbf{x})&space;=&space;\bigg[&space;\dfrac{\partial&space;g(\mathbf{x})}{\partial&space;x_1}&space;\&space;\cdots&space;\&space;\dfrac{\partial&space;g(\mathbf{x})}{\partial&space;x_n}&space;\bigg]" title="\dfrac{\partial g(\mathbf{f}(\mathbf{x}))}{\partial \mathbf{x}}=\dfrac{\partial g}{\partial \mathbf{f}(\mathbf{x})}\mathbf{J_f}(\mathbf{x}) = \bigg[ \dfrac{\partial g(\mathbf{x})}{\partial x_1} \ \cdots \ \dfrac{\partial g(\mathbf{x})}{\partial x_n} \bigg]" />
    </p>

    ```swift
    @@ -172,7 +172,7 @@ of each ![](http://latex.codecogs.com/gif.latex?f_i) evaluated at
    ![](http://latex.codecogs.com/gif.latex?\mathbf{x}) in direction ![](http://latex.codecogs.com/gif.latex?\mathbf{v}).

    <p align="center">
    <img src="https://latex.codecogs.com/gif.latex?\nabla_\mathbf{v}\mathbf{f}(\mathbf{x})=\mathbf{J_f}(\mathbf{x})\mathbf{v}=\bigg[\nabla_\mathbf{v}{f_0}(\mathbf{x})\&space;\cdots\&space;\nabla_\mathbf{v}{f_m}(\mathbf{x})\bigg]" title="\nabla_\mathbf{v}\mathbf{f}(\mathbf{x})=\mathbf{J_f}(\mathbf{x})\mathbf{v}=\bigg[\nabla_\mathbf{v}{f_0}(\mathbf{x})\ \cdots\ \nabla_\mathbf{v}{f_m}(\mathbf{x})\bigg]" />
    <img src="https://latex.codecogs.com/gif.latex?\nabla_\mathbf{v}\mathbf{f}(\mathbf{x})=\mathbf{J_f}(\mathbf{x})\mathbf{v}=\bigg[\nabla_\mathbf{v}{f_1}(\mathbf{x})\&space;\cdots\&space;\nabla_\mathbf{v}{f_m}(\mathbf{x})\bigg]" title="\nabla_\mathbf{v}\mathbf{f}(\mathbf{x})=\mathbf{J_f}(\mathbf{x})\mathbf{v}=\bigg[\nabla_\mathbf{v}{f_1}(\mathbf{x})\ \cdots\ \nabla_\mathbf{v}{f_m}(\mathbf{x})\bigg]" />
    </p>

    The linear function that takes a vector and right-multiplies the Jacobian value
  24. rxwei revised this gist Oct 20, 2018. 1 changed file with 84 additions and 15 deletions.
    99 changes: 84 additions & 15 deletions ad-manifesto.md
    Original file line number Diff line number Diff line change
    @@ -1007,20 +1007,89 @@ Turns out, this is not a new problem - we should learning from how we deal with
    calling conventions in Swift. Functions with different calling conventions have
    different type signatures, e.g. `@convention(thick)` and `@convention(thin)`,
    and function convert back and forth through conversion thunks implicitly.

    ```swift
    // A "thin" function that captures no variables.
    // Its representation is `@convention(thin)` by default.
    func f(x: Int) {
    return x
    }

    var globalVar = 30

    // A "thick" function that captures the value of `globalVar`.
    // Its representation is `@convention(thick)` by default.
    let g = { x in globalVar + x }

    // A higher-order function.
    // The closure argument `h`'s representation is `@convention(thick)`, because it should
    // be able to take closures that capture variables.
    func takeFunc(_ h: (Float) -> Float) { ... }

    takeFunc(f) // Implicitly converted function `f` to a `convention(thick)` closure by
    // creating a conversion thunk.
    takeFunc(g) // `g` is thick already. No conversion needed.
    ```

    Sometimes, different conventions have different binary representations for
    storing captured variables and such. In AD, the only difference between a
    non-differentiable function and a differentiated function (say, in reverse mode)
    is whether the function carries a few other function pointers that represent the
    function's adjoint code, so we can simply add a thicker representation!

    At first glance, this could even be an addition to the existing
    `@convention` attribute as something like `@convention(differentiable)`,
    however, differentiability does not align semantically with `@convention`.
    First, when a function becomes its differentiable (or differentiated) form, its
    original calling convention is not changed. Second, functions with any
    convention is technically differentiable, including `thin`, `thick`, `method`,
    etc. Therefore, we need a separate dimension of "thickness" in the function
    type: differentiability.
    storing captured variables and such, just like the example with `f` and `g`
    above. In AD, the only difference between a non-differentiable function and a
    differentiated function (say, in reverse mode) is whether the function carries a
    few other function pointers that represent the function's adjoint code, so we
    can model differentiable functions using a "thicker" function type, which
    bundles the original function representation along with pointers to the original
    function's Jacobian-vector product functions and/or vector-Jacobian product
    functions. When a normal function with a visible body gets passed as an
    `@autodiff` function, the function will be differentiated.

    ```swift
    // `f` is a normal function that has type `(Float) -> Float`.
    func f(x: Float) -> Float {
    return sin(x)
    }

    // `f` gets implcitly converted (or more accurately, differentiated).
    let g = f as @autodiff (Float) -> Float

    func takesFunc(_ someFunc: @autodiff (Float) -> Float) {
    #derivatives(someFunc)
    ...
    }

    // At the callsite of `takesFunc(_:)`, `f` gets implcitly differentiated to become
    // `@autodiff (Float) -> Float`.
    takesFunc(f)
    ```

    If a normal function does not have a visible body, then it cannot be passed as
    an `@autodiff` function. Swift will show an error at compile-time.

    ```swift
    var normalFuncWithOpaqueBody: (Float) -> Float = ...

    takesFunc(normalFuncWithOpaqueBody)
    ```

    ```console
    test.swift:19:11: error: function is not differentiable, but the contextual type is
    '@autodiff (Float) -> Float'
    takesFunc(normalFuncWithOpaqueBody)
    ^~~~~~~~~~~~~~~~~~~~~~~~

    test.swift:17:4: note: value defined here
    var normalFuncWithOpaqueBody: (Float) -> Float = ...
    ^~~~~~~~~~~~~~~~~~~~~~~~
    ```

    At first glance, this could even be an addition to the existing `@convention`
    attribute as something like `@convention(autodiff)`, however, differentiability
    does not align semantically with `@convention`. First, when a function becomes
    its differentiable (or differentiated) form, its original calling convention is
    not changed. Second, functions with any convention is technically
    differentiable, including `thin`, `thick`, `method`, etc. Third,
    differentiability is not the only information that needs to be encoded --
    there's also the order of differentiation. Therefore, we need a separate
    dimension of "thickness" in the function type: differentiability.

    We define a new formalization of differentiability in Swift's type system,
    including an `@autodiff` function type attribute, an extension to functions'
    @@ -1237,7 +1306,7 @@ As we can see, since we are to differentiate a higher-order function's argument
    (thanks to Generalized Differentiability), we can define `derivatives(of:)` and
    `gradient(of:)` as Swift functions in terms of more general raw differential
    operators, `#differential` and `#pullback`, to replace `#derivatives` and
    `#derivatives`!
    `#gradient`!

    These differential operators work seamlessly with closure captures,
    error-throwing functions, or arbitrary side-effecting code that do not
    @@ -1827,7 +1896,7 @@ Recall that the motivation of introducing a general, future-proof
    ```swift
    extension SGD {
    func fit(_ parameters: inout Parameters, gradients: Parameters) {
    parameters.update(with: gradients) { θ, g in
    parameters.update(withGradients: gradients) { θ, g in
    θ = θ.moved(toward: -θ.tangentVector(from: g) * learningRate)
    }
    }
  25. rxwei revised this gist Oct 20, 2018. 1 changed file with 10 additions and 5 deletions.
    15 changes: 10 additions & 5 deletions ad-manifesto.md
    Original file line number Diff line number Diff line change
    @@ -100,7 +100,7 @@ we will model it as a function that maps vectors to real-valued matrices `J: ℝ

    While it is challenging to define this function with full type safety in Swift
    because shapes cannot be generic parameters yet, we can define a differential
    operator as the following, specialized on shape.
    operator as the following, specialized on shapes.

    ```swift
    func 𝒟<T>(_ f: (Vector2<T>) -> Vector3<T>) -> (Vector2<T>) -> Matrix3x2<T>
    @@ -309,9 +309,9 @@ for (x, y) in minibatches {
    ### Full Extensibility: Custom Types and Derivatives

    We want our AD system to be fully extensible to the point where users can
    request the derivatives of a function taking their own user-defined numeric
    types, and even use this feature to implement structure-dependent algorithms
    such as tree-recursive neural networks. Therefore, AD makes no assumptions about
    request derivatives of a function taking their own user-defined numeric types,
    and even use this feature to implement structure-dependent algorithms such as
    tree-recursive neural networks. Therefore, AD makes no assumptions about
    individual math functions or the types it should support. We enable library
    designers and developers to easily define any type or differentiable functions,
    all in pure Swift code.
    @@ -355,7 +355,12 @@ All differential operators are defined in Swift, and developers can create their
    own differential operators by composing existing ones. For example, the user can
    use the "forward-on-reverse" approach to compute [Hessian-vector
    products](https://en.wikipedia.org/wiki/Hessian_matrix), where the `hvp(at:in:)`
    operator is defined as a native Swift function.
    operator is defined as a native Swift function. The [`@autodiff(order:
    2)`](#the-autodiff-function-type-attribute) attribute in the closure type
    signature marks the closure argument as being differentiable up to at least the
    2nd order, so that the caller of `hvp(at:in:)` will differentiate the actual
    closure argument as needed.so that the caller of this function will implicitly
    trigger differentiation as needed.

    ```swift
    func hvp<T: Differentiable, R: FloatingPoint>(
  26. rxwei revised this gist Oct 20, 2018. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion ad-manifesto.md
    Original file line number Diff line number Diff line change
    @@ -829,7 +829,7 @@ func g(_ x: Float) -> (Vector<Float>, Vector<Float>) {
    return x w
    }

    #derivatives(f) // (Float) -> (Vector<Float>, Vector<Float>)
    #derivatives(g) // (Float) -> (Vector<Float>, Vector<Float>)
    ```

    The grammar of these raw differential operators is defined as follows:
  27. rxwei revised this gist Oct 20, 2018. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion ad-manifesto.md
    Original file line number Diff line number Diff line change
    @@ -829,7 +829,7 @@ func g(_ x: Float) -> (Vector<Float>, Vector<Float>) {
    return x w
    }

    #gradient(f) // (Vector<Float>, Vector<Float>) -> (Vector<Float>, Vector<Float>)
    #derivatives(f) // (Float) -> (Vector<Float>, Vector<Float>)
    ```

    The grammar of these raw differential operators is defined as follows:
  28. rxwei revised this gist Oct 20, 2018. 1 changed file with 2 additions and 3 deletions.
    5 changes: 2 additions & 3 deletions ad-manifesto.md
    Original file line number Diff line number Diff line change
    @@ -788,9 +788,8 @@ func adjointBar(_ x: Vector<Float>, y: Float, adjoint: Float) -> Vector<Float> {
    }
    ```
    ```console
    test.swift:3:35: error: function `bar` does not support higher-order
    differentiation because its adjoint is not differentiable; would you like to add
    `once`?
    test.swift:3:35: error: function `bar` does not support higher-order differentiation
    because its adjoint is not differentiable; would you like to add `once`?
    @differentiable(reverse, adjoint: adjointBar)
    ^~~~~~~~~~
    test.swift:8:6: note: `adjointBar` is defined here
  29. rxwei revised this gist Oct 20, 2018. 1 changed file with 8 additions and 5 deletions.
    13 changes: 8 additions & 5 deletions ad-manifesto.md
    Original file line number Diff line number Diff line change
    @@ -130,7 +130,7 @@ which is exactly the [gradient](https://en.wikipedia.org/wiki/Gradient) of
    <img src="https://latex.codecogs.com/gif.latex?\nabla{f_i}(\mathbf{x})=\mathbf{v}^i\mathbf{J_f}(\mathbf{x})=\bigg[\dfrac{\partial{f_i}(\mathbf{x})}{\partial&space;x_0}&space;\&space;\cdots&space;\&space;\dfrac{\partial{f_i}(\mathbf{x})}{\partial{x_n}}\bigg]" title="\nabla{f_i}(\mathbf{x})=\mathbf{v}^i\mathbf{J_f}(\mathbf{x})=\bigg[\dfrac{\partial{f_i}(\mathbf{x})}{\partial x_0} \ \cdots \ \dfrac{\partial{f_i}(\mathbf{x})}{\partial{x_n}}\bigg]" />
    </p>

    When this vector represents the gradient of another function `g: ℝᵐ → ℝ` at
    When this vector `vⁱ` represents the gradient of another function `g: ℝᵐ → ℝ` at
    ![](http://latex.codecogs.com/gif.latex?\mathbf{f}(\mathbf{x})), namely
    ![](http://latex.codecogs.com/gif.latex?\partial{g}/\partial{f_i(\mathbf{x})}),
    then the vector-Jacobian products will represent
    @@ -667,7 +667,7 @@ There are five options for differentiability:

    5. Linear: `@differentiable(linear)`

    By definiton, a linear map is always a unary function and its Jacobian is
    By definition, a linear map is always a unary function and its Jacobian is
    the matrix associated with this linear transformation itself. In other
    words, both its differential and its pullback are itself.

    @@ -715,7 +715,7 @@ As explained, differentiabilities have different functional requirements.
    4. Other differentiabilities

    Other differentiabilities such as `constant` and `linear` do not require
    any associated functions. However, the users can choose to specify
    any associated functions. However, users can choose to specify
    tangent/adjoint function(s) for their own purposes such as custom
    optimizations.

    @@ -778,9 +778,12 @@ func bar(_ x: Vector<Float>) -> Float {
    return sin(x)[0]
    }

    var someGlobalVariable: Vector<Float> = [1, 1, 1]

    func adjointBar(_ x: Vector<Float>, y: Float, adjoint: Float) -> Vector<Float> {
    var yx = Vector<Float>(repeating: 0, shape: x.shape)
    yx[0] = cos(x[0]) * adjoint
    someGlobalVariable[0] = cos(x[0]) * adjoint
    yx[0] = someGlobalVariable[0]
    return yx
    }
    ```
    @@ -1062,7 +1065,7 @@ due to code size. In order to make the system consistent, we make each

    Since we want to support differentiating opaque functions, we must support
    creating one. The fact is, the user does not even need to know about `@autodiff`
    or intentially create differentiable functions if they are working with
    or intentionally create differentiable functions if they are working with
    functions in the current module. Whenever a local function declaration gets used
    where the contextual type has an `@autodiff` attribute on it, Swift
    differentiates it. If differentiation fails, Swift reports an error at
  30. rxwei revised this gist Oct 20, 2018. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion ad-manifesto.md
    Original file line number Diff line number Diff line change
    @@ -1430,7 +1430,7 @@ composition of the forward-mode differential operator and the reverse-mode
    differential operator on a function.

    <p align="center">
    <img src="https://latex.codecogs.com/png.latex?\mathbf{H_f}(\mathbf{x})\mathbf{v}&space;=&space;\mathbf{J}_{\nabla&space;\mathbf{f}}(\mathbf{x})\mathbf{v}" title="\mathbf{H_f}(\mathbf{x})\mathbf{v} = \mathbf{J}_{\nabla \mathbf{f}}(\mathbf{x})\mathbf{v}" />
    <img src="https://latex.codecogs.com/png.latex?\mathbf{H}_f(\mathbf{x})\mathbf{v}&space;=&space;\mathbf{J}_{\nabla&space;f}(\mathbf{x})\mathbf{v}" title="\mathbf{H}_f(\mathbf{x})\mathbf{v} = \mathbf{J}_{\nabla f}(\mathbf{x})\mathbf{v}" />
    </p>

    Just like other differential operators, we can define the Hessian-vector