rxwei · December 6, 2024 16:54 · Nov 4, 2019 · Jun 17, 2019 · Jun 17, 2019 · Dec 5, 2018
diff --git a/ad-manifesto.md b/ad-manifesto.md
@@ -1,2056 +1 @@
-First-Class Automatic Differentiation in Swift: A Manifesto
-===========================================================
-
-* Author: [Richard Wei](https://github.com/rxwei)
-* Date: October 2018
-
-This document is written for both the machine learning community and the Swift
-programming language design community, with a strong focus on language design.
-
-**Status: Outdated**
-
-**Please see [Swift Automatic Differentiation Design Overview](https://docs.google.com/document/d/1bPepWLfRQa6CtXqKA8CDQ87uZHixNav-TFjLSisuKag/edit?usp=sharing) instead.**
-
-## Table of Contents
-
-- [Introduction](#introduction)
-- [What is AD](#what-is-ad)
-- [Why does Swift need AD?](#why-does-swift-need-ad)
-- [Why make AD first-class?](#why-make-ad-first-class)
-- [Vision](#vision)
-- [Part 1: Differentiable Types](#part-1-differentiable-types)
-- [Part 2: Primitive Registration](#part-2-primitive-registration)
-- [Part 3: Basic Differentiation](#part-3-basic-differentiation)
-- [Part 4: Generalized Differentiability](#part-4-generalized-differentiability)
-- [Part 5: True Differential Operators](#part-5-true-differential-operators)
-- [Part 6: Generalized Types for Differentiation](#part-6-generalized-types-for-differentiation)
-- [Part 7: Customizable Differentiation](#part-7-customizable-differentiation)
-- [Acknowledgements](#acknowledgements)
-
-## Introduction
-
-Automatic Differentiation (AD), also known as algorithmic differentiation, is a
-family of techniques used to obtain the derivative of a function. Functions can
-be represented as a composition of elementary operators whose derivatives are
-well-known. While partial derivatives can be computed through different
-techniques, the most common is a recursive application of the chain rule in the
-reverse direction, called reverse-mode AD. Reverse-mode AD computes
-vector-Jacobian products, i.e. partial derivatives with respect to each input
-parameter, and it has become a prerequisite for implementing gradient-based
-learning methods.
-
-We aim to provide best-in-class AD, including the best optimizations, best error
-messages in failure cases, and the most flexibility and expressivity. To achieve
-this, we built support for AD right into the Swift compiler. This manifesto
-explains the design and vision of AD, and introduces to you the language
-extensions that will make Swift the world's first general-purpose differentiable
-programming language.
-
-## What is AD?
-
-### Basic Calculus
-
-In basic calculus, differentiating a function of type
-![](http://latex.codecogs.com/gif.latex?\mathbb{R}\rightarrow\mathbb{R})
-produces a function
-![](http://latex.codecogs.com/gif.latex?\mathbb{R}\rightarrow\mathbb{R}) that
-maps points onto their corresponding slopes.
-
-<p align="center">
-<img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/9315f1516ee5847107808697e43693d91abfc6e8"
-</p>
-
-In the context of Swift, differentiating a function `(Float) -> Float` produces
-`(Float) -> Float`. Functions with multiple arguments, such as `(Float, Float)
--> Float`, can be thought of as a function whose input domain is a product of
-those arguments types, i.e.
-![](http://latex.codecogs.com/gif.latex?\mathbb{R}\times\mathbb{R}\rightarrow\mathbb{R}),
-so the derivative of such a function has type `(Float, Float) -> (Float,
-Float)`. According to this typing rule, the differential operator
-![](http://latex.codecogs.com/gif.latex?\dfrac{d}{dx}) can be declared as a
-higher-order function, overloaded for each number of arguments because a Swift
-function's argument list is not formally modeled as a tuple.
-
-```swift
-func 𝒟<T: FloatingPoint>(_ f: (T) -> T) -> (T) -> T
-func 𝒟<T: FloatingPoint>(_ f: (T, T) -> T) -> (T, T) -> (T, T)
-func 𝒟<T: FloatingPoint>(_ f: (T, T, T) -> T) -> (T, T, T) -> (T, T, T)
-...
-```
-
-```swift
-func f(_ x: Double, _ y: Double) -> Double {
-    return tanh(x + y)
-}
-𝒟(f) // (Double, Double) -> (Double, Double)
-```
-
-### Vectors and Jacobians
-
-In numerical computing, users often write code that operates on high-dimensional
-mathematical objects. The basic typing rules that we defined on real scalars
-(![](http://latex.codecogs.com/gif.latex?\mathbb{R})) can be generalized for
-[module](https://en.wikipedia.org/wiki/Module_(mathematics))-like types such as
-vectors with extra consideration for shape. In vector calculus, the
-differentiation of a function
-![](http://latex.codecogs.com/gif.latex?\mathbf{f}:\mathbb{R}^n\rightarrow\mathbb{R}^m)
-is defined per scalar because there are multiple inputs and multiple outputs.
-Full differentiation of a vector-valued function
-![](http://latex.codecogs.com/gif.latex?\mathbf{f}) will thus result in a
-matrix, each of whose entries is a function that computes the partial derivative
-of an output scalar with respect to an input scalar. This matrix is called a
-[Jacobian](https://en.wikipedia.org/wiki/Jacobian_matrix_and_determinant). In
-this definition, the Jacobian matrix has type
-![](http://latex.codecogs.com/gif.latex?\mathbf{J_f}:(\mathbb{R}\rightarrow\mathbb{R})^{mn}).
-For simplicity, we will model it as a function that maps vectors to real-valued
-matrices
-![](http://latex.codecogs.com/gif.latex?\mathbf{J_f}:\mathbb{R}^n\rightarrow\mathbb{R}^{mn}).
-
-<p align="center">
-  <img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/74e93aa903c2695e45770030453eb77224104ee4"
-       alt="Automatic differentiation approaches."/>
-</p>
-
-While it is challenging to define this function with full type safety in Swift
-because shapes cannot be generic parameters yet, we can define a differential
-operator as the following, specialized on shapes.
-
-```swift
-func 𝒟<T>(_ f: (Vector2<T>) -> Vector3<T>) -> (Vector2<T>) -> Matrix3x2<T>
-    where T: FloatingPoint
-```
-
-Computing the Jacobian of a function is often unnecessary in gradient-based
-optimization methods. Computing a full Jacobian will require repeated
-evaluations of some primitives in computer code: vector-Jacobian products (VJPs)
-or Jacobian-vector products (JVPs), and VJPs and JVPs are often exactly what we
-need in practice. In these terms, "vector" refers to a vector of partial
-derivatives that are to be chained with the Jacobian by left-multiplication or
-right-multiplication. As we explain chaining next, we discuss how Automatic
-Differentiation comes in the picture.
-
-### Gradient and Reverse-Mode AD
-
-When we let a [one-hot](https://en.wikipedia.org/wiki/One-hot) row vector
-![](http://latex.codecogs.com/gif.latex?\mathbf{v}^i:\mathbb{R}^m=\Big[0\cdots_{i-1}1\cdots0\Big])
-left-multiply a Jacobian matrix of type
-![](http://latex.codecogs.com/gif.latex?\mathbb{R}^{mn}), we are selecting one
-row in the matrix, which is exactly the
-[gradient](https://en.wikipedia.org/wiki/Gradient) of
-![](http://latex.codecogs.com/gif.latex?f_i) evaluated at
-![](http://latex.codecogs.com/gif.latex?\mathbf{x}), i.e.
-![](http://latex.codecogs.com/gif.latex?\nabla{f_i}(\mathbf{x})).
-
-<p align="center">
-<img src="https://latex.codecogs.com/gif.latex?\nabla{f_i}(\mathbf{x})=\mathbf{v}^i\mathbf{J_f}(\mathbf{x})=\bigg[\dfrac{\partial{f_i}(\mathbf{x})}{\partial&space;x_1}&space;\&space;\cdots&space;\&space;\dfrac{\partial{f_i}(\mathbf{x})}{\partial{x_n}}\bigg]" title="\nabla{f_i}(\mathbf{x})=\mathbf{v}^i\mathbf{J_f}(\mathbf{x})=\bigg[\dfrac{\partial{f_i}(\mathbf{x})}{\partial x_1} \ \cdots \ \dfrac{\partial{f_i}(\mathbf{x})}{\partial{x_n}}\bigg]" />
-</p>
-
-When vector ![](http://latex.codecogs.com/gif.latex?\mathbf{v}) in
-![](http://latex.codecogs.com/gif.latex?\mathbf{v}\mathbf{J_f}(\mathbf{x}))
-represents the gradient of another function
-![](http://latex.codecogs.com/gif.latex?g:\mathbb{R}^m\rightarrow\mathbb{R}) at
-![](http://latex.codecogs.com/gif.latex?\mathbf{f}(\mathbf{x})), namely
-![](http://latex.codecogs.com/gif.latex?\partial{g}/\partial{f_i(\mathbf{x})}),
-then the vector-Jacobian products represents
-![](http://latex.codecogs.com/gif.latex?\partial{g}/\partial{\mathbf{x}}). The
-linear function that takes a vector and left-multiplies it with the Jacobian is
-also called a
-[pullback](https://en.wikipedia.org/wiki/Pullback_(differential_geometry)). We
-can define this function in Swift as a higher-order function shown below. The
-body of this function can be defined in terms of `𝒟`, the differential operator
-that returns a Jacobian.
-
-<p align="center">
-<img src="https://latex.codecogs.com/gif.latex?\dfrac{\partial&space;g(\mathbf{f}(\mathbf{x}))}{\partial&space;\mathbf{x}}=\dfrac{\partial&space;g}{\partial&space;\mathbf{f}(\mathbf{x})}\mathbf{J_f}(\mathbf{x})&space;=&space;\bigg[&space;\dfrac{\partial&space;g(\mathbf{x})}{\partial&space;x_1}&space;\&space;\cdots&space;\&space;\dfrac{\partial&space;g(\mathbf{x})}{\partial&space;x_n}&space;\bigg]" title="\dfrac{\partial g(\mathbf{f}(\mathbf{x}))}{\partial \mathbf{x}}=\dfrac{\partial g}{\partial \mathbf{f}(\mathbf{x})}\mathbf{J_f}(\mathbf{x}) = \bigg[ \dfrac{\partial g(\mathbf{x})}{\partial x_1} \ \cdots \ \dfrac{\partial g(\mathbf{x})}{\partial x_n} \bigg]" />
-</p>
-
-```swift
-func pullback<T: FloatingPoint>(
-    of f: (Vector2<T>) -> Vector3<T>,
-    at x: Vector2<T>
-) -> (Vector2<T>) -> Vector2<T>
-    return { adjoint in matmul(adjoint, 𝒟(f)(x)) }
-}
-```
-
-However, when computing gradients or general vector-Jacobian products, we do not
-need to compute the Jacobian at all: **Automatic Differentiation is here to
-help.**
-
-[The chain rule of differentiation](https://en.wikipedia.org/wiki/Chain_rule)
-can be interpreted in left-associative order, i.e. accumulating each function's
-partial derivatives from the final output, eventiually reaching each input.
-
-### Directional Derivatives and Forward-Mode AD
-
-Similarly, when we let a column vector
-![](http://latex.codecogs.com/gif.latex?\mathbf{v}:\mathbb{R}^{n1}) right-multiply a
-Jacobian value
-matrix of type ![](http://latex.codecogs.com/gif.latex?\mathbb{R}^{mn}), the result is a vector whose elements are exactly the
-[directional derivatives](https://en.wikipedia.org/wiki/Directional_derivative)
-of each ![](http://latex.codecogs.com/gif.latex?f_i) evaluated at
-![](http://latex.codecogs.com/gif.latex?\mathbf{x}) in direction ![](http://latex.codecogs.com/gif.latex?\mathbf{v}).
-
-<p align="center">
-<img src="https://latex.codecogs.com/gif.latex?\nabla_\mathbf{v}\mathbf{f}(\mathbf{x})=\mathbf{J_f}(\mathbf{x})\mathbf{v}=\bigg[\nabla_\mathbf{v}{f_1}(\mathbf{x})\&space;\cdots\&space;\nabla_\mathbf{v}{f_m}(\mathbf{x})\bigg]" title="\nabla_\mathbf{v}\mathbf{f}(\mathbf{x})=\mathbf{J_f}(\mathbf{x})\mathbf{v}=\bigg[\nabla_\mathbf{v}{f_1}(\mathbf{x})\ \cdots\ \nabla_\mathbf{v}{f_m}(\mathbf{x})\bigg]" />
-</p>
-
-The linear function that takes a vector and right-multiplies the Jacobian value
-matrix is called a
-[differential](https://en.wikipedia.org/wiki/Pushforward_(differential)), and it
-can also be defined in Swift as a higher-order function in terms of `𝒟`.
-
-```swift
-func differential<T: FloatingPoint>(
-    of f: (Vector2<T>) -> Vector3<T>,
-    at x: Vector2<T>
-) -> (Vector3<T>) -> Vector3<T> {
-    return { tangent in matmul(𝒟(f)(x), tangent) }
-}
-```
-
-Just like vector-Jacobian products, Jacobian-vector products are easy to compute
-using Automatic Differentiation. By simply applying the chain rule of
-differentiation from an input, we will accumulate each function's partial
-derivatives and reach each output.
-
-AD has a rich background. For an in-depth introduction, here's some great
-documentation:
-- [Introduction to Automatic
-Differentiation](https://alexey.radul.name/ideas/2013/introduction-to-automatic-differentiation/)
-- [Automatic differentiation in machine learning: a survey](https://arxiv.org/abs/1502.05767)
-- [The simple essence of automatic
-  differentiation](https://arxiv.org/abs/1804.00746)
-
-## Why does Swift need AD?
-
-Swift is a new programming language in the machine learning space. Recently, the
-[Swift for TensorFlow](https://github.com/tensorflow/swift) project brought the
-full power of a machine learning framework into the Swift programming language.
-Numerical computing has a very different set of requirements than application
-development and systems development, and we believe that Swift needs to better
-address those requirements and improve the usability of numerical software. One
-of the most important building blocks in machine learning and numerical
-computing is the ability to differentiate math code. Automatic Differentiation
-has been implemented in many languages, but because of language constraints and
-design trade-offs, many existing AD systems have limitations. We would like to
-take this opportunity to improve Swift, and demonstrate what Swift can offer in
-all areas of numerical computing in the presence of a compiler and a static type
-system.
-
-## Why make AD first-class?
-
-Automatic Differentiation has been a research topic in scientific computing and
-high-performance computing for nearly half a century. Traditional tools such as
-[OpenAD](http://www.mcs.anl.gov/OpenAD/),
-[TAPENADE](http://tapenade.inria.fr:8080/tapenade/index.jsp) and
-[ADIFOR](http://www.mcs.anl.gov/research/projects/adifor/) are tools that
-transform existing source code. There are many advanced techniques that improved
-the performance of derivatives written in FORTRAN, but these tools have not
-gained wide adoption in the machine learning community. More recent AD systems
-like [Stalin∇](https://github.com/Functional-AutoDiff/STALINGRAD) (pronounced
-Stalingrad, available as a dialect of Scheme) achieved good usability by
-integrating the differential operator into the language, and are equipped with a
-complete set of AD features (such as forward/reverse, nested AD, Hessians,
-Jacobians, directional derivatives and checkpointing). Along with libraries such
-as [DiffSharp](http://diffsharp.github.io/DiffSharp/) (available in F#), and
-[ad](https://hackage.haskell.org/package/ad) (available in Haskell), they
-combine AD closely with functional programming languages.
-
-Researchers in the machine learning community have built many library
-implementations of AD in Python and C++, including
-[Autograd](https://github.com/HIPS/autograd),
-[TensorFlow](http://tensorflow.org/), [Pytorch](http://pytorch.org/), etc.
-
-As Automatic Differentiation is an integral part of any machine learning
-framework, traditional designs and implementations of AD have some limitations.
-Some of these libraries are implemented as a transformation on a standalone DSL
-(a graph) with a closed set of operators. Others are implemented using operator
-overloading directly on a subset of the source language. Although these
-libraries have gained wide adoption, the ones that leverage ahead-of-time AD do
-not expose an easy-to-use programming model, and the ones that have a friendlier
-programming model lack static analysis to perform more optimized AD.
-
-Recent projects such as [Tangent](https://github.com/google/tangent),
-[Myia](https://github.com/mila-udem/myia), and
-[Zygote.jl](https://github.com/FluxML/Zygote.jl) based their AD upon source code
-transformation (SCT), a technique that was common in advanced AD systems before
-the deep learning era such as
-[Stalin∇](https://github.com/Functional-AutoDiff/STALINGRAD). The first two
-libraries parse a Python subset into ASTs and transform a function to its
-derivatives either in AST or in a functional IR, and Zygote hooks into the Julia
-compiler and transforms Julia's IR directly. These tools are pushing the
-boundaries of dynamic languages.
-
-We would like our AD system to feel native and expressive. AD in Swift aims to
-solve real-world usability problems by providing the best generalizations, best
-error messages in failure cases, composable differential operators, and fully
-customizable types and derivatives. To achieve this, we built support for AD
-right into the Swift language. Even though AD has been incubated as part of the
-Swift for TensorFlow project, we believe its importance and impact is beyond
-machine learning, so we decided to propose it eventually through Swift Evolution
-into the core language. 
-
-## Vision
-
-**Swift will be world's first general-purpose differentiable programming
-language.**
-
-### Ease of Use
-
-We expect Swift's language-integrated AD to be super easy to use in the context
-of machine learning, control in robotics, and scientific computing. AD is a
-general language feature that works seamlessly with third-party libraries such
-as [TensorFlow](https://www.tensorflow.org/swift/api_docs/).
-
-```swift
-struct Parameters: Differentiable, ParameterGroup {
-    var w1 = Tensor<Float>(randomNormal: [784, 30])
-    var b1 = Tensor<Float>(zeros: [30])
-    var w2 = Tensor<Float>(randomNormal: [30, 10])
-    var b2 = Tensor<Float>(zeros: [10])
-}
-
-var params = Parameters()
-let minibatches = Dataset(...)
-var optimizer = StochasticGradientDescent()
-for (x, y) in minibatches {
-    let grads = gradient(at: params) { params in
-        let h1 = tanh(matmul(x, params.w1) + params.b1)
-        let ŷ = sigmoid(matmul(h1, params.w2) + params.b2)
-        let loss = (y - ŷ).squared().mean()
-        print("Loss is \(loss)")
-        return loss
-    }
-    optimizer.fit(&params, gradients: grads)
-}
-```
-
-### Full Extensibility: Custom Types and Derivatives
-
-We want our AD system to be fully extensible to the point where users can
-request derivatives of a function taking their own user-defined numeric types,
-and even use this feature to implement structure-dependent algorithms such as
-tree-recursive neural networks. Therefore, when performing AD, Swift makes no
-special assumptions about individual math functions or the types it should
-support. We enable library designers and developers to easily define any type or
-differentiable functions, all in pure Swift code.
-
-Swift supports [protocol-oriented programming and first-class value
-semantics](https://developer.apple.com/videos/play/wwdc2015/408/). AD is deeply
-integrated with value types and has full extensibility via protocol
-conformances. The user can make their custom data structures differentiable
-simply by declaring a conformance to `Differentiable` protocol:
-
-```swift
-extension MyType: Differentiable {
-    ...
-}
-```
-
-Or make an obviously non-differentiable function differentiable by using the
-`@differentiable` attribute, specifying a "tangent" function for computing its
-Jacobian-vector products, or an "adjoint" function for computing its
-vector-Jacobian products.
-
-```swift
-@differentiable(tangent: tangentFoo, adjoint: adjointFoo)
-func foo(_ x: Float) -> Float {
-    return Float(Int(x)) // obviously non-differentiable
-}
-
-func tangentFoo(_ x: (Float, Float), originalResult: Float) -> Float {
-    // Insert custom code to compute the directional derivative
-}
-
-func adjointFoo(_ x: Float, originalResult: Float, adjoint: Float) -> Float {
-    // Insert custom code to compute the gradient
-}
-```
-
-### Composable Differential Operators
-
-With fully customizable data structures and derivatives, everything should feel
-native in the language. In addition, differential operators are functional and
-composable, and differentiability is naturally integrated in the type system.
-All differential operators are defined in Swift, and developers can create their
-own differential operators by composing existing ones. For example, the user can
-use the "forward-on-reverse" approach to compute [Hessian-vector
-products](https://en.wikipedia.org/wiki/Hessian_matrix), where the `hvp(at:in:)`
-operator is defined as a native Swift function. The [`@autodiff(order:
-2)`](#the-autodiff-function-type-attribute) attribute in the closure type
-signature marks the closure argument as being differentiable up to at least the
-2nd order, so that the caller of `hvp(at:in:)` will differentiate the actual
-closure argument as needed.so that the caller of this function will implicitly
-trigger differentiation as needed.
-
-```swift
-func hvp<T: Differentiable, R: FloatingPoint>(
-    at x: T, in f: @autodiff(order: 2) (T) -> R
-) -> @autodiff(linear) (T) -> T {
-    return differential(at: x, in: gradient(of: f))
-}
-```
-
-### Static Analysis and Diagnostics
-
-By building first-class AD into the programming language, we can provide better
-diagnostics about differentiability and numeric stability than any other dynamic
-languages, all at compile-time.
-
-```console
-test.swift:58:10: error: function is not differentiable
-  return #gradient(funcToDiff)(x)
-         ^         ~~~~~~~~~~
-
-test.swift:54:10: note: expression is not differentiable
-  return middle2(x)
-         ^
-
-test.swift:50:10: note: when differentiating this function call
-  return middle(x)
-         ^
-
-test.swift:46:10: note: when differentiating this function call
-  return nested(y)
-         ^
-```
-
-### Flexible Functional-Style Differentiation
-
-In common AD libraries, there are two differentiation styles: functional and
-imperative.
-
-|            | Syntax | Meaning |
-|------------|--------|-------------|
-| Functional | `let 𝝯f = gradient(of: f)`<br/>`𝝯f(x)` | Differentiating a function |
-| Imperative | `let y = f(x)`<br/>`gradient(of: y, wrt: x)` | Differentiating code traced through data flow |
-
-Functional-style AD is transforming one function to another, producing a
-function that takes original arguments and returns the partial derivatives
-evaluated at each argument. Imperative-style AD, on the other hand, is a
-value-value dependency analysis. Although we use both notations in mathematics,
-imperative AD comes at the cost of semantic inconsistency with the host
-language, for example:
-
-```swift
-let y = f(x)
-x = 3
-gradient(of: y, wrt: x) // undefined
-```
-
-Semantically, `y` is a value, but `x` is both a value and a reference to a
-memory location -- it is unclear what exactly we are differentiating with
-respect to. Though making `y` and `x` have reference types could make this
-particular example work out semantically, it would be fundamentally inconsistent
-with Swift's core design where mathematical objects have value types, and would
-also make scalar types like `Float` incompatible with automatic differentiation.
-
-We believe Swift's AD can achieve the same level of expressivity as imperative
-AD while preserving functional properties, and use language integration to push
-developers' productivity to the next level.
-
-
-## Part 1: Differentiable Types
-
-Swift is a general-purpose programming language. Therefore, not every function
-is mathematically differentiable, and not every type represents a real vector
-space to begin with. To make our system mathematically sound, we refine the
-Swift standard library to form a basis for automatic differentiation.
-
-The starting point of this refinement is the fundamental numeric protocols.
-In this section, we talk about how we improve the `Numeric` protocol to support
-the addition of vector types and protocols. Then, we introduce a protocol to
-represent vector spaces as that would be a requirement for doing calculus.
-Finally, we design a protocol specific to differentiation.
-
-### Revising the [`Numeric`](https://developer.apple.com/documentation/swift/numeric) protocol
-
-The Numeric protocol today refines
-[`ExpressibleByIntegerLiteral`](https://developer.apple.com/documentation/swift/expressiblebyintegerliteral).
-This makes sense for scalars, but is not compatible with vector data structures
-because type-checking would fail on the scalar multiplication operator.
-
-On the Swift forum, we have discussed the [fundamental blocker for vector types
-to conform to the existing `Numeric`
-protocol](https://forums.swift.org/t/should-numeric-not-refine-expressiblebyintegerliteral).
-The consensus was to introduce a weakening of the `Numeric` protocol to
-represent the abstractions shared between scalars and vectors: [rng (ring
-without unity)](https://en.wikipedia.org/wiki/Rng_(algebra)) (We assumed that
-vector spaces are rngs by endowing them with `*` as element-wise
-multiplication). The protocol will be called `Arithmetic`.
-
-```swift
-public protocol Arithmetic: Equatable {
-    static var zero: Self { get }
-    prefix static func + (x: Self) -> Self
-    static func + (lhs: Self, rhs: Self) -> Self
-    static func += (lhs: inout Self, rhs: Self) -> Self
-    static func - (lhs: Self, rhs: Self) -> Self
-    static func -= (lhs: inout Self, rhs: Self) -> Self
-    static func * (lhs: Self, rhs: Self) -> Self
-    static func *= (lhs: inout Self, rhs: Self) -> Self
-}
-```
-
-The existing `Numeric` will be changed to refine (inherit from) `Arithmetic`,
-keeping all of its existing behavior.
-
-```swift
-public protocol Numeric: Arithmetic, ExpressibleByIntegerLiteral {
-    associatedtype Magnitude: Comparable, Numeric
-    init?<T>(exactly source: T) where T: BinaryInteger
-    var magnitude: Magnitude { get }
-}
-```
-
-### The `VectorNumeric` protocol
-
-After we introduce the `Arithmetic` protocol, which makes the standard library
-suitable for vector APIs and beyond, we can define a protocol that generalizes
-vectors. Mathematically, a vector space is a ring without unity if we endow them
-with `*` as element-wise multiplication. We represent vector spaces through the
-`VectorNumeric` protocol as follows. `Scalar` is the type of the elements of
-this vector space -- the field which the vector space is over. `Shape` is the
-shape of this vector space, which is customizable. The initializer takes a value
-of the `Scalar` type and a `Shape` and returns a vector of the specified shape.
-
-```swift
-/// A type that represents an unranked vector space. Values of this type are
-/// elements in this vector space and with a specific shape.
-public protocol VectorNumeric: Arithmetic {
-    /// The type of scalars in the vector space.
-    associatedtype Scalar: Numeric
-
-    /// The type whose values specifies the shape of an object in the vector 
-    /// space.
-    associatedtype Shape
-
-    /// Create an object in the vector space with the specified shape by
-    /// repeatedly filling the object with the specified value.
-    ///
-    /// - Parameters:
-    ///   - repeatedValue: the value repeat for the specified shape
-    ///   - shape: the shape
-    init(repeating repeatedValue: Scalar, shape: Shape)
-
-    /// The shape of this vector.
-    var shape: Shape { get }
-
-    /// Returns the scalar product of the vector.
-    static func * (scale: Scalar, value: Self) -> Self
-}
-```
-
-### The `Differentiable` protocol
-
-Now we define a protocol that "activates" a type's differentiability. At a first
-glance, the conforming type must also be a `VectorNumeric` type. So we make this
-protocol refine `VectorNumeric`. Since differentiation only makes sense on real
-vectors, we add a constraint on the associated type `Scalar` such that it
-conforms to `FloatingPoint`.
-
-```swift
-public protocol Differentiable: VectorNumeric where Scalar: FloatingPoint {
-}
-```
-
-You may notice that `Differentiable` looks like a dummy protocol because it
-doesn't have any requirements other than the ones inherited from
-`VectorNumeric`. Although under the current assumptions we can completely omit
-the `Differentiable` protocol and just have the AD system recognize
-`VectorNumeric`-comforming types whose scalar elements comform to
-`FloatingPoint`, we actually have theoretical and practical reasons to revise
-the `Differentiable` protocol later on. So we keep `Differentiable` as a
-separate protocol for now and build towards the final design at the end of this
-document.
-
-## Part 2: Primitive Registration
-
-We are aiming for an open and extensible system, so we made the compiler
-agnostic of the actual operations - it does not have special knowledge of
-numeric standard library functions or distinguish between primitive operators
-and other functions. We recursively determine a function's differentiability
-based on:
-
-- whether a function has a primitive differentiability as specified in the
-  standard or user-defined library, and
-
-- whether a function's definition (type signature and body) is differentiable by
-  applying the chain rule of differentiation.
-
-As such we provide a syntactic way of specifying the differentiability of a
-function, using either the function's linearity properties or a separate
-function to specify the "tangent code", which specifies how to differentiate the
-function in forward mode, or "adjoint code”, which specifies how to
-differentiate the function in reverse mode.
-
-### The `@differentiable` attribute
-
-We introduce a declaration attribute `@differentiable` to Swift's syntax. The
-full grammar of `@differentiable` is defined as follows:
-
-```ebnf
-differentiation-mode = 'forward' | 'reverse' | 'bidirectional'
-differentiability = differentiation-mode  | 'linear' | 'constant'
-differentiability-wrt-self = 'wrt' ':' 'self'
-differentiation-order = 'once'
-differentiation-tangent-specifier = 'tangent' ':' declaration-name
-differentiation-adjoint-specifier = 'adjoint' ':' declaration-name
-differentiable-attribute = '@differentiable'
-    '(' differentiability
-    [ ',' differentiability-wrt-self ]
-    [ ',' differentiation-once ]
-    [ ',' differentiation-tangent-specifier ]
-    [ ',' differentiation-adjoint-specifier ]
-    ')'
-declaration-attribute = differentiable-attribute
-```
-
-#### First Glance
-
-The multiplication operator `*` is differentiable with respect to its two
-arguments. Here's how we make it differentiable in the standard library.
-
-```swift
-extension FloatingPoint {
-    @differentiable(bidirectional, tangent: tangentMul, adjoint: adjointMul)
-    static func * (x: Self, y: Self) -> Self { ... }
-
-    internal func tangentMul(
-        x: (Self, Self), y: (Self, Self), originalResult: Self
-    ) -> Self {
-        return x.1 * y.0 + y.1 * x.0
-    }
-
-    internal func adjointMul(
-        x: Self, y: Self, originalResult: Self, seed: Self
-    ) -> (Self, Self) {
-        return (seed * y, seed * x)
-    }
-}
-```
-
-In TensorFlow, the convolution operator is only differentiable with respect to
-a subset of arguments. Here's how we make it differentiable so that it can be
-used for back-propagation.
-
-```swift
-@differentiable(reverse, adjoint: adjointConv2D)
-public func conv2d(_ input: Tensor<Float>, filter: Tensor<Float>,
-                   strides: @nondiff (Int32, Int32, Int32, Int32),
-                   padding: @nondiff Padding) -> Tensor<Float> {
-    ...
-}
-
-func adjointConv2D(_ input: Tensor<Float>, filter: Tensor<Float>,
-                   strides: (Int32, Int32, Int32, Int32),
-                   padding: Padding) -> (Tensor<Float>, Tensor<Float>) {
-    ...
-}
-```
-
-#### Differentiation Parameters
-
-   Differentiation parameters are marked inline at each argument position in the
-   function declaration. By default, every argument of the funtion is to be
-   differentiated with-respect-to, unless marked as `@nondiff`.
-
-   When a differentiable attribute is applied on a method, or the getter of a
-   computed property in a type, the implicit `self` argument often needs to be
-   differentiated with respect to. In order to make a function a primitive
-   differentiable with respect to `self`, one can add `wrt: self` to
-   the `@differentiable` attribute.
-
-#### Differentiability
-
-There are five options for differentiability:
-
-1. Forward: `@differentiable(forward, tangent: ...)`
-
-   This option says that the function is forward-mode differentiable.
-   Forward-mode differentiation requires the "tangent code" (or tangent
-   function) of this function, so that Swift knows how to compute the
-   function's directional derivatives in the direction specified by the
-   tangent vector that has been forward-propagated to the tangent function.
-
-   The compiler will expect the name of the tangent function, with an expected
-   type signature, to be specified later in the `tangent:` parameter in the
-   attribute.
-
-2. Reverse: `@differentiable(reverse, adjoint: ...)`
-
-   This option says that the function is reverse-mode differentiable.
-   Reverse-mode differentiation requires the "adjoint code" (or adjoint
-   function) of this function, so that Swift knows how to compute the function's
-   vector-Jacobian products, where the vector, also called "adjoint vector", has
-   been back-propagated to the adjoint function.
-
-   The compiler will expect the identifier of the adjoint function, with an
-   expected type signature, to be specified later in the `adjoint:` parameter
-   in the attribute.
-
-3. Bidirectional: `@differentiable(bidirectional, tangent: ..., adjoint: ...)`
-
-   This option says that the function is both forward-mode differentiable and
-   reverse-mode differentiable. The compiler will expect both the tangent
-   function and the adjoint function to be specified later in this attribute.
-
-4. Constant: `@differentiable(constant)`
-
-   By definition, constant functions always have zero derivatives and are
-   differentiable at any arbitrary order. So differentiating this function will
-   result into a zero vector (or vectors, when the function has multiple
-   differentiation arguments) with the same shape as each differentiation
-   argument.
-
-5. Linear: `@differentiable(linear)`
-
-   By definition, a linear map is always a unary function and its Jacobian is
-   the matrix associated with this linear transformation itself. In other
-   words, both its differential and its pullback are itself.
-
-#### Associated Functions
-
-As explained, differentiabilities have different functional requirements.
-
-1. `forward` differentiability
-
-   When the differentiability is `forward`, the compiler expects a `tangent:`
-   label in the attribute followed by the name (qualified or unqualified)
-   of a tangent function that is to be associated with the original function.
-   If the original function declaration has type `(T0, ..., Tn) -> U`, then
-   the expected type of the tangent function is `((T0, T0), ..., (Tn, Tn), U) ->
-   U`. As we can see, every argument of the original function has become
-   a "dual number" in the tangent function represented as a tuple. The first
-   element of such a tuple is the original argument, the second argument the
-   forward-propagated directional derivatives, namely the the "vector" in
-   "Jacobian-vector product". The last argument to the tangent function is the
-   original function's result. The result of the tangent function is the
-   directional derivatives. If any of the original arguments is marked as
-   `@nondiff`, it will not become a dual number in the tangent function's argument
-   list but will remain as the original argument itself.
-
-2. `reverse` differentiability
-
-   When the differentiability is `reverse`, the compiler expects an `adjoint:`
-   label in the attribute followed by the name (qualified or unqualified)
-   of an adjoint function that is to be associated with the original function.
-   If the original function declaration has type `(T0, ..., Tn) -> U`, then
-   the expected type of the adjoint function is `(T0, ..., Tn, U, U) -> (T0,
-   ..., Tn)`. As we can see, the first `n` arguments to the adjoint function,
-   `T0, ..., Tn`,  are the original arguments. The next argument is the
-   original function's result. The last argument is the back-propagated
-   partial derivatives at the original function's result,
-   namely the "vector" in "vector-Jacobian product". The result of the
-   adjoint function contains partial derivatives at each argument, if the
-   argument has not been marked as `@nondiff`.
-
-3. `bidirectional` differentiability
-
-   When the differentiability is `bidirectional`, the compiler expects both
-   `tangent:` and `adjoint:` arguments to be specified.
-
-4. Other differentiabilities
-
-   Other differentiabilities such as `constant` and `linear` do not require
-   any associated functions. However, users can choose to specify
-   tangent/adjoint function(s) for their own purposes such as custom
-   optimizations.
-
-#### Differentiation Order
-
-When a function is marked as `@differentiable`, Swift assumes it to be
-higher-order differentiable, i.e. differentiable at all orders, unless `once` is
-specified in the attribute, in which case Swift will not guarantee any
-higher-order differentiability. If their associated functions (tangent or
-adjoint) are serialized, then their derivatives _may_ be differentiable via a
-separate code transformation.
-
-Differentiabilities `linear` and `constant` guarantee smoothness, and they do
-not have to be serialized whatsoever because their derivatives do not depend on
-any code transformation.
-
-`forward` and `reverse` transitively require the tangent function and the
-adjoint function, respectively, to be differentiable with respect to the
-original arguments. When compiling such declarations, Swift will verify the
-tangent/adjoint function is also differentiable by static analysis. If they are
-not differentiable, the compiler will error out, prompting the user to insert
-`once` in the `@differentiable` attribute.
-
-Example 1. Linear functions are differentiable at any order.
-
-```swift
-public extension Tensor {
-    @differentiable(linear, wrt: self)
-    func transposed() -> Self {
-        ...
-    }
-}
-```
-
-Example 2. A forward-mode primitive-differentiable function whose tangent is
-closed-form is differentiable.
-
-```swift
-// Okay, the tangent function is differentiable.
-@differentiable(forward, tangent: tangentFoo)
-func foo(_ x: Vector<Float>) -> Float {
-    return Vector(repeating: sin(x), shape: [2, 3])
-}
-
-func tangentFoo(_ dualX: (Float, Float), 
-                originalResult: Vector<Float>) -> Vector<Float> {
-    let (x, dx) = dualX
-    // Differentiable because `Vector.init(repeating:shape:)`, `*`, `sin` and 
-    // `cos` are all declared `@differentiable` and are differentiable.
-    return Vector(repeating: cos(x) * dx, shape: [2, 3])
-}
-```
-
-Example 3. A reverse-mode primitive-differentiable function is not
-differentiable at a higher order because its adjoint is not differentiable.
-
-```swift
-@differentiable(reverse, adjoint: adjointBar)
-func bar(_ x: Vector<Float>) -> Float {
-    return sin(x)[0]
-}
-
-var someGlobalVariable: Vector<Float> = [1, 1, 1]
-
-func adjointBar(_ x: Vector<Float>, y: Float, adjoint: Float) -> Vector<Float> {
-    var ∂y∂x = Vector<Float>(repeating: 0, shape: x.shape)
-    someGlobalVariable[0] = cos(x[0]) * adjoint
-    ∂y∂x[0] = someGlobalVariable[0]
-    return ∂y∂x
-}
-```
-```console
-test.swift:3:35: error: function `bar` does not support higher-order differentiation 
-because its adjoint is not differentiable; would you like to add `once`?
-  @differentiable(reverse, adjoint: adjointBar)
-                                    ^~~~~~~~~~
-test.swift:8:6: note: `adjointBar` is defined here
-  func adjointBar(_ x: Vector<Float>, y: Float, adjoint: Float) -> Vector<Float> {
-       ^~~~~~~~~~
-test.swift:10:9: note: operation is not differentiable
-      ∂y∂x[0] = cos(x[0]) * adjoint
-          ^~~~~~~~~~~~~~~~~~~~~~~~~
-```
-
-## Part 3: Basic Differentiation
-
-The application of the chain rule of differentiation gives us vector-Jacobian
-products or Jacobian-vector products, given by functions. Now that we have
-defined primitive differentiable functions, Swift can recursively differentiate
-any function whose body is available to the compiler.
-
-### Start Simple: Gradient and Derivatives
-
-We start by introducing the syntax of two raw differential operators:
-- `#gradient(f)`: Produces the gradient of `f`, where `f: ℝⁿ → ℝ`.
-- `#derivatives(f)`: Produces derivatives of `f`, where `f: ℝ → ℝᵐ`.
-
-The syntax of these operators looks like macros, but we will generalize them and
-make them look much nicer towards in the second half of this document.
-
-Example:
-
-```swift
-func f(_ x: Vector<Float>, _ w: Vector<Float>) -> Float {
-   return x • w
-}
-
-#gradient(f) // (Vector<Float>, Vector<Float>) -> (Vector<Float>, Vector<Float>)
-
-func g(_ x: Float) -> (Vector<Float>, Vector<Float>) {
-   return x • w
-}
-
-#derivatives(g) // (Float) -> (Vector<Float>, Vector<Float>)
-```
-
-The grammar of these raw differential operators is defined as follows:
-
-```ebnf
-derivatives-operator = '#derivatives'
-gradient-operator = '#gradient'
-raw-differential-operator = derivatives-operator | gradient-operator
-autodiff-argument-index-specifier = '.' integer-literal
-autodiff-expression =
-    differential-operator '(' expression [ ',' 'wrt' ':' autodiff-argument-index-specifier ] ')'
-expression = autodiff-expression
-```
-
-### Embrace Generality: Vector-Jacobian Products and Jacobian-Vector Products
-
-Gradient and derivatives are two special cases of differentiation where the
-output or the result is a scalar, respectively. When they are not a scalar,
-vector-Jacobian products and Jacobian-vector products are being computed with a
-vector. These cases are not obvious, but are required for modular machine
-learning APIs where each neural network layer defines a back-propagation method
-that takes a partial derivative vector back-propagated from the previous layer.
-As such, we add two extra differential operators which will be useful for
-computing these products.
-
-- `#differential(f)`: Produces a function that takes the original arguments and
-  returns the differential of `f`.
-- `#pullback(f)`: Produces a function that takes the original arguments and
-  returns the pullback of `f`.
-
-```ebnf
-jvp-operator = '#differential'
-vjp-operator = '#pullback'
-raw-differential-operator = jvp-operator | vjp-operator
-```
-
-Example:
-
-```swift
-// A random generic function that is differentiable.
-func f<T0, T1, U>(_ x: T0, _ y: T1) -> U
-    where T0: Differentiable, T1: Differentiable, U: Differentiable {
-    return someDifferentiableFunction(20, x + y)
-}
-
-#differential(f) // (T0, T1) -> (U) -> (U, (T0, T1))
-// Description:
-//   (T0, T1)       ->  (U)    ->   (U,          (T0, T1))
-//    ^~~~~~             ^           ^           ^~~~~~~~
-//  original args      vector      result    Jacobian-vector products
-
-#pullback(f) // (T0, T1) -> (U, (U) -> (T0, T1))
-// Description:
-//   (T0, T1)       ->  (U,     (U)      ->  (T0, T1))
-//    ^~~~~~             ^       ^           ^~~~~~~~
-//  original args     result   vector   vector-Jacobian products
-```
-### How It Works
-
-The compiler type-checks a `#gradient(f)`, as well as other differential
-operators, by searching for the closest match given the contextual type. `f` is
-expected to have a definition to be differentiable, and thus cannot be a
-closure whose body is opaque to the compiler. If so, Swift reports an error.
-
-Later in the compilation pipeline, the compiler recursively transforms the code
-of `f` to its gradient function `∇f` (or other functions in other modes of
-differentiation), and replaces `#gradient(f)` with `∇f`. Everything composes
-together naturally. Now, differentiation works.
-
-### AD in Action
-
-Automatic Differentiation based on raw differential operators is already
-available and being incubated temporarily on [the "tensorflow" branch of
-Swift](https://github.com/apple/swift/tree/tensorflow). Swift for TensorFlow
-[development
-toolchains](https://github.com/tensorflow/swift/blob/master/Installation.md) and
-[tutorials](https://github.com/tensorflow/swift-tutorials/blob/master/iris/swift_tensorflow_tutorial.ipynb)
-are available for trying out this feature.
-
-## Part 4: Generalized Differentiability
-
-Automatic differentiation relies on the definition (body) of a function to be
-able to differentiate it. Differential operators like `#gradient` trigger the
-differentiation of a function, and the differentiability of the function is
-determined as differentiation goes. This works perfectly so far, but has a
-number of problems.
-
-### Issues with Definition-Based Differentiability
-
-#### Syntactic Weirdness
-
-Raw differential operators adopt the pound-keyword syntax, which has been
-previously used for accessing compiler builtins, e.g. `#file` and `#dsohandle`,
-referring to IDE-specific objects, e.g. `#colorLiteral` and `#imageLiteral`, and
-interoperating with "stringly-typed" Objective-C key paths, e.g.
-`#keyPath(...)`. The pound-keyword syntax does not have native parsing support
-for syntactic features like trailing closures, so it is hard to make the closure
-code short under differential operators like `#gradient`.
-
-Example:
-```swift
-// Ideal
-let dydx = gradient { x in
-    sin(x) + cos(x)
-}
-
-// Reality
-let dydx = #gradient({ x in
-    sin(x) + cos(x)
-})
-```
-
-#### A Higher-Order Function, But Not Quite
-
-When we introduced AD in Swift earlier in this document, we defined the
-differential operator as a higher-order function. Type checking and type
-inference were just expected to work like any other functions.
-
-However, since the compiler needs to reject functions that are not
-differentiable and differentiability is not part of the type system, even if we
-were to redefine `#gradient` as a higher-order function named `gradient(of:)`,
-the compiler would still have to maintain dedicated knowledge about this
-function in order to reject invalid arguments.
-
-#### Cross-Module Differentiability, Without Serialization
-
-As of now, the differentiability of a function is determined solely through
-two tests:
-- Is the function a primitive-differentiable function (`@differentiable`)?
-- Can the function's body be differentiated in the differentiation mode
-  associated with the differential operator applied?
-
-This simple system works perfectly when differentiating concrete functions
-defined in a local module, but does not allow differentiation of opaque function
-values or methods required by protocols. While being free of serialization is
-not a strict requirement for numerical computing libraries, not supporting
-differentiation on protocol requirements fundamentally obstructs composable
-high-level APIs that rely on AD, such as machine learning model APIs.
-
-#### Opaque Closures are Non-Differentiable
-
-There is no way to define a higher-order function that differentiates its
-argument using `#gradient`. Here's an example:
-
-```swift
-func foo(_ f: (Float) -> Float) -> Float {
-    return #gradient(f)(0)
-}
-```
-
-```console
-test.swift:2:22: error: cannot differentiate an opaque closure
-    return #gradient(f)(0)
-           ~~~~~~~~~~^~
-test.swift:1:12: note: value defined here
-func foo(_ f: (Float) -> Float) -> Float {
-           ^~~~~~~~~~~~~~~~~~~
-```
-
-Closure arguments and dynamic dispatch are non-differentiable through direct
-source code transformation. The compiler does not statically know where `f` is
-coming from, nor can it delegate the task of differentiation of argument `f` to
-each callsite of `foo` because it cannot be expressed in the type system.
-
-### Solution: Differentiability in Function Types
-
-As we can see, the core of the problem with definition-based differentiability
-is the opacity of function. The restriction that differentiation depends on the
-full definition of a function to be seen by the differential operator makes it
-impossible to define protocol-oriented differentiable code, and is the primary
-hindrance to modular, composable differentiation APIs.
-
-Turns out, this is not a new problem - we should learning from how we deal with
-calling conventions in Swift. Functions with different calling conventions have
-different type signatures, e.g. `@convention(thick)` and `@convention(thin)`,
-and function convert back and forth through conversion thunks implicitly.
-
-```swift
-// A "thin" function that captures no variables.
-// Its representation is `@convention(thin)` by default.
-func f(x: Int) -> Int {
-    return x
-}
-
-var globalVar = 30
-
-// A "thick" function that captures the value of `globalVar`.
-// Its representation is `@convention(thick)` by default.
-let g = { x in globalVar + x }
-
-// A higher-order function.
-// The closure argument `h`'s representation is `@convention(thick)`, because it should
-// be able to take closures that capture variables.
-func takeFunc(_ h: (Float) -> Float) { ... }
-
-takeFunc(f) // Implicitly converted function `f` to a `convention(thick)` closure by
-            // creating a conversion thunk.
-takeFunc(g) // `g` is thick already. No conversion needed.
-```
-
-Sometimes, different conventions have different binary representations for
-storing captured variables and such, just like the example with `f` and `g`
-above. In AD, the only difference between a non-differentiable function and a
-differentiated function (say, in reverse mode) is whether the function carries a
-few other function pointers that represent the function's adjoint code, so we
-can model differentiable functions using a "thicker" function type, which
-bundles the original function representation along with pointers to the original
-function's Jacobian-vector product functions and/or vector-Jacobian product
-functions. When a normal function with a visible body gets passed as an
-`@autodiff` function, the function will be differentiated. 
-
-```swift
-// `f` is a normal function that has type `(Float) -> Float`.
-func f(x: Float) -> Float {
-   return sin(x)
-}
-
-// `f` gets implcitly converted (or more accurately, differentiated).
-let g = f as @autodiff (Float) -> Float
-
-func takesFunc(_ someFunc: @autodiff (Float) -> Float) {
-    #derivatives(someFunc)
-    ...
-}
-
-// At the callsite of `takesFunc(_:)`, `f` gets implcitly differentiated to become
-// `@autodiff (Float) -> Float`.
-takesFunc(f)
-```
-
-If a normal function does not have a visible body, then it cannot be passed as
-an `@autodiff` function. Swift will show an error at compile-time.
-
-```swift
-var normalFuncWithOpaqueBody: (Float) -> Float = ...
-
-takesFunc(normalFuncWithOpaqueBody)
-```
-
-```console
-test.swift:19:11: error: function is not differentiable, but the contextual type is 
-'@autodiff (Float) -> Float'
-  takesFunc(normalFuncWithOpaqueBody)
-            ^~~~~~~~~~~~~~~~~~~~~~~~
-
-test.swift:17:4: note: value defined here
-  var normalFuncWithOpaqueBody: (Float) -> Float = ...
-      ^~~~~~~~~~~~~~~~~~~~~~~~
-```
-
-At first glance, this could even be an addition to the existing `@convention`
-attribute as something like `@convention(autodiff)`, however, differentiability
-does not align semantically with `@convention`. First, when a function becomes
-its differentiable (or differentiated) form, its original calling convention is
-not changed. Second, functions with any convention is technically
-differentiable, including `thin`, `thick`, `method`, etc. Third,
-differentiability is not the only information that needs to be encoded --
-there's also the order of differentiation. Therefore, we need a separate
-dimension of "thickness" in the function type: differentiability.
-
-We define a new formalization of differentiability in Swift's type system,
-including an `@autodiff` function type attribute, an extension to functions'
-layout, and new syntax for selecting differentiable arguments.
-
-#### The `@autodiff` Function Type Attribute
-
-The `@autodiff` attribute on a function type specifies the function's
-differentiability and differentiation order, just like `@differentiable` on
-function declarations. The biggest differences are
-
-- `@differentiable` contains associated functions (tangent/adjoint) statically,
-  but `@autodiff` functions carry those extra function pointers in their binary
-  representation as a runtime property. Any user of this function will be able
-  to differentiate it, with differentiability guaranteed formally by the type
-  system. With this addition to the type system, serialization/inlinability is
-  no longer necessary because functions can be passed around without losing
-  differentiability.
-
-- Differentiation order is no longer once vs. infinite. Instead, `@autodiff`
-  functions can specify a maximum order at which this function can be
-  differentiated, unless the function is linear or constant. This is because
-  function-representation-based differentiability requires functions to be
-  differentiated ahead of becoming a value and being passed around.
-
-The grammar for `@autodiff` is defined as follows:
-
-```ebnf
-differentiation-order = 'order' ':' integer-literal
-differentiability = 'forward' | 'reverse' | 'linear' | 'constant' | 'bidirectional'
-autodiff-attr = '@autodiff' '(' [ differentiability ',' ] diff-order ')'
-```
-
-When a differentiability is specified on a function type, it's obvious that its
-functions' differentiation behavior is akin to what's defined for the
-`@differentiable` declaration attribute. If no differentiability is specified,
-this function is both forward-mode and reverse-mode differentiable (same as
-`bidirectional`).
-
-#### Creating `@autodiff` Functions
-
-It becomes increasingly clear that first-order differentiation will not, and
-should not, require serialization, and only higher-order differentiation should
-due to code size. In order to make the system consistent, we make each
-`@differentiable` function declaration result in an `@autodiff` function.
-
-Since we want to support differentiating opaque functions, we must support
-creating one. The fact is, the user does not even need to know about `@autodiff`
-or intentionally create differentiable functions if they are working with
-functions in the current module. Whenever a local function declaration gets used
-where the contextual type has an `@autodiff` attribute on it, Swift
-differentiates it. If differentiation fails, Swift reports an error at
-compile-time.
-
-For public APIs, we relax the constraint on `@differentiable` so that it can be
-applied to any function declaration without specifying a tangent or adjoint even
-when the differentiability is forward/reverse. This is when Swift tries to
-differentiate functions and export the derivatives as part of those public APIs: If
-the function gets differentiated, its default type signature has `@autodiff`
-attribute on it; otherwise, Swift reports an error to the user showing what's
-non-differentiable.
-
-#### Higher-Order Differentiation of Opaque Closures
-
-In order for modular libraries to support opaque higher-order differentiation,
-the differentiation order must be specified in the closure type signature, so
-that the closure ABI is guaranteed to contain the higher-order derivative.
-
-```swift
-@autodiff(reverse, order: 2) (T) -> U
-```
-
-For example, function `g` takes a differentiable function that is differentiable
-up to at least the 3rd order, then differentiates it 3 times in the body.
-
-```swift
-// In a separate module:
-func g(_ h: @autodiff(reverse, order: 3) (Float) -> Float) -> Float {
-    return #gradient(h)(1) +
-           #gradient(#gradient(h))(1) +
-           #gradient(#gradient(#gradient(h))(1)
-}
-```
-
-We also extend the `@differentiable` attribute so that it can specify an
-primitive-differentiable function can be forced to be differentiated to a
-specific order ahead of time. For example, when Swift compiles function `f`
-below, this function will have been differentiated 6 times, and gradient
-functions will be preserved in `f`'s ABI so that its derivatives can be called
-from anywhere (any other Swift module, or even C). `f`'s default type signature
-is `@autodiff(reverse, order: 6) (Float) -> Float`.
-
-```swift
-@differentiable(reverse, order: 6)
-public func f(_ x: Float) -> Float {
-    return pow(x, 6)
-}
-```
-
-Differentiable functions with a maximum differentiation order can be implicitly
-"down-ordered", that is, differentiable functions with a higher maximum
-differentiation order can be implicitly converted to a function with a lower
-maximum differentiation order. For example, we can directly pass `f` as an
-argument to `g`.
-
-```swift
-g(f) // 156
-```
-
-#### Conversion Between Differentiabilities
-
-Because of their mathematical properties, differentiabilities can be converted
-to one another statically without runtime overhead. For example, a constant
-function is also a linear function when it's unary; a linear function is a
-bidirectional-differentiable function whose tangent and adjoint are both
-themselves; any differentiability can be completely dropped from a function
-type, forming a "normal" function. This allows us to define generic algorithms
-using differentiation, without specializing them on function types of each
-differentiability.
-
-The following table shows whether each differentiability (as a column label) can
-be converted to another (as a row label).
-
-| Convertible to: | None | Linear    | Constant | Forward | Reverse | Bidirectional |
-|-----------------|------|-----------|----------|---------|---------|---------------|
-| None            | ✔    |           |          |         |         |               |
-| Linear          | ✔    | ✔         |          | ✔       | ✔       | ✔             |
-| Constant        | ✔    | ✔ (unary) | ✔        | ✔       | ✔       | ✔             |
-| Forward         | ✔    |           |          | ✔       |         |               |
-| Reverse         | ✔    |           |          |         | ✔       |               |
-| Bidirectional   | ✔    |           |          | ✔       | ✔       | ✔             |
-
-What does differentiability conversion look like in real code? Just like
-`@convention` conversion, differentiability conversion is implicit and has
-little mental overhead to the user.
-
-```swift
-let linear: @autodiff(linear) (Float) -> Float = ...
-let bidir: @autodiff (Float) -> Float = ...
-let const: @autodiff(constant) (Float) -> Float = ...
-
-func foo(_: @autodiff(reverse) (Float) -> Float) { ... }
-
-foo(linear) // Okay! Implicitly converted to `@autodiff(reverse)`.
-foo(bidir) // Okay! Implicitly converted to `@autodiff(reverse)`.
-foo(const) // Okay! Implicitly converted to `@autodiff(reverse)`.
-...
-```
-
-## Part 5: True Differential Operators
-
-[Generalized Differentiability](#part-4-generalized-differentiability) enabled us
-to define custom differential operators in a functional way. Now it's time to
-define the true differential operators.
-
-### Derivatives and Gradient
-
-We start with functions that take a function and produce a function that
-computes derivatives or gradient. Recall that we already had built-in syntax
-`#gradient` and `#derivatives` for computing gradients and derivatives, but we
-are exploring more expressive APIs enabled by Generalized Differentiability
-which enabled us to differentiate function arguments that are functions.
-
-#### Forward Differential Operators
-
-We define two forward-mode differential operators for computing basic
-derivatives:
-- `derivatives(of:)` computes a derivatives function that takes a value and
-  returns derivatives evaluated at the given value.
-- `derivatives(at:in:)` computes derivatives of a closure at a given value.
-
-```swift
-/// Computes derivatives of `body`.
-func derivatives<T: FloatingPoint, R: Differentiable>(
-    of body: @autodiff(forward) (T) throws -> R
-) rethrows -> (T) -> R {
-    return { x in #differential(body)(x)(1).1 } // seed = dx/dx = 1
-}
-
-/// Computes derivatives of `body` at scalar `x`.
-func derivatives<T: FloatingPoint, R: Differentiable>(
-    at x: T, in body: @autodiff(forward) (T) throws -> R
-) rethrows -> R {
-    return derivatives(of: body)(x)
-}
-```
-
-#### Reverse Differential Operators
-
-We also define two reverse-mode differential operators for computing basic
-gradients:
-- `gradient(of:)` computes a gradient function that takes a value and returns
-  the gradient evaluated at the given value.
-- `gradient(at:in:)` computes the gradient of a closure evaluated at a given
-  value.
-
-```swift
-/// Computes the gradient of `body`.
-func gradient<T: Differentiable, R: FloatingPoint>(
-    of body: @autodiff(reverse) (T) throws -> R
-) rethrows -> (T) -> T {
-    return { x in #pullback(body)(x).1(1) } // seed = dx/dx = 1
-}
-
-/// Computes the gradient of `body` at `x`.
-func gradient<T: Differentiable, R: FloatingPoint>(
-    at x: T, in body: @autodiff(reverse) (T) throws -> R
-) rethrows -> T {
-    return gradient(of: body)(x)
-}
-```
-
-As we can see, since we are to differentiate a higher-order function's argument
-(thanks to Generalized Differentiability), we can define `derivatives(of:)` and
-`gradient(of:)` as Swift functions in terms of more general raw differential
-operators, `#differential` and `#pullback`, to replace `#derivatives` and
-`#gradient`!
-
-These differential operators work seamlessly with closure captures,
-error-throwing functions, or arbitrary side-effecting code that do not
-contribute to the closure result. This looks quite like value-based automatic
-differentiation while the math is actually fully functional. This achieves a
-similar level of expressivity as imperative-style automatic differentiation
-libraries: Instead of writing `gradient(...)` at the bottom of a forward pass,
-one would just write it on top and have a trailing closure close over the
-forward pass.
-
-Example: Train a simple 2-layer perceptron. The snippet computes the gradient
-w.r.t. each parameter at each training step, prints a loss, and optimizes
-parameters.
-
-```swift
-struct Parameters: Differentiable, ParameterGroup {
-    var w1 = Tensor<Float>(randomNormal: [784, 30])
-    var b1 = Tensor<Float>(zeros: [30])
-    var w2 = Tensor<Float>(randomNormal: [30, 10])
-    var b2 = Tensor<Float>(zeros: [10])
-}
-
-var params = Parameters()
-let minibatches = Dataset(...)
-var optimizer = StochasticGradientDescent(learningRate: 0.1)
-for (x, y) in minibatches {
-    let grads = gradient(at: params) { params in
-        let h1 = tanh(matmul(x, params.w1) + params.b1)
-        let ŷ = sigmoid(matmul(h1, params.w2) + params.b2)
-        let loss = (y - ŷ).squared().mean()
-        print("Loss is \(loss)")
-        return loss
-    }
-    optimizer.fit(&params, gradients: grads)
-}
-```
-
-
-### Preserving Original Result
-
-Since the trailing closure as an argument to `gradient(at:in:)`, the forward
-computation is just as customizable as within operator-overloading AD systems.
-Users can do whatever they want to intermediate values or the result in the
-primal computation.
-
-That said, we would like to provide a way to have the differentiation API return
-the original result directly. Because of Generalized Differentiability, these
-APIs can be defined entirely as library functions using primitive differential
-operators.
-
-```swift
-/// Computes `body(x)` and derivatives of each scalar output of `body` at `x`.
-func valueWithDerivatives<T: FloatingPoint, R: Differentiable>(
-    at x: T, in body: @autodiff(forward) (T) throws -> R
-) rethrows -> (value: R, derivatives: R) {
-    return #differential(body)(x)(1)
-}
-
-/// Computes `body(x)` and the gradient of `body` at `x`.
-func valueWithGradient<T: Differentiable, R: FloatingPoint>(
-    at x: T, in body: @autodiff(reverse) (T) throws -> R
-) rethrows -> (value: R, gradient: T) {
-    let (y, pullback) = #pullback(body)(x)
-    return (y, pullback(1))
-}
-```
-
-### Jacobian-Vector Products and Vector-Jacobian Products
-
-Jacobian-vector products (forward-mode) and vector-Jacobian products
-(reverse-mode) are extremely useful differential operators for lots of tasks in
-numerical computing.
-
-```swift
-/// Computes Jacobian-vector products of `body` at `x`.
-func jacobianVectorProducts<T: Differentiable, R: Differentiable>(
-    at x: T, vector: T,
-    in body: @autodiff(forward) (T) throws -> R
-) rethrows -> R {
-    return #differential(body)(x)(vector)
-}
-
-/// Computes the vector-Jacobian products of `body` at `x`.
-func vectorJacobianProducts<T: Differentiable, R: Differentiable>(
-    at x: T, vector: R,
-    in body: @autodiff(reverse) (T) throws -> R
-) rethrows -> T {
-    return #pullback(body)(x)(vector)
-}
-```
-
-### Differentials and Pullbacks
-
-In some cases, computational tasks rely on fully extensible differential
-operators as well as maximum efficiency, e.g. computing vector-Jacobian products
-as well as the original function's result. Luckily, the two operators we
-mentioned in the very beginning when we introduced Jacobians are the ones we
-need: differential and pullback. We have already had their raw operators
-supported in the syntax: `#differential` and `#pullback`, but we can make them
-nicer using by redefining them as Swift functions.
-
-Function `differential(at:in:)` computes the differential of a closure at a
-certain point, and returns a linear map that takes a vector and returns
-Jacobian-vector products.
-
-```swift
-/// Computes the differential of `body` at `x`.
-func differential<T: Differentiable, R: Differentiable>(
-    at x: T, in body: @autodiff(reverse) (T) throws -> R
-) rethrows -> @autodiff(linear) (T) -> R {
-    return #differential(body)(x).1
-}
-```
-
-Function `differentialWithResult(at:in:)` computes the differential of a closure
-at a certain point, and returns a linear map that takes a vector and returns
-both the original function's result and Jacobian-vector products.
-
-```swift
-/// Computes the differential of `body` at `x` that also computes the value of
-/// `body(x)`.
-func differentialWithResult<T: Differentiable, R: Differentiable>(
-    at x: T, in body: @autodiff(reverse) (T) throws -> R
-) rethrows -> @autodiff(linear) (T) -> (originalResult: T, derivatives: R) {
-    return #differential(body)(x)
-}
-```
-
-Function `pullback(at:in:)` computes the pullback of a closure at a certain
-point, and returns a linear map that takes a vector and returns vector-Jacobian
-products.
-
-```swift
-/// Computes the pullback of `body` at `x`.
-func pullback<T: Differentiable, R: Differentiable>(
-    at x: T, in body: @autodiff(reverse) (T) throws -> R
-) rethrows -> @autodiff(linear) (R) -> T {
-    return #pullback(body)(x).1
-}
-```
-
-Function `resultWithPullback(at:in:)` computes the pullback of a closure at a
-certain point, and returns the original function's result and a linear map that
-takes a vector and returns vector-Jacobian products.
-
-```swift
-/// Computes the original value of `body(x)` and the pullback at `x`.
-func resultWithPullback<T: Differentiable, R: Differentiable>(
-    at x: T, in body: @autodiff(reverse) (T) throws -> R
-) rethrows -> (originalResult: T, pullback: @autodiff(linear) (R) -> T) {
-    return #pullback(body)(x)
-}
-```
-
-It is amazing that we are able to define every differential operator in terms of
-other differential operators. `#differential` and `#pullback` have become
-unnecessary because the functional form is so much nicer, so we can teach the
-compiler to recognize Swift functions `differential(at:in:)` and
-`pullback(at:in:)` as the builtin "canonical" differential operator, and remove
-all raw differential operators that start with a `#` from the language.
-
-Examples:
-
-1. Chain directional derivatives freely using differentials.
-
-    ```swift
-    let x = 0.5
-    let df = differential(at: x) { x in
-        sin(cos(x))
-    }
-    df(1) // (f(x), df/dx)
-    df(derivatives(of: log)(t)) // (f(x), df/dt)
-    df(derivatives(at: t, in: log)) // (f(x), df/dt)
-    ```
-
-2. Chain gradients freely using pullbacks.
-    ```swift
-    let x = 0.5
-    let (y, df) = pullback(at: x) { x in
-        cos(sin(x))
-    }
-
-    df(1) // dy/dx
-    df(gradient(of: log)(t)) // dy/dt
-    df(gradient(at: t, in: log)) // dy/dt
-    ```
-
-### Hessian-Vector Products
-
-Second-order optimization methods in machine learning make use of
-[Hessians](https://en.wikipedia.org/wiki/Hessian_matrix) and Hessian-vector
-products, which can be hard to compute. Many AD libraries such as Autograd
-already support Hessians by supporting arbitrarily nested
-forward-mode/reverse-mode differentiation. Hessian-vector products can be
-efficiently computed by applying "forward-on-reverse", namely applying the
-composition of the forward-mode differential operator and the reverse-mode
-differential operator on a function.
-
-<p align="center">
-<img src="https://latex.codecogs.com/png.latex?\mathbf{H}_f(\mathbf{x})\mathbf{v}&space;=&space;\mathbf{J}_{\nabla&space;f}(\mathbf{x})\mathbf{v}" title="\mathbf{H}_f(\mathbf{x})\mathbf{v} = \mathbf{J}_{\nabla f}(\mathbf{x})\mathbf{v}" />
-</p>
-
-Just like other differential operators, we can define the Hessian-vector
-products operator in a simple, functional way.
-
-```swift
-func hvp<T: Differentiable, R: FloatingPoint>(
-    at x: T, in f: @autodiff(order: 2) (T) -> R
-) -> @autodiff(linear) (T) -> T where T: Differentiable {
-    return differential(at: x, in: gradient(of: f))
-}
-```
-
-Nested differentiation without a careful implementation is prone to a bug known
-as perturbation confusion
-[[1]](http://www.bcl.hamilton.ie/~qobi/nesting/papers/ifl2005.pdf)
-[[2]](https://arxiv.org/abs/1211.4892). Language-integrated AD in Swift will
-enforce tagging in compiler-generated code to guarantee the correctness of
-higher-order derivatives.
-
-### Standard Library or an `AutomaticDifferentiation` Module?
-
-Earlier in this document, we discussed enhancements to standard library
-protocols and extensions to the standard library to model differentiable types.
-These protocols are general enough for standard library types such as floating
-point scalars (`Float`, `Double`, and `Float80`) and potentially [SIMD
-vectors](https://github.com/apple/swift-evolution/blob/master/proposals/0229-simd.md).
-However, in any general-purpose programming language, there is always a question
-of how much math the standard library should have.
-
-We think basic differential operators like `gradient(of:)` and
-`derivatives(of:)` should be included in the standard library, because they are
-common operators that one would find in college calculus, and they will make AD
-feel more language-integrated along with standard library protocols
-`VectorNumeric` and `Differentiable`.
-
-We do believe that other operators that contain terms like "Jacobian" and
-"differential" should be in a separate module, possibly called
-"AutomaticDifferentiation" that ships with the Swift language.
-
-## Part 6: Generalized Types for Differentiation
-
-We introduced the `Differentiable` protocol that makes a type represent a vector
-space and be differentiable. However, there are a few scenarios where such a
-protocol won't work well.
-
-1. Customizable weight type
-
-   Orthogonal weight matrixes have shown advantages in neural network training
-   [[1]](https://arxiv.org/abs/1702.00071)
-   [[2]](https://arxiv.org/abs/1709.06079). When differentiating through these
-   networks, gradients with respect to weights will no long stay orthogonal -
-   instead, they are skew-symmetric matrices. While we can represent both
-   orthogonal matrices and skew-symmetric matrices as values of a `Matrix` or
-   `Tensor` type and programmatically ensure its orthogonality, some researchers
-   have been seeking a way to represent this natively in the type system of a
-   programming language and still have AD produce the correct derivative.
-
-2. Quantized training
-
-   Quantization techniques store and calculate numbers in more compact formats,
-   i.e. a fixed-point data type. Conceptually, a quantized tensor for a
-   real-valued `Tensor` can be defined as the following struct:
-
-   ```swift
-   public struct Quantized<Dequantized: Quantizable, QuantizedScalar: FixedWidthInteger> {
-       var data: Quantizable
-       var range: Range<Dequantized.Scalar>
-       var scale: QuantizedScalar
-       var zeroPoint: Int
-   }
-   ```
-
-   We can think of a scenario where the developer defines a neural network as a
-   function whose parameters are of type `Quantized<Tensor<Float>>`. When
-   training parameters to this neural network, gradients need to flow at a
-   significantly higher precision, but today's system cannot achieve that
-   because it assumes gradients to have the same type as the original arguments.
-
-3. Generic optimizers
-
-   Optimization problems in machine learning can be generalized by optimization
-   on manifolds. Optimizers in most libraries assume the original space and the
-   loss space both to be vector spaces, and perform an implicit conversion from
-   cotangent vectors to tangent vectors and another conversion from tangent
-   vectors to the original weight type when performing `θ -= η * ∂L/∂θ`. While
-   this works for most cases, it won't generalize over typed orthogonal
-   matrices, because orthogonal matrices are not vector spaces, and a conversion
-   from an orthogonal matrix to a skew symmetric matrix cannot be implicit.
-
-### Revise `Differentiable` Protocol
-
-To address concerns raised above, we've managed to find a more general answer to
-modeling differentiable types. Instead of requiring them to be vector spaces
-(`VectorNumeric`), we model them as [differentiable
-manifolds](https://en.wikipedia.org/wiki/Differentiable_manifold). Reverse-mode
-differentiation on function over manifolds produces gradients vectors in its
-cotangent bundle; forward-mode differentiation produces derivatives in its
-tangent bundle. Note that we cannot represent tangent/cotangent bundles
-separately from tangent/cotangent spaces inside each bundle, because Swift does
-not have dependent types. By removing the restriction to `VectorNumeric`,
-`Differentiable` is now fully extensible.
-
- ```swift
- /// A type that mathematically represents a differentiable manifold whose
- /// tangent spaces are finite-dimensional.
- ///
- /// In automatic differentiation, differentiation will produce a Jacobian whose
- /// elements are of `Tangent` type.
- public protocol Differentiable {
-     /// The tangent vector space of this differentiable manifold.
-     associatedtype TangentVector: VectorNumeric
-         where TangentVector.Scalar: FloatingPoint
-
-     /// The cotangent space of this differentiable manifold.
-     associatedtype CotangentVector: VectorNumeric
-         where TangentVector.Scalar: FloatingPoint
-
-     /// Returns `self` moved along the value space towards the given tangent
-     /// vector. In Riemannian geometry (mathematics), this is usually equivalent
-     /// to retraction or exponential map.
-     func moved(toward direction: TangentVector) -> Self
-
-     /// Convert a cotangent vector to its corresponding tangent vector.
-     func tangentVector(from cotangent: CotangentVector) -> TangentVector
- }
- ```
-
-When the tangent vector of a differentiable manifold is equal to its cotangent
-vector, we can simply provide a default implementation of
-`tangentVector(from:)`, which is just the identity function.
-
-```swift 
-public extension Differentiable where TangentVector == CotangentVector { 
-    func tangentVector(from cotangent: CotangentVector) -> TangentVector { 
-        return cotangent 
-    } 
-} 
-``` 
-
-When a differentiable manifold is a vector space, it's tangent space is usually 
-itself. In these cases, we simply define `moved(toward:)` as vector addition. 
-
-```swift 
-public extension Differentiable 
-    where Self: VectorNumeric, TangentVector == Self { 
-    func moved(toward direction: TangentVector) -> Self { 
-        return self + direction 
-    } 
-} 
-``` 
-
-### Deriving Conformances to `VectorNumeric` and `Differentiable`
-
-It is very common for numerical computing to deal with lots of parameters, each
-of which is a vector or a matrix. In these cases, instead of manually specifying
-each input in a differential operator's argument list, users would often like
-to differentiate through structures and obtain a structure of partial
-derivatives. It is important for the Swift to provide derived conformances for
-core protocols for numerical computing: `Differentiable` and `VectorNumeric`.
-
-Mathematically, it is straightforward to represent product types. A struct or
-tuple in Swift corresponds to a product of sets; an enum in Swift
-corresponds to an addition of sets.
-
-```swift
-struct Parameters: VectorNumeric, Differentiable {
-    var a: Vector<Float>
-    var b: Float
-}
-```
-
-Struct `Parameters` is equivalent to a product of sets `Vector<Float>` and
-`Float`, or a product of a real vector space `ℝⁿ` and a scalar field `ℝ`, namely
-`ℝⁿ ⨯ ℝ`, which is also a vector space. To make `Parameters` obtain the traits
-of a vector space, we extend the compiler to derive a conformance to
-`VectorNumeric` similar to how `Codable` and `Hashable` conformances are
-derived. When a conformance clause is given in the current file and when all
-stored properties conform to `VectorNumeric` with the same `Scalar`, the
-compiler synthesizes AST to make this type conform, with all protocol requirements
-applying property-wise.
-
-After deriving conformances to `VectorNumeric`:
-
-```swift
-struct Parameters: VectorNumeric {
-    var a: Vector<Float>
-    var b: Float
-
-    // derived:
-    typealias Scalar = Float
-
-    // derived:
-    struct Shape {
-        var a: Vector<Float>.Shape
-        var b: Float.Shape
-    }
-
-    // derived:
-    static func + (lhs: Parameters, rhs: Parameters) -> Parameters {
-        return Parameters(a: lhs.a + rhs.a, b: lhs.b + rhs.b)
-    }
-    // ...
-}
-```
-
-In order for `Parameters` to be differentiable, it must also need to conform to
-`Differentiable`. Deriving conformances to `Differentiable` can follow the same
-rules.
-
-```swift
-struct MyShapes: Differentiable {
-    var a: Circle // conforms to Differentiable
-    var b: Square // conforms to Differentiable
-}
-```
-
-After deriving conformances to `Differentiable`:
-
-```swift
-struct MyShapes: Differentiable {
-    var a: Circle
-    var b: Square
-
-    // derived:
-    struct TangentVector: VectorNumeric {
-        var a: Circle.TangentVector
-        var b: Square.TangentVector
-    }
-    // derived:
-    struct CotangentVector: VectorNumeric {
-        var a: Circle.CotangentVector
-        var b: Square.CotangentVector
-    }
-
-    // derived:
-    func moved(toward direction: TangentVector) -> MyShapes {
-        return MyShapes(a: a.moved(toward: direction.a),
-                        b: b.moved(toward: direction.b))
-    }
-
-    // derived:
-    func tangentVector(from cotangent: CotangentVector) -> TangentVector {
-        return TangentVector(a: a.tangentVector(from: cotangent.a)
-                             b: b.tangentVector(from: cotangent.b))
-    }
-}
-```
-
-With derived conformances to these protocols, the user can now write arbitrarily
-nested structs of differentiable manifolds, and make them differentiable with
-trivial effort, greatly simplifying the development.
-
-### Generalized Differential Operators
-
-In the new `Differentiable` protocol, we added `Tangent` and `Cotangent` types
- to represent the type of Jacobian-vector products and vector-Jacobian products,
- respectively. We make the following changes to the existing differential
- operators we introduced.
- - Differential operators that return `T` as a forward-differentiated derivative
-   will return `T.Tangent` instead.
- - Differential operators that return `T` as a reverse-differentiated derivative
-   will return `T.Cotangent` instead.
- - Vectors `T` for computing Jacobian-vector products will become `T.Tangent`.
- - Vectors `T` for computing vector-Jacobian products will become `T.Cotangent`.
-
- Here we list a few updated differential operators.
-
-### Jacobian-Vector Products and Vector-Jacobian Products
-
-Jacobian-vector products (forward-mode) and vector-Jacobian products
-(reverse-mode) are extremely useful differential operators for lots of tasks in
-numerical computing.
-
-```swift
-/// Computes Jacobian-vector products of `body` at `x`.
-func jacobianVectorProducts<T: Differentiable, R: Differentiable>(
-    at x: T, vector: T.TangentVector,
-    in body: @autodiff(forward) (T) throws -> R
-) rethrows -> R.TangentVector {
-    return #differential(body)(x)(vector)
-}
-
-/// Computes the vector-Jacobian products of `body` at `x`.
-func vectorJacobianProducts<T: Differentiable, R: Differentiable>(
-    at x: T, vector: R.CotangentVector,
-    in body: @autodiff(reverse) (T) throws -> R
-) rethrows -> T.CotangentVector {
-    return #pullback(body)(x)(vector)
-}
-```
-
-### Differentials and Pullbacks
-
-```swift
-/// Computes the differential of `body` at `x`.
-func differential<T: Differentiable, R: Differentiable>(
-    at x: T, in body: @autodiff(reverse) (T) throws -> R
-) rethrows -> @autodiff(linear) (T.TangentVector) -> R.TangentVector {
-    return #differential(body)(x).1
-}
-
-/// Computes the differential of `body` at `x` that also computes the value of
-/// `body(x)`.
-func differentialWithResult<T: Differentiable, R: Differentiable>(
-    at x: T, in body: @autodiff(reverse) (T) throws -> R
-) rethrows -> @autodiff(linear) (T.TangentVector) -> (originalResult: T, derivatives: R.TangentVector) {
-    return #differential(body)(x)
-}
-
-/// Computes the pullback of `body` at `x`.
-func pullback<T: Differentiable, R: Differentiable>(
-    at x: T, in body: @autodiff(reverse) (T) throws -> R
-) rethrows -> @autodiff(linear) (R.CotangentVector) -> T.CotangentVector {
-    return #pullback(body)(x).1
-}
-
-/// Computes the value of `body(x)` and the pullback at `x`.
-func resultWithPullback<T: Differentiable, R: Differentiable>(
-    at x: T, in body: @autodiff(reverse) (T) throws -> R
-) rethrows -> (originalResult: T, pullback: @autodiff(linear) (R.CotangentVector) -> T.CotangentVector) {
-    return #pullback(body)(x)
-}
-```
-
-### Back to the Problems
-
-Recall that the motivation of introducing a general, future-proof
-`Differentiable` protocol is to be able to model the following use cases.
-
-1. Neural network with orthogonal weights can now be differentiable. We can
-   define a type called `OrthogonalMatrix` to conform to `Differentiable`, and
-   another type `SkewSymmetricMatrix` to conform to both `Differentiable` and
-   `VectorNumeric`.
-
-   ```swift
-   struct SkewSymmetricMatrix: Differentiable, VectorNumeric {
-       typealias Scalar = Float
-       ...
-   }
-   struct OrthogonalMatrix: Differentiable {
-       ...
-       typealias TangentSpace = SkewSymmetricMatrix
-       typealias CotangentSpace = SkewSymmetricMatrix
-   }
-   ```
-
-   When we differentiate a function `(OrthogonalMatrix) -> Float` using the
-   reverse-mode differential operator, we'll get a function `(OrthogonalMatrix)
-   -> SkewSymmetricMatrix`. Everything falls out, without type safety
-   compromises.
-
-2. Differentiating a quantized network is now possible with AD.
-
-   ```swift
-   // `Quantized` is a vector space when the dequantized type is one.
-   extension Quantized: VectorNumeric where Dequantized: VectorNumeric {
-       typealias Scalar = Dequantized.Scalar
-       static func + (lhs: Quantized, rhs: Quantized) -> Quantized {
-           // Custom code: Dequantize, add, and requantize!
-       }
-       static func * (lhs: Scalar, rhs: Quantized) -> Quantized {
-           // Custom code: Dequantize, add, and requantize!
-       }
-   }
-
-   // `Quantized` is a differentiable manifold when the dequantized type is one.
-   extension Quantized: Differentiable where Dequantized: Differentiable {
-       typealias TangentVector = Dequantized.TangentVector
-       typealias CotangentVector = Dequantized.CotangentVector
-
-       func moved(toward tangent: Dequantized.TangentVector) -> QuantizedTensor {
-           // Custom code: Dequantize, optimize, and requantize!
-       }
-   }
-   ```
-
-   With `Quantized` conforming to the new `Differentiable` protocol, when we
-   differentiate a function of type `(Quantized<Tensor<Float>, Int8>) -> U`, AD
-   produces a function of type `(Quantized<Tensor<Float>, Int8>) ->
-   Tensor<Float>`, which is close to exactly what we need in quantized training
-   of neural networks.
-
-3. Generic optimizers can be defined in terms of manifold optimization
-   functions, without implicit casting.
-
-   ```swift
-   extension SGD {
-       func fit(_ parameters: inout Parameters, gradients: Parameters) {
-           parameters.update(withGradients: gradients) { θ, g in
-               θ = θ.moved(toward: -θ.tangentVector(from: g) * learningRate)
-           }
-       }
-   }
-   ```
-
-## Part 7. Customizable Differentiation
-
-Some machine learning models require manipulating the gradient with respect to
-certain values, e.g. gradient clipping.
-[Tangent](https://github.com/google/tangent) provides such a feature as a syntax
-extension in Python. Recurrent neural networks often suffer from the "exploding
-gradient" problem, and a typical solution is to force the gradient of an RNN to
-not exceed a certain value by performing gradient clipping.
-
-```swift
-func prediction(for input: Tensor<Float>) -> Float {
-    var prediction = input
-    for _ in 0...5 {
-        // Clip gradient.
-        prediction = prediction.withCustomizedGradient { grad in
-            max(min(grad, 1), -1)
-        }
-        prediction = lstm.prediction(for: input)
-    }
-    return prediction
-}
-```
-
-APIs `withCustomizedGradient(_:)` and `withCustomizedDerivatives(_:)` look like
-a compiler-known function which makes Swift run customized code in
-differentiated code. However, because of the generality of the [differential
-registration](#differential-registration) mechanism, these functions can be
-defined entirely as a Swift function with no special support from the compiler.
-Here's the implementation of these APIs.
-
-```swift
-public extension Differentiable {
-    @differentiable(forward, wrt: self, tangent: tangentCustomizingDerivatives)
-    func withCustomizedDerivatives(
-        _ body: @nondiff (TangentVector) -> TangentVector
-    ) -> Self {
-        return self
-    }
-
-    internal func tangentCustomizingDerivatives(
-        body: (TangentVector) -> TangentVector,
-        originalResult: Self,
-        tangent: TangentVector
-    ) -> TangentVector {
-        return body(tangent)
-    }
-
-    @differentiable(reverse, wrt: self, adjoint: adjointCustomizingGradient)
-    func withCustomizedGradient(
-        _ body: @nondiff (CotangentVector) -> CotangentVector
-    ) -> Self {
-        return self
-    }
-
-    internal func adjointCustomizingGradient(
-        body: (CotangentVector) -> CotangentVector,
-        originalResult: Self,
-        adjoint: CotangentVector
-    ) -> CotangentVector {
-        return body(adjoint)
-    }
-}
-```
-
-This API supports many gradient manipulation tasks in machine learning
-optimization. For example, the user can make gradient computation trigger a
-break from the loop.
-
-```swift
-var prediction = input
-for _ in 0...5 {
-    // Stop loop when necessary.
-    var shouldStop = false
-    prediction = prediction.withCustomizedGradient { grad in
-        if grad < lowerBound {
-            shouldStop = true
-        }
-        return grad
-    }
-    if shouldStop {
-        break
-    }
-    prediction = lstm.prediction(for: input)
-}
-```
-
-Setting a mutable flag is not the most user-friendly way. We can create APIs
-that wrap `withCustomizedDerivatives(_:)` and `withCustomizedGradient(_:)` and
-return a `Bool`, so that later code can decide whether to `break` from the loop
-based on the return value from that API. Or better, if Swift supports non-local
-control flow, i.e. a branch from nested closures, the code can be written just
-as a break.
-
-```swift
-var prediction = input
-for _ in 0...5 {
-    // Stop loop when necessary.
-    prediction = prediction.withCustomizedGradient { grad in
-        if grad < lowerBound {
-            break
-        }
-        return grad
-    }
-    prediction = lstm.prediction(for: input)
-}
-```
-
-## Acknowledgements
-
-The author would like to thank Dan Zheng, Chris Lattner, Alex Wiltschko, Bart
-van Merriënboer, Gordon Plotkin, Dougal Maclaurin, Matthew Johnson, Casey Chu,
-Tim Harley, Marc Rasi, and Dmitri Gribenko for their input to the initial design
-of this powerful language feature.
+### See the official [Differentiable Programming Manifesto](https://github.com/apple/swift/tree/master/docs/DifferentiableProgramming.md) instead.
diff --git a/ad-manifesto.md b/ad-manifesto.md
@@ -9,7 +9,7 @@ programming language design community, with a strong focus on language design.
 
 **Status: Outdated**
 
-Please see [Swift Automatic Differentiation Design Overview] instead.
+**Please see [Swift Automatic Differentiation Design Overview](https://docs.google.com/document/d/1bPepWLfRQa6CtXqKA8CDQ87uZHixNav-TFjLSisuKag/edit?usp=sharing) instead.**
 
 ## Table of Contents
 

diff --git a/ad-manifesto.md b/ad-manifesto.md
@@ -7,7 +7,9 @@ First-Class Automatic Differentiation in Swift: A Manifesto
 This document is written for both the machine learning community and the Swift
 programming language design community, with a strong focus on language design.
 
-**Status: Currently undergoing major revision.**
+**Status: Outdated**
+
+Please see [Swift Automatic Differentiation Design Overview] instead.
 
 ## Table of Contents
 

diff --git a/ad-manifesto.md b/ad-manifesto.md
@@ -7,7 +7,7 @@ First-Class Automatic Differentiation in Swift: A Manifesto
 This document is written for both the machine learning community and the Swift
 programming language design community, with a strong focus on language design.
 
-Status: Currently undergoing major revision.
+**Status: Currently undergoing major revision.**
 
 ## Table of Contents
 

diff --git a/ad-manifesto.md b/ad-manifesto.md
@@ -7,6 +7,8 @@ First-Class Automatic Differentiation in Swift: A Manifesto
 This document is written for both the machine learning community and the Swift
 programming language design community, with a strong focus on language design.
 
+Status: Currently undergoing major revision.
+
 ## Table of Contents
 
 - [Introduction](#introduction)

diff --git a/.gitignore b/.gitignore
@@ -1,3 +0,0 @@
-*.tex
-*.pdf
-auto/

diff --git a/reduced-differentiability-model.org b/reduced-differentiability-model.org
@@ -1,65 +0,0 @@
-#+TITLE: The Reduced Differentiability Model
-#+SUBTITLE: Using only Differentials and Adjoints
-
-* TODO Introduction
-
-
-
-* TODO Motivation
-
-
-
-* Solution
-
-** Rule of First-Order Differentiability
-
-   * A function is forward-differentiable if
-     * it has a forward-differentiable body,
-     * it has a differential, or
-     * it has a reverse-differentiable /adjoint/.
-   * A function is reverse-differentiable if
-     * it has a reverse-differentiable body,
-     * it has a reverse-differentiable /differential/, or
-     * it has an /adjoint/.
-
-** TODO Rule of Higher-Order Differentiability
-
-
-** Simplified Differential and Adjoint Definition Syntax
-
-#+BEGIN_SRC swift
-extension Vector {
-    @differentiable(wrt: self)
-    static func * (lhs: Vector, rhs: Vector) -> Vector {
-        return ... // non-differentiable
-
-        adjoint(v: Vector) -> (Vector, Vector) {
-            return (rhs * v, lhs * v)
-        }
-    }
-}
-#+END_SRC
-
-#+BEGIN_SRC swift
-@differentiable
-func cos(_ x: Vector) -> Vector {
-    return ... // non-differentiable
-
-    differential(v: Vector) -> Vector {
-        return -sin(x) * v
-    }
-}
-#+END_SRC
-
-#+BEGIN_SRC swift
-extension Tensor {
-    @differentiable(wrt: self)
-    func transposed() -> Tensor {
-        return ... // non-differentiable
-
-        adjoint(v: Tensor) -> Tensor {
-            return v.transposed()
-        }
-    }
-}
-#+END_SRC

diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,3 @@
+*.tex
+*.pdf
+auto/
diff --git a/reduced-differentiability-model.org b/reduced-differentiability-model.org
@@ -0,0 +1,65 @@
+#+TITLE: The Reduced Differentiability Model
+#+SUBTITLE: Using only Differentials and Adjoints
+
+* TODO Introduction
+
+
+
+* TODO Motivation
+
+
+
+* Solution
+
+** Rule of First-Order Differentiability
+
+   * A function is forward-differentiable if
+     * it has a forward-differentiable body,
+     * it has a differential, or
+     * it has a reverse-differentiable /adjoint/.
+   * A function is reverse-differentiable if
+     * it has a reverse-differentiable body,
+     * it has a reverse-differentiable /differential/, or
+     * it has an /adjoint/.
+
+** TODO Rule of Higher-Order Differentiability
+
+
+** Simplified Differential and Adjoint Definition Syntax
+
+#+BEGIN_SRC swift
+extension Vector {
+    @differentiable(wrt: self)
+    static func * (lhs: Vector, rhs: Vector) -> Vector {
+        return ... // non-differentiable
+
+        adjoint(v: Vector) -> (Vector, Vector) {
+            return (rhs * v, lhs * v)
+        }
+    }
+}
+#+END_SRC
+
+#+BEGIN_SRC swift
+@differentiable
+func cos(_ x: Vector) -> Vector {
+    return ... // non-differentiable
+
+    differential(v: Vector) -> Vector {
+        return -sin(x) * v
+    }
+}
+#+END_SRC
+
+#+BEGIN_SRC swift
+extension Tensor {
+    @differentiable(wrt: self)
+    func transposed() -> Tensor {
+        return ... // non-differentiable
+
+        adjoint(v: Tensor) -> Tensor {
+            return v.transposed()
+        }
+    }
+}
+#+END_SRC
diff --git a/ad-manifesto.md b/ad-manifesto.md
@@ -83,23 +83,24 @@ func f(_ x: Double, _ y: Double) -> Double {
 
 ### Vectors and Jacobians
 
-In numerical computing, users often write code that operate on high-dimensional
+In numerical computing, users often write code that operates on high-dimensional
 mathematical objects. The basic typing rules that we defined on real scalars
 (![](http://latex.codecogs.com/gif.latex?\mathbb{R})) can be generalized for
 [module](https://en.wikipedia.org/wiki/Module_(mathematics))-like types such as
 vectors with extra consideration for shape. In vector calculus, the
 differentiation of a function
-![](http://latex.codecogs.com/gif.latex?\mathbf{f}:\mathbb{R}^n\rightarrow\mathbb{R}^m) is
-defined per scalar because there are multiple inputs and multiple outputs. Full
-differentiation of vector-valued function
-![](http://latex.codecogs.com/gif.latex?\mathbf{f}) will result in a matrix,
-each of whose entries is a function that computes the partial derivative of an
-output scalar with respect to an input  scalar. This matrix is called a
+![](http://latex.codecogs.com/gif.latex?\mathbf{f}:\mathbb{R}^n\rightarrow\mathbb{R}^m)
+is defined per scalar because there are multiple inputs and multiple outputs.
+Full differentiation of a vector-valued function
+![](http://latex.codecogs.com/gif.latex?\mathbf{f}) will thus result in a
+matrix, each of whose entries is a function that computes the partial derivative
+of an output scalar with respect to an input scalar. This matrix is called a
 [Jacobian](https://en.wikipedia.org/wiki/Jacobian_matrix_and_determinant). In
 this definition, the Jacobian matrix has type
 ![](http://latex.codecogs.com/gif.latex?\mathbf{J_f}:(\mathbb{R}\rightarrow\mathbb{R})^{mn}).
-For simplicity, we will model it as a function that maps vectors
-to real-valued matrices ![](http://latex.codecogs.com/gif.latex?\mathbf{J_f}:\mathbb{R}^n\rightarrow\mathbb{R}^{mn}).
+For simplicity, we will model it as a function that maps vectors to real-valued
+matrices
+![](http://latex.codecogs.com/gif.latex?\mathbf{J_f}:\mathbb{R}^n\rightarrow\mathbb{R}^{mn}).
 
 <p align="center">
   <img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/74e93aa903c2695e45770030453eb77224104ee4"
@@ -118,7 +119,7 @@ func 𝒟<T>(_ f: (Vector2<T>) -> Vector3<T>) -> (Vector2<T>) -> Matrix3x2<T>
 Computing the Jacobian of a function is often unnecessary in gradient-based
 optimization methods. Computing a full Jacobian will require repeated
 evaluations of some primitives in computer code: vector-Jacobian products (VJPs)
-and Jacobian-vector products (JVPs), and VJPs and JVPs are often exactly what we
+or Jacobian-vector products (JVPs), and VJPs and JVPs are often exactly what we
 need in practice. In these terms, "vector" refers to a vector of partial
 derivatives that are to be chained with the Jacobian by left-multiplication or
 right-multiplication. As we explain chaining next, we discuss how Automatic
@@ -1044,7 +1045,7 @@ and function convert back and forth through conversion thunks implicitly.
 ```swift
 // A "thin" function that captures no variables.
 // Its representation is `@convention(thin)` by default.
-func f(x: Int) {
+func f(x: Int) -> Int {
     return x
 }
 

diff --git a/ad-manifesto.md b/ad-manifesto.md
@@ -477,7 +477,7 @@ multiplication). The protocol will be called `Arithmetic`.
 
 ```swift
 public protocol Arithmetic: Equatable {
-    var zero: Self { get }
+    static var zero: Self { get }
     prefix static func + (x: Self) -> Self
     static func + (lhs: Self, rhs: Self) -> Self
     static func += (lhs: inout Self, rhs: Self) -> Self

diff --git a/ad-manifesto.md b/ad-manifesto.md
@@ -115,13 +115,14 @@ func 𝒟<T>(_ f: (Vector2<T>) -> Vector3<T>) -> (Vector2<T>) -> Matrix3x2<T>
     where T: FloatingPoint
 ```
 
-Calculating the Jacobian of a function is often unnecessary in gradient-based
+Computing the Jacobian of a function is often unnecessary in gradient-based
 optimization methods. Computing a full Jacobian will require repeated
-evaluations of vector-Jacobian products (VJPs) and Jacobian-vector products
-(JVPs), but VJPs and JVPs are often what we need in practice. In these terms,
-"vector" refers to a vector of partial derivatives that are to be chained with
-the Jacobian by left-multiplication or right-multiplication. As we explain
-chaining next, we discuss how Automatic Differentiation comes in the picture.
+evaluations of some primitives in computer code: vector-Jacobian products (VJPs)
+and Jacobian-vector products (JVPs), and VJPs and JVPs are often exactly what we
+need in practice. In these terms, "vector" refers to a vector of partial
+derivatives that are to be chained with the Jacobian by left-multiplication or
+right-multiplication. As we explain chaining next, we discuss how Automatic
+Differentiation comes in the picture.
 
 ### Gradient and Reverse-Mode AD
 

diff --git a/ad-manifesto.md b/ad-manifesto.md
@@ -116,12 +116,12 @@ func 𝒟<T>(_ f: (Vector2<T>) -> Vector3<T>) -> (Vector2<T>) -> Matrix3x2<T>
 ```
 
 Calculating the Jacobian of a function is often unnecessary in gradient-based
-optimization methods. In practice, we care more about two byproducts of Jacobian
-calculation that are significantly easier to compute than the Jacobian itself:
-vector-Jacobian products and Jacobian-vector products. In these terms, "vector"
-refers to a vector of partial derivatives that are to be chained with the
-Jacobian by left-multiplication or right-multiplication. As we explain chaining
-next, we discuss how Automatic Differentiation comes in the picture.
+optimization methods. Computing a full Jacobian will require repeated
+evaluations of vector-Jacobian products (VJPs) and Jacobian-vector products
+(JVPs), but VJPs and JVPs are often what we need in practice. In these terms,
+"vector" refers to a vector of partial derivatives that are to be chained with
+the Jacobian by left-multiplication or right-multiplication. As we explain
+chaining next, we discuss how Automatic Differentiation comes in the picture.
 
 ### Gradient and Reverse-Mode AD
 
@@ -469,10 +469,10 @@ On the Swift forum, we have discussed the [fundamental blocker for vector types
 to conform to the existing `Numeric`
 protocol](https://forums.swift.org/t/should-numeric-not-refine-expressiblebyintegerliteral).
 The consensus was to introduce a weakening of the `Numeric` protocol to
-represent the abstractions shared between scalars and vectors:
-[rng](https://en.wikipedia.org/wiki/Rng_(algebra)) (We assumed that vector spaces
-are rngs by endowing them with `*` as element-wise multiplication). The protocol
-will be called `Arithmetic`.
+represent the abstractions shared between scalars and vectors: [rng (ring
+without unity)](https://en.wikipedia.org/wiki/Rng_(algebra)) (We assumed that
+vector spaces are rngs by endowing them with `*` as element-wise
+multiplication). The protocol will be called `Arithmetic`.
 
 ```swift
 public protocol Arithmetic: Equatable {
@@ -502,13 +502,12 @@ public protocol Numeric: Arithmetic, ExpressibleByIntegerLiteral {
 
 After we introduce the `Arithmetic` protocol, which makes the standard library
 suitable for vector APIs and beyond, we can define a protocol that generalizes
-vectors. Mathematically, a vector space is a rng if we endow them with `*` as
-element-wise multiplication. We represent vector spaces through the
-`VectorNumeric` protocol as follows. `Scalar` is the type of the elements
-of this vector space -- the field which the vector space is over.
-`Shape` is the shape of this vector space, which is
-customizable. The initializer takes a value of the `Scalar` type and a
-`Shape` and returns a vector of the specified shape.
+vectors. Mathematically, a vector space is a ring without unity if we endow them
+with `*` as element-wise multiplication. We represent vector spaces through the
+`VectorNumeric` protocol as follows. `Scalar` is the type of the elements of
+this vector space -- the field which the vector space is over. `Shape` is the
+shape of this vector space, which is customizable. The initializer takes a value
+of the `Scalar` type and a `Shape` and returns a vector of the specified shape.
 
 ```swift
 /// A type that represents an unranked vector space. Values of this type are

diff --git a/ad-manifesto.md b/ad-manifesto.md
@@ -390,7 +390,7 @@ func hvp<T: Differentiable, R: FloatingPoint>(
 
 By building first-class AD into the programming language, we can provide better
 diagnostics about differentiability and numeric stability than any other dynamic
-language, all at compile-time.
+languages, all at compile-time.
 
 ```console
 test.swift:58:10: error: function is not differentiable

diff --git a/ad-manifesto.md b/ad-manifesto.md
@@ -675,8 +675,6 @@ There are five options for differentiability:
 
 2. Reverse: `@differentiable(reverse, adjoint: ...)`
 
-   This option says that the function is reverse-mode differentiable.
-
    This option says that the function is reverse-mode differentiable.
    Reverse-mode differentiation requires the "adjoint code" (or adjoint
    function) of this function, so that Swift knows how to compute the function's
@@ -697,7 +695,7 @@ There are five options for differentiability:
 
    By definition, constant functions always have zero derivatives and are
    differentiable at any arbitrary order. So differentiating this function will
-   result into a vector (or vectors, when the function has multiple
+   result into a zero vector (or vectors, when the function has multiple
    differentiation arguments) with the same shape as each differentiation
    argument.
 
@@ -885,7 +883,10 @@ expression = autodiff-expression
 Gradient and derivatives are two special cases of differentiation where the
 output or the result is a scalar, respectively. When they are not a scalar,
 vector-Jacobian products and Jacobian-vector products are being computed with a
-vector. We add two extra differential operators which will be useful for
+vector. These cases are not obvious, but are required for modular machine
+learning APIs where each neural network layer defines a back-propagation method
+that takes a partial derivative vector back-propagated from the previous layer.
+As such, we add two extra differential operators which will be useful for
 computing these products.
 
 - `#differential(f)`: Produces a function that takes the original arguments and

diff --git a/ad-manifesto.md b/ad-manifesto.md
@@ -1242,8 +1242,8 @@ type, forming a "normal" function. This allows us to define generic algorithms
 using differentiation, without specializing them on function types of each
 differentiability.
 
-The following table shows whether each differentiability can be converted to
-another.
+The following table shows whether each differentiability (as a column label) can
+be converted to another (as a row label).
 
 | Convertible to: | None | Linear    | Constant | Forward | Reverse | Bidirectional |
 |-----------------|------|-----------|----------|---------|---------|---------------|

diff --git a/ad-manifesto.md b/ad-manifesto.md
@@ -882,11 +882,12 @@ expression = autodiff-expression
 
 ### Embrace Generality: Vector-Jacobian Products and Jacobian-Vector Products
 
-While gradient and derivatives are two special cases of differentiation where
-the output or the result is a scalar, respectively. When they are not a scalar,
+Gradient and derivatives are two special cases of differentiation where the
+output or the result is a scalar, respectively. When they are not a scalar,
 vector-Jacobian products and Jacobian-vector products are being computed with a
 vector. We add two extra differential operators which will be useful for
 computing these products.
+
 - `#differential(f)`: Produces a function that takes the original arguments and
   returns the differential of `f`.
 - `#pullback(f)`: Produces a function that takes the original arguments and

diff --git a/ad-manifesto.md b/ad-manifesto.md
@@ -116,9 +116,8 @@ func 𝒟<T>(_ f: (Vector2<T>) -> Vector3<T>) -> (Vector2<T>) -> Matrix3x2<T>
 ```
 
 Calculating the Jacobian of a function is often unnecessary in gradient-based
-optimization methods, and is often unnecessary in gradient-based optimization
-methods. In practice, we care more about two byproducts of Jacobian calculation
-that are significantly easier to compute than the Jacobian itself:
+optimization methods. In practice, we care more about two byproducts of Jacobian
+calculation that are significantly easier to compute than the Jacobian itself:
 vector-Jacobian products and Jacobian-vector products. In these terms, "vector"
 refers to a vector of partial derivatives that are to be chained with the
 Jacobian by left-multiplication or right-multiplication. As we explain chaining
@@ -140,11 +139,13 @@ row in the matrix, which is exactly the
 <img src="https://latex.codecogs.com/gif.latex?\nabla{f_i}(\mathbf{x})=\mathbf{v}^i\mathbf{J_f}(\mathbf{x})=\bigg[\dfrac{\partial{f_i}(\mathbf{x})}{\partial&space;x_1}&space;\&space;\cdots&space;\&space;\dfrac{\partial{f_i}(\mathbf{x})}{\partial{x_n}}\bigg]" title="\nabla{f_i}(\mathbf{x})=\mathbf{v}^i\mathbf{J_f}(\mathbf{x})=\bigg[\dfrac{\partial{f_i}(\mathbf{x})}{\partial x_1} \ \cdots \ \dfrac{\partial{f_i}(\mathbf{x})}{\partial{x_n}}\bigg]" />
 </p>
 
-When this vector ![](http://latex.codecogs.com/gif.latex?\mathbf{v}^i) represents the
-gradient of another function ![](http://latex.codecogs.com/gif.latex?g:\mathbb{R}^m\rightarrow\mathbb{R}) at
+When vector ![](http://latex.codecogs.com/gif.latex?\mathbf{v}) in
+![](http://latex.codecogs.com/gif.latex?\mathbf{v}\mathbf{J_f}(\mathbf{x}))
+represents the gradient of another function
+![](http://latex.codecogs.com/gif.latex?g:\mathbb{R}^m\rightarrow\mathbb{R}) at
 ![](http://latex.codecogs.com/gif.latex?\mathbf{f}(\mathbf{x})), namely
 ![](http://latex.codecogs.com/gif.latex?\partial{g}/\partial{f_i(\mathbf{x})}),
-then the vector-Jacobian products will represent
+then the vector-Jacobian products represents
 ![](http://latex.codecogs.com/gif.latex?\partial{g}/\partial{\mathbf{x}}). The
 linear function that takes a vector and left-multiplies it with the Jacobian is
 also called a
@@ -514,7 +515,7 @@ customizable. The initializer takes a value of the `Scalar` type and a
 /// elements in this vector space and with a specific shape.
 public protocol VectorNumeric: Arithmetic {
     /// The type of scalars in the vector space.
-    associatedtype Scalar : Numeric
+    associatedtype Scalar: Numeric
 
     /// The type whose values specifies the shape of an object in the vector 
     /// space.

diff --git a/ad-manifesto.md b/ad-manifesto.md
@@ -89,7 +89,7 @@ mathematical objects. The basic typing rules that we defined on real scalars
 [module](https://en.wikipedia.org/wiki/Module_(mathematics))-like types such as
 vectors with extra consideration for shape. In vector calculus, the
 differentiation of a function
-![](http://latex.codecogs.com/gif.latex?\mathbb{R}^n\rightarrow\mathbb{R}^m) is
+![](http://latex.codecogs.com/gif.latex?\mathbf{f}:\mathbb{R}^n\rightarrow\mathbb{R}^m) is
 defined per scalar because there are multiple inputs and multiple outputs. Full
 differentiation of vector-valued function
 ![](http://latex.codecogs.com/gif.latex?\mathbf{f}) will result in a matrix,

diff --git a/ad-manifesto.md b/ad-manifesto.md
@@ -2044,5 +2044,5 @@ for _ in 0...5 {
 
 The author would like to thank Dan Zheng, Chris Lattner, Alex Wiltschko, Bart
 van Merriënboer, Gordon Plotkin, Dougal Maclaurin, Matthew Johnson, Casey Chu,
-Tim Harley, Marc Rasi, and Dmitri Gribenko for their input to the design of this
-powerful language feature.
+Tim Harley, Marc Rasi, and Dmitri Gribenko for their input to the initial design
+of this powerful language feature.
diff --git a/ad-manifesto.md b/ad-manifesto.md
@@ -324,10 +324,10 @@ for (x, y) in minibatches {
 We want our AD system to be fully extensible to the point where users can
 request derivatives of a function taking their own user-defined numeric types,
 and even use this feature to implement structure-dependent algorithms such as
-tree-recursive neural networks. Therefore, AD makes no assumptions about
-individual math functions or the types it should support. We enable library
-designers and developers to easily define any type or differentiable functions,
-all in pure Swift code.
+tree-recursive neural networks. Therefore, when performing AD, Swift makes no
+special assumptions about individual math functions or the types it should
+support. We enable library designers and developers to easily define any type or
+differentiable functions, all in pure Swift code.
 
 Swift supports [protocol-oriented programming and first-class value
 semantics](https://developer.apple.com/videos/play/wwdc2015/408/). AD is deeply
@@ -341,8 +341,10 @@ extension MyType: Differentiable {
 }
 ```
 
-Or make obviously non-differentiable functions differentiable by using the
-`@differentiable` attribute:
+Or make an obviously non-differentiable function differentiable by using the
+`@differentiable` attribute, specifying a "tangent" function for computing its
+Jacobian-vector products, or an "adjoint" function for computing its
+vector-Jacobian products.
 
 ```swift
 @differentiable(tangent: tangentFoo, adjoint: adjointFoo)
@@ -377,9 +379,8 @@ trigger differentiation as needed.
 
 ```swift
 func hvp<T: Differentiable, R: FloatingPoint>(
-    at x: T, in f: @autodiff(order: 2) (T.CotangentVector) -> R
-) -> @autodiff(linear) (T.TangentVector) -> T.CotangentVector.TangentVector
-    where T.CotangentVector: Differentiable {
+    at x: T, in f: @autodiff(order: 2) (T) -> R
+) -> @autodiff(linear) (T) -> T {
     return differential(at: x, in: gradient(of: f))
 }
 ```
@@ -416,16 +417,31 @@ imperative.
 |            | Syntax | Meaning |
 |------------|--------|-------------|
 | Functional | `let 𝝯f = gradient(of: f)`<br/>`𝝯f(x)` | Differentiating a function |
-| Imperative | `y = f(x)`<br/>`gradient(of: y, wrt: x)` | Differentiating code traced through data flow |
+| Imperative | `let y = f(x)`<br/>`gradient(of: y, wrt: x)` | Differentiating code traced through data flow |
 
 Functional-style AD is transforming one function to another, producing a
 function that takes original arguments and returns the partial derivatives
 evaluated at each argument. Imperative-style AD, on the other hand, is a
 value-value dependency analysis. Although we use both notations in mathematics,
 imperative AD comes at the cost of semantic inconsistency with the host
-language. In Swift's AD system, we believe we can achieve the same level of
-expressivity as imperative AD while preserving functional properties, and use
-language integration to push developers' productivity to the next level.
+language, for example:
+
+```swift
+let y = f(x)
+x = 3
+gradient(of: y, wrt: x) // undefined
+```
+
+Semantically, `y` is a value, but `x` is both a value and a reference to a
+memory location -- it is unclear what exactly we are differentiating with
+respect to. Though making `y` and `x` have reference types could make this
+particular example work out semantically, it would be fundamentally inconsistent
+with Swift's core design where mathematical objects have value types, and would
+also make scalar types like `Float` incompatible with automatic differentiation.
+
+We believe Swift's AD can achieve the same level of expressivity as imperative
+AD while preserving functional properties, and use language integration to push
+developers' productivity to the next level.
 
 
 ## Part 1: Differentiable Types
@@ -559,8 +575,9 @@ based on:
 
 As such we provide a syntactic way of specifying the differentiability of a
 function, using either the function's linearity properties or a separate
-function to specify the "tangent code" or "adjoint code" for the original
-function.
+function to specify the "tangent code", which specifies how to differentiate the
+function in forward mode, or "adjoint code”, which specifies how to
+differentiate the function in reverse mode.
 
 ### The `@differentiable` attribute
 
@@ -622,7 +639,7 @@ public func conv2d(_ input: Tensor<Float>, filter: Tensor<Float>,
 
 func adjointConv2D(_ input: Tensor<Float>, filter: Tensor<Float>,
                    strides: (Int32, Int32, Int32, Int32),
-                   padding: Padding) -> Tensor<Float> {
+                   padding: Padding) -> (Tensor<Float>, Tensor<Float>) {
     ...
 }
 ```

diff --git a/ad-manifesto.md b/ad-manifesto.md
@@ -141,7 +141,7 @@ row in the matrix, which is exactly the
 </p>
 
 When this vector ![](http://latex.codecogs.com/gif.latex?\mathbf{v}^i) represents the
-gradient of another function ![](http://latex.codecogs.com/latex.gif?g:\mathbb{R}^m\rightarrow\mathbb{R}) at
+gradient of another function ![](http://latex.codecogs.com/gif.latex?g:\mathbb{R}^m\rightarrow\mathbb{R}) at
 ![](http://latex.codecogs.com/gif.latex?\mathbf{f}(\mathbf{x})), namely
 ![](http://latex.codecogs.com/gif.latex?\partial{g}/\partial{f_i(\mathbf{x})}),
 then the vector-Jacobian products will represent

diff --git a/ad-manifesto.md b/ad-manifesto.md
@@ -59,11 +59,13 @@ maps points onto their corresponding slopes.
 In the context of Swift, differentiating a function `(Float) -> Float` produces
 `(Float) -> Float`. Functions with multiple arguments, such as `(Float, Float)
 -> Float`, can be thought of as a function whose input domain is a product of
-those arguments types, i.e. `(ℝ ⨯ ℝ) → ℝ`, so the derivative of such a function
-has type `(Float, Float) -> (Float, Float)`. According to this typing rule, the
-differential operator ![](http://latex.codecogs.com/gif.latex?\dfrac{d}{dx}) can
-be declared as a higher-order function, overloaded for each number of arguments
-because a Swift function's argument list is not formally modeled as a tuple.
+those arguments types, i.e.
+![](http://latex.codecogs.com/gif.latex?\mathbb{R}\times\mathbb{R}\rightarrow\mathbb{R}),
+so the derivative of such a function has type `(Float, Float) -> (Float,
+Float)`. According to this typing rule, the differential operator
+![](http://latex.codecogs.com/gif.latex?\dfrac{d}{dx}) can be declared as a
+higher-order function, overloaded for each number of arguments because a Swift
+function's argument list is not formally modeled as a tuple.
 
 ```swift
 func 𝒟<T: FloatingPoint>(_ f: (T) -> T) -> (T) -> T

diff --git a/ad-manifesto.md b/ad-manifesto.md
@@ -46,8 +46,11 @@ programming language.
 
 ### Basic Calculus
 
-In basic calculus, differentiating a function of type `ℝ → ℝ` produces a function
-`ℝ → ℝ` that maps points onto their corresponding slopes.
+In basic calculus, differentiating a function of type
+![](http://latex.codecogs.com/gif.latex?\mathbb{R}\rightarrow\mathbb{R})
+produces a function
+![](http://latex.codecogs.com/gif.latex?\mathbb{R}\rightarrow\mathbb{R}) that
+maps points onto their corresponding slopes.
 
 <p align="center">
 <img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/9315f1516ee5847107808697e43693d91abfc6e8"
@@ -80,18 +83,21 @@ func f(_ x: Double, _ y: Double) -> Double {
 
 In numerical computing, users often write code that operate on high-dimensional
 mathematical objects. The basic typing rules that we defined on real scalars
-(`ℝ`) can be generalized for
+(![](http://latex.codecogs.com/gif.latex?\mathbb{R})) can be generalized for
 [module](https://en.wikipedia.org/wiki/Module_(mathematics))-like types such as
 vectors with extra consideration for shape. In vector calculus, the
-differentiation of a function `f: ℝⁿ → ℝᵐ` is defined per scalar because there
-are multiple inputs and multiple outputs. Full differentiation of vector
-function `f` will result in a matrix, each of whose entries is a function that
-computes the partial derivative of an output scalar with respect to an input
-scalar. This matrix is called a
+differentiation of a function
+![](http://latex.codecogs.com/gif.latex?\mathbb{R}^n\rightarrow\mathbb{R}^m) is
+defined per scalar because there are multiple inputs and multiple outputs. Full
+differentiation of vector-valued function
+![](http://latex.codecogs.com/gif.latex?\mathbf{f}) will result in a matrix,
+each of whose entries is a function that computes the partial derivative of an
+output scalar with respect to an input  scalar. This matrix is called a
 [Jacobian](https://en.wikipedia.org/wiki/Jacobian_matrix_and_determinant). In
-this definition, the Jacobian matrix has type: `J: (ℝ → ℝ)ᵐⁿ`. For simplicity,
-we will model it as a function that maps vectors to real-valued matrices `J: ℝⁿ
-→ ℝᵐⁿ`.
+this definition, the Jacobian matrix has type
+![](http://latex.codecogs.com/gif.latex?\mathbf{J_f}:(\mathbb{R}\rightarrow\mathbb{R})^{mn}).
+For simplicity, we will model it as a function that maps vectors
+to real-valued matrices ![](http://latex.codecogs.com/gif.latex?\mathbf{J_f}:\mathbb{R}^n\rightarrow\mathbb{R}^{mn}).
 
 <p align="center">
   <img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/74e93aa903c2695e45770030453eb77224104ee4"
@@ -119,9 +125,11 @@ next, we discuss how Automatic Differentiation comes in the picture.
 ### Gradient and Reverse-Mode AD
 
 When we let a [one-hot](https://en.wikipedia.org/wiki/One-hot) row vector
-`vⁱ: ℝᵐ = onehot(i)` left-multiply a
-Jacobian matrix of type `ℝᵐⁿ`, we are selecting one row in the matrix,
-which is exactly the [gradient](https://en.wikipedia.org/wiki/Gradient) of
+![](http://latex.codecogs.com/gif.latex?\mathbf{v}^i:\mathbb{R}^m=\Big[0\cdots_{i-1}1\cdots0\Big])
+left-multiply a Jacobian matrix of type
+![](http://latex.codecogs.com/gif.latex?\mathbb{R}^{mn}), we are selecting one
+row in the matrix, which is exactly the
+[gradient](https://en.wikipedia.org/wiki/Gradient) of
 ![](http://latex.codecogs.com/gif.latex?f_i) evaluated at
 ![](http://latex.codecogs.com/gif.latex?\mathbf{x}), i.e.
 ![](http://latex.codecogs.com/gif.latex?\nabla{f_i}(\mathbf{x})).
@@ -130,7 +138,8 @@ which is exactly the [gradient](https://en.wikipedia.org/wiki/Gradient) of
 <img src="https://latex.codecogs.com/gif.latex?\nabla{f_i}(\mathbf{x})=\mathbf{v}^i\mathbf{J_f}(\mathbf{x})=\bigg[\dfrac{\partial{f_i}(\mathbf{x})}{\partial&space;x_1}&space;\&space;\cdots&space;\&space;\dfrac{\partial{f_i}(\mathbf{x})}{\partial{x_n}}\bigg]" title="\nabla{f_i}(\mathbf{x})=\mathbf{v}^i\mathbf{J_f}(\mathbf{x})=\bigg[\dfrac{\partial{f_i}(\mathbf{x})}{\partial x_1} \ \cdots \ \dfrac{\partial{f_i}(\mathbf{x})}{\partial{x_n}}\bigg]" />
 </p>
 
-When this vector `vⁱ` represents the gradient of another function `g: ℝᵐ → ℝ` at
+When this vector ![](http://latex.codecogs.com/gif.latex?\mathbf{v}^i) represents the
+gradient of another function ![](http://latex.codecogs.com/latex.gif?g:\mathbb{R}^m\rightarrow\mathbb{R}) at
 ![](http://latex.codecogs.com/gif.latex?\mathbf{f}(\mathbf{x})), namely
 ![](http://latex.codecogs.com/gif.latex?\partial{g}/\partial{f_i(\mathbf{x})}),
 then the vector-Jacobian products will represent
@@ -165,8 +174,10 @@ partial derivatives from the final output, eventiually reaching each input.
 
 ### Directional Derivatives and Forward-Mode AD
 
-Similarly, when we let a column vector `v: ℝⁿ¹` right-multiply a Jacobian value
-matrix of type `ℝᵐⁿ`, the result is a vector whose elements are exactly the
+Similarly, when we let a column vector
+![](http://latex.codecogs.com/gif.latex?\mathbf{v}:\mathbb{R}^{n1}) right-multiply a
+Jacobian value
+matrix of type ![](http://latex.codecogs.com/gif.latex?\mathbb{R}^{mn}), the result is a vector whose elements are exactly the
 [directional derivatives](https://en.wikipedia.org/wiki/Directional_derivative)
 of each ![](http://latex.codecogs.com/gif.latex?f_i) evaluated at
 ![](http://latex.codecogs.com/gif.latex?\mathbf{x}) in direction ![](http://latex.codecogs.com/gif.latex?\mathbf{v}).

diff --git a/ad-manifesto.md b/ad-manifesto.md
@@ -127,7 +127,7 @@ which is exactly the [gradient](https://en.wikipedia.org/wiki/Gradient) of
 ![](http://latex.codecogs.com/gif.latex?\nabla{f_i}(\mathbf{x})).
 
 <p align="center">
-<img src="https://latex.codecogs.com/gif.latex?\nabla{f_i}(\mathbf{x})=\mathbf{v}^i\mathbf{J_f}(\mathbf{x})=\bigg[\dfrac{\partial{f_i}(\mathbf{x})}{\partial&space;x_0}&space;\&space;\cdots&space;\&space;\dfrac{\partial{f_i}(\mathbf{x})}{\partial{x_n}}\bigg]" title="\nabla{f_i}(\mathbf{x})=\mathbf{v}^i\mathbf{J_f}(\mathbf{x})=\bigg[\dfrac{\partial{f_i}(\mathbf{x})}{\partial x_0} \ \cdots \ \dfrac{\partial{f_i}(\mathbf{x})}{\partial{x_n}}\bigg]" />
+<img src="https://latex.codecogs.com/gif.latex?\nabla{f_i}(\mathbf{x})=\mathbf{v}^i\mathbf{J_f}(\mathbf{x})=\bigg[\dfrac{\partial{f_i}(\mathbf{x})}{\partial&space;x_1}&space;\&space;\cdots&space;\&space;\dfrac{\partial{f_i}(\mathbf{x})}{\partial{x_n}}\bigg]" title="\nabla{f_i}(\mathbf{x})=\mathbf{v}^i\mathbf{J_f}(\mathbf{x})=\bigg[\dfrac{\partial{f_i}(\mathbf{x})}{\partial x_1} \ \cdots \ \dfrac{\partial{f_i}(\mathbf{x})}{\partial{x_n}}\bigg]" />
 </p>
 
 When this vector `vⁱ` represents the gradient of another function `g: ℝᵐ → ℝ` at
@@ -143,7 +143,7 @@ body of this function can be defined in terms of `𝒟`, the differential operat
 that returns a Jacobian.
 
 <p align="center">
-<img src="https://latex.codecogs.com/gif.latex?\dfrac{\partial&space;g(\mathbf{f}(\mathbf{x}))}{\partial&space;\mathbf{x}}=\dfrac{\partial&space;g}{\partial&space;\mathbf{f}(\mathbf{x})}\mathbf{J_f}(\mathbf{x})&space;=&space;\bigg[&space;\dfrac{\partial&space;g(\mathbf{x})}{\partial&space;x_0}&space;\&space;\cdots&space;\&space;\dfrac{\partial&space;g(\mathbf{x})}{\partial&space;x_n}&space;\bigg]" title="\dfrac{\partial g(\mathbf{f}(\mathbf{x}))}{\partial \mathbf{x}}=\dfrac{\partial g}{\partial \mathbf{f}(\mathbf{x})}\mathbf{J_f}(\mathbf{x}) = \bigg[ \dfrac{\partial g(\mathbf{x})}{\partial x_0} \ \cdots \ \dfrac{\partial g(\mathbf{x})}{\partial x_n} \bigg]" />
+<img src="https://latex.codecogs.com/gif.latex?\dfrac{\partial&space;g(\mathbf{f}(\mathbf{x}))}{\partial&space;\mathbf{x}}=\dfrac{\partial&space;g}{\partial&space;\mathbf{f}(\mathbf{x})}\mathbf{J_f}(\mathbf{x})&space;=&space;\bigg[&space;\dfrac{\partial&space;g(\mathbf{x})}{\partial&space;x_1}&space;\&space;\cdots&space;\&space;\dfrac{\partial&space;g(\mathbf{x})}{\partial&space;x_n}&space;\bigg]" title="\dfrac{\partial g(\mathbf{f}(\mathbf{x}))}{\partial \mathbf{x}}=\dfrac{\partial g}{\partial \mathbf{f}(\mathbf{x})}\mathbf{J_f}(\mathbf{x}) = \bigg[ \dfrac{\partial g(\mathbf{x})}{\partial x_1} \ \cdots \ \dfrac{\partial g(\mathbf{x})}{\partial x_n} \bigg]" />
 </p>
 
 ```swift
@@ -172,7 +172,7 @@ of each ![](http://latex.codecogs.com/gif.latex?f_i) evaluated at
 ![](http://latex.codecogs.com/gif.latex?\mathbf{x}) in direction ![](http://latex.codecogs.com/gif.latex?\mathbf{v}).
 
 <p align="center">
-<img src="https://latex.codecogs.com/gif.latex?\nabla_\mathbf{v}\mathbf{f}(\mathbf{x})=\mathbf{J_f}(\mathbf{x})\mathbf{v}=\bigg[\nabla_\mathbf{v}{f_0}(\mathbf{x})\&space;\cdots\&space;\nabla_\mathbf{v}{f_m}(\mathbf{x})\bigg]" title="\nabla_\mathbf{v}\mathbf{f}(\mathbf{x})=\mathbf{J_f}(\mathbf{x})\mathbf{v}=\bigg[\nabla_\mathbf{v}{f_0}(\mathbf{x})\ \cdots\ \nabla_\mathbf{v}{f_m}(\mathbf{x})\bigg]" />
+<img src="https://latex.codecogs.com/gif.latex?\nabla_\mathbf{v}\mathbf{f}(\mathbf{x})=\mathbf{J_f}(\mathbf{x})\mathbf{v}=\bigg[\nabla_\mathbf{v}{f_1}(\mathbf{x})\&space;\cdots\&space;\nabla_\mathbf{v}{f_m}(\mathbf{x})\bigg]" title="\nabla_\mathbf{v}\mathbf{f}(\mathbf{x})=\mathbf{J_f}(\mathbf{x})\mathbf{v}=\bigg[\nabla_\mathbf{v}{f_1}(\mathbf{x})\ \cdots\ \nabla_\mathbf{v}{f_m}(\mathbf{x})\bigg]" />
 </p>
 
 The linear function that takes a vector and right-multiplies the Jacobian value

diff --git a/ad-manifesto.md b/ad-manifesto.md
@@ -1007,20 +1007,89 @@ Turns out, this is not a new problem - we should learning from how we deal with
 calling conventions in Swift. Functions with different calling conventions have
 different type signatures, e.g. `@convention(thick)` and `@convention(thin)`,
 and function convert back and forth through conversion thunks implicitly.
+
+```swift
+// A "thin" function that captures no variables.
+// Its representation is `@convention(thin)` by default.
+func f(x: Int) {
+    return x
+}
+
+var globalVar = 30
+
+// A "thick" function that captures the value of `globalVar`.
+// Its representation is `@convention(thick)` by default.
+let g = { x in globalVar + x }
+
+// A higher-order function.
+// The closure argument `h`'s representation is `@convention(thick)`, because it should
+// be able to take closures that capture variables.
+func takeFunc(_ h: (Float) -> Float) { ... }
+
+takeFunc(f) // Implicitly converted function `f` to a `convention(thick)` closure by
+            // creating a conversion thunk.
+takeFunc(g) // `g` is thick already. No conversion needed.
+```
+
 Sometimes, different conventions have different binary representations for
-storing captured variables and such. In AD, the only difference between a
-non-differentiable function and a differentiated function (say, in reverse mode)
-is whether the function carries a few other function pointers that represent the
-function's adjoint code, so we can simply add a thicker representation!
-
-At first glance, this could even be an addition to the existing
-`@convention` attribute as something like `@convention(differentiable)`,
-however, differentiability does not align semantically with `@convention`.
-First, when a function becomes its differentiable (or differentiated) form, its
-original calling convention is not changed. Second, functions with any
-convention is technically differentiable, including `thin`, `thick`, `method`,
-etc. Therefore, we need a separate dimension of "thickness" in the function
-type: differentiability.
+storing captured variables and such, just like the example with `f` and `g`
+above. In AD, the only difference between a non-differentiable function and a
+differentiated function (say, in reverse mode) is whether the function carries a
+few other function pointers that represent the function's adjoint code, so we
+can model differentiable functions using a "thicker" function type, which
+bundles the original function representation along with pointers to the original
+function's Jacobian-vector product functions and/or vector-Jacobian product
+functions. When a normal function with a visible body gets passed as an
+`@autodiff` function, the function will be differentiated. 
+
+```swift
+// `f` is a normal function that has type `(Float) -> Float`.
+func f(x: Float) -> Float {
+   return sin(x)
+}
+
+// `f` gets implcitly converted (or more accurately, differentiated).
+let g = f as @autodiff (Float) -> Float
+
+func takesFunc(_ someFunc: @autodiff (Float) -> Float) {
+    #derivatives(someFunc)
+    ...
+}
+
+// At the callsite of `takesFunc(_:)`, `f` gets implcitly differentiated to become
+// `@autodiff (Float) -> Float`.
+takesFunc(f)
+```
+
+If a normal function does not have a visible body, then it cannot be passed as
+an `@autodiff` function. Swift will show an error at compile-time.
+
+```swift
+var normalFuncWithOpaqueBody: (Float) -> Float = ...
+
+takesFunc(normalFuncWithOpaqueBody)
+```
+
+```console
+test.swift:19:11: error: function is not differentiable, but the contextual type is 
+'@autodiff (Float) -> Float'
+  takesFunc(normalFuncWithOpaqueBody)
+            ^~~~~~~~~~~~~~~~~~~~~~~~
+
+test.swift:17:4: note: value defined here
+  var normalFuncWithOpaqueBody: (Float) -> Float = ...
+      ^~~~~~~~~~~~~~~~~~~~~~~~
+```
+
+At first glance, this could even be an addition to the existing `@convention`
+attribute as something like `@convention(autodiff)`, however, differentiability
+does not align semantically with `@convention`. First, when a function becomes
+its differentiable (or differentiated) form, its original calling convention is
+not changed. Second, functions with any convention is technically
+differentiable, including `thin`, `thick`, `method`, etc. Third,
+differentiability is not the only information that needs to be encoded --
+there's also the order of differentiation. Therefore, we need a separate
+dimension of "thickness" in the function type: differentiability.
 
 We define a new formalization of differentiability in Swift's type system,
 including an `@autodiff` function type attribute, an extension to functions'
@@ -1237,7 +1306,7 @@ As we can see, since we are to differentiate a higher-order function's argument
 (thanks to Generalized Differentiability), we can define `derivatives(of:)` and
 `gradient(of:)` as Swift functions in terms of more general raw differential
 operators, `#differential` and `#pullback`, to replace `#derivatives` and
-`#derivatives`!
+`#gradient`!
 
 These differential operators work seamlessly with closure captures,
 error-throwing functions, or arbitrary side-effecting code that do not
@@ -1827,7 +1896,7 @@ Recall that the motivation of introducing a general, future-proof
    ```swift
    extension SGD {
        func fit(_ parameters: inout Parameters, gradients: Parameters) {
-           parameters.update(with: gradients) { θ, g in
+           parameters.update(withGradients: gradients) { θ, g in
                θ = θ.moved(toward: -θ.tangentVector(from: g) * learningRate)
            }
        }

diff --git a/ad-manifesto.md b/ad-manifesto.md
@@ -100,7 +100,7 @@ we will model it as a function that maps vectors to real-valued matrices `J: ℝ
 
 While it is challenging to define this function with full type safety in Swift
 because shapes cannot be generic parameters yet, we can define a differential
-operator as the following, specialized on shape.
+operator as the following, specialized on shapes.
 
 ```swift
 func 𝒟<T>(_ f: (Vector2<T>) -> Vector3<T>) -> (Vector2<T>) -> Matrix3x2<T>
@@ -309,9 +309,9 @@ for (x, y) in minibatches {
 ### Full Extensibility: Custom Types and Derivatives
 
 We want our AD system to be fully extensible to the point where users can
-request the derivatives of a function taking their own user-defined numeric
-types, and even use this feature to implement structure-dependent algorithms
-such as tree-recursive neural networks. Therefore, AD makes no assumptions about
+request derivatives of a function taking their own user-defined numeric types,
+and even use this feature to implement structure-dependent algorithms such as
+tree-recursive neural networks. Therefore, AD makes no assumptions about
 individual math functions or the types it should support. We enable library
 designers and developers to easily define any type or differentiable functions,
 all in pure Swift code.
@@ -355,7 +355,12 @@ All differential operators are defined in Swift, and developers can create their
 own differential operators by composing existing ones. For example, the user can
 use the "forward-on-reverse" approach to compute [Hessian-vector
 products](https://en.wikipedia.org/wiki/Hessian_matrix), where the `hvp(at:in:)`
-operator is defined as a native Swift function.
+operator is defined as a native Swift function. The [`@autodiff(order:
+2)`](#the-autodiff-function-type-attribute) attribute in the closure type
+signature marks the closure argument as being differentiable up to at least the
+2nd order, so that the caller of `hvp(at:in:)` will differentiate the actual
+closure argument as needed.so that the caller of this function will implicitly
+trigger differentiation as needed.
 
 ```swift
 func hvp<T: Differentiable, R: FloatingPoint>(

diff --git a/ad-manifesto.md b/ad-manifesto.md
@@ -829,7 +829,7 @@ func g(_ x: Float) -> (Vector<Float>, Vector<Float>) {
    return x • w
 }
 
-#derivatives(f) // (Float) -> (Vector<Float>, Vector<Float>)
+#derivatives(g) // (Float) -> (Vector<Float>, Vector<Float>)
 ```
 
 The grammar of these raw differential operators is defined as follows:

diff --git a/ad-manifesto.md b/ad-manifesto.md
@@ -829,7 +829,7 @@ func g(_ x: Float) -> (Vector<Float>, Vector<Float>) {
    return x • w
 }
 
-#gradient(f) // (Vector<Float>, Vector<Float>) -> (Vector<Float>, Vector<Float>)
+#derivatives(f) // (Float) -> (Vector<Float>, Vector<Float>)
 ```
 
 The grammar of these raw differential operators is defined as follows:

diff --git a/ad-manifesto.md b/ad-manifesto.md
@@ -788,9 +788,8 @@ func adjointBar(_ x: Vector<Float>, y: Float, adjoint: Float) -> Vector<Float> {
 }
 ```
 ```console
-test.swift:3:35: error: function `bar` does not support higher-order 
-differentiation because its adjoint is not differentiable; would you like to add
-`once`?
+test.swift:3:35: error: function `bar` does not support higher-order differentiation 
+because its adjoint is not differentiable; would you like to add `once`?
   @differentiable(reverse, adjoint: adjointBar)
                                     ^~~~~~~~~~
 test.swift:8:6: note: `adjointBar` is defined here

diff --git a/ad-manifesto.md b/ad-manifesto.md
@@ -130,7 +130,7 @@ which is exactly the [gradient](https://en.wikipedia.org/wiki/Gradient) of
 <img src="https://latex.codecogs.com/gif.latex?\nabla{f_i}(\mathbf{x})=\mathbf{v}^i\mathbf{J_f}(\mathbf{x})=\bigg[\dfrac{\partial{f_i}(\mathbf{x})}{\partial&space;x_0}&space;\&space;\cdots&space;\&space;\dfrac{\partial{f_i}(\mathbf{x})}{\partial{x_n}}\bigg]" title="\nabla{f_i}(\mathbf{x})=\mathbf{v}^i\mathbf{J_f}(\mathbf{x})=\bigg[\dfrac{\partial{f_i}(\mathbf{x})}{\partial x_0} \ \cdots \ \dfrac{\partial{f_i}(\mathbf{x})}{\partial{x_n}}\bigg]" />
 </p>
 
-When this vector represents the gradient of another function `g: ℝᵐ → ℝ` at
+When this vector `vⁱ` represents the gradient of another function `g: ℝᵐ → ℝ` at
 ![](http://latex.codecogs.com/gif.latex?\mathbf{f}(\mathbf{x})), namely
 ![](http://latex.codecogs.com/gif.latex?\partial{g}/\partial{f_i(\mathbf{x})}),
 then the vector-Jacobian products will represent
@@ -667,7 +667,7 @@ There are five options for differentiability:
 
 5. Linear: `@differentiable(linear)`
 
-   By definiton, a linear map is always a unary function and its Jacobian is
+   By definition, a linear map is always a unary function and its Jacobian is
    the matrix associated with this linear transformation itself. In other
    words, both its differential and its pullback are itself.
 
@@ -715,7 +715,7 @@ As explained, differentiabilities have different functional requirements.
 4. Other differentiabilities
 
    Other differentiabilities such as `constant` and `linear` do not require
-   any associated functions. However, the users can choose to specify
+   any associated functions. However, users can choose to specify
    tangent/adjoint function(s) for their own purposes such as custom
    optimizations.
 
@@ -778,9 +778,12 @@ func bar(_ x: Vector<Float>) -> Float {
     return sin(x)[0]
 }
 
+var someGlobalVariable: Vector<Float> = [1, 1, 1]
+
 func adjointBar(_ x: Vector<Float>, y: Float, adjoint: Float) -> Vector<Float> {
     var ∂y∂x = Vector<Float>(repeating: 0, shape: x.shape)
-    ∂y∂x[0] = cos(x[0]) * adjoint
+    someGlobalVariable[0] = cos(x[0]) * adjoint
+    ∂y∂x[0] = someGlobalVariable[0]
     return ∂y∂x
 }
 ```
@@ -1062,7 +1065,7 @@ due to code size. In order to make the system consistent, we make each
 
 Since we want to support differentiating opaque functions, we must support
 creating one. The fact is, the user does not even need to know about `@autodiff`
-or intentially create differentiable functions if they are working with
+or intentionally create differentiable functions if they are working with
 functions in the current module. Whenever a local function declaration gets used
 where the contextual type has an `@autodiff` attribute on it, Swift
 differentiates it. If differentiation fails, Swift reports an error at

diff --git a/ad-manifesto.md b/ad-manifesto.md
@@ -1430,7 +1430,7 @@ composition of the forward-mode differential operator and the reverse-mode
 differential operator on a function.
 
 <p align="center">
-<img src="https://latex.codecogs.com/png.latex?\mathbf{H_f}(\mathbf{x})\mathbf{v}&space;=&space;\mathbf{J}_{\nabla&space;\mathbf{f}}(\mathbf{x})\mathbf{v}" title="\mathbf{H_f}(\mathbf{x})\mathbf{v} = \mathbf{J}_{\nabla \mathbf{f}}(\mathbf{x})\mathbf{v}" />
+<img src="https://latex.codecogs.com/png.latex?\mathbf{H}_f(\mathbf{x})\mathbf{v}&space;=&space;\mathbf{J}_{\nabla&space;f}(\mathbf{x})\mathbf{v}" title="\mathbf{H}_f(\mathbf{x})\mathbf{v} = \mathbf{J}_{\nabla f}(\mathbf{x})\mathbf{v}" />
 </p>
 
 Just like other differential operators, we can define the Hessian-vector
-Original file line number
+Diff line change
@@ -1,3 +0,0 @@
-    *.tex
-    *.pdf
-    auto/