tool to create derivates of a given function

Daniel Shawul · Post by **Daniel Shawul** » Tue Nov 07, 2017 12:43 pm

Actually there is no need for this if the eval is linear -- which it commonly is. Compute it through finite difference once and store the Jacobian matrix, after that one doesn't have to call back eval() again. An example implementation is here : https://github.com/dshawul/Scorpio/blob/master/eval.cpp The jacobian takes about 1.5G bytes to store for 400 or so variables. The tuning with the Jacobian is much much faster than without it as one have to just multiply coefficients to compute the eval. I guess I could use SIMD vectorization on CPUs or even GPUs to make computing the eval and its derivative much much faster. Another advantage of this is that you can still use ints in your eval, and use float/double to store the Jacobian. Moreover, the "standard" implementation of autodiff using operator overloading is pretty slow compared to the source-to-source transformation used in fortran and other languages.

Daniel

AlvaroBegue · Post by **AlvaroBegue** » Tue Nov 07, 2017 3:01 pm

Rein Halbersma wrote:[...]
It's reasonably straightforward to code this in C++. See e.g. Alvaro's RuyTunehttp://chessprogramming.wikispaces.com/RuyTune

It may require some minor rewrites in your evaluation function, mainly if you use compound assignment operators (+= etc., instead of a+=b; you need to write a = a + b; ).

Oh, that's my fault. I could have provided those operators (and I probably should have).

Rein Halbersma · Post by **Rein Halbersma** » Tue Nov 07, 2017 7:48 pm

AlvaroBegue wrote:
Rein Halbersma wrote:[...]
It's reasonably straightforward to code this in C++. See e.g. Alvaro's RuyTunehttp://chessprogramming.wikispaces.com/RuyTune

It may require some minor rewrites in your evaluation function, mainly if you use compound assignment operators (+= etc., instead of a+=b; you need to write a = a + b; ).
Oh, that's my fault. I could have provided those operators (and I probably should have).

It's not obvious to me how to implement it, and Python autograd also does not support it. You want that a = b; a += c; to be equivalent to a = b + c; I am curious to how you would implement it.

AlvaroBegue · Post by **AlvaroBegue** » Tue Nov 07, 2017 7:58 pm

Rein Halbersma wrote:
AlvaroBegue wrote:
Rein Halbersma wrote:[...]
It's reasonably straightforward to code this in C++. See e.g. Alvaro's RuyTunehttp://chessprogramming.wikispaces.com/RuyTune

It may require some minor rewrites in your evaluation function, mainly if you use compound assignment operators (+= etc., instead of a+=b; you need to write a = a + b; ).
Oh, that's my fault. I could have provided those operators (and I probably should have).
It's not obvious to me how to implement it, and Python autograd also does not support it. You want that a = b; a += c; to be equivalent to a = b + c; I am curious to how you would implement it.

Code: Select all

  inline Variable &operator+=&#40;Variable &v1, Variable v2&#41; &#123;
    return v1 = v1 + v2;
  &#125;

That seems to work in RuyTune. Through the magic of shared_ptr the old value of v1 is not lost.

Here's a sample program, using autodiff.hpp from RuyTune:

Code: Select all

#include "autodiff.hpp"

namespace autodiff &#123;
  inline Variable &operator+=&#40;Variable &v1, Variable v2&#41; &#123;
   return v1 = v1 + v2;
 &#125;
&#125;

template <typename Score>
Score three_times&#40;Score s&#41; &#123;
  s += 2.0 * s;
  return s;
&#125;

int main&#40;) &#123;
  autodiff&#58;&#58;Variable x = 2.0;
  autodiff&#58;&#58;Variable y = three_times&#40;x&#41;;
  y.p->derivative = 1.0;
  autodiff&#58;&#58;backpropagate&#40;);
  std&#58;&#58;cout << x.p->derivative << '\n'; // Prints 3
&#125;

Rein Halbersma · Post by **Rein Halbersma** » Wed Nov 08, 2017 2:58 pm

AlvaroBegue wrote:
Code: Select all
  inline Variable &operator+=&#40;Variable &v1, Variable v2&#41; &#123;
    return v1 = v1 + v2;
  &#125;
That seems to work in RuyTune. Through the magic of shared_ptr the old value of v1 is not lost.

Ah, that's a nice way to do it. One thing that bothers me, however, is that it is not the canonical way to overload operators. Usually you define op@= (with @ = *, /, +, - etc.) as a member function, returning *this, and then you define op@ as a non-member function in terms of op@=.

Your way of defining op@= in terms of op@ is non-standard. I would have to think it through very carefully if all the usual arithmetic properties are still being satisfied (likely, but I'd like to prove it to myself first

).

One other thing that I noticed in your RuyTune library is that you have a global storage of the computational graph, where you rely on the LIFO nature in which the various contributions to the loss functions are being added. The way e.g. the machine learning library PyTorch works, is to have each variable hold its child edges by shared pointer, that the user then calls backward() on any such variable (in Python Autograd library, this is done implicitly), after which you can call grad() on any of its children (the weights to be learned).

Here's a toy implementation of this approach:

Code: Select all

#include <cassert>  // assert
#include <iostream> // cout
#include <memory>   // make_shared, shared_ptr
#include <utility>  // pair
#include <vector>   // vector

// class with value semantics
template<class T>
class Var 
&#123;
    struct Node;
    using EdgeWeight = std&#58;&#58;pair<std&#58;&#58;shared_ptr<Node>, T>;

    // nested type held by shared_ptr inside Var
    struct Node 
    &#123;
        T data &#123;&#125;;
        T grad &#123;&#125;;
        std&#58;&#58;vector<EdgeWeight> edge;
       
        explicit Node&#40;T const& v&#41; noexcept
        &#58;
            data&#40;v&#41;
        &#123;&#125;
        
        Node&#40;T const& v, std&#58;&#58;vector<EdgeWeight> const& e&#41;
        &#58;
            data&#40;v&#41;, 
            edge&#40;e&#41;
        &#123;&#125;
    &#125;;

    std&#58;&#58;shared_ptr<Node> node_ptr;
    Node const* operator->() const noexcept &#123; return node_ptr.get&#40;); &#125;
    
    // recursive traversal of all child nodes
    void propagate&#40;std&#58;&#58;shared_ptr<Node> const& n&#41;
    &#123;
        for &#40;auto i = std&#58;&#58;size_t&#123;0&#125;, N = n->edge.size&#40;); i < N; ++i&#41; &#123;
            auto& c = n->edge&#91;i&#93;.first;
            c->grad += n->edge&#91;i&#93;.second * n->grad;
            propagate&#40;c&#41;;
        &#125;
    &#125;

public&#58;    
    Var&#40;T const& v&#41;
    &#58;
        node_ptr&#40;std&#58;&#58;make_shared<Node>&#40;v&#41;)
    &#123;&#125;
    
    Var&#40;T const& v, std&#58;&#58;vector<EdgeWeight> const& e&#41;
    &#58;
        node_ptr&#40;std&#58;&#58;make_shared<Node>&#40;v, e&#41;)
    &#123;&#125;

    // back propagation over all child nodes
    void backward&#40;)
    &#123;
        node_ptr->grad = T&#123;1&#125;;
        propagate&#40;node_ptr&#41;;
    &#125;

    // requires previous call to backward&#40;) 
    friend auto grad&#40;Var const& v&#41;
    &#123;
        return v->grad;
    &#125;

    friend std&#58;&#58;ostream& operator<<&#40;std&#58;&#58;ostream& ostr, Var<T> const& v&#41;
    &#123;
        return ostr << v->data;
    &#125;

    friend Var operator+&#40;Var const& L, Var const& R&#41;
    &#123;
        return 
        &#123; 
            L->data + R->data, 
            &#123;
                &#123; L.node_ptr, T&#123;1&#125; &#125;,
                &#123; R.node_ptr, T&#123;1&#125; &#125;
            &#125;
        &#125;;
    &#125;

    friend Var operator*&#40;Var const& L, Var const& R&#41;
    &#123;
        return 
        &#123; 
            L->data * R->data, 
            &#123; 
                &#123; L.node_ptr, R->data &#125;,
                &#123; R.node_ptr, L->data &#125;
            &#125;
        &#125;;
    &#125;
&#125;;

int main&#40;)
&#123;
    Var w1 = 2.0;
    Var w2 = 3.0;
    Var w3 = 5.0;
    Var loss = w1 + w2 * w3;    // forward computation of loss is implicit
    std&#58;&#58;cout << w1 << '\n';
    std&#58;&#58;cout << w2 << '\n';
    std&#58;&#58;cout << w3 << '\n';
    std&#58;&#58;cout << loss << '\n';
    loss.backward&#40;);            // backward propagation of gradient is explicit
    std&#58;&#58;cout << grad&#40;w1&#41; << '\n';
    std&#58;&#58;cout << grad&#40;w2&#41; << '\n';
    std&#58;&#58;cout << grad&#40;w3&#41; << '\n';
&#125;

Live Example

AlvaroBegue · Post by **AlvaroBegue** » Wed Nov 08, 2017 3:11 pm

Oh, that does seem better than my solution. I'll study it carefully and I will probably adopt it. Thanks!

Rein Halbersma · Post by **Rein Halbersma** » Wed Nov 08, 2017 3:22 pm

Daniel Shawul wrote:Actually there is no need for this if the eval is linear -- which it commonly is. Compute it through finite difference once and store the Jacobian matrix, after that one doesn't have to call back eval() again. An example implementation is here : https://github.com/dshawul/Scorpio/blob/master/eval.cpp The jacobian takes about 1.5G bytes to store for 400 or so variables. The tuning with the Jacobian is much much faster than without it as one have to just multiply coefficients to compute the eval. I guess I could use SIMD vectorization on CPUs or even GPUs to make computing the eval and its derivative much much faster. Another advantage of this is that you can still use ints in your eval, and use float/double to store the Jacobian. Moreover, the "standard" implementation of autodiff using operator overloading is pretty slow compared to the source-to-source transformation used in fortran and other languages.

Daniel

Finite difference computations of gradients scale linearly in the number of parameters, whereas automatic differentiation is a constant (for both methods, it is also scales as a constant times the forward computation of the loss). There's a reason that all modern machine learning libraries use automatic differentiation.

Rein Halbersma · Post by **Rein Halbersma** » Wed Nov 08, 2017 3:36 pm

AlvaroBegue wrote:Oh, that does seem better than my solution. I'll study it carefully and I will probably adopt it. Thanks!

I will open a new project Autograd on GitHub, will probably be my xmas break project.

I'm still not completely satisfied. Feature it needs IMO is that every Var should hold a std::set of all it's unique children. This enables writing loss.grad(w1) instead of the toy example's grad(w1), so that you can compute gradients of any variable with respect to any of its children. Look at Python's autograd or PyTorch for inspiration about nice syntactic sugar

Re: the op+= example that you gave. There are some corner cases like x+=a; followed by x-=a, or self-assignment x-=x; that need careful unit testing to make sure they are correct. Or you need a general proof that no cycles between shared pointers can be made through your definitions. As soon as you have shared_ptr cycles, there's a bug.

AlvaroBegue · Post by **AlvaroBegue** » Wed Nov 08, 2017 3:58 pm

Rein Halbersma wrote: Re: the op+= example that you gave. There are some corner cases like x+=a; followed by x-=a, or self-assignment x-=x; that need careful unit testing to make sure they are correct. Or you need a general proof that no cycles between shared pointers can be made through your definitions. As soon as you have shared_ptr cycles, there's a bug.

I don't get what seems so scary about the operator+= I wrote. I am as convinced that it is correct as any other part of the library. In some sense my very primitive implementation, with a global vector that stores all the relevant partial derivatives, makes this easier to think about. The time ordering of the operations is proof that there is no loop.

But if you do find a case where it fails, I am ready to learn something new.

Daniel Shawul · Post by **Daniel Shawul** » Wed Nov 08, 2017 6:54 pm

Rein Halbersma wrote:
Daniel Shawul wrote:Actually there is no need for this if the eval is linear -- which it commonly is. Compute it through finite difference once and store the Jacobian matrix, after that one doesn't have to call back eval() again. An example implementation is here : https://github.com/dshawul/Scorpio/blob/master/eval.cpp The jacobian takes about 1.5G bytes to store for 400 or so variables. The tuning with the Jacobian is much much faster than without it as one have to just multiply coefficients to compute the eval. I guess I could use SIMD vectorization on CPUs or even GPUs to make computing the eval and its derivative much much faster. Another advantage of this is that you can still use ints in your eval, and use float/double to store the Jacobian. Moreover, the "standard" implementation of autodiff using operator overloading is pretty slow compared to the source-to-source transformation used in fortran and other languages.

Daniel
Finite difference computations of gradients scale linearly in the number of parameters, whereas automatic differentiation is a constant (for both methods, it is also scales as a constant times the forward computation of the loss). There's a reason that all modern machine learning libraries use automatic differentiation.

You misunderstood, the FD gradient calculation is done only once at the the beginning and then you store the gradient for all positions. During optimization you read the eval and its gradient from the stored Jacobian matrix so this is faster than anything you can come up with that calls eval() during optimization -- which the autodiff approach does. I mentioned in the other thread how the jacobian method can be used for some non-linear eval() as well.

In any case, machine learning people are kind of newbies to autodiff. The method has been in use in computational fluid dynamics for optimization of non-linear differential equations for long. The source-to-source autodiff used in fortran is much faster than the c++ operator overloading approach because compilers are not good at all in compiling that kind of code. Even we have to do some optimization to handle a dual (mg,eg) score in an efficient way. Using atleast expression templates for lazy evaluation of expressions is necessary to get anywhere close to the source-to-source transformed code. I recall there is such library that would hopefully bring c++ to the level of fortran when it comes to autodiff. This used to be the case for matrix manupilation as well but now c++ is on the level of fortran thanks to expression templates.

Edit: Something like this library http://www.met.reading.ac.uk/clouds/adept/ for C++.

tool to create derivates of a given function

Re: tool to create derivates of a given function

Re: tool to create derivates of a given function

Re: tool to create derivates of a given function

Re: tool to create derivates of a given function

Re: tool to create derivates of a given function

Re: tool to create derivates of a given function

Re: tool to create derivates of a given function

Re: tool to create derivates of a given function

Re: tool to create derivates of a given function

Re: tool to create derivates of a given function