Wouldn't it be nice if C++ GPU

chrisw · Post by **chrisw** » Thu Apr 25, 2019 12:49 pm

Wouldn't it be nice to have a C++ header file which supported:

model = LoadTrainedModelFromFile(filename); // model and weights, saved in some appropriate format from Python

results = model.predict(inputs); // using GPU

Rémi Coulom · Post by **Rémi Coulom** » Thu Apr 25, 2019 1:59 pm

I developed my own home-made C++ deep-learning framework just to be able to do that. I used tensorflow for a while, but it was too painful to use from C++. What you describe can be done with tensorflow, but last time I tried, I had to use undocumented/unsupported features of the low-level C++ tensorflow library, and it was really unpleasant (having to compile the library from source with bazel, ...).

Maybe other frameworks have better C++ support.

At the moment, I am using some simple C++ classes on top of CuDNN. I don't have autodiff, but manually calculating a gradient is not such a big deal in my opinion. I am thinking about compile-time autodiff with template meta-programming.

If people here want help about how to use tensorflow from C++ with a nVidia GPU, I could explain a little how I did it. It is not very difficult to do, but it is not documented.

I applied to the Tensorflow Research Cloud, and was accepted. This gives me access to 100 TPUs for one month. After days of trying to use a TPU from C++, I gave up trying. I am sure it is doable, but there is no documentation at all.

jdart · Post by **jdart** » Thu Apr 25, 2019 2:51 pm

Caffe (https://github.com/BVLC/caffe) supports C++ - it is apparently the main language, Python is a binding. I don't know if it does quite what you need though.

--Jon

chrisw · Post by **chrisw** » Thu Apr 25, 2019 3:54 pm

Daniel has some code here https://github.com/dshawul/egbbdll/blob ... val_nn.cpp

which includes what looks like load model, UFF format:

Code: Select all

void TrtModel::LoadGraph(const string& uff_file_name, int dev_id, int dev_type) {
    std::string dev_name = ((dev_type == GPU) ? "/gpu:" : "/cpu:") + std::to_string(dev_id);
    printf("Loading graph on %s\n",dev_name.c_str());
    fflush(stdout);

    Model::id = dev_id;
    cudaSetDevice(Model::id);

and so on ......

and what looks like a predict ...

Code: Select all

void TrtModel::predict() {

    cudaSetDevice(Model::id);

    context->execute(BATCH_SIZE, buffers.data());

    if(nn_type == DEFAULT || nn_type == SIMPLE) {
        for(int i = 0;i < n_batch;i++) {
            float p = buffers_h[valuei][3*i+0] * 1.0 + buffers_h[valuei][3*i+1] * 0.5;
            scores[i] = logit(p);
            
and so on ......

it seems to need various support includes, and is not exactly easy to work out what is going on. I would guess the CUDA support is ongoing, because it is a pretty obvious thing for many people (not just games) to want to get the predictor into C++ for ongoing apps and lose the Python requirement at runtime.

Daniel Shawul · Post by **Daniel Shawul** » Thu Apr 25, 2019 6:04 pm

egbbdll is very easy to use because it was designed for probing endgame bitbases originally.
You could essentially do probe(FEN_string) and get value and policy results.
How and where it is evaluated the user doesn't need to know, but ofcourse it can use both CPU/GPU.
Both Tensorflow & TensoRT are supported which can use cuDNN so ofcourse it can use CUDA too.
Lc0 explicitly wrote cuda code for the backend but I am getting equal nps using TenorRT.
Moreover, one can use INT8 and maybe INT4. So writing backend code when there is a flora of deep learning
libraries is a futile endevour IMHO.

This is the actual code I use for probing bitbases and neural network. It has become a little cumbersome
after I added policy head but still easy to use. You populate your pieces, and feed history info (for lczero nets)
and just probe. The egbbdll takes care of "batching" with multi-thread approach, and caching as well.

Code: Select all

/*
Probe:
Change interanal scorpio board representaion to [A1 = 0 ... H8 = 63]
board representation and then probe bitbase.
*/

void SEARCHER::fill_list(int& count, int* piece, int* square) {
    PLIST current;

#define ADD_PIECE(list,type) {                  \
       current = list;                          \
       while(current) {                         \
          piece[count] = type;                  \
          square[count] = SQ8864(current->sq);  \
          count++;                              \
          current = current->next;              \
       }                                        \
    };
    ADD_PIECE(plist[wking],_WKING);
    ADD_PIECE(plist[bking],_BKING);
    ADD_PIECE(plist[wqueen],_WQUEEN);
    ADD_PIECE(plist[bqueen],_BQUEEN);
    ADD_PIECE(plist[wrook],_WROOK);
    ADD_PIECE(plist[brook],_BROOK);
    ADD_PIECE(plist[wbishop],_WBISHOP);
    ADD_PIECE(plist[bbishop],_BBISHOP);
    ADD_PIECE(plist[wknight],_WKNIGHT);
    ADD_PIECE(plist[bknight],_BKNIGHT);
    ADD_PIECE(plist[wpawn],_WPAWN);
    ADD_PIECE(plist[bpawn],_BPAWN);
    piece[count] = _EMPTY;
    square[count] = SQ8864(epsquare);
    count++;
}

int SEARCHER::probe_bitbases(int& score) {
#ifdef EGBB
    int piece[MAX_PIECES],square[MAX_PIECES],count = 0;
    fill_list(count,piece,square);
    score = probe_egbb(player,piece,square);
    if(score != _NOTFOUND)
        return true;
#endif
    return false;
}

int SEARCHER::probe_neural(bool hard_probe) {
#ifdef EGBB
    UBMP64 hkey = ((player == white) ? hash_key : 
             (hash_key ^ UINT64(0x2bc3964f82352234)));

    int moves[3*MAX_MOVES];
    int *s = moves;
    for(int i = 0; i < pstack->count; i++) {
        MOVE& m = pstack->move_st[i];
        int from = m_from(m), to = m_to(m);
        if(is_castle(m)) {
            if(to > from) to++;
            else to -= 2;
        }
        *s++ = SQ8864(from);
        *s++ = SQ8864(to); 
        *s++ = m_promote(m);
    }
    *s++ = -1;

    nnecalls++;
    if(nn_type == 0) {
        int piece[33],square[33],isdraw[1];
        int count = 0, hist = 1;
        fill_list(count,piece,square);

        return probe_nn(player,castle,fifty,hist,isdraw,piece,square,moves,
            (float*)pstack->score_st,pstack->count,hkey,hard_probe);
    } else {

        int piece[8*33],square[8*33],isdraw[8];
        int count = 0, hist = 0, phply = hply;
        
        for(int i = 0; i < 8; i++) {
            isdraw[hist++] = draw();
            fill_list(count,piece,square);

            if(hply > 0 && hstack[hply - 1].move) 
                POP_MOVE();
            else break;
        }

        count = phply - hply;
        for(int i = 0; i < count; i++)
            PUSH_MOVE(hstack[hply].move);

        if(isdraw[0])
            hkey ^= UINT64(0xc7e9153edee38dcb);
        hkey ^= fifty_hkey[fifty];

        return probe_nn(player,castle,fifty,hist,isdraw,piece,square,moves,
            (float*)pstack->score_st,pstack->count,hkey,hard_probe);
    }
#endif
    return 0;
}

void PROCESSOR::set_num_searchers() {
#ifdef EGBB
    if(SEARCHER::use_nn && set_num_active_searchers) {
        int n_searchers = n_processors - n_idle_processors;
        set_num_active_searchers(n_searchers);
    }
#endif
}

Daniel

Rein Halbersma · Post by **Rein Halbersma** » Thu Apr 25, 2019 6:58 pm

Rémi Coulom wrote: ↑Thu Apr 25, 2019 1:59 pm I developed my own home-made C++ deep-learning framework just to be able to do that. I used tensorflow for a while, but it was too painful to use from C++. What you describe can be done with tensorflow, but last time I tried, I had to use undocumented/unsupported features of the low-level C++ tensorflow library, and it was really unpleasant (having to compile the library from source with bazel, ...).

Maybe other frameworks have better C++ support.

At the moment, I am using some simple C++ classes on top of CuDNN. I don't have autodiff, but manually calculating a gradient is not such a big deal in my opinion. I am thinking about compile-time autodiff with template meta-programming.

If people here want help about how to use tensorflow from C++ with a nVidia GPU, I could explain a little how I did it. It is not very difficult to do, but it is not documented.

I applied to the Tensorflow Research Cloud, and was accepted. This gives me access to 100 TPUs for one month. After days of trying to use a TPU from C++, I gave up trying. I am sure it is doable, but there is no documentation at all.

LeelaChessZero uses the 3rd party tensorflow_cc wrapper library around the official Tensorflow C++ API, to avoid the Bazel build stuff. See https://github.com/LeelaChessZero/lc0/b ... sorflow.md

Rémi Coulom · Post by **Rémi Coulom** » Thu Apr 25, 2019 7:18 pm

Rein Halbersma wrote: ↑Thu Apr 25, 2019 6:58 pmLeelaChessZero uses the 3rd party tensorflow_cc wrapper library around the official Tensorflow C++ API, to avoid the Bazel build stuff. See https://github.com/LeelaChessZero/lc0/b ... sorflow.md

Thanks for the link. Bazel is still necessary to build the library itself. This is in fact what I had managed to do by myself. It is still very unpleasant to do.

By the way, has anybody here tried to code a fast convolution in cuda directly? I will probably try soon. My impression is that the performance of cuDNN is very bad for small batches. Good performance with small batches is important for tree search.

Rémi

smatovic · Post by **smatovic** » Thu Apr 25, 2019 7:38 pm

Rémi Coulom wrote: ↑Thu Apr 25, 2019 7:18 pm ...
By the way, has anybody here tried to code a fast convolution in cuda directly? I will probably try soon. My impression is that the performance of cuDNN is very bad for small batches. Good performance with small batches is important for tree search.

Rémi

https://github.com/ankan-ban/ConvTest

--
Srdja

Rein Halbersma · Post by **Rein Halbersma** » Thu Apr 25, 2019 8:23 pm

Rémi Coulom wrote: ↑Thu Apr 25, 2019 7:18 pm
Rein Halbersma wrote: ↑Thu Apr 25, 2019 6:58 pmLeelaChessZero uses the 3rd party tensorflow_cc wrapper library around the official Tensorflow C++ API, to avoid the Bazel build stuff. See https://github.com/LeelaChessZero/lc0/b ... sorflow.md
Thanks for the link. Bazel is still necessary to build the library itself. This is in fact what I had managed to do by myself. It is still very unpleasant to do.

That's not what tensorflow_cc advertises: https://github.com/FloopCZ/tensorflow_cc

This repository makes possible the usage of the TensorFlow C++ API from the outside of the TensorFlow source code folders and without the use of the Bazel build system.

Rémi Coulom · Post by **Rémi Coulom** » Thu Apr 25, 2019 8:48 pm

Rein Halbersma wrote: ↑Thu Apr 25, 2019 8:23 pm
Rémi Coulom wrote: ↑Thu Apr 25, 2019 7:18 pm
Rein Halbersma wrote: ↑Thu Apr 25, 2019 6:58 pmLeelaChessZero uses the 3rd party tensorflow_cc wrapper library around the official Tensorflow C++ API, to avoid the Bazel build stuff. See https://github.com/LeelaChessZero/lc0/b ... sorflow.md
Thanks for the link. Bazel is still necessary to build the library itself. This is in fact what I had managed to do by myself. It is still very unpleasant to do.

That's not what tensorflow_cc advertises: https://github.com/FloopCZ/tensorflow_cc

This repository makes possible the usage of the TensorFlow C++ API from the outside of the TensorFlow source code folders and without the use of the Bazel build system.

This means that you don't need bazel to build your own code, but you need it to build the tensorflow library:

If you require GPU support on Ubuntu, please also install Bazel

(from https://github.com/FloopCZ/tensorflow_cc)

Wouldn't it be nice if C++ GPU

Wouldn't it be nice if C++ GPU

Re: Wouldn't it be nice if C++ GPU

Re: Wouldn't it be nice if C++ GPU

Re: Wouldn't it be nice if C++ GPU

Re: Wouldn't it be nice if C++ GPU

Re: Wouldn't it be nice if C++ GPU

Re: Wouldn't it be nice if C++ GPU

Re: Wouldn't it be nice if C++ GPU

Re: Wouldn't it be nice if C++ GPU

Re: Wouldn't it be nice if C++ GPU