Texel 1.07

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

yorkman
Posts: 105
Joined: Thu Jul 27, 2017 10:59 pm

Re: Texel 1.07

Post by yorkman »

More details (sorry for all this reading):

I wanted to see what would happen if I specified the same host twice instead of using the remote host as the 2nd host. This time I specified only 4 cores per host but strangely it returned 288 cores for some reason. Not sure what this means or if it's a bug of some sort but perhaps it may help you?

C:\temp>mpiexec -cores 4 -hosts 2 beast beast c:\temp\texel64cl.exe
info string cores:288 threads:288
mpiexec aborting job...
yorkman
Posts: 105
Joined: Thu Jul 27, 2017 10:59 pm

Re: Texel 1.07

Post by yorkman »

This is with -debug 3 added to mpiexec:

C:\temp>mpiexec -debug 3 -hosts 2 rita suzy c:\temp\texel64cl.exe
[00:9796] host tree:
[00:9796] host: rita, parent: 0, id: 1
[00:9796] host: suzy, parent: 1, id: 2
[00:9796] mpiexec started smpd manager listening on port 49705
[00:9796] using spn msmpi/rita to contact server
[00:9796] rita posting a re-connect to rita:49707 in left child context.
[00:9796] Authentication completed. Successfully obtained Context for Client.
[00:9796] Authorization completed.
[00:9796] version check complete, using PMP version 4.
[00:9796] creating connect command for left node
[00:9796] creating connect command to 'suzy'
[00:9796] posting command SMPD_CONNECT to left child, src=0, dest=1.
[00:9796] host suzy is not connected yet
[00:9796] Handling cmd=SMPD_CONNECT result
[00:9796] cmd=SMPD_CONNECT result will be handled locally
[00:9796] successful connect to suzy.
[00:9796] posting command SMPD_COLLECT to left child, src=0, dest=1.
[00:9796] posting command SMPD_COLLECT to left child, src=0, dest=2.
[00:9796] Handling cmd=SMPD_COLLECT result
[00:9796] cmd=SMPD_COLLECT result will be handled locally
[00:9796] Handling cmd=SMPD_COLLECT result
[00:9796] cmd=SMPD_COLLECT result will be handled locally
[00:9796] Finished collecting hardware summary.
[00:9796] posting command SMPD_STARTDBS to left child, src=0, dest=1.
[00:9796] Handling cmd=SMPD_STARTDBS result
[00:9796] cmd=SMPD_STARTDBS result will be handled locally
[00:9796] start_dbs succeeded, kvs_name: 'c363ba53-b4c2-4fde-a3b6-1bf9eebabff5', domain_name: '4803d140-bf34-4574-ae53-68975b75b7d6'
[00:9796] creating a process group of size 2 on node 0 called c363ba53-b4c2-4fde-a3b6-1bf9eebabff5
[00:9796] launching the processes.
[00:9796] posting command SMPD_LAUNCH to left child, src=0, dest=1.
[00:9796] posting command SMPD_LAUNCH to left child, src=0, dest=2.
[00:9796] Handling cmd=SMPD_LAUNCH result
[00:9796] cmd=SMPD_LAUNCH result will be handled locally
[00:9796] successfully launched process 0
[00:9796] root process launched, starting stdin redirection.
[00:9796] Handling cmd=SMPD_LAUNCH result
[00:9796] cmd=SMPD_LAUNCH result will be handled locally
[00:9796] successfully launched process 1
[00:9796] Authentication completed. Successfully obtained Context for Client.
[00:9796] Authorization completed.
[00:9796] handling command SMPD_INIT src=1 ctx_key=0
[00:9796] init: 0:2:c363ba53-b4c2-4fde-a3b6-1bf9eebabff5
[00:9796] handling command SMPD_INIT src=2 ctx_key=0
[00:9796] init: 1:2:c363ba53-b4c2-4fde-a3b6-1bf9eebabff5

job aborted:
[ranks] message

[0] terminated

[1] fatal error
Fatal error in MPI_Send: Other MPI error, error stack:
MPI_Send(buf=0x00000000002876D0, count=2, MPI_INT, dest=0, tag=0, MPI_COMM_WORLD) failed
[ch3:sock] failed to connnect to remote process c363ba53-b4c2-4fde-a3b6-1bf9eebabff5:0
unable to connect to 192.168.0.11 on port 49715, exhausted all endpoints
unable to connect to 192.168.0.11 on port 49715, A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. (errno 10060)
unable to connect to 192.168.0.11 on port 49715, A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. (errno 10060)

---- error analysis -----

[1] on suzy
mpi has detected a fatal error and aborted c:\temp\texel64cl.exe

---- error analysis -----
[00:9796] last process exited, tearing down the job tree num_exited=2 num_procs=2.
[00:9796] Handling cmd=SMPD_KILL result
[00:9796] cmd=SMPD_KILL result will be handled locally
[00:9796] handling command SMPD_CLOSED src=1
[00:9796] closed command received from left child.
[00:9796] smpd manager successfully stopped listening.

Not sure what to make of this. I can see with netstat on the remote host that a successful connection to smpd is established so I don't know why it fails after that.
yorkman
Posts: 105
Joined: Thu Jul 27, 2017 10:59 pm

Re: Texel 1.07

Post by yorkman »

Finally I figured it out.

For some reason using hostnames didn't work until I added them in the hosts file. I never added them because I could already ping both hosts by their hostname and they would resolve to the correct ip. In any case, it's working perfectly, albeit, from the command line and Arena.

Would love to figure out how to add the cluster engine to Aquarium but it doesn't have an option to add parameters like Arena does. Anyone?
petero2
Posts: 685
Joined: Mon Apr 19, 2010 7:07 pm
Location: Sweden
Full name: Peter Osterlund

Re: Texel 1.07

Post by petero2 »

yorkman wrote:Would love to figure out how to add the cluster engine to Aquarium but it doesn't have an option to add parameters like Arena does. Anyone?
You can use the runcmd.exe program that I just created: runcmd.exe.

1. Copy the runcmd.exe program to the directory where texel is located.

2. In the same directory create a text file called runcmd.txt that contains the command to run, in your case something like:

Code: Select all

mpiexec -hosts 2 host1 host2 texel64cl.exe
Make sure the file ends with a newline character.

3. Install runcmd.exe as a UCI engine in aquarium.
petero2
Posts: 685
Joined: Mon Apr 19, 2010 7:07 pm
Location: Sweden
Full name: Peter Osterlund

Re: Texel 1.07

Post by petero2 »

yorkman wrote:I realize your readme says to use -hosts host1,host2 but this is invalid as the /? for mpiexec shows it must be separated by spaces not commas and you also have to precede it with the number of hosts; so I was getting other errors when using your example (outdated I assume).
Actually it only says that for the linux example based on MPICH. For windows and MS-MPI it says this:

Code: Select all

* Example using MS-MPI and windows:

If there are two computers called host1 and host2 and MS-MPI is installed on
both computers, proceed as follows:

1. On all computers, log in as the same user.
2. On all computers, add firewall exceptions to allow the programs mpiexec and
   smpd (located in C:\Program Files\Microsoft MPI\Bin) to communicate over the
   network.
3. On all computers, start a command prompt and execute:
   smpd -d 0
4. Make sure texel is installed in the same directory on all computers.
5. On the host1 computer, start a command prompt and execute:
   cd /directory/where/texel/is/installed
   mpiexec -hosts 2 host1 host2 texel64cl.exe
petero2
Posts: 685
Joined: Mon Apr 19, 2010 7:07 pm
Location: Sweden
Full name: Peter Osterlund

Re: Texel 1.07

Post by petero2 »

yorkman wrote:I wanted to see what would happen if I specified the same host twice instead of using the remote host as the 2nd host. This time I specified only 4 cores per host but strangely it returned 288 cores for some reason. Not sure what this means or if it's a bug of some sort but perhaps it may help you?

C:\temp>mpiexec -cores 4 -hosts 2 beast beast c:\temp\texel64cl.exe
info string cores:288 threads:288
mpiexec aborting job...
Actually what happens when you specify "-cores 4" is that 4 MPI processes are started on each host instead of only one. Texel assumes that only one process per host will be started, and it assumes each texel process should use all available cores on the host, by starting the proper amount of threads.

In your example you get 8 processes running on the same host and each process concludes that the host has 36 cores, so texel finds 36*8=288 cores, not realizing that each core is found 8 times.

The solution to that problem is to not use the "-cores" option and to not repeat the same host in the list of hosts.
yorkman
Posts: 105
Joined: Thu Jul 27, 2017 10:59 pm

Re: Texel 1.07

Post by yorkman »

Sorry I didn't respond for so long. I wasn't expecting such support so I didn't bother checking back. I'm pleasantly surprised!

This works perfectly in Aquarium. I thank you very much for this! Will put texel to work as soon as I can free up both computers.
yorkman
Posts: 105
Joined: Thu Jul 27, 2017 10:59 pm

Re: Texel 1.07

Post by yorkman »

One question. My computers differ in available ram. Is it possible to start texelcl on pc1 with say 8 GB ram and 32 GB specified on pc2 in Aquarium (using runcmd.exe)? In Aquarium I can set the hash size easily but how would I tell pc1 to use 8 GB ram instead of 32 GB? Or am I forced to use only 8 GB on both?

mpiexec has a switch for setting the # of cores (-cores <num_cores_per_host>) but not for ram.
petero2
Posts: 685
Joined: Mon Apr 19, 2010 7:07 pm
Location: Sweden
Full name: Peter Osterlund

Re: Texel 1.07

Post by petero2 »

yorkman wrote:One question. My computers differ in available ram. Is it possible to start texelcl on pc1 with say 8 GB ram and 32 GB specified on pc2 in Aquarium (using runcmd.exe)? In Aquarium I can set the hash size easily but how would I tell pc1 to use 8 GB ram instead of 32 GB? Or am I forced to use only 8 GB on both?
If texel cannot allocate the requested amount of memory for the hash table it will try again with half the size, repeatedly until the allocation succeeds. Therefore it should work to set the hash table size to 32 GB and if pc1 does not have that much ram, it will settle for a smaller hash table.

I admit though that I have only tested this feature in linux when no swap file is used. I don't know if it works in windows. There is a risk it might allocate a too large hash table and then cause a lot of swapping and extermely low NPS.
yorkman wrote:mpiexec has a switch for setting the # of cores (-cores <num_cores_per_host>) but not for ram.
I want to remind you again that using "-cores" is not useful for texel, because texel assumes only one process per host is started by mpiexec. Texel will then create the correct number of threads to fully utilize all cores on each host, assuming you set the "Threads" UCI parameter to the total number of cores available in the cluster.
yorkman
Posts: 105
Joined: Thu Jul 27, 2017 10:59 pm

Re: Texel 1.07

Post by yorkman »

Thanks Peter for all your help. I'll be sure to test it more seriously when I get both my pc's fully available for a few days. Right now they are on a totally separate network so it's not possible but I did test it quickly with two pc's just to see if it works.