Iterative deepening and the principal variation

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Iterative deepening and the principal variation

Post by bob »

Ugh, finger error deleted some text. log.001 is normal crafty, log.002 is crafty with the hash move change...
User avatar
hgm
Posts: 27811
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Iterative deepening and the principal variation

Post by hgm »

bob wrote:I worry because those last 7 plies are not free, and getting the ordering right makes a measurable difference. If you want to observe this, just turn hash move off for the last N plies in your search and measure the difference...

Here's crafty: No hash move in the last 7 plies of the PV. common middlegame position, searched to depth=17, just one cpu for consistency...

log.001: time=27.15 mat=0 n=56278033 fh=92% nps=2.1M
log.002: time=31.15 mat=0 n=59617831 fh=92% nps=1.9M

So, given some real data, would you be too concerned? I am...
This is totally different: your "real data" is taken when _all_ 7-ply searches from the approximately 50,000 nodes at 10 ply are taken without using both the pre-stored hash infomation, and the hash information they build up locally, during the search itself. The PV would only help a single one of those 50,000 7-ply searches, namely the one starting from the ply=10 PV node.

Since you seem to want to make a point out of this, note that this is a good example of how an educated guess is far superior over using "real data" without understanding. With an educated guess you might be off a factor 10, never a factor 50,000...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Iterative deepening and the principal variation

Post by bob »

hgm wrote:
bob wrote:I worry because those last 7 plies are not free, and getting the ordering right makes a measurable difference. If you want to observe this, just turn hash move off for the last N plies in your search and measure the difference...

Here's crafty: No hash move in the last 7 plies of the PV. common middlegame position, searched to depth=17, just one cpu for consistency...

log.001: time=27.15 mat=0 n=56278033 fh=92% nps=2.1M
log.002: time=31.15 mat=0 n=59617831 fh=92% nps=1.9M

So, given some real data, would you be too concerned? I am...
This is totally different: your "real data" is taken when _all_ 7-ply searches from the approximately 50,000 nodes at 10 ply are taken without using both the pre-stored hash infomation, and the hash information they build up locally, during the search itself. The PV would only help a single one of those 50,000 7-ply searches, namely the one starting from the ply=10 PV node.

Since you seem to want to make a point out of this, note that this is a good example of how an educated guess is far superior over using "real data" without understanding. With an educated guess you might be off a factor 10, never a factor 50,000...
Please re-read what I wrote. Let me clip the appropriate part:

"Here's crafty: No hash move in the last 7 plies of the PV."

Do you see that "no hash move in the last 7 plies OF THE PV"???

With an educated guess, you are so far out in left field, you are out of the stadium, and possibly out of the country.

Notice that we are _not_ talking about the last 7 plies of the left-most path through the tree (assuming left-to-right traversal). We are talking about PV nodes anywhere along the PV. In my case, depth 17, with just the last 7 plies not using the hash move, limited to hash moves in the first move at the root only. If you change your mind, this might be more significant...

Also, your comment about the PV is wrong. The PV is not a single branch, as it is being built. We are "finding our way". It ends up being a single branch, but at many PV nodes, we search multiple branches to find the best one, and all of those get stored as exact search results...

Now if you prefer, I could just limit this to the left-most branch, and search a position or two where the thing does not change it's mind on the best move. But is there any point? Making sure the N-1 iteration PV is in the table in its entirety costs at most, the time required to search N nodes, where N is the length of the PV. Do you _really_ want to claim that doing that effort is going to be as expensive as doing a 7 ply search with no hash move whatsover???

surely not. So the idea of making certain the hash moves are there has essentially no cost. For a depth 20 search, that is 20 nodes worh of effort, in a search that might cover 100M nodes in 20 seconds. 20/100000000 is pretty near to zero. Even if the effect is small, the saving is more than the cost.

But we are not talking about the last 7 plies necessarily either. That's just a random number chosen by someone (not me). The effect is more pronounced as you get closer to the root and don't find a hash move. yet the cost for making certain the moves are present remains constant at N nodes where N = length of the PV. Failing to find a hash move at ply 2-3-4-5 is a killer and preventing it has no measurable cost.

In fact you could wonder if the overhead of carrying around a PV (e.g. in a tri-angular array) during the entire search would not be more time consuming than just redoing this part of the search. Stuffing the PV back into the hash table might take 'zero' time, but making sure you have the PV available might take orders of magnitude more time. Might still be negligible, of course, but it requires extra programming work compared to an alternative that also takes negligible time, so why do it?
So the effort to do this is, in fact, negligible, and it is dwarfed by the effort to search a sub-optimal 7 ply tree (and again, 7 was your number not mine... in a very deep search, 7 is way understated. Maintaining the PV array also has no significant cost. You only fool with it when you back up an actual score. USING PVS, this hardly ever happens. And when you do back things up, you only back up a small part of the "triangular array" also, further reducing the cost.

I once tried the hash-only approach and did not see any measurable difference in speed. And I did find those bogus PV moves that happen when the same position is searched to two different depths, with the latter overwriting the first and providing a move that was not on the original PV path. You follow the PV, and then try to apply Evaluate() and the score will not match. I didn't want to deal with that when debugging, and the current approach doesn't have that problem...
User avatar
hgm
Posts: 27811
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Iterative deepening and the principal variation

Post by hgm »

bob wrote:Notice that we are _not_ talking about the last 7 plies of the left-most path through the tree (assuming left-to-right traversal). We are talking about PV nodes anywhere along the PV.
Not true, this is exactly what we were talking about. As you only stuff that single branch back into the hash. It is the only part of the PV that you have. So what you are doing here when obtaining the timings is totally irrelevant for the question how much time you would gain by stuffing the last 7 ply of the PV back into the hash table.

If you want to contribute something meaningful, then give us the difference between a case where you did stuff the last 7 moves of a 17-ply PV back into the hash before an 18-ply search, and one where you intensionally cleared those 7 entries of the hash table. But not one where you impair usage of the hash table all over the place!

The 7, by the way, was just an example in the context of a 20-ply search; how much of the PV you are likely to not lose depends on the size of the depth-preferred part of your table and the effective branching ratio B. With B=3 you would have on the order of 2M nodes after 13 ply, which seems to get in the right ballpark for a hash-table size. This is why I assumed you would only be able to recover the first 13 ply or so from the hash table, after the search. For a 17-ply search, you are likely to lose only the last 4 ply. There are simply not enough nodes searched to a larger depth to push the PV from the hash table. So using hash misses at ply = 4, 5, 6 as an example of how much time you gain is totally beside the point. You won't have lost those entries, and stuffing them back is going to gain you nothing, as they were already there.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Iterative deepening and the principal variation

Post by bob »

hgm wrote:
bob wrote:Notice that we are _not_ talking about the last 7 plies of the left-most path through the tree (assuming left-to-right traversal). We are talking about PV nodes anywhere along the PV.
Not true, this is exactly what we were talking about. As you only stuff that single branch back into the hash. It is the only part of the PV that you have. So what you are doing here when obtaining the timings is totally irrelevant for the question how much time you would gain by stuffing the last 7 ply of the PV back into the hash table.

If you want to contribute something meaningful, then give us the difference between a case where you did stuff the last 7 moves of a 17-ply PV back into the hash before an 18-ply search, and one where you intensionally cleared those 7 entries of the hash table. But not one where you impair usage of the hash table all over the place!

The 7, by the way, was just an example in the context of a 20-ply search; how much of the PV you are likely to not lose depends on the size of the depth-preferred part of your table and the effective branching ratio B. With B=3 you would have on the order of 2M nodes after 13 ply, which seems to get in the right ballpark for a hash-table size. This is why I assumed you would only be able to recover the first 13 ply or so from the hash table, after the search. For a 17-ply search, you are likely to lose only the last 4 ply. There are simply not enough nodes searched to a larger depth to push the PV from the hash table. So using hash misses at ply = 4, 5, 6 as an example of how much time you gain is totally beside the point. You won't have lost those entries, and stuffing them back is going to gain you nothing, as they were already there.
OK, how about this: I ran the same position. Except I zeroed the last 7 hash moves for the PV from iteration N before starting iteration N+1. Here's the two results, the first is the original stuffing everything, the second is the new one that stuffs everything but zeros the last 7 plies. And this is sub-optimal because a couple of those plies are q-search where I don't even hash probe...

log.001: time=28.86 mat=0 n=56278033 fh=92% nps=2.0M
log.002: time=30.21 mat=0 n=62738153 fh=92% nps=2.1M

So in that search to depth 17, it took an extre 6 million nodes. Does stuffing the PV seem worth it now? Do my previous numbers now seem to be inaccurate? Again, notice I said I simply did not use the hash move for PV nodes. A PV node is defined as any node where alpha != beta - 1; which is just the left-hand edge of the tree for PVS for the most part. Once any alpha or beta value was changed from its original value, I no longer considered that a PV node and still did the hash moves normally.

here is another better test... I just didn't store the old PV moves at all. If they survived from iteration N to N+1, I used them period. But I bypassed the PV installation code.

log.001: time=28.30 mat=0 n=56278033 fh=92% nps=2.0M
log.002: time=28.96 mat=0 n=59271036 fh=92% nps=2.0M

3 million nodes. Almost 1.5 second out of 30 at 2M nps. over a 3% improvement. A .000001% cost. Do you think it is better or worse? Of course, you _could_ have run this yourself to produce this data, correct? BTW the above is a best-case. Huge hash table, short search. So it can't get any better than that for the non-storing code.

Any other comments or wild speculations that will also turn out to be wrong, now???

The idea _does_ make good sense. Of course many guessed that since I am not one to put pointless code in my program, there is usually a reason behind every such idea. And usually that reason is "speed/performance/accuracy"...

jeez, do these discussions go sour quickly.
User avatar
hgm
Posts: 27811
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Iterative deepening and the principal variation

Post by hgm »

Well to address your last "meta comment" first: I think they only go sour because you choose to make them go sour. I only coined a question out of genuine interest. Apparantly you take great offense to the fact that I did not do this measurement myself, feeling the need to press the "real data" point, and the reproach in your last post that I could have run that test myself. Well, this happens to be not the case (at least not easily), as none of my engines does use the tiangular method to save the PV, and I thus also do not have anything to stuff back into the PV. So it would not be a matter of simply bypassing an existent piece of code, but would require a lot of de-novo programming. Furthermore, no one _forces_ you to produce the data, you chose to do that of your own free will. Why do it if it spoils your mood?

To come back to the issue, is seems you are declaring victory a little hasty and abundantly. Note that my "wild guess" very explicitly mentioned the last 7 plies of a 20 ply search. You now present "real data" for the last 7 plies of a 17-ply search, and use the outcome as "proof" that my statement (question, actually) on the 20-ply search was wildly off? Do you really think that a 20 and a 17-ply search from the same situation take equal time? (Or alternatively, that the uninformed 4-ply search you would have at the end of an un-stuffed 17-ply PV would take as long as the uninformed 7-ply search that you measure?)

I can also note that you continue to compare the added search time to the time it takes to stuff the PV back into the hash, why my question clearly related to the effort of maintaining the triangular array, not the stuffing. You argued that PV changes do not occur often, but my "wild guess" is that they occur orders of magnitude more often than just accesing N nodes for the stuffing. (You did not give us any quantitative data on that.) So in the end a more accurate statement than "if I (hgm) have more wrong speculations" would be that you tried to mislead us by presenting irrelevant data, and using it explicitly to make me look bad... Can you imagine that such discussion tactics involve a high risk of making the discussion go sour? :roll:

I fully stand by my statement that having real data on an irrelevant case is worse than having a _relevant_ theoretical estimate.

As for your actual numbers, I must admit that I think they are surprisingly high: with ID (or IID) the difference between an uninformed N-ply search and an informed N-ply search should only be the duration of the (N-2)-ply uninformed search needed to inform it (assuming you deepen in steps of 2). And even with an effective branching ratio of 3 that should not amount to more than 10% overhead. And that overhead just in a 7-ply search out of a 17-ply search should not amount to much, as usually 7-ply searches take far less time than 17-ply searches (yes, indeed, again a very wild guess..). So your results remain puzzling to me. Perhaps I should make the effort of doing such a measurement myself. One day.

The most relevant measurement is the last one, where you simply do not stuff anything back into the PV, but just use what is left in the hash. (This is exactly what the question was about.) This matters 0.66 sec out of 29, slightly over 2%. (We once addressed the issue of timing noise here, but I forgot how large your timing noise typically is...)

But I can easily imagine that a naive implementation of the triangular-array method would eat up that time gains. E.g. when you would copy the best continuation of every move that fails high to the next level. So you would need to limit this updating to, say, to PVS re-searches to make it competative. And it is not clear to me how you should do it in plain alpha-beta. So a caveat seems justified.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Iterative deepening and the principal variation

Post by bob »

hgm wrote:Well to address your last "meta comment" first: I think they only go sour because you choose to make them go sour. I only coined a question out of genuine interest. Apparantly you take great offense to the fact that I did not do this measurement myself, feeling the need to press the "real data" point, and the reproach in your last post that I could have run that test myself. Well, this happens to be not the case (at least not easily), as none of my engines does use the tiangular method to save the PV, and I thus also do not have anything to stuff back into the PV. So it would not be a matter of simply bypassing an existent piece of code, but would require a lot of de-novo programming. Furthermore, no one _forces_ you to produce the data, you chose to do that of your own free will. Why do it if it spoils your mood?

To come back to the issue, is seems you are declaring victory a little hasty and abundantly. Note that my "wild guess" very explicitly mentioned the last 7 plies of a 20 ply search. You now present "real data" for the last 7 plies of a 17-ply search, and use the outcome as "proof" that my statement (question, actually) on the 20-ply search was wildly off? Do you really think that a 20 and a 17-ply search from the same situation take equal time? (Or alternatively, that the uninformed 4-ply search you would have at the end of an un-stuffed 17-ply PV would take as long as the uninformed 7-ply search that you measure?)

First, it is not my job to run all these tests to prove to you that what I am doing is the most efficient way to address this issue. If you want to measure the cost, feel free. I've already done it. And reported on it years ago. The cost for maintaining the PV is tiny. The PV only gets backed up on non-fail-high/non-fail-low nodes, which within a PVS framework is _extremely_ rare.

It is not _that_ hard to remove the PV array from crafty. And it already has code to extract the PV from the hash table anyway, since some PVs get truncated due to EXACT hash hits and I want to capture as much as possible. But I have already done this test when we had a discussion on r.g.c.c years back. If you want to continue to believe that it is a waste, feel free. I am going to maintain the PV array because I want a _correct_ PV, not an "approximation" of the PV, because it simplifies debugging. So I do that irregardless of whether the stuff the PV code is used or not. And, in fact, _most_ programs back the PV up rather than use the hash table for the same reason. In light of that, the PV array is not going away, so as far as I am concerned, it has nothing to do with this question being asked.

I can also note that you continue to compare the added search time to the time it takes to stuff the PV back into the hash, why my question clearly related to the effort of maintaining the triangular array, not the stuffing. You argued that PV changes do not occur often, but my "wild guess" is that they occur orders of magnitude more often than just accesing N nodes for the stuffing. (You did not give us any quantitative data on that.) So in the end a more accurate statement than "if I (hgm) have more wrong speculations" would be that you tried to mislead us by presenting irrelevant data, and using it explicitly to make me look bad... Can you imagine that such discussion tactics involve a high risk of making the discussion go sour? :roll:

Yes, particularly when you insist on having _me_ run your tests for you, and then demand that I remove bits and pieces of my search that I would keep whether I stuffed the PV into the hash or not. Write your code the way you want. But until you actually _measure_ both ways, why not be quite about how you "think" the way you don't do things actually performs? I _have_ done it both ways. I _know_ the overhead (or lack thereof) for maintaining the PV array. And I am not going to remove it, because I would use that method of keeping the PV whether I used the stuffing algorithm or not. I want the _exact_ path of moves that leads to the position that produced the score that gets backed up. With a hash table you rarely get that.

I'm not trying to mislead _anybody_. That seems to be your area of expertise. You make statements, I produce contradictory data, you then start wanting me to remove this, add that, to turn it into an inefficient idea.

To refresh your memory, here was the _original_ question:

Or have you found that feeding the PV into the next iteration is generally no better than just feeding the root?
He asked a simple question, is stuffing the PV no better than just stuffing the first move? I answered that question quickly, accurately, and after you raise nonsensical issues, I gave data supporting my statement. You said it was inaccurate and overstated the gain. I ran the test in another way, which showed I did _not_ overstate the gain. Now you want me to remove an important part of my code because you don't have something similar. Who cares how you do it, other than yourself.

But answer the questions posed here, not the questions you want to answer, regardless of what the poster actually asked...


I fully stand by my statement that having real data on an irrelevant case is worse than having a _relevant_ theoretical estimate.
I happen to agree with you. We just don't agree on what is an irrelevant case. I turned off stuffing the PV and showed what happened. It is not major. But it is not tiny. And at times it is quite large.

As for your actual numbers, I must admit that I think they are surprisingly high: with ID (or IID) the difference between an uninformed N-ply search and an informed N-ply search should only be the duration of the (N-2)-ply uninformed search needed to inform it (assuming you deepen in steps of 2).
Correct, I go by "twosies". But you are overlooking the simple math. A depth-2 search is significant if depth is large. a 5 ply search is non-trivial. And along the PV, a depth-2 IID search is _less_ accurate than the PV which came from a depth -1 prior iteration. At zero cost to boot...
And even with an effective branching ratio of 3 that should not amount to more than 10% overhead. And that overhead just in a 7-ply search out of a 17-ply search should not amount to much, as usually 7-ply searches take far less time than 17-ply searches (yes, indeed, again a very wild guess..). So your results remain puzzling to me. Perhaps I should make the effort of doing such a measurement myself. One day.

It seems you are overlooking the nature of the tree. The evaluation of one node is inexpensive. But applied to many nodes, it becomes a major part of the search. Those 7-2 ply searches from IID happen at a bunch of places...

The most relevant measurement is the last one, where you simply do not stuff anything back into the PV, but just use what is left in the hash. (This is exactly what the question was about.) This matters 0.66 sec out of 29, slightly over 2%. (We once addressed the issue of timing noise here, but I forgot how large your timing noise typically is...)
You are looking at the wrong data. Look at the nodes searched. I don't take the time to do a profile-guided optimization compile when I make test changes like this. So the times are not comparable. But the nodes searched are, and 3M nodes is 1.5 secs on my core-2 duo laptop, using one processor on that particular position.

But I can easily imagine that a naive implementation of the triangular-array method would eat up that time gains. E.g. when you would copy the best continuation of every move that fails high to the next level. So you would need to limit this updating to, say, to PVS re-searches to make it competative. And it is not clear to me how you should do it in plain alpha-beta. So a caveat seems justified.
Why would you copy a PV for a fail-high in the first place? The move at the current ply (where the fail high occurs) is the only move you have. There is no best move at the next ply so there is no PV to copy. That is why the cost for maintaining this array is so tiny, it is just hardly ever copied...
Cardoso
Posts: 362
Joined: Thu Mar 16, 2006 7:39 pm
Location: Portugal
Full name: Alvaro Cardoso

Re: Iterative deepening and the principal variation

Post by Cardoso »

If anyone is interested I once removed the triangular array and used two vectors.
Let's call them temppv[] and mainpv[].
IIRC on a pv node I did:
temppv[ply]=bestmove;
for (i=ply;i<pvlength;i++) mainpv=temppv;
or something like that.
All of this to say I got exactly the same PVs.
As I didn't see no significant speed improvments I forgot the whole thing and kept the triangular array.
I must admit I kept the triangular array also because at the time most programmers used it. But I surely have switched to the two vectors if they were faster.
(Sorry for the bad english, I'm not a native english speaker)

Best regards,
Alvaro
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Iterative deepening and the principal variation

Post by bob »

Dann Corbit wrote:
bob wrote:
wgarvin wrote:
AndrewShort wrote:well, I do have a hash table, and I am sorting the hash move 1st. It's just that when I'm sorting the princ variation first, the hash move (if different) ends up 2nd, not 1st (by design). I only sort the principal variation once, so it affects very few sorts. The rest of the sorts use hash best move on top.

in effect what I'm seeing is that feeding the PV into the next iteration is not much better than feeding just the root move, I guess because my sorting (with hash move, etc) is already pretty good.

is that what you meant?
He probably meant that at the end of the iteration, most of the PV nodes are still in the hash table, so the hash moves from those nodes give you almost as much info as feeding an entire PV into it does...

I think some programs don't represent the PV anywhere other than the hash table? Should a hash replacement scheme prefer not to replace PV nodes?
The problem is, you can't recognize a PV node...
There are some programs that reconstruct the pv purely from the hash (admittedly, this does not work perfectly). I have also seen programs that tag one of the flag bits if it is a pv node (but this is also problematic because pv nodes can change during the search). However, on average the best nodes from the transposition table should approximate closely the pv nodes.
If they survive, which is the problem I was addressing when I wrote the code to stuff the PV back into the transposition table between iterations.

Retrieving the PV from the hash is OK, and you are correct, in that if you get something from the hash, it will usually be the best move. But here is the problem:

(1) I want to be able (for debugging) to follow a PV to the end, and then when I call Evaluate() I get _exactly_ the same score that the PV produced during the search. Often this does not happen, because I search the same position to different depths in the tree, which can produce different scores. The PV contains the score from the position searched to a shallower depth, while the hash table contains the result produced when searching that same position to a deeper depth somewhere else. Hard to debug.

(2) Even with a depth-preferred table, you can search the same position on the PV, and then elsewhere to a different (deeper) depth, and if that position is a "fail-low" the second time around, then you do not get a best move stored and the PV move is lost.

In short, the hash table move is accurate when it is available, but it is not as useful for debugging and might be missing at a critical point as the search data I posted shows.
User avatar
hgm
Posts: 27811
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Iterative deepening and the principal variation

Post by hgm »

bob wrote:First, it is not my job to run all these tests to prove to you that what I am doing is the most efficient way to address this issue. If you want to measure the cost, feel free. I've already done it. And reported on it years ago. The cost for maintaining the PV is tiny. The PV only gets backed up on non-fail-high/non-fail-low nodes, which within a PVS framework is _extremely_ rare.

....

But answer the questions posed here, not the questions you want to answer, regardless of what the poster actually asked...
I can't understand why this got you in such a foul mood. I only asked a question, which was closely related to the original question. Now I am not supposed to do that unless I already know the answer? I would think that this is exactly what forums and questions are for: to probe for knowledge you don't have... And when I did ask it, I only asked it because I had the impression that you would be able to answer it off hand. (You know, "nothing in Crafty is never done just for the heck of it...".) You know everything, you did it all before, but if someone asks for it you apparently feel compelled to remeasure it from scratch, and then start blaming the asker that he engages you in slave labor...
You are looking at the wrong data. Look at the nodes searched. I don't take the time to do a profile-guided optimization compile when I make test changes like this. So the times are not comparable. But the nodes searched are, and 3M nodes is 1.5 secs on my core-2 duo laptop, using one processor on that particular position.
OK, I did not realize that. I thought the extra nodes were a lot faster (e.g. because most of them are IID nodes which all give cache hits on hash probing). So if fact stuffing the PV saves you 3M nodes out of 59M, which is 5%.

I still think this is a very surprising result, that I would like to understand. The original reasoning behind my original question seems sound and simple enough:

An alpha-beta tree of depth N with perfect move ordering and branching ratio B has B^(N/2) nodes (give or take a factor 2), while the absolute worst case of reverse move ordering needs B^N nodes (same as the unpruned minimax tree). So at the very very worst (move ordering extremely much worse than random) the cost of the un-informed search of a 7-ply subtree at the end of the PV branch of the previous iteration will be as large as that of a 14-ply perfectly informed search. And compared to the 20-ply search, even the cost of a 14-ply search should be quite negligible.
Even with an effective branching ratio as low as 3, the 6 extra ply (20-14) would make the search 3^6 = 729 times bigger, so that the uninformed 7-ply search at the end of the first 13 ply of the previous iteration's PV could only amount to 0.13% of the search time (node count). And then only if the move ordering in that un-informed 7-ply search was total crap; if it was merely random you woudl find on the average the cut-move after searching half the number of moves, saving you a factor 2 in every all-node (so a factor 8 in a 7-ply search). Thus 0.02% seems a reasonable _upper limit_ of what you can save compared to an un-informed depth-first search with random move ordering. And IID should already harvest most of that saving (and even if you didn't do IID, your move ordering is likely better than random...).

Do you see any obvious logical flaws in the argument above? (Note that I am not asking you to do any programming, just some thinking!) That you observe 5% where <<0.02% is expected does point to a serious lack of understanding of the underlying process. So it might be worthwile to figure out what is causing the discrepancy.