You can believe that, but......
Apart from that, I still believe that extreme outcomes of the tuning process, like those you have shown several times, are caused by the tuning algorithm (including concepts like using higher and/or variable step sizes) and not by the training data. I always got the best results when using a fixed step size of 1, as in the original Texel tuning. Higher step sizes may lead to a faster convergence to "something" but that "something" is not necessarily a stronger result in terms of playing strength, even if it would result in a smaller MSE.
A tuner has a pretty easy defined job. It must minimize or maximize a fitness.
In our case the task is to minimize the mse. It is not the job to translate that into elo.
So, when your tuning algorithm stops with a stepsize of 1 and another instance gets a better result with a stepsize of 5,
this one did definitely a better job.
If you are not happy with a/the result, because it fits not other criterias like better gameplay or cosmetic aspects like "I have seen such values before", it is not the problem of the tuner. In my final conclusion i mentioned that, that is human. Just like accepting a local minimum because you like the data better. This distorts the task and the value of the optimization.
My final conclusion was that the characteristics of the data used and the ability of the evaluation function to work together are critical.
I was able to show some effects that belong to the data (quiet criterion) and the difference between a more complex and naive evaluation function.
As long there is no bug included the tuner simply reduces the mse as long it finds a better solution.
This is the desired behavior, regardless of whether the result satisfies a further purpose or criterion.