[ List Earliest Comments Only For Pages | Games | Rated Pages | Rated Games | Subjects of Discussion ]
Single Comment
Ha, finally my registration could be processed manually, as all automatic
procedures consistently failed. So this thread is now also open to me for
posting.
Let me start with some remarks to the ongoing discussion.
* I tried Reinhards 4A vs 8N setup. In a 100-game match of 40/1' games
with Joker80, the Knights are crushed by the Archbishops 80-20. So
although in principle I agree with Reinhard that such extreme tests with
setups that make the environment for the pieces very alien compared to
normal Chess could be unreliable, I certainly would not take it for
granted that his claim that 8 Knights beat 4 Archbishops is actually true.
Possible reasons for the discrepancy could be:
1) Reinhard did not base his conclusion on enough games. In my experience
using anything less than 100 games is equivalent to making the decision by
throwing dice. It often happens that after 30 games the side that is
leading by 60% will eventually lose by 45%.
2) Smirf does not handle the Archbishop well, because it is programmed to
underestimate its value, and is prepared to trade it to easily for two
Knights to avoid or postpone a Pawn loss, while Joker80 just gives the
Pawn and saves its Archbishops until he can get 3 Knights for it.
3) The shorter time control used does restrict search depth such that this
does not allow Joker80 to recognize some higher, unnatural strategy (which
has no parallel in normal Chess) where all Knights can be kept defending
each other multiple times, because they all have identical moves, and so
judges the pieces more on their tactical merits that would be relevant for
normal Chess.
* The arguments Reinhard gives against more realistic 'asymmetrical
platesting':
| Let me point to a repeatedly written detail: if a piece will be
| captured, then not only its average piece exchange value is taken
| from the material balance, but also its positional influence from
| the final detail evaluation. Thus it is impossible to create
| 'balanced' different armies by simply manipulating their pure material
| balance to become nearly equal - their positional influences probably
| would not be balanced as need be.
seem invalid. For one, all of us are good enough Chess players that we can
recognize for ourselves in the initial setup we use for playtesting if the
Archbishop or Knight or whatever piece is part of the imbalance is an
exceptionally strong or poor one, or just an average one. So we don't put
a white Knight on e5 defended by Pf4, while the black d- and f-pawn already
passed it, and we don't put it on a1 with white pawns on b3, c2 and black
pawns on b4, c3. In particular, I always test from opening positions,
where non of the pieces is on a particularly good square, but they can be
easily developed, as the opponent does not inderdict access to any of the
good squares either. So after a few opening moves, the pieces get to
places that, almost by definition, are the average where you can get
them.
Secondly, when setting up the position, we get the evaluation of the
engine for that position telling us if the engine does consider one of the
sides highly favored positionally (by taking the difference between the
engine evaluation and the known material difference for the piece values
we know the engine is using). Although I would trust this less than my own
judgement, it can be used as additional confirmation.
Like Derek says, averaging over many positions (like I always do: all my
matches are played starting from 432 different CRC opening positions) will
tend to have avery piece on the average in an average position. If a
certain piece, like A, would always have a +200cP 'positional'
contribution, (e.g. calculated as its contribution to mobility) no matter
where you put it, then that contribution is not positional at all, but a
hidden part of the piece value. Positional contributions should average to
zero, when averaged over all plausible positions. Furthermore, in Chess
positional contributions are usually small compared to material ones, if
they do not have to do with King safety or advanced passers. And none of
the latter play a role in the opening positions I use.
* Symettrical playtesting between engines with different piece-value sets
is known to be a notoriously unreliable method. Dozens of people have
reported trying it, often with quite advanced algorithms to step through
search space (e.g. genetic algorithms, or annealing). The result was
always the same: in the end (sometimes after months of testing) they
obtained piece values that, when pitted against the original hand-tuned
values, would consistently lose.
The reason is most likely that the method works in principle, but requires
too many games in practice. Derek mentioned before, that if two engines
value certain piece combinations differently, they often exchange them for
each other, creating a material imbalance, which then affects their winning
chances. Well, 'often' is not the same as 'always'. For very large
errors, like putting AR the
undervaluation of A only can lead to much more complicated bad trades, as
you have to have at least two pieces for A. The probability that this
occurs is far smaller, and only 10-20% of the games will see such a
trade.
Now the problem is that the games in which the bad trades do NOT happen
will not be affected by the wrong piece value. So this subset of games
will have a 50-50 outcome, pushing the outcome of the total score average
towards 50%. If A vs R+N gives you 60% winning chance,(so 10% excess), if
it is the only bad trade that happens (because you set A slightly under
8), and happens in only 20% of the cases, the total effect you would see
(and on which you would have to conclude the A value is suboptimal) would
be 52%. But the 80% of games that did not contribute to learning anything
about A value, because in the end A was traded for A, will contribute to
the statistical noise! To recognize a 2% excess score in stead of a 10%
excess score you need a 5 times lower statistical error. But statistical
errors only decrease as the SQUARE ROOT of the number of games. So to get
it down a factor 5, you need 25 times as many games. You could not
conclude anything before you had 2500 games!
Symmetrical playtesting MIGHT work if you first discard all the games that
traded A for A (to eliminate the noise they produce, and they can't say
anything about the correctness of the A value), and make sure you have
about 100 games left. Otherwise, the result will be garbage.