I guess I have to get into the specifics I personally still don't trust about computer studies, again.
First, Kaufman's type of study: saying that B=N based on large number of games stats (I only vaguely recall, but many of the players in his database may have been sub-grandmaster level - GMs are relative adults compared to 2300 players playing wargames in a sandbox vs. each other). If you want to establish the absolute truth of if B=N, solving chess from the setup and then doing some sort of a database wins/losses count for [near-]'perfect' play would be best, but that is impossible on earth right now (perfect play, if it does not result in a draw, would probably favour White).
Today's best chess engines might be used to generate, say, a 3000+ vs. 3000+ engine vs. engine database if enough games could be played over time to very statistically matter - that would be arguably second best, but even then there may be some element of doubt to the result being the truth that might be hard to assign exact probability to, perhaps (maybe even a professional statistician who is also a GM could throw up his hands and say, we simply cannot say). In any case, the time it takes to make such a database makes it impractical for now, yet again.
Coming to the type of study used for fairy chess piece values, I don't know how margin(s) of error for such a study can be confidently established, for one thing. Next, more seriously, on my mind is the exact setup and armies used in a given study. For Chess960, I saw somewhere long ago online that someone figured after their own type of study that certain setups are roughly equal, while others favour White more than in orthodox chess, say up to 0.4 pawns worth over Black (you might find this somewhere on the internet, to check me). Consider also that that's just for armies that are equal in strength exactly, being identical as in chess. You may give both sides equally White and Black, but the setup and armies vary per study, and I'd guess it's hard to always be exhaustively fair to every possible setup/army, given time constraints.
Finally, you wrote earlier that errors tend to cancel each other out with lower level play (say 2300+ vs. 2300+ engines, as opposed to 2700+ vs. 2700+), It would be very good to know how many games and studies (even roughly) you base that conclusion on, if you still recall. Also, does the cancellation ever significantly favour one side or the other very much with any given [sort of] study? I think the strength of the engine(s) used just might be the most underestimated/large factor causing possible undetected error with this type of study (and sub-GM play within Kaufman's database study, as I alluded to above).
I guess I have to get into the specifics I personally still don't trust about computer studies, again.
First, Kaufman's type of study: saying that B=N based on large number of games stats (I only vaguely recall, but many of the players in his database may have been sub-grandmaster level - GMs are relative adults compared to 2300 players playing wargames in a sandbox vs. each other). If you want to establish the absolute truth of if B=N, solving chess from the setup and then doing some sort of a database wins/losses count for [near-]'perfect' play would be best, but that is impossible on earth right now (perfect play, if it does not result in a draw, would probably favour White).
Today's best chess engines might be used to generate, say, a 3000+ vs. 3000+ engine vs. engine database if enough games could be played over time to very statistically matter - that would be arguably second best, but even then there may be some element of doubt to the result being the truth that might be hard to assign exact probability to, perhaps (maybe even a professional statistician who is also a GM could throw up his hands and say, we simply cannot say). In any case, the time it takes to make such a database makes it impractical for now, yet again.
Coming to the type of study used for fairy chess piece values, I don't know how margin(s) of error for such a study can be confidently established, for one thing. Next, more seriously, on my mind is the exact setup and armies used in a given study. For Chess960, I saw somewhere long ago online that someone figured after their own type of study that certain setups are roughly equal, while others favour White more than in orthodox chess, say up to 0.4 pawns worth over Black (you might find this somewhere on the internet, to check me). Consider also that that's just for armies that are equal in strength exactly, being identical as in chess. You may give both sides equally White and Black, but the setup and armies vary per study, and I'd guess it's hard to always be exhaustively fair to every possible setup/army, given time constraints.
Finally, you wrote earlier that errors tend to cancel each other out with lower level play (say 2300+ vs. 2300+ engines, as opposed to 2700+ vs. 2700+), It would be very good to know how many games and studies (even roughly) you base that conclusion on, if you still recall. Also, does the cancellation ever significantly favour one side or the other very much with any given [sort of] study? I think the strength of the engine(s) used just might be the most underestimated/large factor causing possible undetected error with this type of study (and sub-GM play within Kaufman's database study, as I alluded to above).