In machine learning in general, having the correct distribution is very important.Robert Pope wrote:When you talk about having the correct distribution of positions for training, is that really a necessary condition? Or is it simply to avoid wasting time learning to handle things that won't likely occur?
For example, if you have a system classify cats vs dogs. If you have 98% cats and 2% dogs in your training set, the system will learn that dogs are extremely rare, and so when in doubt, it should classify something as a cat. In fact, if the system just classifies everything as cat, it would still only have 2% error rate!
This would be disastrous if in the actual application, cats are presented about as often as dogs. It will misclassify many of the dogs as cats, because of the prior probability distribution in the training set.
It is not just to save time. It makes the result more accurate as well.