Saturday, July 2, 2016

Random problems

As discussed in the previous post, my latest Code Jam problem pointed at approximate algorithms right from the statement: you are given a rooted forest with at most 100 vertices, and consider all valid permutations of its vertices. A permutation of vertices is valid if every non-root vertex comes after its parent. Additionally, each vertex is assigned a letter, so each permutation induces a string when we write all those letters in the order of the permutation. Which fraction of valid permutations induce a string that contains the given substring? Your answer doesn't need to be very precise: an error of 0.03 is allowed.

We need to estimate a fraction that's between 0.0 and 1.0 with an error of 0.03 allowed - that doesn't sound very precise. In fact, that's just 17 times more precise than simply always answering 0.5 :) That allows us to use the Monte-Carlo approach: pick k independently and uniformly sampled valid permutations, find out the number m of them that contain the given substring, and output m/k as the answer.

There are two things still unclear about this solution: how large does k need to be to make the error small enough, and how to uniformly sample valid permutations.

k independent samples from a Bernoulli distribution is called a binomial distribution, and we can use its cumulative distribution function to estimate the required k. However, there's a simpler approach that works for virtually any distribution: the Central Limit Theorem. It says that for moderately large values of k the mean of k random values sampled independently from the same distributions is almost like a normal distribution with mean equal to the mean of each random value, and standard deviation equal to the standard deviation of each random value divided by the square root of k.

Each random value in our case is equal to 1 with some unknown probability p, and to 0 with probability 1-p. Its mean is thus p, and its standard deviation is sqrt(p*(1-p)), which is at most sqrt(0.5*0.5)=0.5. The standard deviation of the mean of k such values is thus 0.5/sqrt(k), and by the theorem it's distributed almost normally. For the normal distribution we know the probability that it will be off its mean by the given number of standard deviations. For example, it's within 6 standard deviations of its mean with probability roughly 1 in 500 million. 6 standard deviations is at most 3/sqrt(k), and thus picking k=10000 makes 6 standard deviations lie within the allowed 0.03 error.

Of course, the Central Limit Theorem can't always be blindly applied - one must understand what's going on. For example, note that the error probability of 1 in 500 million is for one number. Since the problem has actually asked you to compute 500 such numbers, the probability of going outside the allowed 0.03 error in at least one of them could only be bounded by 1 in a million, which is still a pretty remote chance - but one has to keep this point in mind when using the theorem. For example, picking k=1000 in our problem means that 0.03 would correspond to roughly 2 standard deviations, so we'd give an answer within allowed error with probability at least 95%, which sounds good enough to submit. However, since we need to compute 500 numbers and not just one, the probability that all of them will stay within 2 standard deviations is much, much lower: 0.95500 is about 7*10-12, so one is very likely to fail when using k=1000.

Another issue is that "moderately large" and "almost" terms are of course not formal. The real theorem gives those terms a precise meaning, and it turns out that in our case they work as expected if p and 1-p are both significantly greater than 1/k. To see the opposite situation, consider a value of p that's less than 1/k. It's not hard to see that the mean of k values will now be almost always 0, sometimes 1/k, and much more rarely anything else - clearly not very close to being distributed normally. This issue doesn't hurt in our problem since for the values of p that are so close to 0 and 1 we'll clearly give a good answer.

The most difficult part of this problem was not the Monte-Carlo idea - it was figuring out how to sample valid permutations uniformly. I won't go into details on this part, though, as it's explained very well in the contest analysis.

What I wanted to bring up instead is the philosophical question of how should problems with approximate solutions be properly stated and checked. As an example, the solution described above can give a wrong answer once in a million attempts. Is that good enough? Would it be unfair if some contestant was so unlucky that he got a wrong answer that way and had to submit again? What if this problem allowed only one submission attempt, and thus getting a wrong answer would have a much steeper penalty - which probability of mistake is acceptable then?

This problem had another peculiar property: the reference solution used Monte-Carlo approach as well, so we could only be sure that the reference answers are correct up to some probability. In other words, one could say that with some very remote probability submitting an exact answer could give a wrong answer verdict! We have made sure that chance is really remote by using MapReduce before the contest to run a lot more attempts for each testcase, and thus we were extremely confident that our answers are precise. What do you think is an acceptable probability of mistake in this case?

Finally, some of my past Code Jam problems (Proper Shuffle, Good Luck) went even further: the inputs themselves were randomly generated, and one had to estimate the hidden variables used for this random generation. In that case the input file by itself does not completely formally determine which output files are correct and which are not! With some probability many different sets of hidden variables could be used to generate the input, and while the actual variables used for generation are known to the judges and will be compared against when checking your solution, you don't have a way to be 100% certain your output is correct even given infinite computational resources. You can, of course, be 99,9999999999999999% certain. How many nines is acceptable here?

I'm looking forward to hearing your opinions on this subject! I'll share my own boundaries in a following post.

No comments:

Post a Comment