## Friday 26 April 2024

### Cumulative density of binomial distributions

Not really a math article but more like a diary here, something to share related to estimating binomial distributions.

How would you approach CLT in a stat 101 course? It came up when a friend of mine was prepping the  coming semester. I still prefer the probabilistic approach. That is, from Chebyshev inequality to laws of large numbers(LLN) then to central limit theorem. Counter opinion says that there is a gap between LLN and CLT: LLN says nothing about how the distribution looks like but CLT says the distribution looks normal. You don't see the reason behind until you start looking into the proof that uses moments...and you don't do that in a 101 course.

Instead the other common approach is to demonstrate via binomial distribution making use of its simple additivity (the other great example being Poisson as it's easily additive as well). Using Stirling's approximation you can also observe the tail shrinking like the normal density. Although Stirling's approximation could also be too hard for a 101 course?...I don't know.

But my encounter with binomial distributions does not stop here. I recently helped another friend of mine up to a task related to cumulative density of some binomial distributions. More precisely I am looking for $P_k = P(X_k \geq n-k)$ where $X_k \sim B(n,f(k,p))$, then I am looking into properties of the series $(P_k)$. It turns out that despite a varying probability parameter, $P_k$ looks really similar across $k$ which poses me great difficulty when I try to estimate them.

The first thing that comes to my mind is to use those traditional bound. But soon you realize that bounds like Chernoff bound does not work very well out of the mean. Berry Esseen bounds the error to the order of $O((npq)^{-1/2})$. Worse is that this bound is tight in the sense that the output of binomial distributions are integers, corresponds to a jump in multiples of $\sqrt{npq}$ so this is the best such estimate can do.

The error in the estimation can easily be seen in numerical examples. Suppose we flip 6 fair coins and we want to know the chance of getting 4 or above heads. The answer is surely 22/64 or about 0.344, but what if we try to estimate by normal distribution? $B(6,0.5)$ looks like $X \sim Z(3,1.5)$, and $P(X \geq 4) \approx P(Z \geq 0.816) \approx 0.207$, but this is far from our precise answer. It turns out that you should take 3.5 instead of 4, assuming that the chance of getting 4 heads is about $P(3.5 < Z < 4.5)$ instead of $P(4 < Z < 5)$.

Why is it 0.5 but not 0.4 or 0.6? Well, a way to realize this is to assume big $n$ so everything work smoothly. In such case even the binomial density is rather linear locally so 0.5 is a fair assumption. Of course that is before we start arguing that even locally the binomial density looks exponential rather than linear as the ratio $f(n,k+1,p) = f(n,k,p) \times \frac{n-k}{k+1} (1-p)/p$ is locally constant if $n$ is large. Is there a second order correction that we can apply for a more estimate? What's the error in that case?

It took me so much time before I truly realize the error from normal density estimation is too big to be useful regardless. The error in a general estimation is probably of the form $O((npq)^{d/2})$, but it is never enough if $|P_k-P_{k-1}| = exp(-O(1)n^2)$...

At the end I just use the good old combinatorial approach. Interpret $X_k$ as a separate coin flip of $kN$ coins with a suitable probability $f'(k,p)$. Then I can argue by counting the relation across $P_k$.

It's an adventure, but nothing too exotic at the end. The most funny thing, I believe, is the discovery of an 150 page PhD thesis improving the Berry-Essen bound on binomial distribution and detailing it's uses! Not going to post the link but it should be easily found on the net ;).