Two Bugs

Say we have a sequence of i.i.d random variables $ X_{1}, X_{2}, X_{3}, … X_{N} $ with a common probability density function $ f(x) $, how does one estimate $ f(x)? $ Let us consider the following simple problem. You are looking at the machine code with hundreds of bugs, of a large program written by many programmers over a long period. After identifying two memory addresses L and M containing starting addresses of two random bugs, where is the safest location on its address space if you expect the third bug next? I'd define safest location to be farthest from the bug prone zone.

One obvious answer is the memory address farthest from L and M. What about the safest place between L and M? Is it the midpoint? Is it nearer to L or M? If we think both bugs have the same source with simple random deviation, then $ L+M \over 2$ is the most bug-prone area. This is because it has the highest probable third bug outcome. On the other hand, if we think that the two bugs have independent sources, then $ L+M \over 2$ is the most bug-free spot, because it is likely that the third bug is sourced from either of those two.

Going about this the classic kernel estimation way, we consider both sources to be independent but with the bandwidth deduced from $ |M-L| $. The estimator $ f(x) $ would be the sum of delta functions on L and M, which would be unacceptable and we model their bandwidth with Gaussians or other kernels. If the bandwidth is sufficiently large (depending on the shape of the kernel), $ L+M \over 2$ would be the most unsafe location. This bandwidth parameter of two bugs controls the safest place (in the summation of two Gaussians). That is to say,

$$ P \left ( \frac { L+M }{2} \right ) \ge max \left ( P (L) \right ) + P(M) $$

For a common kernel of L and M, the most bug-prone location is $ L+M \over 2$

If one source produced both bugs and the third bug is expected to be produced by the same source, then the best fit is the normal distribution with 0 mean and 1 as variance: $ N(0,1, x) $ gives the maximum probabilities for L and M. Otherwise, if different sources produced the two bugs, the best guess would be that the third bug is normally distributed with one of the sources $ F(x) $:

$$ f(x) = \frac{N(0,1,x)}{2} + \frac{F(x)}{2} $$

where,

$$ F(x) = \frac{F(L, x)}{2} + \frac{F(M, x)}{2} $$

And the best guess for $ F(x,y) $ is $ N \left (y, \frac{y-x}{2}, x \right ) $ because we know that the scattering for $ F(x,y) $ is contained in the distance between $ x \space and \space y $, and the scattering $ \frac{y-x}{2} $ is the best fit. The overall distribution thus becomes:

$$ f(x) = \frac{N(0, 1, x)}{2} + \frac{N \left (L, \frac{x+L}{2}, x \right )}{2} + \frac{N \left (M, \frac{x+M}{2}, x \right )}{2} $$

where,

$$ N(a,b,c) = \frac{1}{\sqrt{2\pi b^{2}}} * e^{-\frac{1 (c-a)^{2}}{2 \ b^{2}}}$$

The global minima for this function occur at location approx. $\frac{|L-M|}{4}$

Thus, the best guess for bug-free address is $ L + \frac{|L-M| }{4} $ or $ M - \frac{|L-M| }{4} $