Estimation of parameters is a central problem in various fields of applied science. In robotics and machine learning, literally a good solution to a problem is cast as an optimization problem of some sort. This optimization problem involves a some observations, residue function and some optimization variables (this exact process is sometimes also referred as parameter estimation). Residue functions are quite often cast as least squares. For example, suppose we have several noisy observations (x,y) :eg: { (2,3), {4.1, 5.11} , {7.9, 9.1}, …, etc} . We are interested in fitting a line equation y = ax+b. Particularly we are interested to know good a, b which fit this data.

Since you are reading this blog post, you are likely already familiar with least squares fitting. We shall review the least squares fitting for this and then move on to an alternate mindset to solve this problem.

Enjoy the blog….!

### Least Squares Solution

If one goes on to solve this problem with a naive least square, one you make a loss function as follows and try to find the minima. Particularly we are interested in solving: . This can be done (in this case) by taking derivatives. Please follow my notes below.

Here is how we would do this in practice:

```
import numpy as np
def gen_data():
""" generate x: 1000-vector and y 1000-vector. x and y are linearly related,
but the relation in unknown """
x = np.random.uniform(-10,10,100)
y = 2*x + 1 # note, this 2,1 is not known to the method
return x, y
def simple_least_squares_fit(x, y):
""" Implements my solution. At the core solves AX = b.
Input x and y are 100x1 each
"""
A_0_0 = np.sum( x**2 )
A_0_1 = np.sum( x )
A_1_0 = np.sum( x )
A_1_1 = np.size( x )
n_0 = np.sum( np.multiply( x, y ) )
n_1 = np.sum( y )
A = np.array( [ [A_0_0, A_0_1 ], [A_1_0, A_1_1] ] ) / np.size(x)
n = np.array( [[n_0], [n_1]] ) / np.size(x)
print 'A', A.shape, '\n', A
print 'n', n.shape, '\n, n
result = np.matmul( np.linalg.inv(A) , n )
print 'result', result.shape, '\n', result
if __name__ == '__main__':
x, y = gen_data()
simple_least_squares_fit( x, y )
### Result:
A (2, 2)
[[3.21400272e+01 5.18788283e-03]
[5.18788283e-03 1.00000000e+00]]
n (2, 1)
[[64.2852422 ]
[ 1.01037577]]
result [[2.]
[1.]]
```

### Maximum Likelihood Estimation (MLE)

Although the least squares gives us correct result. But this is just synthetic data and also there is no noise added to it. Another way to thing about this problem is in terms of probability distributions.

Particularly, given several random variables (think of these as random numbers) we wish to estimate the parameters of a probability distributions that will fit this data. In theory we can choose to fit any probability distribution. However, in practice Gaussian distribution is the easiest to deal with. Laplacian distribution is also in common use. A Gaussian distribution (Normal Distribution) has two parameters, which are, a) the mean (a n-vector) and b) the covariance matrix (nxn matrix) .

Now, lets turn back to our problem of line fitting, but using the MLE method. This section is borrowed from Ref [4]. More more elaboration refer to Ref[4].

If the random variable , then we model the random variable Y as . Given the dataset and assuming each of the samples are independent, we can write the likelihood function, ie. P(Y=y_i | x ; a,b, \sigma^2). Also at this stage we can assume this likelihood to be a Gaussian distribution.

Now, that we have , we need to find its maximum. We can take negative log on both sides. This will change the problem to a minimum finding problem. We can now take the derivatives of the L with respect to the optimization variables, viz, . To arrive at the maximum likelihood estimate.

Notice that the optimization problem resulting from the likelihood maximization function is essentially** ordinary least squares**. The solution of which can be derived exactly as the code before.

### Maximum a Posteriori Estimator (MAP)

Now, lets create another estimator, the Maximum a-posteriori estimator. The major point is that the MAP estimator gives a way to provide priors for our estimates. Recall the Bayes rule;

In the MAP estimator we maximize the posterior probability. In practice we minimize the log-likelihood. This would create an additional residue of the prior. The denominator here, does not matter because it is independent of the optimization variables, c, in this case.

### MLE vs MAP

MLE and MAP are infact very similar implementation wise. But the real like implications are quite different. MAP gives a way to bring in prior knowledge about the underlying optimization variables. For example, for our line fitting estimator, if we some how knew that the a and b are around 1 and 2 respectively, we can model the priors of a and b as Gaussians. We would get a much better estimate like this with the MAP rather than simply doing MLE. The basic differences are:

- With MAP you can set a prior on the optimization variables. Usually one would set priors as Gaussians.
- Under the assumption that prior is uniformly distributed, MLE and MAP will result in the same estimates.

There are several articles online talking about MLE vs MAP. See Ref[1] and [2] for more elaborate details.

### Other Advanced Estimators

MLE and MAP are the simplest probabilistic estimations methods out there. However there are some more advanced methods like the Expectation Maximization (EM), Majorization Maximization (MM) and Variational Inference (VI). I will explore more on this in a later blog post.