# Soft Indicator Function

Very often we come across indicator functions denoting class membership. These functions in their native form are neither continuous nor differentiable. I will describe a trick to convert such indicator functions to an approximate continuous and differentiable function. This blog is organized as follows:

1. Describe a computation case with indicator function
2. Trick to convert
3. More remarks on why and how it works

### Computation case with indicator function

Imagine you have several (say 250) 10-dimensional vectors $\bf{x}_i, i=1,...,250$. Also imagine you have several cluster (say 5) centers (also 10-dimensional each)           $\bf{c_k}, k=1,...,5$. Then say you define a matrix $\bf{V}_{10x5}$, whose computation is defined as sum of residues (difference between $\bf{x}_i$ and its cluster centers) for each cluster center. $j^{th}$ column of V for every j ($j=1,..,5$ ) is computed as :

$\bf{V}_j = \sum_{i=1}^{250} a_j(\bf{x}_i) ( \bf{x}_i - \bf{c}_j )$ …….(1)

Here, $\bf{a}_j(\bf{x}_i)$ is an indicator function. Meaning, its value is 1 when $\bf{x}_i$ belongs to cluster j (nearest center to $\bf{x}_i$ is $\bf{c}_k$ ).

Such computations are reasonably common. This computation is for Vector of Locally Aggregated Descriptors (VLAD) [1,2]. Description here adopted from NetVLAD paper, in particular, see vlad. Another example I can think of for similar computation is the K-means clustering.

### Trick to Convert to Continuous and Differentiable Computation

The computation at hand (Eq. 1) is noncontinuous and nondifferentiable. This happens due to the nature of class indicator function ie. $latex \bf{a}_j(\bf{x}_i)$. However, we can approximate this with another continuous and differentiable function as :

$\hat{a}_j(\bf{x}_i) = \frac{ e^{-|| x_i - c_j ||^2 } }{ \sum_qe^{-|| x_i - c_q ||^2 }}$ …… (2)

Now, if you use $\hat{a}$ instead of $a$ for the indicator, you will have a continuous and differentiable representation.

### More Remarks

Plot of $e^{-x^2}$, basically gives a high value (maximum 1) for very small x and gives zero for larger x. A scalar $\alpha$ can be used to set how small is small. The denominator is Eq. 2 is basically to ensure that $\hat{a}$ is between zero and one. This function effectively gives ~1 for cluster center nearest to current instance. Assuming the $\alpha$ is set correct and cluster centers are reasonably far from each other.

### References

1. Arandjelovic, Relja, and Andrew Zisserman. “All about VLAD.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2013.
2. Jégou, Hervé, et al. “Aggregating local descriptors into a compact image representation.” Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010.