This series is based on my attempts at exploring machine learning, and the final results of those attempts. It starts straightaway with part N because I do not remember how many times I have tried to do this before, only to leave it for the future. Just as a side note, I’m not completely illiterate at ML. However, my proficiency, as of writing this post, is limited to linear regression, which some of my friends do sometimes regard as complete illiteracy.
So the endsems are over and my unremitting self is again motivated to proclaim “I’ll learn ML in these vacations”. Having tried my hands at all sorts of online resources, this time I decide to use a physical book as my guide. So I go to the library and pick up this book titled Machine Learning: A constraint-based approach by Professor Marco Gori.
It is 2 AM, I’m on chapter one, and the current topic is why Euclidean distance is not a good metric in higher dimensions. So we’re working in the n-dimensional set \( \mathbb{X} \) and the author defines the set of neighbours of a point \(x_{0}\) as : $$ \mathcal{N}_R (x_{0}) = \{ x \in \mathbb{X}\,\, |\,\, ||x-x_{0}|| < R\}$$ where \(R\) is the degree of neighbourhood of \(x_{0}\) (i.e., the radius). We call this the orange. He then brings up the formula for the volume of an n-ball (which I’ve never seen before) : where \(n\) is the number of dimensions, \(R\) is the radius of the ball, and \(\Gamma\) is the Gamma function. Trusting the author without verifying the veracity of the formula (as I’ve always done), I move on to the definiton of the peel of this orange. We define \(\epsilon > 0\) as the thickness of the peel, which we denote as \(\mathcal{P}_{\epsilon}\). The radius of the pulp, therefore, reduces to \(R-\epsilon\). The volume of the peel is then obtained by subtracting the volume of the pulp from the volume of the orange. $$ \begin{align} vol(\mathcal{P}_{\epsilon}) &= vol(\mathcal{N}_R) - vol(\mathcal{N}_{R-\epsilon})\\\\ &= vol(\mathcal{N}_R) \left(1 - \frac{vol(\mathcal{N}_{R-\epsilon})}{vol(\mathcal{N}_R)}\right)\\\\ &= vol(\mathcal{N}_R) \left(1 - \left(\frac{R-\epsilon}{R}\right)^{n}\right) \end{align} $$ Now we let the dimensions increase. Formally, we let \(n \to \infty\). $$ \begin{align} vol(\mathcal{P}_{\epsilon}) &= vol(\mathcal{N}_R) \left(1 - \lim \limits_{n \to \infty} \left(\frac{R-\epsilon}{R}\right)^{n}\right) \end{align} $$ Since \(R-\epsilon < R\), we have \(\frac{R-\epsilon}{R}<1\) which implies \( \lim \limits_{n \to \infty} \left(\frac{R-\epsilon}{R}\right)^{n}\ = 0\). Thus, we get $$ vol(\mathcal{P}_{\epsilon}) = vol(\mathcal{N}_R) $$ and the orange becomes its peel. This is it. This blows my mind. I naively declare this as the pinnacle of mathematical beauty that has ever been unleashed to me, and go to bed, thinking about all the jokes I could make with this.
Did I even bother to check the proof of the formula for the volume of the n-ball? No. I merely verified that it holds for 1, 2 and 3 dimensions. But what if there is some catch in higher dimensions, I wonder as I lie down on my bed. I then lazily conclude that the author must have checked the proof before writing the formula in his book.
Were we discussing about how Euclidean metrics fail in higher dimensions? No. We were discussing about a greedy orange-loving alien who has invented a dimension-changinator and would be wanting to increase the number of dimensions of his orange, and how he would be disappointed to get only the peel if he increases the number of dimensions to very large values.