Today I told my class that Python was a gift to us from the C community, which in some ways is true. Think of the list type: thanks to C-slingers like Tim Peters, we get a world class class, able to sort, pop, append and insert.
But C is glorified assembler, fast because close to the metal, requiring strict discipline. Python caters to a technical mindset, but does not demand one focus on the nitty-gritty of chip registers and memory deallocation.
Leave all that to the runtime engine.
However what I’m really here to talk about is Numpy & Pandas.
These both have a history, but lets cut to the chase. Numpy is a canvas, a grid of cells, usually for numbers, and Pandas provides a frame around that canvas, as in “picture frame”, so yes, I’m being metaphoric.
Numpy is like some rectangle in a spreadsheet, or rather that’s the job of its star class: the ndarray type.
Pandas lets you address your ndarray using strings, intelligent labels, instead of a sequence of numbers. Coding with labels makes so much more sense, as inserting a new column offsets all further on. Labels don’t get bumped. Wire up your n-dimensional canvas inside a Pandas DataFrame and you’ll have something more robust and ready for further handling.
Why do we care?
What I’m describing are the “cookie trays” we then feed to the oven, perhaps via conveyor belt. Machine Learning algorithms devour Pandas “cookie trays” by extracting the canvas (easily done, there’s like a button you press).
The canvas, a rectangle, breaks down into X (clues) and y (right answers), where our unknown is some F, some function, such that F(X.test) → correct guess about y.
Like your y labels might be types of animal: mouse, zebra, snail, fish…
Your X is a stack of samples, rows of clues.
Of course we immediately think of pictures, which is where convolutional networks come in, but lets start with something less overwhelming than a million pixels. We might have only five clues per sample, on which to base a guess. Given enough samples, our guesser will have a chance to get good.
What slows people down a lot in ML is that sacred cow which insists every dimension needs an axis “at 90 degrees” to all the others.
The doctrine of mutual orthogonality works well up through a third axis, and then transitions to the hypercube or tesseract with a fourth. Yet a typical cookie tray may be thirty columns wide. A survey form. Thirty questions. Are we supposed to picture thirty axes all at right angles to every other? That might take awhile to draw.
How is our iris data (simple numeric facts about some flowers, four columns worth) supposed to fit inside a hypercube?
The mind boggles.
Fortunately, a lot of Youtube animations remind us that dimensions which do not depend on one another, do not require mutual orthogonality for their mental modeling.
Picture dials on a dashboard, each one free to turn independently of all the others.
In the real world, pure linear independence may not be guaranteed, as we’re not always in charge of the sample space. Various tests suggest what “dimensions” might crush together.
Losing columns, because everything we need is in the remaining columns, is no tragedy. The less redundancy the better.
However, given pure independence, meaning every column’s value is free to wiggle in a significant manner, not tethered to the others, we’re ready to fit our data to a mold.
Machine Learning algorithms make castings. They tune the violin strings. They reshape themselves to become adept at predicting the right y labels from each subsequent row in X (the sample space).
You tell me tail length, neck size, number of legs, and I (the model) will tell you if it’s a gorilla or not.
Or maybe I recognize gems.
Think of anything you might want to categorize (classify, sort, label): supervised learning is about doing that job for you. Training may take time.
You’ll want Numpy and Pandas to help you snag and shape data.
Adding a frame to the canvas will help with initial processing, which may involve adding and subtracting columns, filtering out rows, smoothing over holes.
Much of data science is about turning raw data into something more refined, yet faithful to the original. Normalization, and one-hot encoding, are both names of refining techniques that Pandas can help you with.
Once your data is prepped, put those multi-dimensional cookie trays on some conveyor belt and let the data bake the shape of some mold, some model.
Hyperparameters need fine tuning. You’ve picked with model to use, but now comes the fun part, whereby its trained to fit the data (but not over fit).
Feedback is happening as the data comes in.
Is this a Neural Network of perceptrons? A Support Vector machine? A Decision Tree? A Random Forest?
Let go of the idea of just the one oven.
You’re free to bake your cookies in a wide variety of model maker, some of which are actually easy to understand.
K-nearest Neighbors, for example. KNN. That one’s not so hard to think about. Some of the others hurt my head. Perhaps Siraj will explain.
Why we think in terms of mutual orthogonals has to do with the Pythagorean Theorem and the so-called distance formula. KNN often uses that.
The difference along every dimension, such as A.x minus B.x, and A.y miny B.y, gets raised to a 2nd power, which eliminates negative numbers.
Then all of these positive 2nd powers get summed, and then a 2nd root is taken.
In everyday Euclidean XYZ space, the geometric meaning is clear. We call the result of this algorithm the measure of “distance” between any two positions in a so-called N-dimensional phase space, or extended Euclidean N-space. N is for Nerd, just kidding.
When K Nearest Neighbors (KNN), a machine learning method, gets a new data point, it computes its distance from K known quantities of known label, and lets them vote on to whom it belongs. Two out of three say you’re Republican, based on distance, so that’s what we’ll guess. Fiddling with K may change the accuracy.
In linear algebra, we’re able to bisect something N-dimensional, with something yet lower dimensional, or call it a “hyperplane”. Again, don’t despair if you don’t think of hypercubes right away.
To think metaphorically, is to think multidimensionally in another sense. Think of walls confining a room, a concavity.
The machine model figures out how to route you to the right placed, based on your features. It does its best. It might make mistakes.
That’s right: machine learning models are forgiven in advance for getting it wrong some of the time; they may be fooled.
In the real world, the samples may never tease apart that completely.
Even humans make mistakes, and they have a million year head start, which sounds like a lot until you figure in how quickly CPUs and GPUs make up for lost time.
Remember, glorified C isn’t slow.
Data scientists flocked to Python, to its Numpy and Pandas, to its Machine Learning ecosystem, because they didn’t have to learn a lower level system language in the process.
So why not Ruby or other agile? The night is young. Expect many more cookie factories.
As it happened, Python is in the right place at the right time, with both Google and Facebook adopting it, for TensorFlow and Pytorch respectively. There’s nothing wrong with a little competition. The more the merrier.