Classical Machine Learning — Unsupervised Learning Edition

I mentioned the unsupervised learning in the last blog, Classical Machine Learning — Supervised Edition, here’s the link in case you missed it.

We said that unsupervised learning is when we provide the machine with unlabeled data, and ask it to find meaningful structures on its own. In this type of learning, the machine does not have a teacher (labeled data).

Not always we have the luxury of labeled data. Say we want to create rabbit classifier — should I take a million pictures of rabbits and label them one by one? Well, thank you very much, but no.

Thanks to huge gaps in income in places around the world, there are companies like Amazon’s Mechanical Turk that pay workers 0.05$ per task and that’s usually how things get done here.

Or you can use unsupervised learning to try to cluster that data into something more meaningful.


Clustering is a unsupervised ML technique that divides objects into classes based on unknown features. It is most commonly used for:

  • Market segmentation (identifying types of customers)
  • Merge close points on a map (think of compressing pixels in an image →less colors → less memory)
  • Analyze and label new data
  • Anomaly detection (detecting aberrant data-points)

Known clustering algorithms are: K-means clustering, DBSCAN, Mean-Shift, etc.

Clustering tries to find objects with similar features and groups them into a cluster, and those that have multiple similar features are joined in a class. With algorithms like, K-Mean Clustering, you can even specify the number of classes.

For example, FiveStar Fitness (local fitness center in Prishtina, Kosova) has multiple gyms where thousands of people go to exercise. Already having people who use their gyms, they could use clustering algorithms to do market segmentation and identify specific types of people that attend their fitness.

Insights like: the time the gyms are frequented the most, the location gym, underlying patterns in background, age, gender, of people that use the gym, and similar features to understand their clientele better. With this information they could do targeted ads to one particular target group, and give offers that attract a specific type of audience (and obviously generating more €€€). This could be one way they could extract value from the current unsupervised learning algorithms.

Another example of clustering that you might have seen is the Apple or Google Photos. The application tries to find all the faces of your friends in the large pool of pictures you have on your phone, and then cluster faces that look similar to the algorithm. The algorithm hasn’t seen those faces before, and it doesn’t know how your friends look like or who they are, but it’s trying to cluster faces with similar facial features.

Another practical use of unsupervised learning algorithms is image compression. When saving a PNG file, you can set the palette to 32 (or less) colors. In other words, you cluster the pixels in your image to those 32 colors; for example, the algorithms will find all the blueish colors in your image, calculate the ‘average blue’ and set it to all blue pixels. This means that fewer colors are used →lower file sizes →more space for other stuff →more $$$.

Say we wanted to solve placing simit kiosk in the optimal place in the city with k means clustering. First, we would place them randomly in different city parts. We would watch how the buyers interacted with them. The next day readjust the place to center of their buyers interest. Watch and move some more time, until each simit kiosk has its optimal place.

Unfortunately, in real life not all members of clusters are grouped in a circular fashion, and sometimes we can’t know beforehand how many clusters will form. That’s where DBSCAN comes into play. A nice bonus of using it is that it detects outliers (data points that don’t fit in any cluster) for you.

Courtesy of David Sheehan

Dimensionality Reduction

Groups features into more high-level ones.

Widely used for:

  • Recommender systems
  • Beautiful visualizations
  • Fake image analysis
  • Risk management

Most used algorithm for dimensionality reduction: Principal Component Analysis (PCA), Singular Value Decomposition (SVD), Latent Dirichlet allocation (LDA), Latent Semantic Analysis (LSA, pLSA, GLSA), t-SNE (for visualization).

When handling small datasets with 20 rows and 2 columns, things are easy. You can perform simple calculations, analyze the data, understand the impact a variable has on the other, and on top of everything you can visualize it. However, what happens when you have a large dataset?

Say you have a dataset with ~200,000 rows and 1300 features. Due to the size of this data, it’s going to take longer to perform calculations, it will be pretty difficult to understand the dataset and the correlation variables have. Visualization? Good luck visualizing data on larger than three dimensions (Unless you have temporal, time-related, data — you can visualize that in animations across time).

Courtesy of kassambara

Here’s where dimensionality reduction comes into play. This algorithm is used to project the data in a m=1300 dimensional plane and tries to learn new dimensions that represent it better. From this new learned representation, it is easier to select a number of dimensions (obviously lower than m) and easily proceed to further explore and visualize the data as wanted.

One interesting application of dimensionality reduction algorithm is done by a study group in UCLA, who wanted to understand the variation of human genome (a genome is all the genetic material of an organism) across large geographical scale. They gathered small blood samples from 1,300 people across Europe and examined at single-letter differences in DNA (“single nucleotide polymorphism” or SNPs) at about 200,000 places in each genome. With this data, they analyzed how it related to a person geographic location.

They used Principal Component Analysis (PCA) as their dimensionality reduction algorithm to look for patterns in the massive SNP data and reduce that number to two variables, known as principal components. Two-variable datasets can be easily plotted and, therefore, be better understood. This is the startling result they got:

Courtesy of John Novembre

The genome information and the geographical map overlapped to an amazing degree (here’s a link if you want to read more about it — Genes Mirror Geography Within Europe).

We could try to analyze and understand the numbers, but with the huge number of columns (>200,000) it would be difficult for the data scientist or machine learning engineer to get any valuable insight from that data. In such cases, it is much more convenient to use abstractions. For example, an animal that has these features is:

C = [lives in trees, is slow, no ears, ability to change color, eats insects, …]

We can classify this animals as “chameleon”. It is much simpler to do this kind of grouping of features in the new abstraction than having a tediously long list of features and trying to classify it into animals. The abstraction might lose some information, but it is much more convenient for naming and explaining purposes. In addition to that, these abstracted models learn faster and tent to overfit less.

Recommendation Systems and Collaborative Filtering

Recommendation systems are another dimensionality reduction algorithm that aims to predict preference a user would give to an item. This is basically what YouTube is doing in the Up next you see to your right, or what’s shown in Netflix’s homepage, or Amazon’s ‘you might be interested in this as well’ feature that recommends items. Have you ever wondered how and why these algorithms are so accurate? Well, it’s using past watched videos/ series/ bought items to predict what you might be interested next.

In low level, the algorithm does not really understand what you like.

For example, it’s less like this:

Bardh likes latin music, play next Marc Anthony — Celos y Envidia (It’s a phenomenal song by the way).

And more like this:

User Bardh Rushiti likes songs that have this set of features = [x1, x2, x3, x4, …] and most likely he will like songs that have those features as well.

The algorithm gets these high level features without really understanding them.

Association Rule Learning

Look for patterns in large-scale product transactions

Courtesy of Vas3k

Mainly used for:

  • Analyze market basket
  • Analyze web surfing patterns
  • Forecast sales and discounts
  • Find optimal way to arrange products in a show

Popular algorithms: Apriori, Euclat, FP-growth

Association rule learning is a method that covers from analyzing shopping carts, marketing strategy, and other event-related tasks. In other words, finding patters in data that comes in forms of sequence. These patterns are commonly known as rules (hence the name).

For example, a customer buys minced beef, red onions, and ketchup. This customer will most likely buy burger buns as well. The customer buys a bunch of beers → this customer will probably buy peanuts as well. Should the market place the peanuts on the way to beers? How often are products bought together? Can rearranging the location of products in the market increase the profit?

Similarly, these algorithms are highly used in the e-commerce industry to predict what the user will most likely buy next.

Here’s an interesting read on the interesting discoveries from a local giant chain of stores in Kosovo (*cough* NDA *cough*):

Those who bought Nutella — also bought this other salty product!

Sum up

Hope this article helped you get a clearer idea on unsupervised learning algorithms and how are they used in different industries. I’d love to hear your ideas on what you’d like to read next — let me know down below in the comment section!

You can always connect with me via LinkedIn.

Machine Learning Engineer | Innately curious about the world.