Fundamental Components of Machine Learning

In the following weeks I will share with you some articles with the intent to explain some machine learning concepts. A straightforward approach with real-world problems that uses simple language and is easy to understand. This series will be aimed at anyone and everyone, regardless of background. You’ll be able to follow regardless if you’re an engineer or manager.

The main idea behind writing this series is to spike some curiosity in Kosovo’s ICT community and other new-comers for machine learning and its great potential. Right now, most of this community is focused on website and mobile app development, and taking into consideration the lack of digitalization in Kosovo’s market, that was necessary.

Moreover, I believe that at a certain point in time this market will saturate and the need for new technologies will arise; technologies that help companies provide smarter, better, and more innovative products to its clients. Following the global trends, I think that technology is artificial intelligence (AI), more specifically machine learning (ML).

The sooner we start understanding, and playing with the ML algorithms, the bigger the leaps are going to be. This, here, is my contribution to get there.

Getting started with Machine Learning can be overwhelming process. I have shared the most useful resources that helped me and others in this journey, to make this start as easy as possible. You can find the link here:

“Machine Learning is like sex in high school. Everyone is talking about it, a few know what to do, and only your teacher is doing it.” — vas3k

Here is Sabri. Sabri wants to buy an apartment in Prishtina. He tries to calculate how much money he needs to save on monthly basis to be able to afford it. He talked with many people and learned that new apartments cost around 80,000€ for 80 m2, 75,000€ for 70 m2, 70,000€, for 60 m2 and so on.

Sabri, as a phenomenal analytic, starts to see a pattern: so, apartment prices depend on its area, and drop 5,000€ per 10 m2, but doesn’t seem to go lower than 50,000€.

In machine learning terms, Sabri came up with regression — he predicted the price based on previously known data. People do it all the time, when estimating reasonable costs for a used Iphone, or how many drinks to have at a party.

It would be great to have a nice simple formula, especially for the drinks at a party, but unfortunately, it’s impossible :(

The hard thing about real estate is that one should take into account many variables: the number of rooms, the distance from the nearest hospital (or school, gym, park), availability of heat (Termokos in Prishtina), seasonal spikes of interest, does it have available parking spot, and many other factors. For Sabri, it’s hard to keep all that data in his head while calculating the price. That’s true for all of us.

Machines and robots do a much better job at taking into consideration a large number of factors and keeping the data in their ‘head’. And if we provide just that, and ask it to find hidden patterns, they do an amazing job.

That was the birth of machine learning (ML).

The essential components of machine learning

The greater variety of the data you have, the easier it is to find underlying patterns and predict results.

In order to teach the machine, one needs three things:

Data

Photo by Nick Ismail on InformationAge

Want to classify images? Predict the stock market? Find user preferences? Parse activities on Facebook? Make movie recommendations? As I said earlier, the more diverse the data, the easier to find patterns.

There exist two ways to gather data (or to create a dataset): manually, and automatically

  • Manual data acquisition has fewer errors, and is more tailored to the specific problem; however, it takes a lot of time to amass the dataset.
  • In automatic data acquisition you gather anything you can set your hand to. And, as you can assume, it requires less time and is cheaper, but has more errors.

Remember when ReCaptcha asks you “Which pictures contain buses?” — That’s exactly what they’re doing. Free labor baby. This way, you’re helping the Google’s AI Image Recognition Algorithm get labeled data for free.

It’s very hard to collect good data, and going through the labeling — it’s a tedious process. It is interesting to know that some companies will even make their algorithms public, but never their data. It’s that precious!

Features

Photo by Mika Baumeister on Unsplash

This is what the machine looks at to classify samples into categories. In our example, this can be the number of rooms, availability of heat, seasonal spikes of interest, and many more.

When the data are stored in tabular-form, it’s easy, the features are the columns. However, it gets a bit more complicated when the data is in the form of 10Gb of images of cats and dogs. Every pixel cannot be taken as a feature, it simply would be too much to handle. Finding the features is usually the hardest part in a ML Engineering job. The reason for that is because we, humans, are biased creatures. We tend to choose only the features we find important or we know better.

Note: Please avoid being human when working with machines. Thank you :|

Algorithm

Photo by Analytics Vidhya

This is the engine you run for finding patterns in your data, and choosing the right one can vary based on the problem you’re trying to solve. Your accuracy, performance, and size of model is based on the choice of algorithm.

It is useful to keep in mind that, regardless of how powerful the learning algorithm is, it cannot make good predictions if it doesn’t have good data. The well-known phrase goes like this — “Garbage in = garbage out”. Labeled, well-structured, and representative data is the first and most important step in this process of creating a functional, practical, and useful machine learning systems.

Learning != Intelligence (Learning is not intelligence)

Here’s the big picture:

  • Artificial intelligence is a large field, like mathematics, or physics, that enables machines to imitate human behavior. It captures the broad set of computer science for machine perception, logic and learning.

Note: Even classical programming can imitate human behavior and decision, and can therefore be considered artificial intelligence.

  • Machine learning, a sub-field of artificial intelligence, uses statistical methods to improve with experience (new incoming data). It is an important part, but definitely not the only one.
  • Neural networks, a sub-field of machine learning, is a layer-based technique that recognizes patterns. It works similar to the human brain.
  • Deep learning is a modern method of utilizing neural networks. Essentially, it’s a novel architecture. Nowadays, we use neural networks and deep-learning interchangeably, we even use the same libraries.

We will use these concepts to explain more complex ideas in the following posts.

Sum up

You can always connect with me via LinkedIn.

Machine Learning Engineer | Innately curious about the world.