Reinforcement Learning

4 min readNov 27, 2020

In the previous articles, I have explained the underlying distinction between supervised and unsupervised learning, which was the presence or absence of labeled data (teacher). In reinforcement learning (RF) there is no supervisor that tells the algorithm that the decision was good or bad; instead, there is reward that signals the algorithm toward the general desired direction.

Reinforcement learning, as a field of study, can be expressed as the overlap of many others. As shown in the graph, it includes everything from computer science to economics to neuroscience — and it tries to study the approach to solving reward-based problems.

Similarly, this type of learning algorithm can be applied to a myriad of problems as well. For example:

Perform helicopter stunts (for reference, here’s how a Red Bull Helicopter Aerobatics performs stunts)
Play beyond human-performance in Backgammon (board) Games (there’s a whole documentary on Netflix — AlphaGo, the story of reinforcement learning beating the human world champion on the game of GO) *
Manage investments ($)
Walk like a humanoid
Play Atari Games beyond human-level
Play hide and seek *

* = must watch

Think about this — If we wanted to create an algorithm that performed helicopter stunts, how would we label the data? Define the patterns in a set of images, and ask the model to learn from those? Remember, we’re asking the RL algorithm to actually perform those stunts on itself, rather than classify or detect them.

If approached with labeled data, it’s really hard to set up the problem in the first place. Alternatively, we set up a model of the helicopter and an environment it can play with; reward the model every time the stunt is performed correctly and punish every time it crashed (we don’t want the helicopter to crash in real life). After a number of iterations, slowly but surely, the model starts to learn how to perform the required stunts.

By definition, RL is trying to select actions that maximize future rewards. In particular, the algorithms need to learn how to plan ahead because some action might produce rewards later on, or often times learn to make sacrifices immediate short-term reward to acquire long term reward (similar, no?). For example:

Making some financial investments now (negative reward since the money is being spent), for greater returns later
Sacrifice helicopter time to fuel, so it can run longer later

How was this invented?

RF was inspired by how the animal on Earth learn. Agent is the instance of an intelligent form who interacts with the environment. It gets an observation Ot from the environment, with this information it has so far it makes an action At, and from it receive a reward or punishment Rt.

Goals of an agent might be intermediate, final, and time-based goals. For intermediate and time-based goals the reward system is more straightforward, in the sense that after one or a small number of actions the model receives rewards/ punishments. However, for final goals the model receives its rewards/ punishment by the end of its interaction with the environment, therefore, determining what sequence of actions produced the desired output is ambiguous.

For example, an average chess game has approximately 40 moves (277 chess moves in online tournaments oddly). While training an agent to play chess, the agent has history Ht (sequence of observations, actions, and rewards) of 40 observations, 40 actions, and in the end it receives a reward regarding the game (1 — reward for winning, 0 — punishment for loosing).

Ht = [O1, A1, O2, A2, …, O39, A39, O40, A40, R1]

This is how a RL algorithm sees the game. Sometimes the actions might not directly associated with the reward, and this ambiguity is what enables the algorithm to develop its own ways of understanding the environment and interacting with it.

How is this different from supervised & unsupervised learning?

Even though similar to supervised and unsupervised learning, RL algorithms have their own intricacies.

Instead of labeled data (supervisor), these algorithms learn through trial and error, and improve using receive reward signal
Feedback can be delayed or instantaneous
Time matters. An event at a given time is very relevant to the actions in the next event.
The ‘agent’ in RL gets to interact with the environment (data)

Reinforcement learning is, I believe, the closest thing to real artificial intelligence. Because the machines actually learns and creates its own systems and rules (sometimes these approaches are very different from what we humans did) to solve the problem at hand. The topic right now is advancing with incredible speed, let’s see how the future looks with algorithms like this.

Side note:

I found this course lectured by the creator of AlphaGo (the algorithm that beat the world champion at GO) in the field of Reinforcement Learning → If you found this article interesting, I suggest you check it out DeepMind.

Sum up

Hope this article helped you get a clear general idea on reinforcement learning algorithms and how are they used in different industries. I’d love to hear your ideas on what you’d like to read next — let me know down below in the comment section!

You can always connect with me via LinkedIn.

Reinforcement Learning

How was this invented?

How is this different from supervised & unsupervised learning?

Sum up

Written by Bardh Rushiti