Oct 18, 2022

Is human perception a reinforcement learning problem?

Why do we perceive things in the way that we do? Why are some things difficult, and some things easy? Why are we enamored by some activities, so focused and driven that shutting out the noise around us and pressing forward relentlessly feels natural, while we...

Is human perception a reinforcement learning problem?

Why do we perceive things in the way that we do? Why are some things difficult, and some things easy? Why are we enamored by some activities, so focused and driven that shutting out the noise around us and pressing forward relentlessly feels natural, while we are often simultaneously incapable of initiating trivially difficult tasks throughout our days.

Formalizing Our Expectations

My supposition is that all of these behaviors are intimately linked to the way our brains construct the expected value of activities. Consider this formulation of determining the value of undertaking an activity: given complete knowledge about the environment we are in and a set of possible actions we can take, we should be informed enough to rank these actions and make a decision of which action to take. (Of course, having a truly complete knowledge of our environment isn’t feasible, more on this later.)

The environment in question could be your bedroom on a morning before work, you could do any combination of:

A) take a shower

B) brush your teeth

C) stay in bed

D) leave for work

If it’s Monday morning the set of actions you choose might be to shower, brush your teeth, then leave for work. On a Sunday however you might choose only to stay in bed. Perhaps you feel sick on the same Monday morning, you might also choose to remain in bed. Each and every variable that comes together to construct your environment can play a key role in influencing the desirability of actions.

Now, let’s henceforth refer to our position in our environment as our state. At the beginning of our decision making process, we started in a state, and now must decide on what action we should take. This action will affect us or our environment in some way, causing a state transition. So, how desirable is it to be in this new state that we’ve transitioned to? That is the criteria that we’ve used to rank the aforementioned set of possible actions.

A "gridworld" of possible states/actions and their values

The method that we use to generate these ranking criteria, or to implicitly score the value of being in a certain state with reference to the state we were previously in, as well as all of the states we expect to follow, is known in the field of Reinforcement Learning as the Value Function.

Value Function

The above is the standard Value Function, with respect to a given policy, or ruleset by which to act, π. The formula can be understood in plain english as “If at every subsequent state (state S[i]) I take the action that the policy instructs, this is the total value of the full set of actions played out until a certain point in time (T).”

A policy that instructs us to always take the action that seems the most optimal at the current time is known as a greedy policy. Imagine you didn’t get out of bed even though you had to go to work to earn money to live… that would be very shortsighted or “greedy.” To avoid this always happening, we include γ which represents discount. We’ll talk more about discount in a bit.

Optimal Value Function

Now, if we were to observe all possible policies to act under, and what value they result in, choosing the policy with the maximum value would net us the optimal value function, implying we also have the optimal policy.

Environmental Understanding as an Analogy for Perspective

Now, let’s explore the case where we do not have complete environmental understanding. Leveraging the previous example, let’s say that it’s a Monday and we’re intended to go into work. We wake up, choose the actions that seem appropriate to prepare ourselves for a workday, and head into the office. What we failed to realize is that our coworker had contracted Covid over the weekend. Had we known this, perhaps we would have:

A) encouraged our coworker to stay home

B) decided to work from home ourselves

C) reported their symptoms to our manager

All of these actions have pros and cons, the latter solution perhaps being the most effective but the least favorable to your relationship with your coworker. The middle solution being the most neutral in the short term, but unfavorable in the event that the whole office contracts Covid.

Regardless of the solution you would have chosen, you did not have this information at the outset. You are optimizing your sequence of actions in an incompletely explained environment. The information that you do have about the environment is an analog for your perspective as an individual. As new information is introduced to you, you update your model of the world. Sometimes revolutionary information falls onto your lap and you can’t help but to change the way you think or believe.

However, oftentimes in order to gain the most value-rich information, you must seek specific information in an intentional and pointed manner. In this way, a tradeoff forms when we are capable of taking a mixture of actions that skew toward gathering information, and actions that skew toward maximizing our value with the information we already have present.

In Reinforcement Learning, this is the Exploration vs. Exploitation Tradeoff. Humans tend to naturally modulate the emphasis they put on exploration over time. A novice craftsman might eagerly observe their mentor’s each and every move, expending tremendous energy to note any possibly relevant methods or details. The mentor, set in their ways, moves through a nigh predetermined set of actions honed through years of experience, unlikely to reach outside their current understanding of the best practices.

An additional benefit of this formulation is that the value of diverse perspectives in decision making becomes increasingly obvious. A group with homogenous perspectives necessitates the same amount of exploratory actions to determine the highest value actions as a single actor. A group constructed from actors with diverse perspectives naturally has more complete knowledge of the environment (more coverage of the problem space) and thus can make “Exploitative” actions with a much higher confidence that there are no undiscovered higher value actions that could be explored for.

Rewards Now or Later?

As mentioned before, always choosing the action that results in the most immediate reward is known as being “greedy.” A greedy policy will always choose the thing that looks best with no thought to the future. However, real life isn’t like that. As people grow and mature they tend to increase their consideration of long term goals and the associated rewards.

However, “a bird in hand is better than two in the bush.” We know with near 100% certainty that going to work will get us paid for the day. We can be far less certain that if we quit our job to focus more on writing that we will: actually make the effort to take some classes, write a few short stories, get them published in a small anthology, write 3 novels that all flop, almost give up, get encouraged by a friend to press on, and write a NYT Best Seller. Thus, we must adopt a way of devaluing pairs of state and action that are many steps down the road and thus far less likely to happen. Any perturbation of any one of those items in the probability chain would likely have resulted in a wildly different outcome than the optimal.

Thus we have discount a factor by which we devalue the expected reward of actions with respect to the amount of steps it is away from the current state. That’s γ in the formula we looked at earlier for finding the value of a chain of state action pairs. Discount factors very close to 1 will result in a policy that greatly values long term goals, but the closer to zero the greedier it becomes. Imagine your plan has 10 steps, .99¹⁰ is still about .9 or 90% of the reward you’d get if your goal was right in front of you, whereas .9¹⁰ is only .35. The Value Function is really neat because it wraps the probability of the end goal being actualized and the reward of the goal into a singular term (think expected value in statistics) which allows us to compare policies (or ways of life, in our analogies) directly to one another.

It seems, however, that humans often operate with a discount term that is too close to zero for their own good. Any reward that is immediately available is:

A) likely to be far less valuable due to principles of scarcity. If everyone could get an amazing reward with little effort that level of reward would become the standard and thus less valuable.

B) likely to be extremely “exploitative” in that you take what’s right in front of you and don’t “explore” possibilities. Taking the immediately rewarding option means that you will not experience new perspective and thus be ignorant to opportunity outside of your cycle of reward.

Neglecting to enter states and complete actions that would in the long term benefit you greatly results in a lot of value lost. For instance, neglecting to brush your teeth every day because the immediate reward is so small results in having no teeth a few years later.

Negative Reward Accumulation

Likely, a better way to formulate these things is not actually “value lost” but more precisely “negative value gained.” As we neglect choosing actions that would alleviate negative consequences in the long run, and instead pursue this greedy approach to living, we accumulate this negative value. Not brushing your teeth at night because your bed looks so comfy, partying instead of studying because what’s 1 failed test?, staying late at the office because the company that doesn’t even know you exist needs to make this deadline.

As we conduct our lives and this accumulation eventually moves into the horizon that we perceive (where our discount factor hasn’t made the impact “not yet worth worrying about”), we begin to realize our negligence. Your teeth are falling out, it’s too late to save your grades and you fail out of college, now your son is 18 and you realize you didn’t see him grow up.

Had we tuned our discount term to consider reward outside of the immediate, perhaps we would have avoided these pitfalls. Our societal tendency toward these behaviors gives rise to common adages such as “too little too late,” and we lament “if only I had considered *insert problem here*”

Mismatch between Logical Expected Reward and Chemical Reward Response

It’s easier than it’s ever been to receive stimulus that our brains perceive as reward. As I mentioned earlier, if all reward was easily gained it would become less valuable, and that is the current problem we experience with motivation. Why learn something new when you can watch an endless stream of funny videos on your phone or project yourself onto your favorite influencer living a life you wish you had.

We are drinking from the firehose of reward, at every turn is an opportunity to experience a wave of dopamine. This creates a dissonance between what we know logically would make us happy and what the chemicals in our brain are signaling. Thusly, we internalize an aversion to doing hard things with high reward, or to easy things with middling reward, all in favor of actions with high chemical reward but in the longer term result no progress toward self actualization.

But what if we took actions more in line with our logical reward systems, more in line with our biology? What if we trained for that marathon? What if we finishing writing that article? What if we rejected the endless stream of unsubstantiated chemical rewards.