Multi-Armed Bandits (and wireless channels)

The term "bandit" comes from the one-armed bandit, a nickname for slot machines. You might have seen them in casinos, where you pull a lever and hope to get a reward. They look like this:

A one-armed bandit (slot machine)
A one-armed bandit (slot machine)
By Ronald Saunders from Warrington, UK - Own work, CC BY-SA 4.0, via Wikimedia Commons

Imagine you're at a casino with multiple slot machines. Each machine has its own hidden probability of paying out. You have limited time and money, so each pull is important. Do you:

This is the essence of the exploration-exploitation dilemma. Every time you pull a lever, you're both learning about that machine's payout pattern (the underlying distribution so to say) AND trying to win money (maximize reward).

In the simplest case with a single machine, we can define our problem mathematically. At each time step tt:

Note

Since we only have one arm here, the aa is just a placeholder and it is equivalent to qq^*, q^t\hat{q}_t.

q(a)q^*(a) is the true (but unknown) expected reward - in our slot machine example, this would be the machine's programmed payout rate. We never actually know this value; we can only estimate it through repeated plays.

Our estimate q^t(a)\hat{q}_t(a) is calculated in real-time as we play, typically as the average of our observed rewards:

q^t(a)=1nt(a)s=1trs1[as=a]\hat{q}_t(a) = \frac{1}{n_t(a)} \sum_{s=1}^t r_s \cdot \mathbf{1}[a_s = a]

where nt(a)n_t(a) is the number of times we've pulled arm aa up to time tt, and 1[as=a]\mathbf{1}[a_s = a] is 1 if we pulled arm aa at time ss and 0 otherwise.

Multi-Armed Bandits

Now, let's extend this to multiple bandits. Instead of one machine, we have KK different machines (or arms). At each time step tt:

The challenge is an extension of the single-arm bandit problem, but now involves a lot more complexity: how do we balance exploring all KK arms to learn their true reward distributions while exploiting the arms that seem to give the best rewards so far? For this, we would have to understand what is regret.

Understanding Regret

Regret helps us measure how well our strategy performs. It's the difference between:

  1. What we could have earned by always choosing the best arm (if we knew which one it was)
  2. What we actually earned using our strategy

(The term "regret" is particularly fitting here - it represents how much we'd regret our past decisions if we later discovered there was a better arm we could have been pulling all along.)

Let's walk through it. Imagine three slot machines:

If we spend time pulling machines B and C while learning, we accumulate regret compared to always pulling machine A. At each step tt, our instantaneous regret is q(a)q(at)q^*(a^*) - q^*(a_t) - the difference between the best possible reward and what we actually got. Our total regret R(T)R(T) after TT steps is the sum of all these missed opportunities:

R(T)=Tq(a)t=1Tq(at)R(T) = T \cdot q^*(a^*) - \sum_{t=1}^T q^*(a_t)

where:

This gives us a way to evaluate different strategies. A good strategy should:

Try out the interactive visualization below to see how regret changes with different strategies! (some strategies: random, picking always A, B or C, etc, alternating between arms, etc.)

Explore regret accumulation by pulling different arms

History:

Progress: 0/20 pulls

Algorithm of Multi-Armed Bandits

It is evident that there are certain strategies, that should minimize regret in the long run. The core algorithm for multi-armed bandits is simple, but the choice of strategy is crucial. Before discussing such strategies, it's useful for us to write down the algorithm.

While we've used q(a,t)q^*(a,t) and q^t(a)\hat{q}_t(a) to describe the theoretical framework, implementations typically use simpler notation. Let Q(a)Q(a) represent our current estimate and Xa,tX_{a,t} be the actual reward observed when using arm aa at time tt. The core algorithm:

  1. Initialize Q(a)=0Q(a) = 0 and N(a)=0N(a) = 0 for all arms
  2. For each time step t=1,2,...,Tt = 1, 2, ..., T:
    1. Select an arm aa based on our strategy
    2. Use the arm and observe reward Xa,tX_{a,t}
    3. Update our counts: N(a)=N(a)+1N(a) = N(a) + 1
    4. Update our estimate: Q(a)=Q(a)+1N(a)[Xa,tQ(a)]Q(a) = Q(a) + \frac{1}{N(a)}[X_{a,t} - Q(a)]

What we call strategy here is the way we choose the arm at each time step (i.e. step 2.1), which will inevitably affect the cumulative regret.

Bandits in wireless networks

Let's bring our slot machine analogy into the world of wireless networks. Instead of pulling levers, you're now a network controller trying to find the best channel for data transmission. Each wireless channel is like a slot machine, but instead of coins, we're trying to maximize data throughput. And unlike casino machines with fixed probabilities, channel quality fluctuates constantly due to interference, physical obstacles, and other users.

This scenario is perfect for MAB algorithms because wireless networks face the same exploration-exploitation dilemma: do we stay on a channel that's performing well, or check if other channels might offer better throughput? Take a common example like WiFi routers - they must constantly decide which frequency channel to use for transmission, adapting to changing conditions and competing traffic.

At each moment t:

This setting is more challenging: channel conditions change dynamically, past performance doesn't guarantee future quality, and other users' actions influence our rewards. These characteristics make wireless networks a classic example of non-stationary bandits, where the reward distributions themselves evolve over time.

Modeling the Wireless Channel

Before diving into MAB strategies, let's understand what makes wireless channel selection challenging. Unlike our casino analogy where probabilities stay fixed, wireless channels exhibit complex, time-varying behavior:

Signal Quality and Mean SNR

The quality of a wireless channel is primarily determined by its Signal-to-Noise Ratio (SNR). In our model, each channel has a base SNR (SNRmean\text{SNR}_\text{mean}) measured in decibels (dB), representing the average expected signal quality.

The instantaneous SNR at time tt is given by:

SNR(t)=SNRmean+S(t)+F(t)\text{SNR}(t) = \text{SNR}_\text{mean} + S(t) + F(t)

Fading Effects

Shadow Fading (S(t)S(t))

Fast Fading (F(t)F(t))

Note

Rayleigh distribution is an interesting probability distribution commonly used in wireless communication to model fading effects. It can be expressed as:

f(x;σ)=xσ2ex2/(2σ2)f(x; \sigma) = \frac{x}{\sigma^2} e^{-x^2/(2\sigma^2)}

where σ\sigma is the scale parameter. We will not go into the details of this distribution here, but for the interested, see the Wikipedia article on Rayleigh distribution.

Coherence Time

The coherence time represents the duration over which the channel can be considered relatively stable. In our model:

This time-scale separation is typical in wireless channels, where large-scale effects (shadow fading) persist longer than small-scale effects (fast fading)

Channel Capacity

The actual throughput achievable on a channel follows Shannon's capacity formula:

C=Blog2(1+SNR(t))C = B \log_2(1 + \text{SNR}(t))

where:

This varying nature of wireless channels makes our MAB problem particularly interesting. Based on the above model, we can create a wireless channel to simulate our wireless environment and our strategies.

For all the simulations from now on, we will use typical values for a standard wireless environment:

Note that a real wireless environment would have more complex dynamics, but this simplified model captures the essence of the problem.

For visualization purposes, throughput shown with 20-step moving average for clarity. Raw values fluctuate more rapidly due to fast fading effects. We will visualize the throughput of these channels and the cumulative regret if we naively choose the high mean SNR channel, along with an hypothetical perfect strategy that always picks the best channel.

Channel throughput across 3 sample wireless channels with different mean SNR and cumulative regret if we naively choose the high mean SNR channel
Channel throughput across 3 sample wireless channels with different mean SNR and cumulative regret if we naively choose the high mean SNR channel

The above visualization shows some key aspects of our wireless channel environment, especially how throughput varies across channels.

This gives context for why we need MAB strategies, as we want to react to the dynamically changing channel conditions to maximize throughput. It should be alteast better than naively choosing an arm/channel and hoping for the best. It should have some knowledge on when to switch (explore) and when to stick (exploit).

Strategies

Finally! There are actually a LOT of strategies, but for now we will explore a couple popular strategies used widely for tackling the wireless channel selection problem, each with its own approach and trade-offs. Hopefully, this will be a good starting point to explore more strategies on your own!

ϵ\epsilon-greedy

The ϵ\epsilon-greedy strategy is pretty straightforward: with probability 1ϵ1-\epsilon, select the channel that currently appears best based on our estimates q^t(a)\hat{q}_t(a), and with probability ϵ\epsilon, select a random channel:

at={argmaxaq^t(a)with probability 1ϵuniform random channelwith probability ϵa_t = \begin{cases} \arg\max_a \hat{q}_t(a) & \text{with probability } 1-\epsilon \\ \text{uniform random channel} & \text{with probability } \epsilon \end{cases}

The parameter ϵ\epsilon controls the exploration-exploitation tradeoff. We can think of it as a curiosity factor!

In the traditional bandit setting with fixed reward distributions, we would estimate q^t(a)\hat{q}_t(a) as the average of all observed rewards:

q^t(a)=1nt(a)s=1trs1[as=a]\hat{q}_t(a) = \frac{1}{n_t(a)} \sum_{s=1}^t r_s \cdot \mathbf{1}[a_s = a]

However, in wireless networks, recent observations are more valuable than old ones since channel conditions change over time. We can modify our estimate to use exponential smoothing. This is not the traditional way to do ϵ\epsilon-greedy, but we will see why its better when we do our simulations. We define our estimate as:

q^t(a)=(1α)q^t1(a)+αrt\hat{q}_t(a) = (1-\alpha)\hat{q}_{t-1}(a) + \alpha r_t

where α\alpha is the learning rate. This gives more weight to recent rewards, helping us adapt to changing channel conditions.

We will use similar conditions and visualizations like our channel simulation before.

Cumulative regret for different epsilon values in epsilon-greedy strategy with exponential smoothing
Cumulative regret for different epsilon values in epsilon-greedy strategy with ES

The plot above shows how cumulative regret changes with different ϵ\epsilon values. The shaded bands represent the 95% confidence intervals across multiple simulation runs. α\alpha is set to 0.5 for all ES strategies in this simulation. A few observations:

Cumulative regret for different alpha values in epsilon-greedy strategy with exponential smoothing
Cumulative regret for different alpha values in epsilon-greedy strategy with ES

Now that we know why exponential smoothing is better, we can stick with it, and now see how different α\alpha values affect cumulative regret (and in turn, throughput)

Note

Interestingly, α\alpha = 1.0 performs well in our wireless scenario due to rapid channel variations from fast fading effects. While α\alpha = 1.0 means completely discarding history and using only the current reward, in our case this aggressive-ness helps track fast-changing channel conditions. In environments where channel variations are dominated by slower shadow fading (changing every coherence period) rather than fast fading, lower α\alpha values would be more appropriate as they provide better averaging over stable periods.

There are other variants of the ϵ\epsilon-greedy strategy, such as the decaying ϵ\epsilon-greedy, where ϵ\epsilon decreases over time. However, it is not very useful in our wireless scenario, as it doesn't adapt to changing channel conditions, especially in later stages where epsilon is very low, and thus unwilling to explore (which is crucial in non-stationary environments like ours). Nonetheless, it is just one of many, and there are modifications to epsilon-greedy that can be used in wireless scenarios.

Upper Confidence Bound (UCB)

The Upper Confidence Bound (UCB) algorithm takes a more sophisticated approach than ϵ\epsilon-greedy. Instead of using random exploration, it systematically explores arms based on their potential upside. The core idea is: "optimism in the face of uncertainty."

For each arm, UCB maintains two quantities:

At each time step, UCB selects the arm that maximizes:

UCBt(a)=q^t(a)+2lntnt(a)\text{UCB}_t(a) = \hat{q}_t(a) + \sqrt{\frac{2\ln t}{n_t(a)}}

where:

The exploration bonus has some nice properties:

It is interesting to see that there is no parameter like ϵ\epsilon in UCB, and yet it still works. A very good intuition for that can be read at gibberblot from where I took a lot of references for this write-up.

Like with ϵ\epsilon-greedy, we need to adapt UCB for wireless channels. Instead of using the average reward for q^t(a)\hat{q}_t(a), we'll use the same exponential smoothing:

q^t(a)=(1α)q^t1(a)+αrt\hat{q}_t(a) = (1-\alpha)\hat{q}_{t-1}(a) + \alpha r_t

Let's see how different values of α\alpha affect UCB's performance in our wireless scenario, and we will compare it with the best run of ϵ\epsilon-greedy.

Cumulative regret for different alpha values in UCB with ES (along with best run of epsilon-greedy)
Cumulative regret for different alpha values in UCB with ES (along with best run of epsilon-greedy)

From the plot, we notice a couple interesting things:

Note

While standard UCB achieves logarithmic regret bounds for stationary bandits, these bounds don't hold in our wireless setting due to the non-stationary rewards. However, the exponential smoothing modification helps UCB track the changing channel conditions effectively.

Ending thoughts

We explored how MABs adapt to the dynamic world of wireless networks. It's clear that wireless environments demand fresh thinking, and we would need to modify traditional MAB strategies to suit these conditions. Simple can be powerful - while UCB offers sophisticated exploration, a well-tuned ϵ\epsilon-greedy strategy can match its performance with less computational overhead. But we are simply scratching the surface here, there are many other possible strategies and modifications that can be explored. Moreover, MABs themselves are just the beginning, being the subset of the much larger realm of reinforcement learning, which has far more complicated algorithms, agents, networks and whatnot. But that's a story for another day!

The code for the simulations and visualizations is available on wireless-mab.

References

What I referred for this write-up:

I also recommend searching terms like "Adaptive wireless channel selection", "MABs for spectrum sharing" etc. A lot of research papers and articles are on these topics!