Multi-Armed Bandits (and wireless channels)
The term "bandit" comes from the one-armed bandit, a nickname for slot machines. You might have seen them in casinos, where you pull a lever and hope to get a reward. They look like this:

Imagine you're at a casino with multiple slot machines. Each machine has its own hidden probability of paying out. You have limited time and money, so each pull is important. Do you:
- Stick with a machine that's given you decent payouts so far? (exploitation)
- Try other machines that might actually have better odds? (exploration)
This is the essence of the exploration-exploitation dilemma. Every time you pull a lever, you're both learning about that machine's payout pattern (the underlying distribution so to say) AND trying to win money (maximize reward).
In the simplest case with a single machine, we can define our problem mathematically. At each time step :
- We pull the arm and receive a reward
- The true expected reward is some unknown value
- We maintain an estimate based on our observations
Since we only have one arm here, the is just a placeholder and it is equivalent to , .
is the true (but unknown) expected reward - in our slot machine example, this would be the machine's programmed payout rate. We never actually know this value; we can only estimate it through repeated plays.
Our estimate is calculated in real-time as we play, typically as the average of our observed rewards:
where is the number of times we've pulled arm up to time , and is 1 if we pulled arm at time and 0 otherwise.
Multi-Armed Bandits
Now, let's extend this to multiple bandits. Instead of one machine, we have different machines (or arms). At each time step :
- We choose an arm
- We pull the arm and receive a reward
- The true expected reward of arm is
- Our estimate of arm 's reward is
The challenge is an extension of the single-arm bandit problem, but now involves a lot more complexity: how do we balance exploring all arms to learn their true reward distributions while exploiting the arms that seem to give the best rewards so far? For this, we would have to understand what is regret.
Understanding Regret
Regret helps us measure how well our strategy performs. It's the difference between:
- What we could have earned by always choosing the best arm (if we knew which one it was)
- What we actually earned using our strategy
(The term "regret" is particularly fitting here - it represents how much we'd regret our past decisions if we later discovered there was a better arm we could have been pulling all along.)
Let's walk through it. Imagine three slot machines:
- Machine A: averages 10 coins per pull (but we don't know this)
- Machine B: averages 7 coins per pull
- Machine C: averages 4 coins per pull
If we spend time pulling machines B and C while learning, we accumulate regret compared to always pulling machine A. At each step , our instantaneous regret is - the difference between the best possible reward and what we actually got. Our total regret after steps is the sum of all these missed opportunities:
where:
- is the optimal arm (Machine A in our example)
- is its true expected reward (10 coins)
- is the arm we actually pulled at time
- is the true expected reward of that arm
This gives us a way to evaluate different strategies. A good strategy should:
- Keep regret as low as possible: This means we're choosing good arms more often
- Have regret grow as slowly as possible over time: This indicates we're learning the optimal strategy quickly
Try out the interactive visualization below to see how regret changes with different strategies! (some strategies: random, picking always A, B or C, etc, alternating between arms, etc.)
Explore regret accumulation by pulling different arms
History:
Progress: 0/20 pulls
Algorithm of Multi-Armed Bandits
It is evident that there are certain strategies, that should minimize regret in the long run. The core algorithm for multi-armed bandits is simple, but the choice of strategy is crucial. Before discussing such strategies, it's useful for us to write down the algorithm.
While we've used and to describe the theoretical framework, implementations typically use simpler notation. Let represent our current estimate and be the actual reward observed when using arm at time . The core algorithm:
- Initialize and for all arms
- For each time step :
- Select an arm based on our strategy
- Use the arm and observe reward
- Update our counts:
- Update our estimate:
What we call strategy here is the way we choose the arm at each time step (i.e. step 2.1), which will inevitably affect the cumulative regret.
Bandits in wireless networks
Let's bring our slot machine analogy into the world of wireless networks. Instead of pulling levers, you're now a network controller trying to find the best channel for data transmission. Each wireless channel is like a slot machine, but instead of coins, we're trying to maximize data throughput. And unlike casino machines with fixed probabilities, channel quality fluctuates constantly due to interference, physical obstacles, and other users.
This scenario is perfect for MAB algorithms because wireless networks face the same exploration-exploitation dilemma: do we stay on a channel that's performing well, or check if other channels might offer better throughput? Take a common example like WiFi routers - they must constantly decide which frequency channel to use for transmission, adapting to changing conditions and competing traffic.
At each moment t:
- We must choose one channel to transmit data
- The channel's true quality varies with time
- We estimate this quality through actual transmissions, giving us
- Poor channel selection leads to higher regret
This setting is more challenging: channel conditions change dynamically, past performance doesn't guarantee future quality, and other users' actions influence our rewards. These characteristics make wireless networks a classic example of non-stationary bandits, where the reward distributions themselves evolve over time.
Modeling the Wireless Channel
Before diving into MAB strategies, let's understand what makes wireless channel selection challenging. Unlike our casino analogy where probabilities stay fixed, wireless channels exhibit complex, time-varying behavior:
Signal Quality and Mean SNR
The quality of a wireless channel is primarily determined by its Signal-to-Noise Ratio (SNR). In our model, each channel has a base SNR () measured in decibels (dB), representing the average expected signal quality.
The instantaneous SNR at time is given by:
Fading Effects
Shadow Fading ()
- Models signal attenuation from large obstacles
- Follows log-normal distribution:
Fast Fading ()
- Models multipath effects
- Follows Rayleigh distribution:
Rayleigh distribution is an interesting probability distribution commonly used in wireless communication to model fading effects. It can be expressed as:
where is the scale parameter. We will not go into the details of this distribution here, but for the interested, see the Wikipedia article on Rayleigh distribution.
Coherence Time
The coherence time represents the duration over which the channel can be considered relatively stable. In our model:
- Shadow fading changes every coherence period
- Fast fading varies at each time step
This time-scale separation is typical in wireless channels, where large-scale effects (shadow fading) persist longer than small-scale effects (fast fading)
Channel Capacity
The actual throughput achievable on a channel follows Shannon's capacity formula:
where:
- is channel capacity in bits per second
- is bandwidth
- is the instantaneous SNR in linear scale
This varying nature of wireless channels makes our MAB problem particularly interesting. Based on the above model, we can create a wireless channel to simulate our wireless environment and our strategies.
For all the simulations from now on, we will use typical values for a standard wireless environment:
- Shadow fading standard deviation () = 8 dB
- Coherence time = 100 time steps
- Mean SNR values of 15 dB, 12 dB, and 8 dB for the three channels.
Note that a real wireless environment would have more complex dynamics, but this simplified model captures the essence of the problem.
For visualization purposes, throughput shown with 20-step moving average for clarity. Raw values fluctuate more rapidly due to fast fading effects. We will visualize the throughput of these channels and the cumulative regret if we naively choose the high mean SNR channel, along with an hypothetical perfect strategy that always picks the best channel.
The above visualization shows some key aspects of our wireless channel environment, especially how throughput varies across channels.
- The "best" channel (having high average SNR) can perform worse than others, picking it naively can lead to high regret. (and even worse for other channels)
- The perfect strategy, which is basically an oracle that always picks the best channel, is essentially an envelope over the throughput, and will have zero regret. It represents the theoretical upper bound on performance (in practice, such an oracle is impossible).
This gives context for why we need MAB strategies, as we want to react to the dynamically changing channel conditions to maximize throughput. It should be alteast better than naively choosing an arm/channel and hoping for the best. It should have some knowledge on when to switch (explore) and when to stick (exploit).
Strategies
Finally! There are actually a LOT of strategies, but for now we will explore a couple popular strategies used widely for tackling the wireless channel selection problem, each with its own approach and trade-offs. Hopefully, this will be a good starting point to explore more strategies on your own!
-greedy
The -greedy strategy is pretty straightforward: with probability , select the channel that currently appears best based on our estimates , and with probability , select a random channel:
The parameter controls the exploration-exploitation tradeoff. We can think of it as a curiosity factor!
In the traditional bandit setting with fixed reward distributions, we would estimate as the average of all observed rewards:
However, in wireless networks, recent observations are more valuable than old ones since channel conditions change over time. We can modify our estimate to use exponential smoothing. This is not the traditional way to do -greedy, but we will see why its better when we do our simulations. We define our estimate as:
where is the learning rate. This gives more weight to recent rewards, helping us adapt to changing channel conditions.
We will use similar conditions and visualizations like our channel simulation before.
The plot above shows how cumulative regret changes with different values. The shaded bands represent the 95% confidence intervals across multiple simulation runs. is set to 0.5 for all ES strategies in this simulation. A few observations:
- The exponential smoothing helps adapt to changing channel conditions way better than averaging all past rewards. Averaging actually is nearly the same as just sticking with the highest mean SNR channel (why exponential smoothing is better)?
- Low values (0.1 to 0.3) balance between exploiting the current best channel and exploring other channels that might have improved. = 1 as expected performs the worst, as it always explores and never exploits.
- The best performing strategy seems to be = 0.1 (ES), which is in line with what we would expect.
Now that we know why exponential smoothing is better, we can stick with it, and now see how different values affect cumulative regret (and in turn, throughput)
- In wireless channels, values>0.3 perform better because they can quickly adapt to rapid channel variations, particularly from fast fading effects that occur at a per-sample timescale.
- Lower values retain too much historical information, making them slow to track the actual channel conditions. This differs from traditional bandit settings with static reward distributions, highlighting how wireless environments require strategies that can adapt to rapid temporal variations.
Interestingly, = 1.0 performs well in our wireless scenario due to rapid channel variations from fast fading effects. While = 1.0 means completely discarding history and using only the current reward, in our case this aggressive-ness helps track fast-changing channel conditions. In environments where channel variations are dominated by slower shadow fading (changing every coherence period) rather than fast fading, lower values would be more appropriate as they provide better averaging over stable periods.
There are other variants of the -greedy strategy, such as the decaying -greedy, where decreases over time. However, it is not very useful in our wireless scenario, as it doesn't adapt to changing channel conditions, especially in later stages where epsilon is very low, and thus unwilling to explore (which is crucial in non-stationary environments like ours). Nonetheless, it is just one of many, and there are modifications to epsilon-greedy that can be used in wireless scenarios.
Upper Confidence Bound (UCB)
The Upper Confidence Bound (UCB) algorithm takes a more sophisticated approach than -greedy. Instead of using random exploration, it systematically explores arms based on their potential upside. The core idea is: "optimism in the face of uncertainty."
For each arm, UCB maintains two quantities:
- The current estimate of reward ()
- An "exploration bonus" that grows larger the less we use an arm
At each time step, UCB selects the arm that maximizes:
where:
- is the current time step
- is how many times we've used arm
- The square root term is our exploration bonus
The exploration bonus has some nice properties:
- It grows logarithmically with time ()
- It shrinks the more we use an arm ()
It is interesting to see that there is no parameter like in UCB, and yet it still works. A very good intuition for that can be read at gibberblot from where I took a lot of references for this write-up.
Like with -greedy, we need to adapt UCB for wireless channels. Instead of using the average reward for , we'll use the same exponential smoothing:
Let's see how different values of affect UCB's performance in our wireless scenario, and we will compare it with the best run of -greedy.
From the plot, we notice a couple interesting things:
- The most obvious one is that UCB performs just as good as the best run of -greedy, assuming a reasonable value ( in our case).
- We can also infer the same point in another way, that is UCB is not vastly superior to -greedy in our wireless scenario. UCB's exploration bonus is parameter-free, but it still needs to calculate the bonus for each arm at every step, which can be computationally expensive. So if -greedy is doing well, it might be a better choice due to its simplicity.
- UCB with = 1.0 does NOT perform well unlike epsilon-greedy, which is actually expected. UCB's exploration bonus already provides a form of adaptation to changing channel conditions, so aggressive exponential smoothing is not necessary, if not detrimental.
While standard UCB achieves logarithmic regret bounds for stationary bandits, these bounds don't hold in our wireless setting due to the non-stationary rewards. However, the exponential smoothing modification helps UCB track the changing channel conditions effectively.
Ending thoughts
We explored how MABs adapt to the dynamic world of wireless networks. It's clear that wireless environments demand fresh thinking, and we would need to modify traditional MAB strategies to suit these conditions. Simple can be powerful - while UCB offers sophisticated exploration, a well-tuned -greedy strategy can match its performance with less computational overhead. But we are simply scratching the surface here, there are many other possible strategies and modifications that can be explored. Moreover, MABs themselves are just the beginning, being the subset of the much larger realm of reinforcement learning, which has far more complicated algorithms, agents, networks and whatnot. But that's a story for another day!
The code for the simulations and visualizations is available on wireless-mab.
References
What I referred for this write-up:
- Bandit Algorithms - A comprehensive textbook by Tor Lattimore and Csaba Szepesvári
- Gibberblot's RL Notes
I also recommend searching terms like "Adaptive wireless channel selection", "MABs for spectrum sharing" etc. A lot of research papers and articles are on these topics!