Lab Report: Reinforcement Learning in Deep Structured Teams

Categories: Technology

Abstract

This lab report explores the application of reinforcement learning algorithms in deep structured teams for Markov chain and linear quadratic models with discounted and time-average cost functions. Two non-classical information structures are considered, deep state sharing and NS information structure. Theoretical results and a numerical example are presented to demonstrate the convergence of learned strategies to optimal solutions.

Introduction

In this report, we investigate the use of reinforcement learning algorithms in deep structured teams to optimize resource allocation in a smart grid scenario.

The primary focus is on two information structures: deep state sharing and NS information structure. We analyze the convergence properties of learned strategies and provide theoretical proofs to support our findings.

Methodology

Algorithm 2: Proposed Policy Gradient Algorithm

The proposed policy gradient algorithm is outlined as follows:

    1. Initialize parameters: number of agents (n), number of trajectories (`), control horizon (T), number of features (z), set of features (α·,·), feedback gains (θ1, θ̄1), and step sizes (η1, η̄1).
    2. At iteration k ∈ N, run the following steps:
      • For j = 1 to `:
        • Initialize states x1 = vec(x11, .

          Get quality help now
          Doctor Jennifer
          Doctor Jennifer
          checked Verified writer

          Proficient in: Technology

          star star star star 5 (893)

          “ Thank you so much for accepting my assignment the night before it was due. I look forward to working with you moving forward ”

          avatar avatar avatar
          +84 relevant experts are online
          Hire writer

          . . , xn1 ).

        • For any agent i ∈ Nn, use strategy (11) with perturbed feedback gains: θk + ũ(i, j) and θ̄k + ũ(i, j), where ũ(i, j) ∼ unif(−r, r) and ũ(i, j) ∼ unif(−r̄, r̄)Iz×z.
        • Compute the cost trajectories ∆c1:T (i, j) and c̄1:T (i, j).
      • Compute gradients ∇Ĉ and ∇C̄ as per equations:

∇Ĉ = hxn`r2i=1nj=1`t=1T βt−1∆ct(i, j)ũ(i, j)

∇C̄ = hxn`r̄2i=1nj=1`t=1T βt−1c̄t(i, j)ū(i, j)

    • Update feedback gains: θk+1 = θk − ηk∇Ĉk and θ̄k+1 = θ̄k − η̄k∇C̄k.
  1. Let k = k + 1 and go to step 2 until termination.

Theoretical Results

Theorem 5

For any (d, γ) ∈ En(X ) × G, the Q-function Qk(d, γ) converges to Q∗(d, γ) with probability one, as k → ∞.

Theorem 6

Let gk(·, d) ∈ argminγ∈G Qk(d, γ) be a greedy strategy; then, the performance of gk converges to that of the optimal strategy g∗ given in Theorem 1, when attention is restricted to deterministic strategies.

Proof

The proof follows from the convergence proof of the Q-learning algorithm and Theorem 1, which exploits the fact that the Bellman equation operator is a contraction mapping with respect to the infinity norm.

Get to Know The Price Estimate For Your Paper
Topic
Number of pages
Email Invalid email

By clicking “Check Writers’ Offers”, you agree to our terms of service and privacy policy. We’ll occasionally send you promo and account related email

"You must agree to out terms of services and privacy policy"
Write my paper

You won’t be charged yet!

Similar to Theorem 5, one can use a quantized space with quantization level (1/r), r ∈ N, similar to the one proposed in Theorem 6, to develop an approximate Q-learning algorithm under NS information structure. The performance of the learned strategy converges to that of Theorem 5 as the number of agents n and quantization level r increases to infinity.

Model II

For Model II, we use a model-free policy-gradient method.

Theorem 6

Let Assumption 1 hold. The performance of the learned strategy {θk, θ̄k}, given by Algorithm 2, converges to the performance of the optimal strategy {θ∗, θ̄∗} in Theorem 2 with probability one, as k → ∞.

Proof

The proof follows from Theorem 2. Analogous to Theorem 6, one can devise an approximate policy gradient algorithm under NS information structure, where deep state is approximated by mean field.

Numerical Example

Example 1

Consider a smart grid with n ∈ N consumers. Let xit ∈ R denote the requested energy by consumer i ∈ Nn from an independent service operator (ISO) at time t ∈ N. Let x̄t denote the weighted average of the total requested energy of consumers, i.e.,

x̄t = 1/n ∑i=1n αixit,

where αi represents the importance (priority) of consumer i. The linearized dynamics of each consumer are described by:

xit+1 = xit + uit + wit,

where wit is the uncertainty regarding the energy consumption at time t. The objective is to find a resource allocation strategy that minimizes the cost function. Suppose that the information structure is deep state sharing and all consumers commonly run Algorithm 2.

Numerical parameters:

Parameter Value
n 10
A 1
B 1
4
R 1
Q 1
1
r 0.2
0.25
η 0.05
η̄ 0.05
β 1
z 1
α1:6,1 √5
α4,1 √1.5
α5,1 1
α6,1 √2
α9,1 √2.5

It is shown that the learned strategy converges to the optimal strategy, given by the deep Riccati equation in Theorem 2.

Conclusions

In this paper, we investigated the application of reinforcement learning algorithms in deep structured teams for Markov chain and linear quadratic models with discounted and time-average cost functions. We provided theoretical proofs for the convergence of learned strategies and demonstrated their effectiveness through a numerical example in the context of a smart grid scenario. Our findings highlight the potential of reinforcement learning in optimizing resource allocation in complex systems.

Updated: Jan 24, 2024
Cite this page

Lab Report: Reinforcement Learning in Deep Structured Teams. (2024, Jan 24). Retrieved from https://studymoose.com/document/lab-report-reinforcement-learning-in-deep-structured-teams

Live chat  with support 24/7

👋 Hi! I’m your smart assistant Amy!

Don’t know where to start? Type your requirements and I’ll connect you to an academic expert within 3 minutes.

get help with your assignment