Hire Writer

Lab Report: Reinforcement Learning in Deep Structured Teams

Categories: Technology

Report, Pages 3 (719 words)

Views

Abstract

This lab report explores the application of reinforcement learning algorithms in deep structured teams for Markov chain and linear quadratic models with discounted and time-average cost functions. Two non-classical information structures are considered, deep state sharing and NS information structure. Theoretical results and a numerical example are presented to demonstrate the convergence of learned strategies to optimal solutions.

Introduction

In this report, we investigate the use of reinforcement learning algorithms in deep structured teams to optimize resource allocation in a smart grid scenario.

Don't use plagiarized sources. Get your custom paper on

“ Lab Report: Reinforcement Learning in Deep Structured Teams ”

Get high-quality paper

NEW! smart matching with writer

The primary focus is on two information structures: deep state sharing and NS information structure. We analyze the convergence properties of learned strategies and provide theoretical proofs to support our findings.

Methodology

Algorithm 2: Proposed Policy Gradient Algorithm

The proposed policy gradient algorithm is outlined as follows:

1. Initialize parameters: number of agents (n), number of trajectories (`), control horizon (T), number of features (z), set of features (α·,·), feedback gains (θ1, θ̄1), and step sizes (η1, η̄1).
2. At iteration k ∈ N, run the following steps:
  - For j = 1 to `:
    - Initialize states x1 = vec(x11, .
      
      Get quality help now
      
      Doctor Jennifer
      
      Verified writer
      
      Proficient in: Technology
      
      5 (893)
      
      “ Thank you so much for accepting my assignment the night before it was due. I look forward to working with you moving forward ”
      
      +84 relevant experts are online
      
      Hire writer
      
      . . , xn1 ).
    - For any agent i ∈ Nn, use strategy (11) with perturbed feedback gains: θk + ũ(i, j) and θ̄k + ũ(i, j), where ũ(i, j) ∼ unif(−r, r) and ũ(i, j) ∼ unif(−r̄, r̄)Iz×z.
    - Compute the cost trajectories ∆c1:T (i, j) and c̄1:T (i, j).
  - Compute gradients ∇Ĉ and ∇C̄ as per equations:

∇Ĉ = hx^n`r²∑_i=1ⁿ∑_j=1^`∑_t=1^T β^t−1∆ct(i, j)ũ(i, j)

∇C̄ = hx^n`r̄²∑_i=1ⁿ∑_j=1^`∑_t=1^T β^t−1c̄t(i, j)ū(i, j)

- Update feedback gains: θk+1 = θk − ηk∇Ĉk and θ̄k+1 = θ̄k − η̄k∇C̄k.
Let k = k + 1 and go to step 2 until termination.

Theoretical Results

Theorem 5

For any (d, γ) ∈ En(X ) × G, the Q-function Qk(d, γ) converges to Q∗(d, γ) with probability one, as k → ∞.

Theorem 6

Let gk(·, d) ∈ argminγ∈G Qk(d, γ) be a greedy strategy; then, the performance of gk converges to that of the optimal strategy g∗ given in Theorem 1, when attention is restricted to deterministic strategies.

Proof

The proof follows from the convergence proof of the Q-learning algorithm and Theorem 1, which exploits the fact that the Bellman equation operator is a contraction mapping with respect to the infinity norm.

Get to Know The Price Estimate For Your Paper

Topic

Deadline: 10 days left

Number of pages

Email Invalid email

By clicking “Check Writers’ Offers”, you agree to our terms of service and privacy policy. We’ll occasionally send you promo and account related email

"You must agree to out terms of services and privacy policy"

Write my paper

You won’t be charged yet!

Similar to Theorem 5, one can use a quantized space with quantization level (1/r), r ∈ N, similar to the one proposed in Theorem 6, to develop an approximate Q-learning algorithm under NS information structure. The performance of the learned strategy converges to that of Theorem 5 as the number of agents n and quantization level r increases to infinity.

Model II

For Model II, we use a model-free policy-gradient method.

Theorem 6

Let Assumption 1 hold. The performance of the learned strategy {θk, θ̄k}, given by Algorithm 2, converges to the performance of the optimal strategy {θ∗, θ̄∗} in Theorem 2 with probability one, as k → ∞.

Proof

The proof follows from Theorem 2. Analogous to Theorem 6, one can devise an approximate policy gradient algorithm under NS information structure, where deep state is approximated by mean field.

Numerical Example

Example 1

Consider a smart grid with n ∈ N consumers. Let xit ∈ R denote the requested energy by consumer i ∈ Nn from an independent service operator (ISO) at time t ∈ N. Let x̄t denote the weighted average of the total requested energy of consumers, i.e.,

x̄t = 1/n ∑_i=1ⁿ αixit,

where αi represents the importance (priority) of consumer i. The linearized dynamics of each consumer are described by:

xit+1 = xi_t + ui_t + wi_t,

where wit is the uncertainty regarding the energy consumption at time t. The objective is to find a resource allocation strategy that minimizes the cost function. Suppose that the information structure is deep state sharing and all consumers commonly run Algorithm 2.

Numerical parameters:

Parameter	Value
n	10
A	1
B	1
Q̄	4
R	1
Q	1
R̄	1
r	0.2
r̄	0.25
η	0.05
η̄	0.05
β	1
z	1
α1:6,1	√5
α4,1	√1.5
α5,1	1
α6,1	√2
α9,1	√2.5

It is shown that the learned strategy converges to the optimal strategy, given by the deep Riccati equation in Theorem 2.

Conclusions

In this paper, we investigated the application of reinforcement learning algorithms in deep structured teams for Markov chain and linear quadratic models with discounted and time-average cost functions. We provided theoretical proofs for the convergence of learned strategies and demonstrated their effectiveness through a numerical example in the context of a smart grid scenario. Our findings highlight the potential of reinforcement learning in optimizing resource allocation in complex systems.

Lab Report: Reinforcement Learning in Deep Structured Teams. (2024, Jan 24). Retrieved from https://studymoose.com/document/lab-report-reinforcement-learning-in-deep-structured-teams