T | | B | | T |
| F | | F | |
B | | S | | B |
| F | | F | |
T | | B | | T |
Note: special states are not terminal. Special states are like normal states,
except they have a different reward function.
The rules:
- The agent can move: (L)eft, R(ight), (U)p, i D(own). Executing a move
in any direction transfers the agent to the intended state with probability
p1, with probability p2 it transfers the agent to the state left of the
action's origin, with probability p3 it transfers the agent to the state
right of the action's origin, and with probability (1-p1-p2-p3) it
transfers the agent to the state opposite of the intended state (the
examples presented in the lecture assumed p1=0.8 and p2=p3=0.1).
- When the agent executes a move outside the wall, or into a forbidden
state, then as the result of such move she stays in place (and receives the
appropriate reward). It is as if she bounced off the walls and forbidden
states.
- Each move causes the agent to receive a reward (if negative, it can be
considered to be the cost of a move). The value of the reward depends on
the state the agent has just entered. For normal states the default reward
value r applies. For the terminal and special states the rewards are
defined individually for each such state.
- The result of the agent's activity is the discounted sum of the rewards
from all steps, computed using the discounting factor γ.
The parameters of the problem are thus: N,M,p1,p2,p3,r,γ and the
individual reward values for all the terminal and special states.
Additionally, in the second part there is the additional exploration parameter
ε.
Part I: directly solving an MDP
Write a program solving discrete MDP problems of the type described above,
using either the value iteration or the policy iteration method.
The program should display on output the computed optimal policy and the
utility distribution, both in the tabular format conforming to the world
layout.
By saving the subsequent algorithm step utility results, perform an analysis
of the convergence of the algorithm by plotting a graph of utility values
using the Gnuplot program. Include the resulting graph, as well as the
developed set of Gnuplot commands, and the data file, so the graph can be
replotted for each problem instance.
- Compute the solution of the 4x3 problem presented in the lecture, Fig.1,
γ=1.0. In your report give the complete utilities with four decimal
digits, the policy, paste the convergence plot, and the number of iterations
of the algorithm. Present the utilities and policy in a table corresponding
to the world geometry, with policy printed using characters: > < ^ v.
For the termination condition you can take the reduction of all the
differences between successive iterations below 0.0001. Compare your
results with those published in the literature and the lecture. If your
results are different, then your implementation of the algorithm must be
incorrect!
Fig.1:
- Compute the solution of a the problem shown in the Fig.2 (N=4, M=4),
using the motion uncertainty model exactly the same as in lecture example
(p1=0.8, p2,p3=0.1), and the discounting factor γ=0.99. The reward
now is +100 for the terminal state, -20 for the special state, and r=-1
for normal states. Show the solution both as the full policy, state utility
values (four decimal digits), and the convergence plot.
For the termination condition please again take the reduction of all the
utility differences between successive iterations below 0.0001.
Fig.2:
- Change the reward function for the normal, special, and/or terminal
states (relative to the basic version of the 4x4 problem from Fig.2) in a way
to effect a noticeable (but possibly local) change of the optimal policy.
Give the new complete solution (utilities, policy, plot). Explain the
results.
- Change the action uncertainty model (relative to the basic version of the
4x4 problem from Fig.2) in a way to result in a change of the optimal policy.
Give the complete solution (utilities, policy, plot). Explain the results.
- Change the discounting factor (relative to the basic version of the 4x4
problem from Fig.2) in a way to result in a change of the optimal policy.
Give the complete solution (utilities, policy, plot). Explain the results.
The modifications in the last three steps should be sufficiently large to
noticeably affect the agent's policy, but at the same time so limited, that
the results can be meaningfully compared.
Part II: solving an MDP using Q-learning
Implement computing the optimal agent's policy using the Q-learning method
with exploration. Use the same basic parameters as in the first part of the
assignment. However, now the move uncertainty model and the rewards are
unknown to the agent, they are only used to generate the trials. So the
program must consist of two parts: the agent, which generates actions and
learns the Q function from the trials, and the simulator, which generates the
trials, starting from the designated starting state and generating random
reactions to the agent actions.
The agent should generate actions combining exploitation with exploration in
the following way, controlled by the exploration parameter ε. The
best move in the current state (exploitation) should be selected with
probability 1-ε, and a random move (exploration) with probability
ε.
Solve the problem instance showed in Fig.2 with the parameters: p1=0.8,
p2,p3=0.1, γ=0.99 and the learning parameter α=1/N(s,a) where
N(s,a) is the number of times the action a has been selected in state s
in previous trials. Consider two exploration strategies, with ε=0.05
and ε=0.2.
In this part of the assignment it will be hard to achieve the accuracy
comparable to the first part. We expect to get at least the first decimal
digit right. Experiment to determine how many trials are necessary to achieve
such result. Begin with 10,000 trials. Estimate how accurately the Q values
have been computed, and in particular how many trials were necessary to
achieve such accuracy for different values of ε.
In your report show in the tabular form corresponding to the world geometry:
the optimal policy, the state utilities and the Q values (give one decimal
digit here), indicating the number of generated trials.
Requirements concerning the programs
The developed programs should determine the optimal policy and the
distribution of the utility function U (first part), or the distribution of
the action value function Q (second part). The programs may use any libraries
and functions generally available for the given programming language.
Each program should solve the specified MDP problem, save partial results (for
creating convergence graphs) to a disk file(s), and display the final results,
including the policy, utility U, or action value Q, in a tabular format
consistent with the world geometry. Assume the first world size parameter is
the horizontal size, and the second parameter is the vertical size, with the
origin in the left lower corner.
All results should be displayed in the text format on standard output stdout.
Please use pure ASCII code (7-bit), without any national scripts or graphic
characters. For the policy please use the characters: <,>,v,^
The parameters of the MDP problem should be loaded from the given file of the
format described below. The file contains the N,M,p1,p2,p3,r parameters of
the MDP problem, and may optionally contain the γ and ε
parameters, which are the agent's parameters affecting the shape and the way
of determining the solution.
The program should also accept the γ and ε parameters as
command line arguments in a positional manner (first γ and second
ε). The command line arguments are optional, and if not given, the
parameters from the disk file apply. Missing any parameters, or their
incorrect values, should result in stopping the program with an appropriate
message.
The acceptable ways to access the data file: (0) opening the file only for a
single-pass sequential reading, (1) the file name can be on the command line
argument (in which case this should be the first obligatory argument, followed
optionally by γ and ε), or (2) the file name can be fixed and
built into the program, or (3) a filename can be given as an environment
variable (this is a good option for Unix/Linux family of systems, but I do not
know if it works on Windows). It is unacceptable for the program to ask
interactively for the file name, in general require any interaction with the
user, or read anything from the standard input stdin.
The program should be written in a given programming language in a way
independent of operating and window system: Windows, Linux, X11, Android,
etc., also in the complete absence of a window system.
Data format specification
A data file containing a description of the world instance and experiment
parameters is a text file consisting of a series of records (lines).
Each line starts with a label (text), followed by one to three numerical
parameters. The following labels are allowed:
- W (mandatory) defines the world size: horizontal and vertical (2xINT),
- S (optional) specifies the coordinates of the start state (2xINT),
- P (mandatory) specifies the uncertainty distribution p1 p2 p3 (3xFLOAT),
- R (mandatory) specifies the default reward parameter r (1xFLOAT),
- G (optional) specifies the discounting parameter γ (1xFLOAT),
- E (optional) specifies the exploration parameter ε (1xFLOAT),
- T (multiple - must occur 1 or more times) defines a single
terminal state: two coordinates and an individual reward value (2xINT + 1xFLOAT),
- B (multiple - may occur 0 or more times) defines a single
special state: two coordinates and an individual reward value (2xINT + 1xFLOAT),
- F (multiple - may occur 0 or more times) defines a single
forbidden state: two coordinates (2xINT).
The starting state is optional. The value iteration algorithm does not use
it. In Part II of the assignment (Q-learning), if the starting state is not
given, then use a randomly selected non-terminal state to start each trial.
The γ and ε parameters are optional, as they can alternatively
be given as command line arguments to the program.
The state coordinates are counted from 1 and the origin of the coordinate
system is the bottom left corner of the displayed world representation.
Any order of records is allowed in the data file, provided exactly one of
each of the mandatory records occurs, and at most one optional row.
At least one terminal state must also occur.
In addition, MDP world parameters must satisfy the following conditions:
p1, p2, p3 >= 0.0 <= 1.0
p1+p2+p3 <= 1.0
gamma > 0.0 <= 1.0
all defined states (S, T, B, F) must be within the defined world dimensions
A sample data file describing the MDP problem instance from the lecture
(also the Russell and Norvig textbook):
W 4 3
S 1 1
P .8 .1 .1
R -0.04
G 1.00
T 4 3 1.
T 4 2 -1.
F 2 2
Submitting of the results
Please prepare a concise report describing the results obtained in each part
of the assignment, with necessary explanations, interpretations, and
conclusions.
In the report, please list all resources used, both literature, as well as
specific program modules: program libraries other than default libraries, and
possibly used ready-made programs or fragments of programs complete with their
sources.
Additionally, please create a development package containing all of the
necessary source files, data sets created (please do not send the data files
included with the assignment), which are necessary to run the programs, The
method of compiling and running the programs should be described in the README
file included in the package.
The package should allow the developed programs to be run on other MDP
instances compliant with the requirements of this assignment.
| | |