Solving Markov Decision Problems

The goal is to compute solutions to the discrete Markov decision problems (MDP), similar to those discussed in class. The assignment consists of two parts and it is necessary to write two programs solving each part. In the first part the program should directly solve the MDP problem of known parameters using the value iteration or policy iteration method. In the second part the program should solve the MDP of unknown parameters using the Q-learning method.

Introduction

The agent operates in a flat grid world of size NxM using horizontal and vertical moves, with single field steps, at each step receiving the reward specific to the state she entered. In this world the fields are called states: unique start state (S), terminal states (T), the walls and prohibited states (F), and the special states (B). The agent starts in the starting state (S) and stops upon entering a terminal state (T). She can never traverse a wall or step into a prohibited state (F). All unmarked states are normal states, with the default reward value. The rewards received in special (B) and terminal (T) states are individually defined for each state.

Note: special states are not terminal. Special states are very much like normal nonterminal states, except they have a different, individualized, reward function.

Example worlds:

		T
	F	T
S

		B
S	F	F	T
		B

T		B		T
	F		F
B		S		B
	F		F
T		B		T

etc.

The rules:

The agent can move: (L)eft, R(ight), (U)p, i D(own). Executing a move in any direction transfers the agent to the intended state with probability p1, with probability p2 it transfers the agent to the state left of the action's origin, with probability p3 it transfers the agent to the state right of the action's origin, and with probability (1-p1-p2-p3) it transfers the agent to the state opposite of the intended state (the examples presented in the lecture assumed p1=0.8 and p2=p3=0.1).
When the agent executes a move outside the wall, or into a forbidden state, then as the result of such move she stays in place (and receives the appropriate reward). It is as if she bounced off the walls and forbidden states.
Each move causes the agent to receive a reward (if negative, it can be considered to be the cost of a move). The value of the reward depends on the state the agent has just entered. For normal states the default reward value r applies. For the terminal and special states the rewards are defined individually for each such state.
The result of the agent's activity is the discounted sum of the rewards from all steps, computed using the discounting factor γ.

The parameters of the problem are thus: N,M,p1,p2,p3,r,γ and the individual reward values for all the terminal and special states. Additionally, in the second part there is the additional exploration parameter ε.

Part I: directly solving an MDP

Write a program solving ARBITRARY discrete MDP problems of the type described above, using either the value iteration or the policy iteration method. The program should display on output the computed optimal policy and the utility distribution, both in the tabular format conforming to the world layout.

By saving the subsequent algorithm step utility results, perform an analysis of the convergence of the algorithm by plotting a graph of utility values using the Gnuplot program. Include the resulting graph, as well as the developed set of Gnuplot commands, and the data file, so the graph can be replotted for each problem instance.

Compute the solution of the 4x3 problem presented in the lecture, Fig.1, γ=1.0. In your report give the complete utilities with four decimal digits, the policy, paste the convergence plot, and the number of iterations of the algorithm. Present the utilities and policy in a table corresponding to the world geometry, with policy printed using characters: > < ^ v. For the termination condition you can take the reduction of all the differences between successive iterations below 0.0001. Compare your results with those published in the literature and the lecture. If your results are different, then your implementation of the algorithm must be incorrect!
```
Fig.1:
+---+---+---+---+
|   |   |   | T |
+---+---+---+---+
|   | F |   | T |
+---+---+---+---+
| S |   |   |   |
+---+---+---+---+
```
Compute the solution of a the problem shown in the Fig.2 (N=4, M=4), using the motion uncertainty model exactly the same as in lecture example (p1=0.8, p2,p3=0.1), and the discounting factor γ=0.99. The reward now is +100 for the terminal state, -20 for the special state, and r=-1 for normal states. Show the solution both as the full policy, state utility values (four decimal digits), and the convergence plot. For the termination condition please again take the reduction of all the utility differences between successive iterations below 0.0001.
```
Fig.2:
+---+---+---+---+
|   |   |   |   |
+---+---+---+---+
|   |   |   |   |
+---+---+---+---+
|   |   | B |   |
+---+---+---+---+
| S |   | F | T |
+---+---+---+---+
```
Change the reward function for the normal, special, and/or terminal states (relative to the basic version of the 4x4 problem from Fig.2) in a way to effect a noticeable (but possibly local) change of the optimal policy. Give the new complete solution (utilities, policy, plot). Explain the results.
Change the action uncertainty model (relative to the basic version of the 4x4 problem from Fig.2) in a way to result in a change of the optimal policy. Give the complete solution (utilities, policy, plot). Explain the results.
Change the discounting factor (relative to the basic version of the 4x4 problem from Fig.2) in a way to result in a change of the optimal policy. Give the complete solution (utilities, policy, plot). Explain the results.

The modifications in the last three steps should be sufficiently large to noticeably affect the agent's policy, but at the same time so limited, that the results can be meaningfully compared.

Part II: solving an MDP using Q-learning

Implement computing the optimal agent's policy using the Q-learning method with exploration. Use the same basic parameters as in the first part of the assignment. However, now the move uncertainty model and the rewards are unknown to the agent, they are only used to generate the trials. So the program must consist of two parts: the agent, which generates actions and learns the Q function from the trials, and the simulator, which generates the trials, starting from the designated starting state and generating random reactions to the agent actions.

The agent should generate actions combining exploitation with exploration in the following way, controlled by the exploration parameter ε. The best move in the current state (exploitation) should be selected with probability 1-ε, and a random move (exploration) with probability ε.

Solve the problem instance showed in Fig.2 with the parameters: p1=0.8, p2,p3=0.1, γ=0.99 and the learning parameter α=1/N(s,a) where N(s,a) is the number of times the action a has been selected in state s in previous trials. Consider two exploration strategies, with ε=0.05 and ε=0.2.

In this part of the assignment it will be hard to achieve the accuracy comparable to the first part. We expect to get at least the first decimal digit right. Experiment to determine how many trials are necessary to achieve such result. Begin with 10,000 trials. Estimate how accurately the Q values have been computed, and in particular how many trials were necessary to achieve such accuracy for different values of ε.

In your report show in the tabular form corresponding to the world geometry: the optimal policy, the state utilities and the Q values (give one decimal digit here), indicating the number of generated trials.

Requirements concerning the programs

The developed programs should determine the optimal policy and the distribution of the utility function U (first part), or the distribution of the action value function Q (second part). The programs may use any libraries and functions generally available for the given programming language.

Each program should solve the specified MDP problem, save partial results (for creating convergence graphs) to a disk file(s), and display the final results, including the policy, utility U, or action value Q, in a tabular format consistent with the world geometry. Assume the first world size parameter is the horizontal size, and the second parameter is the vertical size, with the origin in the left lower corner.

All results should be displayed in the text format on standard output stdout. Please use pure ASCII code (7-bit), without any national scripts or graphic characters. For the policy please use the characters: <,>,v,^

The parameters of the MDP problem should be loaded from the given file of the format described below. The file contains the N,M,p1,p2,p3,r parameters of the MDP problem, and may optionally contain the γ and ε parameters, which are the agent's parameters affecting the shape and the way of determining the solution.

The program should also accept the γ and ε parameters as command line arguments in a positional manner (first γ and second ε). The command line arguments are optional, and if not given, the parameters from the disk file apply. Missing any parameters, or their incorrect values, should result in stopping the program with an appropriate message.

The acceptable ways to access the data file: (0) opening the file only for a single-pass sequential reading, (1) the file name can be on the command line argument (in which case this should be the first obligatory argument, followed optionally by γ and ε), or (2) the file name can be fixed and built into the program, or (3) a filename can be given as an environment variable (this is a good option for Unix/Linux family of systems, but I do not know if it works on Windows). It is unacceptable for the program to ask interactively for the file name, in general require any interaction with the user, or read anything from the standard input stdin.

The program should be written in a given programming language in a way independent of operating and window system: Windows, Linux, X11, Android, etc., also in the complete absence of a window system.

Data format specification

A data file containing a description of the world instance and experiment parameters is a text file consisting of a series of records (lines). Each line starts with a label (text), followed by one to three numerical parameters. The following labels are allowed:

W (mandatory) defines the world size: horizontal and vertical (2xINT),
S (optional) specifies the coordinates of the start state (2xINT),
P (mandatory) specifies the uncertainty distribution p1 p2 p3 (3xFLOAT),
R (mandatory) specifies the default reward parameter r (1xFLOAT),
G (optional) specifies the discounting parameter γ (1xFLOAT),
E (optional) specifies the exploration parameter ε (1xFLOAT),
T (multiple - must occur 1 or more times) defines a single terminal state: two coordinates and an individual reward value (2xINT + 1xFLOAT),
B (multiple - may occur 0 or more times) defines a single special state: two coordinates and an individual reward value (2xINT + 1xFLOAT),
F (multiple - may occur 0 or more times) defines a single forbidden state: two coordinates (2xINT).

The starting state is optional. The value iteration algorithm does not use it. In Part II of the assignment (Q-learning), if the starting state is not given, then use a randomly selected non-terminal state to start each trial.

The γ and ε parameters are optional, as they can alternatively be given as command line arguments to the program.

The state coordinates are counted from 1 and the origin of the coordinate system is the bottom left corner of the displayed world representation.

Any order of records is allowed in the data file, provided exactly one of each of the mandatory records occurs, and at most one optional row. At least one terminal state must also occur.

In addition, MDP world parameters must satisfy the following conditions:

p1,p2,p3 >= 0.0 <= 1.0
p1+p2+p3 <= 1.0
ε > 0.0 <= 1.0
γ >= 0.0 <= 1.0
all defined states (S,T,B,F) must be within the defined world dimensions

A sample data file describing the MDP problem instance from the lecture (also the Russell and Norvig textbook):

W 4 3
S 1 1
P .8 .1 .1
R -0.04
G 1.00
T 4 3 1.
T 4 2 -1.
F 2 2

Submitting of the results

Please prepare a concise report describing the results obtained in each part of the assignment, with the necessary explanations, interpretations, and conclusions.

Please be brief. Do not include the text of the assignment or its interpretation, or any theory or formulas that are available in the published literature (cite sources instead).

When presenting results, make all reasonable abbreviations, round numerical values to the required and appropriate size, and summarize. Do not present multiple results without comments or discussion.

As a rule, avoid presenting results as screenshots (unless you are presenting the appearance of the program interface, rather than the obtained results). Tabulated results should be presented as text, not images. If you do include images, make sure they are readable. For example, white text on bright yellow background is typically very unreadable. Just as is text displayed in a very small font on a large background.

The report should obligatorily list all the resources used, both literature and specific program modules: program libraries other than the default ones, and possibly used complete programs or fragments of programs - stating their sources.

Additionally, please supply a development package containing all the necessary source files, created data sets (please do not send the data files included in the assignment), which are necessary to run the programs. The method of compiling and running the programs should be clearly and completely described in the README file included in the package.

The package should enable the developed programs to be tested on other MDP instances compliant with the requirements of this assignment.