In Reinforcement Learning there is an environment known as Gridworld. In this environment you have a grid and there is an agent that learns how to find the shortest path from one cell to another. The theme of reinforcement learning is that you do not want to hard-code the rules, but you want the agent to explore until it can find a set of moves that are optimal for the problem at hand. Usually you can alter the grids to make the tasks tough–set ‘traps’, add obstacles, etc. We are considering grids with obstacles, and an interesting question that came up is the following,

Given two grids of size ${N}$, say ${G, \,G'}$ which have respectively ${k,l}$ obstacles where ${k,l\in \mathbb{N},\,k,l\geq 0}$, what are reasonable ways to put an order on the ‘complexity’ of the grids?

In other words, we want to be able to say that, for instance, in ${G}$ the agent will find the optimal path more easily than in ${G'}$ given any two grids ${G,G'}$. This comparability is what we mean by order in the above statement. It is clear that putting obstacles in some grid makes some paths obsolete. It is interesting to note how this affects the time it takes to find the optimal path. A brute force approach to the problem is to consider constructing many random scenerios and use the average experience to draw conclusions, otherwise we might try to see how the connectedness of the grid-structure relates to the average time it takes to find the optimal path. This is of interest because sometimes by reframing a problem into a field that is well established one can do more exploring, which could possibly lead to discovering more interesting ties between the work a person is doing and theories that are relatively well developed (or sometimes not, but in any case it is always interesting to see how far you can push ideas and what interesting ties you can find in doing that). The purpose of this writing is to explore whether or not connectivity can be useful for our purposes. A lot of different methods were tried, but we only present two. The procedure for investigating the rest was done similarly.

The Diffusion Approach

Consider a very drunk person, X, trying to walk. Trying to predict what the next step is going to be at any point would be very difficult, as one can imagine. The steps are somewhat random. Suppose we convince X to visit our grid (certainly not kidnap him), and suppose that we could somehow restrict the moves in this environment to only up, down, right and left.
If we let X wander around the grid, so that somehow each step takes him into a different corresponding grid cell, then after $\tau$ steps ($\tau\to \infty$) he will likely visit all the states/cells. We use likely here since there is a chance of getting moves that would not permit a win, say, left-right-left-right-…..} forever. The probability for this gets small as $\tau\to \infty$, but is nonetheless a positive real number.
If ${N}$, the grid size, is large then so will ${\tau}$. If we talk arbitrarily of an agent whose moves are drawn from a set of possible moves randomly, then we say that we have a random walk. Given a finite state space it is easy to see that after a possibly long time a random walk will lead to the exploration of all states. A practical way of answering the question posed is to find the average time it takes for a random walk to hit the states of interest. This method is more robust, and because it involves computing the quantity of interest directly from the grids in the same spirit that an agent will when it is still learning in its environment.

Graph Connectivity

A graph is a pair ${(V,E)}$ where ${\emptyset \ne V}$ is the set of vertices/nodes and ${E}$ is the set of edges between the vertices. A grid can be visualised as a graph with very little work where the nodes are the states, and the edges a representative of paths between the cells. For example, consider the following,

The graph representation can be realised in the following way:
1) Start at, say, ${(n,n')}$.
2) Consider the possible states with effective moves, i.e moves that are towards the target cell.
3) Construct edges from ${(n,n')}$ to ${(a,b),\,(c,d)}$ from above, then repeat for each of these new states until you have the desired terminal state.

As an example using grid (1), from $(0,0)$, if you are moving to $(3,3)$, then effective moves will lead to $(0,1)$ and $(1,0)$. From here we observe that from $(0,1)$ we can move to $(1,1)$ or $(0,2)$ and from $(1,0)$ we can move to $(1,1)$ or $(2,0)$. Proceeding this way, we obtain a grid representation. Although we only speak of effective moves, we do not put direction to the edges. Considering all possible states, we also obtain the same representation, but it gets messy too quickly.
Starting at cell $(0,0)$ to cell $(3,3)$ produces a graph representation that is exactly like the grid itself.

We say that a graph is connected if between any two vertices there is an edge from one vertex to the other. Otherwise, the graph is disconnected. If ${G}$ is a graph ${S}$ is a set of vertices of ${G}$. If ${G}$ is connected and ${G\backslash S}$ is disconnected then ${S}$ is said to be a vertex-cut. A graph ${G}$ is said to be complete if every edge is connected to all the other edges, and incomplete otherwise. Given a connected graph ${G}$ we define the connectivity as

$\displaystyle \kappa(G)=\begin{cases} \min{|S|}: S\textit{ is a vertex-cut of G}, & \text{if G is not complete}\\ n-1, &\text{otherwise}\end{cases}$

Loosely speaking, this is a measure of how much work you have to do to disconnect a given graph structure.

A Brief Discussion

There are two observations that one can make easily. If a graph ${G}$ is more connected than ${G'}$ then there are most likely more ways of going from one vertex to another. This is a good thing if the agent would take good moves from the start, then the movement would be towards the desired cell and any path taken would lead to a win. The price to pay is that the initially, the agent is learning and this means that it does not take good moves only. The curse of having more paths to take is that there are more ways of getting lost. We want to investigate how this affects the expected time taken to hit the terminal state.

Hence, we do the investigation as follows:
1) Given a graph ${G}$. Compute the number average time it takes a random walk to hit the states of interest, ${\tau}$, and the connectivity of the graph.
2) Given a second graph ${G'}$, we do the same.
3) After generating a sufficiently big number of different graphs we investigate the relationship between the ${\tau}$‘s and the connectivities.

Implementation

As mentioned above, the grid variation procedure for our context involves putting obstacles on the cells. This is essentially taking out vertices on the graph. In this section, we generate grids whose order we compute using these two methods and we discuss whether or not the results are reasonable. We specifically choose variations that are simple, but give us an important pitfall of the connectivity approach. We had no apriori knowledge that a variation like this would fail, and it came up in the testing, but instead of presenting all the variations that work, we offer the simple case that fails.
In this variation, the first grid has an obstacle at $(1,1)$, then for the second we consider an obstacle over the cells $(1,1),\,(1,2),\,(2,1),\,(2,2)$ (i,e the square-subgrid whose diagonal elements are $(1,1),\,(2,2)$), then we proceed to the third grid whose obstacle is square-subgrid with diagonal elements $(1,1),\,(2,2),\,(3,3)$, etc.

Below are subplots obtained using a grid of size $10$ to illustrate this. Notice that Networkx in Python represents graphs arbitrarily prioritising the way in which different vertices are connected to each other. By staring at the diagrams below it should be easy to see that the nodes (red) are connected the same way cells in a grid are connected to each other. Putting an obstacle on a grid cell is equivalent to removing the node on the graph as well as the edges that are incident on the node. Using the procedure in the previous paragraph to generate variations, we obtain the following

Somewhat surprisingly, we obtain figure 2 when plotting the average time it takes to hit the terminal state against the exact same pattern. This means severing the different possible paths makes it less easy to traverse the grid. The case that would be the simplest if we only took good moves becomes the worst case. Note that the values on the ${y}$-axis are normalised so that ${\sum{y_i}=1}$, so a value close to ${0}$ indicates that the average time taken to hit the terminal state is significantly less than that whose value is close to ${1}$. In fact, during the trials the numbers for steps from the initial state to the terminal state for (a)-(g) were less than ${1000}$ while for (h) the agent needed slightly above ${1000}$ steps and for (i) the number was around ${100000}$ steps. Of course, at this point, adding a bigger block would have disconnected (i) and there would be no way to complete the tasks so the number of steps would simply diverge. The behaviour of the values excluding (i) showed a promising monotone increase with small increments, but this is smudged out by (i) since the value obtained is too big in comparison.

Computing the connectivity of the graphs using the Networkx library on Python, we obtain figure 3.This is not very surprising, but at the same time shows that although the idea seems practical and attractive, it breaks down quite easily with very simple graph structures, which in this case is the representations of grids.

It is easy to see that removing the nodes that represents the cells ${(0,1),(0,1)}$, i.e the cells right after the initial cell ${(0,0)}$ is sufficient to disconnect the graph structure. We can do this for any of the corners, and this turns out to be the best we can do, and we can’t disconnect the graph by only removing one vertex. So connectivity does not quite work, because it is not sensitive enough to the deformations we are considering. Hence, although there would have been many exciting things to consider had it worked we have to abandon the idea. There are other alternative to consider of course, for instance we could consider the different paths that emanate from the initial cell of interest to the desired terminal state. This should clearly be a more sensitive number to the variations in question than the general connectivity of the graph, but it did not improve the speed in any way, so we do not consider it.

Conclusion

The goal was to impose an order for grid complexity in the context of Reinforcement Learning. We considered two methods that both seemed reasonable–a monte-carlo approach (inspired by idea of diffusion) as well as a graph theoretic approach (based on connectivity of graphs). In the end, we did experiments to see which method works best. The monte-carlo approach works well although it is not the fastest between the two. It is more sensitive to changes in graph structure, but of course the price is computational power and speed. The connectivity approach is plausible if the structure of graphs chosen wildly varies in a way that affects how well connected the graph is. We see in the experiments that we could generate a few easy variations to which this method was not sensitive, so we choose to use the monte-carlo approach in the context of random walks on graphs for the progress of the research project. Hence, the ordering needed will be pursued using this method and all analysis, if needed, will be done on the context of random walks on graphs.