next up previous contents
Next: Experimentation Up: No Title Previous: EMuds, Aims and Methods

Subsections

Learning Classifier Systems

Originally described by Holland in [24], learning classifier systems (LCS) are learning systems, which exploit Darwinian processes of natural selection in order to explore a problem space. As such, LCS are among the few AI techniques that integrate both an internal adaptation process (reinforcement learning) and an external adaptation process (genetic algorithms) in a single decision-making algorithm. While Goldberg used LCS in his book about genetic algorithms [20] as an application of the theory presented, learning classifier systems had remained relatively unexplored until Wilson successively simplified and improved some aspects of the original systems, dubbing the successors ZCS and XCS. The main transformations are explained in detail in [69,70,30] and a summary of the state of research in XCS systems can be found in [71]. The resulting system, XCS, which I will describe in this chapter exhibits some very good learning and generalization abilities in the tested environments and has the advantage of being well adapted to symbolic discrete environment problems.

Classifier Systems

A classifier system (CS) is a rule-based system for decision making, classifier systems are a special kind of production system [12]. Each rule maps a problem state into a solution or an intermediate new state, where the system can be applied again. This sequence of deductions leads to the systems answer to the problem. Rules in classifier systems are called ``classifiers'' and are of the form if <condition> then <action>. When applied to a problem, a CS is usually presented with a message representing the current situation, the classifier system then attempts to match this message with one or more classifier conditions. If some conditions match, then an action among those advocated by the matched classifiers is selected and applied. The basic system is thus based on a set of classifiers, a matching process and an action selection process.

The learning classifier systems add adaptation to the basic CS through two components. The first is a reinforcement learning algorithm similar to Q-Learning [27] that operates on the action selection process and that I introduce in section 7.4.3. The second is a rule discovery system implemented as a genetic algorithm [23,20] that operates on the classifiers as a population to generate diversity in the classifier set, allowing exploration of the problem space. This component is introduced in section 7.4.4. The overall architecture of an LCS agent is illustrated in figure 7.1.


  
Figure 7.1: Learning classifier system architecture.
\begin{figure}\begin{center}\epsfxsize=10cm
\epsfbox{Illustrations/lcs_lcs.eps}\end{center}\end{figure}

Within an agent system context, the classifier system is the agent's control algorithm with the problem space being the environment and messages the perceived current environment conditions. A learning classifier system provides the agent with an adaptive mechanism to deal with varying environment situations and learn better action patterns through experience. Depending on the type of environment, detectors and effectors have to be customized for the agent to convert perceptions into messages and actions into effector operations. I will present the basics of reinforcement learning and genetic algorithms in the next two sections, before giving an analysis of the XCS classifier system and its operation principles.

Reinforcement Learning

Reinforcement learning (RL) is a form of learning that is well suited for unsupervised learning situations and is guided by the following principle [61]:
If an action taken by a learning system is followed by a satisfactory state of affairs, then the tendency of the system to produce that particular action is strengthened or reinforced. Otherwise, the tendency of the system to produce that action is weakened.
In reinforcement learning (associative reinforcement learning to be precise), a system attempts to learn a mapping from inputs to outputs, guided by a reinforcement signal that represents satisfaction in the sense of the preceding principle. From the agent perspective, an agent, connected to an environment by sensors and effectors, perceives states in the environment and converts them to inputs describing such currently perceived states. The agent then produces an output that is effected as an action in the environment. From the transformation operated in the environment, the agent then receives as reward a scalar reinforcement signal. The behavior of the agent is aimed at maximizing rewards received as reinforcement. The reinforcement learning model (see figure 7.2) consists in:
  
Figure 7.2: The reinforcement learning model.
\begin{figure}\begin{center}\epsfxsize=10cm
\epsfbox{Illustrations/lcs_rl.eps}\end{center}\end{figure}

The purpose of reinforcement learning theory is to find an ``optimal'' policy B for the mapping $B:I \to O$. Optimality being defined in relation to the reinforcement signals r received by the agent over time. Depending on the context, many different algorithms can be applied in order to solve reinforcement learning problems, and actually, even evolutionary techniques such as genetic algorithms could be considered as reinforcement learning within the above framework. Traditionally, though, reinforcement learning algorithms are based on statistical techniques and dynamic programming used to estimate and improve behavior-generating algorithms.

RL Methodology

The main aspects that influence the choice of method are the following: Given the problem specifics, Kaelbling et al. [27] introduce formally justified methods of reinforcement learning and ad-hoc methods. Unfortunately, most demonstrated techniques suffer from scaling problems when applied to the more complicated situations (in particular when the environment is not stable), also, algorithms used for calculation in these techniques can be relatively inefficient. The most commonly used methods in the field of Artificial Intelligence are thus the ad-hoc methods, that are more easily implemented to deal with generalization problems and delayed rewards.

A Mathematical Formulation of Optimality in RL

From the reinforcement learning model previously described and a discounted infinite horizon optimality model, the problem of reinforcement learning can be formulated mathematically in a Markovian decision process perspective as follows [27].

The optimal value of a state s is the maximum over all action selection policies $\psi$ of the expected discounted sum of rewards over all stochastic transitions (with $\gamma$ the discount factor and rt the reward at time t):

\begin{displaymath}V^{\ast}(s) \stackrel{df}{=} \max_{\psi}
[E(\sum_{t=0}^{\infty}\gamma^t r_t)]\end{displaymath}

It can be shown that this value is unique and is a solution of the equation:

\begin{displaymath}V^{\ast}(s) = \max_{a \in A}[R(s,a) + \gamma \sum_{s' \in
S}T(s,a,s') V^{\ast}(s')]\end{displaymath}

where T(s,a,s') is the probability of making a transition from state s to state s' through action a. From the previous definition and the formula for $V^{\ast}$, an optimal decision policy $\psi^{\ast}: S
\longrightarrow A$ is a policy that finds an action a such that $R(s,a) + \gamma \sum_{s' \in S}T(s,a,s') V^{\ast}(s') = V^{\ast}(s)$in every given state s. The objective of a reinforcement learning algorithm is thus to find a policy $\psi$ that minimizes the value $\sum_{s \in S}\vert\psi^{\ast}(s)
- \psi(s)\vert$.

Finding an exact solution for $\psi^{\ast}$ and $V^{\ast}$ is possible by using dynamic programming methods, when T and R are known, the problem faced by reinforcement learning methods is to find a solution when this knowledge is not directly available, but must be sought in the environment through trial and error.

Q-Learning

I present here a well-known algorithm developed by Watkins [67,68], that can be proved to converge to an optimal solution under certain conditions. This algorithm forms the basis of the reinforcement learning algorithm used in the XCS classifier system. In Q-Learning, a state value is defined for a given policy $\psi$:

\begin{displaymath}V^{\psi}(s) = R(s,\psi(s)) + \gamma \sum_{s' \in S}T(s,\psi(s),s')
V^{\psi}(s')]\end{displaymath}

and then a Q value is defined for this policy and every pair (s,a):

\begin{displaymath}Q^{\psi}(s,a) = R(s,a) + \gamma \sum_{s' \in S}T(s,a,s')
V^{\psi}(s')\end{displaymath}

indicating the expected discounted reward for executing action aat state s and pursuing the same policy thereafter. For this definition, if $\psi^{\ast}$ is an optimal policy, then $V^{\ast}(s) = \max_{a \in A}[Q^{\ast}(s,a)]$, where $ \max_{a \in
A}[Q^{\ast}(s,a)] \stackrel{_{df}}{=} \max_{a \in
A}[Q^{\psi^{\ast}}(s,a)]$ and:

\begin{displaymath}Q^{\ast}(s,a) = R(s,a) + \gamma
\sum_{s' \in S}T(s,a,s') \max_{a' \in A}[Q^{\ast}(s',a')]\end{displaymath}

If this $Q^{\ast}$ value can be learned, then an optimal policy can be built from it by finding one action that maximizes $Q^{\ast}(s,a)$ for each state s.

The Q-Learning algorithm estimates this optimal Q value by building a table of randomly initialized Q values for all state-action pairs and updating these values with a Widrow-Hoff delta learning rule. The delta rule adjusts a parameter x towards an estimate of its target value y by replacing x with $x + \beta (y - x)$, ( $0 < \beta
\leq 1$), which is simply written $x \stackrel{_\beta}{\longleftarrow}
y$. The value $\beta$ being the learning rate. It is clear that when y is stationary, this forms a sequence of x values that converge to y. In the algorithm, the delta rule is expressed as:

\begin{displaymath}Q(s,a) \stackrel{\beta}{\longleftarrow} r + \gamma \max_{a' \in
A}[Q(s',a')]\end{displaymath}

With r the actual reinforcement received for executing action a in state s. Each time a table state-action position is chosen to act, this rule is applied to the position to update its value. Over time, the table positions are thus alternatively updated and it can be shown that if all positions are updated regularly enough as time goes on and the learning rate is appropriately adapted at each step, this tabular representation of Q(s,a) will tend to $Q^{\ast}(s,a)$.

Genetic Algorithms

The second component of a learning classifier system is a rule discovery system that is implemented as a genetic algorithm (GA). A genetic algorithm is a search and optimization algorithm based on the Darwinian principles of natural selection observed in nature, but applied to a set of digital genotypes. These digital genotypes are then interpreted as the parameters that can be operated upon for solving the optimization problem at hand. Traditionally, a genetic algorithm is used as a phylogenetic adaptation mechanism that operates on a population of potential solutions to a problem and is based on a measure of the quality (called fitness) of each individual's solution. Individuals are identified by an artificial genome that describes their solution to the problem. The search for an optimal solution works by randomly selecting individuals with a bias towards the fittest and propagating their genetic material in new individuals, possibly through genetic operators such as crossover and mutation. This process usually leads to an increasing average fitness in a population since fitter individuals are more likely to be selected for reproduction.

Implementation of a GA

In many genetic algorithms, an individual is represented by a bitstring (i.e. a string of zeroes and ones) in the set Bl of bitstrings of a specified length l, this string is both the genome of the individual and a coding of the solution that it provides to the problem under study. The algorithm operates on a population $P_t =
\{s_i(t) \: \vert\: s_i(t) \in B_l,\: i \in \{1,..,n\}\}$ of such individuals at every time step t, the population having a predefined number n of individuals. A mapping $\phi: B_l \longrightarrow
\mathbbm{R}$ has to be provided that defines the fitness of individuals in order to provide a selection criterion for the algorithm. The search process then iteratively selects individuals with probability proportional to their fitness and generates new individuals from their genome, thereby forming new populations of individuals. There are a few different methods that can be used to produce population updates either as in the original genetic algorithm that was described as selecting individuals from a population Pt at time t and generating n new individuals to form the next population Pt+1, or as in the variant of steady-state populations used in the XCS where, at each step, a few (one or two) new individuals are created from the current population's fitter individuals and replace unfit individuals of the current population. The three genetic operators used to produce new individuals are: The sequence of populations generated by the algorithm explores the solution space of the problem by accumulating individuals in regions of high fitness value and converging to the (or one of the) global maximums of this fitness landscape.

The search procedure provided by a genetic algorithm is, in most cases, provably better than a random search in the solution space of a problem, although for a large search space the procedure can be slow. The convergence of the algorithm has been proved in the Schemata Theorem [20] by studying generalizations of bitstrings called schemata that represent families of individual bitstrings. Schemata are generalizations of bitstrings and are identical to the classifier conditions used by the XCS system that I introduce in the next section.

XCS Design

The XCS classifier system is an improvement on the original design of classifier systems that was presented by S. W. Wilson in his 1995 article Classifier Fitness Based on Accuracy; it promotes a different approach to the reinforcement learning/genetic algorithm relation in the system adaptation process that allows better generalization of knowledge stored in the form of classifiers. Since much of the relevant work in the field of LCS is now based on this system design, I introduce it here without presenting the original designs of classifier systems. The following subsections describe the elements used in a cycle of the classifier system life. The steps in a cycle are applied repeatedly until the system is stopped by a user and consist in:
1.
detecting an environment message and forming a set of classifiers that match the message
2.
forming action sets in the match set and selecting an action
3.
applying the reinforcement algorithm to classifiers
4.
applying the genetic algorithm to the classifier population
Along such a cycle, the learning/genetic algorithm is focussed on the evaluation and use of the three following parameters:

Conditions, Messages and the Matching Process

When dealing with classifier systems, sensory stimuli are converted to messages which are conventionally formed of a bitstring of specified length. A population $\mathcal{C}$ of classifiers forms the system, each classifier $\chi$ consisting in a condition string and an action string ($c_\chi$,$a_\chi$). The condition string is of the same length as messages and holds three different types of elements: zeroes, ones or wildcard characters. I will therefore refer to these characters as trits and the strings involved as tritstrings. When the matching process is invoked, it uses the condition strings of the classifiers in the classifier population to test whether a match has occurred by comparing successive bits in the message with trits from the classifier. A position is matched in both strings when either the message string holds a one and the condition string a one or a wildcard character, or the message string holds a zero and the condition string a zero or a wildcard character. The wildcard character thereby acts as a don't care symbol that I will hereafter note #. A classifier matches a message if all positions of the classifier condition match corresponding positions in the message. For example, the message 0010110111 is matched by the condition 0##011#11# since every specified bit in the condition is the same as that of the message and we ``don't care'' about the other positions. A match set is created from the subset of classifiers that match the current message before proceeding to the action selection stage.

Action Selection

The action string in a classifier is a bitstring that codes the associated action identifier of that classifier and has a length depending on the number of possible actions the agent may execute. Once a match set has been formed, the system decides among the possible actions which one it will execute. To this end, a prediction value is stored by each classifier, giving an estimate of the reward the classifier usually receives when it is matched and its action is chosen, along with an estimate of the accuracy of this prediction (these parameters are calculated by the reinforcement learning mechanism which we will study in the next subsection). For each action $a \in A$, the set $\mathcal{A}(a)$ of classifiers in the match set that advocate this action is formed. A prediction array p is then formed by calculating a prediction value of each of these sets. The prediction value of such a set is the average of the predictions of the classifiers in the set, weighted by their accuracy:

\begin{displaymath}p=\{p(a)\}_{a \in A} \:\mbox{, with }\: p(a)=\frac{\sum_{\chi...
...
\pi(\chi)\phi(\chi)}{\sum_{\chi \in \mathcal{A}(a)}\phi(\chi)}\end{displaymath}

One of these action sets is selected based on its prediction and named the current action set $\mathcal{A}$. For example, using deterministic action selection, the action set with highest average prediction is selected. Another way to choose the action would be by using a biased random selection mechanism over the predictions of each set.

   
Reinforcement Learning Component

The reinforcement learning algorithm used in an XCS is applied to the prediction, error and fitness parameters defined previously. The updates of these parameters are usually applied in the following order: prediction error, prediction, fitness, but other orders also work.

Two types of problems are distinguished when calculating parameter updates, single step problems and multi step problems. Single step problems are problems where reward depends only on the current state-action pair and the transition function maps all pairs to the uniform probability distribution over the state space (i.e. the state of the next step does not depend on the current state and action). A multi step problem is the more general situation, where the state transition function is not constant and where the system must also learn it. Both situations are studied in the experimental chapter.

In a single step problem, the reinforcement is applied to all classifiers of the current action set, using a reinforcement value of $\rho_t = r_t$. In a multi step problem, the reinforcement is applied to the previous step's action set, using a discounted reinforcement value $\rho_t = r_{t-1} + \gamma \max_{a \in
A}[p_t(a)]$, the t indicating to which time step the value belongs and pt(a) being the prediction value of a's action set at time t, as defined in the preceding subsection. For each classifier $\chi$ to update, the reinforcement rules are:

In these learning rules, $\beta$ controls the learning rate, $\varepsilon_0$ is the limit from which a classifier is not considered accurate anymore and $\alpha$ with $0<\alpha<1$ is the accuracy value given to a classifier of error $2\varepsilon_0$.

In practice, in XCS, the technique of the ``moyenne adaptive modifiée'' (MAM) introduced by Venturini [64] is applied for the first $1/\beta$ action cycles of the system, to speed up the initial convergence of the system.

   
Genetic Algorithm Component

The XCS system is made of a population of classifiers $\mathcal{C}_t
\subseteq \mathcal{C}$ that does not usually hold all $\chi \in
\mathcal{C}$. Since the action selection mechanism is applied to this population $\mathcal{C}_t$, it is essential that this set hold relevant classifiers for all environment states encountered by the system. The genetic algorithm's role is to discover which classifier is relevant and must be part of the classifier population and which can be safely omitted from this population. The basis on which this is accomplished is described in the analysis of the XCS section.

As was mentioned earlier, the genetic algorithm operates on the classifier population $\mathcal{C}_t$. At every step, the genetic algorithm is applied to the population with a probability $1/\theta$. If it is applied, two individuals are selected in the current action set proportionally to their fitness $\phi(\chi)$ as calculated by the reinforcement learning component. These individuals are then either reproduced with a mutation factor of $\mu$ at each of their sites or, with probability $\nu$, they are crossed over at one random position along their condition tritstring or action bitstring. The two new individuals are then inserted in the population and if this population is larger than its predefined maximum size, two unfit classifiers are deleted from the population.

An Analysis of the XCS

In this section, I give my mathematical understanding of an XCS classifier system and show that under certain restrictive assumptions, the XCS actually implements the Q-Learning algorithm. I will then discuss the full XCS implementation and the importance of its genetic accuracy based component for generalization purposes. But first I start by making a summary of the notations I will use, some of which were previously defined and used.

Notations

Mathematical notations: Reinforcement learning notations: XCS notations:

Simplified XCS is Q-Learning

In [15], Dorigo and Bersini present a comparison between the original LCS of Holland and Goldberg; they show that by restricting the LCS to a subsystem called the very simple classifier system with a modification to the learning rule, their system functions in a way that is identical to a Q-Learning algorithm. I will show here that the XCS system can also be considered as a Q-Learning algorithm given some restrictions.

For the XCS to become a Q-Learning implementation, one restriction is necessary, although it is a major one, the removal of the genetic algorithm component of the system. One assumes (enforces) that the population of classifiers present in the system at every time-step consists in only and all the specific classifiers, that is $\mathcal{C}_t = \mathcal{C}', \forall t$, so that these classifiers form a table similar to that used in tabular Q-Learning. This implies that there is no genetic algorithm component and only the prediction values of classifiers need to be learned (accuracy is not needed since action sets hold only one classifier, as we will see). The XCS algorithm then runs in three steps: acquire the environment state sand form a match set $\mathcal{M}(s)$ for this state, evaluate the prediction value of the action sets in $\mathcal{M}(s)$ and select an action, obtain reward and reinforce the selected action set.

Since the classifier population consists in only the specific classifiers, the match set will hold |A| classifiers, one for each action in A, and every action set will hold only one classifier, the classifier whose condition is exactly the current environment state. The prediction value of these action sets will thus be the prediction value $\pi(\chi)$ of their only classifier (accuracies simplify away in the weighted sum calculation) and action selection as well as reinforcement can be considered to operate on the classifiers individually. Note also that we have an isomorphism between the population of classifiers and the set of state-action pairs: $S \times
A \cong \mathcal{C}'$, so that each classifier actually represents a state-action pair $(s,a) = (c_\chi,a_\chi)$.

Remembering that in Q-Learning, the Q value of an optimal policy is estimated by the learning rule:

\begin{displaymath}Q(s,a) \stackrel{\beta}{\longleftarrow} r + \gamma \max_{a' \in
A}[Q(s',a')]\end{displaymath}

We find that in XCS, the learning rule for classifier prediction corresponds to the Q-Learning rule since it is precisely:

\begin{eqnarray*}\pi_{\chi} \stackrel{\beta}{\longleftarrow} \rho & = & r_{t-1} ...
...mma \max_{\chi' \in
\mathcal{M}(s')}[\pi((c_{\chi'},a_{\chi'}))]
\end{eqnarray*}


And so, the prediction value of classifiers learned in this form of XCS is the same as the values found in the table formed by the Q-Learning algorithm, $\pi^{\ast}(\chi) = Q^{\ast}(c_\chi,a_\chi)$. Following the hypotheses necessary for the Q-Learning convergence theorem, this simplified XCS thus learns an optimal policy as long as every classifier prediction is updated sufficiently frequently over time.

Generalization in XCS

As already mentioned for the Q-Learning algorithm, the main problem encountered in the simple XCS system is that maintaining a full population of specific classifiers is impractical for most problems due to the size of the $S \times A$ space. This is precisely the object addressed by the discovery component (genetic algorithm) of an accuracy based classifier system. To achieve a reduction of the size of the space needed to maintain classifier predictions, general classifiers are allowed in a classifier population. These general classifiers, holding wildcard symbols in their condition part, can be said to subsume a family of more specific classifiers. With subsumption defined by the relation $\succ$: Since general classifiers represent a family of classifiers (in the same way as schemata represent families of individuals in a genetic algorithm), as long as all classifiers of the family have the same predictive value, the whole family can be replaced by just one (general) classifier. Regularities in the environment are thus compressed as the information of a single classifier.

To observe what happens to the action selection mechanism when generalization is used, it is necessary to see that for a general classifier $\chi$, the prediction is the average expected prediction of the classifiers it subsumes:

\begin{eqnarray*}\pi(\chi) & = & \sum_{\omega \in \Omega}\pi(\omega)/\vert\Omega...
...re
\; \Omega=\{\omega \in \mathcal{C}' \vert \chi \succ \omega\}
\end{eqnarray*}


For specific classifiers $\chi$, the relation still holds since $\Omega=\{\chi\}$ and one can easily convince oneself of the second equality by listing the family of classifiers a general classifier subsumes and writing down their respective prediction values. Now, the action selection mechanism of the classifier system operates on action sets $\mathcal{A}(a)$ that will contain one or more classifiers, both general and specific, by evaluating the mean prediction value of classifiers in $\mathcal{A}(a)$, weighted by their accuracy and it is expected that the action finally chosen is that which would have been chosen in the simple XCS case. It could seem unnecessary to use the accuracy-based selection, thereby simplifying action selection, but by observing the following simple example, the importance of accuracy in generalization becomes clear.

Action Selection in a Sample Classifier without Accuracy

The immediate generalization of action set prediction value from the simple classifier example would be to take p(a) as the mean of the prediction values of classifiers in the set $\mathcal{A}(a)$, that is, $p(a)=\sum_{\chi \in \mathcal{A}(a)}
\pi(\chi)/\vert\mathcal{A}(a)\vert$. Unfortunately, this will usually not give satisfactory results because the regularities in the environment (reward landscape) will most often not be compatible with the generalization mechanism of the classifier system.

Suppose that the state space is $S=\{00,01,10,11\}$ and the action space $A=\{0,1\}$. The current state of the environment is detected as 00. If the current classifier population is made of all possible classifiers, match set and action sets will be given by:


\begin{displaymath}\mathcal{M}(00)=\{(00,0), (\char93 0,0), (0\char93 ,0), (\cha...
...), (00,1),
(\char93 0,1), (0\char93 ,1), (\char93 \char93 ,1)\}\end{displaymath}



\begin{eqnarray*}\mathcal{A}(0)=\{(00,0), (\char93 0,0), (0\char93 ,0), (\char93...
...)=\{(00,1), (\char93 0,1), (0\char93 ,1), (\char93 \char93 ,1)\}
\end{eqnarray*}



  
Figure: Generalization of $\pi $ values, a prediction landscape.
\begin{figure}\begin{center}\begin{tabular}{c\vert cc\vert\vert c\vert cc\vert\v...
...$3$\space & $ $\space & $ $\space & $ $\\
\end{tabular}\end{center}\end{figure}

If the prediction landscape is as illustrated on figure 7.3, we can evaluate the prediction values of both action sets.


\begin{eqnarray*}p(0)
&=&\frac{1}{4}[\pi(00,0)+\pi(0\char93 ,0)+\pi(\char93 0,0...
...\pi(00,1)+\frac{1}{2}(\pi(00,1)+\pi(01,1))+\ldots]
=\frac{9}{8}
\end{eqnarray*}


And so, even with full knowledge of the predictive values of all classifiers, the selected action is not the most beneficial one. Clearly, from the prediction values given, the action that should be selected if we were relying on specific classifiers is the action 0, but here, using deterministic action selection, the selected action will be 1 because of the high prediction value of classifier (10,1) that is reflected in the prediction value of classifier $(\char93 0,1)$ and enters the prediction value calculation of action set $\mathcal{A}(1)$. Therefore, with generalization comes the need of an accuracy criterion that allows the action selection mechanism to distinguish between accurate generalizations and inaccurate generalizations.

The Role of Accuracy in XCS

As shown in the above example, when a general classifier subsumes a family of classifiers that have contradicting prediction values, the action selection mechanism of the XCS might drift away from the mechanism applied in Q-Learning (which will intuitively lead to bad performance). The factor that influences this drift is that general classifiers will obtain various rewards in any non-trivial environment. Given that their prediction value is the expectation of reward, what becomes important here is the variance in rewards received, intuitively, by how much these rewards differ over the set of state-action pairs they match.

In the simple classifier system with only specialized classifiers, this variance will be zero for a single-step environment, where a state-action pair is always equally rewarded. This variance will remain small with delayed rewards as long as the discount factor used is small and the environment sufficiently regular. This remains true when considering general classifiers whose subsumed family of specialized classifiers has consistent predictions. In essence, there are ``good'' accurate general classifiers (marked by small predictive variance) and ``bad'' inaccurate general classifiers (characterized by a high predictive variance) and if the XCS system is to generalize efficiently, it has to be able to distinguish between these accurate and inaccurate classifiers. The role of the prediction error and fitness functions in the reinforcement learning component of the XCS is to learn this distinction and provide a criterion to both exclude some general classifiers from the population and minimize the effects of existing inaccurate classifiers on action selection. The actual search for accurate classifiers is handled by the genetic algorithm component which is applied to the classifier population.

Since the learning rule for the $\varepsilon$ function updates $\varepsilon_t(\chi)$ with $\varepsilon_{t-1}(\chi) +
\beta (\vert\rho_{t}-\pi_{t}(\chi)\vert-\varepsilon_{t-1}(\chi))$, $\varepsilon(\chi)$ is an estimate of the average difference in the prediction of $\chi$ and the rewards received when applying $\chi$. $\varepsilon$ thus has a similar role to that played by variance in statistics. If the GA was operating on a population of classifiers for which we had full information about prediction values and prediction errors, and fitness was taken as the inverse function of prediction error, the classifier population would tend to a population made of an ever greater proportion of accurate classifiers, due to the schemata theorem for genetic algorithms. In this more general situation, these values must simultaneously be learned by exploration in the environment and so, due to incomplete information, a fitness function must be estimated from the prediction error by the reinforcement learning component of the system, allowing an error tolerance to be introduced in the selection of ``good'' and ``bad'' classifiers.

Overall, the XCS system uses two cooperating algorithms to provide the action-selection mechanism with the best information acquired in the environment at the time a decision must be made. The RL component attempts to derive information about the utility of making a particular decision and the GA selects the classifiers that accurately describe the problem domain in which this decision process occurs. The most interesting result remaining to discover is now a convergence result for the joint RL and GA. It seems that although such a result is difficult to obtain, it is not impossible with the right constraining assumptions. Some typical assumptions I believe necessary would be based on: population size requirements, rate of application of the genetic algorithm, number of explorations by the reinforcement algorithm before the selection or deletion of a classifier by the GA. These parameters are all controllable in the classical XCS. There are also some problems that I have not discussed here that can have a great influence on the classifier system, such as the relation between environment states and representation of such states (input function) or the possible reliance of the environment state transition function on hidden parameters. These problems are typical of the current artificial intelligence algorithms and linked to the functional grounding problem that I introduced in the theoretical part of this thesis.

Test Case: The 6-multiplexer Problem

The multiplexer problems have often been used in conjunction with learning classifier systems to show their adaptation capabilities. These problems are considered interesting because the function that must be learned to solve the problem is irregular but does allow generalizations to be made in its standard representation and this representation is easily used as input to a classifier system. With my implementation of the XCS classifier system, I have been able to reproduce the results that were presented in [70] and [30] and I will give a short presentation of these results here. For a complete presentation, the two preceding documents can be referred to.

Multiplexer Problems

A multiplexer is a logical function m that maps a tuple of n + 2nboolean values to a boolean result. If one considers the input string of boolean values as an ordered bit string, this function m maps signals from the set $\{0,1\}^{n+2^n}$ to the set $\{0,1\}$. It uses the first n bits in the bit string as an address in the remaining 2n bits and returns the value at that address as a result, i.e. if the value of the first n bits considered as a number is x, the value of bit x in the second part of the bit string is returned. The figure 7.4 describes this mechanism.


  
Figure 7.4: Multiplexers, a) general case, b) evaluation of the function.
\begin{figure}\begin{center}\epsfxsize=10cm
\epsfbox{Illustrations/lcs_multiplexer_michele.eps}\end{center}\end{figure}

Since the number of possible addresses depends on the n chosen, there are multiplexer problems for each $n \in \mathbbm{N}$, that is, 3-multiplexers, 6-multiplexers, 11-multiplexers, etc. Results have been published on the 6, 11 and 20 multiplexer problems for the XCS system, but the tuning is usually done on the 6-multiplexer case.

The 6-multiplexer

A 6-multiplexer has an input string consisting in 2 address bits and four value bits. The m function maps each of the 64 possible input strings to a zero or one value. Given that in this representation, each input string has only 3 significant bits: the 2 address bits and the bit at that address, the function landscape has regularities that could be learned by a classifier system. For example, the strings 010010, 011010 or 011110 are all mapped on 1 because they have the same 01**1* pattern. Since the XCS generalizes by using conditions that are patterns of this form, it be able to adapt to the problem by generating classifiers with this type of general conditions.


  
Figure 7.5: Regularities of the 6-multiplexer function.
\begin{figure}\begin{center}\begin{tabular}{c\vert c\vert\vert c\vert c}
input &...
...& 10*1** & 1\\
110*** & 0 & 111*** & 1\\
\end{tabular}\end{center}\end{figure}

One can see in table 7.5 that there are 8 (23) general patterns that sum up the multiplexer function. Optimally, the classifier system should thus explore the space of classifiers until it discovers these 8 general conditions and then limit its population to the classifiers that map these conditions to the correct actions. Unfortunately, these general patterns are not the only ones, since *0*1*1 or 1*00** are other equivalent ways of expressing regularities in the function graph and this will affect an XCS classifier system.

Applying XCS to the 6-multiplexer

In order to learn the 6-multiplexer problem, a reward function must be provided to the system for its decisions. The immediate choice, that is used here, is to reward the system with a value when it correctly maps an input to an output and provide no reward for incorrect answers. With this reward scheme, the classifier must discover classifiers that accurately predict the reward, and use these classifiers take decisions about the result of an input. Since in this problem, the regularities of the reward function are symmetric with respect to the two possible action choices (given an input that is rewarded when action 1 is chosen implies no reward for the choice of action 0 in the same situation), the accurate population of classifiers includes both correct general classifiers and their incorrect counterparts. The expected optimal population for the XCS will thus include at least 16 general classifiers.


  
Figure 7.6: 6-multiplexer results over 100 experiments (10k and 30k steps).
\begin{figure}\begin{center}
\epsfxsize=5cm\epsfbox{Illustrations/lcs_m6.eps}
\epsfxsize=5cm\epsfbox{Illustrations/lcs_30k.eps}\end{center}\end{figure}

Experimenting with the classifier system that I have implemented provides the learning curves illustrated on figure 7.6. In this illustration, the curves plotted represent the averaged results of one hundred different experiments. In each experiment, every decision step was alternated with an exploration step. On exploration, an input is used by the system to test its answer. Reward is distributed to the classifier for this answer. On a decision step (exploitation), the result given by the system is used for the plot data, but no reward is distributed and no reinforcement or discovery process takes place in the system. The dashed line plot on the figure represents the percentage of correct answers returned by the system in the last fifty decision steps. The dotted line represents the overall error in prediction over the last fifty decision steps and the continuous curve is the number of different types of classifiers existing in the population (the value is divided by one thousand for scaling purposes).

The results obtained here are equivalent to those presented in [70,30]. One observes that the predictions of the system become almost perfect after 2000 exploration cycles (4000 steps), the error prediction simultaneously decreases, with a slight delay. The system is initialized without any classifiers at first and one sees that while the population has not reached its maximum number of classifiers (which happens around step 1200), the new classifiers that were generated by the genetic algorithm to fill in the population are very diverse. Maximal diversity is reached around step 1900 with about 180 different types of classifiers. This variety then decreases until it reaches the number of 40-60 different types in the process of elimination of inaccurate classifiers.

Conclusions

In order to experiment the theoretical ideas that I presented in the first part of this thesis, I have chosen the XCS algorithm. With this chapter, I have introduced the basic theory of this type of algorithm, including reinforcement learning and genetic algorithms. Based on this theory, I have explained the practical considerations necessary to implement the XCS system according to Wilson's design. In the next step, I have taken a personal stance towards XCS in order to give a better understanding the XCS adaptation principles. This understanding will then help in the next chapter when we come to studying the structural and dynamic properties of such systems. Three essential points are made here. Firstly, since the simple version of XCS is equivalent to Q-Learning, we can reasonably expect some of the good convergence properties of Q-Learning to be inherited by XCS systems. Secondly, the advantage of an XCS system with respect to Q-Learning is that it can generalise. A property that is equivalent to the ability of compressing information and allows an adaptive algorithm to acquire information for overall problem regularities. Thirdly, the accuracy parameter of the XCS system gives it the ability to distinguish appropriate/relevent generalisations of problem spaces from the inadequate/irrelevent ones.

A final experiment is led to reproduce the results of Wilson and others in the case of multiplexers, so as to show that the system I have implemented is identical to the previously implemented systems, and that results obtained here can be compared with other results obtained on XCS classifier systems.


next up previous contents
Next: Experimentation Up: EMuds Previous: EMuds, Aims and Methods
Antony Robert
2000-02-11