Next: Experimentation
Up: No Title
Previous: EMuds, Aims and Methods
Subsections
Originally described by Holland in [24], learning
classifier systems (LCS) are learning systems, which exploit Darwinian
processes of natural selection in order to explore a problem
space. As such, LCS are among the few AI techniques that integrate both
an internal adaptation process (reinforcement learning) and an
external adaptation process (genetic algorithms) in a single
decision-making algorithm. While Goldberg used LCS in his book about
genetic
algorithms [20] as an application of the theory
presented, learning classifier systems had remained relatively
unexplored until Wilson successively simplified and improved some
aspects of the original systems, dubbing the successors ZCS and
XCS. The main transformations are explained in detail in
[69,70,30] and a summary of the state of
research in XCS systems can be found in [71]. The resulting
system, XCS, which I will describe in this chapter
exhibits some very good learning and generalization abilities in the
tested environments and has the advantage of being well adapted to
symbolic discrete environment problems.
A classifier system (CS) is a rule-based system for decision making,
classifier systems are a special kind of production system
[12]. Each rule maps a problem state into a solution or an
intermediate new state, where the system can be applied again. This
sequence of deductions leads to the systems answer to the
problem. Rules in classifier systems are called ``classifiers'' and
are of the form if <condition> then <action>. When
applied to a problem, a CS is usually presented with a message representing
the current situation, the classifier system then attempts to match
this message with one or more classifier conditions. If some
conditions match, then an action among those advocated by the matched
classifiers is selected and applied. The basic system is thus based on
a set of classifiers, a matching process and an action selection
process.
The learning classifier systems add adaptation to the basic CS through
two components. The first is a reinforcement learning algorithm
similar to Q-Learning [27] that operates on the action
selection process and that I introduce in section 7.4.3. The
second is a rule discovery system implemented as a genetic algorithm
[23,20] that operates on the classifiers as a
population to generate diversity in the classifier set, allowing
exploration of the problem space. This component is introduced in
section 7.4.4. The overall architecture of an LCS agent is
illustrated in figure 7.1.
Figure 7.1:
Learning classifier system architecture.
 |
Within an agent system context, the classifier system is the agent's
control algorithm with the problem space being the environment and
messages the perceived current environment conditions. A learning
classifier system provides the agent with an adaptive mechanism to
deal with varying environment situations and learn better action
patterns through experience. Depending on the type of environment,
detectors and effectors have to be customized for the agent to convert
perceptions into messages and actions into effector operations.
I will present the basics of reinforcement learning and genetic
algorithms in the next two sections, before giving an analysis
of the XCS classifier system and its operation principles.
Reinforcement learning (RL) is a form of learning that is well suited for
unsupervised learning situations and is guided by the following
principle [61]:
If an action taken by a learning system is followed by a
satisfactory state of affairs, then the tendency of the system to
produce that particular action is strengthened or
reinforced. Otherwise, the tendency of the system to produce that
action is weakened.
In reinforcement learning (associative reinforcement learning to be
precise), a system attempts to learn a mapping from inputs to outputs,
guided by a reinforcement signal that represents satisfaction in the
sense of the preceding principle. From the agent perspective, an
agent, connected to an environment by sensors and effectors, perceives
states in the environment and converts them to inputs describing such
currently perceived states. The agent then produces an output that is
effected as an action in the environment. From the transformation
operated in the environment, the agent then receives as reward a scalar
reinforcement signal. The behavior of the agent is aimed at maximizing
rewards received as reinforcement. The reinforcement learning model
(see figure 7.2) consists in:
- environment states
;
- possible actions
;
- reinforcement signals
described by a
reward function R from
into
,
a probability
distribution of reward values;
- a state transition function
for the environment that maps state-action pairs to
probability distributions over the state space;
- agent input signals
and output signals
.
Figure 7.2:
The reinforcement learning model.
 |
The purpose of reinforcement learning theory is to find an ``optimal''
policy B for the mapping
.
Optimality being defined in
relation to the reinforcement signals r received by the agent over
time. Depending on the context, many different algorithms can be
applied in order to solve reinforcement learning problems, and
actually, even evolutionary techniques such as genetic algorithms
could be considered as reinforcement learning within the above
framework. Traditionally, though, reinforcement learning algorithms
are based on statistical techniques and dynamic programming used to
estimate and improve behavior-generating algorithms.
The main aspects that influence the choice of method are the
following:
-
Optimality criterion: defining what is an optimal behavior depends on
making the choice of an optimality criterion and is the
first step to finding a solution to a reinforcement learning
problem. There are basically three models of optimality. 1)
Finite horizon, where the agent has a predefined finite life
span and has to maximize reinforcement received during this time. 2)
Receding horizon, where the agent's life time expectation is not
known and it is asked of the agent that at each time step, he
maximizes the total expected reward for the next h steps. 3)
Infinite horizon, where the agent takes all expected future
rewards into account for the maximization. This can be achieved for
example by using discounted sums of expected rewards.
-
Environment stability: actions in the environment may or may not
influence future states of the environment, depending on this factor,
actions may change the future expected rewards and this should be
taken into account by the behavior. A similar case happens with delayed
rewards, in some problems, reinforcement cannot be given immediately
following an agent's action, it is only when certain specific
situations occur in the environment that the agent receives
reinforcement.
-
Perceptive limits: when the agent perceives the environment, a
descriptive input signal i is built by the sensors observing the
current environment state s. This input signal may not be a complete
or accurate description of the current state of the environment
and thus the agent must be able to generalize its knowledge in the
input-output mapping so that the reinforcement predictions may remain
accurate even under uncertainty.
Given the problem specifics, Kaelbling et al. [27]
introduce formally
justified methods of reinforcement learning and ad-hoc
methods. Unfortunately, most demonstrated techniques suffer from
scaling problems when applied to the more complicated situations (in
particular when the environment is not stable), also,
algorithms used for calculation in these techniques can be relatively
inefficient. The most commonly used methods in the field of
Artificial Intelligence are thus the ad-hoc methods, that are more
easily implemented to deal with generalization problems and delayed
rewards.
From the reinforcement learning model previously described and a
discounted infinite horizon optimality model, the problem of
reinforcement learning can be formulated mathematically in a Markovian
decision process perspective as follows [27].
The optimal value of a state s is the maximum over all action
selection policies
of the expected discounted sum of rewards
over all stochastic transitions
(with
the discount factor and rt the reward at time t):
It can be shown that this value is unique and is a solution of the
equation:
where T(s,a,s') is the probability of
making a transition from state s to state s' through action a.
From the previous definition and the formula for
,
an optimal
decision policy
is a policy that finds an action a such that
in every given state s.
The objective of a reinforcement learning algorithm is thus to find a
policy
that minimizes the value
.
Finding an exact solution for
and
is possible
by using dynamic programming methods, when T and R are known, the
problem faced by reinforcement learning methods is to find a solution
when this knowledge is not directly available, but must be sought in
the environment through trial and error.
I present here a well-known algorithm developed by Watkins
[67,68], that can
be proved to converge to an optimal solution under certain
conditions. This algorithm forms the basis of the reinforcement
learning algorithm used in the XCS classifier system. In Q-Learning, a
state value is defined for a given policy
:
and then a Q value is defined for this policy and
every pair (s,a):
indicating the expected discounted reward for executing action aat state s and pursuing the same policy thereafter. For this
definition, if
is an optimal policy, then
,
where
and:
If this
value can be
learned, then an optimal policy can be built from it by finding one
action that maximizes
for each state s.
The Q-Learning algorithm estimates this optimal Q value
by building a table of randomly initialized Q values for all
state-action pairs and
updating these values with a Widrow-Hoff delta learning rule. The
delta rule adjusts a parameter x towards an estimate of its target
value y by replacing x with
,
(
), which is simply written
.
The value
being the learning rate. It is clear that when
y is stationary, this forms a sequence of x values that converge
to y. In the algorithm, the delta rule is expressed as:
With r the actual reinforcement received
for executing action a in state s.
Each time a table state-action position is chosen to act, this rule is
applied to the position to update its value. Over time, the table
positions are thus alternatively updated and it can be shown that if
all positions are updated regularly enough as time goes on and the
learning rate is appropriately adapted at each step, this
tabular representation of Q(s,a) will tend to
.
The second component of a learning classifier system is a rule
discovery system that is implemented as a genetic algorithm (GA). A genetic
algorithm is a search and optimization algorithm based on the Darwinian
principles of natural selection observed in nature, but applied to a set
of digital genotypes. These digital genotypes are then interpreted as
the parameters that can be operated upon for solving the optimization
problem at hand. Traditionally, a genetic algorithm is used as a
phylogenetic adaptation mechanism that operates on a population of
potential solutions to a problem and is based on a measure of the quality
(called fitness) of each individual's solution. Individuals are
identified by an artificial genome that describes their solution to
the problem. The search for an optimal solution works by randomly
selecting individuals with a bias towards the fittest and propagating
their genetic material in new individuals, possibly
through genetic operators such as crossover and mutation. This process
usually leads to an increasing average fitness in a population since
fitter individuals are more likely to be selected for reproduction.
In many genetic algorithms, an individual is represented by a bitstring
(i.e. a string of zeroes and ones) in the set Bl of bitstrings of a
specified length l, this string is both the genome of
the individual and a coding of the solution that it provides to the
problem under study. The algorithm operates on a population
of such individuals at
every time step t, the
population having a predefined number n of
individuals. A mapping
has to be provided that defines the fitness of
individuals in order to provide a selection criterion for the
algorithm. The search process then iteratively selects individuals
with probability proportional to their fitness and generates new
individuals from their genome, thereby forming new populations of
individuals. There are a few different methods that can be used to
produce population updates either as in the original genetic algorithm
that was described as selecting individuals from a population Pt at
time t and generating n new individuals to form the next
population Pt+1, or as in the variant of steady-state populations
used in the XCS where, at each step, a few (one or two) new
individuals are created from the current population's fitter
individuals and replace unfit individuals of the current population.
The three genetic operators used to produce new individuals are:
-
simple replication: the selected individual is duplicated;
-
mutation: the various sites in a duplicated individual's code are
swapped to the opposite bit with probability
;
-
crossover: two individuals are selected and one or more random
positions in their genome are chosen randomly as crossover points. Two
new individuals are formed by alternating pieces of genetic code
from the two selected individuals, the lengths of these pieces being
delimited by the crossover points chosen.
The sequence of populations generated by the algorithm explores the
solution space of the problem by accumulating individuals in regions
of high fitness value and converging to the (or one of the) global
maximums of this fitness landscape.
The search procedure provided by a genetic algorithm is, in most
cases, provably better than a random search in the solution space of a
problem, although for a large search space the procedure can be slow.
The convergence of the algorithm has been proved in the Schemata Theorem
[20] by studying generalizations of bitstrings called
schemata that represent families of individual bitstrings. Schemata are
generalizations of bitstrings and are identical to the classifier
conditions used by the XCS system that I introduce in the next section.
The XCS classifier system is an improvement on the original design of
classifier systems that was presented by S. W. Wilson in his 1995
article Classifier Fitness Based on Accuracy; it promotes a
different approach to the reinforcement learning/genetic algorithm
relation in the system adaptation process that allows better
generalization of knowledge stored in the form of classifiers. Since
much of the relevant work in the field of LCS is now based on this system
design, I introduce it here without presenting the original designs of
classifier systems. The following subsections describe the elements
used in a cycle of the classifier system life. The steps in a cycle
are applied repeatedly until the system is stopped by a user and
consist in:
- 1.
- detecting an environment message and forming a set of
classifiers that match the message
- 2.
- forming action sets in the match set and selecting an action
- 3.
- applying the reinforcement algorithm to classifiers
- 4.
- applying the genetic algorithm to the classifier population
Along such a cycle, the learning/genetic algorithm is focussed on
the evaluation and use of the three following parameters:
-
prediction
of a classifier
:
the estimate of
the reward gained
by the system when this classifier is matched and its action is
selected. This parameter is used by the action selection mechanism;
-
predictive error
of a classifier
:
an
estimate of the
difference between prediction value and actual reward obtained by the
system when this classifier is matched and its action selected. This
parameter is used to calculate the accuracy of the classifier;
-
classifier accuracy, used to update fitness
:
a function of
classifier error that attempts to estimate how good a classifier is at
guessing the reward the system would receive if its action was
selected. This parameter is used by the genetic algorithm component to select
classifiers and by the action selection mechanism to weigh prediction values
of action sets.
When dealing with classifier systems, sensory stimuli are converted to
messages which are conventionally formed of a bitstring of specified
length. A population
of classifiers forms the system,
each classifier
consisting in a condition string and an action
string (
,
). The condition string is of the same
length as messages and
holds three different types of elements: zeroes, ones or wildcard
characters. I will therefore refer to these characters as trits and
the strings involved as tritstrings. When the matching process is
invoked, it uses the condition strings of the classifiers in the
classifier population to test whether a match has occurred by comparing
successive bits in the message with trits from the classifier. A
position is matched in both strings when either the message string
holds a one and the condition string a one or a wildcard character, or
the message string holds a zero and the condition string a zero or a
wildcard character. The wildcard character thereby acts as a don't
care symbol that I will hereafter note #. A
classifier matches a message if all positions of the classifier
condition match corresponding positions in the message. For example,
the message 0010110111 is matched by the condition 0##011#11#
since every specified bit in the condition is the same as that of the
message and we ``don't care'' about the other positions. A match set
is created from the subset of classifiers that match the current
message before proceeding to the action selection stage.
The action string in a classifier is a bitstring that codes the
associated action identifier of that classifier and has a
length depending on the number of possible actions the agent may
execute. Once a match set has been formed, the system decides
among the possible actions which one it will execute. To this end,
a prediction value is stored by each classifier,
giving an estimate of the reward the classifier usually receives when
it is matched and its action is chosen, along with an estimate of the
accuracy of this prediction (these parameters are calculated
by the reinforcement learning mechanism which we will study in the
next subsection). For each action
,
the set
of
classifiers in the match set that advocate this action is formed. A
prediction array p is then formed by calculating a prediction value
of each of these sets. The prediction value of such a set is the
average of the predictions of the classifiers in the set, weighted by
their accuracy:
One of these
action sets is selected based on its prediction and named the current
action set
.
For example, using deterministic action
selection, the action set with highest average prediction is
selected. Another way to choose the action would be by using a biased
random selection mechanism over the predictions of each set.
Reinforcement Learning Component
The reinforcement learning algorithm used in an XCS is applied to the
prediction, error and fitness parameters defined previously. The
updates of these parameters are usually applied in the following
order: prediction error, prediction, fitness, but other orders also
work.
Two types of problems are distinguished when calculating
parameter updates, single step problems and multi step
problems. Single step problems are problems where reward depends only
on the current state-action pair and the transition function maps
all pairs to the uniform probability distribution over the state
space (i.e. the state of the next step does not depend on the current
state and action). A multi step problem is the more general situation,
where the state transition function is not constant and where the
system must also learn it. Both situations are studied in the
experimental chapter.
In a single step problem, the reinforcement is applied to all
classifiers of the current action set, using a reinforcement value of
.
In a multi step problem, the reinforcement is applied
to the previous step's action set, using a discounted reinforcement
value
,
the t indicating to which time step the
value belongs and pt(a) being the prediction value of a's action
set at time t, as defined in the preceding subsection. For each
classifier
to update, the reinforcement rules are:
-
-
-
,
where
is the relative accuracy of the classifier calculated
from its accuracy
in the following way: if
is
larger than a threshold
,
then
,
otherwise,
.
The relative accuracy of
is then
.
In these learning rules,
controls the learning rate,
is the limit from which a classifier is not
considered accurate anymore and
with
is the
accuracy value given to a classifier of error
.
In practice, in XCS, the technique of the ``moyenne adaptive modifiée''
(MAM) introduced by Venturini [64] is applied for the
first
action cycles of the system, to speed up the initial
convergence of the system.
Genetic Algorithm Component
The XCS system is made of a population of classifiers
that does not usually hold all
.
Since the action selection mechanism is applied to this
population
,
it is essential that this set hold
relevant classifiers for all environment states encountered by the
system. The genetic algorithm's role is to discover which classifier
is relevant and must be part of the classifier population and which
can be safely omitted from this population. The basis on which
this is accomplished is described in the analysis of the XCS section.
As was mentioned earlier, the genetic algorithm operates on the
classifier population
.
At every step, the genetic
algorithm is applied to the population with a probability
.
If it is applied, two individuals are selected in the
current action set proportionally to their fitness
as
calculated by the reinforcement learning component. These individuals
are then either reproduced with a mutation factor of
at each of
their sites or, with probability
,
they are crossed over at one
random position along their condition tritstring or action
bitstring. The two new individuals are then inserted in the population
and if this population is larger than its predefined maximum size, two
unfit classifiers are deleted from the population.
In this section, I give my mathematical understanding of an XCS classifier
system and show that under certain restrictive assumptions, the XCS
actually implements the Q-Learning algorithm. I will then discuss the
full XCS implementation and the importance of its genetic accuracy
based component for generalization purposes. But first I start by
making a summary of the notations I will use, some of which were
previously defined and used.
Mathematical notations:
-
is the set of subsets of E
is the set of probability distributions over E
- |E| is the number of elements in the set E
Reinforcement learning notations:
denotes optimality as in an optimal policy
- S is the set of environment states
- A is the set of actions
- T is the state transition function
,
and T(s,a,s') denotes the probability of the
environment reaching state s' when action a is selected in state
s.
- R is the reward function of the environment
and rt the actual
reward received at a given time t
is a policy for choosing actions and
an
optimal policy
is the state value of s for the given policy
-
is the Q value of state-action pair (s,a) for
policy
,
are the discount factor and the learning rate used
in a learning procedure
XCS notations:
- I and O are the input and output sets of the XCS, which I will
for now consider as isomorphic to S and A respectively
-
is a classifier with condition and action
-
is the set of all possible classifiers,
the population of classifiers
available in the system at time t,
the set of specific classifiers in
,
that is,
classifiers whose condition part have no wildcards
-
is the prediction
function over the set of classifiers
-
is the
prediction error function over the set of classifiers
-
is the fitness
function over the set of classifiers
-
is the match set for state
s at time t
-
is the action set for
action a at time t
In [15], Dorigo and Bersini present a comparison between
the original LCS of Holland and Goldberg; they show that by
restricting the LCS to a subsystem called the very simple classifier
system with a modification to the learning rule, their system
functions in a way that is identical to a Q-Learning algorithm. I will
show here that the XCS system can also be considered as a Q-Learning
algorithm given some restrictions.
For the XCS to become a Q-Learning implementation, one restriction
is necessary, although it is a major one, the removal of the genetic
algorithm component of the system. One assumes (enforces) that
the population of classifiers present in the system at every time-step
consists in only and all the specific classifiers, that is
,
so that these classifiers
form a table similar to that used in tabular Q-Learning. This
implies that there is no genetic algorithm component and only the prediction
values of classifiers need to be learned (accuracy is not needed since
action sets hold only one classifier, as we will see). The XCS
algorithm then runs in three steps: acquire the environment state sand form a match set
for this state, evaluate the
prediction value of the action sets in
and select an
action, obtain reward and reinforce the selected action set.
Since the classifier population consists in only the specific
classifiers, the match set will hold |A| classifiers, one for each
action in A, and every action set will hold only one classifier, the
classifier whose condition is exactly the current environment state.
The prediction value of these action sets will thus be the prediction
value
of their only classifier (accuracies simplify away
in the weighted sum calculation) and action selection as well as
reinforcement can be considered to operate on the classifiers
individually. Note also that we have an isomorphism between the
population of classifiers and the set of state-action pairs:
,
so that each classifier actually represents a
state-action pair
.
Remembering that in Q-Learning, the Q value of an optimal policy is
estimated by the learning rule:
We find that in XCS, the learning rule for classifier prediction
corresponds to the Q-Learning rule since it is precisely:
And so, the prediction value of classifiers learned in this form of
XCS is the same as the values found in the table formed by the
Q-Learning algorithm,
.
Following the
hypotheses necessary for the Q-Learning convergence theorem, this
simplified XCS thus learns an optimal policy as long as every classifier
prediction is updated sufficiently frequently over time.
As already mentioned for the Q-Learning algorithm, the main problem
encountered in the simple XCS system is that maintaining a full
population of specific classifiers is impractical for most problems
due to the size of the
space. This is precisely the
object addressed by the discovery component (genetic algorithm) of an
accuracy based classifier system. To achieve a reduction of the size
of the space needed to maintain classifier predictions, general
classifiers are allowed in a classifier population. These general
classifiers, holding wildcard symbols in their condition part,
can be said to subsume a family of more specific classifiers. With
subsumption defined by the relation
:
-
positions in
c' are matched by the corresponding positions in c, or both
positions hold wildcards
-
and

Since general classifiers represent a family of classifiers (in
the same way as schemata represent families of individuals in a genetic
algorithm), as long as all classifiers of the family have the same
predictive value, the whole family can be replaced by just one
(general) classifier. Regularities in the environment are thus
compressed as the information of a single classifier.
To observe what happens to the action selection mechanism when
generalization is used, it is necessary to see that for a general
classifier
,
the prediction is the average expected prediction
of the classifiers it subsumes:
For specific classifiers
,
the relation still holds since
and one can easily convince oneself of the second
equality by listing the family of classifiers a general classifier
subsumes and writing down their respective prediction values. Now, the
action selection mechanism of the classifier system operates on action
sets
that will contain one or more classifiers, both
general and specific, by evaluating the mean prediction value of
classifiers in
,
weighted by their accuracy and it is
expected that the action finally chosen is that which would have been
chosen in the simple XCS case. It could seem unnecessary to use the
accuracy-based selection, thereby simplifying action selection, but by
observing the following simple example, the importance of accuracy in
generalization becomes clear.
The immediate generalization of action set prediction value from the
simple classifier example would be to take p(a) as the mean of the
prediction values of classifiers in the set
,
that is,
.
Unfortunately, this will usually not give
satisfactory results because the regularities in the environment (reward
landscape) will most often not be compatible with the generalization
mechanism of the classifier system.
Suppose that the state space is
and the action space
.
The current
state of the environment is detected as 00. If the current
classifier population is made of all possible classifiers, match
set and action sets will be given by:
Figure:
Generalization of
values, a prediction landscape.
 |
If the prediction landscape is as illustrated on figure
7.3, we can evaluate the prediction values of
both action sets.
And so, even with full knowledge of the predictive values of all
classifiers, the selected action is not the most beneficial one.
Clearly, from the prediction values given, the action that should be
selected if we were relying on specific classifiers is the action 0,
but here, using deterministic action selection, the selected action
will be 1 because of the high prediction value of classifier
(10,1) that is reflected in the prediction value of classifier
and enters the prediction value calculation of action set
.
Therefore, with generalization comes the need of an
accuracy criterion that allows the action selection mechanism to
distinguish between accurate generalizations and inaccurate
generalizations.
As shown in the above example, when a general classifier subsumes a family
of classifiers that have contradicting prediction values, the action
selection mechanism of the XCS might drift away from the mechanism
applied in Q-Learning (which will intuitively lead to bad
performance). The factor that influences this drift is that general
classifiers will obtain various rewards in any non-trivial
environment. Given that their prediction value is the expectation of
reward, what becomes important here is the variance in rewards
received, intuitively, by how much these rewards differ over the set
of state-action pairs they match.
In the simple classifier system with only specialized classifiers, this
variance will be zero for a single-step environment, where a
state-action pair is always equally rewarded. This variance will remain
small with delayed rewards as long as the discount factor used is small
and the environment sufficiently regular. This remains true when
considering general classifiers whose subsumed family of specialized
classifiers has consistent predictions. In essence, there are ``good''
accurate general classifiers (marked by small predictive variance) and
``bad'' inaccurate general classifiers (characterized by a high
predictive variance) and if the XCS system is to generalize
efficiently, it has to be able to distinguish between these accurate
and inaccurate classifiers. The role of the prediction error and
fitness functions in the reinforcement learning component of the XCS
is to learn this distinction and provide a criterion to both exclude
some general classifiers from the population and minimize the effects
of existing inaccurate classifiers on action selection. The actual
search for accurate classifiers is handled by the genetic algorithm
component which is applied to the classifier population.
Since the learning rule for the
function updates
with
,
is an estimate of the average difference in the
prediction of
and the rewards received when applying
.
thus has a similar role to that played by
variance in statistics. If the GA was operating on a population of
classifiers for which we had full information about prediction values
and prediction errors, and fitness was taken as the inverse function
of prediction error, the classifier population
would tend to a population made of an ever greater proportion of
accurate classifiers, due to the schemata theorem for genetic
algorithms. In this more general situation, these values must
simultaneously be learned by exploration in the environment and so,
due to incomplete information, a fitness function must be estimated
from the prediction error by the reinforcement learning component of
the system, allowing an error tolerance to be introduced in the
selection of ``good'' and ``bad'' classifiers.
Overall, the XCS system uses two cooperating algorithms to provide the
action-selection mechanism with the best information acquired in the
environment at the time a decision must be made. The RL component
attempts to derive information about the utility of making a particular
decision and the GA selects the classifiers that accurately describe the
problem domain in which this decision process occurs. The most
interesting result remaining to discover is now a convergence result
for the joint RL and GA. It seems that although such a result is
difficult to obtain, it is not impossible with the right constraining
assumptions. Some typical assumptions I believe necessary would be
based on: population size requirements, rate of application of the
genetic algorithm, number of explorations by the reinforcement
algorithm before the selection or deletion of a classifier by the
GA. These parameters are all controllable in the classical XCS. There
are also some problems that I have not discussed here that can have a
great influence on the classifier system, such as the relation between
environment states and representation of such states (input function)
or the possible reliance of the environment state transition function
on hidden parameters. These problems are typical of the current
artificial intelligence algorithms and linked to the functional
grounding problem that I introduced in the theoretical part of this
thesis.
The multiplexer problems have often been used in conjunction with
learning classifier systems to show their adaptation capabilities. These
problems are considered interesting because the function that must be
learned to solve the problem is irregular but does allow
generalizations to be made in its standard
representation and this representation is easily used as input to a
classifier system. With my implementation of the XCS classifier system, I
have been able to reproduce the results that were presented in
[70] and [30] and I will give a short
presentation of these results here. For a complete presentation, the
two preceding documents can be referred to.
A multiplexer is a logical function m that maps a tuple of n + 2nboolean values to a boolean result. If one considers the input string
of boolean values as an ordered bit string, this function m maps
signals from the set
to the set
.
It uses
the first n bits in the bit string as an address in the remaining
2n bits and returns the value at that address as a result, i.e. if
the value of the first n bits considered as a number is x, the
value of bit x in the second part of the bit string is returned.
The figure 7.4 describes this mechanism.
Figure 7.4:
Multiplexers, a) general case, b) evaluation of the function.
 |
Since the number of possible addresses depends on the n chosen,
there are multiplexer problems for each
,
that is,
3-multiplexers, 6-multiplexers, 11-multiplexers, etc. Results have
been published on the 6, 11 and 20 multiplexer problems for the XCS
system, but the tuning is usually done on the 6-multiplexer case.
A 6-multiplexer has an input string consisting in 2 address bits and
four value bits. The m function maps each of the 64 possible input
strings to a zero or one value. Given that in this representation,
each input string has only 3 significant bits: the 2 address bits
and the bit at that address, the function landscape has regularities
that could be learned by a classifier system. For example, the strings
010010, 011010 or 011110 are all mapped on 1 because they have the
same 01**1* pattern. Since the XCS generalizes by using conditions
that are patterns of this form, it be able to adapt to the problem by
generating classifiers with this type of general conditions.
Figure 7.5:
Regularities of the 6-multiplexer function.
 |
One can see in table 7.5 that there are 8 (23)
general patterns that sum up the multiplexer function. Optimally, the
classifier system should thus explore the space of classifiers until
it discovers these 8 general conditions and then limit its population
to the classifiers that map these conditions to the correct actions.
Unfortunately, these general patterns are not the only ones, since
*0*1*1 or 1*00** are other equivalent ways of expressing regularities
in the function graph and this will affect an XCS classifier system.
In order to learn the 6-multiplexer problem, a reward function must be
provided to the system for its decisions. The immediate choice, that
is used here, is to reward the system with a value when it correctly
maps an input to an output and provide no reward for incorrect
answers. With this reward scheme, the classifier must discover
classifiers that accurately predict the reward, and use these
classifiers take decisions about the result of an input. Since in this
problem, the regularities of the reward function are symmetric with
respect to the two possible action choices (given an input that is
rewarded when action 1 is chosen implies no reward for the choice of
action 0 in the same situation), the accurate population of
classifiers includes both correct general classifiers and their
incorrect counterparts. The expected optimal population for the XCS
will thus include at least 16 general classifiers.
Figure 7.6:
6-multiplexer results over 100 experiments (10k and 30k steps).
 |
Experimenting with the classifier system that I have implemented
provides the learning curves illustrated on figure
7.6. In this illustration, the curves plotted represent
the averaged results of one hundred different experiments. In each
experiment, every decision step was alternated with an exploration
step. On exploration, an input is used by the system to test its
answer. Reward is distributed to the classifier for this answer. On a
decision step (exploitation), the result given by the system is used
for the plot data, but no reward is distributed and no reinforcement
or discovery process takes place in the system. The dashed line plot
on the figure represents the percentage of correct answers returned by
the system in the last fifty decision steps. The dotted line
represents the overall error in prediction over the last fifty
decision steps and the continuous curve is the number of different
types of classifiers existing in the population (the value is divided
by one thousand for scaling purposes).
The results obtained here are equivalent to those presented in
[70,30]. One observes that the predictions of the
system become almost perfect after 2000 exploration cycles (4000
steps), the error prediction simultaneously decreases, with a slight
delay. The system is initialized without any classifiers at first and
one sees that while the population has not reached its maximum number
of classifiers (which happens around step 1200), the new
classifiers that were generated by the genetic algorithm to fill in
the population are very diverse. Maximal diversity is reached around
step 1900 with about 180 different types of classifiers. This variety
then decreases until it reaches the number of 40-60 different types in
the process of elimination of inaccurate classifiers.
In order to experiment the theoretical ideas that I presented in the
first part of this thesis, I have chosen the XCS algorithm. With this
chapter, I have introduced the basic theory of this type of algorithm,
including reinforcement learning and genetic algorithms. Based on this
theory, I have explained the practical considerations necessary to
implement the XCS system according to Wilson's design. In the next
step, I have taken a personal stance towards XCS in order to give a
better understanding the XCS adaptation principles. This understanding
will then help in the next chapter when we come to studying the
structural and dynamic properties of such systems. Three essential
points are made here. Firstly, since the simple version of XCS is
equivalent to Q-Learning, we can reasonably expect some of the good
convergence properties of Q-Learning to be inherited by XCS
systems. Secondly, the advantage of an XCS system with respect to
Q-Learning is that it can generalise. A property that is equivalent to
the ability of compressing information and allows an adaptive
algorithm to acquire information for overall problem
regularities. Thirdly, the accuracy parameter of the XCS system gives
it the ability to distinguish appropriate/relevent generalisations
of problem spaces from the inadequate/irrelevent ones.
A final experiment is led to reproduce the results of Wilson and
others in the case of multiplexers, so as to show that the system I
have implemented is identical to the previously implemented systems,
and that results obtained here can be compared with other results
obtained on XCS classifier systems.
Next: Experimentation
Up: EMuds
Previous: EMuds, Aims and Methods
Antony Robert
2000-02-11