CHAPTER
4
Selective Attention as an Optimal Computational Strategy
ABSTRACT
We explore selective attention as a key conceptual inspiration from
neurobiology that can motivate the design of information processing systems. In
our framework, an attentional window, the “spotlight of attention,” contains
some reduced set of data from the environment, which is then made available to
higherorder processes for planning, real-time responses, and learning. This
architecture is invaluable for systems with limited computational resources.
Our test bed for these ideas is the control of an articulated arm. We
implemented a system that learns while behaving, guided by the attention-based
content of what the higher-order logic is currently engaged in. In the early
stages of learning, the higher-order computational centers are involved in
every aspect of the arm’s motion. The attentionally assisted learning gradually
assumes responsibility for the arm’s behavior at various levels (motor control,
gestures, spatial, logical), freeing the resource-limited higher-order centers
to spend more time problem solving. remarkable
fact—documented throughout the book— that only a very small fraction of the
incoming sensory information is accessible, in a conscious or unconscious
manner, to influence behavior. Many people have speculated about consciousness
and its function. According to Crick and Koch (1988; Koch, 2004), the function
of conscious visual awareness in biologicalsystems is to “[p]roduce the best
current interpretation of the visual scene in the light of past experience,
either of ourselves or of our ancestors (embodied in our genes), and to make
this interpretation directly available, for a sufï¬cient time, to the parts of
the brain that contemplate and plan voluntary motor output, of one sort or
another, including speech.” This representation consists of a reductive
transformation of the massive, real-time sensory input data. That is, the
content of awareness corresponds to the state of cache memory that holds a
compact version of relevant sensory data as well as recalled items. This
strategy can deal with more complex scenarios and generate a strategy for
action (Newman et al., 1997). This flexible, but slow, aspect of the system,
is complemented by a set of very rapid and highly specialized sensorimotor
modules (D. Psaltis, personal communication, 1995), “zombie agents” (Koch,
2004), that perform highly stereotyped actions (e.g., driving a car, moving the
eyes, walking, running, grasping objects). Figure 4.1 illustrates one way in
which these cognitive strategies may be mapped onto a machine
architecture (Billock, 2001). The sections of the diagram toward the bottom—the
motor/processing modules, early processing, and error generation—reside below
the level of awareness, with fast reflexes and extensive procedural memories.
Selective attention and aware-
I. THE ATTENTION–AWARENESS MODEL: AN INTRODUCTION
Computers and software have recently joined the long line of humantools
inspired by biology. In this case, it is the phenomenal capabilities of
biological nervous systems that intrigue and challenge us. Our desire to mimic
the brain stems from the abilities it possesses, which are in so many cases
superior to those we can implement today. We here explore the extent to which
attentional selection can convey functional advantages to digital machines. By
attentional selection we refer to the
Neurobiology of Attention
18
Copyright 2005, Elsevier, Inc. All rights reserved.
II. LEARNING MOTION WITH AN ARTICULATED ARM
19
Logic
Declarative Memory
Awareness
Online Systems
Attention
“Zombie’’/Online System
Early Processing Processing Modules Environment
Error Signal Generation
FIGURE 4.1 In this functional model of the role of attention and awareness, the
pathway incorporating the attentional bottleneck operates in parallel with the
faster sensorimotor agents (zombie systems), taking their cues from the “error
signals” generated by the zombie systems.
ness are the gateways that provide preprocessed
sensory data to the higher, more resource-constrained parts of the brain (the
logic and planning cortices and memory). Of course, in reality many additional
interconnections exist between these components. To maintain a coherent course
of action, the system must be capable of alternating between volitional,
top-down and reflex-level, bottom-up control. What we would like to abstract
from the biological functions of attention and awareness is amachine that can
aid in performing similar tasks. We explore how, for sufï¬ciently complex
environments, using a reduced representation of the environment allows an
algorithm to perform better when under time pressure, compared with an approach
in which the entire input is represented. We would also like to understand
better what advantages implementing such a bottleneck has for memory and
machine learning.
II. LEARNING MOTION WITH AN ARTICULATED ARM
Machine learning is one area where we expect algorithms inspired by
attentional selection strategy to outperform conventional ones. There are
several ways in which attention might facilitate learning. One is during
learning; if shown a single image of a car embedded in a dense background ï¬lled
with other objects, the learning algorithm does not know which features belong
to the object of relevance (here the car) and which ones are incidental. If
attention would segment the car from the rest of the scene, however, superior
performance can be obtained. This is particu-
larly relevant to one-shot learning algorithms. The same is true during the
recognition phase. Detecting the same car, say, under a different viewpoint, in
a novel scene is much facilitated if an attentional selection strategy can
segment the car from the background and just forward its associated features to
the recognition module (see Rutishauser et al., 2004, for an illustration of
this strategy). Of course, segmentation also helps in reducing the amount of
data that must be memorized,thus improving learning
speed. Picking the right information to he learned and ignoring the rest is
probably one of the key functions of attentional selection. Indeed, the
resultant bottleneck appears to be necessary for the utilization of some kinds
of memory (Naveh-Benjamin and Guez, 2000). The test bed we use for exploring
attentional learning is the control of a segmented arm moving around in a
boxlike environment. It can pick up, move, and drop disks. At the most abstract
level, the arm is used to solve various kinds of puzzles. The problem we
explored was one of ordering various objects into target locations. This is
equivalent to the Tower
of Hanoi problem (Claus,
1884). In our version of this problem (see Fig. 4.2), we begin with an
allotment of disks of various diameters. We assume that they have holes in
their middle, that these disks are stacked in order of decreasing size (i.e., a
larger disk must always be below a smaller one), and that the segmented arm can
transfer the disks from one target stack to another one. The arm moves around
the board and physically takes the top disk from each target and moves it to
another stack, with the end goal of placing them in increasing size on a
speciï¬c goal target. Various obstacles are placed on the board through which
the arm cannot pass. The arm’s segments can overlap as it moves. We assume that
the end effector, when placed over a target, takes or releases a single disk
automatically. Our problem, then, is to manipulate the joints of the arm to
move its endeffector between the appropriate targets in the correct order so as
to solve the puzzle. The details of the articulated arm, the playing board, and
targets are shown in Fig. 4.2. For our purposes, we give the arm segments
minimal dynamics involving a maximum torque and a momentum/friction decay
characteristic. These force relationships are solved by the logic subsystem
using a set of torquechange equations similar to those described by (Uno et
al., 1989) for modeling human limb control. Initially, the system has not yet
learned to drive its joints, and so must use its logic/planning functions to
solve the control problem via explicit equations. The threesegment arm has a
complicated inverse kinematics
SECTION I. FOUNDATIONS
20
2
CHAPTER 4. SELECTIVE ATTENTION AS AN OPTIMAL COMPUTATIONAL STRATEGY
1.5
2
1
F2
3 4
F3
l 4
q3
l
q2
0.5
F1
l
1
0
3
-0.5
-1
1
-1.5
-2 -2
-1
0
1
2
Baskets
Obstacles
Reachable Area
FIGURE 4.2 The Tower of Hanoi problem arranged for solution
by an articulated arm. The arm must move between the marked targets without
colliding with the solid obstacles. Outlined squares are the positions of the
targets. Solid circles indicate obstacles around which the arm must navigate.
which requires an expensive optimization process to ï¬nd
the best trajectory to move from a present position to a target position. The
minimum torque-change model selects a conï¬guration out of the possible
solutions that requires theminimum angular change of the arm segments to
achieve. This is a very costly step in terms of required computational power,
and so at ï¬rst the attention of the logic unit is taken up by this low-level
function. As it does so, the reduced representation of the environment that the
logic uses in solving the problem is presented to the “zombie” system for
learning. Arm kinematics are learned by a neural
network doing a straightforward function ï¬t to the force curve necessary to
move the arm from one angle to another. We use a three-layer neural network
with four units in the hidden layer. The units use a hyperbolic tangent
activation function and are fully interconnected. The input parameters are the
current distance from the computed goal angle and the angular velocity of the
arm segment. The output is the torque to be placed on the segment joint. The
motion learning network starts its training once a sufï¬cient number of
samples (about 40 trajectories) have been collected so that it does not fall
into a shallow local minimum and fail to learn the motor curves. We then use the
Levenberg–Marquardt algorithm (Marquardt, 1963), which
learns rapidly over the next two dozen or so trajectories, after which it has
trained enough to substantially take over arm kinematics from the logic unit.
During training, the control of the arm is shared by the network and the logic.
This sharing is adjusted based on the current training error level of the
network multiplied by a quality parameter that increases with the amount
oftraining data. When the network is controlling arm motion, and fails to steer
the arm to the required target, an error signal triggers the attentional
mechanism, and the logic takes over and computes the kinematics, thus producing
more training data which improves the network’s performance as it takes over
more and more the control of the arm. This behaviorally guided learning
approach produces excellent generalization: a training error of 3.46% (relative
to the force given by the inverse kinematics solver) produced a test set error
over the whole input ï¬eld of 0.17% (both taken in the least-squares sense).
For actual trajectories the test set error is still very low but is higher
(around that of the training set error). This is because for actual
trajectories, points tend to be more clustered in the areas of the input ï¬eld
where performance is critical, such as when the arm segment is close to its
target location. As basic kinematics are learned, the logic/planning system
spends more time planning movements of the arm from one joint conï¬guration to
another. These motion plans are called gestures, such as “going around an
obstacle clockwise.” As a memory system for the gestural level, we use an
ART-like neural network (Carpenter and Grossberg, 1987). This is an
unsupervised learning model that autoclusters the trajectory data the logic
presents to it during actual problem solving, and learns models for those
clusters which can then be introduced into the control loop and largely replace
the logic in computinggesture trajectories. Each ART unit is associated with a
linear neural network, which it uses to model the gesture parameters of the
data with which it is associated. These linear networks have six inputs (a
present and goal angle for each of the three arm segments) and six hidden
units. Once a unit has more than three data points, it begins to train its
associated neural network to model the data. This training also uses the
Levenberg–Marquardt algorithm. If a new data point ruins the ability of the
existing network to model the data well, an error signal is generated. The data
point is then rejected and forms a new ART unit of its own (where it will
compete with the existing units). If the new data point can be learned well,
which usually happens, then it is incorporated into a
new estimate of the mean and covariance of the unit’s resonance region. The
outputs of the neural network are relative coordinates for the
SECTION I.
FOUNDATIONS
III. ROLE OF THE ATTENTION-AWARENESS MODEL IN LEARNING
21
segments of the arm to steer toward in completing the gesture. These
coordinates then directly drive the kinematic level for controlling the arm
joints to move the arm to that conï¬guration. Once an ART unit has more than
six data points, it is allowed to begin to respond to the environment itself,
and if the current arm parameters are within one standard deviation from a unit’s
center point in its input space, then that unit will be chosen to control the
choice of the next trajectory path. Each unit,then,
corresponds to a “gesture” that the system has learned. As the system solves
puzzles, control shifts to the ART network. When no ART unit is found for the
present environment, an error signal alerts the logic to calculate a new
gesture trajectory. Training and operation overlap: if resonance occurs with
one of the existing units, and that unit has sufï¬ciently good performance, it
is used to construct the next trajectory. If resonance occurs, but that unit
fails to drive the arm successfully, an error signal causes the logic unit to
compute the gesture, thus training the network. If no resonance occurs (meaning
that no unit is responsible for dealing with the current state), then a new
unit is created. The consistency of the gestures permits the networks
associated with the ART units to usually achieve least-squares training set
errors below 10-3, and frequently converge to the training threshold of 5 ¥ 10-5 without signiï¬cant overtraining. These errors
are given in radians, which are target angles relative to current positions
learned by the ART unit networks, and correspond to less than a tenth of a
degree. On the other hand, similar gestures may not be repeated by every
movement from one target to another, so it may take a while (in puzzle-solution
time) for the actual arm behavior to lead to the accumulation of enough training
examples for a particular unit. During solution of the ï¬rst few puzzles, that
is, sets of distributed targets, the logic spends a lot of the solution time
(around 60%) planninggestures. After that, however, the zombie system begins to
learn commonly repeated gestures and takes over the gesture planning. This
reduces the total time spent in problem solution and also dramatically reduces
the amount of time the logic spends “attending to” gestures to less than 10%.
The spatial sequence to follow when moving from one target to another is
memorized using “declarative memory.” These are basically memorized series of
gestures: “to go from target 1 to target 3, ï¬rst go clockwise around obstacle
2, then counterclockwise around obstacle 6, then straight on to target 3.”
These directions are learned as the logic solves the puzzle and continue to
evolve as play proceeds. If this memory control fails, an error signal causes
the logic unit to send the arm back to the originating target and
try again by generating a new sequence of gestures itself. These will then
replace the original sequence in memory. As it takes only one example for this
memory to be useful, the declarative memory comes into play quite quickly. On
the other hand, this scheme does not generalize well, except to different
arrangements of the items to be sorted on the same playing board (meaning the
targets and obstacles are in the same position, and only the initial
arrangement of the disks to be sorted has changed).
III. ROLE OF THE ATTENTION–AWARENESS MODEL IN LEARNING
We now return to a discussion of the operation of the system in terms of the
blocks of Fig. 4.1. The angular position and velocities of the arm segmentsand
the positions of obstacles, targets, and disks form the environment. In a real-life
situation a vision system might be used to extract these variables from the raw
sensory data. In this environment, the job of the controller is to move the arm
in the best way to solve the puzzle. The solution of the puzzle at the abstract
level (i.e., which move to perform next with a view to solving the problem)
always remains the province of the logic. The different kinds of memory in the
“zombie” system correspond to some varieties of memory humans employ to solve
different problems. The procedural memory learns from examples to reproduce
forces on the arm segments to cause desired motions. The unsupervised ART
memory learns from examples of common gestures to take over motion planning.
The declarative memory stores sequences of these gestures as spatial
“directions” of how to move from one target to another. The selective attention
mechanism facilitates the learning process. As the system becomes more trained,
the “zombie” gradually takes over control of the arm from the logic. This
happens independently at the various levels as they become “reflexive” from
the point of view of the logic, which then only “attends” to that level
following an error signal. It spends more of its time on other parts of the
problem, which then train other parts of memory. The problem of controlling an
articulated arm to solve a puzzle is one in which
neither pure logic nor traditional machine learning is very good. When the
logic has toplan out in detail all the motions of the arm, it can take a very
long time to solve the problem. The conventional learning problem is
intractable. For even quite small problems, there are dozens of dimen-
SECTION I. FOUNDATIONS
22
CHAPTER 4. SELECTIVE ATTENTION AS AN OPTIMAL COMPUTATIONAL STRATEGY
sions in the learning problem, generating error signals is hard,
and learning is very slow, if it would work at all. This illustrates the role
of the reduced representation of the “awareness window” in interfacing to
different kinds of memory. In the arm example, the awareness window contains
the single task that the zombie system failed to execute within acceptable
error bounds and needs to be currently completed by the logic. The data in the
awareness window continuously become the source of training examples for the
zombie part of the system. In this way, the awareness mechanism splits up the
problem into manageable “chunks.” For declarative memory, the bottleneck
reduces the amount of information necessary for it to learn useful patterns.
For procedural memory, whether supervised or unsupervised, it assists by
pruning out the information in the environment that is less relevant, reducing
the dimensionality of the resulting patterns and speeding up learning. The
zombie systems greatly improve the speed of overall puzzle solution: ï¬guring
out arm ballistics and computing nearoptimal gestures are hard problems, and
when these responses have been learned, solution times
drop by an order of magnitude andmore. Figure 4.3 shows the fraction of time
the system spends in the logic/planning subsystem and the training and recall
of various memory subsystems. During solution of the ï¬rst puzzle, the system
spends almost 100% of its resources training the gesture and direction memories
and only a few percent training the movement memory. However, as this is not so
resource intensive, training is rather rapid once sufï¬cient data are
collected. As the system operates the fraction of time it spends training the
gestures and direction sequence memories decreases, and more time is spent
executing logic/planning overhead tasks (considering where to play next and so
on). By the time the system is solving the fourth puzzle, resource utilization
is taken up mostly by the logic/planning subsystem. Total solution time has
dropped to below 20% of what was required for the ï¬rst problem, and so
responding to interrupts accounts for almost all of the time spent in this
phase.
FIGURE 4.3 Fractional time spent by the system dealing with the various
subsystems (movement, gesture learning and recall, directions learning and
recall, and the overhead of operating the logic). The time shown is a running
average over 15 moves.
IV. CONCLUSIONS
Learning with the help of attentional selection that
assists a logic/planning unit in training a hierarchical memory has several beneï¬ts.
First, it dramatically reduces the dimensionality of the input space. The
sorting problem as described has some 30 dimensions, with dependencies that
areill-suited to learning by a
traditional neural network. By segmenting the process and learning the reduced
representations as used by the logic/planning subsystem to solve the problem at
different levels, it becomes possible to present tractable problems to the
learning algorithm. Second, the hierarchical organization of skills enables
those learned the fastest to be assumed during the learning of higher-level
skills. Third, the attentional mechanism as employed allows for cooperation
between the memory subsystems and the logic/planning subsystems. When the
faster network-based subsystems can respond, they do so. It is only when they
make errors that the logic subsystem is aroused and spends time correcting the
error. The problems for which this approach is well-suited must satisfy at
least two properties. First, they must be amenable to partial solutions. That
is, an approximate solution must be initially acceptable. They cannot be of the
sort where only a perfect solution is permissible. Second, they must exhibit
the property that as additional information is gathered about the problem, that
information becomes less and less important to the solution. The common
(although not universal) occurrence of these properties supports the argument
that there is a wide class of problems, including many found in nature, whose
solution is assisted by the kind of attentional selection architecture
described above.
References
Billock, J. G. (2001). “Attentional Control of Complex
Systems.” Ph.D. thesis, CaliforniaInstitute of
Technology. Carpenter, G. A., and Grossberg, S. A.
(1987). A massively parallel architecture for a
self-organizing neural pattern recognition machine. Computer
Vision Graphics Image Process. 37, 54–115.
SECTION I. FOUNDATIONS
IV. CONCLUSIONS
23
Claus, N. (1884). La Tour d’Hanoi: Jeu de Calcul. Sci. Nat. 1, 127–128. Crick, F., and Koch, C. (1998). Consciousness
and neuroscience. Cereb. Cortex 8, 97–107. Garey, M. R., and Johnson, D. (1979). “Computers and
Intractability: A Guide to the Theory of NP-Completeness.” Freeman, San Francisco. Koch, C.
(2004). “The Quest for Consciousness: A Neurobiological Approach.” Roberts,
Denver, CO. Marquardt, D. W. (1963). An algorithm for
least-squares estimation of nonlinear parameters. J. Soc. Ind. Appi. Math. 11, 431–441. Naveh-Benjamin, M., and
Guez, J. (2000). Effects of divided attention on encoding and retrieval
processes: assessment of atten-
tional costs and a componential analysis. J. Exp. Psychol. Learn. Memory Cogn. 26, 1461–1482. Newman, J.,
Baars, B. J., and Cho, S.-B. (1997). A neural global
workspace model for conscious attention. Neural Netw.
10, 1195–1206. Rutishauser, U., Walther, D., Koch, C., and
Perona, P. (2004). Is attention useful for object recognition? IEEE Int.
Conf. Computer Vision Pattern Recog., in press. Uno, Y.,
Kawato, M., and Suzuki, R. (1989). Formation and control of optimal
trajectory in human multijoint arm movement: minimum torque-change model. Biol.
Cybernet. 61, 89–101.
SECTION I. FOUNDATIONS