+ All documents
Home > Documents > The variance theory of the mirror effect in recognition memory

The variance theory of the mirror effect in recognition memory

Date post: 11-Nov-2023
Category:
Upload: independent
View: 0 times
Download: 0 times
Share this document with a friend
31
Copyright 2001 Psychonomic Society, Inc. 408 Psychonomic Bulletin & Review 2001, 8 (3), 408-438 The mirror effect refers to a general class of empirical phenomena regarding the order found in hit and false alarm rates for two classes of studied items, when one of the two classes is easier to recognize (A) and produces a larger d ¢ than does the other (B). For instance, in the case of a free-choice yes–no recognition task, the mirror ef- fect for new (N) or old (O) items is seen in the following order in terms of increasing probabilities (P) of yes re- sponses: easy new, difficult new, difficult old, and easy old [P(AN) < P(BN) < P(BO) < P(AO)]. The phenome- non is named the mirror effect because the order for the old items is the reverse of the order of the new items. The mirror effect has been studied extensively in dif- ferent contexts and has been shown to be a rather univer- sal, robust, and basic recognition memory phenomenon. It has been found for many classes of stimuli that differ in several stimulus dimensions (e.g., word frequency, con- creteness, material, and strength; for more detailed re- views, see Glanzer & Adams, 1985, and Glanzer, Adams, & Kim, 1993). Furthermore, a few quantitative theories have also been proposed to explain the psychological mechanisms causing the mirror effect (e.g., Glanzer et al., 1993; Greene, 1996; McClelland & Chappell, 1998; Shiffrin & Steyvers, 1997). However, as will be discussed later, these current models do not predict some aspects of the mirror effect. This paper presents a new theory whose aim is to extend the previous models in order to over- come these challenges. Different Facets of the Mirror Effect First, I review a few major empirical findings regarding the mirror effect. For instance, Glanzer and Adams (1985) found a material-based mirror effect with respect to con- crete versus abstract words and pictures versus words. In terms of the concreteness-based mirror effect, the hit rates of the concrete words are larger, but also the false alarm rates are lower. With respect to pictures versus words, the d ¢ s and the hit rates of the pictures are larger and higher, but also the false alarm rates are lower. The frequency-based mirror effect is, however, most commonly studied. In this case, low-frequency words are easier to recognize and produce lager d ¢ s, and the mirror effect is thus as follows: Hit rates are higher for low-frequency words than for high-frequency words, whereas the oppo- site is true for false alarm rates. The underlying density distributions for old and new high- and low-frequency words corresponding to this phenomenon are as shown in Figure 1. This research was supported by Grant APA 146 from NSERC to Ben Murdock and by a postdoctoral grant from STINT to the author. I thank Ben Murdock for insightful advice and careful comments during the writing of this paper. I also thank Tim Curran, Kisok Kim, Matthew Dun- can, Dave Smith, and Carola Åberg for comments on the paper. A spe- cial thanks to Shu-Chen Li for helpful comments and support during the last stage of the revising process. Correspondence concerning this ar- ticle should be addressed to S. Sikström, Department of Psychology, Stockholm University, S-106 91 Stockholm, Sweden (e-mail: sverker@ psych.utoronto.ca). The variance theory of the mirror effect in recognition memory SVERKER SIKSTRÖM Stockholm University, Stockholm, Sweden The mirror effect refers to a rather general empirical finding showing that, for two classes of stim- uli, the class with the higher hit rates also has a lower false alarm rate. In this article, a parsimonious theory is proposed to account for the mirror effect regarding, specifically, high- and low-frequency items and the associated receiver-operating curves. The theory is implemented in a recurrent network in which one layer represents items and the other represents contexts. It is shown that the frequency mirror effect is found in this simple network if the decision is based on counting the number of active nodes in such a way that performance is optimal or near optimal. The optimal performance requires that the number of active nodes is low, only nodes active in the encoded representation are counted, the activation threshold is set between the old and the new distributions, and normalization is based on the variance of the input. Owing to the interference caused by encoding the to-be-recognized item in several preexperimental contexts, the variance of the input to the context layer is greater for high- than for low-frequency items, which yields lower hit rates and higher false alarm rates for high- than for low-frequency items. Although initially the theory was proposed to account for the mirror effect with respect to word frequency, subsequent simulations have shown that the theory also accounts for strength-based mirror effects within a list and between lists. In this case, consistent with experimen- tal data, the variance theory suggests that focusing attention to the more difficult class within a list af- fects the hit rate, but not the false alarm rate and not the standard deviations of the underlying density, leading to no mirror effect.
Transcript

Copyright 2001 Psychonomic Society Inc 408

Psychonomic Bulletin amp Review2001 8 (3) 408-438

The mirror effect refers to a general class of empiricalphenomena regarding the order found in hit and falsealarm rates for two classes of studied items when one ofthe two classes is easier to recognize (A) and produces alarger d cent than does the other (B) For instance in the caseof a free-choice yesndashno recognition task the mirror ef-fect for new (N) or old (O) items is seen in the followingorder in terms of increasing probabilities (P) of yes re-sponses easy new difficult new difficult old and easyold [P(AN) lt P(BN) lt P(BO) lt P(AO)] The phenome-non is named the mirror effect because the order for theold items is the reverse of the order of the new items

The mirror effect has been studied extensively in dif-ferent contexts and has been shown to be a rather univer-sal robust and basic recognition memory phenomenonIt has been found for many classes of stimuli that differin several stimulus dimensions (eg word frequency con-creteness material and strength for more detailed re-views see Glanzer amp Adams 1985 and Glanzer Adams

amp Kim 1993) Furthermore a few quantitative theorieshave also been proposed to explain the psychologicalmechanisms causing the mirror effect (eg Glanzer et al1993 Greene 1996 McClelland amp Chappell 1998Shiffrin amp Steyvers 1997) However as will be discussedlater these current models do not predict some aspects ofthe mirror effect This paper presents a new theory whoseaim is to extend the previous models in order to over-come these challenges

Different Facets of the Mirror EffectFirst I review a few major empirical findings regarding

the mirror effect For instance Glanzer and Adams (1985)found a material-based mirror effect with respect to con-crete versus abstract words and pictures versus words Interms of the concreteness-based mirror effect the hitrates of the concrete words are larger but also the falsealarm rates are lower With respect to pictures versuswords the d cents and the hit rates of the pictures are largerand higher but also the false alarm rates are lower Thefrequency-based mirror effect is however most commonlystudied In this case low-frequency words are easier torecognize and produce lager d cents and the mirror effect isthus as follows Hit rates are higher for low-frequencywords than for high-frequency words whereas the oppo-site is true for false alarm rates The underlying densitydistributions for old and new high- and low-frequencywords corresponding to this phenomenon are as shown inFigure 1

This research was supported by Grant APA 146 from NSERC to BenMurdock and by a postdoctoral grant from STINT to the author I thankBen Murdock for insightful advice and careful comments during thewriting of this paper I also thank Tim Curran Kisok Kim Matthew Dun-can Dave Smith and Carola Aringberg for comments on the paper A spe-cial thanks to Shu-Chen Li for helpful comments and support during thelast stage of the revising process Correspondence concerning this ar-ticle should be addressed to S Sikstroumlm Department of PsychologyStockholm University S-106 91 Stockholm Sweden (e-mail sverkerpsychutorontoca)

The variance theory of the mirror effectin recognition memory

SVERKER SIKSTROumlMStockholm University Stockholm Sweden

The mirror effect refers to a rather general empirical finding showing that for two classes of stim-uli the class with the higher hit rates also has a lower false alarm rate In this article a parsimonioustheory is proposed to account for the mirror effect regarding specifically high- and low-frequencyitems and the associated receiver-operating curves The theory is implemented in a recurrent networkin which one layer represents items and the other represents contexts It is shown that the frequencymirror effect is found in this simple network if the decision is based on counting the number of activenodes in such a way that performance is optimal or near optimal The optimal performance requiresthat the number of active nodes is low only nodes active in the encoded representation are countedthe activation threshold is set between the old and the new distributions and normalization is basedon the variance of the input Owing to the interference caused by encoding the to-be-recognized itemin several preexperimental contexts the variance of the input to the context layer is greater for high-than for low-frequency items which yields lower hit rates and higher false alarm rates for high- thanfor low-frequency items Although initially the theory was proposed to account for the mirror effectwith respect to word frequency subsequent simulations have shown that the theory also accounts forstrength-based mirror effects within a list and between lists In this case consistent with experimen-tal data the variance theory suggests that focusing attention to the more difficult class within a list af-fects the hit rate but not the false alarm rate and not the standard deviations of the underlying densityleading to no mirror effect

VARIANCE THEORY OF MIRROR EFFECT 409

The mirror effect has also been studied with two-alternative forced-choice recognition tests There are sev-eral possible two-alternative comparisons between old andnew items For instance when high- and low-frequency(better recognized) words are used there are four stan-dard pairings of old and new by high- and low-frequencycomparisons plus the two null-choice conditions (Glanzeramp Bowles 1976) in which either a pair of new items ora pair of old items are compared According to the mir-ror effect the order for the four standard pairs is

P(BO BN) lt P(AO BN) P(BO AN) lt P(AO AN)

where P(O N) is the probability of choosing old overnew Regarding the null-choice conditions the mirror ef-fect suggests that

P(BN AN) gt 50 P(AO BO) gt 50

Another general finding of the mirror effect is thatstrength manipulations that affect the positions of thenew distribution also affect the old distribution when themanipulations are made between conditions (Glanzer et al1993) Examples of frequently used experimental ma-nipulations of strength are the duration of study time en-coding conditions and speed versus accuracy instruc-tions These experimental manipulations affect both thenew and the old distributions by moving the distributionscloser together an effect known as concentering or bymoving the distributions apart an effect known as dis-persion However if these strength manipulations are ap-plied differently to items within a condition or in a singlelist quite different results are found For instance re-cently Stretch and Wixted (1998a 1998b) differentiallystrengthened high-frequency words by presenting themwith a higher presentation frequency than that for low-frequency words This manipulation affected the hit rateof the high-frequency words whereas false alarm rates

were largely unaffected Thus this manipulation did notshow a standard mirror effect

In addition to the basic data with respect to hit and falsealarm rates data on z-transformed (or normalized) receiver-operating characteristic curves (z-ROCs) represent an-other regularity of the distributions underlying the mir-ror effect The z-ROC-curves are obtained by plotting thez-transformation of hit rates against the z-transformationof false alarm rates while varying the recognition crite-rion According to signal detection theory the slope ofthis curve is equal to the standard deviation of the new-item distribution divided by the standard deviation of theold-item distribution Data on the mirror effect suggeststhat the standard deviations of the underlying item dis-tributions are symmetrical with respect to the classes(Glanzer amp Adams 1990 Glanzer et al 1993) The stan-dard deviation (s) of the recognition strength of the old-item distribution for the difficult class is smaller than the old-item distribution for the easier class [ie s(BO) lts(AO)] The standard deviation of the new-item distri-bution for the difficult class is smaller than the standard de-viation for the easier class [ie s(BN) lt s(AN)] How-ever this symmetry is tilted so that the standard deviationof the new distribution is smaller than the standard devi-ation of the old distribution [ie s(N) lt s(O) RatcliffSheu amp Gronlund 1992]

Previous Theories of the Mirror EffectIn addition to its empirical generality the mirror effect

is an important benchmark phenomenon because it haschallenged many global matching memory models suchas SAM (Gillund amp Shiffrin 1984) MINERVA (Hintz-man 1988) CHARM (Metcalfe 1982) TODAM (Mur-dock 1982) and the matrix model (Humphreys Bain ampPike 1989) The main limitation of the global memorymodels in capturing the mirror effect arises from their lackof mechanisms for predicting differential false alarm ratesfor BN and AN items given that by definition these arethe new items and neither class of items is presented dur-ing the study phase Confronted with the limitations of theglobal memory models the mirror effect had fostered thedevelopment of several recent theories to account for thepsychological mechanisms that give rise to the phenom-enon itself but as will become clear later as these theo-ries are presented many controversies still remain

Response strategy account For instance Greene(1996) proposed a response-strategy-based account forthe mirror effect regarding order and associative infor-mation Specifically it was suggested that if subjectsequate the number of responses made to each class ofwords (assuming that the number of lures and targets areapproximately equal) this simple response strategy wouldproduce response distributions that give rise to the mir-ror effect However Glanzer Kisok and Adams (1998see also Murdock 1998) argued against this account byshowing that the mirror effect is still present in the ab-sence of a distinctive set of data and in the absence of re-sponse equalization Furthermore Greenersquos account pre-

Figure 1 Hypothetical underlying distributions along the de-cision axis for low-frequency new items (AN) high-frequencynew items (BN) high-frequency old items (BO) and low-frequency old items (AO) For illustrative purposes the variancesare plotted in this graph to be equal The horizontal axis showsthe recognition strength that the decision is made on and the ver-tical axis the density of this recognition strength

410 SIKSTROM

dicts a mirror effect when subjects are made to focus theattention on the more difficult class However increas-ing the number of hits for the difficult class actually di-minishes the false alarm rate Thus the response strategyaccount is at odds with the experimental data

Attention-likelihood theory Another attempt to ac-count for the mirror effect is the attention-likelihood the-ory (Glanzer amp Adams 1990 Glanzer et al 1993) Thistheory says that the difference between BO and AO oc-curs because subjects pay more attention to items inClass A (ie the better-recognized class) than to theitems in Class B However the four distributions (ie ANBN BO and AO) along the decision axis reflecting themirror effect occur mainly because subjects transformthe recognition strength of the items to log-likelihood ra-tios and use that as the basis for their decision

More specifically the attention-likelihood theory con-sists of the following assumptions (1) Stimuli consist ofN number of features The features are either marked orunmarked A marked feature indicates that it was in thestudy list (2) Some proportion p(new) of features are al-ready marked in a new stimulus which reflects the noiselevel with the rationale that features in a new item (whichis not studied during encoding) should be marked becauseof the random noise in the decision process (3) Differentclasses of stimuli (i ) evoke different amounts of atten-tion [n(i )] (4) During study n(i ) features are sampledand marked The proportion of feature sampled for a givenstimulus is n(i ) N Given that when the noise level is notzero some proportion of the new items will also bemarked the probability that a feature is marked becomes

p(i old) 5 p(new) + [1 p(new)]

(5) During recognition subjects randomly sample a setof n(i) number of features Given that the sampling is in-dependent of whether the features were marked at studythe number of marked features is thus binomially dis-tributed with parameters n(i) and p(i old) for old stim-uli and n(i) and p(new) for new stimuli (6) During testsubjects count the number of marked features (x) Theynote the difference between the sample size and the num-ber of marked features [ie n(i) x] Yesndashno decisionsare then based on the log-likelihood ratio given a class[l(x | i)] and it is defined jointly by x n(i) p(new) andp(i old)

(1)

Therefore in contrast to strength theories which includemost of the global memory models the recognition deci-sion is made along the log-likelihood dimension ratherthan the dimension of item strength

In addition to the mirror effect with respect to the order-ing of response distributions of different classes of stim-uli the attention-likelihood theory predicts the followinginequalities for the slopes (defined by the ratios between

the standard deviations of the response distributions fordifferent stimulus classes) of the z-ROC curves

s (AO) s(BN) lt s (AO) s(AN)

lt s(BO) s(BN) lt s(BO) s(AN) (2)

This prediction constrains the standard deviation of thedistribution of the difficult class to be smaller than the stan-dard deviation of the distribution of the easy class [ies(BN) lt s(AN) and s(BO) lt s (AO)] However it doesnot constrain the standard deviation of the new distribu-tion to be smaller than the standard deviation of the olddistribution [ie s (BO) may be smaller than s (BN)]There are 24 possible ways (ie 4 5 24) to order thefour standard deviations Equation 2 allows 3 out of the24 orderingsmdashnamely s (BO) lt s (BN) lt s (AN) lts(AO) s(BN) lt s (AN) lt s(BO) lt s(AO) and s(BN) lts(BO) lt s(AN) lt s(AO) However not all 3 orderingsare found in the empirical data As will be presented laterthe variance theory proposed in this paper is more con-straining than the attention-likelihood theory and it per-mits only those 2 of the orderings that are in line with theempirical data The attention-likelihood theory has re-cently inspired at least two models which also are basedon log-likelihood ratiosmdashnamely the subjective-likelihoodaccount of recognition memory (McClelland amp Chappell1998) and REM (Shiffrin amp Steyvers 1997) Below theyare presented in turn

The subjective-likelihood model The subjective-likelihood model for recognition memory (McClellandamp Chappell 1998) has been applied to the mirror effectwith respect to list length list strength z-ROC curves andother recognition phenomena A major difference betweenMcClelland and Chappellrsquos approach and the attention-likelihood theory is that the subjective-likelihood accountuses one detector for each old and each new item Thesedetectors contrast the probability that a to-be-recognizeditem is an old item with the probability that it is a newitem to form the odds ratio The model makes an old de-cision if at least one of the logarithms of the odds ratiois larger than a certain fixed criterion otherwise a newdecision is made Differing from the attention-likelihoodtheory the odds ratio in the subjective-likelihood modelis calculated on estimates of the probabilities rather thanon the true values that a recognition system (or subject)might not have access to but was nonetheless used inGlanzer et alrsquos (1993) model

The usage of a limited number of detectors one foreach item in the subjective-likelihood theory is a centralmechanism used to account for several phenomena Forexample the strength-based mirror effect is accountedfor by the assumption that detectors for strong itemswhen strengthened are less likely to be activated by newitems during recognition which then lowers the false alarmrate This mechanism works when the number of targetsis reasonably large in relation to the number of potentialdistractors which was assumed in McClelland and Chap-pellrsquos (1998) simulations However in reality the num-

l( | ) ln( ) ( ( )

( ( (

( )

( )x i

p i p i

p p

x n i x

x n i x=

-

-

eacute

eumlecircecirc

-

- old old

new) new)

1

1

n(i)N

VARIANCE THEORY OF MIRROR EFFECT 411

ber of targets in a typical recognition test (for example50) is negligible in comparison with the number of pos-sible distractor words (the number of words in the subjectrsquosvocabularymdasheg 20000) Thus the subjective-likelihoodaccount for the list-strength effect may encounter a prob-lem when the number of detectors associated to each dis-tractor increases to a more realistic number A similarproblem may also be apparent with respect to the list-length effect which is accounted for by the assumptionthat the number of detectors is proportional to the listlength Arguably the subjective-likelihood theory wouldnot account for the list-length effect given a more plau-sible assumption that the number of detectors is relatedto the size of the subjectrsquos vocabulary

REM REM (Shiffrin amp Steyvers 1997) is anothermodel whose aim is to account for the mirror effect ROCcurves list strength and other phenomena in recognitionand recall Similar to the subjective-likelihood theoryREM is based on likelihood ratios and uses noisy vector-based representations in memory Although REM alsouses likelihood ratios REM differs in the sense that ituses true probabilities in calculating the likelihood ratioswhereas the subjective-likelihood theory uses estimatesFurthermore in REM the value assigned to the modelrsquosrepresentation on a particular feature dimension is de-termined when the dimension is sampled the first timeIn the subjective-likelihood theory the representationsof the items are refined successively over presentations

In REM several factors are combined together to pro-duce the frequency-based mirror effect (1) The likelihoodratios are assumed to be smaller for old high-frequencywords because high-frequency features are less diagnos-tic (2) This factor is larger than the compensating factorthat high-frequency words have slightly more matchingfeature values because errors in storage tend to producecommon values increasing the probability of acciden-tally matching a high-frequency feature value (3) Newhigh-frequency words have a slight increase in the numberof chance matches which increases the likelihood ratio

Limitations of current theories Although these the-ories account for several aspects of the data regarding themirror effect they have been subjected to a few generalcriticisms Perhaps the most obvious problem with thesemodels is that they predict that strengthening of a classyields a mirror effect Although this prediction is supportedby data in studies in which the strength manipulations wereapplied between conditions (Glanzer et al 1993) it iscertainly inconsistent with the data when the strengthmanipulations were applied within conditions (Stretch ampWixted 1998a see also Stretch amp Wixted 1998b)

In addition there are a few other more specific criti-cisms about these theories Here are some problems re-garding the attention-likelihood theory First calculatingthe log-likelihood ratios requires knowledge of the class(l dependent on i) Thus it is necessary to classify theto-be-recognized stimuli into distinct categories (i) or atleast to estimate the stimuli along some dimension suchas frequency Glanzer et al (1993) noted that the attention-likelihood theory predicts the mirror effect even though

the estimates of p(i old) are not accuratemdashfor examplewhen this probability is set to the average of the twoclasses Thus the estimates of p(i old) are not critical tothe predictions However it is necessary to estimate thenumber of features sampled at recognition [n(i)] in Equa-tion 1 to make the correct predictions and this processrequires knowledge of the class of stimuli Second knowl-edge of several variables is required Depending on theclassification the attention paid to this category [n(i)]must be estimated The probability that a feature in anew item is marked must also be estimated Third theattention-likelihood theory involves several steps of com-plex calculation Although this may not be the reason fordismissing the theory (see Murdock 1998 for a discus-sion of this topic) it would be preferable to have a sim-pler account Given that the subjective-likelihood ac-count (McClelland amp Chappell 1998) and REM like theattention-likelihood theory (Shiffrin amp Steyvers 1997)are also based at their cores on variations in log-likelihoodratios these criticisms would also apply to them

Research Goal and Organization of the PaperGiven the limitations of current theories the purpose

of this paper is to propose a new account of the mirror ef-fect that can avoid most of these criticisms The theoryis proposed specifically for the frequency-based mirroreffect but it also accounts for the strength-based mirroreffect within a list the strength-based mirror effect be-tween lists z-ROC curves associated with the mirror ef-fect and the list-length effect The paper is organized asfollows First a brief overview of the theory is presentedwhich is then followed by an in-depth presentation Sec-ond the theory is implemented in a connectionist modelof memory previously developed by the author (ieTECO the target event cue and object theory Sikstroumlm1996a 1996b) Third mechanisms of the theory respon-sible for capturing the mirror effect are presented in de-tail Fourth a section presents the theoryrsquos applicationswith respect to the various classical empirical regularitiessuch as between-list strengthening the list-strength ef-fect z-ROC curves and response criterion shifting Fifthpredictions regarding a recently discovered lack of themirror effect in the context of within-list strength manip-ulation are presented (eg Stretch amp Wixted 1998a1998b) and an experiment is carried out to evaluate theprediction For readers who are interested in the analyticsolutions of the theory mathematical derivations ofthese solutions are presented in the sixth section and ananalysis of the modelrsquos optimal performance is con-ducted Finally implications for distributed models ofmemory and the relations between the variance theory andprevious theories of the mirror effect are discussed

THE VARIANCE THEORYOF THE MIRROR EFFECT

Overview of the TheoryIn a nutshell the variance theory proposed here is sim-

ilar to previous models of the mirror effect in the sense

412 SIKSTROM

that it also uses vector-feature representations of the itemsand estimates (via simulations) the response probabili-ties of old (target) and new (lure) items during a recog-nition test However the variance theory is driven by dif-ferent conceptual and technical considerations At theconceptual level the variance theory sets out to capturethe mirror effect mainly in terms of the relations betweenthe study materials and the natural preexperimental con-text associations the items may have This is conceptu-ally quite different from all previous theories seeking toexplain mirror effects primarily in terms of the individualrsquosdecision process Rather the approach taken here con-siders the context in which the individual recognition de-cision processes takes place The natural frequencies ofevents occurring in the individualrsquos day-to-day contextsmay be reflected in recognition decision processes andthe individuals may or may not know (or be consciouslyaware of) these processes At the technical level the vari-ance theory also differs from previous theories in a sig-nificant way Instead of directly computing ratios betweenprobabilities a new way of computing recognition strengthis proposed by normalizing the difference between theresponse probabilities for the target and the lure itemswith standard deviations of the underlying responsedistributions

Specifically in dealing with the frequency-based mir-ror effect a rather plausible key assumption of the vari-ance theory is that high-frequency words are assumed tohave appeared in more naturally occurring preexperi-mental contexts than have the low-frequency words Thisassumption is implemented in connectionist networksimulations in a rather straightforward way by associat-ing the simulated high-frequency items with more con-texts than the low-frequency items during a simulatedpreexperimental phase In implementing the theory itemsand contexts are represented by two separate arrays (vec-tors) of binary features with each feature being repre-sented by a node (or element of the vector) as is shownin Figure 2 A particular item such as CAR activates somefeatures at the item layer A particular context such asREPAIR SHOP activates some features at the context layerDifferent items and contexts may or may not activatesome of the same features The item and context featuresare fully interconnected with weights When an item ap-pears in a particular context and certain features are ac-tivated the weights that reciprocally connect the itemfeatures to the context features are adaptively changedaccording to a specific learning rule described later Notethat in the implementation two types of contextsmdashnamelythe preexperimental contexts and the study context (rep-resenting internally generated time or list informationassociated with an item during the actual experimentalepisode)mdashare represented in the network using one com-mon context layer But these two types of context infor-mation are differentiated by two simulation phasesmdashnamely the preexperimental phase and the actual encodingand testing phase As will be mathematically outlinedlater the standard deviation of the input at the contextlayer increases when an item is associated with several

contexts Therefore high-frequency items (associatedwith more preexperimental contexts) will have largerstandard deviations than will low-frequency items in theiractivation patterns which are subsequently propagatedto the item layer However the expected value of the in-put is equal for high- and low-frequency items

During the recognition test an item vector is presentedto reinstate the activation of the item features The fea-tures of the study context are reinstated by presenting thenetwork with the context pattern encoded during thestudy-encoding phase (but not from other preexperi-mental contexts that the network was trained with duringthe preexperimental phase) The degree of recognitionstrength is determined by first calculating the net inputsto the context and the item nodes The net input is the sig-nal a node receives from other active nodes connectingto it and the strength of the signal determines whetherthe nodes will be activated or not at retrieval The netinput of a given item node is simply computed as the sumof all weights connected to active nodes Similarly thenet input of a given context node is simply the sum of allweights connected to active nodes and that particularcontext node The net inputs then denote the retrievedstate of activation patterns at the item and context layersThe subset of item and context nodes that have activationlevels exceeding a particular activation threshold at re-trieval and that were also active during encoding are thenused to calculate the recognition strength Those nodeswhose activation does not exceed the threshold or thatwere inactive during encoding have no influence on rec-ognition strength For example assume that the activationthreshold is set to 05 so that any node (item or context)that was active during encoding and whose retrieved ac-tivation during testing exceeded the value of 05 wouldcontribute to recognition strength Imagine that four nodesout of a total of eight exceed the threshold and are equalto 075 100 125 and 150 The recognition strength ofthe item is the percentage of above-threshold nodes (50)

Figure 2 The variance theory The upper four circles representnodes in the item layer The lower four circles represent nodes inthe context layer The arrows between the item and the contextlayers represent connections

VARIANCE THEORY OF MIRROR EFFECT 413

minus the expected percentage of above-threshold nodes(eg 25) divided by the standard deviation of actuallyobserved above-threshold nodes (ie by the standard de-viation of 075 100 125 and 150)

Why is recognition strength determined by this ruleas opposed to say just the percentage of above-thresholdnodes As will be shown later this way of measuringrecognition strength (subtracting the expected value anddividing with the standard deviation of the net input) al-lows the model to perform optimally in terms of dis-criminability between new and old items when the re-sponse bias to the yes response is manipulated And inthis case the model accounts for why the standard devi-ation of the response distribution of the easy class islarger than the response distribution of the diff icultclass It is plausible to assume that humans have evolvedto generally respond more or less optimally and that thisis reflected in their performance as well as in the imple-mentation of the variance theory Similarly the activa-tion threshold is set in the model to the value that leadsto the highest d cent (ie to optimal performance) whichoccurs when the activation threshold is between the newand the old distributions This optimal tuning of themodel allows it to account for some rather dramatic re-sults such as concentering showing that target and luredistributions from different item classes converge on astrength-of-evidence axis as memory conditions worsen

Here is a brief example of how the model performsConsider hypothetical levels of activation generated byhigh-frequency and low-frequency old (target) and new(lure) items Because high-frequency words have appearedin a larger number of contexts they have a larger varianceof net input As such targets and lures will be relativelymore confusable and will generate percentages of activatednodes that are difficult to discriminate Assume that thestandard deviation of the net input is 10 and the relevantproportions are 25 for high-frequency targets and 15 forhigh-frequency lures In contrast low-frequency wordswill have occurred in fewer contexts and will be less con-fusable Assume that the standard deviation of net inputsis less for low-frequency wordsmdashfor example 05mdashandthat the percentage of active nodes is 30 for low-frequencytargets and 10 for low-frequency lures Given these val-ues what are the recognition strengths for high-frequencyand low-frequency targets and lures If the expected pro-portion of above-threshold nodes is 20 they are

Low-frequency lures (10 20) 05 5 20High-frequency lures (15 20) 10 5 05High-frequency targets (25 20)10 5 05

Low-frequency targets (30 20)05 5 20

The model accounts for the various aspects of mem-ory phenomena by postulating a connectionist neural net-work model with an implementation and parameter set-tings that allow it to perform at optimal or near-optimallevels When the model is optimized it behaves similarlyto how subjects behave and when it is not optimized itdoes not fit the experimental data This is true not only

for the standard mirror effect but also for exceptions suchas the absence of a mirror effect for within-list strengthmanipulations (something all other competing formalmodels fail to do) Furthermore it predicts key featuresof the ROCs for new and old items as well as for high-and low-frequency items (something any viable formalmodel must do)

Presentation of the Variance TheoryIn this section the details of the variance theory are

presented As will become clearer as the theory is un-folded the theory is analytical and the analytical solu-tions are self-contained solvable (readers who are inter-ested can find the analytical solutions in the sixth section)Although the theory itself does not require a particularcomputational framework it can be more easily explainedand directly implemented by using a connectionist net-work Therefore the presentation of the theory below iscouched within the framework of a Hopfield neural net-work (Hopfield 1982 1984) in order to explicate thetheoryrsquos underlying mechanisms that generate the mirroreffect

Network architecture The variance theory may beimplemented as a two-layer recurrent distributed neuralnetwork (see Figure 2) There are two layers in the rep-resentation one is the item layer and the other the con-text layer Both layers consist of N number of nodes (ieN features) although it could also be implemented withan unequal number of context and item nodes Thus thetotal number of nodes in the network is 2N The item andthe context layers are fully connected so that all the nodesin one layer are connected through weights to all the nodesin the other layer Nodes within a layer are not connected(ie no lateral connections)

Stimulus and context representation Contexts anditems are represented as binary activation patterns acrossthe nodes in the context and item layers respectively Anode is active (or activated) if its corresponding featureis present (ie of value 1) and a node is inactive (or notactivated) if its corresponding feature is absent (ie ofvalue 0) There are several reasons for choosing binaryrepresentations For instance binary representation servesto deblur noisy information at retrieval (Hopfield 1984)Binary representation also allows for a low proportion ofactive nodes (spare representation) which is known toimprove performance (Okada 1996) It also introducesnonlinearity which is necessary for solving some prob-lems in multilayer networks (Rumelhart Hinton amp Wil-liams 1986) and it is perhaps the simplest representa-tion Furthermore in the present model it is shown thatit is necessary for capturing characteristics of the z-ROCcurves that are associated with the mirror effect

More specifically the state of activation for the i thnode in the item layer at encoding is denoted x i

d wherethe superscript d denotes the item layer The state of ac-tivation for the jth node in the context layer at encodingis denoted x j

c where the superscript c denotes the con-text layer Context patterns and item patterns are gener-ated by randomly setting nodes to an active state (ie

414 SIKSTROM

with values of 1) and otherwise to an inactive state (iewith values of 0) Let a be a parameter that determinesthe expected probability that a node is active at encod-ing This parameter does not change during the simula-tion and is assumed to be relatively small (for purposesof sparse representation) Note that a is the expected prob-ability of active nodes whereas the real percentage of ac-tive nodes for specific items or contexts varies around a

The encoding-study phase Encoding occurs bychanging the connection weights between the item andthe context layers The weights (or the strengths of theconnections) contain information of what has been storedin the network The weight between item node i and con-text node j is denoted as wij and it is initialized to zeroThe weight change (Dwi j) is calculated by the learningrule suggested by Hopfield (1982 see also Hertz Kroghamp Palmer 1991 for additional details) This is essen-tially a Hebbian learning rule that increases connectionweights between simultaneously activated nodes Thisrule is chosen here because it is more biologically plau-sible than other rules such as the delta or the gradient-decent learning rules (eg Rumelhart et al 1986) usedin back-propagation networks However the variance the-ory can also be implemented with other learning rules

According to the Hopfield (1982) learning rule theweight change is computed as the outer-product betweenthe item and the context vectors of activation patternswith the parameter a first subtracted from both vectorsThis subtraction is mathematically necessary to keep theexpected value of the weights at zero Using the notionsfor item and context activation defined above the weightchange between these two elements of the item and con-text vectors can be written as

Dwi j 5 (x id a)(x j

c a) (3)

Word frequency as the number of associated con-texts at the preexperimental phase An item may be en-coded more or less frequently and hence be associatedwith more or less different preexperimental contextsdepending on how often the item occurs in the subjectrsquosenvironment In the model at the preexperimental stageof the simulation an itemrsquos frequency is simulated by thenumber of times the item is encoded in different contextsA low-frequency item is encoded less often and is asso-ciated with a smaller number of contexts whereas a high-frequency item is encoded more often and is associatedwith a larger number of different contexts For instanceone might form a preexperimental association betweenSAAB and repair shop after experiencing the rare event ofa new expensive SAAB sports car breaking down half-way through a honeymoon trip to the Grand Canyon withthe SAAB having to be towed to a repair shop somewhereout in the desert In implementing the variance theory therelationship between word frequency and preexperimen-tal itemndashcontext associations can be simulated straight-forwardly At the preexperimental stage of the simula-tion a low-frequency item SAAB may be simulated byassociating one context item REPAIR SHOP with it duringencoding A high-frequency word CAR may be simulated

by associating three different contexts REPAIR SHOP TAXI

RIDE and DRIVING TO WORK with it during encodingThe recognition test phase At recognition an item is

presented to the network the representation of this itemis reinstated as a cue to the item layer and the represen-tation of the study context (simulating an internally gen-erated context regarding list or time information duringthe recognition experiment) is reinstated as a cue to thecontext layer For example the representation of theword CAR is reinstated at the item layer Furthermore therepresentation of the study context STUDY LIST is rein-stated at the context layer The subjects must have this in-formation (or cue) in order to recognize an item from theparticular study context (and not recognize the item fromall the other preexperimental contexts) In the actual ex-perimental setting this information is usually conveyedto the subjects by the explicit instruction to recognizefrom the study context (eg ldquoDo you recognize the wordCAR from the list you just studiedrdquo)

At recognition each node receives a signal (called thenet input) which is computed on the basis of other activenodes connecting to it Item nodes receive net inputs fromactive context nodes and context nodes receive net inputsfrom active item nodes The net input to a given node issimply the sum of the weights of all other active nodesconnected to that node For example if item node 1 isconnected to four context nodes (1 2 3 and 4) wherecontext nodes 1 and 3 are active the net input to itemnode 1 is w11 + w13 Thus active nodes ldquosendrdquo informa-tion to nodes that they are connected to whereas inactivenodes do not send any information Put another way nodesldquoreceiverdquo information or input from active nodes that theyare connected to but not from the inactive nodes

Specifically the net input to item node i is calculatedby first multiplying the activity of the context node (la-beled j ) connected to this node by the weight of the con-nections between the nodes i and j and then summing overall connected nodes In vector formalization the weightmatrix operates on the activation vector and the outputis the net input vector The net inputs to node i(h i

d) at theitem layer depend on the states of activation of nodes in thecontext layer and the weights connecting these two layers

(4)

Following the same principle a similar function is usedto calculate the net inputs to the context node Specifi-cally the net inputs to context nodes (labeled j ) dependon all the states of activation of nodes in the item layerand the weights connecting the two layers

By inserting Equation 3 into Equation 4 and summingover the p number of encoded patterns during list learn-ing it is easy to see that the net input to a node is simplythe sum of Np number of terms for example the net in-put to an item node is

h wj ji ii

Nc d=

=aring x

1

h wi ij jj

Nd c=

=aring x

1

VARIANCE THEORY OF MIRROR EFFECT 415

For a 5 5 the net inputs are binomially distributed witha certain expected value Given a certain criterion (ieNp(1 a) a gt 10) a binomial distribution can be ap-proximated with a normal distribution (Feller 1968) Fora sup1 5 there are actually four outcomes however thesame normal approximation can be used Thus for rea-sonably large parameter values of Npa the distributionof net inputs to the nodes can be approximated by a nor-mal distribution

If the to-be-recognized item has not been encoded withthe context (ie a new item) the net input is simply thesum of random weights Because the expected values ofall weights are zero the expected value of the net inputsfor new items will also be zero If the item has been en-coded with the context (ie an old item) the net input isthe sum of those weights connected to that node whoserespective context nodes were active at encoding Owingto the adaptive weight changes during encoding theseweights will have an expected value that is larger thanzero if both nodes were in the active state during encod-ing [ie each weight change at encoding is computed as(1 a)2] and less than zero if one node was inactive andthe other node was active at encoding [ie each weightchange at encoding was a(1 a)] Of specific impor-tance for the theory is that the variance of the net inputsto the context nodes (from the item nodes) increases withthe number of contexts that are associated with the itemTherefore the variance of the net input is larger for high-than for low-frequency items Similarly the variance ofthe net input to the item nodes (from the context nodes)increases with the number of items associated with onecontext (ie list length) Therefore given that the con-text is constant during a list presentation the variance ofthe net inputs is larger for a long than for a short list

Brief summary of optimal performance Given thestrong selection pressure arguably humans and animalshave evolved to achieve good memory performanceTherefore it is reasonable to assume that mechanisms forrecognition decision have evolved to an optimal or near-optimal performance Following this assumption the pa-rameter values in the model and the implementation ofthe model are guided by the idea that the model shouldperform optimally A detailed discussion about the issueof optimal performance with exact derivations of what isoptimal performance in the context of the present modelis presented later in the sixth section Here I give a briefsummary explaining the results from the analysis of op-timal performance without going into the mathematicaldetails (see Figures 9A 9B 9C and 9D)

The modelrsquos performance is optimal if the percentageof nodes active at encoding (a) is low (see Figure 9A)For a low a it is optimal to base the recognition decisionon nodes that were active at encoding and to ignore nodesthat were inactive during encoding (see Figure 9A) Alsofor a low a it is optimal to place the activation threshold

of the nodes between the expected values of the new andthe old net inputs (see Figure 9B) Finally it is optimalto normalize the recognition strength with the standarddeviation of the net input (see Figures 9C and 9D)

For a low percentage of active nodes it is optimal tobase the recognition decision on nodes that were activeat encoding (or nodes active in the cue pattern) and to ig-nore nodes that were inactive at encoding At recogni-tion the state of activation of a node may be either activeor inactive Therefore the nodes that are active in the cuepattern and have a net input above a certain activationthreshold are activated at recognition otherwise thenodes are inactivated Let z i

d denote the state of activa-tion at recognition for item node i An item node is acti-vated at recognition (z i

d 5 1) if it was active in the cuepattern (x i

d 5 1) and the net input is above the activationthreshold (h i

d gt T ) otherwise it is inactivated (z id 5 0)

z id 5 1 if x i

d 5 1 and h id gt T otherwise z i

d 5 0

Similarly let zjc denote the state of activation at recogni-

tion for context node j

z jc 5 1 if x j

c 5 1 and h jc gt T otherwise z j

c 5 0

This way of activating nodes at retrieval differs from hownodes are activated in a standard Hopfield (1982) net-work where the activation threshold is zero and a nodeis activated if the net input is above zero (independentlyof the state of activation in the cue pattern) The way ofactivating patterns in a Hopfield network is more likelyto produce a retrieved pattern that matches the encodedpattern of activation (eg the expected value of activenodes at retrieval will be the same as the expected valueof active nodes at encoding) However as will be dis-cussed later the way suggested to activate the nodes hereyields better performance in terms of discrimination be-tween a target item and a distractor item

As is shown in Figure 9B performance is optimal whenthe activation threshold is set approximately between thenew and the old net inputs The activation threshold (T )is set to the expected values of the net inputs of nodes ac-tive during encoding (x i

d 5 1 x jc 5 1) for old and new

items The averaging is computed over all nodes (2N ) andover all new and old patterns ( p) in the recognition con-dition If half of the items are new and half of items areold the activation threshold is

As was discussed above the expected net input of new(lure) items is zero Therefore the activation threshold issimply half the expected net input for nodes encoded inthe active state [T 5 mh(O)2 where mh(O) is the expectedvalue of the net input to nodes encoded in the active state]

It is easy to see that the expected percentage of old andnew active nodes at recognition is one half of the per-centage of active nodes at encoding (a 2) That is the ac-tivation threshold divides the old and new distribution in

TaPN

h hi ii

N

j jj

NP

= +eacute

eumlecircecirc = =aring aringaring1

2 1 1

x xd d c c

h a ai i jj

Np

jd d c c= - -

=aringaring ( )( ) x x x

1

416 SIKSTROM

the middle Old items will have a higher expected percent-age active and new items will have a lower expected per-centage active The activation threshold is constant dur-ing recognition in one condition However it must varybetween conditions depending on the net inputs to yieldoptimal performance

The percentage of nodes active at recognition is counted

As is shown in Figure 9C the performance is near-optimalif the recognition decision is based on the number of nodesactive at recognition normalized by the standard devia-tion of the net inputs across the active features of this item(shcent ) Thus this standard deviation is calculated over allthe nodes active at encoding (ie x i

d = 1 and x ic 5 1

nodes inactive at encoding are not used when calculatingthe standard deviation because for low levels of a theseitems carry little to no information of the item) The rec-ognition strength (S ) for an item is calculated by sub-tracting the expected percentage of nodes active at rec-ognition (a 2)1 from the real percentage of nodes activeat recognition (Pc) and dividing by the standard deviationof the net inputs of the item (shcent)

(5)

The subtraction of the expected percentage of nodesactive at recognition makes the expected value of therecognition strength (S ) zero This subtraction is neces-sary for the normalization to work properly The sub-traction moves the recognition strength distributionssymmetrically so that the old and the new distributionsmove at the same rate for a given standard deviation of thenet input (without the subtraction the old recognitionstrength distribution would be more affected than thenew distribution) Thus the recognition strength is de-termined by the difference between two probabilities (thepercentage of active nodes that varies and the expectedpercentage of active nodes that is constant) divided bythe standard deviations of the net input A yes response(Y ) is given if the recognition strength (S ) is above therecognition criterion (C ) An unbiased decision has arecognition criterion of zero

Y 5 S gt C

An issue that may be raised is whether it is sensible tobase recognition strength on two quite different sourcesmdashnamely the percentage of active nodes and the variabil-ity of the net input The immediate answer is that if it isreasonable to optimize performance it is also sensible tomeasure recognition strength this way Another perspec-tive is to note that unbiased responses can be made onlyon the percentage of active nodesmdashthat is a yes responseoccurs if the percentage of active nodes is larger than theexpected percentage of active nodes (Pc gt a 2) and thevariability of the net input can be ignored Thus ldquonor-

mallyrdquo subjects base their unbiased decisions on the per-centage of active nodes and the variability of active nodesonly becomes relevant when subjects are biased Fromthis perspective the percentage of active nodes is usedfor unbiased responses and the variability of the netinput becomes relevant for confidence judgments There-fore by combining both the percentage of active nodesand the variability of the net input the measure of recog-nition strength proposed here will also reflect the confi-dence judgment

An Example With Step-by-Step ComputationsTo clarify the computational details involved in the

variance theory a step-by-step example is given here Fortractability a small network is used consisting of fouritem nodes and four context nodes (see Figure 2) Theactual simulations to be reported later involved a largernetwork architecture with 30 nodes at each layer Thepercentage of nodes active at encoding (denoted by pa-rameter a) is set to 50 Let item BN be represented as1100 and written as the state of activation of the fournodes x1

d x2d x3

d x4d Similarly let 0011 represent

item BO 1010 represent item AO and 0101represent item AN Let context CBN be represented as1100 or the state of activation of the four nodes x1

cx2

c x3c x4

c Similarly 0011 represents context CBOand 0101 represents experimental study context CExp

Item BN is a high-frequency new word For simplicityit is here encoded only once with context CBN in the pre-experimental phase (in the simulations below high-frequency words are preexperimentally associated withthree contexts) The 16 weights between the four itemnodes and the four context nodes are changed accordingto the learning rule where the probability that a node isactive at encoding is determined by the parameter a 55 For example the weight change between item node 1and context node 1 is w11 5 [x1

d(BN) a][x1d(CBN) a] 5

(1 5)(1 5) 5 14 where BN is item BN and the CBNrepresents context CBN Similarly item BO is anotherhigh-frequency word that before the experimental phaseis encoded once with context CBO Items AO and AN arelow-frequency old and new words and they are not en-coded at the preexperimental phase

In the experimental phase item AO is encoded withthe experimental context CExp Finally item BO is en-coded with the same experimental context CExp For ex-ample the weight w11 is now equal to

[x1d(BN) a) (x1

c(CBN) a] + [x1d(BO) a]

[x1c(CBO) a] + [x1

d(BO) a][x1c(CE) a]

+ [x1d(AO) a][x1

c(CE) a] 5 (1 5) (1 5)

+ (0 5)(0 5) + (0 5)(0 5)

+ (1 5) (0 5) 5 14 + 14 + 14 14 5 12

After encoding the full weight matrix is 5 1 1 55 0 0 5 5 0 0 5 5 1 1 5 corre-sponding to the weights w11 w12 w44 respectively

SP a

c

h=

-

cent2

s

PN

c i jj

N

i

N

= +aelig

egraveccedil

ouml

oslashdivide

==aringaring1

2 11

z zd c

VARIANCE THEORY OF MIRROR EFFECT 417

At recognition the state of activation of the old low-frequency item AO is reinstated as a cue to the itemlayer and the state of activation of the encoding contextat the experimental phase CExp is reinstated as a cue tothe context layer The net inputs are calculated For ex-ample the net input to context node 1 is

The net input to the item nodes is 10 2 1 and that tothe context nodes is 5 5 5 5 It can be seen thatthe expected net inputs for randomly created new itemsin this network is 0 and that the expected net inputs forold items encoded in the active state is 05 Therefore theactivation threshold is set to the average of these valuesmdashnamely T 5 025 Nodes that have a net input above theactivation threshold and that are active in the cue patternare activated at recognition Thus the state of activationat recognition for the item nodes is 1010 and thatfor the context nodes is 0101 (which is identical tothe cue patterns) The percentage of active nodes is countedPc(AO) 5 48 5 5 For an unbiased response (C 5 0)this will yield a yes response because this percentage islarger than the expected percentage of active nodes atrecognition (a 2 5 25) The standard deviation of thenet input for nodes active at encoding is 071 (ie thestandard deviation of 1 2 5 5 corresponding tonodes 1 3 6 and 8) The recognition strength is calcu-lated by subtracting the expected percentage of activenodes at recognition from the percentage of active nodesfor the to-be-recognized item and dividing by the standarddeviation of the net input for active nodes [S 5 (Pca 2) shcent 5 (5 25) 71 5 035]

The recognition of the three items BO AN and BNare done in the same way The results for the four itemsAN BN BO and AO are the net inputs (1 0 2 1 5 155 5 15) where the first four numbers representitem nodes and the last four context nodes (1 0 2 115 5 5 15) (1 0 2 1 15 5 5 1) (1 0 21 5 5 5 5) the states of activation at recognition(0 0 0 1 0 0 0 0) (1 0 0 0 0 1 0 0) (0 0 1 1 00 0 1) (1 0 1 0 0 1 0 1) the numbers of nodes ac-tive 1 2 3 4 the standard deviations of the net inputs071 108 108 071 the recognition strengths 17 0011 035 and the unbiased responses no no yes yesrespectively (an unbiased yes response is made whenS gt C where C 5 0 for unbiased responses) Thus thesubject responds correctly for all items the recognitionstrengths are ordered according to the mirror effect andthe variance of the net input is larger for the high- thanfor the low-frequency words

SIMULATIONS OF THEFREQUENCY-BASED MIRROR EFFECT

In this section the variance theory of the mirror effectis simulated in a connectionist framework that is consis-tent with a general connectionist theory of memory

called TECO (Sikstroumlm 1996a 1996b) TECO has beenused to account for a variety of memory phenomena in-volving recognition and recallmdashfor example successivetesting of recall and recognition (Sikstroumlm 1996b) and for-getting curves (Sikstroumlm 1999) An extensive descrip-tion of TECO is beyond the scope of this paper Interestedreaders are referred to previous articles describing themodel for details

ProcedureThe simulation started with initializing the weights to

zero Then 12 items were generated by randomly settingthe nodes to an active state with a probability of a A pre-experimental phase then followed to generate the fre-quency associated with the items In this preexperimen-tal phase half of the items (ie 6) were encoded threetimes each time in a different context These items sim-ulated the high-frequency words The remaining 6 itemswere not encoded before the experimental phase andthey simulated the low-frequency words

At the experimental phase one study-encoding contextwas generated using the same procedure as the genera-tion of the contexts in the preexperimental phase Giventhat in a standard recognition experiment all studieditems would be encoded in the same list in the simula-tions the items were thus encoded once with the samestudy context Three of the high-frequency items wereencoded and three of the low-frequency items were en-coded The other three high- and three low-frequencyitems were not encoded during the experimental phaseand they simulated the new items Each encoding wassimulated by first activating the nodes in the item and con-text layers The weights were then changed according tothe learning rule (Equation 3)

At the recognition test the patterns of activation of the12 items and the study context were reinstated to the net-work The net inputs were calculated for each node andthe recognition strength was calculated from all the nodesin the network The somewhat unrealistic assumptionthat no learning or changes of weights occurred duringtesting was adopted However this is a standard as-sumption often used in other simulations of recognitionmemory cued recall or categorization (eg Kruschke1992 Lewandowsky 1991 Li amp Lindenberger 1999Li Lindenberger amp Frensch 2000) One thousand fivehundred simulation runs with different random item andcontext patterns were carried out The results reported laterare based on the average across these runs

ParametersThe number of high-frequency patterns was six (each

encoded three times preexperimentally and three of themencoded once experimentally to simulate the old items)and the number of low-frequency patterns was six (threeof them encoded experimentally once to simulate the olditems) The following parameter settings were used Thenumber of nodes in each layer (N ) was 30 (the total num-ber of nodes was 2N 5 60) and the percentage of nodes

h w i ii

1 11

45 1 1 0 1 1 5 0 5c d= = + - - = -

=aring x

418 SIKSTROM

active at encoding (a) was 20 Initially all the weightswere set to zero The recognition criterion (C ) was set to 0

ResultsFigure 3A shows the results in terms of the density of

the net inputs to an active node The expected values ofthe net inputs are approximately equal for new high- andnew low-frequency items [mh(AN) 5 00 mh(BN) 5 00]and for old low- and old high-frequency items [mh(AN) 538 m h(BN) 5 38] The high-frequency items have alarger variance of the net inputs [sh(BN) 5 49 sh(BO) 548] than do the low-frequency items [sh(AN) 5 41sh(AO) 5 40] The variances of the old and the new dis-tributions are approximately equal

Figure 3B shows the density of the recognition strengthThe result shows the mirror effect where the hit rate prob-ability is larger for low- than for high-frequency itemsand the false alarm rate is lower for low- than for high-frequency items [P(AN) 5 18 P(BN) 5 25 P(BO) 564 P(AO) 5 71] The standard deviation for the recog-nition strength is larger for old than for new words andlarger for the low-frequency words than for the high-frequency words [ss(AN) 5 029 ss(AN) 5 019 ss(AN)5 023 ss(AN) 5 031] These f indings agree with the empirical data and the predictions of the attention-likelihood model (Glanzer et al 1993) Thus the simu-lation shows that the variance theory can account for thefrequency-based mirror effect and the associated ROCcurves

EXPLICATIONS OF THE MECHANISMS

In this section three essential mechanisms of the vari-ance theory that account for the mirror effect are discussedin detail The involved mechanisms are (1) the varianceof the net input (2) the expected value of the net inputand (3) the way in which the recognition strength is cal-culated by counting the number of nodes so that the per-formance is optimal

OverviewThe first mechanism is that items from the difficult

class (ie high-frequency items in this case) have ahigher variance of the net inputs than do items from theeasy class (ie low-frequency items) but the variance isindependent of whether the items are old or new Themechanism is illustrated in Figure 4A (noncumulative) Itshould also be underscored that the difference in varianceas a function of class membership is not a primitive ofthe theory that the theorist manipulates It is the naturalconsequence of the rather plausible assumption that high-frequency items appear more often and are associatedwith a great variety of different contexts than are low-frequency words The second mechanism is that old itemshave a higher expected net input to nodes encoded in theactive state than do new items (but do not depend on theclass of the items) The mechanism is illustrated in Fig-ure 4B (cumulative) The third mechanism is the wayrecognition strength is measured by counting the activenodes so that recognition performance (eg d cent) is opti-mal or near-optimal An extended analysis of optimalityis presented at the end of the paper however the resultsin Figures 9Andash9D can summarize the main points hereThe results from this analysis imply that the decisionmust be based on the nodes active at encoding (see Fig-ure 9A) the percentage of active nodes must be low (seeFigure 9A) the activation threshold needs to be betweenthe expected net inputs of the new and old items (see Fig-ure 9B) and the recognition strength is normalized bythe standard deviation of the net inputs of the item (seeFigures 9C and 9D) The density of the percentage of ac-tive nodes is illustrated in Figure 4C and the normalizedpercentage of active nodes is illustrated in Figure 4DHere it is shown how these mechanisms predict the mir-ror effect Below these three sets of mechanisms are ex-plained in more detail

Variance of the Net InputThe first and perhaps the most important mechanism

is that the difficult class has a larger variability in the net

Figure 3 (A) Simulations of net inputs The vertical axis shows the simulated density of the net inputs (B) Sim-ulations of the mirror effect The vertical axis shows the simulated density of the recognition strength

VARIANCE THEORY OF MIRROR EFFECT 419

inputs than does the easy class As will be discussed laterthis is shared with other theories such as the subjective-likelihood account (McClelland amp Chappell 1998) andREM (Shiffrin amp Steyvers 1997) However a unique as-pect of the variance theory is that it is a necessary outcomeof simply encoding high-frequency items more timesthan low-frequency items In the subjective-likelihood ac-count a larger variability for high- than for low-frequencywords is an assumption that is controlled by a free pa-rameter ( p0) to reflect that high-frequency words havemore definitions than do low-frequency words The vari-ance theory needs no such assumptions and no addi-tional free parameters to control the variability Whereasthe subjective-likelihood account tries to capture wordfrequency with a free parameter in the variance theorythe word frequency effect is simulated via the number ofcontexts associated with an item

Owing to how the variance theory implements the re-lations between contexts and items the variance of the

net inputs increases with the frequency with which an itemis encoded in different contexts Intuitively this occursbecause a high-frequency item activates several differentcontexts causing the representations of many competingcontexts to be activated simultaneously Low-frequencyitems are associated with fewer contexts than are high-frequency items This causes the representations offewer contexts to be activated simultaneously Thus low-frequency items have less variability in the net inputsthan do high-frequency items

At recognition an item produces a net input in the con-text layer that is a mixture of the net inputs from the studycontext that the network is instructed to retrieve fromand from several uncorrelated preexperimental contextsthat were associated with the item during the preexperi-mental phase The study context that the network is in-structed to retrieve from contributes to the correct activestate For example by adding +1 to the net input to anode (which is the expected net input for a node encoded

Figure 4 (A) The probability density of the net inputs in the variance theory The horizontal axis shows the net inputs and the ver-tical axis the probability density of the net inputs The dotted vertical line is the activation threshold (B) The cumulative probabilitydistributions of the net inputs in the variance theory The horizontal axis shows the net inputs and the vertical axis the cumulativeprobability distributions of the net inputs (C) The density of percentage nodes active at recognition in the variance theory The hor-izontal axis shows the percentage of nodes active at recognition and the vertical axis the density (D) The density of recognition strengthin the variance theory The horizontal axis shows the recognition strength and the vertical axis the density of the recognition strengthusing standard parameter values

420 SIKSTROM

in the active state when N 5 8 and a 5 5 see Equation A1in the Appendix) thus increasing the expected net inputThe preexperimental contexts on the other hand ran-domly add to or subtract from the net input For examplea preexperimental context could add +1 to the net inputif the node was in an active state or it could add 1 tothe net input if the particular preexperimental contextwas encoded in an inactive state (which is the expectednet input for a node encoded in the inactive state whenN 5 8 and a 5 5 see Equation 3 4 or A1) Note thatthe net input can be negative whereas the state of activa-tion cannot If the representations of these preexperimen-tal contexts are uncorrelated the number of associatedpreexperimental contexts will not influence the expectednet input That is the expected value of the negative andpositive contributions will cancel out (eg for N 5 8 anda 5 1 the contributions to the net input is +1 with a prob-ability of 50 and 1 with a probability of 50 yield-ing an expected value of 0) However the variability ofthe net inputs increases when more contexts are addedIn this example the variance is increased by 12 for eachadded context (ie the variance increases by each con-tribution raised to the power of two) Encoding an itemalso increases the variability of the net input for all otheritems encoded in the network However the increase invariability for the items uncorrelated with the encodeditem is much smaller because the contribution from eachnode is independent (Eg for N 5 8 and a 5 5 thecontribution from each node is either +14 or 14 [seeEquation 3] each yielding an increase in variability of142 5 116 An expected value of aN nodes contributeso the total increase in variability is 4 116 5 14)

A mathematical analysis of how the variability of thenet inputs in the context nodes increases as a function ofthe frequency with which the item has been encoded indifferent contexts is presented in the Appendix This analy-sis shows that the variance of the net inputs in contextnodes increases linearly with how many times a givenitem is encoded within different contexts The variabil-ity of the net inputs for nonwords may be a special casediscussed at the end of this paper

In the same way as the variability of context nodes de-pends on the itemrsquos frequency the variability of the itemnodes depends on the frequency of the context That isthe variability of the net input to the item nodes increaseswith how many times one context is associated with dif-ferent items Given that the context is constant during thepresentation of a study list the variability of the net in-puts to the item nodes will increase with list length

Expected Net InputThe second mechanism in the variance theory is that

the expected net inputs to the easy and difficult classesof stimuli are equal given that the encoding conditionsduring the experimental phase of the two classes areequal Note that this is in stark contrast to the attention-likelihood theory which assumes that more attention(corresponding to more net input) is given to the easyclass than to the difficult class Experimentally the equal-

ity of the net inputs simply means that different classesof stimuli are given the exact same conditions for encod-ing and retrieval in a recognition memory study The netinput to a node encoded in the active state increases dur-ing encoding whereas the net input to a node encoded inthe inactive state decreases during encoding Only nodesencoded in the active state are used during retrieval sohere we are only interested in the increase in net input thatoccurs for nodes encoded in the active state Strengthen-ing of the weights during encoding increases the expectednet input The degree of increase in expected net input isinfluenced by strength-based variables such as study timerepetition levels of processing and so forth For exam-ple the simulations can be set up so that a study time of1 sec strengthens the weights less leading to lesser in-creases in the net input than does a longer period of studytimemdashfor example 2 sec of encoding time Because thestudy context is unique to the learning episode preexper-imental encoding in other learning contexts will not af-fect the expected net input but they do affect the vari-ability of the net input as was demonstrated above Theitemndashstudy-context associations are learned approxi-mately equally well for old high- and old low-frequencyitems For example the expected net input for CAR (ahigh-frequency word as a difficult class item) is equal tothe expected net input for ARCTIC (a low-frequency wordas an easy class item) Generally the expected net inputdoes not depend on the class of the item because the ex-pected net input is influenced by the study and the test-ing experimental conditions similarly across stimulusclasses in a recognition memory experiment Thereforethe expected net input for a new difficult item is equal tothe expected net input for a new easy class item

The probability density functions of the net inputs fornodes in the active states are plotted in Figure 4A Oldnodes in the inactive state have a negative expected valueof the net input and are not plotted New nodes in the in-active state have the same density as nodes in the activestate The cumulative probability distributions of the netinputs for nodes in the active state are plotted in Figure 4BFigure 4A shows the first mechanismmdashnamely that thestandard deviation of net inputs for the easy class items[sh(A)] is larger than the standard deviation of net inputsfor the difficult class items [sh(B)] The second mecha-nism is shown in the figures in that the expected netinput of an easy class new item [mh(AN)] is equal to theexpected net input of a difficult class new item [mh(BN)]Furthermore at encoding the expected net inputs of ac-tivated nodes are increased equally or approximatelyequally for the easy and the difficult classes of itemsThis is shown in Figure 4A The expected net input for theold easy class items [m h(AO)] is equal to the expectednet input for the difficult class items [mh(BO)]

Recognition StrengthThe variance theory suggests that the recognition de-

cision needs to be based on counting the number of ac-tive nodes in such a way that the performance is optimalor near-optimal If the net input is above the activation

VARIANCE THEORY OF MIRROR EFFECT 421

threshold (T ) and the node was active at encoding thenode is activated at recognition Otherwise the node isinactivated The distributions of active nodes are plottedin Figure 4C

A closer inspection of Figures 4A and 4B reveals thatthese densities or distributions predict the correct orderof the mirror effect given that the activation threshold islarger than the expected net inputs of the new items andless than the expected net inputs of the old items Thus thevariance theory has the nice property of accounting for themirror effect across a large range of the parameter valuesfor the activation threshold Thus P(AN) lt P(BN) ltP(BO) lt P(AO) for mh(AN) 5 mh(BN) lt T lt mh(AO) 5mh(BO) The variance theory predicts a material-basedmirror effect because the variability of the net inputs isdifferent for the easy and the difficult class items Theexpected strengths of the net inputs are equal The vari-ability is lower for easy class items thus making theprobability of false alarms (or the probability of activenodes) lower for the easy than for the diff icult classitems when the activation threshold is larger than the ex-pected value of the new items Similarly the hit rate (or theprobability of active nodes) for easy class items is higherthan the hit rate for difficult class items when the activationthreshold is less than the expected value of the old items

The activation threshold is set to be between the ex-pected value of the new and the old net inputs so that theperformance is optimal Therefore the activation thresh-old is set to the average of the expected net inputs of theold and the new distributions for difficult and easy classitems respectively

T 5 [mh(AN) + m h(BN) + m h(BO) + mh(AO)]

5 [mh(BO) + mh(AO)] 5 mh(O)

Thus in the variance model the activation threshold isfixed for recognition in one condition although it mayvary between different recognition conditions to opti-mize the performance The variance theory accounts forthe strength-based mirror effect that occurs between listsor conditions with a shift in the activation threshold nec-essary for keeping the performance at an optimal levelAs will be discussed later this is true also when perfor-mance is measured by d cent and it is independent of theplacement of the recognition criterion Simply put themodel adopts the activation threshold on the basis of theoverall difficulty of the test in order to maximize the per-formance

In practice subjects may initially make a preliminaryestimate of the activation threshold which may be ad-justed as more information about the expected value ofthe distributions is gathered The theory does not show amirror effect if the activation threshold is lower than theexpected value of the new items or larger than the expectedvalue of the old items Thus setting the activation thresh-old as was suggested above is an important mechanism

in the model The activation threshold should not be con-fused with the subjectrsquos recognition criterion

Figure 4C shows the density of the probability that anode is active at recognition when the activation thresh-old is set as defined above Note how the mean and stan-dard deviations of the distributions of the net input (Fig-ure 4A) change when the percentage of nodes are calculated(Figure 4C) The expected probabilities of active nodesare arranged according to the mirror effect [Pc(AN) ltPc(BN) lt Pc(BO) lt Pc(AO)] whereas the expected valuesof the net inputs are not [mh(AN) = mh(BN) lt mh(BO) =mh(AO)] Furthermore the standard deviation of the per-centage of active nodes for old items is larger than thatfor new items [s p(O) gt sp(N)] whereas the standard de-viations of the net inputs are equal for old and new items[s h(N) 5 sh(O)]

The standard deviation of the recognition strength issmaller for the new distributions than for the old distri-butions because there are fewer nodes active in the newdistributions The standard deviation of the percentage ofactive nodes at retrieval as a function of the expected prob-ability of nodes active at retrieval is plotted in Figure 5Obviously the standard deviation of the percentage ofactive nodes is zero when the probability of active nodesis zero This standard deviation increases as the prob-ability of active nodes increases For small probabilitiesof active nodes the variance of active nodes (sp) is approx-imately proportional to the percentage of active nodes[sp 5 Pc(1 Pc) N raquo Pc N micro Pc] The percentage of ac-tive nodes is of course larger for old than for new itemsThus the variance theory predicts that the standard de-viation of the percentage of active nodes (sp) is smaller fornew than for old items [sp(AN) lt sp(BN) lt sp(BO) ltsp(AO)] whereas the standard deviation of the net inputis not [sh(AN) 5 sh(AO) lt sh(BN) 5 s h(BO)] The es-sential mechanism that makes these changes in themeans and standard deviations is the nonlinearity intro-duced by the counting of the number of active nodes With-out this nonlinearity these changes would not occur andthe model would not show appropriate ROC curves forold and new items

Note that the standard deviation of active nodes de-creases for probabilities larger than one half (see Figure 5the standard deviation is of course zero when the prob-ability of active nodes is one see the mathematical anal-ysis below) However the probability of active nodes can-not exceed one half because the activation threshold isset so that half of the nodes active during encoding areactive at recognition The probability of active nodes istypically set to a small value in the model because it im-proves the performance

To optimize the performance subjects base their rec-ognition decision on the number of active nodes normal-ized by the standard deviation of the net inputs for theitem The normalization makes the judgments more con-servative for difficult items This plays an important rolefor confidence judgments when the responses are biasedbut plays no role for unbiased responses

12

14

14

422 SIKSTROM

To calculate the recognition strength (S ) in Equation 5the expected percentage of active nodes is subtractedfrom the percentage of nodes active at recognition (Pc)This is necessary for the normalization to work properlyOwing to the placement of the activation threshold theexpected percentage of active nodes at recognition is halfof the expected percentage of nodes active at encoding(a 2 see note 1) This is a constant independent of itemclass new or old item and test difficulty The result is di-vided by the standard deviation of the net inputs associ-ated with nodes active at encoding (shcent)

Note that the standard deviation of the net inputs ofthe to-be-recognized item (s hcent ) varies on an item-to-itembasis around the standard deviation of the net inputs ofall items in the class (sh ) This fluctuation may be solarge that it is not possible to accurately sort the wordsinto classes on the basis of the standard deviation of theitems however there is no need for the subject to makesuch classification in the variance model The subjectsdo not need to know the true standard deviation of net in-puts in the class A yes response occurs if the recognitionstrength is larger than or equal to the subjectrsquos recogni-tion criterion (C ie if S sup3 C ) A no response occurs ifthe recognition strength is less than the subjectrsquos recog-nition criterion (S lt C )

The standard deviation of the net inputs does not affectthe probability of a yes response when the recognitioncriterion is unbiased (C 5 0) In this special case therecognition strength can be simplified to S 5 Pc whereC 5 a 2 The standard deviation of the net inputs inEquation 5 affects the probability of a yes response whenthe recognition criterion is biased (C sup1 0) Thus the stan-dard deviation of the net inputs in Equation 5 may be in-terpreted as a scaling factor that influences the confidencemeasurement (but not the unbiased recognition measure-ments) A large standard deviation of the net input for anitem (correlated with difficulty) influences the decisiontoward uncertainty whereas a small standard deviation ofthe net input for an item (correlated with less difficulty)influences the decision to be more certain

Figure 4D shows the density distribution of the recog-nition strength Note how the standard deviation of theactive nodes for the easy class versus the difficult classes(in Figure 4C) changes when it is normalized by the vari-ance of the net input (in Figure 4D) The normalizationfactor makes the standard deviation of the recognitionstrength of the difficult class smaller than that of the easyclass Thus the standard deviation of the recognitionstrength is proportional to the inverse of the standard de-viation of the net input The difficult class items yield asmall standard deviation of the recognition strength be-cause the standard deviation of the net inputs is highThe easy class items yield a large standard deviation ofthe recognition strength because the standard deviationof the net inputs is small The ordering of the means ofthe distributions is unaffected by the normalization andthe normalization does not change the fact that the olddistributions have a larger standard deviation than do thenew distributions

PREDICTIONS

This section describes the predictions of the variancetheory We have just seen that the variance theory predictsa material-based mirror effect for high- and low-frequencyitems because the low-frequency items have a smallervariance of net inputs This yields lower false alarm ratesand higher hit rates for the easy than for the difficultclass when the activation threshold is set between thenew and the old distributions Here it is shown how themodel accounts for other effects such as the strength-based mirror effect between lists list-length effects andthe shift in the response criterion Most important the var-iance theory makes predictions regarding the strength-based mirror effect within lists that is different from thepredictions of the attention-likelihood theory An exper-iment is conducted to test these predictions Comparativemodeling fitting was also conducted to compare the vari-ance theory with the attention-likelihood theory Thepredictions of the theory are based on an analytic solution

Figure 5 The standard deviation of active nodes as a function of the expectedprobability that the nodes are active The standard deviation is calculated with2N = 100

VARIANCE THEORY OF MIRROR EFFECT 423

that is presented at the end of the paper together with ananalysis of optimal performance

The Material-Based Mirror Effect forHigh- and Low-Frequency Items

The variance theory was simulated above here theanalytical results are presented The variance theory pre-dicts the mirror effect for any choice of parameters whenthe recognition criterion is unbiased As will be discussedlater the variance theory can be fully described by twoparameters (the number of nodes N and the percentageof active nodes a) plus one parameter for each class orwords [the standard deviation of the net input sh( )] Thefollowing parameters are used here The number ofnodes is 2N 5 100 and the percentage of nodes active atencoding is set to a 5 1 The standard deviation of thenet inputs to the easy class is sh(AN) 5 sh(AO) 5 125and the standard deviation of the net inputs to the diffi-cult class is sh(BN) 5 sh(BO) 5 156 There are otherparameters which however as will be discussed laterdo not add any degrees of freedom to the model the ex-pected net inputs of the new distributions mh(AN) 5mh(BN) 5 0 and the expected net inputs of the old dis-tributions mh(AO) 5 mh(BO) 5 1 Consequently the ac-tivation threshold is T 5 05

These parameters yield the following probabilities thata node is active at recognition Pc(AN) 5 43 a Pc(BN) 545 a Pc(BO) 5 55 a Pc(AO) 5 57 a The following ex-pected recognition strengths are predicted ms(AN) 5

0012 ms(BN) 5 0008 ms(BO) 5 0008 ms(AO) 50012 Figure 4D plots the four recognition strength den-sities (the distributions are assumed to be normal) usingthe parameters above The same parameter settings wereused in Figures 4A 4B 4C and 5

Strength-Based Mirror Effects Between ListsThe variance theory is consistent with the strength-

based mirror effects Thus variables that increase the hitrates also decrease the false alarm rates This empiricalfinding is called dispersion which means that the newand the old distributions move apart The opposite phe-nomenon is called concentering which means that thenew and the old distributions move closer together Ex-amples of variables showing strength-based mirror ef-fects are speed versus accuracy instructions length ofstudy time encoding task forgetting repetition and ag-ing (Kim amp Glanzer 1993) These experimental variablescan be related to a specific parameter in the variance the-orymdashnamely the expected net input

The variance theory predicts a strength-based mirroreffect because subjects must adjust the activation thresh-old to optimize the performance This change in activa-tion threshold affects the false alarm rates For exampleassume that study time is increased from 1 to 2 sec sothat the expected net input increases from 1 to 2 and theactivation threshold increases from 12 to 1 This dimin-ishes the false alarm rate However the increase in theactivation threshold is smaller than the increase in the old

net input so the hit rate will increase Thus increasing thestudy time increases the hit rate but decreases the falsealarm rate which is dispersion

The mirror effect is accounted for in some theories bya change in the recognition criterion Note that in the var-iance theory the recognition criterion is constant whereasthe activation threshold is changed There is an impor-tant difference between a change in the recognition cri-terion and a change in the activation threshold Thechange in the activation threshold optimizes the perfor-mance as measured by d cent whereas a change in the recog-nition criterion does not influence d cent Given an optimalplacement of the activation threshold the performance interms of percentage correct is optimal if the recognitioncriterion is set to an optimal value which is zero Thusthere is a clear difference between changing the recogni-tion criterion and changing the activation threshold Thevariance theory accounts for the strength-based mirror ef-fect occurring between two conditions by the change inthe activation threshold necessary for optimal performancewhereas the recognition criterion does not change

Concentering occurs for example when subjects areinstructed to emphasize speed (rather than accuracy) withsuperficial (rather than deep or semantic) study instruc-tions with diminished study time or with an increasedretention interval (Kim amp Glanzer 1993) In the variancetheory all these manipulations are assumed to diminishthe old net inputs Figure 6A shows the predictions of thevariance theory when the expected net inputs of the olddistributions are mh(AO) = mh(BO) 5 05 rather than 1as in Figure 4D Consequently the activation thresholdmust be set to 025 for optimizing the performance Thedistributions in Figure 6A are closer than the distribu-tions in Figure 4D Thus decreasing the net inputsmdashforexample by diminishing study timemdashmoves the distrib-utions closer together thus showing concentering

The opposite phenomenon to concentering is disper-sion which means that increasing the performance movesthe distributions apart Dispersion can be studied bychanging the variables listed above in the opposite direc-tionsmdashfor example by increasing the study time Fig-ure 6B shows the predictions of the theory when the ex-pected net inputs of the old distributions are mh(AO) =m h(BO) 5 2 Consequently the activation thresholdmust be set to 1 to maintain a near-optimal performanceThe distributions in Figure 6B are further apart than thedistributions in Figure 4D

These strength-based manipulations are usually ap-plied between different lists or conditions For exampleKim and Glanzer (1993) manipulated study time betweenfour study lists where the items were presented for 1 seceach in two lists and for 2 sec each in two lists After eachlist there was a recognition test In the variance theorythe activation threshold is the same during each recogni-tion test but may vary between two recognition tests withdifferent levels of difficultymdashfor example different studytimes As will be discussed later different empirical re-sults are found when study time is varied within one list

424 SIKSTROM

In this case the activation threshold is also constant dur-ing the recognition tests although the study time varieswithin the condition

The order of the probabilities in the mirror effect issomewhat robust against changes in the activation thresh-old over a large range Setting the activation to a fixedsufficiently low and positive value yields the mirror ef-fect for any value of the expected net input For example

assume that the activation threshold is fixed to 04 Thenthe mirror effect is predicted for the three cases of ex-pected old net inputs discussed above (05 1 and 2) orany value above 04 The predictions for the new distri-butions do not change with these changes in net inputs[P(AN) = 25 P(BN) = 30] thus a change in the acti-vation threshold is needed to change the false alarmrates In contrast the advantage of the old easy class over

Table 1General Table of Results From the Experiment

Condition AN BN BO AO ss(BN)ss(AN) ss(BO)ss(AO)

Control 013 017 069 082 060 086Frequency 020 028 080 068 101 066Time 010 015 078 076 089 081

NotemdashThe rows show the conditions (control presentation frequency and presenta-tion time) The columns show the false alarm rates for low (AN) and high (BN) fre-quencies and the hit rate for high (BO) and low (AO) frequencies the slope of the z-ROCcurve for the new low-frequency distribution as a function the new high-frequency dis-tribution [ss(BN)ss(AN)] and the corresponding slope for the old distributions[ss(BO)ss(AO)]

Figure 6 The densities of recognition strength in the variance theory for different parameter settings (A) concentering (B) dis-persion (C) activity level set to one and (D) equal variance The horizontal axes show the recognition strength and the vertical axesthe density

VARIANCE THEORY OF MIRROR EFFECT 425

the old difficult class increases with the expected netinput [from P(BO) = 55 and P(AO) = 56 for mh(O) =05 to P(BO) = 89 and P(AO) = 92 for mh(O) = 2]

List-Length EffectGiven everything else equal recognition from a short

list length has a higher hit rate and lower false alarm ratethan recognition from a long list In the variance theorylist length is predicted to affect the standard deviation ofthe net input (sh) for both easy and difficult class itemsso that longer lists have a larger variance than do shorterlists The expected value of the net input is not affectedby list length

Assume that context does not change within a list butis uncorrelated between different lists The context for alist is thus associated with as many items as there areitems in the list The variance of the net inputs to the itemnodes increases when the list length is increased Thereason for this increase in variance is essentially thesame as the reason that word frequency affects the vari-ance In the word frequency case the same item is asso-ciated with several contexts and this increases the vari-ance in context nodes In the list-length case the samecontext is associated with several items and this in-creases the variance in the item nodes Thus the varianceof the net inputs in the item nodes will be a linear func-tion of list length Therefore a long list will have a lowerhit rate and a larger false alarm rate than will a short list

ROC CurvesThe percentage of nodes active at recognition is less

for new items than for old items Owing to the placementof the activation threshold this proportion is always lessthan 12 The standard deviation of the percentage of ac-tive nodes increases as a function of the percentage ofactive nodes If the percentage of active nodes is zerothe standard deviation obviously is zero However thisstandard deviation increases as the percentage of activenodes increases This yields a smaller standard deviationfor the new distribution (which is associated with a lowerpercentage of active nodes) as compared with the olddistribution [ss(AO) gt ss(AN) and ss(BO) gt ss(BN)]

For the sake of understanding the model the propor-tion of nodes active at encoding can be set unrealisticallyhighmdashnamely to a = 1 This setting yields around 50of these nodes being active at recognition This parametersetting makes the standard deviations of the new and theold distributions equal [ss(AO) = ss(AN) and ss(BO) =ss(BN)] Figure 6C shows the prediction for a = 1 (all theother parameters are identical to those in Figure 4D)

The standard deviation of recognition strength is largerfor the difficult class than for the easy class [ss(AN) gtss(BN) and ss(AO) gt ss(BO)] because the recognitionstrengths are calculated from the inverse of the standarddeviation of the net inputs Thus when the standard de-viations of the net inputs are set equal the standard de-viation of the recognition strengths and the recognitionstrengths becomes equal Figure 6D plots the predictionsof the theory when all standard deviations of the net in-

puts are 125 The other parameters are the same as thosein Figure 4D

In Figure 4D the four standard deviations of the recog-nition strengths are ss(AN) = 0015 ss(BN) = 0012ss(BO) = 0015 and ss(AO) = 0020 The ratio of thesestandard deviations must follow Equation 2 This is alsothe case with ss(AO) ss(BN) = 061 lt 074 = ss(AO) ss(AN) lt ss(BO)ss(BN) = 078 lt 094 = ss(BO) ss(AN)

Changing the Recognition CriterionThe probability of a yes response (P) for the four

classes depends on the recognition criterion (C) SettingC to an unbiased value of 0 yields P(AN) = 20 P(BN) =25 P(BO) = 70 P(AO) = 74 These predicted data areprototypical of experimental data for the mirror effect

A conservative value of the recognition criterion (C)will not yield the mirror effect For example C = 0016yields P(AN) = 03 P(BN) = 02 P(BO) = 30 P(AO) =43 Thus the variance theory predicts that a conserva-tive recognition criterion yields a higher false alarm ratefor easy class words than for difficult class words Thisprediction is supported by empirical data Greene (1996)asked subjects to respond yes only if they were sure oftheir response Consistent with the prediction no mirroreffect was found

It follows from the ordering of the distributions in Fig-ure 4D that the theory also predicts the experimentalfindings in forced recognition [P(BO BN) lt P(AO BN)P(BO AN) lt P(AO AN) P(BN AN) gt 50 and P(AOBO) gt 50] For the parameters above the predictions ofthe theory are P(BO BN) = 79 lt 81 = P(AO BN)P(BO AN) = 83 lt 84 = P(AO AN) P(BN AN) = 59 gt50 P(AO BO) = 57 gt 50

Within-List Strength ManipulationSo far the predictions made by the variance theory are

qualitatively (but not quantitatively) equal to those of theattention-likelihood theory However there is an excep-tion that differentiates the variance theory from the attention-likelihood theory The mirror effect is nor-mally studied under experimental conditions in whichthe diff icult and the easy classes are given the sameamount of attentionmdashfor example under conditions inwhich the number of presentations the study time andthe study instructions are equal for the two classes ofwords However if the number of presentation is largerfor the difficult class than for the easy class different re-sults emerge Stretch and Wixted (1998b) conducted fiveexperiments in which the basic manipulation was to pre-sent high-frequency words five times whereas the low-frequency words were presented once The results didnot show a mirror effect because the hit rates for thehigh-frequency words were higher than those for thelow-frequency words However increasing the numberof presentations for the high-frequency words did not af-fect the false alarm rate so both the false alarm rate andthe hit rate were larger for high-frequency words

The variance theory accounts for this new finding inthe following way The theory assumes that a longer pre-

426 SIKSTROM

sentation time or a larger presentation frequency in-creases the net inputs of the old items [mh(O)] This is il-lustrated in Figure 7A (compare with Figure 4A wherethe same amount of attention is paid to the two classes)If the net inputs for old high-frequency items are in-creased sufficiently the percentage of active nodes willbe larger than that for old low-frequency items For thisto occur the effect of the increase in net input (whichgives the advantage for old high-frequency items whenattention is focused on these items) must be larger thanthe effect from the larger standard deviation of the netinputs for old high-frequency items (which gives the ad-vantage for old low-frequency items when the same at-tention is paid to the two classes) This increase in thepercentage of active nodes yields a higher hit rate forhigh-frequency items than for low-frequency items

However it will not significantly change the false alarmrates which are larger for high-frequency items than forlow-frequency items Therefore the variance theory pre-dicts no mirror effect when high-frequency items are pre-sented sufficiently more often or with a sufficiently longerpresentation time as compared with low-frequency items

It is apparent from the density of net inputs (Figure 7A)that the density of recognition strengths (Figure 7B)will not show a mirror effect (ie because the percent-age of active nodes are larger for high- than for low-frequency old items) The parameters used in these fig-ures are identical to the parameters used for the standardmirror effect in Figures 4A and 4D with the exceptionthat the expected net input of the old high-frequencyitems [mh(BO)] is 2 rather than 1 Consequently to opti-mize performance the activation threshold becomes

P a ec

h h

h=-( )yen

ograve12

22

m

s

P -

Figure 7 (A) The probability density of the net inputs in the variancetheory when attention is focused on the high-frequency class The hori-zontal axis shows the net inputs and the vertical axis the probabilitydensity of the net inputs The expected value of the high-frequency value(BO) is shifted to the right because attention is focused on this class Thedotted vertical line is the activation threshold (B) The predictions of thevariance theory when subjects focus their attention on high-frequencywords The horizontal axis shows the recognition strength and the ver-tical axis the probability density

075 rather than 050 The figure does not show a mirroreffect because the expected hit rate and the expectedfalse alarm rate are larger for the high-frequency itemsthan for the low-frequency items Setting C to an unbi-ased value of 0 yields P(AN) = 08 P(BN) = 14 P(BO) =86 P(AO) = 063 [which may be compared with Figure6B P(AN) = 20 P(BN) = 25 P(BO) = 70 P(AO) = 74]

Furthermore in the variance theory the ratio of therecognition strength standard deviations for high- andlow-frequency items depends mainly on the standard de-viations of the net inputs The standard deviations of thenet inputs are not dependent on the attention paid to thestimuli Therefore the variance theory predicts no changein the standard deviations when the amount of attentionis manipulated The standard deviation of the old low-frequency distribution is predicted to be larger than thestandard deviation of the old high-frequency distribu-tion Similarly the standard deviation of the new low-frequency distribution is predicted to be larger than thatof the new high-frequency distribution The standard de-viations in Figure 7B are ss(AN) = 0013 ss(BN) = 0011ss(BO) = 0017 and ss(AO) = 0019 These results aresimilar to the results when the same amount of attentionis paid to the two classes in Figure 4D ss(AN) = 0015ss(BN) = 0012 ss(BO) = 0015 and ss(AO) = 0020

The standard version of the attention-likelihood the-ory has problems accounting for the lack of mirror ef-fect when more study time is given to the difficult classthan to the easy class This theory suggests that the classof items to which more attention is being paid is moreeasily recognized For example low-frequency items arebetter recognized than high-frequency items becausesubjects pay more attention to them The amount of at-tention is assumed to influence the number of sampledfeature [n(i)] so more features are sampled for low- thanfor high-frequency items (Kim amp Glanzer 1993) This isthe only parameter that differs between high- and low-frequency items From this explanation it follows thatif the experimental conditions are manipulated so thatsubjects pay more attention to the high-frequency itemsthe standard version of the attention-likelihood theorywill predict a mirror effect where the high-frequencyitems are the easier class (A) and the low-frequency itemsare the more difficult class (B) The difference from thenormal mirror effect is a larger hit rate and a smaller falsealarm rate for high- than for low-frequency items Fur-thermore the attention-likelihood theory makes predic-tions of the order of the slope of ROC curves The standarddeviation of the hit rate for the high-frequency distributionwould be larger than the hit rate for the low-frequencydistribution Similarly it is predicted that the standarddeviation of the high-frequency false alarm distribution islarger than that of the low-frequency distribution

EXPERIMENT

An experiment was conducted to test the predictionsregarding the within-list strength manipulation The

number of presentations and the study time of the high-frequency words were manipulated in an experimentThe original rationale for the experiment was to comparethe results with the predictions of the variance theoryand the attention-likelihood theory because the experi-ment was conducted before the publication of Stretchand Wixtedrsquos (1998b) study which manipulated atten-tion by the number of presentations In this experimenta new manipulation is investigatedmdashnamely how theamount of study time per item for each class influencesthe mirror effect Furthermore the manipulation of thenumber of presentations is replicated Thus there weretwo experimental conditions one in which the amountof study time was manipulated and one in which the pre-sentation time was manipulated There was also one con-trol condition in which high- and low-frequency wordswere given the same amount of attention

MethodSubjects Twenty-one students taking the introductory psychol-

ogy course at the University of Toronto volunteered to participatein a memory experiment for course credit There were 14 female and7 male subjects with a mean age of 20 ranging from 18 to 29 yearsold

Material Sixty low-frequency words and 60 high-frequencywords were selected from Ku Iumlcera and Francis (1967) The low-frequency words have an occurrence of 4ndash5 times per million andthe high frequency words an occurrence of 50ndash55 times per mil-lion Thirty low- and 30 high-frequency words were randomly cho-sen for List 1 and the remaining for List 2

Procedure The subjects were instructed to study a list of wordsso they would be able to recognize the words after study Fifteenlow-frequency words and 15 high-frequency words were randomlychosen as study words for each subject

Design There were three conditions In all the conditions thelow-frequency words were presented once with a presentation timeof 1 sec In the control condition the high-frequency words werealso presented once with a presentation time of 1 sec In the pre-sentation frequency condition the high-frequency words were pre-sented twice for 1 sec each time In the presentation time conditionthe high-frequency words were presented once for 3 sec The pre-sentation order was randomized All the words were presented inuppercase on a blank computer screen Immediately following thestudy list there was a recognition test The subjects were presentedwith either one of the studied words or one of the lures There were15 low-frequency lures and 15 high-frequency lures in each condi-tion The subjects were asked to judge whether the word was pre-sented in the list or not The subjects were also required to rate theirconfidence in their responses on a scale ranging from 1 (guessing)to 5 (very certain) The order of recognition was randomized foreach subject

Each subject participated in two conditions List 1 was alwaysgiven as the first list and List 2 as the second list Twelve subjectswere randomly chosen for the presentation frequency condition fol-lowed by the presentation time condition Nine subjects were giventhe control condition followed by another control condition Thewhole experimental set-up including instructions presentation ofwords and the recognition test was automated on a computer Eachsubject was tested individually

ResultsThe results from the experiment are presented in the

first three rows of Table 1 The probability for hit rates

VARIANCE THEORY OF MIRROR EFFECT 427

428 SIKSTROM

was larger for the high-frequency words than for the low-frequency words in the presentation frequency and thepresentation time conditions In the control conditionthe probability for hit rates for the low-frequency condi-tion was larger One-tailed paired t tests over the perfor-mance for each subject were carried out to test the dif-ferences between the high and the low frequencies Theeffects were significant in the presentation frequencycondition [t(11) = 22 MSe = 004 p = 02 lt 05] and inthe control condition [t(16) = 33 MSe = 004 p = 00lt 05] but not in the presentation time condition [t(11) =041 MSe = 003 p = 34 lt 05]

The false alarm rate was larger for the high-frequencywords in all the conditions However it was only signif-icantly larger in the presentation frequency condition[t(11) = 18 MSe = 003 p = 048 lt 05] but not in thepresentation time condition [t (11) = 15 MSe = 001 p =07 gt 05] and the control condition [t(16) = 14 MSe =002 p = 09 gt 05]

The results in the presentation frequency conditionsupport the variance theory The hit and the false alarmrates were significantly larger for the high-frequencywords than for the low-frequency words Thus there wasno mirror effect However the prediction of the standardversion of the attention-likelihood theory was not sup-ported

The results in the presentation time condition were inthe same direction as those in the presentation frequencycondition although the difference between the high andthe low frequencies was not significant This conditionis consistent with the variance theory although the stan-dard version of the attention-likelihood theory could notbe dismissed in this condition since the results werenonsignificant

Finally the control condition yielded results consis-tent with previous studies showing a mirror effect Thehit rate for the high-frequency words was significantlylower than the hit rate for the low-frequency words Thefalse alarm rate for the high-frequency words was largerthan that for the low-frequency words (although not sig-nificantly) Thus the control condition is as was ex-pected consistent with both the variance theory and thestandard version of the attention-likelihood theory

The slopes of the ROC curves were calculated as fol-lows The hit and false alarm rates for confidence ratings1ndash5 were z-transformed (eg for confidence rating 4 a hitresponse was scored if the confidence rating was 4 orabove) Linear regressions of one z-transformed mea-surement as a function of another z-transformed measure-ment were conducted The slope of the linear regressioncurves between the z-transformed hit rate of the low-frequency words and the z-transformed hit rate of the high-frequency words [ss(BO)ss(AO)] and similarly for theslope of the false alarms [ss(BN)ss(AN)] are shown inthe last two rows of Table 1

The results show that the standard deviations of theold high-frequency distributions were smaller than thestandard deviations of the old low-frequency distribu-tions in all the conditions The standard deviations of the

false alarm high-frequency distributions were smallerthan the standard deviations of the false alarm low-frequency distributions in the presentation frequencycondition and the control condition but were approxi-mately equal in the presentation time condition

To summarize the results in the presentation frequencycondition are consistent with the variance theory and inconsistent with the standard version of the attention-likelihood theory The control condition is consistentwith both the variance theory and the standard version ofthe attention-likelihood theory These data are also con-sistent with results presented by Stretch and Wixted(1998b) However Stretch and Wixted (1998b) suggestedone possible way to modify the standard version of theattention-likelihood theory to bring it in line with thedata presented here They noted that Glanzer et al (1993)had shown that the attention-likelihood theory predictsthe mirror effect although p(i old) is set to the averagevalue of the two classes This modified version can pre-dict the pattern of data presented here given that the at-tention paid to the high-frequency class was high duringencoding [n(B) = 120] and low during recognition [n(B) =40] This formulation of the attention-likelihood theoryseems somewhat unclear It is not well motivated whyp(i old) should be equal during recognition whereas n(i)is different [p(i old) is calculated from n(i)] or why theamount of attention for high-frequency items is lower thanthat for low-frequency words at encoding but higher atrecognition

COMPARATIVE DATA FITTING

Glanzer et al (1993) fitted the attention-likelihoodtheory to experimental data in nine conditions The easyclass (A) consisted of either low-frequency words orconcrete words and the difficult class (B) consisted ofhigh-frequency words or abstract words Here the vari-ance theory is also fitted to the same set of data as thatused in Glanzer et al (1993) This allows a directionevaluation of the variance theory by comparing its f itswith those of the attention-likelihood theory

Glanzer et al (1993) fitted the attention-likelihoodtheory to the four probabilities of yes responses and thefour ratios of slopes of the ROCs The fitting was con-ducted by minimizing the mean squared error divided bythe variancemdashthat is

Three parameters were fittedmdashnamely the attentionpaid to each of the classes n(A) and n(B) and the prob-ability that a feature was activated p(new) The other pa-rameters were held constant at a value found to give agood f it These parameters were (N = 1000) and therecognition criterion [ln (L) = 0]

The variance theory was fitted to the same set of datausing the same technique and the same number of freeparameters The fitted parameters were the percentage

( )

Observed Predictedi i

ii

-

=aring

2

21

8

s

VARIANCE THEORY OF MIRROR EFFECT 429

nodes active at encoding (a) the standard deviation ofthe net inputs for the easy class words [sh(A)] and thestandard deviation of the net inputs for the difficult classwords [sh(B)] The other parameters were held constantand were set to the same values as those in Figure 4D[2N = 100 mh(N ) = 0 mh(O) = 1 and C = 0] The empir-ical standard deviations (si) were not reported inGlanzer et al (1993) so these parameters were set toone

The results show that the variance theory accounts for98 (r2) of the variance for the probabilities The attention-likelihood theory fits equally well The variance theory ac-counts for 93 of the variance for the slope The attention-likelihood theory accounts for 84 of the variance of theslope Thus the variance theory accounted for the sameamount of variance for the probabilities and more variance

for the slope as compared with the attention-likelihoodtheory when three parameters were fitted

It is reasonable to assume that the percentage of ac-tive nodes at encoding (a) does not vary across condi-tions The ratio of standard deviations of the net inputs[sh(B)sh(A)] may also be conceived of as being con-stant given that the same material is used in the differ-ent conditions Therefore the variance theory was fittedby a single variablemdashnamely the standard deviation ofthe net inputs for the easy class [sh(A)] The activitylevel was fixed to 010 and the ratio of the standard de-viations of the net inputs sh(B)sh(A) to 125 (these val-ues were the average of the fitted parameters in the pre-vious fit these parameters were also used in Figure 4D)The results are presented in Figure 8A where the pre-dicted probabilities are plotted as a function of observed

Figure 8 (A) Fitting the variance theory to Glanzer Adams andKimrsquos (1993) probability data The figure shows the predicted proba-bilities (on the vertical axis) as a function of the observed probabilities(on the horizontal axis) (B) Fitting the variance theory to Glanzer et alrsquos(1993) standard deviation slope data The figure shows the predictedratio of slopes of the receiver-operating characteristic curves (on the ver-tical axis) as a function of the observed ratios (on the horizontal axis)

430 SIKSTROM

probabilities Figure 8B shows the corresponding resultsfor the slope The accounted variance is 096 for the prob-abilities and 085 for the slopes Thus the variance theoryfits the slopes using a single parameter equally well asthe attention-likelihood theory does with three fitting pa-rameters The fit for the variance theory for the probabil-ities using one parameter is slightly less than the fit of theattention-likelihood theory using three fitting parametersIt may be concluded that the fit for the variance theory isreasonably good for the probabilities and the slopes Theslopes have a better fit in the variance theory than in theattention-likelihood theory when three variables are used

ANALYTIC SOLUTIONS

In this section analytic solutions of the variance the-ory an approximation of the standard deviation of therecognition strength and analyses of optimal perfor-mance are presented The variance theory has a simpleanalytic solution and can be fully described by four pa-rameters Two of these parametersmdashnamely the stan-dard deviation of the net inputs from the easy class[sh(A)] and the standard deviation of the net inputs fromthe difficult class [sh(B)]mdashcan also be expressed as thefrequency of the item (see the Appendix) The other twoparameters are the number of nodes (N ) and the expectedprobability that the nodes are active at encoding (a)

There are other variables in the theory however theydo not increase the degrees of freedom There are fourexpected net inputs (mh) however two degrees of free-dom disappear because the new net inputs are constrainedto be equal as well as the old net inputs [mh(AN) =mh(BN) and mh(AO) = mh(BO)] Furthermore the predic-tions are independent of moving the old and new dis-tributions in parallel thus removing another degree offreedom Finally changing the difference between the ex-pected old and new net inputs is mathematically equiva-lent to changing the standard deviations [sh(A) andsh(B)] Thus the degrees of freedom in the net inputscan be captured by the degrees of freedom in the stan-dard deviations The activation threshold (T ) and theprobability that nodes are active (Pc) are simply func-tions of other variables and therefore do not increase thedegrees of freedom Thus there are four degrees of free-dom for the distributionsmdashnamely sh(A) sh(B) N anda An additional degree of freedom is introduced whenplacing the recognition criterion (C)

The probability (Pc) that the net inputs exceed the ac-tivation threshold (T ) for nodes active during encodingcan be explicitly solved from the expected net inputs(mh) and the standard deviation of the net inputs (sh)This probability is dependent on the distribution of thenet inputs which can be approximated with a normaldistribution Pc is solved by integrating the net inputsfrom mh T to infinity (yen) over the probability densityfunction for a normal distribution Thus the probability(Pc) that a node is active at recognition is

(6)

Subtracting the expected percentage of active nodes atrecognition (a2 see note 1) from the percentage of ac-tive nodes and dividing by the standard deviation of thenet inputs (sh) calculates the expected recognitionstrength (ms)

Note that the analytic solution uses the standard devi-ation of the class (sh) as an approximation of the stan-dard deviation of the item (shcent ) because it simplifies theanalytic solution however the variance theory or thesimulation uses the standard deviation of the item Thisapproximation is good when there are a large number offeatures however for a small number of features thevariance of feature strength for a single item may fluctu-ate on an item-to-item basis around the variance of thenet inputs for all the items

The standard deviation of the recognition strength (ss)is calculated from sh Pc and N There is 2N number ofnodes in the context and the item layers The distributionof Pc is binomial but can for a certain criterion [ie 2NPc(1 Pc) gt 10] be approximated with a normal distri-bution with a standard deviation of [Pc(1 Pc) 2N]12The final result is scaled by the normalization factor 1sh

(7)

A yes response occurs if the recognition strength isabove the recognition criterion (C) The probability of ayes response [P(Y)] is calculated from the expected recog-nition strength the variance of the recognition strengthand the recognition criterion by integrating the density ofthe recognition strength over a normal distribution

The probability of choosing A over B in a two-choiceforced recognition test [P(A B)] is calculated from theexpected recognition strength of A [ms(A)] and B [ms(B)]and the standard deviations of the recognition strengthof A [ss(A)] and B [ss(B)]

An Excel sheet for calculating the predictions of the vari-ance theory is available on line (wwwpsychutorontoca ~sverkervariancehtml)

P e ds

S Bs s

s s

C

(A B)

(A)

(A) (B)

=

- -[ ]( )+aelig

egraveccediloumloslashdivide

eacuteeumlecirc

yen

ograve12

2

2 2 22

p

m m

s s

( )

P e ds

S s

s

C

(Y) =-( )yen

ograve12

2

22

p

m

s

sss

h

c cP P

N=

-( )eacute

eumlecircecirc

1 1

2

12

mss

Pc

a

h=

-2

P a e dhcT

hh

h=-( )yen

ograve12

2

22

p

m

s

VARIANCE THEORY OF MIRROR EFFECT 431

Approximations of the Standard Deviation ofRecognition Strength

The standard deviation of the recognition strength isin the model calculated with Equation 7 However to fa-cilitate the understanding of this equation it is useful tomake some approximations First note that the proba-bility that a node is active (Pc) is assumed to be low Byapproximating 1 Pc to one the variance of the recog-nition strength can be simplified to

For a particular class of items the variances of the netinputs of old and new items are equal and the varianceof the recognition strength is proportional to the numberof active nodes (s 2

s micro Pc) This approximation suggests avery simple interpretation of the slope of the z-ROC Theratio of variances between new and old items is simplythe ratio between the number of nodes active in the newitems representations and the number of nodes active inthe old items representations

Or alternatively the slope of the z-ROC curve is equalto the square root of the ratio of the number of nodes ac-tive in the new items representations and the number ofnodes active in the old items representations For exam-ple if the slope of the z-ROC curve is 08 the number ofactive nodes in the new items representations divided bythe number of nodes active in the old items representa-tions is 064 (= 0802)

Another approximation useful for understanding themodel is that for two classes of items the number of ac-tive nodes in the new distribution is approximately equaland the number of active nodes in the old distributions isapproximately equal [Pc(AN) raquo Pc(BN) and Pc(BO) raquoPc(AO)] Given these approximations and the approxi-mation above (1 Pc raquo 1) the recognition strength stan-dard deviation is inversely related to the standard devia-tion of the net inputs in the following way The ratiobetween the recognition strength standard deviations ofthe diff icult and the easy distributions is equal to theratio between the standard deviations of the net inputs ofthe easy and the difficult distributions Furthermore theratio between the recognition strength standard devia-tions of the difficult and easy new distributions is equalto the ratio between the recognition strength standard de-viations of the difficult and the easy old distributionsThe exact solution predicts a slightly smaller ratio in theold than in the new distributions

This suggests that the ratio between the recognitionstrength standard deviations of the difficult class and theeasy class can be interpreted as the ratio between the

standard deviations of the net inputs of the easy class andthe difficult class

Optimizing Performance Derives the Assumptions of the Variance Theory

Arguably good memory performance is an importantfactor for selection in the evolutionary process of hu-mans and animals It is reasonable to assume that thebrain has evolved so that the performance at retrieval isoptimal or near-optimal Here it is investigated how sev-eral assumptions of the variance theory influence per-formance It is shown that several assumptions of themodel fall out as a consequence of optimizing perfor-mance in the form of discriminability between new andold items Thus if the model is implemented in a differ-ent way performance is degraded and the model doesnot account for the experimental data Examples of as-sumptions that yield a good performance in the modelare a low percentage of nodes active setting the activa-tion threshold between old and new net inputs measur-ing performance by nodes that are active in the encodingpattern and normalizing the recognition strength It isshown that an optimal performance in the network requiresthe implementation suggested by the variance theory Ifthe implementation of the variance theory is changedsignificantly the performance is degraded and the net-work would not produce the empirically found memoryphenomena

To conduct this analysis it is necessary to define ameasurement of performance A natural choice is to used cent By using the analytical equations above we find thefollowing expression

(8)

Because ss(N) sometimes is near zero it was founduseful to use the standard deviation of both the old andthe new items recognition strength ss(NO) in the de-nominator of this equation Thus Pc(NO) is equal to[Pc(N) 1 Pc(O)] 2 Pc( ) was calculated with Equation 6The expected value of the net inputs and the standard de-viation of the net inputs for new and old items were cal-culated with the equations derived in the Appendix(Equations A1 A2 and A3) Low-frequency items witha preexperimental frequency of zero were used ( f = 0)and the list length was set to one (L = 1)

The performance can be expressed by the parametersa N and p Increasing the number of nodes (N) monot-onically increases d cent and increasing the number ofstored patterns (p) monotonically decreases d cent The im-pact of these two parameters on d cent was of less impor-tance here and they were set to N = 30 and p = 100

Optimal percentage of nodes active at encodingThe solid line in Figure 9A shows the theoretical d cent as afunction of the percentage of nodes active at encoding

cent = - =-

-eacuteeumlecirc

dP P

P PN

s s

s

c c

c c

m ms( ) ( )

(

O N

(NO)

(O) (N)

(NO) (NO) 12

112

ss

ss

ss

ss

s

s

s

s

h

h

s

s

(BO)

(AO)

(B)

(A)

(A)

(B)

(BN)

(AN)pound raquo pound

s

ss

s

c

c

P

P

2

2

(N)

(O)

(N)

(O)raquo

ss

sc

h

P

N

2

2 2=

432 SIKSTROM

(a) The results show that d cent is optimal for a = 052 Thed cent is lower for larger and smaller a The lower d cent for largea occurs because the interference from other items in-crease For an a larger than the optimal value the weightchanges are distributed over a larger number of nodesand the recognition tests therefore include more noise

The lower d cent for an a less than the optimal value oc-curs because there is variability in the number of activenodes at encoding Thus for very small values of a thefluctuation between the number of nodes active in theencoded representation becomes increasingly importantThus for a small a errors are more likely to occur be-cause old items have few active nodes at encoding mak-ing it less likely that many nodes will be active at re-trieval (independently of how well they are encoded)This analysis suggests that a medium low percentage ofactive nodes at encoding yields optimal performanceThis is consistent with variance theory which requires alow activity for fitting some of the empirical data (seebelow)

There is another factor that contributes to the fact thatoptimal performance occurs when the percentage of ac-tive nodes is medium low which is that the number ofpossible representations increases with a If there is onlyone node active in all the representations there are Npossible combinations of representations if there aretwo nodes active in all the representations there are ap-proximately N 2 possible combinations of representa-tions and so forth This factor is not included in theanalyses

Optimal placement of the activation threshold Animportant property of d cent is that it is insensitive to wherethe criterion is placed Thus any criterion yields thesame performance The activation threshold (T ) may beseen as the criterion for a single node and therefore onemight intuitively think that the placement of the thresh-old is unimportant for d cent However surprisingly theplacement of the criterion becomes important in the vari-ance theory because there is a nonlinear transformationwhen the nodes are activated This nonlinearity makes d centdependent on the activation threshold in the nodes

The d cent was maximized by changing the activationthreshold (T ) and the percentage of nodes active at en-coding (a) The maximum d cent was 240 when T = 081and a = 052 Figure 9B plots d cent as a function of the ac-tivation threshold (T ) when the percentage of nodes ac-tive at encoding was f ixed at the optimal value (a =052) The results show that d cent has an optimal valuewhen the activation threshold is set between the old andthe new distributions The variance theory suggests thatthe threshold should be set to the average of the expectedold and expected new net inputs For the parameter val-ues used here this value is 071 which is near butslightly lower than the optimal value of 081 (the ex-pected value of the new net input is 0 and the expectedvalue of the old net input is 142) Note that this resultapplies when both a and T are set to the optimal value Ifa is set to a nonoptimal value the optimal value of T may

deviate significantly from the one proposed by the the-ory (eg if a = 5 the optimal value of T is much largerthan the expected value of old net inputs of 188)

This analysis emphasizes the importance of setting theactivation threshold as suggested by the variance theorySetting the activation threshold between the old and thenew expected net inputs yields not only the mirror effectbut also an optimal performance in the network Noticethat the activation threshold (T ) is constant even if thesubjectsrsquo decision criterion (C) is changed Therefore d centwill not change when the decision criterion changes Bychanging the decision criterion (rather than the recogni-tion threshold) subjects can maintain an optimal d cent fordifferent confidence levels

Optimal usage of the state of activation in the cuepattern An interesting question is how much informa-tion is carried in nodes that are active in the encoded pat-tern as compared with inactive nodes If both active andinactive nodes carry a similar amount of information itis useful to use all the nodes at retrieval However if in-active nodes carry little information in relation to thenoise performance can be improved by using only theinformation in the active nodes

The information carried in the nodes depends on theamount of weight changes which in turn depends on thepercentage of active nodes at encoding (a) For a = 12the absolute values of the weight changes are the samefor active and inactive nodes (however the signs of theweight changes are different) Thus inactive and activenodes carry the same amount of information and theperformance is optimal when information in both activeand inactive nodes is used For a small a the weightchanges are larger for active nodes (proportional to 1 a)than for inactive nodes (proportional to a) For a suffi-ciently small a the noise in inactive nodes will over-whelm the information in the weight changes so that ifthe information is combined the inactive nodes will harmthe information in the active nodes and performance isoptimal if only information from active nodes is used

The performance of the variance theory was calcu-lated by using the information in all the nodes This isdone by counting the number of nodes that are retrievedto the correct state of activation (ie the same state asthat during encoding) The mathematical details of thiscalculation are described at the end of the Appendix Theresults are shown by the dotted line in Figure 9A usingthe same set of parameters as when d cent was calculated byusing only active nodes shown by the solid line The re-sults show that the highest d cent is found when the decisionis based only on active nodes and when a is low Includ-ing inactive nodes in decision lowers d cent However for alarger a (above 15 for the parameters used here) it isbeneficial to base the decision on all the nodes

Optimal placement of the recognition criterion forthe two classes of items The recognition criterion (C)does not affect d cent but it influences performance as mea-sured by the hit rates and false alarm rates Therefore itis necessary to use another criterion for performance

VARIANCE THEORY OF MIRROR EFFECT 433

with respect to the placement of the recognition crite-rion A natural choice for performance in this context isthe probability of hits minus the probability of falsealarms This measurement corresponds to optimal per-formance when old correct responses and new correctnew responses are rewarded equally It is easy to see thatif the standard deviations of the old and the new distri-butions are equal the optimal performance will be foundif the recognition criterion is set exactly between the dis-tributions For unequal standard deviations the optimalrecognition criterion is shifted from the midpoint towardthe distribution with the smallest standard deviationMore exactly the optimal recognition criterion is thepoint at which the old and the new distributions inter-sect It is easy to see that this is true because if the recog-nition criterion is moved to the left of this point the rateof increase in false alarms is larger than the rate of in-crease in hits and performance suffers If the recognition

criterion is moved to the right of this point the rate of de-crease in hits is larger than the rate of decrease in falsealarms and performance also suffers (see eg Figure 4D)Formally f [S(O)] denotes the density of recognitionstrength of the old distribution and f [S(N)] the densityof the recognition strength of the new distribution Theratio between these variables is called the likelihoodratio L = f [S(O)] f [S(N)] and the optimal performanceoccurs when this ratio is equal to one (L = 1)

In the mirror effect there are two classes of itemseach having a new and an old distribution with differentstandard deviations The question of optimal perfor-mance is complicated by the possibility of using differ-ent criteria for the two classes The performance maythen vary depending on the choice of the two criteria andon additional restrictions on the overall level of confi-dence For example if one class is very easy (ie perfectdiscrimination) and one class is very difficult (ie no

Figure 9 (A) Theoretical d cent as a function of percentage of nodes active at encoding The solid line shows the d cent as a function of per-centage of nodes active at encoding when the decision is based only on nodes that are active during encoding The dotted line showsd cent when the decision is based on all the nodes (B) Theoretical d cent as a function of the activation threshold The leftmost arrow pointsat the expected net input of the new items [m(N)] the rightmost arrow points at the expected net input of the old items [m(O)] and themiddle arrow points at the point at the placement of the activation threshold of the nodes Note that the activation threshold is slightlylower than the optimal point (C) Optimal placement of the recognition criterion for the two classes The y-axis shows the maximumlikelihood for Class B divided by the maximum likelihood for Class A An optimal performance is found when this ratio is one Thex-axis shows the false alarm rate for Class A The straight line shows the ratio for theoretical optimal performance the dotted line theratio before normalization and the solid curved line the ratio after normalization See the text for details (D) The advantage of nor-malization for different recognition criteria The y-axis shows the total hit rate after normalization minus the total hit rate before nor-malization as a function of the total false alarm rate on the x-axis See the text for details

434 SIKSTROM

discrimination) and subjects are instructed to respondyes only when they are absolutely certain that they arecorrect it may be optimal to set a very high criterion forthe difficult class so that no yes responses will be madefor the difficult class and a moderate criterion for theeasy class so that some yes responses will be made forthe easy class Therefore any model that optimizes per-formance for the two classes must combine the criteriafor each class so that the performance for the sum of theclasses will be optimal

This problem may formally be stated as follows Giventwo classes (A and B) with a fixed total false alarm rate[P(AN) + P(BN)] how should the recognition criteriafor the two classes [T(A) and T(B)] be chosen so that thehit rates are maximized [P(AO) + P(BO)] The solutionto this problem is surprisingly simple The optimal perfor-mance occurs when the placements of the maximum like-lihoods of the two classes are equal

L(A) = f [S(AO)] f [S(AN)] = L(B)

= f [S(BO)] f [S(BN)]

It is easy to see that this criterion must be satisfied foroptimal performance because any shift from this pointdiminishes performance For example if the recognitionthreshold for class A is diminished the recognition cri-terion for class B must be increased to keep the totalfalse alarm rate constant According to the formulationof the problem the change in total false alarm rates mustbe equal f f [S(BN) = 0] The maximum-likelihood ra-tios are monotonically increasing functions of the recog-nition criteria therefore L(A) L(B) lt 0 when the recog-nition criteria are changed as specified above ThusL(A) = f [S(AO)] f [S(AN)] lt f [S(BO)] f [S(BN)] =L(B) or f [S(AO)] f [S(BO)] lt 0 This shows that thechange in the placement of the criteria from L(A) = L(B)results in an overall decrease in hit ratemdash( f [S(AO)] f [S(BO)] lt 0)mdashand performance suffers

Note that the variance theory only has one overallrecognition criterion However the theory and any the-ory of the mirror effect specifies how this criterionchanges when it is moved over the two classes Thus thesecond criterion can indirectly be inferred from the for-mulation of the theory This is done in the variance the-ory by the normalization factor that scales the recogni-tion differently between the two classes of words

The intriguing question here is whether the variancetheory is optimal or almost optimal in terms of place-ment of the recognition criterion for the two classes Fig-ure 9C plots the maximum likelihood of class B dividedby the maximum likelihood of class A [L(B)L(A)] onthe y-axis The x-axis shows the probability of false alarmsfor class A when the recognition criteria are changedThe optimal ratio of the maximum likelihood on the y-axisis one and it is plotted as the dotted straight line The re-sults before normalization (ie by counting the numberof nodes above the recognition criterion) are plotted in

the dotted monotonically increasing line The resultsafter normalization (ie the percentage of active nodesminus the expected number of active nodes divided bythe standard deviation of the net input) are plotted as thesolid line

The figure clearly shows that performance before nor-malization is far from optimal For a conservative recog-nition criterion or low false alarm rates the maximumlikelihood for class A is larger than the maximum likeli-hood for class B Therefore a more liberal criterion forclass B and a more strict criterion for class A would bemore advantageous For a liberal recognition criterionor a high false alarm rate the maximum likelihood forclass B is larger than the maximum likelihood for classA Therefore a more liberal criterion for Class A and astricter criterion for Class B would be beneficial Theonly point at which the performance is optimal is whenthe recognition criterion is unbiased At this point [aroundP(AN) = 25] the maximum-likelihood ratio is one

Normalization improves performance significantly sothe maximum-likelihood ratio is one or almost one fora range of criteria For false alarm rates larger than 015and smaller than 060 the ratio is within two percentagepoints of one The maximum likelihood for Class A islarger than that for Class B for false alarm rates less than015 and for false alarm rates larger than 060 Thus thereis still some deviation from optimal performance evenafter normalization However the maximum-likelihoodratio is closer to the optimal value for all false alarmrates after normalization than before normalization Ar-guably normalization improves performance sufficientlyso that performance may be described as being near anoptimal value for a wide range of recognition criteria

Overall performance after normalization can be di-rectly compared with performance before normalizationFigure 9D plots the total hit rate after normalizationminus the total hit rate before normalization for differenttotal false alarm rates In this figure the standard devia-tion of the net inputs to Class B was set to a 3 (rather than156) in order to make the difference between perfor-mance before and after normalization more salient Allother parameters were identical to those in Figure 4DFurthermore it is assumed that the number of items inClass A is equal to the number of items in Class B Thetotal hit rate is equal to the average hit rate of Class Aand Class B and the total false alarm rate is equal to theaverage false alarm rate in Class A and Class B

For all recognition criteria or for all false alarm ratesperformance is better or equal after normalization ascompared with performance before normalization Foran unbiased recognition criterion or a slightly largerrecognition criterion performance is approximatelyequal before and after normalization [ie for 25 lt P(N) lt40] For conservative recognition criteria [P(N) lt 25]there is a large advantage for normalized performanceover nonnormalized performance The largest advantageis approximately 005 and it occurs when the false alarm

VARIANCE THEORY OF MIRROR EFFECT 435

rate is approximately 005 For liberal recognition crite-ria [P(N) gt 40] there is also an advantage for normal-ized performance The largest advantage is around 001and it occurs when the false alarm rate is 070 The ad-vantage for liberal criteria is smaller than the advantagefor conservative criteria because of a ceiling effect forlarge false alarm rates and large hit rates

A Nonoptimal Network is Inconsistent With Empirical Data of Recognition Memory

To summarize it has been shown that performance isoptimal when (1) the percentage of nodes active at en-coding is low (2) the activation threshold is set betweenthe new and the old distributions (3) only nodes activeat encoding are used for measuring the recognitionstrength and (4) the recognition strength is normalizedwith the standard deviation of the net input It is inter-esting to note that all these conditions are necessary forproducing the empirically found memory phenomena

1 The percentage of active nodes has to be low forproducing appropriate standard deviations for the oldand the new recognition distributions If the percentageof active nodes is too high the standard deviation of theold distribution will approach the standard deviation ofthe new distribution

2 The model predicts the mirror effect only if the ac-tivation threshold is set between the old and the new dis-tributions If the recognition threshold is larger than theexpected value of the net input of the old distributionthe hit rate of the easy class will be less than the hit rateof the difficult distribution Similarly if the recognitionthreshold is lower than the expected new net input thefalse alarm of the easy class will be larger than the falsealarm rate of the difficult class This is inconsistent withthe empirical data for unbiased responses

3 Assume that the recognition strength is based on allthe nodes (ie not only nodes inactive during encod-ing)mdashfor example by counting the number of nodes inthe correct state of activation This measurement ofrecognition strength will have at least 50 of the nodesin the correct state (ie Pc gt 50) even if the subjectswere merely guessing on new items This would lead tothe incorrect prediction that the old recognition strengthvariance should be smaller than the new recognitionstrength variance because Pc(1 Pc)N decreases for Pcover 50 This is inconsistent with the finding that thevariance of the old distribution is larger than the varianceof the new distribution

4 If the recognition strength is not normalized withthe net input the variance of the recognition strength ofthe new easy class (AN) will be smaller than the varianceof the recognition strength of the new difficult class (BNsee Figure 4C) This is inconsistent with the empirical data

This analysis indicates that several recognition mem-ory phenomena fall out as a consequence of optimizingperformance in the network If the model is optimized interms of performance the model reproduces the empir-ical data If the model is made nonoptimal the model

does not reproduce the empirical data Arguably thehuman brain has evolved to optimize storage capacityand therefore these memory phenomena occur

GENERAL DISCUSSION

This paper has suggested a variance theory for themirror effect that also applies to the ROC curves Themodel is remarkably simple It has been shown that atwo-layer recurrent network where one layer representscontext and one layer represents items shows these phe-nomena if performance is measured by counting thenumber of nodes active at recognition in a way that opti-mizes performance The structure of the theory guaran-tees that high-frequency items have a larger variance inthe net inputs than do low-frequency items because en-coding the same item in different contexts increases thevariance whereas the expected net inputs are the sameThe theory predicts the mirror effect when the sameamount of attention is paid to both classes of stimuli Thestandard deviation of the recognition strength is largerfor old than for new items because more nodes are activein old items The standard deviation for the easy class islarger than the standard deviation for the difficult classbecause the recognition strength is normalized with thestandard deviation for the net input

There are several reasons why the variance theory isinteresting First the theory is extremely simple Theonly necessary assumptions are that recognition is basedon recurrent associations between contexts and itemsand performance is measured by counting the number ofnodes in an optimal way Second these assumptions areconsistent with what is known about how the brain worksTherefore the model is biologically plausible Third themodel accounts for a large amount of data including themirror effect exceptions from the mirror effect ROCcurves list-length effects and so on Fourth the modelfits the empirical data well Fifth it is easy to implementthe model in a connectionist network

Paying more attention to one of the classes violates theassumption of equal expected net inputs to the two classesThe variance theory predicts that attention to the moredifficult class primarily affects the hit rates whereas thefalse alarm rates and standard deviations of the underly-ing distributions are less affected An experiment sup-ported the prediction A standard mirror effect was foundwhen attention was divided equally between high- andlow-frequency words However focusing the subjectsrsquoattention on the high-frequency words either by the pre-sentation frequency or the presentation time made thehit rate larger for the high-frequency words than for thelow-frequency words This manipulation of attention didnot influence the false alarm rates which were higher forthe high-frequency words in all the conditions Thus nomirror effect was found when attention was paid to thehigh-frequency words Nor did the focusing of attentioninfluence the order of the standard deviations The stan-dard deviations were larger for the low-frequency distri-

436 SIKSTROM

bution than for the high-frequency distribution This wastrue for the new and the old distributions both when at-tention was paid to high-frequency words and when at-tention was divided equally between the two classes (ex-cept in the new frequency control condition where thestandard deviations were equal)

The variance theory predicts the order of the standarddeviations of the underlying distributions for the follow-ing reasons The standard deviations of the old distribu-tions are predicted to be larger than those of the new dis-tributions because more nodes are activated in the olddistributions The standard deviations of the easy classdistributions are predicted to be larger than the standarddeviations of the difficult class distributions because therecognition strength is normalized by the itemrsquos diffi-culty estimated from the standard deviation of the net in-puts This is consistent with the empirical data

In contrast the attention-likelihood theory does notconstrain the old distribution to be larger than the newdistribution for the difficult class (it can be larger orsmaller depending on parameter settings) The variancetheory allows the following two orders of the standarddeviations ss(BN) lt ss(BO) lt ss(AN) lt ss(AO) andss(BN) lt ss(AN) lt ss(BO) lt ss(AO) The first order isthe more common although the second order occurs oc-casionally (see eg Glanzer et al 1993 Experiment 1)In addition the attention-likelihood theory allowsss(BO) lt ss(BN) lt ss(AN) lt ss(AO) according to Equa-tion 2 which to the authorrsquos knowledge has not beenfound in empirical data

The variance theory predicts that strength variablessuch as study time repetition and study instructions af-fect the expected net input For example increasing studytime will increase the net input that improves the hit rateIncreasing the net inputs also causes an increase in theactivation threshold that diminishes the false alarm rates

The variance theory has no (ie lateral) connectionswithin the item layer and there are no connections with-in the context layer Including intraitem connections inrecognition makes it impossible to tell whether therecognition strength comes from encoding during thelearning episode or from another preexperimental learn-ing episode Thus there would be a confounding be-tween the itemrsquos frequency and recognition strength Forexample if the recognition strength in the variance the-ory included intraitem connections and used a constantrecognition criterion it would predict a higher hit rateand a higher false alarm rate for high-frequency itemsthan for low-frequency items Thus the hit rate for high-frequency words would be larger than that for low-frequency words which is contrary to the data on the mir-ror effect This issue is called the frequencyrecognitionndashstrength confounding problem Other models may bevulnerable to this problem depending on their specificassumptions The variance theory is immune from thisproblem because recognition strength is based on the as-sociation between the context and the item that yields apure measurement of the strength of the target in a par-

ticular episode Net inputs within the item population arenot used because these connections are highly corre-lated with the frequency of the item

This frequencyrecognition-strength confounding prob-lem may be relevant to several distributed models thatassume that recognition strength increases with frequencythus making the false alarm rate higher for high- than forlow-frequency stimuli This is often implemented in dis-tributed models by including intraitem associations inthe measurement of recognition strength Thus whenintraitem and itemndashcontext associations are added it isnot possible to know whether the intraitem strength oc-curs because an item has been encoded in the to-be-retrieved-from list or at another episode

Although the intraitem associations are not used tomeasure recognition strength they may play an impor-tant role in recognition In the first step of recognitionthese associations may be used for deblurring unclear in-formation in the item cue (a similar mechanism occursfor the context cue) Arguably this deblurring mecha-nism works well for well-known words however fornonwords it is much more likely to fail Such failure willactivate features that were not active in the encoded rep-resentation It is more likely that these wrongly activatednodes represent high-frequency features because it ismore likely to converge on high-frequency features Thereare two interesting implications of this perspective Firstthe wrongly activated nodes will use the wrong connec-tions between the context and the item Second becausethe wrongly activated nodes represent high-frequencyfeatures the average variability will be larger for non-words than for words This is a plausible account of thepoor recognition performance with nonwords as com-pared with words It is also a tentative account of why non-words can be seen as a difficult class with a higher vari-ability than that for words in the variance theory Howeverfurther work is needed before any firm conclusion can bemade regarding this aspect of the theory

A problem similar to frequencyrecognition-strengthconfounding occurs if within-context connections areused at recognition If the context is temporally corre-lated the within-context connections are influenced bylist length This causes a confounding between list lengthand recognition strength This issue is called the list-lengthrecognition-strength confounding problem Othermodels may be vulnerable to this problem depending ontheir specific assumptions

Another issue is whether the variance theory can ac-count for the mirror effect found in abstract and concretewords where words from a concrete class are more eas-ily discriminated than words from an abstract class Thevariance theory can account for this given the assump-tion that the variance of the net input is larger for abstractthan for concrete words However at this point it is notcompletely clear how this assumption can be motivatedA possibility is that although these two classes areequated for word frequency the contexts associated withan abstract word are more variable than the contexts as-

VARIANCE THEORY OF MIRROR EFFECT 437

sociated with a concrete word This larger variability incontext for abstract words may lead to a larger variabil-ity in the net input Another possibility is that the activefeatures in abstract words are more general and there-fore associated with more contexts Nodes active in con-crete words may represent more specific features acti-vated with a lower frequency and therefore associatedwith fewer contexts Thus features in abstract wordsmay be of a higher frequency than features in concretewords although the frequencies of the items are thesame This would lead to a mirror effect in the modelHowever at this point no claim is made that variancetheory can handle this phenomenon

The variance theoryrsquos account of the list-length andlist-strength effects is arguably much simpler than thesubjective-likelihoodrsquos account The activation thresholdis set so that on average half of the nodes active duringencoding are active during recognition The activationthreshold is constant within one condition but may changebetween conditionsmdashfor example when study time ischanged Furthermore subjects do not need to estimatethe list length and the probability that a particular itemis drawn from the study list

The variance theory has advantages over the attention-likelihood theory As was discussed in more detail abovethe attention-likelihood theory requires subjects to clas-sify the stimulus Depending on this classification thesubjects must know (consciously or unconsciously) howmuch attention is paid to the stimuli in order to calculatethe log-likelihood ratios Thus the yesndashno decision isbased on the subjectsrsquo classification of the stimuli Thevariance theory does not require a classification of thestimuli During the calculation of recognition strengththe standard deviation of the net input of the item is used(shcent ) so the subject does not need to know the class or thestandard deviation of the class (sh) The increase in thehit rate and decrease in the false alarm rate for the easierclass occurs according to the theory because the vari-ance is smaller for the easier class The variance theoryhas a simple mathematical base subjects count the num-ber of activated nodes minus the expected value dividedby the standard deviation of the net input of the item Anexplicit solution is presented that requires only calculat-ing the probabilities of two z-transformations

The subjective-likelihood theory uses feature contentof the items for addressing issues regarding item similar-ity and word frequency In particular high-frequency wordsare assumed to have a higher noise or variability than dolow-frequency words The variance theory also usesvariability that depends on frequency However the vari-ance theory simulates the increase in variance duringeach presentation of a feature in different contexts thusmaking it an unavoidable phenomenon for high-frequencyfeatures In the subjective-likelihood theory the featurevariance is introduced as an assumption or a constantand it is not explicitly simulated

There are several other differences between the vari-ance theory and the subjective-likelihood theory The

subjective-likelihood theory is based on log-likelihoodratios In the variance theory log-likelihood ratios arenot necessary to account of the mirror effect and for z-ROC curves Instead the variance theory uses the num-ber of active nodes normalized by the variance of the netinput as the measurement of recognition strength

Another difference is the use of one detector for eachitem in the subjective-likelihood theory This makes themodel essentially local whereas the variance theory isdistributed This property may cause difficulties in im-plementing the model in a connectionist network How-ever see McClelland and Chappell (1998) for a brief dis-cussion of this topic An implementation of the variancetheory is straightforward

REFERENCES

Feller W (1968) An introduction to probability theory and its appli-cation New York Wiley

Gillund G amp Shiffrin R M (1984) A retrieval model for bothrecognition and recall Psychological Review 91 1-67

Glanzer M amp Adams J K (1985) The mirror effect in recognitionmemory Memory amp Cognition 13 8-20

Glanzer M amp Adams J K (1990) The mirror effect in recognitionmemory Data and theory Journal of Experimental PsychologyLearning Memory amp Cognition 16 5-16

Glanzer M Adams J K amp Kim K (1993) The regularities ofrecognition memory Psychological Review 100 546-567

Glanzer M amp Bowles N (1976) Analysis of the word frequencyeffect in recognition memory Journal of Experimental PsychologyHuman Learning amp Memory 2 21-31

Glanzer M Kisok K amp Adams J K (1998) Response distribu-tions as an explanation of the mirror effect Journal of ExperimentalPsychology Learning Memory amp Cognition 24 633-644

Greene R L (1996) Mirror effect in order and associative informa-tion Role of response strategies Journal of Experimental Psychol-ogy Learning Memory amp Cognition 22 687-695

Hertz J Krogh A amp Palmer R G (1991) Introduction to the the-ory of neural computation Reading MA Addison-Wesley

Hintzman D L (1988) Judgment of frequency and recognition memoryin a multiple trace memory model Psychological Review 95 528-551

Hopfield J J (1982) Neural networks and physical systems withemergent collective computational abilities Proceedings of the Na-tional Academy of Sciences 79 2554-2558

Hopfield J J (1984) Neurons with graded response have collectivecomputational properties like those of two-state neurons Proceed-ings of the National Academy of Sciences 81 3088-3092

Humphreys M S Bain J D amp Pike R (1989) Different way to cuea coherent memory system A theory for episodic semantic and pro-cedural tasks Psychological Review 96 208-233

Kim K amp Glanzer M (1993) Speed versus accuracy instructionsstudy time and the mirror effect Journal of Experimental Psychol-ogy Learning Memory amp Cognition 19 638-652

Kruschke J K (1992) ALCOVE An exemplar-based connectionistmodel of category learning Psychological Review 99 22-44

Ku Iumlcera H amp Francis W N (1967) Computational analysis ofpresent-day American English Providence RI Brown UniversityPress

Lewandowsky S (1991) Gradual unlearning and catastrophic inter-ference A comparison of distributed architectures In W E Hockleyamp S Lewandowsky (Eds) Relating theory and data Essays onhuman memory in honor of Bennet B Murdock (pp 445-476) Hills-dale NJ Erlbaum

Li S-C amp Lindenberger U (1999) Cross-level unification A com-putational exploration of the link between deterioration of neuro-transmitter systems and dedifferentiation of cognitive abilities in oldage In L-G Nilsson amp H J Markowitsch (Eds) Cognitive neuro-sciences of memory (pp 103-146) Seattle Hogrefe amp Huber

438 SIKSTROM

Li S-C Lindenberger U amp Frensch P A (2000) Unifying cog-nitive aging From neuromodulation to representation to cognitionNeurocomputing 32-33 879-890

McClelland J L amp Chappell M (1998) Familiarity breeds dif-ferentiation A subjective-likelihood approach to the effects of expe-rience in recognition memory Psychological Review 105 724-760

Metcalfe J (1982) A composite holographic associative recallmodel Psychological Review 89 627-658

Murdock B B (1982) A theory for the storage and retrieval of item andassociative information Psychological Review 89 609-626

Murdock B B (1998) The mirror effect and attentionndashlikelihoodtheory A reflective analysis Journal of Experimental PsychologyLearning Memory amp Cognition 24 524-534

Okada M (1996) Notions of associative memory and sparse codingNeural Networks 9 1429-1458

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves PsychologicalReview 99 518-535

Rumelhart D E Hinton G E amp Williams R J (1986) Learn-ing representation by backpropagating errors Nature 323 533-536

Shiffrin R M amp Steyvers M (1997) A model for recognitionmemory REMndashretrieving effectively from memory PsychonomicBulletin amp Review 4 145-166

Sikstroumlm S (1996a) TECO A connectionist theory of successiveepisodic tests Umearing Doctoral dissertation Umearing University (ISBN91-7191-155-3)

Sikstroumlm S (1996b) The TECO connectionist theory of recognitionfailure European Journal of Cognitive Psychology 8 341-380

Sikstroumlm S (1999) Power function forgetting curves as an emergentproperty of biologically plausible neural networks model Interna-tional Journal of Psychology 34 460-464

Stretch V amp Wixted J T (1998a) Decision rules for recognitionmemory confidence judgments Journal of Experimental Psychol-ogy Learning Memory amp Cognition 24 1397-1410

Stretch V amp Wixted J T (1998b) On the difference betweenstrength-based and frequency-based mirror effects in recognitionmemory Journal of Experimental Psychology Learning Memoryamp Cognition 24 1379-1396

NOTE

1 The expected number of nodes active at recognition is a2 giventhat the standard deviations of the percentages of active nodes for new[sp(N)] and old items [sp(O)] are equal However if the standard devi-ations are different the expected number of active nodes will be

Because the new variance is smaller than the old variance this valuewill be slightly less than a2 typically around 044a if the ROC curveis 08 It is actually more precise to use this value In this paper the ap-proximation a2 is used except that in the simulations where the ex-pression above is used

APPENDIX

The Expected Value of the Net Input and the Variance of the Net Input

Inserting the weight changes into the net input solves the ex-pected value of the net input for old items In this Appendix thesuperscripts representing the item layer (d) and the contextlayer (c) are dropped Only the active states of activation con-tribute to the net input so jj 5 jj = 1

(A1)

The expected value of the net inputs for the new items iszero

mh(N) 5 0 (A2)

The expected value of the weights is zero The variance ofthe net input is calculated by squaring each term in the netinput Let P represent the number of patterns stored in the net-work If the patterns are uncorrelated or a new item is pre-sented the variance of the net input is

(A3)

For an old item the variance of the net input to the contextlayer from the item layer depends on the frequency of the item( f ) All the aN weights supporting a context node contribute tothe net input in the same direction

(A4)

The same function can be used for calculating the varianceof the net input to a node in the item layer when the same con-text is associated with several items Let the same context be as-sociated with L items in a list Furthermore let the context be-tween different lists be uncorrelated The variance of the netinput to an item node is

The expected variance for all nodes is the weighted sum ofthese two terms Half of the nodes are context nodes and halfof the nodes are item nodes Therefore the expected varianceof the net input to all nodes given an item with a frequency off and a list length of L in a network where p patterns have beenencoded is

(A5)

Performance in the Variance Model When AllNodes Are Used

In the variance model recognition strength is based on nodesactive at encoding However if recognition strength is based onall the nodes performance can be calculated as follows The d centis calculated by using Equation 8 where Pc(O) and Pc(N) are thenumber of correct nodes The number of correct nodes is thenumber of nodes active at retrieval that also is active at encod-ing (ie calculated as usual with Equation 7) plus the numberof inactive nodes at encoding that also are inactive at retrievalThe latter value can be calculated by replacing a with 1 a inEquation 7 and using the expected value of old net inputs for in-active nodes a2 (1 a) N (compare with Equation A1)

(Manuscript received February 9 1999revision accepted for publication October 30 2000)

s h f L a N p a a N2 3 22 1O( ) = +( ) + -( )

s h f a a LN2 4 2 21( ) = -( )

s x x x xh ij jj

N

i j ji

Nf f f a a

a a f N

22 2

4 2 21

( ) = waring aringaelig

egraveccedilouml

oslashdivide= -( ) -( )eacute

eumlecirc

= -( )

s x x x xh ij jj

N

i j j

P

i

NN a a

a a a a a a a

a a a a PN a a PN

2 2 2

2

3

1 1 2 0 1

0 0 1 1

( ) = w

= [( ) ( )] + [( )(1- )] ( )

+ [( )( )] ( ) = ( )

2 2

2 2 2

( ) = -( ) -( ) - - - -

- - - -

aring aringaring

m x x x xh ijj

N

j i j jj

Na a a a N(O) = wDaring aring= -( ) -( ) = -( )1

2

ss s

p

p p

a(N)

( )N (O)+

VARIANCE THEORY OF MIRROR EFFECT 409

The mirror effect has also been studied with two-alternative forced-choice recognition tests There are sev-eral possible two-alternative comparisons between old andnew items For instance when high- and low-frequency(better recognized) words are used there are four stan-dard pairings of old and new by high- and low-frequencycomparisons plus the two null-choice conditions (Glanzeramp Bowles 1976) in which either a pair of new items ora pair of old items are compared According to the mir-ror effect the order for the four standard pairs is

P(BO BN) lt P(AO BN) P(BO AN) lt P(AO AN)

where P(O N) is the probability of choosing old overnew Regarding the null-choice conditions the mirror ef-fect suggests that

P(BN AN) gt 50 P(AO BO) gt 50

Another general finding of the mirror effect is thatstrength manipulations that affect the positions of thenew distribution also affect the old distribution when themanipulations are made between conditions (Glanzer et al1993) Examples of frequently used experimental ma-nipulations of strength are the duration of study time en-coding conditions and speed versus accuracy instruc-tions These experimental manipulations affect both thenew and the old distributions by moving the distributionscloser together an effect known as concentering or bymoving the distributions apart an effect known as dis-persion However if these strength manipulations are ap-plied differently to items within a condition or in a singlelist quite different results are found For instance re-cently Stretch and Wixted (1998a 1998b) differentiallystrengthened high-frequency words by presenting themwith a higher presentation frequency than that for low-frequency words This manipulation affected the hit rateof the high-frequency words whereas false alarm rates

were largely unaffected Thus this manipulation did notshow a standard mirror effect

In addition to the basic data with respect to hit and falsealarm rates data on z-transformed (or normalized) receiver-operating characteristic curves (z-ROCs) represent an-other regularity of the distributions underlying the mir-ror effect The z-ROC-curves are obtained by plotting thez-transformation of hit rates against the z-transformationof false alarm rates while varying the recognition crite-rion According to signal detection theory the slope ofthis curve is equal to the standard deviation of the new-item distribution divided by the standard deviation of theold-item distribution Data on the mirror effect suggeststhat the standard deviations of the underlying item dis-tributions are symmetrical with respect to the classes(Glanzer amp Adams 1990 Glanzer et al 1993) The stan-dard deviation (s) of the recognition strength of the old-item distribution for the difficult class is smaller than the old-item distribution for the easier class [ie s(BO) lts(AO)] The standard deviation of the new-item distri-bution for the difficult class is smaller than the standard de-viation for the easier class [ie s(BN) lt s(AN)] How-ever this symmetry is tilted so that the standard deviationof the new distribution is smaller than the standard devi-ation of the old distribution [ie s(N) lt s(O) RatcliffSheu amp Gronlund 1992]

Previous Theories of the Mirror EffectIn addition to its empirical generality the mirror effect

is an important benchmark phenomenon because it haschallenged many global matching memory models suchas SAM (Gillund amp Shiffrin 1984) MINERVA (Hintz-man 1988) CHARM (Metcalfe 1982) TODAM (Mur-dock 1982) and the matrix model (Humphreys Bain ampPike 1989) The main limitation of the global memorymodels in capturing the mirror effect arises from their lackof mechanisms for predicting differential false alarm ratesfor BN and AN items given that by definition these arethe new items and neither class of items is presented dur-ing the study phase Confronted with the limitations of theglobal memory models the mirror effect had fostered thedevelopment of several recent theories to account for thepsychological mechanisms that give rise to the phenom-enon itself but as will become clear later as these theo-ries are presented many controversies still remain

Response strategy account For instance Greene(1996) proposed a response-strategy-based account forthe mirror effect regarding order and associative infor-mation Specifically it was suggested that if subjectsequate the number of responses made to each class ofwords (assuming that the number of lures and targets areapproximately equal) this simple response strategy wouldproduce response distributions that give rise to the mir-ror effect However Glanzer Kisok and Adams (1998see also Murdock 1998) argued against this account byshowing that the mirror effect is still present in the ab-sence of a distinctive set of data and in the absence of re-sponse equalization Furthermore Greenersquos account pre-

Figure 1 Hypothetical underlying distributions along the de-cision axis for low-frequency new items (AN) high-frequencynew items (BN) high-frequency old items (BO) and low-frequency old items (AO) For illustrative purposes the variancesare plotted in this graph to be equal The horizontal axis showsthe recognition strength that the decision is made on and the ver-tical axis the density of this recognition strength

410 SIKSTROM

dicts a mirror effect when subjects are made to focus theattention on the more difficult class However increas-ing the number of hits for the difficult class actually di-minishes the false alarm rate Thus the response strategyaccount is at odds with the experimental data

Attention-likelihood theory Another attempt to ac-count for the mirror effect is the attention-likelihood the-ory (Glanzer amp Adams 1990 Glanzer et al 1993) Thistheory says that the difference between BO and AO oc-curs because subjects pay more attention to items inClass A (ie the better-recognized class) than to theitems in Class B However the four distributions (ie ANBN BO and AO) along the decision axis reflecting themirror effect occur mainly because subjects transformthe recognition strength of the items to log-likelihood ra-tios and use that as the basis for their decision

More specifically the attention-likelihood theory con-sists of the following assumptions (1) Stimuli consist ofN number of features The features are either marked orunmarked A marked feature indicates that it was in thestudy list (2) Some proportion p(new) of features are al-ready marked in a new stimulus which reflects the noiselevel with the rationale that features in a new item (whichis not studied during encoding) should be marked becauseof the random noise in the decision process (3) Differentclasses of stimuli (i ) evoke different amounts of atten-tion [n(i )] (4) During study n(i ) features are sampledand marked The proportion of feature sampled for a givenstimulus is n(i ) N Given that when the noise level is notzero some proportion of the new items will also bemarked the probability that a feature is marked becomes

p(i old) 5 p(new) + [1 p(new)]

(5) During recognition subjects randomly sample a setof n(i) number of features Given that the sampling is in-dependent of whether the features were marked at studythe number of marked features is thus binomially dis-tributed with parameters n(i) and p(i old) for old stim-uli and n(i) and p(new) for new stimuli (6) During testsubjects count the number of marked features (x) Theynote the difference between the sample size and the num-ber of marked features [ie n(i) x] Yesndashno decisionsare then based on the log-likelihood ratio given a class[l(x | i)] and it is defined jointly by x n(i) p(new) andp(i old)

(1)

Therefore in contrast to strength theories which includemost of the global memory models the recognition deci-sion is made along the log-likelihood dimension ratherthan the dimension of item strength

In addition to the mirror effect with respect to the order-ing of response distributions of different classes of stim-uli the attention-likelihood theory predicts the followinginequalities for the slopes (defined by the ratios between

the standard deviations of the response distributions fordifferent stimulus classes) of the z-ROC curves

s (AO) s(BN) lt s (AO) s(AN)

lt s(BO) s(BN) lt s(BO) s(AN) (2)

This prediction constrains the standard deviation of thedistribution of the difficult class to be smaller than the stan-dard deviation of the distribution of the easy class [ies(BN) lt s(AN) and s(BO) lt s (AO)] However it doesnot constrain the standard deviation of the new distribu-tion to be smaller than the standard deviation of the olddistribution [ie s (BO) may be smaller than s (BN)]There are 24 possible ways (ie 4 5 24) to order thefour standard deviations Equation 2 allows 3 out of the24 orderingsmdashnamely s (BO) lt s (BN) lt s (AN) lts(AO) s(BN) lt s (AN) lt s(BO) lt s(AO) and s(BN) lts(BO) lt s(AN) lt s(AO) However not all 3 orderingsare found in the empirical data As will be presented laterthe variance theory proposed in this paper is more con-straining than the attention-likelihood theory and it per-mits only those 2 of the orderings that are in line with theempirical data The attention-likelihood theory has re-cently inspired at least two models which also are basedon log-likelihood ratiosmdashnamely the subjective-likelihoodaccount of recognition memory (McClelland amp Chappell1998) and REM (Shiffrin amp Steyvers 1997) Below theyare presented in turn

The subjective-likelihood model The subjective-likelihood model for recognition memory (McClellandamp Chappell 1998) has been applied to the mirror effectwith respect to list length list strength z-ROC curves andother recognition phenomena A major difference betweenMcClelland and Chappellrsquos approach and the attention-likelihood theory is that the subjective-likelihood accountuses one detector for each old and each new item Thesedetectors contrast the probability that a to-be-recognizeditem is an old item with the probability that it is a newitem to form the odds ratio The model makes an old de-cision if at least one of the logarithms of the odds ratiois larger than a certain fixed criterion otherwise a newdecision is made Differing from the attention-likelihoodtheory the odds ratio in the subjective-likelihood modelis calculated on estimates of the probabilities rather thanon the true values that a recognition system (or subject)might not have access to but was nonetheless used inGlanzer et alrsquos (1993) model

The usage of a limited number of detectors one foreach item in the subjective-likelihood theory is a centralmechanism used to account for several phenomena Forexample the strength-based mirror effect is accountedfor by the assumption that detectors for strong itemswhen strengthened are less likely to be activated by newitems during recognition which then lowers the false alarmrate This mechanism works when the number of targetsis reasonably large in relation to the number of potentialdistractors which was assumed in McClelland and Chap-pellrsquos (1998) simulations However in reality the num-

l( | ) ln( ) ( ( )

( ( (

( )

( )x i

p i p i

p p

x n i x

x n i x=

-

-

eacute

eumlecircecirc

-

- old old

new) new)

1

1

n(i)N

VARIANCE THEORY OF MIRROR EFFECT 411

ber of targets in a typical recognition test (for example50) is negligible in comparison with the number of pos-sible distractor words (the number of words in the subjectrsquosvocabularymdasheg 20000) Thus the subjective-likelihoodaccount for the list-strength effect may encounter a prob-lem when the number of detectors associated to each dis-tractor increases to a more realistic number A similarproblem may also be apparent with respect to the list-length effect which is accounted for by the assumptionthat the number of detectors is proportional to the listlength Arguably the subjective-likelihood theory wouldnot account for the list-length effect given a more plau-sible assumption that the number of detectors is relatedto the size of the subjectrsquos vocabulary

REM REM (Shiffrin amp Steyvers 1997) is anothermodel whose aim is to account for the mirror effect ROCcurves list strength and other phenomena in recognitionand recall Similar to the subjective-likelihood theoryREM is based on likelihood ratios and uses noisy vector-based representations in memory Although REM alsouses likelihood ratios REM differs in the sense that ituses true probabilities in calculating the likelihood ratioswhereas the subjective-likelihood theory uses estimatesFurthermore in REM the value assigned to the modelrsquosrepresentation on a particular feature dimension is de-termined when the dimension is sampled the first timeIn the subjective-likelihood theory the representationsof the items are refined successively over presentations

In REM several factors are combined together to pro-duce the frequency-based mirror effect (1) The likelihoodratios are assumed to be smaller for old high-frequencywords because high-frequency features are less diagnos-tic (2) This factor is larger than the compensating factorthat high-frequency words have slightly more matchingfeature values because errors in storage tend to producecommon values increasing the probability of acciden-tally matching a high-frequency feature value (3) Newhigh-frequency words have a slight increase in the numberof chance matches which increases the likelihood ratio

Limitations of current theories Although these the-ories account for several aspects of the data regarding themirror effect they have been subjected to a few generalcriticisms Perhaps the most obvious problem with thesemodels is that they predict that strengthening of a classyields a mirror effect Although this prediction is supportedby data in studies in which the strength manipulations wereapplied between conditions (Glanzer et al 1993) it iscertainly inconsistent with the data when the strengthmanipulations were applied within conditions (Stretch ampWixted 1998a see also Stretch amp Wixted 1998b)

In addition there are a few other more specific criti-cisms about these theories Here are some problems re-garding the attention-likelihood theory First calculatingthe log-likelihood ratios requires knowledge of the class(l dependent on i) Thus it is necessary to classify theto-be-recognized stimuli into distinct categories (i) or atleast to estimate the stimuli along some dimension suchas frequency Glanzer et al (1993) noted that the attention-likelihood theory predicts the mirror effect even though

the estimates of p(i old) are not accuratemdashfor examplewhen this probability is set to the average of the twoclasses Thus the estimates of p(i old) are not critical tothe predictions However it is necessary to estimate thenumber of features sampled at recognition [n(i)] in Equa-tion 1 to make the correct predictions and this processrequires knowledge of the class of stimuli Second knowl-edge of several variables is required Depending on theclassification the attention paid to this category [n(i)]must be estimated The probability that a feature in anew item is marked must also be estimated Third theattention-likelihood theory involves several steps of com-plex calculation Although this may not be the reason fordismissing the theory (see Murdock 1998 for a discus-sion of this topic) it would be preferable to have a sim-pler account Given that the subjective-likelihood ac-count (McClelland amp Chappell 1998) and REM like theattention-likelihood theory (Shiffrin amp Steyvers 1997)are also based at their cores on variations in log-likelihoodratios these criticisms would also apply to them

Research Goal and Organization of the PaperGiven the limitations of current theories the purpose

of this paper is to propose a new account of the mirror ef-fect that can avoid most of these criticisms The theoryis proposed specifically for the frequency-based mirroreffect but it also accounts for the strength-based mirroreffect within a list the strength-based mirror effect be-tween lists z-ROC curves associated with the mirror ef-fect and the list-length effect The paper is organized asfollows First a brief overview of the theory is presentedwhich is then followed by an in-depth presentation Sec-ond the theory is implemented in a connectionist modelof memory previously developed by the author (ieTECO the target event cue and object theory Sikstroumlm1996a 1996b) Third mechanisms of the theory respon-sible for capturing the mirror effect are presented in de-tail Fourth a section presents the theoryrsquos applicationswith respect to the various classical empirical regularitiessuch as between-list strengthening the list-strength ef-fect z-ROC curves and response criterion shifting Fifthpredictions regarding a recently discovered lack of themirror effect in the context of within-list strength manip-ulation are presented (eg Stretch amp Wixted 1998a1998b) and an experiment is carried out to evaluate theprediction For readers who are interested in the analyticsolutions of the theory mathematical derivations ofthese solutions are presented in the sixth section and ananalysis of the modelrsquos optimal performance is con-ducted Finally implications for distributed models ofmemory and the relations between the variance theory andprevious theories of the mirror effect are discussed

THE VARIANCE THEORYOF THE MIRROR EFFECT

Overview of the TheoryIn a nutshell the variance theory proposed here is sim-

ilar to previous models of the mirror effect in the sense

412 SIKSTROM

that it also uses vector-feature representations of the itemsand estimates (via simulations) the response probabili-ties of old (target) and new (lure) items during a recog-nition test However the variance theory is driven by dif-ferent conceptual and technical considerations At theconceptual level the variance theory sets out to capturethe mirror effect mainly in terms of the relations betweenthe study materials and the natural preexperimental con-text associations the items may have This is conceptu-ally quite different from all previous theories seeking toexplain mirror effects primarily in terms of the individualrsquosdecision process Rather the approach taken here con-siders the context in which the individual recognition de-cision processes takes place The natural frequencies ofevents occurring in the individualrsquos day-to-day contextsmay be reflected in recognition decision processes andthe individuals may or may not know (or be consciouslyaware of) these processes At the technical level the vari-ance theory also differs from previous theories in a sig-nificant way Instead of directly computing ratios betweenprobabilities a new way of computing recognition strengthis proposed by normalizing the difference between theresponse probabilities for the target and the lure itemswith standard deviations of the underlying responsedistributions

Specifically in dealing with the frequency-based mir-ror effect a rather plausible key assumption of the vari-ance theory is that high-frequency words are assumed tohave appeared in more naturally occurring preexperi-mental contexts than have the low-frequency words Thisassumption is implemented in connectionist networksimulations in a rather straightforward way by associat-ing the simulated high-frequency items with more con-texts than the low-frequency items during a simulatedpreexperimental phase In implementing the theory itemsand contexts are represented by two separate arrays (vec-tors) of binary features with each feature being repre-sented by a node (or element of the vector) as is shownin Figure 2 A particular item such as CAR activates somefeatures at the item layer A particular context such asREPAIR SHOP activates some features at the context layerDifferent items and contexts may or may not activatesome of the same features The item and context featuresare fully interconnected with weights When an item ap-pears in a particular context and certain features are ac-tivated the weights that reciprocally connect the itemfeatures to the context features are adaptively changedaccording to a specific learning rule described later Notethat in the implementation two types of contextsmdashnamelythe preexperimental contexts and the study context (rep-resenting internally generated time or list informationassociated with an item during the actual experimentalepisode)mdashare represented in the network using one com-mon context layer But these two types of context infor-mation are differentiated by two simulation phasesmdashnamely the preexperimental phase and the actual encodingand testing phase As will be mathematically outlinedlater the standard deviation of the input at the contextlayer increases when an item is associated with several

contexts Therefore high-frequency items (associatedwith more preexperimental contexts) will have largerstandard deviations than will low-frequency items in theiractivation patterns which are subsequently propagatedto the item layer However the expected value of the in-put is equal for high- and low-frequency items

During the recognition test an item vector is presentedto reinstate the activation of the item features The fea-tures of the study context are reinstated by presenting thenetwork with the context pattern encoded during thestudy-encoding phase (but not from other preexperi-mental contexts that the network was trained with duringthe preexperimental phase) The degree of recognitionstrength is determined by first calculating the net inputsto the context and the item nodes The net input is the sig-nal a node receives from other active nodes connectingto it and the strength of the signal determines whetherthe nodes will be activated or not at retrieval The netinput of a given item node is simply computed as the sumof all weights connected to active nodes Similarly thenet input of a given context node is simply the sum of allweights connected to active nodes and that particularcontext node The net inputs then denote the retrievedstate of activation patterns at the item and context layersThe subset of item and context nodes that have activationlevels exceeding a particular activation threshold at re-trieval and that were also active during encoding are thenused to calculate the recognition strength Those nodeswhose activation does not exceed the threshold or thatwere inactive during encoding have no influence on rec-ognition strength For example assume that the activationthreshold is set to 05 so that any node (item or context)that was active during encoding and whose retrieved ac-tivation during testing exceeded the value of 05 wouldcontribute to recognition strength Imagine that four nodesout of a total of eight exceed the threshold and are equalto 075 100 125 and 150 The recognition strength ofthe item is the percentage of above-threshold nodes (50)

Figure 2 The variance theory The upper four circles representnodes in the item layer The lower four circles represent nodes inthe context layer The arrows between the item and the contextlayers represent connections

VARIANCE THEORY OF MIRROR EFFECT 413

minus the expected percentage of above-threshold nodes(eg 25) divided by the standard deviation of actuallyobserved above-threshold nodes (ie by the standard de-viation of 075 100 125 and 150)

Why is recognition strength determined by this ruleas opposed to say just the percentage of above-thresholdnodes As will be shown later this way of measuringrecognition strength (subtracting the expected value anddividing with the standard deviation of the net input) al-lows the model to perform optimally in terms of dis-criminability between new and old items when the re-sponse bias to the yes response is manipulated And inthis case the model accounts for why the standard devi-ation of the response distribution of the easy class islarger than the response distribution of the diff icultclass It is plausible to assume that humans have evolvedto generally respond more or less optimally and that thisis reflected in their performance as well as in the imple-mentation of the variance theory Similarly the activa-tion threshold is set in the model to the value that leadsto the highest d cent (ie to optimal performance) whichoccurs when the activation threshold is between the newand the old distributions This optimal tuning of themodel allows it to account for some rather dramatic re-sults such as concentering showing that target and luredistributions from different item classes converge on astrength-of-evidence axis as memory conditions worsen

Here is a brief example of how the model performsConsider hypothetical levels of activation generated byhigh-frequency and low-frequency old (target) and new(lure) items Because high-frequency words have appearedin a larger number of contexts they have a larger varianceof net input As such targets and lures will be relativelymore confusable and will generate percentages of activatednodes that are difficult to discriminate Assume that thestandard deviation of the net input is 10 and the relevantproportions are 25 for high-frequency targets and 15 forhigh-frequency lures In contrast low-frequency wordswill have occurred in fewer contexts and will be less con-fusable Assume that the standard deviation of net inputsis less for low-frequency wordsmdashfor example 05mdashandthat the percentage of active nodes is 30 for low-frequencytargets and 10 for low-frequency lures Given these val-ues what are the recognition strengths for high-frequencyand low-frequency targets and lures If the expected pro-portion of above-threshold nodes is 20 they are

Low-frequency lures (10 20) 05 5 20High-frequency lures (15 20) 10 5 05High-frequency targets (25 20)10 5 05

Low-frequency targets (30 20)05 5 20

The model accounts for the various aspects of mem-ory phenomena by postulating a connectionist neural net-work model with an implementation and parameter set-tings that allow it to perform at optimal or near-optimallevels When the model is optimized it behaves similarlyto how subjects behave and when it is not optimized itdoes not fit the experimental data This is true not only

for the standard mirror effect but also for exceptions suchas the absence of a mirror effect for within-list strengthmanipulations (something all other competing formalmodels fail to do) Furthermore it predicts key featuresof the ROCs for new and old items as well as for high-and low-frequency items (something any viable formalmodel must do)

Presentation of the Variance TheoryIn this section the details of the variance theory are

presented As will become clearer as the theory is un-folded the theory is analytical and the analytical solu-tions are self-contained solvable (readers who are inter-ested can find the analytical solutions in the sixth section)Although the theory itself does not require a particularcomputational framework it can be more easily explainedand directly implemented by using a connectionist net-work Therefore the presentation of the theory below iscouched within the framework of a Hopfield neural net-work (Hopfield 1982 1984) in order to explicate thetheoryrsquos underlying mechanisms that generate the mirroreffect

Network architecture The variance theory may beimplemented as a two-layer recurrent distributed neuralnetwork (see Figure 2) There are two layers in the rep-resentation one is the item layer and the other the con-text layer Both layers consist of N number of nodes (ieN features) although it could also be implemented withan unequal number of context and item nodes Thus thetotal number of nodes in the network is 2N The item andthe context layers are fully connected so that all the nodesin one layer are connected through weights to all the nodesin the other layer Nodes within a layer are not connected(ie no lateral connections)

Stimulus and context representation Contexts anditems are represented as binary activation patterns acrossthe nodes in the context and item layers respectively Anode is active (or activated) if its corresponding featureis present (ie of value 1) and a node is inactive (or notactivated) if its corresponding feature is absent (ie ofvalue 0) There are several reasons for choosing binaryrepresentations For instance binary representation servesto deblur noisy information at retrieval (Hopfield 1984)Binary representation also allows for a low proportion ofactive nodes (spare representation) which is known toimprove performance (Okada 1996) It also introducesnonlinearity which is necessary for solving some prob-lems in multilayer networks (Rumelhart Hinton amp Wil-liams 1986) and it is perhaps the simplest representa-tion Furthermore in the present model it is shown thatit is necessary for capturing characteristics of the z-ROCcurves that are associated with the mirror effect

More specifically the state of activation for the i thnode in the item layer at encoding is denoted x i

d wherethe superscript d denotes the item layer The state of ac-tivation for the jth node in the context layer at encodingis denoted x j

c where the superscript c denotes the con-text layer Context patterns and item patterns are gener-ated by randomly setting nodes to an active state (ie

414 SIKSTROM

with values of 1) and otherwise to an inactive state (iewith values of 0) Let a be a parameter that determinesthe expected probability that a node is active at encod-ing This parameter does not change during the simula-tion and is assumed to be relatively small (for purposesof sparse representation) Note that a is the expected prob-ability of active nodes whereas the real percentage of ac-tive nodes for specific items or contexts varies around a

The encoding-study phase Encoding occurs bychanging the connection weights between the item andthe context layers The weights (or the strengths of theconnections) contain information of what has been storedin the network The weight between item node i and con-text node j is denoted as wij and it is initialized to zeroThe weight change (Dwi j) is calculated by the learningrule suggested by Hopfield (1982 see also Hertz Kroghamp Palmer 1991 for additional details) This is essen-tially a Hebbian learning rule that increases connectionweights between simultaneously activated nodes Thisrule is chosen here because it is more biologically plau-sible than other rules such as the delta or the gradient-decent learning rules (eg Rumelhart et al 1986) usedin back-propagation networks However the variance the-ory can also be implemented with other learning rules

According to the Hopfield (1982) learning rule theweight change is computed as the outer-product betweenthe item and the context vectors of activation patternswith the parameter a first subtracted from both vectorsThis subtraction is mathematically necessary to keep theexpected value of the weights at zero Using the notionsfor item and context activation defined above the weightchange between these two elements of the item and con-text vectors can be written as

Dwi j 5 (x id a)(x j

c a) (3)

Word frequency as the number of associated con-texts at the preexperimental phase An item may be en-coded more or less frequently and hence be associatedwith more or less different preexperimental contextsdepending on how often the item occurs in the subjectrsquosenvironment In the model at the preexperimental stageof the simulation an itemrsquos frequency is simulated by thenumber of times the item is encoded in different contextsA low-frequency item is encoded less often and is asso-ciated with a smaller number of contexts whereas a high-frequency item is encoded more often and is associatedwith a larger number of different contexts For instanceone might form a preexperimental association betweenSAAB and repair shop after experiencing the rare event ofa new expensive SAAB sports car breaking down half-way through a honeymoon trip to the Grand Canyon withthe SAAB having to be towed to a repair shop somewhereout in the desert In implementing the variance theory therelationship between word frequency and preexperimen-tal itemndashcontext associations can be simulated straight-forwardly At the preexperimental stage of the simula-tion a low-frequency item SAAB may be simulated byassociating one context item REPAIR SHOP with it duringencoding A high-frequency word CAR may be simulated

by associating three different contexts REPAIR SHOP TAXI

RIDE and DRIVING TO WORK with it during encodingThe recognition test phase At recognition an item is

presented to the network the representation of this itemis reinstated as a cue to the item layer and the represen-tation of the study context (simulating an internally gen-erated context regarding list or time information duringthe recognition experiment) is reinstated as a cue to thecontext layer For example the representation of theword CAR is reinstated at the item layer Furthermore therepresentation of the study context STUDY LIST is rein-stated at the context layer The subjects must have this in-formation (or cue) in order to recognize an item from theparticular study context (and not recognize the item fromall the other preexperimental contexts) In the actual ex-perimental setting this information is usually conveyedto the subjects by the explicit instruction to recognizefrom the study context (eg ldquoDo you recognize the wordCAR from the list you just studiedrdquo)

At recognition each node receives a signal (called thenet input) which is computed on the basis of other activenodes connecting to it Item nodes receive net inputs fromactive context nodes and context nodes receive net inputsfrom active item nodes The net input to a given node issimply the sum of the weights of all other active nodesconnected to that node For example if item node 1 isconnected to four context nodes (1 2 3 and 4) wherecontext nodes 1 and 3 are active the net input to itemnode 1 is w11 + w13 Thus active nodes ldquosendrdquo informa-tion to nodes that they are connected to whereas inactivenodes do not send any information Put another way nodesldquoreceiverdquo information or input from active nodes that theyare connected to but not from the inactive nodes

Specifically the net input to item node i is calculatedby first multiplying the activity of the context node (la-beled j ) connected to this node by the weight of the con-nections between the nodes i and j and then summing overall connected nodes In vector formalization the weightmatrix operates on the activation vector and the outputis the net input vector The net inputs to node i(h i

d) at theitem layer depend on the states of activation of nodes in thecontext layer and the weights connecting these two layers

(4)

Following the same principle a similar function is usedto calculate the net inputs to the context node Specifi-cally the net inputs to context nodes (labeled j ) dependon all the states of activation of nodes in the item layerand the weights connecting the two layers

By inserting Equation 3 into Equation 4 and summingover the p number of encoded patterns during list learn-ing it is easy to see that the net input to a node is simplythe sum of Np number of terms for example the net in-put to an item node is

h wj ji ii

Nc d=

=aring x

1

h wi ij jj

Nd c=

=aring x

1

VARIANCE THEORY OF MIRROR EFFECT 415

For a 5 5 the net inputs are binomially distributed witha certain expected value Given a certain criterion (ieNp(1 a) a gt 10) a binomial distribution can be ap-proximated with a normal distribution (Feller 1968) Fora sup1 5 there are actually four outcomes however thesame normal approximation can be used Thus for rea-sonably large parameter values of Npa the distributionof net inputs to the nodes can be approximated by a nor-mal distribution

If the to-be-recognized item has not been encoded withthe context (ie a new item) the net input is simply thesum of random weights Because the expected values ofall weights are zero the expected value of the net inputsfor new items will also be zero If the item has been en-coded with the context (ie an old item) the net input isthe sum of those weights connected to that node whoserespective context nodes were active at encoding Owingto the adaptive weight changes during encoding theseweights will have an expected value that is larger thanzero if both nodes were in the active state during encod-ing [ie each weight change at encoding is computed as(1 a)2] and less than zero if one node was inactive andthe other node was active at encoding [ie each weightchange at encoding was a(1 a)] Of specific impor-tance for the theory is that the variance of the net inputsto the context nodes (from the item nodes) increases withthe number of contexts that are associated with the itemTherefore the variance of the net input is larger for high-than for low-frequency items Similarly the variance ofthe net input to the item nodes (from the context nodes)increases with the number of items associated with onecontext (ie list length) Therefore given that the con-text is constant during a list presentation the variance ofthe net inputs is larger for a long than for a short list

Brief summary of optimal performance Given thestrong selection pressure arguably humans and animalshave evolved to achieve good memory performanceTherefore it is reasonable to assume that mechanisms forrecognition decision have evolved to an optimal or near-optimal performance Following this assumption the pa-rameter values in the model and the implementation ofthe model are guided by the idea that the model shouldperform optimally A detailed discussion about the issueof optimal performance with exact derivations of what isoptimal performance in the context of the present modelis presented later in the sixth section Here I give a briefsummary explaining the results from the analysis of op-timal performance without going into the mathematicaldetails (see Figures 9A 9B 9C and 9D)

The modelrsquos performance is optimal if the percentageof nodes active at encoding (a) is low (see Figure 9A)For a low a it is optimal to base the recognition decisionon nodes that were active at encoding and to ignore nodesthat were inactive during encoding (see Figure 9A) Alsofor a low a it is optimal to place the activation threshold

of the nodes between the expected values of the new andthe old net inputs (see Figure 9B) Finally it is optimalto normalize the recognition strength with the standarddeviation of the net input (see Figures 9C and 9D)

For a low percentage of active nodes it is optimal tobase the recognition decision on nodes that were activeat encoding (or nodes active in the cue pattern) and to ig-nore nodes that were inactive at encoding At recogni-tion the state of activation of a node may be either activeor inactive Therefore the nodes that are active in the cuepattern and have a net input above a certain activationthreshold are activated at recognition otherwise thenodes are inactivated Let z i

d denote the state of activa-tion at recognition for item node i An item node is acti-vated at recognition (z i

d 5 1) if it was active in the cuepattern (x i

d 5 1) and the net input is above the activationthreshold (h i

d gt T ) otherwise it is inactivated (z id 5 0)

z id 5 1 if x i

d 5 1 and h id gt T otherwise z i

d 5 0

Similarly let zjc denote the state of activation at recogni-

tion for context node j

z jc 5 1 if x j

c 5 1 and h jc gt T otherwise z j

c 5 0

This way of activating nodes at retrieval differs from hownodes are activated in a standard Hopfield (1982) net-work where the activation threshold is zero and a nodeis activated if the net input is above zero (independentlyof the state of activation in the cue pattern) The way ofactivating patterns in a Hopfield network is more likelyto produce a retrieved pattern that matches the encodedpattern of activation (eg the expected value of activenodes at retrieval will be the same as the expected valueof active nodes at encoding) However as will be dis-cussed later the way suggested to activate the nodes hereyields better performance in terms of discrimination be-tween a target item and a distractor item

As is shown in Figure 9B performance is optimal whenthe activation threshold is set approximately between thenew and the old net inputs The activation threshold (T )is set to the expected values of the net inputs of nodes ac-tive during encoding (x i

d 5 1 x jc 5 1) for old and new

items The averaging is computed over all nodes (2N ) andover all new and old patterns ( p) in the recognition con-dition If half of the items are new and half of items areold the activation threshold is

As was discussed above the expected net input of new(lure) items is zero Therefore the activation threshold issimply half the expected net input for nodes encoded inthe active state [T 5 mh(O)2 where mh(O) is the expectedvalue of the net input to nodes encoded in the active state]

It is easy to see that the expected percentage of old andnew active nodes at recognition is one half of the per-centage of active nodes at encoding (a 2) That is the ac-tivation threshold divides the old and new distribution in

TaPN

h hi ii

N

j jj

NP

= +eacute

eumlecircecirc = =aring aringaring1

2 1 1

x xd d c c

h a ai i jj

Np

jd d c c= - -

=aringaring ( )( ) x x x

1

416 SIKSTROM

the middle Old items will have a higher expected percent-age active and new items will have a lower expected per-centage active The activation threshold is constant dur-ing recognition in one condition However it must varybetween conditions depending on the net inputs to yieldoptimal performance

The percentage of nodes active at recognition is counted

As is shown in Figure 9C the performance is near-optimalif the recognition decision is based on the number of nodesactive at recognition normalized by the standard devia-tion of the net inputs across the active features of this item(shcent ) Thus this standard deviation is calculated over allthe nodes active at encoding (ie x i

d = 1 and x ic 5 1

nodes inactive at encoding are not used when calculatingthe standard deviation because for low levels of a theseitems carry little to no information of the item) The rec-ognition strength (S ) for an item is calculated by sub-tracting the expected percentage of nodes active at rec-ognition (a 2)1 from the real percentage of nodes activeat recognition (Pc) and dividing by the standard deviationof the net inputs of the item (shcent)

(5)

The subtraction of the expected percentage of nodesactive at recognition makes the expected value of therecognition strength (S ) zero This subtraction is neces-sary for the normalization to work properly The sub-traction moves the recognition strength distributionssymmetrically so that the old and the new distributionsmove at the same rate for a given standard deviation of thenet input (without the subtraction the old recognitionstrength distribution would be more affected than thenew distribution) Thus the recognition strength is de-termined by the difference between two probabilities (thepercentage of active nodes that varies and the expectedpercentage of active nodes that is constant) divided bythe standard deviations of the net input A yes response(Y ) is given if the recognition strength (S ) is above therecognition criterion (C ) An unbiased decision has arecognition criterion of zero

Y 5 S gt C

An issue that may be raised is whether it is sensible tobase recognition strength on two quite different sourcesmdashnamely the percentage of active nodes and the variabil-ity of the net input The immediate answer is that if it isreasonable to optimize performance it is also sensible tomeasure recognition strength this way Another perspec-tive is to note that unbiased responses can be made onlyon the percentage of active nodesmdashthat is a yes responseoccurs if the percentage of active nodes is larger than theexpected percentage of active nodes (Pc gt a 2) and thevariability of the net input can be ignored Thus ldquonor-

mallyrdquo subjects base their unbiased decisions on the per-centage of active nodes and the variability of active nodesonly becomes relevant when subjects are biased Fromthis perspective the percentage of active nodes is usedfor unbiased responses and the variability of the netinput becomes relevant for confidence judgments There-fore by combining both the percentage of active nodesand the variability of the net input the measure of recog-nition strength proposed here will also reflect the confi-dence judgment

An Example With Step-by-Step ComputationsTo clarify the computational details involved in the

variance theory a step-by-step example is given here Fortractability a small network is used consisting of fouritem nodes and four context nodes (see Figure 2) Theactual simulations to be reported later involved a largernetwork architecture with 30 nodes at each layer Thepercentage of nodes active at encoding (denoted by pa-rameter a) is set to 50 Let item BN be represented as1100 and written as the state of activation of the fournodes x1

d x2d x3

d x4d Similarly let 0011 represent

item BO 1010 represent item AO and 0101represent item AN Let context CBN be represented as1100 or the state of activation of the four nodes x1

cx2

c x3c x4

c Similarly 0011 represents context CBOand 0101 represents experimental study context CExp

Item BN is a high-frequency new word For simplicityit is here encoded only once with context CBN in the pre-experimental phase (in the simulations below high-frequency words are preexperimentally associated withthree contexts) The 16 weights between the four itemnodes and the four context nodes are changed accordingto the learning rule where the probability that a node isactive at encoding is determined by the parameter a 55 For example the weight change between item node 1and context node 1 is w11 5 [x1

d(BN) a][x1d(CBN) a] 5

(1 5)(1 5) 5 14 where BN is item BN and the CBNrepresents context CBN Similarly item BO is anotherhigh-frequency word that before the experimental phaseis encoded once with context CBO Items AO and AN arelow-frequency old and new words and they are not en-coded at the preexperimental phase

In the experimental phase item AO is encoded withthe experimental context CExp Finally item BO is en-coded with the same experimental context CExp For ex-ample the weight w11 is now equal to

[x1d(BN) a) (x1

c(CBN) a] + [x1d(BO) a]

[x1c(CBO) a] + [x1

d(BO) a][x1c(CE) a]

+ [x1d(AO) a][x1

c(CE) a] 5 (1 5) (1 5)

+ (0 5)(0 5) + (0 5)(0 5)

+ (1 5) (0 5) 5 14 + 14 + 14 14 5 12

After encoding the full weight matrix is 5 1 1 55 0 0 5 5 0 0 5 5 1 1 5 corre-sponding to the weights w11 w12 w44 respectively

SP a

c

h=

-

cent2

s

PN

c i jj

N

i

N

= +aelig

egraveccedil

ouml

oslashdivide

==aringaring1

2 11

z zd c

VARIANCE THEORY OF MIRROR EFFECT 417

At recognition the state of activation of the old low-frequency item AO is reinstated as a cue to the itemlayer and the state of activation of the encoding contextat the experimental phase CExp is reinstated as a cue tothe context layer The net inputs are calculated For ex-ample the net input to context node 1 is

The net input to the item nodes is 10 2 1 and that tothe context nodes is 5 5 5 5 It can be seen thatthe expected net inputs for randomly created new itemsin this network is 0 and that the expected net inputs forold items encoded in the active state is 05 Therefore theactivation threshold is set to the average of these valuesmdashnamely T 5 025 Nodes that have a net input above theactivation threshold and that are active in the cue patternare activated at recognition Thus the state of activationat recognition for the item nodes is 1010 and thatfor the context nodes is 0101 (which is identical tothe cue patterns) The percentage of active nodes is countedPc(AO) 5 48 5 5 For an unbiased response (C 5 0)this will yield a yes response because this percentage islarger than the expected percentage of active nodes atrecognition (a 2 5 25) The standard deviation of thenet input for nodes active at encoding is 071 (ie thestandard deviation of 1 2 5 5 corresponding tonodes 1 3 6 and 8) The recognition strength is calcu-lated by subtracting the expected percentage of activenodes at recognition from the percentage of active nodesfor the to-be-recognized item and dividing by the standarddeviation of the net input for active nodes [S 5 (Pca 2) shcent 5 (5 25) 71 5 035]

The recognition of the three items BO AN and BNare done in the same way The results for the four itemsAN BN BO and AO are the net inputs (1 0 2 1 5 155 5 15) where the first four numbers representitem nodes and the last four context nodes (1 0 2 115 5 5 15) (1 0 2 1 15 5 5 1) (1 0 21 5 5 5 5) the states of activation at recognition(0 0 0 1 0 0 0 0) (1 0 0 0 0 1 0 0) (0 0 1 1 00 0 1) (1 0 1 0 0 1 0 1) the numbers of nodes ac-tive 1 2 3 4 the standard deviations of the net inputs071 108 108 071 the recognition strengths 17 0011 035 and the unbiased responses no no yes yesrespectively (an unbiased yes response is made whenS gt C where C 5 0 for unbiased responses) Thus thesubject responds correctly for all items the recognitionstrengths are ordered according to the mirror effect andthe variance of the net input is larger for the high- thanfor the low-frequency words

SIMULATIONS OF THEFREQUENCY-BASED MIRROR EFFECT

In this section the variance theory of the mirror effectis simulated in a connectionist framework that is consis-tent with a general connectionist theory of memory

called TECO (Sikstroumlm 1996a 1996b) TECO has beenused to account for a variety of memory phenomena in-volving recognition and recallmdashfor example successivetesting of recall and recognition (Sikstroumlm 1996b) and for-getting curves (Sikstroumlm 1999) An extensive descrip-tion of TECO is beyond the scope of this paper Interestedreaders are referred to previous articles describing themodel for details

ProcedureThe simulation started with initializing the weights to

zero Then 12 items were generated by randomly settingthe nodes to an active state with a probability of a A pre-experimental phase then followed to generate the fre-quency associated with the items In this preexperimen-tal phase half of the items (ie 6) were encoded threetimes each time in a different context These items sim-ulated the high-frequency words The remaining 6 itemswere not encoded before the experimental phase andthey simulated the low-frequency words

At the experimental phase one study-encoding contextwas generated using the same procedure as the genera-tion of the contexts in the preexperimental phase Giventhat in a standard recognition experiment all studieditems would be encoded in the same list in the simula-tions the items were thus encoded once with the samestudy context Three of the high-frequency items wereencoded and three of the low-frequency items were en-coded The other three high- and three low-frequencyitems were not encoded during the experimental phaseand they simulated the new items Each encoding wassimulated by first activating the nodes in the item and con-text layers The weights were then changed according tothe learning rule (Equation 3)

At the recognition test the patterns of activation of the12 items and the study context were reinstated to the net-work The net inputs were calculated for each node andthe recognition strength was calculated from all the nodesin the network The somewhat unrealistic assumptionthat no learning or changes of weights occurred duringtesting was adopted However this is a standard as-sumption often used in other simulations of recognitionmemory cued recall or categorization (eg Kruschke1992 Lewandowsky 1991 Li amp Lindenberger 1999Li Lindenberger amp Frensch 2000) One thousand fivehundred simulation runs with different random item andcontext patterns were carried out The results reported laterare based on the average across these runs

ParametersThe number of high-frequency patterns was six (each

encoded three times preexperimentally and three of themencoded once experimentally to simulate the old items)and the number of low-frequency patterns was six (threeof them encoded experimentally once to simulate the olditems) The following parameter settings were used Thenumber of nodes in each layer (N ) was 30 (the total num-ber of nodes was 2N 5 60) and the percentage of nodes

h w i ii

1 11

45 1 1 0 1 1 5 0 5c d= = + - - = -

=aring x

418 SIKSTROM

active at encoding (a) was 20 Initially all the weightswere set to zero The recognition criterion (C ) was set to 0

ResultsFigure 3A shows the results in terms of the density of

the net inputs to an active node The expected values ofthe net inputs are approximately equal for new high- andnew low-frequency items [mh(AN) 5 00 mh(BN) 5 00]and for old low- and old high-frequency items [mh(AN) 538 m h(BN) 5 38] The high-frequency items have alarger variance of the net inputs [sh(BN) 5 49 sh(BO) 548] than do the low-frequency items [sh(AN) 5 41sh(AO) 5 40] The variances of the old and the new dis-tributions are approximately equal

Figure 3B shows the density of the recognition strengthThe result shows the mirror effect where the hit rate prob-ability is larger for low- than for high-frequency itemsand the false alarm rate is lower for low- than for high-frequency items [P(AN) 5 18 P(BN) 5 25 P(BO) 564 P(AO) 5 71] The standard deviation for the recog-nition strength is larger for old than for new words andlarger for the low-frequency words than for the high-frequency words [ss(AN) 5 029 ss(AN) 5 019 ss(AN)5 023 ss(AN) 5 031] These f indings agree with the empirical data and the predictions of the attention-likelihood model (Glanzer et al 1993) Thus the simu-lation shows that the variance theory can account for thefrequency-based mirror effect and the associated ROCcurves

EXPLICATIONS OF THE MECHANISMS

In this section three essential mechanisms of the vari-ance theory that account for the mirror effect are discussedin detail The involved mechanisms are (1) the varianceof the net input (2) the expected value of the net inputand (3) the way in which the recognition strength is cal-culated by counting the number of nodes so that the per-formance is optimal

OverviewThe first mechanism is that items from the difficult

class (ie high-frequency items in this case) have ahigher variance of the net inputs than do items from theeasy class (ie low-frequency items) but the variance isindependent of whether the items are old or new Themechanism is illustrated in Figure 4A (noncumulative) Itshould also be underscored that the difference in varianceas a function of class membership is not a primitive ofthe theory that the theorist manipulates It is the naturalconsequence of the rather plausible assumption that high-frequency items appear more often and are associatedwith a great variety of different contexts than are low-frequency words The second mechanism is that old itemshave a higher expected net input to nodes encoded in theactive state than do new items (but do not depend on theclass of the items) The mechanism is illustrated in Fig-ure 4B (cumulative) The third mechanism is the wayrecognition strength is measured by counting the activenodes so that recognition performance (eg d cent) is opti-mal or near-optimal An extended analysis of optimalityis presented at the end of the paper however the resultsin Figures 9Andash9D can summarize the main points hereThe results from this analysis imply that the decisionmust be based on the nodes active at encoding (see Fig-ure 9A) the percentage of active nodes must be low (seeFigure 9A) the activation threshold needs to be betweenthe expected net inputs of the new and old items (see Fig-ure 9B) and the recognition strength is normalized bythe standard deviation of the net inputs of the item (seeFigures 9C and 9D) The density of the percentage of ac-tive nodes is illustrated in Figure 4C and the normalizedpercentage of active nodes is illustrated in Figure 4DHere it is shown how these mechanisms predict the mir-ror effect Below these three sets of mechanisms are ex-plained in more detail

Variance of the Net InputThe first and perhaps the most important mechanism

is that the difficult class has a larger variability in the net

Figure 3 (A) Simulations of net inputs The vertical axis shows the simulated density of the net inputs (B) Sim-ulations of the mirror effect The vertical axis shows the simulated density of the recognition strength

VARIANCE THEORY OF MIRROR EFFECT 419

inputs than does the easy class As will be discussed laterthis is shared with other theories such as the subjective-likelihood account (McClelland amp Chappell 1998) andREM (Shiffrin amp Steyvers 1997) However a unique as-pect of the variance theory is that it is a necessary outcomeof simply encoding high-frequency items more timesthan low-frequency items In the subjective-likelihood ac-count a larger variability for high- than for low-frequencywords is an assumption that is controlled by a free pa-rameter ( p0) to reflect that high-frequency words havemore definitions than do low-frequency words The vari-ance theory needs no such assumptions and no addi-tional free parameters to control the variability Whereasthe subjective-likelihood account tries to capture wordfrequency with a free parameter in the variance theorythe word frequency effect is simulated via the number ofcontexts associated with an item

Owing to how the variance theory implements the re-lations between contexts and items the variance of the

net inputs increases with the frequency with which an itemis encoded in different contexts Intuitively this occursbecause a high-frequency item activates several differentcontexts causing the representations of many competingcontexts to be activated simultaneously Low-frequencyitems are associated with fewer contexts than are high-frequency items This causes the representations offewer contexts to be activated simultaneously Thus low-frequency items have less variability in the net inputsthan do high-frequency items

At recognition an item produces a net input in the con-text layer that is a mixture of the net inputs from the studycontext that the network is instructed to retrieve fromand from several uncorrelated preexperimental contextsthat were associated with the item during the preexperi-mental phase The study context that the network is in-structed to retrieve from contributes to the correct activestate For example by adding +1 to the net input to anode (which is the expected net input for a node encoded

Figure 4 (A) The probability density of the net inputs in the variance theory The horizontal axis shows the net inputs and the ver-tical axis the probability density of the net inputs The dotted vertical line is the activation threshold (B) The cumulative probabilitydistributions of the net inputs in the variance theory The horizontal axis shows the net inputs and the vertical axis the cumulativeprobability distributions of the net inputs (C) The density of percentage nodes active at recognition in the variance theory The hor-izontal axis shows the percentage of nodes active at recognition and the vertical axis the density (D) The density of recognition strengthin the variance theory The horizontal axis shows the recognition strength and the vertical axis the density of the recognition strengthusing standard parameter values

420 SIKSTROM

in the active state when N 5 8 and a 5 5 see Equation A1in the Appendix) thus increasing the expected net inputThe preexperimental contexts on the other hand ran-domly add to or subtract from the net input For examplea preexperimental context could add +1 to the net inputif the node was in an active state or it could add 1 tothe net input if the particular preexperimental contextwas encoded in an inactive state (which is the expectednet input for a node encoded in the inactive state whenN 5 8 and a 5 5 see Equation 3 4 or A1) Note thatthe net input can be negative whereas the state of activa-tion cannot If the representations of these preexperimen-tal contexts are uncorrelated the number of associatedpreexperimental contexts will not influence the expectednet input That is the expected value of the negative andpositive contributions will cancel out (eg for N 5 8 anda 5 1 the contributions to the net input is +1 with a prob-ability of 50 and 1 with a probability of 50 yield-ing an expected value of 0) However the variability ofthe net inputs increases when more contexts are addedIn this example the variance is increased by 12 for eachadded context (ie the variance increases by each con-tribution raised to the power of two) Encoding an itemalso increases the variability of the net input for all otheritems encoded in the network However the increase invariability for the items uncorrelated with the encodeditem is much smaller because the contribution from eachnode is independent (Eg for N 5 8 and a 5 5 thecontribution from each node is either +14 or 14 [seeEquation 3] each yielding an increase in variability of142 5 116 An expected value of aN nodes contributeso the total increase in variability is 4 116 5 14)

A mathematical analysis of how the variability of thenet inputs in the context nodes increases as a function ofthe frequency with which the item has been encoded indifferent contexts is presented in the Appendix This analy-sis shows that the variance of the net inputs in contextnodes increases linearly with how many times a givenitem is encoded within different contexts The variabil-ity of the net inputs for nonwords may be a special casediscussed at the end of this paper

In the same way as the variability of context nodes de-pends on the itemrsquos frequency the variability of the itemnodes depends on the frequency of the context That isthe variability of the net input to the item nodes increaseswith how many times one context is associated with dif-ferent items Given that the context is constant during thepresentation of a study list the variability of the net in-puts to the item nodes will increase with list length

Expected Net InputThe second mechanism in the variance theory is that

the expected net inputs to the easy and difficult classesof stimuli are equal given that the encoding conditionsduring the experimental phase of the two classes areequal Note that this is in stark contrast to the attention-likelihood theory which assumes that more attention(corresponding to more net input) is given to the easyclass than to the difficult class Experimentally the equal-

ity of the net inputs simply means that different classesof stimuli are given the exact same conditions for encod-ing and retrieval in a recognition memory study The netinput to a node encoded in the active state increases dur-ing encoding whereas the net input to a node encoded inthe inactive state decreases during encoding Only nodesencoded in the active state are used during retrieval sohere we are only interested in the increase in net input thatoccurs for nodes encoded in the active state Strengthen-ing of the weights during encoding increases the expectednet input The degree of increase in expected net input isinfluenced by strength-based variables such as study timerepetition levels of processing and so forth For exam-ple the simulations can be set up so that a study time of1 sec strengthens the weights less leading to lesser in-creases in the net input than does a longer period of studytimemdashfor example 2 sec of encoding time Because thestudy context is unique to the learning episode preexper-imental encoding in other learning contexts will not af-fect the expected net input but they do affect the vari-ability of the net input as was demonstrated above Theitemndashstudy-context associations are learned approxi-mately equally well for old high- and old low-frequencyitems For example the expected net input for CAR (ahigh-frequency word as a difficult class item) is equal tothe expected net input for ARCTIC (a low-frequency wordas an easy class item) Generally the expected net inputdoes not depend on the class of the item because the ex-pected net input is influenced by the study and the test-ing experimental conditions similarly across stimulusclasses in a recognition memory experiment Thereforethe expected net input for a new difficult item is equal tothe expected net input for a new easy class item

The probability density functions of the net inputs fornodes in the active states are plotted in Figure 4A Oldnodes in the inactive state have a negative expected valueof the net input and are not plotted New nodes in the in-active state have the same density as nodes in the activestate The cumulative probability distributions of the netinputs for nodes in the active state are plotted in Figure 4BFigure 4A shows the first mechanismmdashnamely that thestandard deviation of net inputs for the easy class items[sh(A)] is larger than the standard deviation of net inputsfor the difficult class items [sh(B)] The second mecha-nism is shown in the figures in that the expected netinput of an easy class new item [mh(AN)] is equal to theexpected net input of a difficult class new item [mh(BN)]Furthermore at encoding the expected net inputs of ac-tivated nodes are increased equally or approximatelyequally for the easy and the difficult classes of itemsThis is shown in Figure 4A The expected net input for theold easy class items [m h(AO)] is equal to the expectednet input for the difficult class items [mh(BO)]

Recognition StrengthThe variance theory suggests that the recognition de-

cision needs to be based on counting the number of ac-tive nodes in such a way that the performance is optimalor near-optimal If the net input is above the activation

VARIANCE THEORY OF MIRROR EFFECT 421

threshold (T ) and the node was active at encoding thenode is activated at recognition Otherwise the node isinactivated The distributions of active nodes are plottedin Figure 4C

A closer inspection of Figures 4A and 4B reveals thatthese densities or distributions predict the correct orderof the mirror effect given that the activation threshold islarger than the expected net inputs of the new items andless than the expected net inputs of the old items Thus thevariance theory has the nice property of accounting for themirror effect across a large range of the parameter valuesfor the activation threshold Thus P(AN) lt P(BN) ltP(BO) lt P(AO) for mh(AN) 5 mh(BN) lt T lt mh(AO) 5mh(BO) The variance theory predicts a material-basedmirror effect because the variability of the net inputs isdifferent for the easy and the difficult class items Theexpected strengths of the net inputs are equal The vari-ability is lower for easy class items thus making theprobability of false alarms (or the probability of activenodes) lower for the easy than for the diff icult classitems when the activation threshold is larger than the ex-pected value of the new items Similarly the hit rate (or theprobability of active nodes) for easy class items is higherthan the hit rate for difficult class items when the activationthreshold is less than the expected value of the old items

The activation threshold is set to be between the ex-pected value of the new and the old net inputs so that theperformance is optimal Therefore the activation thresh-old is set to the average of the expected net inputs of theold and the new distributions for difficult and easy classitems respectively

T 5 [mh(AN) + m h(BN) + m h(BO) + mh(AO)]

5 [mh(BO) + mh(AO)] 5 mh(O)

Thus in the variance model the activation threshold isfixed for recognition in one condition although it mayvary between different recognition conditions to opti-mize the performance The variance theory accounts forthe strength-based mirror effect that occurs between listsor conditions with a shift in the activation threshold nec-essary for keeping the performance at an optimal levelAs will be discussed later this is true also when perfor-mance is measured by d cent and it is independent of theplacement of the recognition criterion Simply put themodel adopts the activation threshold on the basis of theoverall difficulty of the test in order to maximize the per-formance

In practice subjects may initially make a preliminaryestimate of the activation threshold which may be ad-justed as more information about the expected value ofthe distributions is gathered The theory does not show amirror effect if the activation threshold is lower than theexpected value of the new items or larger than the expectedvalue of the old items Thus setting the activation thresh-old as was suggested above is an important mechanism

in the model The activation threshold should not be con-fused with the subjectrsquos recognition criterion

Figure 4C shows the density of the probability that anode is active at recognition when the activation thresh-old is set as defined above Note how the mean and stan-dard deviations of the distributions of the net input (Fig-ure 4A) change when the percentage of nodes are calculated(Figure 4C) The expected probabilities of active nodesare arranged according to the mirror effect [Pc(AN) ltPc(BN) lt Pc(BO) lt Pc(AO)] whereas the expected valuesof the net inputs are not [mh(AN) = mh(BN) lt mh(BO) =mh(AO)] Furthermore the standard deviation of the per-centage of active nodes for old items is larger than thatfor new items [s p(O) gt sp(N)] whereas the standard de-viations of the net inputs are equal for old and new items[s h(N) 5 sh(O)]

The standard deviation of the recognition strength issmaller for the new distributions than for the old distri-butions because there are fewer nodes active in the newdistributions The standard deviation of the percentage ofactive nodes at retrieval as a function of the expected prob-ability of nodes active at retrieval is plotted in Figure 5Obviously the standard deviation of the percentage ofactive nodes is zero when the probability of active nodesis zero This standard deviation increases as the prob-ability of active nodes increases For small probabilitiesof active nodes the variance of active nodes (sp) is approx-imately proportional to the percentage of active nodes[sp 5 Pc(1 Pc) N raquo Pc N micro Pc] The percentage of ac-tive nodes is of course larger for old than for new itemsThus the variance theory predicts that the standard de-viation of the percentage of active nodes (sp) is smaller fornew than for old items [sp(AN) lt sp(BN) lt sp(BO) ltsp(AO)] whereas the standard deviation of the net inputis not [sh(AN) 5 sh(AO) lt sh(BN) 5 s h(BO)] The es-sential mechanism that makes these changes in themeans and standard deviations is the nonlinearity intro-duced by the counting of the number of active nodes With-out this nonlinearity these changes would not occur andthe model would not show appropriate ROC curves forold and new items

Note that the standard deviation of active nodes de-creases for probabilities larger than one half (see Figure 5the standard deviation is of course zero when the prob-ability of active nodes is one see the mathematical anal-ysis below) However the probability of active nodes can-not exceed one half because the activation threshold isset so that half of the nodes active during encoding areactive at recognition The probability of active nodes istypically set to a small value in the model because it im-proves the performance

To optimize the performance subjects base their rec-ognition decision on the number of active nodes normal-ized by the standard deviation of the net inputs for theitem The normalization makes the judgments more con-servative for difficult items This plays an important rolefor confidence judgments when the responses are biasedbut plays no role for unbiased responses

12

14

14

422 SIKSTROM

To calculate the recognition strength (S ) in Equation 5the expected percentage of active nodes is subtractedfrom the percentage of nodes active at recognition (Pc)This is necessary for the normalization to work properlyOwing to the placement of the activation threshold theexpected percentage of active nodes at recognition is halfof the expected percentage of nodes active at encoding(a 2 see note 1) This is a constant independent of itemclass new or old item and test difficulty The result is di-vided by the standard deviation of the net inputs associ-ated with nodes active at encoding (shcent)

Note that the standard deviation of the net inputs ofthe to-be-recognized item (s hcent ) varies on an item-to-itembasis around the standard deviation of the net inputs ofall items in the class (sh ) This fluctuation may be solarge that it is not possible to accurately sort the wordsinto classes on the basis of the standard deviation of theitems however there is no need for the subject to makesuch classification in the variance model The subjectsdo not need to know the true standard deviation of net in-puts in the class A yes response occurs if the recognitionstrength is larger than or equal to the subjectrsquos recogni-tion criterion (C ie if S sup3 C ) A no response occurs ifthe recognition strength is less than the subjectrsquos recog-nition criterion (S lt C )

The standard deviation of the net inputs does not affectthe probability of a yes response when the recognitioncriterion is unbiased (C 5 0) In this special case therecognition strength can be simplified to S 5 Pc whereC 5 a 2 The standard deviation of the net inputs inEquation 5 affects the probability of a yes response whenthe recognition criterion is biased (C sup1 0) Thus the stan-dard deviation of the net inputs in Equation 5 may be in-terpreted as a scaling factor that influences the confidencemeasurement (but not the unbiased recognition measure-ments) A large standard deviation of the net input for anitem (correlated with difficulty) influences the decisiontoward uncertainty whereas a small standard deviation ofthe net input for an item (correlated with less difficulty)influences the decision to be more certain

Figure 4D shows the density distribution of the recog-nition strength Note how the standard deviation of theactive nodes for the easy class versus the difficult classes(in Figure 4C) changes when it is normalized by the vari-ance of the net input (in Figure 4D) The normalizationfactor makes the standard deviation of the recognitionstrength of the difficult class smaller than that of the easyclass Thus the standard deviation of the recognitionstrength is proportional to the inverse of the standard de-viation of the net input The difficult class items yield asmall standard deviation of the recognition strength be-cause the standard deviation of the net inputs is highThe easy class items yield a large standard deviation ofthe recognition strength because the standard deviationof the net inputs is small The ordering of the means ofthe distributions is unaffected by the normalization andthe normalization does not change the fact that the olddistributions have a larger standard deviation than do thenew distributions

PREDICTIONS

This section describes the predictions of the variancetheory We have just seen that the variance theory predictsa material-based mirror effect for high- and low-frequencyitems because the low-frequency items have a smallervariance of net inputs This yields lower false alarm ratesand higher hit rates for the easy than for the difficultclass when the activation threshold is set between thenew and the old distributions Here it is shown how themodel accounts for other effects such as the strength-based mirror effect between lists list-length effects andthe shift in the response criterion Most important the var-iance theory makes predictions regarding the strength-based mirror effect within lists that is different from thepredictions of the attention-likelihood theory An exper-iment is conducted to test these predictions Comparativemodeling fitting was also conducted to compare the vari-ance theory with the attention-likelihood theory Thepredictions of the theory are based on an analytic solution

Figure 5 The standard deviation of active nodes as a function of the expectedprobability that the nodes are active The standard deviation is calculated with2N = 100

VARIANCE THEORY OF MIRROR EFFECT 423

that is presented at the end of the paper together with ananalysis of optimal performance

The Material-Based Mirror Effect forHigh- and Low-Frequency Items

The variance theory was simulated above here theanalytical results are presented The variance theory pre-dicts the mirror effect for any choice of parameters whenthe recognition criterion is unbiased As will be discussedlater the variance theory can be fully described by twoparameters (the number of nodes N and the percentageof active nodes a) plus one parameter for each class orwords [the standard deviation of the net input sh( )] Thefollowing parameters are used here The number ofnodes is 2N 5 100 and the percentage of nodes active atencoding is set to a 5 1 The standard deviation of thenet inputs to the easy class is sh(AN) 5 sh(AO) 5 125and the standard deviation of the net inputs to the diffi-cult class is sh(BN) 5 sh(BO) 5 156 There are otherparameters which however as will be discussed laterdo not add any degrees of freedom to the model the ex-pected net inputs of the new distributions mh(AN) 5mh(BN) 5 0 and the expected net inputs of the old dis-tributions mh(AO) 5 mh(BO) 5 1 Consequently the ac-tivation threshold is T 5 05

These parameters yield the following probabilities thata node is active at recognition Pc(AN) 5 43 a Pc(BN) 545 a Pc(BO) 5 55 a Pc(AO) 5 57 a The following ex-pected recognition strengths are predicted ms(AN) 5

0012 ms(BN) 5 0008 ms(BO) 5 0008 ms(AO) 50012 Figure 4D plots the four recognition strength den-sities (the distributions are assumed to be normal) usingthe parameters above The same parameter settings wereused in Figures 4A 4B 4C and 5

Strength-Based Mirror Effects Between ListsThe variance theory is consistent with the strength-

based mirror effects Thus variables that increase the hitrates also decrease the false alarm rates This empiricalfinding is called dispersion which means that the newand the old distributions move apart The opposite phe-nomenon is called concentering which means that thenew and the old distributions move closer together Ex-amples of variables showing strength-based mirror ef-fects are speed versus accuracy instructions length ofstudy time encoding task forgetting repetition and ag-ing (Kim amp Glanzer 1993) These experimental variablescan be related to a specific parameter in the variance the-orymdashnamely the expected net input

The variance theory predicts a strength-based mirroreffect because subjects must adjust the activation thresh-old to optimize the performance This change in activa-tion threshold affects the false alarm rates For exampleassume that study time is increased from 1 to 2 sec sothat the expected net input increases from 1 to 2 and theactivation threshold increases from 12 to 1 This dimin-ishes the false alarm rate However the increase in theactivation threshold is smaller than the increase in the old

net input so the hit rate will increase Thus increasing thestudy time increases the hit rate but decreases the falsealarm rate which is dispersion

The mirror effect is accounted for in some theories bya change in the recognition criterion Note that in the var-iance theory the recognition criterion is constant whereasthe activation threshold is changed There is an impor-tant difference between a change in the recognition cri-terion and a change in the activation threshold Thechange in the activation threshold optimizes the perfor-mance as measured by d cent whereas a change in the recog-nition criterion does not influence d cent Given an optimalplacement of the activation threshold the performance interms of percentage correct is optimal if the recognitioncriterion is set to an optimal value which is zero Thusthere is a clear difference between changing the recogni-tion criterion and changing the activation threshold Thevariance theory accounts for the strength-based mirror ef-fect occurring between two conditions by the change inthe activation threshold necessary for optimal performancewhereas the recognition criterion does not change

Concentering occurs for example when subjects areinstructed to emphasize speed (rather than accuracy) withsuperficial (rather than deep or semantic) study instruc-tions with diminished study time or with an increasedretention interval (Kim amp Glanzer 1993) In the variancetheory all these manipulations are assumed to diminishthe old net inputs Figure 6A shows the predictions of thevariance theory when the expected net inputs of the olddistributions are mh(AO) = mh(BO) 5 05 rather than 1as in Figure 4D Consequently the activation thresholdmust be set to 025 for optimizing the performance Thedistributions in Figure 6A are closer than the distribu-tions in Figure 4D Thus decreasing the net inputsmdashforexample by diminishing study timemdashmoves the distrib-utions closer together thus showing concentering

The opposite phenomenon to concentering is disper-sion which means that increasing the performance movesthe distributions apart Dispersion can be studied bychanging the variables listed above in the opposite direc-tionsmdashfor example by increasing the study time Fig-ure 6B shows the predictions of the theory when the ex-pected net inputs of the old distributions are mh(AO) =m h(BO) 5 2 Consequently the activation thresholdmust be set to 1 to maintain a near-optimal performanceThe distributions in Figure 6B are further apart than thedistributions in Figure 4D

These strength-based manipulations are usually ap-plied between different lists or conditions For exampleKim and Glanzer (1993) manipulated study time betweenfour study lists where the items were presented for 1 seceach in two lists and for 2 sec each in two lists After eachlist there was a recognition test In the variance theorythe activation threshold is the same during each recogni-tion test but may vary between two recognition tests withdifferent levels of difficultymdashfor example different studytimes As will be discussed later different empirical re-sults are found when study time is varied within one list

424 SIKSTROM

In this case the activation threshold is also constant dur-ing the recognition tests although the study time varieswithin the condition

The order of the probabilities in the mirror effect issomewhat robust against changes in the activation thresh-old over a large range Setting the activation to a fixedsufficiently low and positive value yields the mirror ef-fect for any value of the expected net input For example

assume that the activation threshold is fixed to 04 Thenthe mirror effect is predicted for the three cases of ex-pected old net inputs discussed above (05 1 and 2) orany value above 04 The predictions for the new distri-butions do not change with these changes in net inputs[P(AN) = 25 P(BN) = 30] thus a change in the acti-vation threshold is needed to change the false alarmrates In contrast the advantage of the old easy class over

Table 1General Table of Results From the Experiment

Condition AN BN BO AO ss(BN)ss(AN) ss(BO)ss(AO)

Control 013 017 069 082 060 086Frequency 020 028 080 068 101 066Time 010 015 078 076 089 081

NotemdashThe rows show the conditions (control presentation frequency and presenta-tion time) The columns show the false alarm rates for low (AN) and high (BN) fre-quencies and the hit rate for high (BO) and low (AO) frequencies the slope of the z-ROCcurve for the new low-frequency distribution as a function the new high-frequency dis-tribution [ss(BN)ss(AN)] and the corresponding slope for the old distributions[ss(BO)ss(AO)]

Figure 6 The densities of recognition strength in the variance theory for different parameter settings (A) concentering (B) dis-persion (C) activity level set to one and (D) equal variance The horizontal axes show the recognition strength and the vertical axesthe density

VARIANCE THEORY OF MIRROR EFFECT 425

the old difficult class increases with the expected netinput [from P(BO) = 55 and P(AO) = 56 for mh(O) =05 to P(BO) = 89 and P(AO) = 92 for mh(O) = 2]

List-Length EffectGiven everything else equal recognition from a short

list length has a higher hit rate and lower false alarm ratethan recognition from a long list In the variance theorylist length is predicted to affect the standard deviation ofthe net input (sh) for both easy and difficult class itemsso that longer lists have a larger variance than do shorterlists The expected value of the net input is not affectedby list length

Assume that context does not change within a list butis uncorrelated between different lists The context for alist is thus associated with as many items as there areitems in the list The variance of the net inputs to the itemnodes increases when the list length is increased Thereason for this increase in variance is essentially thesame as the reason that word frequency affects the vari-ance In the word frequency case the same item is asso-ciated with several contexts and this increases the vari-ance in context nodes In the list-length case the samecontext is associated with several items and this in-creases the variance in the item nodes Thus the varianceof the net inputs in the item nodes will be a linear func-tion of list length Therefore a long list will have a lowerhit rate and a larger false alarm rate than will a short list

ROC CurvesThe percentage of nodes active at recognition is less

for new items than for old items Owing to the placementof the activation threshold this proportion is always lessthan 12 The standard deviation of the percentage of ac-tive nodes increases as a function of the percentage ofactive nodes If the percentage of active nodes is zerothe standard deviation obviously is zero However thisstandard deviation increases as the percentage of activenodes increases This yields a smaller standard deviationfor the new distribution (which is associated with a lowerpercentage of active nodes) as compared with the olddistribution [ss(AO) gt ss(AN) and ss(BO) gt ss(BN)]

For the sake of understanding the model the propor-tion of nodes active at encoding can be set unrealisticallyhighmdashnamely to a = 1 This setting yields around 50of these nodes being active at recognition This parametersetting makes the standard deviations of the new and theold distributions equal [ss(AO) = ss(AN) and ss(BO) =ss(BN)] Figure 6C shows the prediction for a = 1 (all theother parameters are identical to those in Figure 4D)

The standard deviation of recognition strength is largerfor the difficult class than for the easy class [ss(AN) gtss(BN) and ss(AO) gt ss(BO)] because the recognitionstrengths are calculated from the inverse of the standarddeviation of the net inputs Thus when the standard de-viations of the net inputs are set equal the standard de-viation of the recognition strengths and the recognitionstrengths becomes equal Figure 6D plots the predictionsof the theory when all standard deviations of the net in-

puts are 125 The other parameters are the same as thosein Figure 4D

In Figure 4D the four standard deviations of the recog-nition strengths are ss(AN) = 0015 ss(BN) = 0012ss(BO) = 0015 and ss(AO) = 0020 The ratio of thesestandard deviations must follow Equation 2 This is alsothe case with ss(AO) ss(BN) = 061 lt 074 = ss(AO) ss(AN) lt ss(BO)ss(BN) = 078 lt 094 = ss(BO) ss(AN)

Changing the Recognition CriterionThe probability of a yes response (P) for the four

classes depends on the recognition criterion (C) SettingC to an unbiased value of 0 yields P(AN) = 20 P(BN) =25 P(BO) = 70 P(AO) = 74 These predicted data areprototypical of experimental data for the mirror effect

A conservative value of the recognition criterion (C)will not yield the mirror effect For example C = 0016yields P(AN) = 03 P(BN) = 02 P(BO) = 30 P(AO) =43 Thus the variance theory predicts that a conserva-tive recognition criterion yields a higher false alarm ratefor easy class words than for difficult class words Thisprediction is supported by empirical data Greene (1996)asked subjects to respond yes only if they were sure oftheir response Consistent with the prediction no mirroreffect was found

It follows from the ordering of the distributions in Fig-ure 4D that the theory also predicts the experimentalfindings in forced recognition [P(BO BN) lt P(AO BN)P(BO AN) lt P(AO AN) P(BN AN) gt 50 and P(AOBO) gt 50] For the parameters above the predictions ofthe theory are P(BO BN) = 79 lt 81 = P(AO BN)P(BO AN) = 83 lt 84 = P(AO AN) P(BN AN) = 59 gt50 P(AO BO) = 57 gt 50

Within-List Strength ManipulationSo far the predictions made by the variance theory are

qualitatively (but not quantitatively) equal to those of theattention-likelihood theory However there is an excep-tion that differentiates the variance theory from the attention-likelihood theory The mirror effect is nor-mally studied under experimental conditions in whichthe diff icult and the easy classes are given the sameamount of attentionmdashfor example under conditions inwhich the number of presentations the study time andthe study instructions are equal for the two classes ofwords However if the number of presentation is largerfor the difficult class than for the easy class different re-sults emerge Stretch and Wixted (1998b) conducted fiveexperiments in which the basic manipulation was to pre-sent high-frequency words five times whereas the low-frequency words were presented once The results didnot show a mirror effect because the hit rates for thehigh-frequency words were higher than those for thelow-frequency words However increasing the numberof presentations for the high-frequency words did not af-fect the false alarm rate so both the false alarm rate andthe hit rate were larger for high-frequency words

The variance theory accounts for this new finding inthe following way The theory assumes that a longer pre-

426 SIKSTROM

sentation time or a larger presentation frequency in-creases the net inputs of the old items [mh(O)] This is il-lustrated in Figure 7A (compare with Figure 4A wherethe same amount of attention is paid to the two classes)If the net inputs for old high-frequency items are in-creased sufficiently the percentage of active nodes willbe larger than that for old low-frequency items For thisto occur the effect of the increase in net input (whichgives the advantage for old high-frequency items whenattention is focused on these items) must be larger thanthe effect from the larger standard deviation of the netinputs for old high-frequency items (which gives the ad-vantage for old low-frequency items when the same at-tention is paid to the two classes) This increase in thepercentage of active nodes yields a higher hit rate forhigh-frequency items than for low-frequency items

However it will not significantly change the false alarmrates which are larger for high-frequency items than forlow-frequency items Therefore the variance theory pre-dicts no mirror effect when high-frequency items are pre-sented sufficiently more often or with a sufficiently longerpresentation time as compared with low-frequency items

It is apparent from the density of net inputs (Figure 7A)that the density of recognition strengths (Figure 7B)will not show a mirror effect (ie because the percent-age of active nodes are larger for high- than for low-frequency old items) The parameters used in these fig-ures are identical to the parameters used for the standardmirror effect in Figures 4A and 4D with the exceptionthat the expected net input of the old high-frequencyitems [mh(BO)] is 2 rather than 1 Consequently to opti-mize performance the activation threshold becomes

P a ec

h h

h=-( )yen

ograve12

22

m

s

P -

Figure 7 (A) The probability density of the net inputs in the variancetheory when attention is focused on the high-frequency class The hori-zontal axis shows the net inputs and the vertical axis the probabilitydensity of the net inputs The expected value of the high-frequency value(BO) is shifted to the right because attention is focused on this class Thedotted vertical line is the activation threshold (B) The predictions of thevariance theory when subjects focus their attention on high-frequencywords The horizontal axis shows the recognition strength and the ver-tical axis the probability density

075 rather than 050 The figure does not show a mirroreffect because the expected hit rate and the expectedfalse alarm rate are larger for the high-frequency itemsthan for the low-frequency items Setting C to an unbi-ased value of 0 yields P(AN) = 08 P(BN) = 14 P(BO) =86 P(AO) = 063 [which may be compared with Figure6B P(AN) = 20 P(BN) = 25 P(BO) = 70 P(AO) = 74]

Furthermore in the variance theory the ratio of therecognition strength standard deviations for high- andlow-frequency items depends mainly on the standard de-viations of the net inputs The standard deviations of thenet inputs are not dependent on the attention paid to thestimuli Therefore the variance theory predicts no changein the standard deviations when the amount of attentionis manipulated The standard deviation of the old low-frequency distribution is predicted to be larger than thestandard deviation of the old high-frequency distribu-tion Similarly the standard deviation of the new low-frequency distribution is predicted to be larger than thatof the new high-frequency distribution The standard de-viations in Figure 7B are ss(AN) = 0013 ss(BN) = 0011ss(BO) = 0017 and ss(AO) = 0019 These results aresimilar to the results when the same amount of attentionis paid to the two classes in Figure 4D ss(AN) = 0015ss(BN) = 0012 ss(BO) = 0015 and ss(AO) = 0020

The standard version of the attention-likelihood the-ory has problems accounting for the lack of mirror ef-fect when more study time is given to the difficult classthan to the easy class This theory suggests that the classof items to which more attention is being paid is moreeasily recognized For example low-frequency items arebetter recognized than high-frequency items becausesubjects pay more attention to them The amount of at-tention is assumed to influence the number of sampledfeature [n(i)] so more features are sampled for low- thanfor high-frequency items (Kim amp Glanzer 1993) This isthe only parameter that differs between high- and low-frequency items From this explanation it follows thatif the experimental conditions are manipulated so thatsubjects pay more attention to the high-frequency itemsthe standard version of the attention-likelihood theorywill predict a mirror effect where the high-frequencyitems are the easier class (A) and the low-frequency itemsare the more difficult class (B) The difference from thenormal mirror effect is a larger hit rate and a smaller falsealarm rate for high- than for low-frequency items Fur-thermore the attention-likelihood theory makes predic-tions of the order of the slope of ROC curves The standarddeviation of the hit rate for the high-frequency distributionwould be larger than the hit rate for the low-frequencydistribution Similarly it is predicted that the standarddeviation of the high-frequency false alarm distribution islarger than that of the low-frequency distribution

EXPERIMENT

An experiment was conducted to test the predictionsregarding the within-list strength manipulation The

number of presentations and the study time of the high-frequency words were manipulated in an experimentThe original rationale for the experiment was to comparethe results with the predictions of the variance theoryand the attention-likelihood theory because the experi-ment was conducted before the publication of Stretchand Wixtedrsquos (1998b) study which manipulated atten-tion by the number of presentations In this experimenta new manipulation is investigatedmdashnamely how theamount of study time per item for each class influencesthe mirror effect Furthermore the manipulation of thenumber of presentations is replicated Thus there weretwo experimental conditions one in which the amountof study time was manipulated and one in which the pre-sentation time was manipulated There was also one con-trol condition in which high- and low-frequency wordswere given the same amount of attention

MethodSubjects Twenty-one students taking the introductory psychol-

ogy course at the University of Toronto volunteered to participatein a memory experiment for course credit There were 14 female and7 male subjects with a mean age of 20 ranging from 18 to 29 yearsold

Material Sixty low-frequency words and 60 high-frequencywords were selected from Ku Iumlcera and Francis (1967) The low-frequency words have an occurrence of 4ndash5 times per million andthe high frequency words an occurrence of 50ndash55 times per mil-lion Thirty low- and 30 high-frequency words were randomly cho-sen for List 1 and the remaining for List 2

Procedure The subjects were instructed to study a list of wordsso they would be able to recognize the words after study Fifteenlow-frequency words and 15 high-frequency words were randomlychosen as study words for each subject

Design There were three conditions In all the conditions thelow-frequency words were presented once with a presentation timeof 1 sec In the control condition the high-frequency words werealso presented once with a presentation time of 1 sec In the pre-sentation frequency condition the high-frequency words were pre-sented twice for 1 sec each time In the presentation time conditionthe high-frequency words were presented once for 3 sec The pre-sentation order was randomized All the words were presented inuppercase on a blank computer screen Immediately following thestudy list there was a recognition test The subjects were presentedwith either one of the studied words or one of the lures There were15 low-frequency lures and 15 high-frequency lures in each condi-tion The subjects were asked to judge whether the word was pre-sented in the list or not The subjects were also required to rate theirconfidence in their responses on a scale ranging from 1 (guessing)to 5 (very certain) The order of recognition was randomized foreach subject

Each subject participated in two conditions List 1 was alwaysgiven as the first list and List 2 as the second list Twelve subjectswere randomly chosen for the presentation frequency condition fol-lowed by the presentation time condition Nine subjects were giventhe control condition followed by another control condition Thewhole experimental set-up including instructions presentation ofwords and the recognition test was automated on a computer Eachsubject was tested individually

ResultsThe results from the experiment are presented in the

first three rows of Table 1 The probability for hit rates

VARIANCE THEORY OF MIRROR EFFECT 427

428 SIKSTROM

was larger for the high-frequency words than for the low-frequency words in the presentation frequency and thepresentation time conditions In the control conditionthe probability for hit rates for the low-frequency condi-tion was larger One-tailed paired t tests over the perfor-mance for each subject were carried out to test the dif-ferences between the high and the low frequencies Theeffects were significant in the presentation frequencycondition [t(11) = 22 MSe = 004 p = 02 lt 05] and inthe control condition [t(16) = 33 MSe = 004 p = 00lt 05] but not in the presentation time condition [t(11) =041 MSe = 003 p = 34 lt 05]

The false alarm rate was larger for the high-frequencywords in all the conditions However it was only signif-icantly larger in the presentation frequency condition[t(11) = 18 MSe = 003 p = 048 lt 05] but not in thepresentation time condition [t (11) = 15 MSe = 001 p =07 gt 05] and the control condition [t(16) = 14 MSe =002 p = 09 gt 05]

The results in the presentation frequency conditionsupport the variance theory The hit and the false alarmrates were significantly larger for the high-frequencywords than for the low-frequency words Thus there wasno mirror effect However the prediction of the standardversion of the attention-likelihood theory was not sup-ported

The results in the presentation time condition were inthe same direction as those in the presentation frequencycondition although the difference between the high andthe low frequencies was not significant This conditionis consistent with the variance theory although the stan-dard version of the attention-likelihood theory could notbe dismissed in this condition since the results werenonsignificant

Finally the control condition yielded results consis-tent with previous studies showing a mirror effect Thehit rate for the high-frequency words was significantlylower than the hit rate for the low-frequency words Thefalse alarm rate for the high-frequency words was largerthan that for the low-frequency words (although not sig-nificantly) Thus the control condition is as was ex-pected consistent with both the variance theory and thestandard version of the attention-likelihood theory

The slopes of the ROC curves were calculated as fol-lows The hit and false alarm rates for confidence ratings1ndash5 were z-transformed (eg for confidence rating 4 a hitresponse was scored if the confidence rating was 4 orabove) Linear regressions of one z-transformed mea-surement as a function of another z-transformed measure-ment were conducted The slope of the linear regressioncurves between the z-transformed hit rate of the low-frequency words and the z-transformed hit rate of the high-frequency words [ss(BO)ss(AO)] and similarly for theslope of the false alarms [ss(BN)ss(AN)] are shown inthe last two rows of Table 1

The results show that the standard deviations of theold high-frequency distributions were smaller than thestandard deviations of the old low-frequency distribu-tions in all the conditions The standard deviations of the

false alarm high-frequency distributions were smallerthan the standard deviations of the false alarm low-frequency distributions in the presentation frequencycondition and the control condition but were approxi-mately equal in the presentation time condition

To summarize the results in the presentation frequencycondition are consistent with the variance theory and inconsistent with the standard version of the attention-likelihood theory The control condition is consistentwith both the variance theory and the standard version ofthe attention-likelihood theory These data are also con-sistent with results presented by Stretch and Wixted(1998b) However Stretch and Wixted (1998b) suggestedone possible way to modify the standard version of theattention-likelihood theory to bring it in line with thedata presented here They noted that Glanzer et al (1993)had shown that the attention-likelihood theory predictsthe mirror effect although p(i old) is set to the averagevalue of the two classes This modified version can pre-dict the pattern of data presented here given that the at-tention paid to the high-frequency class was high duringencoding [n(B) = 120] and low during recognition [n(B) =40] This formulation of the attention-likelihood theoryseems somewhat unclear It is not well motivated whyp(i old) should be equal during recognition whereas n(i)is different [p(i old) is calculated from n(i)] or why theamount of attention for high-frequency items is lower thanthat for low-frequency words at encoding but higher atrecognition

COMPARATIVE DATA FITTING

Glanzer et al (1993) fitted the attention-likelihoodtheory to experimental data in nine conditions The easyclass (A) consisted of either low-frequency words orconcrete words and the difficult class (B) consisted ofhigh-frequency words or abstract words Here the vari-ance theory is also fitted to the same set of data as thatused in Glanzer et al (1993) This allows a directionevaluation of the variance theory by comparing its f itswith those of the attention-likelihood theory

Glanzer et al (1993) fitted the attention-likelihoodtheory to the four probabilities of yes responses and thefour ratios of slopes of the ROCs The fitting was con-ducted by minimizing the mean squared error divided bythe variancemdashthat is

Three parameters were fittedmdashnamely the attentionpaid to each of the classes n(A) and n(B) and the prob-ability that a feature was activated p(new) The other pa-rameters were held constant at a value found to give agood f it These parameters were (N = 1000) and therecognition criterion [ln (L) = 0]

The variance theory was fitted to the same set of datausing the same technique and the same number of freeparameters The fitted parameters were the percentage

( )

Observed Predictedi i

ii

-

=aring

2

21

8

s

VARIANCE THEORY OF MIRROR EFFECT 429

nodes active at encoding (a) the standard deviation ofthe net inputs for the easy class words [sh(A)] and thestandard deviation of the net inputs for the difficult classwords [sh(B)] The other parameters were held constantand were set to the same values as those in Figure 4D[2N = 100 mh(N ) = 0 mh(O) = 1 and C = 0] The empir-ical standard deviations (si) were not reported inGlanzer et al (1993) so these parameters were set toone

The results show that the variance theory accounts for98 (r2) of the variance for the probabilities The attention-likelihood theory fits equally well The variance theory ac-counts for 93 of the variance for the slope The attention-likelihood theory accounts for 84 of the variance of theslope Thus the variance theory accounted for the sameamount of variance for the probabilities and more variance

for the slope as compared with the attention-likelihoodtheory when three parameters were fitted

It is reasonable to assume that the percentage of ac-tive nodes at encoding (a) does not vary across condi-tions The ratio of standard deviations of the net inputs[sh(B)sh(A)] may also be conceived of as being con-stant given that the same material is used in the differ-ent conditions Therefore the variance theory was fittedby a single variablemdashnamely the standard deviation ofthe net inputs for the easy class [sh(A)] The activitylevel was fixed to 010 and the ratio of the standard de-viations of the net inputs sh(B)sh(A) to 125 (these val-ues were the average of the fitted parameters in the pre-vious fit these parameters were also used in Figure 4D)The results are presented in Figure 8A where the pre-dicted probabilities are plotted as a function of observed

Figure 8 (A) Fitting the variance theory to Glanzer Adams andKimrsquos (1993) probability data The figure shows the predicted proba-bilities (on the vertical axis) as a function of the observed probabilities(on the horizontal axis) (B) Fitting the variance theory to Glanzer et alrsquos(1993) standard deviation slope data The figure shows the predictedratio of slopes of the receiver-operating characteristic curves (on the ver-tical axis) as a function of the observed ratios (on the horizontal axis)

430 SIKSTROM

probabilities Figure 8B shows the corresponding resultsfor the slope The accounted variance is 096 for the prob-abilities and 085 for the slopes Thus the variance theoryfits the slopes using a single parameter equally well asthe attention-likelihood theory does with three fitting pa-rameters The fit for the variance theory for the probabil-ities using one parameter is slightly less than the fit of theattention-likelihood theory using three fitting parametersIt may be concluded that the fit for the variance theory isreasonably good for the probabilities and the slopes Theslopes have a better fit in the variance theory than in theattention-likelihood theory when three variables are used

ANALYTIC SOLUTIONS

In this section analytic solutions of the variance the-ory an approximation of the standard deviation of therecognition strength and analyses of optimal perfor-mance are presented The variance theory has a simpleanalytic solution and can be fully described by four pa-rameters Two of these parametersmdashnamely the stan-dard deviation of the net inputs from the easy class[sh(A)] and the standard deviation of the net inputs fromthe difficult class [sh(B)]mdashcan also be expressed as thefrequency of the item (see the Appendix) The other twoparameters are the number of nodes (N ) and the expectedprobability that the nodes are active at encoding (a)

There are other variables in the theory however theydo not increase the degrees of freedom There are fourexpected net inputs (mh) however two degrees of free-dom disappear because the new net inputs are constrainedto be equal as well as the old net inputs [mh(AN) =mh(BN) and mh(AO) = mh(BO)] Furthermore the predic-tions are independent of moving the old and new dis-tributions in parallel thus removing another degree offreedom Finally changing the difference between the ex-pected old and new net inputs is mathematically equiva-lent to changing the standard deviations [sh(A) andsh(B)] Thus the degrees of freedom in the net inputscan be captured by the degrees of freedom in the stan-dard deviations The activation threshold (T ) and theprobability that nodes are active (Pc) are simply func-tions of other variables and therefore do not increase thedegrees of freedom Thus there are four degrees of free-dom for the distributionsmdashnamely sh(A) sh(B) N anda An additional degree of freedom is introduced whenplacing the recognition criterion (C)

The probability (Pc) that the net inputs exceed the ac-tivation threshold (T ) for nodes active during encodingcan be explicitly solved from the expected net inputs(mh) and the standard deviation of the net inputs (sh)This probability is dependent on the distribution of thenet inputs which can be approximated with a normaldistribution Pc is solved by integrating the net inputsfrom mh T to infinity (yen) over the probability densityfunction for a normal distribution Thus the probability(Pc) that a node is active at recognition is

(6)

Subtracting the expected percentage of active nodes atrecognition (a2 see note 1) from the percentage of ac-tive nodes and dividing by the standard deviation of thenet inputs (sh) calculates the expected recognitionstrength (ms)

Note that the analytic solution uses the standard devi-ation of the class (sh) as an approximation of the stan-dard deviation of the item (shcent ) because it simplifies theanalytic solution however the variance theory or thesimulation uses the standard deviation of the item Thisapproximation is good when there are a large number offeatures however for a small number of features thevariance of feature strength for a single item may fluctu-ate on an item-to-item basis around the variance of thenet inputs for all the items

The standard deviation of the recognition strength (ss)is calculated from sh Pc and N There is 2N number ofnodes in the context and the item layers The distributionof Pc is binomial but can for a certain criterion [ie 2NPc(1 Pc) gt 10] be approximated with a normal distri-bution with a standard deviation of [Pc(1 Pc) 2N]12The final result is scaled by the normalization factor 1sh

(7)

A yes response occurs if the recognition strength isabove the recognition criterion (C) The probability of ayes response [P(Y)] is calculated from the expected recog-nition strength the variance of the recognition strengthand the recognition criterion by integrating the density ofthe recognition strength over a normal distribution

The probability of choosing A over B in a two-choiceforced recognition test [P(A B)] is calculated from theexpected recognition strength of A [ms(A)] and B [ms(B)]and the standard deviations of the recognition strengthof A [ss(A)] and B [ss(B)]

An Excel sheet for calculating the predictions of the vari-ance theory is available on line (wwwpsychutorontoca ~sverkervariancehtml)

P e ds

S Bs s

s s

C

(A B)

(A)

(A) (B)

=

- -[ ]( )+aelig

egraveccediloumloslashdivide

eacuteeumlecirc

yen

ograve12

2

2 2 22

p

m m

s s

( )

P e ds

S s

s

C

(Y) =-( )yen

ograve12

2

22

p

m

s

sss

h

c cP P

N=

-( )eacute

eumlecircecirc

1 1

2

12

mss

Pc

a

h=

-2

P a e dhcT

hh

h=-( )yen

ograve12

2

22

p

m

s

VARIANCE THEORY OF MIRROR EFFECT 431

Approximations of the Standard Deviation ofRecognition Strength

The standard deviation of the recognition strength isin the model calculated with Equation 7 However to fa-cilitate the understanding of this equation it is useful tomake some approximations First note that the proba-bility that a node is active (Pc) is assumed to be low Byapproximating 1 Pc to one the variance of the recog-nition strength can be simplified to

For a particular class of items the variances of the netinputs of old and new items are equal and the varianceof the recognition strength is proportional to the numberof active nodes (s 2

s micro Pc) This approximation suggests avery simple interpretation of the slope of the z-ROC Theratio of variances between new and old items is simplythe ratio between the number of nodes active in the newitems representations and the number of nodes active inthe old items representations

Or alternatively the slope of the z-ROC curve is equalto the square root of the ratio of the number of nodes ac-tive in the new items representations and the number ofnodes active in the old items representations For exam-ple if the slope of the z-ROC curve is 08 the number ofactive nodes in the new items representations divided bythe number of nodes active in the old items representa-tions is 064 (= 0802)

Another approximation useful for understanding themodel is that for two classes of items the number of ac-tive nodes in the new distribution is approximately equaland the number of active nodes in the old distributions isapproximately equal [Pc(AN) raquo Pc(BN) and Pc(BO) raquoPc(AO)] Given these approximations and the approxi-mation above (1 Pc raquo 1) the recognition strength stan-dard deviation is inversely related to the standard devia-tion of the net inputs in the following way The ratiobetween the recognition strength standard deviations ofthe diff icult and the easy distributions is equal to theratio between the standard deviations of the net inputs ofthe easy and the difficult distributions Furthermore theratio between the recognition strength standard devia-tions of the difficult and easy new distributions is equalto the ratio between the recognition strength standard de-viations of the difficult and the easy old distributionsThe exact solution predicts a slightly smaller ratio in theold than in the new distributions

This suggests that the ratio between the recognitionstrength standard deviations of the difficult class and theeasy class can be interpreted as the ratio between the

standard deviations of the net inputs of the easy class andthe difficult class

Optimizing Performance Derives the Assumptions of the Variance Theory

Arguably good memory performance is an importantfactor for selection in the evolutionary process of hu-mans and animals It is reasonable to assume that thebrain has evolved so that the performance at retrieval isoptimal or near-optimal Here it is investigated how sev-eral assumptions of the variance theory influence per-formance It is shown that several assumptions of themodel fall out as a consequence of optimizing perfor-mance in the form of discriminability between new andold items Thus if the model is implemented in a differ-ent way performance is degraded and the model doesnot account for the experimental data Examples of as-sumptions that yield a good performance in the modelare a low percentage of nodes active setting the activa-tion threshold between old and new net inputs measur-ing performance by nodes that are active in the encodingpattern and normalizing the recognition strength It isshown that an optimal performance in the network requiresthe implementation suggested by the variance theory Ifthe implementation of the variance theory is changedsignificantly the performance is degraded and the net-work would not produce the empirically found memoryphenomena

To conduct this analysis it is necessary to define ameasurement of performance A natural choice is to used cent By using the analytical equations above we find thefollowing expression

(8)

Because ss(N) sometimes is near zero it was founduseful to use the standard deviation of both the old andthe new items recognition strength ss(NO) in the de-nominator of this equation Thus Pc(NO) is equal to[Pc(N) 1 Pc(O)] 2 Pc( ) was calculated with Equation 6The expected value of the net inputs and the standard de-viation of the net inputs for new and old items were cal-culated with the equations derived in the Appendix(Equations A1 A2 and A3) Low-frequency items witha preexperimental frequency of zero were used ( f = 0)and the list length was set to one (L = 1)

The performance can be expressed by the parametersa N and p Increasing the number of nodes (N) monot-onically increases d cent and increasing the number ofstored patterns (p) monotonically decreases d cent The im-pact of these two parameters on d cent was of less impor-tance here and they were set to N = 30 and p = 100

Optimal percentage of nodes active at encodingThe solid line in Figure 9A shows the theoretical d cent as afunction of the percentage of nodes active at encoding

cent = - =-

-eacuteeumlecirc

dP P

P PN

s s

s

c c

c c

m ms( ) ( )

(

O N

(NO)

(O) (N)

(NO) (NO) 12

112

ss

ss

ss

ss

s

s

s

s

h

h

s

s

(BO)

(AO)

(B)

(A)

(A)

(B)

(BN)

(AN)pound raquo pound

s

ss

s

c

c

P

P

2

2

(N)

(O)

(N)

(O)raquo

ss

sc

h

P

N

2

2 2=

432 SIKSTROM

(a) The results show that d cent is optimal for a = 052 Thed cent is lower for larger and smaller a The lower d cent for largea occurs because the interference from other items in-crease For an a larger than the optimal value the weightchanges are distributed over a larger number of nodesand the recognition tests therefore include more noise

The lower d cent for an a less than the optimal value oc-curs because there is variability in the number of activenodes at encoding Thus for very small values of a thefluctuation between the number of nodes active in theencoded representation becomes increasingly importantThus for a small a errors are more likely to occur be-cause old items have few active nodes at encoding mak-ing it less likely that many nodes will be active at re-trieval (independently of how well they are encoded)This analysis suggests that a medium low percentage ofactive nodes at encoding yields optimal performanceThis is consistent with variance theory which requires alow activity for fitting some of the empirical data (seebelow)

There is another factor that contributes to the fact thatoptimal performance occurs when the percentage of ac-tive nodes is medium low which is that the number ofpossible representations increases with a If there is onlyone node active in all the representations there are Npossible combinations of representations if there aretwo nodes active in all the representations there are ap-proximately N 2 possible combinations of representa-tions and so forth This factor is not included in theanalyses

Optimal placement of the activation threshold Animportant property of d cent is that it is insensitive to wherethe criterion is placed Thus any criterion yields thesame performance The activation threshold (T ) may beseen as the criterion for a single node and therefore onemight intuitively think that the placement of the thresh-old is unimportant for d cent However surprisingly theplacement of the criterion becomes important in the vari-ance theory because there is a nonlinear transformationwhen the nodes are activated This nonlinearity makes d centdependent on the activation threshold in the nodes

The d cent was maximized by changing the activationthreshold (T ) and the percentage of nodes active at en-coding (a) The maximum d cent was 240 when T = 081and a = 052 Figure 9B plots d cent as a function of the ac-tivation threshold (T ) when the percentage of nodes ac-tive at encoding was f ixed at the optimal value (a =052) The results show that d cent has an optimal valuewhen the activation threshold is set between the old andthe new distributions The variance theory suggests thatthe threshold should be set to the average of the expectedold and expected new net inputs For the parameter val-ues used here this value is 071 which is near butslightly lower than the optimal value of 081 (the ex-pected value of the new net input is 0 and the expectedvalue of the old net input is 142) Note that this resultapplies when both a and T are set to the optimal value Ifa is set to a nonoptimal value the optimal value of T may

deviate significantly from the one proposed by the the-ory (eg if a = 5 the optimal value of T is much largerthan the expected value of old net inputs of 188)

This analysis emphasizes the importance of setting theactivation threshold as suggested by the variance theorySetting the activation threshold between the old and thenew expected net inputs yields not only the mirror effectbut also an optimal performance in the network Noticethat the activation threshold (T ) is constant even if thesubjectsrsquo decision criterion (C) is changed Therefore d centwill not change when the decision criterion changes Bychanging the decision criterion (rather than the recogni-tion threshold) subjects can maintain an optimal d cent fordifferent confidence levels

Optimal usage of the state of activation in the cuepattern An interesting question is how much informa-tion is carried in nodes that are active in the encoded pat-tern as compared with inactive nodes If both active andinactive nodes carry a similar amount of information itis useful to use all the nodes at retrieval However if in-active nodes carry little information in relation to thenoise performance can be improved by using only theinformation in the active nodes

The information carried in the nodes depends on theamount of weight changes which in turn depends on thepercentage of active nodes at encoding (a) For a = 12the absolute values of the weight changes are the samefor active and inactive nodes (however the signs of theweight changes are different) Thus inactive and activenodes carry the same amount of information and theperformance is optimal when information in both activeand inactive nodes is used For a small a the weightchanges are larger for active nodes (proportional to 1 a)than for inactive nodes (proportional to a) For a suffi-ciently small a the noise in inactive nodes will over-whelm the information in the weight changes so that ifthe information is combined the inactive nodes will harmthe information in the active nodes and performance isoptimal if only information from active nodes is used

The performance of the variance theory was calcu-lated by using the information in all the nodes This isdone by counting the number of nodes that are retrievedto the correct state of activation (ie the same state asthat during encoding) The mathematical details of thiscalculation are described at the end of the Appendix Theresults are shown by the dotted line in Figure 9A usingthe same set of parameters as when d cent was calculated byusing only active nodes shown by the solid line The re-sults show that the highest d cent is found when the decisionis based only on active nodes and when a is low Includ-ing inactive nodes in decision lowers d cent However for alarger a (above 15 for the parameters used here) it isbeneficial to base the decision on all the nodes

Optimal placement of the recognition criterion forthe two classes of items The recognition criterion (C)does not affect d cent but it influences performance as mea-sured by the hit rates and false alarm rates Therefore itis necessary to use another criterion for performance

VARIANCE THEORY OF MIRROR EFFECT 433

with respect to the placement of the recognition crite-rion A natural choice for performance in this context isthe probability of hits minus the probability of falsealarms This measurement corresponds to optimal per-formance when old correct responses and new correctnew responses are rewarded equally It is easy to see thatif the standard deviations of the old and the new distri-butions are equal the optimal performance will be foundif the recognition criterion is set exactly between the dis-tributions For unequal standard deviations the optimalrecognition criterion is shifted from the midpoint towardthe distribution with the smallest standard deviationMore exactly the optimal recognition criterion is thepoint at which the old and the new distributions inter-sect It is easy to see that this is true because if the recog-nition criterion is moved to the left of this point the rateof increase in false alarms is larger than the rate of in-crease in hits and performance suffers If the recognition

criterion is moved to the right of this point the rate of de-crease in hits is larger than the rate of decrease in falsealarms and performance also suffers (see eg Figure 4D)Formally f [S(O)] denotes the density of recognitionstrength of the old distribution and f [S(N)] the densityof the recognition strength of the new distribution Theratio between these variables is called the likelihoodratio L = f [S(O)] f [S(N)] and the optimal performanceoccurs when this ratio is equal to one (L = 1)

In the mirror effect there are two classes of itemseach having a new and an old distribution with differentstandard deviations The question of optimal perfor-mance is complicated by the possibility of using differ-ent criteria for the two classes The performance maythen vary depending on the choice of the two criteria andon additional restrictions on the overall level of confi-dence For example if one class is very easy (ie perfectdiscrimination) and one class is very difficult (ie no

Figure 9 (A) Theoretical d cent as a function of percentage of nodes active at encoding The solid line shows the d cent as a function of per-centage of nodes active at encoding when the decision is based only on nodes that are active during encoding The dotted line showsd cent when the decision is based on all the nodes (B) Theoretical d cent as a function of the activation threshold The leftmost arrow pointsat the expected net input of the new items [m(N)] the rightmost arrow points at the expected net input of the old items [m(O)] and themiddle arrow points at the point at the placement of the activation threshold of the nodes Note that the activation threshold is slightlylower than the optimal point (C) Optimal placement of the recognition criterion for the two classes The y-axis shows the maximumlikelihood for Class B divided by the maximum likelihood for Class A An optimal performance is found when this ratio is one Thex-axis shows the false alarm rate for Class A The straight line shows the ratio for theoretical optimal performance the dotted line theratio before normalization and the solid curved line the ratio after normalization See the text for details (D) The advantage of nor-malization for different recognition criteria The y-axis shows the total hit rate after normalization minus the total hit rate before nor-malization as a function of the total false alarm rate on the x-axis See the text for details

434 SIKSTROM

discrimination) and subjects are instructed to respondyes only when they are absolutely certain that they arecorrect it may be optimal to set a very high criterion forthe difficult class so that no yes responses will be madefor the difficult class and a moderate criterion for theeasy class so that some yes responses will be made forthe easy class Therefore any model that optimizes per-formance for the two classes must combine the criteriafor each class so that the performance for the sum of theclasses will be optimal

This problem may formally be stated as follows Giventwo classes (A and B) with a fixed total false alarm rate[P(AN) + P(BN)] how should the recognition criteriafor the two classes [T(A) and T(B)] be chosen so that thehit rates are maximized [P(AO) + P(BO)] The solutionto this problem is surprisingly simple The optimal perfor-mance occurs when the placements of the maximum like-lihoods of the two classes are equal

L(A) = f [S(AO)] f [S(AN)] = L(B)

= f [S(BO)] f [S(BN)]

It is easy to see that this criterion must be satisfied foroptimal performance because any shift from this pointdiminishes performance For example if the recognitionthreshold for class A is diminished the recognition cri-terion for class B must be increased to keep the totalfalse alarm rate constant According to the formulationof the problem the change in total false alarm rates mustbe equal f f [S(BN) = 0] The maximum-likelihood ra-tios are monotonically increasing functions of the recog-nition criteria therefore L(A) L(B) lt 0 when the recog-nition criteria are changed as specified above ThusL(A) = f [S(AO)] f [S(AN)] lt f [S(BO)] f [S(BN)] =L(B) or f [S(AO)] f [S(BO)] lt 0 This shows that thechange in the placement of the criteria from L(A) = L(B)results in an overall decrease in hit ratemdash( f [S(AO)] f [S(BO)] lt 0)mdashand performance suffers

Note that the variance theory only has one overallrecognition criterion However the theory and any the-ory of the mirror effect specifies how this criterionchanges when it is moved over the two classes Thus thesecond criterion can indirectly be inferred from the for-mulation of the theory This is done in the variance the-ory by the normalization factor that scales the recogni-tion differently between the two classes of words

The intriguing question here is whether the variancetheory is optimal or almost optimal in terms of place-ment of the recognition criterion for the two classes Fig-ure 9C plots the maximum likelihood of class B dividedby the maximum likelihood of class A [L(B)L(A)] onthe y-axis The x-axis shows the probability of false alarmsfor class A when the recognition criteria are changedThe optimal ratio of the maximum likelihood on the y-axisis one and it is plotted as the dotted straight line The re-sults before normalization (ie by counting the numberof nodes above the recognition criterion) are plotted in

the dotted monotonically increasing line The resultsafter normalization (ie the percentage of active nodesminus the expected number of active nodes divided bythe standard deviation of the net input) are plotted as thesolid line

The figure clearly shows that performance before nor-malization is far from optimal For a conservative recog-nition criterion or low false alarm rates the maximumlikelihood for class A is larger than the maximum likeli-hood for class B Therefore a more liberal criterion forclass B and a more strict criterion for class A would bemore advantageous For a liberal recognition criterionor a high false alarm rate the maximum likelihood forclass B is larger than the maximum likelihood for classA Therefore a more liberal criterion for Class A and astricter criterion for Class B would be beneficial Theonly point at which the performance is optimal is whenthe recognition criterion is unbiased At this point [aroundP(AN) = 25] the maximum-likelihood ratio is one

Normalization improves performance significantly sothe maximum-likelihood ratio is one or almost one fora range of criteria For false alarm rates larger than 015and smaller than 060 the ratio is within two percentagepoints of one The maximum likelihood for Class A islarger than that for Class B for false alarm rates less than015 and for false alarm rates larger than 060 Thus thereis still some deviation from optimal performance evenafter normalization However the maximum-likelihoodratio is closer to the optimal value for all false alarmrates after normalization than before normalization Ar-guably normalization improves performance sufficientlyso that performance may be described as being near anoptimal value for a wide range of recognition criteria

Overall performance after normalization can be di-rectly compared with performance before normalizationFigure 9D plots the total hit rate after normalizationminus the total hit rate before normalization for differenttotal false alarm rates In this figure the standard devia-tion of the net inputs to Class B was set to a 3 (rather than156) in order to make the difference between perfor-mance before and after normalization more salient Allother parameters were identical to those in Figure 4DFurthermore it is assumed that the number of items inClass A is equal to the number of items in Class B Thetotal hit rate is equal to the average hit rate of Class Aand Class B and the total false alarm rate is equal to theaverage false alarm rate in Class A and Class B

For all recognition criteria or for all false alarm ratesperformance is better or equal after normalization ascompared with performance before normalization Foran unbiased recognition criterion or a slightly largerrecognition criterion performance is approximatelyequal before and after normalization [ie for 25 lt P(N) lt40] For conservative recognition criteria [P(N) lt 25]there is a large advantage for normalized performanceover nonnormalized performance The largest advantageis approximately 005 and it occurs when the false alarm

VARIANCE THEORY OF MIRROR EFFECT 435

rate is approximately 005 For liberal recognition crite-ria [P(N) gt 40] there is also an advantage for normal-ized performance The largest advantage is around 001and it occurs when the false alarm rate is 070 The ad-vantage for liberal criteria is smaller than the advantagefor conservative criteria because of a ceiling effect forlarge false alarm rates and large hit rates

A Nonoptimal Network is Inconsistent With Empirical Data of Recognition Memory

To summarize it has been shown that performance isoptimal when (1) the percentage of nodes active at en-coding is low (2) the activation threshold is set betweenthe new and the old distributions (3) only nodes activeat encoding are used for measuring the recognitionstrength and (4) the recognition strength is normalizedwith the standard deviation of the net input It is inter-esting to note that all these conditions are necessary forproducing the empirically found memory phenomena

1 The percentage of active nodes has to be low forproducing appropriate standard deviations for the oldand the new recognition distributions If the percentageof active nodes is too high the standard deviation of theold distribution will approach the standard deviation ofthe new distribution

2 The model predicts the mirror effect only if the ac-tivation threshold is set between the old and the new dis-tributions If the recognition threshold is larger than theexpected value of the net input of the old distributionthe hit rate of the easy class will be less than the hit rateof the difficult distribution Similarly if the recognitionthreshold is lower than the expected new net input thefalse alarm of the easy class will be larger than the falsealarm rate of the difficult class This is inconsistent withthe empirical data for unbiased responses

3 Assume that the recognition strength is based on allthe nodes (ie not only nodes inactive during encod-ing)mdashfor example by counting the number of nodes inthe correct state of activation This measurement ofrecognition strength will have at least 50 of the nodesin the correct state (ie Pc gt 50) even if the subjectswere merely guessing on new items This would lead tothe incorrect prediction that the old recognition strengthvariance should be smaller than the new recognitionstrength variance because Pc(1 Pc)N decreases for Pcover 50 This is inconsistent with the finding that thevariance of the old distribution is larger than the varianceof the new distribution

4 If the recognition strength is not normalized withthe net input the variance of the recognition strength ofthe new easy class (AN) will be smaller than the varianceof the recognition strength of the new difficult class (BNsee Figure 4C) This is inconsistent with the empirical data

This analysis indicates that several recognition mem-ory phenomena fall out as a consequence of optimizingperformance in the network If the model is optimized interms of performance the model reproduces the empir-ical data If the model is made nonoptimal the model

does not reproduce the empirical data Arguably thehuman brain has evolved to optimize storage capacityand therefore these memory phenomena occur

GENERAL DISCUSSION

This paper has suggested a variance theory for themirror effect that also applies to the ROC curves Themodel is remarkably simple It has been shown that atwo-layer recurrent network where one layer representscontext and one layer represents items shows these phe-nomena if performance is measured by counting thenumber of nodes active at recognition in a way that opti-mizes performance The structure of the theory guaran-tees that high-frequency items have a larger variance inthe net inputs than do low-frequency items because en-coding the same item in different contexts increases thevariance whereas the expected net inputs are the sameThe theory predicts the mirror effect when the sameamount of attention is paid to both classes of stimuli Thestandard deviation of the recognition strength is largerfor old than for new items because more nodes are activein old items The standard deviation for the easy class islarger than the standard deviation for the difficult classbecause the recognition strength is normalized with thestandard deviation for the net input

There are several reasons why the variance theory isinteresting First the theory is extremely simple Theonly necessary assumptions are that recognition is basedon recurrent associations between contexts and itemsand performance is measured by counting the number ofnodes in an optimal way Second these assumptions areconsistent with what is known about how the brain worksTherefore the model is biologically plausible Third themodel accounts for a large amount of data including themirror effect exceptions from the mirror effect ROCcurves list-length effects and so on Fourth the modelfits the empirical data well Fifth it is easy to implementthe model in a connectionist network

Paying more attention to one of the classes violates theassumption of equal expected net inputs to the two classesThe variance theory predicts that attention to the moredifficult class primarily affects the hit rates whereas thefalse alarm rates and standard deviations of the underly-ing distributions are less affected An experiment sup-ported the prediction A standard mirror effect was foundwhen attention was divided equally between high- andlow-frequency words However focusing the subjectsrsquoattention on the high-frequency words either by the pre-sentation frequency or the presentation time made thehit rate larger for the high-frequency words than for thelow-frequency words This manipulation of attention didnot influence the false alarm rates which were higher forthe high-frequency words in all the conditions Thus nomirror effect was found when attention was paid to thehigh-frequency words Nor did the focusing of attentioninfluence the order of the standard deviations The stan-dard deviations were larger for the low-frequency distri-

436 SIKSTROM

bution than for the high-frequency distribution This wastrue for the new and the old distributions both when at-tention was paid to high-frequency words and when at-tention was divided equally between the two classes (ex-cept in the new frequency control condition where thestandard deviations were equal)

The variance theory predicts the order of the standarddeviations of the underlying distributions for the follow-ing reasons The standard deviations of the old distribu-tions are predicted to be larger than those of the new dis-tributions because more nodes are activated in the olddistributions The standard deviations of the easy classdistributions are predicted to be larger than the standarddeviations of the difficult class distributions because therecognition strength is normalized by the itemrsquos diffi-culty estimated from the standard deviation of the net in-puts This is consistent with the empirical data

In contrast the attention-likelihood theory does notconstrain the old distribution to be larger than the newdistribution for the difficult class (it can be larger orsmaller depending on parameter settings) The variancetheory allows the following two orders of the standarddeviations ss(BN) lt ss(BO) lt ss(AN) lt ss(AO) andss(BN) lt ss(AN) lt ss(BO) lt ss(AO) The first order isthe more common although the second order occurs oc-casionally (see eg Glanzer et al 1993 Experiment 1)In addition the attention-likelihood theory allowsss(BO) lt ss(BN) lt ss(AN) lt ss(AO) according to Equa-tion 2 which to the authorrsquos knowledge has not beenfound in empirical data

The variance theory predicts that strength variablessuch as study time repetition and study instructions af-fect the expected net input For example increasing studytime will increase the net input that improves the hit rateIncreasing the net inputs also causes an increase in theactivation threshold that diminishes the false alarm rates

The variance theory has no (ie lateral) connectionswithin the item layer and there are no connections with-in the context layer Including intraitem connections inrecognition makes it impossible to tell whether therecognition strength comes from encoding during thelearning episode or from another preexperimental learn-ing episode Thus there would be a confounding be-tween the itemrsquos frequency and recognition strength Forexample if the recognition strength in the variance the-ory included intraitem connections and used a constantrecognition criterion it would predict a higher hit rateand a higher false alarm rate for high-frequency itemsthan for low-frequency items Thus the hit rate for high-frequency words would be larger than that for low-frequency words which is contrary to the data on the mir-ror effect This issue is called the frequencyrecognitionndashstrength confounding problem Other models may bevulnerable to this problem depending on their specificassumptions The variance theory is immune from thisproblem because recognition strength is based on the as-sociation between the context and the item that yields apure measurement of the strength of the target in a par-

ticular episode Net inputs within the item population arenot used because these connections are highly corre-lated with the frequency of the item

This frequencyrecognition-strength confounding prob-lem may be relevant to several distributed models thatassume that recognition strength increases with frequencythus making the false alarm rate higher for high- than forlow-frequency stimuli This is often implemented in dis-tributed models by including intraitem associations inthe measurement of recognition strength Thus whenintraitem and itemndashcontext associations are added it isnot possible to know whether the intraitem strength oc-curs because an item has been encoded in the to-be-retrieved-from list or at another episode

Although the intraitem associations are not used tomeasure recognition strength they may play an impor-tant role in recognition In the first step of recognitionthese associations may be used for deblurring unclear in-formation in the item cue (a similar mechanism occursfor the context cue) Arguably this deblurring mecha-nism works well for well-known words however fornonwords it is much more likely to fail Such failure willactivate features that were not active in the encoded rep-resentation It is more likely that these wrongly activatednodes represent high-frequency features because it ismore likely to converge on high-frequency features Thereare two interesting implications of this perspective Firstthe wrongly activated nodes will use the wrong connec-tions between the context and the item Second becausethe wrongly activated nodes represent high-frequencyfeatures the average variability will be larger for non-words than for words This is a plausible account of thepoor recognition performance with nonwords as com-pared with words It is also a tentative account of why non-words can be seen as a difficult class with a higher vari-ability than that for words in the variance theory Howeverfurther work is needed before any firm conclusion can bemade regarding this aspect of the theory

A problem similar to frequencyrecognition-strengthconfounding occurs if within-context connections areused at recognition If the context is temporally corre-lated the within-context connections are influenced bylist length This causes a confounding between list lengthand recognition strength This issue is called the list-lengthrecognition-strength confounding problem Othermodels may be vulnerable to this problem depending ontheir specific assumptions

Another issue is whether the variance theory can ac-count for the mirror effect found in abstract and concretewords where words from a concrete class are more eas-ily discriminated than words from an abstract class Thevariance theory can account for this given the assump-tion that the variance of the net input is larger for abstractthan for concrete words However at this point it is notcompletely clear how this assumption can be motivatedA possibility is that although these two classes areequated for word frequency the contexts associated withan abstract word are more variable than the contexts as-

VARIANCE THEORY OF MIRROR EFFECT 437

sociated with a concrete word This larger variability incontext for abstract words may lead to a larger variabil-ity in the net input Another possibility is that the activefeatures in abstract words are more general and there-fore associated with more contexts Nodes active in con-crete words may represent more specific features acti-vated with a lower frequency and therefore associatedwith fewer contexts Thus features in abstract wordsmay be of a higher frequency than features in concretewords although the frequencies of the items are thesame This would lead to a mirror effect in the modelHowever at this point no claim is made that variancetheory can handle this phenomenon

The variance theoryrsquos account of the list-length andlist-strength effects is arguably much simpler than thesubjective-likelihoodrsquos account The activation thresholdis set so that on average half of the nodes active duringencoding are active during recognition The activationthreshold is constant within one condition but may changebetween conditionsmdashfor example when study time ischanged Furthermore subjects do not need to estimatethe list length and the probability that a particular itemis drawn from the study list

The variance theory has advantages over the attention-likelihood theory As was discussed in more detail abovethe attention-likelihood theory requires subjects to clas-sify the stimulus Depending on this classification thesubjects must know (consciously or unconsciously) howmuch attention is paid to the stimuli in order to calculatethe log-likelihood ratios Thus the yesndashno decision isbased on the subjectsrsquo classification of the stimuli Thevariance theory does not require a classification of thestimuli During the calculation of recognition strengththe standard deviation of the net input of the item is used(shcent ) so the subject does not need to know the class or thestandard deviation of the class (sh) The increase in thehit rate and decrease in the false alarm rate for the easierclass occurs according to the theory because the vari-ance is smaller for the easier class The variance theoryhas a simple mathematical base subjects count the num-ber of activated nodes minus the expected value dividedby the standard deviation of the net input of the item Anexplicit solution is presented that requires only calculat-ing the probabilities of two z-transformations

The subjective-likelihood theory uses feature contentof the items for addressing issues regarding item similar-ity and word frequency In particular high-frequency wordsare assumed to have a higher noise or variability than dolow-frequency words The variance theory also usesvariability that depends on frequency However the vari-ance theory simulates the increase in variance duringeach presentation of a feature in different contexts thusmaking it an unavoidable phenomenon for high-frequencyfeatures In the subjective-likelihood theory the featurevariance is introduced as an assumption or a constantand it is not explicitly simulated

There are several other differences between the vari-ance theory and the subjective-likelihood theory The

subjective-likelihood theory is based on log-likelihoodratios In the variance theory log-likelihood ratios arenot necessary to account of the mirror effect and for z-ROC curves Instead the variance theory uses the num-ber of active nodes normalized by the variance of the netinput as the measurement of recognition strength

Another difference is the use of one detector for eachitem in the subjective-likelihood theory This makes themodel essentially local whereas the variance theory isdistributed This property may cause difficulties in im-plementing the model in a connectionist network How-ever see McClelland and Chappell (1998) for a brief dis-cussion of this topic An implementation of the variancetheory is straightforward

REFERENCES

Feller W (1968) An introduction to probability theory and its appli-cation New York Wiley

Gillund G amp Shiffrin R M (1984) A retrieval model for bothrecognition and recall Psychological Review 91 1-67

Glanzer M amp Adams J K (1985) The mirror effect in recognitionmemory Memory amp Cognition 13 8-20

Glanzer M amp Adams J K (1990) The mirror effect in recognitionmemory Data and theory Journal of Experimental PsychologyLearning Memory amp Cognition 16 5-16

Glanzer M Adams J K amp Kim K (1993) The regularities ofrecognition memory Psychological Review 100 546-567

Glanzer M amp Bowles N (1976) Analysis of the word frequencyeffect in recognition memory Journal of Experimental PsychologyHuman Learning amp Memory 2 21-31

Glanzer M Kisok K amp Adams J K (1998) Response distribu-tions as an explanation of the mirror effect Journal of ExperimentalPsychology Learning Memory amp Cognition 24 633-644

Greene R L (1996) Mirror effect in order and associative informa-tion Role of response strategies Journal of Experimental Psychol-ogy Learning Memory amp Cognition 22 687-695

Hertz J Krogh A amp Palmer R G (1991) Introduction to the the-ory of neural computation Reading MA Addison-Wesley

Hintzman D L (1988) Judgment of frequency and recognition memoryin a multiple trace memory model Psychological Review 95 528-551

Hopfield J J (1982) Neural networks and physical systems withemergent collective computational abilities Proceedings of the Na-tional Academy of Sciences 79 2554-2558

Hopfield J J (1984) Neurons with graded response have collectivecomputational properties like those of two-state neurons Proceed-ings of the National Academy of Sciences 81 3088-3092

Humphreys M S Bain J D amp Pike R (1989) Different way to cuea coherent memory system A theory for episodic semantic and pro-cedural tasks Psychological Review 96 208-233

Kim K amp Glanzer M (1993) Speed versus accuracy instructionsstudy time and the mirror effect Journal of Experimental Psychol-ogy Learning Memory amp Cognition 19 638-652

Kruschke J K (1992) ALCOVE An exemplar-based connectionistmodel of category learning Psychological Review 99 22-44

Ku Iumlcera H amp Francis W N (1967) Computational analysis ofpresent-day American English Providence RI Brown UniversityPress

Lewandowsky S (1991) Gradual unlearning and catastrophic inter-ference A comparison of distributed architectures In W E Hockleyamp S Lewandowsky (Eds) Relating theory and data Essays onhuman memory in honor of Bennet B Murdock (pp 445-476) Hills-dale NJ Erlbaum

Li S-C amp Lindenberger U (1999) Cross-level unification A com-putational exploration of the link between deterioration of neuro-transmitter systems and dedifferentiation of cognitive abilities in oldage In L-G Nilsson amp H J Markowitsch (Eds) Cognitive neuro-sciences of memory (pp 103-146) Seattle Hogrefe amp Huber

438 SIKSTROM

Li S-C Lindenberger U amp Frensch P A (2000) Unifying cog-nitive aging From neuromodulation to representation to cognitionNeurocomputing 32-33 879-890

McClelland J L amp Chappell M (1998) Familiarity breeds dif-ferentiation A subjective-likelihood approach to the effects of expe-rience in recognition memory Psychological Review 105 724-760

Metcalfe J (1982) A composite holographic associative recallmodel Psychological Review 89 627-658

Murdock B B (1982) A theory for the storage and retrieval of item andassociative information Psychological Review 89 609-626

Murdock B B (1998) The mirror effect and attentionndashlikelihoodtheory A reflective analysis Journal of Experimental PsychologyLearning Memory amp Cognition 24 524-534

Okada M (1996) Notions of associative memory and sparse codingNeural Networks 9 1429-1458

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves PsychologicalReview 99 518-535

Rumelhart D E Hinton G E amp Williams R J (1986) Learn-ing representation by backpropagating errors Nature 323 533-536

Shiffrin R M amp Steyvers M (1997) A model for recognitionmemory REMndashretrieving effectively from memory PsychonomicBulletin amp Review 4 145-166

Sikstroumlm S (1996a) TECO A connectionist theory of successiveepisodic tests Umearing Doctoral dissertation Umearing University (ISBN91-7191-155-3)

Sikstroumlm S (1996b) The TECO connectionist theory of recognitionfailure European Journal of Cognitive Psychology 8 341-380

Sikstroumlm S (1999) Power function forgetting curves as an emergentproperty of biologically plausible neural networks model Interna-tional Journal of Psychology 34 460-464

Stretch V amp Wixted J T (1998a) Decision rules for recognitionmemory confidence judgments Journal of Experimental Psychol-ogy Learning Memory amp Cognition 24 1397-1410

Stretch V amp Wixted J T (1998b) On the difference betweenstrength-based and frequency-based mirror effects in recognitionmemory Journal of Experimental Psychology Learning Memoryamp Cognition 24 1379-1396

NOTE

1 The expected number of nodes active at recognition is a2 giventhat the standard deviations of the percentages of active nodes for new[sp(N)] and old items [sp(O)] are equal However if the standard devi-ations are different the expected number of active nodes will be

Because the new variance is smaller than the old variance this valuewill be slightly less than a2 typically around 044a if the ROC curveis 08 It is actually more precise to use this value In this paper the ap-proximation a2 is used except that in the simulations where the ex-pression above is used

APPENDIX

The Expected Value of the Net Input and the Variance of the Net Input

Inserting the weight changes into the net input solves the ex-pected value of the net input for old items In this Appendix thesuperscripts representing the item layer (d) and the contextlayer (c) are dropped Only the active states of activation con-tribute to the net input so jj 5 jj = 1

(A1)

The expected value of the net inputs for the new items iszero

mh(N) 5 0 (A2)

The expected value of the weights is zero The variance ofthe net input is calculated by squaring each term in the netinput Let P represent the number of patterns stored in the net-work If the patterns are uncorrelated or a new item is pre-sented the variance of the net input is

(A3)

For an old item the variance of the net input to the contextlayer from the item layer depends on the frequency of the item( f ) All the aN weights supporting a context node contribute tothe net input in the same direction

(A4)

The same function can be used for calculating the varianceof the net input to a node in the item layer when the same con-text is associated with several items Let the same context be as-sociated with L items in a list Furthermore let the context be-tween different lists be uncorrelated The variance of the netinput to an item node is

The expected variance for all nodes is the weighted sum ofthese two terms Half of the nodes are context nodes and halfof the nodes are item nodes Therefore the expected varianceof the net input to all nodes given an item with a frequency off and a list length of L in a network where p patterns have beenencoded is

(A5)

Performance in the Variance Model When AllNodes Are Used

In the variance model recognition strength is based on nodesactive at encoding However if recognition strength is based onall the nodes performance can be calculated as follows The d centis calculated by using Equation 8 where Pc(O) and Pc(N) are thenumber of correct nodes The number of correct nodes is thenumber of nodes active at retrieval that also is active at encod-ing (ie calculated as usual with Equation 7) plus the numberof inactive nodes at encoding that also are inactive at retrievalThe latter value can be calculated by replacing a with 1 a inEquation 7 and using the expected value of old net inputs for in-active nodes a2 (1 a) N (compare with Equation A1)

(Manuscript received February 9 1999revision accepted for publication October 30 2000)

s h f L a N p a a N2 3 22 1O( ) = +( ) + -( )

s h f a a LN2 4 2 21( ) = -( )

s x x x xh ij jj

N

i j ji

Nf f f a a

a a f N

22 2

4 2 21

( ) = waring aringaelig

egraveccedilouml

oslashdivide= -( ) -( )eacute

eumlecirc

= -( )

s x x x xh ij jj

N

i j j

P

i

NN a a

a a a a a a a

a a a a PN a a PN

2 2 2

2

3

1 1 2 0 1

0 0 1 1

( ) = w

= [( ) ( )] + [( )(1- )] ( )

+ [( )( )] ( ) = ( )

2 2

2 2 2

( ) = -( ) -( ) - - - -

- - - -

aring aringaring

m x x x xh ijj

N

j i j jj

Na a a a N(O) = wDaring aring= -( ) -( ) = -( )1

2

ss s

p

p p

a(N)

( )N (O)+

410 SIKSTROM

dicts a mirror effect when subjects are made to focus theattention on the more difficult class However increas-ing the number of hits for the difficult class actually di-minishes the false alarm rate Thus the response strategyaccount is at odds with the experimental data

Attention-likelihood theory Another attempt to ac-count for the mirror effect is the attention-likelihood the-ory (Glanzer amp Adams 1990 Glanzer et al 1993) Thistheory says that the difference between BO and AO oc-curs because subjects pay more attention to items inClass A (ie the better-recognized class) than to theitems in Class B However the four distributions (ie ANBN BO and AO) along the decision axis reflecting themirror effect occur mainly because subjects transformthe recognition strength of the items to log-likelihood ra-tios and use that as the basis for their decision

More specifically the attention-likelihood theory con-sists of the following assumptions (1) Stimuli consist ofN number of features The features are either marked orunmarked A marked feature indicates that it was in thestudy list (2) Some proportion p(new) of features are al-ready marked in a new stimulus which reflects the noiselevel with the rationale that features in a new item (whichis not studied during encoding) should be marked becauseof the random noise in the decision process (3) Differentclasses of stimuli (i ) evoke different amounts of atten-tion [n(i )] (4) During study n(i ) features are sampledand marked The proportion of feature sampled for a givenstimulus is n(i ) N Given that when the noise level is notzero some proportion of the new items will also bemarked the probability that a feature is marked becomes

p(i old) 5 p(new) + [1 p(new)]

(5) During recognition subjects randomly sample a setof n(i) number of features Given that the sampling is in-dependent of whether the features were marked at studythe number of marked features is thus binomially dis-tributed with parameters n(i) and p(i old) for old stim-uli and n(i) and p(new) for new stimuli (6) During testsubjects count the number of marked features (x) Theynote the difference between the sample size and the num-ber of marked features [ie n(i) x] Yesndashno decisionsare then based on the log-likelihood ratio given a class[l(x | i)] and it is defined jointly by x n(i) p(new) andp(i old)

(1)

Therefore in contrast to strength theories which includemost of the global memory models the recognition deci-sion is made along the log-likelihood dimension ratherthan the dimension of item strength

In addition to the mirror effect with respect to the order-ing of response distributions of different classes of stim-uli the attention-likelihood theory predicts the followinginequalities for the slopes (defined by the ratios between

the standard deviations of the response distributions fordifferent stimulus classes) of the z-ROC curves

s (AO) s(BN) lt s (AO) s(AN)

lt s(BO) s(BN) lt s(BO) s(AN) (2)

This prediction constrains the standard deviation of thedistribution of the difficult class to be smaller than the stan-dard deviation of the distribution of the easy class [ies(BN) lt s(AN) and s(BO) lt s (AO)] However it doesnot constrain the standard deviation of the new distribu-tion to be smaller than the standard deviation of the olddistribution [ie s (BO) may be smaller than s (BN)]There are 24 possible ways (ie 4 5 24) to order thefour standard deviations Equation 2 allows 3 out of the24 orderingsmdashnamely s (BO) lt s (BN) lt s (AN) lts(AO) s(BN) lt s (AN) lt s(BO) lt s(AO) and s(BN) lts(BO) lt s(AN) lt s(AO) However not all 3 orderingsare found in the empirical data As will be presented laterthe variance theory proposed in this paper is more con-straining than the attention-likelihood theory and it per-mits only those 2 of the orderings that are in line with theempirical data The attention-likelihood theory has re-cently inspired at least two models which also are basedon log-likelihood ratiosmdashnamely the subjective-likelihoodaccount of recognition memory (McClelland amp Chappell1998) and REM (Shiffrin amp Steyvers 1997) Below theyare presented in turn

The subjective-likelihood model The subjective-likelihood model for recognition memory (McClellandamp Chappell 1998) has been applied to the mirror effectwith respect to list length list strength z-ROC curves andother recognition phenomena A major difference betweenMcClelland and Chappellrsquos approach and the attention-likelihood theory is that the subjective-likelihood accountuses one detector for each old and each new item Thesedetectors contrast the probability that a to-be-recognizeditem is an old item with the probability that it is a newitem to form the odds ratio The model makes an old de-cision if at least one of the logarithms of the odds ratiois larger than a certain fixed criterion otherwise a newdecision is made Differing from the attention-likelihoodtheory the odds ratio in the subjective-likelihood modelis calculated on estimates of the probabilities rather thanon the true values that a recognition system (or subject)might not have access to but was nonetheless used inGlanzer et alrsquos (1993) model

The usage of a limited number of detectors one foreach item in the subjective-likelihood theory is a centralmechanism used to account for several phenomena Forexample the strength-based mirror effect is accountedfor by the assumption that detectors for strong itemswhen strengthened are less likely to be activated by newitems during recognition which then lowers the false alarmrate This mechanism works when the number of targetsis reasonably large in relation to the number of potentialdistractors which was assumed in McClelland and Chap-pellrsquos (1998) simulations However in reality the num-

l( | ) ln( ) ( ( )

( ( (

( )

( )x i

p i p i

p p

x n i x

x n i x=

-

-

eacute

eumlecircecirc

-

- old old

new) new)

1

1

n(i)N

VARIANCE THEORY OF MIRROR EFFECT 411

ber of targets in a typical recognition test (for example50) is negligible in comparison with the number of pos-sible distractor words (the number of words in the subjectrsquosvocabularymdasheg 20000) Thus the subjective-likelihoodaccount for the list-strength effect may encounter a prob-lem when the number of detectors associated to each dis-tractor increases to a more realistic number A similarproblem may also be apparent with respect to the list-length effect which is accounted for by the assumptionthat the number of detectors is proportional to the listlength Arguably the subjective-likelihood theory wouldnot account for the list-length effect given a more plau-sible assumption that the number of detectors is relatedto the size of the subjectrsquos vocabulary

REM REM (Shiffrin amp Steyvers 1997) is anothermodel whose aim is to account for the mirror effect ROCcurves list strength and other phenomena in recognitionand recall Similar to the subjective-likelihood theoryREM is based on likelihood ratios and uses noisy vector-based representations in memory Although REM alsouses likelihood ratios REM differs in the sense that ituses true probabilities in calculating the likelihood ratioswhereas the subjective-likelihood theory uses estimatesFurthermore in REM the value assigned to the modelrsquosrepresentation on a particular feature dimension is de-termined when the dimension is sampled the first timeIn the subjective-likelihood theory the representationsof the items are refined successively over presentations

In REM several factors are combined together to pro-duce the frequency-based mirror effect (1) The likelihoodratios are assumed to be smaller for old high-frequencywords because high-frequency features are less diagnos-tic (2) This factor is larger than the compensating factorthat high-frequency words have slightly more matchingfeature values because errors in storage tend to producecommon values increasing the probability of acciden-tally matching a high-frequency feature value (3) Newhigh-frequency words have a slight increase in the numberof chance matches which increases the likelihood ratio

Limitations of current theories Although these the-ories account for several aspects of the data regarding themirror effect they have been subjected to a few generalcriticisms Perhaps the most obvious problem with thesemodels is that they predict that strengthening of a classyields a mirror effect Although this prediction is supportedby data in studies in which the strength manipulations wereapplied between conditions (Glanzer et al 1993) it iscertainly inconsistent with the data when the strengthmanipulations were applied within conditions (Stretch ampWixted 1998a see also Stretch amp Wixted 1998b)

In addition there are a few other more specific criti-cisms about these theories Here are some problems re-garding the attention-likelihood theory First calculatingthe log-likelihood ratios requires knowledge of the class(l dependent on i) Thus it is necessary to classify theto-be-recognized stimuli into distinct categories (i) or atleast to estimate the stimuli along some dimension suchas frequency Glanzer et al (1993) noted that the attention-likelihood theory predicts the mirror effect even though

the estimates of p(i old) are not accuratemdashfor examplewhen this probability is set to the average of the twoclasses Thus the estimates of p(i old) are not critical tothe predictions However it is necessary to estimate thenumber of features sampled at recognition [n(i)] in Equa-tion 1 to make the correct predictions and this processrequires knowledge of the class of stimuli Second knowl-edge of several variables is required Depending on theclassification the attention paid to this category [n(i)]must be estimated The probability that a feature in anew item is marked must also be estimated Third theattention-likelihood theory involves several steps of com-plex calculation Although this may not be the reason fordismissing the theory (see Murdock 1998 for a discus-sion of this topic) it would be preferable to have a sim-pler account Given that the subjective-likelihood ac-count (McClelland amp Chappell 1998) and REM like theattention-likelihood theory (Shiffrin amp Steyvers 1997)are also based at their cores on variations in log-likelihoodratios these criticisms would also apply to them

Research Goal and Organization of the PaperGiven the limitations of current theories the purpose

of this paper is to propose a new account of the mirror ef-fect that can avoid most of these criticisms The theoryis proposed specifically for the frequency-based mirroreffect but it also accounts for the strength-based mirroreffect within a list the strength-based mirror effect be-tween lists z-ROC curves associated with the mirror ef-fect and the list-length effect The paper is organized asfollows First a brief overview of the theory is presentedwhich is then followed by an in-depth presentation Sec-ond the theory is implemented in a connectionist modelof memory previously developed by the author (ieTECO the target event cue and object theory Sikstroumlm1996a 1996b) Third mechanisms of the theory respon-sible for capturing the mirror effect are presented in de-tail Fourth a section presents the theoryrsquos applicationswith respect to the various classical empirical regularitiessuch as between-list strengthening the list-strength ef-fect z-ROC curves and response criterion shifting Fifthpredictions regarding a recently discovered lack of themirror effect in the context of within-list strength manip-ulation are presented (eg Stretch amp Wixted 1998a1998b) and an experiment is carried out to evaluate theprediction For readers who are interested in the analyticsolutions of the theory mathematical derivations ofthese solutions are presented in the sixth section and ananalysis of the modelrsquos optimal performance is con-ducted Finally implications for distributed models ofmemory and the relations between the variance theory andprevious theories of the mirror effect are discussed

THE VARIANCE THEORYOF THE MIRROR EFFECT

Overview of the TheoryIn a nutshell the variance theory proposed here is sim-

ilar to previous models of the mirror effect in the sense

412 SIKSTROM

that it also uses vector-feature representations of the itemsand estimates (via simulations) the response probabili-ties of old (target) and new (lure) items during a recog-nition test However the variance theory is driven by dif-ferent conceptual and technical considerations At theconceptual level the variance theory sets out to capturethe mirror effect mainly in terms of the relations betweenthe study materials and the natural preexperimental con-text associations the items may have This is conceptu-ally quite different from all previous theories seeking toexplain mirror effects primarily in terms of the individualrsquosdecision process Rather the approach taken here con-siders the context in which the individual recognition de-cision processes takes place The natural frequencies ofevents occurring in the individualrsquos day-to-day contextsmay be reflected in recognition decision processes andthe individuals may or may not know (or be consciouslyaware of) these processes At the technical level the vari-ance theory also differs from previous theories in a sig-nificant way Instead of directly computing ratios betweenprobabilities a new way of computing recognition strengthis proposed by normalizing the difference between theresponse probabilities for the target and the lure itemswith standard deviations of the underlying responsedistributions

Specifically in dealing with the frequency-based mir-ror effect a rather plausible key assumption of the vari-ance theory is that high-frequency words are assumed tohave appeared in more naturally occurring preexperi-mental contexts than have the low-frequency words Thisassumption is implemented in connectionist networksimulations in a rather straightforward way by associat-ing the simulated high-frequency items with more con-texts than the low-frequency items during a simulatedpreexperimental phase In implementing the theory itemsand contexts are represented by two separate arrays (vec-tors) of binary features with each feature being repre-sented by a node (or element of the vector) as is shownin Figure 2 A particular item such as CAR activates somefeatures at the item layer A particular context such asREPAIR SHOP activates some features at the context layerDifferent items and contexts may or may not activatesome of the same features The item and context featuresare fully interconnected with weights When an item ap-pears in a particular context and certain features are ac-tivated the weights that reciprocally connect the itemfeatures to the context features are adaptively changedaccording to a specific learning rule described later Notethat in the implementation two types of contextsmdashnamelythe preexperimental contexts and the study context (rep-resenting internally generated time or list informationassociated with an item during the actual experimentalepisode)mdashare represented in the network using one com-mon context layer But these two types of context infor-mation are differentiated by two simulation phasesmdashnamely the preexperimental phase and the actual encodingand testing phase As will be mathematically outlinedlater the standard deviation of the input at the contextlayer increases when an item is associated with several

contexts Therefore high-frequency items (associatedwith more preexperimental contexts) will have largerstandard deviations than will low-frequency items in theiractivation patterns which are subsequently propagatedto the item layer However the expected value of the in-put is equal for high- and low-frequency items

During the recognition test an item vector is presentedto reinstate the activation of the item features The fea-tures of the study context are reinstated by presenting thenetwork with the context pattern encoded during thestudy-encoding phase (but not from other preexperi-mental contexts that the network was trained with duringthe preexperimental phase) The degree of recognitionstrength is determined by first calculating the net inputsto the context and the item nodes The net input is the sig-nal a node receives from other active nodes connectingto it and the strength of the signal determines whetherthe nodes will be activated or not at retrieval The netinput of a given item node is simply computed as the sumof all weights connected to active nodes Similarly thenet input of a given context node is simply the sum of allweights connected to active nodes and that particularcontext node The net inputs then denote the retrievedstate of activation patterns at the item and context layersThe subset of item and context nodes that have activationlevels exceeding a particular activation threshold at re-trieval and that were also active during encoding are thenused to calculate the recognition strength Those nodeswhose activation does not exceed the threshold or thatwere inactive during encoding have no influence on rec-ognition strength For example assume that the activationthreshold is set to 05 so that any node (item or context)that was active during encoding and whose retrieved ac-tivation during testing exceeded the value of 05 wouldcontribute to recognition strength Imagine that four nodesout of a total of eight exceed the threshold and are equalto 075 100 125 and 150 The recognition strength ofthe item is the percentage of above-threshold nodes (50)

Figure 2 The variance theory The upper four circles representnodes in the item layer The lower four circles represent nodes inthe context layer The arrows between the item and the contextlayers represent connections

VARIANCE THEORY OF MIRROR EFFECT 413

minus the expected percentage of above-threshold nodes(eg 25) divided by the standard deviation of actuallyobserved above-threshold nodes (ie by the standard de-viation of 075 100 125 and 150)

Why is recognition strength determined by this ruleas opposed to say just the percentage of above-thresholdnodes As will be shown later this way of measuringrecognition strength (subtracting the expected value anddividing with the standard deviation of the net input) al-lows the model to perform optimally in terms of dis-criminability between new and old items when the re-sponse bias to the yes response is manipulated And inthis case the model accounts for why the standard devi-ation of the response distribution of the easy class islarger than the response distribution of the diff icultclass It is plausible to assume that humans have evolvedto generally respond more or less optimally and that thisis reflected in their performance as well as in the imple-mentation of the variance theory Similarly the activa-tion threshold is set in the model to the value that leadsto the highest d cent (ie to optimal performance) whichoccurs when the activation threshold is between the newand the old distributions This optimal tuning of themodel allows it to account for some rather dramatic re-sults such as concentering showing that target and luredistributions from different item classes converge on astrength-of-evidence axis as memory conditions worsen

Here is a brief example of how the model performsConsider hypothetical levels of activation generated byhigh-frequency and low-frequency old (target) and new(lure) items Because high-frequency words have appearedin a larger number of contexts they have a larger varianceof net input As such targets and lures will be relativelymore confusable and will generate percentages of activatednodes that are difficult to discriminate Assume that thestandard deviation of the net input is 10 and the relevantproportions are 25 for high-frequency targets and 15 forhigh-frequency lures In contrast low-frequency wordswill have occurred in fewer contexts and will be less con-fusable Assume that the standard deviation of net inputsis less for low-frequency wordsmdashfor example 05mdashandthat the percentage of active nodes is 30 for low-frequencytargets and 10 for low-frequency lures Given these val-ues what are the recognition strengths for high-frequencyand low-frequency targets and lures If the expected pro-portion of above-threshold nodes is 20 they are

Low-frequency lures (10 20) 05 5 20High-frequency lures (15 20) 10 5 05High-frequency targets (25 20)10 5 05

Low-frequency targets (30 20)05 5 20

The model accounts for the various aspects of mem-ory phenomena by postulating a connectionist neural net-work model with an implementation and parameter set-tings that allow it to perform at optimal or near-optimallevels When the model is optimized it behaves similarlyto how subjects behave and when it is not optimized itdoes not fit the experimental data This is true not only

for the standard mirror effect but also for exceptions suchas the absence of a mirror effect for within-list strengthmanipulations (something all other competing formalmodels fail to do) Furthermore it predicts key featuresof the ROCs for new and old items as well as for high-and low-frequency items (something any viable formalmodel must do)

Presentation of the Variance TheoryIn this section the details of the variance theory are

presented As will become clearer as the theory is un-folded the theory is analytical and the analytical solu-tions are self-contained solvable (readers who are inter-ested can find the analytical solutions in the sixth section)Although the theory itself does not require a particularcomputational framework it can be more easily explainedand directly implemented by using a connectionist net-work Therefore the presentation of the theory below iscouched within the framework of a Hopfield neural net-work (Hopfield 1982 1984) in order to explicate thetheoryrsquos underlying mechanisms that generate the mirroreffect

Network architecture The variance theory may beimplemented as a two-layer recurrent distributed neuralnetwork (see Figure 2) There are two layers in the rep-resentation one is the item layer and the other the con-text layer Both layers consist of N number of nodes (ieN features) although it could also be implemented withan unequal number of context and item nodes Thus thetotal number of nodes in the network is 2N The item andthe context layers are fully connected so that all the nodesin one layer are connected through weights to all the nodesin the other layer Nodes within a layer are not connected(ie no lateral connections)

Stimulus and context representation Contexts anditems are represented as binary activation patterns acrossthe nodes in the context and item layers respectively Anode is active (or activated) if its corresponding featureis present (ie of value 1) and a node is inactive (or notactivated) if its corresponding feature is absent (ie ofvalue 0) There are several reasons for choosing binaryrepresentations For instance binary representation servesto deblur noisy information at retrieval (Hopfield 1984)Binary representation also allows for a low proportion ofactive nodes (spare representation) which is known toimprove performance (Okada 1996) It also introducesnonlinearity which is necessary for solving some prob-lems in multilayer networks (Rumelhart Hinton amp Wil-liams 1986) and it is perhaps the simplest representa-tion Furthermore in the present model it is shown thatit is necessary for capturing characteristics of the z-ROCcurves that are associated with the mirror effect

More specifically the state of activation for the i thnode in the item layer at encoding is denoted x i

d wherethe superscript d denotes the item layer The state of ac-tivation for the jth node in the context layer at encodingis denoted x j

c where the superscript c denotes the con-text layer Context patterns and item patterns are gener-ated by randomly setting nodes to an active state (ie

414 SIKSTROM

with values of 1) and otherwise to an inactive state (iewith values of 0) Let a be a parameter that determinesthe expected probability that a node is active at encod-ing This parameter does not change during the simula-tion and is assumed to be relatively small (for purposesof sparse representation) Note that a is the expected prob-ability of active nodes whereas the real percentage of ac-tive nodes for specific items or contexts varies around a

The encoding-study phase Encoding occurs bychanging the connection weights between the item andthe context layers The weights (or the strengths of theconnections) contain information of what has been storedin the network The weight between item node i and con-text node j is denoted as wij and it is initialized to zeroThe weight change (Dwi j) is calculated by the learningrule suggested by Hopfield (1982 see also Hertz Kroghamp Palmer 1991 for additional details) This is essen-tially a Hebbian learning rule that increases connectionweights between simultaneously activated nodes Thisrule is chosen here because it is more biologically plau-sible than other rules such as the delta or the gradient-decent learning rules (eg Rumelhart et al 1986) usedin back-propagation networks However the variance the-ory can also be implemented with other learning rules

According to the Hopfield (1982) learning rule theweight change is computed as the outer-product betweenthe item and the context vectors of activation patternswith the parameter a first subtracted from both vectorsThis subtraction is mathematically necessary to keep theexpected value of the weights at zero Using the notionsfor item and context activation defined above the weightchange between these two elements of the item and con-text vectors can be written as

Dwi j 5 (x id a)(x j

c a) (3)

Word frequency as the number of associated con-texts at the preexperimental phase An item may be en-coded more or less frequently and hence be associatedwith more or less different preexperimental contextsdepending on how often the item occurs in the subjectrsquosenvironment In the model at the preexperimental stageof the simulation an itemrsquos frequency is simulated by thenumber of times the item is encoded in different contextsA low-frequency item is encoded less often and is asso-ciated with a smaller number of contexts whereas a high-frequency item is encoded more often and is associatedwith a larger number of different contexts For instanceone might form a preexperimental association betweenSAAB and repair shop after experiencing the rare event ofa new expensive SAAB sports car breaking down half-way through a honeymoon trip to the Grand Canyon withthe SAAB having to be towed to a repair shop somewhereout in the desert In implementing the variance theory therelationship between word frequency and preexperimen-tal itemndashcontext associations can be simulated straight-forwardly At the preexperimental stage of the simula-tion a low-frequency item SAAB may be simulated byassociating one context item REPAIR SHOP with it duringencoding A high-frequency word CAR may be simulated

by associating three different contexts REPAIR SHOP TAXI

RIDE and DRIVING TO WORK with it during encodingThe recognition test phase At recognition an item is

presented to the network the representation of this itemis reinstated as a cue to the item layer and the represen-tation of the study context (simulating an internally gen-erated context regarding list or time information duringthe recognition experiment) is reinstated as a cue to thecontext layer For example the representation of theword CAR is reinstated at the item layer Furthermore therepresentation of the study context STUDY LIST is rein-stated at the context layer The subjects must have this in-formation (or cue) in order to recognize an item from theparticular study context (and not recognize the item fromall the other preexperimental contexts) In the actual ex-perimental setting this information is usually conveyedto the subjects by the explicit instruction to recognizefrom the study context (eg ldquoDo you recognize the wordCAR from the list you just studiedrdquo)

At recognition each node receives a signal (called thenet input) which is computed on the basis of other activenodes connecting to it Item nodes receive net inputs fromactive context nodes and context nodes receive net inputsfrom active item nodes The net input to a given node issimply the sum of the weights of all other active nodesconnected to that node For example if item node 1 isconnected to four context nodes (1 2 3 and 4) wherecontext nodes 1 and 3 are active the net input to itemnode 1 is w11 + w13 Thus active nodes ldquosendrdquo informa-tion to nodes that they are connected to whereas inactivenodes do not send any information Put another way nodesldquoreceiverdquo information or input from active nodes that theyare connected to but not from the inactive nodes

Specifically the net input to item node i is calculatedby first multiplying the activity of the context node (la-beled j ) connected to this node by the weight of the con-nections between the nodes i and j and then summing overall connected nodes In vector formalization the weightmatrix operates on the activation vector and the outputis the net input vector The net inputs to node i(h i

d) at theitem layer depend on the states of activation of nodes in thecontext layer and the weights connecting these two layers

(4)

Following the same principle a similar function is usedto calculate the net inputs to the context node Specifi-cally the net inputs to context nodes (labeled j ) dependon all the states of activation of nodes in the item layerand the weights connecting the two layers

By inserting Equation 3 into Equation 4 and summingover the p number of encoded patterns during list learn-ing it is easy to see that the net input to a node is simplythe sum of Np number of terms for example the net in-put to an item node is

h wj ji ii

Nc d=

=aring x

1

h wi ij jj

Nd c=

=aring x

1

VARIANCE THEORY OF MIRROR EFFECT 415

For a 5 5 the net inputs are binomially distributed witha certain expected value Given a certain criterion (ieNp(1 a) a gt 10) a binomial distribution can be ap-proximated with a normal distribution (Feller 1968) Fora sup1 5 there are actually four outcomes however thesame normal approximation can be used Thus for rea-sonably large parameter values of Npa the distributionof net inputs to the nodes can be approximated by a nor-mal distribution

If the to-be-recognized item has not been encoded withthe context (ie a new item) the net input is simply thesum of random weights Because the expected values ofall weights are zero the expected value of the net inputsfor new items will also be zero If the item has been en-coded with the context (ie an old item) the net input isthe sum of those weights connected to that node whoserespective context nodes were active at encoding Owingto the adaptive weight changes during encoding theseweights will have an expected value that is larger thanzero if both nodes were in the active state during encod-ing [ie each weight change at encoding is computed as(1 a)2] and less than zero if one node was inactive andthe other node was active at encoding [ie each weightchange at encoding was a(1 a)] Of specific impor-tance for the theory is that the variance of the net inputsto the context nodes (from the item nodes) increases withthe number of contexts that are associated with the itemTherefore the variance of the net input is larger for high-than for low-frequency items Similarly the variance ofthe net input to the item nodes (from the context nodes)increases with the number of items associated with onecontext (ie list length) Therefore given that the con-text is constant during a list presentation the variance ofthe net inputs is larger for a long than for a short list

Brief summary of optimal performance Given thestrong selection pressure arguably humans and animalshave evolved to achieve good memory performanceTherefore it is reasonable to assume that mechanisms forrecognition decision have evolved to an optimal or near-optimal performance Following this assumption the pa-rameter values in the model and the implementation ofthe model are guided by the idea that the model shouldperform optimally A detailed discussion about the issueof optimal performance with exact derivations of what isoptimal performance in the context of the present modelis presented later in the sixth section Here I give a briefsummary explaining the results from the analysis of op-timal performance without going into the mathematicaldetails (see Figures 9A 9B 9C and 9D)

The modelrsquos performance is optimal if the percentageof nodes active at encoding (a) is low (see Figure 9A)For a low a it is optimal to base the recognition decisionon nodes that were active at encoding and to ignore nodesthat were inactive during encoding (see Figure 9A) Alsofor a low a it is optimal to place the activation threshold

of the nodes between the expected values of the new andthe old net inputs (see Figure 9B) Finally it is optimalto normalize the recognition strength with the standarddeviation of the net input (see Figures 9C and 9D)

For a low percentage of active nodes it is optimal tobase the recognition decision on nodes that were activeat encoding (or nodes active in the cue pattern) and to ig-nore nodes that were inactive at encoding At recogni-tion the state of activation of a node may be either activeor inactive Therefore the nodes that are active in the cuepattern and have a net input above a certain activationthreshold are activated at recognition otherwise thenodes are inactivated Let z i

d denote the state of activa-tion at recognition for item node i An item node is acti-vated at recognition (z i

d 5 1) if it was active in the cuepattern (x i

d 5 1) and the net input is above the activationthreshold (h i

d gt T ) otherwise it is inactivated (z id 5 0)

z id 5 1 if x i

d 5 1 and h id gt T otherwise z i

d 5 0

Similarly let zjc denote the state of activation at recogni-

tion for context node j

z jc 5 1 if x j

c 5 1 and h jc gt T otherwise z j

c 5 0

This way of activating nodes at retrieval differs from hownodes are activated in a standard Hopfield (1982) net-work where the activation threshold is zero and a nodeis activated if the net input is above zero (independentlyof the state of activation in the cue pattern) The way ofactivating patterns in a Hopfield network is more likelyto produce a retrieved pattern that matches the encodedpattern of activation (eg the expected value of activenodes at retrieval will be the same as the expected valueof active nodes at encoding) However as will be dis-cussed later the way suggested to activate the nodes hereyields better performance in terms of discrimination be-tween a target item and a distractor item

As is shown in Figure 9B performance is optimal whenthe activation threshold is set approximately between thenew and the old net inputs The activation threshold (T )is set to the expected values of the net inputs of nodes ac-tive during encoding (x i

d 5 1 x jc 5 1) for old and new

items The averaging is computed over all nodes (2N ) andover all new and old patterns ( p) in the recognition con-dition If half of the items are new and half of items areold the activation threshold is

As was discussed above the expected net input of new(lure) items is zero Therefore the activation threshold issimply half the expected net input for nodes encoded inthe active state [T 5 mh(O)2 where mh(O) is the expectedvalue of the net input to nodes encoded in the active state]

It is easy to see that the expected percentage of old andnew active nodes at recognition is one half of the per-centage of active nodes at encoding (a 2) That is the ac-tivation threshold divides the old and new distribution in

TaPN

h hi ii

N

j jj

NP

= +eacute

eumlecircecirc = =aring aringaring1

2 1 1

x xd d c c

h a ai i jj

Np

jd d c c= - -

=aringaring ( )( ) x x x

1

416 SIKSTROM

the middle Old items will have a higher expected percent-age active and new items will have a lower expected per-centage active The activation threshold is constant dur-ing recognition in one condition However it must varybetween conditions depending on the net inputs to yieldoptimal performance

The percentage of nodes active at recognition is counted

As is shown in Figure 9C the performance is near-optimalif the recognition decision is based on the number of nodesactive at recognition normalized by the standard devia-tion of the net inputs across the active features of this item(shcent ) Thus this standard deviation is calculated over allthe nodes active at encoding (ie x i

d = 1 and x ic 5 1

nodes inactive at encoding are not used when calculatingthe standard deviation because for low levels of a theseitems carry little to no information of the item) The rec-ognition strength (S ) for an item is calculated by sub-tracting the expected percentage of nodes active at rec-ognition (a 2)1 from the real percentage of nodes activeat recognition (Pc) and dividing by the standard deviationof the net inputs of the item (shcent)

(5)

The subtraction of the expected percentage of nodesactive at recognition makes the expected value of therecognition strength (S ) zero This subtraction is neces-sary for the normalization to work properly The sub-traction moves the recognition strength distributionssymmetrically so that the old and the new distributionsmove at the same rate for a given standard deviation of thenet input (without the subtraction the old recognitionstrength distribution would be more affected than thenew distribution) Thus the recognition strength is de-termined by the difference between two probabilities (thepercentage of active nodes that varies and the expectedpercentage of active nodes that is constant) divided bythe standard deviations of the net input A yes response(Y ) is given if the recognition strength (S ) is above therecognition criterion (C ) An unbiased decision has arecognition criterion of zero

Y 5 S gt C

An issue that may be raised is whether it is sensible tobase recognition strength on two quite different sourcesmdashnamely the percentage of active nodes and the variabil-ity of the net input The immediate answer is that if it isreasonable to optimize performance it is also sensible tomeasure recognition strength this way Another perspec-tive is to note that unbiased responses can be made onlyon the percentage of active nodesmdashthat is a yes responseoccurs if the percentage of active nodes is larger than theexpected percentage of active nodes (Pc gt a 2) and thevariability of the net input can be ignored Thus ldquonor-

mallyrdquo subjects base their unbiased decisions on the per-centage of active nodes and the variability of active nodesonly becomes relevant when subjects are biased Fromthis perspective the percentage of active nodes is usedfor unbiased responses and the variability of the netinput becomes relevant for confidence judgments There-fore by combining both the percentage of active nodesand the variability of the net input the measure of recog-nition strength proposed here will also reflect the confi-dence judgment

An Example With Step-by-Step ComputationsTo clarify the computational details involved in the

variance theory a step-by-step example is given here Fortractability a small network is used consisting of fouritem nodes and four context nodes (see Figure 2) Theactual simulations to be reported later involved a largernetwork architecture with 30 nodes at each layer Thepercentage of nodes active at encoding (denoted by pa-rameter a) is set to 50 Let item BN be represented as1100 and written as the state of activation of the fournodes x1

d x2d x3

d x4d Similarly let 0011 represent

item BO 1010 represent item AO and 0101represent item AN Let context CBN be represented as1100 or the state of activation of the four nodes x1

cx2

c x3c x4

c Similarly 0011 represents context CBOand 0101 represents experimental study context CExp

Item BN is a high-frequency new word For simplicityit is here encoded only once with context CBN in the pre-experimental phase (in the simulations below high-frequency words are preexperimentally associated withthree contexts) The 16 weights between the four itemnodes and the four context nodes are changed accordingto the learning rule where the probability that a node isactive at encoding is determined by the parameter a 55 For example the weight change between item node 1and context node 1 is w11 5 [x1

d(BN) a][x1d(CBN) a] 5

(1 5)(1 5) 5 14 where BN is item BN and the CBNrepresents context CBN Similarly item BO is anotherhigh-frequency word that before the experimental phaseis encoded once with context CBO Items AO and AN arelow-frequency old and new words and they are not en-coded at the preexperimental phase

In the experimental phase item AO is encoded withthe experimental context CExp Finally item BO is en-coded with the same experimental context CExp For ex-ample the weight w11 is now equal to

[x1d(BN) a) (x1

c(CBN) a] + [x1d(BO) a]

[x1c(CBO) a] + [x1

d(BO) a][x1c(CE) a]

+ [x1d(AO) a][x1

c(CE) a] 5 (1 5) (1 5)

+ (0 5)(0 5) + (0 5)(0 5)

+ (1 5) (0 5) 5 14 + 14 + 14 14 5 12

After encoding the full weight matrix is 5 1 1 55 0 0 5 5 0 0 5 5 1 1 5 corre-sponding to the weights w11 w12 w44 respectively

SP a

c

h=

-

cent2

s

PN

c i jj

N

i

N

= +aelig

egraveccedil

ouml

oslashdivide

==aringaring1

2 11

z zd c

VARIANCE THEORY OF MIRROR EFFECT 417

At recognition the state of activation of the old low-frequency item AO is reinstated as a cue to the itemlayer and the state of activation of the encoding contextat the experimental phase CExp is reinstated as a cue tothe context layer The net inputs are calculated For ex-ample the net input to context node 1 is

The net input to the item nodes is 10 2 1 and that tothe context nodes is 5 5 5 5 It can be seen thatthe expected net inputs for randomly created new itemsin this network is 0 and that the expected net inputs forold items encoded in the active state is 05 Therefore theactivation threshold is set to the average of these valuesmdashnamely T 5 025 Nodes that have a net input above theactivation threshold and that are active in the cue patternare activated at recognition Thus the state of activationat recognition for the item nodes is 1010 and thatfor the context nodes is 0101 (which is identical tothe cue patterns) The percentage of active nodes is countedPc(AO) 5 48 5 5 For an unbiased response (C 5 0)this will yield a yes response because this percentage islarger than the expected percentage of active nodes atrecognition (a 2 5 25) The standard deviation of thenet input for nodes active at encoding is 071 (ie thestandard deviation of 1 2 5 5 corresponding tonodes 1 3 6 and 8) The recognition strength is calcu-lated by subtracting the expected percentage of activenodes at recognition from the percentage of active nodesfor the to-be-recognized item and dividing by the standarddeviation of the net input for active nodes [S 5 (Pca 2) shcent 5 (5 25) 71 5 035]

The recognition of the three items BO AN and BNare done in the same way The results for the four itemsAN BN BO and AO are the net inputs (1 0 2 1 5 155 5 15) where the first four numbers representitem nodes and the last four context nodes (1 0 2 115 5 5 15) (1 0 2 1 15 5 5 1) (1 0 21 5 5 5 5) the states of activation at recognition(0 0 0 1 0 0 0 0) (1 0 0 0 0 1 0 0) (0 0 1 1 00 0 1) (1 0 1 0 0 1 0 1) the numbers of nodes ac-tive 1 2 3 4 the standard deviations of the net inputs071 108 108 071 the recognition strengths 17 0011 035 and the unbiased responses no no yes yesrespectively (an unbiased yes response is made whenS gt C where C 5 0 for unbiased responses) Thus thesubject responds correctly for all items the recognitionstrengths are ordered according to the mirror effect andthe variance of the net input is larger for the high- thanfor the low-frequency words

SIMULATIONS OF THEFREQUENCY-BASED MIRROR EFFECT

In this section the variance theory of the mirror effectis simulated in a connectionist framework that is consis-tent with a general connectionist theory of memory

called TECO (Sikstroumlm 1996a 1996b) TECO has beenused to account for a variety of memory phenomena in-volving recognition and recallmdashfor example successivetesting of recall and recognition (Sikstroumlm 1996b) and for-getting curves (Sikstroumlm 1999) An extensive descrip-tion of TECO is beyond the scope of this paper Interestedreaders are referred to previous articles describing themodel for details

ProcedureThe simulation started with initializing the weights to

zero Then 12 items were generated by randomly settingthe nodes to an active state with a probability of a A pre-experimental phase then followed to generate the fre-quency associated with the items In this preexperimen-tal phase half of the items (ie 6) were encoded threetimes each time in a different context These items sim-ulated the high-frequency words The remaining 6 itemswere not encoded before the experimental phase andthey simulated the low-frequency words

At the experimental phase one study-encoding contextwas generated using the same procedure as the genera-tion of the contexts in the preexperimental phase Giventhat in a standard recognition experiment all studieditems would be encoded in the same list in the simula-tions the items were thus encoded once with the samestudy context Three of the high-frequency items wereencoded and three of the low-frequency items were en-coded The other three high- and three low-frequencyitems were not encoded during the experimental phaseand they simulated the new items Each encoding wassimulated by first activating the nodes in the item and con-text layers The weights were then changed according tothe learning rule (Equation 3)

At the recognition test the patterns of activation of the12 items and the study context were reinstated to the net-work The net inputs were calculated for each node andthe recognition strength was calculated from all the nodesin the network The somewhat unrealistic assumptionthat no learning or changes of weights occurred duringtesting was adopted However this is a standard as-sumption often used in other simulations of recognitionmemory cued recall or categorization (eg Kruschke1992 Lewandowsky 1991 Li amp Lindenberger 1999Li Lindenberger amp Frensch 2000) One thousand fivehundred simulation runs with different random item andcontext patterns were carried out The results reported laterare based on the average across these runs

ParametersThe number of high-frequency patterns was six (each

encoded three times preexperimentally and three of themencoded once experimentally to simulate the old items)and the number of low-frequency patterns was six (threeof them encoded experimentally once to simulate the olditems) The following parameter settings were used Thenumber of nodes in each layer (N ) was 30 (the total num-ber of nodes was 2N 5 60) and the percentage of nodes

h w i ii

1 11

45 1 1 0 1 1 5 0 5c d= = + - - = -

=aring x

418 SIKSTROM

active at encoding (a) was 20 Initially all the weightswere set to zero The recognition criterion (C ) was set to 0

ResultsFigure 3A shows the results in terms of the density of

the net inputs to an active node The expected values ofthe net inputs are approximately equal for new high- andnew low-frequency items [mh(AN) 5 00 mh(BN) 5 00]and for old low- and old high-frequency items [mh(AN) 538 m h(BN) 5 38] The high-frequency items have alarger variance of the net inputs [sh(BN) 5 49 sh(BO) 548] than do the low-frequency items [sh(AN) 5 41sh(AO) 5 40] The variances of the old and the new dis-tributions are approximately equal

Figure 3B shows the density of the recognition strengthThe result shows the mirror effect where the hit rate prob-ability is larger for low- than for high-frequency itemsand the false alarm rate is lower for low- than for high-frequency items [P(AN) 5 18 P(BN) 5 25 P(BO) 564 P(AO) 5 71] The standard deviation for the recog-nition strength is larger for old than for new words andlarger for the low-frequency words than for the high-frequency words [ss(AN) 5 029 ss(AN) 5 019 ss(AN)5 023 ss(AN) 5 031] These f indings agree with the empirical data and the predictions of the attention-likelihood model (Glanzer et al 1993) Thus the simu-lation shows that the variance theory can account for thefrequency-based mirror effect and the associated ROCcurves

EXPLICATIONS OF THE MECHANISMS

In this section three essential mechanisms of the vari-ance theory that account for the mirror effect are discussedin detail The involved mechanisms are (1) the varianceof the net input (2) the expected value of the net inputand (3) the way in which the recognition strength is cal-culated by counting the number of nodes so that the per-formance is optimal

OverviewThe first mechanism is that items from the difficult

class (ie high-frequency items in this case) have ahigher variance of the net inputs than do items from theeasy class (ie low-frequency items) but the variance isindependent of whether the items are old or new Themechanism is illustrated in Figure 4A (noncumulative) Itshould also be underscored that the difference in varianceas a function of class membership is not a primitive ofthe theory that the theorist manipulates It is the naturalconsequence of the rather plausible assumption that high-frequency items appear more often and are associatedwith a great variety of different contexts than are low-frequency words The second mechanism is that old itemshave a higher expected net input to nodes encoded in theactive state than do new items (but do not depend on theclass of the items) The mechanism is illustrated in Fig-ure 4B (cumulative) The third mechanism is the wayrecognition strength is measured by counting the activenodes so that recognition performance (eg d cent) is opti-mal or near-optimal An extended analysis of optimalityis presented at the end of the paper however the resultsin Figures 9Andash9D can summarize the main points hereThe results from this analysis imply that the decisionmust be based on the nodes active at encoding (see Fig-ure 9A) the percentage of active nodes must be low (seeFigure 9A) the activation threshold needs to be betweenthe expected net inputs of the new and old items (see Fig-ure 9B) and the recognition strength is normalized bythe standard deviation of the net inputs of the item (seeFigures 9C and 9D) The density of the percentage of ac-tive nodes is illustrated in Figure 4C and the normalizedpercentage of active nodes is illustrated in Figure 4DHere it is shown how these mechanisms predict the mir-ror effect Below these three sets of mechanisms are ex-plained in more detail

Variance of the Net InputThe first and perhaps the most important mechanism

is that the difficult class has a larger variability in the net

Figure 3 (A) Simulations of net inputs The vertical axis shows the simulated density of the net inputs (B) Sim-ulations of the mirror effect The vertical axis shows the simulated density of the recognition strength

VARIANCE THEORY OF MIRROR EFFECT 419

inputs than does the easy class As will be discussed laterthis is shared with other theories such as the subjective-likelihood account (McClelland amp Chappell 1998) andREM (Shiffrin amp Steyvers 1997) However a unique as-pect of the variance theory is that it is a necessary outcomeof simply encoding high-frequency items more timesthan low-frequency items In the subjective-likelihood ac-count a larger variability for high- than for low-frequencywords is an assumption that is controlled by a free pa-rameter ( p0) to reflect that high-frequency words havemore definitions than do low-frequency words The vari-ance theory needs no such assumptions and no addi-tional free parameters to control the variability Whereasthe subjective-likelihood account tries to capture wordfrequency with a free parameter in the variance theorythe word frequency effect is simulated via the number ofcontexts associated with an item

Owing to how the variance theory implements the re-lations between contexts and items the variance of the

net inputs increases with the frequency with which an itemis encoded in different contexts Intuitively this occursbecause a high-frequency item activates several differentcontexts causing the representations of many competingcontexts to be activated simultaneously Low-frequencyitems are associated with fewer contexts than are high-frequency items This causes the representations offewer contexts to be activated simultaneously Thus low-frequency items have less variability in the net inputsthan do high-frequency items

At recognition an item produces a net input in the con-text layer that is a mixture of the net inputs from the studycontext that the network is instructed to retrieve fromand from several uncorrelated preexperimental contextsthat were associated with the item during the preexperi-mental phase The study context that the network is in-structed to retrieve from contributes to the correct activestate For example by adding +1 to the net input to anode (which is the expected net input for a node encoded

Figure 4 (A) The probability density of the net inputs in the variance theory The horizontal axis shows the net inputs and the ver-tical axis the probability density of the net inputs The dotted vertical line is the activation threshold (B) The cumulative probabilitydistributions of the net inputs in the variance theory The horizontal axis shows the net inputs and the vertical axis the cumulativeprobability distributions of the net inputs (C) The density of percentage nodes active at recognition in the variance theory The hor-izontal axis shows the percentage of nodes active at recognition and the vertical axis the density (D) The density of recognition strengthin the variance theory The horizontal axis shows the recognition strength and the vertical axis the density of the recognition strengthusing standard parameter values

420 SIKSTROM

in the active state when N 5 8 and a 5 5 see Equation A1in the Appendix) thus increasing the expected net inputThe preexperimental contexts on the other hand ran-domly add to or subtract from the net input For examplea preexperimental context could add +1 to the net inputif the node was in an active state or it could add 1 tothe net input if the particular preexperimental contextwas encoded in an inactive state (which is the expectednet input for a node encoded in the inactive state whenN 5 8 and a 5 5 see Equation 3 4 or A1) Note thatthe net input can be negative whereas the state of activa-tion cannot If the representations of these preexperimen-tal contexts are uncorrelated the number of associatedpreexperimental contexts will not influence the expectednet input That is the expected value of the negative andpositive contributions will cancel out (eg for N 5 8 anda 5 1 the contributions to the net input is +1 with a prob-ability of 50 and 1 with a probability of 50 yield-ing an expected value of 0) However the variability ofthe net inputs increases when more contexts are addedIn this example the variance is increased by 12 for eachadded context (ie the variance increases by each con-tribution raised to the power of two) Encoding an itemalso increases the variability of the net input for all otheritems encoded in the network However the increase invariability for the items uncorrelated with the encodeditem is much smaller because the contribution from eachnode is independent (Eg for N 5 8 and a 5 5 thecontribution from each node is either +14 or 14 [seeEquation 3] each yielding an increase in variability of142 5 116 An expected value of aN nodes contributeso the total increase in variability is 4 116 5 14)

A mathematical analysis of how the variability of thenet inputs in the context nodes increases as a function ofthe frequency with which the item has been encoded indifferent contexts is presented in the Appendix This analy-sis shows that the variance of the net inputs in contextnodes increases linearly with how many times a givenitem is encoded within different contexts The variabil-ity of the net inputs for nonwords may be a special casediscussed at the end of this paper

In the same way as the variability of context nodes de-pends on the itemrsquos frequency the variability of the itemnodes depends on the frequency of the context That isthe variability of the net input to the item nodes increaseswith how many times one context is associated with dif-ferent items Given that the context is constant during thepresentation of a study list the variability of the net in-puts to the item nodes will increase with list length

Expected Net InputThe second mechanism in the variance theory is that

the expected net inputs to the easy and difficult classesof stimuli are equal given that the encoding conditionsduring the experimental phase of the two classes areequal Note that this is in stark contrast to the attention-likelihood theory which assumes that more attention(corresponding to more net input) is given to the easyclass than to the difficult class Experimentally the equal-

ity of the net inputs simply means that different classesof stimuli are given the exact same conditions for encod-ing and retrieval in a recognition memory study The netinput to a node encoded in the active state increases dur-ing encoding whereas the net input to a node encoded inthe inactive state decreases during encoding Only nodesencoded in the active state are used during retrieval sohere we are only interested in the increase in net input thatoccurs for nodes encoded in the active state Strengthen-ing of the weights during encoding increases the expectednet input The degree of increase in expected net input isinfluenced by strength-based variables such as study timerepetition levels of processing and so forth For exam-ple the simulations can be set up so that a study time of1 sec strengthens the weights less leading to lesser in-creases in the net input than does a longer period of studytimemdashfor example 2 sec of encoding time Because thestudy context is unique to the learning episode preexper-imental encoding in other learning contexts will not af-fect the expected net input but they do affect the vari-ability of the net input as was demonstrated above Theitemndashstudy-context associations are learned approxi-mately equally well for old high- and old low-frequencyitems For example the expected net input for CAR (ahigh-frequency word as a difficult class item) is equal tothe expected net input for ARCTIC (a low-frequency wordas an easy class item) Generally the expected net inputdoes not depend on the class of the item because the ex-pected net input is influenced by the study and the test-ing experimental conditions similarly across stimulusclasses in a recognition memory experiment Thereforethe expected net input for a new difficult item is equal tothe expected net input for a new easy class item

The probability density functions of the net inputs fornodes in the active states are plotted in Figure 4A Oldnodes in the inactive state have a negative expected valueof the net input and are not plotted New nodes in the in-active state have the same density as nodes in the activestate The cumulative probability distributions of the netinputs for nodes in the active state are plotted in Figure 4BFigure 4A shows the first mechanismmdashnamely that thestandard deviation of net inputs for the easy class items[sh(A)] is larger than the standard deviation of net inputsfor the difficult class items [sh(B)] The second mecha-nism is shown in the figures in that the expected netinput of an easy class new item [mh(AN)] is equal to theexpected net input of a difficult class new item [mh(BN)]Furthermore at encoding the expected net inputs of ac-tivated nodes are increased equally or approximatelyequally for the easy and the difficult classes of itemsThis is shown in Figure 4A The expected net input for theold easy class items [m h(AO)] is equal to the expectednet input for the difficult class items [mh(BO)]

Recognition StrengthThe variance theory suggests that the recognition de-

cision needs to be based on counting the number of ac-tive nodes in such a way that the performance is optimalor near-optimal If the net input is above the activation

VARIANCE THEORY OF MIRROR EFFECT 421

threshold (T ) and the node was active at encoding thenode is activated at recognition Otherwise the node isinactivated The distributions of active nodes are plottedin Figure 4C

A closer inspection of Figures 4A and 4B reveals thatthese densities or distributions predict the correct orderof the mirror effect given that the activation threshold islarger than the expected net inputs of the new items andless than the expected net inputs of the old items Thus thevariance theory has the nice property of accounting for themirror effect across a large range of the parameter valuesfor the activation threshold Thus P(AN) lt P(BN) ltP(BO) lt P(AO) for mh(AN) 5 mh(BN) lt T lt mh(AO) 5mh(BO) The variance theory predicts a material-basedmirror effect because the variability of the net inputs isdifferent for the easy and the difficult class items Theexpected strengths of the net inputs are equal The vari-ability is lower for easy class items thus making theprobability of false alarms (or the probability of activenodes) lower for the easy than for the diff icult classitems when the activation threshold is larger than the ex-pected value of the new items Similarly the hit rate (or theprobability of active nodes) for easy class items is higherthan the hit rate for difficult class items when the activationthreshold is less than the expected value of the old items

The activation threshold is set to be between the ex-pected value of the new and the old net inputs so that theperformance is optimal Therefore the activation thresh-old is set to the average of the expected net inputs of theold and the new distributions for difficult and easy classitems respectively

T 5 [mh(AN) + m h(BN) + m h(BO) + mh(AO)]

5 [mh(BO) + mh(AO)] 5 mh(O)

Thus in the variance model the activation threshold isfixed for recognition in one condition although it mayvary between different recognition conditions to opti-mize the performance The variance theory accounts forthe strength-based mirror effect that occurs between listsor conditions with a shift in the activation threshold nec-essary for keeping the performance at an optimal levelAs will be discussed later this is true also when perfor-mance is measured by d cent and it is independent of theplacement of the recognition criterion Simply put themodel adopts the activation threshold on the basis of theoverall difficulty of the test in order to maximize the per-formance

In practice subjects may initially make a preliminaryestimate of the activation threshold which may be ad-justed as more information about the expected value ofthe distributions is gathered The theory does not show amirror effect if the activation threshold is lower than theexpected value of the new items or larger than the expectedvalue of the old items Thus setting the activation thresh-old as was suggested above is an important mechanism

in the model The activation threshold should not be con-fused with the subjectrsquos recognition criterion

Figure 4C shows the density of the probability that anode is active at recognition when the activation thresh-old is set as defined above Note how the mean and stan-dard deviations of the distributions of the net input (Fig-ure 4A) change when the percentage of nodes are calculated(Figure 4C) The expected probabilities of active nodesare arranged according to the mirror effect [Pc(AN) ltPc(BN) lt Pc(BO) lt Pc(AO)] whereas the expected valuesof the net inputs are not [mh(AN) = mh(BN) lt mh(BO) =mh(AO)] Furthermore the standard deviation of the per-centage of active nodes for old items is larger than thatfor new items [s p(O) gt sp(N)] whereas the standard de-viations of the net inputs are equal for old and new items[s h(N) 5 sh(O)]

The standard deviation of the recognition strength issmaller for the new distributions than for the old distri-butions because there are fewer nodes active in the newdistributions The standard deviation of the percentage ofactive nodes at retrieval as a function of the expected prob-ability of nodes active at retrieval is plotted in Figure 5Obviously the standard deviation of the percentage ofactive nodes is zero when the probability of active nodesis zero This standard deviation increases as the prob-ability of active nodes increases For small probabilitiesof active nodes the variance of active nodes (sp) is approx-imately proportional to the percentage of active nodes[sp 5 Pc(1 Pc) N raquo Pc N micro Pc] The percentage of ac-tive nodes is of course larger for old than for new itemsThus the variance theory predicts that the standard de-viation of the percentage of active nodes (sp) is smaller fornew than for old items [sp(AN) lt sp(BN) lt sp(BO) ltsp(AO)] whereas the standard deviation of the net inputis not [sh(AN) 5 sh(AO) lt sh(BN) 5 s h(BO)] The es-sential mechanism that makes these changes in themeans and standard deviations is the nonlinearity intro-duced by the counting of the number of active nodes With-out this nonlinearity these changes would not occur andthe model would not show appropriate ROC curves forold and new items

Note that the standard deviation of active nodes de-creases for probabilities larger than one half (see Figure 5the standard deviation is of course zero when the prob-ability of active nodes is one see the mathematical anal-ysis below) However the probability of active nodes can-not exceed one half because the activation threshold isset so that half of the nodes active during encoding areactive at recognition The probability of active nodes istypically set to a small value in the model because it im-proves the performance

To optimize the performance subjects base their rec-ognition decision on the number of active nodes normal-ized by the standard deviation of the net inputs for theitem The normalization makes the judgments more con-servative for difficult items This plays an important rolefor confidence judgments when the responses are biasedbut plays no role for unbiased responses

12

14

14

422 SIKSTROM

To calculate the recognition strength (S ) in Equation 5the expected percentage of active nodes is subtractedfrom the percentage of nodes active at recognition (Pc)This is necessary for the normalization to work properlyOwing to the placement of the activation threshold theexpected percentage of active nodes at recognition is halfof the expected percentage of nodes active at encoding(a 2 see note 1) This is a constant independent of itemclass new or old item and test difficulty The result is di-vided by the standard deviation of the net inputs associ-ated with nodes active at encoding (shcent)

Note that the standard deviation of the net inputs ofthe to-be-recognized item (s hcent ) varies on an item-to-itembasis around the standard deviation of the net inputs ofall items in the class (sh ) This fluctuation may be solarge that it is not possible to accurately sort the wordsinto classes on the basis of the standard deviation of theitems however there is no need for the subject to makesuch classification in the variance model The subjectsdo not need to know the true standard deviation of net in-puts in the class A yes response occurs if the recognitionstrength is larger than or equal to the subjectrsquos recogni-tion criterion (C ie if S sup3 C ) A no response occurs ifthe recognition strength is less than the subjectrsquos recog-nition criterion (S lt C )

The standard deviation of the net inputs does not affectthe probability of a yes response when the recognitioncriterion is unbiased (C 5 0) In this special case therecognition strength can be simplified to S 5 Pc whereC 5 a 2 The standard deviation of the net inputs inEquation 5 affects the probability of a yes response whenthe recognition criterion is biased (C sup1 0) Thus the stan-dard deviation of the net inputs in Equation 5 may be in-terpreted as a scaling factor that influences the confidencemeasurement (but not the unbiased recognition measure-ments) A large standard deviation of the net input for anitem (correlated with difficulty) influences the decisiontoward uncertainty whereas a small standard deviation ofthe net input for an item (correlated with less difficulty)influences the decision to be more certain

Figure 4D shows the density distribution of the recog-nition strength Note how the standard deviation of theactive nodes for the easy class versus the difficult classes(in Figure 4C) changes when it is normalized by the vari-ance of the net input (in Figure 4D) The normalizationfactor makes the standard deviation of the recognitionstrength of the difficult class smaller than that of the easyclass Thus the standard deviation of the recognitionstrength is proportional to the inverse of the standard de-viation of the net input The difficult class items yield asmall standard deviation of the recognition strength be-cause the standard deviation of the net inputs is highThe easy class items yield a large standard deviation ofthe recognition strength because the standard deviationof the net inputs is small The ordering of the means ofthe distributions is unaffected by the normalization andthe normalization does not change the fact that the olddistributions have a larger standard deviation than do thenew distributions

PREDICTIONS

This section describes the predictions of the variancetheory We have just seen that the variance theory predictsa material-based mirror effect for high- and low-frequencyitems because the low-frequency items have a smallervariance of net inputs This yields lower false alarm ratesand higher hit rates for the easy than for the difficultclass when the activation threshold is set between thenew and the old distributions Here it is shown how themodel accounts for other effects such as the strength-based mirror effect between lists list-length effects andthe shift in the response criterion Most important the var-iance theory makes predictions regarding the strength-based mirror effect within lists that is different from thepredictions of the attention-likelihood theory An exper-iment is conducted to test these predictions Comparativemodeling fitting was also conducted to compare the vari-ance theory with the attention-likelihood theory Thepredictions of the theory are based on an analytic solution

Figure 5 The standard deviation of active nodes as a function of the expectedprobability that the nodes are active The standard deviation is calculated with2N = 100

VARIANCE THEORY OF MIRROR EFFECT 423

that is presented at the end of the paper together with ananalysis of optimal performance

The Material-Based Mirror Effect forHigh- and Low-Frequency Items

The variance theory was simulated above here theanalytical results are presented The variance theory pre-dicts the mirror effect for any choice of parameters whenthe recognition criterion is unbiased As will be discussedlater the variance theory can be fully described by twoparameters (the number of nodes N and the percentageof active nodes a) plus one parameter for each class orwords [the standard deviation of the net input sh( )] Thefollowing parameters are used here The number ofnodes is 2N 5 100 and the percentage of nodes active atencoding is set to a 5 1 The standard deviation of thenet inputs to the easy class is sh(AN) 5 sh(AO) 5 125and the standard deviation of the net inputs to the diffi-cult class is sh(BN) 5 sh(BO) 5 156 There are otherparameters which however as will be discussed laterdo not add any degrees of freedom to the model the ex-pected net inputs of the new distributions mh(AN) 5mh(BN) 5 0 and the expected net inputs of the old dis-tributions mh(AO) 5 mh(BO) 5 1 Consequently the ac-tivation threshold is T 5 05

These parameters yield the following probabilities thata node is active at recognition Pc(AN) 5 43 a Pc(BN) 545 a Pc(BO) 5 55 a Pc(AO) 5 57 a The following ex-pected recognition strengths are predicted ms(AN) 5

0012 ms(BN) 5 0008 ms(BO) 5 0008 ms(AO) 50012 Figure 4D plots the four recognition strength den-sities (the distributions are assumed to be normal) usingthe parameters above The same parameter settings wereused in Figures 4A 4B 4C and 5

Strength-Based Mirror Effects Between ListsThe variance theory is consistent with the strength-

based mirror effects Thus variables that increase the hitrates also decrease the false alarm rates This empiricalfinding is called dispersion which means that the newand the old distributions move apart The opposite phe-nomenon is called concentering which means that thenew and the old distributions move closer together Ex-amples of variables showing strength-based mirror ef-fects are speed versus accuracy instructions length ofstudy time encoding task forgetting repetition and ag-ing (Kim amp Glanzer 1993) These experimental variablescan be related to a specific parameter in the variance the-orymdashnamely the expected net input

The variance theory predicts a strength-based mirroreffect because subjects must adjust the activation thresh-old to optimize the performance This change in activa-tion threshold affects the false alarm rates For exampleassume that study time is increased from 1 to 2 sec sothat the expected net input increases from 1 to 2 and theactivation threshold increases from 12 to 1 This dimin-ishes the false alarm rate However the increase in theactivation threshold is smaller than the increase in the old

net input so the hit rate will increase Thus increasing thestudy time increases the hit rate but decreases the falsealarm rate which is dispersion

The mirror effect is accounted for in some theories bya change in the recognition criterion Note that in the var-iance theory the recognition criterion is constant whereasthe activation threshold is changed There is an impor-tant difference between a change in the recognition cri-terion and a change in the activation threshold Thechange in the activation threshold optimizes the perfor-mance as measured by d cent whereas a change in the recog-nition criterion does not influence d cent Given an optimalplacement of the activation threshold the performance interms of percentage correct is optimal if the recognitioncriterion is set to an optimal value which is zero Thusthere is a clear difference between changing the recogni-tion criterion and changing the activation threshold Thevariance theory accounts for the strength-based mirror ef-fect occurring between two conditions by the change inthe activation threshold necessary for optimal performancewhereas the recognition criterion does not change

Concentering occurs for example when subjects areinstructed to emphasize speed (rather than accuracy) withsuperficial (rather than deep or semantic) study instruc-tions with diminished study time or with an increasedretention interval (Kim amp Glanzer 1993) In the variancetheory all these manipulations are assumed to diminishthe old net inputs Figure 6A shows the predictions of thevariance theory when the expected net inputs of the olddistributions are mh(AO) = mh(BO) 5 05 rather than 1as in Figure 4D Consequently the activation thresholdmust be set to 025 for optimizing the performance Thedistributions in Figure 6A are closer than the distribu-tions in Figure 4D Thus decreasing the net inputsmdashforexample by diminishing study timemdashmoves the distrib-utions closer together thus showing concentering

The opposite phenomenon to concentering is disper-sion which means that increasing the performance movesthe distributions apart Dispersion can be studied bychanging the variables listed above in the opposite direc-tionsmdashfor example by increasing the study time Fig-ure 6B shows the predictions of the theory when the ex-pected net inputs of the old distributions are mh(AO) =m h(BO) 5 2 Consequently the activation thresholdmust be set to 1 to maintain a near-optimal performanceThe distributions in Figure 6B are further apart than thedistributions in Figure 4D

These strength-based manipulations are usually ap-plied between different lists or conditions For exampleKim and Glanzer (1993) manipulated study time betweenfour study lists where the items were presented for 1 seceach in two lists and for 2 sec each in two lists After eachlist there was a recognition test In the variance theorythe activation threshold is the same during each recogni-tion test but may vary between two recognition tests withdifferent levels of difficultymdashfor example different studytimes As will be discussed later different empirical re-sults are found when study time is varied within one list

424 SIKSTROM

In this case the activation threshold is also constant dur-ing the recognition tests although the study time varieswithin the condition

The order of the probabilities in the mirror effect issomewhat robust against changes in the activation thresh-old over a large range Setting the activation to a fixedsufficiently low and positive value yields the mirror ef-fect for any value of the expected net input For example

assume that the activation threshold is fixed to 04 Thenthe mirror effect is predicted for the three cases of ex-pected old net inputs discussed above (05 1 and 2) orany value above 04 The predictions for the new distri-butions do not change with these changes in net inputs[P(AN) = 25 P(BN) = 30] thus a change in the acti-vation threshold is needed to change the false alarmrates In contrast the advantage of the old easy class over

Table 1General Table of Results From the Experiment

Condition AN BN BO AO ss(BN)ss(AN) ss(BO)ss(AO)

Control 013 017 069 082 060 086Frequency 020 028 080 068 101 066Time 010 015 078 076 089 081

NotemdashThe rows show the conditions (control presentation frequency and presenta-tion time) The columns show the false alarm rates for low (AN) and high (BN) fre-quencies and the hit rate for high (BO) and low (AO) frequencies the slope of the z-ROCcurve for the new low-frequency distribution as a function the new high-frequency dis-tribution [ss(BN)ss(AN)] and the corresponding slope for the old distributions[ss(BO)ss(AO)]

Figure 6 The densities of recognition strength in the variance theory for different parameter settings (A) concentering (B) dis-persion (C) activity level set to one and (D) equal variance The horizontal axes show the recognition strength and the vertical axesthe density

VARIANCE THEORY OF MIRROR EFFECT 425

the old difficult class increases with the expected netinput [from P(BO) = 55 and P(AO) = 56 for mh(O) =05 to P(BO) = 89 and P(AO) = 92 for mh(O) = 2]

List-Length EffectGiven everything else equal recognition from a short

list length has a higher hit rate and lower false alarm ratethan recognition from a long list In the variance theorylist length is predicted to affect the standard deviation ofthe net input (sh) for both easy and difficult class itemsso that longer lists have a larger variance than do shorterlists The expected value of the net input is not affectedby list length

Assume that context does not change within a list butis uncorrelated between different lists The context for alist is thus associated with as many items as there areitems in the list The variance of the net inputs to the itemnodes increases when the list length is increased Thereason for this increase in variance is essentially thesame as the reason that word frequency affects the vari-ance In the word frequency case the same item is asso-ciated with several contexts and this increases the vari-ance in context nodes In the list-length case the samecontext is associated with several items and this in-creases the variance in the item nodes Thus the varianceof the net inputs in the item nodes will be a linear func-tion of list length Therefore a long list will have a lowerhit rate and a larger false alarm rate than will a short list

ROC CurvesThe percentage of nodes active at recognition is less

for new items than for old items Owing to the placementof the activation threshold this proportion is always lessthan 12 The standard deviation of the percentage of ac-tive nodes increases as a function of the percentage ofactive nodes If the percentage of active nodes is zerothe standard deviation obviously is zero However thisstandard deviation increases as the percentage of activenodes increases This yields a smaller standard deviationfor the new distribution (which is associated with a lowerpercentage of active nodes) as compared with the olddistribution [ss(AO) gt ss(AN) and ss(BO) gt ss(BN)]

For the sake of understanding the model the propor-tion of nodes active at encoding can be set unrealisticallyhighmdashnamely to a = 1 This setting yields around 50of these nodes being active at recognition This parametersetting makes the standard deviations of the new and theold distributions equal [ss(AO) = ss(AN) and ss(BO) =ss(BN)] Figure 6C shows the prediction for a = 1 (all theother parameters are identical to those in Figure 4D)

The standard deviation of recognition strength is largerfor the difficult class than for the easy class [ss(AN) gtss(BN) and ss(AO) gt ss(BO)] because the recognitionstrengths are calculated from the inverse of the standarddeviation of the net inputs Thus when the standard de-viations of the net inputs are set equal the standard de-viation of the recognition strengths and the recognitionstrengths becomes equal Figure 6D plots the predictionsof the theory when all standard deviations of the net in-

puts are 125 The other parameters are the same as thosein Figure 4D

In Figure 4D the four standard deviations of the recog-nition strengths are ss(AN) = 0015 ss(BN) = 0012ss(BO) = 0015 and ss(AO) = 0020 The ratio of thesestandard deviations must follow Equation 2 This is alsothe case with ss(AO) ss(BN) = 061 lt 074 = ss(AO) ss(AN) lt ss(BO)ss(BN) = 078 lt 094 = ss(BO) ss(AN)

Changing the Recognition CriterionThe probability of a yes response (P) for the four

classes depends on the recognition criterion (C) SettingC to an unbiased value of 0 yields P(AN) = 20 P(BN) =25 P(BO) = 70 P(AO) = 74 These predicted data areprototypical of experimental data for the mirror effect

A conservative value of the recognition criterion (C)will not yield the mirror effect For example C = 0016yields P(AN) = 03 P(BN) = 02 P(BO) = 30 P(AO) =43 Thus the variance theory predicts that a conserva-tive recognition criterion yields a higher false alarm ratefor easy class words than for difficult class words Thisprediction is supported by empirical data Greene (1996)asked subjects to respond yes only if they were sure oftheir response Consistent with the prediction no mirroreffect was found

It follows from the ordering of the distributions in Fig-ure 4D that the theory also predicts the experimentalfindings in forced recognition [P(BO BN) lt P(AO BN)P(BO AN) lt P(AO AN) P(BN AN) gt 50 and P(AOBO) gt 50] For the parameters above the predictions ofthe theory are P(BO BN) = 79 lt 81 = P(AO BN)P(BO AN) = 83 lt 84 = P(AO AN) P(BN AN) = 59 gt50 P(AO BO) = 57 gt 50

Within-List Strength ManipulationSo far the predictions made by the variance theory are

qualitatively (but not quantitatively) equal to those of theattention-likelihood theory However there is an excep-tion that differentiates the variance theory from the attention-likelihood theory The mirror effect is nor-mally studied under experimental conditions in whichthe diff icult and the easy classes are given the sameamount of attentionmdashfor example under conditions inwhich the number of presentations the study time andthe study instructions are equal for the two classes ofwords However if the number of presentation is largerfor the difficult class than for the easy class different re-sults emerge Stretch and Wixted (1998b) conducted fiveexperiments in which the basic manipulation was to pre-sent high-frequency words five times whereas the low-frequency words were presented once The results didnot show a mirror effect because the hit rates for thehigh-frequency words were higher than those for thelow-frequency words However increasing the numberof presentations for the high-frequency words did not af-fect the false alarm rate so both the false alarm rate andthe hit rate were larger for high-frequency words

The variance theory accounts for this new finding inthe following way The theory assumes that a longer pre-

426 SIKSTROM

sentation time or a larger presentation frequency in-creases the net inputs of the old items [mh(O)] This is il-lustrated in Figure 7A (compare with Figure 4A wherethe same amount of attention is paid to the two classes)If the net inputs for old high-frequency items are in-creased sufficiently the percentage of active nodes willbe larger than that for old low-frequency items For thisto occur the effect of the increase in net input (whichgives the advantage for old high-frequency items whenattention is focused on these items) must be larger thanthe effect from the larger standard deviation of the netinputs for old high-frequency items (which gives the ad-vantage for old low-frequency items when the same at-tention is paid to the two classes) This increase in thepercentage of active nodes yields a higher hit rate forhigh-frequency items than for low-frequency items

However it will not significantly change the false alarmrates which are larger for high-frequency items than forlow-frequency items Therefore the variance theory pre-dicts no mirror effect when high-frequency items are pre-sented sufficiently more often or with a sufficiently longerpresentation time as compared with low-frequency items

It is apparent from the density of net inputs (Figure 7A)that the density of recognition strengths (Figure 7B)will not show a mirror effect (ie because the percent-age of active nodes are larger for high- than for low-frequency old items) The parameters used in these fig-ures are identical to the parameters used for the standardmirror effect in Figures 4A and 4D with the exceptionthat the expected net input of the old high-frequencyitems [mh(BO)] is 2 rather than 1 Consequently to opti-mize performance the activation threshold becomes

P a ec

h h

h=-( )yen

ograve12

22

m

s

P -

Figure 7 (A) The probability density of the net inputs in the variancetheory when attention is focused on the high-frequency class The hori-zontal axis shows the net inputs and the vertical axis the probabilitydensity of the net inputs The expected value of the high-frequency value(BO) is shifted to the right because attention is focused on this class Thedotted vertical line is the activation threshold (B) The predictions of thevariance theory when subjects focus their attention on high-frequencywords The horizontal axis shows the recognition strength and the ver-tical axis the probability density

075 rather than 050 The figure does not show a mirroreffect because the expected hit rate and the expectedfalse alarm rate are larger for the high-frequency itemsthan for the low-frequency items Setting C to an unbi-ased value of 0 yields P(AN) = 08 P(BN) = 14 P(BO) =86 P(AO) = 063 [which may be compared with Figure6B P(AN) = 20 P(BN) = 25 P(BO) = 70 P(AO) = 74]

Furthermore in the variance theory the ratio of therecognition strength standard deviations for high- andlow-frequency items depends mainly on the standard de-viations of the net inputs The standard deviations of thenet inputs are not dependent on the attention paid to thestimuli Therefore the variance theory predicts no changein the standard deviations when the amount of attentionis manipulated The standard deviation of the old low-frequency distribution is predicted to be larger than thestandard deviation of the old high-frequency distribu-tion Similarly the standard deviation of the new low-frequency distribution is predicted to be larger than thatof the new high-frequency distribution The standard de-viations in Figure 7B are ss(AN) = 0013 ss(BN) = 0011ss(BO) = 0017 and ss(AO) = 0019 These results aresimilar to the results when the same amount of attentionis paid to the two classes in Figure 4D ss(AN) = 0015ss(BN) = 0012 ss(BO) = 0015 and ss(AO) = 0020

The standard version of the attention-likelihood the-ory has problems accounting for the lack of mirror ef-fect when more study time is given to the difficult classthan to the easy class This theory suggests that the classof items to which more attention is being paid is moreeasily recognized For example low-frequency items arebetter recognized than high-frequency items becausesubjects pay more attention to them The amount of at-tention is assumed to influence the number of sampledfeature [n(i)] so more features are sampled for low- thanfor high-frequency items (Kim amp Glanzer 1993) This isthe only parameter that differs between high- and low-frequency items From this explanation it follows thatif the experimental conditions are manipulated so thatsubjects pay more attention to the high-frequency itemsthe standard version of the attention-likelihood theorywill predict a mirror effect where the high-frequencyitems are the easier class (A) and the low-frequency itemsare the more difficult class (B) The difference from thenormal mirror effect is a larger hit rate and a smaller falsealarm rate for high- than for low-frequency items Fur-thermore the attention-likelihood theory makes predic-tions of the order of the slope of ROC curves The standarddeviation of the hit rate for the high-frequency distributionwould be larger than the hit rate for the low-frequencydistribution Similarly it is predicted that the standarddeviation of the high-frequency false alarm distribution islarger than that of the low-frequency distribution

EXPERIMENT

An experiment was conducted to test the predictionsregarding the within-list strength manipulation The

number of presentations and the study time of the high-frequency words were manipulated in an experimentThe original rationale for the experiment was to comparethe results with the predictions of the variance theoryand the attention-likelihood theory because the experi-ment was conducted before the publication of Stretchand Wixtedrsquos (1998b) study which manipulated atten-tion by the number of presentations In this experimenta new manipulation is investigatedmdashnamely how theamount of study time per item for each class influencesthe mirror effect Furthermore the manipulation of thenumber of presentations is replicated Thus there weretwo experimental conditions one in which the amountof study time was manipulated and one in which the pre-sentation time was manipulated There was also one con-trol condition in which high- and low-frequency wordswere given the same amount of attention

MethodSubjects Twenty-one students taking the introductory psychol-

ogy course at the University of Toronto volunteered to participatein a memory experiment for course credit There were 14 female and7 male subjects with a mean age of 20 ranging from 18 to 29 yearsold

Material Sixty low-frequency words and 60 high-frequencywords were selected from Ku Iumlcera and Francis (1967) The low-frequency words have an occurrence of 4ndash5 times per million andthe high frequency words an occurrence of 50ndash55 times per mil-lion Thirty low- and 30 high-frequency words were randomly cho-sen for List 1 and the remaining for List 2

Procedure The subjects were instructed to study a list of wordsso they would be able to recognize the words after study Fifteenlow-frequency words and 15 high-frequency words were randomlychosen as study words for each subject

Design There were three conditions In all the conditions thelow-frequency words were presented once with a presentation timeof 1 sec In the control condition the high-frequency words werealso presented once with a presentation time of 1 sec In the pre-sentation frequency condition the high-frequency words were pre-sented twice for 1 sec each time In the presentation time conditionthe high-frequency words were presented once for 3 sec The pre-sentation order was randomized All the words were presented inuppercase on a blank computer screen Immediately following thestudy list there was a recognition test The subjects were presentedwith either one of the studied words or one of the lures There were15 low-frequency lures and 15 high-frequency lures in each condi-tion The subjects were asked to judge whether the word was pre-sented in the list or not The subjects were also required to rate theirconfidence in their responses on a scale ranging from 1 (guessing)to 5 (very certain) The order of recognition was randomized foreach subject

Each subject participated in two conditions List 1 was alwaysgiven as the first list and List 2 as the second list Twelve subjectswere randomly chosen for the presentation frequency condition fol-lowed by the presentation time condition Nine subjects were giventhe control condition followed by another control condition Thewhole experimental set-up including instructions presentation ofwords and the recognition test was automated on a computer Eachsubject was tested individually

ResultsThe results from the experiment are presented in the

first three rows of Table 1 The probability for hit rates

VARIANCE THEORY OF MIRROR EFFECT 427

428 SIKSTROM

was larger for the high-frequency words than for the low-frequency words in the presentation frequency and thepresentation time conditions In the control conditionthe probability for hit rates for the low-frequency condi-tion was larger One-tailed paired t tests over the perfor-mance for each subject were carried out to test the dif-ferences between the high and the low frequencies Theeffects were significant in the presentation frequencycondition [t(11) = 22 MSe = 004 p = 02 lt 05] and inthe control condition [t(16) = 33 MSe = 004 p = 00lt 05] but not in the presentation time condition [t(11) =041 MSe = 003 p = 34 lt 05]

The false alarm rate was larger for the high-frequencywords in all the conditions However it was only signif-icantly larger in the presentation frequency condition[t(11) = 18 MSe = 003 p = 048 lt 05] but not in thepresentation time condition [t (11) = 15 MSe = 001 p =07 gt 05] and the control condition [t(16) = 14 MSe =002 p = 09 gt 05]

The results in the presentation frequency conditionsupport the variance theory The hit and the false alarmrates were significantly larger for the high-frequencywords than for the low-frequency words Thus there wasno mirror effect However the prediction of the standardversion of the attention-likelihood theory was not sup-ported

The results in the presentation time condition were inthe same direction as those in the presentation frequencycondition although the difference between the high andthe low frequencies was not significant This conditionis consistent with the variance theory although the stan-dard version of the attention-likelihood theory could notbe dismissed in this condition since the results werenonsignificant

Finally the control condition yielded results consis-tent with previous studies showing a mirror effect Thehit rate for the high-frequency words was significantlylower than the hit rate for the low-frequency words Thefalse alarm rate for the high-frequency words was largerthan that for the low-frequency words (although not sig-nificantly) Thus the control condition is as was ex-pected consistent with both the variance theory and thestandard version of the attention-likelihood theory

The slopes of the ROC curves were calculated as fol-lows The hit and false alarm rates for confidence ratings1ndash5 were z-transformed (eg for confidence rating 4 a hitresponse was scored if the confidence rating was 4 orabove) Linear regressions of one z-transformed mea-surement as a function of another z-transformed measure-ment were conducted The slope of the linear regressioncurves between the z-transformed hit rate of the low-frequency words and the z-transformed hit rate of the high-frequency words [ss(BO)ss(AO)] and similarly for theslope of the false alarms [ss(BN)ss(AN)] are shown inthe last two rows of Table 1

The results show that the standard deviations of theold high-frequency distributions were smaller than thestandard deviations of the old low-frequency distribu-tions in all the conditions The standard deviations of the

false alarm high-frequency distributions were smallerthan the standard deviations of the false alarm low-frequency distributions in the presentation frequencycondition and the control condition but were approxi-mately equal in the presentation time condition

To summarize the results in the presentation frequencycondition are consistent with the variance theory and inconsistent with the standard version of the attention-likelihood theory The control condition is consistentwith both the variance theory and the standard version ofthe attention-likelihood theory These data are also con-sistent with results presented by Stretch and Wixted(1998b) However Stretch and Wixted (1998b) suggestedone possible way to modify the standard version of theattention-likelihood theory to bring it in line with thedata presented here They noted that Glanzer et al (1993)had shown that the attention-likelihood theory predictsthe mirror effect although p(i old) is set to the averagevalue of the two classes This modified version can pre-dict the pattern of data presented here given that the at-tention paid to the high-frequency class was high duringencoding [n(B) = 120] and low during recognition [n(B) =40] This formulation of the attention-likelihood theoryseems somewhat unclear It is not well motivated whyp(i old) should be equal during recognition whereas n(i)is different [p(i old) is calculated from n(i)] or why theamount of attention for high-frequency items is lower thanthat for low-frequency words at encoding but higher atrecognition

COMPARATIVE DATA FITTING

Glanzer et al (1993) fitted the attention-likelihoodtheory to experimental data in nine conditions The easyclass (A) consisted of either low-frequency words orconcrete words and the difficult class (B) consisted ofhigh-frequency words or abstract words Here the vari-ance theory is also fitted to the same set of data as thatused in Glanzer et al (1993) This allows a directionevaluation of the variance theory by comparing its f itswith those of the attention-likelihood theory

Glanzer et al (1993) fitted the attention-likelihoodtheory to the four probabilities of yes responses and thefour ratios of slopes of the ROCs The fitting was con-ducted by minimizing the mean squared error divided bythe variancemdashthat is

Three parameters were fittedmdashnamely the attentionpaid to each of the classes n(A) and n(B) and the prob-ability that a feature was activated p(new) The other pa-rameters were held constant at a value found to give agood f it These parameters were (N = 1000) and therecognition criterion [ln (L) = 0]

The variance theory was fitted to the same set of datausing the same technique and the same number of freeparameters The fitted parameters were the percentage

( )

Observed Predictedi i

ii

-

=aring

2

21

8

s

VARIANCE THEORY OF MIRROR EFFECT 429

nodes active at encoding (a) the standard deviation ofthe net inputs for the easy class words [sh(A)] and thestandard deviation of the net inputs for the difficult classwords [sh(B)] The other parameters were held constantand were set to the same values as those in Figure 4D[2N = 100 mh(N ) = 0 mh(O) = 1 and C = 0] The empir-ical standard deviations (si) were not reported inGlanzer et al (1993) so these parameters were set toone

The results show that the variance theory accounts for98 (r2) of the variance for the probabilities The attention-likelihood theory fits equally well The variance theory ac-counts for 93 of the variance for the slope The attention-likelihood theory accounts for 84 of the variance of theslope Thus the variance theory accounted for the sameamount of variance for the probabilities and more variance

for the slope as compared with the attention-likelihoodtheory when three parameters were fitted

It is reasonable to assume that the percentage of ac-tive nodes at encoding (a) does not vary across condi-tions The ratio of standard deviations of the net inputs[sh(B)sh(A)] may also be conceived of as being con-stant given that the same material is used in the differ-ent conditions Therefore the variance theory was fittedby a single variablemdashnamely the standard deviation ofthe net inputs for the easy class [sh(A)] The activitylevel was fixed to 010 and the ratio of the standard de-viations of the net inputs sh(B)sh(A) to 125 (these val-ues were the average of the fitted parameters in the pre-vious fit these parameters were also used in Figure 4D)The results are presented in Figure 8A where the pre-dicted probabilities are plotted as a function of observed

Figure 8 (A) Fitting the variance theory to Glanzer Adams andKimrsquos (1993) probability data The figure shows the predicted proba-bilities (on the vertical axis) as a function of the observed probabilities(on the horizontal axis) (B) Fitting the variance theory to Glanzer et alrsquos(1993) standard deviation slope data The figure shows the predictedratio of slopes of the receiver-operating characteristic curves (on the ver-tical axis) as a function of the observed ratios (on the horizontal axis)

430 SIKSTROM

probabilities Figure 8B shows the corresponding resultsfor the slope The accounted variance is 096 for the prob-abilities and 085 for the slopes Thus the variance theoryfits the slopes using a single parameter equally well asthe attention-likelihood theory does with three fitting pa-rameters The fit for the variance theory for the probabil-ities using one parameter is slightly less than the fit of theattention-likelihood theory using three fitting parametersIt may be concluded that the fit for the variance theory isreasonably good for the probabilities and the slopes Theslopes have a better fit in the variance theory than in theattention-likelihood theory when three variables are used

ANALYTIC SOLUTIONS

In this section analytic solutions of the variance the-ory an approximation of the standard deviation of therecognition strength and analyses of optimal perfor-mance are presented The variance theory has a simpleanalytic solution and can be fully described by four pa-rameters Two of these parametersmdashnamely the stan-dard deviation of the net inputs from the easy class[sh(A)] and the standard deviation of the net inputs fromthe difficult class [sh(B)]mdashcan also be expressed as thefrequency of the item (see the Appendix) The other twoparameters are the number of nodes (N ) and the expectedprobability that the nodes are active at encoding (a)

There are other variables in the theory however theydo not increase the degrees of freedom There are fourexpected net inputs (mh) however two degrees of free-dom disappear because the new net inputs are constrainedto be equal as well as the old net inputs [mh(AN) =mh(BN) and mh(AO) = mh(BO)] Furthermore the predic-tions are independent of moving the old and new dis-tributions in parallel thus removing another degree offreedom Finally changing the difference between the ex-pected old and new net inputs is mathematically equiva-lent to changing the standard deviations [sh(A) andsh(B)] Thus the degrees of freedom in the net inputscan be captured by the degrees of freedom in the stan-dard deviations The activation threshold (T ) and theprobability that nodes are active (Pc) are simply func-tions of other variables and therefore do not increase thedegrees of freedom Thus there are four degrees of free-dom for the distributionsmdashnamely sh(A) sh(B) N anda An additional degree of freedom is introduced whenplacing the recognition criterion (C)

The probability (Pc) that the net inputs exceed the ac-tivation threshold (T ) for nodes active during encodingcan be explicitly solved from the expected net inputs(mh) and the standard deviation of the net inputs (sh)This probability is dependent on the distribution of thenet inputs which can be approximated with a normaldistribution Pc is solved by integrating the net inputsfrom mh T to infinity (yen) over the probability densityfunction for a normal distribution Thus the probability(Pc) that a node is active at recognition is

(6)

Subtracting the expected percentage of active nodes atrecognition (a2 see note 1) from the percentage of ac-tive nodes and dividing by the standard deviation of thenet inputs (sh) calculates the expected recognitionstrength (ms)

Note that the analytic solution uses the standard devi-ation of the class (sh) as an approximation of the stan-dard deviation of the item (shcent ) because it simplifies theanalytic solution however the variance theory or thesimulation uses the standard deviation of the item Thisapproximation is good when there are a large number offeatures however for a small number of features thevariance of feature strength for a single item may fluctu-ate on an item-to-item basis around the variance of thenet inputs for all the items

The standard deviation of the recognition strength (ss)is calculated from sh Pc and N There is 2N number ofnodes in the context and the item layers The distributionof Pc is binomial but can for a certain criterion [ie 2NPc(1 Pc) gt 10] be approximated with a normal distri-bution with a standard deviation of [Pc(1 Pc) 2N]12The final result is scaled by the normalization factor 1sh

(7)

A yes response occurs if the recognition strength isabove the recognition criterion (C) The probability of ayes response [P(Y)] is calculated from the expected recog-nition strength the variance of the recognition strengthand the recognition criterion by integrating the density ofthe recognition strength over a normal distribution

The probability of choosing A over B in a two-choiceforced recognition test [P(A B)] is calculated from theexpected recognition strength of A [ms(A)] and B [ms(B)]and the standard deviations of the recognition strengthof A [ss(A)] and B [ss(B)]

An Excel sheet for calculating the predictions of the vari-ance theory is available on line (wwwpsychutorontoca ~sverkervariancehtml)

P e ds

S Bs s

s s

C

(A B)

(A)

(A) (B)

=

- -[ ]( )+aelig

egraveccediloumloslashdivide

eacuteeumlecirc

yen

ograve12

2

2 2 22

p

m m

s s

( )

P e ds

S s

s

C

(Y) =-( )yen

ograve12

2

22

p

m

s

sss

h

c cP P

N=

-( )eacute

eumlecircecirc

1 1

2

12

mss

Pc

a

h=

-2

P a e dhcT

hh

h=-( )yen

ograve12

2

22

p

m

s

VARIANCE THEORY OF MIRROR EFFECT 431

Approximations of the Standard Deviation ofRecognition Strength

The standard deviation of the recognition strength isin the model calculated with Equation 7 However to fa-cilitate the understanding of this equation it is useful tomake some approximations First note that the proba-bility that a node is active (Pc) is assumed to be low Byapproximating 1 Pc to one the variance of the recog-nition strength can be simplified to

For a particular class of items the variances of the netinputs of old and new items are equal and the varianceof the recognition strength is proportional to the numberof active nodes (s 2

s micro Pc) This approximation suggests avery simple interpretation of the slope of the z-ROC Theratio of variances between new and old items is simplythe ratio between the number of nodes active in the newitems representations and the number of nodes active inthe old items representations

Or alternatively the slope of the z-ROC curve is equalto the square root of the ratio of the number of nodes ac-tive in the new items representations and the number ofnodes active in the old items representations For exam-ple if the slope of the z-ROC curve is 08 the number ofactive nodes in the new items representations divided bythe number of nodes active in the old items representa-tions is 064 (= 0802)

Another approximation useful for understanding themodel is that for two classes of items the number of ac-tive nodes in the new distribution is approximately equaland the number of active nodes in the old distributions isapproximately equal [Pc(AN) raquo Pc(BN) and Pc(BO) raquoPc(AO)] Given these approximations and the approxi-mation above (1 Pc raquo 1) the recognition strength stan-dard deviation is inversely related to the standard devia-tion of the net inputs in the following way The ratiobetween the recognition strength standard deviations ofthe diff icult and the easy distributions is equal to theratio between the standard deviations of the net inputs ofthe easy and the difficult distributions Furthermore theratio between the recognition strength standard devia-tions of the difficult and easy new distributions is equalto the ratio between the recognition strength standard de-viations of the difficult and the easy old distributionsThe exact solution predicts a slightly smaller ratio in theold than in the new distributions

This suggests that the ratio between the recognitionstrength standard deviations of the difficult class and theeasy class can be interpreted as the ratio between the

standard deviations of the net inputs of the easy class andthe difficult class

Optimizing Performance Derives the Assumptions of the Variance Theory

Arguably good memory performance is an importantfactor for selection in the evolutionary process of hu-mans and animals It is reasonable to assume that thebrain has evolved so that the performance at retrieval isoptimal or near-optimal Here it is investigated how sev-eral assumptions of the variance theory influence per-formance It is shown that several assumptions of themodel fall out as a consequence of optimizing perfor-mance in the form of discriminability between new andold items Thus if the model is implemented in a differ-ent way performance is degraded and the model doesnot account for the experimental data Examples of as-sumptions that yield a good performance in the modelare a low percentage of nodes active setting the activa-tion threshold between old and new net inputs measur-ing performance by nodes that are active in the encodingpattern and normalizing the recognition strength It isshown that an optimal performance in the network requiresthe implementation suggested by the variance theory Ifthe implementation of the variance theory is changedsignificantly the performance is degraded and the net-work would not produce the empirically found memoryphenomena

To conduct this analysis it is necessary to define ameasurement of performance A natural choice is to used cent By using the analytical equations above we find thefollowing expression

(8)

Because ss(N) sometimes is near zero it was founduseful to use the standard deviation of both the old andthe new items recognition strength ss(NO) in the de-nominator of this equation Thus Pc(NO) is equal to[Pc(N) 1 Pc(O)] 2 Pc( ) was calculated with Equation 6The expected value of the net inputs and the standard de-viation of the net inputs for new and old items were cal-culated with the equations derived in the Appendix(Equations A1 A2 and A3) Low-frequency items witha preexperimental frequency of zero were used ( f = 0)and the list length was set to one (L = 1)

The performance can be expressed by the parametersa N and p Increasing the number of nodes (N) monot-onically increases d cent and increasing the number ofstored patterns (p) monotonically decreases d cent The im-pact of these two parameters on d cent was of less impor-tance here and they were set to N = 30 and p = 100

Optimal percentage of nodes active at encodingThe solid line in Figure 9A shows the theoretical d cent as afunction of the percentage of nodes active at encoding

cent = - =-

-eacuteeumlecirc

dP P

P PN

s s

s

c c

c c

m ms( ) ( )

(

O N

(NO)

(O) (N)

(NO) (NO) 12

112

ss

ss

ss

ss

s

s

s

s

h

h

s

s

(BO)

(AO)

(B)

(A)

(A)

(B)

(BN)

(AN)pound raquo pound

s

ss

s

c

c

P

P

2

2

(N)

(O)

(N)

(O)raquo

ss

sc

h

P

N

2

2 2=

432 SIKSTROM

(a) The results show that d cent is optimal for a = 052 Thed cent is lower for larger and smaller a The lower d cent for largea occurs because the interference from other items in-crease For an a larger than the optimal value the weightchanges are distributed over a larger number of nodesand the recognition tests therefore include more noise

The lower d cent for an a less than the optimal value oc-curs because there is variability in the number of activenodes at encoding Thus for very small values of a thefluctuation between the number of nodes active in theencoded representation becomes increasingly importantThus for a small a errors are more likely to occur be-cause old items have few active nodes at encoding mak-ing it less likely that many nodes will be active at re-trieval (independently of how well they are encoded)This analysis suggests that a medium low percentage ofactive nodes at encoding yields optimal performanceThis is consistent with variance theory which requires alow activity for fitting some of the empirical data (seebelow)

There is another factor that contributes to the fact thatoptimal performance occurs when the percentage of ac-tive nodes is medium low which is that the number ofpossible representations increases with a If there is onlyone node active in all the representations there are Npossible combinations of representations if there aretwo nodes active in all the representations there are ap-proximately N 2 possible combinations of representa-tions and so forth This factor is not included in theanalyses

Optimal placement of the activation threshold Animportant property of d cent is that it is insensitive to wherethe criterion is placed Thus any criterion yields thesame performance The activation threshold (T ) may beseen as the criterion for a single node and therefore onemight intuitively think that the placement of the thresh-old is unimportant for d cent However surprisingly theplacement of the criterion becomes important in the vari-ance theory because there is a nonlinear transformationwhen the nodes are activated This nonlinearity makes d centdependent on the activation threshold in the nodes

The d cent was maximized by changing the activationthreshold (T ) and the percentage of nodes active at en-coding (a) The maximum d cent was 240 when T = 081and a = 052 Figure 9B plots d cent as a function of the ac-tivation threshold (T ) when the percentage of nodes ac-tive at encoding was f ixed at the optimal value (a =052) The results show that d cent has an optimal valuewhen the activation threshold is set between the old andthe new distributions The variance theory suggests thatthe threshold should be set to the average of the expectedold and expected new net inputs For the parameter val-ues used here this value is 071 which is near butslightly lower than the optimal value of 081 (the ex-pected value of the new net input is 0 and the expectedvalue of the old net input is 142) Note that this resultapplies when both a and T are set to the optimal value Ifa is set to a nonoptimal value the optimal value of T may

deviate significantly from the one proposed by the the-ory (eg if a = 5 the optimal value of T is much largerthan the expected value of old net inputs of 188)

This analysis emphasizes the importance of setting theactivation threshold as suggested by the variance theorySetting the activation threshold between the old and thenew expected net inputs yields not only the mirror effectbut also an optimal performance in the network Noticethat the activation threshold (T ) is constant even if thesubjectsrsquo decision criterion (C) is changed Therefore d centwill not change when the decision criterion changes Bychanging the decision criterion (rather than the recogni-tion threshold) subjects can maintain an optimal d cent fordifferent confidence levels

Optimal usage of the state of activation in the cuepattern An interesting question is how much informa-tion is carried in nodes that are active in the encoded pat-tern as compared with inactive nodes If both active andinactive nodes carry a similar amount of information itis useful to use all the nodes at retrieval However if in-active nodes carry little information in relation to thenoise performance can be improved by using only theinformation in the active nodes

The information carried in the nodes depends on theamount of weight changes which in turn depends on thepercentage of active nodes at encoding (a) For a = 12the absolute values of the weight changes are the samefor active and inactive nodes (however the signs of theweight changes are different) Thus inactive and activenodes carry the same amount of information and theperformance is optimal when information in both activeand inactive nodes is used For a small a the weightchanges are larger for active nodes (proportional to 1 a)than for inactive nodes (proportional to a) For a suffi-ciently small a the noise in inactive nodes will over-whelm the information in the weight changes so that ifthe information is combined the inactive nodes will harmthe information in the active nodes and performance isoptimal if only information from active nodes is used

The performance of the variance theory was calcu-lated by using the information in all the nodes This isdone by counting the number of nodes that are retrievedto the correct state of activation (ie the same state asthat during encoding) The mathematical details of thiscalculation are described at the end of the Appendix Theresults are shown by the dotted line in Figure 9A usingthe same set of parameters as when d cent was calculated byusing only active nodes shown by the solid line The re-sults show that the highest d cent is found when the decisionis based only on active nodes and when a is low Includ-ing inactive nodes in decision lowers d cent However for alarger a (above 15 for the parameters used here) it isbeneficial to base the decision on all the nodes

Optimal placement of the recognition criterion forthe two classes of items The recognition criterion (C)does not affect d cent but it influences performance as mea-sured by the hit rates and false alarm rates Therefore itis necessary to use another criterion for performance

VARIANCE THEORY OF MIRROR EFFECT 433

with respect to the placement of the recognition crite-rion A natural choice for performance in this context isthe probability of hits minus the probability of falsealarms This measurement corresponds to optimal per-formance when old correct responses and new correctnew responses are rewarded equally It is easy to see thatif the standard deviations of the old and the new distri-butions are equal the optimal performance will be foundif the recognition criterion is set exactly between the dis-tributions For unequal standard deviations the optimalrecognition criterion is shifted from the midpoint towardthe distribution with the smallest standard deviationMore exactly the optimal recognition criterion is thepoint at which the old and the new distributions inter-sect It is easy to see that this is true because if the recog-nition criterion is moved to the left of this point the rateof increase in false alarms is larger than the rate of in-crease in hits and performance suffers If the recognition

criterion is moved to the right of this point the rate of de-crease in hits is larger than the rate of decrease in falsealarms and performance also suffers (see eg Figure 4D)Formally f [S(O)] denotes the density of recognitionstrength of the old distribution and f [S(N)] the densityof the recognition strength of the new distribution Theratio between these variables is called the likelihoodratio L = f [S(O)] f [S(N)] and the optimal performanceoccurs when this ratio is equal to one (L = 1)

In the mirror effect there are two classes of itemseach having a new and an old distribution with differentstandard deviations The question of optimal perfor-mance is complicated by the possibility of using differ-ent criteria for the two classes The performance maythen vary depending on the choice of the two criteria andon additional restrictions on the overall level of confi-dence For example if one class is very easy (ie perfectdiscrimination) and one class is very difficult (ie no

Figure 9 (A) Theoretical d cent as a function of percentage of nodes active at encoding The solid line shows the d cent as a function of per-centage of nodes active at encoding when the decision is based only on nodes that are active during encoding The dotted line showsd cent when the decision is based on all the nodes (B) Theoretical d cent as a function of the activation threshold The leftmost arrow pointsat the expected net input of the new items [m(N)] the rightmost arrow points at the expected net input of the old items [m(O)] and themiddle arrow points at the point at the placement of the activation threshold of the nodes Note that the activation threshold is slightlylower than the optimal point (C) Optimal placement of the recognition criterion for the two classes The y-axis shows the maximumlikelihood for Class B divided by the maximum likelihood for Class A An optimal performance is found when this ratio is one Thex-axis shows the false alarm rate for Class A The straight line shows the ratio for theoretical optimal performance the dotted line theratio before normalization and the solid curved line the ratio after normalization See the text for details (D) The advantage of nor-malization for different recognition criteria The y-axis shows the total hit rate after normalization minus the total hit rate before nor-malization as a function of the total false alarm rate on the x-axis See the text for details

434 SIKSTROM

discrimination) and subjects are instructed to respondyes only when they are absolutely certain that they arecorrect it may be optimal to set a very high criterion forthe difficult class so that no yes responses will be madefor the difficult class and a moderate criterion for theeasy class so that some yes responses will be made forthe easy class Therefore any model that optimizes per-formance for the two classes must combine the criteriafor each class so that the performance for the sum of theclasses will be optimal

This problem may formally be stated as follows Giventwo classes (A and B) with a fixed total false alarm rate[P(AN) + P(BN)] how should the recognition criteriafor the two classes [T(A) and T(B)] be chosen so that thehit rates are maximized [P(AO) + P(BO)] The solutionto this problem is surprisingly simple The optimal perfor-mance occurs when the placements of the maximum like-lihoods of the two classes are equal

L(A) = f [S(AO)] f [S(AN)] = L(B)

= f [S(BO)] f [S(BN)]

It is easy to see that this criterion must be satisfied foroptimal performance because any shift from this pointdiminishes performance For example if the recognitionthreshold for class A is diminished the recognition cri-terion for class B must be increased to keep the totalfalse alarm rate constant According to the formulationof the problem the change in total false alarm rates mustbe equal f f [S(BN) = 0] The maximum-likelihood ra-tios are monotonically increasing functions of the recog-nition criteria therefore L(A) L(B) lt 0 when the recog-nition criteria are changed as specified above ThusL(A) = f [S(AO)] f [S(AN)] lt f [S(BO)] f [S(BN)] =L(B) or f [S(AO)] f [S(BO)] lt 0 This shows that thechange in the placement of the criteria from L(A) = L(B)results in an overall decrease in hit ratemdash( f [S(AO)] f [S(BO)] lt 0)mdashand performance suffers

Note that the variance theory only has one overallrecognition criterion However the theory and any the-ory of the mirror effect specifies how this criterionchanges when it is moved over the two classes Thus thesecond criterion can indirectly be inferred from the for-mulation of the theory This is done in the variance the-ory by the normalization factor that scales the recogni-tion differently between the two classes of words

The intriguing question here is whether the variancetheory is optimal or almost optimal in terms of place-ment of the recognition criterion for the two classes Fig-ure 9C plots the maximum likelihood of class B dividedby the maximum likelihood of class A [L(B)L(A)] onthe y-axis The x-axis shows the probability of false alarmsfor class A when the recognition criteria are changedThe optimal ratio of the maximum likelihood on the y-axisis one and it is plotted as the dotted straight line The re-sults before normalization (ie by counting the numberof nodes above the recognition criterion) are plotted in

the dotted monotonically increasing line The resultsafter normalization (ie the percentage of active nodesminus the expected number of active nodes divided bythe standard deviation of the net input) are plotted as thesolid line

The figure clearly shows that performance before nor-malization is far from optimal For a conservative recog-nition criterion or low false alarm rates the maximumlikelihood for class A is larger than the maximum likeli-hood for class B Therefore a more liberal criterion forclass B and a more strict criterion for class A would bemore advantageous For a liberal recognition criterionor a high false alarm rate the maximum likelihood forclass B is larger than the maximum likelihood for classA Therefore a more liberal criterion for Class A and astricter criterion for Class B would be beneficial Theonly point at which the performance is optimal is whenthe recognition criterion is unbiased At this point [aroundP(AN) = 25] the maximum-likelihood ratio is one

Normalization improves performance significantly sothe maximum-likelihood ratio is one or almost one fora range of criteria For false alarm rates larger than 015and smaller than 060 the ratio is within two percentagepoints of one The maximum likelihood for Class A islarger than that for Class B for false alarm rates less than015 and for false alarm rates larger than 060 Thus thereis still some deviation from optimal performance evenafter normalization However the maximum-likelihoodratio is closer to the optimal value for all false alarmrates after normalization than before normalization Ar-guably normalization improves performance sufficientlyso that performance may be described as being near anoptimal value for a wide range of recognition criteria

Overall performance after normalization can be di-rectly compared with performance before normalizationFigure 9D plots the total hit rate after normalizationminus the total hit rate before normalization for differenttotal false alarm rates In this figure the standard devia-tion of the net inputs to Class B was set to a 3 (rather than156) in order to make the difference between perfor-mance before and after normalization more salient Allother parameters were identical to those in Figure 4DFurthermore it is assumed that the number of items inClass A is equal to the number of items in Class B Thetotal hit rate is equal to the average hit rate of Class Aand Class B and the total false alarm rate is equal to theaverage false alarm rate in Class A and Class B

For all recognition criteria or for all false alarm ratesperformance is better or equal after normalization ascompared with performance before normalization Foran unbiased recognition criterion or a slightly largerrecognition criterion performance is approximatelyequal before and after normalization [ie for 25 lt P(N) lt40] For conservative recognition criteria [P(N) lt 25]there is a large advantage for normalized performanceover nonnormalized performance The largest advantageis approximately 005 and it occurs when the false alarm

VARIANCE THEORY OF MIRROR EFFECT 435

rate is approximately 005 For liberal recognition crite-ria [P(N) gt 40] there is also an advantage for normal-ized performance The largest advantage is around 001and it occurs when the false alarm rate is 070 The ad-vantage for liberal criteria is smaller than the advantagefor conservative criteria because of a ceiling effect forlarge false alarm rates and large hit rates

A Nonoptimal Network is Inconsistent With Empirical Data of Recognition Memory

To summarize it has been shown that performance isoptimal when (1) the percentage of nodes active at en-coding is low (2) the activation threshold is set betweenthe new and the old distributions (3) only nodes activeat encoding are used for measuring the recognitionstrength and (4) the recognition strength is normalizedwith the standard deviation of the net input It is inter-esting to note that all these conditions are necessary forproducing the empirically found memory phenomena

1 The percentage of active nodes has to be low forproducing appropriate standard deviations for the oldand the new recognition distributions If the percentageof active nodes is too high the standard deviation of theold distribution will approach the standard deviation ofthe new distribution

2 The model predicts the mirror effect only if the ac-tivation threshold is set between the old and the new dis-tributions If the recognition threshold is larger than theexpected value of the net input of the old distributionthe hit rate of the easy class will be less than the hit rateof the difficult distribution Similarly if the recognitionthreshold is lower than the expected new net input thefalse alarm of the easy class will be larger than the falsealarm rate of the difficult class This is inconsistent withthe empirical data for unbiased responses

3 Assume that the recognition strength is based on allthe nodes (ie not only nodes inactive during encod-ing)mdashfor example by counting the number of nodes inthe correct state of activation This measurement ofrecognition strength will have at least 50 of the nodesin the correct state (ie Pc gt 50) even if the subjectswere merely guessing on new items This would lead tothe incorrect prediction that the old recognition strengthvariance should be smaller than the new recognitionstrength variance because Pc(1 Pc)N decreases for Pcover 50 This is inconsistent with the finding that thevariance of the old distribution is larger than the varianceof the new distribution

4 If the recognition strength is not normalized withthe net input the variance of the recognition strength ofthe new easy class (AN) will be smaller than the varianceof the recognition strength of the new difficult class (BNsee Figure 4C) This is inconsistent with the empirical data

This analysis indicates that several recognition mem-ory phenomena fall out as a consequence of optimizingperformance in the network If the model is optimized interms of performance the model reproduces the empir-ical data If the model is made nonoptimal the model

does not reproduce the empirical data Arguably thehuman brain has evolved to optimize storage capacityand therefore these memory phenomena occur

GENERAL DISCUSSION

This paper has suggested a variance theory for themirror effect that also applies to the ROC curves Themodel is remarkably simple It has been shown that atwo-layer recurrent network where one layer representscontext and one layer represents items shows these phe-nomena if performance is measured by counting thenumber of nodes active at recognition in a way that opti-mizes performance The structure of the theory guaran-tees that high-frequency items have a larger variance inthe net inputs than do low-frequency items because en-coding the same item in different contexts increases thevariance whereas the expected net inputs are the sameThe theory predicts the mirror effect when the sameamount of attention is paid to both classes of stimuli Thestandard deviation of the recognition strength is largerfor old than for new items because more nodes are activein old items The standard deviation for the easy class islarger than the standard deviation for the difficult classbecause the recognition strength is normalized with thestandard deviation for the net input

There are several reasons why the variance theory isinteresting First the theory is extremely simple Theonly necessary assumptions are that recognition is basedon recurrent associations between contexts and itemsand performance is measured by counting the number ofnodes in an optimal way Second these assumptions areconsistent with what is known about how the brain worksTherefore the model is biologically plausible Third themodel accounts for a large amount of data including themirror effect exceptions from the mirror effect ROCcurves list-length effects and so on Fourth the modelfits the empirical data well Fifth it is easy to implementthe model in a connectionist network

Paying more attention to one of the classes violates theassumption of equal expected net inputs to the two classesThe variance theory predicts that attention to the moredifficult class primarily affects the hit rates whereas thefalse alarm rates and standard deviations of the underly-ing distributions are less affected An experiment sup-ported the prediction A standard mirror effect was foundwhen attention was divided equally between high- andlow-frequency words However focusing the subjectsrsquoattention on the high-frequency words either by the pre-sentation frequency or the presentation time made thehit rate larger for the high-frequency words than for thelow-frequency words This manipulation of attention didnot influence the false alarm rates which were higher forthe high-frequency words in all the conditions Thus nomirror effect was found when attention was paid to thehigh-frequency words Nor did the focusing of attentioninfluence the order of the standard deviations The stan-dard deviations were larger for the low-frequency distri-

436 SIKSTROM

bution than for the high-frequency distribution This wastrue for the new and the old distributions both when at-tention was paid to high-frequency words and when at-tention was divided equally between the two classes (ex-cept in the new frequency control condition where thestandard deviations were equal)

The variance theory predicts the order of the standarddeviations of the underlying distributions for the follow-ing reasons The standard deviations of the old distribu-tions are predicted to be larger than those of the new dis-tributions because more nodes are activated in the olddistributions The standard deviations of the easy classdistributions are predicted to be larger than the standarddeviations of the difficult class distributions because therecognition strength is normalized by the itemrsquos diffi-culty estimated from the standard deviation of the net in-puts This is consistent with the empirical data

In contrast the attention-likelihood theory does notconstrain the old distribution to be larger than the newdistribution for the difficult class (it can be larger orsmaller depending on parameter settings) The variancetheory allows the following two orders of the standarddeviations ss(BN) lt ss(BO) lt ss(AN) lt ss(AO) andss(BN) lt ss(AN) lt ss(BO) lt ss(AO) The first order isthe more common although the second order occurs oc-casionally (see eg Glanzer et al 1993 Experiment 1)In addition the attention-likelihood theory allowsss(BO) lt ss(BN) lt ss(AN) lt ss(AO) according to Equa-tion 2 which to the authorrsquos knowledge has not beenfound in empirical data

The variance theory predicts that strength variablessuch as study time repetition and study instructions af-fect the expected net input For example increasing studytime will increase the net input that improves the hit rateIncreasing the net inputs also causes an increase in theactivation threshold that diminishes the false alarm rates

The variance theory has no (ie lateral) connectionswithin the item layer and there are no connections with-in the context layer Including intraitem connections inrecognition makes it impossible to tell whether therecognition strength comes from encoding during thelearning episode or from another preexperimental learn-ing episode Thus there would be a confounding be-tween the itemrsquos frequency and recognition strength Forexample if the recognition strength in the variance the-ory included intraitem connections and used a constantrecognition criterion it would predict a higher hit rateand a higher false alarm rate for high-frequency itemsthan for low-frequency items Thus the hit rate for high-frequency words would be larger than that for low-frequency words which is contrary to the data on the mir-ror effect This issue is called the frequencyrecognitionndashstrength confounding problem Other models may bevulnerable to this problem depending on their specificassumptions The variance theory is immune from thisproblem because recognition strength is based on the as-sociation between the context and the item that yields apure measurement of the strength of the target in a par-

ticular episode Net inputs within the item population arenot used because these connections are highly corre-lated with the frequency of the item

This frequencyrecognition-strength confounding prob-lem may be relevant to several distributed models thatassume that recognition strength increases with frequencythus making the false alarm rate higher for high- than forlow-frequency stimuli This is often implemented in dis-tributed models by including intraitem associations inthe measurement of recognition strength Thus whenintraitem and itemndashcontext associations are added it isnot possible to know whether the intraitem strength oc-curs because an item has been encoded in the to-be-retrieved-from list or at another episode

Although the intraitem associations are not used tomeasure recognition strength they may play an impor-tant role in recognition In the first step of recognitionthese associations may be used for deblurring unclear in-formation in the item cue (a similar mechanism occursfor the context cue) Arguably this deblurring mecha-nism works well for well-known words however fornonwords it is much more likely to fail Such failure willactivate features that were not active in the encoded rep-resentation It is more likely that these wrongly activatednodes represent high-frequency features because it ismore likely to converge on high-frequency features Thereare two interesting implications of this perspective Firstthe wrongly activated nodes will use the wrong connec-tions between the context and the item Second becausethe wrongly activated nodes represent high-frequencyfeatures the average variability will be larger for non-words than for words This is a plausible account of thepoor recognition performance with nonwords as com-pared with words It is also a tentative account of why non-words can be seen as a difficult class with a higher vari-ability than that for words in the variance theory Howeverfurther work is needed before any firm conclusion can bemade regarding this aspect of the theory

A problem similar to frequencyrecognition-strengthconfounding occurs if within-context connections areused at recognition If the context is temporally corre-lated the within-context connections are influenced bylist length This causes a confounding between list lengthand recognition strength This issue is called the list-lengthrecognition-strength confounding problem Othermodels may be vulnerable to this problem depending ontheir specific assumptions

Another issue is whether the variance theory can ac-count for the mirror effect found in abstract and concretewords where words from a concrete class are more eas-ily discriminated than words from an abstract class Thevariance theory can account for this given the assump-tion that the variance of the net input is larger for abstractthan for concrete words However at this point it is notcompletely clear how this assumption can be motivatedA possibility is that although these two classes areequated for word frequency the contexts associated withan abstract word are more variable than the contexts as-

VARIANCE THEORY OF MIRROR EFFECT 437

sociated with a concrete word This larger variability incontext for abstract words may lead to a larger variabil-ity in the net input Another possibility is that the activefeatures in abstract words are more general and there-fore associated with more contexts Nodes active in con-crete words may represent more specific features acti-vated with a lower frequency and therefore associatedwith fewer contexts Thus features in abstract wordsmay be of a higher frequency than features in concretewords although the frequencies of the items are thesame This would lead to a mirror effect in the modelHowever at this point no claim is made that variancetheory can handle this phenomenon

The variance theoryrsquos account of the list-length andlist-strength effects is arguably much simpler than thesubjective-likelihoodrsquos account The activation thresholdis set so that on average half of the nodes active duringencoding are active during recognition The activationthreshold is constant within one condition but may changebetween conditionsmdashfor example when study time ischanged Furthermore subjects do not need to estimatethe list length and the probability that a particular itemis drawn from the study list

The variance theory has advantages over the attention-likelihood theory As was discussed in more detail abovethe attention-likelihood theory requires subjects to clas-sify the stimulus Depending on this classification thesubjects must know (consciously or unconsciously) howmuch attention is paid to the stimuli in order to calculatethe log-likelihood ratios Thus the yesndashno decision isbased on the subjectsrsquo classification of the stimuli Thevariance theory does not require a classification of thestimuli During the calculation of recognition strengththe standard deviation of the net input of the item is used(shcent ) so the subject does not need to know the class or thestandard deviation of the class (sh) The increase in thehit rate and decrease in the false alarm rate for the easierclass occurs according to the theory because the vari-ance is smaller for the easier class The variance theoryhas a simple mathematical base subjects count the num-ber of activated nodes minus the expected value dividedby the standard deviation of the net input of the item Anexplicit solution is presented that requires only calculat-ing the probabilities of two z-transformations

The subjective-likelihood theory uses feature contentof the items for addressing issues regarding item similar-ity and word frequency In particular high-frequency wordsare assumed to have a higher noise or variability than dolow-frequency words The variance theory also usesvariability that depends on frequency However the vari-ance theory simulates the increase in variance duringeach presentation of a feature in different contexts thusmaking it an unavoidable phenomenon for high-frequencyfeatures In the subjective-likelihood theory the featurevariance is introduced as an assumption or a constantand it is not explicitly simulated

There are several other differences between the vari-ance theory and the subjective-likelihood theory The

subjective-likelihood theory is based on log-likelihoodratios In the variance theory log-likelihood ratios arenot necessary to account of the mirror effect and for z-ROC curves Instead the variance theory uses the num-ber of active nodes normalized by the variance of the netinput as the measurement of recognition strength

Another difference is the use of one detector for eachitem in the subjective-likelihood theory This makes themodel essentially local whereas the variance theory isdistributed This property may cause difficulties in im-plementing the model in a connectionist network How-ever see McClelland and Chappell (1998) for a brief dis-cussion of this topic An implementation of the variancetheory is straightforward

REFERENCES

Feller W (1968) An introduction to probability theory and its appli-cation New York Wiley

Gillund G amp Shiffrin R M (1984) A retrieval model for bothrecognition and recall Psychological Review 91 1-67

Glanzer M amp Adams J K (1985) The mirror effect in recognitionmemory Memory amp Cognition 13 8-20

Glanzer M amp Adams J K (1990) The mirror effect in recognitionmemory Data and theory Journal of Experimental PsychologyLearning Memory amp Cognition 16 5-16

Glanzer M Adams J K amp Kim K (1993) The regularities ofrecognition memory Psychological Review 100 546-567

Glanzer M amp Bowles N (1976) Analysis of the word frequencyeffect in recognition memory Journal of Experimental PsychologyHuman Learning amp Memory 2 21-31

Glanzer M Kisok K amp Adams J K (1998) Response distribu-tions as an explanation of the mirror effect Journal of ExperimentalPsychology Learning Memory amp Cognition 24 633-644

Greene R L (1996) Mirror effect in order and associative informa-tion Role of response strategies Journal of Experimental Psychol-ogy Learning Memory amp Cognition 22 687-695

Hertz J Krogh A amp Palmer R G (1991) Introduction to the the-ory of neural computation Reading MA Addison-Wesley

Hintzman D L (1988) Judgment of frequency and recognition memoryin a multiple trace memory model Psychological Review 95 528-551

Hopfield J J (1982) Neural networks and physical systems withemergent collective computational abilities Proceedings of the Na-tional Academy of Sciences 79 2554-2558

Hopfield J J (1984) Neurons with graded response have collectivecomputational properties like those of two-state neurons Proceed-ings of the National Academy of Sciences 81 3088-3092

Humphreys M S Bain J D amp Pike R (1989) Different way to cuea coherent memory system A theory for episodic semantic and pro-cedural tasks Psychological Review 96 208-233

Kim K amp Glanzer M (1993) Speed versus accuracy instructionsstudy time and the mirror effect Journal of Experimental Psychol-ogy Learning Memory amp Cognition 19 638-652

Kruschke J K (1992) ALCOVE An exemplar-based connectionistmodel of category learning Psychological Review 99 22-44

Ku Iumlcera H amp Francis W N (1967) Computational analysis ofpresent-day American English Providence RI Brown UniversityPress

Lewandowsky S (1991) Gradual unlearning and catastrophic inter-ference A comparison of distributed architectures In W E Hockleyamp S Lewandowsky (Eds) Relating theory and data Essays onhuman memory in honor of Bennet B Murdock (pp 445-476) Hills-dale NJ Erlbaum

Li S-C amp Lindenberger U (1999) Cross-level unification A com-putational exploration of the link between deterioration of neuro-transmitter systems and dedifferentiation of cognitive abilities in oldage In L-G Nilsson amp H J Markowitsch (Eds) Cognitive neuro-sciences of memory (pp 103-146) Seattle Hogrefe amp Huber

438 SIKSTROM

Li S-C Lindenberger U amp Frensch P A (2000) Unifying cog-nitive aging From neuromodulation to representation to cognitionNeurocomputing 32-33 879-890

McClelland J L amp Chappell M (1998) Familiarity breeds dif-ferentiation A subjective-likelihood approach to the effects of expe-rience in recognition memory Psychological Review 105 724-760

Metcalfe J (1982) A composite holographic associative recallmodel Psychological Review 89 627-658

Murdock B B (1982) A theory for the storage and retrieval of item andassociative information Psychological Review 89 609-626

Murdock B B (1998) The mirror effect and attentionndashlikelihoodtheory A reflective analysis Journal of Experimental PsychologyLearning Memory amp Cognition 24 524-534

Okada M (1996) Notions of associative memory and sparse codingNeural Networks 9 1429-1458

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves PsychologicalReview 99 518-535

Rumelhart D E Hinton G E amp Williams R J (1986) Learn-ing representation by backpropagating errors Nature 323 533-536

Shiffrin R M amp Steyvers M (1997) A model for recognitionmemory REMndashretrieving effectively from memory PsychonomicBulletin amp Review 4 145-166

Sikstroumlm S (1996a) TECO A connectionist theory of successiveepisodic tests Umearing Doctoral dissertation Umearing University (ISBN91-7191-155-3)

Sikstroumlm S (1996b) The TECO connectionist theory of recognitionfailure European Journal of Cognitive Psychology 8 341-380

Sikstroumlm S (1999) Power function forgetting curves as an emergentproperty of biologically plausible neural networks model Interna-tional Journal of Psychology 34 460-464

Stretch V amp Wixted J T (1998a) Decision rules for recognitionmemory confidence judgments Journal of Experimental Psychol-ogy Learning Memory amp Cognition 24 1397-1410

Stretch V amp Wixted J T (1998b) On the difference betweenstrength-based and frequency-based mirror effects in recognitionmemory Journal of Experimental Psychology Learning Memoryamp Cognition 24 1379-1396

NOTE

1 The expected number of nodes active at recognition is a2 giventhat the standard deviations of the percentages of active nodes for new[sp(N)] and old items [sp(O)] are equal However if the standard devi-ations are different the expected number of active nodes will be

Because the new variance is smaller than the old variance this valuewill be slightly less than a2 typically around 044a if the ROC curveis 08 It is actually more precise to use this value In this paper the ap-proximation a2 is used except that in the simulations where the ex-pression above is used

APPENDIX

The Expected Value of the Net Input and the Variance of the Net Input

Inserting the weight changes into the net input solves the ex-pected value of the net input for old items In this Appendix thesuperscripts representing the item layer (d) and the contextlayer (c) are dropped Only the active states of activation con-tribute to the net input so jj 5 jj = 1

(A1)

The expected value of the net inputs for the new items iszero

mh(N) 5 0 (A2)

The expected value of the weights is zero The variance ofthe net input is calculated by squaring each term in the netinput Let P represent the number of patterns stored in the net-work If the patterns are uncorrelated or a new item is pre-sented the variance of the net input is

(A3)

For an old item the variance of the net input to the contextlayer from the item layer depends on the frequency of the item( f ) All the aN weights supporting a context node contribute tothe net input in the same direction

(A4)

The same function can be used for calculating the varianceof the net input to a node in the item layer when the same con-text is associated with several items Let the same context be as-sociated with L items in a list Furthermore let the context be-tween different lists be uncorrelated The variance of the netinput to an item node is

The expected variance for all nodes is the weighted sum ofthese two terms Half of the nodes are context nodes and halfof the nodes are item nodes Therefore the expected varianceof the net input to all nodes given an item with a frequency off and a list length of L in a network where p patterns have beenencoded is

(A5)

Performance in the Variance Model When AllNodes Are Used

In the variance model recognition strength is based on nodesactive at encoding However if recognition strength is based onall the nodes performance can be calculated as follows The d centis calculated by using Equation 8 where Pc(O) and Pc(N) are thenumber of correct nodes The number of correct nodes is thenumber of nodes active at retrieval that also is active at encod-ing (ie calculated as usual with Equation 7) plus the numberof inactive nodes at encoding that also are inactive at retrievalThe latter value can be calculated by replacing a with 1 a inEquation 7 and using the expected value of old net inputs for in-active nodes a2 (1 a) N (compare with Equation A1)

(Manuscript received February 9 1999revision accepted for publication October 30 2000)

s h f L a N p a a N2 3 22 1O( ) = +( ) + -( )

s h f a a LN2 4 2 21( ) = -( )

s x x x xh ij jj

N

i j ji

Nf f f a a

a a f N

22 2

4 2 21

( ) = waring aringaelig

egraveccedilouml

oslashdivide= -( ) -( )eacute

eumlecirc

= -( )

s x x x xh ij jj

N

i j j

P

i

NN a a

a a a a a a a

a a a a PN a a PN

2 2 2

2

3

1 1 2 0 1

0 0 1 1

( ) = w

= [( ) ( )] + [( )(1- )] ( )

+ [( )( )] ( ) = ( )

2 2

2 2 2

( ) = -( ) -( ) - - - -

- - - -

aring aringaring

m x x x xh ijj

N

j i j jj

Na a a a N(O) = wDaring aring= -( ) -( ) = -( )1

2

ss s

p

p p

a(N)

( )N (O)+

VARIANCE THEORY OF MIRROR EFFECT 411

ber of targets in a typical recognition test (for example50) is negligible in comparison with the number of pos-sible distractor words (the number of words in the subjectrsquosvocabularymdasheg 20000) Thus the subjective-likelihoodaccount for the list-strength effect may encounter a prob-lem when the number of detectors associated to each dis-tractor increases to a more realistic number A similarproblem may also be apparent with respect to the list-length effect which is accounted for by the assumptionthat the number of detectors is proportional to the listlength Arguably the subjective-likelihood theory wouldnot account for the list-length effect given a more plau-sible assumption that the number of detectors is relatedto the size of the subjectrsquos vocabulary

REM REM (Shiffrin amp Steyvers 1997) is anothermodel whose aim is to account for the mirror effect ROCcurves list strength and other phenomena in recognitionand recall Similar to the subjective-likelihood theoryREM is based on likelihood ratios and uses noisy vector-based representations in memory Although REM alsouses likelihood ratios REM differs in the sense that ituses true probabilities in calculating the likelihood ratioswhereas the subjective-likelihood theory uses estimatesFurthermore in REM the value assigned to the modelrsquosrepresentation on a particular feature dimension is de-termined when the dimension is sampled the first timeIn the subjective-likelihood theory the representationsof the items are refined successively over presentations

In REM several factors are combined together to pro-duce the frequency-based mirror effect (1) The likelihoodratios are assumed to be smaller for old high-frequencywords because high-frequency features are less diagnos-tic (2) This factor is larger than the compensating factorthat high-frequency words have slightly more matchingfeature values because errors in storage tend to producecommon values increasing the probability of acciden-tally matching a high-frequency feature value (3) Newhigh-frequency words have a slight increase in the numberof chance matches which increases the likelihood ratio

Limitations of current theories Although these the-ories account for several aspects of the data regarding themirror effect they have been subjected to a few generalcriticisms Perhaps the most obvious problem with thesemodels is that they predict that strengthening of a classyields a mirror effect Although this prediction is supportedby data in studies in which the strength manipulations wereapplied between conditions (Glanzer et al 1993) it iscertainly inconsistent with the data when the strengthmanipulations were applied within conditions (Stretch ampWixted 1998a see also Stretch amp Wixted 1998b)

In addition there are a few other more specific criti-cisms about these theories Here are some problems re-garding the attention-likelihood theory First calculatingthe log-likelihood ratios requires knowledge of the class(l dependent on i) Thus it is necessary to classify theto-be-recognized stimuli into distinct categories (i) or atleast to estimate the stimuli along some dimension suchas frequency Glanzer et al (1993) noted that the attention-likelihood theory predicts the mirror effect even though

the estimates of p(i old) are not accuratemdashfor examplewhen this probability is set to the average of the twoclasses Thus the estimates of p(i old) are not critical tothe predictions However it is necessary to estimate thenumber of features sampled at recognition [n(i)] in Equa-tion 1 to make the correct predictions and this processrequires knowledge of the class of stimuli Second knowl-edge of several variables is required Depending on theclassification the attention paid to this category [n(i)]must be estimated The probability that a feature in anew item is marked must also be estimated Third theattention-likelihood theory involves several steps of com-plex calculation Although this may not be the reason fordismissing the theory (see Murdock 1998 for a discus-sion of this topic) it would be preferable to have a sim-pler account Given that the subjective-likelihood ac-count (McClelland amp Chappell 1998) and REM like theattention-likelihood theory (Shiffrin amp Steyvers 1997)are also based at their cores on variations in log-likelihoodratios these criticisms would also apply to them

Research Goal and Organization of the PaperGiven the limitations of current theories the purpose

of this paper is to propose a new account of the mirror ef-fect that can avoid most of these criticisms The theoryis proposed specifically for the frequency-based mirroreffect but it also accounts for the strength-based mirroreffect within a list the strength-based mirror effect be-tween lists z-ROC curves associated with the mirror ef-fect and the list-length effect The paper is organized asfollows First a brief overview of the theory is presentedwhich is then followed by an in-depth presentation Sec-ond the theory is implemented in a connectionist modelof memory previously developed by the author (ieTECO the target event cue and object theory Sikstroumlm1996a 1996b) Third mechanisms of the theory respon-sible for capturing the mirror effect are presented in de-tail Fourth a section presents the theoryrsquos applicationswith respect to the various classical empirical regularitiessuch as between-list strengthening the list-strength ef-fect z-ROC curves and response criterion shifting Fifthpredictions regarding a recently discovered lack of themirror effect in the context of within-list strength manip-ulation are presented (eg Stretch amp Wixted 1998a1998b) and an experiment is carried out to evaluate theprediction For readers who are interested in the analyticsolutions of the theory mathematical derivations ofthese solutions are presented in the sixth section and ananalysis of the modelrsquos optimal performance is con-ducted Finally implications for distributed models ofmemory and the relations between the variance theory andprevious theories of the mirror effect are discussed

THE VARIANCE THEORYOF THE MIRROR EFFECT

Overview of the TheoryIn a nutshell the variance theory proposed here is sim-

ilar to previous models of the mirror effect in the sense

412 SIKSTROM

that it also uses vector-feature representations of the itemsand estimates (via simulations) the response probabili-ties of old (target) and new (lure) items during a recog-nition test However the variance theory is driven by dif-ferent conceptual and technical considerations At theconceptual level the variance theory sets out to capturethe mirror effect mainly in terms of the relations betweenthe study materials and the natural preexperimental con-text associations the items may have This is conceptu-ally quite different from all previous theories seeking toexplain mirror effects primarily in terms of the individualrsquosdecision process Rather the approach taken here con-siders the context in which the individual recognition de-cision processes takes place The natural frequencies ofevents occurring in the individualrsquos day-to-day contextsmay be reflected in recognition decision processes andthe individuals may or may not know (or be consciouslyaware of) these processes At the technical level the vari-ance theory also differs from previous theories in a sig-nificant way Instead of directly computing ratios betweenprobabilities a new way of computing recognition strengthis proposed by normalizing the difference between theresponse probabilities for the target and the lure itemswith standard deviations of the underlying responsedistributions

Specifically in dealing with the frequency-based mir-ror effect a rather plausible key assumption of the vari-ance theory is that high-frequency words are assumed tohave appeared in more naturally occurring preexperi-mental contexts than have the low-frequency words Thisassumption is implemented in connectionist networksimulations in a rather straightforward way by associat-ing the simulated high-frequency items with more con-texts than the low-frequency items during a simulatedpreexperimental phase In implementing the theory itemsand contexts are represented by two separate arrays (vec-tors) of binary features with each feature being repre-sented by a node (or element of the vector) as is shownin Figure 2 A particular item such as CAR activates somefeatures at the item layer A particular context such asREPAIR SHOP activates some features at the context layerDifferent items and contexts may or may not activatesome of the same features The item and context featuresare fully interconnected with weights When an item ap-pears in a particular context and certain features are ac-tivated the weights that reciprocally connect the itemfeatures to the context features are adaptively changedaccording to a specific learning rule described later Notethat in the implementation two types of contextsmdashnamelythe preexperimental contexts and the study context (rep-resenting internally generated time or list informationassociated with an item during the actual experimentalepisode)mdashare represented in the network using one com-mon context layer But these two types of context infor-mation are differentiated by two simulation phasesmdashnamely the preexperimental phase and the actual encodingand testing phase As will be mathematically outlinedlater the standard deviation of the input at the contextlayer increases when an item is associated with several

contexts Therefore high-frequency items (associatedwith more preexperimental contexts) will have largerstandard deviations than will low-frequency items in theiractivation patterns which are subsequently propagatedto the item layer However the expected value of the in-put is equal for high- and low-frequency items

During the recognition test an item vector is presentedto reinstate the activation of the item features The fea-tures of the study context are reinstated by presenting thenetwork with the context pattern encoded during thestudy-encoding phase (but not from other preexperi-mental contexts that the network was trained with duringthe preexperimental phase) The degree of recognitionstrength is determined by first calculating the net inputsto the context and the item nodes The net input is the sig-nal a node receives from other active nodes connectingto it and the strength of the signal determines whetherthe nodes will be activated or not at retrieval The netinput of a given item node is simply computed as the sumof all weights connected to active nodes Similarly thenet input of a given context node is simply the sum of allweights connected to active nodes and that particularcontext node The net inputs then denote the retrievedstate of activation patterns at the item and context layersThe subset of item and context nodes that have activationlevels exceeding a particular activation threshold at re-trieval and that were also active during encoding are thenused to calculate the recognition strength Those nodeswhose activation does not exceed the threshold or thatwere inactive during encoding have no influence on rec-ognition strength For example assume that the activationthreshold is set to 05 so that any node (item or context)that was active during encoding and whose retrieved ac-tivation during testing exceeded the value of 05 wouldcontribute to recognition strength Imagine that four nodesout of a total of eight exceed the threshold and are equalto 075 100 125 and 150 The recognition strength ofthe item is the percentage of above-threshold nodes (50)

Figure 2 The variance theory The upper four circles representnodes in the item layer The lower four circles represent nodes inthe context layer The arrows between the item and the contextlayers represent connections

VARIANCE THEORY OF MIRROR EFFECT 413

minus the expected percentage of above-threshold nodes(eg 25) divided by the standard deviation of actuallyobserved above-threshold nodes (ie by the standard de-viation of 075 100 125 and 150)

Why is recognition strength determined by this ruleas opposed to say just the percentage of above-thresholdnodes As will be shown later this way of measuringrecognition strength (subtracting the expected value anddividing with the standard deviation of the net input) al-lows the model to perform optimally in terms of dis-criminability between new and old items when the re-sponse bias to the yes response is manipulated And inthis case the model accounts for why the standard devi-ation of the response distribution of the easy class islarger than the response distribution of the diff icultclass It is plausible to assume that humans have evolvedto generally respond more or less optimally and that thisis reflected in their performance as well as in the imple-mentation of the variance theory Similarly the activa-tion threshold is set in the model to the value that leadsto the highest d cent (ie to optimal performance) whichoccurs when the activation threshold is between the newand the old distributions This optimal tuning of themodel allows it to account for some rather dramatic re-sults such as concentering showing that target and luredistributions from different item classes converge on astrength-of-evidence axis as memory conditions worsen

Here is a brief example of how the model performsConsider hypothetical levels of activation generated byhigh-frequency and low-frequency old (target) and new(lure) items Because high-frequency words have appearedin a larger number of contexts they have a larger varianceof net input As such targets and lures will be relativelymore confusable and will generate percentages of activatednodes that are difficult to discriminate Assume that thestandard deviation of the net input is 10 and the relevantproportions are 25 for high-frequency targets and 15 forhigh-frequency lures In contrast low-frequency wordswill have occurred in fewer contexts and will be less con-fusable Assume that the standard deviation of net inputsis less for low-frequency wordsmdashfor example 05mdashandthat the percentage of active nodes is 30 for low-frequencytargets and 10 for low-frequency lures Given these val-ues what are the recognition strengths for high-frequencyand low-frequency targets and lures If the expected pro-portion of above-threshold nodes is 20 they are

Low-frequency lures (10 20) 05 5 20High-frequency lures (15 20) 10 5 05High-frequency targets (25 20)10 5 05

Low-frequency targets (30 20)05 5 20

The model accounts for the various aspects of mem-ory phenomena by postulating a connectionist neural net-work model with an implementation and parameter set-tings that allow it to perform at optimal or near-optimallevels When the model is optimized it behaves similarlyto how subjects behave and when it is not optimized itdoes not fit the experimental data This is true not only

for the standard mirror effect but also for exceptions suchas the absence of a mirror effect for within-list strengthmanipulations (something all other competing formalmodels fail to do) Furthermore it predicts key featuresof the ROCs for new and old items as well as for high-and low-frequency items (something any viable formalmodel must do)

Presentation of the Variance TheoryIn this section the details of the variance theory are

presented As will become clearer as the theory is un-folded the theory is analytical and the analytical solu-tions are self-contained solvable (readers who are inter-ested can find the analytical solutions in the sixth section)Although the theory itself does not require a particularcomputational framework it can be more easily explainedand directly implemented by using a connectionist net-work Therefore the presentation of the theory below iscouched within the framework of a Hopfield neural net-work (Hopfield 1982 1984) in order to explicate thetheoryrsquos underlying mechanisms that generate the mirroreffect

Network architecture The variance theory may beimplemented as a two-layer recurrent distributed neuralnetwork (see Figure 2) There are two layers in the rep-resentation one is the item layer and the other the con-text layer Both layers consist of N number of nodes (ieN features) although it could also be implemented withan unequal number of context and item nodes Thus thetotal number of nodes in the network is 2N The item andthe context layers are fully connected so that all the nodesin one layer are connected through weights to all the nodesin the other layer Nodes within a layer are not connected(ie no lateral connections)

Stimulus and context representation Contexts anditems are represented as binary activation patterns acrossthe nodes in the context and item layers respectively Anode is active (or activated) if its corresponding featureis present (ie of value 1) and a node is inactive (or notactivated) if its corresponding feature is absent (ie ofvalue 0) There are several reasons for choosing binaryrepresentations For instance binary representation servesto deblur noisy information at retrieval (Hopfield 1984)Binary representation also allows for a low proportion ofactive nodes (spare representation) which is known toimprove performance (Okada 1996) It also introducesnonlinearity which is necessary for solving some prob-lems in multilayer networks (Rumelhart Hinton amp Wil-liams 1986) and it is perhaps the simplest representa-tion Furthermore in the present model it is shown thatit is necessary for capturing characteristics of the z-ROCcurves that are associated with the mirror effect

More specifically the state of activation for the i thnode in the item layer at encoding is denoted x i

d wherethe superscript d denotes the item layer The state of ac-tivation for the jth node in the context layer at encodingis denoted x j

c where the superscript c denotes the con-text layer Context patterns and item patterns are gener-ated by randomly setting nodes to an active state (ie

414 SIKSTROM

with values of 1) and otherwise to an inactive state (iewith values of 0) Let a be a parameter that determinesthe expected probability that a node is active at encod-ing This parameter does not change during the simula-tion and is assumed to be relatively small (for purposesof sparse representation) Note that a is the expected prob-ability of active nodes whereas the real percentage of ac-tive nodes for specific items or contexts varies around a

The encoding-study phase Encoding occurs bychanging the connection weights between the item andthe context layers The weights (or the strengths of theconnections) contain information of what has been storedin the network The weight between item node i and con-text node j is denoted as wij and it is initialized to zeroThe weight change (Dwi j) is calculated by the learningrule suggested by Hopfield (1982 see also Hertz Kroghamp Palmer 1991 for additional details) This is essen-tially a Hebbian learning rule that increases connectionweights between simultaneously activated nodes Thisrule is chosen here because it is more biologically plau-sible than other rules such as the delta or the gradient-decent learning rules (eg Rumelhart et al 1986) usedin back-propagation networks However the variance the-ory can also be implemented with other learning rules

According to the Hopfield (1982) learning rule theweight change is computed as the outer-product betweenthe item and the context vectors of activation patternswith the parameter a first subtracted from both vectorsThis subtraction is mathematically necessary to keep theexpected value of the weights at zero Using the notionsfor item and context activation defined above the weightchange between these two elements of the item and con-text vectors can be written as

Dwi j 5 (x id a)(x j

c a) (3)

Word frequency as the number of associated con-texts at the preexperimental phase An item may be en-coded more or less frequently and hence be associatedwith more or less different preexperimental contextsdepending on how often the item occurs in the subjectrsquosenvironment In the model at the preexperimental stageof the simulation an itemrsquos frequency is simulated by thenumber of times the item is encoded in different contextsA low-frequency item is encoded less often and is asso-ciated with a smaller number of contexts whereas a high-frequency item is encoded more often and is associatedwith a larger number of different contexts For instanceone might form a preexperimental association betweenSAAB and repair shop after experiencing the rare event ofa new expensive SAAB sports car breaking down half-way through a honeymoon trip to the Grand Canyon withthe SAAB having to be towed to a repair shop somewhereout in the desert In implementing the variance theory therelationship between word frequency and preexperimen-tal itemndashcontext associations can be simulated straight-forwardly At the preexperimental stage of the simula-tion a low-frequency item SAAB may be simulated byassociating one context item REPAIR SHOP with it duringencoding A high-frequency word CAR may be simulated

by associating three different contexts REPAIR SHOP TAXI

RIDE and DRIVING TO WORK with it during encodingThe recognition test phase At recognition an item is

presented to the network the representation of this itemis reinstated as a cue to the item layer and the represen-tation of the study context (simulating an internally gen-erated context regarding list or time information duringthe recognition experiment) is reinstated as a cue to thecontext layer For example the representation of theword CAR is reinstated at the item layer Furthermore therepresentation of the study context STUDY LIST is rein-stated at the context layer The subjects must have this in-formation (or cue) in order to recognize an item from theparticular study context (and not recognize the item fromall the other preexperimental contexts) In the actual ex-perimental setting this information is usually conveyedto the subjects by the explicit instruction to recognizefrom the study context (eg ldquoDo you recognize the wordCAR from the list you just studiedrdquo)

At recognition each node receives a signal (called thenet input) which is computed on the basis of other activenodes connecting to it Item nodes receive net inputs fromactive context nodes and context nodes receive net inputsfrom active item nodes The net input to a given node issimply the sum of the weights of all other active nodesconnected to that node For example if item node 1 isconnected to four context nodes (1 2 3 and 4) wherecontext nodes 1 and 3 are active the net input to itemnode 1 is w11 + w13 Thus active nodes ldquosendrdquo informa-tion to nodes that they are connected to whereas inactivenodes do not send any information Put another way nodesldquoreceiverdquo information or input from active nodes that theyare connected to but not from the inactive nodes

Specifically the net input to item node i is calculatedby first multiplying the activity of the context node (la-beled j ) connected to this node by the weight of the con-nections between the nodes i and j and then summing overall connected nodes In vector formalization the weightmatrix operates on the activation vector and the outputis the net input vector The net inputs to node i(h i

d) at theitem layer depend on the states of activation of nodes in thecontext layer and the weights connecting these two layers

(4)

Following the same principle a similar function is usedto calculate the net inputs to the context node Specifi-cally the net inputs to context nodes (labeled j ) dependon all the states of activation of nodes in the item layerand the weights connecting the two layers

By inserting Equation 3 into Equation 4 and summingover the p number of encoded patterns during list learn-ing it is easy to see that the net input to a node is simplythe sum of Np number of terms for example the net in-put to an item node is

h wj ji ii

Nc d=

=aring x

1

h wi ij jj

Nd c=

=aring x

1

VARIANCE THEORY OF MIRROR EFFECT 415

For a 5 5 the net inputs are binomially distributed witha certain expected value Given a certain criterion (ieNp(1 a) a gt 10) a binomial distribution can be ap-proximated with a normal distribution (Feller 1968) Fora sup1 5 there are actually four outcomes however thesame normal approximation can be used Thus for rea-sonably large parameter values of Npa the distributionof net inputs to the nodes can be approximated by a nor-mal distribution

If the to-be-recognized item has not been encoded withthe context (ie a new item) the net input is simply thesum of random weights Because the expected values ofall weights are zero the expected value of the net inputsfor new items will also be zero If the item has been en-coded with the context (ie an old item) the net input isthe sum of those weights connected to that node whoserespective context nodes were active at encoding Owingto the adaptive weight changes during encoding theseweights will have an expected value that is larger thanzero if both nodes were in the active state during encod-ing [ie each weight change at encoding is computed as(1 a)2] and less than zero if one node was inactive andthe other node was active at encoding [ie each weightchange at encoding was a(1 a)] Of specific impor-tance for the theory is that the variance of the net inputsto the context nodes (from the item nodes) increases withthe number of contexts that are associated with the itemTherefore the variance of the net input is larger for high-than for low-frequency items Similarly the variance ofthe net input to the item nodes (from the context nodes)increases with the number of items associated with onecontext (ie list length) Therefore given that the con-text is constant during a list presentation the variance ofthe net inputs is larger for a long than for a short list

Brief summary of optimal performance Given thestrong selection pressure arguably humans and animalshave evolved to achieve good memory performanceTherefore it is reasonable to assume that mechanisms forrecognition decision have evolved to an optimal or near-optimal performance Following this assumption the pa-rameter values in the model and the implementation ofthe model are guided by the idea that the model shouldperform optimally A detailed discussion about the issueof optimal performance with exact derivations of what isoptimal performance in the context of the present modelis presented later in the sixth section Here I give a briefsummary explaining the results from the analysis of op-timal performance without going into the mathematicaldetails (see Figures 9A 9B 9C and 9D)

The modelrsquos performance is optimal if the percentageof nodes active at encoding (a) is low (see Figure 9A)For a low a it is optimal to base the recognition decisionon nodes that were active at encoding and to ignore nodesthat were inactive during encoding (see Figure 9A) Alsofor a low a it is optimal to place the activation threshold

of the nodes between the expected values of the new andthe old net inputs (see Figure 9B) Finally it is optimalto normalize the recognition strength with the standarddeviation of the net input (see Figures 9C and 9D)

For a low percentage of active nodes it is optimal tobase the recognition decision on nodes that were activeat encoding (or nodes active in the cue pattern) and to ig-nore nodes that were inactive at encoding At recogni-tion the state of activation of a node may be either activeor inactive Therefore the nodes that are active in the cuepattern and have a net input above a certain activationthreshold are activated at recognition otherwise thenodes are inactivated Let z i

d denote the state of activa-tion at recognition for item node i An item node is acti-vated at recognition (z i

d 5 1) if it was active in the cuepattern (x i

d 5 1) and the net input is above the activationthreshold (h i

d gt T ) otherwise it is inactivated (z id 5 0)

z id 5 1 if x i

d 5 1 and h id gt T otherwise z i

d 5 0

Similarly let zjc denote the state of activation at recogni-

tion for context node j

z jc 5 1 if x j

c 5 1 and h jc gt T otherwise z j

c 5 0

This way of activating nodes at retrieval differs from hownodes are activated in a standard Hopfield (1982) net-work where the activation threshold is zero and a nodeis activated if the net input is above zero (independentlyof the state of activation in the cue pattern) The way ofactivating patterns in a Hopfield network is more likelyto produce a retrieved pattern that matches the encodedpattern of activation (eg the expected value of activenodes at retrieval will be the same as the expected valueof active nodes at encoding) However as will be dis-cussed later the way suggested to activate the nodes hereyields better performance in terms of discrimination be-tween a target item and a distractor item

As is shown in Figure 9B performance is optimal whenthe activation threshold is set approximately between thenew and the old net inputs The activation threshold (T )is set to the expected values of the net inputs of nodes ac-tive during encoding (x i

d 5 1 x jc 5 1) for old and new

items The averaging is computed over all nodes (2N ) andover all new and old patterns ( p) in the recognition con-dition If half of the items are new and half of items areold the activation threshold is

As was discussed above the expected net input of new(lure) items is zero Therefore the activation threshold issimply half the expected net input for nodes encoded inthe active state [T 5 mh(O)2 where mh(O) is the expectedvalue of the net input to nodes encoded in the active state]

It is easy to see that the expected percentage of old andnew active nodes at recognition is one half of the per-centage of active nodes at encoding (a 2) That is the ac-tivation threshold divides the old and new distribution in

TaPN

h hi ii

N

j jj

NP

= +eacute

eumlecircecirc = =aring aringaring1

2 1 1

x xd d c c

h a ai i jj

Np

jd d c c= - -

=aringaring ( )( ) x x x

1

416 SIKSTROM

the middle Old items will have a higher expected percent-age active and new items will have a lower expected per-centage active The activation threshold is constant dur-ing recognition in one condition However it must varybetween conditions depending on the net inputs to yieldoptimal performance

The percentage of nodes active at recognition is counted

As is shown in Figure 9C the performance is near-optimalif the recognition decision is based on the number of nodesactive at recognition normalized by the standard devia-tion of the net inputs across the active features of this item(shcent ) Thus this standard deviation is calculated over allthe nodes active at encoding (ie x i

d = 1 and x ic 5 1

nodes inactive at encoding are not used when calculatingthe standard deviation because for low levels of a theseitems carry little to no information of the item) The rec-ognition strength (S ) for an item is calculated by sub-tracting the expected percentage of nodes active at rec-ognition (a 2)1 from the real percentage of nodes activeat recognition (Pc) and dividing by the standard deviationof the net inputs of the item (shcent)

(5)

The subtraction of the expected percentage of nodesactive at recognition makes the expected value of therecognition strength (S ) zero This subtraction is neces-sary for the normalization to work properly The sub-traction moves the recognition strength distributionssymmetrically so that the old and the new distributionsmove at the same rate for a given standard deviation of thenet input (without the subtraction the old recognitionstrength distribution would be more affected than thenew distribution) Thus the recognition strength is de-termined by the difference between two probabilities (thepercentage of active nodes that varies and the expectedpercentage of active nodes that is constant) divided bythe standard deviations of the net input A yes response(Y ) is given if the recognition strength (S ) is above therecognition criterion (C ) An unbiased decision has arecognition criterion of zero

Y 5 S gt C

An issue that may be raised is whether it is sensible tobase recognition strength on two quite different sourcesmdashnamely the percentage of active nodes and the variabil-ity of the net input The immediate answer is that if it isreasonable to optimize performance it is also sensible tomeasure recognition strength this way Another perspec-tive is to note that unbiased responses can be made onlyon the percentage of active nodesmdashthat is a yes responseoccurs if the percentage of active nodes is larger than theexpected percentage of active nodes (Pc gt a 2) and thevariability of the net input can be ignored Thus ldquonor-

mallyrdquo subjects base their unbiased decisions on the per-centage of active nodes and the variability of active nodesonly becomes relevant when subjects are biased Fromthis perspective the percentage of active nodes is usedfor unbiased responses and the variability of the netinput becomes relevant for confidence judgments There-fore by combining both the percentage of active nodesand the variability of the net input the measure of recog-nition strength proposed here will also reflect the confi-dence judgment

An Example With Step-by-Step ComputationsTo clarify the computational details involved in the

variance theory a step-by-step example is given here Fortractability a small network is used consisting of fouritem nodes and four context nodes (see Figure 2) Theactual simulations to be reported later involved a largernetwork architecture with 30 nodes at each layer Thepercentage of nodes active at encoding (denoted by pa-rameter a) is set to 50 Let item BN be represented as1100 and written as the state of activation of the fournodes x1

d x2d x3

d x4d Similarly let 0011 represent

item BO 1010 represent item AO and 0101represent item AN Let context CBN be represented as1100 or the state of activation of the four nodes x1

cx2

c x3c x4

c Similarly 0011 represents context CBOand 0101 represents experimental study context CExp

Item BN is a high-frequency new word For simplicityit is here encoded only once with context CBN in the pre-experimental phase (in the simulations below high-frequency words are preexperimentally associated withthree contexts) The 16 weights between the four itemnodes and the four context nodes are changed accordingto the learning rule where the probability that a node isactive at encoding is determined by the parameter a 55 For example the weight change between item node 1and context node 1 is w11 5 [x1

d(BN) a][x1d(CBN) a] 5

(1 5)(1 5) 5 14 where BN is item BN and the CBNrepresents context CBN Similarly item BO is anotherhigh-frequency word that before the experimental phaseis encoded once with context CBO Items AO and AN arelow-frequency old and new words and they are not en-coded at the preexperimental phase

In the experimental phase item AO is encoded withthe experimental context CExp Finally item BO is en-coded with the same experimental context CExp For ex-ample the weight w11 is now equal to

[x1d(BN) a) (x1

c(CBN) a] + [x1d(BO) a]

[x1c(CBO) a] + [x1

d(BO) a][x1c(CE) a]

+ [x1d(AO) a][x1

c(CE) a] 5 (1 5) (1 5)

+ (0 5)(0 5) + (0 5)(0 5)

+ (1 5) (0 5) 5 14 + 14 + 14 14 5 12

After encoding the full weight matrix is 5 1 1 55 0 0 5 5 0 0 5 5 1 1 5 corre-sponding to the weights w11 w12 w44 respectively

SP a

c

h=

-

cent2

s

PN

c i jj

N

i

N

= +aelig

egraveccedil

ouml

oslashdivide

==aringaring1

2 11

z zd c

VARIANCE THEORY OF MIRROR EFFECT 417

At recognition the state of activation of the old low-frequency item AO is reinstated as a cue to the itemlayer and the state of activation of the encoding contextat the experimental phase CExp is reinstated as a cue tothe context layer The net inputs are calculated For ex-ample the net input to context node 1 is

The net input to the item nodes is 10 2 1 and that tothe context nodes is 5 5 5 5 It can be seen thatthe expected net inputs for randomly created new itemsin this network is 0 and that the expected net inputs forold items encoded in the active state is 05 Therefore theactivation threshold is set to the average of these valuesmdashnamely T 5 025 Nodes that have a net input above theactivation threshold and that are active in the cue patternare activated at recognition Thus the state of activationat recognition for the item nodes is 1010 and thatfor the context nodes is 0101 (which is identical tothe cue patterns) The percentage of active nodes is countedPc(AO) 5 48 5 5 For an unbiased response (C 5 0)this will yield a yes response because this percentage islarger than the expected percentage of active nodes atrecognition (a 2 5 25) The standard deviation of thenet input for nodes active at encoding is 071 (ie thestandard deviation of 1 2 5 5 corresponding tonodes 1 3 6 and 8) The recognition strength is calcu-lated by subtracting the expected percentage of activenodes at recognition from the percentage of active nodesfor the to-be-recognized item and dividing by the standarddeviation of the net input for active nodes [S 5 (Pca 2) shcent 5 (5 25) 71 5 035]

The recognition of the three items BO AN and BNare done in the same way The results for the four itemsAN BN BO and AO are the net inputs (1 0 2 1 5 155 5 15) where the first four numbers representitem nodes and the last four context nodes (1 0 2 115 5 5 15) (1 0 2 1 15 5 5 1) (1 0 21 5 5 5 5) the states of activation at recognition(0 0 0 1 0 0 0 0) (1 0 0 0 0 1 0 0) (0 0 1 1 00 0 1) (1 0 1 0 0 1 0 1) the numbers of nodes ac-tive 1 2 3 4 the standard deviations of the net inputs071 108 108 071 the recognition strengths 17 0011 035 and the unbiased responses no no yes yesrespectively (an unbiased yes response is made whenS gt C where C 5 0 for unbiased responses) Thus thesubject responds correctly for all items the recognitionstrengths are ordered according to the mirror effect andthe variance of the net input is larger for the high- thanfor the low-frequency words

SIMULATIONS OF THEFREQUENCY-BASED MIRROR EFFECT

In this section the variance theory of the mirror effectis simulated in a connectionist framework that is consis-tent with a general connectionist theory of memory

called TECO (Sikstroumlm 1996a 1996b) TECO has beenused to account for a variety of memory phenomena in-volving recognition and recallmdashfor example successivetesting of recall and recognition (Sikstroumlm 1996b) and for-getting curves (Sikstroumlm 1999) An extensive descrip-tion of TECO is beyond the scope of this paper Interestedreaders are referred to previous articles describing themodel for details

ProcedureThe simulation started with initializing the weights to

zero Then 12 items were generated by randomly settingthe nodes to an active state with a probability of a A pre-experimental phase then followed to generate the fre-quency associated with the items In this preexperimen-tal phase half of the items (ie 6) were encoded threetimes each time in a different context These items sim-ulated the high-frequency words The remaining 6 itemswere not encoded before the experimental phase andthey simulated the low-frequency words

At the experimental phase one study-encoding contextwas generated using the same procedure as the genera-tion of the contexts in the preexperimental phase Giventhat in a standard recognition experiment all studieditems would be encoded in the same list in the simula-tions the items were thus encoded once with the samestudy context Three of the high-frequency items wereencoded and three of the low-frequency items were en-coded The other three high- and three low-frequencyitems were not encoded during the experimental phaseand they simulated the new items Each encoding wassimulated by first activating the nodes in the item and con-text layers The weights were then changed according tothe learning rule (Equation 3)

At the recognition test the patterns of activation of the12 items and the study context were reinstated to the net-work The net inputs were calculated for each node andthe recognition strength was calculated from all the nodesin the network The somewhat unrealistic assumptionthat no learning or changes of weights occurred duringtesting was adopted However this is a standard as-sumption often used in other simulations of recognitionmemory cued recall or categorization (eg Kruschke1992 Lewandowsky 1991 Li amp Lindenberger 1999Li Lindenberger amp Frensch 2000) One thousand fivehundred simulation runs with different random item andcontext patterns were carried out The results reported laterare based on the average across these runs

ParametersThe number of high-frequency patterns was six (each

encoded three times preexperimentally and three of themencoded once experimentally to simulate the old items)and the number of low-frequency patterns was six (threeof them encoded experimentally once to simulate the olditems) The following parameter settings were used Thenumber of nodes in each layer (N ) was 30 (the total num-ber of nodes was 2N 5 60) and the percentage of nodes

h w i ii

1 11

45 1 1 0 1 1 5 0 5c d= = + - - = -

=aring x

418 SIKSTROM

active at encoding (a) was 20 Initially all the weightswere set to zero The recognition criterion (C ) was set to 0

ResultsFigure 3A shows the results in terms of the density of

the net inputs to an active node The expected values ofthe net inputs are approximately equal for new high- andnew low-frequency items [mh(AN) 5 00 mh(BN) 5 00]and for old low- and old high-frequency items [mh(AN) 538 m h(BN) 5 38] The high-frequency items have alarger variance of the net inputs [sh(BN) 5 49 sh(BO) 548] than do the low-frequency items [sh(AN) 5 41sh(AO) 5 40] The variances of the old and the new dis-tributions are approximately equal

Figure 3B shows the density of the recognition strengthThe result shows the mirror effect where the hit rate prob-ability is larger for low- than for high-frequency itemsand the false alarm rate is lower for low- than for high-frequency items [P(AN) 5 18 P(BN) 5 25 P(BO) 564 P(AO) 5 71] The standard deviation for the recog-nition strength is larger for old than for new words andlarger for the low-frequency words than for the high-frequency words [ss(AN) 5 029 ss(AN) 5 019 ss(AN)5 023 ss(AN) 5 031] These f indings agree with the empirical data and the predictions of the attention-likelihood model (Glanzer et al 1993) Thus the simu-lation shows that the variance theory can account for thefrequency-based mirror effect and the associated ROCcurves

EXPLICATIONS OF THE MECHANISMS

In this section three essential mechanisms of the vari-ance theory that account for the mirror effect are discussedin detail The involved mechanisms are (1) the varianceof the net input (2) the expected value of the net inputand (3) the way in which the recognition strength is cal-culated by counting the number of nodes so that the per-formance is optimal

OverviewThe first mechanism is that items from the difficult

class (ie high-frequency items in this case) have ahigher variance of the net inputs than do items from theeasy class (ie low-frequency items) but the variance isindependent of whether the items are old or new Themechanism is illustrated in Figure 4A (noncumulative) Itshould also be underscored that the difference in varianceas a function of class membership is not a primitive ofthe theory that the theorist manipulates It is the naturalconsequence of the rather plausible assumption that high-frequency items appear more often and are associatedwith a great variety of different contexts than are low-frequency words The second mechanism is that old itemshave a higher expected net input to nodes encoded in theactive state than do new items (but do not depend on theclass of the items) The mechanism is illustrated in Fig-ure 4B (cumulative) The third mechanism is the wayrecognition strength is measured by counting the activenodes so that recognition performance (eg d cent) is opti-mal or near-optimal An extended analysis of optimalityis presented at the end of the paper however the resultsin Figures 9Andash9D can summarize the main points hereThe results from this analysis imply that the decisionmust be based on the nodes active at encoding (see Fig-ure 9A) the percentage of active nodes must be low (seeFigure 9A) the activation threshold needs to be betweenthe expected net inputs of the new and old items (see Fig-ure 9B) and the recognition strength is normalized bythe standard deviation of the net inputs of the item (seeFigures 9C and 9D) The density of the percentage of ac-tive nodes is illustrated in Figure 4C and the normalizedpercentage of active nodes is illustrated in Figure 4DHere it is shown how these mechanisms predict the mir-ror effect Below these three sets of mechanisms are ex-plained in more detail

Variance of the Net InputThe first and perhaps the most important mechanism

is that the difficult class has a larger variability in the net

Figure 3 (A) Simulations of net inputs The vertical axis shows the simulated density of the net inputs (B) Sim-ulations of the mirror effect The vertical axis shows the simulated density of the recognition strength

VARIANCE THEORY OF MIRROR EFFECT 419

inputs than does the easy class As will be discussed laterthis is shared with other theories such as the subjective-likelihood account (McClelland amp Chappell 1998) andREM (Shiffrin amp Steyvers 1997) However a unique as-pect of the variance theory is that it is a necessary outcomeof simply encoding high-frequency items more timesthan low-frequency items In the subjective-likelihood ac-count a larger variability for high- than for low-frequencywords is an assumption that is controlled by a free pa-rameter ( p0) to reflect that high-frequency words havemore definitions than do low-frequency words The vari-ance theory needs no such assumptions and no addi-tional free parameters to control the variability Whereasthe subjective-likelihood account tries to capture wordfrequency with a free parameter in the variance theorythe word frequency effect is simulated via the number ofcontexts associated with an item

Owing to how the variance theory implements the re-lations between contexts and items the variance of the

net inputs increases with the frequency with which an itemis encoded in different contexts Intuitively this occursbecause a high-frequency item activates several differentcontexts causing the representations of many competingcontexts to be activated simultaneously Low-frequencyitems are associated with fewer contexts than are high-frequency items This causes the representations offewer contexts to be activated simultaneously Thus low-frequency items have less variability in the net inputsthan do high-frequency items

At recognition an item produces a net input in the con-text layer that is a mixture of the net inputs from the studycontext that the network is instructed to retrieve fromand from several uncorrelated preexperimental contextsthat were associated with the item during the preexperi-mental phase The study context that the network is in-structed to retrieve from contributes to the correct activestate For example by adding +1 to the net input to anode (which is the expected net input for a node encoded

Figure 4 (A) The probability density of the net inputs in the variance theory The horizontal axis shows the net inputs and the ver-tical axis the probability density of the net inputs The dotted vertical line is the activation threshold (B) The cumulative probabilitydistributions of the net inputs in the variance theory The horizontal axis shows the net inputs and the vertical axis the cumulativeprobability distributions of the net inputs (C) The density of percentage nodes active at recognition in the variance theory The hor-izontal axis shows the percentage of nodes active at recognition and the vertical axis the density (D) The density of recognition strengthin the variance theory The horizontal axis shows the recognition strength and the vertical axis the density of the recognition strengthusing standard parameter values

420 SIKSTROM

in the active state when N 5 8 and a 5 5 see Equation A1in the Appendix) thus increasing the expected net inputThe preexperimental contexts on the other hand ran-domly add to or subtract from the net input For examplea preexperimental context could add +1 to the net inputif the node was in an active state or it could add 1 tothe net input if the particular preexperimental contextwas encoded in an inactive state (which is the expectednet input for a node encoded in the inactive state whenN 5 8 and a 5 5 see Equation 3 4 or A1) Note thatthe net input can be negative whereas the state of activa-tion cannot If the representations of these preexperimen-tal contexts are uncorrelated the number of associatedpreexperimental contexts will not influence the expectednet input That is the expected value of the negative andpositive contributions will cancel out (eg for N 5 8 anda 5 1 the contributions to the net input is +1 with a prob-ability of 50 and 1 with a probability of 50 yield-ing an expected value of 0) However the variability ofthe net inputs increases when more contexts are addedIn this example the variance is increased by 12 for eachadded context (ie the variance increases by each con-tribution raised to the power of two) Encoding an itemalso increases the variability of the net input for all otheritems encoded in the network However the increase invariability for the items uncorrelated with the encodeditem is much smaller because the contribution from eachnode is independent (Eg for N 5 8 and a 5 5 thecontribution from each node is either +14 or 14 [seeEquation 3] each yielding an increase in variability of142 5 116 An expected value of aN nodes contributeso the total increase in variability is 4 116 5 14)

A mathematical analysis of how the variability of thenet inputs in the context nodes increases as a function ofthe frequency with which the item has been encoded indifferent contexts is presented in the Appendix This analy-sis shows that the variance of the net inputs in contextnodes increases linearly with how many times a givenitem is encoded within different contexts The variabil-ity of the net inputs for nonwords may be a special casediscussed at the end of this paper

In the same way as the variability of context nodes de-pends on the itemrsquos frequency the variability of the itemnodes depends on the frequency of the context That isthe variability of the net input to the item nodes increaseswith how many times one context is associated with dif-ferent items Given that the context is constant during thepresentation of a study list the variability of the net in-puts to the item nodes will increase with list length

Expected Net InputThe second mechanism in the variance theory is that

the expected net inputs to the easy and difficult classesof stimuli are equal given that the encoding conditionsduring the experimental phase of the two classes areequal Note that this is in stark contrast to the attention-likelihood theory which assumes that more attention(corresponding to more net input) is given to the easyclass than to the difficult class Experimentally the equal-

ity of the net inputs simply means that different classesof stimuli are given the exact same conditions for encod-ing and retrieval in a recognition memory study The netinput to a node encoded in the active state increases dur-ing encoding whereas the net input to a node encoded inthe inactive state decreases during encoding Only nodesencoded in the active state are used during retrieval sohere we are only interested in the increase in net input thatoccurs for nodes encoded in the active state Strengthen-ing of the weights during encoding increases the expectednet input The degree of increase in expected net input isinfluenced by strength-based variables such as study timerepetition levels of processing and so forth For exam-ple the simulations can be set up so that a study time of1 sec strengthens the weights less leading to lesser in-creases in the net input than does a longer period of studytimemdashfor example 2 sec of encoding time Because thestudy context is unique to the learning episode preexper-imental encoding in other learning contexts will not af-fect the expected net input but they do affect the vari-ability of the net input as was demonstrated above Theitemndashstudy-context associations are learned approxi-mately equally well for old high- and old low-frequencyitems For example the expected net input for CAR (ahigh-frequency word as a difficult class item) is equal tothe expected net input for ARCTIC (a low-frequency wordas an easy class item) Generally the expected net inputdoes not depend on the class of the item because the ex-pected net input is influenced by the study and the test-ing experimental conditions similarly across stimulusclasses in a recognition memory experiment Thereforethe expected net input for a new difficult item is equal tothe expected net input for a new easy class item

The probability density functions of the net inputs fornodes in the active states are plotted in Figure 4A Oldnodes in the inactive state have a negative expected valueof the net input and are not plotted New nodes in the in-active state have the same density as nodes in the activestate The cumulative probability distributions of the netinputs for nodes in the active state are plotted in Figure 4BFigure 4A shows the first mechanismmdashnamely that thestandard deviation of net inputs for the easy class items[sh(A)] is larger than the standard deviation of net inputsfor the difficult class items [sh(B)] The second mecha-nism is shown in the figures in that the expected netinput of an easy class new item [mh(AN)] is equal to theexpected net input of a difficult class new item [mh(BN)]Furthermore at encoding the expected net inputs of ac-tivated nodes are increased equally or approximatelyequally for the easy and the difficult classes of itemsThis is shown in Figure 4A The expected net input for theold easy class items [m h(AO)] is equal to the expectednet input for the difficult class items [mh(BO)]

Recognition StrengthThe variance theory suggests that the recognition de-

cision needs to be based on counting the number of ac-tive nodes in such a way that the performance is optimalor near-optimal If the net input is above the activation

VARIANCE THEORY OF MIRROR EFFECT 421

threshold (T ) and the node was active at encoding thenode is activated at recognition Otherwise the node isinactivated The distributions of active nodes are plottedin Figure 4C

A closer inspection of Figures 4A and 4B reveals thatthese densities or distributions predict the correct orderof the mirror effect given that the activation threshold islarger than the expected net inputs of the new items andless than the expected net inputs of the old items Thus thevariance theory has the nice property of accounting for themirror effect across a large range of the parameter valuesfor the activation threshold Thus P(AN) lt P(BN) ltP(BO) lt P(AO) for mh(AN) 5 mh(BN) lt T lt mh(AO) 5mh(BO) The variance theory predicts a material-basedmirror effect because the variability of the net inputs isdifferent for the easy and the difficult class items Theexpected strengths of the net inputs are equal The vari-ability is lower for easy class items thus making theprobability of false alarms (or the probability of activenodes) lower for the easy than for the diff icult classitems when the activation threshold is larger than the ex-pected value of the new items Similarly the hit rate (or theprobability of active nodes) for easy class items is higherthan the hit rate for difficult class items when the activationthreshold is less than the expected value of the old items

The activation threshold is set to be between the ex-pected value of the new and the old net inputs so that theperformance is optimal Therefore the activation thresh-old is set to the average of the expected net inputs of theold and the new distributions for difficult and easy classitems respectively

T 5 [mh(AN) + m h(BN) + m h(BO) + mh(AO)]

5 [mh(BO) + mh(AO)] 5 mh(O)

Thus in the variance model the activation threshold isfixed for recognition in one condition although it mayvary between different recognition conditions to opti-mize the performance The variance theory accounts forthe strength-based mirror effect that occurs between listsor conditions with a shift in the activation threshold nec-essary for keeping the performance at an optimal levelAs will be discussed later this is true also when perfor-mance is measured by d cent and it is independent of theplacement of the recognition criterion Simply put themodel adopts the activation threshold on the basis of theoverall difficulty of the test in order to maximize the per-formance

In practice subjects may initially make a preliminaryestimate of the activation threshold which may be ad-justed as more information about the expected value ofthe distributions is gathered The theory does not show amirror effect if the activation threshold is lower than theexpected value of the new items or larger than the expectedvalue of the old items Thus setting the activation thresh-old as was suggested above is an important mechanism

in the model The activation threshold should not be con-fused with the subjectrsquos recognition criterion

Figure 4C shows the density of the probability that anode is active at recognition when the activation thresh-old is set as defined above Note how the mean and stan-dard deviations of the distributions of the net input (Fig-ure 4A) change when the percentage of nodes are calculated(Figure 4C) The expected probabilities of active nodesare arranged according to the mirror effect [Pc(AN) ltPc(BN) lt Pc(BO) lt Pc(AO)] whereas the expected valuesof the net inputs are not [mh(AN) = mh(BN) lt mh(BO) =mh(AO)] Furthermore the standard deviation of the per-centage of active nodes for old items is larger than thatfor new items [s p(O) gt sp(N)] whereas the standard de-viations of the net inputs are equal for old and new items[s h(N) 5 sh(O)]

The standard deviation of the recognition strength issmaller for the new distributions than for the old distri-butions because there are fewer nodes active in the newdistributions The standard deviation of the percentage ofactive nodes at retrieval as a function of the expected prob-ability of nodes active at retrieval is plotted in Figure 5Obviously the standard deviation of the percentage ofactive nodes is zero when the probability of active nodesis zero This standard deviation increases as the prob-ability of active nodes increases For small probabilitiesof active nodes the variance of active nodes (sp) is approx-imately proportional to the percentage of active nodes[sp 5 Pc(1 Pc) N raquo Pc N micro Pc] The percentage of ac-tive nodes is of course larger for old than for new itemsThus the variance theory predicts that the standard de-viation of the percentage of active nodes (sp) is smaller fornew than for old items [sp(AN) lt sp(BN) lt sp(BO) ltsp(AO)] whereas the standard deviation of the net inputis not [sh(AN) 5 sh(AO) lt sh(BN) 5 s h(BO)] The es-sential mechanism that makes these changes in themeans and standard deviations is the nonlinearity intro-duced by the counting of the number of active nodes With-out this nonlinearity these changes would not occur andthe model would not show appropriate ROC curves forold and new items

Note that the standard deviation of active nodes de-creases for probabilities larger than one half (see Figure 5the standard deviation is of course zero when the prob-ability of active nodes is one see the mathematical anal-ysis below) However the probability of active nodes can-not exceed one half because the activation threshold isset so that half of the nodes active during encoding areactive at recognition The probability of active nodes istypically set to a small value in the model because it im-proves the performance

To optimize the performance subjects base their rec-ognition decision on the number of active nodes normal-ized by the standard deviation of the net inputs for theitem The normalization makes the judgments more con-servative for difficult items This plays an important rolefor confidence judgments when the responses are biasedbut plays no role for unbiased responses

12

14

14

422 SIKSTROM

To calculate the recognition strength (S ) in Equation 5the expected percentage of active nodes is subtractedfrom the percentage of nodes active at recognition (Pc)This is necessary for the normalization to work properlyOwing to the placement of the activation threshold theexpected percentage of active nodes at recognition is halfof the expected percentage of nodes active at encoding(a 2 see note 1) This is a constant independent of itemclass new or old item and test difficulty The result is di-vided by the standard deviation of the net inputs associ-ated with nodes active at encoding (shcent)

Note that the standard deviation of the net inputs ofthe to-be-recognized item (s hcent ) varies on an item-to-itembasis around the standard deviation of the net inputs ofall items in the class (sh ) This fluctuation may be solarge that it is not possible to accurately sort the wordsinto classes on the basis of the standard deviation of theitems however there is no need for the subject to makesuch classification in the variance model The subjectsdo not need to know the true standard deviation of net in-puts in the class A yes response occurs if the recognitionstrength is larger than or equal to the subjectrsquos recogni-tion criterion (C ie if S sup3 C ) A no response occurs ifthe recognition strength is less than the subjectrsquos recog-nition criterion (S lt C )

The standard deviation of the net inputs does not affectthe probability of a yes response when the recognitioncriterion is unbiased (C 5 0) In this special case therecognition strength can be simplified to S 5 Pc whereC 5 a 2 The standard deviation of the net inputs inEquation 5 affects the probability of a yes response whenthe recognition criterion is biased (C sup1 0) Thus the stan-dard deviation of the net inputs in Equation 5 may be in-terpreted as a scaling factor that influences the confidencemeasurement (but not the unbiased recognition measure-ments) A large standard deviation of the net input for anitem (correlated with difficulty) influences the decisiontoward uncertainty whereas a small standard deviation ofthe net input for an item (correlated with less difficulty)influences the decision to be more certain

Figure 4D shows the density distribution of the recog-nition strength Note how the standard deviation of theactive nodes for the easy class versus the difficult classes(in Figure 4C) changes when it is normalized by the vari-ance of the net input (in Figure 4D) The normalizationfactor makes the standard deviation of the recognitionstrength of the difficult class smaller than that of the easyclass Thus the standard deviation of the recognitionstrength is proportional to the inverse of the standard de-viation of the net input The difficult class items yield asmall standard deviation of the recognition strength be-cause the standard deviation of the net inputs is highThe easy class items yield a large standard deviation ofthe recognition strength because the standard deviationof the net inputs is small The ordering of the means ofthe distributions is unaffected by the normalization andthe normalization does not change the fact that the olddistributions have a larger standard deviation than do thenew distributions

PREDICTIONS

This section describes the predictions of the variancetheory We have just seen that the variance theory predictsa material-based mirror effect for high- and low-frequencyitems because the low-frequency items have a smallervariance of net inputs This yields lower false alarm ratesand higher hit rates for the easy than for the difficultclass when the activation threshold is set between thenew and the old distributions Here it is shown how themodel accounts for other effects such as the strength-based mirror effect between lists list-length effects andthe shift in the response criterion Most important the var-iance theory makes predictions regarding the strength-based mirror effect within lists that is different from thepredictions of the attention-likelihood theory An exper-iment is conducted to test these predictions Comparativemodeling fitting was also conducted to compare the vari-ance theory with the attention-likelihood theory Thepredictions of the theory are based on an analytic solution

Figure 5 The standard deviation of active nodes as a function of the expectedprobability that the nodes are active The standard deviation is calculated with2N = 100

VARIANCE THEORY OF MIRROR EFFECT 423

that is presented at the end of the paper together with ananalysis of optimal performance

The Material-Based Mirror Effect forHigh- and Low-Frequency Items

The variance theory was simulated above here theanalytical results are presented The variance theory pre-dicts the mirror effect for any choice of parameters whenthe recognition criterion is unbiased As will be discussedlater the variance theory can be fully described by twoparameters (the number of nodes N and the percentageof active nodes a) plus one parameter for each class orwords [the standard deviation of the net input sh( )] Thefollowing parameters are used here The number ofnodes is 2N 5 100 and the percentage of nodes active atencoding is set to a 5 1 The standard deviation of thenet inputs to the easy class is sh(AN) 5 sh(AO) 5 125and the standard deviation of the net inputs to the diffi-cult class is sh(BN) 5 sh(BO) 5 156 There are otherparameters which however as will be discussed laterdo not add any degrees of freedom to the model the ex-pected net inputs of the new distributions mh(AN) 5mh(BN) 5 0 and the expected net inputs of the old dis-tributions mh(AO) 5 mh(BO) 5 1 Consequently the ac-tivation threshold is T 5 05

These parameters yield the following probabilities thata node is active at recognition Pc(AN) 5 43 a Pc(BN) 545 a Pc(BO) 5 55 a Pc(AO) 5 57 a The following ex-pected recognition strengths are predicted ms(AN) 5

0012 ms(BN) 5 0008 ms(BO) 5 0008 ms(AO) 50012 Figure 4D plots the four recognition strength den-sities (the distributions are assumed to be normal) usingthe parameters above The same parameter settings wereused in Figures 4A 4B 4C and 5

Strength-Based Mirror Effects Between ListsThe variance theory is consistent with the strength-

based mirror effects Thus variables that increase the hitrates also decrease the false alarm rates This empiricalfinding is called dispersion which means that the newand the old distributions move apart The opposite phe-nomenon is called concentering which means that thenew and the old distributions move closer together Ex-amples of variables showing strength-based mirror ef-fects are speed versus accuracy instructions length ofstudy time encoding task forgetting repetition and ag-ing (Kim amp Glanzer 1993) These experimental variablescan be related to a specific parameter in the variance the-orymdashnamely the expected net input

The variance theory predicts a strength-based mirroreffect because subjects must adjust the activation thresh-old to optimize the performance This change in activa-tion threshold affects the false alarm rates For exampleassume that study time is increased from 1 to 2 sec sothat the expected net input increases from 1 to 2 and theactivation threshold increases from 12 to 1 This dimin-ishes the false alarm rate However the increase in theactivation threshold is smaller than the increase in the old

net input so the hit rate will increase Thus increasing thestudy time increases the hit rate but decreases the falsealarm rate which is dispersion

The mirror effect is accounted for in some theories bya change in the recognition criterion Note that in the var-iance theory the recognition criterion is constant whereasthe activation threshold is changed There is an impor-tant difference between a change in the recognition cri-terion and a change in the activation threshold Thechange in the activation threshold optimizes the perfor-mance as measured by d cent whereas a change in the recog-nition criterion does not influence d cent Given an optimalplacement of the activation threshold the performance interms of percentage correct is optimal if the recognitioncriterion is set to an optimal value which is zero Thusthere is a clear difference between changing the recogni-tion criterion and changing the activation threshold Thevariance theory accounts for the strength-based mirror ef-fect occurring between two conditions by the change inthe activation threshold necessary for optimal performancewhereas the recognition criterion does not change

Concentering occurs for example when subjects areinstructed to emphasize speed (rather than accuracy) withsuperficial (rather than deep or semantic) study instruc-tions with diminished study time or with an increasedretention interval (Kim amp Glanzer 1993) In the variancetheory all these manipulations are assumed to diminishthe old net inputs Figure 6A shows the predictions of thevariance theory when the expected net inputs of the olddistributions are mh(AO) = mh(BO) 5 05 rather than 1as in Figure 4D Consequently the activation thresholdmust be set to 025 for optimizing the performance Thedistributions in Figure 6A are closer than the distribu-tions in Figure 4D Thus decreasing the net inputsmdashforexample by diminishing study timemdashmoves the distrib-utions closer together thus showing concentering

The opposite phenomenon to concentering is disper-sion which means that increasing the performance movesthe distributions apart Dispersion can be studied bychanging the variables listed above in the opposite direc-tionsmdashfor example by increasing the study time Fig-ure 6B shows the predictions of the theory when the ex-pected net inputs of the old distributions are mh(AO) =m h(BO) 5 2 Consequently the activation thresholdmust be set to 1 to maintain a near-optimal performanceThe distributions in Figure 6B are further apart than thedistributions in Figure 4D

These strength-based manipulations are usually ap-plied between different lists or conditions For exampleKim and Glanzer (1993) manipulated study time betweenfour study lists where the items were presented for 1 seceach in two lists and for 2 sec each in two lists After eachlist there was a recognition test In the variance theorythe activation threshold is the same during each recogni-tion test but may vary between two recognition tests withdifferent levels of difficultymdashfor example different studytimes As will be discussed later different empirical re-sults are found when study time is varied within one list

424 SIKSTROM

In this case the activation threshold is also constant dur-ing the recognition tests although the study time varieswithin the condition

The order of the probabilities in the mirror effect issomewhat robust against changes in the activation thresh-old over a large range Setting the activation to a fixedsufficiently low and positive value yields the mirror ef-fect for any value of the expected net input For example

assume that the activation threshold is fixed to 04 Thenthe mirror effect is predicted for the three cases of ex-pected old net inputs discussed above (05 1 and 2) orany value above 04 The predictions for the new distri-butions do not change with these changes in net inputs[P(AN) = 25 P(BN) = 30] thus a change in the acti-vation threshold is needed to change the false alarmrates In contrast the advantage of the old easy class over

Table 1General Table of Results From the Experiment

Condition AN BN BO AO ss(BN)ss(AN) ss(BO)ss(AO)

Control 013 017 069 082 060 086Frequency 020 028 080 068 101 066Time 010 015 078 076 089 081

NotemdashThe rows show the conditions (control presentation frequency and presenta-tion time) The columns show the false alarm rates for low (AN) and high (BN) fre-quencies and the hit rate for high (BO) and low (AO) frequencies the slope of the z-ROCcurve for the new low-frequency distribution as a function the new high-frequency dis-tribution [ss(BN)ss(AN)] and the corresponding slope for the old distributions[ss(BO)ss(AO)]

Figure 6 The densities of recognition strength in the variance theory for different parameter settings (A) concentering (B) dis-persion (C) activity level set to one and (D) equal variance The horizontal axes show the recognition strength and the vertical axesthe density

VARIANCE THEORY OF MIRROR EFFECT 425

the old difficult class increases with the expected netinput [from P(BO) = 55 and P(AO) = 56 for mh(O) =05 to P(BO) = 89 and P(AO) = 92 for mh(O) = 2]

List-Length EffectGiven everything else equal recognition from a short

list length has a higher hit rate and lower false alarm ratethan recognition from a long list In the variance theorylist length is predicted to affect the standard deviation ofthe net input (sh) for both easy and difficult class itemsso that longer lists have a larger variance than do shorterlists The expected value of the net input is not affectedby list length

Assume that context does not change within a list butis uncorrelated between different lists The context for alist is thus associated with as many items as there areitems in the list The variance of the net inputs to the itemnodes increases when the list length is increased Thereason for this increase in variance is essentially thesame as the reason that word frequency affects the vari-ance In the word frequency case the same item is asso-ciated with several contexts and this increases the vari-ance in context nodes In the list-length case the samecontext is associated with several items and this in-creases the variance in the item nodes Thus the varianceof the net inputs in the item nodes will be a linear func-tion of list length Therefore a long list will have a lowerhit rate and a larger false alarm rate than will a short list

ROC CurvesThe percentage of nodes active at recognition is less

for new items than for old items Owing to the placementof the activation threshold this proportion is always lessthan 12 The standard deviation of the percentage of ac-tive nodes increases as a function of the percentage ofactive nodes If the percentage of active nodes is zerothe standard deviation obviously is zero However thisstandard deviation increases as the percentage of activenodes increases This yields a smaller standard deviationfor the new distribution (which is associated with a lowerpercentage of active nodes) as compared with the olddistribution [ss(AO) gt ss(AN) and ss(BO) gt ss(BN)]

For the sake of understanding the model the propor-tion of nodes active at encoding can be set unrealisticallyhighmdashnamely to a = 1 This setting yields around 50of these nodes being active at recognition This parametersetting makes the standard deviations of the new and theold distributions equal [ss(AO) = ss(AN) and ss(BO) =ss(BN)] Figure 6C shows the prediction for a = 1 (all theother parameters are identical to those in Figure 4D)

The standard deviation of recognition strength is largerfor the difficult class than for the easy class [ss(AN) gtss(BN) and ss(AO) gt ss(BO)] because the recognitionstrengths are calculated from the inverse of the standarddeviation of the net inputs Thus when the standard de-viations of the net inputs are set equal the standard de-viation of the recognition strengths and the recognitionstrengths becomes equal Figure 6D plots the predictionsof the theory when all standard deviations of the net in-

puts are 125 The other parameters are the same as thosein Figure 4D

In Figure 4D the four standard deviations of the recog-nition strengths are ss(AN) = 0015 ss(BN) = 0012ss(BO) = 0015 and ss(AO) = 0020 The ratio of thesestandard deviations must follow Equation 2 This is alsothe case with ss(AO) ss(BN) = 061 lt 074 = ss(AO) ss(AN) lt ss(BO)ss(BN) = 078 lt 094 = ss(BO) ss(AN)

Changing the Recognition CriterionThe probability of a yes response (P) for the four

classes depends on the recognition criterion (C) SettingC to an unbiased value of 0 yields P(AN) = 20 P(BN) =25 P(BO) = 70 P(AO) = 74 These predicted data areprototypical of experimental data for the mirror effect

A conservative value of the recognition criterion (C)will not yield the mirror effect For example C = 0016yields P(AN) = 03 P(BN) = 02 P(BO) = 30 P(AO) =43 Thus the variance theory predicts that a conserva-tive recognition criterion yields a higher false alarm ratefor easy class words than for difficult class words Thisprediction is supported by empirical data Greene (1996)asked subjects to respond yes only if they were sure oftheir response Consistent with the prediction no mirroreffect was found

It follows from the ordering of the distributions in Fig-ure 4D that the theory also predicts the experimentalfindings in forced recognition [P(BO BN) lt P(AO BN)P(BO AN) lt P(AO AN) P(BN AN) gt 50 and P(AOBO) gt 50] For the parameters above the predictions ofthe theory are P(BO BN) = 79 lt 81 = P(AO BN)P(BO AN) = 83 lt 84 = P(AO AN) P(BN AN) = 59 gt50 P(AO BO) = 57 gt 50

Within-List Strength ManipulationSo far the predictions made by the variance theory are

qualitatively (but not quantitatively) equal to those of theattention-likelihood theory However there is an excep-tion that differentiates the variance theory from the attention-likelihood theory The mirror effect is nor-mally studied under experimental conditions in whichthe diff icult and the easy classes are given the sameamount of attentionmdashfor example under conditions inwhich the number of presentations the study time andthe study instructions are equal for the two classes ofwords However if the number of presentation is largerfor the difficult class than for the easy class different re-sults emerge Stretch and Wixted (1998b) conducted fiveexperiments in which the basic manipulation was to pre-sent high-frequency words five times whereas the low-frequency words were presented once The results didnot show a mirror effect because the hit rates for thehigh-frequency words were higher than those for thelow-frequency words However increasing the numberof presentations for the high-frequency words did not af-fect the false alarm rate so both the false alarm rate andthe hit rate were larger for high-frequency words

The variance theory accounts for this new finding inthe following way The theory assumes that a longer pre-

426 SIKSTROM

sentation time or a larger presentation frequency in-creases the net inputs of the old items [mh(O)] This is il-lustrated in Figure 7A (compare with Figure 4A wherethe same amount of attention is paid to the two classes)If the net inputs for old high-frequency items are in-creased sufficiently the percentage of active nodes willbe larger than that for old low-frequency items For thisto occur the effect of the increase in net input (whichgives the advantage for old high-frequency items whenattention is focused on these items) must be larger thanthe effect from the larger standard deviation of the netinputs for old high-frequency items (which gives the ad-vantage for old low-frequency items when the same at-tention is paid to the two classes) This increase in thepercentage of active nodes yields a higher hit rate forhigh-frequency items than for low-frequency items

However it will not significantly change the false alarmrates which are larger for high-frequency items than forlow-frequency items Therefore the variance theory pre-dicts no mirror effect when high-frequency items are pre-sented sufficiently more often or with a sufficiently longerpresentation time as compared with low-frequency items

It is apparent from the density of net inputs (Figure 7A)that the density of recognition strengths (Figure 7B)will not show a mirror effect (ie because the percent-age of active nodes are larger for high- than for low-frequency old items) The parameters used in these fig-ures are identical to the parameters used for the standardmirror effect in Figures 4A and 4D with the exceptionthat the expected net input of the old high-frequencyitems [mh(BO)] is 2 rather than 1 Consequently to opti-mize performance the activation threshold becomes

P a ec

h h

h=-( )yen

ograve12

22

m

s

P -

Figure 7 (A) The probability density of the net inputs in the variancetheory when attention is focused on the high-frequency class The hori-zontal axis shows the net inputs and the vertical axis the probabilitydensity of the net inputs The expected value of the high-frequency value(BO) is shifted to the right because attention is focused on this class Thedotted vertical line is the activation threshold (B) The predictions of thevariance theory when subjects focus their attention on high-frequencywords The horizontal axis shows the recognition strength and the ver-tical axis the probability density

075 rather than 050 The figure does not show a mirroreffect because the expected hit rate and the expectedfalse alarm rate are larger for the high-frequency itemsthan for the low-frequency items Setting C to an unbi-ased value of 0 yields P(AN) = 08 P(BN) = 14 P(BO) =86 P(AO) = 063 [which may be compared with Figure6B P(AN) = 20 P(BN) = 25 P(BO) = 70 P(AO) = 74]

Furthermore in the variance theory the ratio of therecognition strength standard deviations for high- andlow-frequency items depends mainly on the standard de-viations of the net inputs The standard deviations of thenet inputs are not dependent on the attention paid to thestimuli Therefore the variance theory predicts no changein the standard deviations when the amount of attentionis manipulated The standard deviation of the old low-frequency distribution is predicted to be larger than thestandard deviation of the old high-frequency distribu-tion Similarly the standard deviation of the new low-frequency distribution is predicted to be larger than thatof the new high-frequency distribution The standard de-viations in Figure 7B are ss(AN) = 0013 ss(BN) = 0011ss(BO) = 0017 and ss(AO) = 0019 These results aresimilar to the results when the same amount of attentionis paid to the two classes in Figure 4D ss(AN) = 0015ss(BN) = 0012 ss(BO) = 0015 and ss(AO) = 0020

The standard version of the attention-likelihood the-ory has problems accounting for the lack of mirror ef-fect when more study time is given to the difficult classthan to the easy class This theory suggests that the classof items to which more attention is being paid is moreeasily recognized For example low-frequency items arebetter recognized than high-frequency items becausesubjects pay more attention to them The amount of at-tention is assumed to influence the number of sampledfeature [n(i)] so more features are sampled for low- thanfor high-frequency items (Kim amp Glanzer 1993) This isthe only parameter that differs between high- and low-frequency items From this explanation it follows thatif the experimental conditions are manipulated so thatsubjects pay more attention to the high-frequency itemsthe standard version of the attention-likelihood theorywill predict a mirror effect where the high-frequencyitems are the easier class (A) and the low-frequency itemsare the more difficult class (B) The difference from thenormal mirror effect is a larger hit rate and a smaller falsealarm rate for high- than for low-frequency items Fur-thermore the attention-likelihood theory makes predic-tions of the order of the slope of ROC curves The standarddeviation of the hit rate for the high-frequency distributionwould be larger than the hit rate for the low-frequencydistribution Similarly it is predicted that the standarddeviation of the high-frequency false alarm distribution islarger than that of the low-frequency distribution

EXPERIMENT

An experiment was conducted to test the predictionsregarding the within-list strength manipulation The

number of presentations and the study time of the high-frequency words were manipulated in an experimentThe original rationale for the experiment was to comparethe results with the predictions of the variance theoryand the attention-likelihood theory because the experi-ment was conducted before the publication of Stretchand Wixtedrsquos (1998b) study which manipulated atten-tion by the number of presentations In this experimenta new manipulation is investigatedmdashnamely how theamount of study time per item for each class influencesthe mirror effect Furthermore the manipulation of thenumber of presentations is replicated Thus there weretwo experimental conditions one in which the amountof study time was manipulated and one in which the pre-sentation time was manipulated There was also one con-trol condition in which high- and low-frequency wordswere given the same amount of attention

MethodSubjects Twenty-one students taking the introductory psychol-

ogy course at the University of Toronto volunteered to participatein a memory experiment for course credit There were 14 female and7 male subjects with a mean age of 20 ranging from 18 to 29 yearsold

Material Sixty low-frequency words and 60 high-frequencywords were selected from Ku Iumlcera and Francis (1967) The low-frequency words have an occurrence of 4ndash5 times per million andthe high frequency words an occurrence of 50ndash55 times per mil-lion Thirty low- and 30 high-frequency words were randomly cho-sen for List 1 and the remaining for List 2

Procedure The subjects were instructed to study a list of wordsso they would be able to recognize the words after study Fifteenlow-frequency words and 15 high-frequency words were randomlychosen as study words for each subject

Design There were three conditions In all the conditions thelow-frequency words were presented once with a presentation timeof 1 sec In the control condition the high-frequency words werealso presented once with a presentation time of 1 sec In the pre-sentation frequency condition the high-frequency words were pre-sented twice for 1 sec each time In the presentation time conditionthe high-frequency words were presented once for 3 sec The pre-sentation order was randomized All the words were presented inuppercase on a blank computer screen Immediately following thestudy list there was a recognition test The subjects were presentedwith either one of the studied words or one of the lures There were15 low-frequency lures and 15 high-frequency lures in each condi-tion The subjects were asked to judge whether the word was pre-sented in the list or not The subjects were also required to rate theirconfidence in their responses on a scale ranging from 1 (guessing)to 5 (very certain) The order of recognition was randomized foreach subject

Each subject participated in two conditions List 1 was alwaysgiven as the first list and List 2 as the second list Twelve subjectswere randomly chosen for the presentation frequency condition fol-lowed by the presentation time condition Nine subjects were giventhe control condition followed by another control condition Thewhole experimental set-up including instructions presentation ofwords and the recognition test was automated on a computer Eachsubject was tested individually

ResultsThe results from the experiment are presented in the

first three rows of Table 1 The probability for hit rates

VARIANCE THEORY OF MIRROR EFFECT 427

428 SIKSTROM

was larger for the high-frequency words than for the low-frequency words in the presentation frequency and thepresentation time conditions In the control conditionthe probability for hit rates for the low-frequency condi-tion was larger One-tailed paired t tests over the perfor-mance for each subject were carried out to test the dif-ferences between the high and the low frequencies Theeffects were significant in the presentation frequencycondition [t(11) = 22 MSe = 004 p = 02 lt 05] and inthe control condition [t(16) = 33 MSe = 004 p = 00lt 05] but not in the presentation time condition [t(11) =041 MSe = 003 p = 34 lt 05]

The false alarm rate was larger for the high-frequencywords in all the conditions However it was only signif-icantly larger in the presentation frequency condition[t(11) = 18 MSe = 003 p = 048 lt 05] but not in thepresentation time condition [t (11) = 15 MSe = 001 p =07 gt 05] and the control condition [t(16) = 14 MSe =002 p = 09 gt 05]

The results in the presentation frequency conditionsupport the variance theory The hit and the false alarmrates were significantly larger for the high-frequencywords than for the low-frequency words Thus there wasno mirror effect However the prediction of the standardversion of the attention-likelihood theory was not sup-ported

The results in the presentation time condition were inthe same direction as those in the presentation frequencycondition although the difference between the high andthe low frequencies was not significant This conditionis consistent with the variance theory although the stan-dard version of the attention-likelihood theory could notbe dismissed in this condition since the results werenonsignificant

Finally the control condition yielded results consis-tent with previous studies showing a mirror effect Thehit rate for the high-frequency words was significantlylower than the hit rate for the low-frequency words Thefalse alarm rate for the high-frequency words was largerthan that for the low-frequency words (although not sig-nificantly) Thus the control condition is as was ex-pected consistent with both the variance theory and thestandard version of the attention-likelihood theory

The slopes of the ROC curves were calculated as fol-lows The hit and false alarm rates for confidence ratings1ndash5 were z-transformed (eg for confidence rating 4 a hitresponse was scored if the confidence rating was 4 orabove) Linear regressions of one z-transformed mea-surement as a function of another z-transformed measure-ment were conducted The slope of the linear regressioncurves between the z-transformed hit rate of the low-frequency words and the z-transformed hit rate of the high-frequency words [ss(BO)ss(AO)] and similarly for theslope of the false alarms [ss(BN)ss(AN)] are shown inthe last two rows of Table 1

The results show that the standard deviations of theold high-frequency distributions were smaller than thestandard deviations of the old low-frequency distribu-tions in all the conditions The standard deviations of the

false alarm high-frequency distributions were smallerthan the standard deviations of the false alarm low-frequency distributions in the presentation frequencycondition and the control condition but were approxi-mately equal in the presentation time condition

To summarize the results in the presentation frequencycondition are consistent with the variance theory and inconsistent with the standard version of the attention-likelihood theory The control condition is consistentwith both the variance theory and the standard version ofthe attention-likelihood theory These data are also con-sistent with results presented by Stretch and Wixted(1998b) However Stretch and Wixted (1998b) suggestedone possible way to modify the standard version of theattention-likelihood theory to bring it in line with thedata presented here They noted that Glanzer et al (1993)had shown that the attention-likelihood theory predictsthe mirror effect although p(i old) is set to the averagevalue of the two classes This modified version can pre-dict the pattern of data presented here given that the at-tention paid to the high-frequency class was high duringencoding [n(B) = 120] and low during recognition [n(B) =40] This formulation of the attention-likelihood theoryseems somewhat unclear It is not well motivated whyp(i old) should be equal during recognition whereas n(i)is different [p(i old) is calculated from n(i)] or why theamount of attention for high-frequency items is lower thanthat for low-frequency words at encoding but higher atrecognition

COMPARATIVE DATA FITTING

Glanzer et al (1993) fitted the attention-likelihoodtheory to experimental data in nine conditions The easyclass (A) consisted of either low-frequency words orconcrete words and the difficult class (B) consisted ofhigh-frequency words or abstract words Here the vari-ance theory is also fitted to the same set of data as thatused in Glanzer et al (1993) This allows a directionevaluation of the variance theory by comparing its f itswith those of the attention-likelihood theory

Glanzer et al (1993) fitted the attention-likelihoodtheory to the four probabilities of yes responses and thefour ratios of slopes of the ROCs The fitting was con-ducted by minimizing the mean squared error divided bythe variancemdashthat is

Three parameters were fittedmdashnamely the attentionpaid to each of the classes n(A) and n(B) and the prob-ability that a feature was activated p(new) The other pa-rameters were held constant at a value found to give agood f it These parameters were (N = 1000) and therecognition criterion [ln (L) = 0]

The variance theory was fitted to the same set of datausing the same technique and the same number of freeparameters The fitted parameters were the percentage

( )

Observed Predictedi i

ii

-

=aring

2

21

8

s

VARIANCE THEORY OF MIRROR EFFECT 429

nodes active at encoding (a) the standard deviation ofthe net inputs for the easy class words [sh(A)] and thestandard deviation of the net inputs for the difficult classwords [sh(B)] The other parameters were held constantand were set to the same values as those in Figure 4D[2N = 100 mh(N ) = 0 mh(O) = 1 and C = 0] The empir-ical standard deviations (si) were not reported inGlanzer et al (1993) so these parameters were set toone

The results show that the variance theory accounts for98 (r2) of the variance for the probabilities The attention-likelihood theory fits equally well The variance theory ac-counts for 93 of the variance for the slope The attention-likelihood theory accounts for 84 of the variance of theslope Thus the variance theory accounted for the sameamount of variance for the probabilities and more variance

for the slope as compared with the attention-likelihoodtheory when three parameters were fitted

It is reasonable to assume that the percentage of ac-tive nodes at encoding (a) does not vary across condi-tions The ratio of standard deviations of the net inputs[sh(B)sh(A)] may also be conceived of as being con-stant given that the same material is used in the differ-ent conditions Therefore the variance theory was fittedby a single variablemdashnamely the standard deviation ofthe net inputs for the easy class [sh(A)] The activitylevel was fixed to 010 and the ratio of the standard de-viations of the net inputs sh(B)sh(A) to 125 (these val-ues were the average of the fitted parameters in the pre-vious fit these parameters were also used in Figure 4D)The results are presented in Figure 8A where the pre-dicted probabilities are plotted as a function of observed

Figure 8 (A) Fitting the variance theory to Glanzer Adams andKimrsquos (1993) probability data The figure shows the predicted proba-bilities (on the vertical axis) as a function of the observed probabilities(on the horizontal axis) (B) Fitting the variance theory to Glanzer et alrsquos(1993) standard deviation slope data The figure shows the predictedratio of slopes of the receiver-operating characteristic curves (on the ver-tical axis) as a function of the observed ratios (on the horizontal axis)

430 SIKSTROM

probabilities Figure 8B shows the corresponding resultsfor the slope The accounted variance is 096 for the prob-abilities and 085 for the slopes Thus the variance theoryfits the slopes using a single parameter equally well asthe attention-likelihood theory does with three fitting pa-rameters The fit for the variance theory for the probabil-ities using one parameter is slightly less than the fit of theattention-likelihood theory using three fitting parametersIt may be concluded that the fit for the variance theory isreasonably good for the probabilities and the slopes Theslopes have a better fit in the variance theory than in theattention-likelihood theory when three variables are used

ANALYTIC SOLUTIONS

In this section analytic solutions of the variance the-ory an approximation of the standard deviation of therecognition strength and analyses of optimal perfor-mance are presented The variance theory has a simpleanalytic solution and can be fully described by four pa-rameters Two of these parametersmdashnamely the stan-dard deviation of the net inputs from the easy class[sh(A)] and the standard deviation of the net inputs fromthe difficult class [sh(B)]mdashcan also be expressed as thefrequency of the item (see the Appendix) The other twoparameters are the number of nodes (N ) and the expectedprobability that the nodes are active at encoding (a)

There are other variables in the theory however theydo not increase the degrees of freedom There are fourexpected net inputs (mh) however two degrees of free-dom disappear because the new net inputs are constrainedto be equal as well as the old net inputs [mh(AN) =mh(BN) and mh(AO) = mh(BO)] Furthermore the predic-tions are independent of moving the old and new dis-tributions in parallel thus removing another degree offreedom Finally changing the difference between the ex-pected old and new net inputs is mathematically equiva-lent to changing the standard deviations [sh(A) andsh(B)] Thus the degrees of freedom in the net inputscan be captured by the degrees of freedom in the stan-dard deviations The activation threshold (T ) and theprobability that nodes are active (Pc) are simply func-tions of other variables and therefore do not increase thedegrees of freedom Thus there are four degrees of free-dom for the distributionsmdashnamely sh(A) sh(B) N anda An additional degree of freedom is introduced whenplacing the recognition criterion (C)

The probability (Pc) that the net inputs exceed the ac-tivation threshold (T ) for nodes active during encodingcan be explicitly solved from the expected net inputs(mh) and the standard deviation of the net inputs (sh)This probability is dependent on the distribution of thenet inputs which can be approximated with a normaldistribution Pc is solved by integrating the net inputsfrom mh T to infinity (yen) over the probability densityfunction for a normal distribution Thus the probability(Pc) that a node is active at recognition is

(6)

Subtracting the expected percentage of active nodes atrecognition (a2 see note 1) from the percentage of ac-tive nodes and dividing by the standard deviation of thenet inputs (sh) calculates the expected recognitionstrength (ms)

Note that the analytic solution uses the standard devi-ation of the class (sh) as an approximation of the stan-dard deviation of the item (shcent ) because it simplifies theanalytic solution however the variance theory or thesimulation uses the standard deviation of the item Thisapproximation is good when there are a large number offeatures however for a small number of features thevariance of feature strength for a single item may fluctu-ate on an item-to-item basis around the variance of thenet inputs for all the items

The standard deviation of the recognition strength (ss)is calculated from sh Pc and N There is 2N number ofnodes in the context and the item layers The distributionof Pc is binomial but can for a certain criterion [ie 2NPc(1 Pc) gt 10] be approximated with a normal distri-bution with a standard deviation of [Pc(1 Pc) 2N]12The final result is scaled by the normalization factor 1sh

(7)

A yes response occurs if the recognition strength isabove the recognition criterion (C) The probability of ayes response [P(Y)] is calculated from the expected recog-nition strength the variance of the recognition strengthand the recognition criterion by integrating the density ofthe recognition strength over a normal distribution

The probability of choosing A over B in a two-choiceforced recognition test [P(A B)] is calculated from theexpected recognition strength of A [ms(A)] and B [ms(B)]and the standard deviations of the recognition strengthof A [ss(A)] and B [ss(B)]

An Excel sheet for calculating the predictions of the vari-ance theory is available on line (wwwpsychutorontoca ~sverkervariancehtml)

P e ds

S Bs s

s s

C

(A B)

(A)

(A) (B)

=

- -[ ]( )+aelig

egraveccediloumloslashdivide

eacuteeumlecirc

yen

ograve12

2

2 2 22

p

m m

s s

( )

P e ds

S s

s

C

(Y) =-( )yen

ograve12

2

22

p

m

s

sss

h

c cP P

N=

-( )eacute

eumlecircecirc

1 1

2

12

mss

Pc

a

h=

-2

P a e dhcT

hh

h=-( )yen

ograve12

2

22

p

m

s

VARIANCE THEORY OF MIRROR EFFECT 431

Approximations of the Standard Deviation ofRecognition Strength

The standard deviation of the recognition strength isin the model calculated with Equation 7 However to fa-cilitate the understanding of this equation it is useful tomake some approximations First note that the proba-bility that a node is active (Pc) is assumed to be low Byapproximating 1 Pc to one the variance of the recog-nition strength can be simplified to

For a particular class of items the variances of the netinputs of old and new items are equal and the varianceof the recognition strength is proportional to the numberof active nodes (s 2

s micro Pc) This approximation suggests avery simple interpretation of the slope of the z-ROC Theratio of variances between new and old items is simplythe ratio between the number of nodes active in the newitems representations and the number of nodes active inthe old items representations

Or alternatively the slope of the z-ROC curve is equalto the square root of the ratio of the number of nodes ac-tive in the new items representations and the number ofnodes active in the old items representations For exam-ple if the slope of the z-ROC curve is 08 the number ofactive nodes in the new items representations divided bythe number of nodes active in the old items representa-tions is 064 (= 0802)

Another approximation useful for understanding themodel is that for two classes of items the number of ac-tive nodes in the new distribution is approximately equaland the number of active nodes in the old distributions isapproximately equal [Pc(AN) raquo Pc(BN) and Pc(BO) raquoPc(AO)] Given these approximations and the approxi-mation above (1 Pc raquo 1) the recognition strength stan-dard deviation is inversely related to the standard devia-tion of the net inputs in the following way The ratiobetween the recognition strength standard deviations ofthe diff icult and the easy distributions is equal to theratio between the standard deviations of the net inputs ofthe easy and the difficult distributions Furthermore theratio between the recognition strength standard devia-tions of the difficult and easy new distributions is equalto the ratio between the recognition strength standard de-viations of the difficult and the easy old distributionsThe exact solution predicts a slightly smaller ratio in theold than in the new distributions

This suggests that the ratio between the recognitionstrength standard deviations of the difficult class and theeasy class can be interpreted as the ratio between the

standard deviations of the net inputs of the easy class andthe difficult class

Optimizing Performance Derives the Assumptions of the Variance Theory

Arguably good memory performance is an importantfactor for selection in the evolutionary process of hu-mans and animals It is reasonable to assume that thebrain has evolved so that the performance at retrieval isoptimal or near-optimal Here it is investigated how sev-eral assumptions of the variance theory influence per-formance It is shown that several assumptions of themodel fall out as a consequence of optimizing perfor-mance in the form of discriminability between new andold items Thus if the model is implemented in a differ-ent way performance is degraded and the model doesnot account for the experimental data Examples of as-sumptions that yield a good performance in the modelare a low percentage of nodes active setting the activa-tion threshold between old and new net inputs measur-ing performance by nodes that are active in the encodingpattern and normalizing the recognition strength It isshown that an optimal performance in the network requiresthe implementation suggested by the variance theory Ifthe implementation of the variance theory is changedsignificantly the performance is degraded and the net-work would not produce the empirically found memoryphenomena

To conduct this analysis it is necessary to define ameasurement of performance A natural choice is to used cent By using the analytical equations above we find thefollowing expression

(8)

Because ss(N) sometimes is near zero it was founduseful to use the standard deviation of both the old andthe new items recognition strength ss(NO) in the de-nominator of this equation Thus Pc(NO) is equal to[Pc(N) 1 Pc(O)] 2 Pc( ) was calculated with Equation 6The expected value of the net inputs and the standard de-viation of the net inputs for new and old items were cal-culated with the equations derived in the Appendix(Equations A1 A2 and A3) Low-frequency items witha preexperimental frequency of zero were used ( f = 0)and the list length was set to one (L = 1)

The performance can be expressed by the parametersa N and p Increasing the number of nodes (N) monot-onically increases d cent and increasing the number ofstored patterns (p) monotonically decreases d cent The im-pact of these two parameters on d cent was of less impor-tance here and they were set to N = 30 and p = 100

Optimal percentage of nodes active at encodingThe solid line in Figure 9A shows the theoretical d cent as afunction of the percentage of nodes active at encoding

cent = - =-

-eacuteeumlecirc

dP P

P PN

s s

s

c c

c c

m ms( ) ( )

(

O N

(NO)

(O) (N)

(NO) (NO) 12

112

ss

ss

ss

ss

s

s

s

s

h

h

s

s

(BO)

(AO)

(B)

(A)

(A)

(B)

(BN)

(AN)pound raquo pound

s

ss

s

c

c

P

P

2

2

(N)

(O)

(N)

(O)raquo

ss

sc

h

P

N

2

2 2=

432 SIKSTROM

(a) The results show that d cent is optimal for a = 052 Thed cent is lower for larger and smaller a The lower d cent for largea occurs because the interference from other items in-crease For an a larger than the optimal value the weightchanges are distributed over a larger number of nodesand the recognition tests therefore include more noise

The lower d cent for an a less than the optimal value oc-curs because there is variability in the number of activenodes at encoding Thus for very small values of a thefluctuation between the number of nodes active in theencoded representation becomes increasingly importantThus for a small a errors are more likely to occur be-cause old items have few active nodes at encoding mak-ing it less likely that many nodes will be active at re-trieval (independently of how well they are encoded)This analysis suggests that a medium low percentage ofactive nodes at encoding yields optimal performanceThis is consistent with variance theory which requires alow activity for fitting some of the empirical data (seebelow)

There is another factor that contributes to the fact thatoptimal performance occurs when the percentage of ac-tive nodes is medium low which is that the number ofpossible representations increases with a If there is onlyone node active in all the representations there are Npossible combinations of representations if there aretwo nodes active in all the representations there are ap-proximately N 2 possible combinations of representa-tions and so forth This factor is not included in theanalyses

Optimal placement of the activation threshold Animportant property of d cent is that it is insensitive to wherethe criterion is placed Thus any criterion yields thesame performance The activation threshold (T ) may beseen as the criterion for a single node and therefore onemight intuitively think that the placement of the thresh-old is unimportant for d cent However surprisingly theplacement of the criterion becomes important in the vari-ance theory because there is a nonlinear transformationwhen the nodes are activated This nonlinearity makes d centdependent on the activation threshold in the nodes

The d cent was maximized by changing the activationthreshold (T ) and the percentage of nodes active at en-coding (a) The maximum d cent was 240 when T = 081and a = 052 Figure 9B plots d cent as a function of the ac-tivation threshold (T ) when the percentage of nodes ac-tive at encoding was f ixed at the optimal value (a =052) The results show that d cent has an optimal valuewhen the activation threshold is set between the old andthe new distributions The variance theory suggests thatthe threshold should be set to the average of the expectedold and expected new net inputs For the parameter val-ues used here this value is 071 which is near butslightly lower than the optimal value of 081 (the ex-pected value of the new net input is 0 and the expectedvalue of the old net input is 142) Note that this resultapplies when both a and T are set to the optimal value Ifa is set to a nonoptimal value the optimal value of T may

deviate significantly from the one proposed by the the-ory (eg if a = 5 the optimal value of T is much largerthan the expected value of old net inputs of 188)

This analysis emphasizes the importance of setting theactivation threshold as suggested by the variance theorySetting the activation threshold between the old and thenew expected net inputs yields not only the mirror effectbut also an optimal performance in the network Noticethat the activation threshold (T ) is constant even if thesubjectsrsquo decision criterion (C) is changed Therefore d centwill not change when the decision criterion changes Bychanging the decision criterion (rather than the recogni-tion threshold) subjects can maintain an optimal d cent fordifferent confidence levels

Optimal usage of the state of activation in the cuepattern An interesting question is how much informa-tion is carried in nodes that are active in the encoded pat-tern as compared with inactive nodes If both active andinactive nodes carry a similar amount of information itis useful to use all the nodes at retrieval However if in-active nodes carry little information in relation to thenoise performance can be improved by using only theinformation in the active nodes

The information carried in the nodes depends on theamount of weight changes which in turn depends on thepercentage of active nodes at encoding (a) For a = 12the absolute values of the weight changes are the samefor active and inactive nodes (however the signs of theweight changes are different) Thus inactive and activenodes carry the same amount of information and theperformance is optimal when information in both activeand inactive nodes is used For a small a the weightchanges are larger for active nodes (proportional to 1 a)than for inactive nodes (proportional to a) For a suffi-ciently small a the noise in inactive nodes will over-whelm the information in the weight changes so that ifthe information is combined the inactive nodes will harmthe information in the active nodes and performance isoptimal if only information from active nodes is used

The performance of the variance theory was calcu-lated by using the information in all the nodes This isdone by counting the number of nodes that are retrievedto the correct state of activation (ie the same state asthat during encoding) The mathematical details of thiscalculation are described at the end of the Appendix Theresults are shown by the dotted line in Figure 9A usingthe same set of parameters as when d cent was calculated byusing only active nodes shown by the solid line The re-sults show that the highest d cent is found when the decisionis based only on active nodes and when a is low Includ-ing inactive nodes in decision lowers d cent However for alarger a (above 15 for the parameters used here) it isbeneficial to base the decision on all the nodes

Optimal placement of the recognition criterion forthe two classes of items The recognition criterion (C)does not affect d cent but it influences performance as mea-sured by the hit rates and false alarm rates Therefore itis necessary to use another criterion for performance

VARIANCE THEORY OF MIRROR EFFECT 433

with respect to the placement of the recognition crite-rion A natural choice for performance in this context isthe probability of hits minus the probability of falsealarms This measurement corresponds to optimal per-formance when old correct responses and new correctnew responses are rewarded equally It is easy to see thatif the standard deviations of the old and the new distri-butions are equal the optimal performance will be foundif the recognition criterion is set exactly between the dis-tributions For unequal standard deviations the optimalrecognition criterion is shifted from the midpoint towardthe distribution with the smallest standard deviationMore exactly the optimal recognition criterion is thepoint at which the old and the new distributions inter-sect It is easy to see that this is true because if the recog-nition criterion is moved to the left of this point the rateof increase in false alarms is larger than the rate of in-crease in hits and performance suffers If the recognition

criterion is moved to the right of this point the rate of de-crease in hits is larger than the rate of decrease in falsealarms and performance also suffers (see eg Figure 4D)Formally f [S(O)] denotes the density of recognitionstrength of the old distribution and f [S(N)] the densityof the recognition strength of the new distribution Theratio between these variables is called the likelihoodratio L = f [S(O)] f [S(N)] and the optimal performanceoccurs when this ratio is equal to one (L = 1)

In the mirror effect there are two classes of itemseach having a new and an old distribution with differentstandard deviations The question of optimal perfor-mance is complicated by the possibility of using differ-ent criteria for the two classes The performance maythen vary depending on the choice of the two criteria andon additional restrictions on the overall level of confi-dence For example if one class is very easy (ie perfectdiscrimination) and one class is very difficult (ie no

Figure 9 (A) Theoretical d cent as a function of percentage of nodes active at encoding The solid line shows the d cent as a function of per-centage of nodes active at encoding when the decision is based only on nodes that are active during encoding The dotted line showsd cent when the decision is based on all the nodes (B) Theoretical d cent as a function of the activation threshold The leftmost arrow pointsat the expected net input of the new items [m(N)] the rightmost arrow points at the expected net input of the old items [m(O)] and themiddle arrow points at the point at the placement of the activation threshold of the nodes Note that the activation threshold is slightlylower than the optimal point (C) Optimal placement of the recognition criterion for the two classes The y-axis shows the maximumlikelihood for Class B divided by the maximum likelihood for Class A An optimal performance is found when this ratio is one Thex-axis shows the false alarm rate for Class A The straight line shows the ratio for theoretical optimal performance the dotted line theratio before normalization and the solid curved line the ratio after normalization See the text for details (D) The advantage of nor-malization for different recognition criteria The y-axis shows the total hit rate after normalization minus the total hit rate before nor-malization as a function of the total false alarm rate on the x-axis See the text for details

434 SIKSTROM

discrimination) and subjects are instructed to respondyes only when they are absolutely certain that they arecorrect it may be optimal to set a very high criterion forthe difficult class so that no yes responses will be madefor the difficult class and a moderate criterion for theeasy class so that some yes responses will be made forthe easy class Therefore any model that optimizes per-formance for the two classes must combine the criteriafor each class so that the performance for the sum of theclasses will be optimal

This problem may formally be stated as follows Giventwo classes (A and B) with a fixed total false alarm rate[P(AN) + P(BN)] how should the recognition criteriafor the two classes [T(A) and T(B)] be chosen so that thehit rates are maximized [P(AO) + P(BO)] The solutionto this problem is surprisingly simple The optimal perfor-mance occurs when the placements of the maximum like-lihoods of the two classes are equal

L(A) = f [S(AO)] f [S(AN)] = L(B)

= f [S(BO)] f [S(BN)]

It is easy to see that this criterion must be satisfied foroptimal performance because any shift from this pointdiminishes performance For example if the recognitionthreshold for class A is diminished the recognition cri-terion for class B must be increased to keep the totalfalse alarm rate constant According to the formulationof the problem the change in total false alarm rates mustbe equal f f [S(BN) = 0] The maximum-likelihood ra-tios are monotonically increasing functions of the recog-nition criteria therefore L(A) L(B) lt 0 when the recog-nition criteria are changed as specified above ThusL(A) = f [S(AO)] f [S(AN)] lt f [S(BO)] f [S(BN)] =L(B) or f [S(AO)] f [S(BO)] lt 0 This shows that thechange in the placement of the criteria from L(A) = L(B)results in an overall decrease in hit ratemdash( f [S(AO)] f [S(BO)] lt 0)mdashand performance suffers

Note that the variance theory only has one overallrecognition criterion However the theory and any the-ory of the mirror effect specifies how this criterionchanges when it is moved over the two classes Thus thesecond criterion can indirectly be inferred from the for-mulation of the theory This is done in the variance the-ory by the normalization factor that scales the recogni-tion differently between the two classes of words

The intriguing question here is whether the variancetheory is optimal or almost optimal in terms of place-ment of the recognition criterion for the two classes Fig-ure 9C plots the maximum likelihood of class B dividedby the maximum likelihood of class A [L(B)L(A)] onthe y-axis The x-axis shows the probability of false alarmsfor class A when the recognition criteria are changedThe optimal ratio of the maximum likelihood on the y-axisis one and it is plotted as the dotted straight line The re-sults before normalization (ie by counting the numberof nodes above the recognition criterion) are plotted in

the dotted monotonically increasing line The resultsafter normalization (ie the percentage of active nodesminus the expected number of active nodes divided bythe standard deviation of the net input) are plotted as thesolid line

The figure clearly shows that performance before nor-malization is far from optimal For a conservative recog-nition criterion or low false alarm rates the maximumlikelihood for class A is larger than the maximum likeli-hood for class B Therefore a more liberal criterion forclass B and a more strict criterion for class A would bemore advantageous For a liberal recognition criterionor a high false alarm rate the maximum likelihood forclass B is larger than the maximum likelihood for classA Therefore a more liberal criterion for Class A and astricter criterion for Class B would be beneficial Theonly point at which the performance is optimal is whenthe recognition criterion is unbiased At this point [aroundP(AN) = 25] the maximum-likelihood ratio is one

Normalization improves performance significantly sothe maximum-likelihood ratio is one or almost one fora range of criteria For false alarm rates larger than 015and smaller than 060 the ratio is within two percentagepoints of one The maximum likelihood for Class A islarger than that for Class B for false alarm rates less than015 and for false alarm rates larger than 060 Thus thereis still some deviation from optimal performance evenafter normalization However the maximum-likelihoodratio is closer to the optimal value for all false alarmrates after normalization than before normalization Ar-guably normalization improves performance sufficientlyso that performance may be described as being near anoptimal value for a wide range of recognition criteria

Overall performance after normalization can be di-rectly compared with performance before normalizationFigure 9D plots the total hit rate after normalizationminus the total hit rate before normalization for differenttotal false alarm rates In this figure the standard devia-tion of the net inputs to Class B was set to a 3 (rather than156) in order to make the difference between perfor-mance before and after normalization more salient Allother parameters were identical to those in Figure 4DFurthermore it is assumed that the number of items inClass A is equal to the number of items in Class B Thetotal hit rate is equal to the average hit rate of Class Aand Class B and the total false alarm rate is equal to theaverage false alarm rate in Class A and Class B

For all recognition criteria or for all false alarm ratesperformance is better or equal after normalization ascompared with performance before normalization Foran unbiased recognition criterion or a slightly largerrecognition criterion performance is approximatelyequal before and after normalization [ie for 25 lt P(N) lt40] For conservative recognition criteria [P(N) lt 25]there is a large advantage for normalized performanceover nonnormalized performance The largest advantageis approximately 005 and it occurs when the false alarm

VARIANCE THEORY OF MIRROR EFFECT 435

rate is approximately 005 For liberal recognition crite-ria [P(N) gt 40] there is also an advantage for normal-ized performance The largest advantage is around 001and it occurs when the false alarm rate is 070 The ad-vantage for liberal criteria is smaller than the advantagefor conservative criteria because of a ceiling effect forlarge false alarm rates and large hit rates

A Nonoptimal Network is Inconsistent With Empirical Data of Recognition Memory

To summarize it has been shown that performance isoptimal when (1) the percentage of nodes active at en-coding is low (2) the activation threshold is set betweenthe new and the old distributions (3) only nodes activeat encoding are used for measuring the recognitionstrength and (4) the recognition strength is normalizedwith the standard deviation of the net input It is inter-esting to note that all these conditions are necessary forproducing the empirically found memory phenomena

1 The percentage of active nodes has to be low forproducing appropriate standard deviations for the oldand the new recognition distributions If the percentageof active nodes is too high the standard deviation of theold distribution will approach the standard deviation ofthe new distribution

2 The model predicts the mirror effect only if the ac-tivation threshold is set between the old and the new dis-tributions If the recognition threshold is larger than theexpected value of the net input of the old distributionthe hit rate of the easy class will be less than the hit rateof the difficult distribution Similarly if the recognitionthreshold is lower than the expected new net input thefalse alarm of the easy class will be larger than the falsealarm rate of the difficult class This is inconsistent withthe empirical data for unbiased responses

3 Assume that the recognition strength is based on allthe nodes (ie not only nodes inactive during encod-ing)mdashfor example by counting the number of nodes inthe correct state of activation This measurement ofrecognition strength will have at least 50 of the nodesin the correct state (ie Pc gt 50) even if the subjectswere merely guessing on new items This would lead tothe incorrect prediction that the old recognition strengthvariance should be smaller than the new recognitionstrength variance because Pc(1 Pc)N decreases for Pcover 50 This is inconsistent with the finding that thevariance of the old distribution is larger than the varianceof the new distribution

4 If the recognition strength is not normalized withthe net input the variance of the recognition strength ofthe new easy class (AN) will be smaller than the varianceof the recognition strength of the new difficult class (BNsee Figure 4C) This is inconsistent with the empirical data

This analysis indicates that several recognition mem-ory phenomena fall out as a consequence of optimizingperformance in the network If the model is optimized interms of performance the model reproduces the empir-ical data If the model is made nonoptimal the model

does not reproduce the empirical data Arguably thehuman brain has evolved to optimize storage capacityand therefore these memory phenomena occur

GENERAL DISCUSSION

This paper has suggested a variance theory for themirror effect that also applies to the ROC curves Themodel is remarkably simple It has been shown that atwo-layer recurrent network where one layer representscontext and one layer represents items shows these phe-nomena if performance is measured by counting thenumber of nodes active at recognition in a way that opti-mizes performance The structure of the theory guaran-tees that high-frequency items have a larger variance inthe net inputs than do low-frequency items because en-coding the same item in different contexts increases thevariance whereas the expected net inputs are the sameThe theory predicts the mirror effect when the sameamount of attention is paid to both classes of stimuli Thestandard deviation of the recognition strength is largerfor old than for new items because more nodes are activein old items The standard deviation for the easy class islarger than the standard deviation for the difficult classbecause the recognition strength is normalized with thestandard deviation for the net input

There are several reasons why the variance theory isinteresting First the theory is extremely simple Theonly necessary assumptions are that recognition is basedon recurrent associations between contexts and itemsand performance is measured by counting the number ofnodes in an optimal way Second these assumptions areconsistent with what is known about how the brain worksTherefore the model is biologically plausible Third themodel accounts for a large amount of data including themirror effect exceptions from the mirror effect ROCcurves list-length effects and so on Fourth the modelfits the empirical data well Fifth it is easy to implementthe model in a connectionist network

Paying more attention to one of the classes violates theassumption of equal expected net inputs to the two classesThe variance theory predicts that attention to the moredifficult class primarily affects the hit rates whereas thefalse alarm rates and standard deviations of the underly-ing distributions are less affected An experiment sup-ported the prediction A standard mirror effect was foundwhen attention was divided equally between high- andlow-frequency words However focusing the subjectsrsquoattention on the high-frequency words either by the pre-sentation frequency or the presentation time made thehit rate larger for the high-frequency words than for thelow-frequency words This manipulation of attention didnot influence the false alarm rates which were higher forthe high-frequency words in all the conditions Thus nomirror effect was found when attention was paid to thehigh-frequency words Nor did the focusing of attentioninfluence the order of the standard deviations The stan-dard deviations were larger for the low-frequency distri-

436 SIKSTROM

bution than for the high-frequency distribution This wastrue for the new and the old distributions both when at-tention was paid to high-frequency words and when at-tention was divided equally between the two classes (ex-cept in the new frequency control condition where thestandard deviations were equal)

The variance theory predicts the order of the standarddeviations of the underlying distributions for the follow-ing reasons The standard deviations of the old distribu-tions are predicted to be larger than those of the new dis-tributions because more nodes are activated in the olddistributions The standard deviations of the easy classdistributions are predicted to be larger than the standarddeviations of the difficult class distributions because therecognition strength is normalized by the itemrsquos diffi-culty estimated from the standard deviation of the net in-puts This is consistent with the empirical data

In contrast the attention-likelihood theory does notconstrain the old distribution to be larger than the newdistribution for the difficult class (it can be larger orsmaller depending on parameter settings) The variancetheory allows the following two orders of the standarddeviations ss(BN) lt ss(BO) lt ss(AN) lt ss(AO) andss(BN) lt ss(AN) lt ss(BO) lt ss(AO) The first order isthe more common although the second order occurs oc-casionally (see eg Glanzer et al 1993 Experiment 1)In addition the attention-likelihood theory allowsss(BO) lt ss(BN) lt ss(AN) lt ss(AO) according to Equa-tion 2 which to the authorrsquos knowledge has not beenfound in empirical data

The variance theory predicts that strength variablessuch as study time repetition and study instructions af-fect the expected net input For example increasing studytime will increase the net input that improves the hit rateIncreasing the net inputs also causes an increase in theactivation threshold that diminishes the false alarm rates

The variance theory has no (ie lateral) connectionswithin the item layer and there are no connections with-in the context layer Including intraitem connections inrecognition makes it impossible to tell whether therecognition strength comes from encoding during thelearning episode or from another preexperimental learn-ing episode Thus there would be a confounding be-tween the itemrsquos frequency and recognition strength Forexample if the recognition strength in the variance the-ory included intraitem connections and used a constantrecognition criterion it would predict a higher hit rateand a higher false alarm rate for high-frequency itemsthan for low-frequency items Thus the hit rate for high-frequency words would be larger than that for low-frequency words which is contrary to the data on the mir-ror effect This issue is called the frequencyrecognitionndashstrength confounding problem Other models may bevulnerable to this problem depending on their specificassumptions The variance theory is immune from thisproblem because recognition strength is based on the as-sociation between the context and the item that yields apure measurement of the strength of the target in a par-

ticular episode Net inputs within the item population arenot used because these connections are highly corre-lated with the frequency of the item

This frequencyrecognition-strength confounding prob-lem may be relevant to several distributed models thatassume that recognition strength increases with frequencythus making the false alarm rate higher for high- than forlow-frequency stimuli This is often implemented in dis-tributed models by including intraitem associations inthe measurement of recognition strength Thus whenintraitem and itemndashcontext associations are added it isnot possible to know whether the intraitem strength oc-curs because an item has been encoded in the to-be-retrieved-from list or at another episode

Although the intraitem associations are not used tomeasure recognition strength they may play an impor-tant role in recognition In the first step of recognitionthese associations may be used for deblurring unclear in-formation in the item cue (a similar mechanism occursfor the context cue) Arguably this deblurring mecha-nism works well for well-known words however fornonwords it is much more likely to fail Such failure willactivate features that were not active in the encoded rep-resentation It is more likely that these wrongly activatednodes represent high-frequency features because it ismore likely to converge on high-frequency features Thereare two interesting implications of this perspective Firstthe wrongly activated nodes will use the wrong connec-tions between the context and the item Second becausethe wrongly activated nodes represent high-frequencyfeatures the average variability will be larger for non-words than for words This is a plausible account of thepoor recognition performance with nonwords as com-pared with words It is also a tentative account of why non-words can be seen as a difficult class with a higher vari-ability than that for words in the variance theory Howeverfurther work is needed before any firm conclusion can bemade regarding this aspect of the theory

A problem similar to frequencyrecognition-strengthconfounding occurs if within-context connections areused at recognition If the context is temporally corre-lated the within-context connections are influenced bylist length This causes a confounding between list lengthand recognition strength This issue is called the list-lengthrecognition-strength confounding problem Othermodels may be vulnerable to this problem depending ontheir specific assumptions

Another issue is whether the variance theory can ac-count for the mirror effect found in abstract and concretewords where words from a concrete class are more eas-ily discriminated than words from an abstract class Thevariance theory can account for this given the assump-tion that the variance of the net input is larger for abstractthan for concrete words However at this point it is notcompletely clear how this assumption can be motivatedA possibility is that although these two classes areequated for word frequency the contexts associated withan abstract word are more variable than the contexts as-

VARIANCE THEORY OF MIRROR EFFECT 437

sociated with a concrete word This larger variability incontext for abstract words may lead to a larger variabil-ity in the net input Another possibility is that the activefeatures in abstract words are more general and there-fore associated with more contexts Nodes active in con-crete words may represent more specific features acti-vated with a lower frequency and therefore associatedwith fewer contexts Thus features in abstract wordsmay be of a higher frequency than features in concretewords although the frequencies of the items are thesame This would lead to a mirror effect in the modelHowever at this point no claim is made that variancetheory can handle this phenomenon

The variance theoryrsquos account of the list-length andlist-strength effects is arguably much simpler than thesubjective-likelihoodrsquos account The activation thresholdis set so that on average half of the nodes active duringencoding are active during recognition The activationthreshold is constant within one condition but may changebetween conditionsmdashfor example when study time ischanged Furthermore subjects do not need to estimatethe list length and the probability that a particular itemis drawn from the study list

The variance theory has advantages over the attention-likelihood theory As was discussed in more detail abovethe attention-likelihood theory requires subjects to clas-sify the stimulus Depending on this classification thesubjects must know (consciously or unconsciously) howmuch attention is paid to the stimuli in order to calculatethe log-likelihood ratios Thus the yesndashno decision isbased on the subjectsrsquo classification of the stimuli Thevariance theory does not require a classification of thestimuli During the calculation of recognition strengththe standard deviation of the net input of the item is used(shcent ) so the subject does not need to know the class or thestandard deviation of the class (sh) The increase in thehit rate and decrease in the false alarm rate for the easierclass occurs according to the theory because the vari-ance is smaller for the easier class The variance theoryhas a simple mathematical base subjects count the num-ber of activated nodes minus the expected value dividedby the standard deviation of the net input of the item Anexplicit solution is presented that requires only calculat-ing the probabilities of two z-transformations

The subjective-likelihood theory uses feature contentof the items for addressing issues regarding item similar-ity and word frequency In particular high-frequency wordsare assumed to have a higher noise or variability than dolow-frequency words The variance theory also usesvariability that depends on frequency However the vari-ance theory simulates the increase in variance duringeach presentation of a feature in different contexts thusmaking it an unavoidable phenomenon for high-frequencyfeatures In the subjective-likelihood theory the featurevariance is introduced as an assumption or a constantand it is not explicitly simulated

There are several other differences between the vari-ance theory and the subjective-likelihood theory The

subjective-likelihood theory is based on log-likelihoodratios In the variance theory log-likelihood ratios arenot necessary to account of the mirror effect and for z-ROC curves Instead the variance theory uses the num-ber of active nodes normalized by the variance of the netinput as the measurement of recognition strength

Another difference is the use of one detector for eachitem in the subjective-likelihood theory This makes themodel essentially local whereas the variance theory isdistributed This property may cause difficulties in im-plementing the model in a connectionist network How-ever see McClelland and Chappell (1998) for a brief dis-cussion of this topic An implementation of the variancetheory is straightforward

REFERENCES

Feller W (1968) An introduction to probability theory and its appli-cation New York Wiley

Gillund G amp Shiffrin R M (1984) A retrieval model for bothrecognition and recall Psychological Review 91 1-67

Glanzer M amp Adams J K (1985) The mirror effect in recognitionmemory Memory amp Cognition 13 8-20

Glanzer M amp Adams J K (1990) The mirror effect in recognitionmemory Data and theory Journal of Experimental PsychologyLearning Memory amp Cognition 16 5-16

Glanzer M Adams J K amp Kim K (1993) The regularities ofrecognition memory Psychological Review 100 546-567

Glanzer M amp Bowles N (1976) Analysis of the word frequencyeffect in recognition memory Journal of Experimental PsychologyHuman Learning amp Memory 2 21-31

Glanzer M Kisok K amp Adams J K (1998) Response distribu-tions as an explanation of the mirror effect Journal of ExperimentalPsychology Learning Memory amp Cognition 24 633-644

Greene R L (1996) Mirror effect in order and associative informa-tion Role of response strategies Journal of Experimental Psychol-ogy Learning Memory amp Cognition 22 687-695

Hertz J Krogh A amp Palmer R G (1991) Introduction to the the-ory of neural computation Reading MA Addison-Wesley

Hintzman D L (1988) Judgment of frequency and recognition memoryin a multiple trace memory model Psychological Review 95 528-551

Hopfield J J (1982) Neural networks and physical systems withemergent collective computational abilities Proceedings of the Na-tional Academy of Sciences 79 2554-2558

Hopfield J J (1984) Neurons with graded response have collectivecomputational properties like those of two-state neurons Proceed-ings of the National Academy of Sciences 81 3088-3092

Humphreys M S Bain J D amp Pike R (1989) Different way to cuea coherent memory system A theory for episodic semantic and pro-cedural tasks Psychological Review 96 208-233

Kim K amp Glanzer M (1993) Speed versus accuracy instructionsstudy time and the mirror effect Journal of Experimental Psychol-ogy Learning Memory amp Cognition 19 638-652

Kruschke J K (1992) ALCOVE An exemplar-based connectionistmodel of category learning Psychological Review 99 22-44

Ku Iumlcera H amp Francis W N (1967) Computational analysis ofpresent-day American English Providence RI Brown UniversityPress

Lewandowsky S (1991) Gradual unlearning and catastrophic inter-ference A comparison of distributed architectures In W E Hockleyamp S Lewandowsky (Eds) Relating theory and data Essays onhuman memory in honor of Bennet B Murdock (pp 445-476) Hills-dale NJ Erlbaum

Li S-C amp Lindenberger U (1999) Cross-level unification A com-putational exploration of the link between deterioration of neuro-transmitter systems and dedifferentiation of cognitive abilities in oldage In L-G Nilsson amp H J Markowitsch (Eds) Cognitive neuro-sciences of memory (pp 103-146) Seattle Hogrefe amp Huber

438 SIKSTROM

Li S-C Lindenberger U amp Frensch P A (2000) Unifying cog-nitive aging From neuromodulation to representation to cognitionNeurocomputing 32-33 879-890

McClelland J L amp Chappell M (1998) Familiarity breeds dif-ferentiation A subjective-likelihood approach to the effects of expe-rience in recognition memory Psychological Review 105 724-760

Metcalfe J (1982) A composite holographic associative recallmodel Psychological Review 89 627-658

Murdock B B (1982) A theory for the storage and retrieval of item andassociative information Psychological Review 89 609-626

Murdock B B (1998) The mirror effect and attentionndashlikelihoodtheory A reflective analysis Journal of Experimental PsychologyLearning Memory amp Cognition 24 524-534

Okada M (1996) Notions of associative memory and sparse codingNeural Networks 9 1429-1458

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves PsychologicalReview 99 518-535

Rumelhart D E Hinton G E amp Williams R J (1986) Learn-ing representation by backpropagating errors Nature 323 533-536

Shiffrin R M amp Steyvers M (1997) A model for recognitionmemory REMndashretrieving effectively from memory PsychonomicBulletin amp Review 4 145-166

Sikstroumlm S (1996a) TECO A connectionist theory of successiveepisodic tests Umearing Doctoral dissertation Umearing University (ISBN91-7191-155-3)

Sikstroumlm S (1996b) The TECO connectionist theory of recognitionfailure European Journal of Cognitive Psychology 8 341-380

Sikstroumlm S (1999) Power function forgetting curves as an emergentproperty of biologically plausible neural networks model Interna-tional Journal of Psychology 34 460-464

Stretch V amp Wixted J T (1998a) Decision rules for recognitionmemory confidence judgments Journal of Experimental Psychol-ogy Learning Memory amp Cognition 24 1397-1410

Stretch V amp Wixted J T (1998b) On the difference betweenstrength-based and frequency-based mirror effects in recognitionmemory Journal of Experimental Psychology Learning Memoryamp Cognition 24 1379-1396

NOTE

1 The expected number of nodes active at recognition is a2 giventhat the standard deviations of the percentages of active nodes for new[sp(N)] and old items [sp(O)] are equal However if the standard devi-ations are different the expected number of active nodes will be

Because the new variance is smaller than the old variance this valuewill be slightly less than a2 typically around 044a if the ROC curveis 08 It is actually more precise to use this value In this paper the ap-proximation a2 is used except that in the simulations where the ex-pression above is used

APPENDIX

The Expected Value of the Net Input and the Variance of the Net Input

Inserting the weight changes into the net input solves the ex-pected value of the net input for old items In this Appendix thesuperscripts representing the item layer (d) and the contextlayer (c) are dropped Only the active states of activation con-tribute to the net input so jj 5 jj = 1

(A1)

The expected value of the net inputs for the new items iszero

mh(N) 5 0 (A2)

The expected value of the weights is zero The variance ofthe net input is calculated by squaring each term in the netinput Let P represent the number of patterns stored in the net-work If the patterns are uncorrelated or a new item is pre-sented the variance of the net input is

(A3)

For an old item the variance of the net input to the contextlayer from the item layer depends on the frequency of the item( f ) All the aN weights supporting a context node contribute tothe net input in the same direction

(A4)

The same function can be used for calculating the varianceof the net input to a node in the item layer when the same con-text is associated with several items Let the same context be as-sociated with L items in a list Furthermore let the context be-tween different lists be uncorrelated The variance of the netinput to an item node is

The expected variance for all nodes is the weighted sum ofthese two terms Half of the nodes are context nodes and halfof the nodes are item nodes Therefore the expected varianceof the net input to all nodes given an item with a frequency off and a list length of L in a network where p patterns have beenencoded is

(A5)

Performance in the Variance Model When AllNodes Are Used

In the variance model recognition strength is based on nodesactive at encoding However if recognition strength is based onall the nodes performance can be calculated as follows The d centis calculated by using Equation 8 where Pc(O) and Pc(N) are thenumber of correct nodes The number of correct nodes is thenumber of nodes active at retrieval that also is active at encod-ing (ie calculated as usual with Equation 7) plus the numberof inactive nodes at encoding that also are inactive at retrievalThe latter value can be calculated by replacing a with 1 a inEquation 7 and using the expected value of old net inputs for in-active nodes a2 (1 a) N (compare with Equation A1)

(Manuscript received February 9 1999revision accepted for publication October 30 2000)

s h f L a N p a a N2 3 22 1O( ) = +( ) + -( )

s h f a a LN2 4 2 21( ) = -( )

s x x x xh ij jj

N

i j ji

Nf f f a a

a a f N

22 2

4 2 21

( ) = waring aringaelig

egraveccedilouml

oslashdivide= -( ) -( )eacute

eumlecirc

= -( )

s x x x xh ij jj

N

i j j

P

i

NN a a

a a a a a a a

a a a a PN a a PN

2 2 2

2

3

1 1 2 0 1

0 0 1 1

( ) = w

= [( ) ( )] + [( )(1- )] ( )

+ [( )( )] ( ) = ( )

2 2

2 2 2

( ) = -( ) -( ) - - - -

- - - -

aring aringaring

m x x x xh ijj

N

j i j jj

Na a a a N(O) = wDaring aring= -( ) -( ) = -( )1

2

ss s

p

p p

a(N)

( )N (O)+

412 SIKSTROM

that it also uses vector-feature representations of the itemsand estimates (via simulations) the response probabili-ties of old (target) and new (lure) items during a recog-nition test However the variance theory is driven by dif-ferent conceptual and technical considerations At theconceptual level the variance theory sets out to capturethe mirror effect mainly in terms of the relations betweenthe study materials and the natural preexperimental con-text associations the items may have This is conceptu-ally quite different from all previous theories seeking toexplain mirror effects primarily in terms of the individualrsquosdecision process Rather the approach taken here con-siders the context in which the individual recognition de-cision processes takes place The natural frequencies ofevents occurring in the individualrsquos day-to-day contextsmay be reflected in recognition decision processes andthe individuals may or may not know (or be consciouslyaware of) these processes At the technical level the vari-ance theory also differs from previous theories in a sig-nificant way Instead of directly computing ratios betweenprobabilities a new way of computing recognition strengthis proposed by normalizing the difference between theresponse probabilities for the target and the lure itemswith standard deviations of the underlying responsedistributions

Specifically in dealing with the frequency-based mir-ror effect a rather plausible key assumption of the vari-ance theory is that high-frequency words are assumed tohave appeared in more naturally occurring preexperi-mental contexts than have the low-frequency words Thisassumption is implemented in connectionist networksimulations in a rather straightforward way by associat-ing the simulated high-frequency items with more con-texts than the low-frequency items during a simulatedpreexperimental phase In implementing the theory itemsand contexts are represented by two separate arrays (vec-tors) of binary features with each feature being repre-sented by a node (or element of the vector) as is shownin Figure 2 A particular item such as CAR activates somefeatures at the item layer A particular context such asREPAIR SHOP activates some features at the context layerDifferent items and contexts may or may not activatesome of the same features The item and context featuresare fully interconnected with weights When an item ap-pears in a particular context and certain features are ac-tivated the weights that reciprocally connect the itemfeatures to the context features are adaptively changedaccording to a specific learning rule described later Notethat in the implementation two types of contextsmdashnamelythe preexperimental contexts and the study context (rep-resenting internally generated time or list informationassociated with an item during the actual experimentalepisode)mdashare represented in the network using one com-mon context layer But these two types of context infor-mation are differentiated by two simulation phasesmdashnamely the preexperimental phase and the actual encodingand testing phase As will be mathematically outlinedlater the standard deviation of the input at the contextlayer increases when an item is associated with several

contexts Therefore high-frequency items (associatedwith more preexperimental contexts) will have largerstandard deviations than will low-frequency items in theiractivation patterns which are subsequently propagatedto the item layer However the expected value of the in-put is equal for high- and low-frequency items

During the recognition test an item vector is presentedto reinstate the activation of the item features The fea-tures of the study context are reinstated by presenting thenetwork with the context pattern encoded during thestudy-encoding phase (but not from other preexperi-mental contexts that the network was trained with duringthe preexperimental phase) The degree of recognitionstrength is determined by first calculating the net inputsto the context and the item nodes The net input is the sig-nal a node receives from other active nodes connectingto it and the strength of the signal determines whetherthe nodes will be activated or not at retrieval The netinput of a given item node is simply computed as the sumof all weights connected to active nodes Similarly thenet input of a given context node is simply the sum of allweights connected to active nodes and that particularcontext node The net inputs then denote the retrievedstate of activation patterns at the item and context layersThe subset of item and context nodes that have activationlevels exceeding a particular activation threshold at re-trieval and that were also active during encoding are thenused to calculate the recognition strength Those nodeswhose activation does not exceed the threshold or thatwere inactive during encoding have no influence on rec-ognition strength For example assume that the activationthreshold is set to 05 so that any node (item or context)that was active during encoding and whose retrieved ac-tivation during testing exceeded the value of 05 wouldcontribute to recognition strength Imagine that four nodesout of a total of eight exceed the threshold and are equalto 075 100 125 and 150 The recognition strength ofthe item is the percentage of above-threshold nodes (50)

Figure 2 The variance theory The upper four circles representnodes in the item layer The lower four circles represent nodes inthe context layer The arrows between the item and the contextlayers represent connections

VARIANCE THEORY OF MIRROR EFFECT 413

minus the expected percentage of above-threshold nodes(eg 25) divided by the standard deviation of actuallyobserved above-threshold nodes (ie by the standard de-viation of 075 100 125 and 150)

Why is recognition strength determined by this ruleas opposed to say just the percentage of above-thresholdnodes As will be shown later this way of measuringrecognition strength (subtracting the expected value anddividing with the standard deviation of the net input) al-lows the model to perform optimally in terms of dis-criminability between new and old items when the re-sponse bias to the yes response is manipulated And inthis case the model accounts for why the standard devi-ation of the response distribution of the easy class islarger than the response distribution of the diff icultclass It is plausible to assume that humans have evolvedto generally respond more or less optimally and that thisis reflected in their performance as well as in the imple-mentation of the variance theory Similarly the activa-tion threshold is set in the model to the value that leadsto the highest d cent (ie to optimal performance) whichoccurs when the activation threshold is between the newand the old distributions This optimal tuning of themodel allows it to account for some rather dramatic re-sults such as concentering showing that target and luredistributions from different item classes converge on astrength-of-evidence axis as memory conditions worsen

Here is a brief example of how the model performsConsider hypothetical levels of activation generated byhigh-frequency and low-frequency old (target) and new(lure) items Because high-frequency words have appearedin a larger number of contexts they have a larger varianceof net input As such targets and lures will be relativelymore confusable and will generate percentages of activatednodes that are difficult to discriminate Assume that thestandard deviation of the net input is 10 and the relevantproportions are 25 for high-frequency targets and 15 forhigh-frequency lures In contrast low-frequency wordswill have occurred in fewer contexts and will be less con-fusable Assume that the standard deviation of net inputsis less for low-frequency wordsmdashfor example 05mdashandthat the percentage of active nodes is 30 for low-frequencytargets and 10 for low-frequency lures Given these val-ues what are the recognition strengths for high-frequencyand low-frequency targets and lures If the expected pro-portion of above-threshold nodes is 20 they are

Low-frequency lures (10 20) 05 5 20High-frequency lures (15 20) 10 5 05High-frequency targets (25 20)10 5 05

Low-frequency targets (30 20)05 5 20

The model accounts for the various aspects of mem-ory phenomena by postulating a connectionist neural net-work model with an implementation and parameter set-tings that allow it to perform at optimal or near-optimallevels When the model is optimized it behaves similarlyto how subjects behave and when it is not optimized itdoes not fit the experimental data This is true not only

for the standard mirror effect but also for exceptions suchas the absence of a mirror effect for within-list strengthmanipulations (something all other competing formalmodels fail to do) Furthermore it predicts key featuresof the ROCs for new and old items as well as for high-and low-frequency items (something any viable formalmodel must do)

Presentation of the Variance TheoryIn this section the details of the variance theory are

presented As will become clearer as the theory is un-folded the theory is analytical and the analytical solu-tions are self-contained solvable (readers who are inter-ested can find the analytical solutions in the sixth section)Although the theory itself does not require a particularcomputational framework it can be more easily explainedand directly implemented by using a connectionist net-work Therefore the presentation of the theory below iscouched within the framework of a Hopfield neural net-work (Hopfield 1982 1984) in order to explicate thetheoryrsquos underlying mechanisms that generate the mirroreffect

Network architecture The variance theory may beimplemented as a two-layer recurrent distributed neuralnetwork (see Figure 2) There are two layers in the rep-resentation one is the item layer and the other the con-text layer Both layers consist of N number of nodes (ieN features) although it could also be implemented withan unequal number of context and item nodes Thus thetotal number of nodes in the network is 2N The item andthe context layers are fully connected so that all the nodesin one layer are connected through weights to all the nodesin the other layer Nodes within a layer are not connected(ie no lateral connections)

Stimulus and context representation Contexts anditems are represented as binary activation patterns acrossthe nodes in the context and item layers respectively Anode is active (or activated) if its corresponding featureis present (ie of value 1) and a node is inactive (or notactivated) if its corresponding feature is absent (ie ofvalue 0) There are several reasons for choosing binaryrepresentations For instance binary representation servesto deblur noisy information at retrieval (Hopfield 1984)Binary representation also allows for a low proportion ofactive nodes (spare representation) which is known toimprove performance (Okada 1996) It also introducesnonlinearity which is necessary for solving some prob-lems in multilayer networks (Rumelhart Hinton amp Wil-liams 1986) and it is perhaps the simplest representa-tion Furthermore in the present model it is shown thatit is necessary for capturing characteristics of the z-ROCcurves that are associated with the mirror effect

More specifically the state of activation for the i thnode in the item layer at encoding is denoted x i

d wherethe superscript d denotes the item layer The state of ac-tivation for the jth node in the context layer at encodingis denoted x j

c where the superscript c denotes the con-text layer Context patterns and item patterns are gener-ated by randomly setting nodes to an active state (ie

414 SIKSTROM

with values of 1) and otherwise to an inactive state (iewith values of 0) Let a be a parameter that determinesthe expected probability that a node is active at encod-ing This parameter does not change during the simula-tion and is assumed to be relatively small (for purposesof sparse representation) Note that a is the expected prob-ability of active nodes whereas the real percentage of ac-tive nodes for specific items or contexts varies around a

The encoding-study phase Encoding occurs bychanging the connection weights between the item andthe context layers The weights (or the strengths of theconnections) contain information of what has been storedin the network The weight between item node i and con-text node j is denoted as wij and it is initialized to zeroThe weight change (Dwi j) is calculated by the learningrule suggested by Hopfield (1982 see also Hertz Kroghamp Palmer 1991 for additional details) This is essen-tially a Hebbian learning rule that increases connectionweights between simultaneously activated nodes Thisrule is chosen here because it is more biologically plau-sible than other rules such as the delta or the gradient-decent learning rules (eg Rumelhart et al 1986) usedin back-propagation networks However the variance the-ory can also be implemented with other learning rules

According to the Hopfield (1982) learning rule theweight change is computed as the outer-product betweenthe item and the context vectors of activation patternswith the parameter a first subtracted from both vectorsThis subtraction is mathematically necessary to keep theexpected value of the weights at zero Using the notionsfor item and context activation defined above the weightchange between these two elements of the item and con-text vectors can be written as

Dwi j 5 (x id a)(x j

c a) (3)

Word frequency as the number of associated con-texts at the preexperimental phase An item may be en-coded more or less frequently and hence be associatedwith more or less different preexperimental contextsdepending on how often the item occurs in the subjectrsquosenvironment In the model at the preexperimental stageof the simulation an itemrsquos frequency is simulated by thenumber of times the item is encoded in different contextsA low-frequency item is encoded less often and is asso-ciated with a smaller number of contexts whereas a high-frequency item is encoded more often and is associatedwith a larger number of different contexts For instanceone might form a preexperimental association betweenSAAB and repair shop after experiencing the rare event ofa new expensive SAAB sports car breaking down half-way through a honeymoon trip to the Grand Canyon withthe SAAB having to be towed to a repair shop somewhereout in the desert In implementing the variance theory therelationship between word frequency and preexperimen-tal itemndashcontext associations can be simulated straight-forwardly At the preexperimental stage of the simula-tion a low-frequency item SAAB may be simulated byassociating one context item REPAIR SHOP with it duringencoding A high-frequency word CAR may be simulated

by associating three different contexts REPAIR SHOP TAXI

RIDE and DRIVING TO WORK with it during encodingThe recognition test phase At recognition an item is

presented to the network the representation of this itemis reinstated as a cue to the item layer and the represen-tation of the study context (simulating an internally gen-erated context regarding list or time information duringthe recognition experiment) is reinstated as a cue to thecontext layer For example the representation of theword CAR is reinstated at the item layer Furthermore therepresentation of the study context STUDY LIST is rein-stated at the context layer The subjects must have this in-formation (or cue) in order to recognize an item from theparticular study context (and not recognize the item fromall the other preexperimental contexts) In the actual ex-perimental setting this information is usually conveyedto the subjects by the explicit instruction to recognizefrom the study context (eg ldquoDo you recognize the wordCAR from the list you just studiedrdquo)

At recognition each node receives a signal (called thenet input) which is computed on the basis of other activenodes connecting to it Item nodes receive net inputs fromactive context nodes and context nodes receive net inputsfrom active item nodes The net input to a given node issimply the sum of the weights of all other active nodesconnected to that node For example if item node 1 isconnected to four context nodes (1 2 3 and 4) wherecontext nodes 1 and 3 are active the net input to itemnode 1 is w11 + w13 Thus active nodes ldquosendrdquo informa-tion to nodes that they are connected to whereas inactivenodes do not send any information Put another way nodesldquoreceiverdquo information or input from active nodes that theyare connected to but not from the inactive nodes

Specifically the net input to item node i is calculatedby first multiplying the activity of the context node (la-beled j ) connected to this node by the weight of the con-nections between the nodes i and j and then summing overall connected nodes In vector formalization the weightmatrix operates on the activation vector and the outputis the net input vector The net inputs to node i(h i

d) at theitem layer depend on the states of activation of nodes in thecontext layer and the weights connecting these two layers

(4)

Following the same principle a similar function is usedto calculate the net inputs to the context node Specifi-cally the net inputs to context nodes (labeled j ) dependon all the states of activation of nodes in the item layerand the weights connecting the two layers

By inserting Equation 3 into Equation 4 and summingover the p number of encoded patterns during list learn-ing it is easy to see that the net input to a node is simplythe sum of Np number of terms for example the net in-put to an item node is

h wj ji ii

Nc d=

=aring x

1

h wi ij jj

Nd c=

=aring x

1

VARIANCE THEORY OF MIRROR EFFECT 415

For a 5 5 the net inputs are binomially distributed witha certain expected value Given a certain criterion (ieNp(1 a) a gt 10) a binomial distribution can be ap-proximated with a normal distribution (Feller 1968) Fora sup1 5 there are actually four outcomes however thesame normal approximation can be used Thus for rea-sonably large parameter values of Npa the distributionof net inputs to the nodes can be approximated by a nor-mal distribution

If the to-be-recognized item has not been encoded withthe context (ie a new item) the net input is simply thesum of random weights Because the expected values ofall weights are zero the expected value of the net inputsfor new items will also be zero If the item has been en-coded with the context (ie an old item) the net input isthe sum of those weights connected to that node whoserespective context nodes were active at encoding Owingto the adaptive weight changes during encoding theseweights will have an expected value that is larger thanzero if both nodes were in the active state during encod-ing [ie each weight change at encoding is computed as(1 a)2] and less than zero if one node was inactive andthe other node was active at encoding [ie each weightchange at encoding was a(1 a)] Of specific impor-tance for the theory is that the variance of the net inputsto the context nodes (from the item nodes) increases withthe number of contexts that are associated with the itemTherefore the variance of the net input is larger for high-than for low-frequency items Similarly the variance ofthe net input to the item nodes (from the context nodes)increases with the number of items associated with onecontext (ie list length) Therefore given that the con-text is constant during a list presentation the variance ofthe net inputs is larger for a long than for a short list

Brief summary of optimal performance Given thestrong selection pressure arguably humans and animalshave evolved to achieve good memory performanceTherefore it is reasonable to assume that mechanisms forrecognition decision have evolved to an optimal or near-optimal performance Following this assumption the pa-rameter values in the model and the implementation ofthe model are guided by the idea that the model shouldperform optimally A detailed discussion about the issueof optimal performance with exact derivations of what isoptimal performance in the context of the present modelis presented later in the sixth section Here I give a briefsummary explaining the results from the analysis of op-timal performance without going into the mathematicaldetails (see Figures 9A 9B 9C and 9D)

The modelrsquos performance is optimal if the percentageof nodes active at encoding (a) is low (see Figure 9A)For a low a it is optimal to base the recognition decisionon nodes that were active at encoding and to ignore nodesthat were inactive during encoding (see Figure 9A) Alsofor a low a it is optimal to place the activation threshold

of the nodes between the expected values of the new andthe old net inputs (see Figure 9B) Finally it is optimalto normalize the recognition strength with the standarddeviation of the net input (see Figures 9C and 9D)

For a low percentage of active nodes it is optimal tobase the recognition decision on nodes that were activeat encoding (or nodes active in the cue pattern) and to ig-nore nodes that were inactive at encoding At recogni-tion the state of activation of a node may be either activeor inactive Therefore the nodes that are active in the cuepattern and have a net input above a certain activationthreshold are activated at recognition otherwise thenodes are inactivated Let z i

d denote the state of activa-tion at recognition for item node i An item node is acti-vated at recognition (z i

d 5 1) if it was active in the cuepattern (x i

d 5 1) and the net input is above the activationthreshold (h i

d gt T ) otherwise it is inactivated (z id 5 0)

z id 5 1 if x i

d 5 1 and h id gt T otherwise z i

d 5 0

Similarly let zjc denote the state of activation at recogni-

tion for context node j

z jc 5 1 if x j

c 5 1 and h jc gt T otherwise z j

c 5 0

This way of activating nodes at retrieval differs from hownodes are activated in a standard Hopfield (1982) net-work where the activation threshold is zero and a nodeis activated if the net input is above zero (independentlyof the state of activation in the cue pattern) The way ofactivating patterns in a Hopfield network is more likelyto produce a retrieved pattern that matches the encodedpattern of activation (eg the expected value of activenodes at retrieval will be the same as the expected valueof active nodes at encoding) However as will be dis-cussed later the way suggested to activate the nodes hereyields better performance in terms of discrimination be-tween a target item and a distractor item

As is shown in Figure 9B performance is optimal whenthe activation threshold is set approximately between thenew and the old net inputs The activation threshold (T )is set to the expected values of the net inputs of nodes ac-tive during encoding (x i

d 5 1 x jc 5 1) for old and new

items The averaging is computed over all nodes (2N ) andover all new and old patterns ( p) in the recognition con-dition If half of the items are new and half of items areold the activation threshold is

As was discussed above the expected net input of new(lure) items is zero Therefore the activation threshold issimply half the expected net input for nodes encoded inthe active state [T 5 mh(O)2 where mh(O) is the expectedvalue of the net input to nodes encoded in the active state]

It is easy to see that the expected percentage of old andnew active nodes at recognition is one half of the per-centage of active nodes at encoding (a 2) That is the ac-tivation threshold divides the old and new distribution in

TaPN

h hi ii

N

j jj

NP

= +eacute

eumlecircecirc = =aring aringaring1

2 1 1

x xd d c c

h a ai i jj

Np

jd d c c= - -

=aringaring ( )( ) x x x

1

416 SIKSTROM

the middle Old items will have a higher expected percent-age active and new items will have a lower expected per-centage active The activation threshold is constant dur-ing recognition in one condition However it must varybetween conditions depending on the net inputs to yieldoptimal performance

The percentage of nodes active at recognition is counted

As is shown in Figure 9C the performance is near-optimalif the recognition decision is based on the number of nodesactive at recognition normalized by the standard devia-tion of the net inputs across the active features of this item(shcent ) Thus this standard deviation is calculated over allthe nodes active at encoding (ie x i

d = 1 and x ic 5 1

nodes inactive at encoding are not used when calculatingthe standard deviation because for low levels of a theseitems carry little to no information of the item) The rec-ognition strength (S ) for an item is calculated by sub-tracting the expected percentage of nodes active at rec-ognition (a 2)1 from the real percentage of nodes activeat recognition (Pc) and dividing by the standard deviationof the net inputs of the item (shcent)

(5)

The subtraction of the expected percentage of nodesactive at recognition makes the expected value of therecognition strength (S ) zero This subtraction is neces-sary for the normalization to work properly The sub-traction moves the recognition strength distributionssymmetrically so that the old and the new distributionsmove at the same rate for a given standard deviation of thenet input (without the subtraction the old recognitionstrength distribution would be more affected than thenew distribution) Thus the recognition strength is de-termined by the difference between two probabilities (thepercentage of active nodes that varies and the expectedpercentage of active nodes that is constant) divided bythe standard deviations of the net input A yes response(Y ) is given if the recognition strength (S ) is above therecognition criterion (C ) An unbiased decision has arecognition criterion of zero

Y 5 S gt C

An issue that may be raised is whether it is sensible tobase recognition strength on two quite different sourcesmdashnamely the percentage of active nodes and the variabil-ity of the net input The immediate answer is that if it isreasonable to optimize performance it is also sensible tomeasure recognition strength this way Another perspec-tive is to note that unbiased responses can be made onlyon the percentage of active nodesmdashthat is a yes responseoccurs if the percentage of active nodes is larger than theexpected percentage of active nodes (Pc gt a 2) and thevariability of the net input can be ignored Thus ldquonor-

mallyrdquo subjects base their unbiased decisions on the per-centage of active nodes and the variability of active nodesonly becomes relevant when subjects are biased Fromthis perspective the percentage of active nodes is usedfor unbiased responses and the variability of the netinput becomes relevant for confidence judgments There-fore by combining both the percentage of active nodesand the variability of the net input the measure of recog-nition strength proposed here will also reflect the confi-dence judgment

An Example With Step-by-Step ComputationsTo clarify the computational details involved in the

variance theory a step-by-step example is given here Fortractability a small network is used consisting of fouritem nodes and four context nodes (see Figure 2) Theactual simulations to be reported later involved a largernetwork architecture with 30 nodes at each layer Thepercentage of nodes active at encoding (denoted by pa-rameter a) is set to 50 Let item BN be represented as1100 and written as the state of activation of the fournodes x1

d x2d x3

d x4d Similarly let 0011 represent

item BO 1010 represent item AO and 0101represent item AN Let context CBN be represented as1100 or the state of activation of the four nodes x1

cx2

c x3c x4

c Similarly 0011 represents context CBOand 0101 represents experimental study context CExp

Item BN is a high-frequency new word For simplicityit is here encoded only once with context CBN in the pre-experimental phase (in the simulations below high-frequency words are preexperimentally associated withthree contexts) The 16 weights between the four itemnodes and the four context nodes are changed accordingto the learning rule where the probability that a node isactive at encoding is determined by the parameter a 55 For example the weight change between item node 1and context node 1 is w11 5 [x1

d(BN) a][x1d(CBN) a] 5

(1 5)(1 5) 5 14 where BN is item BN and the CBNrepresents context CBN Similarly item BO is anotherhigh-frequency word that before the experimental phaseis encoded once with context CBO Items AO and AN arelow-frequency old and new words and they are not en-coded at the preexperimental phase

In the experimental phase item AO is encoded withthe experimental context CExp Finally item BO is en-coded with the same experimental context CExp For ex-ample the weight w11 is now equal to

[x1d(BN) a) (x1

c(CBN) a] + [x1d(BO) a]

[x1c(CBO) a] + [x1

d(BO) a][x1c(CE) a]

+ [x1d(AO) a][x1

c(CE) a] 5 (1 5) (1 5)

+ (0 5)(0 5) + (0 5)(0 5)

+ (1 5) (0 5) 5 14 + 14 + 14 14 5 12

After encoding the full weight matrix is 5 1 1 55 0 0 5 5 0 0 5 5 1 1 5 corre-sponding to the weights w11 w12 w44 respectively

SP a

c

h=

-

cent2

s

PN

c i jj

N

i

N

= +aelig

egraveccedil

ouml

oslashdivide

==aringaring1

2 11

z zd c

VARIANCE THEORY OF MIRROR EFFECT 417

At recognition the state of activation of the old low-frequency item AO is reinstated as a cue to the itemlayer and the state of activation of the encoding contextat the experimental phase CExp is reinstated as a cue tothe context layer The net inputs are calculated For ex-ample the net input to context node 1 is

The net input to the item nodes is 10 2 1 and that tothe context nodes is 5 5 5 5 It can be seen thatthe expected net inputs for randomly created new itemsin this network is 0 and that the expected net inputs forold items encoded in the active state is 05 Therefore theactivation threshold is set to the average of these valuesmdashnamely T 5 025 Nodes that have a net input above theactivation threshold and that are active in the cue patternare activated at recognition Thus the state of activationat recognition for the item nodes is 1010 and thatfor the context nodes is 0101 (which is identical tothe cue patterns) The percentage of active nodes is countedPc(AO) 5 48 5 5 For an unbiased response (C 5 0)this will yield a yes response because this percentage islarger than the expected percentage of active nodes atrecognition (a 2 5 25) The standard deviation of thenet input for nodes active at encoding is 071 (ie thestandard deviation of 1 2 5 5 corresponding tonodes 1 3 6 and 8) The recognition strength is calcu-lated by subtracting the expected percentage of activenodes at recognition from the percentage of active nodesfor the to-be-recognized item and dividing by the standarddeviation of the net input for active nodes [S 5 (Pca 2) shcent 5 (5 25) 71 5 035]

The recognition of the three items BO AN and BNare done in the same way The results for the four itemsAN BN BO and AO are the net inputs (1 0 2 1 5 155 5 15) where the first four numbers representitem nodes and the last four context nodes (1 0 2 115 5 5 15) (1 0 2 1 15 5 5 1) (1 0 21 5 5 5 5) the states of activation at recognition(0 0 0 1 0 0 0 0) (1 0 0 0 0 1 0 0) (0 0 1 1 00 0 1) (1 0 1 0 0 1 0 1) the numbers of nodes ac-tive 1 2 3 4 the standard deviations of the net inputs071 108 108 071 the recognition strengths 17 0011 035 and the unbiased responses no no yes yesrespectively (an unbiased yes response is made whenS gt C where C 5 0 for unbiased responses) Thus thesubject responds correctly for all items the recognitionstrengths are ordered according to the mirror effect andthe variance of the net input is larger for the high- thanfor the low-frequency words

SIMULATIONS OF THEFREQUENCY-BASED MIRROR EFFECT

In this section the variance theory of the mirror effectis simulated in a connectionist framework that is consis-tent with a general connectionist theory of memory

called TECO (Sikstroumlm 1996a 1996b) TECO has beenused to account for a variety of memory phenomena in-volving recognition and recallmdashfor example successivetesting of recall and recognition (Sikstroumlm 1996b) and for-getting curves (Sikstroumlm 1999) An extensive descrip-tion of TECO is beyond the scope of this paper Interestedreaders are referred to previous articles describing themodel for details

ProcedureThe simulation started with initializing the weights to

zero Then 12 items were generated by randomly settingthe nodes to an active state with a probability of a A pre-experimental phase then followed to generate the fre-quency associated with the items In this preexperimen-tal phase half of the items (ie 6) were encoded threetimes each time in a different context These items sim-ulated the high-frequency words The remaining 6 itemswere not encoded before the experimental phase andthey simulated the low-frequency words

At the experimental phase one study-encoding contextwas generated using the same procedure as the genera-tion of the contexts in the preexperimental phase Giventhat in a standard recognition experiment all studieditems would be encoded in the same list in the simula-tions the items were thus encoded once with the samestudy context Three of the high-frequency items wereencoded and three of the low-frequency items were en-coded The other three high- and three low-frequencyitems were not encoded during the experimental phaseand they simulated the new items Each encoding wassimulated by first activating the nodes in the item and con-text layers The weights were then changed according tothe learning rule (Equation 3)

At the recognition test the patterns of activation of the12 items and the study context were reinstated to the net-work The net inputs were calculated for each node andthe recognition strength was calculated from all the nodesin the network The somewhat unrealistic assumptionthat no learning or changes of weights occurred duringtesting was adopted However this is a standard as-sumption often used in other simulations of recognitionmemory cued recall or categorization (eg Kruschke1992 Lewandowsky 1991 Li amp Lindenberger 1999Li Lindenberger amp Frensch 2000) One thousand fivehundred simulation runs with different random item andcontext patterns were carried out The results reported laterare based on the average across these runs

ParametersThe number of high-frequency patterns was six (each

encoded three times preexperimentally and three of themencoded once experimentally to simulate the old items)and the number of low-frequency patterns was six (threeof them encoded experimentally once to simulate the olditems) The following parameter settings were used Thenumber of nodes in each layer (N ) was 30 (the total num-ber of nodes was 2N 5 60) and the percentage of nodes

h w i ii

1 11

45 1 1 0 1 1 5 0 5c d= = + - - = -

=aring x

418 SIKSTROM

active at encoding (a) was 20 Initially all the weightswere set to zero The recognition criterion (C ) was set to 0

ResultsFigure 3A shows the results in terms of the density of

the net inputs to an active node The expected values ofthe net inputs are approximately equal for new high- andnew low-frequency items [mh(AN) 5 00 mh(BN) 5 00]and for old low- and old high-frequency items [mh(AN) 538 m h(BN) 5 38] The high-frequency items have alarger variance of the net inputs [sh(BN) 5 49 sh(BO) 548] than do the low-frequency items [sh(AN) 5 41sh(AO) 5 40] The variances of the old and the new dis-tributions are approximately equal

Figure 3B shows the density of the recognition strengthThe result shows the mirror effect where the hit rate prob-ability is larger for low- than for high-frequency itemsand the false alarm rate is lower for low- than for high-frequency items [P(AN) 5 18 P(BN) 5 25 P(BO) 564 P(AO) 5 71] The standard deviation for the recog-nition strength is larger for old than for new words andlarger for the low-frequency words than for the high-frequency words [ss(AN) 5 029 ss(AN) 5 019 ss(AN)5 023 ss(AN) 5 031] These f indings agree with the empirical data and the predictions of the attention-likelihood model (Glanzer et al 1993) Thus the simu-lation shows that the variance theory can account for thefrequency-based mirror effect and the associated ROCcurves

EXPLICATIONS OF THE MECHANISMS

In this section three essential mechanisms of the vari-ance theory that account for the mirror effect are discussedin detail The involved mechanisms are (1) the varianceof the net input (2) the expected value of the net inputand (3) the way in which the recognition strength is cal-culated by counting the number of nodes so that the per-formance is optimal

OverviewThe first mechanism is that items from the difficult

class (ie high-frequency items in this case) have ahigher variance of the net inputs than do items from theeasy class (ie low-frequency items) but the variance isindependent of whether the items are old or new Themechanism is illustrated in Figure 4A (noncumulative) Itshould also be underscored that the difference in varianceas a function of class membership is not a primitive ofthe theory that the theorist manipulates It is the naturalconsequence of the rather plausible assumption that high-frequency items appear more often and are associatedwith a great variety of different contexts than are low-frequency words The second mechanism is that old itemshave a higher expected net input to nodes encoded in theactive state than do new items (but do not depend on theclass of the items) The mechanism is illustrated in Fig-ure 4B (cumulative) The third mechanism is the wayrecognition strength is measured by counting the activenodes so that recognition performance (eg d cent) is opti-mal or near-optimal An extended analysis of optimalityis presented at the end of the paper however the resultsin Figures 9Andash9D can summarize the main points hereThe results from this analysis imply that the decisionmust be based on the nodes active at encoding (see Fig-ure 9A) the percentage of active nodes must be low (seeFigure 9A) the activation threshold needs to be betweenthe expected net inputs of the new and old items (see Fig-ure 9B) and the recognition strength is normalized bythe standard deviation of the net inputs of the item (seeFigures 9C and 9D) The density of the percentage of ac-tive nodes is illustrated in Figure 4C and the normalizedpercentage of active nodes is illustrated in Figure 4DHere it is shown how these mechanisms predict the mir-ror effect Below these three sets of mechanisms are ex-plained in more detail

Variance of the Net InputThe first and perhaps the most important mechanism

is that the difficult class has a larger variability in the net

Figure 3 (A) Simulations of net inputs The vertical axis shows the simulated density of the net inputs (B) Sim-ulations of the mirror effect The vertical axis shows the simulated density of the recognition strength

VARIANCE THEORY OF MIRROR EFFECT 419

inputs than does the easy class As will be discussed laterthis is shared with other theories such as the subjective-likelihood account (McClelland amp Chappell 1998) andREM (Shiffrin amp Steyvers 1997) However a unique as-pect of the variance theory is that it is a necessary outcomeof simply encoding high-frequency items more timesthan low-frequency items In the subjective-likelihood ac-count a larger variability for high- than for low-frequencywords is an assumption that is controlled by a free pa-rameter ( p0) to reflect that high-frequency words havemore definitions than do low-frequency words The vari-ance theory needs no such assumptions and no addi-tional free parameters to control the variability Whereasthe subjective-likelihood account tries to capture wordfrequency with a free parameter in the variance theorythe word frequency effect is simulated via the number ofcontexts associated with an item

Owing to how the variance theory implements the re-lations between contexts and items the variance of the

net inputs increases with the frequency with which an itemis encoded in different contexts Intuitively this occursbecause a high-frequency item activates several differentcontexts causing the representations of many competingcontexts to be activated simultaneously Low-frequencyitems are associated with fewer contexts than are high-frequency items This causes the representations offewer contexts to be activated simultaneously Thus low-frequency items have less variability in the net inputsthan do high-frequency items

At recognition an item produces a net input in the con-text layer that is a mixture of the net inputs from the studycontext that the network is instructed to retrieve fromand from several uncorrelated preexperimental contextsthat were associated with the item during the preexperi-mental phase The study context that the network is in-structed to retrieve from contributes to the correct activestate For example by adding +1 to the net input to anode (which is the expected net input for a node encoded

Figure 4 (A) The probability density of the net inputs in the variance theory The horizontal axis shows the net inputs and the ver-tical axis the probability density of the net inputs The dotted vertical line is the activation threshold (B) The cumulative probabilitydistributions of the net inputs in the variance theory The horizontal axis shows the net inputs and the vertical axis the cumulativeprobability distributions of the net inputs (C) The density of percentage nodes active at recognition in the variance theory The hor-izontal axis shows the percentage of nodes active at recognition and the vertical axis the density (D) The density of recognition strengthin the variance theory The horizontal axis shows the recognition strength and the vertical axis the density of the recognition strengthusing standard parameter values

420 SIKSTROM

in the active state when N 5 8 and a 5 5 see Equation A1in the Appendix) thus increasing the expected net inputThe preexperimental contexts on the other hand ran-domly add to or subtract from the net input For examplea preexperimental context could add +1 to the net inputif the node was in an active state or it could add 1 tothe net input if the particular preexperimental contextwas encoded in an inactive state (which is the expectednet input for a node encoded in the inactive state whenN 5 8 and a 5 5 see Equation 3 4 or A1) Note thatthe net input can be negative whereas the state of activa-tion cannot If the representations of these preexperimen-tal contexts are uncorrelated the number of associatedpreexperimental contexts will not influence the expectednet input That is the expected value of the negative andpositive contributions will cancel out (eg for N 5 8 anda 5 1 the contributions to the net input is +1 with a prob-ability of 50 and 1 with a probability of 50 yield-ing an expected value of 0) However the variability ofthe net inputs increases when more contexts are addedIn this example the variance is increased by 12 for eachadded context (ie the variance increases by each con-tribution raised to the power of two) Encoding an itemalso increases the variability of the net input for all otheritems encoded in the network However the increase invariability for the items uncorrelated with the encodeditem is much smaller because the contribution from eachnode is independent (Eg for N 5 8 and a 5 5 thecontribution from each node is either +14 or 14 [seeEquation 3] each yielding an increase in variability of142 5 116 An expected value of aN nodes contributeso the total increase in variability is 4 116 5 14)

A mathematical analysis of how the variability of thenet inputs in the context nodes increases as a function ofthe frequency with which the item has been encoded indifferent contexts is presented in the Appendix This analy-sis shows that the variance of the net inputs in contextnodes increases linearly with how many times a givenitem is encoded within different contexts The variabil-ity of the net inputs for nonwords may be a special casediscussed at the end of this paper

In the same way as the variability of context nodes de-pends on the itemrsquos frequency the variability of the itemnodes depends on the frequency of the context That isthe variability of the net input to the item nodes increaseswith how many times one context is associated with dif-ferent items Given that the context is constant during thepresentation of a study list the variability of the net in-puts to the item nodes will increase with list length

Expected Net InputThe second mechanism in the variance theory is that

the expected net inputs to the easy and difficult classesof stimuli are equal given that the encoding conditionsduring the experimental phase of the two classes areequal Note that this is in stark contrast to the attention-likelihood theory which assumes that more attention(corresponding to more net input) is given to the easyclass than to the difficult class Experimentally the equal-

ity of the net inputs simply means that different classesof stimuli are given the exact same conditions for encod-ing and retrieval in a recognition memory study The netinput to a node encoded in the active state increases dur-ing encoding whereas the net input to a node encoded inthe inactive state decreases during encoding Only nodesencoded in the active state are used during retrieval sohere we are only interested in the increase in net input thatoccurs for nodes encoded in the active state Strengthen-ing of the weights during encoding increases the expectednet input The degree of increase in expected net input isinfluenced by strength-based variables such as study timerepetition levels of processing and so forth For exam-ple the simulations can be set up so that a study time of1 sec strengthens the weights less leading to lesser in-creases in the net input than does a longer period of studytimemdashfor example 2 sec of encoding time Because thestudy context is unique to the learning episode preexper-imental encoding in other learning contexts will not af-fect the expected net input but they do affect the vari-ability of the net input as was demonstrated above Theitemndashstudy-context associations are learned approxi-mately equally well for old high- and old low-frequencyitems For example the expected net input for CAR (ahigh-frequency word as a difficult class item) is equal tothe expected net input for ARCTIC (a low-frequency wordas an easy class item) Generally the expected net inputdoes not depend on the class of the item because the ex-pected net input is influenced by the study and the test-ing experimental conditions similarly across stimulusclasses in a recognition memory experiment Thereforethe expected net input for a new difficult item is equal tothe expected net input for a new easy class item

The probability density functions of the net inputs fornodes in the active states are plotted in Figure 4A Oldnodes in the inactive state have a negative expected valueof the net input and are not plotted New nodes in the in-active state have the same density as nodes in the activestate The cumulative probability distributions of the netinputs for nodes in the active state are plotted in Figure 4BFigure 4A shows the first mechanismmdashnamely that thestandard deviation of net inputs for the easy class items[sh(A)] is larger than the standard deviation of net inputsfor the difficult class items [sh(B)] The second mecha-nism is shown in the figures in that the expected netinput of an easy class new item [mh(AN)] is equal to theexpected net input of a difficult class new item [mh(BN)]Furthermore at encoding the expected net inputs of ac-tivated nodes are increased equally or approximatelyequally for the easy and the difficult classes of itemsThis is shown in Figure 4A The expected net input for theold easy class items [m h(AO)] is equal to the expectednet input for the difficult class items [mh(BO)]

Recognition StrengthThe variance theory suggests that the recognition de-

cision needs to be based on counting the number of ac-tive nodes in such a way that the performance is optimalor near-optimal If the net input is above the activation

VARIANCE THEORY OF MIRROR EFFECT 421

threshold (T ) and the node was active at encoding thenode is activated at recognition Otherwise the node isinactivated The distributions of active nodes are plottedin Figure 4C

A closer inspection of Figures 4A and 4B reveals thatthese densities or distributions predict the correct orderof the mirror effect given that the activation threshold islarger than the expected net inputs of the new items andless than the expected net inputs of the old items Thus thevariance theory has the nice property of accounting for themirror effect across a large range of the parameter valuesfor the activation threshold Thus P(AN) lt P(BN) ltP(BO) lt P(AO) for mh(AN) 5 mh(BN) lt T lt mh(AO) 5mh(BO) The variance theory predicts a material-basedmirror effect because the variability of the net inputs isdifferent for the easy and the difficult class items Theexpected strengths of the net inputs are equal The vari-ability is lower for easy class items thus making theprobability of false alarms (or the probability of activenodes) lower for the easy than for the diff icult classitems when the activation threshold is larger than the ex-pected value of the new items Similarly the hit rate (or theprobability of active nodes) for easy class items is higherthan the hit rate for difficult class items when the activationthreshold is less than the expected value of the old items

The activation threshold is set to be between the ex-pected value of the new and the old net inputs so that theperformance is optimal Therefore the activation thresh-old is set to the average of the expected net inputs of theold and the new distributions for difficult and easy classitems respectively

T 5 [mh(AN) + m h(BN) + m h(BO) + mh(AO)]

5 [mh(BO) + mh(AO)] 5 mh(O)

Thus in the variance model the activation threshold isfixed for recognition in one condition although it mayvary between different recognition conditions to opti-mize the performance The variance theory accounts forthe strength-based mirror effect that occurs between listsor conditions with a shift in the activation threshold nec-essary for keeping the performance at an optimal levelAs will be discussed later this is true also when perfor-mance is measured by d cent and it is independent of theplacement of the recognition criterion Simply put themodel adopts the activation threshold on the basis of theoverall difficulty of the test in order to maximize the per-formance

In practice subjects may initially make a preliminaryestimate of the activation threshold which may be ad-justed as more information about the expected value ofthe distributions is gathered The theory does not show amirror effect if the activation threshold is lower than theexpected value of the new items or larger than the expectedvalue of the old items Thus setting the activation thresh-old as was suggested above is an important mechanism

in the model The activation threshold should not be con-fused with the subjectrsquos recognition criterion

Figure 4C shows the density of the probability that anode is active at recognition when the activation thresh-old is set as defined above Note how the mean and stan-dard deviations of the distributions of the net input (Fig-ure 4A) change when the percentage of nodes are calculated(Figure 4C) The expected probabilities of active nodesare arranged according to the mirror effect [Pc(AN) ltPc(BN) lt Pc(BO) lt Pc(AO)] whereas the expected valuesof the net inputs are not [mh(AN) = mh(BN) lt mh(BO) =mh(AO)] Furthermore the standard deviation of the per-centage of active nodes for old items is larger than thatfor new items [s p(O) gt sp(N)] whereas the standard de-viations of the net inputs are equal for old and new items[s h(N) 5 sh(O)]

The standard deviation of the recognition strength issmaller for the new distributions than for the old distri-butions because there are fewer nodes active in the newdistributions The standard deviation of the percentage ofactive nodes at retrieval as a function of the expected prob-ability of nodes active at retrieval is plotted in Figure 5Obviously the standard deviation of the percentage ofactive nodes is zero when the probability of active nodesis zero This standard deviation increases as the prob-ability of active nodes increases For small probabilitiesof active nodes the variance of active nodes (sp) is approx-imately proportional to the percentage of active nodes[sp 5 Pc(1 Pc) N raquo Pc N micro Pc] The percentage of ac-tive nodes is of course larger for old than for new itemsThus the variance theory predicts that the standard de-viation of the percentage of active nodes (sp) is smaller fornew than for old items [sp(AN) lt sp(BN) lt sp(BO) ltsp(AO)] whereas the standard deviation of the net inputis not [sh(AN) 5 sh(AO) lt sh(BN) 5 s h(BO)] The es-sential mechanism that makes these changes in themeans and standard deviations is the nonlinearity intro-duced by the counting of the number of active nodes With-out this nonlinearity these changes would not occur andthe model would not show appropriate ROC curves forold and new items

Note that the standard deviation of active nodes de-creases for probabilities larger than one half (see Figure 5the standard deviation is of course zero when the prob-ability of active nodes is one see the mathematical anal-ysis below) However the probability of active nodes can-not exceed one half because the activation threshold isset so that half of the nodes active during encoding areactive at recognition The probability of active nodes istypically set to a small value in the model because it im-proves the performance

To optimize the performance subjects base their rec-ognition decision on the number of active nodes normal-ized by the standard deviation of the net inputs for theitem The normalization makes the judgments more con-servative for difficult items This plays an important rolefor confidence judgments when the responses are biasedbut plays no role for unbiased responses

12

14

14

422 SIKSTROM

To calculate the recognition strength (S ) in Equation 5the expected percentage of active nodes is subtractedfrom the percentage of nodes active at recognition (Pc)This is necessary for the normalization to work properlyOwing to the placement of the activation threshold theexpected percentage of active nodes at recognition is halfof the expected percentage of nodes active at encoding(a 2 see note 1) This is a constant independent of itemclass new or old item and test difficulty The result is di-vided by the standard deviation of the net inputs associ-ated with nodes active at encoding (shcent)

Note that the standard deviation of the net inputs ofthe to-be-recognized item (s hcent ) varies on an item-to-itembasis around the standard deviation of the net inputs ofall items in the class (sh ) This fluctuation may be solarge that it is not possible to accurately sort the wordsinto classes on the basis of the standard deviation of theitems however there is no need for the subject to makesuch classification in the variance model The subjectsdo not need to know the true standard deviation of net in-puts in the class A yes response occurs if the recognitionstrength is larger than or equal to the subjectrsquos recogni-tion criterion (C ie if S sup3 C ) A no response occurs ifthe recognition strength is less than the subjectrsquos recog-nition criterion (S lt C )

The standard deviation of the net inputs does not affectthe probability of a yes response when the recognitioncriterion is unbiased (C 5 0) In this special case therecognition strength can be simplified to S 5 Pc whereC 5 a 2 The standard deviation of the net inputs inEquation 5 affects the probability of a yes response whenthe recognition criterion is biased (C sup1 0) Thus the stan-dard deviation of the net inputs in Equation 5 may be in-terpreted as a scaling factor that influences the confidencemeasurement (but not the unbiased recognition measure-ments) A large standard deviation of the net input for anitem (correlated with difficulty) influences the decisiontoward uncertainty whereas a small standard deviation ofthe net input for an item (correlated with less difficulty)influences the decision to be more certain

Figure 4D shows the density distribution of the recog-nition strength Note how the standard deviation of theactive nodes for the easy class versus the difficult classes(in Figure 4C) changes when it is normalized by the vari-ance of the net input (in Figure 4D) The normalizationfactor makes the standard deviation of the recognitionstrength of the difficult class smaller than that of the easyclass Thus the standard deviation of the recognitionstrength is proportional to the inverse of the standard de-viation of the net input The difficult class items yield asmall standard deviation of the recognition strength be-cause the standard deviation of the net inputs is highThe easy class items yield a large standard deviation ofthe recognition strength because the standard deviationof the net inputs is small The ordering of the means ofthe distributions is unaffected by the normalization andthe normalization does not change the fact that the olddistributions have a larger standard deviation than do thenew distributions

PREDICTIONS

This section describes the predictions of the variancetheory We have just seen that the variance theory predictsa material-based mirror effect for high- and low-frequencyitems because the low-frequency items have a smallervariance of net inputs This yields lower false alarm ratesand higher hit rates for the easy than for the difficultclass when the activation threshold is set between thenew and the old distributions Here it is shown how themodel accounts for other effects such as the strength-based mirror effect between lists list-length effects andthe shift in the response criterion Most important the var-iance theory makes predictions regarding the strength-based mirror effect within lists that is different from thepredictions of the attention-likelihood theory An exper-iment is conducted to test these predictions Comparativemodeling fitting was also conducted to compare the vari-ance theory with the attention-likelihood theory Thepredictions of the theory are based on an analytic solution

Figure 5 The standard deviation of active nodes as a function of the expectedprobability that the nodes are active The standard deviation is calculated with2N = 100

VARIANCE THEORY OF MIRROR EFFECT 423

that is presented at the end of the paper together with ananalysis of optimal performance

The Material-Based Mirror Effect forHigh- and Low-Frequency Items

The variance theory was simulated above here theanalytical results are presented The variance theory pre-dicts the mirror effect for any choice of parameters whenthe recognition criterion is unbiased As will be discussedlater the variance theory can be fully described by twoparameters (the number of nodes N and the percentageof active nodes a) plus one parameter for each class orwords [the standard deviation of the net input sh( )] Thefollowing parameters are used here The number ofnodes is 2N 5 100 and the percentage of nodes active atencoding is set to a 5 1 The standard deviation of thenet inputs to the easy class is sh(AN) 5 sh(AO) 5 125and the standard deviation of the net inputs to the diffi-cult class is sh(BN) 5 sh(BO) 5 156 There are otherparameters which however as will be discussed laterdo not add any degrees of freedom to the model the ex-pected net inputs of the new distributions mh(AN) 5mh(BN) 5 0 and the expected net inputs of the old dis-tributions mh(AO) 5 mh(BO) 5 1 Consequently the ac-tivation threshold is T 5 05

These parameters yield the following probabilities thata node is active at recognition Pc(AN) 5 43 a Pc(BN) 545 a Pc(BO) 5 55 a Pc(AO) 5 57 a The following ex-pected recognition strengths are predicted ms(AN) 5

0012 ms(BN) 5 0008 ms(BO) 5 0008 ms(AO) 50012 Figure 4D plots the four recognition strength den-sities (the distributions are assumed to be normal) usingthe parameters above The same parameter settings wereused in Figures 4A 4B 4C and 5

Strength-Based Mirror Effects Between ListsThe variance theory is consistent with the strength-

based mirror effects Thus variables that increase the hitrates also decrease the false alarm rates This empiricalfinding is called dispersion which means that the newand the old distributions move apart The opposite phe-nomenon is called concentering which means that thenew and the old distributions move closer together Ex-amples of variables showing strength-based mirror ef-fects are speed versus accuracy instructions length ofstudy time encoding task forgetting repetition and ag-ing (Kim amp Glanzer 1993) These experimental variablescan be related to a specific parameter in the variance the-orymdashnamely the expected net input

The variance theory predicts a strength-based mirroreffect because subjects must adjust the activation thresh-old to optimize the performance This change in activa-tion threshold affects the false alarm rates For exampleassume that study time is increased from 1 to 2 sec sothat the expected net input increases from 1 to 2 and theactivation threshold increases from 12 to 1 This dimin-ishes the false alarm rate However the increase in theactivation threshold is smaller than the increase in the old

net input so the hit rate will increase Thus increasing thestudy time increases the hit rate but decreases the falsealarm rate which is dispersion

The mirror effect is accounted for in some theories bya change in the recognition criterion Note that in the var-iance theory the recognition criterion is constant whereasthe activation threshold is changed There is an impor-tant difference between a change in the recognition cri-terion and a change in the activation threshold Thechange in the activation threshold optimizes the perfor-mance as measured by d cent whereas a change in the recog-nition criterion does not influence d cent Given an optimalplacement of the activation threshold the performance interms of percentage correct is optimal if the recognitioncriterion is set to an optimal value which is zero Thusthere is a clear difference between changing the recogni-tion criterion and changing the activation threshold Thevariance theory accounts for the strength-based mirror ef-fect occurring between two conditions by the change inthe activation threshold necessary for optimal performancewhereas the recognition criterion does not change

Concentering occurs for example when subjects areinstructed to emphasize speed (rather than accuracy) withsuperficial (rather than deep or semantic) study instruc-tions with diminished study time or with an increasedretention interval (Kim amp Glanzer 1993) In the variancetheory all these manipulations are assumed to diminishthe old net inputs Figure 6A shows the predictions of thevariance theory when the expected net inputs of the olddistributions are mh(AO) = mh(BO) 5 05 rather than 1as in Figure 4D Consequently the activation thresholdmust be set to 025 for optimizing the performance Thedistributions in Figure 6A are closer than the distribu-tions in Figure 4D Thus decreasing the net inputsmdashforexample by diminishing study timemdashmoves the distrib-utions closer together thus showing concentering

The opposite phenomenon to concentering is disper-sion which means that increasing the performance movesthe distributions apart Dispersion can be studied bychanging the variables listed above in the opposite direc-tionsmdashfor example by increasing the study time Fig-ure 6B shows the predictions of the theory when the ex-pected net inputs of the old distributions are mh(AO) =m h(BO) 5 2 Consequently the activation thresholdmust be set to 1 to maintain a near-optimal performanceThe distributions in Figure 6B are further apart than thedistributions in Figure 4D

These strength-based manipulations are usually ap-plied between different lists or conditions For exampleKim and Glanzer (1993) manipulated study time betweenfour study lists where the items were presented for 1 seceach in two lists and for 2 sec each in two lists After eachlist there was a recognition test In the variance theorythe activation threshold is the same during each recogni-tion test but may vary between two recognition tests withdifferent levels of difficultymdashfor example different studytimes As will be discussed later different empirical re-sults are found when study time is varied within one list

424 SIKSTROM

In this case the activation threshold is also constant dur-ing the recognition tests although the study time varieswithin the condition

The order of the probabilities in the mirror effect issomewhat robust against changes in the activation thresh-old over a large range Setting the activation to a fixedsufficiently low and positive value yields the mirror ef-fect for any value of the expected net input For example

assume that the activation threshold is fixed to 04 Thenthe mirror effect is predicted for the three cases of ex-pected old net inputs discussed above (05 1 and 2) orany value above 04 The predictions for the new distri-butions do not change with these changes in net inputs[P(AN) = 25 P(BN) = 30] thus a change in the acti-vation threshold is needed to change the false alarmrates In contrast the advantage of the old easy class over

Table 1General Table of Results From the Experiment

Condition AN BN BO AO ss(BN)ss(AN) ss(BO)ss(AO)

Control 013 017 069 082 060 086Frequency 020 028 080 068 101 066Time 010 015 078 076 089 081

NotemdashThe rows show the conditions (control presentation frequency and presenta-tion time) The columns show the false alarm rates for low (AN) and high (BN) fre-quencies and the hit rate for high (BO) and low (AO) frequencies the slope of the z-ROCcurve for the new low-frequency distribution as a function the new high-frequency dis-tribution [ss(BN)ss(AN)] and the corresponding slope for the old distributions[ss(BO)ss(AO)]

Figure 6 The densities of recognition strength in the variance theory for different parameter settings (A) concentering (B) dis-persion (C) activity level set to one and (D) equal variance The horizontal axes show the recognition strength and the vertical axesthe density

VARIANCE THEORY OF MIRROR EFFECT 425

the old difficult class increases with the expected netinput [from P(BO) = 55 and P(AO) = 56 for mh(O) =05 to P(BO) = 89 and P(AO) = 92 for mh(O) = 2]

List-Length EffectGiven everything else equal recognition from a short

list length has a higher hit rate and lower false alarm ratethan recognition from a long list In the variance theorylist length is predicted to affect the standard deviation ofthe net input (sh) for both easy and difficult class itemsso that longer lists have a larger variance than do shorterlists The expected value of the net input is not affectedby list length

Assume that context does not change within a list butis uncorrelated between different lists The context for alist is thus associated with as many items as there areitems in the list The variance of the net inputs to the itemnodes increases when the list length is increased Thereason for this increase in variance is essentially thesame as the reason that word frequency affects the vari-ance In the word frequency case the same item is asso-ciated with several contexts and this increases the vari-ance in context nodes In the list-length case the samecontext is associated with several items and this in-creases the variance in the item nodes Thus the varianceof the net inputs in the item nodes will be a linear func-tion of list length Therefore a long list will have a lowerhit rate and a larger false alarm rate than will a short list

ROC CurvesThe percentage of nodes active at recognition is less

for new items than for old items Owing to the placementof the activation threshold this proportion is always lessthan 12 The standard deviation of the percentage of ac-tive nodes increases as a function of the percentage ofactive nodes If the percentage of active nodes is zerothe standard deviation obviously is zero However thisstandard deviation increases as the percentage of activenodes increases This yields a smaller standard deviationfor the new distribution (which is associated with a lowerpercentage of active nodes) as compared with the olddistribution [ss(AO) gt ss(AN) and ss(BO) gt ss(BN)]

For the sake of understanding the model the propor-tion of nodes active at encoding can be set unrealisticallyhighmdashnamely to a = 1 This setting yields around 50of these nodes being active at recognition This parametersetting makes the standard deviations of the new and theold distributions equal [ss(AO) = ss(AN) and ss(BO) =ss(BN)] Figure 6C shows the prediction for a = 1 (all theother parameters are identical to those in Figure 4D)

The standard deviation of recognition strength is largerfor the difficult class than for the easy class [ss(AN) gtss(BN) and ss(AO) gt ss(BO)] because the recognitionstrengths are calculated from the inverse of the standarddeviation of the net inputs Thus when the standard de-viations of the net inputs are set equal the standard de-viation of the recognition strengths and the recognitionstrengths becomes equal Figure 6D plots the predictionsof the theory when all standard deviations of the net in-

puts are 125 The other parameters are the same as thosein Figure 4D

In Figure 4D the four standard deviations of the recog-nition strengths are ss(AN) = 0015 ss(BN) = 0012ss(BO) = 0015 and ss(AO) = 0020 The ratio of thesestandard deviations must follow Equation 2 This is alsothe case with ss(AO) ss(BN) = 061 lt 074 = ss(AO) ss(AN) lt ss(BO)ss(BN) = 078 lt 094 = ss(BO) ss(AN)

Changing the Recognition CriterionThe probability of a yes response (P) for the four

classes depends on the recognition criterion (C) SettingC to an unbiased value of 0 yields P(AN) = 20 P(BN) =25 P(BO) = 70 P(AO) = 74 These predicted data areprototypical of experimental data for the mirror effect

A conservative value of the recognition criterion (C)will not yield the mirror effect For example C = 0016yields P(AN) = 03 P(BN) = 02 P(BO) = 30 P(AO) =43 Thus the variance theory predicts that a conserva-tive recognition criterion yields a higher false alarm ratefor easy class words than for difficult class words Thisprediction is supported by empirical data Greene (1996)asked subjects to respond yes only if they were sure oftheir response Consistent with the prediction no mirroreffect was found

It follows from the ordering of the distributions in Fig-ure 4D that the theory also predicts the experimentalfindings in forced recognition [P(BO BN) lt P(AO BN)P(BO AN) lt P(AO AN) P(BN AN) gt 50 and P(AOBO) gt 50] For the parameters above the predictions ofthe theory are P(BO BN) = 79 lt 81 = P(AO BN)P(BO AN) = 83 lt 84 = P(AO AN) P(BN AN) = 59 gt50 P(AO BO) = 57 gt 50

Within-List Strength ManipulationSo far the predictions made by the variance theory are

qualitatively (but not quantitatively) equal to those of theattention-likelihood theory However there is an excep-tion that differentiates the variance theory from the attention-likelihood theory The mirror effect is nor-mally studied under experimental conditions in whichthe diff icult and the easy classes are given the sameamount of attentionmdashfor example under conditions inwhich the number of presentations the study time andthe study instructions are equal for the two classes ofwords However if the number of presentation is largerfor the difficult class than for the easy class different re-sults emerge Stretch and Wixted (1998b) conducted fiveexperiments in which the basic manipulation was to pre-sent high-frequency words five times whereas the low-frequency words were presented once The results didnot show a mirror effect because the hit rates for thehigh-frequency words were higher than those for thelow-frequency words However increasing the numberof presentations for the high-frequency words did not af-fect the false alarm rate so both the false alarm rate andthe hit rate were larger for high-frequency words

The variance theory accounts for this new finding inthe following way The theory assumes that a longer pre-

426 SIKSTROM

sentation time or a larger presentation frequency in-creases the net inputs of the old items [mh(O)] This is il-lustrated in Figure 7A (compare with Figure 4A wherethe same amount of attention is paid to the two classes)If the net inputs for old high-frequency items are in-creased sufficiently the percentage of active nodes willbe larger than that for old low-frequency items For thisto occur the effect of the increase in net input (whichgives the advantage for old high-frequency items whenattention is focused on these items) must be larger thanthe effect from the larger standard deviation of the netinputs for old high-frequency items (which gives the ad-vantage for old low-frequency items when the same at-tention is paid to the two classes) This increase in thepercentage of active nodes yields a higher hit rate forhigh-frequency items than for low-frequency items

However it will not significantly change the false alarmrates which are larger for high-frequency items than forlow-frequency items Therefore the variance theory pre-dicts no mirror effect when high-frequency items are pre-sented sufficiently more often or with a sufficiently longerpresentation time as compared with low-frequency items

It is apparent from the density of net inputs (Figure 7A)that the density of recognition strengths (Figure 7B)will not show a mirror effect (ie because the percent-age of active nodes are larger for high- than for low-frequency old items) The parameters used in these fig-ures are identical to the parameters used for the standardmirror effect in Figures 4A and 4D with the exceptionthat the expected net input of the old high-frequencyitems [mh(BO)] is 2 rather than 1 Consequently to opti-mize performance the activation threshold becomes

P a ec

h h

h=-( )yen

ograve12

22

m

s

P -

Figure 7 (A) The probability density of the net inputs in the variancetheory when attention is focused on the high-frequency class The hori-zontal axis shows the net inputs and the vertical axis the probabilitydensity of the net inputs The expected value of the high-frequency value(BO) is shifted to the right because attention is focused on this class Thedotted vertical line is the activation threshold (B) The predictions of thevariance theory when subjects focus their attention on high-frequencywords The horizontal axis shows the recognition strength and the ver-tical axis the probability density

075 rather than 050 The figure does not show a mirroreffect because the expected hit rate and the expectedfalse alarm rate are larger for the high-frequency itemsthan for the low-frequency items Setting C to an unbi-ased value of 0 yields P(AN) = 08 P(BN) = 14 P(BO) =86 P(AO) = 063 [which may be compared with Figure6B P(AN) = 20 P(BN) = 25 P(BO) = 70 P(AO) = 74]

Furthermore in the variance theory the ratio of therecognition strength standard deviations for high- andlow-frequency items depends mainly on the standard de-viations of the net inputs The standard deviations of thenet inputs are not dependent on the attention paid to thestimuli Therefore the variance theory predicts no changein the standard deviations when the amount of attentionis manipulated The standard deviation of the old low-frequency distribution is predicted to be larger than thestandard deviation of the old high-frequency distribu-tion Similarly the standard deviation of the new low-frequency distribution is predicted to be larger than thatof the new high-frequency distribution The standard de-viations in Figure 7B are ss(AN) = 0013 ss(BN) = 0011ss(BO) = 0017 and ss(AO) = 0019 These results aresimilar to the results when the same amount of attentionis paid to the two classes in Figure 4D ss(AN) = 0015ss(BN) = 0012 ss(BO) = 0015 and ss(AO) = 0020

The standard version of the attention-likelihood the-ory has problems accounting for the lack of mirror ef-fect when more study time is given to the difficult classthan to the easy class This theory suggests that the classof items to which more attention is being paid is moreeasily recognized For example low-frequency items arebetter recognized than high-frequency items becausesubjects pay more attention to them The amount of at-tention is assumed to influence the number of sampledfeature [n(i)] so more features are sampled for low- thanfor high-frequency items (Kim amp Glanzer 1993) This isthe only parameter that differs between high- and low-frequency items From this explanation it follows thatif the experimental conditions are manipulated so thatsubjects pay more attention to the high-frequency itemsthe standard version of the attention-likelihood theorywill predict a mirror effect where the high-frequencyitems are the easier class (A) and the low-frequency itemsare the more difficult class (B) The difference from thenormal mirror effect is a larger hit rate and a smaller falsealarm rate for high- than for low-frequency items Fur-thermore the attention-likelihood theory makes predic-tions of the order of the slope of ROC curves The standarddeviation of the hit rate for the high-frequency distributionwould be larger than the hit rate for the low-frequencydistribution Similarly it is predicted that the standarddeviation of the high-frequency false alarm distribution islarger than that of the low-frequency distribution

EXPERIMENT

An experiment was conducted to test the predictionsregarding the within-list strength manipulation The

number of presentations and the study time of the high-frequency words were manipulated in an experimentThe original rationale for the experiment was to comparethe results with the predictions of the variance theoryand the attention-likelihood theory because the experi-ment was conducted before the publication of Stretchand Wixtedrsquos (1998b) study which manipulated atten-tion by the number of presentations In this experimenta new manipulation is investigatedmdashnamely how theamount of study time per item for each class influencesthe mirror effect Furthermore the manipulation of thenumber of presentations is replicated Thus there weretwo experimental conditions one in which the amountof study time was manipulated and one in which the pre-sentation time was manipulated There was also one con-trol condition in which high- and low-frequency wordswere given the same amount of attention

MethodSubjects Twenty-one students taking the introductory psychol-

ogy course at the University of Toronto volunteered to participatein a memory experiment for course credit There were 14 female and7 male subjects with a mean age of 20 ranging from 18 to 29 yearsold

Material Sixty low-frequency words and 60 high-frequencywords were selected from Ku Iumlcera and Francis (1967) The low-frequency words have an occurrence of 4ndash5 times per million andthe high frequency words an occurrence of 50ndash55 times per mil-lion Thirty low- and 30 high-frequency words were randomly cho-sen for List 1 and the remaining for List 2

Procedure The subjects were instructed to study a list of wordsso they would be able to recognize the words after study Fifteenlow-frequency words and 15 high-frequency words were randomlychosen as study words for each subject

Design There were three conditions In all the conditions thelow-frequency words were presented once with a presentation timeof 1 sec In the control condition the high-frequency words werealso presented once with a presentation time of 1 sec In the pre-sentation frequency condition the high-frequency words were pre-sented twice for 1 sec each time In the presentation time conditionthe high-frequency words were presented once for 3 sec The pre-sentation order was randomized All the words were presented inuppercase on a blank computer screen Immediately following thestudy list there was a recognition test The subjects were presentedwith either one of the studied words or one of the lures There were15 low-frequency lures and 15 high-frequency lures in each condi-tion The subjects were asked to judge whether the word was pre-sented in the list or not The subjects were also required to rate theirconfidence in their responses on a scale ranging from 1 (guessing)to 5 (very certain) The order of recognition was randomized foreach subject

Each subject participated in two conditions List 1 was alwaysgiven as the first list and List 2 as the second list Twelve subjectswere randomly chosen for the presentation frequency condition fol-lowed by the presentation time condition Nine subjects were giventhe control condition followed by another control condition Thewhole experimental set-up including instructions presentation ofwords and the recognition test was automated on a computer Eachsubject was tested individually

ResultsThe results from the experiment are presented in the

first three rows of Table 1 The probability for hit rates

VARIANCE THEORY OF MIRROR EFFECT 427

428 SIKSTROM

was larger for the high-frequency words than for the low-frequency words in the presentation frequency and thepresentation time conditions In the control conditionthe probability for hit rates for the low-frequency condi-tion was larger One-tailed paired t tests over the perfor-mance for each subject were carried out to test the dif-ferences between the high and the low frequencies Theeffects were significant in the presentation frequencycondition [t(11) = 22 MSe = 004 p = 02 lt 05] and inthe control condition [t(16) = 33 MSe = 004 p = 00lt 05] but not in the presentation time condition [t(11) =041 MSe = 003 p = 34 lt 05]

The false alarm rate was larger for the high-frequencywords in all the conditions However it was only signif-icantly larger in the presentation frequency condition[t(11) = 18 MSe = 003 p = 048 lt 05] but not in thepresentation time condition [t (11) = 15 MSe = 001 p =07 gt 05] and the control condition [t(16) = 14 MSe =002 p = 09 gt 05]

The results in the presentation frequency conditionsupport the variance theory The hit and the false alarmrates were significantly larger for the high-frequencywords than for the low-frequency words Thus there wasno mirror effect However the prediction of the standardversion of the attention-likelihood theory was not sup-ported

The results in the presentation time condition were inthe same direction as those in the presentation frequencycondition although the difference between the high andthe low frequencies was not significant This conditionis consistent with the variance theory although the stan-dard version of the attention-likelihood theory could notbe dismissed in this condition since the results werenonsignificant

Finally the control condition yielded results consis-tent with previous studies showing a mirror effect Thehit rate for the high-frequency words was significantlylower than the hit rate for the low-frequency words Thefalse alarm rate for the high-frequency words was largerthan that for the low-frequency words (although not sig-nificantly) Thus the control condition is as was ex-pected consistent with both the variance theory and thestandard version of the attention-likelihood theory

The slopes of the ROC curves were calculated as fol-lows The hit and false alarm rates for confidence ratings1ndash5 were z-transformed (eg for confidence rating 4 a hitresponse was scored if the confidence rating was 4 orabove) Linear regressions of one z-transformed mea-surement as a function of another z-transformed measure-ment were conducted The slope of the linear regressioncurves between the z-transformed hit rate of the low-frequency words and the z-transformed hit rate of the high-frequency words [ss(BO)ss(AO)] and similarly for theslope of the false alarms [ss(BN)ss(AN)] are shown inthe last two rows of Table 1

The results show that the standard deviations of theold high-frequency distributions were smaller than thestandard deviations of the old low-frequency distribu-tions in all the conditions The standard deviations of the

false alarm high-frequency distributions were smallerthan the standard deviations of the false alarm low-frequency distributions in the presentation frequencycondition and the control condition but were approxi-mately equal in the presentation time condition

To summarize the results in the presentation frequencycondition are consistent with the variance theory and inconsistent with the standard version of the attention-likelihood theory The control condition is consistentwith both the variance theory and the standard version ofthe attention-likelihood theory These data are also con-sistent with results presented by Stretch and Wixted(1998b) However Stretch and Wixted (1998b) suggestedone possible way to modify the standard version of theattention-likelihood theory to bring it in line with thedata presented here They noted that Glanzer et al (1993)had shown that the attention-likelihood theory predictsthe mirror effect although p(i old) is set to the averagevalue of the two classes This modified version can pre-dict the pattern of data presented here given that the at-tention paid to the high-frequency class was high duringencoding [n(B) = 120] and low during recognition [n(B) =40] This formulation of the attention-likelihood theoryseems somewhat unclear It is not well motivated whyp(i old) should be equal during recognition whereas n(i)is different [p(i old) is calculated from n(i)] or why theamount of attention for high-frequency items is lower thanthat for low-frequency words at encoding but higher atrecognition

COMPARATIVE DATA FITTING

Glanzer et al (1993) fitted the attention-likelihoodtheory to experimental data in nine conditions The easyclass (A) consisted of either low-frequency words orconcrete words and the difficult class (B) consisted ofhigh-frequency words or abstract words Here the vari-ance theory is also fitted to the same set of data as thatused in Glanzer et al (1993) This allows a directionevaluation of the variance theory by comparing its f itswith those of the attention-likelihood theory

Glanzer et al (1993) fitted the attention-likelihoodtheory to the four probabilities of yes responses and thefour ratios of slopes of the ROCs The fitting was con-ducted by minimizing the mean squared error divided bythe variancemdashthat is

Three parameters were fittedmdashnamely the attentionpaid to each of the classes n(A) and n(B) and the prob-ability that a feature was activated p(new) The other pa-rameters were held constant at a value found to give agood f it These parameters were (N = 1000) and therecognition criterion [ln (L) = 0]

The variance theory was fitted to the same set of datausing the same technique and the same number of freeparameters The fitted parameters were the percentage

( )

Observed Predictedi i

ii

-

=aring

2

21

8

s

VARIANCE THEORY OF MIRROR EFFECT 429

nodes active at encoding (a) the standard deviation ofthe net inputs for the easy class words [sh(A)] and thestandard deviation of the net inputs for the difficult classwords [sh(B)] The other parameters were held constantand were set to the same values as those in Figure 4D[2N = 100 mh(N ) = 0 mh(O) = 1 and C = 0] The empir-ical standard deviations (si) were not reported inGlanzer et al (1993) so these parameters were set toone

The results show that the variance theory accounts for98 (r2) of the variance for the probabilities The attention-likelihood theory fits equally well The variance theory ac-counts for 93 of the variance for the slope The attention-likelihood theory accounts for 84 of the variance of theslope Thus the variance theory accounted for the sameamount of variance for the probabilities and more variance

for the slope as compared with the attention-likelihoodtheory when three parameters were fitted

It is reasonable to assume that the percentage of ac-tive nodes at encoding (a) does not vary across condi-tions The ratio of standard deviations of the net inputs[sh(B)sh(A)] may also be conceived of as being con-stant given that the same material is used in the differ-ent conditions Therefore the variance theory was fittedby a single variablemdashnamely the standard deviation ofthe net inputs for the easy class [sh(A)] The activitylevel was fixed to 010 and the ratio of the standard de-viations of the net inputs sh(B)sh(A) to 125 (these val-ues were the average of the fitted parameters in the pre-vious fit these parameters were also used in Figure 4D)The results are presented in Figure 8A where the pre-dicted probabilities are plotted as a function of observed

Figure 8 (A) Fitting the variance theory to Glanzer Adams andKimrsquos (1993) probability data The figure shows the predicted proba-bilities (on the vertical axis) as a function of the observed probabilities(on the horizontal axis) (B) Fitting the variance theory to Glanzer et alrsquos(1993) standard deviation slope data The figure shows the predictedratio of slopes of the receiver-operating characteristic curves (on the ver-tical axis) as a function of the observed ratios (on the horizontal axis)

430 SIKSTROM

probabilities Figure 8B shows the corresponding resultsfor the slope The accounted variance is 096 for the prob-abilities and 085 for the slopes Thus the variance theoryfits the slopes using a single parameter equally well asthe attention-likelihood theory does with three fitting pa-rameters The fit for the variance theory for the probabil-ities using one parameter is slightly less than the fit of theattention-likelihood theory using three fitting parametersIt may be concluded that the fit for the variance theory isreasonably good for the probabilities and the slopes Theslopes have a better fit in the variance theory than in theattention-likelihood theory when three variables are used

ANALYTIC SOLUTIONS

In this section analytic solutions of the variance the-ory an approximation of the standard deviation of therecognition strength and analyses of optimal perfor-mance are presented The variance theory has a simpleanalytic solution and can be fully described by four pa-rameters Two of these parametersmdashnamely the stan-dard deviation of the net inputs from the easy class[sh(A)] and the standard deviation of the net inputs fromthe difficult class [sh(B)]mdashcan also be expressed as thefrequency of the item (see the Appendix) The other twoparameters are the number of nodes (N ) and the expectedprobability that the nodes are active at encoding (a)

There are other variables in the theory however theydo not increase the degrees of freedom There are fourexpected net inputs (mh) however two degrees of free-dom disappear because the new net inputs are constrainedto be equal as well as the old net inputs [mh(AN) =mh(BN) and mh(AO) = mh(BO)] Furthermore the predic-tions are independent of moving the old and new dis-tributions in parallel thus removing another degree offreedom Finally changing the difference between the ex-pected old and new net inputs is mathematically equiva-lent to changing the standard deviations [sh(A) andsh(B)] Thus the degrees of freedom in the net inputscan be captured by the degrees of freedom in the stan-dard deviations The activation threshold (T ) and theprobability that nodes are active (Pc) are simply func-tions of other variables and therefore do not increase thedegrees of freedom Thus there are four degrees of free-dom for the distributionsmdashnamely sh(A) sh(B) N anda An additional degree of freedom is introduced whenplacing the recognition criterion (C)

The probability (Pc) that the net inputs exceed the ac-tivation threshold (T ) for nodes active during encodingcan be explicitly solved from the expected net inputs(mh) and the standard deviation of the net inputs (sh)This probability is dependent on the distribution of thenet inputs which can be approximated with a normaldistribution Pc is solved by integrating the net inputsfrom mh T to infinity (yen) over the probability densityfunction for a normal distribution Thus the probability(Pc) that a node is active at recognition is

(6)

Subtracting the expected percentage of active nodes atrecognition (a2 see note 1) from the percentage of ac-tive nodes and dividing by the standard deviation of thenet inputs (sh) calculates the expected recognitionstrength (ms)

Note that the analytic solution uses the standard devi-ation of the class (sh) as an approximation of the stan-dard deviation of the item (shcent ) because it simplifies theanalytic solution however the variance theory or thesimulation uses the standard deviation of the item Thisapproximation is good when there are a large number offeatures however for a small number of features thevariance of feature strength for a single item may fluctu-ate on an item-to-item basis around the variance of thenet inputs for all the items

The standard deviation of the recognition strength (ss)is calculated from sh Pc and N There is 2N number ofnodes in the context and the item layers The distributionof Pc is binomial but can for a certain criterion [ie 2NPc(1 Pc) gt 10] be approximated with a normal distri-bution with a standard deviation of [Pc(1 Pc) 2N]12The final result is scaled by the normalization factor 1sh

(7)

A yes response occurs if the recognition strength isabove the recognition criterion (C) The probability of ayes response [P(Y)] is calculated from the expected recog-nition strength the variance of the recognition strengthand the recognition criterion by integrating the density ofthe recognition strength over a normal distribution

The probability of choosing A over B in a two-choiceforced recognition test [P(A B)] is calculated from theexpected recognition strength of A [ms(A)] and B [ms(B)]and the standard deviations of the recognition strengthof A [ss(A)] and B [ss(B)]

An Excel sheet for calculating the predictions of the vari-ance theory is available on line (wwwpsychutorontoca ~sverkervariancehtml)

P e ds

S Bs s

s s

C

(A B)

(A)

(A) (B)

=

- -[ ]( )+aelig

egraveccediloumloslashdivide

eacuteeumlecirc

yen

ograve12

2

2 2 22

p

m m

s s

( )

P e ds

S s

s

C

(Y) =-( )yen

ograve12

2

22

p

m

s

sss

h

c cP P

N=

-( )eacute

eumlecircecirc

1 1

2

12

mss

Pc

a

h=

-2

P a e dhcT

hh

h=-( )yen

ograve12

2

22

p

m

s

VARIANCE THEORY OF MIRROR EFFECT 431

Approximations of the Standard Deviation ofRecognition Strength

The standard deviation of the recognition strength isin the model calculated with Equation 7 However to fa-cilitate the understanding of this equation it is useful tomake some approximations First note that the proba-bility that a node is active (Pc) is assumed to be low Byapproximating 1 Pc to one the variance of the recog-nition strength can be simplified to

For a particular class of items the variances of the netinputs of old and new items are equal and the varianceof the recognition strength is proportional to the numberof active nodes (s 2

s micro Pc) This approximation suggests avery simple interpretation of the slope of the z-ROC Theratio of variances between new and old items is simplythe ratio between the number of nodes active in the newitems representations and the number of nodes active inthe old items representations

Or alternatively the slope of the z-ROC curve is equalto the square root of the ratio of the number of nodes ac-tive in the new items representations and the number ofnodes active in the old items representations For exam-ple if the slope of the z-ROC curve is 08 the number ofactive nodes in the new items representations divided bythe number of nodes active in the old items representa-tions is 064 (= 0802)

Another approximation useful for understanding themodel is that for two classes of items the number of ac-tive nodes in the new distribution is approximately equaland the number of active nodes in the old distributions isapproximately equal [Pc(AN) raquo Pc(BN) and Pc(BO) raquoPc(AO)] Given these approximations and the approxi-mation above (1 Pc raquo 1) the recognition strength stan-dard deviation is inversely related to the standard devia-tion of the net inputs in the following way The ratiobetween the recognition strength standard deviations ofthe diff icult and the easy distributions is equal to theratio between the standard deviations of the net inputs ofthe easy and the difficult distributions Furthermore theratio between the recognition strength standard devia-tions of the difficult and easy new distributions is equalto the ratio between the recognition strength standard de-viations of the difficult and the easy old distributionsThe exact solution predicts a slightly smaller ratio in theold than in the new distributions

This suggests that the ratio between the recognitionstrength standard deviations of the difficult class and theeasy class can be interpreted as the ratio between the

standard deviations of the net inputs of the easy class andthe difficult class

Optimizing Performance Derives the Assumptions of the Variance Theory

Arguably good memory performance is an importantfactor for selection in the evolutionary process of hu-mans and animals It is reasonable to assume that thebrain has evolved so that the performance at retrieval isoptimal or near-optimal Here it is investigated how sev-eral assumptions of the variance theory influence per-formance It is shown that several assumptions of themodel fall out as a consequence of optimizing perfor-mance in the form of discriminability between new andold items Thus if the model is implemented in a differ-ent way performance is degraded and the model doesnot account for the experimental data Examples of as-sumptions that yield a good performance in the modelare a low percentage of nodes active setting the activa-tion threshold between old and new net inputs measur-ing performance by nodes that are active in the encodingpattern and normalizing the recognition strength It isshown that an optimal performance in the network requiresthe implementation suggested by the variance theory Ifthe implementation of the variance theory is changedsignificantly the performance is degraded and the net-work would not produce the empirically found memoryphenomena

To conduct this analysis it is necessary to define ameasurement of performance A natural choice is to used cent By using the analytical equations above we find thefollowing expression

(8)

Because ss(N) sometimes is near zero it was founduseful to use the standard deviation of both the old andthe new items recognition strength ss(NO) in the de-nominator of this equation Thus Pc(NO) is equal to[Pc(N) 1 Pc(O)] 2 Pc( ) was calculated with Equation 6The expected value of the net inputs and the standard de-viation of the net inputs for new and old items were cal-culated with the equations derived in the Appendix(Equations A1 A2 and A3) Low-frequency items witha preexperimental frequency of zero were used ( f = 0)and the list length was set to one (L = 1)

The performance can be expressed by the parametersa N and p Increasing the number of nodes (N) monot-onically increases d cent and increasing the number ofstored patterns (p) monotonically decreases d cent The im-pact of these two parameters on d cent was of less impor-tance here and they were set to N = 30 and p = 100

Optimal percentage of nodes active at encodingThe solid line in Figure 9A shows the theoretical d cent as afunction of the percentage of nodes active at encoding

cent = - =-

-eacuteeumlecirc

dP P

P PN

s s

s

c c

c c

m ms( ) ( )

(

O N

(NO)

(O) (N)

(NO) (NO) 12

112

ss

ss

ss

ss

s

s

s

s

h

h

s

s

(BO)

(AO)

(B)

(A)

(A)

(B)

(BN)

(AN)pound raquo pound

s

ss

s

c

c

P

P

2

2

(N)

(O)

(N)

(O)raquo

ss

sc

h

P

N

2

2 2=

432 SIKSTROM

(a) The results show that d cent is optimal for a = 052 Thed cent is lower for larger and smaller a The lower d cent for largea occurs because the interference from other items in-crease For an a larger than the optimal value the weightchanges are distributed over a larger number of nodesand the recognition tests therefore include more noise

The lower d cent for an a less than the optimal value oc-curs because there is variability in the number of activenodes at encoding Thus for very small values of a thefluctuation between the number of nodes active in theencoded representation becomes increasingly importantThus for a small a errors are more likely to occur be-cause old items have few active nodes at encoding mak-ing it less likely that many nodes will be active at re-trieval (independently of how well they are encoded)This analysis suggests that a medium low percentage ofactive nodes at encoding yields optimal performanceThis is consistent with variance theory which requires alow activity for fitting some of the empirical data (seebelow)

There is another factor that contributes to the fact thatoptimal performance occurs when the percentage of ac-tive nodes is medium low which is that the number ofpossible representations increases with a If there is onlyone node active in all the representations there are Npossible combinations of representations if there aretwo nodes active in all the representations there are ap-proximately N 2 possible combinations of representa-tions and so forth This factor is not included in theanalyses

Optimal placement of the activation threshold Animportant property of d cent is that it is insensitive to wherethe criterion is placed Thus any criterion yields thesame performance The activation threshold (T ) may beseen as the criterion for a single node and therefore onemight intuitively think that the placement of the thresh-old is unimportant for d cent However surprisingly theplacement of the criterion becomes important in the vari-ance theory because there is a nonlinear transformationwhen the nodes are activated This nonlinearity makes d centdependent on the activation threshold in the nodes

The d cent was maximized by changing the activationthreshold (T ) and the percentage of nodes active at en-coding (a) The maximum d cent was 240 when T = 081and a = 052 Figure 9B plots d cent as a function of the ac-tivation threshold (T ) when the percentage of nodes ac-tive at encoding was f ixed at the optimal value (a =052) The results show that d cent has an optimal valuewhen the activation threshold is set between the old andthe new distributions The variance theory suggests thatthe threshold should be set to the average of the expectedold and expected new net inputs For the parameter val-ues used here this value is 071 which is near butslightly lower than the optimal value of 081 (the ex-pected value of the new net input is 0 and the expectedvalue of the old net input is 142) Note that this resultapplies when both a and T are set to the optimal value Ifa is set to a nonoptimal value the optimal value of T may

deviate significantly from the one proposed by the the-ory (eg if a = 5 the optimal value of T is much largerthan the expected value of old net inputs of 188)

This analysis emphasizes the importance of setting theactivation threshold as suggested by the variance theorySetting the activation threshold between the old and thenew expected net inputs yields not only the mirror effectbut also an optimal performance in the network Noticethat the activation threshold (T ) is constant even if thesubjectsrsquo decision criterion (C) is changed Therefore d centwill not change when the decision criterion changes Bychanging the decision criterion (rather than the recogni-tion threshold) subjects can maintain an optimal d cent fordifferent confidence levels

Optimal usage of the state of activation in the cuepattern An interesting question is how much informa-tion is carried in nodes that are active in the encoded pat-tern as compared with inactive nodes If both active andinactive nodes carry a similar amount of information itis useful to use all the nodes at retrieval However if in-active nodes carry little information in relation to thenoise performance can be improved by using only theinformation in the active nodes

The information carried in the nodes depends on theamount of weight changes which in turn depends on thepercentage of active nodes at encoding (a) For a = 12the absolute values of the weight changes are the samefor active and inactive nodes (however the signs of theweight changes are different) Thus inactive and activenodes carry the same amount of information and theperformance is optimal when information in both activeand inactive nodes is used For a small a the weightchanges are larger for active nodes (proportional to 1 a)than for inactive nodes (proportional to a) For a suffi-ciently small a the noise in inactive nodes will over-whelm the information in the weight changes so that ifthe information is combined the inactive nodes will harmthe information in the active nodes and performance isoptimal if only information from active nodes is used

The performance of the variance theory was calcu-lated by using the information in all the nodes This isdone by counting the number of nodes that are retrievedto the correct state of activation (ie the same state asthat during encoding) The mathematical details of thiscalculation are described at the end of the Appendix Theresults are shown by the dotted line in Figure 9A usingthe same set of parameters as when d cent was calculated byusing only active nodes shown by the solid line The re-sults show that the highest d cent is found when the decisionis based only on active nodes and when a is low Includ-ing inactive nodes in decision lowers d cent However for alarger a (above 15 for the parameters used here) it isbeneficial to base the decision on all the nodes

Optimal placement of the recognition criterion forthe two classes of items The recognition criterion (C)does not affect d cent but it influences performance as mea-sured by the hit rates and false alarm rates Therefore itis necessary to use another criterion for performance

VARIANCE THEORY OF MIRROR EFFECT 433

with respect to the placement of the recognition crite-rion A natural choice for performance in this context isthe probability of hits minus the probability of falsealarms This measurement corresponds to optimal per-formance when old correct responses and new correctnew responses are rewarded equally It is easy to see thatif the standard deviations of the old and the new distri-butions are equal the optimal performance will be foundif the recognition criterion is set exactly between the dis-tributions For unequal standard deviations the optimalrecognition criterion is shifted from the midpoint towardthe distribution with the smallest standard deviationMore exactly the optimal recognition criterion is thepoint at which the old and the new distributions inter-sect It is easy to see that this is true because if the recog-nition criterion is moved to the left of this point the rateof increase in false alarms is larger than the rate of in-crease in hits and performance suffers If the recognition

criterion is moved to the right of this point the rate of de-crease in hits is larger than the rate of decrease in falsealarms and performance also suffers (see eg Figure 4D)Formally f [S(O)] denotes the density of recognitionstrength of the old distribution and f [S(N)] the densityof the recognition strength of the new distribution Theratio between these variables is called the likelihoodratio L = f [S(O)] f [S(N)] and the optimal performanceoccurs when this ratio is equal to one (L = 1)

In the mirror effect there are two classes of itemseach having a new and an old distribution with differentstandard deviations The question of optimal perfor-mance is complicated by the possibility of using differ-ent criteria for the two classes The performance maythen vary depending on the choice of the two criteria andon additional restrictions on the overall level of confi-dence For example if one class is very easy (ie perfectdiscrimination) and one class is very difficult (ie no

Figure 9 (A) Theoretical d cent as a function of percentage of nodes active at encoding The solid line shows the d cent as a function of per-centage of nodes active at encoding when the decision is based only on nodes that are active during encoding The dotted line showsd cent when the decision is based on all the nodes (B) Theoretical d cent as a function of the activation threshold The leftmost arrow pointsat the expected net input of the new items [m(N)] the rightmost arrow points at the expected net input of the old items [m(O)] and themiddle arrow points at the point at the placement of the activation threshold of the nodes Note that the activation threshold is slightlylower than the optimal point (C) Optimal placement of the recognition criterion for the two classes The y-axis shows the maximumlikelihood for Class B divided by the maximum likelihood for Class A An optimal performance is found when this ratio is one Thex-axis shows the false alarm rate for Class A The straight line shows the ratio for theoretical optimal performance the dotted line theratio before normalization and the solid curved line the ratio after normalization See the text for details (D) The advantage of nor-malization for different recognition criteria The y-axis shows the total hit rate after normalization minus the total hit rate before nor-malization as a function of the total false alarm rate on the x-axis See the text for details

434 SIKSTROM

discrimination) and subjects are instructed to respondyes only when they are absolutely certain that they arecorrect it may be optimal to set a very high criterion forthe difficult class so that no yes responses will be madefor the difficult class and a moderate criterion for theeasy class so that some yes responses will be made forthe easy class Therefore any model that optimizes per-formance for the two classes must combine the criteriafor each class so that the performance for the sum of theclasses will be optimal

This problem may formally be stated as follows Giventwo classes (A and B) with a fixed total false alarm rate[P(AN) + P(BN)] how should the recognition criteriafor the two classes [T(A) and T(B)] be chosen so that thehit rates are maximized [P(AO) + P(BO)] The solutionto this problem is surprisingly simple The optimal perfor-mance occurs when the placements of the maximum like-lihoods of the two classes are equal

L(A) = f [S(AO)] f [S(AN)] = L(B)

= f [S(BO)] f [S(BN)]

It is easy to see that this criterion must be satisfied foroptimal performance because any shift from this pointdiminishes performance For example if the recognitionthreshold for class A is diminished the recognition cri-terion for class B must be increased to keep the totalfalse alarm rate constant According to the formulationof the problem the change in total false alarm rates mustbe equal f f [S(BN) = 0] The maximum-likelihood ra-tios are monotonically increasing functions of the recog-nition criteria therefore L(A) L(B) lt 0 when the recog-nition criteria are changed as specified above ThusL(A) = f [S(AO)] f [S(AN)] lt f [S(BO)] f [S(BN)] =L(B) or f [S(AO)] f [S(BO)] lt 0 This shows that thechange in the placement of the criteria from L(A) = L(B)results in an overall decrease in hit ratemdash( f [S(AO)] f [S(BO)] lt 0)mdashand performance suffers

Note that the variance theory only has one overallrecognition criterion However the theory and any the-ory of the mirror effect specifies how this criterionchanges when it is moved over the two classes Thus thesecond criterion can indirectly be inferred from the for-mulation of the theory This is done in the variance the-ory by the normalization factor that scales the recogni-tion differently between the two classes of words

The intriguing question here is whether the variancetheory is optimal or almost optimal in terms of place-ment of the recognition criterion for the two classes Fig-ure 9C plots the maximum likelihood of class B dividedby the maximum likelihood of class A [L(B)L(A)] onthe y-axis The x-axis shows the probability of false alarmsfor class A when the recognition criteria are changedThe optimal ratio of the maximum likelihood on the y-axisis one and it is plotted as the dotted straight line The re-sults before normalization (ie by counting the numberof nodes above the recognition criterion) are plotted in

the dotted monotonically increasing line The resultsafter normalization (ie the percentage of active nodesminus the expected number of active nodes divided bythe standard deviation of the net input) are plotted as thesolid line

The figure clearly shows that performance before nor-malization is far from optimal For a conservative recog-nition criterion or low false alarm rates the maximumlikelihood for class A is larger than the maximum likeli-hood for class B Therefore a more liberal criterion forclass B and a more strict criterion for class A would bemore advantageous For a liberal recognition criterionor a high false alarm rate the maximum likelihood forclass B is larger than the maximum likelihood for classA Therefore a more liberal criterion for Class A and astricter criterion for Class B would be beneficial Theonly point at which the performance is optimal is whenthe recognition criterion is unbiased At this point [aroundP(AN) = 25] the maximum-likelihood ratio is one

Normalization improves performance significantly sothe maximum-likelihood ratio is one or almost one fora range of criteria For false alarm rates larger than 015and smaller than 060 the ratio is within two percentagepoints of one The maximum likelihood for Class A islarger than that for Class B for false alarm rates less than015 and for false alarm rates larger than 060 Thus thereis still some deviation from optimal performance evenafter normalization However the maximum-likelihoodratio is closer to the optimal value for all false alarmrates after normalization than before normalization Ar-guably normalization improves performance sufficientlyso that performance may be described as being near anoptimal value for a wide range of recognition criteria

Overall performance after normalization can be di-rectly compared with performance before normalizationFigure 9D plots the total hit rate after normalizationminus the total hit rate before normalization for differenttotal false alarm rates In this figure the standard devia-tion of the net inputs to Class B was set to a 3 (rather than156) in order to make the difference between perfor-mance before and after normalization more salient Allother parameters were identical to those in Figure 4DFurthermore it is assumed that the number of items inClass A is equal to the number of items in Class B Thetotal hit rate is equal to the average hit rate of Class Aand Class B and the total false alarm rate is equal to theaverage false alarm rate in Class A and Class B

For all recognition criteria or for all false alarm ratesperformance is better or equal after normalization ascompared with performance before normalization Foran unbiased recognition criterion or a slightly largerrecognition criterion performance is approximatelyequal before and after normalization [ie for 25 lt P(N) lt40] For conservative recognition criteria [P(N) lt 25]there is a large advantage for normalized performanceover nonnormalized performance The largest advantageis approximately 005 and it occurs when the false alarm

VARIANCE THEORY OF MIRROR EFFECT 435

rate is approximately 005 For liberal recognition crite-ria [P(N) gt 40] there is also an advantage for normal-ized performance The largest advantage is around 001and it occurs when the false alarm rate is 070 The ad-vantage for liberal criteria is smaller than the advantagefor conservative criteria because of a ceiling effect forlarge false alarm rates and large hit rates

A Nonoptimal Network is Inconsistent With Empirical Data of Recognition Memory

To summarize it has been shown that performance isoptimal when (1) the percentage of nodes active at en-coding is low (2) the activation threshold is set betweenthe new and the old distributions (3) only nodes activeat encoding are used for measuring the recognitionstrength and (4) the recognition strength is normalizedwith the standard deviation of the net input It is inter-esting to note that all these conditions are necessary forproducing the empirically found memory phenomena

1 The percentage of active nodes has to be low forproducing appropriate standard deviations for the oldand the new recognition distributions If the percentageof active nodes is too high the standard deviation of theold distribution will approach the standard deviation ofthe new distribution

2 The model predicts the mirror effect only if the ac-tivation threshold is set between the old and the new dis-tributions If the recognition threshold is larger than theexpected value of the net input of the old distributionthe hit rate of the easy class will be less than the hit rateof the difficult distribution Similarly if the recognitionthreshold is lower than the expected new net input thefalse alarm of the easy class will be larger than the falsealarm rate of the difficult class This is inconsistent withthe empirical data for unbiased responses

3 Assume that the recognition strength is based on allthe nodes (ie not only nodes inactive during encod-ing)mdashfor example by counting the number of nodes inthe correct state of activation This measurement ofrecognition strength will have at least 50 of the nodesin the correct state (ie Pc gt 50) even if the subjectswere merely guessing on new items This would lead tothe incorrect prediction that the old recognition strengthvariance should be smaller than the new recognitionstrength variance because Pc(1 Pc)N decreases for Pcover 50 This is inconsistent with the finding that thevariance of the old distribution is larger than the varianceof the new distribution

4 If the recognition strength is not normalized withthe net input the variance of the recognition strength ofthe new easy class (AN) will be smaller than the varianceof the recognition strength of the new difficult class (BNsee Figure 4C) This is inconsistent with the empirical data

This analysis indicates that several recognition mem-ory phenomena fall out as a consequence of optimizingperformance in the network If the model is optimized interms of performance the model reproduces the empir-ical data If the model is made nonoptimal the model

does not reproduce the empirical data Arguably thehuman brain has evolved to optimize storage capacityand therefore these memory phenomena occur

GENERAL DISCUSSION

This paper has suggested a variance theory for themirror effect that also applies to the ROC curves Themodel is remarkably simple It has been shown that atwo-layer recurrent network where one layer representscontext and one layer represents items shows these phe-nomena if performance is measured by counting thenumber of nodes active at recognition in a way that opti-mizes performance The structure of the theory guaran-tees that high-frequency items have a larger variance inthe net inputs than do low-frequency items because en-coding the same item in different contexts increases thevariance whereas the expected net inputs are the sameThe theory predicts the mirror effect when the sameamount of attention is paid to both classes of stimuli Thestandard deviation of the recognition strength is largerfor old than for new items because more nodes are activein old items The standard deviation for the easy class islarger than the standard deviation for the difficult classbecause the recognition strength is normalized with thestandard deviation for the net input

There are several reasons why the variance theory isinteresting First the theory is extremely simple Theonly necessary assumptions are that recognition is basedon recurrent associations between contexts and itemsand performance is measured by counting the number ofnodes in an optimal way Second these assumptions areconsistent with what is known about how the brain worksTherefore the model is biologically plausible Third themodel accounts for a large amount of data including themirror effect exceptions from the mirror effect ROCcurves list-length effects and so on Fourth the modelfits the empirical data well Fifth it is easy to implementthe model in a connectionist network

Paying more attention to one of the classes violates theassumption of equal expected net inputs to the two classesThe variance theory predicts that attention to the moredifficult class primarily affects the hit rates whereas thefalse alarm rates and standard deviations of the underly-ing distributions are less affected An experiment sup-ported the prediction A standard mirror effect was foundwhen attention was divided equally between high- andlow-frequency words However focusing the subjectsrsquoattention on the high-frequency words either by the pre-sentation frequency or the presentation time made thehit rate larger for the high-frequency words than for thelow-frequency words This manipulation of attention didnot influence the false alarm rates which were higher forthe high-frequency words in all the conditions Thus nomirror effect was found when attention was paid to thehigh-frequency words Nor did the focusing of attentioninfluence the order of the standard deviations The stan-dard deviations were larger for the low-frequency distri-

436 SIKSTROM

bution than for the high-frequency distribution This wastrue for the new and the old distributions both when at-tention was paid to high-frequency words and when at-tention was divided equally between the two classes (ex-cept in the new frequency control condition where thestandard deviations were equal)

The variance theory predicts the order of the standarddeviations of the underlying distributions for the follow-ing reasons The standard deviations of the old distribu-tions are predicted to be larger than those of the new dis-tributions because more nodes are activated in the olddistributions The standard deviations of the easy classdistributions are predicted to be larger than the standarddeviations of the difficult class distributions because therecognition strength is normalized by the itemrsquos diffi-culty estimated from the standard deviation of the net in-puts This is consistent with the empirical data

In contrast the attention-likelihood theory does notconstrain the old distribution to be larger than the newdistribution for the difficult class (it can be larger orsmaller depending on parameter settings) The variancetheory allows the following two orders of the standarddeviations ss(BN) lt ss(BO) lt ss(AN) lt ss(AO) andss(BN) lt ss(AN) lt ss(BO) lt ss(AO) The first order isthe more common although the second order occurs oc-casionally (see eg Glanzer et al 1993 Experiment 1)In addition the attention-likelihood theory allowsss(BO) lt ss(BN) lt ss(AN) lt ss(AO) according to Equa-tion 2 which to the authorrsquos knowledge has not beenfound in empirical data

The variance theory predicts that strength variablessuch as study time repetition and study instructions af-fect the expected net input For example increasing studytime will increase the net input that improves the hit rateIncreasing the net inputs also causes an increase in theactivation threshold that diminishes the false alarm rates

The variance theory has no (ie lateral) connectionswithin the item layer and there are no connections with-in the context layer Including intraitem connections inrecognition makes it impossible to tell whether therecognition strength comes from encoding during thelearning episode or from another preexperimental learn-ing episode Thus there would be a confounding be-tween the itemrsquos frequency and recognition strength Forexample if the recognition strength in the variance the-ory included intraitem connections and used a constantrecognition criterion it would predict a higher hit rateand a higher false alarm rate for high-frequency itemsthan for low-frequency items Thus the hit rate for high-frequency words would be larger than that for low-frequency words which is contrary to the data on the mir-ror effect This issue is called the frequencyrecognitionndashstrength confounding problem Other models may bevulnerable to this problem depending on their specificassumptions The variance theory is immune from thisproblem because recognition strength is based on the as-sociation between the context and the item that yields apure measurement of the strength of the target in a par-

ticular episode Net inputs within the item population arenot used because these connections are highly corre-lated with the frequency of the item

This frequencyrecognition-strength confounding prob-lem may be relevant to several distributed models thatassume that recognition strength increases with frequencythus making the false alarm rate higher for high- than forlow-frequency stimuli This is often implemented in dis-tributed models by including intraitem associations inthe measurement of recognition strength Thus whenintraitem and itemndashcontext associations are added it isnot possible to know whether the intraitem strength oc-curs because an item has been encoded in the to-be-retrieved-from list or at another episode

Although the intraitem associations are not used tomeasure recognition strength they may play an impor-tant role in recognition In the first step of recognitionthese associations may be used for deblurring unclear in-formation in the item cue (a similar mechanism occursfor the context cue) Arguably this deblurring mecha-nism works well for well-known words however fornonwords it is much more likely to fail Such failure willactivate features that were not active in the encoded rep-resentation It is more likely that these wrongly activatednodes represent high-frequency features because it ismore likely to converge on high-frequency features Thereare two interesting implications of this perspective Firstthe wrongly activated nodes will use the wrong connec-tions between the context and the item Second becausethe wrongly activated nodes represent high-frequencyfeatures the average variability will be larger for non-words than for words This is a plausible account of thepoor recognition performance with nonwords as com-pared with words It is also a tentative account of why non-words can be seen as a difficult class with a higher vari-ability than that for words in the variance theory Howeverfurther work is needed before any firm conclusion can bemade regarding this aspect of the theory

A problem similar to frequencyrecognition-strengthconfounding occurs if within-context connections areused at recognition If the context is temporally corre-lated the within-context connections are influenced bylist length This causes a confounding between list lengthand recognition strength This issue is called the list-lengthrecognition-strength confounding problem Othermodels may be vulnerable to this problem depending ontheir specific assumptions

Another issue is whether the variance theory can ac-count for the mirror effect found in abstract and concretewords where words from a concrete class are more eas-ily discriminated than words from an abstract class Thevariance theory can account for this given the assump-tion that the variance of the net input is larger for abstractthan for concrete words However at this point it is notcompletely clear how this assumption can be motivatedA possibility is that although these two classes areequated for word frequency the contexts associated withan abstract word are more variable than the contexts as-

VARIANCE THEORY OF MIRROR EFFECT 437

sociated with a concrete word This larger variability incontext for abstract words may lead to a larger variabil-ity in the net input Another possibility is that the activefeatures in abstract words are more general and there-fore associated with more contexts Nodes active in con-crete words may represent more specific features acti-vated with a lower frequency and therefore associatedwith fewer contexts Thus features in abstract wordsmay be of a higher frequency than features in concretewords although the frequencies of the items are thesame This would lead to a mirror effect in the modelHowever at this point no claim is made that variancetheory can handle this phenomenon

The variance theoryrsquos account of the list-length andlist-strength effects is arguably much simpler than thesubjective-likelihoodrsquos account The activation thresholdis set so that on average half of the nodes active duringencoding are active during recognition The activationthreshold is constant within one condition but may changebetween conditionsmdashfor example when study time ischanged Furthermore subjects do not need to estimatethe list length and the probability that a particular itemis drawn from the study list

The variance theory has advantages over the attention-likelihood theory As was discussed in more detail abovethe attention-likelihood theory requires subjects to clas-sify the stimulus Depending on this classification thesubjects must know (consciously or unconsciously) howmuch attention is paid to the stimuli in order to calculatethe log-likelihood ratios Thus the yesndashno decision isbased on the subjectsrsquo classification of the stimuli Thevariance theory does not require a classification of thestimuli During the calculation of recognition strengththe standard deviation of the net input of the item is used(shcent ) so the subject does not need to know the class or thestandard deviation of the class (sh) The increase in thehit rate and decrease in the false alarm rate for the easierclass occurs according to the theory because the vari-ance is smaller for the easier class The variance theoryhas a simple mathematical base subjects count the num-ber of activated nodes minus the expected value dividedby the standard deviation of the net input of the item Anexplicit solution is presented that requires only calculat-ing the probabilities of two z-transformations

The subjective-likelihood theory uses feature contentof the items for addressing issues regarding item similar-ity and word frequency In particular high-frequency wordsare assumed to have a higher noise or variability than dolow-frequency words The variance theory also usesvariability that depends on frequency However the vari-ance theory simulates the increase in variance duringeach presentation of a feature in different contexts thusmaking it an unavoidable phenomenon for high-frequencyfeatures In the subjective-likelihood theory the featurevariance is introduced as an assumption or a constantand it is not explicitly simulated

There are several other differences between the vari-ance theory and the subjective-likelihood theory The

subjective-likelihood theory is based on log-likelihoodratios In the variance theory log-likelihood ratios arenot necessary to account of the mirror effect and for z-ROC curves Instead the variance theory uses the num-ber of active nodes normalized by the variance of the netinput as the measurement of recognition strength

Another difference is the use of one detector for eachitem in the subjective-likelihood theory This makes themodel essentially local whereas the variance theory isdistributed This property may cause difficulties in im-plementing the model in a connectionist network How-ever see McClelland and Chappell (1998) for a brief dis-cussion of this topic An implementation of the variancetheory is straightforward

REFERENCES

Feller W (1968) An introduction to probability theory and its appli-cation New York Wiley

Gillund G amp Shiffrin R M (1984) A retrieval model for bothrecognition and recall Psychological Review 91 1-67

Glanzer M amp Adams J K (1985) The mirror effect in recognitionmemory Memory amp Cognition 13 8-20

Glanzer M amp Adams J K (1990) The mirror effect in recognitionmemory Data and theory Journal of Experimental PsychologyLearning Memory amp Cognition 16 5-16

Glanzer M Adams J K amp Kim K (1993) The regularities ofrecognition memory Psychological Review 100 546-567

Glanzer M amp Bowles N (1976) Analysis of the word frequencyeffect in recognition memory Journal of Experimental PsychologyHuman Learning amp Memory 2 21-31

Glanzer M Kisok K amp Adams J K (1998) Response distribu-tions as an explanation of the mirror effect Journal of ExperimentalPsychology Learning Memory amp Cognition 24 633-644

Greene R L (1996) Mirror effect in order and associative informa-tion Role of response strategies Journal of Experimental Psychol-ogy Learning Memory amp Cognition 22 687-695

Hertz J Krogh A amp Palmer R G (1991) Introduction to the the-ory of neural computation Reading MA Addison-Wesley

Hintzman D L (1988) Judgment of frequency and recognition memoryin a multiple trace memory model Psychological Review 95 528-551

Hopfield J J (1982) Neural networks and physical systems withemergent collective computational abilities Proceedings of the Na-tional Academy of Sciences 79 2554-2558

Hopfield J J (1984) Neurons with graded response have collectivecomputational properties like those of two-state neurons Proceed-ings of the National Academy of Sciences 81 3088-3092

Humphreys M S Bain J D amp Pike R (1989) Different way to cuea coherent memory system A theory for episodic semantic and pro-cedural tasks Psychological Review 96 208-233

Kim K amp Glanzer M (1993) Speed versus accuracy instructionsstudy time and the mirror effect Journal of Experimental Psychol-ogy Learning Memory amp Cognition 19 638-652

Kruschke J K (1992) ALCOVE An exemplar-based connectionistmodel of category learning Psychological Review 99 22-44

Ku Iumlcera H amp Francis W N (1967) Computational analysis ofpresent-day American English Providence RI Brown UniversityPress

Lewandowsky S (1991) Gradual unlearning and catastrophic inter-ference A comparison of distributed architectures In W E Hockleyamp S Lewandowsky (Eds) Relating theory and data Essays onhuman memory in honor of Bennet B Murdock (pp 445-476) Hills-dale NJ Erlbaum

Li S-C amp Lindenberger U (1999) Cross-level unification A com-putational exploration of the link between deterioration of neuro-transmitter systems and dedifferentiation of cognitive abilities in oldage In L-G Nilsson amp H J Markowitsch (Eds) Cognitive neuro-sciences of memory (pp 103-146) Seattle Hogrefe amp Huber

438 SIKSTROM

Li S-C Lindenberger U amp Frensch P A (2000) Unifying cog-nitive aging From neuromodulation to representation to cognitionNeurocomputing 32-33 879-890

McClelland J L amp Chappell M (1998) Familiarity breeds dif-ferentiation A subjective-likelihood approach to the effects of expe-rience in recognition memory Psychological Review 105 724-760

Metcalfe J (1982) A composite holographic associative recallmodel Psychological Review 89 627-658

Murdock B B (1982) A theory for the storage and retrieval of item andassociative information Psychological Review 89 609-626

Murdock B B (1998) The mirror effect and attentionndashlikelihoodtheory A reflective analysis Journal of Experimental PsychologyLearning Memory amp Cognition 24 524-534

Okada M (1996) Notions of associative memory and sparse codingNeural Networks 9 1429-1458

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves PsychologicalReview 99 518-535

Rumelhart D E Hinton G E amp Williams R J (1986) Learn-ing representation by backpropagating errors Nature 323 533-536

Shiffrin R M amp Steyvers M (1997) A model for recognitionmemory REMndashretrieving effectively from memory PsychonomicBulletin amp Review 4 145-166

Sikstroumlm S (1996a) TECO A connectionist theory of successiveepisodic tests Umearing Doctoral dissertation Umearing University (ISBN91-7191-155-3)

Sikstroumlm S (1996b) The TECO connectionist theory of recognitionfailure European Journal of Cognitive Psychology 8 341-380

Sikstroumlm S (1999) Power function forgetting curves as an emergentproperty of biologically plausible neural networks model Interna-tional Journal of Psychology 34 460-464

Stretch V amp Wixted J T (1998a) Decision rules for recognitionmemory confidence judgments Journal of Experimental Psychol-ogy Learning Memory amp Cognition 24 1397-1410

Stretch V amp Wixted J T (1998b) On the difference betweenstrength-based and frequency-based mirror effects in recognitionmemory Journal of Experimental Psychology Learning Memoryamp Cognition 24 1379-1396

NOTE

1 The expected number of nodes active at recognition is a2 giventhat the standard deviations of the percentages of active nodes for new[sp(N)] and old items [sp(O)] are equal However if the standard devi-ations are different the expected number of active nodes will be

Because the new variance is smaller than the old variance this valuewill be slightly less than a2 typically around 044a if the ROC curveis 08 It is actually more precise to use this value In this paper the ap-proximation a2 is used except that in the simulations where the ex-pression above is used

APPENDIX

The Expected Value of the Net Input and the Variance of the Net Input

Inserting the weight changes into the net input solves the ex-pected value of the net input for old items In this Appendix thesuperscripts representing the item layer (d) and the contextlayer (c) are dropped Only the active states of activation con-tribute to the net input so jj 5 jj = 1

(A1)

The expected value of the net inputs for the new items iszero

mh(N) 5 0 (A2)

The expected value of the weights is zero The variance ofthe net input is calculated by squaring each term in the netinput Let P represent the number of patterns stored in the net-work If the patterns are uncorrelated or a new item is pre-sented the variance of the net input is

(A3)

For an old item the variance of the net input to the contextlayer from the item layer depends on the frequency of the item( f ) All the aN weights supporting a context node contribute tothe net input in the same direction

(A4)

The same function can be used for calculating the varianceof the net input to a node in the item layer when the same con-text is associated with several items Let the same context be as-sociated with L items in a list Furthermore let the context be-tween different lists be uncorrelated The variance of the netinput to an item node is

The expected variance for all nodes is the weighted sum ofthese two terms Half of the nodes are context nodes and halfof the nodes are item nodes Therefore the expected varianceof the net input to all nodes given an item with a frequency off and a list length of L in a network where p patterns have beenencoded is

(A5)

Performance in the Variance Model When AllNodes Are Used

In the variance model recognition strength is based on nodesactive at encoding However if recognition strength is based onall the nodes performance can be calculated as follows The d centis calculated by using Equation 8 where Pc(O) and Pc(N) are thenumber of correct nodes The number of correct nodes is thenumber of nodes active at retrieval that also is active at encod-ing (ie calculated as usual with Equation 7) plus the numberof inactive nodes at encoding that also are inactive at retrievalThe latter value can be calculated by replacing a with 1 a inEquation 7 and using the expected value of old net inputs for in-active nodes a2 (1 a) N (compare with Equation A1)

(Manuscript received February 9 1999revision accepted for publication October 30 2000)

s h f L a N p a a N2 3 22 1O( ) = +( ) + -( )

s h f a a LN2 4 2 21( ) = -( )

s x x x xh ij jj

N

i j ji

Nf f f a a

a a f N

22 2

4 2 21

( ) = waring aringaelig

egraveccedilouml

oslashdivide= -( ) -( )eacute

eumlecirc

= -( )

s x x x xh ij jj

N

i j j

P

i

NN a a

a a a a a a a

a a a a PN a a PN

2 2 2

2

3

1 1 2 0 1

0 0 1 1

( ) = w

= [( ) ( )] + [( )(1- )] ( )

+ [( )( )] ( ) = ( )

2 2

2 2 2

( ) = -( ) -( ) - - - -

- - - -

aring aringaring

m x x x xh ijj

N

j i j jj

Na a a a N(O) = wDaring aring= -( ) -( ) = -( )1

2

ss s

p

p p

a(N)

( )N (O)+

VARIANCE THEORY OF MIRROR EFFECT 413

minus the expected percentage of above-threshold nodes(eg 25) divided by the standard deviation of actuallyobserved above-threshold nodes (ie by the standard de-viation of 075 100 125 and 150)

Why is recognition strength determined by this ruleas opposed to say just the percentage of above-thresholdnodes As will be shown later this way of measuringrecognition strength (subtracting the expected value anddividing with the standard deviation of the net input) al-lows the model to perform optimally in terms of dis-criminability between new and old items when the re-sponse bias to the yes response is manipulated And inthis case the model accounts for why the standard devi-ation of the response distribution of the easy class islarger than the response distribution of the diff icultclass It is plausible to assume that humans have evolvedto generally respond more or less optimally and that thisis reflected in their performance as well as in the imple-mentation of the variance theory Similarly the activa-tion threshold is set in the model to the value that leadsto the highest d cent (ie to optimal performance) whichoccurs when the activation threshold is between the newand the old distributions This optimal tuning of themodel allows it to account for some rather dramatic re-sults such as concentering showing that target and luredistributions from different item classes converge on astrength-of-evidence axis as memory conditions worsen

Here is a brief example of how the model performsConsider hypothetical levels of activation generated byhigh-frequency and low-frequency old (target) and new(lure) items Because high-frequency words have appearedin a larger number of contexts they have a larger varianceof net input As such targets and lures will be relativelymore confusable and will generate percentages of activatednodes that are difficult to discriminate Assume that thestandard deviation of the net input is 10 and the relevantproportions are 25 for high-frequency targets and 15 forhigh-frequency lures In contrast low-frequency wordswill have occurred in fewer contexts and will be less con-fusable Assume that the standard deviation of net inputsis less for low-frequency wordsmdashfor example 05mdashandthat the percentage of active nodes is 30 for low-frequencytargets and 10 for low-frequency lures Given these val-ues what are the recognition strengths for high-frequencyand low-frequency targets and lures If the expected pro-portion of above-threshold nodes is 20 they are

Low-frequency lures (10 20) 05 5 20High-frequency lures (15 20) 10 5 05High-frequency targets (25 20)10 5 05

Low-frequency targets (30 20)05 5 20

The model accounts for the various aspects of mem-ory phenomena by postulating a connectionist neural net-work model with an implementation and parameter set-tings that allow it to perform at optimal or near-optimallevels When the model is optimized it behaves similarlyto how subjects behave and when it is not optimized itdoes not fit the experimental data This is true not only

for the standard mirror effect but also for exceptions suchas the absence of a mirror effect for within-list strengthmanipulations (something all other competing formalmodels fail to do) Furthermore it predicts key featuresof the ROCs for new and old items as well as for high-and low-frequency items (something any viable formalmodel must do)

Presentation of the Variance TheoryIn this section the details of the variance theory are

presented As will become clearer as the theory is un-folded the theory is analytical and the analytical solu-tions are self-contained solvable (readers who are inter-ested can find the analytical solutions in the sixth section)Although the theory itself does not require a particularcomputational framework it can be more easily explainedand directly implemented by using a connectionist net-work Therefore the presentation of the theory below iscouched within the framework of a Hopfield neural net-work (Hopfield 1982 1984) in order to explicate thetheoryrsquos underlying mechanisms that generate the mirroreffect

Network architecture The variance theory may beimplemented as a two-layer recurrent distributed neuralnetwork (see Figure 2) There are two layers in the rep-resentation one is the item layer and the other the con-text layer Both layers consist of N number of nodes (ieN features) although it could also be implemented withan unequal number of context and item nodes Thus thetotal number of nodes in the network is 2N The item andthe context layers are fully connected so that all the nodesin one layer are connected through weights to all the nodesin the other layer Nodes within a layer are not connected(ie no lateral connections)

Stimulus and context representation Contexts anditems are represented as binary activation patterns acrossthe nodes in the context and item layers respectively Anode is active (or activated) if its corresponding featureis present (ie of value 1) and a node is inactive (or notactivated) if its corresponding feature is absent (ie ofvalue 0) There are several reasons for choosing binaryrepresentations For instance binary representation servesto deblur noisy information at retrieval (Hopfield 1984)Binary representation also allows for a low proportion ofactive nodes (spare representation) which is known toimprove performance (Okada 1996) It also introducesnonlinearity which is necessary for solving some prob-lems in multilayer networks (Rumelhart Hinton amp Wil-liams 1986) and it is perhaps the simplest representa-tion Furthermore in the present model it is shown thatit is necessary for capturing characteristics of the z-ROCcurves that are associated with the mirror effect

More specifically the state of activation for the i thnode in the item layer at encoding is denoted x i

d wherethe superscript d denotes the item layer The state of ac-tivation for the jth node in the context layer at encodingis denoted x j

c where the superscript c denotes the con-text layer Context patterns and item patterns are gener-ated by randomly setting nodes to an active state (ie

414 SIKSTROM

with values of 1) and otherwise to an inactive state (iewith values of 0) Let a be a parameter that determinesthe expected probability that a node is active at encod-ing This parameter does not change during the simula-tion and is assumed to be relatively small (for purposesof sparse representation) Note that a is the expected prob-ability of active nodes whereas the real percentage of ac-tive nodes for specific items or contexts varies around a

The encoding-study phase Encoding occurs bychanging the connection weights between the item andthe context layers The weights (or the strengths of theconnections) contain information of what has been storedin the network The weight between item node i and con-text node j is denoted as wij and it is initialized to zeroThe weight change (Dwi j) is calculated by the learningrule suggested by Hopfield (1982 see also Hertz Kroghamp Palmer 1991 for additional details) This is essen-tially a Hebbian learning rule that increases connectionweights between simultaneously activated nodes Thisrule is chosen here because it is more biologically plau-sible than other rules such as the delta or the gradient-decent learning rules (eg Rumelhart et al 1986) usedin back-propagation networks However the variance the-ory can also be implemented with other learning rules

According to the Hopfield (1982) learning rule theweight change is computed as the outer-product betweenthe item and the context vectors of activation patternswith the parameter a first subtracted from both vectorsThis subtraction is mathematically necessary to keep theexpected value of the weights at zero Using the notionsfor item and context activation defined above the weightchange between these two elements of the item and con-text vectors can be written as

Dwi j 5 (x id a)(x j

c a) (3)

Word frequency as the number of associated con-texts at the preexperimental phase An item may be en-coded more or less frequently and hence be associatedwith more or less different preexperimental contextsdepending on how often the item occurs in the subjectrsquosenvironment In the model at the preexperimental stageof the simulation an itemrsquos frequency is simulated by thenumber of times the item is encoded in different contextsA low-frequency item is encoded less often and is asso-ciated with a smaller number of contexts whereas a high-frequency item is encoded more often and is associatedwith a larger number of different contexts For instanceone might form a preexperimental association betweenSAAB and repair shop after experiencing the rare event ofa new expensive SAAB sports car breaking down half-way through a honeymoon trip to the Grand Canyon withthe SAAB having to be towed to a repair shop somewhereout in the desert In implementing the variance theory therelationship between word frequency and preexperimen-tal itemndashcontext associations can be simulated straight-forwardly At the preexperimental stage of the simula-tion a low-frequency item SAAB may be simulated byassociating one context item REPAIR SHOP with it duringencoding A high-frequency word CAR may be simulated

by associating three different contexts REPAIR SHOP TAXI

RIDE and DRIVING TO WORK with it during encodingThe recognition test phase At recognition an item is

presented to the network the representation of this itemis reinstated as a cue to the item layer and the represen-tation of the study context (simulating an internally gen-erated context regarding list or time information duringthe recognition experiment) is reinstated as a cue to thecontext layer For example the representation of theword CAR is reinstated at the item layer Furthermore therepresentation of the study context STUDY LIST is rein-stated at the context layer The subjects must have this in-formation (or cue) in order to recognize an item from theparticular study context (and not recognize the item fromall the other preexperimental contexts) In the actual ex-perimental setting this information is usually conveyedto the subjects by the explicit instruction to recognizefrom the study context (eg ldquoDo you recognize the wordCAR from the list you just studiedrdquo)

At recognition each node receives a signal (called thenet input) which is computed on the basis of other activenodes connecting to it Item nodes receive net inputs fromactive context nodes and context nodes receive net inputsfrom active item nodes The net input to a given node issimply the sum of the weights of all other active nodesconnected to that node For example if item node 1 isconnected to four context nodes (1 2 3 and 4) wherecontext nodes 1 and 3 are active the net input to itemnode 1 is w11 + w13 Thus active nodes ldquosendrdquo informa-tion to nodes that they are connected to whereas inactivenodes do not send any information Put another way nodesldquoreceiverdquo information or input from active nodes that theyare connected to but not from the inactive nodes

Specifically the net input to item node i is calculatedby first multiplying the activity of the context node (la-beled j ) connected to this node by the weight of the con-nections between the nodes i and j and then summing overall connected nodes In vector formalization the weightmatrix operates on the activation vector and the outputis the net input vector The net inputs to node i(h i

d) at theitem layer depend on the states of activation of nodes in thecontext layer and the weights connecting these two layers

(4)

Following the same principle a similar function is usedto calculate the net inputs to the context node Specifi-cally the net inputs to context nodes (labeled j ) dependon all the states of activation of nodes in the item layerand the weights connecting the two layers

By inserting Equation 3 into Equation 4 and summingover the p number of encoded patterns during list learn-ing it is easy to see that the net input to a node is simplythe sum of Np number of terms for example the net in-put to an item node is

h wj ji ii

Nc d=

=aring x

1

h wi ij jj

Nd c=

=aring x

1

VARIANCE THEORY OF MIRROR EFFECT 415

For a 5 5 the net inputs are binomially distributed witha certain expected value Given a certain criterion (ieNp(1 a) a gt 10) a binomial distribution can be ap-proximated with a normal distribution (Feller 1968) Fora sup1 5 there are actually four outcomes however thesame normal approximation can be used Thus for rea-sonably large parameter values of Npa the distributionof net inputs to the nodes can be approximated by a nor-mal distribution

If the to-be-recognized item has not been encoded withthe context (ie a new item) the net input is simply thesum of random weights Because the expected values ofall weights are zero the expected value of the net inputsfor new items will also be zero If the item has been en-coded with the context (ie an old item) the net input isthe sum of those weights connected to that node whoserespective context nodes were active at encoding Owingto the adaptive weight changes during encoding theseweights will have an expected value that is larger thanzero if both nodes were in the active state during encod-ing [ie each weight change at encoding is computed as(1 a)2] and less than zero if one node was inactive andthe other node was active at encoding [ie each weightchange at encoding was a(1 a)] Of specific impor-tance for the theory is that the variance of the net inputsto the context nodes (from the item nodes) increases withthe number of contexts that are associated with the itemTherefore the variance of the net input is larger for high-than for low-frequency items Similarly the variance ofthe net input to the item nodes (from the context nodes)increases with the number of items associated with onecontext (ie list length) Therefore given that the con-text is constant during a list presentation the variance ofthe net inputs is larger for a long than for a short list

Brief summary of optimal performance Given thestrong selection pressure arguably humans and animalshave evolved to achieve good memory performanceTherefore it is reasonable to assume that mechanisms forrecognition decision have evolved to an optimal or near-optimal performance Following this assumption the pa-rameter values in the model and the implementation ofthe model are guided by the idea that the model shouldperform optimally A detailed discussion about the issueof optimal performance with exact derivations of what isoptimal performance in the context of the present modelis presented later in the sixth section Here I give a briefsummary explaining the results from the analysis of op-timal performance without going into the mathematicaldetails (see Figures 9A 9B 9C and 9D)

The modelrsquos performance is optimal if the percentageof nodes active at encoding (a) is low (see Figure 9A)For a low a it is optimal to base the recognition decisionon nodes that were active at encoding and to ignore nodesthat were inactive during encoding (see Figure 9A) Alsofor a low a it is optimal to place the activation threshold

of the nodes between the expected values of the new andthe old net inputs (see Figure 9B) Finally it is optimalto normalize the recognition strength with the standarddeviation of the net input (see Figures 9C and 9D)

For a low percentage of active nodes it is optimal tobase the recognition decision on nodes that were activeat encoding (or nodes active in the cue pattern) and to ig-nore nodes that were inactive at encoding At recogni-tion the state of activation of a node may be either activeor inactive Therefore the nodes that are active in the cuepattern and have a net input above a certain activationthreshold are activated at recognition otherwise thenodes are inactivated Let z i

d denote the state of activa-tion at recognition for item node i An item node is acti-vated at recognition (z i

d 5 1) if it was active in the cuepattern (x i

d 5 1) and the net input is above the activationthreshold (h i

d gt T ) otherwise it is inactivated (z id 5 0)

z id 5 1 if x i

d 5 1 and h id gt T otherwise z i

d 5 0

Similarly let zjc denote the state of activation at recogni-

tion for context node j

z jc 5 1 if x j

c 5 1 and h jc gt T otherwise z j

c 5 0

This way of activating nodes at retrieval differs from hownodes are activated in a standard Hopfield (1982) net-work where the activation threshold is zero and a nodeis activated if the net input is above zero (independentlyof the state of activation in the cue pattern) The way ofactivating patterns in a Hopfield network is more likelyto produce a retrieved pattern that matches the encodedpattern of activation (eg the expected value of activenodes at retrieval will be the same as the expected valueof active nodes at encoding) However as will be dis-cussed later the way suggested to activate the nodes hereyields better performance in terms of discrimination be-tween a target item and a distractor item

As is shown in Figure 9B performance is optimal whenthe activation threshold is set approximately between thenew and the old net inputs The activation threshold (T )is set to the expected values of the net inputs of nodes ac-tive during encoding (x i

d 5 1 x jc 5 1) for old and new

items The averaging is computed over all nodes (2N ) andover all new and old patterns ( p) in the recognition con-dition If half of the items are new and half of items areold the activation threshold is

As was discussed above the expected net input of new(lure) items is zero Therefore the activation threshold issimply half the expected net input for nodes encoded inthe active state [T 5 mh(O)2 where mh(O) is the expectedvalue of the net input to nodes encoded in the active state]

It is easy to see that the expected percentage of old andnew active nodes at recognition is one half of the per-centage of active nodes at encoding (a 2) That is the ac-tivation threshold divides the old and new distribution in

TaPN

h hi ii

N

j jj

NP

= +eacute

eumlecircecirc = =aring aringaring1

2 1 1

x xd d c c

h a ai i jj

Np

jd d c c= - -

=aringaring ( )( ) x x x

1

416 SIKSTROM

the middle Old items will have a higher expected percent-age active and new items will have a lower expected per-centage active The activation threshold is constant dur-ing recognition in one condition However it must varybetween conditions depending on the net inputs to yieldoptimal performance

The percentage of nodes active at recognition is counted

As is shown in Figure 9C the performance is near-optimalif the recognition decision is based on the number of nodesactive at recognition normalized by the standard devia-tion of the net inputs across the active features of this item(shcent ) Thus this standard deviation is calculated over allthe nodes active at encoding (ie x i

d = 1 and x ic 5 1

nodes inactive at encoding are not used when calculatingthe standard deviation because for low levels of a theseitems carry little to no information of the item) The rec-ognition strength (S ) for an item is calculated by sub-tracting the expected percentage of nodes active at rec-ognition (a 2)1 from the real percentage of nodes activeat recognition (Pc) and dividing by the standard deviationof the net inputs of the item (shcent)

(5)

The subtraction of the expected percentage of nodesactive at recognition makes the expected value of therecognition strength (S ) zero This subtraction is neces-sary for the normalization to work properly The sub-traction moves the recognition strength distributionssymmetrically so that the old and the new distributionsmove at the same rate for a given standard deviation of thenet input (without the subtraction the old recognitionstrength distribution would be more affected than thenew distribution) Thus the recognition strength is de-termined by the difference between two probabilities (thepercentage of active nodes that varies and the expectedpercentage of active nodes that is constant) divided bythe standard deviations of the net input A yes response(Y ) is given if the recognition strength (S ) is above therecognition criterion (C ) An unbiased decision has arecognition criterion of zero

Y 5 S gt C

An issue that may be raised is whether it is sensible tobase recognition strength on two quite different sourcesmdashnamely the percentage of active nodes and the variabil-ity of the net input The immediate answer is that if it isreasonable to optimize performance it is also sensible tomeasure recognition strength this way Another perspec-tive is to note that unbiased responses can be made onlyon the percentage of active nodesmdashthat is a yes responseoccurs if the percentage of active nodes is larger than theexpected percentage of active nodes (Pc gt a 2) and thevariability of the net input can be ignored Thus ldquonor-

mallyrdquo subjects base their unbiased decisions on the per-centage of active nodes and the variability of active nodesonly becomes relevant when subjects are biased Fromthis perspective the percentage of active nodes is usedfor unbiased responses and the variability of the netinput becomes relevant for confidence judgments There-fore by combining both the percentage of active nodesand the variability of the net input the measure of recog-nition strength proposed here will also reflect the confi-dence judgment

An Example With Step-by-Step ComputationsTo clarify the computational details involved in the

variance theory a step-by-step example is given here Fortractability a small network is used consisting of fouritem nodes and four context nodes (see Figure 2) Theactual simulations to be reported later involved a largernetwork architecture with 30 nodes at each layer Thepercentage of nodes active at encoding (denoted by pa-rameter a) is set to 50 Let item BN be represented as1100 and written as the state of activation of the fournodes x1

d x2d x3

d x4d Similarly let 0011 represent

item BO 1010 represent item AO and 0101represent item AN Let context CBN be represented as1100 or the state of activation of the four nodes x1

cx2

c x3c x4

c Similarly 0011 represents context CBOand 0101 represents experimental study context CExp

Item BN is a high-frequency new word For simplicityit is here encoded only once with context CBN in the pre-experimental phase (in the simulations below high-frequency words are preexperimentally associated withthree contexts) The 16 weights between the four itemnodes and the four context nodes are changed accordingto the learning rule where the probability that a node isactive at encoding is determined by the parameter a 55 For example the weight change between item node 1and context node 1 is w11 5 [x1

d(BN) a][x1d(CBN) a] 5

(1 5)(1 5) 5 14 where BN is item BN and the CBNrepresents context CBN Similarly item BO is anotherhigh-frequency word that before the experimental phaseis encoded once with context CBO Items AO and AN arelow-frequency old and new words and they are not en-coded at the preexperimental phase

In the experimental phase item AO is encoded withthe experimental context CExp Finally item BO is en-coded with the same experimental context CExp For ex-ample the weight w11 is now equal to

[x1d(BN) a) (x1

c(CBN) a] + [x1d(BO) a]

[x1c(CBO) a] + [x1

d(BO) a][x1c(CE) a]

+ [x1d(AO) a][x1

c(CE) a] 5 (1 5) (1 5)

+ (0 5)(0 5) + (0 5)(0 5)

+ (1 5) (0 5) 5 14 + 14 + 14 14 5 12

After encoding the full weight matrix is 5 1 1 55 0 0 5 5 0 0 5 5 1 1 5 corre-sponding to the weights w11 w12 w44 respectively

SP a

c

h=

-

cent2

s

PN

c i jj

N

i

N

= +aelig

egraveccedil

ouml

oslashdivide

==aringaring1

2 11

z zd c

VARIANCE THEORY OF MIRROR EFFECT 417

At recognition the state of activation of the old low-frequency item AO is reinstated as a cue to the itemlayer and the state of activation of the encoding contextat the experimental phase CExp is reinstated as a cue tothe context layer The net inputs are calculated For ex-ample the net input to context node 1 is

The net input to the item nodes is 10 2 1 and that tothe context nodes is 5 5 5 5 It can be seen thatthe expected net inputs for randomly created new itemsin this network is 0 and that the expected net inputs forold items encoded in the active state is 05 Therefore theactivation threshold is set to the average of these valuesmdashnamely T 5 025 Nodes that have a net input above theactivation threshold and that are active in the cue patternare activated at recognition Thus the state of activationat recognition for the item nodes is 1010 and thatfor the context nodes is 0101 (which is identical tothe cue patterns) The percentage of active nodes is countedPc(AO) 5 48 5 5 For an unbiased response (C 5 0)this will yield a yes response because this percentage islarger than the expected percentage of active nodes atrecognition (a 2 5 25) The standard deviation of thenet input for nodes active at encoding is 071 (ie thestandard deviation of 1 2 5 5 corresponding tonodes 1 3 6 and 8) The recognition strength is calcu-lated by subtracting the expected percentage of activenodes at recognition from the percentage of active nodesfor the to-be-recognized item and dividing by the standarddeviation of the net input for active nodes [S 5 (Pca 2) shcent 5 (5 25) 71 5 035]

The recognition of the three items BO AN and BNare done in the same way The results for the four itemsAN BN BO and AO are the net inputs (1 0 2 1 5 155 5 15) where the first four numbers representitem nodes and the last four context nodes (1 0 2 115 5 5 15) (1 0 2 1 15 5 5 1) (1 0 21 5 5 5 5) the states of activation at recognition(0 0 0 1 0 0 0 0) (1 0 0 0 0 1 0 0) (0 0 1 1 00 0 1) (1 0 1 0 0 1 0 1) the numbers of nodes ac-tive 1 2 3 4 the standard deviations of the net inputs071 108 108 071 the recognition strengths 17 0011 035 and the unbiased responses no no yes yesrespectively (an unbiased yes response is made whenS gt C where C 5 0 for unbiased responses) Thus thesubject responds correctly for all items the recognitionstrengths are ordered according to the mirror effect andthe variance of the net input is larger for the high- thanfor the low-frequency words

SIMULATIONS OF THEFREQUENCY-BASED MIRROR EFFECT

In this section the variance theory of the mirror effectis simulated in a connectionist framework that is consis-tent with a general connectionist theory of memory

called TECO (Sikstroumlm 1996a 1996b) TECO has beenused to account for a variety of memory phenomena in-volving recognition and recallmdashfor example successivetesting of recall and recognition (Sikstroumlm 1996b) and for-getting curves (Sikstroumlm 1999) An extensive descrip-tion of TECO is beyond the scope of this paper Interestedreaders are referred to previous articles describing themodel for details

ProcedureThe simulation started with initializing the weights to

zero Then 12 items were generated by randomly settingthe nodes to an active state with a probability of a A pre-experimental phase then followed to generate the fre-quency associated with the items In this preexperimen-tal phase half of the items (ie 6) were encoded threetimes each time in a different context These items sim-ulated the high-frequency words The remaining 6 itemswere not encoded before the experimental phase andthey simulated the low-frequency words

At the experimental phase one study-encoding contextwas generated using the same procedure as the genera-tion of the contexts in the preexperimental phase Giventhat in a standard recognition experiment all studieditems would be encoded in the same list in the simula-tions the items were thus encoded once with the samestudy context Three of the high-frequency items wereencoded and three of the low-frequency items were en-coded The other three high- and three low-frequencyitems were not encoded during the experimental phaseand they simulated the new items Each encoding wassimulated by first activating the nodes in the item and con-text layers The weights were then changed according tothe learning rule (Equation 3)

At the recognition test the patterns of activation of the12 items and the study context were reinstated to the net-work The net inputs were calculated for each node andthe recognition strength was calculated from all the nodesin the network The somewhat unrealistic assumptionthat no learning or changes of weights occurred duringtesting was adopted However this is a standard as-sumption often used in other simulations of recognitionmemory cued recall or categorization (eg Kruschke1992 Lewandowsky 1991 Li amp Lindenberger 1999Li Lindenberger amp Frensch 2000) One thousand fivehundred simulation runs with different random item andcontext patterns were carried out The results reported laterare based on the average across these runs

ParametersThe number of high-frequency patterns was six (each

encoded three times preexperimentally and three of themencoded once experimentally to simulate the old items)and the number of low-frequency patterns was six (threeof them encoded experimentally once to simulate the olditems) The following parameter settings were used Thenumber of nodes in each layer (N ) was 30 (the total num-ber of nodes was 2N 5 60) and the percentage of nodes

h w i ii

1 11

45 1 1 0 1 1 5 0 5c d= = + - - = -

=aring x

418 SIKSTROM

active at encoding (a) was 20 Initially all the weightswere set to zero The recognition criterion (C ) was set to 0

ResultsFigure 3A shows the results in terms of the density of

the net inputs to an active node The expected values ofthe net inputs are approximately equal for new high- andnew low-frequency items [mh(AN) 5 00 mh(BN) 5 00]and for old low- and old high-frequency items [mh(AN) 538 m h(BN) 5 38] The high-frequency items have alarger variance of the net inputs [sh(BN) 5 49 sh(BO) 548] than do the low-frequency items [sh(AN) 5 41sh(AO) 5 40] The variances of the old and the new dis-tributions are approximately equal

Figure 3B shows the density of the recognition strengthThe result shows the mirror effect where the hit rate prob-ability is larger for low- than for high-frequency itemsand the false alarm rate is lower for low- than for high-frequency items [P(AN) 5 18 P(BN) 5 25 P(BO) 564 P(AO) 5 71] The standard deviation for the recog-nition strength is larger for old than for new words andlarger for the low-frequency words than for the high-frequency words [ss(AN) 5 029 ss(AN) 5 019 ss(AN)5 023 ss(AN) 5 031] These f indings agree with the empirical data and the predictions of the attention-likelihood model (Glanzer et al 1993) Thus the simu-lation shows that the variance theory can account for thefrequency-based mirror effect and the associated ROCcurves

EXPLICATIONS OF THE MECHANISMS

In this section three essential mechanisms of the vari-ance theory that account for the mirror effect are discussedin detail The involved mechanisms are (1) the varianceof the net input (2) the expected value of the net inputand (3) the way in which the recognition strength is cal-culated by counting the number of nodes so that the per-formance is optimal

OverviewThe first mechanism is that items from the difficult

class (ie high-frequency items in this case) have ahigher variance of the net inputs than do items from theeasy class (ie low-frequency items) but the variance isindependent of whether the items are old or new Themechanism is illustrated in Figure 4A (noncumulative) Itshould also be underscored that the difference in varianceas a function of class membership is not a primitive ofthe theory that the theorist manipulates It is the naturalconsequence of the rather plausible assumption that high-frequency items appear more often and are associatedwith a great variety of different contexts than are low-frequency words The second mechanism is that old itemshave a higher expected net input to nodes encoded in theactive state than do new items (but do not depend on theclass of the items) The mechanism is illustrated in Fig-ure 4B (cumulative) The third mechanism is the wayrecognition strength is measured by counting the activenodes so that recognition performance (eg d cent) is opti-mal or near-optimal An extended analysis of optimalityis presented at the end of the paper however the resultsin Figures 9Andash9D can summarize the main points hereThe results from this analysis imply that the decisionmust be based on the nodes active at encoding (see Fig-ure 9A) the percentage of active nodes must be low (seeFigure 9A) the activation threshold needs to be betweenthe expected net inputs of the new and old items (see Fig-ure 9B) and the recognition strength is normalized bythe standard deviation of the net inputs of the item (seeFigures 9C and 9D) The density of the percentage of ac-tive nodes is illustrated in Figure 4C and the normalizedpercentage of active nodes is illustrated in Figure 4DHere it is shown how these mechanisms predict the mir-ror effect Below these three sets of mechanisms are ex-plained in more detail

Variance of the Net InputThe first and perhaps the most important mechanism

is that the difficult class has a larger variability in the net

Figure 3 (A) Simulations of net inputs The vertical axis shows the simulated density of the net inputs (B) Sim-ulations of the mirror effect The vertical axis shows the simulated density of the recognition strength

VARIANCE THEORY OF MIRROR EFFECT 419

inputs than does the easy class As will be discussed laterthis is shared with other theories such as the subjective-likelihood account (McClelland amp Chappell 1998) andREM (Shiffrin amp Steyvers 1997) However a unique as-pect of the variance theory is that it is a necessary outcomeof simply encoding high-frequency items more timesthan low-frequency items In the subjective-likelihood ac-count a larger variability for high- than for low-frequencywords is an assumption that is controlled by a free pa-rameter ( p0) to reflect that high-frequency words havemore definitions than do low-frequency words The vari-ance theory needs no such assumptions and no addi-tional free parameters to control the variability Whereasthe subjective-likelihood account tries to capture wordfrequency with a free parameter in the variance theorythe word frequency effect is simulated via the number ofcontexts associated with an item

Owing to how the variance theory implements the re-lations between contexts and items the variance of the

net inputs increases with the frequency with which an itemis encoded in different contexts Intuitively this occursbecause a high-frequency item activates several differentcontexts causing the representations of many competingcontexts to be activated simultaneously Low-frequencyitems are associated with fewer contexts than are high-frequency items This causes the representations offewer contexts to be activated simultaneously Thus low-frequency items have less variability in the net inputsthan do high-frequency items

At recognition an item produces a net input in the con-text layer that is a mixture of the net inputs from the studycontext that the network is instructed to retrieve fromand from several uncorrelated preexperimental contextsthat were associated with the item during the preexperi-mental phase The study context that the network is in-structed to retrieve from contributes to the correct activestate For example by adding +1 to the net input to anode (which is the expected net input for a node encoded

Figure 4 (A) The probability density of the net inputs in the variance theory The horizontal axis shows the net inputs and the ver-tical axis the probability density of the net inputs The dotted vertical line is the activation threshold (B) The cumulative probabilitydistributions of the net inputs in the variance theory The horizontal axis shows the net inputs and the vertical axis the cumulativeprobability distributions of the net inputs (C) The density of percentage nodes active at recognition in the variance theory The hor-izontal axis shows the percentage of nodes active at recognition and the vertical axis the density (D) The density of recognition strengthin the variance theory The horizontal axis shows the recognition strength and the vertical axis the density of the recognition strengthusing standard parameter values

420 SIKSTROM

in the active state when N 5 8 and a 5 5 see Equation A1in the Appendix) thus increasing the expected net inputThe preexperimental contexts on the other hand ran-domly add to or subtract from the net input For examplea preexperimental context could add +1 to the net inputif the node was in an active state or it could add 1 tothe net input if the particular preexperimental contextwas encoded in an inactive state (which is the expectednet input for a node encoded in the inactive state whenN 5 8 and a 5 5 see Equation 3 4 or A1) Note thatthe net input can be negative whereas the state of activa-tion cannot If the representations of these preexperimen-tal contexts are uncorrelated the number of associatedpreexperimental contexts will not influence the expectednet input That is the expected value of the negative andpositive contributions will cancel out (eg for N 5 8 anda 5 1 the contributions to the net input is +1 with a prob-ability of 50 and 1 with a probability of 50 yield-ing an expected value of 0) However the variability ofthe net inputs increases when more contexts are addedIn this example the variance is increased by 12 for eachadded context (ie the variance increases by each con-tribution raised to the power of two) Encoding an itemalso increases the variability of the net input for all otheritems encoded in the network However the increase invariability for the items uncorrelated with the encodeditem is much smaller because the contribution from eachnode is independent (Eg for N 5 8 and a 5 5 thecontribution from each node is either +14 or 14 [seeEquation 3] each yielding an increase in variability of142 5 116 An expected value of aN nodes contributeso the total increase in variability is 4 116 5 14)

A mathematical analysis of how the variability of thenet inputs in the context nodes increases as a function ofthe frequency with which the item has been encoded indifferent contexts is presented in the Appendix This analy-sis shows that the variance of the net inputs in contextnodes increases linearly with how many times a givenitem is encoded within different contexts The variabil-ity of the net inputs for nonwords may be a special casediscussed at the end of this paper

In the same way as the variability of context nodes de-pends on the itemrsquos frequency the variability of the itemnodes depends on the frequency of the context That isthe variability of the net input to the item nodes increaseswith how many times one context is associated with dif-ferent items Given that the context is constant during thepresentation of a study list the variability of the net in-puts to the item nodes will increase with list length

Expected Net InputThe second mechanism in the variance theory is that

the expected net inputs to the easy and difficult classesof stimuli are equal given that the encoding conditionsduring the experimental phase of the two classes areequal Note that this is in stark contrast to the attention-likelihood theory which assumes that more attention(corresponding to more net input) is given to the easyclass than to the difficult class Experimentally the equal-

ity of the net inputs simply means that different classesof stimuli are given the exact same conditions for encod-ing and retrieval in a recognition memory study The netinput to a node encoded in the active state increases dur-ing encoding whereas the net input to a node encoded inthe inactive state decreases during encoding Only nodesencoded in the active state are used during retrieval sohere we are only interested in the increase in net input thatoccurs for nodes encoded in the active state Strengthen-ing of the weights during encoding increases the expectednet input The degree of increase in expected net input isinfluenced by strength-based variables such as study timerepetition levels of processing and so forth For exam-ple the simulations can be set up so that a study time of1 sec strengthens the weights less leading to lesser in-creases in the net input than does a longer period of studytimemdashfor example 2 sec of encoding time Because thestudy context is unique to the learning episode preexper-imental encoding in other learning contexts will not af-fect the expected net input but they do affect the vari-ability of the net input as was demonstrated above Theitemndashstudy-context associations are learned approxi-mately equally well for old high- and old low-frequencyitems For example the expected net input for CAR (ahigh-frequency word as a difficult class item) is equal tothe expected net input for ARCTIC (a low-frequency wordas an easy class item) Generally the expected net inputdoes not depend on the class of the item because the ex-pected net input is influenced by the study and the test-ing experimental conditions similarly across stimulusclasses in a recognition memory experiment Thereforethe expected net input for a new difficult item is equal tothe expected net input for a new easy class item

The probability density functions of the net inputs fornodes in the active states are plotted in Figure 4A Oldnodes in the inactive state have a negative expected valueof the net input and are not plotted New nodes in the in-active state have the same density as nodes in the activestate The cumulative probability distributions of the netinputs for nodes in the active state are plotted in Figure 4BFigure 4A shows the first mechanismmdashnamely that thestandard deviation of net inputs for the easy class items[sh(A)] is larger than the standard deviation of net inputsfor the difficult class items [sh(B)] The second mecha-nism is shown in the figures in that the expected netinput of an easy class new item [mh(AN)] is equal to theexpected net input of a difficult class new item [mh(BN)]Furthermore at encoding the expected net inputs of ac-tivated nodes are increased equally or approximatelyequally for the easy and the difficult classes of itemsThis is shown in Figure 4A The expected net input for theold easy class items [m h(AO)] is equal to the expectednet input for the difficult class items [mh(BO)]

Recognition StrengthThe variance theory suggests that the recognition de-

cision needs to be based on counting the number of ac-tive nodes in such a way that the performance is optimalor near-optimal If the net input is above the activation

VARIANCE THEORY OF MIRROR EFFECT 421

threshold (T ) and the node was active at encoding thenode is activated at recognition Otherwise the node isinactivated The distributions of active nodes are plottedin Figure 4C

A closer inspection of Figures 4A and 4B reveals thatthese densities or distributions predict the correct orderof the mirror effect given that the activation threshold islarger than the expected net inputs of the new items andless than the expected net inputs of the old items Thus thevariance theory has the nice property of accounting for themirror effect across a large range of the parameter valuesfor the activation threshold Thus P(AN) lt P(BN) ltP(BO) lt P(AO) for mh(AN) 5 mh(BN) lt T lt mh(AO) 5mh(BO) The variance theory predicts a material-basedmirror effect because the variability of the net inputs isdifferent for the easy and the difficult class items Theexpected strengths of the net inputs are equal The vari-ability is lower for easy class items thus making theprobability of false alarms (or the probability of activenodes) lower for the easy than for the diff icult classitems when the activation threshold is larger than the ex-pected value of the new items Similarly the hit rate (or theprobability of active nodes) for easy class items is higherthan the hit rate for difficult class items when the activationthreshold is less than the expected value of the old items

The activation threshold is set to be between the ex-pected value of the new and the old net inputs so that theperformance is optimal Therefore the activation thresh-old is set to the average of the expected net inputs of theold and the new distributions for difficult and easy classitems respectively

T 5 [mh(AN) + m h(BN) + m h(BO) + mh(AO)]

5 [mh(BO) + mh(AO)] 5 mh(O)

Thus in the variance model the activation threshold isfixed for recognition in one condition although it mayvary between different recognition conditions to opti-mize the performance The variance theory accounts forthe strength-based mirror effect that occurs between listsor conditions with a shift in the activation threshold nec-essary for keeping the performance at an optimal levelAs will be discussed later this is true also when perfor-mance is measured by d cent and it is independent of theplacement of the recognition criterion Simply put themodel adopts the activation threshold on the basis of theoverall difficulty of the test in order to maximize the per-formance

In practice subjects may initially make a preliminaryestimate of the activation threshold which may be ad-justed as more information about the expected value ofthe distributions is gathered The theory does not show amirror effect if the activation threshold is lower than theexpected value of the new items or larger than the expectedvalue of the old items Thus setting the activation thresh-old as was suggested above is an important mechanism

in the model The activation threshold should not be con-fused with the subjectrsquos recognition criterion

Figure 4C shows the density of the probability that anode is active at recognition when the activation thresh-old is set as defined above Note how the mean and stan-dard deviations of the distributions of the net input (Fig-ure 4A) change when the percentage of nodes are calculated(Figure 4C) The expected probabilities of active nodesare arranged according to the mirror effect [Pc(AN) ltPc(BN) lt Pc(BO) lt Pc(AO)] whereas the expected valuesof the net inputs are not [mh(AN) = mh(BN) lt mh(BO) =mh(AO)] Furthermore the standard deviation of the per-centage of active nodes for old items is larger than thatfor new items [s p(O) gt sp(N)] whereas the standard de-viations of the net inputs are equal for old and new items[s h(N) 5 sh(O)]

The standard deviation of the recognition strength issmaller for the new distributions than for the old distri-butions because there are fewer nodes active in the newdistributions The standard deviation of the percentage ofactive nodes at retrieval as a function of the expected prob-ability of nodes active at retrieval is plotted in Figure 5Obviously the standard deviation of the percentage ofactive nodes is zero when the probability of active nodesis zero This standard deviation increases as the prob-ability of active nodes increases For small probabilitiesof active nodes the variance of active nodes (sp) is approx-imately proportional to the percentage of active nodes[sp 5 Pc(1 Pc) N raquo Pc N micro Pc] The percentage of ac-tive nodes is of course larger for old than for new itemsThus the variance theory predicts that the standard de-viation of the percentage of active nodes (sp) is smaller fornew than for old items [sp(AN) lt sp(BN) lt sp(BO) ltsp(AO)] whereas the standard deviation of the net inputis not [sh(AN) 5 sh(AO) lt sh(BN) 5 s h(BO)] The es-sential mechanism that makes these changes in themeans and standard deviations is the nonlinearity intro-duced by the counting of the number of active nodes With-out this nonlinearity these changes would not occur andthe model would not show appropriate ROC curves forold and new items

Note that the standard deviation of active nodes de-creases for probabilities larger than one half (see Figure 5the standard deviation is of course zero when the prob-ability of active nodes is one see the mathematical anal-ysis below) However the probability of active nodes can-not exceed one half because the activation threshold isset so that half of the nodes active during encoding areactive at recognition The probability of active nodes istypically set to a small value in the model because it im-proves the performance

To optimize the performance subjects base their rec-ognition decision on the number of active nodes normal-ized by the standard deviation of the net inputs for theitem The normalization makes the judgments more con-servative for difficult items This plays an important rolefor confidence judgments when the responses are biasedbut plays no role for unbiased responses

12

14

14

422 SIKSTROM

To calculate the recognition strength (S ) in Equation 5the expected percentage of active nodes is subtractedfrom the percentage of nodes active at recognition (Pc)This is necessary for the normalization to work properlyOwing to the placement of the activation threshold theexpected percentage of active nodes at recognition is halfof the expected percentage of nodes active at encoding(a 2 see note 1) This is a constant independent of itemclass new or old item and test difficulty The result is di-vided by the standard deviation of the net inputs associ-ated with nodes active at encoding (shcent)

Note that the standard deviation of the net inputs ofthe to-be-recognized item (s hcent ) varies on an item-to-itembasis around the standard deviation of the net inputs ofall items in the class (sh ) This fluctuation may be solarge that it is not possible to accurately sort the wordsinto classes on the basis of the standard deviation of theitems however there is no need for the subject to makesuch classification in the variance model The subjectsdo not need to know the true standard deviation of net in-puts in the class A yes response occurs if the recognitionstrength is larger than or equal to the subjectrsquos recogni-tion criterion (C ie if S sup3 C ) A no response occurs ifthe recognition strength is less than the subjectrsquos recog-nition criterion (S lt C )

The standard deviation of the net inputs does not affectthe probability of a yes response when the recognitioncriterion is unbiased (C 5 0) In this special case therecognition strength can be simplified to S 5 Pc whereC 5 a 2 The standard deviation of the net inputs inEquation 5 affects the probability of a yes response whenthe recognition criterion is biased (C sup1 0) Thus the stan-dard deviation of the net inputs in Equation 5 may be in-terpreted as a scaling factor that influences the confidencemeasurement (but not the unbiased recognition measure-ments) A large standard deviation of the net input for anitem (correlated with difficulty) influences the decisiontoward uncertainty whereas a small standard deviation ofthe net input for an item (correlated with less difficulty)influences the decision to be more certain

Figure 4D shows the density distribution of the recog-nition strength Note how the standard deviation of theactive nodes for the easy class versus the difficult classes(in Figure 4C) changes when it is normalized by the vari-ance of the net input (in Figure 4D) The normalizationfactor makes the standard deviation of the recognitionstrength of the difficult class smaller than that of the easyclass Thus the standard deviation of the recognitionstrength is proportional to the inverse of the standard de-viation of the net input The difficult class items yield asmall standard deviation of the recognition strength be-cause the standard deviation of the net inputs is highThe easy class items yield a large standard deviation ofthe recognition strength because the standard deviationof the net inputs is small The ordering of the means ofthe distributions is unaffected by the normalization andthe normalization does not change the fact that the olddistributions have a larger standard deviation than do thenew distributions

PREDICTIONS

This section describes the predictions of the variancetheory We have just seen that the variance theory predictsa material-based mirror effect for high- and low-frequencyitems because the low-frequency items have a smallervariance of net inputs This yields lower false alarm ratesand higher hit rates for the easy than for the difficultclass when the activation threshold is set between thenew and the old distributions Here it is shown how themodel accounts for other effects such as the strength-based mirror effect between lists list-length effects andthe shift in the response criterion Most important the var-iance theory makes predictions regarding the strength-based mirror effect within lists that is different from thepredictions of the attention-likelihood theory An exper-iment is conducted to test these predictions Comparativemodeling fitting was also conducted to compare the vari-ance theory with the attention-likelihood theory Thepredictions of the theory are based on an analytic solution

Figure 5 The standard deviation of active nodes as a function of the expectedprobability that the nodes are active The standard deviation is calculated with2N = 100

VARIANCE THEORY OF MIRROR EFFECT 423

that is presented at the end of the paper together with ananalysis of optimal performance

The Material-Based Mirror Effect forHigh- and Low-Frequency Items

The variance theory was simulated above here theanalytical results are presented The variance theory pre-dicts the mirror effect for any choice of parameters whenthe recognition criterion is unbiased As will be discussedlater the variance theory can be fully described by twoparameters (the number of nodes N and the percentageof active nodes a) plus one parameter for each class orwords [the standard deviation of the net input sh( )] Thefollowing parameters are used here The number ofnodes is 2N 5 100 and the percentage of nodes active atencoding is set to a 5 1 The standard deviation of thenet inputs to the easy class is sh(AN) 5 sh(AO) 5 125and the standard deviation of the net inputs to the diffi-cult class is sh(BN) 5 sh(BO) 5 156 There are otherparameters which however as will be discussed laterdo not add any degrees of freedom to the model the ex-pected net inputs of the new distributions mh(AN) 5mh(BN) 5 0 and the expected net inputs of the old dis-tributions mh(AO) 5 mh(BO) 5 1 Consequently the ac-tivation threshold is T 5 05

These parameters yield the following probabilities thata node is active at recognition Pc(AN) 5 43 a Pc(BN) 545 a Pc(BO) 5 55 a Pc(AO) 5 57 a The following ex-pected recognition strengths are predicted ms(AN) 5

0012 ms(BN) 5 0008 ms(BO) 5 0008 ms(AO) 50012 Figure 4D plots the four recognition strength den-sities (the distributions are assumed to be normal) usingthe parameters above The same parameter settings wereused in Figures 4A 4B 4C and 5

Strength-Based Mirror Effects Between ListsThe variance theory is consistent with the strength-

based mirror effects Thus variables that increase the hitrates also decrease the false alarm rates This empiricalfinding is called dispersion which means that the newand the old distributions move apart The opposite phe-nomenon is called concentering which means that thenew and the old distributions move closer together Ex-amples of variables showing strength-based mirror ef-fects are speed versus accuracy instructions length ofstudy time encoding task forgetting repetition and ag-ing (Kim amp Glanzer 1993) These experimental variablescan be related to a specific parameter in the variance the-orymdashnamely the expected net input

The variance theory predicts a strength-based mirroreffect because subjects must adjust the activation thresh-old to optimize the performance This change in activa-tion threshold affects the false alarm rates For exampleassume that study time is increased from 1 to 2 sec sothat the expected net input increases from 1 to 2 and theactivation threshold increases from 12 to 1 This dimin-ishes the false alarm rate However the increase in theactivation threshold is smaller than the increase in the old

net input so the hit rate will increase Thus increasing thestudy time increases the hit rate but decreases the falsealarm rate which is dispersion

The mirror effect is accounted for in some theories bya change in the recognition criterion Note that in the var-iance theory the recognition criterion is constant whereasthe activation threshold is changed There is an impor-tant difference between a change in the recognition cri-terion and a change in the activation threshold Thechange in the activation threshold optimizes the perfor-mance as measured by d cent whereas a change in the recog-nition criterion does not influence d cent Given an optimalplacement of the activation threshold the performance interms of percentage correct is optimal if the recognitioncriterion is set to an optimal value which is zero Thusthere is a clear difference between changing the recogni-tion criterion and changing the activation threshold Thevariance theory accounts for the strength-based mirror ef-fect occurring between two conditions by the change inthe activation threshold necessary for optimal performancewhereas the recognition criterion does not change

Concentering occurs for example when subjects areinstructed to emphasize speed (rather than accuracy) withsuperficial (rather than deep or semantic) study instruc-tions with diminished study time or with an increasedretention interval (Kim amp Glanzer 1993) In the variancetheory all these manipulations are assumed to diminishthe old net inputs Figure 6A shows the predictions of thevariance theory when the expected net inputs of the olddistributions are mh(AO) = mh(BO) 5 05 rather than 1as in Figure 4D Consequently the activation thresholdmust be set to 025 for optimizing the performance Thedistributions in Figure 6A are closer than the distribu-tions in Figure 4D Thus decreasing the net inputsmdashforexample by diminishing study timemdashmoves the distrib-utions closer together thus showing concentering

The opposite phenomenon to concentering is disper-sion which means that increasing the performance movesthe distributions apart Dispersion can be studied bychanging the variables listed above in the opposite direc-tionsmdashfor example by increasing the study time Fig-ure 6B shows the predictions of the theory when the ex-pected net inputs of the old distributions are mh(AO) =m h(BO) 5 2 Consequently the activation thresholdmust be set to 1 to maintain a near-optimal performanceThe distributions in Figure 6B are further apart than thedistributions in Figure 4D

These strength-based manipulations are usually ap-plied between different lists or conditions For exampleKim and Glanzer (1993) manipulated study time betweenfour study lists where the items were presented for 1 seceach in two lists and for 2 sec each in two lists After eachlist there was a recognition test In the variance theorythe activation threshold is the same during each recogni-tion test but may vary between two recognition tests withdifferent levels of difficultymdashfor example different studytimes As will be discussed later different empirical re-sults are found when study time is varied within one list

424 SIKSTROM

In this case the activation threshold is also constant dur-ing the recognition tests although the study time varieswithin the condition

The order of the probabilities in the mirror effect issomewhat robust against changes in the activation thresh-old over a large range Setting the activation to a fixedsufficiently low and positive value yields the mirror ef-fect for any value of the expected net input For example

assume that the activation threshold is fixed to 04 Thenthe mirror effect is predicted for the three cases of ex-pected old net inputs discussed above (05 1 and 2) orany value above 04 The predictions for the new distri-butions do not change with these changes in net inputs[P(AN) = 25 P(BN) = 30] thus a change in the acti-vation threshold is needed to change the false alarmrates In contrast the advantage of the old easy class over

Table 1General Table of Results From the Experiment

Condition AN BN BO AO ss(BN)ss(AN) ss(BO)ss(AO)

Control 013 017 069 082 060 086Frequency 020 028 080 068 101 066Time 010 015 078 076 089 081

NotemdashThe rows show the conditions (control presentation frequency and presenta-tion time) The columns show the false alarm rates for low (AN) and high (BN) fre-quencies and the hit rate for high (BO) and low (AO) frequencies the slope of the z-ROCcurve for the new low-frequency distribution as a function the new high-frequency dis-tribution [ss(BN)ss(AN)] and the corresponding slope for the old distributions[ss(BO)ss(AO)]

Figure 6 The densities of recognition strength in the variance theory for different parameter settings (A) concentering (B) dis-persion (C) activity level set to one and (D) equal variance The horizontal axes show the recognition strength and the vertical axesthe density

VARIANCE THEORY OF MIRROR EFFECT 425

the old difficult class increases with the expected netinput [from P(BO) = 55 and P(AO) = 56 for mh(O) =05 to P(BO) = 89 and P(AO) = 92 for mh(O) = 2]

List-Length EffectGiven everything else equal recognition from a short

list length has a higher hit rate and lower false alarm ratethan recognition from a long list In the variance theorylist length is predicted to affect the standard deviation ofthe net input (sh) for both easy and difficult class itemsso that longer lists have a larger variance than do shorterlists The expected value of the net input is not affectedby list length

Assume that context does not change within a list butis uncorrelated between different lists The context for alist is thus associated with as many items as there areitems in the list The variance of the net inputs to the itemnodes increases when the list length is increased Thereason for this increase in variance is essentially thesame as the reason that word frequency affects the vari-ance In the word frequency case the same item is asso-ciated with several contexts and this increases the vari-ance in context nodes In the list-length case the samecontext is associated with several items and this in-creases the variance in the item nodes Thus the varianceof the net inputs in the item nodes will be a linear func-tion of list length Therefore a long list will have a lowerhit rate and a larger false alarm rate than will a short list

ROC CurvesThe percentage of nodes active at recognition is less

for new items than for old items Owing to the placementof the activation threshold this proportion is always lessthan 12 The standard deviation of the percentage of ac-tive nodes increases as a function of the percentage ofactive nodes If the percentage of active nodes is zerothe standard deviation obviously is zero However thisstandard deviation increases as the percentage of activenodes increases This yields a smaller standard deviationfor the new distribution (which is associated with a lowerpercentage of active nodes) as compared with the olddistribution [ss(AO) gt ss(AN) and ss(BO) gt ss(BN)]

For the sake of understanding the model the propor-tion of nodes active at encoding can be set unrealisticallyhighmdashnamely to a = 1 This setting yields around 50of these nodes being active at recognition This parametersetting makes the standard deviations of the new and theold distributions equal [ss(AO) = ss(AN) and ss(BO) =ss(BN)] Figure 6C shows the prediction for a = 1 (all theother parameters are identical to those in Figure 4D)

The standard deviation of recognition strength is largerfor the difficult class than for the easy class [ss(AN) gtss(BN) and ss(AO) gt ss(BO)] because the recognitionstrengths are calculated from the inverse of the standarddeviation of the net inputs Thus when the standard de-viations of the net inputs are set equal the standard de-viation of the recognition strengths and the recognitionstrengths becomes equal Figure 6D plots the predictionsof the theory when all standard deviations of the net in-

puts are 125 The other parameters are the same as thosein Figure 4D

In Figure 4D the four standard deviations of the recog-nition strengths are ss(AN) = 0015 ss(BN) = 0012ss(BO) = 0015 and ss(AO) = 0020 The ratio of thesestandard deviations must follow Equation 2 This is alsothe case with ss(AO) ss(BN) = 061 lt 074 = ss(AO) ss(AN) lt ss(BO)ss(BN) = 078 lt 094 = ss(BO) ss(AN)

Changing the Recognition CriterionThe probability of a yes response (P) for the four

classes depends on the recognition criterion (C) SettingC to an unbiased value of 0 yields P(AN) = 20 P(BN) =25 P(BO) = 70 P(AO) = 74 These predicted data areprototypical of experimental data for the mirror effect

A conservative value of the recognition criterion (C)will not yield the mirror effect For example C = 0016yields P(AN) = 03 P(BN) = 02 P(BO) = 30 P(AO) =43 Thus the variance theory predicts that a conserva-tive recognition criterion yields a higher false alarm ratefor easy class words than for difficult class words Thisprediction is supported by empirical data Greene (1996)asked subjects to respond yes only if they were sure oftheir response Consistent with the prediction no mirroreffect was found

It follows from the ordering of the distributions in Fig-ure 4D that the theory also predicts the experimentalfindings in forced recognition [P(BO BN) lt P(AO BN)P(BO AN) lt P(AO AN) P(BN AN) gt 50 and P(AOBO) gt 50] For the parameters above the predictions ofthe theory are P(BO BN) = 79 lt 81 = P(AO BN)P(BO AN) = 83 lt 84 = P(AO AN) P(BN AN) = 59 gt50 P(AO BO) = 57 gt 50

Within-List Strength ManipulationSo far the predictions made by the variance theory are

qualitatively (but not quantitatively) equal to those of theattention-likelihood theory However there is an excep-tion that differentiates the variance theory from the attention-likelihood theory The mirror effect is nor-mally studied under experimental conditions in whichthe diff icult and the easy classes are given the sameamount of attentionmdashfor example under conditions inwhich the number of presentations the study time andthe study instructions are equal for the two classes ofwords However if the number of presentation is largerfor the difficult class than for the easy class different re-sults emerge Stretch and Wixted (1998b) conducted fiveexperiments in which the basic manipulation was to pre-sent high-frequency words five times whereas the low-frequency words were presented once The results didnot show a mirror effect because the hit rates for thehigh-frequency words were higher than those for thelow-frequency words However increasing the numberof presentations for the high-frequency words did not af-fect the false alarm rate so both the false alarm rate andthe hit rate were larger for high-frequency words

The variance theory accounts for this new finding inthe following way The theory assumes that a longer pre-

426 SIKSTROM

sentation time or a larger presentation frequency in-creases the net inputs of the old items [mh(O)] This is il-lustrated in Figure 7A (compare with Figure 4A wherethe same amount of attention is paid to the two classes)If the net inputs for old high-frequency items are in-creased sufficiently the percentage of active nodes willbe larger than that for old low-frequency items For thisto occur the effect of the increase in net input (whichgives the advantage for old high-frequency items whenattention is focused on these items) must be larger thanthe effect from the larger standard deviation of the netinputs for old high-frequency items (which gives the ad-vantage for old low-frequency items when the same at-tention is paid to the two classes) This increase in thepercentage of active nodes yields a higher hit rate forhigh-frequency items than for low-frequency items

However it will not significantly change the false alarmrates which are larger for high-frequency items than forlow-frequency items Therefore the variance theory pre-dicts no mirror effect when high-frequency items are pre-sented sufficiently more often or with a sufficiently longerpresentation time as compared with low-frequency items

It is apparent from the density of net inputs (Figure 7A)that the density of recognition strengths (Figure 7B)will not show a mirror effect (ie because the percent-age of active nodes are larger for high- than for low-frequency old items) The parameters used in these fig-ures are identical to the parameters used for the standardmirror effect in Figures 4A and 4D with the exceptionthat the expected net input of the old high-frequencyitems [mh(BO)] is 2 rather than 1 Consequently to opti-mize performance the activation threshold becomes

P a ec

h h

h=-( )yen

ograve12

22

m

s

P -

Figure 7 (A) The probability density of the net inputs in the variancetheory when attention is focused on the high-frequency class The hori-zontal axis shows the net inputs and the vertical axis the probabilitydensity of the net inputs The expected value of the high-frequency value(BO) is shifted to the right because attention is focused on this class Thedotted vertical line is the activation threshold (B) The predictions of thevariance theory when subjects focus their attention on high-frequencywords The horizontal axis shows the recognition strength and the ver-tical axis the probability density

075 rather than 050 The figure does not show a mirroreffect because the expected hit rate and the expectedfalse alarm rate are larger for the high-frequency itemsthan for the low-frequency items Setting C to an unbi-ased value of 0 yields P(AN) = 08 P(BN) = 14 P(BO) =86 P(AO) = 063 [which may be compared with Figure6B P(AN) = 20 P(BN) = 25 P(BO) = 70 P(AO) = 74]

Furthermore in the variance theory the ratio of therecognition strength standard deviations for high- andlow-frequency items depends mainly on the standard de-viations of the net inputs The standard deviations of thenet inputs are not dependent on the attention paid to thestimuli Therefore the variance theory predicts no changein the standard deviations when the amount of attentionis manipulated The standard deviation of the old low-frequency distribution is predicted to be larger than thestandard deviation of the old high-frequency distribu-tion Similarly the standard deviation of the new low-frequency distribution is predicted to be larger than thatof the new high-frequency distribution The standard de-viations in Figure 7B are ss(AN) = 0013 ss(BN) = 0011ss(BO) = 0017 and ss(AO) = 0019 These results aresimilar to the results when the same amount of attentionis paid to the two classes in Figure 4D ss(AN) = 0015ss(BN) = 0012 ss(BO) = 0015 and ss(AO) = 0020

The standard version of the attention-likelihood the-ory has problems accounting for the lack of mirror ef-fect when more study time is given to the difficult classthan to the easy class This theory suggests that the classof items to which more attention is being paid is moreeasily recognized For example low-frequency items arebetter recognized than high-frequency items becausesubjects pay more attention to them The amount of at-tention is assumed to influence the number of sampledfeature [n(i)] so more features are sampled for low- thanfor high-frequency items (Kim amp Glanzer 1993) This isthe only parameter that differs between high- and low-frequency items From this explanation it follows thatif the experimental conditions are manipulated so thatsubjects pay more attention to the high-frequency itemsthe standard version of the attention-likelihood theorywill predict a mirror effect where the high-frequencyitems are the easier class (A) and the low-frequency itemsare the more difficult class (B) The difference from thenormal mirror effect is a larger hit rate and a smaller falsealarm rate for high- than for low-frequency items Fur-thermore the attention-likelihood theory makes predic-tions of the order of the slope of ROC curves The standarddeviation of the hit rate for the high-frequency distributionwould be larger than the hit rate for the low-frequencydistribution Similarly it is predicted that the standarddeviation of the high-frequency false alarm distribution islarger than that of the low-frequency distribution

EXPERIMENT

An experiment was conducted to test the predictionsregarding the within-list strength manipulation The

number of presentations and the study time of the high-frequency words were manipulated in an experimentThe original rationale for the experiment was to comparethe results with the predictions of the variance theoryand the attention-likelihood theory because the experi-ment was conducted before the publication of Stretchand Wixtedrsquos (1998b) study which manipulated atten-tion by the number of presentations In this experimenta new manipulation is investigatedmdashnamely how theamount of study time per item for each class influencesthe mirror effect Furthermore the manipulation of thenumber of presentations is replicated Thus there weretwo experimental conditions one in which the amountof study time was manipulated and one in which the pre-sentation time was manipulated There was also one con-trol condition in which high- and low-frequency wordswere given the same amount of attention

MethodSubjects Twenty-one students taking the introductory psychol-

ogy course at the University of Toronto volunteered to participatein a memory experiment for course credit There were 14 female and7 male subjects with a mean age of 20 ranging from 18 to 29 yearsold

Material Sixty low-frequency words and 60 high-frequencywords were selected from Ku Iumlcera and Francis (1967) The low-frequency words have an occurrence of 4ndash5 times per million andthe high frequency words an occurrence of 50ndash55 times per mil-lion Thirty low- and 30 high-frequency words were randomly cho-sen for List 1 and the remaining for List 2

Procedure The subjects were instructed to study a list of wordsso they would be able to recognize the words after study Fifteenlow-frequency words and 15 high-frequency words were randomlychosen as study words for each subject

Design There were three conditions In all the conditions thelow-frequency words were presented once with a presentation timeof 1 sec In the control condition the high-frequency words werealso presented once with a presentation time of 1 sec In the pre-sentation frequency condition the high-frequency words were pre-sented twice for 1 sec each time In the presentation time conditionthe high-frequency words were presented once for 3 sec The pre-sentation order was randomized All the words were presented inuppercase on a blank computer screen Immediately following thestudy list there was a recognition test The subjects were presentedwith either one of the studied words or one of the lures There were15 low-frequency lures and 15 high-frequency lures in each condi-tion The subjects were asked to judge whether the word was pre-sented in the list or not The subjects were also required to rate theirconfidence in their responses on a scale ranging from 1 (guessing)to 5 (very certain) The order of recognition was randomized foreach subject

Each subject participated in two conditions List 1 was alwaysgiven as the first list and List 2 as the second list Twelve subjectswere randomly chosen for the presentation frequency condition fol-lowed by the presentation time condition Nine subjects were giventhe control condition followed by another control condition Thewhole experimental set-up including instructions presentation ofwords and the recognition test was automated on a computer Eachsubject was tested individually

ResultsThe results from the experiment are presented in the

first three rows of Table 1 The probability for hit rates

VARIANCE THEORY OF MIRROR EFFECT 427

428 SIKSTROM

was larger for the high-frequency words than for the low-frequency words in the presentation frequency and thepresentation time conditions In the control conditionthe probability for hit rates for the low-frequency condi-tion was larger One-tailed paired t tests over the perfor-mance for each subject were carried out to test the dif-ferences between the high and the low frequencies Theeffects were significant in the presentation frequencycondition [t(11) = 22 MSe = 004 p = 02 lt 05] and inthe control condition [t(16) = 33 MSe = 004 p = 00lt 05] but not in the presentation time condition [t(11) =041 MSe = 003 p = 34 lt 05]

The false alarm rate was larger for the high-frequencywords in all the conditions However it was only signif-icantly larger in the presentation frequency condition[t(11) = 18 MSe = 003 p = 048 lt 05] but not in thepresentation time condition [t (11) = 15 MSe = 001 p =07 gt 05] and the control condition [t(16) = 14 MSe =002 p = 09 gt 05]

The results in the presentation frequency conditionsupport the variance theory The hit and the false alarmrates were significantly larger for the high-frequencywords than for the low-frequency words Thus there wasno mirror effect However the prediction of the standardversion of the attention-likelihood theory was not sup-ported

The results in the presentation time condition were inthe same direction as those in the presentation frequencycondition although the difference between the high andthe low frequencies was not significant This conditionis consistent with the variance theory although the stan-dard version of the attention-likelihood theory could notbe dismissed in this condition since the results werenonsignificant

Finally the control condition yielded results consis-tent with previous studies showing a mirror effect Thehit rate for the high-frequency words was significantlylower than the hit rate for the low-frequency words Thefalse alarm rate for the high-frequency words was largerthan that for the low-frequency words (although not sig-nificantly) Thus the control condition is as was ex-pected consistent with both the variance theory and thestandard version of the attention-likelihood theory

The slopes of the ROC curves were calculated as fol-lows The hit and false alarm rates for confidence ratings1ndash5 were z-transformed (eg for confidence rating 4 a hitresponse was scored if the confidence rating was 4 orabove) Linear regressions of one z-transformed mea-surement as a function of another z-transformed measure-ment were conducted The slope of the linear regressioncurves between the z-transformed hit rate of the low-frequency words and the z-transformed hit rate of the high-frequency words [ss(BO)ss(AO)] and similarly for theslope of the false alarms [ss(BN)ss(AN)] are shown inthe last two rows of Table 1

The results show that the standard deviations of theold high-frequency distributions were smaller than thestandard deviations of the old low-frequency distribu-tions in all the conditions The standard deviations of the

false alarm high-frequency distributions were smallerthan the standard deviations of the false alarm low-frequency distributions in the presentation frequencycondition and the control condition but were approxi-mately equal in the presentation time condition

To summarize the results in the presentation frequencycondition are consistent with the variance theory and inconsistent with the standard version of the attention-likelihood theory The control condition is consistentwith both the variance theory and the standard version ofthe attention-likelihood theory These data are also con-sistent with results presented by Stretch and Wixted(1998b) However Stretch and Wixted (1998b) suggestedone possible way to modify the standard version of theattention-likelihood theory to bring it in line with thedata presented here They noted that Glanzer et al (1993)had shown that the attention-likelihood theory predictsthe mirror effect although p(i old) is set to the averagevalue of the two classes This modified version can pre-dict the pattern of data presented here given that the at-tention paid to the high-frequency class was high duringencoding [n(B) = 120] and low during recognition [n(B) =40] This formulation of the attention-likelihood theoryseems somewhat unclear It is not well motivated whyp(i old) should be equal during recognition whereas n(i)is different [p(i old) is calculated from n(i)] or why theamount of attention for high-frequency items is lower thanthat for low-frequency words at encoding but higher atrecognition

COMPARATIVE DATA FITTING

Glanzer et al (1993) fitted the attention-likelihoodtheory to experimental data in nine conditions The easyclass (A) consisted of either low-frequency words orconcrete words and the difficult class (B) consisted ofhigh-frequency words or abstract words Here the vari-ance theory is also fitted to the same set of data as thatused in Glanzer et al (1993) This allows a directionevaluation of the variance theory by comparing its f itswith those of the attention-likelihood theory

Glanzer et al (1993) fitted the attention-likelihoodtheory to the four probabilities of yes responses and thefour ratios of slopes of the ROCs The fitting was con-ducted by minimizing the mean squared error divided bythe variancemdashthat is

Three parameters were fittedmdashnamely the attentionpaid to each of the classes n(A) and n(B) and the prob-ability that a feature was activated p(new) The other pa-rameters were held constant at a value found to give agood f it These parameters were (N = 1000) and therecognition criterion [ln (L) = 0]

The variance theory was fitted to the same set of datausing the same technique and the same number of freeparameters The fitted parameters were the percentage

( )

Observed Predictedi i

ii

-

=aring

2

21

8

s

VARIANCE THEORY OF MIRROR EFFECT 429

nodes active at encoding (a) the standard deviation ofthe net inputs for the easy class words [sh(A)] and thestandard deviation of the net inputs for the difficult classwords [sh(B)] The other parameters were held constantand were set to the same values as those in Figure 4D[2N = 100 mh(N ) = 0 mh(O) = 1 and C = 0] The empir-ical standard deviations (si) were not reported inGlanzer et al (1993) so these parameters were set toone

The results show that the variance theory accounts for98 (r2) of the variance for the probabilities The attention-likelihood theory fits equally well The variance theory ac-counts for 93 of the variance for the slope The attention-likelihood theory accounts for 84 of the variance of theslope Thus the variance theory accounted for the sameamount of variance for the probabilities and more variance

for the slope as compared with the attention-likelihoodtheory when three parameters were fitted

It is reasonable to assume that the percentage of ac-tive nodes at encoding (a) does not vary across condi-tions The ratio of standard deviations of the net inputs[sh(B)sh(A)] may also be conceived of as being con-stant given that the same material is used in the differ-ent conditions Therefore the variance theory was fittedby a single variablemdashnamely the standard deviation ofthe net inputs for the easy class [sh(A)] The activitylevel was fixed to 010 and the ratio of the standard de-viations of the net inputs sh(B)sh(A) to 125 (these val-ues were the average of the fitted parameters in the pre-vious fit these parameters were also used in Figure 4D)The results are presented in Figure 8A where the pre-dicted probabilities are plotted as a function of observed

Figure 8 (A) Fitting the variance theory to Glanzer Adams andKimrsquos (1993) probability data The figure shows the predicted proba-bilities (on the vertical axis) as a function of the observed probabilities(on the horizontal axis) (B) Fitting the variance theory to Glanzer et alrsquos(1993) standard deviation slope data The figure shows the predictedratio of slopes of the receiver-operating characteristic curves (on the ver-tical axis) as a function of the observed ratios (on the horizontal axis)

430 SIKSTROM

probabilities Figure 8B shows the corresponding resultsfor the slope The accounted variance is 096 for the prob-abilities and 085 for the slopes Thus the variance theoryfits the slopes using a single parameter equally well asthe attention-likelihood theory does with three fitting pa-rameters The fit for the variance theory for the probabil-ities using one parameter is slightly less than the fit of theattention-likelihood theory using three fitting parametersIt may be concluded that the fit for the variance theory isreasonably good for the probabilities and the slopes Theslopes have a better fit in the variance theory than in theattention-likelihood theory when three variables are used

ANALYTIC SOLUTIONS

In this section analytic solutions of the variance the-ory an approximation of the standard deviation of therecognition strength and analyses of optimal perfor-mance are presented The variance theory has a simpleanalytic solution and can be fully described by four pa-rameters Two of these parametersmdashnamely the stan-dard deviation of the net inputs from the easy class[sh(A)] and the standard deviation of the net inputs fromthe difficult class [sh(B)]mdashcan also be expressed as thefrequency of the item (see the Appendix) The other twoparameters are the number of nodes (N ) and the expectedprobability that the nodes are active at encoding (a)

There are other variables in the theory however theydo not increase the degrees of freedom There are fourexpected net inputs (mh) however two degrees of free-dom disappear because the new net inputs are constrainedto be equal as well as the old net inputs [mh(AN) =mh(BN) and mh(AO) = mh(BO)] Furthermore the predic-tions are independent of moving the old and new dis-tributions in parallel thus removing another degree offreedom Finally changing the difference between the ex-pected old and new net inputs is mathematically equiva-lent to changing the standard deviations [sh(A) andsh(B)] Thus the degrees of freedom in the net inputscan be captured by the degrees of freedom in the stan-dard deviations The activation threshold (T ) and theprobability that nodes are active (Pc) are simply func-tions of other variables and therefore do not increase thedegrees of freedom Thus there are four degrees of free-dom for the distributionsmdashnamely sh(A) sh(B) N anda An additional degree of freedom is introduced whenplacing the recognition criterion (C)

The probability (Pc) that the net inputs exceed the ac-tivation threshold (T ) for nodes active during encodingcan be explicitly solved from the expected net inputs(mh) and the standard deviation of the net inputs (sh)This probability is dependent on the distribution of thenet inputs which can be approximated with a normaldistribution Pc is solved by integrating the net inputsfrom mh T to infinity (yen) over the probability densityfunction for a normal distribution Thus the probability(Pc) that a node is active at recognition is

(6)

Subtracting the expected percentage of active nodes atrecognition (a2 see note 1) from the percentage of ac-tive nodes and dividing by the standard deviation of thenet inputs (sh) calculates the expected recognitionstrength (ms)

Note that the analytic solution uses the standard devi-ation of the class (sh) as an approximation of the stan-dard deviation of the item (shcent ) because it simplifies theanalytic solution however the variance theory or thesimulation uses the standard deviation of the item Thisapproximation is good when there are a large number offeatures however for a small number of features thevariance of feature strength for a single item may fluctu-ate on an item-to-item basis around the variance of thenet inputs for all the items

The standard deviation of the recognition strength (ss)is calculated from sh Pc and N There is 2N number ofnodes in the context and the item layers The distributionof Pc is binomial but can for a certain criterion [ie 2NPc(1 Pc) gt 10] be approximated with a normal distri-bution with a standard deviation of [Pc(1 Pc) 2N]12The final result is scaled by the normalization factor 1sh

(7)

A yes response occurs if the recognition strength isabove the recognition criterion (C) The probability of ayes response [P(Y)] is calculated from the expected recog-nition strength the variance of the recognition strengthand the recognition criterion by integrating the density ofthe recognition strength over a normal distribution

The probability of choosing A over B in a two-choiceforced recognition test [P(A B)] is calculated from theexpected recognition strength of A [ms(A)] and B [ms(B)]and the standard deviations of the recognition strengthof A [ss(A)] and B [ss(B)]

An Excel sheet for calculating the predictions of the vari-ance theory is available on line (wwwpsychutorontoca ~sverkervariancehtml)

P e ds

S Bs s

s s

C

(A B)

(A)

(A) (B)

=

- -[ ]( )+aelig

egraveccediloumloslashdivide

eacuteeumlecirc

yen

ograve12

2

2 2 22

p

m m

s s

( )

P e ds

S s

s

C

(Y) =-( )yen

ograve12

2

22

p

m

s

sss

h

c cP P

N=

-( )eacute

eumlecircecirc

1 1

2

12

mss

Pc

a

h=

-2

P a e dhcT

hh

h=-( )yen

ograve12

2

22

p

m

s

VARIANCE THEORY OF MIRROR EFFECT 431

Approximations of the Standard Deviation ofRecognition Strength

The standard deviation of the recognition strength isin the model calculated with Equation 7 However to fa-cilitate the understanding of this equation it is useful tomake some approximations First note that the proba-bility that a node is active (Pc) is assumed to be low Byapproximating 1 Pc to one the variance of the recog-nition strength can be simplified to

For a particular class of items the variances of the netinputs of old and new items are equal and the varianceof the recognition strength is proportional to the numberof active nodes (s 2

s micro Pc) This approximation suggests avery simple interpretation of the slope of the z-ROC Theratio of variances between new and old items is simplythe ratio between the number of nodes active in the newitems representations and the number of nodes active inthe old items representations

Or alternatively the slope of the z-ROC curve is equalto the square root of the ratio of the number of nodes ac-tive in the new items representations and the number ofnodes active in the old items representations For exam-ple if the slope of the z-ROC curve is 08 the number ofactive nodes in the new items representations divided bythe number of nodes active in the old items representa-tions is 064 (= 0802)

Another approximation useful for understanding themodel is that for two classes of items the number of ac-tive nodes in the new distribution is approximately equaland the number of active nodes in the old distributions isapproximately equal [Pc(AN) raquo Pc(BN) and Pc(BO) raquoPc(AO)] Given these approximations and the approxi-mation above (1 Pc raquo 1) the recognition strength stan-dard deviation is inversely related to the standard devia-tion of the net inputs in the following way The ratiobetween the recognition strength standard deviations ofthe diff icult and the easy distributions is equal to theratio between the standard deviations of the net inputs ofthe easy and the difficult distributions Furthermore theratio between the recognition strength standard devia-tions of the difficult and easy new distributions is equalto the ratio between the recognition strength standard de-viations of the difficult and the easy old distributionsThe exact solution predicts a slightly smaller ratio in theold than in the new distributions

This suggests that the ratio between the recognitionstrength standard deviations of the difficult class and theeasy class can be interpreted as the ratio between the

standard deviations of the net inputs of the easy class andthe difficult class

Optimizing Performance Derives the Assumptions of the Variance Theory

Arguably good memory performance is an importantfactor for selection in the evolutionary process of hu-mans and animals It is reasonable to assume that thebrain has evolved so that the performance at retrieval isoptimal or near-optimal Here it is investigated how sev-eral assumptions of the variance theory influence per-formance It is shown that several assumptions of themodel fall out as a consequence of optimizing perfor-mance in the form of discriminability between new andold items Thus if the model is implemented in a differ-ent way performance is degraded and the model doesnot account for the experimental data Examples of as-sumptions that yield a good performance in the modelare a low percentage of nodes active setting the activa-tion threshold between old and new net inputs measur-ing performance by nodes that are active in the encodingpattern and normalizing the recognition strength It isshown that an optimal performance in the network requiresthe implementation suggested by the variance theory Ifthe implementation of the variance theory is changedsignificantly the performance is degraded and the net-work would not produce the empirically found memoryphenomena

To conduct this analysis it is necessary to define ameasurement of performance A natural choice is to used cent By using the analytical equations above we find thefollowing expression

(8)

Because ss(N) sometimes is near zero it was founduseful to use the standard deviation of both the old andthe new items recognition strength ss(NO) in the de-nominator of this equation Thus Pc(NO) is equal to[Pc(N) 1 Pc(O)] 2 Pc( ) was calculated with Equation 6The expected value of the net inputs and the standard de-viation of the net inputs for new and old items were cal-culated with the equations derived in the Appendix(Equations A1 A2 and A3) Low-frequency items witha preexperimental frequency of zero were used ( f = 0)and the list length was set to one (L = 1)

The performance can be expressed by the parametersa N and p Increasing the number of nodes (N) monot-onically increases d cent and increasing the number ofstored patterns (p) monotonically decreases d cent The im-pact of these two parameters on d cent was of less impor-tance here and they were set to N = 30 and p = 100

Optimal percentage of nodes active at encodingThe solid line in Figure 9A shows the theoretical d cent as afunction of the percentage of nodes active at encoding

cent = - =-

-eacuteeumlecirc

dP P

P PN

s s

s

c c

c c

m ms( ) ( )

(

O N

(NO)

(O) (N)

(NO) (NO) 12

112

ss

ss

ss

ss

s

s

s

s

h

h

s

s

(BO)

(AO)

(B)

(A)

(A)

(B)

(BN)

(AN)pound raquo pound

s

ss

s

c

c

P

P

2

2

(N)

(O)

(N)

(O)raquo

ss

sc

h

P

N

2

2 2=

432 SIKSTROM

(a) The results show that d cent is optimal for a = 052 Thed cent is lower for larger and smaller a The lower d cent for largea occurs because the interference from other items in-crease For an a larger than the optimal value the weightchanges are distributed over a larger number of nodesand the recognition tests therefore include more noise

The lower d cent for an a less than the optimal value oc-curs because there is variability in the number of activenodes at encoding Thus for very small values of a thefluctuation between the number of nodes active in theencoded representation becomes increasingly importantThus for a small a errors are more likely to occur be-cause old items have few active nodes at encoding mak-ing it less likely that many nodes will be active at re-trieval (independently of how well they are encoded)This analysis suggests that a medium low percentage ofactive nodes at encoding yields optimal performanceThis is consistent with variance theory which requires alow activity for fitting some of the empirical data (seebelow)

There is another factor that contributes to the fact thatoptimal performance occurs when the percentage of ac-tive nodes is medium low which is that the number ofpossible representations increases with a If there is onlyone node active in all the representations there are Npossible combinations of representations if there aretwo nodes active in all the representations there are ap-proximately N 2 possible combinations of representa-tions and so forth This factor is not included in theanalyses

Optimal placement of the activation threshold Animportant property of d cent is that it is insensitive to wherethe criterion is placed Thus any criterion yields thesame performance The activation threshold (T ) may beseen as the criterion for a single node and therefore onemight intuitively think that the placement of the thresh-old is unimportant for d cent However surprisingly theplacement of the criterion becomes important in the vari-ance theory because there is a nonlinear transformationwhen the nodes are activated This nonlinearity makes d centdependent on the activation threshold in the nodes

The d cent was maximized by changing the activationthreshold (T ) and the percentage of nodes active at en-coding (a) The maximum d cent was 240 when T = 081and a = 052 Figure 9B plots d cent as a function of the ac-tivation threshold (T ) when the percentage of nodes ac-tive at encoding was f ixed at the optimal value (a =052) The results show that d cent has an optimal valuewhen the activation threshold is set between the old andthe new distributions The variance theory suggests thatthe threshold should be set to the average of the expectedold and expected new net inputs For the parameter val-ues used here this value is 071 which is near butslightly lower than the optimal value of 081 (the ex-pected value of the new net input is 0 and the expectedvalue of the old net input is 142) Note that this resultapplies when both a and T are set to the optimal value Ifa is set to a nonoptimal value the optimal value of T may

deviate significantly from the one proposed by the the-ory (eg if a = 5 the optimal value of T is much largerthan the expected value of old net inputs of 188)

This analysis emphasizes the importance of setting theactivation threshold as suggested by the variance theorySetting the activation threshold between the old and thenew expected net inputs yields not only the mirror effectbut also an optimal performance in the network Noticethat the activation threshold (T ) is constant even if thesubjectsrsquo decision criterion (C) is changed Therefore d centwill not change when the decision criterion changes Bychanging the decision criterion (rather than the recogni-tion threshold) subjects can maintain an optimal d cent fordifferent confidence levels

Optimal usage of the state of activation in the cuepattern An interesting question is how much informa-tion is carried in nodes that are active in the encoded pat-tern as compared with inactive nodes If both active andinactive nodes carry a similar amount of information itis useful to use all the nodes at retrieval However if in-active nodes carry little information in relation to thenoise performance can be improved by using only theinformation in the active nodes

The information carried in the nodes depends on theamount of weight changes which in turn depends on thepercentage of active nodes at encoding (a) For a = 12the absolute values of the weight changes are the samefor active and inactive nodes (however the signs of theweight changes are different) Thus inactive and activenodes carry the same amount of information and theperformance is optimal when information in both activeand inactive nodes is used For a small a the weightchanges are larger for active nodes (proportional to 1 a)than for inactive nodes (proportional to a) For a suffi-ciently small a the noise in inactive nodes will over-whelm the information in the weight changes so that ifthe information is combined the inactive nodes will harmthe information in the active nodes and performance isoptimal if only information from active nodes is used

The performance of the variance theory was calcu-lated by using the information in all the nodes This isdone by counting the number of nodes that are retrievedto the correct state of activation (ie the same state asthat during encoding) The mathematical details of thiscalculation are described at the end of the Appendix Theresults are shown by the dotted line in Figure 9A usingthe same set of parameters as when d cent was calculated byusing only active nodes shown by the solid line The re-sults show that the highest d cent is found when the decisionis based only on active nodes and when a is low Includ-ing inactive nodes in decision lowers d cent However for alarger a (above 15 for the parameters used here) it isbeneficial to base the decision on all the nodes

Optimal placement of the recognition criterion forthe two classes of items The recognition criterion (C)does not affect d cent but it influences performance as mea-sured by the hit rates and false alarm rates Therefore itis necessary to use another criterion for performance

VARIANCE THEORY OF MIRROR EFFECT 433

with respect to the placement of the recognition crite-rion A natural choice for performance in this context isthe probability of hits minus the probability of falsealarms This measurement corresponds to optimal per-formance when old correct responses and new correctnew responses are rewarded equally It is easy to see thatif the standard deviations of the old and the new distri-butions are equal the optimal performance will be foundif the recognition criterion is set exactly between the dis-tributions For unequal standard deviations the optimalrecognition criterion is shifted from the midpoint towardthe distribution with the smallest standard deviationMore exactly the optimal recognition criterion is thepoint at which the old and the new distributions inter-sect It is easy to see that this is true because if the recog-nition criterion is moved to the left of this point the rateof increase in false alarms is larger than the rate of in-crease in hits and performance suffers If the recognition

criterion is moved to the right of this point the rate of de-crease in hits is larger than the rate of decrease in falsealarms and performance also suffers (see eg Figure 4D)Formally f [S(O)] denotes the density of recognitionstrength of the old distribution and f [S(N)] the densityof the recognition strength of the new distribution Theratio between these variables is called the likelihoodratio L = f [S(O)] f [S(N)] and the optimal performanceoccurs when this ratio is equal to one (L = 1)

In the mirror effect there are two classes of itemseach having a new and an old distribution with differentstandard deviations The question of optimal perfor-mance is complicated by the possibility of using differ-ent criteria for the two classes The performance maythen vary depending on the choice of the two criteria andon additional restrictions on the overall level of confi-dence For example if one class is very easy (ie perfectdiscrimination) and one class is very difficult (ie no

Figure 9 (A) Theoretical d cent as a function of percentage of nodes active at encoding The solid line shows the d cent as a function of per-centage of nodes active at encoding when the decision is based only on nodes that are active during encoding The dotted line showsd cent when the decision is based on all the nodes (B) Theoretical d cent as a function of the activation threshold The leftmost arrow pointsat the expected net input of the new items [m(N)] the rightmost arrow points at the expected net input of the old items [m(O)] and themiddle arrow points at the point at the placement of the activation threshold of the nodes Note that the activation threshold is slightlylower than the optimal point (C) Optimal placement of the recognition criterion for the two classes The y-axis shows the maximumlikelihood for Class B divided by the maximum likelihood for Class A An optimal performance is found when this ratio is one Thex-axis shows the false alarm rate for Class A The straight line shows the ratio for theoretical optimal performance the dotted line theratio before normalization and the solid curved line the ratio after normalization See the text for details (D) The advantage of nor-malization for different recognition criteria The y-axis shows the total hit rate after normalization minus the total hit rate before nor-malization as a function of the total false alarm rate on the x-axis See the text for details

434 SIKSTROM

discrimination) and subjects are instructed to respondyes only when they are absolutely certain that they arecorrect it may be optimal to set a very high criterion forthe difficult class so that no yes responses will be madefor the difficult class and a moderate criterion for theeasy class so that some yes responses will be made forthe easy class Therefore any model that optimizes per-formance for the two classes must combine the criteriafor each class so that the performance for the sum of theclasses will be optimal

This problem may formally be stated as follows Giventwo classes (A and B) with a fixed total false alarm rate[P(AN) + P(BN)] how should the recognition criteriafor the two classes [T(A) and T(B)] be chosen so that thehit rates are maximized [P(AO) + P(BO)] The solutionto this problem is surprisingly simple The optimal perfor-mance occurs when the placements of the maximum like-lihoods of the two classes are equal

L(A) = f [S(AO)] f [S(AN)] = L(B)

= f [S(BO)] f [S(BN)]

It is easy to see that this criterion must be satisfied foroptimal performance because any shift from this pointdiminishes performance For example if the recognitionthreshold for class A is diminished the recognition cri-terion for class B must be increased to keep the totalfalse alarm rate constant According to the formulationof the problem the change in total false alarm rates mustbe equal f f [S(BN) = 0] The maximum-likelihood ra-tios are monotonically increasing functions of the recog-nition criteria therefore L(A) L(B) lt 0 when the recog-nition criteria are changed as specified above ThusL(A) = f [S(AO)] f [S(AN)] lt f [S(BO)] f [S(BN)] =L(B) or f [S(AO)] f [S(BO)] lt 0 This shows that thechange in the placement of the criteria from L(A) = L(B)results in an overall decrease in hit ratemdash( f [S(AO)] f [S(BO)] lt 0)mdashand performance suffers

Note that the variance theory only has one overallrecognition criterion However the theory and any the-ory of the mirror effect specifies how this criterionchanges when it is moved over the two classes Thus thesecond criterion can indirectly be inferred from the for-mulation of the theory This is done in the variance the-ory by the normalization factor that scales the recogni-tion differently between the two classes of words

The intriguing question here is whether the variancetheory is optimal or almost optimal in terms of place-ment of the recognition criterion for the two classes Fig-ure 9C plots the maximum likelihood of class B dividedby the maximum likelihood of class A [L(B)L(A)] onthe y-axis The x-axis shows the probability of false alarmsfor class A when the recognition criteria are changedThe optimal ratio of the maximum likelihood on the y-axisis one and it is plotted as the dotted straight line The re-sults before normalization (ie by counting the numberof nodes above the recognition criterion) are plotted in

the dotted monotonically increasing line The resultsafter normalization (ie the percentage of active nodesminus the expected number of active nodes divided bythe standard deviation of the net input) are plotted as thesolid line

The figure clearly shows that performance before nor-malization is far from optimal For a conservative recog-nition criterion or low false alarm rates the maximumlikelihood for class A is larger than the maximum likeli-hood for class B Therefore a more liberal criterion forclass B and a more strict criterion for class A would bemore advantageous For a liberal recognition criterionor a high false alarm rate the maximum likelihood forclass B is larger than the maximum likelihood for classA Therefore a more liberal criterion for Class A and astricter criterion for Class B would be beneficial Theonly point at which the performance is optimal is whenthe recognition criterion is unbiased At this point [aroundP(AN) = 25] the maximum-likelihood ratio is one

Normalization improves performance significantly sothe maximum-likelihood ratio is one or almost one fora range of criteria For false alarm rates larger than 015and smaller than 060 the ratio is within two percentagepoints of one The maximum likelihood for Class A islarger than that for Class B for false alarm rates less than015 and for false alarm rates larger than 060 Thus thereis still some deviation from optimal performance evenafter normalization However the maximum-likelihoodratio is closer to the optimal value for all false alarmrates after normalization than before normalization Ar-guably normalization improves performance sufficientlyso that performance may be described as being near anoptimal value for a wide range of recognition criteria

Overall performance after normalization can be di-rectly compared with performance before normalizationFigure 9D plots the total hit rate after normalizationminus the total hit rate before normalization for differenttotal false alarm rates In this figure the standard devia-tion of the net inputs to Class B was set to a 3 (rather than156) in order to make the difference between perfor-mance before and after normalization more salient Allother parameters were identical to those in Figure 4DFurthermore it is assumed that the number of items inClass A is equal to the number of items in Class B Thetotal hit rate is equal to the average hit rate of Class Aand Class B and the total false alarm rate is equal to theaverage false alarm rate in Class A and Class B

For all recognition criteria or for all false alarm ratesperformance is better or equal after normalization ascompared with performance before normalization Foran unbiased recognition criterion or a slightly largerrecognition criterion performance is approximatelyequal before and after normalization [ie for 25 lt P(N) lt40] For conservative recognition criteria [P(N) lt 25]there is a large advantage for normalized performanceover nonnormalized performance The largest advantageis approximately 005 and it occurs when the false alarm

VARIANCE THEORY OF MIRROR EFFECT 435

rate is approximately 005 For liberal recognition crite-ria [P(N) gt 40] there is also an advantage for normal-ized performance The largest advantage is around 001and it occurs when the false alarm rate is 070 The ad-vantage for liberal criteria is smaller than the advantagefor conservative criteria because of a ceiling effect forlarge false alarm rates and large hit rates

A Nonoptimal Network is Inconsistent With Empirical Data of Recognition Memory

To summarize it has been shown that performance isoptimal when (1) the percentage of nodes active at en-coding is low (2) the activation threshold is set betweenthe new and the old distributions (3) only nodes activeat encoding are used for measuring the recognitionstrength and (4) the recognition strength is normalizedwith the standard deviation of the net input It is inter-esting to note that all these conditions are necessary forproducing the empirically found memory phenomena

1 The percentage of active nodes has to be low forproducing appropriate standard deviations for the oldand the new recognition distributions If the percentageof active nodes is too high the standard deviation of theold distribution will approach the standard deviation ofthe new distribution

2 The model predicts the mirror effect only if the ac-tivation threshold is set between the old and the new dis-tributions If the recognition threshold is larger than theexpected value of the net input of the old distributionthe hit rate of the easy class will be less than the hit rateof the difficult distribution Similarly if the recognitionthreshold is lower than the expected new net input thefalse alarm of the easy class will be larger than the falsealarm rate of the difficult class This is inconsistent withthe empirical data for unbiased responses

3 Assume that the recognition strength is based on allthe nodes (ie not only nodes inactive during encod-ing)mdashfor example by counting the number of nodes inthe correct state of activation This measurement ofrecognition strength will have at least 50 of the nodesin the correct state (ie Pc gt 50) even if the subjectswere merely guessing on new items This would lead tothe incorrect prediction that the old recognition strengthvariance should be smaller than the new recognitionstrength variance because Pc(1 Pc)N decreases for Pcover 50 This is inconsistent with the finding that thevariance of the old distribution is larger than the varianceof the new distribution

4 If the recognition strength is not normalized withthe net input the variance of the recognition strength ofthe new easy class (AN) will be smaller than the varianceof the recognition strength of the new difficult class (BNsee Figure 4C) This is inconsistent with the empirical data

This analysis indicates that several recognition mem-ory phenomena fall out as a consequence of optimizingperformance in the network If the model is optimized interms of performance the model reproduces the empir-ical data If the model is made nonoptimal the model

does not reproduce the empirical data Arguably thehuman brain has evolved to optimize storage capacityand therefore these memory phenomena occur

GENERAL DISCUSSION

This paper has suggested a variance theory for themirror effect that also applies to the ROC curves Themodel is remarkably simple It has been shown that atwo-layer recurrent network where one layer representscontext and one layer represents items shows these phe-nomena if performance is measured by counting thenumber of nodes active at recognition in a way that opti-mizes performance The structure of the theory guaran-tees that high-frequency items have a larger variance inthe net inputs than do low-frequency items because en-coding the same item in different contexts increases thevariance whereas the expected net inputs are the sameThe theory predicts the mirror effect when the sameamount of attention is paid to both classes of stimuli Thestandard deviation of the recognition strength is largerfor old than for new items because more nodes are activein old items The standard deviation for the easy class islarger than the standard deviation for the difficult classbecause the recognition strength is normalized with thestandard deviation for the net input

There are several reasons why the variance theory isinteresting First the theory is extremely simple Theonly necessary assumptions are that recognition is basedon recurrent associations between contexts and itemsand performance is measured by counting the number ofnodes in an optimal way Second these assumptions areconsistent with what is known about how the brain worksTherefore the model is biologically plausible Third themodel accounts for a large amount of data including themirror effect exceptions from the mirror effect ROCcurves list-length effects and so on Fourth the modelfits the empirical data well Fifth it is easy to implementthe model in a connectionist network

Paying more attention to one of the classes violates theassumption of equal expected net inputs to the two classesThe variance theory predicts that attention to the moredifficult class primarily affects the hit rates whereas thefalse alarm rates and standard deviations of the underly-ing distributions are less affected An experiment sup-ported the prediction A standard mirror effect was foundwhen attention was divided equally between high- andlow-frequency words However focusing the subjectsrsquoattention on the high-frequency words either by the pre-sentation frequency or the presentation time made thehit rate larger for the high-frequency words than for thelow-frequency words This manipulation of attention didnot influence the false alarm rates which were higher forthe high-frequency words in all the conditions Thus nomirror effect was found when attention was paid to thehigh-frequency words Nor did the focusing of attentioninfluence the order of the standard deviations The stan-dard deviations were larger for the low-frequency distri-

436 SIKSTROM

bution than for the high-frequency distribution This wastrue for the new and the old distributions both when at-tention was paid to high-frequency words and when at-tention was divided equally between the two classes (ex-cept in the new frequency control condition where thestandard deviations were equal)

The variance theory predicts the order of the standarddeviations of the underlying distributions for the follow-ing reasons The standard deviations of the old distribu-tions are predicted to be larger than those of the new dis-tributions because more nodes are activated in the olddistributions The standard deviations of the easy classdistributions are predicted to be larger than the standarddeviations of the difficult class distributions because therecognition strength is normalized by the itemrsquos diffi-culty estimated from the standard deviation of the net in-puts This is consistent with the empirical data

In contrast the attention-likelihood theory does notconstrain the old distribution to be larger than the newdistribution for the difficult class (it can be larger orsmaller depending on parameter settings) The variancetheory allows the following two orders of the standarddeviations ss(BN) lt ss(BO) lt ss(AN) lt ss(AO) andss(BN) lt ss(AN) lt ss(BO) lt ss(AO) The first order isthe more common although the second order occurs oc-casionally (see eg Glanzer et al 1993 Experiment 1)In addition the attention-likelihood theory allowsss(BO) lt ss(BN) lt ss(AN) lt ss(AO) according to Equa-tion 2 which to the authorrsquos knowledge has not beenfound in empirical data

The variance theory predicts that strength variablessuch as study time repetition and study instructions af-fect the expected net input For example increasing studytime will increase the net input that improves the hit rateIncreasing the net inputs also causes an increase in theactivation threshold that diminishes the false alarm rates

The variance theory has no (ie lateral) connectionswithin the item layer and there are no connections with-in the context layer Including intraitem connections inrecognition makes it impossible to tell whether therecognition strength comes from encoding during thelearning episode or from another preexperimental learn-ing episode Thus there would be a confounding be-tween the itemrsquos frequency and recognition strength Forexample if the recognition strength in the variance the-ory included intraitem connections and used a constantrecognition criterion it would predict a higher hit rateand a higher false alarm rate for high-frequency itemsthan for low-frequency items Thus the hit rate for high-frequency words would be larger than that for low-frequency words which is contrary to the data on the mir-ror effect This issue is called the frequencyrecognitionndashstrength confounding problem Other models may bevulnerable to this problem depending on their specificassumptions The variance theory is immune from thisproblem because recognition strength is based on the as-sociation between the context and the item that yields apure measurement of the strength of the target in a par-

ticular episode Net inputs within the item population arenot used because these connections are highly corre-lated with the frequency of the item

This frequencyrecognition-strength confounding prob-lem may be relevant to several distributed models thatassume that recognition strength increases with frequencythus making the false alarm rate higher for high- than forlow-frequency stimuli This is often implemented in dis-tributed models by including intraitem associations inthe measurement of recognition strength Thus whenintraitem and itemndashcontext associations are added it isnot possible to know whether the intraitem strength oc-curs because an item has been encoded in the to-be-retrieved-from list or at another episode

Although the intraitem associations are not used tomeasure recognition strength they may play an impor-tant role in recognition In the first step of recognitionthese associations may be used for deblurring unclear in-formation in the item cue (a similar mechanism occursfor the context cue) Arguably this deblurring mecha-nism works well for well-known words however fornonwords it is much more likely to fail Such failure willactivate features that were not active in the encoded rep-resentation It is more likely that these wrongly activatednodes represent high-frequency features because it ismore likely to converge on high-frequency features Thereare two interesting implications of this perspective Firstthe wrongly activated nodes will use the wrong connec-tions between the context and the item Second becausethe wrongly activated nodes represent high-frequencyfeatures the average variability will be larger for non-words than for words This is a plausible account of thepoor recognition performance with nonwords as com-pared with words It is also a tentative account of why non-words can be seen as a difficult class with a higher vari-ability than that for words in the variance theory Howeverfurther work is needed before any firm conclusion can bemade regarding this aspect of the theory

A problem similar to frequencyrecognition-strengthconfounding occurs if within-context connections areused at recognition If the context is temporally corre-lated the within-context connections are influenced bylist length This causes a confounding between list lengthand recognition strength This issue is called the list-lengthrecognition-strength confounding problem Othermodels may be vulnerable to this problem depending ontheir specific assumptions

Another issue is whether the variance theory can ac-count for the mirror effect found in abstract and concretewords where words from a concrete class are more eas-ily discriminated than words from an abstract class Thevariance theory can account for this given the assump-tion that the variance of the net input is larger for abstractthan for concrete words However at this point it is notcompletely clear how this assumption can be motivatedA possibility is that although these two classes areequated for word frequency the contexts associated withan abstract word are more variable than the contexts as-

VARIANCE THEORY OF MIRROR EFFECT 437

sociated with a concrete word This larger variability incontext for abstract words may lead to a larger variabil-ity in the net input Another possibility is that the activefeatures in abstract words are more general and there-fore associated with more contexts Nodes active in con-crete words may represent more specific features acti-vated with a lower frequency and therefore associatedwith fewer contexts Thus features in abstract wordsmay be of a higher frequency than features in concretewords although the frequencies of the items are thesame This would lead to a mirror effect in the modelHowever at this point no claim is made that variancetheory can handle this phenomenon

The variance theoryrsquos account of the list-length andlist-strength effects is arguably much simpler than thesubjective-likelihoodrsquos account The activation thresholdis set so that on average half of the nodes active duringencoding are active during recognition The activationthreshold is constant within one condition but may changebetween conditionsmdashfor example when study time ischanged Furthermore subjects do not need to estimatethe list length and the probability that a particular itemis drawn from the study list

The variance theory has advantages over the attention-likelihood theory As was discussed in more detail abovethe attention-likelihood theory requires subjects to clas-sify the stimulus Depending on this classification thesubjects must know (consciously or unconsciously) howmuch attention is paid to the stimuli in order to calculatethe log-likelihood ratios Thus the yesndashno decision isbased on the subjectsrsquo classification of the stimuli Thevariance theory does not require a classification of thestimuli During the calculation of recognition strengththe standard deviation of the net input of the item is used(shcent ) so the subject does not need to know the class or thestandard deviation of the class (sh) The increase in thehit rate and decrease in the false alarm rate for the easierclass occurs according to the theory because the vari-ance is smaller for the easier class The variance theoryhas a simple mathematical base subjects count the num-ber of activated nodes minus the expected value dividedby the standard deviation of the net input of the item Anexplicit solution is presented that requires only calculat-ing the probabilities of two z-transformations

The subjective-likelihood theory uses feature contentof the items for addressing issues regarding item similar-ity and word frequency In particular high-frequency wordsare assumed to have a higher noise or variability than dolow-frequency words The variance theory also usesvariability that depends on frequency However the vari-ance theory simulates the increase in variance duringeach presentation of a feature in different contexts thusmaking it an unavoidable phenomenon for high-frequencyfeatures In the subjective-likelihood theory the featurevariance is introduced as an assumption or a constantand it is not explicitly simulated

There are several other differences between the vari-ance theory and the subjective-likelihood theory The

subjective-likelihood theory is based on log-likelihoodratios In the variance theory log-likelihood ratios arenot necessary to account of the mirror effect and for z-ROC curves Instead the variance theory uses the num-ber of active nodes normalized by the variance of the netinput as the measurement of recognition strength

Another difference is the use of one detector for eachitem in the subjective-likelihood theory This makes themodel essentially local whereas the variance theory isdistributed This property may cause difficulties in im-plementing the model in a connectionist network How-ever see McClelland and Chappell (1998) for a brief dis-cussion of this topic An implementation of the variancetheory is straightforward

REFERENCES

Feller W (1968) An introduction to probability theory and its appli-cation New York Wiley

Gillund G amp Shiffrin R M (1984) A retrieval model for bothrecognition and recall Psychological Review 91 1-67

Glanzer M amp Adams J K (1985) The mirror effect in recognitionmemory Memory amp Cognition 13 8-20

Glanzer M amp Adams J K (1990) The mirror effect in recognitionmemory Data and theory Journal of Experimental PsychologyLearning Memory amp Cognition 16 5-16

Glanzer M Adams J K amp Kim K (1993) The regularities ofrecognition memory Psychological Review 100 546-567

Glanzer M amp Bowles N (1976) Analysis of the word frequencyeffect in recognition memory Journal of Experimental PsychologyHuman Learning amp Memory 2 21-31

Glanzer M Kisok K amp Adams J K (1998) Response distribu-tions as an explanation of the mirror effect Journal of ExperimentalPsychology Learning Memory amp Cognition 24 633-644

Greene R L (1996) Mirror effect in order and associative informa-tion Role of response strategies Journal of Experimental Psychol-ogy Learning Memory amp Cognition 22 687-695

Hertz J Krogh A amp Palmer R G (1991) Introduction to the the-ory of neural computation Reading MA Addison-Wesley

Hintzman D L (1988) Judgment of frequency and recognition memoryin a multiple trace memory model Psychological Review 95 528-551

Hopfield J J (1982) Neural networks and physical systems withemergent collective computational abilities Proceedings of the Na-tional Academy of Sciences 79 2554-2558

Hopfield J J (1984) Neurons with graded response have collectivecomputational properties like those of two-state neurons Proceed-ings of the National Academy of Sciences 81 3088-3092

Humphreys M S Bain J D amp Pike R (1989) Different way to cuea coherent memory system A theory for episodic semantic and pro-cedural tasks Psychological Review 96 208-233

Kim K amp Glanzer M (1993) Speed versus accuracy instructionsstudy time and the mirror effect Journal of Experimental Psychol-ogy Learning Memory amp Cognition 19 638-652

Kruschke J K (1992) ALCOVE An exemplar-based connectionistmodel of category learning Psychological Review 99 22-44

Ku Iumlcera H amp Francis W N (1967) Computational analysis ofpresent-day American English Providence RI Brown UniversityPress

Lewandowsky S (1991) Gradual unlearning and catastrophic inter-ference A comparison of distributed architectures In W E Hockleyamp S Lewandowsky (Eds) Relating theory and data Essays onhuman memory in honor of Bennet B Murdock (pp 445-476) Hills-dale NJ Erlbaum

Li S-C amp Lindenberger U (1999) Cross-level unification A com-putational exploration of the link between deterioration of neuro-transmitter systems and dedifferentiation of cognitive abilities in oldage In L-G Nilsson amp H J Markowitsch (Eds) Cognitive neuro-sciences of memory (pp 103-146) Seattle Hogrefe amp Huber

438 SIKSTROM

Li S-C Lindenberger U amp Frensch P A (2000) Unifying cog-nitive aging From neuromodulation to representation to cognitionNeurocomputing 32-33 879-890

McClelland J L amp Chappell M (1998) Familiarity breeds dif-ferentiation A subjective-likelihood approach to the effects of expe-rience in recognition memory Psychological Review 105 724-760

Metcalfe J (1982) A composite holographic associative recallmodel Psychological Review 89 627-658

Murdock B B (1982) A theory for the storage and retrieval of item andassociative information Psychological Review 89 609-626

Murdock B B (1998) The mirror effect and attentionndashlikelihoodtheory A reflective analysis Journal of Experimental PsychologyLearning Memory amp Cognition 24 524-534

Okada M (1996) Notions of associative memory and sparse codingNeural Networks 9 1429-1458

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves PsychologicalReview 99 518-535

Rumelhart D E Hinton G E amp Williams R J (1986) Learn-ing representation by backpropagating errors Nature 323 533-536

Shiffrin R M amp Steyvers M (1997) A model for recognitionmemory REMndashretrieving effectively from memory PsychonomicBulletin amp Review 4 145-166

Sikstroumlm S (1996a) TECO A connectionist theory of successiveepisodic tests Umearing Doctoral dissertation Umearing University (ISBN91-7191-155-3)

Sikstroumlm S (1996b) The TECO connectionist theory of recognitionfailure European Journal of Cognitive Psychology 8 341-380

Sikstroumlm S (1999) Power function forgetting curves as an emergentproperty of biologically plausible neural networks model Interna-tional Journal of Psychology 34 460-464

Stretch V amp Wixted J T (1998a) Decision rules for recognitionmemory confidence judgments Journal of Experimental Psychol-ogy Learning Memory amp Cognition 24 1397-1410

Stretch V amp Wixted J T (1998b) On the difference betweenstrength-based and frequency-based mirror effects in recognitionmemory Journal of Experimental Psychology Learning Memoryamp Cognition 24 1379-1396

NOTE

1 The expected number of nodes active at recognition is a2 giventhat the standard deviations of the percentages of active nodes for new[sp(N)] and old items [sp(O)] are equal However if the standard devi-ations are different the expected number of active nodes will be

Because the new variance is smaller than the old variance this valuewill be slightly less than a2 typically around 044a if the ROC curveis 08 It is actually more precise to use this value In this paper the ap-proximation a2 is used except that in the simulations where the ex-pression above is used

APPENDIX

The Expected Value of the Net Input and the Variance of the Net Input

Inserting the weight changes into the net input solves the ex-pected value of the net input for old items In this Appendix thesuperscripts representing the item layer (d) and the contextlayer (c) are dropped Only the active states of activation con-tribute to the net input so jj 5 jj = 1

(A1)

The expected value of the net inputs for the new items iszero

mh(N) 5 0 (A2)

The expected value of the weights is zero The variance ofthe net input is calculated by squaring each term in the netinput Let P represent the number of patterns stored in the net-work If the patterns are uncorrelated or a new item is pre-sented the variance of the net input is

(A3)

For an old item the variance of the net input to the contextlayer from the item layer depends on the frequency of the item( f ) All the aN weights supporting a context node contribute tothe net input in the same direction

(A4)

The same function can be used for calculating the varianceof the net input to a node in the item layer when the same con-text is associated with several items Let the same context be as-sociated with L items in a list Furthermore let the context be-tween different lists be uncorrelated The variance of the netinput to an item node is

The expected variance for all nodes is the weighted sum ofthese two terms Half of the nodes are context nodes and halfof the nodes are item nodes Therefore the expected varianceof the net input to all nodes given an item with a frequency off and a list length of L in a network where p patterns have beenencoded is

(A5)

Performance in the Variance Model When AllNodes Are Used

In the variance model recognition strength is based on nodesactive at encoding However if recognition strength is based onall the nodes performance can be calculated as follows The d centis calculated by using Equation 8 where Pc(O) and Pc(N) are thenumber of correct nodes The number of correct nodes is thenumber of nodes active at retrieval that also is active at encod-ing (ie calculated as usual with Equation 7) plus the numberof inactive nodes at encoding that also are inactive at retrievalThe latter value can be calculated by replacing a with 1 a inEquation 7 and using the expected value of old net inputs for in-active nodes a2 (1 a) N (compare with Equation A1)

(Manuscript received February 9 1999revision accepted for publication October 30 2000)

s h f L a N p a a N2 3 22 1O( ) = +( ) + -( )

s h f a a LN2 4 2 21( ) = -( )

s x x x xh ij jj

N

i j ji

Nf f f a a

a a f N

22 2

4 2 21

( ) = waring aringaelig

egraveccedilouml

oslashdivide= -( ) -( )eacute

eumlecirc

= -( )

s x x x xh ij jj

N

i j j

P

i

NN a a

a a a a a a a

a a a a PN a a PN

2 2 2

2

3

1 1 2 0 1

0 0 1 1

( ) = w

= [( ) ( )] + [( )(1- )] ( )

+ [( )( )] ( ) = ( )

2 2

2 2 2

( ) = -( ) -( ) - - - -

- - - -

aring aringaring

m x x x xh ijj

N

j i j jj

Na a a a N(O) = wDaring aring= -( ) -( ) = -( )1

2

ss s

p

p p

a(N)

( )N (O)+

414 SIKSTROM

with values of 1) and otherwise to an inactive state (iewith values of 0) Let a be a parameter that determinesthe expected probability that a node is active at encod-ing This parameter does not change during the simula-tion and is assumed to be relatively small (for purposesof sparse representation) Note that a is the expected prob-ability of active nodes whereas the real percentage of ac-tive nodes for specific items or contexts varies around a

The encoding-study phase Encoding occurs bychanging the connection weights between the item andthe context layers The weights (or the strengths of theconnections) contain information of what has been storedin the network The weight between item node i and con-text node j is denoted as wij and it is initialized to zeroThe weight change (Dwi j) is calculated by the learningrule suggested by Hopfield (1982 see also Hertz Kroghamp Palmer 1991 for additional details) This is essen-tially a Hebbian learning rule that increases connectionweights between simultaneously activated nodes Thisrule is chosen here because it is more biologically plau-sible than other rules such as the delta or the gradient-decent learning rules (eg Rumelhart et al 1986) usedin back-propagation networks However the variance the-ory can also be implemented with other learning rules

According to the Hopfield (1982) learning rule theweight change is computed as the outer-product betweenthe item and the context vectors of activation patternswith the parameter a first subtracted from both vectorsThis subtraction is mathematically necessary to keep theexpected value of the weights at zero Using the notionsfor item and context activation defined above the weightchange between these two elements of the item and con-text vectors can be written as

Dwi j 5 (x id a)(x j

c a) (3)

Word frequency as the number of associated con-texts at the preexperimental phase An item may be en-coded more or less frequently and hence be associatedwith more or less different preexperimental contextsdepending on how often the item occurs in the subjectrsquosenvironment In the model at the preexperimental stageof the simulation an itemrsquos frequency is simulated by thenumber of times the item is encoded in different contextsA low-frequency item is encoded less often and is asso-ciated with a smaller number of contexts whereas a high-frequency item is encoded more often and is associatedwith a larger number of different contexts For instanceone might form a preexperimental association betweenSAAB and repair shop after experiencing the rare event ofa new expensive SAAB sports car breaking down half-way through a honeymoon trip to the Grand Canyon withthe SAAB having to be towed to a repair shop somewhereout in the desert In implementing the variance theory therelationship between word frequency and preexperimen-tal itemndashcontext associations can be simulated straight-forwardly At the preexperimental stage of the simula-tion a low-frequency item SAAB may be simulated byassociating one context item REPAIR SHOP with it duringencoding A high-frequency word CAR may be simulated

by associating three different contexts REPAIR SHOP TAXI

RIDE and DRIVING TO WORK with it during encodingThe recognition test phase At recognition an item is

presented to the network the representation of this itemis reinstated as a cue to the item layer and the represen-tation of the study context (simulating an internally gen-erated context regarding list or time information duringthe recognition experiment) is reinstated as a cue to thecontext layer For example the representation of theword CAR is reinstated at the item layer Furthermore therepresentation of the study context STUDY LIST is rein-stated at the context layer The subjects must have this in-formation (or cue) in order to recognize an item from theparticular study context (and not recognize the item fromall the other preexperimental contexts) In the actual ex-perimental setting this information is usually conveyedto the subjects by the explicit instruction to recognizefrom the study context (eg ldquoDo you recognize the wordCAR from the list you just studiedrdquo)

At recognition each node receives a signal (called thenet input) which is computed on the basis of other activenodes connecting to it Item nodes receive net inputs fromactive context nodes and context nodes receive net inputsfrom active item nodes The net input to a given node issimply the sum of the weights of all other active nodesconnected to that node For example if item node 1 isconnected to four context nodes (1 2 3 and 4) wherecontext nodes 1 and 3 are active the net input to itemnode 1 is w11 + w13 Thus active nodes ldquosendrdquo informa-tion to nodes that they are connected to whereas inactivenodes do not send any information Put another way nodesldquoreceiverdquo information or input from active nodes that theyare connected to but not from the inactive nodes

Specifically the net input to item node i is calculatedby first multiplying the activity of the context node (la-beled j ) connected to this node by the weight of the con-nections between the nodes i and j and then summing overall connected nodes In vector formalization the weightmatrix operates on the activation vector and the outputis the net input vector The net inputs to node i(h i

d) at theitem layer depend on the states of activation of nodes in thecontext layer and the weights connecting these two layers

(4)

Following the same principle a similar function is usedto calculate the net inputs to the context node Specifi-cally the net inputs to context nodes (labeled j ) dependon all the states of activation of nodes in the item layerand the weights connecting the two layers

By inserting Equation 3 into Equation 4 and summingover the p number of encoded patterns during list learn-ing it is easy to see that the net input to a node is simplythe sum of Np number of terms for example the net in-put to an item node is

h wj ji ii

Nc d=

=aring x

1

h wi ij jj

Nd c=

=aring x

1

VARIANCE THEORY OF MIRROR EFFECT 415

For a 5 5 the net inputs are binomially distributed witha certain expected value Given a certain criterion (ieNp(1 a) a gt 10) a binomial distribution can be ap-proximated with a normal distribution (Feller 1968) Fora sup1 5 there are actually four outcomes however thesame normal approximation can be used Thus for rea-sonably large parameter values of Npa the distributionof net inputs to the nodes can be approximated by a nor-mal distribution

If the to-be-recognized item has not been encoded withthe context (ie a new item) the net input is simply thesum of random weights Because the expected values ofall weights are zero the expected value of the net inputsfor new items will also be zero If the item has been en-coded with the context (ie an old item) the net input isthe sum of those weights connected to that node whoserespective context nodes were active at encoding Owingto the adaptive weight changes during encoding theseweights will have an expected value that is larger thanzero if both nodes were in the active state during encod-ing [ie each weight change at encoding is computed as(1 a)2] and less than zero if one node was inactive andthe other node was active at encoding [ie each weightchange at encoding was a(1 a)] Of specific impor-tance for the theory is that the variance of the net inputsto the context nodes (from the item nodes) increases withthe number of contexts that are associated with the itemTherefore the variance of the net input is larger for high-than for low-frequency items Similarly the variance ofthe net input to the item nodes (from the context nodes)increases with the number of items associated with onecontext (ie list length) Therefore given that the con-text is constant during a list presentation the variance ofthe net inputs is larger for a long than for a short list

Brief summary of optimal performance Given thestrong selection pressure arguably humans and animalshave evolved to achieve good memory performanceTherefore it is reasonable to assume that mechanisms forrecognition decision have evolved to an optimal or near-optimal performance Following this assumption the pa-rameter values in the model and the implementation ofthe model are guided by the idea that the model shouldperform optimally A detailed discussion about the issueof optimal performance with exact derivations of what isoptimal performance in the context of the present modelis presented later in the sixth section Here I give a briefsummary explaining the results from the analysis of op-timal performance without going into the mathematicaldetails (see Figures 9A 9B 9C and 9D)

The modelrsquos performance is optimal if the percentageof nodes active at encoding (a) is low (see Figure 9A)For a low a it is optimal to base the recognition decisionon nodes that were active at encoding and to ignore nodesthat were inactive during encoding (see Figure 9A) Alsofor a low a it is optimal to place the activation threshold

of the nodes between the expected values of the new andthe old net inputs (see Figure 9B) Finally it is optimalto normalize the recognition strength with the standarddeviation of the net input (see Figures 9C and 9D)

For a low percentage of active nodes it is optimal tobase the recognition decision on nodes that were activeat encoding (or nodes active in the cue pattern) and to ig-nore nodes that were inactive at encoding At recogni-tion the state of activation of a node may be either activeor inactive Therefore the nodes that are active in the cuepattern and have a net input above a certain activationthreshold are activated at recognition otherwise thenodes are inactivated Let z i

d denote the state of activa-tion at recognition for item node i An item node is acti-vated at recognition (z i

d 5 1) if it was active in the cuepattern (x i

d 5 1) and the net input is above the activationthreshold (h i

d gt T ) otherwise it is inactivated (z id 5 0)

z id 5 1 if x i

d 5 1 and h id gt T otherwise z i

d 5 0

Similarly let zjc denote the state of activation at recogni-

tion for context node j

z jc 5 1 if x j

c 5 1 and h jc gt T otherwise z j

c 5 0

This way of activating nodes at retrieval differs from hownodes are activated in a standard Hopfield (1982) net-work where the activation threshold is zero and a nodeis activated if the net input is above zero (independentlyof the state of activation in the cue pattern) The way ofactivating patterns in a Hopfield network is more likelyto produce a retrieved pattern that matches the encodedpattern of activation (eg the expected value of activenodes at retrieval will be the same as the expected valueof active nodes at encoding) However as will be dis-cussed later the way suggested to activate the nodes hereyields better performance in terms of discrimination be-tween a target item and a distractor item

As is shown in Figure 9B performance is optimal whenthe activation threshold is set approximately between thenew and the old net inputs The activation threshold (T )is set to the expected values of the net inputs of nodes ac-tive during encoding (x i

d 5 1 x jc 5 1) for old and new

items The averaging is computed over all nodes (2N ) andover all new and old patterns ( p) in the recognition con-dition If half of the items are new and half of items areold the activation threshold is

As was discussed above the expected net input of new(lure) items is zero Therefore the activation threshold issimply half the expected net input for nodes encoded inthe active state [T 5 mh(O)2 where mh(O) is the expectedvalue of the net input to nodes encoded in the active state]

It is easy to see that the expected percentage of old andnew active nodes at recognition is one half of the per-centage of active nodes at encoding (a 2) That is the ac-tivation threshold divides the old and new distribution in

TaPN

h hi ii

N

j jj

NP

= +eacute

eumlecircecirc = =aring aringaring1

2 1 1

x xd d c c

h a ai i jj

Np

jd d c c= - -

=aringaring ( )( ) x x x

1

416 SIKSTROM

the middle Old items will have a higher expected percent-age active and new items will have a lower expected per-centage active The activation threshold is constant dur-ing recognition in one condition However it must varybetween conditions depending on the net inputs to yieldoptimal performance

The percentage of nodes active at recognition is counted

As is shown in Figure 9C the performance is near-optimalif the recognition decision is based on the number of nodesactive at recognition normalized by the standard devia-tion of the net inputs across the active features of this item(shcent ) Thus this standard deviation is calculated over allthe nodes active at encoding (ie x i

d = 1 and x ic 5 1

nodes inactive at encoding are not used when calculatingthe standard deviation because for low levels of a theseitems carry little to no information of the item) The rec-ognition strength (S ) for an item is calculated by sub-tracting the expected percentage of nodes active at rec-ognition (a 2)1 from the real percentage of nodes activeat recognition (Pc) and dividing by the standard deviationof the net inputs of the item (shcent)

(5)

The subtraction of the expected percentage of nodesactive at recognition makes the expected value of therecognition strength (S ) zero This subtraction is neces-sary for the normalization to work properly The sub-traction moves the recognition strength distributionssymmetrically so that the old and the new distributionsmove at the same rate for a given standard deviation of thenet input (without the subtraction the old recognitionstrength distribution would be more affected than thenew distribution) Thus the recognition strength is de-termined by the difference between two probabilities (thepercentage of active nodes that varies and the expectedpercentage of active nodes that is constant) divided bythe standard deviations of the net input A yes response(Y ) is given if the recognition strength (S ) is above therecognition criterion (C ) An unbiased decision has arecognition criterion of zero

Y 5 S gt C

An issue that may be raised is whether it is sensible tobase recognition strength on two quite different sourcesmdashnamely the percentage of active nodes and the variabil-ity of the net input The immediate answer is that if it isreasonable to optimize performance it is also sensible tomeasure recognition strength this way Another perspec-tive is to note that unbiased responses can be made onlyon the percentage of active nodesmdashthat is a yes responseoccurs if the percentage of active nodes is larger than theexpected percentage of active nodes (Pc gt a 2) and thevariability of the net input can be ignored Thus ldquonor-

mallyrdquo subjects base their unbiased decisions on the per-centage of active nodes and the variability of active nodesonly becomes relevant when subjects are biased Fromthis perspective the percentage of active nodes is usedfor unbiased responses and the variability of the netinput becomes relevant for confidence judgments There-fore by combining both the percentage of active nodesand the variability of the net input the measure of recog-nition strength proposed here will also reflect the confi-dence judgment

An Example With Step-by-Step ComputationsTo clarify the computational details involved in the

variance theory a step-by-step example is given here Fortractability a small network is used consisting of fouritem nodes and four context nodes (see Figure 2) Theactual simulations to be reported later involved a largernetwork architecture with 30 nodes at each layer Thepercentage of nodes active at encoding (denoted by pa-rameter a) is set to 50 Let item BN be represented as1100 and written as the state of activation of the fournodes x1

d x2d x3

d x4d Similarly let 0011 represent

item BO 1010 represent item AO and 0101represent item AN Let context CBN be represented as1100 or the state of activation of the four nodes x1

cx2

c x3c x4

c Similarly 0011 represents context CBOand 0101 represents experimental study context CExp

Item BN is a high-frequency new word For simplicityit is here encoded only once with context CBN in the pre-experimental phase (in the simulations below high-frequency words are preexperimentally associated withthree contexts) The 16 weights between the four itemnodes and the four context nodes are changed accordingto the learning rule where the probability that a node isactive at encoding is determined by the parameter a 55 For example the weight change between item node 1and context node 1 is w11 5 [x1

d(BN) a][x1d(CBN) a] 5

(1 5)(1 5) 5 14 where BN is item BN and the CBNrepresents context CBN Similarly item BO is anotherhigh-frequency word that before the experimental phaseis encoded once with context CBO Items AO and AN arelow-frequency old and new words and they are not en-coded at the preexperimental phase

In the experimental phase item AO is encoded withthe experimental context CExp Finally item BO is en-coded with the same experimental context CExp For ex-ample the weight w11 is now equal to

[x1d(BN) a) (x1

c(CBN) a] + [x1d(BO) a]

[x1c(CBO) a] + [x1

d(BO) a][x1c(CE) a]

+ [x1d(AO) a][x1

c(CE) a] 5 (1 5) (1 5)

+ (0 5)(0 5) + (0 5)(0 5)

+ (1 5) (0 5) 5 14 + 14 + 14 14 5 12

After encoding the full weight matrix is 5 1 1 55 0 0 5 5 0 0 5 5 1 1 5 corre-sponding to the weights w11 w12 w44 respectively

SP a

c

h=

-

cent2

s

PN

c i jj

N

i

N

= +aelig

egraveccedil

ouml

oslashdivide

==aringaring1

2 11

z zd c

VARIANCE THEORY OF MIRROR EFFECT 417

At recognition the state of activation of the old low-frequency item AO is reinstated as a cue to the itemlayer and the state of activation of the encoding contextat the experimental phase CExp is reinstated as a cue tothe context layer The net inputs are calculated For ex-ample the net input to context node 1 is

The net input to the item nodes is 10 2 1 and that tothe context nodes is 5 5 5 5 It can be seen thatthe expected net inputs for randomly created new itemsin this network is 0 and that the expected net inputs forold items encoded in the active state is 05 Therefore theactivation threshold is set to the average of these valuesmdashnamely T 5 025 Nodes that have a net input above theactivation threshold and that are active in the cue patternare activated at recognition Thus the state of activationat recognition for the item nodes is 1010 and thatfor the context nodes is 0101 (which is identical tothe cue patterns) The percentage of active nodes is countedPc(AO) 5 48 5 5 For an unbiased response (C 5 0)this will yield a yes response because this percentage islarger than the expected percentage of active nodes atrecognition (a 2 5 25) The standard deviation of thenet input for nodes active at encoding is 071 (ie thestandard deviation of 1 2 5 5 corresponding tonodes 1 3 6 and 8) The recognition strength is calcu-lated by subtracting the expected percentage of activenodes at recognition from the percentage of active nodesfor the to-be-recognized item and dividing by the standarddeviation of the net input for active nodes [S 5 (Pca 2) shcent 5 (5 25) 71 5 035]

The recognition of the three items BO AN and BNare done in the same way The results for the four itemsAN BN BO and AO are the net inputs (1 0 2 1 5 155 5 15) where the first four numbers representitem nodes and the last four context nodes (1 0 2 115 5 5 15) (1 0 2 1 15 5 5 1) (1 0 21 5 5 5 5) the states of activation at recognition(0 0 0 1 0 0 0 0) (1 0 0 0 0 1 0 0) (0 0 1 1 00 0 1) (1 0 1 0 0 1 0 1) the numbers of nodes ac-tive 1 2 3 4 the standard deviations of the net inputs071 108 108 071 the recognition strengths 17 0011 035 and the unbiased responses no no yes yesrespectively (an unbiased yes response is made whenS gt C where C 5 0 for unbiased responses) Thus thesubject responds correctly for all items the recognitionstrengths are ordered according to the mirror effect andthe variance of the net input is larger for the high- thanfor the low-frequency words

SIMULATIONS OF THEFREQUENCY-BASED MIRROR EFFECT

In this section the variance theory of the mirror effectis simulated in a connectionist framework that is consis-tent with a general connectionist theory of memory

called TECO (Sikstroumlm 1996a 1996b) TECO has beenused to account for a variety of memory phenomena in-volving recognition and recallmdashfor example successivetesting of recall and recognition (Sikstroumlm 1996b) and for-getting curves (Sikstroumlm 1999) An extensive descrip-tion of TECO is beyond the scope of this paper Interestedreaders are referred to previous articles describing themodel for details

ProcedureThe simulation started with initializing the weights to

zero Then 12 items were generated by randomly settingthe nodes to an active state with a probability of a A pre-experimental phase then followed to generate the fre-quency associated with the items In this preexperimen-tal phase half of the items (ie 6) were encoded threetimes each time in a different context These items sim-ulated the high-frequency words The remaining 6 itemswere not encoded before the experimental phase andthey simulated the low-frequency words

At the experimental phase one study-encoding contextwas generated using the same procedure as the genera-tion of the contexts in the preexperimental phase Giventhat in a standard recognition experiment all studieditems would be encoded in the same list in the simula-tions the items were thus encoded once with the samestudy context Three of the high-frequency items wereencoded and three of the low-frequency items were en-coded The other three high- and three low-frequencyitems were not encoded during the experimental phaseand they simulated the new items Each encoding wassimulated by first activating the nodes in the item and con-text layers The weights were then changed according tothe learning rule (Equation 3)

At the recognition test the patterns of activation of the12 items and the study context were reinstated to the net-work The net inputs were calculated for each node andthe recognition strength was calculated from all the nodesin the network The somewhat unrealistic assumptionthat no learning or changes of weights occurred duringtesting was adopted However this is a standard as-sumption often used in other simulations of recognitionmemory cued recall or categorization (eg Kruschke1992 Lewandowsky 1991 Li amp Lindenberger 1999Li Lindenberger amp Frensch 2000) One thousand fivehundred simulation runs with different random item andcontext patterns were carried out The results reported laterare based on the average across these runs

ParametersThe number of high-frequency patterns was six (each

encoded three times preexperimentally and three of themencoded once experimentally to simulate the old items)and the number of low-frequency patterns was six (threeof them encoded experimentally once to simulate the olditems) The following parameter settings were used Thenumber of nodes in each layer (N ) was 30 (the total num-ber of nodes was 2N 5 60) and the percentage of nodes

h w i ii

1 11

45 1 1 0 1 1 5 0 5c d= = + - - = -

=aring x

418 SIKSTROM

active at encoding (a) was 20 Initially all the weightswere set to zero The recognition criterion (C ) was set to 0

ResultsFigure 3A shows the results in terms of the density of

the net inputs to an active node The expected values ofthe net inputs are approximately equal for new high- andnew low-frequency items [mh(AN) 5 00 mh(BN) 5 00]and for old low- and old high-frequency items [mh(AN) 538 m h(BN) 5 38] The high-frequency items have alarger variance of the net inputs [sh(BN) 5 49 sh(BO) 548] than do the low-frequency items [sh(AN) 5 41sh(AO) 5 40] The variances of the old and the new dis-tributions are approximately equal

Figure 3B shows the density of the recognition strengthThe result shows the mirror effect where the hit rate prob-ability is larger for low- than for high-frequency itemsand the false alarm rate is lower for low- than for high-frequency items [P(AN) 5 18 P(BN) 5 25 P(BO) 564 P(AO) 5 71] The standard deviation for the recog-nition strength is larger for old than for new words andlarger for the low-frequency words than for the high-frequency words [ss(AN) 5 029 ss(AN) 5 019 ss(AN)5 023 ss(AN) 5 031] These f indings agree with the empirical data and the predictions of the attention-likelihood model (Glanzer et al 1993) Thus the simu-lation shows that the variance theory can account for thefrequency-based mirror effect and the associated ROCcurves

EXPLICATIONS OF THE MECHANISMS

In this section three essential mechanisms of the vari-ance theory that account for the mirror effect are discussedin detail The involved mechanisms are (1) the varianceof the net input (2) the expected value of the net inputand (3) the way in which the recognition strength is cal-culated by counting the number of nodes so that the per-formance is optimal

OverviewThe first mechanism is that items from the difficult

class (ie high-frequency items in this case) have ahigher variance of the net inputs than do items from theeasy class (ie low-frequency items) but the variance isindependent of whether the items are old or new Themechanism is illustrated in Figure 4A (noncumulative) Itshould also be underscored that the difference in varianceas a function of class membership is not a primitive ofthe theory that the theorist manipulates It is the naturalconsequence of the rather plausible assumption that high-frequency items appear more often and are associatedwith a great variety of different contexts than are low-frequency words The second mechanism is that old itemshave a higher expected net input to nodes encoded in theactive state than do new items (but do not depend on theclass of the items) The mechanism is illustrated in Fig-ure 4B (cumulative) The third mechanism is the wayrecognition strength is measured by counting the activenodes so that recognition performance (eg d cent) is opti-mal or near-optimal An extended analysis of optimalityis presented at the end of the paper however the resultsin Figures 9Andash9D can summarize the main points hereThe results from this analysis imply that the decisionmust be based on the nodes active at encoding (see Fig-ure 9A) the percentage of active nodes must be low (seeFigure 9A) the activation threshold needs to be betweenthe expected net inputs of the new and old items (see Fig-ure 9B) and the recognition strength is normalized bythe standard deviation of the net inputs of the item (seeFigures 9C and 9D) The density of the percentage of ac-tive nodes is illustrated in Figure 4C and the normalizedpercentage of active nodes is illustrated in Figure 4DHere it is shown how these mechanisms predict the mir-ror effect Below these three sets of mechanisms are ex-plained in more detail

Variance of the Net InputThe first and perhaps the most important mechanism

is that the difficult class has a larger variability in the net

Figure 3 (A) Simulations of net inputs The vertical axis shows the simulated density of the net inputs (B) Sim-ulations of the mirror effect The vertical axis shows the simulated density of the recognition strength

VARIANCE THEORY OF MIRROR EFFECT 419

inputs than does the easy class As will be discussed laterthis is shared with other theories such as the subjective-likelihood account (McClelland amp Chappell 1998) andREM (Shiffrin amp Steyvers 1997) However a unique as-pect of the variance theory is that it is a necessary outcomeof simply encoding high-frequency items more timesthan low-frequency items In the subjective-likelihood ac-count a larger variability for high- than for low-frequencywords is an assumption that is controlled by a free pa-rameter ( p0) to reflect that high-frequency words havemore definitions than do low-frequency words The vari-ance theory needs no such assumptions and no addi-tional free parameters to control the variability Whereasthe subjective-likelihood account tries to capture wordfrequency with a free parameter in the variance theorythe word frequency effect is simulated via the number ofcontexts associated with an item

Owing to how the variance theory implements the re-lations between contexts and items the variance of the

net inputs increases with the frequency with which an itemis encoded in different contexts Intuitively this occursbecause a high-frequency item activates several differentcontexts causing the representations of many competingcontexts to be activated simultaneously Low-frequencyitems are associated with fewer contexts than are high-frequency items This causes the representations offewer contexts to be activated simultaneously Thus low-frequency items have less variability in the net inputsthan do high-frequency items

At recognition an item produces a net input in the con-text layer that is a mixture of the net inputs from the studycontext that the network is instructed to retrieve fromand from several uncorrelated preexperimental contextsthat were associated with the item during the preexperi-mental phase The study context that the network is in-structed to retrieve from contributes to the correct activestate For example by adding +1 to the net input to anode (which is the expected net input for a node encoded

Figure 4 (A) The probability density of the net inputs in the variance theory The horizontal axis shows the net inputs and the ver-tical axis the probability density of the net inputs The dotted vertical line is the activation threshold (B) The cumulative probabilitydistributions of the net inputs in the variance theory The horizontal axis shows the net inputs and the vertical axis the cumulativeprobability distributions of the net inputs (C) The density of percentage nodes active at recognition in the variance theory The hor-izontal axis shows the percentage of nodes active at recognition and the vertical axis the density (D) The density of recognition strengthin the variance theory The horizontal axis shows the recognition strength and the vertical axis the density of the recognition strengthusing standard parameter values

420 SIKSTROM

in the active state when N 5 8 and a 5 5 see Equation A1in the Appendix) thus increasing the expected net inputThe preexperimental contexts on the other hand ran-domly add to or subtract from the net input For examplea preexperimental context could add +1 to the net inputif the node was in an active state or it could add 1 tothe net input if the particular preexperimental contextwas encoded in an inactive state (which is the expectednet input for a node encoded in the inactive state whenN 5 8 and a 5 5 see Equation 3 4 or A1) Note thatthe net input can be negative whereas the state of activa-tion cannot If the representations of these preexperimen-tal contexts are uncorrelated the number of associatedpreexperimental contexts will not influence the expectednet input That is the expected value of the negative andpositive contributions will cancel out (eg for N 5 8 anda 5 1 the contributions to the net input is +1 with a prob-ability of 50 and 1 with a probability of 50 yield-ing an expected value of 0) However the variability ofthe net inputs increases when more contexts are addedIn this example the variance is increased by 12 for eachadded context (ie the variance increases by each con-tribution raised to the power of two) Encoding an itemalso increases the variability of the net input for all otheritems encoded in the network However the increase invariability for the items uncorrelated with the encodeditem is much smaller because the contribution from eachnode is independent (Eg for N 5 8 and a 5 5 thecontribution from each node is either +14 or 14 [seeEquation 3] each yielding an increase in variability of142 5 116 An expected value of aN nodes contributeso the total increase in variability is 4 116 5 14)

A mathematical analysis of how the variability of thenet inputs in the context nodes increases as a function ofthe frequency with which the item has been encoded indifferent contexts is presented in the Appendix This analy-sis shows that the variance of the net inputs in contextnodes increases linearly with how many times a givenitem is encoded within different contexts The variabil-ity of the net inputs for nonwords may be a special casediscussed at the end of this paper

In the same way as the variability of context nodes de-pends on the itemrsquos frequency the variability of the itemnodes depends on the frequency of the context That isthe variability of the net input to the item nodes increaseswith how many times one context is associated with dif-ferent items Given that the context is constant during thepresentation of a study list the variability of the net in-puts to the item nodes will increase with list length

Expected Net InputThe second mechanism in the variance theory is that

the expected net inputs to the easy and difficult classesof stimuli are equal given that the encoding conditionsduring the experimental phase of the two classes areequal Note that this is in stark contrast to the attention-likelihood theory which assumes that more attention(corresponding to more net input) is given to the easyclass than to the difficult class Experimentally the equal-

ity of the net inputs simply means that different classesof stimuli are given the exact same conditions for encod-ing and retrieval in a recognition memory study The netinput to a node encoded in the active state increases dur-ing encoding whereas the net input to a node encoded inthe inactive state decreases during encoding Only nodesencoded in the active state are used during retrieval sohere we are only interested in the increase in net input thatoccurs for nodes encoded in the active state Strengthen-ing of the weights during encoding increases the expectednet input The degree of increase in expected net input isinfluenced by strength-based variables such as study timerepetition levels of processing and so forth For exam-ple the simulations can be set up so that a study time of1 sec strengthens the weights less leading to lesser in-creases in the net input than does a longer period of studytimemdashfor example 2 sec of encoding time Because thestudy context is unique to the learning episode preexper-imental encoding in other learning contexts will not af-fect the expected net input but they do affect the vari-ability of the net input as was demonstrated above Theitemndashstudy-context associations are learned approxi-mately equally well for old high- and old low-frequencyitems For example the expected net input for CAR (ahigh-frequency word as a difficult class item) is equal tothe expected net input for ARCTIC (a low-frequency wordas an easy class item) Generally the expected net inputdoes not depend on the class of the item because the ex-pected net input is influenced by the study and the test-ing experimental conditions similarly across stimulusclasses in a recognition memory experiment Thereforethe expected net input for a new difficult item is equal tothe expected net input for a new easy class item

The probability density functions of the net inputs fornodes in the active states are plotted in Figure 4A Oldnodes in the inactive state have a negative expected valueof the net input and are not plotted New nodes in the in-active state have the same density as nodes in the activestate The cumulative probability distributions of the netinputs for nodes in the active state are plotted in Figure 4BFigure 4A shows the first mechanismmdashnamely that thestandard deviation of net inputs for the easy class items[sh(A)] is larger than the standard deviation of net inputsfor the difficult class items [sh(B)] The second mecha-nism is shown in the figures in that the expected netinput of an easy class new item [mh(AN)] is equal to theexpected net input of a difficult class new item [mh(BN)]Furthermore at encoding the expected net inputs of ac-tivated nodes are increased equally or approximatelyequally for the easy and the difficult classes of itemsThis is shown in Figure 4A The expected net input for theold easy class items [m h(AO)] is equal to the expectednet input for the difficult class items [mh(BO)]

Recognition StrengthThe variance theory suggests that the recognition de-

cision needs to be based on counting the number of ac-tive nodes in such a way that the performance is optimalor near-optimal If the net input is above the activation

VARIANCE THEORY OF MIRROR EFFECT 421

threshold (T ) and the node was active at encoding thenode is activated at recognition Otherwise the node isinactivated The distributions of active nodes are plottedin Figure 4C

A closer inspection of Figures 4A and 4B reveals thatthese densities or distributions predict the correct orderof the mirror effect given that the activation threshold islarger than the expected net inputs of the new items andless than the expected net inputs of the old items Thus thevariance theory has the nice property of accounting for themirror effect across a large range of the parameter valuesfor the activation threshold Thus P(AN) lt P(BN) ltP(BO) lt P(AO) for mh(AN) 5 mh(BN) lt T lt mh(AO) 5mh(BO) The variance theory predicts a material-basedmirror effect because the variability of the net inputs isdifferent for the easy and the difficult class items Theexpected strengths of the net inputs are equal The vari-ability is lower for easy class items thus making theprobability of false alarms (or the probability of activenodes) lower for the easy than for the diff icult classitems when the activation threshold is larger than the ex-pected value of the new items Similarly the hit rate (or theprobability of active nodes) for easy class items is higherthan the hit rate for difficult class items when the activationthreshold is less than the expected value of the old items

The activation threshold is set to be between the ex-pected value of the new and the old net inputs so that theperformance is optimal Therefore the activation thresh-old is set to the average of the expected net inputs of theold and the new distributions for difficult and easy classitems respectively

T 5 [mh(AN) + m h(BN) + m h(BO) + mh(AO)]

5 [mh(BO) + mh(AO)] 5 mh(O)

Thus in the variance model the activation threshold isfixed for recognition in one condition although it mayvary between different recognition conditions to opti-mize the performance The variance theory accounts forthe strength-based mirror effect that occurs between listsor conditions with a shift in the activation threshold nec-essary for keeping the performance at an optimal levelAs will be discussed later this is true also when perfor-mance is measured by d cent and it is independent of theplacement of the recognition criterion Simply put themodel adopts the activation threshold on the basis of theoverall difficulty of the test in order to maximize the per-formance

In practice subjects may initially make a preliminaryestimate of the activation threshold which may be ad-justed as more information about the expected value ofthe distributions is gathered The theory does not show amirror effect if the activation threshold is lower than theexpected value of the new items or larger than the expectedvalue of the old items Thus setting the activation thresh-old as was suggested above is an important mechanism

in the model The activation threshold should not be con-fused with the subjectrsquos recognition criterion

Figure 4C shows the density of the probability that anode is active at recognition when the activation thresh-old is set as defined above Note how the mean and stan-dard deviations of the distributions of the net input (Fig-ure 4A) change when the percentage of nodes are calculated(Figure 4C) The expected probabilities of active nodesare arranged according to the mirror effect [Pc(AN) ltPc(BN) lt Pc(BO) lt Pc(AO)] whereas the expected valuesof the net inputs are not [mh(AN) = mh(BN) lt mh(BO) =mh(AO)] Furthermore the standard deviation of the per-centage of active nodes for old items is larger than thatfor new items [s p(O) gt sp(N)] whereas the standard de-viations of the net inputs are equal for old and new items[s h(N) 5 sh(O)]

The standard deviation of the recognition strength issmaller for the new distributions than for the old distri-butions because there are fewer nodes active in the newdistributions The standard deviation of the percentage ofactive nodes at retrieval as a function of the expected prob-ability of nodes active at retrieval is plotted in Figure 5Obviously the standard deviation of the percentage ofactive nodes is zero when the probability of active nodesis zero This standard deviation increases as the prob-ability of active nodes increases For small probabilitiesof active nodes the variance of active nodes (sp) is approx-imately proportional to the percentage of active nodes[sp 5 Pc(1 Pc) N raquo Pc N micro Pc] The percentage of ac-tive nodes is of course larger for old than for new itemsThus the variance theory predicts that the standard de-viation of the percentage of active nodes (sp) is smaller fornew than for old items [sp(AN) lt sp(BN) lt sp(BO) ltsp(AO)] whereas the standard deviation of the net inputis not [sh(AN) 5 sh(AO) lt sh(BN) 5 s h(BO)] The es-sential mechanism that makes these changes in themeans and standard deviations is the nonlinearity intro-duced by the counting of the number of active nodes With-out this nonlinearity these changes would not occur andthe model would not show appropriate ROC curves forold and new items

Note that the standard deviation of active nodes de-creases for probabilities larger than one half (see Figure 5the standard deviation is of course zero when the prob-ability of active nodes is one see the mathematical anal-ysis below) However the probability of active nodes can-not exceed one half because the activation threshold isset so that half of the nodes active during encoding areactive at recognition The probability of active nodes istypically set to a small value in the model because it im-proves the performance

To optimize the performance subjects base their rec-ognition decision on the number of active nodes normal-ized by the standard deviation of the net inputs for theitem The normalization makes the judgments more con-servative for difficult items This plays an important rolefor confidence judgments when the responses are biasedbut plays no role for unbiased responses

12

14

14

422 SIKSTROM

To calculate the recognition strength (S ) in Equation 5the expected percentage of active nodes is subtractedfrom the percentage of nodes active at recognition (Pc)This is necessary for the normalization to work properlyOwing to the placement of the activation threshold theexpected percentage of active nodes at recognition is halfof the expected percentage of nodes active at encoding(a 2 see note 1) This is a constant independent of itemclass new or old item and test difficulty The result is di-vided by the standard deviation of the net inputs associ-ated with nodes active at encoding (shcent)

Note that the standard deviation of the net inputs ofthe to-be-recognized item (s hcent ) varies on an item-to-itembasis around the standard deviation of the net inputs ofall items in the class (sh ) This fluctuation may be solarge that it is not possible to accurately sort the wordsinto classes on the basis of the standard deviation of theitems however there is no need for the subject to makesuch classification in the variance model The subjectsdo not need to know the true standard deviation of net in-puts in the class A yes response occurs if the recognitionstrength is larger than or equal to the subjectrsquos recogni-tion criterion (C ie if S sup3 C ) A no response occurs ifthe recognition strength is less than the subjectrsquos recog-nition criterion (S lt C )

The standard deviation of the net inputs does not affectthe probability of a yes response when the recognitioncriterion is unbiased (C 5 0) In this special case therecognition strength can be simplified to S 5 Pc whereC 5 a 2 The standard deviation of the net inputs inEquation 5 affects the probability of a yes response whenthe recognition criterion is biased (C sup1 0) Thus the stan-dard deviation of the net inputs in Equation 5 may be in-terpreted as a scaling factor that influences the confidencemeasurement (but not the unbiased recognition measure-ments) A large standard deviation of the net input for anitem (correlated with difficulty) influences the decisiontoward uncertainty whereas a small standard deviation ofthe net input for an item (correlated with less difficulty)influences the decision to be more certain

Figure 4D shows the density distribution of the recog-nition strength Note how the standard deviation of theactive nodes for the easy class versus the difficult classes(in Figure 4C) changes when it is normalized by the vari-ance of the net input (in Figure 4D) The normalizationfactor makes the standard deviation of the recognitionstrength of the difficult class smaller than that of the easyclass Thus the standard deviation of the recognitionstrength is proportional to the inverse of the standard de-viation of the net input The difficult class items yield asmall standard deviation of the recognition strength be-cause the standard deviation of the net inputs is highThe easy class items yield a large standard deviation ofthe recognition strength because the standard deviationof the net inputs is small The ordering of the means ofthe distributions is unaffected by the normalization andthe normalization does not change the fact that the olddistributions have a larger standard deviation than do thenew distributions

PREDICTIONS

This section describes the predictions of the variancetheory We have just seen that the variance theory predictsa material-based mirror effect for high- and low-frequencyitems because the low-frequency items have a smallervariance of net inputs This yields lower false alarm ratesand higher hit rates for the easy than for the difficultclass when the activation threshold is set between thenew and the old distributions Here it is shown how themodel accounts for other effects such as the strength-based mirror effect between lists list-length effects andthe shift in the response criterion Most important the var-iance theory makes predictions regarding the strength-based mirror effect within lists that is different from thepredictions of the attention-likelihood theory An exper-iment is conducted to test these predictions Comparativemodeling fitting was also conducted to compare the vari-ance theory with the attention-likelihood theory Thepredictions of the theory are based on an analytic solution

Figure 5 The standard deviation of active nodes as a function of the expectedprobability that the nodes are active The standard deviation is calculated with2N = 100

VARIANCE THEORY OF MIRROR EFFECT 423

that is presented at the end of the paper together with ananalysis of optimal performance

The Material-Based Mirror Effect forHigh- and Low-Frequency Items

The variance theory was simulated above here theanalytical results are presented The variance theory pre-dicts the mirror effect for any choice of parameters whenthe recognition criterion is unbiased As will be discussedlater the variance theory can be fully described by twoparameters (the number of nodes N and the percentageof active nodes a) plus one parameter for each class orwords [the standard deviation of the net input sh( )] Thefollowing parameters are used here The number ofnodes is 2N 5 100 and the percentage of nodes active atencoding is set to a 5 1 The standard deviation of thenet inputs to the easy class is sh(AN) 5 sh(AO) 5 125and the standard deviation of the net inputs to the diffi-cult class is sh(BN) 5 sh(BO) 5 156 There are otherparameters which however as will be discussed laterdo not add any degrees of freedom to the model the ex-pected net inputs of the new distributions mh(AN) 5mh(BN) 5 0 and the expected net inputs of the old dis-tributions mh(AO) 5 mh(BO) 5 1 Consequently the ac-tivation threshold is T 5 05

These parameters yield the following probabilities thata node is active at recognition Pc(AN) 5 43 a Pc(BN) 545 a Pc(BO) 5 55 a Pc(AO) 5 57 a The following ex-pected recognition strengths are predicted ms(AN) 5

0012 ms(BN) 5 0008 ms(BO) 5 0008 ms(AO) 50012 Figure 4D plots the four recognition strength den-sities (the distributions are assumed to be normal) usingthe parameters above The same parameter settings wereused in Figures 4A 4B 4C and 5

Strength-Based Mirror Effects Between ListsThe variance theory is consistent with the strength-

based mirror effects Thus variables that increase the hitrates also decrease the false alarm rates This empiricalfinding is called dispersion which means that the newand the old distributions move apart The opposite phe-nomenon is called concentering which means that thenew and the old distributions move closer together Ex-amples of variables showing strength-based mirror ef-fects are speed versus accuracy instructions length ofstudy time encoding task forgetting repetition and ag-ing (Kim amp Glanzer 1993) These experimental variablescan be related to a specific parameter in the variance the-orymdashnamely the expected net input

The variance theory predicts a strength-based mirroreffect because subjects must adjust the activation thresh-old to optimize the performance This change in activa-tion threshold affects the false alarm rates For exampleassume that study time is increased from 1 to 2 sec sothat the expected net input increases from 1 to 2 and theactivation threshold increases from 12 to 1 This dimin-ishes the false alarm rate However the increase in theactivation threshold is smaller than the increase in the old

net input so the hit rate will increase Thus increasing thestudy time increases the hit rate but decreases the falsealarm rate which is dispersion

The mirror effect is accounted for in some theories bya change in the recognition criterion Note that in the var-iance theory the recognition criterion is constant whereasthe activation threshold is changed There is an impor-tant difference between a change in the recognition cri-terion and a change in the activation threshold Thechange in the activation threshold optimizes the perfor-mance as measured by d cent whereas a change in the recog-nition criterion does not influence d cent Given an optimalplacement of the activation threshold the performance interms of percentage correct is optimal if the recognitioncriterion is set to an optimal value which is zero Thusthere is a clear difference between changing the recogni-tion criterion and changing the activation threshold Thevariance theory accounts for the strength-based mirror ef-fect occurring between two conditions by the change inthe activation threshold necessary for optimal performancewhereas the recognition criterion does not change

Concentering occurs for example when subjects areinstructed to emphasize speed (rather than accuracy) withsuperficial (rather than deep or semantic) study instruc-tions with diminished study time or with an increasedretention interval (Kim amp Glanzer 1993) In the variancetheory all these manipulations are assumed to diminishthe old net inputs Figure 6A shows the predictions of thevariance theory when the expected net inputs of the olddistributions are mh(AO) = mh(BO) 5 05 rather than 1as in Figure 4D Consequently the activation thresholdmust be set to 025 for optimizing the performance Thedistributions in Figure 6A are closer than the distribu-tions in Figure 4D Thus decreasing the net inputsmdashforexample by diminishing study timemdashmoves the distrib-utions closer together thus showing concentering

The opposite phenomenon to concentering is disper-sion which means that increasing the performance movesthe distributions apart Dispersion can be studied bychanging the variables listed above in the opposite direc-tionsmdashfor example by increasing the study time Fig-ure 6B shows the predictions of the theory when the ex-pected net inputs of the old distributions are mh(AO) =m h(BO) 5 2 Consequently the activation thresholdmust be set to 1 to maintain a near-optimal performanceThe distributions in Figure 6B are further apart than thedistributions in Figure 4D

These strength-based manipulations are usually ap-plied between different lists or conditions For exampleKim and Glanzer (1993) manipulated study time betweenfour study lists where the items were presented for 1 seceach in two lists and for 2 sec each in two lists After eachlist there was a recognition test In the variance theorythe activation threshold is the same during each recogni-tion test but may vary between two recognition tests withdifferent levels of difficultymdashfor example different studytimes As will be discussed later different empirical re-sults are found when study time is varied within one list

424 SIKSTROM

In this case the activation threshold is also constant dur-ing the recognition tests although the study time varieswithin the condition

The order of the probabilities in the mirror effect issomewhat robust against changes in the activation thresh-old over a large range Setting the activation to a fixedsufficiently low and positive value yields the mirror ef-fect for any value of the expected net input For example

assume that the activation threshold is fixed to 04 Thenthe mirror effect is predicted for the three cases of ex-pected old net inputs discussed above (05 1 and 2) orany value above 04 The predictions for the new distri-butions do not change with these changes in net inputs[P(AN) = 25 P(BN) = 30] thus a change in the acti-vation threshold is needed to change the false alarmrates In contrast the advantage of the old easy class over

Table 1General Table of Results From the Experiment

Condition AN BN BO AO ss(BN)ss(AN) ss(BO)ss(AO)

Control 013 017 069 082 060 086Frequency 020 028 080 068 101 066Time 010 015 078 076 089 081

NotemdashThe rows show the conditions (control presentation frequency and presenta-tion time) The columns show the false alarm rates for low (AN) and high (BN) fre-quencies and the hit rate for high (BO) and low (AO) frequencies the slope of the z-ROCcurve for the new low-frequency distribution as a function the new high-frequency dis-tribution [ss(BN)ss(AN)] and the corresponding slope for the old distributions[ss(BO)ss(AO)]

Figure 6 The densities of recognition strength in the variance theory for different parameter settings (A) concentering (B) dis-persion (C) activity level set to one and (D) equal variance The horizontal axes show the recognition strength and the vertical axesthe density

VARIANCE THEORY OF MIRROR EFFECT 425

the old difficult class increases with the expected netinput [from P(BO) = 55 and P(AO) = 56 for mh(O) =05 to P(BO) = 89 and P(AO) = 92 for mh(O) = 2]

List-Length EffectGiven everything else equal recognition from a short

list length has a higher hit rate and lower false alarm ratethan recognition from a long list In the variance theorylist length is predicted to affect the standard deviation ofthe net input (sh) for both easy and difficult class itemsso that longer lists have a larger variance than do shorterlists The expected value of the net input is not affectedby list length

Assume that context does not change within a list butis uncorrelated between different lists The context for alist is thus associated with as many items as there areitems in the list The variance of the net inputs to the itemnodes increases when the list length is increased Thereason for this increase in variance is essentially thesame as the reason that word frequency affects the vari-ance In the word frequency case the same item is asso-ciated with several contexts and this increases the vari-ance in context nodes In the list-length case the samecontext is associated with several items and this in-creases the variance in the item nodes Thus the varianceof the net inputs in the item nodes will be a linear func-tion of list length Therefore a long list will have a lowerhit rate and a larger false alarm rate than will a short list

ROC CurvesThe percentage of nodes active at recognition is less

for new items than for old items Owing to the placementof the activation threshold this proportion is always lessthan 12 The standard deviation of the percentage of ac-tive nodes increases as a function of the percentage ofactive nodes If the percentage of active nodes is zerothe standard deviation obviously is zero However thisstandard deviation increases as the percentage of activenodes increases This yields a smaller standard deviationfor the new distribution (which is associated with a lowerpercentage of active nodes) as compared with the olddistribution [ss(AO) gt ss(AN) and ss(BO) gt ss(BN)]

For the sake of understanding the model the propor-tion of nodes active at encoding can be set unrealisticallyhighmdashnamely to a = 1 This setting yields around 50of these nodes being active at recognition This parametersetting makes the standard deviations of the new and theold distributions equal [ss(AO) = ss(AN) and ss(BO) =ss(BN)] Figure 6C shows the prediction for a = 1 (all theother parameters are identical to those in Figure 4D)

The standard deviation of recognition strength is largerfor the difficult class than for the easy class [ss(AN) gtss(BN) and ss(AO) gt ss(BO)] because the recognitionstrengths are calculated from the inverse of the standarddeviation of the net inputs Thus when the standard de-viations of the net inputs are set equal the standard de-viation of the recognition strengths and the recognitionstrengths becomes equal Figure 6D plots the predictionsof the theory when all standard deviations of the net in-

puts are 125 The other parameters are the same as thosein Figure 4D

In Figure 4D the four standard deviations of the recog-nition strengths are ss(AN) = 0015 ss(BN) = 0012ss(BO) = 0015 and ss(AO) = 0020 The ratio of thesestandard deviations must follow Equation 2 This is alsothe case with ss(AO) ss(BN) = 061 lt 074 = ss(AO) ss(AN) lt ss(BO)ss(BN) = 078 lt 094 = ss(BO) ss(AN)

Changing the Recognition CriterionThe probability of a yes response (P) for the four

classes depends on the recognition criterion (C) SettingC to an unbiased value of 0 yields P(AN) = 20 P(BN) =25 P(BO) = 70 P(AO) = 74 These predicted data areprototypical of experimental data for the mirror effect

A conservative value of the recognition criterion (C)will not yield the mirror effect For example C = 0016yields P(AN) = 03 P(BN) = 02 P(BO) = 30 P(AO) =43 Thus the variance theory predicts that a conserva-tive recognition criterion yields a higher false alarm ratefor easy class words than for difficult class words Thisprediction is supported by empirical data Greene (1996)asked subjects to respond yes only if they were sure oftheir response Consistent with the prediction no mirroreffect was found

It follows from the ordering of the distributions in Fig-ure 4D that the theory also predicts the experimentalfindings in forced recognition [P(BO BN) lt P(AO BN)P(BO AN) lt P(AO AN) P(BN AN) gt 50 and P(AOBO) gt 50] For the parameters above the predictions ofthe theory are P(BO BN) = 79 lt 81 = P(AO BN)P(BO AN) = 83 lt 84 = P(AO AN) P(BN AN) = 59 gt50 P(AO BO) = 57 gt 50

Within-List Strength ManipulationSo far the predictions made by the variance theory are

qualitatively (but not quantitatively) equal to those of theattention-likelihood theory However there is an excep-tion that differentiates the variance theory from the attention-likelihood theory The mirror effect is nor-mally studied under experimental conditions in whichthe diff icult and the easy classes are given the sameamount of attentionmdashfor example under conditions inwhich the number of presentations the study time andthe study instructions are equal for the two classes ofwords However if the number of presentation is largerfor the difficult class than for the easy class different re-sults emerge Stretch and Wixted (1998b) conducted fiveexperiments in which the basic manipulation was to pre-sent high-frequency words five times whereas the low-frequency words were presented once The results didnot show a mirror effect because the hit rates for thehigh-frequency words were higher than those for thelow-frequency words However increasing the numberof presentations for the high-frequency words did not af-fect the false alarm rate so both the false alarm rate andthe hit rate were larger for high-frequency words

The variance theory accounts for this new finding inthe following way The theory assumes that a longer pre-

426 SIKSTROM

sentation time or a larger presentation frequency in-creases the net inputs of the old items [mh(O)] This is il-lustrated in Figure 7A (compare with Figure 4A wherethe same amount of attention is paid to the two classes)If the net inputs for old high-frequency items are in-creased sufficiently the percentage of active nodes willbe larger than that for old low-frequency items For thisto occur the effect of the increase in net input (whichgives the advantage for old high-frequency items whenattention is focused on these items) must be larger thanthe effect from the larger standard deviation of the netinputs for old high-frequency items (which gives the ad-vantage for old low-frequency items when the same at-tention is paid to the two classes) This increase in thepercentage of active nodes yields a higher hit rate forhigh-frequency items than for low-frequency items

However it will not significantly change the false alarmrates which are larger for high-frequency items than forlow-frequency items Therefore the variance theory pre-dicts no mirror effect when high-frequency items are pre-sented sufficiently more often or with a sufficiently longerpresentation time as compared with low-frequency items

It is apparent from the density of net inputs (Figure 7A)that the density of recognition strengths (Figure 7B)will not show a mirror effect (ie because the percent-age of active nodes are larger for high- than for low-frequency old items) The parameters used in these fig-ures are identical to the parameters used for the standardmirror effect in Figures 4A and 4D with the exceptionthat the expected net input of the old high-frequencyitems [mh(BO)] is 2 rather than 1 Consequently to opti-mize performance the activation threshold becomes

P a ec

h h

h=-( )yen

ograve12

22

m

s

P -

Figure 7 (A) The probability density of the net inputs in the variancetheory when attention is focused on the high-frequency class The hori-zontal axis shows the net inputs and the vertical axis the probabilitydensity of the net inputs The expected value of the high-frequency value(BO) is shifted to the right because attention is focused on this class Thedotted vertical line is the activation threshold (B) The predictions of thevariance theory when subjects focus their attention on high-frequencywords The horizontal axis shows the recognition strength and the ver-tical axis the probability density

075 rather than 050 The figure does not show a mirroreffect because the expected hit rate and the expectedfalse alarm rate are larger for the high-frequency itemsthan for the low-frequency items Setting C to an unbi-ased value of 0 yields P(AN) = 08 P(BN) = 14 P(BO) =86 P(AO) = 063 [which may be compared with Figure6B P(AN) = 20 P(BN) = 25 P(BO) = 70 P(AO) = 74]

Furthermore in the variance theory the ratio of therecognition strength standard deviations for high- andlow-frequency items depends mainly on the standard de-viations of the net inputs The standard deviations of thenet inputs are not dependent on the attention paid to thestimuli Therefore the variance theory predicts no changein the standard deviations when the amount of attentionis manipulated The standard deviation of the old low-frequency distribution is predicted to be larger than thestandard deviation of the old high-frequency distribu-tion Similarly the standard deviation of the new low-frequency distribution is predicted to be larger than thatof the new high-frequency distribution The standard de-viations in Figure 7B are ss(AN) = 0013 ss(BN) = 0011ss(BO) = 0017 and ss(AO) = 0019 These results aresimilar to the results when the same amount of attentionis paid to the two classes in Figure 4D ss(AN) = 0015ss(BN) = 0012 ss(BO) = 0015 and ss(AO) = 0020

The standard version of the attention-likelihood the-ory has problems accounting for the lack of mirror ef-fect when more study time is given to the difficult classthan to the easy class This theory suggests that the classof items to which more attention is being paid is moreeasily recognized For example low-frequency items arebetter recognized than high-frequency items becausesubjects pay more attention to them The amount of at-tention is assumed to influence the number of sampledfeature [n(i)] so more features are sampled for low- thanfor high-frequency items (Kim amp Glanzer 1993) This isthe only parameter that differs between high- and low-frequency items From this explanation it follows thatif the experimental conditions are manipulated so thatsubjects pay more attention to the high-frequency itemsthe standard version of the attention-likelihood theorywill predict a mirror effect where the high-frequencyitems are the easier class (A) and the low-frequency itemsare the more difficult class (B) The difference from thenormal mirror effect is a larger hit rate and a smaller falsealarm rate for high- than for low-frequency items Fur-thermore the attention-likelihood theory makes predic-tions of the order of the slope of ROC curves The standarddeviation of the hit rate for the high-frequency distributionwould be larger than the hit rate for the low-frequencydistribution Similarly it is predicted that the standarddeviation of the high-frequency false alarm distribution islarger than that of the low-frequency distribution

EXPERIMENT

An experiment was conducted to test the predictionsregarding the within-list strength manipulation The

number of presentations and the study time of the high-frequency words were manipulated in an experimentThe original rationale for the experiment was to comparethe results with the predictions of the variance theoryand the attention-likelihood theory because the experi-ment was conducted before the publication of Stretchand Wixtedrsquos (1998b) study which manipulated atten-tion by the number of presentations In this experimenta new manipulation is investigatedmdashnamely how theamount of study time per item for each class influencesthe mirror effect Furthermore the manipulation of thenumber of presentations is replicated Thus there weretwo experimental conditions one in which the amountof study time was manipulated and one in which the pre-sentation time was manipulated There was also one con-trol condition in which high- and low-frequency wordswere given the same amount of attention

MethodSubjects Twenty-one students taking the introductory psychol-

ogy course at the University of Toronto volunteered to participatein a memory experiment for course credit There were 14 female and7 male subjects with a mean age of 20 ranging from 18 to 29 yearsold

Material Sixty low-frequency words and 60 high-frequencywords were selected from Ku Iumlcera and Francis (1967) The low-frequency words have an occurrence of 4ndash5 times per million andthe high frequency words an occurrence of 50ndash55 times per mil-lion Thirty low- and 30 high-frequency words were randomly cho-sen for List 1 and the remaining for List 2

Procedure The subjects were instructed to study a list of wordsso they would be able to recognize the words after study Fifteenlow-frequency words and 15 high-frequency words were randomlychosen as study words for each subject

Design There were three conditions In all the conditions thelow-frequency words were presented once with a presentation timeof 1 sec In the control condition the high-frequency words werealso presented once with a presentation time of 1 sec In the pre-sentation frequency condition the high-frequency words were pre-sented twice for 1 sec each time In the presentation time conditionthe high-frequency words were presented once for 3 sec The pre-sentation order was randomized All the words were presented inuppercase on a blank computer screen Immediately following thestudy list there was a recognition test The subjects were presentedwith either one of the studied words or one of the lures There were15 low-frequency lures and 15 high-frequency lures in each condi-tion The subjects were asked to judge whether the word was pre-sented in the list or not The subjects were also required to rate theirconfidence in their responses on a scale ranging from 1 (guessing)to 5 (very certain) The order of recognition was randomized foreach subject

Each subject participated in two conditions List 1 was alwaysgiven as the first list and List 2 as the second list Twelve subjectswere randomly chosen for the presentation frequency condition fol-lowed by the presentation time condition Nine subjects were giventhe control condition followed by another control condition Thewhole experimental set-up including instructions presentation ofwords and the recognition test was automated on a computer Eachsubject was tested individually

ResultsThe results from the experiment are presented in the

first three rows of Table 1 The probability for hit rates

VARIANCE THEORY OF MIRROR EFFECT 427

428 SIKSTROM

was larger for the high-frequency words than for the low-frequency words in the presentation frequency and thepresentation time conditions In the control conditionthe probability for hit rates for the low-frequency condi-tion was larger One-tailed paired t tests over the perfor-mance for each subject were carried out to test the dif-ferences between the high and the low frequencies Theeffects were significant in the presentation frequencycondition [t(11) = 22 MSe = 004 p = 02 lt 05] and inthe control condition [t(16) = 33 MSe = 004 p = 00lt 05] but not in the presentation time condition [t(11) =041 MSe = 003 p = 34 lt 05]

The false alarm rate was larger for the high-frequencywords in all the conditions However it was only signif-icantly larger in the presentation frequency condition[t(11) = 18 MSe = 003 p = 048 lt 05] but not in thepresentation time condition [t (11) = 15 MSe = 001 p =07 gt 05] and the control condition [t(16) = 14 MSe =002 p = 09 gt 05]

The results in the presentation frequency conditionsupport the variance theory The hit and the false alarmrates were significantly larger for the high-frequencywords than for the low-frequency words Thus there wasno mirror effect However the prediction of the standardversion of the attention-likelihood theory was not sup-ported

The results in the presentation time condition were inthe same direction as those in the presentation frequencycondition although the difference between the high andthe low frequencies was not significant This conditionis consistent with the variance theory although the stan-dard version of the attention-likelihood theory could notbe dismissed in this condition since the results werenonsignificant

Finally the control condition yielded results consis-tent with previous studies showing a mirror effect Thehit rate for the high-frequency words was significantlylower than the hit rate for the low-frequency words Thefalse alarm rate for the high-frequency words was largerthan that for the low-frequency words (although not sig-nificantly) Thus the control condition is as was ex-pected consistent with both the variance theory and thestandard version of the attention-likelihood theory

The slopes of the ROC curves were calculated as fol-lows The hit and false alarm rates for confidence ratings1ndash5 were z-transformed (eg for confidence rating 4 a hitresponse was scored if the confidence rating was 4 orabove) Linear regressions of one z-transformed mea-surement as a function of another z-transformed measure-ment were conducted The slope of the linear regressioncurves between the z-transformed hit rate of the low-frequency words and the z-transformed hit rate of the high-frequency words [ss(BO)ss(AO)] and similarly for theslope of the false alarms [ss(BN)ss(AN)] are shown inthe last two rows of Table 1

The results show that the standard deviations of theold high-frequency distributions were smaller than thestandard deviations of the old low-frequency distribu-tions in all the conditions The standard deviations of the

false alarm high-frequency distributions were smallerthan the standard deviations of the false alarm low-frequency distributions in the presentation frequencycondition and the control condition but were approxi-mately equal in the presentation time condition

To summarize the results in the presentation frequencycondition are consistent with the variance theory and inconsistent with the standard version of the attention-likelihood theory The control condition is consistentwith both the variance theory and the standard version ofthe attention-likelihood theory These data are also con-sistent with results presented by Stretch and Wixted(1998b) However Stretch and Wixted (1998b) suggestedone possible way to modify the standard version of theattention-likelihood theory to bring it in line with thedata presented here They noted that Glanzer et al (1993)had shown that the attention-likelihood theory predictsthe mirror effect although p(i old) is set to the averagevalue of the two classes This modified version can pre-dict the pattern of data presented here given that the at-tention paid to the high-frequency class was high duringencoding [n(B) = 120] and low during recognition [n(B) =40] This formulation of the attention-likelihood theoryseems somewhat unclear It is not well motivated whyp(i old) should be equal during recognition whereas n(i)is different [p(i old) is calculated from n(i)] or why theamount of attention for high-frequency items is lower thanthat for low-frequency words at encoding but higher atrecognition

COMPARATIVE DATA FITTING

Glanzer et al (1993) fitted the attention-likelihoodtheory to experimental data in nine conditions The easyclass (A) consisted of either low-frequency words orconcrete words and the difficult class (B) consisted ofhigh-frequency words or abstract words Here the vari-ance theory is also fitted to the same set of data as thatused in Glanzer et al (1993) This allows a directionevaluation of the variance theory by comparing its f itswith those of the attention-likelihood theory

Glanzer et al (1993) fitted the attention-likelihoodtheory to the four probabilities of yes responses and thefour ratios of slopes of the ROCs The fitting was con-ducted by minimizing the mean squared error divided bythe variancemdashthat is

Three parameters were fittedmdashnamely the attentionpaid to each of the classes n(A) and n(B) and the prob-ability that a feature was activated p(new) The other pa-rameters were held constant at a value found to give agood f it These parameters were (N = 1000) and therecognition criterion [ln (L) = 0]

The variance theory was fitted to the same set of datausing the same technique and the same number of freeparameters The fitted parameters were the percentage

( )

Observed Predictedi i

ii

-

=aring

2

21

8

s

VARIANCE THEORY OF MIRROR EFFECT 429

nodes active at encoding (a) the standard deviation ofthe net inputs for the easy class words [sh(A)] and thestandard deviation of the net inputs for the difficult classwords [sh(B)] The other parameters were held constantand were set to the same values as those in Figure 4D[2N = 100 mh(N ) = 0 mh(O) = 1 and C = 0] The empir-ical standard deviations (si) were not reported inGlanzer et al (1993) so these parameters were set toone

The results show that the variance theory accounts for98 (r2) of the variance for the probabilities The attention-likelihood theory fits equally well The variance theory ac-counts for 93 of the variance for the slope The attention-likelihood theory accounts for 84 of the variance of theslope Thus the variance theory accounted for the sameamount of variance for the probabilities and more variance

for the slope as compared with the attention-likelihoodtheory when three parameters were fitted

It is reasonable to assume that the percentage of ac-tive nodes at encoding (a) does not vary across condi-tions The ratio of standard deviations of the net inputs[sh(B)sh(A)] may also be conceived of as being con-stant given that the same material is used in the differ-ent conditions Therefore the variance theory was fittedby a single variablemdashnamely the standard deviation ofthe net inputs for the easy class [sh(A)] The activitylevel was fixed to 010 and the ratio of the standard de-viations of the net inputs sh(B)sh(A) to 125 (these val-ues were the average of the fitted parameters in the pre-vious fit these parameters were also used in Figure 4D)The results are presented in Figure 8A where the pre-dicted probabilities are plotted as a function of observed

Figure 8 (A) Fitting the variance theory to Glanzer Adams andKimrsquos (1993) probability data The figure shows the predicted proba-bilities (on the vertical axis) as a function of the observed probabilities(on the horizontal axis) (B) Fitting the variance theory to Glanzer et alrsquos(1993) standard deviation slope data The figure shows the predictedratio of slopes of the receiver-operating characteristic curves (on the ver-tical axis) as a function of the observed ratios (on the horizontal axis)

430 SIKSTROM

probabilities Figure 8B shows the corresponding resultsfor the slope The accounted variance is 096 for the prob-abilities and 085 for the slopes Thus the variance theoryfits the slopes using a single parameter equally well asthe attention-likelihood theory does with three fitting pa-rameters The fit for the variance theory for the probabil-ities using one parameter is slightly less than the fit of theattention-likelihood theory using three fitting parametersIt may be concluded that the fit for the variance theory isreasonably good for the probabilities and the slopes Theslopes have a better fit in the variance theory than in theattention-likelihood theory when three variables are used

ANALYTIC SOLUTIONS

In this section analytic solutions of the variance the-ory an approximation of the standard deviation of therecognition strength and analyses of optimal perfor-mance are presented The variance theory has a simpleanalytic solution and can be fully described by four pa-rameters Two of these parametersmdashnamely the stan-dard deviation of the net inputs from the easy class[sh(A)] and the standard deviation of the net inputs fromthe difficult class [sh(B)]mdashcan also be expressed as thefrequency of the item (see the Appendix) The other twoparameters are the number of nodes (N ) and the expectedprobability that the nodes are active at encoding (a)

There are other variables in the theory however theydo not increase the degrees of freedom There are fourexpected net inputs (mh) however two degrees of free-dom disappear because the new net inputs are constrainedto be equal as well as the old net inputs [mh(AN) =mh(BN) and mh(AO) = mh(BO)] Furthermore the predic-tions are independent of moving the old and new dis-tributions in parallel thus removing another degree offreedom Finally changing the difference between the ex-pected old and new net inputs is mathematically equiva-lent to changing the standard deviations [sh(A) andsh(B)] Thus the degrees of freedom in the net inputscan be captured by the degrees of freedom in the stan-dard deviations The activation threshold (T ) and theprobability that nodes are active (Pc) are simply func-tions of other variables and therefore do not increase thedegrees of freedom Thus there are four degrees of free-dom for the distributionsmdashnamely sh(A) sh(B) N anda An additional degree of freedom is introduced whenplacing the recognition criterion (C)

The probability (Pc) that the net inputs exceed the ac-tivation threshold (T ) for nodes active during encodingcan be explicitly solved from the expected net inputs(mh) and the standard deviation of the net inputs (sh)This probability is dependent on the distribution of thenet inputs which can be approximated with a normaldistribution Pc is solved by integrating the net inputsfrom mh T to infinity (yen) over the probability densityfunction for a normal distribution Thus the probability(Pc) that a node is active at recognition is

(6)

Subtracting the expected percentage of active nodes atrecognition (a2 see note 1) from the percentage of ac-tive nodes and dividing by the standard deviation of thenet inputs (sh) calculates the expected recognitionstrength (ms)

Note that the analytic solution uses the standard devi-ation of the class (sh) as an approximation of the stan-dard deviation of the item (shcent ) because it simplifies theanalytic solution however the variance theory or thesimulation uses the standard deviation of the item Thisapproximation is good when there are a large number offeatures however for a small number of features thevariance of feature strength for a single item may fluctu-ate on an item-to-item basis around the variance of thenet inputs for all the items

The standard deviation of the recognition strength (ss)is calculated from sh Pc and N There is 2N number ofnodes in the context and the item layers The distributionof Pc is binomial but can for a certain criterion [ie 2NPc(1 Pc) gt 10] be approximated with a normal distri-bution with a standard deviation of [Pc(1 Pc) 2N]12The final result is scaled by the normalization factor 1sh

(7)

A yes response occurs if the recognition strength isabove the recognition criterion (C) The probability of ayes response [P(Y)] is calculated from the expected recog-nition strength the variance of the recognition strengthand the recognition criterion by integrating the density ofthe recognition strength over a normal distribution

The probability of choosing A over B in a two-choiceforced recognition test [P(A B)] is calculated from theexpected recognition strength of A [ms(A)] and B [ms(B)]and the standard deviations of the recognition strengthof A [ss(A)] and B [ss(B)]

An Excel sheet for calculating the predictions of the vari-ance theory is available on line (wwwpsychutorontoca ~sverkervariancehtml)

P e ds

S Bs s

s s

C

(A B)

(A)

(A) (B)

=

- -[ ]( )+aelig

egraveccediloumloslashdivide

eacuteeumlecirc

yen

ograve12

2

2 2 22

p

m m

s s

( )

P e ds

S s

s

C

(Y) =-( )yen

ograve12

2

22

p

m

s

sss

h

c cP P

N=

-( )eacute

eumlecircecirc

1 1

2

12

mss

Pc

a

h=

-2

P a e dhcT

hh

h=-( )yen

ograve12

2

22

p

m

s

VARIANCE THEORY OF MIRROR EFFECT 431

Approximations of the Standard Deviation ofRecognition Strength

The standard deviation of the recognition strength isin the model calculated with Equation 7 However to fa-cilitate the understanding of this equation it is useful tomake some approximations First note that the proba-bility that a node is active (Pc) is assumed to be low Byapproximating 1 Pc to one the variance of the recog-nition strength can be simplified to

For a particular class of items the variances of the netinputs of old and new items are equal and the varianceof the recognition strength is proportional to the numberof active nodes (s 2

s micro Pc) This approximation suggests avery simple interpretation of the slope of the z-ROC Theratio of variances between new and old items is simplythe ratio between the number of nodes active in the newitems representations and the number of nodes active inthe old items representations

Or alternatively the slope of the z-ROC curve is equalto the square root of the ratio of the number of nodes ac-tive in the new items representations and the number ofnodes active in the old items representations For exam-ple if the slope of the z-ROC curve is 08 the number ofactive nodes in the new items representations divided bythe number of nodes active in the old items representa-tions is 064 (= 0802)

Another approximation useful for understanding themodel is that for two classes of items the number of ac-tive nodes in the new distribution is approximately equaland the number of active nodes in the old distributions isapproximately equal [Pc(AN) raquo Pc(BN) and Pc(BO) raquoPc(AO)] Given these approximations and the approxi-mation above (1 Pc raquo 1) the recognition strength stan-dard deviation is inversely related to the standard devia-tion of the net inputs in the following way The ratiobetween the recognition strength standard deviations ofthe diff icult and the easy distributions is equal to theratio between the standard deviations of the net inputs ofthe easy and the difficult distributions Furthermore theratio between the recognition strength standard devia-tions of the difficult and easy new distributions is equalto the ratio between the recognition strength standard de-viations of the difficult and the easy old distributionsThe exact solution predicts a slightly smaller ratio in theold than in the new distributions

This suggests that the ratio between the recognitionstrength standard deviations of the difficult class and theeasy class can be interpreted as the ratio between the

standard deviations of the net inputs of the easy class andthe difficult class

Optimizing Performance Derives the Assumptions of the Variance Theory

Arguably good memory performance is an importantfactor for selection in the evolutionary process of hu-mans and animals It is reasonable to assume that thebrain has evolved so that the performance at retrieval isoptimal or near-optimal Here it is investigated how sev-eral assumptions of the variance theory influence per-formance It is shown that several assumptions of themodel fall out as a consequence of optimizing perfor-mance in the form of discriminability between new andold items Thus if the model is implemented in a differ-ent way performance is degraded and the model doesnot account for the experimental data Examples of as-sumptions that yield a good performance in the modelare a low percentage of nodes active setting the activa-tion threshold between old and new net inputs measur-ing performance by nodes that are active in the encodingpattern and normalizing the recognition strength It isshown that an optimal performance in the network requiresthe implementation suggested by the variance theory Ifthe implementation of the variance theory is changedsignificantly the performance is degraded and the net-work would not produce the empirically found memoryphenomena

To conduct this analysis it is necessary to define ameasurement of performance A natural choice is to used cent By using the analytical equations above we find thefollowing expression

(8)

Because ss(N) sometimes is near zero it was founduseful to use the standard deviation of both the old andthe new items recognition strength ss(NO) in the de-nominator of this equation Thus Pc(NO) is equal to[Pc(N) 1 Pc(O)] 2 Pc( ) was calculated with Equation 6The expected value of the net inputs and the standard de-viation of the net inputs for new and old items were cal-culated with the equations derived in the Appendix(Equations A1 A2 and A3) Low-frequency items witha preexperimental frequency of zero were used ( f = 0)and the list length was set to one (L = 1)

The performance can be expressed by the parametersa N and p Increasing the number of nodes (N) monot-onically increases d cent and increasing the number ofstored patterns (p) monotonically decreases d cent The im-pact of these two parameters on d cent was of less impor-tance here and they were set to N = 30 and p = 100

Optimal percentage of nodes active at encodingThe solid line in Figure 9A shows the theoretical d cent as afunction of the percentage of nodes active at encoding

cent = - =-

-eacuteeumlecirc

dP P

P PN

s s

s

c c

c c

m ms( ) ( )

(

O N

(NO)

(O) (N)

(NO) (NO) 12

112

ss

ss

ss

ss

s

s

s

s

h

h

s

s

(BO)

(AO)

(B)

(A)

(A)

(B)

(BN)

(AN)pound raquo pound

s

ss

s

c

c

P

P

2

2

(N)

(O)

(N)

(O)raquo

ss

sc

h

P

N

2

2 2=

432 SIKSTROM

(a) The results show that d cent is optimal for a = 052 Thed cent is lower for larger and smaller a The lower d cent for largea occurs because the interference from other items in-crease For an a larger than the optimal value the weightchanges are distributed over a larger number of nodesand the recognition tests therefore include more noise

The lower d cent for an a less than the optimal value oc-curs because there is variability in the number of activenodes at encoding Thus for very small values of a thefluctuation between the number of nodes active in theencoded representation becomes increasingly importantThus for a small a errors are more likely to occur be-cause old items have few active nodes at encoding mak-ing it less likely that many nodes will be active at re-trieval (independently of how well they are encoded)This analysis suggests that a medium low percentage ofactive nodes at encoding yields optimal performanceThis is consistent with variance theory which requires alow activity for fitting some of the empirical data (seebelow)

There is another factor that contributes to the fact thatoptimal performance occurs when the percentage of ac-tive nodes is medium low which is that the number ofpossible representations increases with a If there is onlyone node active in all the representations there are Npossible combinations of representations if there aretwo nodes active in all the representations there are ap-proximately N 2 possible combinations of representa-tions and so forth This factor is not included in theanalyses

Optimal placement of the activation threshold Animportant property of d cent is that it is insensitive to wherethe criterion is placed Thus any criterion yields thesame performance The activation threshold (T ) may beseen as the criterion for a single node and therefore onemight intuitively think that the placement of the thresh-old is unimportant for d cent However surprisingly theplacement of the criterion becomes important in the vari-ance theory because there is a nonlinear transformationwhen the nodes are activated This nonlinearity makes d centdependent on the activation threshold in the nodes

The d cent was maximized by changing the activationthreshold (T ) and the percentage of nodes active at en-coding (a) The maximum d cent was 240 when T = 081and a = 052 Figure 9B plots d cent as a function of the ac-tivation threshold (T ) when the percentage of nodes ac-tive at encoding was f ixed at the optimal value (a =052) The results show that d cent has an optimal valuewhen the activation threshold is set between the old andthe new distributions The variance theory suggests thatthe threshold should be set to the average of the expectedold and expected new net inputs For the parameter val-ues used here this value is 071 which is near butslightly lower than the optimal value of 081 (the ex-pected value of the new net input is 0 and the expectedvalue of the old net input is 142) Note that this resultapplies when both a and T are set to the optimal value Ifa is set to a nonoptimal value the optimal value of T may

deviate significantly from the one proposed by the the-ory (eg if a = 5 the optimal value of T is much largerthan the expected value of old net inputs of 188)

This analysis emphasizes the importance of setting theactivation threshold as suggested by the variance theorySetting the activation threshold between the old and thenew expected net inputs yields not only the mirror effectbut also an optimal performance in the network Noticethat the activation threshold (T ) is constant even if thesubjectsrsquo decision criterion (C) is changed Therefore d centwill not change when the decision criterion changes Bychanging the decision criterion (rather than the recogni-tion threshold) subjects can maintain an optimal d cent fordifferent confidence levels

Optimal usage of the state of activation in the cuepattern An interesting question is how much informa-tion is carried in nodes that are active in the encoded pat-tern as compared with inactive nodes If both active andinactive nodes carry a similar amount of information itis useful to use all the nodes at retrieval However if in-active nodes carry little information in relation to thenoise performance can be improved by using only theinformation in the active nodes

The information carried in the nodes depends on theamount of weight changes which in turn depends on thepercentage of active nodes at encoding (a) For a = 12the absolute values of the weight changes are the samefor active and inactive nodes (however the signs of theweight changes are different) Thus inactive and activenodes carry the same amount of information and theperformance is optimal when information in both activeand inactive nodes is used For a small a the weightchanges are larger for active nodes (proportional to 1 a)than for inactive nodes (proportional to a) For a suffi-ciently small a the noise in inactive nodes will over-whelm the information in the weight changes so that ifthe information is combined the inactive nodes will harmthe information in the active nodes and performance isoptimal if only information from active nodes is used

The performance of the variance theory was calcu-lated by using the information in all the nodes This isdone by counting the number of nodes that are retrievedto the correct state of activation (ie the same state asthat during encoding) The mathematical details of thiscalculation are described at the end of the Appendix Theresults are shown by the dotted line in Figure 9A usingthe same set of parameters as when d cent was calculated byusing only active nodes shown by the solid line The re-sults show that the highest d cent is found when the decisionis based only on active nodes and when a is low Includ-ing inactive nodes in decision lowers d cent However for alarger a (above 15 for the parameters used here) it isbeneficial to base the decision on all the nodes

Optimal placement of the recognition criterion forthe two classes of items The recognition criterion (C)does not affect d cent but it influences performance as mea-sured by the hit rates and false alarm rates Therefore itis necessary to use another criterion for performance

VARIANCE THEORY OF MIRROR EFFECT 433

with respect to the placement of the recognition crite-rion A natural choice for performance in this context isthe probability of hits minus the probability of falsealarms This measurement corresponds to optimal per-formance when old correct responses and new correctnew responses are rewarded equally It is easy to see thatif the standard deviations of the old and the new distri-butions are equal the optimal performance will be foundif the recognition criterion is set exactly between the dis-tributions For unequal standard deviations the optimalrecognition criterion is shifted from the midpoint towardthe distribution with the smallest standard deviationMore exactly the optimal recognition criterion is thepoint at which the old and the new distributions inter-sect It is easy to see that this is true because if the recog-nition criterion is moved to the left of this point the rateof increase in false alarms is larger than the rate of in-crease in hits and performance suffers If the recognition

criterion is moved to the right of this point the rate of de-crease in hits is larger than the rate of decrease in falsealarms and performance also suffers (see eg Figure 4D)Formally f [S(O)] denotes the density of recognitionstrength of the old distribution and f [S(N)] the densityof the recognition strength of the new distribution Theratio between these variables is called the likelihoodratio L = f [S(O)] f [S(N)] and the optimal performanceoccurs when this ratio is equal to one (L = 1)

In the mirror effect there are two classes of itemseach having a new and an old distribution with differentstandard deviations The question of optimal perfor-mance is complicated by the possibility of using differ-ent criteria for the two classes The performance maythen vary depending on the choice of the two criteria andon additional restrictions on the overall level of confi-dence For example if one class is very easy (ie perfectdiscrimination) and one class is very difficult (ie no

Figure 9 (A) Theoretical d cent as a function of percentage of nodes active at encoding The solid line shows the d cent as a function of per-centage of nodes active at encoding when the decision is based only on nodes that are active during encoding The dotted line showsd cent when the decision is based on all the nodes (B) Theoretical d cent as a function of the activation threshold The leftmost arrow pointsat the expected net input of the new items [m(N)] the rightmost arrow points at the expected net input of the old items [m(O)] and themiddle arrow points at the point at the placement of the activation threshold of the nodes Note that the activation threshold is slightlylower than the optimal point (C) Optimal placement of the recognition criterion for the two classes The y-axis shows the maximumlikelihood for Class B divided by the maximum likelihood for Class A An optimal performance is found when this ratio is one Thex-axis shows the false alarm rate for Class A The straight line shows the ratio for theoretical optimal performance the dotted line theratio before normalization and the solid curved line the ratio after normalization See the text for details (D) The advantage of nor-malization for different recognition criteria The y-axis shows the total hit rate after normalization minus the total hit rate before nor-malization as a function of the total false alarm rate on the x-axis See the text for details

434 SIKSTROM

discrimination) and subjects are instructed to respondyes only when they are absolutely certain that they arecorrect it may be optimal to set a very high criterion forthe difficult class so that no yes responses will be madefor the difficult class and a moderate criterion for theeasy class so that some yes responses will be made forthe easy class Therefore any model that optimizes per-formance for the two classes must combine the criteriafor each class so that the performance for the sum of theclasses will be optimal

This problem may formally be stated as follows Giventwo classes (A and B) with a fixed total false alarm rate[P(AN) + P(BN)] how should the recognition criteriafor the two classes [T(A) and T(B)] be chosen so that thehit rates are maximized [P(AO) + P(BO)] The solutionto this problem is surprisingly simple The optimal perfor-mance occurs when the placements of the maximum like-lihoods of the two classes are equal

L(A) = f [S(AO)] f [S(AN)] = L(B)

= f [S(BO)] f [S(BN)]

It is easy to see that this criterion must be satisfied foroptimal performance because any shift from this pointdiminishes performance For example if the recognitionthreshold for class A is diminished the recognition cri-terion for class B must be increased to keep the totalfalse alarm rate constant According to the formulationof the problem the change in total false alarm rates mustbe equal f f [S(BN) = 0] The maximum-likelihood ra-tios are monotonically increasing functions of the recog-nition criteria therefore L(A) L(B) lt 0 when the recog-nition criteria are changed as specified above ThusL(A) = f [S(AO)] f [S(AN)] lt f [S(BO)] f [S(BN)] =L(B) or f [S(AO)] f [S(BO)] lt 0 This shows that thechange in the placement of the criteria from L(A) = L(B)results in an overall decrease in hit ratemdash( f [S(AO)] f [S(BO)] lt 0)mdashand performance suffers

Note that the variance theory only has one overallrecognition criterion However the theory and any the-ory of the mirror effect specifies how this criterionchanges when it is moved over the two classes Thus thesecond criterion can indirectly be inferred from the for-mulation of the theory This is done in the variance the-ory by the normalization factor that scales the recogni-tion differently between the two classes of words

The intriguing question here is whether the variancetheory is optimal or almost optimal in terms of place-ment of the recognition criterion for the two classes Fig-ure 9C plots the maximum likelihood of class B dividedby the maximum likelihood of class A [L(B)L(A)] onthe y-axis The x-axis shows the probability of false alarmsfor class A when the recognition criteria are changedThe optimal ratio of the maximum likelihood on the y-axisis one and it is plotted as the dotted straight line The re-sults before normalization (ie by counting the numberof nodes above the recognition criterion) are plotted in

the dotted monotonically increasing line The resultsafter normalization (ie the percentage of active nodesminus the expected number of active nodes divided bythe standard deviation of the net input) are plotted as thesolid line

The figure clearly shows that performance before nor-malization is far from optimal For a conservative recog-nition criterion or low false alarm rates the maximumlikelihood for class A is larger than the maximum likeli-hood for class B Therefore a more liberal criterion forclass B and a more strict criterion for class A would bemore advantageous For a liberal recognition criterionor a high false alarm rate the maximum likelihood forclass B is larger than the maximum likelihood for classA Therefore a more liberal criterion for Class A and astricter criterion for Class B would be beneficial Theonly point at which the performance is optimal is whenthe recognition criterion is unbiased At this point [aroundP(AN) = 25] the maximum-likelihood ratio is one

Normalization improves performance significantly sothe maximum-likelihood ratio is one or almost one fora range of criteria For false alarm rates larger than 015and smaller than 060 the ratio is within two percentagepoints of one The maximum likelihood for Class A islarger than that for Class B for false alarm rates less than015 and for false alarm rates larger than 060 Thus thereis still some deviation from optimal performance evenafter normalization However the maximum-likelihoodratio is closer to the optimal value for all false alarmrates after normalization than before normalization Ar-guably normalization improves performance sufficientlyso that performance may be described as being near anoptimal value for a wide range of recognition criteria

Overall performance after normalization can be di-rectly compared with performance before normalizationFigure 9D plots the total hit rate after normalizationminus the total hit rate before normalization for differenttotal false alarm rates In this figure the standard devia-tion of the net inputs to Class B was set to a 3 (rather than156) in order to make the difference between perfor-mance before and after normalization more salient Allother parameters were identical to those in Figure 4DFurthermore it is assumed that the number of items inClass A is equal to the number of items in Class B Thetotal hit rate is equal to the average hit rate of Class Aand Class B and the total false alarm rate is equal to theaverage false alarm rate in Class A and Class B

For all recognition criteria or for all false alarm ratesperformance is better or equal after normalization ascompared with performance before normalization Foran unbiased recognition criterion or a slightly largerrecognition criterion performance is approximatelyequal before and after normalization [ie for 25 lt P(N) lt40] For conservative recognition criteria [P(N) lt 25]there is a large advantage for normalized performanceover nonnormalized performance The largest advantageis approximately 005 and it occurs when the false alarm

VARIANCE THEORY OF MIRROR EFFECT 435

rate is approximately 005 For liberal recognition crite-ria [P(N) gt 40] there is also an advantage for normal-ized performance The largest advantage is around 001and it occurs when the false alarm rate is 070 The ad-vantage for liberal criteria is smaller than the advantagefor conservative criteria because of a ceiling effect forlarge false alarm rates and large hit rates

A Nonoptimal Network is Inconsistent With Empirical Data of Recognition Memory

To summarize it has been shown that performance isoptimal when (1) the percentage of nodes active at en-coding is low (2) the activation threshold is set betweenthe new and the old distributions (3) only nodes activeat encoding are used for measuring the recognitionstrength and (4) the recognition strength is normalizedwith the standard deviation of the net input It is inter-esting to note that all these conditions are necessary forproducing the empirically found memory phenomena

1 The percentage of active nodes has to be low forproducing appropriate standard deviations for the oldand the new recognition distributions If the percentageof active nodes is too high the standard deviation of theold distribution will approach the standard deviation ofthe new distribution

2 The model predicts the mirror effect only if the ac-tivation threshold is set between the old and the new dis-tributions If the recognition threshold is larger than theexpected value of the net input of the old distributionthe hit rate of the easy class will be less than the hit rateof the difficult distribution Similarly if the recognitionthreshold is lower than the expected new net input thefalse alarm of the easy class will be larger than the falsealarm rate of the difficult class This is inconsistent withthe empirical data for unbiased responses

3 Assume that the recognition strength is based on allthe nodes (ie not only nodes inactive during encod-ing)mdashfor example by counting the number of nodes inthe correct state of activation This measurement ofrecognition strength will have at least 50 of the nodesin the correct state (ie Pc gt 50) even if the subjectswere merely guessing on new items This would lead tothe incorrect prediction that the old recognition strengthvariance should be smaller than the new recognitionstrength variance because Pc(1 Pc)N decreases for Pcover 50 This is inconsistent with the finding that thevariance of the old distribution is larger than the varianceof the new distribution

4 If the recognition strength is not normalized withthe net input the variance of the recognition strength ofthe new easy class (AN) will be smaller than the varianceof the recognition strength of the new difficult class (BNsee Figure 4C) This is inconsistent with the empirical data

This analysis indicates that several recognition mem-ory phenomena fall out as a consequence of optimizingperformance in the network If the model is optimized interms of performance the model reproduces the empir-ical data If the model is made nonoptimal the model

does not reproduce the empirical data Arguably thehuman brain has evolved to optimize storage capacityand therefore these memory phenomena occur

GENERAL DISCUSSION

This paper has suggested a variance theory for themirror effect that also applies to the ROC curves Themodel is remarkably simple It has been shown that atwo-layer recurrent network where one layer representscontext and one layer represents items shows these phe-nomena if performance is measured by counting thenumber of nodes active at recognition in a way that opti-mizes performance The structure of the theory guaran-tees that high-frequency items have a larger variance inthe net inputs than do low-frequency items because en-coding the same item in different contexts increases thevariance whereas the expected net inputs are the sameThe theory predicts the mirror effect when the sameamount of attention is paid to both classes of stimuli Thestandard deviation of the recognition strength is largerfor old than for new items because more nodes are activein old items The standard deviation for the easy class islarger than the standard deviation for the difficult classbecause the recognition strength is normalized with thestandard deviation for the net input

There are several reasons why the variance theory isinteresting First the theory is extremely simple Theonly necessary assumptions are that recognition is basedon recurrent associations between contexts and itemsand performance is measured by counting the number ofnodes in an optimal way Second these assumptions areconsistent with what is known about how the brain worksTherefore the model is biologically plausible Third themodel accounts for a large amount of data including themirror effect exceptions from the mirror effect ROCcurves list-length effects and so on Fourth the modelfits the empirical data well Fifth it is easy to implementthe model in a connectionist network

Paying more attention to one of the classes violates theassumption of equal expected net inputs to the two classesThe variance theory predicts that attention to the moredifficult class primarily affects the hit rates whereas thefalse alarm rates and standard deviations of the underly-ing distributions are less affected An experiment sup-ported the prediction A standard mirror effect was foundwhen attention was divided equally between high- andlow-frequency words However focusing the subjectsrsquoattention on the high-frequency words either by the pre-sentation frequency or the presentation time made thehit rate larger for the high-frequency words than for thelow-frequency words This manipulation of attention didnot influence the false alarm rates which were higher forthe high-frequency words in all the conditions Thus nomirror effect was found when attention was paid to thehigh-frequency words Nor did the focusing of attentioninfluence the order of the standard deviations The stan-dard deviations were larger for the low-frequency distri-

436 SIKSTROM

bution than for the high-frequency distribution This wastrue for the new and the old distributions both when at-tention was paid to high-frequency words and when at-tention was divided equally between the two classes (ex-cept in the new frequency control condition where thestandard deviations were equal)

The variance theory predicts the order of the standarddeviations of the underlying distributions for the follow-ing reasons The standard deviations of the old distribu-tions are predicted to be larger than those of the new dis-tributions because more nodes are activated in the olddistributions The standard deviations of the easy classdistributions are predicted to be larger than the standarddeviations of the difficult class distributions because therecognition strength is normalized by the itemrsquos diffi-culty estimated from the standard deviation of the net in-puts This is consistent with the empirical data

In contrast the attention-likelihood theory does notconstrain the old distribution to be larger than the newdistribution for the difficult class (it can be larger orsmaller depending on parameter settings) The variancetheory allows the following two orders of the standarddeviations ss(BN) lt ss(BO) lt ss(AN) lt ss(AO) andss(BN) lt ss(AN) lt ss(BO) lt ss(AO) The first order isthe more common although the second order occurs oc-casionally (see eg Glanzer et al 1993 Experiment 1)In addition the attention-likelihood theory allowsss(BO) lt ss(BN) lt ss(AN) lt ss(AO) according to Equa-tion 2 which to the authorrsquos knowledge has not beenfound in empirical data

The variance theory predicts that strength variablessuch as study time repetition and study instructions af-fect the expected net input For example increasing studytime will increase the net input that improves the hit rateIncreasing the net inputs also causes an increase in theactivation threshold that diminishes the false alarm rates

The variance theory has no (ie lateral) connectionswithin the item layer and there are no connections with-in the context layer Including intraitem connections inrecognition makes it impossible to tell whether therecognition strength comes from encoding during thelearning episode or from another preexperimental learn-ing episode Thus there would be a confounding be-tween the itemrsquos frequency and recognition strength Forexample if the recognition strength in the variance the-ory included intraitem connections and used a constantrecognition criterion it would predict a higher hit rateand a higher false alarm rate for high-frequency itemsthan for low-frequency items Thus the hit rate for high-frequency words would be larger than that for low-frequency words which is contrary to the data on the mir-ror effect This issue is called the frequencyrecognitionndashstrength confounding problem Other models may bevulnerable to this problem depending on their specificassumptions The variance theory is immune from thisproblem because recognition strength is based on the as-sociation between the context and the item that yields apure measurement of the strength of the target in a par-

ticular episode Net inputs within the item population arenot used because these connections are highly corre-lated with the frequency of the item

This frequencyrecognition-strength confounding prob-lem may be relevant to several distributed models thatassume that recognition strength increases with frequencythus making the false alarm rate higher for high- than forlow-frequency stimuli This is often implemented in dis-tributed models by including intraitem associations inthe measurement of recognition strength Thus whenintraitem and itemndashcontext associations are added it isnot possible to know whether the intraitem strength oc-curs because an item has been encoded in the to-be-retrieved-from list or at another episode

Although the intraitem associations are not used tomeasure recognition strength they may play an impor-tant role in recognition In the first step of recognitionthese associations may be used for deblurring unclear in-formation in the item cue (a similar mechanism occursfor the context cue) Arguably this deblurring mecha-nism works well for well-known words however fornonwords it is much more likely to fail Such failure willactivate features that were not active in the encoded rep-resentation It is more likely that these wrongly activatednodes represent high-frequency features because it ismore likely to converge on high-frequency features Thereare two interesting implications of this perspective Firstthe wrongly activated nodes will use the wrong connec-tions between the context and the item Second becausethe wrongly activated nodes represent high-frequencyfeatures the average variability will be larger for non-words than for words This is a plausible account of thepoor recognition performance with nonwords as com-pared with words It is also a tentative account of why non-words can be seen as a difficult class with a higher vari-ability than that for words in the variance theory Howeverfurther work is needed before any firm conclusion can bemade regarding this aspect of the theory

A problem similar to frequencyrecognition-strengthconfounding occurs if within-context connections areused at recognition If the context is temporally corre-lated the within-context connections are influenced bylist length This causes a confounding between list lengthand recognition strength This issue is called the list-lengthrecognition-strength confounding problem Othermodels may be vulnerable to this problem depending ontheir specific assumptions

Another issue is whether the variance theory can ac-count for the mirror effect found in abstract and concretewords where words from a concrete class are more eas-ily discriminated than words from an abstract class Thevariance theory can account for this given the assump-tion that the variance of the net input is larger for abstractthan for concrete words However at this point it is notcompletely clear how this assumption can be motivatedA possibility is that although these two classes areequated for word frequency the contexts associated withan abstract word are more variable than the contexts as-

VARIANCE THEORY OF MIRROR EFFECT 437

sociated with a concrete word This larger variability incontext for abstract words may lead to a larger variabil-ity in the net input Another possibility is that the activefeatures in abstract words are more general and there-fore associated with more contexts Nodes active in con-crete words may represent more specific features acti-vated with a lower frequency and therefore associatedwith fewer contexts Thus features in abstract wordsmay be of a higher frequency than features in concretewords although the frequencies of the items are thesame This would lead to a mirror effect in the modelHowever at this point no claim is made that variancetheory can handle this phenomenon

The variance theoryrsquos account of the list-length andlist-strength effects is arguably much simpler than thesubjective-likelihoodrsquos account The activation thresholdis set so that on average half of the nodes active duringencoding are active during recognition The activationthreshold is constant within one condition but may changebetween conditionsmdashfor example when study time ischanged Furthermore subjects do not need to estimatethe list length and the probability that a particular itemis drawn from the study list

The variance theory has advantages over the attention-likelihood theory As was discussed in more detail abovethe attention-likelihood theory requires subjects to clas-sify the stimulus Depending on this classification thesubjects must know (consciously or unconsciously) howmuch attention is paid to the stimuli in order to calculatethe log-likelihood ratios Thus the yesndashno decision isbased on the subjectsrsquo classification of the stimuli Thevariance theory does not require a classification of thestimuli During the calculation of recognition strengththe standard deviation of the net input of the item is used(shcent ) so the subject does not need to know the class or thestandard deviation of the class (sh) The increase in thehit rate and decrease in the false alarm rate for the easierclass occurs according to the theory because the vari-ance is smaller for the easier class The variance theoryhas a simple mathematical base subjects count the num-ber of activated nodes minus the expected value dividedby the standard deviation of the net input of the item Anexplicit solution is presented that requires only calculat-ing the probabilities of two z-transformations

The subjective-likelihood theory uses feature contentof the items for addressing issues regarding item similar-ity and word frequency In particular high-frequency wordsare assumed to have a higher noise or variability than dolow-frequency words The variance theory also usesvariability that depends on frequency However the vari-ance theory simulates the increase in variance duringeach presentation of a feature in different contexts thusmaking it an unavoidable phenomenon for high-frequencyfeatures In the subjective-likelihood theory the featurevariance is introduced as an assumption or a constantand it is not explicitly simulated

There are several other differences between the vari-ance theory and the subjective-likelihood theory The

subjective-likelihood theory is based on log-likelihoodratios In the variance theory log-likelihood ratios arenot necessary to account of the mirror effect and for z-ROC curves Instead the variance theory uses the num-ber of active nodes normalized by the variance of the netinput as the measurement of recognition strength

Another difference is the use of one detector for eachitem in the subjective-likelihood theory This makes themodel essentially local whereas the variance theory isdistributed This property may cause difficulties in im-plementing the model in a connectionist network How-ever see McClelland and Chappell (1998) for a brief dis-cussion of this topic An implementation of the variancetheory is straightforward

REFERENCES

Feller W (1968) An introduction to probability theory and its appli-cation New York Wiley

Gillund G amp Shiffrin R M (1984) A retrieval model for bothrecognition and recall Psychological Review 91 1-67

Glanzer M amp Adams J K (1985) The mirror effect in recognitionmemory Memory amp Cognition 13 8-20

Glanzer M amp Adams J K (1990) The mirror effect in recognitionmemory Data and theory Journal of Experimental PsychologyLearning Memory amp Cognition 16 5-16

Glanzer M Adams J K amp Kim K (1993) The regularities ofrecognition memory Psychological Review 100 546-567

Glanzer M amp Bowles N (1976) Analysis of the word frequencyeffect in recognition memory Journal of Experimental PsychologyHuman Learning amp Memory 2 21-31

Glanzer M Kisok K amp Adams J K (1998) Response distribu-tions as an explanation of the mirror effect Journal of ExperimentalPsychology Learning Memory amp Cognition 24 633-644

Greene R L (1996) Mirror effect in order and associative informa-tion Role of response strategies Journal of Experimental Psychol-ogy Learning Memory amp Cognition 22 687-695

Hertz J Krogh A amp Palmer R G (1991) Introduction to the the-ory of neural computation Reading MA Addison-Wesley

Hintzman D L (1988) Judgment of frequency and recognition memoryin a multiple trace memory model Psychological Review 95 528-551

Hopfield J J (1982) Neural networks and physical systems withemergent collective computational abilities Proceedings of the Na-tional Academy of Sciences 79 2554-2558

Hopfield J J (1984) Neurons with graded response have collectivecomputational properties like those of two-state neurons Proceed-ings of the National Academy of Sciences 81 3088-3092

Humphreys M S Bain J D amp Pike R (1989) Different way to cuea coherent memory system A theory for episodic semantic and pro-cedural tasks Psychological Review 96 208-233

Kim K amp Glanzer M (1993) Speed versus accuracy instructionsstudy time and the mirror effect Journal of Experimental Psychol-ogy Learning Memory amp Cognition 19 638-652

Kruschke J K (1992) ALCOVE An exemplar-based connectionistmodel of category learning Psychological Review 99 22-44

Ku Iumlcera H amp Francis W N (1967) Computational analysis ofpresent-day American English Providence RI Brown UniversityPress

Lewandowsky S (1991) Gradual unlearning and catastrophic inter-ference A comparison of distributed architectures In W E Hockleyamp S Lewandowsky (Eds) Relating theory and data Essays onhuman memory in honor of Bennet B Murdock (pp 445-476) Hills-dale NJ Erlbaum

Li S-C amp Lindenberger U (1999) Cross-level unification A com-putational exploration of the link between deterioration of neuro-transmitter systems and dedifferentiation of cognitive abilities in oldage In L-G Nilsson amp H J Markowitsch (Eds) Cognitive neuro-sciences of memory (pp 103-146) Seattle Hogrefe amp Huber

438 SIKSTROM

Li S-C Lindenberger U amp Frensch P A (2000) Unifying cog-nitive aging From neuromodulation to representation to cognitionNeurocomputing 32-33 879-890

McClelland J L amp Chappell M (1998) Familiarity breeds dif-ferentiation A subjective-likelihood approach to the effects of expe-rience in recognition memory Psychological Review 105 724-760

Metcalfe J (1982) A composite holographic associative recallmodel Psychological Review 89 627-658

Murdock B B (1982) A theory for the storage and retrieval of item andassociative information Psychological Review 89 609-626

Murdock B B (1998) The mirror effect and attentionndashlikelihoodtheory A reflective analysis Journal of Experimental PsychologyLearning Memory amp Cognition 24 524-534

Okada M (1996) Notions of associative memory and sparse codingNeural Networks 9 1429-1458

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves PsychologicalReview 99 518-535

Rumelhart D E Hinton G E amp Williams R J (1986) Learn-ing representation by backpropagating errors Nature 323 533-536

Shiffrin R M amp Steyvers M (1997) A model for recognitionmemory REMndashretrieving effectively from memory PsychonomicBulletin amp Review 4 145-166

Sikstroumlm S (1996a) TECO A connectionist theory of successiveepisodic tests Umearing Doctoral dissertation Umearing University (ISBN91-7191-155-3)

Sikstroumlm S (1996b) The TECO connectionist theory of recognitionfailure European Journal of Cognitive Psychology 8 341-380

Sikstroumlm S (1999) Power function forgetting curves as an emergentproperty of biologically plausible neural networks model Interna-tional Journal of Psychology 34 460-464

Stretch V amp Wixted J T (1998a) Decision rules for recognitionmemory confidence judgments Journal of Experimental Psychol-ogy Learning Memory amp Cognition 24 1397-1410

Stretch V amp Wixted J T (1998b) On the difference betweenstrength-based and frequency-based mirror effects in recognitionmemory Journal of Experimental Psychology Learning Memoryamp Cognition 24 1379-1396

NOTE

1 The expected number of nodes active at recognition is a2 giventhat the standard deviations of the percentages of active nodes for new[sp(N)] and old items [sp(O)] are equal However if the standard devi-ations are different the expected number of active nodes will be

Because the new variance is smaller than the old variance this valuewill be slightly less than a2 typically around 044a if the ROC curveis 08 It is actually more precise to use this value In this paper the ap-proximation a2 is used except that in the simulations where the ex-pression above is used

APPENDIX

The Expected Value of the Net Input and the Variance of the Net Input

Inserting the weight changes into the net input solves the ex-pected value of the net input for old items In this Appendix thesuperscripts representing the item layer (d) and the contextlayer (c) are dropped Only the active states of activation con-tribute to the net input so jj 5 jj = 1

(A1)

The expected value of the net inputs for the new items iszero

mh(N) 5 0 (A2)

The expected value of the weights is zero The variance ofthe net input is calculated by squaring each term in the netinput Let P represent the number of patterns stored in the net-work If the patterns are uncorrelated or a new item is pre-sented the variance of the net input is

(A3)

For an old item the variance of the net input to the contextlayer from the item layer depends on the frequency of the item( f ) All the aN weights supporting a context node contribute tothe net input in the same direction

(A4)

The same function can be used for calculating the varianceof the net input to a node in the item layer when the same con-text is associated with several items Let the same context be as-sociated with L items in a list Furthermore let the context be-tween different lists be uncorrelated The variance of the netinput to an item node is

The expected variance for all nodes is the weighted sum ofthese two terms Half of the nodes are context nodes and halfof the nodes are item nodes Therefore the expected varianceof the net input to all nodes given an item with a frequency off and a list length of L in a network where p patterns have beenencoded is

(A5)

Performance in the Variance Model When AllNodes Are Used

In the variance model recognition strength is based on nodesactive at encoding However if recognition strength is based onall the nodes performance can be calculated as follows The d centis calculated by using Equation 8 where Pc(O) and Pc(N) are thenumber of correct nodes The number of correct nodes is thenumber of nodes active at retrieval that also is active at encod-ing (ie calculated as usual with Equation 7) plus the numberof inactive nodes at encoding that also are inactive at retrievalThe latter value can be calculated by replacing a with 1 a inEquation 7 and using the expected value of old net inputs for in-active nodes a2 (1 a) N (compare with Equation A1)

(Manuscript received February 9 1999revision accepted for publication October 30 2000)

s h f L a N p a a N2 3 22 1O( ) = +( ) + -( )

s h f a a LN2 4 2 21( ) = -( )

s x x x xh ij jj

N

i j ji

Nf f f a a

a a f N

22 2

4 2 21

( ) = waring aringaelig

egraveccedilouml

oslashdivide= -( ) -( )eacute

eumlecirc

= -( )

s x x x xh ij jj

N

i j j

P

i

NN a a

a a a a a a a

a a a a PN a a PN

2 2 2

2

3

1 1 2 0 1

0 0 1 1

( ) = w

= [( ) ( )] + [( )(1- )] ( )

+ [( )( )] ( ) = ( )

2 2

2 2 2

( ) = -( ) -( ) - - - -

- - - -

aring aringaring

m x x x xh ijj

N

j i j jj

Na a a a N(O) = wDaring aring= -( ) -( ) = -( )1

2

ss s

p

p p

a(N)

( )N (O)+

VARIANCE THEORY OF MIRROR EFFECT 415

For a 5 5 the net inputs are binomially distributed witha certain expected value Given a certain criterion (ieNp(1 a) a gt 10) a binomial distribution can be ap-proximated with a normal distribution (Feller 1968) Fora sup1 5 there are actually four outcomes however thesame normal approximation can be used Thus for rea-sonably large parameter values of Npa the distributionof net inputs to the nodes can be approximated by a nor-mal distribution

If the to-be-recognized item has not been encoded withthe context (ie a new item) the net input is simply thesum of random weights Because the expected values ofall weights are zero the expected value of the net inputsfor new items will also be zero If the item has been en-coded with the context (ie an old item) the net input isthe sum of those weights connected to that node whoserespective context nodes were active at encoding Owingto the adaptive weight changes during encoding theseweights will have an expected value that is larger thanzero if both nodes were in the active state during encod-ing [ie each weight change at encoding is computed as(1 a)2] and less than zero if one node was inactive andthe other node was active at encoding [ie each weightchange at encoding was a(1 a)] Of specific impor-tance for the theory is that the variance of the net inputsto the context nodes (from the item nodes) increases withthe number of contexts that are associated with the itemTherefore the variance of the net input is larger for high-than for low-frequency items Similarly the variance ofthe net input to the item nodes (from the context nodes)increases with the number of items associated with onecontext (ie list length) Therefore given that the con-text is constant during a list presentation the variance ofthe net inputs is larger for a long than for a short list

Brief summary of optimal performance Given thestrong selection pressure arguably humans and animalshave evolved to achieve good memory performanceTherefore it is reasonable to assume that mechanisms forrecognition decision have evolved to an optimal or near-optimal performance Following this assumption the pa-rameter values in the model and the implementation ofthe model are guided by the idea that the model shouldperform optimally A detailed discussion about the issueof optimal performance with exact derivations of what isoptimal performance in the context of the present modelis presented later in the sixth section Here I give a briefsummary explaining the results from the analysis of op-timal performance without going into the mathematicaldetails (see Figures 9A 9B 9C and 9D)

The modelrsquos performance is optimal if the percentageof nodes active at encoding (a) is low (see Figure 9A)For a low a it is optimal to base the recognition decisionon nodes that were active at encoding and to ignore nodesthat were inactive during encoding (see Figure 9A) Alsofor a low a it is optimal to place the activation threshold

of the nodes between the expected values of the new andthe old net inputs (see Figure 9B) Finally it is optimalto normalize the recognition strength with the standarddeviation of the net input (see Figures 9C and 9D)

For a low percentage of active nodes it is optimal tobase the recognition decision on nodes that were activeat encoding (or nodes active in the cue pattern) and to ig-nore nodes that were inactive at encoding At recogni-tion the state of activation of a node may be either activeor inactive Therefore the nodes that are active in the cuepattern and have a net input above a certain activationthreshold are activated at recognition otherwise thenodes are inactivated Let z i

d denote the state of activa-tion at recognition for item node i An item node is acti-vated at recognition (z i

d 5 1) if it was active in the cuepattern (x i

d 5 1) and the net input is above the activationthreshold (h i

d gt T ) otherwise it is inactivated (z id 5 0)

z id 5 1 if x i

d 5 1 and h id gt T otherwise z i

d 5 0

Similarly let zjc denote the state of activation at recogni-

tion for context node j

z jc 5 1 if x j

c 5 1 and h jc gt T otherwise z j

c 5 0

This way of activating nodes at retrieval differs from hownodes are activated in a standard Hopfield (1982) net-work where the activation threshold is zero and a nodeis activated if the net input is above zero (independentlyof the state of activation in the cue pattern) The way ofactivating patterns in a Hopfield network is more likelyto produce a retrieved pattern that matches the encodedpattern of activation (eg the expected value of activenodes at retrieval will be the same as the expected valueof active nodes at encoding) However as will be dis-cussed later the way suggested to activate the nodes hereyields better performance in terms of discrimination be-tween a target item and a distractor item

As is shown in Figure 9B performance is optimal whenthe activation threshold is set approximately between thenew and the old net inputs The activation threshold (T )is set to the expected values of the net inputs of nodes ac-tive during encoding (x i

d 5 1 x jc 5 1) for old and new

items The averaging is computed over all nodes (2N ) andover all new and old patterns ( p) in the recognition con-dition If half of the items are new and half of items areold the activation threshold is

As was discussed above the expected net input of new(lure) items is zero Therefore the activation threshold issimply half the expected net input for nodes encoded inthe active state [T 5 mh(O)2 where mh(O) is the expectedvalue of the net input to nodes encoded in the active state]

It is easy to see that the expected percentage of old andnew active nodes at recognition is one half of the per-centage of active nodes at encoding (a 2) That is the ac-tivation threshold divides the old and new distribution in

TaPN

h hi ii

N

j jj

NP

= +eacute

eumlecircecirc = =aring aringaring1

2 1 1

x xd d c c

h a ai i jj

Np

jd d c c= - -

=aringaring ( )( ) x x x

1

416 SIKSTROM

the middle Old items will have a higher expected percent-age active and new items will have a lower expected per-centage active The activation threshold is constant dur-ing recognition in one condition However it must varybetween conditions depending on the net inputs to yieldoptimal performance

The percentage of nodes active at recognition is counted

As is shown in Figure 9C the performance is near-optimalif the recognition decision is based on the number of nodesactive at recognition normalized by the standard devia-tion of the net inputs across the active features of this item(shcent ) Thus this standard deviation is calculated over allthe nodes active at encoding (ie x i

d = 1 and x ic 5 1

nodes inactive at encoding are not used when calculatingthe standard deviation because for low levels of a theseitems carry little to no information of the item) The rec-ognition strength (S ) for an item is calculated by sub-tracting the expected percentage of nodes active at rec-ognition (a 2)1 from the real percentage of nodes activeat recognition (Pc) and dividing by the standard deviationof the net inputs of the item (shcent)

(5)

The subtraction of the expected percentage of nodesactive at recognition makes the expected value of therecognition strength (S ) zero This subtraction is neces-sary for the normalization to work properly The sub-traction moves the recognition strength distributionssymmetrically so that the old and the new distributionsmove at the same rate for a given standard deviation of thenet input (without the subtraction the old recognitionstrength distribution would be more affected than thenew distribution) Thus the recognition strength is de-termined by the difference between two probabilities (thepercentage of active nodes that varies and the expectedpercentage of active nodes that is constant) divided bythe standard deviations of the net input A yes response(Y ) is given if the recognition strength (S ) is above therecognition criterion (C ) An unbiased decision has arecognition criterion of zero

Y 5 S gt C

An issue that may be raised is whether it is sensible tobase recognition strength on two quite different sourcesmdashnamely the percentage of active nodes and the variabil-ity of the net input The immediate answer is that if it isreasonable to optimize performance it is also sensible tomeasure recognition strength this way Another perspec-tive is to note that unbiased responses can be made onlyon the percentage of active nodesmdashthat is a yes responseoccurs if the percentage of active nodes is larger than theexpected percentage of active nodes (Pc gt a 2) and thevariability of the net input can be ignored Thus ldquonor-

mallyrdquo subjects base their unbiased decisions on the per-centage of active nodes and the variability of active nodesonly becomes relevant when subjects are biased Fromthis perspective the percentage of active nodes is usedfor unbiased responses and the variability of the netinput becomes relevant for confidence judgments There-fore by combining both the percentage of active nodesand the variability of the net input the measure of recog-nition strength proposed here will also reflect the confi-dence judgment

An Example With Step-by-Step ComputationsTo clarify the computational details involved in the

variance theory a step-by-step example is given here Fortractability a small network is used consisting of fouritem nodes and four context nodes (see Figure 2) Theactual simulations to be reported later involved a largernetwork architecture with 30 nodes at each layer Thepercentage of nodes active at encoding (denoted by pa-rameter a) is set to 50 Let item BN be represented as1100 and written as the state of activation of the fournodes x1

d x2d x3

d x4d Similarly let 0011 represent

item BO 1010 represent item AO and 0101represent item AN Let context CBN be represented as1100 or the state of activation of the four nodes x1

cx2

c x3c x4

c Similarly 0011 represents context CBOand 0101 represents experimental study context CExp

Item BN is a high-frequency new word For simplicityit is here encoded only once with context CBN in the pre-experimental phase (in the simulations below high-frequency words are preexperimentally associated withthree contexts) The 16 weights between the four itemnodes and the four context nodes are changed accordingto the learning rule where the probability that a node isactive at encoding is determined by the parameter a 55 For example the weight change between item node 1and context node 1 is w11 5 [x1

d(BN) a][x1d(CBN) a] 5

(1 5)(1 5) 5 14 where BN is item BN and the CBNrepresents context CBN Similarly item BO is anotherhigh-frequency word that before the experimental phaseis encoded once with context CBO Items AO and AN arelow-frequency old and new words and they are not en-coded at the preexperimental phase

In the experimental phase item AO is encoded withthe experimental context CExp Finally item BO is en-coded with the same experimental context CExp For ex-ample the weight w11 is now equal to

[x1d(BN) a) (x1

c(CBN) a] + [x1d(BO) a]

[x1c(CBO) a] + [x1

d(BO) a][x1c(CE) a]

+ [x1d(AO) a][x1

c(CE) a] 5 (1 5) (1 5)

+ (0 5)(0 5) + (0 5)(0 5)

+ (1 5) (0 5) 5 14 + 14 + 14 14 5 12

After encoding the full weight matrix is 5 1 1 55 0 0 5 5 0 0 5 5 1 1 5 corre-sponding to the weights w11 w12 w44 respectively

SP a

c

h=

-

cent2

s

PN

c i jj

N

i

N

= +aelig

egraveccedil

ouml

oslashdivide

==aringaring1

2 11

z zd c

VARIANCE THEORY OF MIRROR EFFECT 417

At recognition the state of activation of the old low-frequency item AO is reinstated as a cue to the itemlayer and the state of activation of the encoding contextat the experimental phase CExp is reinstated as a cue tothe context layer The net inputs are calculated For ex-ample the net input to context node 1 is

The net input to the item nodes is 10 2 1 and that tothe context nodes is 5 5 5 5 It can be seen thatthe expected net inputs for randomly created new itemsin this network is 0 and that the expected net inputs forold items encoded in the active state is 05 Therefore theactivation threshold is set to the average of these valuesmdashnamely T 5 025 Nodes that have a net input above theactivation threshold and that are active in the cue patternare activated at recognition Thus the state of activationat recognition for the item nodes is 1010 and thatfor the context nodes is 0101 (which is identical tothe cue patterns) The percentage of active nodes is countedPc(AO) 5 48 5 5 For an unbiased response (C 5 0)this will yield a yes response because this percentage islarger than the expected percentage of active nodes atrecognition (a 2 5 25) The standard deviation of thenet input for nodes active at encoding is 071 (ie thestandard deviation of 1 2 5 5 corresponding tonodes 1 3 6 and 8) The recognition strength is calcu-lated by subtracting the expected percentage of activenodes at recognition from the percentage of active nodesfor the to-be-recognized item and dividing by the standarddeviation of the net input for active nodes [S 5 (Pca 2) shcent 5 (5 25) 71 5 035]

The recognition of the three items BO AN and BNare done in the same way The results for the four itemsAN BN BO and AO are the net inputs (1 0 2 1 5 155 5 15) where the first four numbers representitem nodes and the last four context nodes (1 0 2 115 5 5 15) (1 0 2 1 15 5 5 1) (1 0 21 5 5 5 5) the states of activation at recognition(0 0 0 1 0 0 0 0) (1 0 0 0 0 1 0 0) (0 0 1 1 00 0 1) (1 0 1 0 0 1 0 1) the numbers of nodes ac-tive 1 2 3 4 the standard deviations of the net inputs071 108 108 071 the recognition strengths 17 0011 035 and the unbiased responses no no yes yesrespectively (an unbiased yes response is made whenS gt C where C 5 0 for unbiased responses) Thus thesubject responds correctly for all items the recognitionstrengths are ordered according to the mirror effect andthe variance of the net input is larger for the high- thanfor the low-frequency words

SIMULATIONS OF THEFREQUENCY-BASED MIRROR EFFECT

In this section the variance theory of the mirror effectis simulated in a connectionist framework that is consis-tent with a general connectionist theory of memory

called TECO (Sikstroumlm 1996a 1996b) TECO has beenused to account for a variety of memory phenomena in-volving recognition and recallmdashfor example successivetesting of recall and recognition (Sikstroumlm 1996b) and for-getting curves (Sikstroumlm 1999) An extensive descrip-tion of TECO is beyond the scope of this paper Interestedreaders are referred to previous articles describing themodel for details

ProcedureThe simulation started with initializing the weights to

zero Then 12 items were generated by randomly settingthe nodes to an active state with a probability of a A pre-experimental phase then followed to generate the fre-quency associated with the items In this preexperimen-tal phase half of the items (ie 6) were encoded threetimes each time in a different context These items sim-ulated the high-frequency words The remaining 6 itemswere not encoded before the experimental phase andthey simulated the low-frequency words

At the experimental phase one study-encoding contextwas generated using the same procedure as the genera-tion of the contexts in the preexperimental phase Giventhat in a standard recognition experiment all studieditems would be encoded in the same list in the simula-tions the items were thus encoded once with the samestudy context Three of the high-frequency items wereencoded and three of the low-frequency items were en-coded The other three high- and three low-frequencyitems were not encoded during the experimental phaseand they simulated the new items Each encoding wassimulated by first activating the nodes in the item and con-text layers The weights were then changed according tothe learning rule (Equation 3)

At the recognition test the patterns of activation of the12 items and the study context were reinstated to the net-work The net inputs were calculated for each node andthe recognition strength was calculated from all the nodesin the network The somewhat unrealistic assumptionthat no learning or changes of weights occurred duringtesting was adopted However this is a standard as-sumption often used in other simulations of recognitionmemory cued recall or categorization (eg Kruschke1992 Lewandowsky 1991 Li amp Lindenberger 1999Li Lindenberger amp Frensch 2000) One thousand fivehundred simulation runs with different random item andcontext patterns were carried out The results reported laterare based on the average across these runs

ParametersThe number of high-frequency patterns was six (each

encoded three times preexperimentally and three of themencoded once experimentally to simulate the old items)and the number of low-frequency patterns was six (threeof them encoded experimentally once to simulate the olditems) The following parameter settings were used Thenumber of nodes in each layer (N ) was 30 (the total num-ber of nodes was 2N 5 60) and the percentage of nodes

h w i ii

1 11

45 1 1 0 1 1 5 0 5c d= = + - - = -

=aring x

418 SIKSTROM

active at encoding (a) was 20 Initially all the weightswere set to zero The recognition criterion (C ) was set to 0

ResultsFigure 3A shows the results in terms of the density of

the net inputs to an active node The expected values ofthe net inputs are approximately equal for new high- andnew low-frequency items [mh(AN) 5 00 mh(BN) 5 00]and for old low- and old high-frequency items [mh(AN) 538 m h(BN) 5 38] The high-frequency items have alarger variance of the net inputs [sh(BN) 5 49 sh(BO) 548] than do the low-frequency items [sh(AN) 5 41sh(AO) 5 40] The variances of the old and the new dis-tributions are approximately equal

Figure 3B shows the density of the recognition strengthThe result shows the mirror effect where the hit rate prob-ability is larger for low- than for high-frequency itemsand the false alarm rate is lower for low- than for high-frequency items [P(AN) 5 18 P(BN) 5 25 P(BO) 564 P(AO) 5 71] The standard deviation for the recog-nition strength is larger for old than for new words andlarger for the low-frequency words than for the high-frequency words [ss(AN) 5 029 ss(AN) 5 019 ss(AN)5 023 ss(AN) 5 031] These f indings agree with the empirical data and the predictions of the attention-likelihood model (Glanzer et al 1993) Thus the simu-lation shows that the variance theory can account for thefrequency-based mirror effect and the associated ROCcurves

EXPLICATIONS OF THE MECHANISMS

In this section three essential mechanisms of the vari-ance theory that account for the mirror effect are discussedin detail The involved mechanisms are (1) the varianceof the net input (2) the expected value of the net inputand (3) the way in which the recognition strength is cal-culated by counting the number of nodes so that the per-formance is optimal

OverviewThe first mechanism is that items from the difficult

class (ie high-frequency items in this case) have ahigher variance of the net inputs than do items from theeasy class (ie low-frequency items) but the variance isindependent of whether the items are old or new Themechanism is illustrated in Figure 4A (noncumulative) Itshould also be underscored that the difference in varianceas a function of class membership is not a primitive ofthe theory that the theorist manipulates It is the naturalconsequence of the rather plausible assumption that high-frequency items appear more often and are associatedwith a great variety of different contexts than are low-frequency words The second mechanism is that old itemshave a higher expected net input to nodes encoded in theactive state than do new items (but do not depend on theclass of the items) The mechanism is illustrated in Fig-ure 4B (cumulative) The third mechanism is the wayrecognition strength is measured by counting the activenodes so that recognition performance (eg d cent) is opti-mal or near-optimal An extended analysis of optimalityis presented at the end of the paper however the resultsin Figures 9Andash9D can summarize the main points hereThe results from this analysis imply that the decisionmust be based on the nodes active at encoding (see Fig-ure 9A) the percentage of active nodes must be low (seeFigure 9A) the activation threshold needs to be betweenthe expected net inputs of the new and old items (see Fig-ure 9B) and the recognition strength is normalized bythe standard deviation of the net inputs of the item (seeFigures 9C and 9D) The density of the percentage of ac-tive nodes is illustrated in Figure 4C and the normalizedpercentage of active nodes is illustrated in Figure 4DHere it is shown how these mechanisms predict the mir-ror effect Below these three sets of mechanisms are ex-plained in more detail

Variance of the Net InputThe first and perhaps the most important mechanism

is that the difficult class has a larger variability in the net

Figure 3 (A) Simulations of net inputs The vertical axis shows the simulated density of the net inputs (B) Sim-ulations of the mirror effect The vertical axis shows the simulated density of the recognition strength

VARIANCE THEORY OF MIRROR EFFECT 419

inputs than does the easy class As will be discussed laterthis is shared with other theories such as the subjective-likelihood account (McClelland amp Chappell 1998) andREM (Shiffrin amp Steyvers 1997) However a unique as-pect of the variance theory is that it is a necessary outcomeof simply encoding high-frequency items more timesthan low-frequency items In the subjective-likelihood ac-count a larger variability for high- than for low-frequencywords is an assumption that is controlled by a free pa-rameter ( p0) to reflect that high-frequency words havemore definitions than do low-frequency words The vari-ance theory needs no such assumptions and no addi-tional free parameters to control the variability Whereasthe subjective-likelihood account tries to capture wordfrequency with a free parameter in the variance theorythe word frequency effect is simulated via the number ofcontexts associated with an item

Owing to how the variance theory implements the re-lations between contexts and items the variance of the

net inputs increases with the frequency with which an itemis encoded in different contexts Intuitively this occursbecause a high-frequency item activates several differentcontexts causing the representations of many competingcontexts to be activated simultaneously Low-frequencyitems are associated with fewer contexts than are high-frequency items This causes the representations offewer contexts to be activated simultaneously Thus low-frequency items have less variability in the net inputsthan do high-frequency items

At recognition an item produces a net input in the con-text layer that is a mixture of the net inputs from the studycontext that the network is instructed to retrieve fromand from several uncorrelated preexperimental contextsthat were associated with the item during the preexperi-mental phase The study context that the network is in-structed to retrieve from contributes to the correct activestate For example by adding +1 to the net input to anode (which is the expected net input for a node encoded

Figure 4 (A) The probability density of the net inputs in the variance theory The horizontal axis shows the net inputs and the ver-tical axis the probability density of the net inputs The dotted vertical line is the activation threshold (B) The cumulative probabilitydistributions of the net inputs in the variance theory The horizontal axis shows the net inputs and the vertical axis the cumulativeprobability distributions of the net inputs (C) The density of percentage nodes active at recognition in the variance theory The hor-izontal axis shows the percentage of nodes active at recognition and the vertical axis the density (D) The density of recognition strengthin the variance theory The horizontal axis shows the recognition strength and the vertical axis the density of the recognition strengthusing standard parameter values

420 SIKSTROM

in the active state when N 5 8 and a 5 5 see Equation A1in the Appendix) thus increasing the expected net inputThe preexperimental contexts on the other hand ran-domly add to or subtract from the net input For examplea preexperimental context could add +1 to the net inputif the node was in an active state or it could add 1 tothe net input if the particular preexperimental contextwas encoded in an inactive state (which is the expectednet input for a node encoded in the inactive state whenN 5 8 and a 5 5 see Equation 3 4 or A1) Note thatthe net input can be negative whereas the state of activa-tion cannot If the representations of these preexperimen-tal contexts are uncorrelated the number of associatedpreexperimental contexts will not influence the expectednet input That is the expected value of the negative andpositive contributions will cancel out (eg for N 5 8 anda 5 1 the contributions to the net input is +1 with a prob-ability of 50 and 1 with a probability of 50 yield-ing an expected value of 0) However the variability ofthe net inputs increases when more contexts are addedIn this example the variance is increased by 12 for eachadded context (ie the variance increases by each con-tribution raised to the power of two) Encoding an itemalso increases the variability of the net input for all otheritems encoded in the network However the increase invariability for the items uncorrelated with the encodeditem is much smaller because the contribution from eachnode is independent (Eg for N 5 8 and a 5 5 thecontribution from each node is either +14 or 14 [seeEquation 3] each yielding an increase in variability of142 5 116 An expected value of aN nodes contributeso the total increase in variability is 4 116 5 14)

A mathematical analysis of how the variability of thenet inputs in the context nodes increases as a function ofthe frequency with which the item has been encoded indifferent contexts is presented in the Appendix This analy-sis shows that the variance of the net inputs in contextnodes increases linearly with how many times a givenitem is encoded within different contexts The variabil-ity of the net inputs for nonwords may be a special casediscussed at the end of this paper

In the same way as the variability of context nodes de-pends on the itemrsquos frequency the variability of the itemnodes depends on the frequency of the context That isthe variability of the net input to the item nodes increaseswith how many times one context is associated with dif-ferent items Given that the context is constant during thepresentation of a study list the variability of the net in-puts to the item nodes will increase with list length

Expected Net InputThe second mechanism in the variance theory is that

the expected net inputs to the easy and difficult classesof stimuli are equal given that the encoding conditionsduring the experimental phase of the two classes areequal Note that this is in stark contrast to the attention-likelihood theory which assumes that more attention(corresponding to more net input) is given to the easyclass than to the difficult class Experimentally the equal-

ity of the net inputs simply means that different classesof stimuli are given the exact same conditions for encod-ing and retrieval in a recognition memory study The netinput to a node encoded in the active state increases dur-ing encoding whereas the net input to a node encoded inthe inactive state decreases during encoding Only nodesencoded in the active state are used during retrieval sohere we are only interested in the increase in net input thatoccurs for nodes encoded in the active state Strengthen-ing of the weights during encoding increases the expectednet input The degree of increase in expected net input isinfluenced by strength-based variables such as study timerepetition levels of processing and so forth For exam-ple the simulations can be set up so that a study time of1 sec strengthens the weights less leading to lesser in-creases in the net input than does a longer period of studytimemdashfor example 2 sec of encoding time Because thestudy context is unique to the learning episode preexper-imental encoding in other learning contexts will not af-fect the expected net input but they do affect the vari-ability of the net input as was demonstrated above Theitemndashstudy-context associations are learned approxi-mately equally well for old high- and old low-frequencyitems For example the expected net input for CAR (ahigh-frequency word as a difficult class item) is equal tothe expected net input for ARCTIC (a low-frequency wordas an easy class item) Generally the expected net inputdoes not depend on the class of the item because the ex-pected net input is influenced by the study and the test-ing experimental conditions similarly across stimulusclasses in a recognition memory experiment Thereforethe expected net input for a new difficult item is equal tothe expected net input for a new easy class item

The probability density functions of the net inputs fornodes in the active states are plotted in Figure 4A Oldnodes in the inactive state have a negative expected valueof the net input and are not plotted New nodes in the in-active state have the same density as nodes in the activestate The cumulative probability distributions of the netinputs for nodes in the active state are plotted in Figure 4BFigure 4A shows the first mechanismmdashnamely that thestandard deviation of net inputs for the easy class items[sh(A)] is larger than the standard deviation of net inputsfor the difficult class items [sh(B)] The second mecha-nism is shown in the figures in that the expected netinput of an easy class new item [mh(AN)] is equal to theexpected net input of a difficult class new item [mh(BN)]Furthermore at encoding the expected net inputs of ac-tivated nodes are increased equally or approximatelyequally for the easy and the difficult classes of itemsThis is shown in Figure 4A The expected net input for theold easy class items [m h(AO)] is equal to the expectednet input for the difficult class items [mh(BO)]

Recognition StrengthThe variance theory suggests that the recognition de-

cision needs to be based on counting the number of ac-tive nodes in such a way that the performance is optimalor near-optimal If the net input is above the activation

VARIANCE THEORY OF MIRROR EFFECT 421

threshold (T ) and the node was active at encoding thenode is activated at recognition Otherwise the node isinactivated The distributions of active nodes are plottedin Figure 4C

A closer inspection of Figures 4A and 4B reveals thatthese densities or distributions predict the correct orderof the mirror effect given that the activation threshold islarger than the expected net inputs of the new items andless than the expected net inputs of the old items Thus thevariance theory has the nice property of accounting for themirror effect across a large range of the parameter valuesfor the activation threshold Thus P(AN) lt P(BN) ltP(BO) lt P(AO) for mh(AN) 5 mh(BN) lt T lt mh(AO) 5mh(BO) The variance theory predicts a material-basedmirror effect because the variability of the net inputs isdifferent for the easy and the difficult class items Theexpected strengths of the net inputs are equal The vari-ability is lower for easy class items thus making theprobability of false alarms (or the probability of activenodes) lower for the easy than for the diff icult classitems when the activation threshold is larger than the ex-pected value of the new items Similarly the hit rate (or theprobability of active nodes) for easy class items is higherthan the hit rate for difficult class items when the activationthreshold is less than the expected value of the old items

The activation threshold is set to be between the ex-pected value of the new and the old net inputs so that theperformance is optimal Therefore the activation thresh-old is set to the average of the expected net inputs of theold and the new distributions for difficult and easy classitems respectively

T 5 [mh(AN) + m h(BN) + m h(BO) + mh(AO)]

5 [mh(BO) + mh(AO)] 5 mh(O)

Thus in the variance model the activation threshold isfixed for recognition in one condition although it mayvary between different recognition conditions to opti-mize the performance The variance theory accounts forthe strength-based mirror effect that occurs between listsor conditions with a shift in the activation threshold nec-essary for keeping the performance at an optimal levelAs will be discussed later this is true also when perfor-mance is measured by d cent and it is independent of theplacement of the recognition criterion Simply put themodel adopts the activation threshold on the basis of theoverall difficulty of the test in order to maximize the per-formance

In practice subjects may initially make a preliminaryestimate of the activation threshold which may be ad-justed as more information about the expected value ofthe distributions is gathered The theory does not show amirror effect if the activation threshold is lower than theexpected value of the new items or larger than the expectedvalue of the old items Thus setting the activation thresh-old as was suggested above is an important mechanism

in the model The activation threshold should not be con-fused with the subjectrsquos recognition criterion

Figure 4C shows the density of the probability that anode is active at recognition when the activation thresh-old is set as defined above Note how the mean and stan-dard deviations of the distributions of the net input (Fig-ure 4A) change when the percentage of nodes are calculated(Figure 4C) The expected probabilities of active nodesare arranged according to the mirror effect [Pc(AN) ltPc(BN) lt Pc(BO) lt Pc(AO)] whereas the expected valuesof the net inputs are not [mh(AN) = mh(BN) lt mh(BO) =mh(AO)] Furthermore the standard deviation of the per-centage of active nodes for old items is larger than thatfor new items [s p(O) gt sp(N)] whereas the standard de-viations of the net inputs are equal for old and new items[s h(N) 5 sh(O)]

The standard deviation of the recognition strength issmaller for the new distributions than for the old distri-butions because there are fewer nodes active in the newdistributions The standard deviation of the percentage ofactive nodes at retrieval as a function of the expected prob-ability of nodes active at retrieval is plotted in Figure 5Obviously the standard deviation of the percentage ofactive nodes is zero when the probability of active nodesis zero This standard deviation increases as the prob-ability of active nodes increases For small probabilitiesof active nodes the variance of active nodes (sp) is approx-imately proportional to the percentage of active nodes[sp 5 Pc(1 Pc) N raquo Pc N micro Pc] The percentage of ac-tive nodes is of course larger for old than for new itemsThus the variance theory predicts that the standard de-viation of the percentage of active nodes (sp) is smaller fornew than for old items [sp(AN) lt sp(BN) lt sp(BO) ltsp(AO)] whereas the standard deviation of the net inputis not [sh(AN) 5 sh(AO) lt sh(BN) 5 s h(BO)] The es-sential mechanism that makes these changes in themeans and standard deviations is the nonlinearity intro-duced by the counting of the number of active nodes With-out this nonlinearity these changes would not occur andthe model would not show appropriate ROC curves forold and new items

Note that the standard deviation of active nodes de-creases for probabilities larger than one half (see Figure 5the standard deviation is of course zero when the prob-ability of active nodes is one see the mathematical anal-ysis below) However the probability of active nodes can-not exceed one half because the activation threshold isset so that half of the nodes active during encoding areactive at recognition The probability of active nodes istypically set to a small value in the model because it im-proves the performance

To optimize the performance subjects base their rec-ognition decision on the number of active nodes normal-ized by the standard deviation of the net inputs for theitem The normalization makes the judgments more con-servative for difficult items This plays an important rolefor confidence judgments when the responses are biasedbut plays no role for unbiased responses

12

14

14

422 SIKSTROM

To calculate the recognition strength (S ) in Equation 5the expected percentage of active nodes is subtractedfrom the percentage of nodes active at recognition (Pc)This is necessary for the normalization to work properlyOwing to the placement of the activation threshold theexpected percentage of active nodes at recognition is halfof the expected percentage of nodes active at encoding(a 2 see note 1) This is a constant independent of itemclass new or old item and test difficulty The result is di-vided by the standard deviation of the net inputs associ-ated with nodes active at encoding (shcent)

Note that the standard deviation of the net inputs ofthe to-be-recognized item (s hcent ) varies on an item-to-itembasis around the standard deviation of the net inputs ofall items in the class (sh ) This fluctuation may be solarge that it is not possible to accurately sort the wordsinto classes on the basis of the standard deviation of theitems however there is no need for the subject to makesuch classification in the variance model The subjectsdo not need to know the true standard deviation of net in-puts in the class A yes response occurs if the recognitionstrength is larger than or equal to the subjectrsquos recogni-tion criterion (C ie if S sup3 C ) A no response occurs ifthe recognition strength is less than the subjectrsquos recog-nition criterion (S lt C )

The standard deviation of the net inputs does not affectthe probability of a yes response when the recognitioncriterion is unbiased (C 5 0) In this special case therecognition strength can be simplified to S 5 Pc whereC 5 a 2 The standard deviation of the net inputs inEquation 5 affects the probability of a yes response whenthe recognition criterion is biased (C sup1 0) Thus the stan-dard deviation of the net inputs in Equation 5 may be in-terpreted as a scaling factor that influences the confidencemeasurement (but not the unbiased recognition measure-ments) A large standard deviation of the net input for anitem (correlated with difficulty) influences the decisiontoward uncertainty whereas a small standard deviation ofthe net input for an item (correlated with less difficulty)influences the decision to be more certain

Figure 4D shows the density distribution of the recog-nition strength Note how the standard deviation of theactive nodes for the easy class versus the difficult classes(in Figure 4C) changes when it is normalized by the vari-ance of the net input (in Figure 4D) The normalizationfactor makes the standard deviation of the recognitionstrength of the difficult class smaller than that of the easyclass Thus the standard deviation of the recognitionstrength is proportional to the inverse of the standard de-viation of the net input The difficult class items yield asmall standard deviation of the recognition strength be-cause the standard deviation of the net inputs is highThe easy class items yield a large standard deviation ofthe recognition strength because the standard deviationof the net inputs is small The ordering of the means ofthe distributions is unaffected by the normalization andthe normalization does not change the fact that the olddistributions have a larger standard deviation than do thenew distributions

PREDICTIONS

This section describes the predictions of the variancetheory We have just seen that the variance theory predictsa material-based mirror effect for high- and low-frequencyitems because the low-frequency items have a smallervariance of net inputs This yields lower false alarm ratesand higher hit rates for the easy than for the difficultclass when the activation threshold is set between thenew and the old distributions Here it is shown how themodel accounts for other effects such as the strength-based mirror effect between lists list-length effects andthe shift in the response criterion Most important the var-iance theory makes predictions regarding the strength-based mirror effect within lists that is different from thepredictions of the attention-likelihood theory An exper-iment is conducted to test these predictions Comparativemodeling fitting was also conducted to compare the vari-ance theory with the attention-likelihood theory Thepredictions of the theory are based on an analytic solution

Figure 5 The standard deviation of active nodes as a function of the expectedprobability that the nodes are active The standard deviation is calculated with2N = 100

VARIANCE THEORY OF MIRROR EFFECT 423

that is presented at the end of the paper together with ananalysis of optimal performance

The Material-Based Mirror Effect forHigh- and Low-Frequency Items

The variance theory was simulated above here theanalytical results are presented The variance theory pre-dicts the mirror effect for any choice of parameters whenthe recognition criterion is unbiased As will be discussedlater the variance theory can be fully described by twoparameters (the number of nodes N and the percentageof active nodes a) plus one parameter for each class orwords [the standard deviation of the net input sh( )] Thefollowing parameters are used here The number ofnodes is 2N 5 100 and the percentage of nodes active atencoding is set to a 5 1 The standard deviation of thenet inputs to the easy class is sh(AN) 5 sh(AO) 5 125and the standard deviation of the net inputs to the diffi-cult class is sh(BN) 5 sh(BO) 5 156 There are otherparameters which however as will be discussed laterdo not add any degrees of freedom to the model the ex-pected net inputs of the new distributions mh(AN) 5mh(BN) 5 0 and the expected net inputs of the old dis-tributions mh(AO) 5 mh(BO) 5 1 Consequently the ac-tivation threshold is T 5 05

These parameters yield the following probabilities thata node is active at recognition Pc(AN) 5 43 a Pc(BN) 545 a Pc(BO) 5 55 a Pc(AO) 5 57 a The following ex-pected recognition strengths are predicted ms(AN) 5

0012 ms(BN) 5 0008 ms(BO) 5 0008 ms(AO) 50012 Figure 4D plots the four recognition strength den-sities (the distributions are assumed to be normal) usingthe parameters above The same parameter settings wereused in Figures 4A 4B 4C and 5

Strength-Based Mirror Effects Between ListsThe variance theory is consistent with the strength-

based mirror effects Thus variables that increase the hitrates also decrease the false alarm rates This empiricalfinding is called dispersion which means that the newand the old distributions move apart The opposite phe-nomenon is called concentering which means that thenew and the old distributions move closer together Ex-amples of variables showing strength-based mirror ef-fects are speed versus accuracy instructions length ofstudy time encoding task forgetting repetition and ag-ing (Kim amp Glanzer 1993) These experimental variablescan be related to a specific parameter in the variance the-orymdashnamely the expected net input

The variance theory predicts a strength-based mirroreffect because subjects must adjust the activation thresh-old to optimize the performance This change in activa-tion threshold affects the false alarm rates For exampleassume that study time is increased from 1 to 2 sec sothat the expected net input increases from 1 to 2 and theactivation threshold increases from 12 to 1 This dimin-ishes the false alarm rate However the increase in theactivation threshold is smaller than the increase in the old

net input so the hit rate will increase Thus increasing thestudy time increases the hit rate but decreases the falsealarm rate which is dispersion

The mirror effect is accounted for in some theories bya change in the recognition criterion Note that in the var-iance theory the recognition criterion is constant whereasthe activation threshold is changed There is an impor-tant difference between a change in the recognition cri-terion and a change in the activation threshold Thechange in the activation threshold optimizes the perfor-mance as measured by d cent whereas a change in the recog-nition criterion does not influence d cent Given an optimalplacement of the activation threshold the performance interms of percentage correct is optimal if the recognitioncriterion is set to an optimal value which is zero Thusthere is a clear difference between changing the recogni-tion criterion and changing the activation threshold Thevariance theory accounts for the strength-based mirror ef-fect occurring between two conditions by the change inthe activation threshold necessary for optimal performancewhereas the recognition criterion does not change

Concentering occurs for example when subjects areinstructed to emphasize speed (rather than accuracy) withsuperficial (rather than deep or semantic) study instruc-tions with diminished study time or with an increasedretention interval (Kim amp Glanzer 1993) In the variancetheory all these manipulations are assumed to diminishthe old net inputs Figure 6A shows the predictions of thevariance theory when the expected net inputs of the olddistributions are mh(AO) = mh(BO) 5 05 rather than 1as in Figure 4D Consequently the activation thresholdmust be set to 025 for optimizing the performance Thedistributions in Figure 6A are closer than the distribu-tions in Figure 4D Thus decreasing the net inputsmdashforexample by diminishing study timemdashmoves the distrib-utions closer together thus showing concentering

The opposite phenomenon to concentering is disper-sion which means that increasing the performance movesthe distributions apart Dispersion can be studied bychanging the variables listed above in the opposite direc-tionsmdashfor example by increasing the study time Fig-ure 6B shows the predictions of the theory when the ex-pected net inputs of the old distributions are mh(AO) =m h(BO) 5 2 Consequently the activation thresholdmust be set to 1 to maintain a near-optimal performanceThe distributions in Figure 6B are further apart than thedistributions in Figure 4D

These strength-based manipulations are usually ap-plied between different lists or conditions For exampleKim and Glanzer (1993) manipulated study time betweenfour study lists where the items were presented for 1 seceach in two lists and for 2 sec each in two lists After eachlist there was a recognition test In the variance theorythe activation threshold is the same during each recogni-tion test but may vary between two recognition tests withdifferent levels of difficultymdashfor example different studytimes As will be discussed later different empirical re-sults are found when study time is varied within one list

424 SIKSTROM

In this case the activation threshold is also constant dur-ing the recognition tests although the study time varieswithin the condition

The order of the probabilities in the mirror effect issomewhat robust against changes in the activation thresh-old over a large range Setting the activation to a fixedsufficiently low and positive value yields the mirror ef-fect for any value of the expected net input For example

assume that the activation threshold is fixed to 04 Thenthe mirror effect is predicted for the three cases of ex-pected old net inputs discussed above (05 1 and 2) orany value above 04 The predictions for the new distri-butions do not change with these changes in net inputs[P(AN) = 25 P(BN) = 30] thus a change in the acti-vation threshold is needed to change the false alarmrates In contrast the advantage of the old easy class over

Table 1General Table of Results From the Experiment

Condition AN BN BO AO ss(BN)ss(AN) ss(BO)ss(AO)

Control 013 017 069 082 060 086Frequency 020 028 080 068 101 066Time 010 015 078 076 089 081

NotemdashThe rows show the conditions (control presentation frequency and presenta-tion time) The columns show the false alarm rates for low (AN) and high (BN) fre-quencies and the hit rate for high (BO) and low (AO) frequencies the slope of the z-ROCcurve for the new low-frequency distribution as a function the new high-frequency dis-tribution [ss(BN)ss(AN)] and the corresponding slope for the old distributions[ss(BO)ss(AO)]

Figure 6 The densities of recognition strength in the variance theory for different parameter settings (A) concentering (B) dis-persion (C) activity level set to one and (D) equal variance The horizontal axes show the recognition strength and the vertical axesthe density

VARIANCE THEORY OF MIRROR EFFECT 425

the old difficult class increases with the expected netinput [from P(BO) = 55 and P(AO) = 56 for mh(O) =05 to P(BO) = 89 and P(AO) = 92 for mh(O) = 2]

List-Length EffectGiven everything else equal recognition from a short

list length has a higher hit rate and lower false alarm ratethan recognition from a long list In the variance theorylist length is predicted to affect the standard deviation ofthe net input (sh) for both easy and difficult class itemsso that longer lists have a larger variance than do shorterlists The expected value of the net input is not affectedby list length

Assume that context does not change within a list butis uncorrelated between different lists The context for alist is thus associated with as many items as there areitems in the list The variance of the net inputs to the itemnodes increases when the list length is increased Thereason for this increase in variance is essentially thesame as the reason that word frequency affects the vari-ance In the word frequency case the same item is asso-ciated with several contexts and this increases the vari-ance in context nodes In the list-length case the samecontext is associated with several items and this in-creases the variance in the item nodes Thus the varianceof the net inputs in the item nodes will be a linear func-tion of list length Therefore a long list will have a lowerhit rate and a larger false alarm rate than will a short list

ROC CurvesThe percentage of nodes active at recognition is less

for new items than for old items Owing to the placementof the activation threshold this proportion is always lessthan 12 The standard deviation of the percentage of ac-tive nodes increases as a function of the percentage ofactive nodes If the percentage of active nodes is zerothe standard deviation obviously is zero However thisstandard deviation increases as the percentage of activenodes increases This yields a smaller standard deviationfor the new distribution (which is associated with a lowerpercentage of active nodes) as compared with the olddistribution [ss(AO) gt ss(AN) and ss(BO) gt ss(BN)]

For the sake of understanding the model the propor-tion of nodes active at encoding can be set unrealisticallyhighmdashnamely to a = 1 This setting yields around 50of these nodes being active at recognition This parametersetting makes the standard deviations of the new and theold distributions equal [ss(AO) = ss(AN) and ss(BO) =ss(BN)] Figure 6C shows the prediction for a = 1 (all theother parameters are identical to those in Figure 4D)

The standard deviation of recognition strength is largerfor the difficult class than for the easy class [ss(AN) gtss(BN) and ss(AO) gt ss(BO)] because the recognitionstrengths are calculated from the inverse of the standarddeviation of the net inputs Thus when the standard de-viations of the net inputs are set equal the standard de-viation of the recognition strengths and the recognitionstrengths becomes equal Figure 6D plots the predictionsof the theory when all standard deviations of the net in-

puts are 125 The other parameters are the same as thosein Figure 4D

In Figure 4D the four standard deviations of the recog-nition strengths are ss(AN) = 0015 ss(BN) = 0012ss(BO) = 0015 and ss(AO) = 0020 The ratio of thesestandard deviations must follow Equation 2 This is alsothe case with ss(AO) ss(BN) = 061 lt 074 = ss(AO) ss(AN) lt ss(BO)ss(BN) = 078 lt 094 = ss(BO) ss(AN)

Changing the Recognition CriterionThe probability of a yes response (P) for the four

classes depends on the recognition criterion (C) SettingC to an unbiased value of 0 yields P(AN) = 20 P(BN) =25 P(BO) = 70 P(AO) = 74 These predicted data areprototypical of experimental data for the mirror effect

A conservative value of the recognition criterion (C)will not yield the mirror effect For example C = 0016yields P(AN) = 03 P(BN) = 02 P(BO) = 30 P(AO) =43 Thus the variance theory predicts that a conserva-tive recognition criterion yields a higher false alarm ratefor easy class words than for difficult class words Thisprediction is supported by empirical data Greene (1996)asked subjects to respond yes only if they were sure oftheir response Consistent with the prediction no mirroreffect was found

It follows from the ordering of the distributions in Fig-ure 4D that the theory also predicts the experimentalfindings in forced recognition [P(BO BN) lt P(AO BN)P(BO AN) lt P(AO AN) P(BN AN) gt 50 and P(AOBO) gt 50] For the parameters above the predictions ofthe theory are P(BO BN) = 79 lt 81 = P(AO BN)P(BO AN) = 83 lt 84 = P(AO AN) P(BN AN) = 59 gt50 P(AO BO) = 57 gt 50

Within-List Strength ManipulationSo far the predictions made by the variance theory are

qualitatively (but not quantitatively) equal to those of theattention-likelihood theory However there is an excep-tion that differentiates the variance theory from the attention-likelihood theory The mirror effect is nor-mally studied under experimental conditions in whichthe diff icult and the easy classes are given the sameamount of attentionmdashfor example under conditions inwhich the number of presentations the study time andthe study instructions are equal for the two classes ofwords However if the number of presentation is largerfor the difficult class than for the easy class different re-sults emerge Stretch and Wixted (1998b) conducted fiveexperiments in which the basic manipulation was to pre-sent high-frequency words five times whereas the low-frequency words were presented once The results didnot show a mirror effect because the hit rates for thehigh-frequency words were higher than those for thelow-frequency words However increasing the numberof presentations for the high-frequency words did not af-fect the false alarm rate so both the false alarm rate andthe hit rate were larger for high-frequency words

The variance theory accounts for this new finding inthe following way The theory assumes that a longer pre-

426 SIKSTROM

sentation time or a larger presentation frequency in-creases the net inputs of the old items [mh(O)] This is il-lustrated in Figure 7A (compare with Figure 4A wherethe same amount of attention is paid to the two classes)If the net inputs for old high-frequency items are in-creased sufficiently the percentage of active nodes willbe larger than that for old low-frequency items For thisto occur the effect of the increase in net input (whichgives the advantage for old high-frequency items whenattention is focused on these items) must be larger thanthe effect from the larger standard deviation of the netinputs for old high-frequency items (which gives the ad-vantage for old low-frequency items when the same at-tention is paid to the two classes) This increase in thepercentage of active nodes yields a higher hit rate forhigh-frequency items than for low-frequency items

However it will not significantly change the false alarmrates which are larger for high-frequency items than forlow-frequency items Therefore the variance theory pre-dicts no mirror effect when high-frequency items are pre-sented sufficiently more often or with a sufficiently longerpresentation time as compared with low-frequency items

It is apparent from the density of net inputs (Figure 7A)that the density of recognition strengths (Figure 7B)will not show a mirror effect (ie because the percent-age of active nodes are larger for high- than for low-frequency old items) The parameters used in these fig-ures are identical to the parameters used for the standardmirror effect in Figures 4A and 4D with the exceptionthat the expected net input of the old high-frequencyitems [mh(BO)] is 2 rather than 1 Consequently to opti-mize performance the activation threshold becomes

P a ec

h h

h=-( )yen

ograve12

22

m

s

P -

Figure 7 (A) The probability density of the net inputs in the variancetheory when attention is focused on the high-frequency class The hori-zontal axis shows the net inputs and the vertical axis the probabilitydensity of the net inputs The expected value of the high-frequency value(BO) is shifted to the right because attention is focused on this class Thedotted vertical line is the activation threshold (B) The predictions of thevariance theory when subjects focus their attention on high-frequencywords The horizontal axis shows the recognition strength and the ver-tical axis the probability density

075 rather than 050 The figure does not show a mirroreffect because the expected hit rate and the expectedfalse alarm rate are larger for the high-frequency itemsthan for the low-frequency items Setting C to an unbi-ased value of 0 yields P(AN) = 08 P(BN) = 14 P(BO) =86 P(AO) = 063 [which may be compared with Figure6B P(AN) = 20 P(BN) = 25 P(BO) = 70 P(AO) = 74]

Furthermore in the variance theory the ratio of therecognition strength standard deviations for high- andlow-frequency items depends mainly on the standard de-viations of the net inputs The standard deviations of thenet inputs are not dependent on the attention paid to thestimuli Therefore the variance theory predicts no changein the standard deviations when the amount of attentionis manipulated The standard deviation of the old low-frequency distribution is predicted to be larger than thestandard deviation of the old high-frequency distribu-tion Similarly the standard deviation of the new low-frequency distribution is predicted to be larger than thatof the new high-frequency distribution The standard de-viations in Figure 7B are ss(AN) = 0013 ss(BN) = 0011ss(BO) = 0017 and ss(AO) = 0019 These results aresimilar to the results when the same amount of attentionis paid to the two classes in Figure 4D ss(AN) = 0015ss(BN) = 0012 ss(BO) = 0015 and ss(AO) = 0020

The standard version of the attention-likelihood the-ory has problems accounting for the lack of mirror ef-fect when more study time is given to the difficult classthan to the easy class This theory suggests that the classof items to which more attention is being paid is moreeasily recognized For example low-frequency items arebetter recognized than high-frequency items becausesubjects pay more attention to them The amount of at-tention is assumed to influence the number of sampledfeature [n(i)] so more features are sampled for low- thanfor high-frequency items (Kim amp Glanzer 1993) This isthe only parameter that differs between high- and low-frequency items From this explanation it follows thatif the experimental conditions are manipulated so thatsubjects pay more attention to the high-frequency itemsthe standard version of the attention-likelihood theorywill predict a mirror effect where the high-frequencyitems are the easier class (A) and the low-frequency itemsare the more difficult class (B) The difference from thenormal mirror effect is a larger hit rate and a smaller falsealarm rate for high- than for low-frequency items Fur-thermore the attention-likelihood theory makes predic-tions of the order of the slope of ROC curves The standarddeviation of the hit rate for the high-frequency distributionwould be larger than the hit rate for the low-frequencydistribution Similarly it is predicted that the standarddeviation of the high-frequency false alarm distribution islarger than that of the low-frequency distribution

EXPERIMENT

An experiment was conducted to test the predictionsregarding the within-list strength manipulation The

number of presentations and the study time of the high-frequency words were manipulated in an experimentThe original rationale for the experiment was to comparethe results with the predictions of the variance theoryand the attention-likelihood theory because the experi-ment was conducted before the publication of Stretchand Wixtedrsquos (1998b) study which manipulated atten-tion by the number of presentations In this experimenta new manipulation is investigatedmdashnamely how theamount of study time per item for each class influencesthe mirror effect Furthermore the manipulation of thenumber of presentations is replicated Thus there weretwo experimental conditions one in which the amountof study time was manipulated and one in which the pre-sentation time was manipulated There was also one con-trol condition in which high- and low-frequency wordswere given the same amount of attention

MethodSubjects Twenty-one students taking the introductory psychol-

ogy course at the University of Toronto volunteered to participatein a memory experiment for course credit There were 14 female and7 male subjects with a mean age of 20 ranging from 18 to 29 yearsold

Material Sixty low-frequency words and 60 high-frequencywords were selected from Ku Iumlcera and Francis (1967) The low-frequency words have an occurrence of 4ndash5 times per million andthe high frequency words an occurrence of 50ndash55 times per mil-lion Thirty low- and 30 high-frequency words were randomly cho-sen for List 1 and the remaining for List 2

Procedure The subjects were instructed to study a list of wordsso they would be able to recognize the words after study Fifteenlow-frequency words and 15 high-frequency words were randomlychosen as study words for each subject

Design There were three conditions In all the conditions thelow-frequency words were presented once with a presentation timeof 1 sec In the control condition the high-frequency words werealso presented once with a presentation time of 1 sec In the pre-sentation frequency condition the high-frequency words were pre-sented twice for 1 sec each time In the presentation time conditionthe high-frequency words were presented once for 3 sec The pre-sentation order was randomized All the words were presented inuppercase on a blank computer screen Immediately following thestudy list there was a recognition test The subjects were presentedwith either one of the studied words or one of the lures There were15 low-frequency lures and 15 high-frequency lures in each condi-tion The subjects were asked to judge whether the word was pre-sented in the list or not The subjects were also required to rate theirconfidence in their responses on a scale ranging from 1 (guessing)to 5 (very certain) The order of recognition was randomized foreach subject

Each subject participated in two conditions List 1 was alwaysgiven as the first list and List 2 as the second list Twelve subjectswere randomly chosen for the presentation frequency condition fol-lowed by the presentation time condition Nine subjects were giventhe control condition followed by another control condition Thewhole experimental set-up including instructions presentation ofwords and the recognition test was automated on a computer Eachsubject was tested individually

ResultsThe results from the experiment are presented in the

first three rows of Table 1 The probability for hit rates

VARIANCE THEORY OF MIRROR EFFECT 427

428 SIKSTROM

was larger for the high-frequency words than for the low-frequency words in the presentation frequency and thepresentation time conditions In the control conditionthe probability for hit rates for the low-frequency condi-tion was larger One-tailed paired t tests over the perfor-mance for each subject were carried out to test the dif-ferences between the high and the low frequencies Theeffects were significant in the presentation frequencycondition [t(11) = 22 MSe = 004 p = 02 lt 05] and inthe control condition [t(16) = 33 MSe = 004 p = 00lt 05] but not in the presentation time condition [t(11) =041 MSe = 003 p = 34 lt 05]

The false alarm rate was larger for the high-frequencywords in all the conditions However it was only signif-icantly larger in the presentation frequency condition[t(11) = 18 MSe = 003 p = 048 lt 05] but not in thepresentation time condition [t (11) = 15 MSe = 001 p =07 gt 05] and the control condition [t(16) = 14 MSe =002 p = 09 gt 05]

The results in the presentation frequency conditionsupport the variance theory The hit and the false alarmrates were significantly larger for the high-frequencywords than for the low-frequency words Thus there wasno mirror effect However the prediction of the standardversion of the attention-likelihood theory was not sup-ported

The results in the presentation time condition were inthe same direction as those in the presentation frequencycondition although the difference between the high andthe low frequencies was not significant This conditionis consistent with the variance theory although the stan-dard version of the attention-likelihood theory could notbe dismissed in this condition since the results werenonsignificant

Finally the control condition yielded results consis-tent with previous studies showing a mirror effect Thehit rate for the high-frequency words was significantlylower than the hit rate for the low-frequency words Thefalse alarm rate for the high-frequency words was largerthan that for the low-frequency words (although not sig-nificantly) Thus the control condition is as was ex-pected consistent with both the variance theory and thestandard version of the attention-likelihood theory

The slopes of the ROC curves were calculated as fol-lows The hit and false alarm rates for confidence ratings1ndash5 were z-transformed (eg for confidence rating 4 a hitresponse was scored if the confidence rating was 4 orabove) Linear regressions of one z-transformed mea-surement as a function of another z-transformed measure-ment were conducted The slope of the linear regressioncurves between the z-transformed hit rate of the low-frequency words and the z-transformed hit rate of the high-frequency words [ss(BO)ss(AO)] and similarly for theslope of the false alarms [ss(BN)ss(AN)] are shown inthe last two rows of Table 1

The results show that the standard deviations of theold high-frequency distributions were smaller than thestandard deviations of the old low-frequency distribu-tions in all the conditions The standard deviations of the

false alarm high-frequency distributions were smallerthan the standard deviations of the false alarm low-frequency distributions in the presentation frequencycondition and the control condition but were approxi-mately equal in the presentation time condition

To summarize the results in the presentation frequencycondition are consistent with the variance theory and inconsistent with the standard version of the attention-likelihood theory The control condition is consistentwith both the variance theory and the standard version ofthe attention-likelihood theory These data are also con-sistent with results presented by Stretch and Wixted(1998b) However Stretch and Wixted (1998b) suggestedone possible way to modify the standard version of theattention-likelihood theory to bring it in line with thedata presented here They noted that Glanzer et al (1993)had shown that the attention-likelihood theory predictsthe mirror effect although p(i old) is set to the averagevalue of the two classes This modified version can pre-dict the pattern of data presented here given that the at-tention paid to the high-frequency class was high duringencoding [n(B) = 120] and low during recognition [n(B) =40] This formulation of the attention-likelihood theoryseems somewhat unclear It is not well motivated whyp(i old) should be equal during recognition whereas n(i)is different [p(i old) is calculated from n(i)] or why theamount of attention for high-frequency items is lower thanthat for low-frequency words at encoding but higher atrecognition

COMPARATIVE DATA FITTING

Glanzer et al (1993) fitted the attention-likelihoodtheory to experimental data in nine conditions The easyclass (A) consisted of either low-frequency words orconcrete words and the difficult class (B) consisted ofhigh-frequency words or abstract words Here the vari-ance theory is also fitted to the same set of data as thatused in Glanzer et al (1993) This allows a directionevaluation of the variance theory by comparing its f itswith those of the attention-likelihood theory

Glanzer et al (1993) fitted the attention-likelihoodtheory to the four probabilities of yes responses and thefour ratios of slopes of the ROCs The fitting was con-ducted by minimizing the mean squared error divided bythe variancemdashthat is

Three parameters were fittedmdashnamely the attentionpaid to each of the classes n(A) and n(B) and the prob-ability that a feature was activated p(new) The other pa-rameters were held constant at a value found to give agood f it These parameters were (N = 1000) and therecognition criterion [ln (L) = 0]

The variance theory was fitted to the same set of datausing the same technique and the same number of freeparameters The fitted parameters were the percentage

( )

Observed Predictedi i

ii

-

=aring

2

21

8

s

VARIANCE THEORY OF MIRROR EFFECT 429

nodes active at encoding (a) the standard deviation ofthe net inputs for the easy class words [sh(A)] and thestandard deviation of the net inputs for the difficult classwords [sh(B)] The other parameters were held constantand were set to the same values as those in Figure 4D[2N = 100 mh(N ) = 0 mh(O) = 1 and C = 0] The empir-ical standard deviations (si) were not reported inGlanzer et al (1993) so these parameters were set toone

The results show that the variance theory accounts for98 (r2) of the variance for the probabilities The attention-likelihood theory fits equally well The variance theory ac-counts for 93 of the variance for the slope The attention-likelihood theory accounts for 84 of the variance of theslope Thus the variance theory accounted for the sameamount of variance for the probabilities and more variance

for the slope as compared with the attention-likelihoodtheory when three parameters were fitted

It is reasonable to assume that the percentage of ac-tive nodes at encoding (a) does not vary across condi-tions The ratio of standard deviations of the net inputs[sh(B)sh(A)] may also be conceived of as being con-stant given that the same material is used in the differ-ent conditions Therefore the variance theory was fittedby a single variablemdashnamely the standard deviation ofthe net inputs for the easy class [sh(A)] The activitylevel was fixed to 010 and the ratio of the standard de-viations of the net inputs sh(B)sh(A) to 125 (these val-ues were the average of the fitted parameters in the pre-vious fit these parameters were also used in Figure 4D)The results are presented in Figure 8A where the pre-dicted probabilities are plotted as a function of observed

Figure 8 (A) Fitting the variance theory to Glanzer Adams andKimrsquos (1993) probability data The figure shows the predicted proba-bilities (on the vertical axis) as a function of the observed probabilities(on the horizontal axis) (B) Fitting the variance theory to Glanzer et alrsquos(1993) standard deviation slope data The figure shows the predictedratio of slopes of the receiver-operating characteristic curves (on the ver-tical axis) as a function of the observed ratios (on the horizontal axis)

430 SIKSTROM

probabilities Figure 8B shows the corresponding resultsfor the slope The accounted variance is 096 for the prob-abilities and 085 for the slopes Thus the variance theoryfits the slopes using a single parameter equally well asthe attention-likelihood theory does with three fitting pa-rameters The fit for the variance theory for the probabil-ities using one parameter is slightly less than the fit of theattention-likelihood theory using three fitting parametersIt may be concluded that the fit for the variance theory isreasonably good for the probabilities and the slopes Theslopes have a better fit in the variance theory than in theattention-likelihood theory when three variables are used

ANALYTIC SOLUTIONS

In this section analytic solutions of the variance the-ory an approximation of the standard deviation of therecognition strength and analyses of optimal perfor-mance are presented The variance theory has a simpleanalytic solution and can be fully described by four pa-rameters Two of these parametersmdashnamely the stan-dard deviation of the net inputs from the easy class[sh(A)] and the standard deviation of the net inputs fromthe difficult class [sh(B)]mdashcan also be expressed as thefrequency of the item (see the Appendix) The other twoparameters are the number of nodes (N ) and the expectedprobability that the nodes are active at encoding (a)

There are other variables in the theory however theydo not increase the degrees of freedom There are fourexpected net inputs (mh) however two degrees of free-dom disappear because the new net inputs are constrainedto be equal as well as the old net inputs [mh(AN) =mh(BN) and mh(AO) = mh(BO)] Furthermore the predic-tions are independent of moving the old and new dis-tributions in parallel thus removing another degree offreedom Finally changing the difference between the ex-pected old and new net inputs is mathematically equiva-lent to changing the standard deviations [sh(A) andsh(B)] Thus the degrees of freedom in the net inputscan be captured by the degrees of freedom in the stan-dard deviations The activation threshold (T ) and theprobability that nodes are active (Pc) are simply func-tions of other variables and therefore do not increase thedegrees of freedom Thus there are four degrees of free-dom for the distributionsmdashnamely sh(A) sh(B) N anda An additional degree of freedom is introduced whenplacing the recognition criterion (C)

The probability (Pc) that the net inputs exceed the ac-tivation threshold (T ) for nodes active during encodingcan be explicitly solved from the expected net inputs(mh) and the standard deviation of the net inputs (sh)This probability is dependent on the distribution of thenet inputs which can be approximated with a normaldistribution Pc is solved by integrating the net inputsfrom mh T to infinity (yen) over the probability densityfunction for a normal distribution Thus the probability(Pc) that a node is active at recognition is

(6)

Subtracting the expected percentage of active nodes atrecognition (a2 see note 1) from the percentage of ac-tive nodes and dividing by the standard deviation of thenet inputs (sh) calculates the expected recognitionstrength (ms)

Note that the analytic solution uses the standard devi-ation of the class (sh) as an approximation of the stan-dard deviation of the item (shcent ) because it simplifies theanalytic solution however the variance theory or thesimulation uses the standard deviation of the item Thisapproximation is good when there are a large number offeatures however for a small number of features thevariance of feature strength for a single item may fluctu-ate on an item-to-item basis around the variance of thenet inputs for all the items

The standard deviation of the recognition strength (ss)is calculated from sh Pc and N There is 2N number ofnodes in the context and the item layers The distributionof Pc is binomial but can for a certain criterion [ie 2NPc(1 Pc) gt 10] be approximated with a normal distri-bution with a standard deviation of [Pc(1 Pc) 2N]12The final result is scaled by the normalization factor 1sh

(7)

A yes response occurs if the recognition strength isabove the recognition criterion (C) The probability of ayes response [P(Y)] is calculated from the expected recog-nition strength the variance of the recognition strengthand the recognition criterion by integrating the density ofthe recognition strength over a normal distribution

The probability of choosing A over B in a two-choiceforced recognition test [P(A B)] is calculated from theexpected recognition strength of A [ms(A)] and B [ms(B)]and the standard deviations of the recognition strengthof A [ss(A)] and B [ss(B)]

An Excel sheet for calculating the predictions of the vari-ance theory is available on line (wwwpsychutorontoca ~sverkervariancehtml)

P e ds

S Bs s

s s

C

(A B)

(A)

(A) (B)

=

- -[ ]( )+aelig

egraveccediloumloslashdivide

eacuteeumlecirc

yen

ograve12

2

2 2 22

p

m m

s s

( )

P e ds

S s

s

C

(Y) =-( )yen

ograve12

2

22

p

m

s

sss

h

c cP P

N=

-( )eacute

eumlecircecirc

1 1

2

12

mss

Pc

a

h=

-2

P a e dhcT

hh

h=-( )yen

ograve12

2

22

p

m

s

VARIANCE THEORY OF MIRROR EFFECT 431

Approximations of the Standard Deviation ofRecognition Strength

The standard deviation of the recognition strength isin the model calculated with Equation 7 However to fa-cilitate the understanding of this equation it is useful tomake some approximations First note that the proba-bility that a node is active (Pc) is assumed to be low Byapproximating 1 Pc to one the variance of the recog-nition strength can be simplified to

For a particular class of items the variances of the netinputs of old and new items are equal and the varianceof the recognition strength is proportional to the numberof active nodes (s 2

s micro Pc) This approximation suggests avery simple interpretation of the slope of the z-ROC Theratio of variances between new and old items is simplythe ratio between the number of nodes active in the newitems representations and the number of nodes active inthe old items representations

Or alternatively the slope of the z-ROC curve is equalto the square root of the ratio of the number of nodes ac-tive in the new items representations and the number ofnodes active in the old items representations For exam-ple if the slope of the z-ROC curve is 08 the number ofactive nodes in the new items representations divided bythe number of nodes active in the old items representa-tions is 064 (= 0802)

Another approximation useful for understanding themodel is that for two classes of items the number of ac-tive nodes in the new distribution is approximately equaland the number of active nodes in the old distributions isapproximately equal [Pc(AN) raquo Pc(BN) and Pc(BO) raquoPc(AO)] Given these approximations and the approxi-mation above (1 Pc raquo 1) the recognition strength stan-dard deviation is inversely related to the standard devia-tion of the net inputs in the following way The ratiobetween the recognition strength standard deviations ofthe diff icult and the easy distributions is equal to theratio between the standard deviations of the net inputs ofthe easy and the difficult distributions Furthermore theratio between the recognition strength standard devia-tions of the difficult and easy new distributions is equalto the ratio between the recognition strength standard de-viations of the difficult and the easy old distributionsThe exact solution predicts a slightly smaller ratio in theold than in the new distributions

This suggests that the ratio between the recognitionstrength standard deviations of the difficult class and theeasy class can be interpreted as the ratio between the

standard deviations of the net inputs of the easy class andthe difficult class

Optimizing Performance Derives the Assumptions of the Variance Theory

Arguably good memory performance is an importantfactor for selection in the evolutionary process of hu-mans and animals It is reasonable to assume that thebrain has evolved so that the performance at retrieval isoptimal or near-optimal Here it is investigated how sev-eral assumptions of the variance theory influence per-formance It is shown that several assumptions of themodel fall out as a consequence of optimizing perfor-mance in the form of discriminability between new andold items Thus if the model is implemented in a differ-ent way performance is degraded and the model doesnot account for the experimental data Examples of as-sumptions that yield a good performance in the modelare a low percentage of nodes active setting the activa-tion threshold between old and new net inputs measur-ing performance by nodes that are active in the encodingpattern and normalizing the recognition strength It isshown that an optimal performance in the network requiresthe implementation suggested by the variance theory Ifthe implementation of the variance theory is changedsignificantly the performance is degraded and the net-work would not produce the empirically found memoryphenomena

To conduct this analysis it is necessary to define ameasurement of performance A natural choice is to used cent By using the analytical equations above we find thefollowing expression

(8)

Because ss(N) sometimes is near zero it was founduseful to use the standard deviation of both the old andthe new items recognition strength ss(NO) in the de-nominator of this equation Thus Pc(NO) is equal to[Pc(N) 1 Pc(O)] 2 Pc( ) was calculated with Equation 6The expected value of the net inputs and the standard de-viation of the net inputs for new and old items were cal-culated with the equations derived in the Appendix(Equations A1 A2 and A3) Low-frequency items witha preexperimental frequency of zero were used ( f = 0)and the list length was set to one (L = 1)

The performance can be expressed by the parametersa N and p Increasing the number of nodes (N) monot-onically increases d cent and increasing the number ofstored patterns (p) monotonically decreases d cent The im-pact of these two parameters on d cent was of less impor-tance here and they were set to N = 30 and p = 100

Optimal percentage of nodes active at encodingThe solid line in Figure 9A shows the theoretical d cent as afunction of the percentage of nodes active at encoding

cent = - =-

-eacuteeumlecirc

dP P

P PN

s s

s

c c

c c

m ms( ) ( )

(

O N

(NO)

(O) (N)

(NO) (NO) 12

112

ss

ss

ss

ss

s

s

s

s

h

h

s

s

(BO)

(AO)

(B)

(A)

(A)

(B)

(BN)

(AN)pound raquo pound

s

ss

s

c

c

P

P

2

2

(N)

(O)

(N)

(O)raquo

ss

sc

h

P

N

2

2 2=

432 SIKSTROM

(a) The results show that d cent is optimal for a = 052 Thed cent is lower for larger and smaller a The lower d cent for largea occurs because the interference from other items in-crease For an a larger than the optimal value the weightchanges are distributed over a larger number of nodesand the recognition tests therefore include more noise

The lower d cent for an a less than the optimal value oc-curs because there is variability in the number of activenodes at encoding Thus for very small values of a thefluctuation between the number of nodes active in theencoded representation becomes increasingly importantThus for a small a errors are more likely to occur be-cause old items have few active nodes at encoding mak-ing it less likely that many nodes will be active at re-trieval (independently of how well they are encoded)This analysis suggests that a medium low percentage ofactive nodes at encoding yields optimal performanceThis is consistent with variance theory which requires alow activity for fitting some of the empirical data (seebelow)

There is another factor that contributes to the fact thatoptimal performance occurs when the percentage of ac-tive nodes is medium low which is that the number ofpossible representations increases with a If there is onlyone node active in all the representations there are Npossible combinations of representations if there aretwo nodes active in all the representations there are ap-proximately N 2 possible combinations of representa-tions and so forth This factor is not included in theanalyses

Optimal placement of the activation threshold Animportant property of d cent is that it is insensitive to wherethe criterion is placed Thus any criterion yields thesame performance The activation threshold (T ) may beseen as the criterion for a single node and therefore onemight intuitively think that the placement of the thresh-old is unimportant for d cent However surprisingly theplacement of the criterion becomes important in the vari-ance theory because there is a nonlinear transformationwhen the nodes are activated This nonlinearity makes d centdependent on the activation threshold in the nodes

The d cent was maximized by changing the activationthreshold (T ) and the percentage of nodes active at en-coding (a) The maximum d cent was 240 when T = 081and a = 052 Figure 9B plots d cent as a function of the ac-tivation threshold (T ) when the percentage of nodes ac-tive at encoding was f ixed at the optimal value (a =052) The results show that d cent has an optimal valuewhen the activation threshold is set between the old andthe new distributions The variance theory suggests thatthe threshold should be set to the average of the expectedold and expected new net inputs For the parameter val-ues used here this value is 071 which is near butslightly lower than the optimal value of 081 (the ex-pected value of the new net input is 0 and the expectedvalue of the old net input is 142) Note that this resultapplies when both a and T are set to the optimal value Ifa is set to a nonoptimal value the optimal value of T may

deviate significantly from the one proposed by the the-ory (eg if a = 5 the optimal value of T is much largerthan the expected value of old net inputs of 188)

This analysis emphasizes the importance of setting theactivation threshold as suggested by the variance theorySetting the activation threshold between the old and thenew expected net inputs yields not only the mirror effectbut also an optimal performance in the network Noticethat the activation threshold (T ) is constant even if thesubjectsrsquo decision criterion (C) is changed Therefore d centwill not change when the decision criterion changes Bychanging the decision criterion (rather than the recogni-tion threshold) subjects can maintain an optimal d cent fordifferent confidence levels

Optimal usage of the state of activation in the cuepattern An interesting question is how much informa-tion is carried in nodes that are active in the encoded pat-tern as compared with inactive nodes If both active andinactive nodes carry a similar amount of information itis useful to use all the nodes at retrieval However if in-active nodes carry little information in relation to thenoise performance can be improved by using only theinformation in the active nodes

The information carried in the nodes depends on theamount of weight changes which in turn depends on thepercentage of active nodes at encoding (a) For a = 12the absolute values of the weight changes are the samefor active and inactive nodes (however the signs of theweight changes are different) Thus inactive and activenodes carry the same amount of information and theperformance is optimal when information in both activeand inactive nodes is used For a small a the weightchanges are larger for active nodes (proportional to 1 a)than for inactive nodes (proportional to a) For a suffi-ciently small a the noise in inactive nodes will over-whelm the information in the weight changes so that ifthe information is combined the inactive nodes will harmthe information in the active nodes and performance isoptimal if only information from active nodes is used

The performance of the variance theory was calcu-lated by using the information in all the nodes This isdone by counting the number of nodes that are retrievedto the correct state of activation (ie the same state asthat during encoding) The mathematical details of thiscalculation are described at the end of the Appendix Theresults are shown by the dotted line in Figure 9A usingthe same set of parameters as when d cent was calculated byusing only active nodes shown by the solid line The re-sults show that the highest d cent is found when the decisionis based only on active nodes and when a is low Includ-ing inactive nodes in decision lowers d cent However for alarger a (above 15 for the parameters used here) it isbeneficial to base the decision on all the nodes

Optimal placement of the recognition criterion forthe two classes of items The recognition criterion (C)does not affect d cent but it influences performance as mea-sured by the hit rates and false alarm rates Therefore itis necessary to use another criterion for performance

VARIANCE THEORY OF MIRROR EFFECT 433

with respect to the placement of the recognition crite-rion A natural choice for performance in this context isthe probability of hits minus the probability of falsealarms This measurement corresponds to optimal per-formance when old correct responses and new correctnew responses are rewarded equally It is easy to see thatif the standard deviations of the old and the new distri-butions are equal the optimal performance will be foundif the recognition criterion is set exactly between the dis-tributions For unequal standard deviations the optimalrecognition criterion is shifted from the midpoint towardthe distribution with the smallest standard deviationMore exactly the optimal recognition criterion is thepoint at which the old and the new distributions inter-sect It is easy to see that this is true because if the recog-nition criterion is moved to the left of this point the rateof increase in false alarms is larger than the rate of in-crease in hits and performance suffers If the recognition

criterion is moved to the right of this point the rate of de-crease in hits is larger than the rate of decrease in falsealarms and performance also suffers (see eg Figure 4D)Formally f [S(O)] denotes the density of recognitionstrength of the old distribution and f [S(N)] the densityof the recognition strength of the new distribution Theratio between these variables is called the likelihoodratio L = f [S(O)] f [S(N)] and the optimal performanceoccurs when this ratio is equal to one (L = 1)

In the mirror effect there are two classes of itemseach having a new and an old distribution with differentstandard deviations The question of optimal perfor-mance is complicated by the possibility of using differ-ent criteria for the two classes The performance maythen vary depending on the choice of the two criteria andon additional restrictions on the overall level of confi-dence For example if one class is very easy (ie perfectdiscrimination) and one class is very difficult (ie no

Figure 9 (A) Theoretical d cent as a function of percentage of nodes active at encoding The solid line shows the d cent as a function of per-centage of nodes active at encoding when the decision is based only on nodes that are active during encoding The dotted line showsd cent when the decision is based on all the nodes (B) Theoretical d cent as a function of the activation threshold The leftmost arrow pointsat the expected net input of the new items [m(N)] the rightmost arrow points at the expected net input of the old items [m(O)] and themiddle arrow points at the point at the placement of the activation threshold of the nodes Note that the activation threshold is slightlylower than the optimal point (C) Optimal placement of the recognition criterion for the two classes The y-axis shows the maximumlikelihood for Class B divided by the maximum likelihood for Class A An optimal performance is found when this ratio is one Thex-axis shows the false alarm rate for Class A The straight line shows the ratio for theoretical optimal performance the dotted line theratio before normalization and the solid curved line the ratio after normalization See the text for details (D) The advantage of nor-malization for different recognition criteria The y-axis shows the total hit rate after normalization minus the total hit rate before nor-malization as a function of the total false alarm rate on the x-axis See the text for details

434 SIKSTROM

discrimination) and subjects are instructed to respondyes only when they are absolutely certain that they arecorrect it may be optimal to set a very high criterion forthe difficult class so that no yes responses will be madefor the difficult class and a moderate criterion for theeasy class so that some yes responses will be made forthe easy class Therefore any model that optimizes per-formance for the two classes must combine the criteriafor each class so that the performance for the sum of theclasses will be optimal

This problem may formally be stated as follows Giventwo classes (A and B) with a fixed total false alarm rate[P(AN) + P(BN)] how should the recognition criteriafor the two classes [T(A) and T(B)] be chosen so that thehit rates are maximized [P(AO) + P(BO)] The solutionto this problem is surprisingly simple The optimal perfor-mance occurs when the placements of the maximum like-lihoods of the two classes are equal

L(A) = f [S(AO)] f [S(AN)] = L(B)

= f [S(BO)] f [S(BN)]

It is easy to see that this criterion must be satisfied foroptimal performance because any shift from this pointdiminishes performance For example if the recognitionthreshold for class A is diminished the recognition cri-terion for class B must be increased to keep the totalfalse alarm rate constant According to the formulationof the problem the change in total false alarm rates mustbe equal f f [S(BN) = 0] The maximum-likelihood ra-tios are monotonically increasing functions of the recog-nition criteria therefore L(A) L(B) lt 0 when the recog-nition criteria are changed as specified above ThusL(A) = f [S(AO)] f [S(AN)] lt f [S(BO)] f [S(BN)] =L(B) or f [S(AO)] f [S(BO)] lt 0 This shows that thechange in the placement of the criteria from L(A) = L(B)results in an overall decrease in hit ratemdash( f [S(AO)] f [S(BO)] lt 0)mdashand performance suffers

Note that the variance theory only has one overallrecognition criterion However the theory and any the-ory of the mirror effect specifies how this criterionchanges when it is moved over the two classes Thus thesecond criterion can indirectly be inferred from the for-mulation of the theory This is done in the variance the-ory by the normalization factor that scales the recogni-tion differently between the two classes of words

The intriguing question here is whether the variancetheory is optimal or almost optimal in terms of place-ment of the recognition criterion for the two classes Fig-ure 9C plots the maximum likelihood of class B dividedby the maximum likelihood of class A [L(B)L(A)] onthe y-axis The x-axis shows the probability of false alarmsfor class A when the recognition criteria are changedThe optimal ratio of the maximum likelihood on the y-axisis one and it is plotted as the dotted straight line The re-sults before normalization (ie by counting the numberof nodes above the recognition criterion) are plotted in

the dotted monotonically increasing line The resultsafter normalization (ie the percentage of active nodesminus the expected number of active nodes divided bythe standard deviation of the net input) are plotted as thesolid line

The figure clearly shows that performance before nor-malization is far from optimal For a conservative recog-nition criterion or low false alarm rates the maximumlikelihood for class A is larger than the maximum likeli-hood for class B Therefore a more liberal criterion forclass B and a more strict criterion for class A would bemore advantageous For a liberal recognition criterionor a high false alarm rate the maximum likelihood forclass B is larger than the maximum likelihood for classA Therefore a more liberal criterion for Class A and astricter criterion for Class B would be beneficial Theonly point at which the performance is optimal is whenthe recognition criterion is unbiased At this point [aroundP(AN) = 25] the maximum-likelihood ratio is one

Normalization improves performance significantly sothe maximum-likelihood ratio is one or almost one fora range of criteria For false alarm rates larger than 015and smaller than 060 the ratio is within two percentagepoints of one The maximum likelihood for Class A islarger than that for Class B for false alarm rates less than015 and for false alarm rates larger than 060 Thus thereis still some deviation from optimal performance evenafter normalization However the maximum-likelihoodratio is closer to the optimal value for all false alarmrates after normalization than before normalization Ar-guably normalization improves performance sufficientlyso that performance may be described as being near anoptimal value for a wide range of recognition criteria

Overall performance after normalization can be di-rectly compared with performance before normalizationFigure 9D plots the total hit rate after normalizationminus the total hit rate before normalization for differenttotal false alarm rates In this figure the standard devia-tion of the net inputs to Class B was set to a 3 (rather than156) in order to make the difference between perfor-mance before and after normalization more salient Allother parameters were identical to those in Figure 4DFurthermore it is assumed that the number of items inClass A is equal to the number of items in Class B Thetotal hit rate is equal to the average hit rate of Class Aand Class B and the total false alarm rate is equal to theaverage false alarm rate in Class A and Class B

For all recognition criteria or for all false alarm ratesperformance is better or equal after normalization ascompared with performance before normalization Foran unbiased recognition criterion or a slightly largerrecognition criterion performance is approximatelyequal before and after normalization [ie for 25 lt P(N) lt40] For conservative recognition criteria [P(N) lt 25]there is a large advantage for normalized performanceover nonnormalized performance The largest advantageis approximately 005 and it occurs when the false alarm

VARIANCE THEORY OF MIRROR EFFECT 435

rate is approximately 005 For liberal recognition crite-ria [P(N) gt 40] there is also an advantage for normal-ized performance The largest advantage is around 001and it occurs when the false alarm rate is 070 The ad-vantage for liberal criteria is smaller than the advantagefor conservative criteria because of a ceiling effect forlarge false alarm rates and large hit rates

A Nonoptimal Network is Inconsistent With Empirical Data of Recognition Memory

To summarize it has been shown that performance isoptimal when (1) the percentage of nodes active at en-coding is low (2) the activation threshold is set betweenthe new and the old distributions (3) only nodes activeat encoding are used for measuring the recognitionstrength and (4) the recognition strength is normalizedwith the standard deviation of the net input It is inter-esting to note that all these conditions are necessary forproducing the empirically found memory phenomena

1 The percentage of active nodes has to be low forproducing appropriate standard deviations for the oldand the new recognition distributions If the percentageof active nodes is too high the standard deviation of theold distribution will approach the standard deviation ofthe new distribution

2 The model predicts the mirror effect only if the ac-tivation threshold is set between the old and the new dis-tributions If the recognition threshold is larger than theexpected value of the net input of the old distributionthe hit rate of the easy class will be less than the hit rateof the difficult distribution Similarly if the recognitionthreshold is lower than the expected new net input thefalse alarm of the easy class will be larger than the falsealarm rate of the difficult class This is inconsistent withthe empirical data for unbiased responses

3 Assume that the recognition strength is based on allthe nodes (ie not only nodes inactive during encod-ing)mdashfor example by counting the number of nodes inthe correct state of activation This measurement ofrecognition strength will have at least 50 of the nodesin the correct state (ie Pc gt 50) even if the subjectswere merely guessing on new items This would lead tothe incorrect prediction that the old recognition strengthvariance should be smaller than the new recognitionstrength variance because Pc(1 Pc)N decreases for Pcover 50 This is inconsistent with the finding that thevariance of the old distribution is larger than the varianceof the new distribution

4 If the recognition strength is not normalized withthe net input the variance of the recognition strength ofthe new easy class (AN) will be smaller than the varianceof the recognition strength of the new difficult class (BNsee Figure 4C) This is inconsistent with the empirical data

This analysis indicates that several recognition mem-ory phenomena fall out as a consequence of optimizingperformance in the network If the model is optimized interms of performance the model reproduces the empir-ical data If the model is made nonoptimal the model

does not reproduce the empirical data Arguably thehuman brain has evolved to optimize storage capacityand therefore these memory phenomena occur

GENERAL DISCUSSION

This paper has suggested a variance theory for themirror effect that also applies to the ROC curves Themodel is remarkably simple It has been shown that atwo-layer recurrent network where one layer representscontext and one layer represents items shows these phe-nomena if performance is measured by counting thenumber of nodes active at recognition in a way that opti-mizes performance The structure of the theory guaran-tees that high-frequency items have a larger variance inthe net inputs than do low-frequency items because en-coding the same item in different contexts increases thevariance whereas the expected net inputs are the sameThe theory predicts the mirror effect when the sameamount of attention is paid to both classes of stimuli Thestandard deviation of the recognition strength is largerfor old than for new items because more nodes are activein old items The standard deviation for the easy class islarger than the standard deviation for the difficult classbecause the recognition strength is normalized with thestandard deviation for the net input

There are several reasons why the variance theory isinteresting First the theory is extremely simple Theonly necessary assumptions are that recognition is basedon recurrent associations between contexts and itemsand performance is measured by counting the number ofnodes in an optimal way Second these assumptions areconsistent with what is known about how the brain worksTherefore the model is biologically plausible Third themodel accounts for a large amount of data including themirror effect exceptions from the mirror effect ROCcurves list-length effects and so on Fourth the modelfits the empirical data well Fifth it is easy to implementthe model in a connectionist network

Paying more attention to one of the classes violates theassumption of equal expected net inputs to the two classesThe variance theory predicts that attention to the moredifficult class primarily affects the hit rates whereas thefalse alarm rates and standard deviations of the underly-ing distributions are less affected An experiment sup-ported the prediction A standard mirror effect was foundwhen attention was divided equally between high- andlow-frequency words However focusing the subjectsrsquoattention on the high-frequency words either by the pre-sentation frequency or the presentation time made thehit rate larger for the high-frequency words than for thelow-frequency words This manipulation of attention didnot influence the false alarm rates which were higher forthe high-frequency words in all the conditions Thus nomirror effect was found when attention was paid to thehigh-frequency words Nor did the focusing of attentioninfluence the order of the standard deviations The stan-dard deviations were larger for the low-frequency distri-

436 SIKSTROM

bution than for the high-frequency distribution This wastrue for the new and the old distributions both when at-tention was paid to high-frequency words and when at-tention was divided equally between the two classes (ex-cept in the new frequency control condition where thestandard deviations were equal)

The variance theory predicts the order of the standarddeviations of the underlying distributions for the follow-ing reasons The standard deviations of the old distribu-tions are predicted to be larger than those of the new dis-tributions because more nodes are activated in the olddistributions The standard deviations of the easy classdistributions are predicted to be larger than the standarddeviations of the difficult class distributions because therecognition strength is normalized by the itemrsquos diffi-culty estimated from the standard deviation of the net in-puts This is consistent with the empirical data

In contrast the attention-likelihood theory does notconstrain the old distribution to be larger than the newdistribution for the difficult class (it can be larger orsmaller depending on parameter settings) The variancetheory allows the following two orders of the standarddeviations ss(BN) lt ss(BO) lt ss(AN) lt ss(AO) andss(BN) lt ss(AN) lt ss(BO) lt ss(AO) The first order isthe more common although the second order occurs oc-casionally (see eg Glanzer et al 1993 Experiment 1)In addition the attention-likelihood theory allowsss(BO) lt ss(BN) lt ss(AN) lt ss(AO) according to Equa-tion 2 which to the authorrsquos knowledge has not beenfound in empirical data

The variance theory predicts that strength variablessuch as study time repetition and study instructions af-fect the expected net input For example increasing studytime will increase the net input that improves the hit rateIncreasing the net inputs also causes an increase in theactivation threshold that diminishes the false alarm rates

The variance theory has no (ie lateral) connectionswithin the item layer and there are no connections with-in the context layer Including intraitem connections inrecognition makes it impossible to tell whether therecognition strength comes from encoding during thelearning episode or from another preexperimental learn-ing episode Thus there would be a confounding be-tween the itemrsquos frequency and recognition strength Forexample if the recognition strength in the variance the-ory included intraitem connections and used a constantrecognition criterion it would predict a higher hit rateand a higher false alarm rate for high-frequency itemsthan for low-frequency items Thus the hit rate for high-frequency words would be larger than that for low-frequency words which is contrary to the data on the mir-ror effect This issue is called the frequencyrecognitionndashstrength confounding problem Other models may bevulnerable to this problem depending on their specificassumptions The variance theory is immune from thisproblem because recognition strength is based on the as-sociation between the context and the item that yields apure measurement of the strength of the target in a par-

ticular episode Net inputs within the item population arenot used because these connections are highly corre-lated with the frequency of the item

This frequencyrecognition-strength confounding prob-lem may be relevant to several distributed models thatassume that recognition strength increases with frequencythus making the false alarm rate higher for high- than forlow-frequency stimuli This is often implemented in dis-tributed models by including intraitem associations inthe measurement of recognition strength Thus whenintraitem and itemndashcontext associations are added it isnot possible to know whether the intraitem strength oc-curs because an item has been encoded in the to-be-retrieved-from list or at another episode

Although the intraitem associations are not used tomeasure recognition strength they may play an impor-tant role in recognition In the first step of recognitionthese associations may be used for deblurring unclear in-formation in the item cue (a similar mechanism occursfor the context cue) Arguably this deblurring mecha-nism works well for well-known words however fornonwords it is much more likely to fail Such failure willactivate features that were not active in the encoded rep-resentation It is more likely that these wrongly activatednodes represent high-frequency features because it ismore likely to converge on high-frequency features Thereare two interesting implications of this perspective Firstthe wrongly activated nodes will use the wrong connec-tions between the context and the item Second becausethe wrongly activated nodes represent high-frequencyfeatures the average variability will be larger for non-words than for words This is a plausible account of thepoor recognition performance with nonwords as com-pared with words It is also a tentative account of why non-words can be seen as a difficult class with a higher vari-ability than that for words in the variance theory Howeverfurther work is needed before any firm conclusion can bemade regarding this aspect of the theory

A problem similar to frequencyrecognition-strengthconfounding occurs if within-context connections areused at recognition If the context is temporally corre-lated the within-context connections are influenced bylist length This causes a confounding between list lengthand recognition strength This issue is called the list-lengthrecognition-strength confounding problem Othermodels may be vulnerable to this problem depending ontheir specific assumptions

Another issue is whether the variance theory can ac-count for the mirror effect found in abstract and concretewords where words from a concrete class are more eas-ily discriminated than words from an abstract class Thevariance theory can account for this given the assump-tion that the variance of the net input is larger for abstractthan for concrete words However at this point it is notcompletely clear how this assumption can be motivatedA possibility is that although these two classes areequated for word frequency the contexts associated withan abstract word are more variable than the contexts as-

VARIANCE THEORY OF MIRROR EFFECT 437

sociated with a concrete word This larger variability incontext for abstract words may lead to a larger variabil-ity in the net input Another possibility is that the activefeatures in abstract words are more general and there-fore associated with more contexts Nodes active in con-crete words may represent more specific features acti-vated with a lower frequency and therefore associatedwith fewer contexts Thus features in abstract wordsmay be of a higher frequency than features in concretewords although the frequencies of the items are thesame This would lead to a mirror effect in the modelHowever at this point no claim is made that variancetheory can handle this phenomenon

The variance theoryrsquos account of the list-length andlist-strength effects is arguably much simpler than thesubjective-likelihoodrsquos account The activation thresholdis set so that on average half of the nodes active duringencoding are active during recognition The activationthreshold is constant within one condition but may changebetween conditionsmdashfor example when study time ischanged Furthermore subjects do not need to estimatethe list length and the probability that a particular itemis drawn from the study list

The variance theory has advantages over the attention-likelihood theory As was discussed in more detail abovethe attention-likelihood theory requires subjects to clas-sify the stimulus Depending on this classification thesubjects must know (consciously or unconsciously) howmuch attention is paid to the stimuli in order to calculatethe log-likelihood ratios Thus the yesndashno decision isbased on the subjectsrsquo classification of the stimuli Thevariance theory does not require a classification of thestimuli During the calculation of recognition strengththe standard deviation of the net input of the item is used(shcent ) so the subject does not need to know the class or thestandard deviation of the class (sh) The increase in thehit rate and decrease in the false alarm rate for the easierclass occurs according to the theory because the vari-ance is smaller for the easier class The variance theoryhas a simple mathematical base subjects count the num-ber of activated nodes minus the expected value dividedby the standard deviation of the net input of the item Anexplicit solution is presented that requires only calculat-ing the probabilities of two z-transformations

The subjective-likelihood theory uses feature contentof the items for addressing issues regarding item similar-ity and word frequency In particular high-frequency wordsare assumed to have a higher noise or variability than dolow-frequency words The variance theory also usesvariability that depends on frequency However the vari-ance theory simulates the increase in variance duringeach presentation of a feature in different contexts thusmaking it an unavoidable phenomenon for high-frequencyfeatures In the subjective-likelihood theory the featurevariance is introduced as an assumption or a constantand it is not explicitly simulated

There are several other differences between the vari-ance theory and the subjective-likelihood theory The

subjective-likelihood theory is based on log-likelihoodratios In the variance theory log-likelihood ratios arenot necessary to account of the mirror effect and for z-ROC curves Instead the variance theory uses the num-ber of active nodes normalized by the variance of the netinput as the measurement of recognition strength

Another difference is the use of one detector for eachitem in the subjective-likelihood theory This makes themodel essentially local whereas the variance theory isdistributed This property may cause difficulties in im-plementing the model in a connectionist network How-ever see McClelland and Chappell (1998) for a brief dis-cussion of this topic An implementation of the variancetheory is straightforward

REFERENCES

Feller W (1968) An introduction to probability theory and its appli-cation New York Wiley

Gillund G amp Shiffrin R M (1984) A retrieval model for bothrecognition and recall Psychological Review 91 1-67

Glanzer M amp Adams J K (1985) The mirror effect in recognitionmemory Memory amp Cognition 13 8-20

Glanzer M amp Adams J K (1990) The mirror effect in recognitionmemory Data and theory Journal of Experimental PsychologyLearning Memory amp Cognition 16 5-16

Glanzer M Adams J K amp Kim K (1993) The regularities ofrecognition memory Psychological Review 100 546-567

Glanzer M amp Bowles N (1976) Analysis of the word frequencyeffect in recognition memory Journal of Experimental PsychologyHuman Learning amp Memory 2 21-31

Glanzer M Kisok K amp Adams J K (1998) Response distribu-tions as an explanation of the mirror effect Journal of ExperimentalPsychology Learning Memory amp Cognition 24 633-644

Greene R L (1996) Mirror effect in order and associative informa-tion Role of response strategies Journal of Experimental Psychol-ogy Learning Memory amp Cognition 22 687-695

Hertz J Krogh A amp Palmer R G (1991) Introduction to the the-ory of neural computation Reading MA Addison-Wesley

Hintzman D L (1988) Judgment of frequency and recognition memoryin a multiple trace memory model Psychological Review 95 528-551

Hopfield J J (1982) Neural networks and physical systems withemergent collective computational abilities Proceedings of the Na-tional Academy of Sciences 79 2554-2558

Hopfield J J (1984) Neurons with graded response have collectivecomputational properties like those of two-state neurons Proceed-ings of the National Academy of Sciences 81 3088-3092

Humphreys M S Bain J D amp Pike R (1989) Different way to cuea coherent memory system A theory for episodic semantic and pro-cedural tasks Psychological Review 96 208-233

Kim K amp Glanzer M (1993) Speed versus accuracy instructionsstudy time and the mirror effect Journal of Experimental Psychol-ogy Learning Memory amp Cognition 19 638-652

Kruschke J K (1992) ALCOVE An exemplar-based connectionistmodel of category learning Psychological Review 99 22-44

Ku Iumlcera H amp Francis W N (1967) Computational analysis ofpresent-day American English Providence RI Brown UniversityPress

Lewandowsky S (1991) Gradual unlearning and catastrophic inter-ference A comparison of distributed architectures In W E Hockleyamp S Lewandowsky (Eds) Relating theory and data Essays onhuman memory in honor of Bennet B Murdock (pp 445-476) Hills-dale NJ Erlbaum

Li S-C amp Lindenberger U (1999) Cross-level unification A com-putational exploration of the link between deterioration of neuro-transmitter systems and dedifferentiation of cognitive abilities in oldage In L-G Nilsson amp H J Markowitsch (Eds) Cognitive neuro-sciences of memory (pp 103-146) Seattle Hogrefe amp Huber

438 SIKSTROM

Li S-C Lindenberger U amp Frensch P A (2000) Unifying cog-nitive aging From neuromodulation to representation to cognitionNeurocomputing 32-33 879-890

McClelland J L amp Chappell M (1998) Familiarity breeds dif-ferentiation A subjective-likelihood approach to the effects of expe-rience in recognition memory Psychological Review 105 724-760

Metcalfe J (1982) A composite holographic associative recallmodel Psychological Review 89 627-658

Murdock B B (1982) A theory for the storage and retrieval of item andassociative information Psychological Review 89 609-626

Murdock B B (1998) The mirror effect and attentionndashlikelihoodtheory A reflective analysis Journal of Experimental PsychologyLearning Memory amp Cognition 24 524-534

Okada M (1996) Notions of associative memory and sparse codingNeural Networks 9 1429-1458

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves PsychologicalReview 99 518-535

Rumelhart D E Hinton G E amp Williams R J (1986) Learn-ing representation by backpropagating errors Nature 323 533-536

Shiffrin R M amp Steyvers M (1997) A model for recognitionmemory REMndashretrieving effectively from memory PsychonomicBulletin amp Review 4 145-166

Sikstroumlm S (1996a) TECO A connectionist theory of successiveepisodic tests Umearing Doctoral dissertation Umearing University (ISBN91-7191-155-3)

Sikstroumlm S (1996b) The TECO connectionist theory of recognitionfailure European Journal of Cognitive Psychology 8 341-380

Sikstroumlm S (1999) Power function forgetting curves as an emergentproperty of biologically plausible neural networks model Interna-tional Journal of Psychology 34 460-464

Stretch V amp Wixted J T (1998a) Decision rules for recognitionmemory confidence judgments Journal of Experimental Psychol-ogy Learning Memory amp Cognition 24 1397-1410

Stretch V amp Wixted J T (1998b) On the difference betweenstrength-based and frequency-based mirror effects in recognitionmemory Journal of Experimental Psychology Learning Memoryamp Cognition 24 1379-1396

NOTE

1 The expected number of nodes active at recognition is a2 giventhat the standard deviations of the percentages of active nodes for new[sp(N)] and old items [sp(O)] are equal However if the standard devi-ations are different the expected number of active nodes will be

Because the new variance is smaller than the old variance this valuewill be slightly less than a2 typically around 044a if the ROC curveis 08 It is actually more precise to use this value In this paper the ap-proximation a2 is used except that in the simulations where the ex-pression above is used

APPENDIX

The Expected Value of the Net Input and the Variance of the Net Input

Inserting the weight changes into the net input solves the ex-pected value of the net input for old items In this Appendix thesuperscripts representing the item layer (d) and the contextlayer (c) are dropped Only the active states of activation con-tribute to the net input so jj 5 jj = 1

(A1)

The expected value of the net inputs for the new items iszero

mh(N) 5 0 (A2)

The expected value of the weights is zero The variance ofthe net input is calculated by squaring each term in the netinput Let P represent the number of patterns stored in the net-work If the patterns are uncorrelated or a new item is pre-sented the variance of the net input is

(A3)

For an old item the variance of the net input to the contextlayer from the item layer depends on the frequency of the item( f ) All the aN weights supporting a context node contribute tothe net input in the same direction

(A4)

The same function can be used for calculating the varianceof the net input to a node in the item layer when the same con-text is associated with several items Let the same context be as-sociated with L items in a list Furthermore let the context be-tween different lists be uncorrelated The variance of the netinput to an item node is

The expected variance for all nodes is the weighted sum ofthese two terms Half of the nodes are context nodes and halfof the nodes are item nodes Therefore the expected varianceof the net input to all nodes given an item with a frequency off and a list length of L in a network where p patterns have beenencoded is

(A5)

Performance in the Variance Model When AllNodes Are Used

In the variance model recognition strength is based on nodesactive at encoding However if recognition strength is based onall the nodes performance can be calculated as follows The d centis calculated by using Equation 8 where Pc(O) and Pc(N) are thenumber of correct nodes The number of correct nodes is thenumber of nodes active at retrieval that also is active at encod-ing (ie calculated as usual with Equation 7) plus the numberof inactive nodes at encoding that also are inactive at retrievalThe latter value can be calculated by replacing a with 1 a inEquation 7 and using the expected value of old net inputs for in-active nodes a2 (1 a) N (compare with Equation A1)

(Manuscript received February 9 1999revision accepted for publication October 30 2000)

s h f L a N p a a N2 3 22 1O( ) = +( ) + -( )

s h f a a LN2 4 2 21( ) = -( )

s x x x xh ij jj

N

i j ji

Nf f f a a

a a f N

22 2

4 2 21

( ) = waring aringaelig

egraveccedilouml

oslashdivide= -( ) -( )eacute

eumlecirc

= -( )

s x x x xh ij jj

N

i j j

P

i

NN a a

a a a a a a a

a a a a PN a a PN

2 2 2

2

3

1 1 2 0 1

0 0 1 1

( ) = w

= [( ) ( )] + [( )(1- )] ( )

+ [( )( )] ( ) = ( )

2 2

2 2 2

( ) = -( ) -( ) - - - -

- - - -

aring aringaring

m x x x xh ijj

N

j i j jj

Na a a a N(O) = wDaring aring= -( ) -( ) = -( )1

2

ss s

p

p p

a(N)

( )N (O)+

416 SIKSTROM

the middle Old items will have a higher expected percent-age active and new items will have a lower expected per-centage active The activation threshold is constant dur-ing recognition in one condition However it must varybetween conditions depending on the net inputs to yieldoptimal performance

The percentage of nodes active at recognition is counted

As is shown in Figure 9C the performance is near-optimalif the recognition decision is based on the number of nodesactive at recognition normalized by the standard devia-tion of the net inputs across the active features of this item(shcent ) Thus this standard deviation is calculated over allthe nodes active at encoding (ie x i

d = 1 and x ic 5 1

nodes inactive at encoding are not used when calculatingthe standard deviation because for low levels of a theseitems carry little to no information of the item) The rec-ognition strength (S ) for an item is calculated by sub-tracting the expected percentage of nodes active at rec-ognition (a 2)1 from the real percentage of nodes activeat recognition (Pc) and dividing by the standard deviationof the net inputs of the item (shcent)

(5)

The subtraction of the expected percentage of nodesactive at recognition makes the expected value of therecognition strength (S ) zero This subtraction is neces-sary for the normalization to work properly The sub-traction moves the recognition strength distributionssymmetrically so that the old and the new distributionsmove at the same rate for a given standard deviation of thenet input (without the subtraction the old recognitionstrength distribution would be more affected than thenew distribution) Thus the recognition strength is de-termined by the difference between two probabilities (thepercentage of active nodes that varies and the expectedpercentage of active nodes that is constant) divided bythe standard deviations of the net input A yes response(Y ) is given if the recognition strength (S ) is above therecognition criterion (C ) An unbiased decision has arecognition criterion of zero

Y 5 S gt C

An issue that may be raised is whether it is sensible tobase recognition strength on two quite different sourcesmdashnamely the percentage of active nodes and the variabil-ity of the net input The immediate answer is that if it isreasonable to optimize performance it is also sensible tomeasure recognition strength this way Another perspec-tive is to note that unbiased responses can be made onlyon the percentage of active nodesmdashthat is a yes responseoccurs if the percentage of active nodes is larger than theexpected percentage of active nodes (Pc gt a 2) and thevariability of the net input can be ignored Thus ldquonor-

mallyrdquo subjects base their unbiased decisions on the per-centage of active nodes and the variability of active nodesonly becomes relevant when subjects are biased Fromthis perspective the percentage of active nodes is usedfor unbiased responses and the variability of the netinput becomes relevant for confidence judgments There-fore by combining both the percentage of active nodesand the variability of the net input the measure of recog-nition strength proposed here will also reflect the confi-dence judgment

An Example With Step-by-Step ComputationsTo clarify the computational details involved in the

variance theory a step-by-step example is given here Fortractability a small network is used consisting of fouritem nodes and four context nodes (see Figure 2) Theactual simulations to be reported later involved a largernetwork architecture with 30 nodes at each layer Thepercentage of nodes active at encoding (denoted by pa-rameter a) is set to 50 Let item BN be represented as1100 and written as the state of activation of the fournodes x1

d x2d x3

d x4d Similarly let 0011 represent

item BO 1010 represent item AO and 0101represent item AN Let context CBN be represented as1100 or the state of activation of the four nodes x1

cx2

c x3c x4

c Similarly 0011 represents context CBOand 0101 represents experimental study context CExp

Item BN is a high-frequency new word For simplicityit is here encoded only once with context CBN in the pre-experimental phase (in the simulations below high-frequency words are preexperimentally associated withthree contexts) The 16 weights between the four itemnodes and the four context nodes are changed accordingto the learning rule where the probability that a node isactive at encoding is determined by the parameter a 55 For example the weight change between item node 1and context node 1 is w11 5 [x1

d(BN) a][x1d(CBN) a] 5

(1 5)(1 5) 5 14 where BN is item BN and the CBNrepresents context CBN Similarly item BO is anotherhigh-frequency word that before the experimental phaseis encoded once with context CBO Items AO and AN arelow-frequency old and new words and they are not en-coded at the preexperimental phase

In the experimental phase item AO is encoded withthe experimental context CExp Finally item BO is en-coded with the same experimental context CExp For ex-ample the weight w11 is now equal to

[x1d(BN) a) (x1

c(CBN) a] + [x1d(BO) a]

[x1c(CBO) a] + [x1

d(BO) a][x1c(CE) a]

+ [x1d(AO) a][x1

c(CE) a] 5 (1 5) (1 5)

+ (0 5)(0 5) + (0 5)(0 5)

+ (1 5) (0 5) 5 14 + 14 + 14 14 5 12

After encoding the full weight matrix is 5 1 1 55 0 0 5 5 0 0 5 5 1 1 5 corre-sponding to the weights w11 w12 w44 respectively

SP a

c

h=

-

cent2

s

PN

c i jj

N

i

N

= +aelig

egraveccedil

ouml

oslashdivide

==aringaring1

2 11

z zd c

VARIANCE THEORY OF MIRROR EFFECT 417

At recognition the state of activation of the old low-frequency item AO is reinstated as a cue to the itemlayer and the state of activation of the encoding contextat the experimental phase CExp is reinstated as a cue tothe context layer The net inputs are calculated For ex-ample the net input to context node 1 is

The net input to the item nodes is 10 2 1 and that tothe context nodes is 5 5 5 5 It can be seen thatthe expected net inputs for randomly created new itemsin this network is 0 and that the expected net inputs forold items encoded in the active state is 05 Therefore theactivation threshold is set to the average of these valuesmdashnamely T 5 025 Nodes that have a net input above theactivation threshold and that are active in the cue patternare activated at recognition Thus the state of activationat recognition for the item nodes is 1010 and thatfor the context nodes is 0101 (which is identical tothe cue patterns) The percentage of active nodes is countedPc(AO) 5 48 5 5 For an unbiased response (C 5 0)this will yield a yes response because this percentage islarger than the expected percentage of active nodes atrecognition (a 2 5 25) The standard deviation of thenet input for nodes active at encoding is 071 (ie thestandard deviation of 1 2 5 5 corresponding tonodes 1 3 6 and 8) The recognition strength is calcu-lated by subtracting the expected percentage of activenodes at recognition from the percentage of active nodesfor the to-be-recognized item and dividing by the standarddeviation of the net input for active nodes [S 5 (Pca 2) shcent 5 (5 25) 71 5 035]

The recognition of the three items BO AN and BNare done in the same way The results for the four itemsAN BN BO and AO are the net inputs (1 0 2 1 5 155 5 15) where the first four numbers representitem nodes and the last four context nodes (1 0 2 115 5 5 15) (1 0 2 1 15 5 5 1) (1 0 21 5 5 5 5) the states of activation at recognition(0 0 0 1 0 0 0 0) (1 0 0 0 0 1 0 0) (0 0 1 1 00 0 1) (1 0 1 0 0 1 0 1) the numbers of nodes ac-tive 1 2 3 4 the standard deviations of the net inputs071 108 108 071 the recognition strengths 17 0011 035 and the unbiased responses no no yes yesrespectively (an unbiased yes response is made whenS gt C where C 5 0 for unbiased responses) Thus thesubject responds correctly for all items the recognitionstrengths are ordered according to the mirror effect andthe variance of the net input is larger for the high- thanfor the low-frequency words

SIMULATIONS OF THEFREQUENCY-BASED MIRROR EFFECT

In this section the variance theory of the mirror effectis simulated in a connectionist framework that is consis-tent with a general connectionist theory of memory

called TECO (Sikstroumlm 1996a 1996b) TECO has beenused to account for a variety of memory phenomena in-volving recognition and recallmdashfor example successivetesting of recall and recognition (Sikstroumlm 1996b) and for-getting curves (Sikstroumlm 1999) An extensive descrip-tion of TECO is beyond the scope of this paper Interestedreaders are referred to previous articles describing themodel for details

ProcedureThe simulation started with initializing the weights to

zero Then 12 items were generated by randomly settingthe nodes to an active state with a probability of a A pre-experimental phase then followed to generate the fre-quency associated with the items In this preexperimen-tal phase half of the items (ie 6) were encoded threetimes each time in a different context These items sim-ulated the high-frequency words The remaining 6 itemswere not encoded before the experimental phase andthey simulated the low-frequency words

At the experimental phase one study-encoding contextwas generated using the same procedure as the genera-tion of the contexts in the preexperimental phase Giventhat in a standard recognition experiment all studieditems would be encoded in the same list in the simula-tions the items were thus encoded once with the samestudy context Three of the high-frequency items wereencoded and three of the low-frequency items were en-coded The other three high- and three low-frequencyitems were not encoded during the experimental phaseand they simulated the new items Each encoding wassimulated by first activating the nodes in the item and con-text layers The weights were then changed according tothe learning rule (Equation 3)

At the recognition test the patterns of activation of the12 items and the study context were reinstated to the net-work The net inputs were calculated for each node andthe recognition strength was calculated from all the nodesin the network The somewhat unrealistic assumptionthat no learning or changes of weights occurred duringtesting was adopted However this is a standard as-sumption often used in other simulations of recognitionmemory cued recall or categorization (eg Kruschke1992 Lewandowsky 1991 Li amp Lindenberger 1999Li Lindenberger amp Frensch 2000) One thousand fivehundred simulation runs with different random item andcontext patterns were carried out The results reported laterare based on the average across these runs

ParametersThe number of high-frequency patterns was six (each

encoded three times preexperimentally and three of themencoded once experimentally to simulate the old items)and the number of low-frequency patterns was six (threeof them encoded experimentally once to simulate the olditems) The following parameter settings were used Thenumber of nodes in each layer (N ) was 30 (the total num-ber of nodes was 2N 5 60) and the percentage of nodes

h w i ii

1 11

45 1 1 0 1 1 5 0 5c d= = + - - = -

=aring x

418 SIKSTROM

active at encoding (a) was 20 Initially all the weightswere set to zero The recognition criterion (C ) was set to 0

ResultsFigure 3A shows the results in terms of the density of

the net inputs to an active node The expected values ofthe net inputs are approximately equal for new high- andnew low-frequency items [mh(AN) 5 00 mh(BN) 5 00]and for old low- and old high-frequency items [mh(AN) 538 m h(BN) 5 38] The high-frequency items have alarger variance of the net inputs [sh(BN) 5 49 sh(BO) 548] than do the low-frequency items [sh(AN) 5 41sh(AO) 5 40] The variances of the old and the new dis-tributions are approximately equal

Figure 3B shows the density of the recognition strengthThe result shows the mirror effect where the hit rate prob-ability is larger for low- than for high-frequency itemsand the false alarm rate is lower for low- than for high-frequency items [P(AN) 5 18 P(BN) 5 25 P(BO) 564 P(AO) 5 71] The standard deviation for the recog-nition strength is larger for old than for new words andlarger for the low-frequency words than for the high-frequency words [ss(AN) 5 029 ss(AN) 5 019 ss(AN)5 023 ss(AN) 5 031] These f indings agree with the empirical data and the predictions of the attention-likelihood model (Glanzer et al 1993) Thus the simu-lation shows that the variance theory can account for thefrequency-based mirror effect and the associated ROCcurves

EXPLICATIONS OF THE MECHANISMS

In this section three essential mechanisms of the vari-ance theory that account for the mirror effect are discussedin detail The involved mechanisms are (1) the varianceof the net input (2) the expected value of the net inputand (3) the way in which the recognition strength is cal-culated by counting the number of nodes so that the per-formance is optimal

OverviewThe first mechanism is that items from the difficult

class (ie high-frequency items in this case) have ahigher variance of the net inputs than do items from theeasy class (ie low-frequency items) but the variance isindependent of whether the items are old or new Themechanism is illustrated in Figure 4A (noncumulative) Itshould also be underscored that the difference in varianceas a function of class membership is not a primitive ofthe theory that the theorist manipulates It is the naturalconsequence of the rather plausible assumption that high-frequency items appear more often and are associatedwith a great variety of different contexts than are low-frequency words The second mechanism is that old itemshave a higher expected net input to nodes encoded in theactive state than do new items (but do not depend on theclass of the items) The mechanism is illustrated in Fig-ure 4B (cumulative) The third mechanism is the wayrecognition strength is measured by counting the activenodes so that recognition performance (eg d cent) is opti-mal or near-optimal An extended analysis of optimalityis presented at the end of the paper however the resultsin Figures 9Andash9D can summarize the main points hereThe results from this analysis imply that the decisionmust be based on the nodes active at encoding (see Fig-ure 9A) the percentage of active nodes must be low (seeFigure 9A) the activation threshold needs to be betweenthe expected net inputs of the new and old items (see Fig-ure 9B) and the recognition strength is normalized bythe standard deviation of the net inputs of the item (seeFigures 9C and 9D) The density of the percentage of ac-tive nodes is illustrated in Figure 4C and the normalizedpercentage of active nodes is illustrated in Figure 4DHere it is shown how these mechanisms predict the mir-ror effect Below these three sets of mechanisms are ex-plained in more detail

Variance of the Net InputThe first and perhaps the most important mechanism

is that the difficult class has a larger variability in the net

Figure 3 (A) Simulations of net inputs The vertical axis shows the simulated density of the net inputs (B) Sim-ulations of the mirror effect The vertical axis shows the simulated density of the recognition strength

VARIANCE THEORY OF MIRROR EFFECT 419

inputs than does the easy class As will be discussed laterthis is shared with other theories such as the subjective-likelihood account (McClelland amp Chappell 1998) andREM (Shiffrin amp Steyvers 1997) However a unique as-pect of the variance theory is that it is a necessary outcomeof simply encoding high-frequency items more timesthan low-frequency items In the subjective-likelihood ac-count a larger variability for high- than for low-frequencywords is an assumption that is controlled by a free pa-rameter ( p0) to reflect that high-frequency words havemore definitions than do low-frequency words The vari-ance theory needs no such assumptions and no addi-tional free parameters to control the variability Whereasthe subjective-likelihood account tries to capture wordfrequency with a free parameter in the variance theorythe word frequency effect is simulated via the number ofcontexts associated with an item

Owing to how the variance theory implements the re-lations between contexts and items the variance of the

net inputs increases with the frequency with which an itemis encoded in different contexts Intuitively this occursbecause a high-frequency item activates several differentcontexts causing the representations of many competingcontexts to be activated simultaneously Low-frequencyitems are associated with fewer contexts than are high-frequency items This causes the representations offewer contexts to be activated simultaneously Thus low-frequency items have less variability in the net inputsthan do high-frequency items

At recognition an item produces a net input in the con-text layer that is a mixture of the net inputs from the studycontext that the network is instructed to retrieve fromand from several uncorrelated preexperimental contextsthat were associated with the item during the preexperi-mental phase The study context that the network is in-structed to retrieve from contributes to the correct activestate For example by adding +1 to the net input to anode (which is the expected net input for a node encoded

Figure 4 (A) The probability density of the net inputs in the variance theory The horizontal axis shows the net inputs and the ver-tical axis the probability density of the net inputs The dotted vertical line is the activation threshold (B) The cumulative probabilitydistributions of the net inputs in the variance theory The horizontal axis shows the net inputs and the vertical axis the cumulativeprobability distributions of the net inputs (C) The density of percentage nodes active at recognition in the variance theory The hor-izontal axis shows the percentage of nodes active at recognition and the vertical axis the density (D) The density of recognition strengthin the variance theory The horizontal axis shows the recognition strength and the vertical axis the density of the recognition strengthusing standard parameter values

420 SIKSTROM

in the active state when N 5 8 and a 5 5 see Equation A1in the Appendix) thus increasing the expected net inputThe preexperimental contexts on the other hand ran-domly add to or subtract from the net input For examplea preexperimental context could add +1 to the net inputif the node was in an active state or it could add 1 tothe net input if the particular preexperimental contextwas encoded in an inactive state (which is the expectednet input for a node encoded in the inactive state whenN 5 8 and a 5 5 see Equation 3 4 or A1) Note thatthe net input can be negative whereas the state of activa-tion cannot If the representations of these preexperimen-tal contexts are uncorrelated the number of associatedpreexperimental contexts will not influence the expectednet input That is the expected value of the negative andpositive contributions will cancel out (eg for N 5 8 anda 5 1 the contributions to the net input is +1 with a prob-ability of 50 and 1 with a probability of 50 yield-ing an expected value of 0) However the variability ofthe net inputs increases when more contexts are addedIn this example the variance is increased by 12 for eachadded context (ie the variance increases by each con-tribution raised to the power of two) Encoding an itemalso increases the variability of the net input for all otheritems encoded in the network However the increase invariability for the items uncorrelated with the encodeditem is much smaller because the contribution from eachnode is independent (Eg for N 5 8 and a 5 5 thecontribution from each node is either +14 or 14 [seeEquation 3] each yielding an increase in variability of142 5 116 An expected value of aN nodes contributeso the total increase in variability is 4 116 5 14)

A mathematical analysis of how the variability of thenet inputs in the context nodes increases as a function ofthe frequency with which the item has been encoded indifferent contexts is presented in the Appendix This analy-sis shows that the variance of the net inputs in contextnodes increases linearly with how many times a givenitem is encoded within different contexts The variabil-ity of the net inputs for nonwords may be a special casediscussed at the end of this paper

In the same way as the variability of context nodes de-pends on the itemrsquos frequency the variability of the itemnodes depends on the frequency of the context That isthe variability of the net input to the item nodes increaseswith how many times one context is associated with dif-ferent items Given that the context is constant during thepresentation of a study list the variability of the net in-puts to the item nodes will increase with list length

Expected Net InputThe second mechanism in the variance theory is that

the expected net inputs to the easy and difficult classesof stimuli are equal given that the encoding conditionsduring the experimental phase of the two classes areequal Note that this is in stark contrast to the attention-likelihood theory which assumes that more attention(corresponding to more net input) is given to the easyclass than to the difficult class Experimentally the equal-

ity of the net inputs simply means that different classesof stimuli are given the exact same conditions for encod-ing and retrieval in a recognition memory study The netinput to a node encoded in the active state increases dur-ing encoding whereas the net input to a node encoded inthe inactive state decreases during encoding Only nodesencoded in the active state are used during retrieval sohere we are only interested in the increase in net input thatoccurs for nodes encoded in the active state Strengthen-ing of the weights during encoding increases the expectednet input The degree of increase in expected net input isinfluenced by strength-based variables such as study timerepetition levels of processing and so forth For exam-ple the simulations can be set up so that a study time of1 sec strengthens the weights less leading to lesser in-creases in the net input than does a longer period of studytimemdashfor example 2 sec of encoding time Because thestudy context is unique to the learning episode preexper-imental encoding in other learning contexts will not af-fect the expected net input but they do affect the vari-ability of the net input as was demonstrated above Theitemndashstudy-context associations are learned approxi-mately equally well for old high- and old low-frequencyitems For example the expected net input for CAR (ahigh-frequency word as a difficult class item) is equal tothe expected net input for ARCTIC (a low-frequency wordas an easy class item) Generally the expected net inputdoes not depend on the class of the item because the ex-pected net input is influenced by the study and the test-ing experimental conditions similarly across stimulusclasses in a recognition memory experiment Thereforethe expected net input for a new difficult item is equal tothe expected net input for a new easy class item

The probability density functions of the net inputs fornodes in the active states are plotted in Figure 4A Oldnodes in the inactive state have a negative expected valueof the net input and are not plotted New nodes in the in-active state have the same density as nodes in the activestate The cumulative probability distributions of the netinputs for nodes in the active state are plotted in Figure 4BFigure 4A shows the first mechanismmdashnamely that thestandard deviation of net inputs for the easy class items[sh(A)] is larger than the standard deviation of net inputsfor the difficult class items [sh(B)] The second mecha-nism is shown in the figures in that the expected netinput of an easy class new item [mh(AN)] is equal to theexpected net input of a difficult class new item [mh(BN)]Furthermore at encoding the expected net inputs of ac-tivated nodes are increased equally or approximatelyequally for the easy and the difficult classes of itemsThis is shown in Figure 4A The expected net input for theold easy class items [m h(AO)] is equal to the expectednet input for the difficult class items [mh(BO)]

Recognition StrengthThe variance theory suggests that the recognition de-

cision needs to be based on counting the number of ac-tive nodes in such a way that the performance is optimalor near-optimal If the net input is above the activation

VARIANCE THEORY OF MIRROR EFFECT 421

threshold (T ) and the node was active at encoding thenode is activated at recognition Otherwise the node isinactivated The distributions of active nodes are plottedin Figure 4C

A closer inspection of Figures 4A and 4B reveals thatthese densities or distributions predict the correct orderof the mirror effect given that the activation threshold islarger than the expected net inputs of the new items andless than the expected net inputs of the old items Thus thevariance theory has the nice property of accounting for themirror effect across a large range of the parameter valuesfor the activation threshold Thus P(AN) lt P(BN) ltP(BO) lt P(AO) for mh(AN) 5 mh(BN) lt T lt mh(AO) 5mh(BO) The variance theory predicts a material-basedmirror effect because the variability of the net inputs isdifferent for the easy and the difficult class items Theexpected strengths of the net inputs are equal The vari-ability is lower for easy class items thus making theprobability of false alarms (or the probability of activenodes) lower for the easy than for the diff icult classitems when the activation threshold is larger than the ex-pected value of the new items Similarly the hit rate (or theprobability of active nodes) for easy class items is higherthan the hit rate for difficult class items when the activationthreshold is less than the expected value of the old items

The activation threshold is set to be between the ex-pected value of the new and the old net inputs so that theperformance is optimal Therefore the activation thresh-old is set to the average of the expected net inputs of theold and the new distributions for difficult and easy classitems respectively

T 5 [mh(AN) + m h(BN) + m h(BO) + mh(AO)]

5 [mh(BO) + mh(AO)] 5 mh(O)

Thus in the variance model the activation threshold isfixed for recognition in one condition although it mayvary between different recognition conditions to opti-mize the performance The variance theory accounts forthe strength-based mirror effect that occurs between listsor conditions with a shift in the activation threshold nec-essary for keeping the performance at an optimal levelAs will be discussed later this is true also when perfor-mance is measured by d cent and it is independent of theplacement of the recognition criterion Simply put themodel adopts the activation threshold on the basis of theoverall difficulty of the test in order to maximize the per-formance

In practice subjects may initially make a preliminaryestimate of the activation threshold which may be ad-justed as more information about the expected value ofthe distributions is gathered The theory does not show amirror effect if the activation threshold is lower than theexpected value of the new items or larger than the expectedvalue of the old items Thus setting the activation thresh-old as was suggested above is an important mechanism

in the model The activation threshold should not be con-fused with the subjectrsquos recognition criterion

Figure 4C shows the density of the probability that anode is active at recognition when the activation thresh-old is set as defined above Note how the mean and stan-dard deviations of the distributions of the net input (Fig-ure 4A) change when the percentage of nodes are calculated(Figure 4C) The expected probabilities of active nodesare arranged according to the mirror effect [Pc(AN) ltPc(BN) lt Pc(BO) lt Pc(AO)] whereas the expected valuesof the net inputs are not [mh(AN) = mh(BN) lt mh(BO) =mh(AO)] Furthermore the standard deviation of the per-centage of active nodes for old items is larger than thatfor new items [s p(O) gt sp(N)] whereas the standard de-viations of the net inputs are equal for old and new items[s h(N) 5 sh(O)]

The standard deviation of the recognition strength issmaller for the new distributions than for the old distri-butions because there are fewer nodes active in the newdistributions The standard deviation of the percentage ofactive nodes at retrieval as a function of the expected prob-ability of nodes active at retrieval is plotted in Figure 5Obviously the standard deviation of the percentage ofactive nodes is zero when the probability of active nodesis zero This standard deviation increases as the prob-ability of active nodes increases For small probabilitiesof active nodes the variance of active nodes (sp) is approx-imately proportional to the percentage of active nodes[sp 5 Pc(1 Pc) N raquo Pc N micro Pc] The percentage of ac-tive nodes is of course larger for old than for new itemsThus the variance theory predicts that the standard de-viation of the percentage of active nodes (sp) is smaller fornew than for old items [sp(AN) lt sp(BN) lt sp(BO) ltsp(AO)] whereas the standard deviation of the net inputis not [sh(AN) 5 sh(AO) lt sh(BN) 5 s h(BO)] The es-sential mechanism that makes these changes in themeans and standard deviations is the nonlinearity intro-duced by the counting of the number of active nodes With-out this nonlinearity these changes would not occur andthe model would not show appropriate ROC curves forold and new items

Note that the standard deviation of active nodes de-creases for probabilities larger than one half (see Figure 5the standard deviation is of course zero when the prob-ability of active nodes is one see the mathematical anal-ysis below) However the probability of active nodes can-not exceed one half because the activation threshold isset so that half of the nodes active during encoding areactive at recognition The probability of active nodes istypically set to a small value in the model because it im-proves the performance

To optimize the performance subjects base their rec-ognition decision on the number of active nodes normal-ized by the standard deviation of the net inputs for theitem The normalization makes the judgments more con-servative for difficult items This plays an important rolefor confidence judgments when the responses are biasedbut plays no role for unbiased responses

12

14

14

422 SIKSTROM

To calculate the recognition strength (S ) in Equation 5the expected percentage of active nodes is subtractedfrom the percentage of nodes active at recognition (Pc)This is necessary for the normalization to work properlyOwing to the placement of the activation threshold theexpected percentage of active nodes at recognition is halfof the expected percentage of nodes active at encoding(a 2 see note 1) This is a constant independent of itemclass new or old item and test difficulty The result is di-vided by the standard deviation of the net inputs associ-ated with nodes active at encoding (shcent)

Note that the standard deviation of the net inputs ofthe to-be-recognized item (s hcent ) varies on an item-to-itembasis around the standard deviation of the net inputs ofall items in the class (sh ) This fluctuation may be solarge that it is not possible to accurately sort the wordsinto classes on the basis of the standard deviation of theitems however there is no need for the subject to makesuch classification in the variance model The subjectsdo not need to know the true standard deviation of net in-puts in the class A yes response occurs if the recognitionstrength is larger than or equal to the subjectrsquos recogni-tion criterion (C ie if S sup3 C ) A no response occurs ifthe recognition strength is less than the subjectrsquos recog-nition criterion (S lt C )

The standard deviation of the net inputs does not affectthe probability of a yes response when the recognitioncriterion is unbiased (C 5 0) In this special case therecognition strength can be simplified to S 5 Pc whereC 5 a 2 The standard deviation of the net inputs inEquation 5 affects the probability of a yes response whenthe recognition criterion is biased (C sup1 0) Thus the stan-dard deviation of the net inputs in Equation 5 may be in-terpreted as a scaling factor that influences the confidencemeasurement (but not the unbiased recognition measure-ments) A large standard deviation of the net input for anitem (correlated with difficulty) influences the decisiontoward uncertainty whereas a small standard deviation ofthe net input for an item (correlated with less difficulty)influences the decision to be more certain

Figure 4D shows the density distribution of the recog-nition strength Note how the standard deviation of theactive nodes for the easy class versus the difficult classes(in Figure 4C) changes when it is normalized by the vari-ance of the net input (in Figure 4D) The normalizationfactor makes the standard deviation of the recognitionstrength of the difficult class smaller than that of the easyclass Thus the standard deviation of the recognitionstrength is proportional to the inverse of the standard de-viation of the net input The difficult class items yield asmall standard deviation of the recognition strength be-cause the standard deviation of the net inputs is highThe easy class items yield a large standard deviation ofthe recognition strength because the standard deviationof the net inputs is small The ordering of the means ofthe distributions is unaffected by the normalization andthe normalization does not change the fact that the olddistributions have a larger standard deviation than do thenew distributions

PREDICTIONS

This section describes the predictions of the variancetheory We have just seen that the variance theory predictsa material-based mirror effect for high- and low-frequencyitems because the low-frequency items have a smallervariance of net inputs This yields lower false alarm ratesand higher hit rates for the easy than for the difficultclass when the activation threshold is set between thenew and the old distributions Here it is shown how themodel accounts for other effects such as the strength-based mirror effect between lists list-length effects andthe shift in the response criterion Most important the var-iance theory makes predictions regarding the strength-based mirror effect within lists that is different from thepredictions of the attention-likelihood theory An exper-iment is conducted to test these predictions Comparativemodeling fitting was also conducted to compare the vari-ance theory with the attention-likelihood theory Thepredictions of the theory are based on an analytic solution

Figure 5 The standard deviation of active nodes as a function of the expectedprobability that the nodes are active The standard deviation is calculated with2N = 100

VARIANCE THEORY OF MIRROR EFFECT 423

that is presented at the end of the paper together with ananalysis of optimal performance

The Material-Based Mirror Effect forHigh- and Low-Frequency Items

The variance theory was simulated above here theanalytical results are presented The variance theory pre-dicts the mirror effect for any choice of parameters whenthe recognition criterion is unbiased As will be discussedlater the variance theory can be fully described by twoparameters (the number of nodes N and the percentageof active nodes a) plus one parameter for each class orwords [the standard deviation of the net input sh( )] Thefollowing parameters are used here The number ofnodes is 2N 5 100 and the percentage of nodes active atencoding is set to a 5 1 The standard deviation of thenet inputs to the easy class is sh(AN) 5 sh(AO) 5 125and the standard deviation of the net inputs to the diffi-cult class is sh(BN) 5 sh(BO) 5 156 There are otherparameters which however as will be discussed laterdo not add any degrees of freedom to the model the ex-pected net inputs of the new distributions mh(AN) 5mh(BN) 5 0 and the expected net inputs of the old dis-tributions mh(AO) 5 mh(BO) 5 1 Consequently the ac-tivation threshold is T 5 05

These parameters yield the following probabilities thata node is active at recognition Pc(AN) 5 43 a Pc(BN) 545 a Pc(BO) 5 55 a Pc(AO) 5 57 a The following ex-pected recognition strengths are predicted ms(AN) 5

0012 ms(BN) 5 0008 ms(BO) 5 0008 ms(AO) 50012 Figure 4D plots the four recognition strength den-sities (the distributions are assumed to be normal) usingthe parameters above The same parameter settings wereused in Figures 4A 4B 4C and 5

Strength-Based Mirror Effects Between ListsThe variance theory is consistent with the strength-

based mirror effects Thus variables that increase the hitrates also decrease the false alarm rates This empiricalfinding is called dispersion which means that the newand the old distributions move apart The opposite phe-nomenon is called concentering which means that thenew and the old distributions move closer together Ex-amples of variables showing strength-based mirror ef-fects are speed versus accuracy instructions length ofstudy time encoding task forgetting repetition and ag-ing (Kim amp Glanzer 1993) These experimental variablescan be related to a specific parameter in the variance the-orymdashnamely the expected net input

The variance theory predicts a strength-based mirroreffect because subjects must adjust the activation thresh-old to optimize the performance This change in activa-tion threshold affects the false alarm rates For exampleassume that study time is increased from 1 to 2 sec sothat the expected net input increases from 1 to 2 and theactivation threshold increases from 12 to 1 This dimin-ishes the false alarm rate However the increase in theactivation threshold is smaller than the increase in the old

net input so the hit rate will increase Thus increasing thestudy time increases the hit rate but decreases the falsealarm rate which is dispersion

The mirror effect is accounted for in some theories bya change in the recognition criterion Note that in the var-iance theory the recognition criterion is constant whereasthe activation threshold is changed There is an impor-tant difference between a change in the recognition cri-terion and a change in the activation threshold Thechange in the activation threshold optimizes the perfor-mance as measured by d cent whereas a change in the recog-nition criterion does not influence d cent Given an optimalplacement of the activation threshold the performance interms of percentage correct is optimal if the recognitioncriterion is set to an optimal value which is zero Thusthere is a clear difference between changing the recogni-tion criterion and changing the activation threshold Thevariance theory accounts for the strength-based mirror ef-fect occurring between two conditions by the change inthe activation threshold necessary for optimal performancewhereas the recognition criterion does not change

Concentering occurs for example when subjects areinstructed to emphasize speed (rather than accuracy) withsuperficial (rather than deep or semantic) study instruc-tions with diminished study time or with an increasedretention interval (Kim amp Glanzer 1993) In the variancetheory all these manipulations are assumed to diminishthe old net inputs Figure 6A shows the predictions of thevariance theory when the expected net inputs of the olddistributions are mh(AO) = mh(BO) 5 05 rather than 1as in Figure 4D Consequently the activation thresholdmust be set to 025 for optimizing the performance Thedistributions in Figure 6A are closer than the distribu-tions in Figure 4D Thus decreasing the net inputsmdashforexample by diminishing study timemdashmoves the distrib-utions closer together thus showing concentering

The opposite phenomenon to concentering is disper-sion which means that increasing the performance movesthe distributions apart Dispersion can be studied bychanging the variables listed above in the opposite direc-tionsmdashfor example by increasing the study time Fig-ure 6B shows the predictions of the theory when the ex-pected net inputs of the old distributions are mh(AO) =m h(BO) 5 2 Consequently the activation thresholdmust be set to 1 to maintain a near-optimal performanceThe distributions in Figure 6B are further apart than thedistributions in Figure 4D

These strength-based manipulations are usually ap-plied between different lists or conditions For exampleKim and Glanzer (1993) manipulated study time betweenfour study lists where the items were presented for 1 seceach in two lists and for 2 sec each in two lists After eachlist there was a recognition test In the variance theorythe activation threshold is the same during each recogni-tion test but may vary between two recognition tests withdifferent levels of difficultymdashfor example different studytimes As will be discussed later different empirical re-sults are found when study time is varied within one list

424 SIKSTROM

In this case the activation threshold is also constant dur-ing the recognition tests although the study time varieswithin the condition

The order of the probabilities in the mirror effect issomewhat robust against changes in the activation thresh-old over a large range Setting the activation to a fixedsufficiently low and positive value yields the mirror ef-fect for any value of the expected net input For example

assume that the activation threshold is fixed to 04 Thenthe mirror effect is predicted for the three cases of ex-pected old net inputs discussed above (05 1 and 2) orany value above 04 The predictions for the new distri-butions do not change with these changes in net inputs[P(AN) = 25 P(BN) = 30] thus a change in the acti-vation threshold is needed to change the false alarmrates In contrast the advantage of the old easy class over

Table 1General Table of Results From the Experiment

Condition AN BN BO AO ss(BN)ss(AN) ss(BO)ss(AO)

Control 013 017 069 082 060 086Frequency 020 028 080 068 101 066Time 010 015 078 076 089 081

NotemdashThe rows show the conditions (control presentation frequency and presenta-tion time) The columns show the false alarm rates for low (AN) and high (BN) fre-quencies and the hit rate for high (BO) and low (AO) frequencies the slope of the z-ROCcurve for the new low-frequency distribution as a function the new high-frequency dis-tribution [ss(BN)ss(AN)] and the corresponding slope for the old distributions[ss(BO)ss(AO)]

Figure 6 The densities of recognition strength in the variance theory for different parameter settings (A) concentering (B) dis-persion (C) activity level set to one and (D) equal variance The horizontal axes show the recognition strength and the vertical axesthe density

VARIANCE THEORY OF MIRROR EFFECT 425

the old difficult class increases with the expected netinput [from P(BO) = 55 and P(AO) = 56 for mh(O) =05 to P(BO) = 89 and P(AO) = 92 for mh(O) = 2]

List-Length EffectGiven everything else equal recognition from a short

list length has a higher hit rate and lower false alarm ratethan recognition from a long list In the variance theorylist length is predicted to affect the standard deviation ofthe net input (sh) for both easy and difficult class itemsso that longer lists have a larger variance than do shorterlists The expected value of the net input is not affectedby list length

Assume that context does not change within a list butis uncorrelated between different lists The context for alist is thus associated with as many items as there areitems in the list The variance of the net inputs to the itemnodes increases when the list length is increased Thereason for this increase in variance is essentially thesame as the reason that word frequency affects the vari-ance In the word frequency case the same item is asso-ciated with several contexts and this increases the vari-ance in context nodes In the list-length case the samecontext is associated with several items and this in-creases the variance in the item nodes Thus the varianceof the net inputs in the item nodes will be a linear func-tion of list length Therefore a long list will have a lowerhit rate and a larger false alarm rate than will a short list

ROC CurvesThe percentage of nodes active at recognition is less

for new items than for old items Owing to the placementof the activation threshold this proportion is always lessthan 12 The standard deviation of the percentage of ac-tive nodes increases as a function of the percentage ofactive nodes If the percentage of active nodes is zerothe standard deviation obviously is zero However thisstandard deviation increases as the percentage of activenodes increases This yields a smaller standard deviationfor the new distribution (which is associated with a lowerpercentage of active nodes) as compared with the olddistribution [ss(AO) gt ss(AN) and ss(BO) gt ss(BN)]

For the sake of understanding the model the propor-tion of nodes active at encoding can be set unrealisticallyhighmdashnamely to a = 1 This setting yields around 50of these nodes being active at recognition This parametersetting makes the standard deviations of the new and theold distributions equal [ss(AO) = ss(AN) and ss(BO) =ss(BN)] Figure 6C shows the prediction for a = 1 (all theother parameters are identical to those in Figure 4D)

The standard deviation of recognition strength is largerfor the difficult class than for the easy class [ss(AN) gtss(BN) and ss(AO) gt ss(BO)] because the recognitionstrengths are calculated from the inverse of the standarddeviation of the net inputs Thus when the standard de-viations of the net inputs are set equal the standard de-viation of the recognition strengths and the recognitionstrengths becomes equal Figure 6D plots the predictionsof the theory when all standard deviations of the net in-

puts are 125 The other parameters are the same as thosein Figure 4D

In Figure 4D the four standard deviations of the recog-nition strengths are ss(AN) = 0015 ss(BN) = 0012ss(BO) = 0015 and ss(AO) = 0020 The ratio of thesestandard deviations must follow Equation 2 This is alsothe case with ss(AO) ss(BN) = 061 lt 074 = ss(AO) ss(AN) lt ss(BO)ss(BN) = 078 lt 094 = ss(BO) ss(AN)

Changing the Recognition CriterionThe probability of a yes response (P) for the four

classes depends on the recognition criterion (C) SettingC to an unbiased value of 0 yields P(AN) = 20 P(BN) =25 P(BO) = 70 P(AO) = 74 These predicted data areprototypical of experimental data for the mirror effect

A conservative value of the recognition criterion (C)will not yield the mirror effect For example C = 0016yields P(AN) = 03 P(BN) = 02 P(BO) = 30 P(AO) =43 Thus the variance theory predicts that a conserva-tive recognition criterion yields a higher false alarm ratefor easy class words than for difficult class words Thisprediction is supported by empirical data Greene (1996)asked subjects to respond yes only if they were sure oftheir response Consistent with the prediction no mirroreffect was found

It follows from the ordering of the distributions in Fig-ure 4D that the theory also predicts the experimentalfindings in forced recognition [P(BO BN) lt P(AO BN)P(BO AN) lt P(AO AN) P(BN AN) gt 50 and P(AOBO) gt 50] For the parameters above the predictions ofthe theory are P(BO BN) = 79 lt 81 = P(AO BN)P(BO AN) = 83 lt 84 = P(AO AN) P(BN AN) = 59 gt50 P(AO BO) = 57 gt 50

Within-List Strength ManipulationSo far the predictions made by the variance theory are

qualitatively (but not quantitatively) equal to those of theattention-likelihood theory However there is an excep-tion that differentiates the variance theory from the attention-likelihood theory The mirror effect is nor-mally studied under experimental conditions in whichthe diff icult and the easy classes are given the sameamount of attentionmdashfor example under conditions inwhich the number of presentations the study time andthe study instructions are equal for the two classes ofwords However if the number of presentation is largerfor the difficult class than for the easy class different re-sults emerge Stretch and Wixted (1998b) conducted fiveexperiments in which the basic manipulation was to pre-sent high-frequency words five times whereas the low-frequency words were presented once The results didnot show a mirror effect because the hit rates for thehigh-frequency words were higher than those for thelow-frequency words However increasing the numberof presentations for the high-frequency words did not af-fect the false alarm rate so both the false alarm rate andthe hit rate were larger for high-frequency words

The variance theory accounts for this new finding inthe following way The theory assumes that a longer pre-

426 SIKSTROM

sentation time or a larger presentation frequency in-creases the net inputs of the old items [mh(O)] This is il-lustrated in Figure 7A (compare with Figure 4A wherethe same amount of attention is paid to the two classes)If the net inputs for old high-frequency items are in-creased sufficiently the percentage of active nodes willbe larger than that for old low-frequency items For thisto occur the effect of the increase in net input (whichgives the advantage for old high-frequency items whenattention is focused on these items) must be larger thanthe effect from the larger standard deviation of the netinputs for old high-frequency items (which gives the ad-vantage for old low-frequency items when the same at-tention is paid to the two classes) This increase in thepercentage of active nodes yields a higher hit rate forhigh-frequency items than for low-frequency items

However it will not significantly change the false alarmrates which are larger for high-frequency items than forlow-frequency items Therefore the variance theory pre-dicts no mirror effect when high-frequency items are pre-sented sufficiently more often or with a sufficiently longerpresentation time as compared with low-frequency items

It is apparent from the density of net inputs (Figure 7A)that the density of recognition strengths (Figure 7B)will not show a mirror effect (ie because the percent-age of active nodes are larger for high- than for low-frequency old items) The parameters used in these fig-ures are identical to the parameters used for the standardmirror effect in Figures 4A and 4D with the exceptionthat the expected net input of the old high-frequencyitems [mh(BO)] is 2 rather than 1 Consequently to opti-mize performance the activation threshold becomes

P a ec

h h

h=-( )yen

ograve12

22

m

s

P -

Figure 7 (A) The probability density of the net inputs in the variancetheory when attention is focused on the high-frequency class The hori-zontal axis shows the net inputs and the vertical axis the probabilitydensity of the net inputs The expected value of the high-frequency value(BO) is shifted to the right because attention is focused on this class Thedotted vertical line is the activation threshold (B) The predictions of thevariance theory when subjects focus their attention on high-frequencywords The horizontal axis shows the recognition strength and the ver-tical axis the probability density

075 rather than 050 The figure does not show a mirroreffect because the expected hit rate and the expectedfalse alarm rate are larger for the high-frequency itemsthan for the low-frequency items Setting C to an unbi-ased value of 0 yields P(AN) = 08 P(BN) = 14 P(BO) =86 P(AO) = 063 [which may be compared with Figure6B P(AN) = 20 P(BN) = 25 P(BO) = 70 P(AO) = 74]

Furthermore in the variance theory the ratio of therecognition strength standard deviations for high- andlow-frequency items depends mainly on the standard de-viations of the net inputs The standard deviations of thenet inputs are not dependent on the attention paid to thestimuli Therefore the variance theory predicts no changein the standard deviations when the amount of attentionis manipulated The standard deviation of the old low-frequency distribution is predicted to be larger than thestandard deviation of the old high-frequency distribu-tion Similarly the standard deviation of the new low-frequency distribution is predicted to be larger than thatof the new high-frequency distribution The standard de-viations in Figure 7B are ss(AN) = 0013 ss(BN) = 0011ss(BO) = 0017 and ss(AO) = 0019 These results aresimilar to the results when the same amount of attentionis paid to the two classes in Figure 4D ss(AN) = 0015ss(BN) = 0012 ss(BO) = 0015 and ss(AO) = 0020

The standard version of the attention-likelihood the-ory has problems accounting for the lack of mirror ef-fect when more study time is given to the difficult classthan to the easy class This theory suggests that the classof items to which more attention is being paid is moreeasily recognized For example low-frequency items arebetter recognized than high-frequency items becausesubjects pay more attention to them The amount of at-tention is assumed to influence the number of sampledfeature [n(i)] so more features are sampled for low- thanfor high-frequency items (Kim amp Glanzer 1993) This isthe only parameter that differs between high- and low-frequency items From this explanation it follows thatif the experimental conditions are manipulated so thatsubjects pay more attention to the high-frequency itemsthe standard version of the attention-likelihood theorywill predict a mirror effect where the high-frequencyitems are the easier class (A) and the low-frequency itemsare the more difficult class (B) The difference from thenormal mirror effect is a larger hit rate and a smaller falsealarm rate for high- than for low-frequency items Fur-thermore the attention-likelihood theory makes predic-tions of the order of the slope of ROC curves The standarddeviation of the hit rate for the high-frequency distributionwould be larger than the hit rate for the low-frequencydistribution Similarly it is predicted that the standarddeviation of the high-frequency false alarm distribution islarger than that of the low-frequency distribution

EXPERIMENT

An experiment was conducted to test the predictionsregarding the within-list strength manipulation The

number of presentations and the study time of the high-frequency words were manipulated in an experimentThe original rationale for the experiment was to comparethe results with the predictions of the variance theoryand the attention-likelihood theory because the experi-ment was conducted before the publication of Stretchand Wixtedrsquos (1998b) study which manipulated atten-tion by the number of presentations In this experimenta new manipulation is investigatedmdashnamely how theamount of study time per item for each class influencesthe mirror effect Furthermore the manipulation of thenumber of presentations is replicated Thus there weretwo experimental conditions one in which the amountof study time was manipulated and one in which the pre-sentation time was manipulated There was also one con-trol condition in which high- and low-frequency wordswere given the same amount of attention

MethodSubjects Twenty-one students taking the introductory psychol-

ogy course at the University of Toronto volunteered to participatein a memory experiment for course credit There were 14 female and7 male subjects with a mean age of 20 ranging from 18 to 29 yearsold

Material Sixty low-frequency words and 60 high-frequencywords were selected from Ku Iumlcera and Francis (1967) The low-frequency words have an occurrence of 4ndash5 times per million andthe high frequency words an occurrence of 50ndash55 times per mil-lion Thirty low- and 30 high-frequency words were randomly cho-sen for List 1 and the remaining for List 2

Procedure The subjects were instructed to study a list of wordsso they would be able to recognize the words after study Fifteenlow-frequency words and 15 high-frequency words were randomlychosen as study words for each subject

Design There were three conditions In all the conditions thelow-frequency words were presented once with a presentation timeof 1 sec In the control condition the high-frequency words werealso presented once with a presentation time of 1 sec In the pre-sentation frequency condition the high-frequency words were pre-sented twice for 1 sec each time In the presentation time conditionthe high-frequency words were presented once for 3 sec The pre-sentation order was randomized All the words were presented inuppercase on a blank computer screen Immediately following thestudy list there was a recognition test The subjects were presentedwith either one of the studied words or one of the lures There were15 low-frequency lures and 15 high-frequency lures in each condi-tion The subjects were asked to judge whether the word was pre-sented in the list or not The subjects were also required to rate theirconfidence in their responses on a scale ranging from 1 (guessing)to 5 (very certain) The order of recognition was randomized foreach subject

Each subject participated in two conditions List 1 was alwaysgiven as the first list and List 2 as the second list Twelve subjectswere randomly chosen for the presentation frequency condition fol-lowed by the presentation time condition Nine subjects were giventhe control condition followed by another control condition Thewhole experimental set-up including instructions presentation ofwords and the recognition test was automated on a computer Eachsubject was tested individually

ResultsThe results from the experiment are presented in the

first three rows of Table 1 The probability for hit rates

VARIANCE THEORY OF MIRROR EFFECT 427

428 SIKSTROM

was larger for the high-frequency words than for the low-frequency words in the presentation frequency and thepresentation time conditions In the control conditionthe probability for hit rates for the low-frequency condi-tion was larger One-tailed paired t tests over the perfor-mance for each subject were carried out to test the dif-ferences between the high and the low frequencies Theeffects were significant in the presentation frequencycondition [t(11) = 22 MSe = 004 p = 02 lt 05] and inthe control condition [t(16) = 33 MSe = 004 p = 00lt 05] but not in the presentation time condition [t(11) =041 MSe = 003 p = 34 lt 05]

The false alarm rate was larger for the high-frequencywords in all the conditions However it was only signif-icantly larger in the presentation frequency condition[t(11) = 18 MSe = 003 p = 048 lt 05] but not in thepresentation time condition [t (11) = 15 MSe = 001 p =07 gt 05] and the control condition [t(16) = 14 MSe =002 p = 09 gt 05]

The results in the presentation frequency conditionsupport the variance theory The hit and the false alarmrates were significantly larger for the high-frequencywords than for the low-frequency words Thus there wasno mirror effect However the prediction of the standardversion of the attention-likelihood theory was not sup-ported

The results in the presentation time condition were inthe same direction as those in the presentation frequencycondition although the difference between the high andthe low frequencies was not significant This conditionis consistent with the variance theory although the stan-dard version of the attention-likelihood theory could notbe dismissed in this condition since the results werenonsignificant

Finally the control condition yielded results consis-tent with previous studies showing a mirror effect Thehit rate for the high-frequency words was significantlylower than the hit rate for the low-frequency words Thefalse alarm rate for the high-frequency words was largerthan that for the low-frequency words (although not sig-nificantly) Thus the control condition is as was ex-pected consistent with both the variance theory and thestandard version of the attention-likelihood theory

The slopes of the ROC curves were calculated as fol-lows The hit and false alarm rates for confidence ratings1ndash5 were z-transformed (eg for confidence rating 4 a hitresponse was scored if the confidence rating was 4 orabove) Linear regressions of one z-transformed mea-surement as a function of another z-transformed measure-ment were conducted The slope of the linear regressioncurves between the z-transformed hit rate of the low-frequency words and the z-transformed hit rate of the high-frequency words [ss(BO)ss(AO)] and similarly for theslope of the false alarms [ss(BN)ss(AN)] are shown inthe last two rows of Table 1

The results show that the standard deviations of theold high-frequency distributions were smaller than thestandard deviations of the old low-frequency distribu-tions in all the conditions The standard deviations of the

false alarm high-frequency distributions were smallerthan the standard deviations of the false alarm low-frequency distributions in the presentation frequencycondition and the control condition but were approxi-mately equal in the presentation time condition

To summarize the results in the presentation frequencycondition are consistent with the variance theory and inconsistent with the standard version of the attention-likelihood theory The control condition is consistentwith both the variance theory and the standard version ofthe attention-likelihood theory These data are also con-sistent with results presented by Stretch and Wixted(1998b) However Stretch and Wixted (1998b) suggestedone possible way to modify the standard version of theattention-likelihood theory to bring it in line with thedata presented here They noted that Glanzer et al (1993)had shown that the attention-likelihood theory predictsthe mirror effect although p(i old) is set to the averagevalue of the two classes This modified version can pre-dict the pattern of data presented here given that the at-tention paid to the high-frequency class was high duringencoding [n(B) = 120] and low during recognition [n(B) =40] This formulation of the attention-likelihood theoryseems somewhat unclear It is not well motivated whyp(i old) should be equal during recognition whereas n(i)is different [p(i old) is calculated from n(i)] or why theamount of attention for high-frequency items is lower thanthat for low-frequency words at encoding but higher atrecognition

COMPARATIVE DATA FITTING

Glanzer et al (1993) fitted the attention-likelihoodtheory to experimental data in nine conditions The easyclass (A) consisted of either low-frequency words orconcrete words and the difficult class (B) consisted ofhigh-frequency words or abstract words Here the vari-ance theory is also fitted to the same set of data as thatused in Glanzer et al (1993) This allows a directionevaluation of the variance theory by comparing its f itswith those of the attention-likelihood theory

Glanzer et al (1993) fitted the attention-likelihoodtheory to the four probabilities of yes responses and thefour ratios of slopes of the ROCs The fitting was con-ducted by minimizing the mean squared error divided bythe variancemdashthat is

Three parameters were fittedmdashnamely the attentionpaid to each of the classes n(A) and n(B) and the prob-ability that a feature was activated p(new) The other pa-rameters were held constant at a value found to give agood f it These parameters were (N = 1000) and therecognition criterion [ln (L) = 0]

The variance theory was fitted to the same set of datausing the same technique and the same number of freeparameters The fitted parameters were the percentage

( )

Observed Predictedi i

ii

-

=aring

2

21

8

s

VARIANCE THEORY OF MIRROR EFFECT 429

nodes active at encoding (a) the standard deviation ofthe net inputs for the easy class words [sh(A)] and thestandard deviation of the net inputs for the difficult classwords [sh(B)] The other parameters were held constantand were set to the same values as those in Figure 4D[2N = 100 mh(N ) = 0 mh(O) = 1 and C = 0] The empir-ical standard deviations (si) were not reported inGlanzer et al (1993) so these parameters were set toone

The results show that the variance theory accounts for98 (r2) of the variance for the probabilities The attention-likelihood theory fits equally well The variance theory ac-counts for 93 of the variance for the slope The attention-likelihood theory accounts for 84 of the variance of theslope Thus the variance theory accounted for the sameamount of variance for the probabilities and more variance

for the slope as compared with the attention-likelihoodtheory when three parameters were fitted

It is reasonable to assume that the percentage of ac-tive nodes at encoding (a) does not vary across condi-tions The ratio of standard deviations of the net inputs[sh(B)sh(A)] may also be conceived of as being con-stant given that the same material is used in the differ-ent conditions Therefore the variance theory was fittedby a single variablemdashnamely the standard deviation ofthe net inputs for the easy class [sh(A)] The activitylevel was fixed to 010 and the ratio of the standard de-viations of the net inputs sh(B)sh(A) to 125 (these val-ues were the average of the fitted parameters in the pre-vious fit these parameters were also used in Figure 4D)The results are presented in Figure 8A where the pre-dicted probabilities are plotted as a function of observed

Figure 8 (A) Fitting the variance theory to Glanzer Adams andKimrsquos (1993) probability data The figure shows the predicted proba-bilities (on the vertical axis) as a function of the observed probabilities(on the horizontal axis) (B) Fitting the variance theory to Glanzer et alrsquos(1993) standard deviation slope data The figure shows the predictedratio of slopes of the receiver-operating characteristic curves (on the ver-tical axis) as a function of the observed ratios (on the horizontal axis)

430 SIKSTROM

probabilities Figure 8B shows the corresponding resultsfor the slope The accounted variance is 096 for the prob-abilities and 085 for the slopes Thus the variance theoryfits the slopes using a single parameter equally well asthe attention-likelihood theory does with three fitting pa-rameters The fit for the variance theory for the probabil-ities using one parameter is slightly less than the fit of theattention-likelihood theory using three fitting parametersIt may be concluded that the fit for the variance theory isreasonably good for the probabilities and the slopes Theslopes have a better fit in the variance theory than in theattention-likelihood theory when three variables are used

ANALYTIC SOLUTIONS

In this section analytic solutions of the variance the-ory an approximation of the standard deviation of therecognition strength and analyses of optimal perfor-mance are presented The variance theory has a simpleanalytic solution and can be fully described by four pa-rameters Two of these parametersmdashnamely the stan-dard deviation of the net inputs from the easy class[sh(A)] and the standard deviation of the net inputs fromthe difficult class [sh(B)]mdashcan also be expressed as thefrequency of the item (see the Appendix) The other twoparameters are the number of nodes (N ) and the expectedprobability that the nodes are active at encoding (a)

There are other variables in the theory however theydo not increase the degrees of freedom There are fourexpected net inputs (mh) however two degrees of free-dom disappear because the new net inputs are constrainedto be equal as well as the old net inputs [mh(AN) =mh(BN) and mh(AO) = mh(BO)] Furthermore the predic-tions are independent of moving the old and new dis-tributions in parallel thus removing another degree offreedom Finally changing the difference between the ex-pected old and new net inputs is mathematically equiva-lent to changing the standard deviations [sh(A) andsh(B)] Thus the degrees of freedom in the net inputscan be captured by the degrees of freedom in the stan-dard deviations The activation threshold (T ) and theprobability that nodes are active (Pc) are simply func-tions of other variables and therefore do not increase thedegrees of freedom Thus there are four degrees of free-dom for the distributionsmdashnamely sh(A) sh(B) N anda An additional degree of freedom is introduced whenplacing the recognition criterion (C)

The probability (Pc) that the net inputs exceed the ac-tivation threshold (T ) for nodes active during encodingcan be explicitly solved from the expected net inputs(mh) and the standard deviation of the net inputs (sh)This probability is dependent on the distribution of thenet inputs which can be approximated with a normaldistribution Pc is solved by integrating the net inputsfrom mh T to infinity (yen) over the probability densityfunction for a normal distribution Thus the probability(Pc) that a node is active at recognition is

(6)

Subtracting the expected percentage of active nodes atrecognition (a2 see note 1) from the percentage of ac-tive nodes and dividing by the standard deviation of thenet inputs (sh) calculates the expected recognitionstrength (ms)

Note that the analytic solution uses the standard devi-ation of the class (sh) as an approximation of the stan-dard deviation of the item (shcent ) because it simplifies theanalytic solution however the variance theory or thesimulation uses the standard deviation of the item Thisapproximation is good when there are a large number offeatures however for a small number of features thevariance of feature strength for a single item may fluctu-ate on an item-to-item basis around the variance of thenet inputs for all the items

The standard deviation of the recognition strength (ss)is calculated from sh Pc and N There is 2N number ofnodes in the context and the item layers The distributionof Pc is binomial but can for a certain criterion [ie 2NPc(1 Pc) gt 10] be approximated with a normal distri-bution with a standard deviation of [Pc(1 Pc) 2N]12The final result is scaled by the normalization factor 1sh

(7)

A yes response occurs if the recognition strength isabove the recognition criterion (C) The probability of ayes response [P(Y)] is calculated from the expected recog-nition strength the variance of the recognition strengthand the recognition criterion by integrating the density ofthe recognition strength over a normal distribution

The probability of choosing A over B in a two-choiceforced recognition test [P(A B)] is calculated from theexpected recognition strength of A [ms(A)] and B [ms(B)]and the standard deviations of the recognition strengthof A [ss(A)] and B [ss(B)]

An Excel sheet for calculating the predictions of the vari-ance theory is available on line (wwwpsychutorontoca ~sverkervariancehtml)

P e ds

S Bs s

s s

C

(A B)

(A)

(A) (B)

=

- -[ ]( )+aelig

egraveccediloumloslashdivide

eacuteeumlecirc

yen

ograve12

2

2 2 22

p

m m

s s

( )

P e ds

S s

s

C

(Y) =-( )yen

ograve12

2

22

p

m

s

sss

h

c cP P

N=

-( )eacute

eumlecircecirc

1 1

2

12

mss

Pc

a

h=

-2

P a e dhcT

hh

h=-( )yen

ograve12

2

22

p

m

s

VARIANCE THEORY OF MIRROR EFFECT 431

Approximations of the Standard Deviation ofRecognition Strength

The standard deviation of the recognition strength isin the model calculated with Equation 7 However to fa-cilitate the understanding of this equation it is useful tomake some approximations First note that the proba-bility that a node is active (Pc) is assumed to be low Byapproximating 1 Pc to one the variance of the recog-nition strength can be simplified to

For a particular class of items the variances of the netinputs of old and new items are equal and the varianceof the recognition strength is proportional to the numberof active nodes (s 2

s micro Pc) This approximation suggests avery simple interpretation of the slope of the z-ROC Theratio of variances between new and old items is simplythe ratio between the number of nodes active in the newitems representations and the number of nodes active inthe old items representations

Or alternatively the slope of the z-ROC curve is equalto the square root of the ratio of the number of nodes ac-tive in the new items representations and the number ofnodes active in the old items representations For exam-ple if the slope of the z-ROC curve is 08 the number ofactive nodes in the new items representations divided bythe number of nodes active in the old items representa-tions is 064 (= 0802)

Another approximation useful for understanding themodel is that for two classes of items the number of ac-tive nodes in the new distribution is approximately equaland the number of active nodes in the old distributions isapproximately equal [Pc(AN) raquo Pc(BN) and Pc(BO) raquoPc(AO)] Given these approximations and the approxi-mation above (1 Pc raquo 1) the recognition strength stan-dard deviation is inversely related to the standard devia-tion of the net inputs in the following way The ratiobetween the recognition strength standard deviations ofthe diff icult and the easy distributions is equal to theratio between the standard deviations of the net inputs ofthe easy and the difficult distributions Furthermore theratio between the recognition strength standard devia-tions of the difficult and easy new distributions is equalto the ratio between the recognition strength standard de-viations of the difficult and the easy old distributionsThe exact solution predicts a slightly smaller ratio in theold than in the new distributions

This suggests that the ratio between the recognitionstrength standard deviations of the difficult class and theeasy class can be interpreted as the ratio between the

standard deviations of the net inputs of the easy class andthe difficult class

Optimizing Performance Derives the Assumptions of the Variance Theory

Arguably good memory performance is an importantfactor for selection in the evolutionary process of hu-mans and animals It is reasonable to assume that thebrain has evolved so that the performance at retrieval isoptimal or near-optimal Here it is investigated how sev-eral assumptions of the variance theory influence per-formance It is shown that several assumptions of themodel fall out as a consequence of optimizing perfor-mance in the form of discriminability between new andold items Thus if the model is implemented in a differ-ent way performance is degraded and the model doesnot account for the experimental data Examples of as-sumptions that yield a good performance in the modelare a low percentage of nodes active setting the activa-tion threshold between old and new net inputs measur-ing performance by nodes that are active in the encodingpattern and normalizing the recognition strength It isshown that an optimal performance in the network requiresthe implementation suggested by the variance theory Ifthe implementation of the variance theory is changedsignificantly the performance is degraded and the net-work would not produce the empirically found memoryphenomena

To conduct this analysis it is necessary to define ameasurement of performance A natural choice is to used cent By using the analytical equations above we find thefollowing expression

(8)

Because ss(N) sometimes is near zero it was founduseful to use the standard deviation of both the old andthe new items recognition strength ss(NO) in the de-nominator of this equation Thus Pc(NO) is equal to[Pc(N) 1 Pc(O)] 2 Pc( ) was calculated with Equation 6The expected value of the net inputs and the standard de-viation of the net inputs for new and old items were cal-culated with the equations derived in the Appendix(Equations A1 A2 and A3) Low-frequency items witha preexperimental frequency of zero were used ( f = 0)and the list length was set to one (L = 1)

The performance can be expressed by the parametersa N and p Increasing the number of nodes (N) monot-onically increases d cent and increasing the number ofstored patterns (p) monotonically decreases d cent The im-pact of these two parameters on d cent was of less impor-tance here and they were set to N = 30 and p = 100

Optimal percentage of nodes active at encodingThe solid line in Figure 9A shows the theoretical d cent as afunction of the percentage of nodes active at encoding

cent = - =-

-eacuteeumlecirc

dP P

P PN

s s

s

c c

c c

m ms( ) ( )

(

O N

(NO)

(O) (N)

(NO) (NO) 12

112

ss

ss

ss

ss

s

s

s

s

h

h

s

s

(BO)

(AO)

(B)

(A)

(A)

(B)

(BN)

(AN)pound raquo pound

s

ss

s

c

c

P

P

2

2

(N)

(O)

(N)

(O)raquo

ss

sc

h

P

N

2

2 2=

432 SIKSTROM

(a) The results show that d cent is optimal for a = 052 Thed cent is lower for larger and smaller a The lower d cent for largea occurs because the interference from other items in-crease For an a larger than the optimal value the weightchanges are distributed over a larger number of nodesand the recognition tests therefore include more noise

The lower d cent for an a less than the optimal value oc-curs because there is variability in the number of activenodes at encoding Thus for very small values of a thefluctuation between the number of nodes active in theencoded representation becomes increasingly importantThus for a small a errors are more likely to occur be-cause old items have few active nodes at encoding mak-ing it less likely that many nodes will be active at re-trieval (independently of how well they are encoded)This analysis suggests that a medium low percentage ofactive nodes at encoding yields optimal performanceThis is consistent with variance theory which requires alow activity for fitting some of the empirical data (seebelow)

There is another factor that contributes to the fact thatoptimal performance occurs when the percentage of ac-tive nodes is medium low which is that the number ofpossible representations increases with a If there is onlyone node active in all the representations there are Npossible combinations of representations if there aretwo nodes active in all the representations there are ap-proximately N 2 possible combinations of representa-tions and so forth This factor is not included in theanalyses

Optimal placement of the activation threshold Animportant property of d cent is that it is insensitive to wherethe criterion is placed Thus any criterion yields thesame performance The activation threshold (T ) may beseen as the criterion for a single node and therefore onemight intuitively think that the placement of the thresh-old is unimportant for d cent However surprisingly theplacement of the criterion becomes important in the vari-ance theory because there is a nonlinear transformationwhen the nodes are activated This nonlinearity makes d centdependent on the activation threshold in the nodes

The d cent was maximized by changing the activationthreshold (T ) and the percentage of nodes active at en-coding (a) The maximum d cent was 240 when T = 081and a = 052 Figure 9B plots d cent as a function of the ac-tivation threshold (T ) when the percentage of nodes ac-tive at encoding was f ixed at the optimal value (a =052) The results show that d cent has an optimal valuewhen the activation threshold is set between the old andthe new distributions The variance theory suggests thatthe threshold should be set to the average of the expectedold and expected new net inputs For the parameter val-ues used here this value is 071 which is near butslightly lower than the optimal value of 081 (the ex-pected value of the new net input is 0 and the expectedvalue of the old net input is 142) Note that this resultapplies when both a and T are set to the optimal value Ifa is set to a nonoptimal value the optimal value of T may

deviate significantly from the one proposed by the the-ory (eg if a = 5 the optimal value of T is much largerthan the expected value of old net inputs of 188)

This analysis emphasizes the importance of setting theactivation threshold as suggested by the variance theorySetting the activation threshold between the old and thenew expected net inputs yields not only the mirror effectbut also an optimal performance in the network Noticethat the activation threshold (T ) is constant even if thesubjectsrsquo decision criterion (C) is changed Therefore d centwill not change when the decision criterion changes Bychanging the decision criterion (rather than the recogni-tion threshold) subjects can maintain an optimal d cent fordifferent confidence levels

Optimal usage of the state of activation in the cuepattern An interesting question is how much informa-tion is carried in nodes that are active in the encoded pat-tern as compared with inactive nodes If both active andinactive nodes carry a similar amount of information itis useful to use all the nodes at retrieval However if in-active nodes carry little information in relation to thenoise performance can be improved by using only theinformation in the active nodes

The information carried in the nodes depends on theamount of weight changes which in turn depends on thepercentage of active nodes at encoding (a) For a = 12the absolute values of the weight changes are the samefor active and inactive nodes (however the signs of theweight changes are different) Thus inactive and activenodes carry the same amount of information and theperformance is optimal when information in both activeand inactive nodes is used For a small a the weightchanges are larger for active nodes (proportional to 1 a)than for inactive nodes (proportional to a) For a suffi-ciently small a the noise in inactive nodes will over-whelm the information in the weight changes so that ifthe information is combined the inactive nodes will harmthe information in the active nodes and performance isoptimal if only information from active nodes is used

The performance of the variance theory was calcu-lated by using the information in all the nodes This isdone by counting the number of nodes that are retrievedto the correct state of activation (ie the same state asthat during encoding) The mathematical details of thiscalculation are described at the end of the Appendix Theresults are shown by the dotted line in Figure 9A usingthe same set of parameters as when d cent was calculated byusing only active nodes shown by the solid line The re-sults show that the highest d cent is found when the decisionis based only on active nodes and when a is low Includ-ing inactive nodes in decision lowers d cent However for alarger a (above 15 for the parameters used here) it isbeneficial to base the decision on all the nodes

Optimal placement of the recognition criterion forthe two classes of items The recognition criterion (C)does not affect d cent but it influences performance as mea-sured by the hit rates and false alarm rates Therefore itis necessary to use another criterion for performance

VARIANCE THEORY OF MIRROR EFFECT 433

with respect to the placement of the recognition crite-rion A natural choice for performance in this context isthe probability of hits minus the probability of falsealarms This measurement corresponds to optimal per-formance when old correct responses and new correctnew responses are rewarded equally It is easy to see thatif the standard deviations of the old and the new distri-butions are equal the optimal performance will be foundif the recognition criterion is set exactly between the dis-tributions For unequal standard deviations the optimalrecognition criterion is shifted from the midpoint towardthe distribution with the smallest standard deviationMore exactly the optimal recognition criterion is thepoint at which the old and the new distributions inter-sect It is easy to see that this is true because if the recog-nition criterion is moved to the left of this point the rateof increase in false alarms is larger than the rate of in-crease in hits and performance suffers If the recognition

criterion is moved to the right of this point the rate of de-crease in hits is larger than the rate of decrease in falsealarms and performance also suffers (see eg Figure 4D)Formally f [S(O)] denotes the density of recognitionstrength of the old distribution and f [S(N)] the densityof the recognition strength of the new distribution Theratio between these variables is called the likelihoodratio L = f [S(O)] f [S(N)] and the optimal performanceoccurs when this ratio is equal to one (L = 1)

In the mirror effect there are two classes of itemseach having a new and an old distribution with differentstandard deviations The question of optimal perfor-mance is complicated by the possibility of using differ-ent criteria for the two classes The performance maythen vary depending on the choice of the two criteria andon additional restrictions on the overall level of confi-dence For example if one class is very easy (ie perfectdiscrimination) and one class is very difficult (ie no

Figure 9 (A) Theoretical d cent as a function of percentage of nodes active at encoding The solid line shows the d cent as a function of per-centage of nodes active at encoding when the decision is based only on nodes that are active during encoding The dotted line showsd cent when the decision is based on all the nodes (B) Theoretical d cent as a function of the activation threshold The leftmost arrow pointsat the expected net input of the new items [m(N)] the rightmost arrow points at the expected net input of the old items [m(O)] and themiddle arrow points at the point at the placement of the activation threshold of the nodes Note that the activation threshold is slightlylower than the optimal point (C) Optimal placement of the recognition criterion for the two classes The y-axis shows the maximumlikelihood for Class B divided by the maximum likelihood for Class A An optimal performance is found when this ratio is one Thex-axis shows the false alarm rate for Class A The straight line shows the ratio for theoretical optimal performance the dotted line theratio before normalization and the solid curved line the ratio after normalization See the text for details (D) The advantage of nor-malization for different recognition criteria The y-axis shows the total hit rate after normalization minus the total hit rate before nor-malization as a function of the total false alarm rate on the x-axis See the text for details

434 SIKSTROM

discrimination) and subjects are instructed to respondyes only when they are absolutely certain that they arecorrect it may be optimal to set a very high criterion forthe difficult class so that no yes responses will be madefor the difficult class and a moderate criterion for theeasy class so that some yes responses will be made forthe easy class Therefore any model that optimizes per-formance for the two classes must combine the criteriafor each class so that the performance for the sum of theclasses will be optimal

This problem may formally be stated as follows Giventwo classes (A and B) with a fixed total false alarm rate[P(AN) + P(BN)] how should the recognition criteriafor the two classes [T(A) and T(B)] be chosen so that thehit rates are maximized [P(AO) + P(BO)] The solutionto this problem is surprisingly simple The optimal perfor-mance occurs when the placements of the maximum like-lihoods of the two classes are equal

L(A) = f [S(AO)] f [S(AN)] = L(B)

= f [S(BO)] f [S(BN)]

It is easy to see that this criterion must be satisfied foroptimal performance because any shift from this pointdiminishes performance For example if the recognitionthreshold for class A is diminished the recognition cri-terion for class B must be increased to keep the totalfalse alarm rate constant According to the formulationof the problem the change in total false alarm rates mustbe equal f f [S(BN) = 0] The maximum-likelihood ra-tios are monotonically increasing functions of the recog-nition criteria therefore L(A) L(B) lt 0 when the recog-nition criteria are changed as specified above ThusL(A) = f [S(AO)] f [S(AN)] lt f [S(BO)] f [S(BN)] =L(B) or f [S(AO)] f [S(BO)] lt 0 This shows that thechange in the placement of the criteria from L(A) = L(B)results in an overall decrease in hit ratemdash( f [S(AO)] f [S(BO)] lt 0)mdashand performance suffers

Note that the variance theory only has one overallrecognition criterion However the theory and any the-ory of the mirror effect specifies how this criterionchanges when it is moved over the two classes Thus thesecond criterion can indirectly be inferred from the for-mulation of the theory This is done in the variance the-ory by the normalization factor that scales the recogni-tion differently between the two classes of words

The intriguing question here is whether the variancetheory is optimal or almost optimal in terms of place-ment of the recognition criterion for the two classes Fig-ure 9C plots the maximum likelihood of class B dividedby the maximum likelihood of class A [L(B)L(A)] onthe y-axis The x-axis shows the probability of false alarmsfor class A when the recognition criteria are changedThe optimal ratio of the maximum likelihood on the y-axisis one and it is plotted as the dotted straight line The re-sults before normalization (ie by counting the numberof nodes above the recognition criterion) are plotted in

the dotted monotonically increasing line The resultsafter normalization (ie the percentage of active nodesminus the expected number of active nodes divided bythe standard deviation of the net input) are plotted as thesolid line

The figure clearly shows that performance before nor-malization is far from optimal For a conservative recog-nition criterion or low false alarm rates the maximumlikelihood for class A is larger than the maximum likeli-hood for class B Therefore a more liberal criterion forclass B and a more strict criterion for class A would bemore advantageous For a liberal recognition criterionor a high false alarm rate the maximum likelihood forclass B is larger than the maximum likelihood for classA Therefore a more liberal criterion for Class A and astricter criterion for Class B would be beneficial Theonly point at which the performance is optimal is whenthe recognition criterion is unbiased At this point [aroundP(AN) = 25] the maximum-likelihood ratio is one

Normalization improves performance significantly sothe maximum-likelihood ratio is one or almost one fora range of criteria For false alarm rates larger than 015and smaller than 060 the ratio is within two percentagepoints of one The maximum likelihood for Class A islarger than that for Class B for false alarm rates less than015 and for false alarm rates larger than 060 Thus thereis still some deviation from optimal performance evenafter normalization However the maximum-likelihoodratio is closer to the optimal value for all false alarmrates after normalization than before normalization Ar-guably normalization improves performance sufficientlyso that performance may be described as being near anoptimal value for a wide range of recognition criteria

Overall performance after normalization can be di-rectly compared with performance before normalizationFigure 9D plots the total hit rate after normalizationminus the total hit rate before normalization for differenttotal false alarm rates In this figure the standard devia-tion of the net inputs to Class B was set to a 3 (rather than156) in order to make the difference between perfor-mance before and after normalization more salient Allother parameters were identical to those in Figure 4DFurthermore it is assumed that the number of items inClass A is equal to the number of items in Class B Thetotal hit rate is equal to the average hit rate of Class Aand Class B and the total false alarm rate is equal to theaverage false alarm rate in Class A and Class B

For all recognition criteria or for all false alarm ratesperformance is better or equal after normalization ascompared with performance before normalization Foran unbiased recognition criterion or a slightly largerrecognition criterion performance is approximatelyequal before and after normalization [ie for 25 lt P(N) lt40] For conservative recognition criteria [P(N) lt 25]there is a large advantage for normalized performanceover nonnormalized performance The largest advantageis approximately 005 and it occurs when the false alarm

VARIANCE THEORY OF MIRROR EFFECT 435

rate is approximately 005 For liberal recognition crite-ria [P(N) gt 40] there is also an advantage for normal-ized performance The largest advantage is around 001and it occurs when the false alarm rate is 070 The ad-vantage for liberal criteria is smaller than the advantagefor conservative criteria because of a ceiling effect forlarge false alarm rates and large hit rates

A Nonoptimal Network is Inconsistent With Empirical Data of Recognition Memory

To summarize it has been shown that performance isoptimal when (1) the percentage of nodes active at en-coding is low (2) the activation threshold is set betweenthe new and the old distributions (3) only nodes activeat encoding are used for measuring the recognitionstrength and (4) the recognition strength is normalizedwith the standard deviation of the net input It is inter-esting to note that all these conditions are necessary forproducing the empirically found memory phenomena

1 The percentage of active nodes has to be low forproducing appropriate standard deviations for the oldand the new recognition distributions If the percentageof active nodes is too high the standard deviation of theold distribution will approach the standard deviation ofthe new distribution

2 The model predicts the mirror effect only if the ac-tivation threshold is set between the old and the new dis-tributions If the recognition threshold is larger than theexpected value of the net input of the old distributionthe hit rate of the easy class will be less than the hit rateof the difficult distribution Similarly if the recognitionthreshold is lower than the expected new net input thefalse alarm of the easy class will be larger than the falsealarm rate of the difficult class This is inconsistent withthe empirical data for unbiased responses

3 Assume that the recognition strength is based on allthe nodes (ie not only nodes inactive during encod-ing)mdashfor example by counting the number of nodes inthe correct state of activation This measurement ofrecognition strength will have at least 50 of the nodesin the correct state (ie Pc gt 50) even if the subjectswere merely guessing on new items This would lead tothe incorrect prediction that the old recognition strengthvariance should be smaller than the new recognitionstrength variance because Pc(1 Pc)N decreases for Pcover 50 This is inconsistent with the finding that thevariance of the old distribution is larger than the varianceof the new distribution

4 If the recognition strength is not normalized withthe net input the variance of the recognition strength ofthe new easy class (AN) will be smaller than the varianceof the recognition strength of the new difficult class (BNsee Figure 4C) This is inconsistent with the empirical data

This analysis indicates that several recognition mem-ory phenomena fall out as a consequence of optimizingperformance in the network If the model is optimized interms of performance the model reproduces the empir-ical data If the model is made nonoptimal the model

does not reproduce the empirical data Arguably thehuman brain has evolved to optimize storage capacityand therefore these memory phenomena occur

GENERAL DISCUSSION

This paper has suggested a variance theory for themirror effect that also applies to the ROC curves Themodel is remarkably simple It has been shown that atwo-layer recurrent network where one layer representscontext and one layer represents items shows these phe-nomena if performance is measured by counting thenumber of nodes active at recognition in a way that opti-mizes performance The structure of the theory guaran-tees that high-frequency items have a larger variance inthe net inputs than do low-frequency items because en-coding the same item in different contexts increases thevariance whereas the expected net inputs are the sameThe theory predicts the mirror effect when the sameamount of attention is paid to both classes of stimuli Thestandard deviation of the recognition strength is largerfor old than for new items because more nodes are activein old items The standard deviation for the easy class islarger than the standard deviation for the difficult classbecause the recognition strength is normalized with thestandard deviation for the net input

There are several reasons why the variance theory isinteresting First the theory is extremely simple Theonly necessary assumptions are that recognition is basedon recurrent associations between contexts and itemsand performance is measured by counting the number ofnodes in an optimal way Second these assumptions areconsistent with what is known about how the brain worksTherefore the model is biologically plausible Third themodel accounts for a large amount of data including themirror effect exceptions from the mirror effect ROCcurves list-length effects and so on Fourth the modelfits the empirical data well Fifth it is easy to implementthe model in a connectionist network

Paying more attention to one of the classes violates theassumption of equal expected net inputs to the two classesThe variance theory predicts that attention to the moredifficult class primarily affects the hit rates whereas thefalse alarm rates and standard deviations of the underly-ing distributions are less affected An experiment sup-ported the prediction A standard mirror effect was foundwhen attention was divided equally between high- andlow-frequency words However focusing the subjectsrsquoattention on the high-frequency words either by the pre-sentation frequency or the presentation time made thehit rate larger for the high-frequency words than for thelow-frequency words This manipulation of attention didnot influence the false alarm rates which were higher forthe high-frequency words in all the conditions Thus nomirror effect was found when attention was paid to thehigh-frequency words Nor did the focusing of attentioninfluence the order of the standard deviations The stan-dard deviations were larger for the low-frequency distri-

436 SIKSTROM

bution than for the high-frequency distribution This wastrue for the new and the old distributions both when at-tention was paid to high-frequency words and when at-tention was divided equally between the two classes (ex-cept in the new frequency control condition where thestandard deviations were equal)

The variance theory predicts the order of the standarddeviations of the underlying distributions for the follow-ing reasons The standard deviations of the old distribu-tions are predicted to be larger than those of the new dis-tributions because more nodes are activated in the olddistributions The standard deviations of the easy classdistributions are predicted to be larger than the standarddeviations of the difficult class distributions because therecognition strength is normalized by the itemrsquos diffi-culty estimated from the standard deviation of the net in-puts This is consistent with the empirical data

In contrast the attention-likelihood theory does notconstrain the old distribution to be larger than the newdistribution for the difficult class (it can be larger orsmaller depending on parameter settings) The variancetheory allows the following two orders of the standarddeviations ss(BN) lt ss(BO) lt ss(AN) lt ss(AO) andss(BN) lt ss(AN) lt ss(BO) lt ss(AO) The first order isthe more common although the second order occurs oc-casionally (see eg Glanzer et al 1993 Experiment 1)In addition the attention-likelihood theory allowsss(BO) lt ss(BN) lt ss(AN) lt ss(AO) according to Equa-tion 2 which to the authorrsquos knowledge has not beenfound in empirical data

The variance theory predicts that strength variablessuch as study time repetition and study instructions af-fect the expected net input For example increasing studytime will increase the net input that improves the hit rateIncreasing the net inputs also causes an increase in theactivation threshold that diminishes the false alarm rates

The variance theory has no (ie lateral) connectionswithin the item layer and there are no connections with-in the context layer Including intraitem connections inrecognition makes it impossible to tell whether therecognition strength comes from encoding during thelearning episode or from another preexperimental learn-ing episode Thus there would be a confounding be-tween the itemrsquos frequency and recognition strength Forexample if the recognition strength in the variance the-ory included intraitem connections and used a constantrecognition criterion it would predict a higher hit rateand a higher false alarm rate for high-frequency itemsthan for low-frequency items Thus the hit rate for high-frequency words would be larger than that for low-frequency words which is contrary to the data on the mir-ror effect This issue is called the frequencyrecognitionndashstrength confounding problem Other models may bevulnerable to this problem depending on their specificassumptions The variance theory is immune from thisproblem because recognition strength is based on the as-sociation between the context and the item that yields apure measurement of the strength of the target in a par-

ticular episode Net inputs within the item population arenot used because these connections are highly corre-lated with the frequency of the item

This frequencyrecognition-strength confounding prob-lem may be relevant to several distributed models thatassume that recognition strength increases with frequencythus making the false alarm rate higher for high- than forlow-frequency stimuli This is often implemented in dis-tributed models by including intraitem associations inthe measurement of recognition strength Thus whenintraitem and itemndashcontext associations are added it isnot possible to know whether the intraitem strength oc-curs because an item has been encoded in the to-be-retrieved-from list or at another episode

Although the intraitem associations are not used tomeasure recognition strength they may play an impor-tant role in recognition In the first step of recognitionthese associations may be used for deblurring unclear in-formation in the item cue (a similar mechanism occursfor the context cue) Arguably this deblurring mecha-nism works well for well-known words however fornonwords it is much more likely to fail Such failure willactivate features that were not active in the encoded rep-resentation It is more likely that these wrongly activatednodes represent high-frequency features because it ismore likely to converge on high-frequency features Thereare two interesting implications of this perspective Firstthe wrongly activated nodes will use the wrong connec-tions between the context and the item Second becausethe wrongly activated nodes represent high-frequencyfeatures the average variability will be larger for non-words than for words This is a plausible account of thepoor recognition performance with nonwords as com-pared with words It is also a tentative account of why non-words can be seen as a difficult class with a higher vari-ability than that for words in the variance theory Howeverfurther work is needed before any firm conclusion can bemade regarding this aspect of the theory

A problem similar to frequencyrecognition-strengthconfounding occurs if within-context connections areused at recognition If the context is temporally corre-lated the within-context connections are influenced bylist length This causes a confounding between list lengthand recognition strength This issue is called the list-lengthrecognition-strength confounding problem Othermodels may be vulnerable to this problem depending ontheir specific assumptions

Another issue is whether the variance theory can ac-count for the mirror effect found in abstract and concretewords where words from a concrete class are more eas-ily discriminated than words from an abstract class Thevariance theory can account for this given the assump-tion that the variance of the net input is larger for abstractthan for concrete words However at this point it is notcompletely clear how this assumption can be motivatedA possibility is that although these two classes areequated for word frequency the contexts associated withan abstract word are more variable than the contexts as-

VARIANCE THEORY OF MIRROR EFFECT 437

sociated with a concrete word This larger variability incontext for abstract words may lead to a larger variabil-ity in the net input Another possibility is that the activefeatures in abstract words are more general and there-fore associated with more contexts Nodes active in con-crete words may represent more specific features acti-vated with a lower frequency and therefore associatedwith fewer contexts Thus features in abstract wordsmay be of a higher frequency than features in concretewords although the frequencies of the items are thesame This would lead to a mirror effect in the modelHowever at this point no claim is made that variancetheory can handle this phenomenon

The variance theoryrsquos account of the list-length andlist-strength effects is arguably much simpler than thesubjective-likelihoodrsquos account The activation thresholdis set so that on average half of the nodes active duringencoding are active during recognition The activationthreshold is constant within one condition but may changebetween conditionsmdashfor example when study time ischanged Furthermore subjects do not need to estimatethe list length and the probability that a particular itemis drawn from the study list

The variance theory has advantages over the attention-likelihood theory As was discussed in more detail abovethe attention-likelihood theory requires subjects to clas-sify the stimulus Depending on this classification thesubjects must know (consciously or unconsciously) howmuch attention is paid to the stimuli in order to calculatethe log-likelihood ratios Thus the yesndashno decision isbased on the subjectsrsquo classification of the stimuli Thevariance theory does not require a classification of thestimuli During the calculation of recognition strengththe standard deviation of the net input of the item is used(shcent ) so the subject does not need to know the class or thestandard deviation of the class (sh) The increase in thehit rate and decrease in the false alarm rate for the easierclass occurs according to the theory because the vari-ance is smaller for the easier class The variance theoryhas a simple mathematical base subjects count the num-ber of activated nodes minus the expected value dividedby the standard deviation of the net input of the item Anexplicit solution is presented that requires only calculat-ing the probabilities of two z-transformations

The subjective-likelihood theory uses feature contentof the items for addressing issues regarding item similar-ity and word frequency In particular high-frequency wordsare assumed to have a higher noise or variability than dolow-frequency words The variance theory also usesvariability that depends on frequency However the vari-ance theory simulates the increase in variance duringeach presentation of a feature in different contexts thusmaking it an unavoidable phenomenon for high-frequencyfeatures In the subjective-likelihood theory the featurevariance is introduced as an assumption or a constantand it is not explicitly simulated

There are several other differences between the vari-ance theory and the subjective-likelihood theory The

subjective-likelihood theory is based on log-likelihoodratios In the variance theory log-likelihood ratios arenot necessary to account of the mirror effect and for z-ROC curves Instead the variance theory uses the num-ber of active nodes normalized by the variance of the netinput as the measurement of recognition strength

Another difference is the use of one detector for eachitem in the subjective-likelihood theory This makes themodel essentially local whereas the variance theory isdistributed This property may cause difficulties in im-plementing the model in a connectionist network How-ever see McClelland and Chappell (1998) for a brief dis-cussion of this topic An implementation of the variancetheory is straightforward

REFERENCES

Feller W (1968) An introduction to probability theory and its appli-cation New York Wiley

Gillund G amp Shiffrin R M (1984) A retrieval model for bothrecognition and recall Psychological Review 91 1-67

Glanzer M amp Adams J K (1985) The mirror effect in recognitionmemory Memory amp Cognition 13 8-20

Glanzer M amp Adams J K (1990) The mirror effect in recognitionmemory Data and theory Journal of Experimental PsychologyLearning Memory amp Cognition 16 5-16

Glanzer M Adams J K amp Kim K (1993) The regularities ofrecognition memory Psychological Review 100 546-567

Glanzer M amp Bowles N (1976) Analysis of the word frequencyeffect in recognition memory Journal of Experimental PsychologyHuman Learning amp Memory 2 21-31

Glanzer M Kisok K amp Adams J K (1998) Response distribu-tions as an explanation of the mirror effect Journal of ExperimentalPsychology Learning Memory amp Cognition 24 633-644

Greene R L (1996) Mirror effect in order and associative informa-tion Role of response strategies Journal of Experimental Psychol-ogy Learning Memory amp Cognition 22 687-695

Hertz J Krogh A amp Palmer R G (1991) Introduction to the the-ory of neural computation Reading MA Addison-Wesley

Hintzman D L (1988) Judgment of frequency and recognition memoryin a multiple trace memory model Psychological Review 95 528-551

Hopfield J J (1982) Neural networks and physical systems withemergent collective computational abilities Proceedings of the Na-tional Academy of Sciences 79 2554-2558

Hopfield J J (1984) Neurons with graded response have collectivecomputational properties like those of two-state neurons Proceed-ings of the National Academy of Sciences 81 3088-3092

Humphreys M S Bain J D amp Pike R (1989) Different way to cuea coherent memory system A theory for episodic semantic and pro-cedural tasks Psychological Review 96 208-233

Kim K amp Glanzer M (1993) Speed versus accuracy instructionsstudy time and the mirror effect Journal of Experimental Psychol-ogy Learning Memory amp Cognition 19 638-652

Kruschke J K (1992) ALCOVE An exemplar-based connectionistmodel of category learning Psychological Review 99 22-44

Ku Iumlcera H amp Francis W N (1967) Computational analysis ofpresent-day American English Providence RI Brown UniversityPress

Lewandowsky S (1991) Gradual unlearning and catastrophic inter-ference A comparison of distributed architectures In W E Hockleyamp S Lewandowsky (Eds) Relating theory and data Essays onhuman memory in honor of Bennet B Murdock (pp 445-476) Hills-dale NJ Erlbaum

Li S-C amp Lindenberger U (1999) Cross-level unification A com-putational exploration of the link between deterioration of neuro-transmitter systems and dedifferentiation of cognitive abilities in oldage In L-G Nilsson amp H J Markowitsch (Eds) Cognitive neuro-sciences of memory (pp 103-146) Seattle Hogrefe amp Huber

438 SIKSTROM

Li S-C Lindenberger U amp Frensch P A (2000) Unifying cog-nitive aging From neuromodulation to representation to cognitionNeurocomputing 32-33 879-890

McClelland J L amp Chappell M (1998) Familiarity breeds dif-ferentiation A subjective-likelihood approach to the effects of expe-rience in recognition memory Psychological Review 105 724-760

Metcalfe J (1982) A composite holographic associative recallmodel Psychological Review 89 627-658

Murdock B B (1982) A theory for the storage and retrieval of item andassociative information Psychological Review 89 609-626

Murdock B B (1998) The mirror effect and attentionndashlikelihoodtheory A reflective analysis Journal of Experimental PsychologyLearning Memory amp Cognition 24 524-534

Okada M (1996) Notions of associative memory and sparse codingNeural Networks 9 1429-1458

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves PsychologicalReview 99 518-535

Rumelhart D E Hinton G E amp Williams R J (1986) Learn-ing representation by backpropagating errors Nature 323 533-536

Shiffrin R M amp Steyvers M (1997) A model for recognitionmemory REMndashretrieving effectively from memory PsychonomicBulletin amp Review 4 145-166

Sikstroumlm S (1996a) TECO A connectionist theory of successiveepisodic tests Umearing Doctoral dissertation Umearing University (ISBN91-7191-155-3)

Sikstroumlm S (1996b) The TECO connectionist theory of recognitionfailure European Journal of Cognitive Psychology 8 341-380

Sikstroumlm S (1999) Power function forgetting curves as an emergentproperty of biologically plausible neural networks model Interna-tional Journal of Psychology 34 460-464

Stretch V amp Wixted J T (1998a) Decision rules for recognitionmemory confidence judgments Journal of Experimental Psychol-ogy Learning Memory amp Cognition 24 1397-1410

Stretch V amp Wixted J T (1998b) On the difference betweenstrength-based and frequency-based mirror effects in recognitionmemory Journal of Experimental Psychology Learning Memoryamp Cognition 24 1379-1396

NOTE

1 The expected number of nodes active at recognition is a2 giventhat the standard deviations of the percentages of active nodes for new[sp(N)] and old items [sp(O)] are equal However if the standard devi-ations are different the expected number of active nodes will be

Because the new variance is smaller than the old variance this valuewill be slightly less than a2 typically around 044a if the ROC curveis 08 It is actually more precise to use this value In this paper the ap-proximation a2 is used except that in the simulations where the ex-pression above is used

APPENDIX

The Expected Value of the Net Input and the Variance of the Net Input

Inserting the weight changes into the net input solves the ex-pected value of the net input for old items In this Appendix thesuperscripts representing the item layer (d) and the contextlayer (c) are dropped Only the active states of activation con-tribute to the net input so jj 5 jj = 1

(A1)

The expected value of the net inputs for the new items iszero

mh(N) 5 0 (A2)

The expected value of the weights is zero The variance ofthe net input is calculated by squaring each term in the netinput Let P represent the number of patterns stored in the net-work If the patterns are uncorrelated or a new item is pre-sented the variance of the net input is

(A3)

For an old item the variance of the net input to the contextlayer from the item layer depends on the frequency of the item( f ) All the aN weights supporting a context node contribute tothe net input in the same direction

(A4)

The same function can be used for calculating the varianceof the net input to a node in the item layer when the same con-text is associated with several items Let the same context be as-sociated with L items in a list Furthermore let the context be-tween different lists be uncorrelated The variance of the netinput to an item node is

The expected variance for all nodes is the weighted sum ofthese two terms Half of the nodes are context nodes and halfof the nodes are item nodes Therefore the expected varianceof the net input to all nodes given an item with a frequency off and a list length of L in a network where p patterns have beenencoded is

(A5)

Performance in the Variance Model When AllNodes Are Used

In the variance model recognition strength is based on nodesactive at encoding However if recognition strength is based onall the nodes performance can be calculated as follows The d centis calculated by using Equation 8 where Pc(O) and Pc(N) are thenumber of correct nodes The number of correct nodes is thenumber of nodes active at retrieval that also is active at encod-ing (ie calculated as usual with Equation 7) plus the numberof inactive nodes at encoding that also are inactive at retrievalThe latter value can be calculated by replacing a with 1 a inEquation 7 and using the expected value of old net inputs for in-active nodes a2 (1 a) N (compare with Equation A1)

(Manuscript received February 9 1999revision accepted for publication October 30 2000)

s h f L a N p a a N2 3 22 1O( ) = +( ) + -( )

s h f a a LN2 4 2 21( ) = -( )

s x x x xh ij jj

N

i j ji

Nf f f a a

a a f N

22 2

4 2 21

( ) = waring aringaelig

egraveccedilouml

oslashdivide= -( ) -( )eacute

eumlecirc

= -( )

s x x x xh ij jj

N

i j j

P

i

NN a a

a a a a a a a

a a a a PN a a PN

2 2 2

2

3

1 1 2 0 1

0 0 1 1

( ) = w

= [( ) ( )] + [( )(1- )] ( )

+ [( )( )] ( ) = ( )

2 2

2 2 2

( ) = -( ) -( ) - - - -

- - - -

aring aringaring

m x x x xh ijj

N

j i j jj

Na a a a N(O) = wDaring aring= -( ) -( ) = -( )1

2

ss s

p

p p

a(N)

( )N (O)+

VARIANCE THEORY OF MIRROR EFFECT 417

At recognition the state of activation of the old low-frequency item AO is reinstated as a cue to the itemlayer and the state of activation of the encoding contextat the experimental phase CExp is reinstated as a cue tothe context layer The net inputs are calculated For ex-ample the net input to context node 1 is

The net input to the item nodes is 10 2 1 and that tothe context nodes is 5 5 5 5 It can be seen thatthe expected net inputs for randomly created new itemsin this network is 0 and that the expected net inputs forold items encoded in the active state is 05 Therefore theactivation threshold is set to the average of these valuesmdashnamely T 5 025 Nodes that have a net input above theactivation threshold and that are active in the cue patternare activated at recognition Thus the state of activationat recognition for the item nodes is 1010 and thatfor the context nodes is 0101 (which is identical tothe cue patterns) The percentage of active nodes is countedPc(AO) 5 48 5 5 For an unbiased response (C 5 0)this will yield a yes response because this percentage islarger than the expected percentage of active nodes atrecognition (a 2 5 25) The standard deviation of thenet input for nodes active at encoding is 071 (ie thestandard deviation of 1 2 5 5 corresponding tonodes 1 3 6 and 8) The recognition strength is calcu-lated by subtracting the expected percentage of activenodes at recognition from the percentage of active nodesfor the to-be-recognized item and dividing by the standarddeviation of the net input for active nodes [S 5 (Pca 2) shcent 5 (5 25) 71 5 035]

The recognition of the three items BO AN and BNare done in the same way The results for the four itemsAN BN BO and AO are the net inputs (1 0 2 1 5 155 5 15) where the first four numbers representitem nodes and the last four context nodes (1 0 2 115 5 5 15) (1 0 2 1 15 5 5 1) (1 0 21 5 5 5 5) the states of activation at recognition(0 0 0 1 0 0 0 0) (1 0 0 0 0 1 0 0) (0 0 1 1 00 0 1) (1 0 1 0 0 1 0 1) the numbers of nodes ac-tive 1 2 3 4 the standard deviations of the net inputs071 108 108 071 the recognition strengths 17 0011 035 and the unbiased responses no no yes yesrespectively (an unbiased yes response is made whenS gt C where C 5 0 for unbiased responses) Thus thesubject responds correctly for all items the recognitionstrengths are ordered according to the mirror effect andthe variance of the net input is larger for the high- thanfor the low-frequency words

SIMULATIONS OF THEFREQUENCY-BASED MIRROR EFFECT

In this section the variance theory of the mirror effectis simulated in a connectionist framework that is consis-tent with a general connectionist theory of memory

called TECO (Sikstroumlm 1996a 1996b) TECO has beenused to account for a variety of memory phenomena in-volving recognition and recallmdashfor example successivetesting of recall and recognition (Sikstroumlm 1996b) and for-getting curves (Sikstroumlm 1999) An extensive descrip-tion of TECO is beyond the scope of this paper Interestedreaders are referred to previous articles describing themodel for details

ProcedureThe simulation started with initializing the weights to

zero Then 12 items were generated by randomly settingthe nodes to an active state with a probability of a A pre-experimental phase then followed to generate the fre-quency associated with the items In this preexperimen-tal phase half of the items (ie 6) were encoded threetimes each time in a different context These items sim-ulated the high-frequency words The remaining 6 itemswere not encoded before the experimental phase andthey simulated the low-frequency words

At the experimental phase one study-encoding contextwas generated using the same procedure as the genera-tion of the contexts in the preexperimental phase Giventhat in a standard recognition experiment all studieditems would be encoded in the same list in the simula-tions the items were thus encoded once with the samestudy context Three of the high-frequency items wereencoded and three of the low-frequency items were en-coded The other three high- and three low-frequencyitems were not encoded during the experimental phaseand they simulated the new items Each encoding wassimulated by first activating the nodes in the item and con-text layers The weights were then changed according tothe learning rule (Equation 3)

At the recognition test the patterns of activation of the12 items and the study context were reinstated to the net-work The net inputs were calculated for each node andthe recognition strength was calculated from all the nodesin the network The somewhat unrealistic assumptionthat no learning or changes of weights occurred duringtesting was adopted However this is a standard as-sumption often used in other simulations of recognitionmemory cued recall or categorization (eg Kruschke1992 Lewandowsky 1991 Li amp Lindenberger 1999Li Lindenberger amp Frensch 2000) One thousand fivehundred simulation runs with different random item andcontext patterns were carried out The results reported laterare based on the average across these runs

ParametersThe number of high-frequency patterns was six (each

encoded three times preexperimentally and three of themencoded once experimentally to simulate the old items)and the number of low-frequency patterns was six (threeof them encoded experimentally once to simulate the olditems) The following parameter settings were used Thenumber of nodes in each layer (N ) was 30 (the total num-ber of nodes was 2N 5 60) and the percentage of nodes

h w i ii

1 11

45 1 1 0 1 1 5 0 5c d= = + - - = -

=aring x

418 SIKSTROM

active at encoding (a) was 20 Initially all the weightswere set to zero The recognition criterion (C ) was set to 0

ResultsFigure 3A shows the results in terms of the density of

the net inputs to an active node The expected values ofthe net inputs are approximately equal for new high- andnew low-frequency items [mh(AN) 5 00 mh(BN) 5 00]and for old low- and old high-frequency items [mh(AN) 538 m h(BN) 5 38] The high-frequency items have alarger variance of the net inputs [sh(BN) 5 49 sh(BO) 548] than do the low-frequency items [sh(AN) 5 41sh(AO) 5 40] The variances of the old and the new dis-tributions are approximately equal

Figure 3B shows the density of the recognition strengthThe result shows the mirror effect where the hit rate prob-ability is larger for low- than for high-frequency itemsand the false alarm rate is lower for low- than for high-frequency items [P(AN) 5 18 P(BN) 5 25 P(BO) 564 P(AO) 5 71] The standard deviation for the recog-nition strength is larger for old than for new words andlarger for the low-frequency words than for the high-frequency words [ss(AN) 5 029 ss(AN) 5 019 ss(AN)5 023 ss(AN) 5 031] These f indings agree with the empirical data and the predictions of the attention-likelihood model (Glanzer et al 1993) Thus the simu-lation shows that the variance theory can account for thefrequency-based mirror effect and the associated ROCcurves

EXPLICATIONS OF THE MECHANISMS

In this section three essential mechanisms of the vari-ance theory that account for the mirror effect are discussedin detail The involved mechanisms are (1) the varianceof the net input (2) the expected value of the net inputand (3) the way in which the recognition strength is cal-culated by counting the number of nodes so that the per-formance is optimal

OverviewThe first mechanism is that items from the difficult

class (ie high-frequency items in this case) have ahigher variance of the net inputs than do items from theeasy class (ie low-frequency items) but the variance isindependent of whether the items are old or new Themechanism is illustrated in Figure 4A (noncumulative) Itshould also be underscored that the difference in varianceas a function of class membership is not a primitive ofthe theory that the theorist manipulates It is the naturalconsequence of the rather plausible assumption that high-frequency items appear more often and are associatedwith a great variety of different contexts than are low-frequency words The second mechanism is that old itemshave a higher expected net input to nodes encoded in theactive state than do new items (but do not depend on theclass of the items) The mechanism is illustrated in Fig-ure 4B (cumulative) The third mechanism is the wayrecognition strength is measured by counting the activenodes so that recognition performance (eg d cent) is opti-mal or near-optimal An extended analysis of optimalityis presented at the end of the paper however the resultsin Figures 9Andash9D can summarize the main points hereThe results from this analysis imply that the decisionmust be based on the nodes active at encoding (see Fig-ure 9A) the percentage of active nodes must be low (seeFigure 9A) the activation threshold needs to be betweenthe expected net inputs of the new and old items (see Fig-ure 9B) and the recognition strength is normalized bythe standard deviation of the net inputs of the item (seeFigures 9C and 9D) The density of the percentage of ac-tive nodes is illustrated in Figure 4C and the normalizedpercentage of active nodes is illustrated in Figure 4DHere it is shown how these mechanisms predict the mir-ror effect Below these three sets of mechanisms are ex-plained in more detail

Variance of the Net InputThe first and perhaps the most important mechanism

is that the difficult class has a larger variability in the net

Figure 3 (A) Simulations of net inputs The vertical axis shows the simulated density of the net inputs (B) Sim-ulations of the mirror effect The vertical axis shows the simulated density of the recognition strength

VARIANCE THEORY OF MIRROR EFFECT 419

inputs than does the easy class As will be discussed laterthis is shared with other theories such as the subjective-likelihood account (McClelland amp Chappell 1998) andREM (Shiffrin amp Steyvers 1997) However a unique as-pect of the variance theory is that it is a necessary outcomeof simply encoding high-frequency items more timesthan low-frequency items In the subjective-likelihood ac-count a larger variability for high- than for low-frequencywords is an assumption that is controlled by a free pa-rameter ( p0) to reflect that high-frequency words havemore definitions than do low-frequency words The vari-ance theory needs no such assumptions and no addi-tional free parameters to control the variability Whereasthe subjective-likelihood account tries to capture wordfrequency with a free parameter in the variance theorythe word frequency effect is simulated via the number ofcontexts associated with an item

Owing to how the variance theory implements the re-lations between contexts and items the variance of the

net inputs increases with the frequency with which an itemis encoded in different contexts Intuitively this occursbecause a high-frequency item activates several differentcontexts causing the representations of many competingcontexts to be activated simultaneously Low-frequencyitems are associated with fewer contexts than are high-frequency items This causes the representations offewer contexts to be activated simultaneously Thus low-frequency items have less variability in the net inputsthan do high-frequency items

At recognition an item produces a net input in the con-text layer that is a mixture of the net inputs from the studycontext that the network is instructed to retrieve fromand from several uncorrelated preexperimental contextsthat were associated with the item during the preexperi-mental phase The study context that the network is in-structed to retrieve from contributes to the correct activestate For example by adding +1 to the net input to anode (which is the expected net input for a node encoded

Figure 4 (A) The probability density of the net inputs in the variance theory The horizontal axis shows the net inputs and the ver-tical axis the probability density of the net inputs The dotted vertical line is the activation threshold (B) The cumulative probabilitydistributions of the net inputs in the variance theory The horizontal axis shows the net inputs and the vertical axis the cumulativeprobability distributions of the net inputs (C) The density of percentage nodes active at recognition in the variance theory The hor-izontal axis shows the percentage of nodes active at recognition and the vertical axis the density (D) The density of recognition strengthin the variance theory The horizontal axis shows the recognition strength and the vertical axis the density of the recognition strengthusing standard parameter values

420 SIKSTROM

in the active state when N 5 8 and a 5 5 see Equation A1in the Appendix) thus increasing the expected net inputThe preexperimental contexts on the other hand ran-domly add to or subtract from the net input For examplea preexperimental context could add +1 to the net inputif the node was in an active state or it could add 1 tothe net input if the particular preexperimental contextwas encoded in an inactive state (which is the expectednet input for a node encoded in the inactive state whenN 5 8 and a 5 5 see Equation 3 4 or A1) Note thatthe net input can be negative whereas the state of activa-tion cannot If the representations of these preexperimen-tal contexts are uncorrelated the number of associatedpreexperimental contexts will not influence the expectednet input That is the expected value of the negative andpositive contributions will cancel out (eg for N 5 8 anda 5 1 the contributions to the net input is +1 with a prob-ability of 50 and 1 with a probability of 50 yield-ing an expected value of 0) However the variability ofthe net inputs increases when more contexts are addedIn this example the variance is increased by 12 for eachadded context (ie the variance increases by each con-tribution raised to the power of two) Encoding an itemalso increases the variability of the net input for all otheritems encoded in the network However the increase invariability for the items uncorrelated with the encodeditem is much smaller because the contribution from eachnode is independent (Eg for N 5 8 and a 5 5 thecontribution from each node is either +14 or 14 [seeEquation 3] each yielding an increase in variability of142 5 116 An expected value of aN nodes contributeso the total increase in variability is 4 116 5 14)

A mathematical analysis of how the variability of thenet inputs in the context nodes increases as a function ofthe frequency with which the item has been encoded indifferent contexts is presented in the Appendix This analy-sis shows that the variance of the net inputs in contextnodes increases linearly with how many times a givenitem is encoded within different contexts The variabil-ity of the net inputs for nonwords may be a special casediscussed at the end of this paper

In the same way as the variability of context nodes de-pends on the itemrsquos frequency the variability of the itemnodes depends on the frequency of the context That isthe variability of the net input to the item nodes increaseswith how many times one context is associated with dif-ferent items Given that the context is constant during thepresentation of a study list the variability of the net in-puts to the item nodes will increase with list length

Expected Net InputThe second mechanism in the variance theory is that

the expected net inputs to the easy and difficult classesof stimuli are equal given that the encoding conditionsduring the experimental phase of the two classes areequal Note that this is in stark contrast to the attention-likelihood theory which assumes that more attention(corresponding to more net input) is given to the easyclass than to the difficult class Experimentally the equal-

ity of the net inputs simply means that different classesof stimuli are given the exact same conditions for encod-ing and retrieval in a recognition memory study The netinput to a node encoded in the active state increases dur-ing encoding whereas the net input to a node encoded inthe inactive state decreases during encoding Only nodesencoded in the active state are used during retrieval sohere we are only interested in the increase in net input thatoccurs for nodes encoded in the active state Strengthen-ing of the weights during encoding increases the expectednet input The degree of increase in expected net input isinfluenced by strength-based variables such as study timerepetition levels of processing and so forth For exam-ple the simulations can be set up so that a study time of1 sec strengthens the weights less leading to lesser in-creases in the net input than does a longer period of studytimemdashfor example 2 sec of encoding time Because thestudy context is unique to the learning episode preexper-imental encoding in other learning contexts will not af-fect the expected net input but they do affect the vari-ability of the net input as was demonstrated above Theitemndashstudy-context associations are learned approxi-mately equally well for old high- and old low-frequencyitems For example the expected net input for CAR (ahigh-frequency word as a difficult class item) is equal tothe expected net input for ARCTIC (a low-frequency wordas an easy class item) Generally the expected net inputdoes not depend on the class of the item because the ex-pected net input is influenced by the study and the test-ing experimental conditions similarly across stimulusclasses in a recognition memory experiment Thereforethe expected net input for a new difficult item is equal tothe expected net input for a new easy class item

The probability density functions of the net inputs fornodes in the active states are plotted in Figure 4A Oldnodes in the inactive state have a negative expected valueof the net input and are not plotted New nodes in the in-active state have the same density as nodes in the activestate The cumulative probability distributions of the netinputs for nodes in the active state are plotted in Figure 4BFigure 4A shows the first mechanismmdashnamely that thestandard deviation of net inputs for the easy class items[sh(A)] is larger than the standard deviation of net inputsfor the difficult class items [sh(B)] The second mecha-nism is shown in the figures in that the expected netinput of an easy class new item [mh(AN)] is equal to theexpected net input of a difficult class new item [mh(BN)]Furthermore at encoding the expected net inputs of ac-tivated nodes are increased equally or approximatelyequally for the easy and the difficult classes of itemsThis is shown in Figure 4A The expected net input for theold easy class items [m h(AO)] is equal to the expectednet input for the difficult class items [mh(BO)]

Recognition StrengthThe variance theory suggests that the recognition de-

cision needs to be based on counting the number of ac-tive nodes in such a way that the performance is optimalor near-optimal If the net input is above the activation

VARIANCE THEORY OF MIRROR EFFECT 421

threshold (T ) and the node was active at encoding thenode is activated at recognition Otherwise the node isinactivated The distributions of active nodes are plottedin Figure 4C

A closer inspection of Figures 4A and 4B reveals thatthese densities or distributions predict the correct orderof the mirror effect given that the activation threshold islarger than the expected net inputs of the new items andless than the expected net inputs of the old items Thus thevariance theory has the nice property of accounting for themirror effect across a large range of the parameter valuesfor the activation threshold Thus P(AN) lt P(BN) ltP(BO) lt P(AO) for mh(AN) 5 mh(BN) lt T lt mh(AO) 5mh(BO) The variance theory predicts a material-basedmirror effect because the variability of the net inputs isdifferent for the easy and the difficult class items Theexpected strengths of the net inputs are equal The vari-ability is lower for easy class items thus making theprobability of false alarms (or the probability of activenodes) lower for the easy than for the diff icult classitems when the activation threshold is larger than the ex-pected value of the new items Similarly the hit rate (or theprobability of active nodes) for easy class items is higherthan the hit rate for difficult class items when the activationthreshold is less than the expected value of the old items

The activation threshold is set to be between the ex-pected value of the new and the old net inputs so that theperformance is optimal Therefore the activation thresh-old is set to the average of the expected net inputs of theold and the new distributions for difficult and easy classitems respectively

T 5 [mh(AN) + m h(BN) + m h(BO) + mh(AO)]

5 [mh(BO) + mh(AO)] 5 mh(O)

Thus in the variance model the activation threshold isfixed for recognition in one condition although it mayvary between different recognition conditions to opti-mize the performance The variance theory accounts forthe strength-based mirror effect that occurs between listsor conditions with a shift in the activation threshold nec-essary for keeping the performance at an optimal levelAs will be discussed later this is true also when perfor-mance is measured by d cent and it is independent of theplacement of the recognition criterion Simply put themodel adopts the activation threshold on the basis of theoverall difficulty of the test in order to maximize the per-formance

In practice subjects may initially make a preliminaryestimate of the activation threshold which may be ad-justed as more information about the expected value ofthe distributions is gathered The theory does not show amirror effect if the activation threshold is lower than theexpected value of the new items or larger than the expectedvalue of the old items Thus setting the activation thresh-old as was suggested above is an important mechanism

in the model The activation threshold should not be con-fused with the subjectrsquos recognition criterion

Figure 4C shows the density of the probability that anode is active at recognition when the activation thresh-old is set as defined above Note how the mean and stan-dard deviations of the distributions of the net input (Fig-ure 4A) change when the percentage of nodes are calculated(Figure 4C) The expected probabilities of active nodesare arranged according to the mirror effect [Pc(AN) ltPc(BN) lt Pc(BO) lt Pc(AO)] whereas the expected valuesof the net inputs are not [mh(AN) = mh(BN) lt mh(BO) =mh(AO)] Furthermore the standard deviation of the per-centage of active nodes for old items is larger than thatfor new items [s p(O) gt sp(N)] whereas the standard de-viations of the net inputs are equal for old and new items[s h(N) 5 sh(O)]

The standard deviation of the recognition strength issmaller for the new distributions than for the old distri-butions because there are fewer nodes active in the newdistributions The standard deviation of the percentage ofactive nodes at retrieval as a function of the expected prob-ability of nodes active at retrieval is plotted in Figure 5Obviously the standard deviation of the percentage ofactive nodes is zero when the probability of active nodesis zero This standard deviation increases as the prob-ability of active nodes increases For small probabilitiesof active nodes the variance of active nodes (sp) is approx-imately proportional to the percentage of active nodes[sp 5 Pc(1 Pc) N raquo Pc N micro Pc] The percentage of ac-tive nodes is of course larger for old than for new itemsThus the variance theory predicts that the standard de-viation of the percentage of active nodes (sp) is smaller fornew than for old items [sp(AN) lt sp(BN) lt sp(BO) ltsp(AO)] whereas the standard deviation of the net inputis not [sh(AN) 5 sh(AO) lt sh(BN) 5 s h(BO)] The es-sential mechanism that makes these changes in themeans and standard deviations is the nonlinearity intro-duced by the counting of the number of active nodes With-out this nonlinearity these changes would not occur andthe model would not show appropriate ROC curves forold and new items

Note that the standard deviation of active nodes de-creases for probabilities larger than one half (see Figure 5the standard deviation is of course zero when the prob-ability of active nodes is one see the mathematical anal-ysis below) However the probability of active nodes can-not exceed one half because the activation threshold isset so that half of the nodes active during encoding areactive at recognition The probability of active nodes istypically set to a small value in the model because it im-proves the performance

To optimize the performance subjects base their rec-ognition decision on the number of active nodes normal-ized by the standard deviation of the net inputs for theitem The normalization makes the judgments more con-servative for difficult items This plays an important rolefor confidence judgments when the responses are biasedbut plays no role for unbiased responses

12

14

14

422 SIKSTROM

To calculate the recognition strength (S ) in Equation 5the expected percentage of active nodes is subtractedfrom the percentage of nodes active at recognition (Pc)This is necessary for the normalization to work properlyOwing to the placement of the activation threshold theexpected percentage of active nodes at recognition is halfof the expected percentage of nodes active at encoding(a 2 see note 1) This is a constant independent of itemclass new or old item and test difficulty The result is di-vided by the standard deviation of the net inputs associ-ated with nodes active at encoding (shcent)

Note that the standard deviation of the net inputs ofthe to-be-recognized item (s hcent ) varies on an item-to-itembasis around the standard deviation of the net inputs ofall items in the class (sh ) This fluctuation may be solarge that it is not possible to accurately sort the wordsinto classes on the basis of the standard deviation of theitems however there is no need for the subject to makesuch classification in the variance model The subjectsdo not need to know the true standard deviation of net in-puts in the class A yes response occurs if the recognitionstrength is larger than or equal to the subjectrsquos recogni-tion criterion (C ie if S sup3 C ) A no response occurs ifthe recognition strength is less than the subjectrsquos recog-nition criterion (S lt C )

The standard deviation of the net inputs does not affectthe probability of a yes response when the recognitioncriterion is unbiased (C 5 0) In this special case therecognition strength can be simplified to S 5 Pc whereC 5 a 2 The standard deviation of the net inputs inEquation 5 affects the probability of a yes response whenthe recognition criterion is biased (C sup1 0) Thus the stan-dard deviation of the net inputs in Equation 5 may be in-terpreted as a scaling factor that influences the confidencemeasurement (but not the unbiased recognition measure-ments) A large standard deviation of the net input for anitem (correlated with difficulty) influences the decisiontoward uncertainty whereas a small standard deviation ofthe net input for an item (correlated with less difficulty)influences the decision to be more certain

Figure 4D shows the density distribution of the recog-nition strength Note how the standard deviation of theactive nodes for the easy class versus the difficult classes(in Figure 4C) changes when it is normalized by the vari-ance of the net input (in Figure 4D) The normalizationfactor makes the standard deviation of the recognitionstrength of the difficult class smaller than that of the easyclass Thus the standard deviation of the recognitionstrength is proportional to the inverse of the standard de-viation of the net input The difficult class items yield asmall standard deviation of the recognition strength be-cause the standard deviation of the net inputs is highThe easy class items yield a large standard deviation ofthe recognition strength because the standard deviationof the net inputs is small The ordering of the means ofthe distributions is unaffected by the normalization andthe normalization does not change the fact that the olddistributions have a larger standard deviation than do thenew distributions

PREDICTIONS

This section describes the predictions of the variancetheory We have just seen that the variance theory predictsa material-based mirror effect for high- and low-frequencyitems because the low-frequency items have a smallervariance of net inputs This yields lower false alarm ratesand higher hit rates for the easy than for the difficultclass when the activation threshold is set between thenew and the old distributions Here it is shown how themodel accounts for other effects such as the strength-based mirror effect between lists list-length effects andthe shift in the response criterion Most important the var-iance theory makes predictions regarding the strength-based mirror effect within lists that is different from thepredictions of the attention-likelihood theory An exper-iment is conducted to test these predictions Comparativemodeling fitting was also conducted to compare the vari-ance theory with the attention-likelihood theory Thepredictions of the theory are based on an analytic solution

Figure 5 The standard deviation of active nodes as a function of the expectedprobability that the nodes are active The standard deviation is calculated with2N = 100

VARIANCE THEORY OF MIRROR EFFECT 423

that is presented at the end of the paper together with ananalysis of optimal performance

The Material-Based Mirror Effect forHigh- and Low-Frequency Items

The variance theory was simulated above here theanalytical results are presented The variance theory pre-dicts the mirror effect for any choice of parameters whenthe recognition criterion is unbiased As will be discussedlater the variance theory can be fully described by twoparameters (the number of nodes N and the percentageof active nodes a) plus one parameter for each class orwords [the standard deviation of the net input sh( )] Thefollowing parameters are used here The number ofnodes is 2N 5 100 and the percentage of nodes active atencoding is set to a 5 1 The standard deviation of thenet inputs to the easy class is sh(AN) 5 sh(AO) 5 125and the standard deviation of the net inputs to the diffi-cult class is sh(BN) 5 sh(BO) 5 156 There are otherparameters which however as will be discussed laterdo not add any degrees of freedom to the model the ex-pected net inputs of the new distributions mh(AN) 5mh(BN) 5 0 and the expected net inputs of the old dis-tributions mh(AO) 5 mh(BO) 5 1 Consequently the ac-tivation threshold is T 5 05

These parameters yield the following probabilities thata node is active at recognition Pc(AN) 5 43 a Pc(BN) 545 a Pc(BO) 5 55 a Pc(AO) 5 57 a The following ex-pected recognition strengths are predicted ms(AN) 5

0012 ms(BN) 5 0008 ms(BO) 5 0008 ms(AO) 50012 Figure 4D plots the four recognition strength den-sities (the distributions are assumed to be normal) usingthe parameters above The same parameter settings wereused in Figures 4A 4B 4C and 5

Strength-Based Mirror Effects Between ListsThe variance theory is consistent with the strength-

based mirror effects Thus variables that increase the hitrates also decrease the false alarm rates This empiricalfinding is called dispersion which means that the newand the old distributions move apart The opposite phe-nomenon is called concentering which means that thenew and the old distributions move closer together Ex-amples of variables showing strength-based mirror ef-fects are speed versus accuracy instructions length ofstudy time encoding task forgetting repetition and ag-ing (Kim amp Glanzer 1993) These experimental variablescan be related to a specific parameter in the variance the-orymdashnamely the expected net input

The variance theory predicts a strength-based mirroreffect because subjects must adjust the activation thresh-old to optimize the performance This change in activa-tion threshold affects the false alarm rates For exampleassume that study time is increased from 1 to 2 sec sothat the expected net input increases from 1 to 2 and theactivation threshold increases from 12 to 1 This dimin-ishes the false alarm rate However the increase in theactivation threshold is smaller than the increase in the old

net input so the hit rate will increase Thus increasing thestudy time increases the hit rate but decreases the falsealarm rate which is dispersion

The mirror effect is accounted for in some theories bya change in the recognition criterion Note that in the var-iance theory the recognition criterion is constant whereasthe activation threshold is changed There is an impor-tant difference between a change in the recognition cri-terion and a change in the activation threshold Thechange in the activation threshold optimizes the perfor-mance as measured by d cent whereas a change in the recog-nition criterion does not influence d cent Given an optimalplacement of the activation threshold the performance interms of percentage correct is optimal if the recognitioncriterion is set to an optimal value which is zero Thusthere is a clear difference between changing the recogni-tion criterion and changing the activation threshold Thevariance theory accounts for the strength-based mirror ef-fect occurring between two conditions by the change inthe activation threshold necessary for optimal performancewhereas the recognition criterion does not change

Concentering occurs for example when subjects areinstructed to emphasize speed (rather than accuracy) withsuperficial (rather than deep or semantic) study instruc-tions with diminished study time or with an increasedretention interval (Kim amp Glanzer 1993) In the variancetheory all these manipulations are assumed to diminishthe old net inputs Figure 6A shows the predictions of thevariance theory when the expected net inputs of the olddistributions are mh(AO) = mh(BO) 5 05 rather than 1as in Figure 4D Consequently the activation thresholdmust be set to 025 for optimizing the performance Thedistributions in Figure 6A are closer than the distribu-tions in Figure 4D Thus decreasing the net inputsmdashforexample by diminishing study timemdashmoves the distrib-utions closer together thus showing concentering

The opposite phenomenon to concentering is disper-sion which means that increasing the performance movesthe distributions apart Dispersion can be studied bychanging the variables listed above in the opposite direc-tionsmdashfor example by increasing the study time Fig-ure 6B shows the predictions of the theory when the ex-pected net inputs of the old distributions are mh(AO) =m h(BO) 5 2 Consequently the activation thresholdmust be set to 1 to maintain a near-optimal performanceThe distributions in Figure 6B are further apart than thedistributions in Figure 4D

These strength-based manipulations are usually ap-plied between different lists or conditions For exampleKim and Glanzer (1993) manipulated study time betweenfour study lists where the items were presented for 1 seceach in two lists and for 2 sec each in two lists After eachlist there was a recognition test In the variance theorythe activation threshold is the same during each recogni-tion test but may vary between two recognition tests withdifferent levels of difficultymdashfor example different studytimes As will be discussed later different empirical re-sults are found when study time is varied within one list

424 SIKSTROM

In this case the activation threshold is also constant dur-ing the recognition tests although the study time varieswithin the condition

The order of the probabilities in the mirror effect issomewhat robust against changes in the activation thresh-old over a large range Setting the activation to a fixedsufficiently low and positive value yields the mirror ef-fect for any value of the expected net input For example

assume that the activation threshold is fixed to 04 Thenthe mirror effect is predicted for the three cases of ex-pected old net inputs discussed above (05 1 and 2) orany value above 04 The predictions for the new distri-butions do not change with these changes in net inputs[P(AN) = 25 P(BN) = 30] thus a change in the acti-vation threshold is needed to change the false alarmrates In contrast the advantage of the old easy class over

Table 1General Table of Results From the Experiment

Condition AN BN BO AO ss(BN)ss(AN) ss(BO)ss(AO)

Control 013 017 069 082 060 086Frequency 020 028 080 068 101 066Time 010 015 078 076 089 081

NotemdashThe rows show the conditions (control presentation frequency and presenta-tion time) The columns show the false alarm rates for low (AN) and high (BN) fre-quencies and the hit rate for high (BO) and low (AO) frequencies the slope of the z-ROCcurve for the new low-frequency distribution as a function the new high-frequency dis-tribution [ss(BN)ss(AN)] and the corresponding slope for the old distributions[ss(BO)ss(AO)]

Figure 6 The densities of recognition strength in the variance theory for different parameter settings (A) concentering (B) dis-persion (C) activity level set to one and (D) equal variance The horizontal axes show the recognition strength and the vertical axesthe density

VARIANCE THEORY OF MIRROR EFFECT 425

the old difficult class increases with the expected netinput [from P(BO) = 55 and P(AO) = 56 for mh(O) =05 to P(BO) = 89 and P(AO) = 92 for mh(O) = 2]

List-Length EffectGiven everything else equal recognition from a short

list length has a higher hit rate and lower false alarm ratethan recognition from a long list In the variance theorylist length is predicted to affect the standard deviation ofthe net input (sh) for both easy and difficult class itemsso that longer lists have a larger variance than do shorterlists The expected value of the net input is not affectedby list length

Assume that context does not change within a list butis uncorrelated between different lists The context for alist is thus associated with as many items as there areitems in the list The variance of the net inputs to the itemnodes increases when the list length is increased Thereason for this increase in variance is essentially thesame as the reason that word frequency affects the vari-ance In the word frequency case the same item is asso-ciated with several contexts and this increases the vari-ance in context nodes In the list-length case the samecontext is associated with several items and this in-creases the variance in the item nodes Thus the varianceof the net inputs in the item nodes will be a linear func-tion of list length Therefore a long list will have a lowerhit rate and a larger false alarm rate than will a short list

ROC CurvesThe percentage of nodes active at recognition is less

for new items than for old items Owing to the placementof the activation threshold this proportion is always lessthan 12 The standard deviation of the percentage of ac-tive nodes increases as a function of the percentage ofactive nodes If the percentage of active nodes is zerothe standard deviation obviously is zero However thisstandard deviation increases as the percentage of activenodes increases This yields a smaller standard deviationfor the new distribution (which is associated with a lowerpercentage of active nodes) as compared with the olddistribution [ss(AO) gt ss(AN) and ss(BO) gt ss(BN)]

For the sake of understanding the model the propor-tion of nodes active at encoding can be set unrealisticallyhighmdashnamely to a = 1 This setting yields around 50of these nodes being active at recognition This parametersetting makes the standard deviations of the new and theold distributions equal [ss(AO) = ss(AN) and ss(BO) =ss(BN)] Figure 6C shows the prediction for a = 1 (all theother parameters are identical to those in Figure 4D)

The standard deviation of recognition strength is largerfor the difficult class than for the easy class [ss(AN) gtss(BN) and ss(AO) gt ss(BO)] because the recognitionstrengths are calculated from the inverse of the standarddeviation of the net inputs Thus when the standard de-viations of the net inputs are set equal the standard de-viation of the recognition strengths and the recognitionstrengths becomes equal Figure 6D plots the predictionsof the theory when all standard deviations of the net in-

puts are 125 The other parameters are the same as thosein Figure 4D

In Figure 4D the four standard deviations of the recog-nition strengths are ss(AN) = 0015 ss(BN) = 0012ss(BO) = 0015 and ss(AO) = 0020 The ratio of thesestandard deviations must follow Equation 2 This is alsothe case with ss(AO) ss(BN) = 061 lt 074 = ss(AO) ss(AN) lt ss(BO)ss(BN) = 078 lt 094 = ss(BO) ss(AN)

Changing the Recognition CriterionThe probability of a yes response (P) for the four

classes depends on the recognition criterion (C) SettingC to an unbiased value of 0 yields P(AN) = 20 P(BN) =25 P(BO) = 70 P(AO) = 74 These predicted data areprototypical of experimental data for the mirror effect

A conservative value of the recognition criterion (C)will not yield the mirror effect For example C = 0016yields P(AN) = 03 P(BN) = 02 P(BO) = 30 P(AO) =43 Thus the variance theory predicts that a conserva-tive recognition criterion yields a higher false alarm ratefor easy class words than for difficult class words Thisprediction is supported by empirical data Greene (1996)asked subjects to respond yes only if they were sure oftheir response Consistent with the prediction no mirroreffect was found

It follows from the ordering of the distributions in Fig-ure 4D that the theory also predicts the experimentalfindings in forced recognition [P(BO BN) lt P(AO BN)P(BO AN) lt P(AO AN) P(BN AN) gt 50 and P(AOBO) gt 50] For the parameters above the predictions ofthe theory are P(BO BN) = 79 lt 81 = P(AO BN)P(BO AN) = 83 lt 84 = P(AO AN) P(BN AN) = 59 gt50 P(AO BO) = 57 gt 50

Within-List Strength ManipulationSo far the predictions made by the variance theory are

qualitatively (but not quantitatively) equal to those of theattention-likelihood theory However there is an excep-tion that differentiates the variance theory from the attention-likelihood theory The mirror effect is nor-mally studied under experimental conditions in whichthe diff icult and the easy classes are given the sameamount of attentionmdashfor example under conditions inwhich the number of presentations the study time andthe study instructions are equal for the two classes ofwords However if the number of presentation is largerfor the difficult class than for the easy class different re-sults emerge Stretch and Wixted (1998b) conducted fiveexperiments in which the basic manipulation was to pre-sent high-frequency words five times whereas the low-frequency words were presented once The results didnot show a mirror effect because the hit rates for thehigh-frequency words were higher than those for thelow-frequency words However increasing the numberof presentations for the high-frequency words did not af-fect the false alarm rate so both the false alarm rate andthe hit rate were larger for high-frequency words

The variance theory accounts for this new finding inthe following way The theory assumes that a longer pre-

426 SIKSTROM

sentation time or a larger presentation frequency in-creases the net inputs of the old items [mh(O)] This is il-lustrated in Figure 7A (compare with Figure 4A wherethe same amount of attention is paid to the two classes)If the net inputs for old high-frequency items are in-creased sufficiently the percentage of active nodes willbe larger than that for old low-frequency items For thisto occur the effect of the increase in net input (whichgives the advantage for old high-frequency items whenattention is focused on these items) must be larger thanthe effect from the larger standard deviation of the netinputs for old high-frequency items (which gives the ad-vantage for old low-frequency items when the same at-tention is paid to the two classes) This increase in thepercentage of active nodes yields a higher hit rate forhigh-frequency items than for low-frequency items

However it will not significantly change the false alarmrates which are larger for high-frequency items than forlow-frequency items Therefore the variance theory pre-dicts no mirror effect when high-frequency items are pre-sented sufficiently more often or with a sufficiently longerpresentation time as compared with low-frequency items

It is apparent from the density of net inputs (Figure 7A)that the density of recognition strengths (Figure 7B)will not show a mirror effect (ie because the percent-age of active nodes are larger for high- than for low-frequency old items) The parameters used in these fig-ures are identical to the parameters used for the standardmirror effect in Figures 4A and 4D with the exceptionthat the expected net input of the old high-frequencyitems [mh(BO)] is 2 rather than 1 Consequently to opti-mize performance the activation threshold becomes

P a ec

h h

h=-( )yen

ograve12

22

m

s

P -

Figure 7 (A) The probability density of the net inputs in the variancetheory when attention is focused on the high-frequency class The hori-zontal axis shows the net inputs and the vertical axis the probabilitydensity of the net inputs The expected value of the high-frequency value(BO) is shifted to the right because attention is focused on this class Thedotted vertical line is the activation threshold (B) The predictions of thevariance theory when subjects focus their attention on high-frequencywords The horizontal axis shows the recognition strength and the ver-tical axis the probability density

075 rather than 050 The figure does not show a mirroreffect because the expected hit rate and the expectedfalse alarm rate are larger for the high-frequency itemsthan for the low-frequency items Setting C to an unbi-ased value of 0 yields P(AN) = 08 P(BN) = 14 P(BO) =86 P(AO) = 063 [which may be compared with Figure6B P(AN) = 20 P(BN) = 25 P(BO) = 70 P(AO) = 74]

Furthermore in the variance theory the ratio of therecognition strength standard deviations for high- andlow-frequency items depends mainly on the standard de-viations of the net inputs The standard deviations of thenet inputs are not dependent on the attention paid to thestimuli Therefore the variance theory predicts no changein the standard deviations when the amount of attentionis manipulated The standard deviation of the old low-frequency distribution is predicted to be larger than thestandard deviation of the old high-frequency distribu-tion Similarly the standard deviation of the new low-frequency distribution is predicted to be larger than thatof the new high-frequency distribution The standard de-viations in Figure 7B are ss(AN) = 0013 ss(BN) = 0011ss(BO) = 0017 and ss(AO) = 0019 These results aresimilar to the results when the same amount of attentionis paid to the two classes in Figure 4D ss(AN) = 0015ss(BN) = 0012 ss(BO) = 0015 and ss(AO) = 0020

The standard version of the attention-likelihood the-ory has problems accounting for the lack of mirror ef-fect when more study time is given to the difficult classthan to the easy class This theory suggests that the classof items to which more attention is being paid is moreeasily recognized For example low-frequency items arebetter recognized than high-frequency items becausesubjects pay more attention to them The amount of at-tention is assumed to influence the number of sampledfeature [n(i)] so more features are sampled for low- thanfor high-frequency items (Kim amp Glanzer 1993) This isthe only parameter that differs between high- and low-frequency items From this explanation it follows thatif the experimental conditions are manipulated so thatsubjects pay more attention to the high-frequency itemsthe standard version of the attention-likelihood theorywill predict a mirror effect where the high-frequencyitems are the easier class (A) and the low-frequency itemsare the more difficult class (B) The difference from thenormal mirror effect is a larger hit rate and a smaller falsealarm rate for high- than for low-frequency items Fur-thermore the attention-likelihood theory makes predic-tions of the order of the slope of ROC curves The standarddeviation of the hit rate for the high-frequency distributionwould be larger than the hit rate for the low-frequencydistribution Similarly it is predicted that the standarddeviation of the high-frequency false alarm distribution islarger than that of the low-frequency distribution

EXPERIMENT

An experiment was conducted to test the predictionsregarding the within-list strength manipulation The

number of presentations and the study time of the high-frequency words were manipulated in an experimentThe original rationale for the experiment was to comparethe results with the predictions of the variance theoryand the attention-likelihood theory because the experi-ment was conducted before the publication of Stretchand Wixtedrsquos (1998b) study which manipulated atten-tion by the number of presentations In this experimenta new manipulation is investigatedmdashnamely how theamount of study time per item for each class influencesthe mirror effect Furthermore the manipulation of thenumber of presentations is replicated Thus there weretwo experimental conditions one in which the amountof study time was manipulated and one in which the pre-sentation time was manipulated There was also one con-trol condition in which high- and low-frequency wordswere given the same amount of attention

MethodSubjects Twenty-one students taking the introductory psychol-

ogy course at the University of Toronto volunteered to participatein a memory experiment for course credit There were 14 female and7 male subjects with a mean age of 20 ranging from 18 to 29 yearsold

Material Sixty low-frequency words and 60 high-frequencywords were selected from Ku Iumlcera and Francis (1967) The low-frequency words have an occurrence of 4ndash5 times per million andthe high frequency words an occurrence of 50ndash55 times per mil-lion Thirty low- and 30 high-frequency words were randomly cho-sen for List 1 and the remaining for List 2

Procedure The subjects were instructed to study a list of wordsso they would be able to recognize the words after study Fifteenlow-frequency words and 15 high-frequency words were randomlychosen as study words for each subject

Design There were three conditions In all the conditions thelow-frequency words were presented once with a presentation timeof 1 sec In the control condition the high-frequency words werealso presented once with a presentation time of 1 sec In the pre-sentation frequency condition the high-frequency words were pre-sented twice for 1 sec each time In the presentation time conditionthe high-frequency words were presented once for 3 sec The pre-sentation order was randomized All the words were presented inuppercase on a blank computer screen Immediately following thestudy list there was a recognition test The subjects were presentedwith either one of the studied words or one of the lures There were15 low-frequency lures and 15 high-frequency lures in each condi-tion The subjects were asked to judge whether the word was pre-sented in the list or not The subjects were also required to rate theirconfidence in their responses on a scale ranging from 1 (guessing)to 5 (very certain) The order of recognition was randomized foreach subject

Each subject participated in two conditions List 1 was alwaysgiven as the first list and List 2 as the second list Twelve subjectswere randomly chosen for the presentation frequency condition fol-lowed by the presentation time condition Nine subjects were giventhe control condition followed by another control condition Thewhole experimental set-up including instructions presentation ofwords and the recognition test was automated on a computer Eachsubject was tested individually

ResultsThe results from the experiment are presented in the

first three rows of Table 1 The probability for hit rates

VARIANCE THEORY OF MIRROR EFFECT 427

428 SIKSTROM

was larger for the high-frequency words than for the low-frequency words in the presentation frequency and thepresentation time conditions In the control conditionthe probability for hit rates for the low-frequency condi-tion was larger One-tailed paired t tests over the perfor-mance for each subject were carried out to test the dif-ferences between the high and the low frequencies Theeffects were significant in the presentation frequencycondition [t(11) = 22 MSe = 004 p = 02 lt 05] and inthe control condition [t(16) = 33 MSe = 004 p = 00lt 05] but not in the presentation time condition [t(11) =041 MSe = 003 p = 34 lt 05]

The false alarm rate was larger for the high-frequencywords in all the conditions However it was only signif-icantly larger in the presentation frequency condition[t(11) = 18 MSe = 003 p = 048 lt 05] but not in thepresentation time condition [t (11) = 15 MSe = 001 p =07 gt 05] and the control condition [t(16) = 14 MSe =002 p = 09 gt 05]

The results in the presentation frequency conditionsupport the variance theory The hit and the false alarmrates were significantly larger for the high-frequencywords than for the low-frequency words Thus there wasno mirror effect However the prediction of the standardversion of the attention-likelihood theory was not sup-ported

The results in the presentation time condition were inthe same direction as those in the presentation frequencycondition although the difference between the high andthe low frequencies was not significant This conditionis consistent with the variance theory although the stan-dard version of the attention-likelihood theory could notbe dismissed in this condition since the results werenonsignificant

Finally the control condition yielded results consis-tent with previous studies showing a mirror effect Thehit rate for the high-frequency words was significantlylower than the hit rate for the low-frequency words Thefalse alarm rate for the high-frequency words was largerthan that for the low-frequency words (although not sig-nificantly) Thus the control condition is as was ex-pected consistent with both the variance theory and thestandard version of the attention-likelihood theory

The slopes of the ROC curves were calculated as fol-lows The hit and false alarm rates for confidence ratings1ndash5 were z-transformed (eg for confidence rating 4 a hitresponse was scored if the confidence rating was 4 orabove) Linear regressions of one z-transformed mea-surement as a function of another z-transformed measure-ment were conducted The slope of the linear regressioncurves between the z-transformed hit rate of the low-frequency words and the z-transformed hit rate of the high-frequency words [ss(BO)ss(AO)] and similarly for theslope of the false alarms [ss(BN)ss(AN)] are shown inthe last two rows of Table 1

The results show that the standard deviations of theold high-frequency distributions were smaller than thestandard deviations of the old low-frequency distribu-tions in all the conditions The standard deviations of the

false alarm high-frequency distributions were smallerthan the standard deviations of the false alarm low-frequency distributions in the presentation frequencycondition and the control condition but were approxi-mately equal in the presentation time condition

To summarize the results in the presentation frequencycondition are consistent with the variance theory and inconsistent with the standard version of the attention-likelihood theory The control condition is consistentwith both the variance theory and the standard version ofthe attention-likelihood theory These data are also con-sistent with results presented by Stretch and Wixted(1998b) However Stretch and Wixted (1998b) suggestedone possible way to modify the standard version of theattention-likelihood theory to bring it in line with thedata presented here They noted that Glanzer et al (1993)had shown that the attention-likelihood theory predictsthe mirror effect although p(i old) is set to the averagevalue of the two classes This modified version can pre-dict the pattern of data presented here given that the at-tention paid to the high-frequency class was high duringencoding [n(B) = 120] and low during recognition [n(B) =40] This formulation of the attention-likelihood theoryseems somewhat unclear It is not well motivated whyp(i old) should be equal during recognition whereas n(i)is different [p(i old) is calculated from n(i)] or why theamount of attention for high-frequency items is lower thanthat for low-frequency words at encoding but higher atrecognition

COMPARATIVE DATA FITTING

Glanzer et al (1993) fitted the attention-likelihoodtheory to experimental data in nine conditions The easyclass (A) consisted of either low-frequency words orconcrete words and the difficult class (B) consisted ofhigh-frequency words or abstract words Here the vari-ance theory is also fitted to the same set of data as thatused in Glanzer et al (1993) This allows a directionevaluation of the variance theory by comparing its f itswith those of the attention-likelihood theory

Glanzer et al (1993) fitted the attention-likelihoodtheory to the four probabilities of yes responses and thefour ratios of slopes of the ROCs The fitting was con-ducted by minimizing the mean squared error divided bythe variancemdashthat is

Three parameters were fittedmdashnamely the attentionpaid to each of the classes n(A) and n(B) and the prob-ability that a feature was activated p(new) The other pa-rameters were held constant at a value found to give agood f it These parameters were (N = 1000) and therecognition criterion [ln (L) = 0]

The variance theory was fitted to the same set of datausing the same technique and the same number of freeparameters The fitted parameters were the percentage

( )

Observed Predictedi i

ii

-

=aring

2

21

8

s

VARIANCE THEORY OF MIRROR EFFECT 429

nodes active at encoding (a) the standard deviation ofthe net inputs for the easy class words [sh(A)] and thestandard deviation of the net inputs for the difficult classwords [sh(B)] The other parameters were held constantand were set to the same values as those in Figure 4D[2N = 100 mh(N ) = 0 mh(O) = 1 and C = 0] The empir-ical standard deviations (si) were not reported inGlanzer et al (1993) so these parameters were set toone

The results show that the variance theory accounts for98 (r2) of the variance for the probabilities The attention-likelihood theory fits equally well The variance theory ac-counts for 93 of the variance for the slope The attention-likelihood theory accounts for 84 of the variance of theslope Thus the variance theory accounted for the sameamount of variance for the probabilities and more variance

for the slope as compared with the attention-likelihoodtheory when three parameters were fitted

It is reasonable to assume that the percentage of ac-tive nodes at encoding (a) does not vary across condi-tions The ratio of standard deviations of the net inputs[sh(B)sh(A)] may also be conceived of as being con-stant given that the same material is used in the differ-ent conditions Therefore the variance theory was fittedby a single variablemdashnamely the standard deviation ofthe net inputs for the easy class [sh(A)] The activitylevel was fixed to 010 and the ratio of the standard de-viations of the net inputs sh(B)sh(A) to 125 (these val-ues were the average of the fitted parameters in the pre-vious fit these parameters were also used in Figure 4D)The results are presented in Figure 8A where the pre-dicted probabilities are plotted as a function of observed

Figure 8 (A) Fitting the variance theory to Glanzer Adams andKimrsquos (1993) probability data The figure shows the predicted proba-bilities (on the vertical axis) as a function of the observed probabilities(on the horizontal axis) (B) Fitting the variance theory to Glanzer et alrsquos(1993) standard deviation slope data The figure shows the predictedratio of slopes of the receiver-operating characteristic curves (on the ver-tical axis) as a function of the observed ratios (on the horizontal axis)

430 SIKSTROM

probabilities Figure 8B shows the corresponding resultsfor the slope The accounted variance is 096 for the prob-abilities and 085 for the slopes Thus the variance theoryfits the slopes using a single parameter equally well asthe attention-likelihood theory does with three fitting pa-rameters The fit for the variance theory for the probabil-ities using one parameter is slightly less than the fit of theattention-likelihood theory using three fitting parametersIt may be concluded that the fit for the variance theory isreasonably good for the probabilities and the slopes Theslopes have a better fit in the variance theory than in theattention-likelihood theory when three variables are used

ANALYTIC SOLUTIONS

In this section analytic solutions of the variance the-ory an approximation of the standard deviation of therecognition strength and analyses of optimal perfor-mance are presented The variance theory has a simpleanalytic solution and can be fully described by four pa-rameters Two of these parametersmdashnamely the stan-dard deviation of the net inputs from the easy class[sh(A)] and the standard deviation of the net inputs fromthe difficult class [sh(B)]mdashcan also be expressed as thefrequency of the item (see the Appendix) The other twoparameters are the number of nodes (N ) and the expectedprobability that the nodes are active at encoding (a)

There are other variables in the theory however theydo not increase the degrees of freedom There are fourexpected net inputs (mh) however two degrees of free-dom disappear because the new net inputs are constrainedto be equal as well as the old net inputs [mh(AN) =mh(BN) and mh(AO) = mh(BO)] Furthermore the predic-tions are independent of moving the old and new dis-tributions in parallel thus removing another degree offreedom Finally changing the difference between the ex-pected old and new net inputs is mathematically equiva-lent to changing the standard deviations [sh(A) andsh(B)] Thus the degrees of freedom in the net inputscan be captured by the degrees of freedom in the stan-dard deviations The activation threshold (T ) and theprobability that nodes are active (Pc) are simply func-tions of other variables and therefore do not increase thedegrees of freedom Thus there are four degrees of free-dom for the distributionsmdashnamely sh(A) sh(B) N anda An additional degree of freedom is introduced whenplacing the recognition criterion (C)

The probability (Pc) that the net inputs exceed the ac-tivation threshold (T ) for nodes active during encodingcan be explicitly solved from the expected net inputs(mh) and the standard deviation of the net inputs (sh)This probability is dependent on the distribution of thenet inputs which can be approximated with a normaldistribution Pc is solved by integrating the net inputsfrom mh T to infinity (yen) over the probability densityfunction for a normal distribution Thus the probability(Pc) that a node is active at recognition is

(6)

Subtracting the expected percentage of active nodes atrecognition (a2 see note 1) from the percentage of ac-tive nodes and dividing by the standard deviation of thenet inputs (sh) calculates the expected recognitionstrength (ms)

Note that the analytic solution uses the standard devi-ation of the class (sh) as an approximation of the stan-dard deviation of the item (shcent ) because it simplifies theanalytic solution however the variance theory or thesimulation uses the standard deviation of the item Thisapproximation is good when there are a large number offeatures however for a small number of features thevariance of feature strength for a single item may fluctu-ate on an item-to-item basis around the variance of thenet inputs for all the items

The standard deviation of the recognition strength (ss)is calculated from sh Pc and N There is 2N number ofnodes in the context and the item layers The distributionof Pc is binomial but can for a certain criterion [ie 2NPc(1 Pc) gt 10] be approximated with a normal distri-bution with a standard deviation of [Pc(1 Pc) 2N]12The final result is scaled by the normalization factor 1sh

(7)

A yes response occurs if the recognition strength isabove the recognition criterion (C) The probability of ayes response [P(Y)] is calculated from the expected recog-nition strength the variance of the recognition strengthand the recognition criterion by integrating the density ofthe recognition strength over a normal distribution

The probability of choosing A over B in a two-choiceforced recognition test [P(A B)] is calculated from theexpected recognition strength of A [ms(A)] and B [ms(B)]and the standard deviations of the recognition strengthof A [ss(A)] and B [ss(B)]

An Excel sheet for calculating the predictions of the vari-ance theory is available on line (wwwpsychutorontoca ~sverkervariancehtml)

P e ds

S Bs s

s s

C

(A B)

(A)

(A) (B)

=

- -[ ]( )+aelig

egraveccediloumloslashdivide

eacuteeumlecirc

yen

ograve12

2

2 2 22

p

m m

s s

( )

P e ds

S s

s

C

(Y) =-( )yen

ograve12

2

22

p

m

s

sss

h

c cP P

N=

-( )eacute

eumlecircecirc

1 1

2

12

mss

Pc

a

h=

-2

P a e dhcT

hh

h=-( )yen

ograve12

2

22

p

m

s

VARIANCE THEORY OF MIRROR EFFECT 431

Approximations of the Standard Deviation ofRecognition Strength

The standard deviation of the recognition strength isin the model calculated with Equation 7 However to fa-cilitate the understanding of this equation it is useful tomake some approximations First note that the proba-bility that a node is active (Pc) is assumed to be low Byapproximating 1 Pc to one the variance of the recog-nition strength can be simplified to

For a particular class of items the variances of the netinputs of old and new items are equal and the varianceof the recognition strength is proportional to the numberof active nodes (s 2

s micro Pc) This approximation suggests avery simple interpretation of the slope of the z-ROC Theratio of variances between new and old items is simplythe ratio between the number of nodes active in the newitems representations and the number of nodes active inthe old items representations

Or alternatively the slope of the z-ROC curve is equalto the square root of the ratio of the number of nodes ac-tive in the new items representations and the number ofnodes active in the old items representations For exam-ple if the slope of the z-ROC curve is 08 the number ofactive nodes in the new items representations divided bythe number of nodes active in the old items representa-tions is 064 (= 0802)

Another approximation useful for understanding themodel is that for two classes of items the number of ac-tive nodes in the new distribution is approximately equaland the number of active nodes in the old distributions isapproximately equal [Pc(AN) raquo Pc(BN) and Pc(BO) raquoPc(AO)] Given these approximations and the approxi-mation above (1 Pc raquo 1) the recognition strength stan-dard deviation is inversely related to the standard devia-tion of the net inputs in the following way The ratiobetween the recognition strength standard deviations ofthe diff icult and the easy distributions is equal to theratio between the standard deviations of the net inputs ofthe easy and the difficult distributions Furthermore theratio between the recognition strength standard devia-tions of the difficult and easy new distributions is equalto the ratio between the recognition strength standard de-viations of the difficult and the easy old distributionsThe exact solution predicts a slightly smaller ratio in theold than in the new distributions

This suggests that the ratio between the recognitionstrength standard deviations of the difficult class and theeasy class can be interpreted as the ratio between the

standard deviations of the net inputs of the easy class andthe difficult class

Optimizing Performance Derives the Assumptions of the Variance Theory

Arguably good memory performance is an importantfactor for selection in the evolutionary process of hu-mans and animals It is reasonable to assume that thebrain has evolved so that the performance at retrieval isoptimal or near-optimal Here it is investigated how sev-eral assumptions of the variance theory influence per-formance It is shown that several assumptions of themodel fall out as a consequence of optimizing perfor-mance in the form of discriminability between new andold items Thus if the model is implemented in a differ-ent way performance is degraded and the model doesnot account for the experimental data Examples of as-sumptions that yield a good performance in the modelare a low percentage of nodes active setting the activa-tion threshold between old and new net inputs measur-ing performance by nodes that are active in the encodingpattern and normalizing the recognition strength It isshown that an optimal performance in the network requiresthe implementation suggested by the variance theory Ifthe implementation of the variance theory is changedsignificantly the performance is degraded and the net-work would not produce the empirically found memoryphenomena

To conduct this analysis it is necessary to define ameasurement of performance A natural choice is to used cent By using the analytical equations above we find thefollowing expression

(8)

Because ss(N) sometimes is near zero it was founduseful to use the standard deviation of both the old andthe new items recognition strength ss(NO) in the de-nominator of this equation Thus Pc(NO) is equal to[Pc(N) 1 Pc(O)] 2 Pc( ) was calculated with Equation 6The expected value of the net inputs and the standard de-viation of the net inputs for new and old items were cal-culated with the equations derived in the Appendix(Equations A1 A2 and A3) Low-frequency items witha preexperimental frequency of zero were used ( f = 0)and the list length was set to one (L = 1)

The performance can be expressed by the parametersa N and p Increasing the number of nodes (N) monot-onically increases d cent and increasing the number ofstored patterns (p) monotonically decreases d cent The im-pact of these two parameters on d cent was of less impor-tance here and they were set to N = 30 and p = 100

Optimal percentage of nodes active at encodingThe solid line in Figure 9A shows the theoretical d cent as afunction of the percentage of nodes active at encoding

cent = - =-

-eacuteeumlecirc

dP P

P PN

s s

s

c c

c c

m ms( ) ( )

(

O N

(NO)

(O) (N)

(NO) (NO) 12

112

ss

ss

ss

ss

s

s

s

s

h

h

s

s

(BO)

(AO)

(B)

(A)

(A)

(B)

(BN)

(AN)pound raquo pound

s

ss

s

c

c

P

P

2

2

(N)

(O)

(N)

(O)raquo

ss

sc

h

P

N

2

2 2=

432 SIKSTROM

(a) The results show that d cent is optimal for a = 052 Thed cent is lower for larger and smaller a The lower d cent for largea occurs because the interference from other items in-crease For an a larger than the optimal value the weightchanges are distributed over a larger number of nodesand the recognition tests therefore include more noise

The lower d cent for an a less than the optimal value oc-curs because there is variability in the number of activenodes at encoding Thus for very small values of a thefluctuation between the number of nodes active in theencoded representation becomes increasingly importantThus for a small a errors are more likely to occur be-cause old items have few active nodes at encoding mak-ing it less likely that many nodes will be active at re-trieval (independently of how well they are encoded)This analysis suggests that a medium low percentage ofactive nodes at encoding yields optimal performanceThis is consistent with variance theory which requires alow activity for fitting some of the empirical data (seebelow)

There is another factor that contributes to the fact thatoptimal performance occurs when the percentage of ac-tive nodes is medium low which is that the number ofpossible representations increases with a If there is onlyone node active in all the representations there are Npossible combinations of representations if there aretwo nodes active in all the representations there are ap-proximately N 2 possible combinations of representa-tions and so forth This factor is not included in theanalyses

Optimal placement of the activation threshold Animportant property of d cent is that it is insensitive to wherethe criterion is placed Thus any criterion yields thesame performance The activation threshold (T ) may beseen as the criterion for a single node and therefore onemight intuitively think that the placement of the thresh-old is unimportant for d cent However surprisingly theplacement of the criterion becomes important in the vari-ance theory because there is a nonlinear transformationwhen the nodes are activated This nonlinearity makes d centdependent on the activation threshold in the nodes

The d cent was maximized by changing the activationthreshold (T ) and the percentage of nodes active at en-coding (a) The maximum d cent was 240 when T = 081and a = 052 Figure 9B plots d cent as a function of the ac-tivation threshold (T ) when the percentage of nodes ac-tive at encoding was f ixed at the optimal value (a =052) The results show that d cent has an optimal valuewhen the activation threshold is set between the old andthe new distributions The variance theory suggests thatthe threshold should be set to the average of the expectedold and expected new net inputs For the parameter val-ues used here this value is 071 which is near butslightly lower than the optimal value of 081 (the ex-pected value of the new net input is 0 and the expectedvalue of the old net input is 142) Note that this resultapplies when both a and T are set to the optimal value Ifa is set to a nonoptimal value the optimal value of T may

deviate significantly from the one proposed by the the-ory (eg if a = 5 the optimal value of T is much largerthan the expected value of old net inputs of 188)

This analysis emphasizes the importance of setting theactivation threshold as suggested by the variance theorySetting the activation threshold between the old and thenew expected net inputs yields not only the mirror effectbut also an optimal performance in the network Noticethat the activation threshold (T ) is constant even if thesubjectsrsquo decision criterion (C) is changed Therefore d centwill not change when the decision criterion changes Bychanging the decision criterion (rather than the recogni-tion threshold) subjects can maintain an optimal d cent fordifferent confidence levels

Optimal usage of the state of activation in the cuepattern An interesting question is how much informa-tion is carried in nodes that are active in the encoded pat-tern as compared with inactive nodes If both active andinactive nodes carry a similar amount of information itis useful to use all the nodes at retrieval However if in-active nodes carry little information in relation to thenoise performance can be improved by using only theinformation in the active nodes

The information carried in the nodes depends on theamount of weight changes which in turn depends on thepercentage of active nodes at encoding (a) For a = 12the absolute values of the weight changes are the samefor active and inactive nodes (however the signs of theweight changes are different) Thus inactive and activenodes carry the same amount of information and theperformance is optimal when information in both activeand inactive nodes is used For a small a the weightchanges are larger for active nodes (proportional to 1 a)than for inactive nodes (proportional to a) For a suffi-ciently small a the noise in inactive nodes will over-whelm the information in the weight changes so that ifthe information is combined the inactive nodes will harmthe information in the active nodes and performance isoptimal if only information from active nodes is used

The performance of the variance theory was calcu-lated by using the information in all the nodes This isdone by counting the number of nodes that are retrievedto the correct state of activation (ie the same state asthat during encoding) The mathematical details of thiscalculation are described at the end of the Appendix Theresults are shown by the dotted line in Figure 9A usingthe same set of parameters as when d cent was calculated byusing only active nodes shown by the solid line The re-sults show that the highest d cent is found when the decisionis based only on active nodes and when a is low Includ-ing inactive nodes in decision lowers d cent However for alarger a (above 15 for the parameters used here) it isbeneficial to base the decision on all the nodes

Optimal placement of the recognition criterion forthe two classes of items The recognition criterion (C)does not affect d cent but it influences performance as mea-sured by the hit rates and false alarm rates Therefore itis necessary to use another criterion for performance

VARIANCE THEORY OF MIRROR EFFECT 433

with respect to the placement of the recognition crite-rion A natural choice for performance in this context isthe probability of hits minus the probability of falsealarms This measurement corresponds to optimal per-formance when old correct responses and new correctnew responses are rewarded equally It is easy to see thatif the standard deviations of the old and the new distri-butions are equal the optimal performance will be foundif the recognition criterion is set exactly between the dis-tributions For unequal standard deviations the optimalrecognition criterion is shifted from the midpoint towardthe distribution with the smallest standard deviationMore exactly the optimal recognition criterion is thepoint at which the old and the new distributions inter-sect It is easy to see that this is true because if the recog-nition criterion is moved to the left of this point the rateof increase in false alarms is larger than the rate of in-crease in hits and performance suffers If the recognition

criterion is moved to the right of this point the rate of de-crease in hits is larger than the rate of decrease in falsealarms and performance also suffers (see eg Figure 4D)Formally f [S(O)] denotes the density of recognitionstrength of the old distribution and f [S(N)] the densityof the recognition strength of the new distribution Theratio between these variables is called the likelihoodratio L = f [S(O)] f [S(N)] and the optimal performanceoccurs when this ratio is equal to one (L = 1)

In the mirror effect there are two classes of itemseach having a new and an old distribution with differentstandard deviations The question of optimal perfor-mance is complicated by the possibility of using differ-ent criteria for the two classes The performance maythen vary depending on the choice of the two criteria andon additional restrictions on the overall level of confi-dence For example if one class is very easy (ie perfectdiscrimination) and one class is very difficult (ie no

Figure 9 (A) Theoretical d cent as a function of percentage of nodes active at encoding The solid line shows the d cent as a function of per-centage of nodes active at encoding when the decision is based only on nodes that are active during encoding The dotted line showsd cent when the decision is based on all the nodes (B) Theoretical d cent as a function of the activation threshold The leftmost arrow pointsat the expected net input of the new items [m(N)] the rightmost arrow points at the expected net input of the old items [m(O)] and themiddle arrow points at the point at the placement of the activation threshold of the nodes Note that the activation threshold is slightlylower than the optimal point (C) Optimal placement of the recognition criterion for the two classes The y-axis shows the maximumlikelihood for Class B divided by the maximum likelihood for Class A An optimal performance is found when this ratio is one Thex-axis shows the false alarm rate for Class A The straight line shows the ratio for theoretical optimal performance the dotted line theratio before normalization and the solid curved line the ratio after normalization See the text for details (D) The advantage of nor-malization for different recognition criteria The y-axis shows the total hit rate after normalization minus the total hit rate before nor-malization as a function of the total false alarm rate on the x-axis See the text for details

434 SIKSTROM

discrimination) and subjects are instructed to respondyes only when they are absolutely certain that they arecorrect it may be optimal to set a very high criterion forthe difficult class so that no yes responses will be madefor the difficult class and a moderate criterion for theeasy class so that some yes responses will be made forthe easy class Therefore any model that optimizes per-formance for the two classes must combine the criteriafor each class so that the performance for the sum of theclasses will be optimal

This problem may formally be stated as follows Giventwo classes (A and B) with a fixed total false alarm rate[P(AN) + P(BN)] how should the recognition criteriafor the two classes [T(A) and T(B)] be chosen so that thehit rates are maximized [P(AO) + P(BO)] The solutionto this problem is surprisingly simple The optimal perfor-mance occurs when the placements of the maximum like-lihoods of the two classes are equal

L(A) = f [S(AO)] f [S(AN)] = L(B)

= f [S(BO)] f [S(BN)]

It is easy to see that this criterion must be satisfied foroptimal performance because any shift from this pointdiminishes performance For example if the recognitionthreshold for class A is diminished the recognition cri-terion for class B must be increased to keep the totalfalse alarm rate constant According to the formulationof the problem the change in total false alarm rates mustbe equal f f [S(BN) = 0] The maximum-likelihood ra-tios are monotonically increasing functions of the recog-nition criteria therefore L(A) L(B) lt 0 when the recog-nition criteria are changed as specified above ThusL(A) = f [S(AO)] f [S(AN)] lt f [S(BO)] f [S(BN)] =L(B) or f [S(AO)] f [S(BO)] lt 0 This shows that thechange in the placement of the criteria from L(A) = L(B)results in an overall decrease in hit ratemdash( f [S(AO)] f [S(BO)] lt 0)mdashand performance suffers

Note that the variance theory only has one overallrecognition criterion However the theory and any the-ory of the mirror effect specifies how this criterionchanges when it is moved over the two classes Thus thesecond criterion can indirectly be inferred from the for-mulation of the theory This is done in the variance the-ory by the normalization factor that scales the recogni-tion differently between the two classes of words

The intriguing question here is whether the variancetheory is optimal or almost optimal in terms of place-ment of the recognition criterion for the two classes Fig-ure 9C plots the maximum likelihood of class B dividedby the maximum likelihood of class A [L(B)L(A)] onthe y-axis The x-axis shows the probability of false alarmsfor class A when the recognition criteria are changedThe optimal ratio of the maximum likelihood on the y-axisis one and it is plotted as the dotted straight line The re-sults before normalization (ie by counting the numberof nodes above the recognition criterion) are plotted in

the dotted monotonically increasing line The resultsafter normalization (ie the percentage of active nodesminus the expected number of active nodes divided bythe standard deviation of the net input) are plotted as thesolid line

The figure clearly shows that performance before nor-malization is far from optimal For a conservative recog-nition criterion or low false alarm rates the maximumlikelihood for class A is larger than the maximum likeli-hood for class B Therefore a more liberal criterion forclass B and a more strict criterion for class A would bemore advantageous For a liberal recognition criterionor a high false alarm rate the maximum likelihood forclass B is larger than the maximum likelihood for classA Therefore a more liberal criterion for Class A and astricter criterion for Class B would be beneficial Theonly point at which the performance is optimal is whenthe recognition criterion is unbiased At this point [aroundP(AN) = 25] the maximum-likelihood ratio is one

Normalization improves performance significantly sothe maximum-likelihood ratio is one or almost one fora range of criteria For false alarm rates larger than 015and smaller than 060 the ratio is within two percentagepoints of one The maximum likelihood for Class A islarger than that for Class B for false alarm rates less than015 and for false alarm rates larger than 060 Thus thereis still some deviation from optimal performance evenafter normalization However the maximum-likelihoodratio is closer to the optimal value for all false alarmrates after normalization than before normalization Ar-guably normalization improves performance sufficientlyso that performance may be described as being near anoptimal value for a wide range of recognition criteria

Overall performance after normalization can be di-rectly compared with performance before normalizationFigure 9D plots the total hit rate after normalizationminus the total hit rate before normalization for differenttotal false alarm rates In this figure the standard devia-tion of the net inputs to Class B was set to a 3 (rather than156) in order to make the difference between perfor-mance before and after normalization more salient Allother parameters were identical to those in Figure 4DFurthermore it is assumed that the number of items inClass A is equal to the number of items in Class B Thetotal hit rate is equal to the average hit rate of Class Aand Class B and the total false alarm rate is equal to theaverage false alarm rate in Class A and Class B

For all recognition criteria or for all false alarm ratesperformance is better or equal after normalization ascompared with performance before normalization Foran unbiased recognition criterion or a slightly largerrecognition criterion performance is approximatelyequal before and after normalization [ie for 25 lt P(N) lt40] For conservative recognition criteria [P(N) lt 25]there is a large advantage for normalized performanceover nonnormalized performance The largest advantageis approximately 005 and it occurs when the false alarm

VARIANCE THEORY OF MIRROR EFFECT 435

rate is approximately 005 For liberal recognition crite-ria [P(N) gt 40] there is also an advantage for normal-ized performance The largest advantage is around 001and it occurs when the false alarm rate is 070 The ad-vantage for liberal criteria is smaller than the advantagefor conservative criteria because of a ceiling effect forlarge false alarm rates and large hit rates

A Nonoptimal Network is Inconsistent With Empirical Data of Recognition Memory

To summarize it has been shown that performance isoptimal when (1) the percentage of nodes active at en-coding is low (2) the activation threshold is set betweenthe new and the old distributions (3) only nodes activeat encoding are used for measuring the recognitionstrength and (4) the recognition strength is normalizedwith the standard deviation of the net input It is inter-esting to note that all these conditions are necessary forproducing the empirically found memory phenomena

1 The percentage of active nodes has to be low forproducing appropriate standard deviations for the oldand the new recognition distributions If the percentageof active nodes is too high the standard deviation of theold distribution will approach the standard deviation ofthe new distribution

2 The model predicts the mirror effect only if the ac-tivation threshold is set between the old and the new dis-tributions If the recognition threshold is larger than theexpected value of the net input of the old distributionthe hit rate of the easy class will be less than the hit rateof the difficult distribution Similarly if the recognitionthreshold is lower than the expected new net input thefalse alarm of the easy class will be larger than the falsealarm rate of the difficult class This is inconsistent withthe empirical data for unbiased responses

3 Assume that the recognition strength is based on allthe nodes (ie not only nodes inactive during encod-ing)mdashfor example by counting the number of nodes inthe correct state of activation This measurement ofrecognition strength will have at least 50 of the nodesin the correct state (ie Pc gt 50) even if the subjectswere merely guessing on new items This would lead tothe incorrect prediction that the old recognition strengthvariance should be smaller than the new recognitionstrength variance because Pc(1 Pc)N decreases for Pcover 50 This is inconsistent with the finding that thevariance of the old distribution is larger than the varianceof the new distribution

4 If the recognition strength is not normalized withthe net input the variance of the recognition strength ofthe new easy class (AN) will be smaller than the varianceof the recognition strength of the new difficult class (BNsee Figure 4C) This is inconsistent with the empirical data

This analysis indicates that several recognition mem-ory phenomena fall out as a consequence of optimizingperformance in the network If the model is optimized interms of performance the model reproduces the empir-ical data If the model is made nonoptimal the model

does not reproduce the empirical data Arguably thehuman brain has evolved to optimize storage capacityand therefore these memory phenomena occur

GENERAL DISCUSSION

This paper has suggested a variance theory for themirror effect that also applies to the ROC curves Themodel is remarkably simple It has been shown that atwo-layer recurrent network where one layer representscontext and one layer represents items shows these phe-nomena if performance is measured by counting thenumber of nodes active at recognition in a way that opti-mizes performance The structure of the theory guaran-tees that high-frequency items have a larger variance inthe net inputs than do low-frequency items because en-coding the same item in different contexts increases thevariance whereas the expected net inputs are the sameThe theory predicts the mirror effect when the sameamount of attention is paid to both classes of stimuli Thestandard deviation of the recognition strength is largerfor old than for new items because more nodes are activein old items The standard deviation for the easy class islarger than the standard deviation for the difficult classbecause the recognition strength is normalized with thestandard deviation for the net input

There are several reasons why the variance theory isinteresting First the theory is extremely simple Theonly necessary assumptions are that recognition is basedon recurrent associations between contexts and itemsand performance is measured by counting the number ofnodes in an optimal way Second these assumptions areconsistent with what is known about how the brain worksTherefore the model is biologically plausible Third themodel accounts for a large amount of data including themirror effect exceptions from the mirror effect ROCcurves list-length effects and so on Fourth the modelfits the empirical data well Fifth it is easy to implementthe model in a connectionist network

Paying more attention to one of the classes violates theassumption of equal expected net inputs to the two classesThe variance theory predicts that attention to the moredifficult class primarily affects the hit rates whereas thefalse alarm rates and standard deviations of the underly-ing distributions are less affected An experiment sup-ported the prediction A standard mirror effect was foundwhen attention was divided equally between high- andlow-frequency words However focusing the subjectsrsquoattention on the high-frequency words either by the pre-sentation frequency or the presentation time made thehit rate larger for the high-frequency words than for thelow-frequency words This manipulation of attention didnot influence the false alarm rates which were higher forthe high-frequency words in all the conditions Thus nomirror effect was found when attention was paid to thehigh-frequency words Nor did the focusing of attentioninfluence the order of the standard deviations The stan-dard deviations were larger for the low-frequency distri-

436 SIKSTROM

bution than for the high-frequency distribution This wastrue for the new and the old distributions both when at-tention was paid to high-frequency words and when at-tention was divided equally between the two classes (ex-cept in the new frequency control condition where thestandard deviations were equal)

The variance theory predicts the order of the standarddeviations of the underlying distributions for the follow-ing reasons The standard deviations of the old distribu-tions are predicted to be larger than those of the new dis-tributions because more nodes are activated in the olddistributions The standard deviations of the easy classdistributions are predicted to be larger than the standarddeviations of the difficult class distributions because therecognition strength is normalized by the itemrsquos diffi-culty estimated from the standard deviation of the net in-puts This is consistent with the empirical data

In contrast the attention-likelihood theory does notconstrain the old distribution to be larger than the newdistribution for the difficult class (it can be larger orsmaller depending on parameter settings) The variancetheory allows the following two orders of the standarddeviations ss(BN) lt ss(BO) lt ss(AN) lt ss(AO) andss(BN) lt ss(AN) lt ss(BO) lt ss(AO) The first order isthe more common although the second order occurs oc-casionally (see eg Glanzer et al 1993 Experiment 1)In addition the attention-likelihood theory allowsss(BO) lt ss(BN) lt ss(AN) lt ss(AO) according to Equa-tion 2 which to the authorrsquos knowledge has not beenfound in empirical data

The variance theory predicts that strength variablessuch as study time repetition and study instructions af-fect the expected net input For example increasing studytime will increase the net input that improves the hit rateIncreasing the net inputs also causes an increase in theactivation threshold that diminishes the false alarm rates

The variance theory has no (ie lateral) connectionswithin the item layer and there are no connections with-in the context layer Including intraitem connections inrecognition makes it impossible to tell whether therecognition strength comes from encoding during thelearning episode or from another preexperimental learn-ing episode Thus there would be a confounding be-tween the itemrsquos frequency and recognition strength Forexample if the recognition strength in the variance the-ory included intraitem connections and used a constantrecognition criterion it would predict a higher hit rateand a higher false alarm rate for high-frequency itemsthan for low-frequency items Thus the hit rate for high-frequency words would be larger than that for low-frequency words which is contrary to the data on the mir-ror effect This issue is called the frequencyrecognitionndashstrength confounding problem Other models may bevulnerable to this problem depending on their specificassumptions The variance theory is immune from thisproblem because recognition strength is based on the as-sociation between the context and the item that yields apure measurement of the strength of the target in a par-

ticular episode Net inputs within the item population arenot used because these connections are highly corre-lated with the frequency of the item

This frequencyrecognition-strength confounding prob-lem may be relevant to several distributed models thatassume that recognition strength increases with frequencythus making the false alarm rate higher for high- than forlow-frequency stimuli This is often implemented in dis-tributed models by including intraitem associations inthe measurement of recognition strength Thus whenintraitem and itemndashcontext associations are added it isnot possible to know whether the intraitem strength oc-curs because an item has been encoded in the to-be-retrieved-from list or at another episode

Although the intraitem associations are not used tomeasure recognition strength they may play an impor-tant role in recognition In the first step of recognitionthese associations may be used for deblurring unclear in-formation in the item cue (a similar mechanism occursfor the context cue) Arguably this deblurring mecha-nism works well for well-known words however fornonwords it is much more likely to fail Such failure willactivate features that were not active in the encoded rep-resentation It is more likely that these wrongly activatednodes represent high-frequency features because it ismore likely to converge on high-frequency features Thereare two interesting implications of this perspective Firstthe wrongly activated nodes will use the wrong connec-tions between the context and the item Second becausethe wrongly activated nodes represent high-frequencyfeatures the average variability will be larger for non-words than for words This is a plausible account of thepoor recognition performance with nonwords as com-pared with words It is also a tentative account of why non-words can be seen as a difficult class with a higher vari-ability than that for words in the variance theory Howeverfurther work is needed before any firm conclusion can bemade regarding this aspect of the theory

A problem similar to frequencyrecognition-strengthconfounding occurs if within-context connections areused at recognition If the context is temporally corre-lated the within-context connections are influenced bylist length This causes a confounding between list lengthand recognition strength This issue is called the list-lengthrecognition-strength confounding problem Othermodels may be vulnerable to this problem depending ontheir specific assumptions

Another issue is whether the variance theory can ac-count for the mirror effect found in abstract and concretewords where words from a concrete class are more eas-ily discriminated than words from an abstract class Thevariance theory can account for this given the assump-tion that the variance of the net input is larger for abstractthan for concrete words However at this point it is notcompletely clear how this assumption can be motivatedA possibility is that although these two classes areequated for word frequency the contexts associated withan abstract word are more variable than the contexts as-

VARIANCE THEORY OF MIRROR EFFECT 437

sociated with a concrete word This larger variability incontext for abstract words may lead to a larger variabil-ity in the net input Another possibility is that the activefeatures in abstract words are more general and there-fore associated with more contexts Nodes active in con-crete words may represent more specific features acti-vated with a lower frequency and therefore associatedwith fewer contexts Thus features in abstract wordsmay be of a higher frequency than features in concretewords although the frequencies of the items are thesame This would lead to a mirror effect in the modelHowever at this point no claim is made that variancetheory can handle this phenomenon

The variance theoryrsquos account of the list-length andlist-strength effects is arguably much simpler than thesubjective-likelihoodrsquos account The activation thresholdis set so that on average half of the nodes active duringencoding are active during recognition The activationthreshold is constant within one condition but may changebetween conditionsmdashfor example when study time ischanged Furthermore subjects do not need to estimatethe list length and the probability that a particular itemis drawn from the study list

The variance theory has advantages over the attention-likelihood theory As was discussed in more detail abovethe attention-likelihood theory requires subjects to clas-sify the stimulus Depending on this classification thesubjects must know (consciously or unconsciously) howmuch attention is paid to the stimuli in order to calculatethe log-likelihood ratios Thus the yesndashno decision isbased on the subjectsrsquo classification of the stimuli Thevariance theory does not require a classification of thestimuli During the calculation of recognition strengththe standard deviation of the net input of the item is used(shcent ) so the subject does not need to know the class or thestandard deviation of the class (sh) The increase in thehit rate and decrease in the false alarm rate for the easierclass occurs according to the theory because the vari-ance is smaller for the easier class The variance theoryhas a simple mathematical base subjects count the num-ber of activated nodes minus the expected value dividedby the standard deviation of the net input of the item Anexplicit solution is presented that requires only calculat-ing the probabilities of two z-transformations

The subjective-likelihood theory uses feature contentof the items for addressing issues regarding item similar-ity and word frequency In particular high-frequency wordsare assumed to have a higher noise or variability than dolow-frequency words The variance theory also usesvariability that depends on frequency However the vari-ance theory simulates the increase in variance duringeach presentation of a feature in different contexts thusmaking it an unavoidable phenomenon for high-frequencyfeatures In the subjective-likelihood theory the featurevariance is introduced as an assumption or a constantand it is not explicitly simulated

There are several other differences between the vari-ance theory and the subjective-likelihood theory The

subjective-likelihood theory is based on log-likelihoodratios In the variance theory log-likelihood ratios arenot necessary to account of the mirror effect and for z-ROC curves Instead the variance theory uses the num-ber of active nodes normalized by the variance of the netinput as the measurement of recognition strength

Another difference is the use of one detector for eachitem in the subjective-likelihood theory This makes themodel essentially local whereas the variance theory isdistributed This property may cause difficulties in im-plementing the model in a connectionist network How-ever see McClelland and Chappell (1998) for a brief dis-cussion of this topic An implementation of the variancetheory is straightforward

REFERENCES

Feller W (1968) An introduction to probability theory and its appli-cation New York Wiley

Gillund G amp Shiffrin R M (1984) A retrieval model for bothrecognition and recall Psychological Review 91 1-67

Glanzer M amp Adams J K (1985) The mirror effect in recognitionmemory Memory amp Cognition 13 8-20

Glanzer M amp Adams J K (1990) The mirror effect in recognitionmemory Data and theory Journal of Experimental PsychologyLearning Memory amp Cognition 16 5-16

Glanzer M Adams J K amp Kim K (1993) The regularities ofrecognition memory Psychological Review 100 546-567

Glanzer M amp Bowles N (1976) Analysis of the word frequencyeffect in recognition memory Journal of Experimental PsychologyHuman Learning amp Memory 2 21-31

Glanzer M Kisok K amp Adams J K (1998) Response distribu-tions as an explanation of the mirror effect Journal of ExperimentalPsychology Learning Memory amp Cognition 24 633-644

Greene R L (1996) Mirror effect in order and associative informa-tion Role of response strategies Journal of Experimental Psychol-ogy Learning Memory amp Cognition 22 687-695

Hertz J Krogh A amp Palmer R G (1991) Introduction to the the-ory of neural computation Reading MA Addison-Wesley

Hintzman D L (1988) Judgment of frequency and recognition memoryin a multiple trace memory model Psychological Review 95 528-551

Hopfield J J (1982) Neural networks and physical systems withemergent collective computational abilities Proceedings of the Na-tional Academy of Sciences 79 2554-2558

Hopfield J J (1984) Neurons with graded response have collectivecomputational properties like those of two-state neurons Proceed-ings of the National Academy of Sciences 81 3088-3092

Humphreys M S Bain J D amp Pike R (1989) Different way to cuea coherent memory system A theory for episodic semantic and pro-cedural tasks Psychological Review 96 208-233

Kim K amp Glanzer M (1993) Speed versus accuracy instructionsstudy time and the mirror effect Journal of Experimental Psychol-ogy Learning Memory amp Cognition 19 638-652

Kruschke J K (1992) ALCOVE An exemplar-based connectionistmodel of category learning Psychological Review 99 22-44

Ku Iumlcera H amp Francis W N (1967) Computational analysis ofpresent-day American English Providence RI Brown UniversityPress

Lewandowsky S (1991) Gradual unlearning and catastrophic inter-ference A comparison of distributed architectures In W E Hockleyamp S Lewandowsky (Eds) Relating theory and data Essays onhuman memory in honor of Bennet B Murdock (pp 445-476) Hills-dale NJ Erlbaum

Li S-C amp Lindenberger U (1999) Cross-level unification A com-putational exploration of the link between deterioration of neuro-transmitter systems and dedifferentiation of cognitive abilities in oldage In L-G Nilsson amp H J Markowitsch (Eds) Cognitive neuro-sciences of memory (pp 103-146) Seattle Hogrefe amp Huber

438 SIKSTROM

Li S-C Lindenberger U amp Frensch P A (2000) Unifying cog-nitive aging From neuromodulation to representation to cognitionNeurocomputing 32-33 879-890

McClelland J L amp Chappell M (1998) Familiarity breeds dif-ferentiation A subjective-likelihood approach to the effects of expe-rience in recognition memory Psychological Review 105 724-760

Metcalfe J (1982) A composite holographic associative recallmodel Psychological Review 89 627-658

Murdock B B (1982) A theory for the storage and retrieval of item andassociative information Psychological Review 89 609-626

Murdock B B (1998) The mirror effect and attentionndashlikelihoodtheory A reflective analysis Journal of Experimental PsychologyLearning Memory amp Cognition 24 524-534

Okada M (1996) Notions of associative memory and sparse codingNeural Networks 9 1429-1458

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves PsychologicalReview 99 518-535

Rumelhart D E Hinton G E amp Williams R J (1986) Learn-ing representation by backpropagating errors Nature 323 533-536

Shiffrin R M amp Steyvers M (1997) A model for recognitionmemory REMndashretrieving effectively from memory PsychonomicBulletin amp Review 4 145-166

Sikstroumlm S (1996a) TECO A connectionist theory of successiveepisodic tests Umearing Doctoral dissertation Umearing University (ISBN91-7191-155-3)

Sikstroumlm S (1996b) The TECO connectionist theory of recognitionfailure European Journal of Cognitive Psychology 8 341-380

Sikstroumlm S (1999) Power function forgetting curves as an emergentproperty of biologically plausible neural networks model Interna-tional Journal of Psychology 34 460-464

Stretch V amp Wixted J T (1998a) Decision rules for recognitionmemory confidence judgments Journal of Experimental Psychol-ogy Learning Memory amp Cognition 24 1397-1410

Stretch V amp Wixted J T (1998b) On the difference betweenstrength-based and frequency-based mirror effects in recognitionmemory Journal of Experimental Psychology Learning Memoryamp Cognition 24 1379-1396

NOTE

1 The expected number of nodes active at recognition is a2 giventhat the standard deviations of the percentages of active nodes for new[sp(N)] and old items [sp(O)] are equal However if the standard devi-ations are different the expected number of active nodes will be

Because the new variance is smaller than the old variance this valuewill be slightly less than a2 typically around 044a if the ROC curveis 08 It is actually more precise to use this value In this paper the ap-proximation a2 is used except that in the simulations where the ex-pression above is used

APPENDIX

The Expected Value of the Net Input and the Variance of the Net Input

Inserting the weight changes into the net input solves the ex-pected value of the net input for old items In this Appendix thesuperscripts representing the item layer (d) and the contextlayer (c) are dropped Only the active states of activation con-tribute to the net input so jj 5 jj = 1

(A1)

The expected value of the net inputs for the new items iszero

mh(N) 5 0 (A2)

The expected value of the weights is zero The variance ofthe net input is calculated by squaring each term in the netinput Let P represent the number of patterns stored in the net-work If the patterns are uncorrelated or a new item is pre-sented the variance of the net input is

(A3)

For an old item the variance of the net input to the contextlayer from the item layer depends on the frequency of the item( f ) All the aN weights supporting a context node contribute tothe net input in the same direction

(A4)

The same function can be used for calculating the varianceof the net input to a node in the item layer when the same con-text is associated with several items Let the same context be as-sociated with L items in a list Furthermore let the context be-tween different lists be uncorrelated The variance of the netinput to an item node is

The expected variance for all nodes is the weighted sum ofthese two terms Half of the nodes are context nodes and halfof the nodes are item nodes Therefore the expected varianceof the net input to all nodes given an item with a frequency off and a list length of L in a network where p patterns have beenencoded is

(A5)

Performance in the Variance Model When AllNodes Are Used

In the variance model recognition strength is based on nodesactive at encoding However if recognition strength is based onall the nodes performance can be calculated as follows The d centis calculated by using Equation 8 where Pc(O) and Pc(N) are thenumber of correct nodes The number of correct nodes is thenumber of nodes active at retrieval that also is active at encod-ing (ie calculated as usual with Equation 7) plus the numberof inactive nodes at encoding that also are inactive at retrievalThe latter value can be calculated by replacing a with 1 a inEquation 7 and using the expected value of old net inputs for in-active nodes a2 (1 a) N (compare with Equation A1)

(Manuscript received February 9 1999revision accepted for publication October 30 2000)

s h f L a N p a a N2 3 22 1O( ) = +( ) + -( )

s h f a a LN2 4 2 21( ) = -( )

s x x x xh ij jj

N

i j ji

Nf f f a a

a a f N

22 2

4 2 21

( ) = waring aringaelig

egraveccedilouml

oslashdivide= -( ) -( )eacute

eumlecirc

= -( )

s x x x xh ij jj

N

i j j

P

i

NN a a

a a a a a a a

a a a a PN a a PN

2 2 2

2

3

1 1 2 0 1

0 0 1 1

( ) = w

= [( ) ( )] + [( )(1- )] ( )

+ [( )( )] ( ) = ( )

2 2

2 2 2

( ) = -( ) -( ) - - - -

- - - -

aring aringaring

m x x x xh ijj

N

j i j jj

Na a a a N(O) = wDaring aring= -( ) -( ) = -( )1

2

ss s

p

p p

a(N)

( )N (O)+

418 SIKSTROM

active at encoding (a) was 20 Initially all the weightswere set to zero The recognition criterion (C ) was set to 0

ResultsFigure 3A shows the results in terms of the density of

the net inputs to an active node The expected values ofthe net inputs are approximately equal for new high- andnew low-frequency items [mh(AN) 5 00 mh(BN) 5 00]and for old low- and old high-frequency items [mh(AN) 538 m h(BN) 5 38] The high-frequency items have alarger variance of the net inputs [sh(BN) 5 49 sh(BO) 548] than do the low-frequency items [sh(AN) 5 41sh(AO) 5 40] The variances of the old and the new dis-tributions are approximately equal

Figure 3B shows the density of the recognition strengthThe result shows the mirror effect where the hit rate prob-ability is larger for low- than for high-frequency itemsand the false alarm rate is lower for low- than for high-frequency items [P(AN) 5 18 P(BN) 5 25 P(BO) 564 P(AO) 5 71] The standard deviation for the recog-nition strength is larger for old than for new words andlarger for the low-frequency words than for the high-frequency words [ss(AN) 5 029 ss(AN) 5 019 ss(AN)5 023 ss(AN) 5 031] These f indings agree with the empirical data and the predictions of the attention-likelihood model (Glanzer et al 1993) Thus the simu-lation shows that the variance theory can account for thefrequency-based mirror effect and the associated ROCcurves

EXPLICATIONS OF THE MECHANISMS

In this section three essential mechanisms of the vari-ance theory that account for the mirror effect are discussedin detail The involved mechanisms are (1) the varianceof the net input (2) the expected value of the net inputand (3) the way in which the recognition strength is cal-culated by counting the number of nodes so that the per-formance is optimal

OverviewThe first mechanism is that items from the difficult

class (ie high-frequency items in this case) have ahigher variance of the net inputs than do items from theeasy class (ie low-frequency items) but the variance isindependent of whether the items are old or new Themechanism is illustrated in Figure 4A (noncumulative) Itshould also be underscored that the difference in varianceas a function of class membership is not a primitive ofthe theory that the theorist manipulates It is the naturalconsequence of the rather plausible assumption that high-frequency items appear more often and are associatedwith a great variety of different contexts than are low-frequency words The second mechanism is that old itemshave a higher expected net input to nodes encoded in theactive state than do new items (but do not depend on theclass of the items) The mechanism is illustrated in Fig-ure 4B (cumulative) The third mechanism is the wayrecognition strength is measured by counting the activenodes so that recognition performance (eg d cent) is opti-mal or near-optimal An extended analysis of optimalityis presented at the end of the paper however the resultsin Figures 9Andash9D can summarize the main points hereThe results from this analysis imply that the decisionmust be based on the nodes active at encoding (see Fig-ure 9A) the percentage of active nodes must be low (seeFigure 9A) the activation threshold needs to be betweenthe expected net inputs of the new and old items (see Fig-ure 9B) and the recognition strength is normalized bythe standard deviation of the net inputs of the item (seeFigures 9C and 9D) The density of the percentage of ac-tive nodes is illustrated in Figure 4C and the normalizedpercentage of active nodes is illustrated in Figure 4DHere it is shown how these mechanisms predict the mir-ror effect Below these three sets of mechanisms are ex-plained in more detail

Variance of the Net InputThe first and perhaps the most important mechanism

is that the difficult class has a larger variability in the net

Figure 3 (A) Simulations of net inputs The vertical axis shows the simulated density of the net inputs (B) Sim-ulations of the mirror effect The vertical axis shows the simulated density of the recognition strength

VARIANCE THEORY OF MIRROR EFFECT 419

inputs than does the easy class As will be discussed laterthis is shared with other theories such as the subjective-likelihood account (McClelland amp Chappell 1998) andREM (Shiffrin amp Steyvers 1997) However a unique as-pect of the variance theory is that it is a necessary outcomeof simply encoding high-frequency items more timesthan low-frequency items In the subjective-likelihood ac-count a larger variability for high- than for low-frequencywords is an assumption that is controlled by a free pa-rameter ( p0) to reflect that high-frequency words havemore definitions than do low-frequency words The vari-ance theory needs no such assumptions and no addi-tional free parameters to control the variability Whereasthe subjective-likelihood account tries to capture wordfrequency with a free parameter in the variance theorythe word frequency effect is simulated via the number ofcontexts associated with an item

Owing to how the variance theory implements the re-lations between contexts and items the variance of the

net inputs increases with the frequency with which an itemis encoded in different contexts Intuitively this occursbecause a high-frequency item activates several differentcontexts causing the representations of many competingcontexts to be activated simultaneously Low-frequencyitems are associated with fewer contexts than are high-frequency items This causes the representations offewer contexts to be activated simultaneously Thus low-frequency items have less variability in the net inputsthan do high-frequency items

At recognition an item produces a net input in the con-text layer that is a mixture of the net inputs from the studycontext that the network is instructed to retrieve fromand from several uncorrelated preexperimental contextsthat were associated with the item during the preexperi-mental phase The study context that the network is in-structed to retrieve from contributes to the correct activestate For example by adding +1 to the net input to anode (which is the expected net input for a node encoded

Figure 4 (A) The probability density of the net inputs in the variance theory The horizontal axis shows the net inputs and the ver-tical axis the probability density of the net inputs The dotted vertical line is the activation threshold (B) The cumulative probabilitydistributions of the net inputs in the variance theory The horizontal axis shows the net inputs and the vertical axis the cumulativeprobability distributions of the net inputs (C) The density of percentage nodes active at recognition in the variance theory The hor-izontal axis shows the percentage of nodes active at recognition and the vertical axis the density (D) The density of recognition strengthin the variance theory The horizontal axis shows the recognition strength and the vertical axis the density of the recognition strengthusing standard parameter values

420 SIKSTROM

in the active state when N 5 8 and a 5 5 see Equation A1in the Appendix) thus increasing the expected net inputThe preexperimental contexts on the other hand ran-domly add to or subtract from the net input For examplea preexperimental context could add +1 to the net inputif the node was in an active state or it could add 1 tothe net input if the particular preexperimental contextwas encoded in an inactive state (which is the expectednet input for a node encoded in the inactive state whenN 5 8 and a 5 5 see Equation 3 4 or A1) Note thatthe net input can be negative whereas the state of activa-tion cannot If the representations of these preexperimen-tal contexts are uncorrelated the number of associatedpreexperimental contexts will not influence the expectednet input That is the expected value of the negative andpositive contributions will cancel out (eg for N 5 8 anda 5 1 the contributions to the net input is +1 with a prob-ability of 50 and 1 with a probability of 50 yield-ing an expected value of 0) However the variability ofthe net inputs increases when more contexts are addedIn this example the variance is increased by 12 for eachadded context (ie the variance increases by each con-tribution raised to the power of two) Encoding an itemalso increases the variability of the net input for all otheritems encoded in the network However the increase invariability for the items uncorrelated with the encodeditem is much smaller because the contribution from eachnode is independent (Eg for N 5 8 and a 5 5 thecontribution from each node is either +14 or 14 [seeEquation 3] each yielding an increase in variability of142 5 116 An expected value of aN nodes contributeso the total increase in variability is 4 116 5 14)

A mathematical analysis of how the variability of thenet inputs in the context nodes increases as a function ofthe frequency with which the item has been encoded indifferent contexts is presented in the Appendix This analy-sis shows that the variance of the net inputs in contextnodes increases linearly with how many times a givenitem is encoded within different contexts The variabil-ity of the net inputs for nonwords may be a special casediscussed at the end of this paper

In the same way as the variability of context nodes de-pends on the itemrsquos frequency the variability of the itemnodes depends on the frequency of the context That isthe variability of the net input to the item nodes increaseswith how many times one context is associated with dif-ferent items Given that the context is constant during thepresentation of a study list the variability of the net in-puts to the item nodes will increase with list length

Expected Net InputThe second mechanism in the variance theory is that

the expected net inputs to the easy and difficult classesof stimuli are equal given that the encoding conditionsduring the experimental phase of the two classes areequal Note that this is in stark contrast to the attention-likelihood theory which assumes that more attention(corresponding to more net input) is given to the easyclass than to the difficult class Experimentally the equal-

ity of the net inputs simply means that different classesof stimuli are given the exact same conditions for encod-ing and retrieval in a recognition memory study The netinput to a node encoded in the active state increases dur-ing encoding whereas the net input to a node encoded inthe inactive state decreases during encoding Only nodesencoded in the active state are used during retrieval sohere we are only interested in the increase in net input thatoccurs for nodes encoded in the active state Strengthen-ing of the weights during encoding increases the expectednet input The degree of increase in expected net input isinfluenced by strength-based variables such as study timerepetition levels of processing and so forth For exam-ple the simulations can be set up so that a study time of1 sec strengthens the weights less leading to lesser in-creases in the net input than does a longer period of studytimemdashfor example 2 sec of encoding time Because thestudy context is unique to the learning episode preexper-imental encoding in other learning contexts will not af-fect the expected net input but they do affect the vari-ability of the net input as was demonstrated above Theitemndashstudy-context associations are learned approxi-mately equally well for old high- and old low-frequencyitems For example the expected net input for CAR (ahigh-frequency word as a difficult class item) is equal tothe expected net input for ARCTIC (a low-frequency wordas an easy class item) Generally the expected net inputdoes not depend on the class of the item because the ex-pected net input is influenced by the study and the test-ing experimental conditions similarly across stimulusclasses in a recognition memory experiment Thereforethe expected net input for a new difficult item is equal tothe expected net input for a new easy class item

The probability density functions of the net inputs fornodes in the active states are plotted in Figure 4A Oldnodes in the inactive state have a negative expected valueof the net input and are not plotted New nodes in the in-active state have the same density as nodes in the activestate The cumulative probability distributions of the netinputs for nodes in the active state are plotted in Figure 4BFigure 4A shows the first mechanismmdashnamely that thestandard deviation of net inputs for the easy class items[sh(A)] is larger than the standard deviation of net inputsfor the difficult class items [sh(B)] The second mecha-nism is shown in the figures in that the expected netinput of an easy class new item [mh(AN)] is equal to theexpected net input of a difficult class new item [mh(BN)]Furthermore at encoding the expected net inputs of ac-tivated nodes are increased equally or approximatelyequally for the easy and the difficult classes of itemsThis is shown in Figure 4A The expected net input for theold easy class items [m h(AO)] is equal to the expectednet input for the difficult class items [mh(BO)]

Recognition StrengthThe variance theory suggests that the recognition de-

cision needs to be based on counting the number of ac-tive nodes in such a way that the performance is optimalor near-optimal If the net input is above the activation

VARIANCE THEORY OF MIRROR EFFECT 421

threshold (T ) and the node was active at encoding thenode is activated at recognition Otherwise the node isinactivated The distributions of active nodes are plottedin Figure 4C

A closer inspection of Figures 4A and 4B reveals thatthese densities or distributions predict the correct orderof the mirror effect given that the activation threshold islarger than the expected net inputs of the new items andless than the expected net inputs of the old items Thus thevariance theory has the nice property of accounting for themirror effect across a large range of the parameter valuesfor the activation threshold Thus P(AN) lt P(BN) ltP(BO) lt P(AO) for mh(AN) 5 mh(BN) lt T lt mh(AO) 5mh(BO) The variance theory predicts a material-basedmirror effect because the variability of the net inputs isdifferent for the easy and the difficult class items Theexpected strengths of the net inputs are equal The vari-ability is lower for easy class items thus making theprobability of false alarms (or the probability of activenodes) lower for the easy than for the diff icult classitems when the activation threshold is larger than the ex-pected value of the new items Similarly the hit rate (or theprobability of active nodes) for easy class items is higherthan the hit rate for difficult class items when the activationthreshold is less than the expected value of the old items

The activation threshold is set to be between the ex-pected value of the new and the old net inputs so that theperformance is optimal Therefore the activation thresh-old is set to the average of the expected net inputs of theold and the new distributions for difficult and easy classitems respectively

T 5 [mh(AN) + m h(BN) + m h(BO) + mh(AO)]

5 [mh(BO) + mh(AO)] 5 mh(O)

Thus in the variance model the activation threshold isfixed for recognition in one condition although it mayvary between different recognition conditions to opti-mize the performance The variance theory accounts forthe strength-based mirror effect that occurs between listsor conditions with a shift in the activation threshold nec-essary for keeping the performance at an optimal levelAs will be discussed later this is true also when perfor-mance is measured by d cent and it is independent of theplacement of the recognition criterion Simply put themodel adopts the activation threshold on the basis of theoverall difficulty of the test in order to maximize the per-formance

In practice subjects may initially make a preliminaryestimate of the activation threshold which may be ad-justed as more information about the expected value ofthe distributions is gathered The theory does not show amirror effect if the activation threshold is lower than theexpected value of the new items or larger than the expectedvalue of the old items Thus setting the activation thresh-old as was suggested above is an important mechanism

in the model The activation threshold should not be con-fused with the subjectrsquos recognition criterion

Figure 4C shows the density of the probability that anode is active at recognition when the activation thresh-old is set as defined above Note how the mean and stan-dard deviations of the distributions of the net input (Fig-ure 4A) change when the percentage of nodes are calculated(Figure 4C) The expected probabilities of active nodesare arranged according to the mirror effect [Pc(AN) ltPc(BN) lt Pc(BO) lt Pc(AO)] whereas the expected valuesof the net inputs are not [mh(AN) = mh(BN) lt mh(BO) =mh(AO)] Furthermore the standard deviation of the per-centage of active nodes for old items is larger than thatfor new items [s p(O) gt sp(N)] whereas the standard de-viations of the net inputs are equal for old and new items[s h(N) 5 sh(O)]

The standard deviation of the recognition strength issmaller for the new distributions than for the old distri-butions because there are fewer nodes active in the newdistributions The standard deviation of the percentage ofactive nodes at retrieval as a function of the expected prob-ability of nodes active at retrieval is plotted in Figure 5Obviously the standard deviation of the percentage ofactive nodes is zero when the probability of active nodesis zero This standard deviation increases as the prob-ability of active nodes increases For small probabilitiesof active nodes the variance of active nodes (sp) is approx-imately proportional to the percentage of active nodes[sp 5 Pc(1 Pc) N raquo Pc N micro Pc] The percentage of ac-tive nodes is of course larger for old than for new itemsThus the variance theory predicts that the standard de-viation of the percentage of active nodes (sp) is smaller fornew than for old items [sp(AN) lt sp(BN) lt sp(BO) ltsp(AO)] whereas the standard deviation of the net inputis not [sh(AN) 5 sh(AO) lt sh(BN) 5 s h(BO)] The es-sential mechanism that makes these changes in themeans and standard deviations is the nonlinearity intro-duced by the counting of the number of active nodes With-out this nonlinearity these changes would not occur andthe model would not show appropriate ROC curves forold and new items

Note that the standard deviation of active nodes de-creases for probabilities larger than one half (see Figure 5the standard deviation is of course zero when the prob-ability of active nodes is one see the mathematical anal-ysis below) However the probability of active nodes can-not exceed one half because the activation threshold isset so that half of the nodes active during encoding areactive at recognition The probability of active nodes istypically set to a small value in the model because it im-proves the performance

To optimize the performance subjects base their rec-ognition decision on the number of active nodes normal-ized by the standard deviation of the net inputs for theitem The normalization makes the judgments more con-servative for difficult items This plays an important rolefor confidence judgments when the responses are biasedbut plays no role for unbiased responses

12

14

14

422 SIKSTROM

To calculate the recognition strength (S ) in Equation 5the expected percentage of active nodes is subtractedfrom the percentage of nodes active at recognition (Pc)This is necessary for the normalization to work properlyOwing to the placement of the activation threshold theexpected percentage of active nodes at recognition is halfof the expected percentage of nodes active at encoding(a 2 see note 1) This is a constant independent of itemclass new or old item and test difficulty The result is di-vided by the standard deviation of the net inputs associ-ated with nodes active at encoding (shcent)

Note that the standard deviation of the net inputs ofthe to-be-recognized item (s hcent ) varies on an item-to-itembasis around the standard deviation of the net inputs ofall items in the class (sh ) This fluctuation may be solarge that it is not possible to accurately sort the wordsinto classes on the basis of the standard deviation of theitems however there is no need for the subject to makesuch classification in the variance model The subjectsdo not need to know the true standard deviation of net in-puts in the class A yes response occurs if the recognitionstrength is larger than or equal to the subjectrsquos recogni-tion criterion (C ie if S sup3 C ) A no response occurs ifthe recognition strength is less than the subjectrsquos recog-nition criterion (S lt C )

The standard deviation of the net inputs does not affectthe probability of a yes response when the recognitioncriterion is unbiased (C 5 0) In this special case therecognition strength can be simplified to S 5 Pc whereC 5 a 2 The standard deviation of the net inputs inEquation 5 affects the probability of a yes response whenthe recognition criterion is biased (C sup1 0) Thus the stan-dard deviation of the net inputs in Equation 5 may be in-terpreted as a scaling factor that influences the confidencemeasurement (but not the unbiased recognition measure-ments) A large standard deviation of the net input for anitem (correlated with difficulty) influences the decisiontoward uncertainty whereas a small standard deviation ofthe net input for an item (correlated with less difficulty)influences the decision to be more certain

Figure 4D shows the density distribution of the recog-nition strength Note how the standard deviation of theactive nodes for the easy class versus the difficult classes(in Figure 4C) changes when it is normalized by the vari-ance of the net input (in Figure 4D) The normalizationfactor makes the standard deviation of the recognitionstrength of the difficult class smaller than that of the easyclass Thus the standard deviation of the recognitionstrength is proportional to the inverse of the standard de-viation of the net input The difficult class items yield asmall standard deviation of the recognition strength be-cause the standard deviation of the net inputs is highThe easy class items yield a large standard deviation ofthe recognition strength because the standard deviationof the net inputs is small The ordering of the means ofthe distributions is unaffected by the normalization andthe normalization does not change the fact that the olddistributions have a larger standard deviation than do thenew distributions

PREDICTIONS

This section describes the predictions of the variancetheory We have just seen that the variance theory predictsa material-based mirror effect for high- and low-frequencyitems because the low-frequency items have a smallervariance of net inputs This yields lower false alarm ratesand higher hit rates for the easy than for the difficultclass when the activation threshold is set between thenew and the old distributions Here it is shown how themodel accounts for other effects such as the strength-based mirror effect between lists list-length effects andthe shift in the response criterion Most important the var-iance theory makes predictions regarding the strength-based mirror effect within lists that is different from thepredictions of the attention-likelihood theory An exper-iment is conducted to test these predictions Comparativemodeling fitting was also conducted to compare the vari-ance theory with the attention-likelihood theory Thepredictions of the theory are based on an analytic solution

Figure 5 The standard deviation of active nodes as a function of the expectedprobability that the nodes are active The standard deviation is calculated with2N = 100

VARIANCE THEORY OF MIRROR EFFECT 423

that is presented at the end of the paper together with ananalysis of optimal performance

The Material-Based Mirror Effect forHigh- and Low-Frequency Items

The variance theory was simulated above here theanalytical results are presented The variance theory pre-dicts the mirror effect for any choice of parameters whenthe recognition criterion is unbiased As will be discussedlater the variance theory can be fully described by twoparameters (the number of nodes N and the percentageof active nodes a) plus one parameter for each class orwords [the standard deviation of the net input sh( )] Thefollowing parameters are used here The number ofnodes is 2N 5 100 and the percentage of nodes active atencoding is set to a 5 1 The standard deviation of thenet inputs to the easy class is sh(AN) 5 sh(AO) 5 125and the standard deviation of the net inputs to the diffi-cult class is sh(BN) 5 sh(BO) 5 156 There are otherparameters which however as will be discussed laterdo not add any degrees of freedom to the model the ex-pected net inputs of the new distributions mh(AN) 5mh(BN) 5 0 and the expected net inputs of the old dis-tributions mh(AO) 5 mh(BO) 5 1 Consequently the ac-tivation threshold is T 5 05

These parameters yield the following probabilities thata node is active at recognition Pc(AN) 5 43 a Pc(BN) 545 a Pc(BO) 5 55 a Pc(AO) 5 57 a The following ex-pected recognition strengths are predicted ms(AN) 5

0012 ms(BN) 5 0008 ms(BO) 5 0008 ms(AO) 50012 Figure 4D plots the four recognition strength den-sities (the distributions are assumed to be normal) usingthe parameters above The same parameter settings wereused in Figures 4A 4B 4C and 5

Strength-Based Mirror Effects Between ListsThe variance theory is consistent with the strength-

based mirror effects Thus variables that increase the hitrates also decrease the false alarm rates This empiricalfinding is called dispersion which means that the newand the old distributions move apart The opposite phe-nomenon is called concentering which means that thenew and the old distributions move closer together Ex-amples of variables showing strength-based mirror ef-fects are speed versus accuracy instructions length ofstudy time encoding task forgetting repetition and ag-ing (Kim amp Glanzer 1993) These experimental variablescan be related to a specific parameter in the variance the-orymdashnamely the expected net input

The variance theory predicts a strength-based mirroreffect because subjects must adjust the activation thresh-old to optimize the performance This change in activa-tion threshold affects the false alarm rates For exampleassume that study time is increased from 1 to 2 sec sothat the expected net input increases from 1 to 2 and theactivation threshold increases from 12 to 1 This dimin-ishes the false alarm rate However the increase in theactivation threshold is smaller than the increase in the old

net input so the hit rate will increase Thus increasing thestudy time increases the hit rate but decreases the falsealarm rate which is dispersion

The mirror effect is accounted for in some theories bya change in the recognition criterion Note that in the var-iance theory the recognition criterion is constant whereasthe activation threshold is changed There is an impor-tant difference between a change in the recognition cri-terion and a change in the activation threshold Thechange in the activation threshold optimizes the perfor-mance as measured by d cent whereas a change in the recog-nition criterion does not influence d cent Given an optimalplacement of the activation threshold the performance interms of percentage correct is optimal if the recognitioncriterion is set to an optimal value which is zero Thusthere is a clear difference between changing the recogni-tion criterion and changing the activation threshold Thevariance theory accounts for the strength-based mirror ef-fect occurring between two conditions by the change inthe activation threshold necessary for optimal performancewhereas the recognition criterion does not change

Concentering occurs for example when subjects areinstructed to emphasize speed (rather than accuracy) withsuperficial (rather than deep or semantic) study instruc-tions with diminished study time or with an increasedretention interval (Kim amp Glanzer 1993) In the variancetheory all these manipulations are assumed to diminishthe old net inputs Figure 6A shows the predictions of thevariance theory when the expected net inputs of the olddistributions are mh(AO) = mh(BO) 5 05 rather than 1as in Figure 4D Consequently the activation thresholdmust be set to 025 for optimizing the performance Thedistributions in Figure 6A are closer than the distribu-tions in Figure 4D Thus decreasing the net inputsmdashforexample by diminishing study timemdashmoves the distrib-utions closer together thus showing concentering

The opposite phenomenon to concentering is disper-sion which means that increasing the performance movesthe distributions apart Dispersion can be studied bychanging the variables listed above in the opposite direc-tionsmdashfor example by increasing the study time Fig-ure 6B shows the predictions of the theory when the ex-pected net inputs of the old distributions are mh(AO) =m h(BO) 5 2 Consequently the activation thresholdmust be set to 1 to maintain a near-optimal performanceThe distributions in Figure 6B are further apart than thedistributions in Figure 4D

These strength-based manipulations are usually ap-plied between different lists or conditions For exampleKim and Glanzer (1993) manipulated study time betweenfour study lists where the items were presented for 1 seceach in two lists and for 2 sec each in two lists After eachlist there was a recognition test In the variance theorythe activation threshold is the same during each recogni-tion test but may vary between two recognition tests withdifferent levels of difficultymdashfor example different studytimes As will be discussed later different empirical re-sults are found when study time is varied within one list

424 SIKSTROM

In this case the activation threshold is also constant dur-ing the recognition tests although the study time varieswithin the condition

The order of the probabilities in the mirror effect issomewhat robust against changes in the activation thresh-old over a large range Setting the activation to a fixedsufficiently low and positive value yields the mirror ef-fect for any value of the expected net input For example

assume that the activation threshold is fixed to 04 Thenthe mirror effect is predicted for the three cases of ex-pected old net inputs discussed above (05 1 and 2) orany value above 04 The predictions for the new distri-butions do not change with these changes in net inputs[P(AN) = 25 P(BN) = 30] thus a change in the acti-vation threshold is needed to change the false alarmrates In contrast the advantage of the old easy class over

Table 1General Table of Results From the Experiment

Condition AN BN BO AO ss(BN)ss(AN) ss(BO)ss(AO)

Control 013 017 069 082 060 086Frequency 020 028 080 068 101 066Time 010 015 078 076 089 081

NotemdashThe rows show the conditions (control presentation frequency and presenta-tion time) The columns show the false alarm rates for low (AN) and high (BN) fre-quencies and the hit rate for high (BO) and low (AO) frequencies the slope of the z-ROCcurve for the new low-frequency distribution as a function the new high-frequency dis-tribution [ss(BN)ss(AN)] and the corresponding slope for the old distributions[ss(BO)ss(AO)]

Figure 6 The densities of recognition strength in the variance theory for different parameter settings (A) concentering (B) dis-persion (C) activity level set to one and (D) equal variance The horizontal axes show the recognition strength and the vertical axesthe density

VARIANCE THEORY OF MIRROR EFFECT 425

the old difficult class increases with the expected netinput [from P(BO) = 55 and P(AO) = 56 for mh(O) =05 to P(BO) = 89 and P(AO) = 92 for mh(O) = 2]

List-Length EffectGiven everything else equal recognition from a short

list length has a higher hit rate and lower false alarm ratethan recognition from a long list In the variance theorylist length is predicted to affect the standard deviation ofthe net input (sh) for both easy and difficult class itemsso that longer lists have a larger variance than do shorterlists The expected value of the net input is not affectedby list length

Assume that context does not change within a list butis uncorrelated between different lists The context for alist is thus associated with as many items as there areitems in the list The variance of the net inputs to the itemnodes increases when the list length is increased Thereason for this increase in variance is essentially thesame as the reason that word frequency affects the vari-ance In the word frequency case the same item is asso-ciated with several contexts and this increases the vari-ance in context nodes In the list-length case the samecontext is associated with several items and this in-creases the variance in the item nodes Thus the varianceof the net inputs in the item nodes will be a linear func-tion of list length Therefore a long list will have a lowerhit rate and a larger false alarm rate than will a short list

ROC CurvesThe percentage of nodes active at recognition is less

for new items than for old items Owing to the placementof the activation threshold this proportion is always lessthan 12 The standard deviation of the percentage of ac-tive nodes increases as a function of the percentage ofactive nodes If the percentage of active nodes is zerothe standard deviation obviously is zero However thisstandard deviation increases as the percentage of activenodes increases This yields a smaller standard deviationfor the new distribution (which is associated with a lowerpercentage of active nodes) as compared with the olddistribution [ss(AO) gt ss(AN) and ss(BO) gt ss(BN)]

For the sake of understanding the model the propor-tion of nodes active at encoding can be set unrealisticallyhighmdashnamely to a = 1 This setting yields around 50of these nodes being active at recognition This parametersetting makes the standard deviations of the new and theold distributions equal [ss(AO) = ss(AN) and ss(BO) =ss(BN)] Figure 6C shows the prediction for a = 1 (all theother parameters are identical to those in Figure 4D)

The standard deviation of recognition strength is largerfor the difficult class than for the easy class [ss(AN) gtss(BN) and ss(AO) gt ss(BO)] because the recognitionstrengths are calculated from the inverse of the standarddeviation of the net inputs Thus when the standard de-viations of the net inputs are set equal the standard de-viation of the recognition strengths and the recognitionstrengths becomes equal Figure 6D plots the predictionsof the theory when all standard deviations of the net in-

puts are 125 The other parameters are the same as thosein Figure 4D

In Figure 4D the four standard deviations of the recog-nition strengths are ss(AN) = 0015 ss(BN) = 0012ss(BO) = 0015 and ss(AO) = 0020 The ratio of thesestandard deviations must follow Equation 2 This is alsothe case with ss(AO) ss(BN) = 061 lt 074 = ss(AO) ss(AN) lt ss(BO)ss(BN) = 078 lt 094 = ss(BO) ss(AN)

Changing the Recognition CriterionThe probability of a yes response (P) for the four

classes depends on the recognition criterion (C) SettingC to an unbiased value of 0 yields P(AN) = 20 P(BN) =25 P(BO) = 70 P(AO) = 74 These predicted data areprototypical of experimental data for the mirror effect

A conservative value of the recognition criterion (C)will not yield the mirror effect For example C = 0016yields P(AN) = 03 P(BN) = 02 P(BO) = 30 P(AO) =43 Thus the variance theory predicts that a conserva-tive recognition criterion yields a higher false alarm ratefor easy class words than for difficult class words Thisprediction is supported by empirical data Greene (1996)asked subjects to respond yes only if they were sure oftheir response Consistent with the prediction no mirroreffect was found

It follows from the ordering of the distributions in Fig-ure 4D that the theory also predicts the experimentalfindings in forced recognition [P(BO BN) lt P(AO BN)P(BO AN) lt P(AO AN) P(BN AN) gt 50 and P(AOBO) gt 50] For the parameters above the predictions ofthe theory are P(BO BN) = 79 lt 81 = P(AO BN)P(BO AN) = 83 lt 84 = P(AO AN) P(BN AN) = 59 gt50 P(AO BO) = 57 gt 50

Within-List Strength ManipulationSo far the predictions made by the variance theory are

qualitatively (but not quantitatively) equal to those of theattention-likelihood theory However there is an excep-tion that differentiates the variance theory from the attention-likelihood theory The mirror effect is nor-mally studied under experimental conditions in whichthe diff icult and the easy classes are given the sameamount of attentionmdashfor example under conditions inwhich the number of presentations the study time andthe study instructions are equal for the two classes ofwords However if the number of presentation is largerfor the difficult class than for the easy class different re-sults emerge Stretch and Wixted (1998b) conducted fiveexperiments in which the basic manipulation was to pre-sent high-frequency words five times whereas the low-frequency words were presented once The results didnot show a mirror effect because the hit rates for thehigh-frequency words were higher than those for thelow-frequency words However increasing the numberof presentations for the high-frequency words did not af-fect the false alarm rate so both the false alarm rate andthe hit rate were larger for high-frequency words

The variance theory accounts for this new finding inthe following way The theory assumes that a longer pre-

426 SIKSTROM

sentation time or a larger presentation frequency in-creases the net inputs of the old items [mh(O)] This is il-lustrated in Figure 7A (compare with Figure 4A wherethe same amount of attention is paid to the two classes)If the net inputs for old high-frequency items are in-creased sufficiently the percentage of active nodes willbe larger than that for old low-frequency items For thisto occur the effect of the increase in net input (whichgives the advantage for old high-frequency items whenattention is focused on these items) must be larger thanthe effect from the larger standard deviation of the netinputs for old high-frequency items (which gives the ad-vantage for old low-frequency items when the same at-tention is paid to the two classes) This increase in thepercentage of active nodes yields a higher hit rate forhigh-frequency items than for low-frequency items

However it will not significantly change the false alarmrates which are larger for high-frequency items than forlow-frequency items Therefore the variance theory pre-dicts no mirror effect when high-frequency items are pre-sented sufficiently more often or with a sufficiently longerpresentation time as compared with low-frequency items

It is apparent from the density of net inputs (Figure 7A)that the density of recognition strengths (Figure 7B)will not show a mirror effect (ie because the percent-age of active nodes are larger for high- than for low-frequency old items) The parameters used in these fig-ures are identical to the parameters used for the standardmirror effect in Figures 4A and 4D with the exceptionthat the expected net input of the old high-frequencyitems [mh(BO)] is 2 rather than 1 Consequently to opti-mize performance the activation threshold becomes

P a ec

h h

h=-( )yen

ograve12

22

m

s

P -

Figure 7 (A) The probability density of the net inputs in the variancetheory when attention is focused on the high-frequency class The hori-zontal axis shows the net inputs and the vertical axis the probabilitydensity of the net inputs The expected value of the high-frequency value(BO) is shifted to the right because attention is focused on this class Thedotted vertical line is the activation threshold (B) The predictions of thevariance theory when subjects focus their attention on high-frequencywords The horizontal axis shows the recognition strength and the ver-tical axis the probability density

075 rather than 050 The figure does not show a mirroreffect because the expected hit rate and the expectedfalse alarm rate are larger for the high-frequency itemsthan for the low-frequency items Setting C to an unbi-ased value of 0 yields P(AN) = 08 P(BN) = 14 P(BO) =86 P(AO) = 063 [which may be compared with Figure6B P(AN) = 20 P(BN) = 25 P(BO) = 70 P(AO) = 74]

Furthermore in the variance theory the ratio of therecognition strength standard deviations for high- andlow-frequency items depends mainly on the standard de-viations of the net inputs The standard deviations of thenet inputs are not dependent on the attention paid to thestimuli Therefore the variance theory predicts no changein the standard deviations when the amount of attentionis manipulated The standard deviation of the old low-frequency distribution is predicted to be larger than thestandard deviation of the old high-frequency distribu-tion Similarly the standard deviation of the new low-frequency distribution is predicted to be larger than thatof the new high-frequency distribution The standard de-viations in Figure 7B are ss(AN) = 0013 ss(BN) = 0011ss(BO) = 0017 and ss(AO) = 0019 These results aresimilar to the results when the same amount of attentionis paid to the two classes in Figure 4D ss(AN) = 0015ss(BN) = 0012 ss(BO) = 0015 and ss(AO) = 0020

The standard version of the attention-likelihood the-ory has problems accounting for the lack of mirror ef-fect when more study time is given to the difficult classthan to the easy class This theory suggests that the classof items to which more attention is being paid is moreeasily recognized For example low-frequency items arebetter recognized than high-frequency items becausesubjects pay more attention to them The amount of at-tention is assumed to influence the number of sampledfeature [n(i)] so more features are sampled for low- thanfor high-frequency items (Kim amp Glanzer 1993) This isthe only parameter that differs between high- and low-frequency items From this explanation it follows thatif the experimental conditions are manipulated so thatsubjects pay more attention to the high-frequency itemsthe standard version of the attention-likelihood theorywill predict a mirror effect where the high-frequencyitems are the easier class (A) and the low-frequency itemsare the more difficult class (B) The difference from thenormal mirror effect is a larger hit rate and a smaller falsealarm rate for high- than for low-frequency items Fur-thermore the attention-likelihood theory makes predic-tions of the order of the slope of ROC curves The standarddeviation of the hit rate for the high-frequency distributionwould be larger than the hit rate for the low-frequencydistribution Similarly it is predicted that the standarddeviation of the high-frequency false alarm distribution islarger than that of the low-frequency distribution

EXPERIMENT

An experiment was conducted to test the predictionsregarding the within-list strength manipulation The

number of presentations and the study time of the high-frequency words were manipulated in an experimentThe original rationale for the experiment was to comparethe results with the predictions of the variance theoryand the attention-likelihood theory because the experi-ment was conducted before the publication of Stretchand Wixtedrsquos (1998b) study which manipulated atten-tion by the number of presentations In this experimenta new manipulation is investigatedmdashnamely how theamount of study time per item for each class influencesthe mirror effect Furthermore the manipulation of thenumber of presentations is replicated Thus there weretwo experimental conditions one in which the amountof study time was manipulated and one in which the pre-sentation time was manipulated There was also one con-trol condition in which high- and low-frequency wordswere given the same amount of attention

MethodSubjects Twenty-one students taking the introductory psychol-

ogy course at the University of Toronto volunteered to participatein a memory experiment for course credit There were 14 female and7 male subjects with a mean age of 20 ranging from 18 to 29 yearsold

Material Sixty low-frequency words and 60 high-frequencywords were selected from Ku Iumlcera and Francis (1967) The low-frequency words have an occurrence of 4ndash5 times per million andthe high frequency words an occurrence of 50ndash55 times per mil-lion Thirty low- and 30 high-frequency words were randomly cho-sen for List 1 and the remaining for List 2

Procedure The subjects were instructed to study a list of wordsso they would be able to recognize the words after study Fifteenlow-frequency words and 15 high-frequency words were randomlychosen as study words for each subject

Design There were three conditions In all the conditions thelow-frequency words were presented once with a presentation timeof 1 sec In the control condition the high-frequency words werealso presented once with a presentation time of 1 sec In the pre-sentation frequency condition the high-frequency words were pre-sented twice for 1 sec each time In the presentation time conditionthe high-frequency words were presented once for 3 sec The pre-sentation order was randomized All the words were presented inuppercase on a blank computer screen Immediately following thestudy list there was a recognition test The subjects were presentedwith either one of the studied words or one of the lures There were15 low-frequency lures and 15 high-frequency lures in each condi-tion The subjects were asked to judge whether the word was pre-sented in the list or not The subjects were also required to rate theirconfidence in their responses on a scale ranging from 1 (guessing)to 5 (very certain) The order of recognition was randomized foreach subject

Each subject participated in two conditions List 1 was alwaysgiven as the first list and List 2 as the second list Twelve subjectswere randomly chosen for the presentation frequency condition fol-lowed by the presentation time condition Nine subjects were giventhe control condition followed by another control condition Thewhole experimental set-up including instructions presentation ofwords and the recognition test was automated on a computer Eachsubject was tested individually

ResultsThe results from the experiment are presented in the

first three rows of Table 1 The probability for hit rates

VARIANCE THEORY OF MIRROR EFFECT 427

428 SIKSTROM

was larger for the high-frequency words than for the low-frequency words in the presentation frequency and thepresentation time conditions In the control conditionthe probability for hit rates for the low-frequency condi-tion was larger One-tailed paired t tests over the perfor-mance for each subject were carried out to test the dif-ferences between the high and the low frequencies Theeffects were significant in the presentation frequencycondition [t(11) = 22 MSe = 004 p = 02 lt 05] and inthe control condition [t(16) = 33 MSe = 004 p = 00lt 05] but not in the presentation time condition [t(11) =041 MSe = 003 p = 34 lt 05]

The false alarm rate was larger for the high-frequencywords in all the conditions However it was only signif-icantly larger in the presentation frequency condition[t(11) = 18 MSe = 003 p = 048 lt 05] but not in thepresentation time condition [t (11) = 15 MSe = 001 p =07 gt 05] and the control condition [t(16) = 14 MSe =002 p = 09 gt 05]

The results in the presentation frequency conditionsupport the variance theory The hit and the false alarmrates were significantly larger for the high-frequencywords than for the low-frequency words Thus there wasno mirror effect However the prediction of the standardversion of the attention-likelihood theory was not sup-ported

The results in the presentation time condition were inthe same direction as those in the presentation frequencycondition although the difference between the high andthe low frequencies was not significant This conditionis consistent with the variance theory although the stan-dard version of the attention-likelihood theory could notbe dismissed in this condition since the results werenonsignificant

Finally the control condition yielded results consis-tent with previous studies showing a mirror effect Thehit rate for the high-frequency words was significantlylower than the hit rate for the low-frequency words Thefalse alarm rate for the high-frequency words was largerthan that for the low-frequency words (although not sig-nificantly) Thus the control condition is as was ex-pected consistent with both the variance theory and thestandard version of the attention-likelihood theory

The slopes of the ROC curves were calculated as fol-lows The hit and false alarm rates for confidence ratings1ndash5 were z-transformed (eg for confidence rating 4 a hitresponse was scored if the confidence rating was 4 orabove) Linear regressions of one z-transformed mea-surement as a function of another z-transformed measure-ment were conducted The slope of the linear regressioncurves between the z-transformed hit rate of the low-frequency words and the z-transformed hit rate of the high-frequency words [ss(BO)ss(AO)] and similarly for theslope of the false alarms [ss(BN)ss(AN)] are shown inthe last two rows of Table 1

The results show that the standard deviations of theold high-frequency distributions were smaller than thestandard deviations of the old low-frequency distribu-tions in all the conditions The standard deviations of the

false alarm high-frequency distributions were smallerthan the standard deviations of the false alarm low-frequency distributions in the presentation frequencycondition and the control condition but were approxi-mately equal in the presentation time condition

To summarize the results in the presentation frequencycondition are consistent with the variance theory and inconsistent with the standard version of the attention-likelihood theory The control condition is consistentwith both the variance theory and the standard version ofthe attention-likelihood theory These data are also con-sistent with results presented by Stretch and Wixted(1998b) However Stretch and Wixted (1998b) suggestedone possible way to modify the standard version of theattention-likelihood theory to bring it in line with thedata presented here They noted that Glanzer et al (1993)had shown that the attention-likelihood theory predictsthe mirror effect although p(i old) is set to the averagevalue of the two classes This modified version can pre-dict the pattern of data presented here given that the at-tention paid to the high-frequency class was high duringencoding [n(B) = 120] and low during recognition [n(B) =40] This formulation of the attention-likelihood theoryseems somewhat unclear It is not well motivated whyp(i old) should be equal during recognition whereas n(i)is different [p(i old) is calculated from n(i)] or why theamount of attention for high-frequency items is lower thanthat for low-frequency words at encoding but higher atrecognition

COMPARATIVE DATA FITTING

Glanzer et al (1993) fitted the attention-likelihoodtheory to experimental data in nine conditions The easyclass (A) consisted of either low-frequency words orconcrete words and the difficult class (B) consisted ofhigh-frequency words or abstract words Here the vari-ance theory is also fitted to the same set of data as thatused in Glanzer et al (1993) This allows a directionevaluation of the variance theory by comparing its f itswith those of the attention-likelihood theory

Glanzer et al (1993) fitted the attention-likelihoodtheory to the four probabilities of yes responses and thefour ratios of slopes of the ROCs The fitting was con-ducted by minimizing the mean squared error divided bythe variancemdashthat is

Three parameters were fittedmdashnamely the attentionpaid to each of the classes n(A) and n(B) and the prob-ability that a feature was activated p(new) The other pa-rameters were held constant at a value found to give agood f it These parameters were (N = 1000) and therecognition criterion [ln (L) = 0]

The variance theory was fitted to the same set of datausing the same technique and the same number of freeparameters The fitted parameters were the percentage

( )

Observed Predictedi i

ii

-

=aring

2

21

8

s

VARIANCE THEORY OF MIRROR EFFECT 429

nodes active at encoding (a) the standard deviation ofthe net inputs for the easy class words [sh(A)] and thestandard deviation of the net inputs for the difficult classwords [sh(B)] The other parameters were held constantand were set to the same values as those in Figure 4D[2N = 100 mh(N ) = 0 mh(O) = 1 and C = 0] The empir-ical standard deviations (si) were not reported inGlanzer et al (1993) so these parameters were set toone

The results show that the variance theory accounts for98 (r2) of the variance for the probabilities The attention-likelihood theory fits equally well The variance theory ac-counts for 93 of the variance for the slope The attention-likelihood theory accounts for 84 of the variance of theslope Thus the variance theory accounted for the sameamount of variance for the probabilities and more variance

for the slope as compared with the attention-likelihoodtheory when three parameters were fitted

It is reasonable to assume that the percentage of ac-tive nodes at encoding (a) does not vary across condi-tions The ratio of standard deviations of the net inputs[sh(B)sh(A)] may also be conceived of as being con-stant given that the same material is used in the differ-ent conditions Therefore the variance theory was fittedby a single variablemdashnamely the standard deviation ofthe net inputs for the easy class [sh(A)] The activitylevel was fixed to 010 and the ratio of the standard de-viations of the net inputs sh(B)sh(A) to 125 (these val-ues were the average of the fitted parameters in the pre-vious fit these parameters were also used in Figure 4D)The results are presented in Figure 8A where the pre-dicted probabilities are plotted as a function of observed

Figure 8 (A) Fitting the variance theory to Glanzer Adams andKimrsquos (1993) probability data The figure shows the predicted proba-bilities (on the vertical axis) as a function of the observed probabilities(on the horizontal axis) (B) Fitting the variance theory to Glanzer et alrsquos(1993) standard deviation slope data The figure shows the predictedratio of slopes of the receiver-operating characteristic curves (on the ver-tical axis) as a function of the observed ratios (on the horizontal axis)

430 SIKSTROM

probabilities Figure 8B shows the corresponding resultsfor the slope The accounted variance is 096 for the prob-abilities and 085 for the slopes Thus the variance theoryfits the slopes using a single parameter equally well asthe attention-likelihood theory does with three fitting pa-rameters The fit for the variance theory for the probabil-ities using one parameter is slightly less than the fit of theattention-likelihood theory using three fitting parametersIt may be concluded that the fit for the variance theory isreasonably good for the probabilities and the slopes Theslopes have a better fit in the variance theory than in theattention-likelihood theory when three variables are used

ANALYTIC SOLUTIONS

In this section analytic solutions of the variance the-ory an approximation of the standard deviation of therecognition strength and analyses of optimal perfor-mance are presented The variance theory has a simpleanalytic solution and can be fully described by four pa-rameters Two of these parametersmdashnamely the stan-dard deviation of the net inputs from the easy class[sh(A)] and the standard deviation of the net inputs fromthe difficult class [sh(B)]mdashcan also be expressed as thefrequency of the item (see the Appendix) The other twoparameters are the number of nodes (N ) and the expectedprobability that the nodes are active at encoding (a)

There are other variables in the theory however theydo not increase the degrees of freedom There are fourexpected net inputs (mh) however two degrees of free-dom disappear because the new net inputs are constrainedto be equal as well as the old net inputs [mh(AN) =mh(BN) and mh(AO) = mh(BO)] Furthermore the predic-tions are independent of moving the old and new dis-tributions in parallel thus removing another degree offreedom Finally changing the difference between the ex-pected old and new net inputs is mathematically equiva-lent to changing the standard deviations [sh(A) andsh(B)] Thus the degrees of freedom in the net inputscan be captured by the degrees of freedom in the stan-dard deviations The activation threshold (T ) and theprobability that nodes are active (Pc) are simply func-tions of other variables and therefore do not increase thedegrees of freedom Thus there are four degrees of free-dom for the distributionsmdashnamely sh(A) sh(B) N anda An additional degree of freedom is introduced whenplacing the recognition criterion (C)

The probability (Pc) that the net inputs exceed the ac-tivation threshold (T ) for nodes active during encodingcan be explicitly solved from the expected net inputs(mh) and the standard deviation of the net inputs (sh)This probability is dependent on the distribution of thenet inputs which can be approximated with a normaldistribution Pc is solved by integrating the net inputsfrom mh T to infinity (yen) over the probability densityfunction for a normal distribution Thus the probability(Pc) that a node is active at recognition is

(6)

Subtracting the expected percentage of active nodes atrecognition (a2 see note 1) from the percentage of ac-tive nodes and dividing by the standard deviation of thenet inputs (sh) calculates the expected recognitionstrength (ms)

Note that the analytic solution uses the standard devi-ation of the class (sh) as an approximation of the stan-dard deviation of the item (shcent ) because it simplifies theanalytic solution however the variance theory or thesimulation uses the standard deviation of the item Thisapproximation is good when there are a large number offeatures however for a small number of features thevariance of feature strength for a single item may fluctu-ate on an item-to-item basis around the variance of thenet inputs for all the items

The standard deviation of the recognition strength (ss)is calculated from sh Pc and N There is 2N number ofnodes in the context and the item layers The distributionof Pc is binomial but can for a certain criterion [ie 2NPc(1 Pc) gt 10] be approximated with a normal distri-bution with a standard deviation of [Pc(1 Pc) 2N]12The final result is scaled by the normalization factor 1sh

(7)

A yes response occurs if the recognition strength isabove the recognition criterion (C) The probability of ayes response [P(Y)] is calculated from the expected recog-nition strength the variance of the recognition strengthand the recognition criterion by integrating the density ofthe recognition strength over a normal distribution

The probability of choosing A over B in a two-choiceforced recognition test [P(A B)] is calculated from theexpected recognition strength of A [ms(A)] and B [ms(B)]and the standard deviations of the recognition strengthof A [ss(A)] and B [ss(B)]

An Excel sheet for calculating the predictions of the vari-ance theory is available on line (wwwpsychutorontoca ~sverkervariancehtml)

P e ds

S Bs s

s s

C

(A B)

(A)

(A) (B)

=

- -[ ]( )+aelig

egraveccediloumloslashdivide

eacuteeumlecirc

yen

ograve12

2

2 2 22

p

m m

s s

( )

P e ds

S s

s

C

(Y) =-( )yen

ograve12

2

22

p

m

s

sss

h

c cP P

N=

-( )eacute

eumlecircecirc

1 1

2

12

mss

Pc

a

h=

-2

P a e dhcT

hh

h=-( )yen

ograve12

2

22

p

m

s

VARIANCE THEORY OF MIRROR EFFECT 431

Approximations of the Standard Deviation ofRecognition Strength

The standard deviation of the recognition strength isin the model calculated with Equation 7 However to fa-cilitate the understanding of this equation it is useful tomake some approximations First note that the proba-bility that a node is active (Pc) is assumed to be low Byapproximating 1 Pc to one the variance of the recog-nition strength can be simplified to

For a particular class of items the variances of the netinputs of old and new items are equal and the varianceof the recognition strength is proportional to the numberof active nodes (s 2

s micro Pc) This approximation suggests avery simple interpretation of the slope of the z-ROC Theratio of variances between new and old items is simplythe ratio between the number of nodes active in the newitems representations and the number of nodes active inthe old items representations

Or alternatively the slope of the z-ROC curve is equalto the square root of the ratio of the number of nodes ac-tive in the new items representations and the number ofnodes active in the old items representations For exam-ple if the slope of the z-ROC curve is 08 the number ofactive nodes in the new items representations divided bythe number of nodes active in the old items representa-tions is 064 (= 0802)

Another approximation useful for understanding themodel is that for two classes of items the number of ac-tive nodes in the new distribution is approximately equaland the number of active nodes in the old distributions isapproximately equal [Pc(AN) raquo Pc(BN) and Pc(BO) raquoPc(AO)] Given these approximations and the approxi-mation above (1 Pc raquo 1) the recognition strength stan-dard deviation is inversely related to the standard devia-tion of the net inputs in the following way The ratiobetween the recognition strength standard deviations ofthe diff icult and the easy distributions is equal to theratio between the standard deviations of the net inputs ofthe easy and the difficult distributions Furthermore theratio between the recognition strength standard devia-tions of the difficult and easy new distributions is equalto the ratio between the recognition strength standard de-viations of the difficult and the easy old distributionsThe exact solution predicts a slightly smaller ratio in theold than in the new distributions

This suggests that the ratio between the recognitionstrength standard deviations of the difficult class and theeasy class can be interpreted as the ratio between the

standard deviations of the net inputs of the easy class andthe difficult class

Optimizing Performance Derives the Assumptions of the Variance Theory

Arguably good memory performance is an importantfactor for selection in the evolutionary process of hu-mans and animals It is reasonable to assume that thebrain has evolved so that the performance at retrieval isoptimal or near-optimal Here it is investigated how sev-eral assumptions of the variance theory influence per-formance It is shown that several assumptions of themodel fall out as a consequence of optimizing perfor-mance in the form of discriminability between new andold items Thus if the model is implemented in a differ-ent way performance is degraded and the model doesnot account for the experimental data Examples of as-sumptions that yield a good performance in the modelare a low percentage of nodes active setting the activa-tion threshold between old and new net inputs measur-ing performance by nodes that are active in the encodingpattern and normalizing the recognition strength It isshown that an optimal performance in the network requiresthe implementation suggested by the variance theory Ifthe implementation of the variance theory is changedsignificantly the performance is degraded and the net-work would not produce the empirically found memoryphenomena

To conduct this analysis it is necessary to define ameasurement of performance A natural choice is to used cent By using the analytical equations above we find thefollowing expression

(8)

Because ss(N) sometimes is near zero it was founduseful to use the standard deviation of both the old andthe new items recognition strength ss(NO) in the de-nominator of this equation Thus Pc(NO) is equal to[Pc(N) 1 Pc(O)] 2 Pc( ) was calculated with Equation 6The expected value of the net inputs and the standard de-viation of the net inputs for new and old items were cal-culated with the equations derived in the Appendix(Equations A1 A2 and A3) Low-frequency items witha preexperimental frequency of zero were used ( f = 0)and the list length was set to one (L = 1)

The performance can be expressed by the parametersa N and p Increasing the number of nodes (N) monot-onically increases d cent and increasing the number ofstored patterns (p) monotonically decreases d cent The im-pact of these two parameters on d cent was of less impor-tance here and they were set to N = 30 and p = 100

Optimal percentage of nodes active at encodingThe solid line in Figure 9A shows the theoretical d cent as afunction of the percentage of nodes active at encoding

cent = - =-

-eacuteeumlecirc

dP P

P PN

s s

s

c c

c c

m ms( ) ( )

(

O N

(NO)

(O) (N)

(NO) (NO) 12

112

ss

ss

ss

ss

s

s

s

s

h

h

s

s

(BO)

(AO)

(B)

(A)

(A)

(B)

(BN)

(AN)pound raquo pound

s

ss

s

c

c

P

P

2

2

(N)

(O)

(N)

(O)raquo

ss

sc

h

P

N

2

2 2=

432 SIKSTROM

(a) The results show that d cent is optimal for a = 052 Thed cent is lower for larger and smaller a The lower d cent for largea occurs because the interference from other items in-crease For an a larger than the optimal value the weightchanges are distributed over a larger number of nodesand the recognition tests therefore include more noise

The lower d cent for an a less than the optimal value oc-curs because there is variability in the number of activenodes at encoding Thus for very small values of a thefluctuation between the number of nodes active in theencoded representation becomes increasingly importantThus for a small a errors are more likely to occur be-cause old items have few active nodes at encoding mak-ing it less likely that many nodes will be active at re-trieval (independently of how well they are encoded)This analysis suggests that a medium low percentage ofactive nodes at encoding yields optimal performanceThis is consistent with variance theory which requires alow activity for fitting some of the empirical data (seebelow)

There is another factor that contributes to the fact thatoptimal performance occurs when the percentage of ac-tive nodes is medium low which is that the number ofpossible representations increases with a If there is onlyone node active in all the representations there are Npossible combinations of representations if there aretwo nodes active in all the representations there are ap-proximately N 2 possible combinations of representa-tions and so forth This factor is not included in theanalyses

Optimal placement of the activation threshold Animportant property of d cent is that it is insensitive to wherethe criterion is placed Thus any criterion yields thesame performance The activation threshold (T ) may beseen as the criterion for a single node and therefore onemight intuitively think that the placement of the thresh-old is unimportant for d cent However surprisingly theplacement of the criterion becomes important in the vari-ance theory because there is a nonlinear transformationwhen the nodes are activated This nonlinearity makes d centdependent on the activation threshold in the nodes

The d cent was maximized by changing the activationthreshold (T ) and the percentage of nodes active at en-coding (a) The maximum d cent was 240 when T = 081and a = 052 Figure 9B plots d cent as a function of the ac-tivation threshold (T ) when the percentage of nodes ac-tive at encoding was f ixed at the optimal value (a =052) The results show that d cent has an optimal valuewhen the activation threshold is set between the old andthe new distributions The variance theory suggests thatthe threshold should be set to the average of the expectedold and expected new net inputs For the parameter val-ues used here this value is 071 which is near butslightly lower than the optimal value of 081 (the ex-pected value of the new net input is 0 and the expectedvalue of the old net input is 142) Note that this resultapplies when both a and T are set to the optimal value Ifa is set to a nonoptimal value the optimal value of T may

deviate significantly from the one proposed by the the-ory (eg if a = 5 the optimal value of T is much largerthan the expected value of old net inputs of 188)

This analysis emphasizes the importance of setting theactivation threshold as suggested by the variance theorySetting the activation threshold between the old and thenew expected net inputs yields not only the mirror effectbut also an optimal performance in the network Noticethat the activation threshold (T ) is constant even if thesubjectsrsquo decision criterion (C) is changed Therefore d centwill not change when the decision criterion changes Bychanging the decision criterion (rather than the recogni-tion threshold) subjects can maintain an optimal d cent fordifferent confidence levels

Optimal usage of the state of activation in the cuepattern An interesting question is how much informa-tion is carried in nodes that are active in the encoded pat-tern as compared with inactive nodes If both active andinactive nodes carry a similar amount of information itis useful to use all the nodes at retrieval However if in-active nodes carry little information in relation to thenoise performance can be improved by using only theinformation in the active nodes

The information carried in the nodes depends on theamount of weight changes which in turn depends on thepercentage of active nodes at encoding (a) For a = 12the absolute values of the weight changes are the samefor active and inactive nodes (however the signs of theweight changes are different) Thus inactive and activenodes carry the same amount of information and theperformance is optimal when information in both activeand inactive nodes is used For a small a the weightchanges are larger for active nodes (proportional to 1 a)than for inactive nodes (proportional to a) For a suffi-ciently small a the noise in inactive nodes will over-whelm the information in the weight changes so that ifthe information is combined the inactive nodes will harmthe information in the active nodes and performance isoptimal if only information from active nodes is used

The performance of the variance theory was calcu-lated by using the information in all the nodes This isdone by counting the number of nodes that are retrievedto the correct state of activation (ie the same state asthat during encoding) The mathematical details of thiscalculation are described at the end of the Appendix Theresults are shown by the dotted line in Figure 9A usingthe same set of parameters as when d cent was calculated byusing only active nodes shown by the solid line The re-sults show that the highest d cent is found when the decisionis based only on active nodes and when a is low Includ-ing inactive nodes in decision lowers d cent However for alarger a (above 15 for the parameters used here) it isbeneficial to base the decision on all the nodes

Optimal placement of the recognition criterion forthe two classes of items The recognition criterion (C)does not affect d cent but it influences performance as mea-sured by the hit rates and false alarm rates Therefore itis necessary to use another criterion for performance

VARIANCE THEORY OF MIRROR EFFECT 433

with respect to the placement of the recognition crite-rion A natural choice for performance in this context isthe probability of hits minus the probability of falsealarms This measurement corresponds to optimal per-formance when old correct responses and new correctnew responses are rewarded equally It is easy to see thatif the standard deviations of the old and the new distri-butions are equal the optimal performance will be foundif the recognition criterion is set exactly between the dis-tributions For unequal standard deviations the optimalrecognition criterion is shifted from the midpoint towardthe distribution with the smallest standard deviationMore exactly the optimal recognition criterion is thepoint at which the old and the new distributions inter-sect It is easy to see that this is true because if the recog-nition criterion is moved to the left of this point the rateof increase in false alarms is larger than the rate of in-crease in hits and performance suffers If the recognition

criterion is moved to the right of this point the rate of de-crease in hits is larger than the rate of decrease in falsealarms and performance also suffers (see eg Figure 4D)Formally f [S(O)] denotes the density of recognitionstrength of the old distribution and f [S(N)] the densityof the recognition strength of the new distribution Theratio between these variables is called the likelihoodratio L = f [S(O)] f [S(N)] and the optimal performanceoccurs when this ratio is equal to one (L = 1)

In the mirror effect there are two classes of itemseach having a new and an old distribution with differentstandard deviations The question of optimal perfor-mance is complicated by the possibility of using differ-ent criteria for the two classes The performance maythen vary depending on the choice of the two criteria andon additional restrictions on the overall level of confi-dence For example if one class is very easy (ie perfectdiscrimination) and one class is very difficult (ie no

Figure 9 (A) Theoretical d cent as a function of percentage of nodes active at encoding The solid line shows the d cent as a function of per-centage of nodes active at encoding when the decision is based only on nodes that are active during encoding The dotted line showsd cent when the decision is based on all the nodes (B) Theoretical d cent as a function of the activation threshold The leftmost arrow pointsat the expected net input of the new items [m(N)] the rightmost arrow points at the expected net input of the old items [m(O)] and themiddle arrow points at the point at the placement of the activation threshold of the nodes Note that the activation threshold is slightlylower than the optimal point (C) Optimal placement of the recognition criterion for the two classes The y-axis shows the maximumlikelihood for Class B divided by the maximum likelihood for Class A An optimal performance is found when this ratio is one Thex-axis shows the false alarm rate for Class A The straight line shows the ratio for theoretical optimal performance the dotted line theratio before normalization and the solid curved line the ratio after normalization See the text for details (D) The advantage of nor-malization for different recognition criteria The y-axis shows the total hit rate after normalization minus the total hit rate before nor-malization as a function of the total false alarm rate on the x-axis See the text for details

434 SIKSTROM

discrimination) and subjects are instructed to respondyes only when they are absolutely certain that they arecorrect it may be optimal to set a very high criterion forthe difficult class so that no yes responses will be madefor the difficult class and a moderate criterion for theeasy class so that some yes responses will be made forthe easy class Therefore any model that optimizes per-formance for the two classes must combine the criteriafor each class so that the performance for the sum of theclasses will be optimal

This problem may formally be stated as follows Giventwo classes (A and B) with a fixed total false alarm rate[P(AN) + P(BN)] how should the recognition criteriafor the two classes [T(A) and T(B)] be chosen so that thehit rates are maximized [P(AO) + P(BO)] The solutionto this problem is surprisingly simple The optimal perfor-mance occurs when the placements of the maximum like-lihoods of the two classes are equal

L(A) = f [S(AO)] f [S(AN)] = L(B)

= f [S(BO)] f [S(BN)]

It is easy to see that this criterion must be satisfied foroptimal performance because any shift from this pointdiminishes performance For example if the recognitionthreshold for class A is diminished the recognition cri-terion for class B must be increased to keep the totalfalse alarm rate constant According to the formulationof the problem the change in total false alarm rates mustbe equal f f [S(BN) = 0] The maximum-likelihood ra-tios are monotonically increasing functions of the recog-nition criteria therefore L(A) L(B) lt 0 when the recog-nition criteria are changed as specified above ThusL(A) = f [S(AO)] f [S(AN)] lt f [S(BO)] f [S(BN)] =L(B) or f [S(AO)] f [S(BO)] lt 0 This shows that thechange in the placement of the criteria from L(A) = L(B)results in an overall decrease in hit ratemdash( f [S(AO)] f [S(BO)] lt 0)mdashand performance suffers

Note that the variance theory only has one overallrecognition criterion However the theory and any the-ory of the mirror effect specifies how this criterionchanges when it is moved over the two classes Thus thesecond criterion can indirectly be inferred from the for-mulation of the theory This is done in the variance the-ory by the normalization factor that scales the recogni-tion differently between the two classes of words

The intriguing question here is whether the variancetheory is optimal or almost optimal in terms of place-ment of the recognition criterion for the two classes Fig-ure 9C plots the maximum likelihood of class B dividedby the maximum likelihood of class A [L(B)L(A)] onthe y-axis The x-axis shows the probability of false alarmsfor class A when the recognition criteria are changedThe optimal ratio of the maximum likelihood on the y-axisis one and it is plotted as the dotted straight line The re-sults before normalization (ie by counting the numberof nodes above the recognition criterion) are plotted in

the dotted monotonically increasing line The resultsafter normalization (ie the percentage of active nodesminus the expected number of active nodes divided bythe standard deviation of the net input) are plotted as thesolid line

The figure clearly shows that performance before nor-malization is far from optimal For a conservative recog-nition criterion or low false alarm rates the maximumlikelihood for class A is larger than the maximum likeli-hood for class B Therefore a more liberal criterion forclass B and a more strict criterion for class A would bemore advantageous For a liberal recognition criterionor a high false alarm rate the maximum likelihood forclass B is larger than the maximum likelihood for classA Therefore a more liberal criterion for Class A and astricter criterion for Class B would be beneficial Theonly point at which the performance is optimal is whenthe recognition criterion is unbiased At this point [aroundP(AN) = 25] the maximum-likelihood ratio is one

Normalization improves performance significantly sothe maximum-likelihood ratio is one or almost one fora range of criteria For false alarm rates larger than 015and smaller than 060 the ratio is within two percentagepoints of one The maximum likelihood for Class A islarger than that for Class B for false alarm rates less than015 and for false alarm rates larger than 060 Thus thereis still some deviation from optimal performance evenafter normalization However the maximum-likelihoodratio is closer to the optimal value for all false alarmrates after normalization than before normalization Ar-guably normalization improves performance sufficientlyso that performance may be described as being near anoptimal value for a wide range of recognition criteria

Overall performance after normalization can be di-rectly compared with performance before normalizationFigure 9D plots the total hit rate after normalizationminus the total hit rate before normalization for differenttotal false alarm rates In this figure the standard devia-tion of the net inputs to Class B was set to a 3 (rather than156) in order to make the difference between perfor-mance before and after normalization more salient Allother parameters were identical to those in Figure 4DFurthermore it is assumed that the number of items inClass A is equal to the number of items in Class B Thetotal hit rate is equal to the average hit rate of Class Aand Class B and the total false alarm rate is equal to theaverage false alarm rate in Class A and Class B

For all recognition criteria or for all false alarm ratesperformance is better or equal after normalization ascompared with performance before normalization Foran unbiased recognition criterion or a slightly largerrecognition criterion performance is approximatelyequal before and after normalization [ie for 25 lt P(N) lt40] For conservative recognition criteria [P(N) lt 25]there is a large advantage for normalized performanceover nonnormalized performance The largest advantageis approximately 005 and it occurs when the false alarm

VARIANCE THEORY OF MIRROR EFFECT 435

rate is approximately 005 For liberal recognition crite-ria [P(N) gt 40] there is also an advantage for normal-ized performance The largest advantage is around 001and it occurs when the false alarm rate is 070 The ad-vantage for liberal criteria is smaller than the advantagefor conservative criteria because of a ceiling effect forlarge false alarm rates and large hit rates

A Nonoptimal Network is Inconsistent With Empirical Data of Recognition Memory

To summarize it has been shown that performance isoptimal when (1) the percentage of nodes active at en-coding is low (2) the activation threshold is set betweenthe new and the old distributions (3) only nodes activeat encoding are used for measuring the recognitionstrength and (4) the recognition strength is normalizedwith the standard deviation of the net input It is inter-esting to note that all these conditions are necessary forproducing the empirically found memory phenomena

1 The percentage of active nodes has to be low forproducing appropriate standard deviations for the oldand the new recognition distributions If the percentageof active nodes is too high the standard deviation of theold distribution will approach the standard deviation ofthe new distribution

2 The model predicts the mirror effect only if the ac-tivation threshold is set between the old and the new dis-tributions If the recognition threshold is larger than theexpected value of the net input of the old distributionthe hit rate of the easy class will be less than the hit rateof the difficult distribution Similarly if the recognitionthreshold is lower than the expected new net input thefalse alarm of the easy class will be larger than the falsealarm rate of the difficult class This is inconsistent withthe empirical data for unbiased responses

3 Assume that the recognition strength is based on allthe nodes (ie not only nodes inactive during encod-ing)mdashfor example by counting the number of nodes inthe correct state of activation This measurement ofrecognition strength will have at least 50 of the nodesin the correct state (ie Pc gt 50) even if the subjectswere merely guessing on new items This would lead tothe incorrect prediction that the old recognition strengthvariance should be smaller than the new recognitionstrength variance because Pc(1 Pc)N decreases for Pcover 50 This is inconsistent with the finding that thevariance of the old distribution is larger than the varianceof the new distribution

4 If the recognition strength is not normalized withthe net input the variance of the recognition strength ofthe new easy class (AN) will be smaller than the varianceof the recognition strength of the new difficult class (BNsee Figure 4C) This is inconsistent with the empirical data

This analysis indicates that several recognition mem-ory phenomena fall out as a consequence of optimizingperformance in the network If the model is optimized interms of performance the model reproduces the empir-ical data If the model is made nonoptimal the model

does not reproduce the empirical data Arguably thehuman brain has evolved to optimize storage capacityand therefore these memory phenomena occur

GENERAL DISCUSSION

This paper has suggested a variance theory for themirror effect that also applies to the ROC curves Themodel is remarkably simple It has been shown that atwo-layer recurrent network where one layer representscontext and one layer represents items shows these phe-nomena if performance is measured by counting thenumber of nodes active at recognition in a way that opti-mizes performance The structure of the theory guaran-tees that high-frequency items have a larger variance inthe net inputs than do low-frequency items because en-coding the same item in different contexts increases thevariance whereas the expected net inputs are the sameThe theory predicts the mirror effect when the sameamount of attention is paid to both classes of stimuli Thestandard deviation of the recognition strength is largerfor old than for new items because more nodes are activein old items The standard deviation for the easy class islarger than the standard deviation for the difficult classbecause the recognition strength is normalized with thestandard deviation for the net input

There are several reasons why the variance theory isinteresting First the theory is extremely simple Theonly necessary assumptions are that recognition is basedon recurrent associations between contexts and itemsand performance is measured by counting the number ofnodes in an optimal way Second these assumptions areconsistent with what is known about how the brain worksTherefore the model is biologically plausible Third themodel accounts for a large amount of data including themirror effect exceptions from the mirror effect ROCcurves list-length effects and so on Fourth the modelfits the empirical data well Fifth it is easy to implementthe model in a connectionist network

Paying more attention to one of the classes violates theassumption of equal expected net inputs to the two classesThe variance theory predicts that attention to the moredifficult class primarily affects the hit rates whereas thefalse alarm rates and standard deviations of the underly-ing distributions are less affected An experiment sup-ported the prediction A standard mirror effect was foundwhen attention was divided equally between high- andlow-frequency words However focusing the subjectsrsquoattention on the high-frequency words either by the pre-sentation frequency or the presentation time made thehit rate larger for the high-frequency words than for thelow-frequency words This manipulation of attention didnot influence the false alarm rates which were higher forthe high-frequency words in all the conditions Thus nomirror effect was found when attention was paid to thehigh-frequency words Nor did the focusing of attentioninfluence the order of the standard deviations The stan-dard deviations were larger for the low-frequency distri-

436 SIKSTROM

bution than for the high-frequency distribution This wastrue for the new and the old distributions both when at-tention was paid to high-frequency words and when at-tention was divided equally between the two classes (ex-cept in the new frequency control condition where thestandard deviations were equal)

The variance theory predicts the order of the standarddeviations of the underlying distributions for the follow-ing reasons The standard deviations of the old distribu-tions are predicted to be larger than those of the new dis-tributions because more nodes are activated in the olddistributions The standard deviations of the easy classdistributions are predicted to be larger than the standarddeviations of the difficult class distributions because therecognition strength is normalized by the itemrsquos diffi-culty estimated from the standard deviation of the net in-puts This is consistent with the empirical data

In contrast the attention-likelihood theory does notconstrain the old distribution to be larger than the newdistribution for the difficult class (it can be larger orsmaller depending on parameter settings) The variancetheory allows the following two orders of the standarddeviations ss(BN) lt ss(BO) lt ss(AN) lt ss(AO) andss(BN) lt ss(AN) lt ss(BO) lt ss(AO) The first order isthe more common although the second order occurs oc-casionally (see eg Glanzer et al 1993 Experiment 1)In addition the attention-likelihood theory allowsss(BO) lt ss(BN) lt ss(AN) lt ss(AO) according to Equa-tion 2 which to the authorrsquos knowledge has not beenfound in empirical data

The variance theory predicts that strength variablessuch as study time repetition and study instructions af-fect the expected net input For example increasing studytime will increase the net input that improves the hit rateIncreasing the net inputs also causes an increase in theactivation threshold that diminishes the false alarm rates

The variance theory has no (ie lateral) connectionswithin the item layer and there are no connections with-in the context layer Including intraitem connections inrecognition makes it impossible to tell whether therecognition strength comes from encoding during thelearning episode or from another preexperimental learn-ing episode Thus there would be a confounding be-tween the itemrsquos frequency and recognition strength Forexample if the recognition strength in the variance the-ory included intraitem connections and used a constantrecognition criterion it would predict a higher hit rateand a higher false alarm rate for high-frequency itemsthan for low-frequency items Thus the hit rate for high-frequency words would be larger than that for low-frequency words which is contrary to the data on the mir-ror effect This issue is called the frequencyrecognitionndashstrength confounding problem Other models may bevulnerable to this problem depending on their specificassumptions The variance theory is immune from thisproblem because recognition strength is based on the as-sociation between the context and the item that yields apure measurement of the strength of the target in a par-

ticular episode Net inputs within the item population arenot used because these connections are highly corre-lated with the frequency of the item

This frequencyrecognition-strength confounding prob-lem may be relevant to several distributed models thatassume that recognition strength increases with frequencythus making the false alarm rate higher for high- than forlow-frequency stimuli This is often implemented in dis-tributed models by including intraitem associations inthe measurement of recognition strength Thus whenintraitem and itemndashcontext associations are added it isnot possible to know whether the intraitem strength oc-curs because an item has been encoded in the to-be-retrieved-from list or at another episode

Although the intraitem associations are not used tomeasure recognition strength they may play an impor-tant role in recognition In the first step of recognitionthese associations may be used for deblurring unclear in-formation in the item cue (a similar mechanism occursfor the context cue) Arguably this deblurring mecha-nism works well for well-known words however fornonwords it is much more likely to fail Such failure willactivate features that were not active in the encoded rep-resentation It is more likely that these wrongly activatednodes represent high-frequency features because it ismore likely to converge on high-frequency features Thereare two interesting implications of this perspective Firstthe wrongly activated nodes will use the wrong connec-tions between the context and the item Second becausethe wrongly activated nodes represent high-frequencyfeatures the average variability will be larger for non-words than for words This is a plausible account of thepoor recognition performance with nonwords as com-pared with words It is also a tentative account of why non-words can be seen as a difficult class with a higher vari-ability than that for words in the variance theory Howeverfurther work is needed before any firm conclusion can bemade regarding this aspect of the theory

A problem similar to frequencyrecognition-strengthconfounding occurs if within-context connections areused at recognition If the context is temporally corre-lated the within-context connections are influenced bylist length This causes a confounding between list lengthand recognition strength This issue is called the list-lengthrecognition-strength confounding problem Othermodels may be vulnerable to this problem depending ontheir specific assumptions

Another issue is whether the variance theory can ac-count for the mirror effect found in abstract and concretewords where words from a concrete class are more eas-ily discriminated than words from an abstract class Thevariance theory can account for this given the assump-tion that the variance of the net input is larger for abstractthan for concrete words However at this point it is notcompletely clear how this assumption can be motivatedA possibility is that although these two classes areequated for word frequency the contexts associated withan abstract word are more variable than the contexts as-

VARIANCE THEORY OF MIRROR EFFECT 437

sociated with a concrete word This larger variability incontext for abstract words may lead to a larger variabil-ity in the net input Another possibility is that the activefeatures in abstract words are more general and there-fore associated with more contexts Nodes active in con-crete words may represent more specific features acti-vated with a lower frequency and therefore associatedwith fewer contexts Thus features in abstract wordsmay be of a higher frequency than features in concretewords although the frequencies of the items are thesame This would lead to a mirror effect in the modelHowever at this point no claim is made that variancetheory can handle this phenomenon

The variance theoryrsquos account of the list-length andlist-strength effects is arguably much simpler than thesubjective-likelihoodrsquos account The activation thresholdis set so that on average half of the nodes active duringencoding are active during recognition The activationthreshold is constant within one condition but may changebetween conditionsmdashfor example when study time ischanged Furthermore subjects do not need to estimatethe list length and the probability that a particular itemis drawn from the study list

The variance theory has advantages over the attention-likelihood theory As was discussed in more detail abovethe attention-likelihood theory requires subjects to clas-sify the stimulus Depending on this classification thesubjects must know (consciously or unconsciously) howmuch attention is paid to the stimuli in order to calculatethe log-likelihood ratios Thus the yesndashno decision isbased on the subjectsrsquo classification of the stimuli Thevariance theory does not require a classification of thestimuli During the calculation of recognition strengththe standard deviation of the net input of the item is used(shcent ) so the subject does not need to know the class or thestandard deviation of the class (sh) The increase in thehit rate and decrease in the false alarm rate for the easierclass occurs according to the theory because the vari-ance is smaller for the easier class The variance theoryhas a simple mathematical base subjects count the num-ber of activated nodes minus the expected value dividedby the standard deviation of the net input of the item Anexplicit solution is presented that requires only calculat-ing the probabilities of two z-transformations

The subjective-likelihood theory uses feature contentof the items for addressing issues regarding item similar-ity and word frequency In particular high-frequency wordsare assumed to have a higher noise or variability than dolow-frequency words The variance theory also usesvariability that depends on frequency However the vari-ance theory simulates the increase in variance duringeach presentation of a feature in different contexts thusmaking it an unavoidable phenomenon for high-frequencyfeatures In the subjective-likelihood theory the featurevariance is introduced as an assumption or a constantand it is not explicitly simulated

There are several other differences between the vari-ance theory and the subjective-likelihood theory The

subjective-likelihood theory is based on log-likelihoodratios In the variance theory log-likelihood ratios arenot necessary to account of the mirror effect and for z-ROC curves Instead the variance theory uses the num-ber of active nodes normalized by the variance of the netinput as the measurement of recognition strength

Another difference is the use of one detector for eachitem in the subjective-likelihood theory This makes themodel essentially local whereas the variance theory isdistributed This property may cause difficulties in im-plementing the model in a connectionist network How-ever see McClelland and Chappell (1998) for a brief dis-cussion of this topic An implementation of the variancetheory is straightforward

REFERENCES

Feller W (1968) An introduction to probability theory and its appli-cation New York Wiley

Gillund G amp Shiffrin R M (1984) A retrieval model for bothrecognition and recall Psychological Review 91 1-67

Glanzer M amp Adams J K (1985) The mirror effect in recognitionmemory Memory amp Cognition 13 8-20

Glanzer M amp Adams J K (1990) The mirror effect in recognitionmemory Data and theory Journal of Experimental PsychologyLearning Memory amp Cognition 16 5-16

Glanzer M Adams J K amp Kim K (1993) The regularities ofrecognition memory Psychological Review 100 546-567

Glanzer M amp Bowles N (1976) Analysis of the word frequencyeffect in recognition memory Journal of Experimental PsychologyHuman Learning amp Memory 2 21-31

Glanzer M Kisok K amp Adams J K (1998) Response distribu-tions as an explanation of the mirror effect Journal of ExperimentalPsychology Learning Memory amp Cognition 24 633-644

Greene R L (1996) Mirror effect in order and associative informa-tion Role of response strategies Journal of Experimental Psychol-ogy Learning Memory amp Cognition 22 687-695

Hertz J Krogh A amp Palmer R G (1991) Introduction to the the-ory of neural computation Reading MA Addison-Wesley

Hintzman D L (1988) Judgment of frequency and recognition memoryin a multiple trace memory model Psychological Review 95 528-551

Hopfield J J (1982) Neural networks and physical systems withemergent collective computational abilities Proceedings of the Na-tional Academy of Sciences 79 2554-2558

Hopfield J J (1984) Neurons with graded response have collectivecomputational properties like those of two-state neurons Proceed-ings of the National Academy of Sciences 81 3088-3092

Humphreys M S Bain J D amp Pike R (1989) Different way to cuea coherent memory system A theory for episodic semantic and pro-cedural tasks Psychological Review 96 208-233

Kim K amp Glanzer M (1993) Speed versus accuracy instructionsstudy time and the mirror effect Journal of Experimental Psychol-ogy Learning Memory amp Cognition 19 638-652

Kruschke J K (1992) ALCOVE An exemplar-based connectionistmodel of category learning Psychological Review 99 22-44

Ku Iumlcera H amp Francis W N (1967) Computational analysis ofpresent-day American English Providence RI Brown UniversityPress

Lewandowsky S (1991) Gradual unlearning and catastrophic inter-ference A comparison of distributed architectures In W E Hockleyamp S Lewandowsky (Eds) Relating theory and data Essays onhuman memory in honor of Bennet B Murdock (pp 445-476) Hills-dale NJ Erlbaum

Li S-C amp Lindenberger U (1999) Cross-level unification A com-putational exploration of the link between deterioration of neuro-transmitter systems and dedifferentiation of cognitive abilities in oldage In L-G Nilsson amp H J Markowitsch (Eds) Cognitive neuro-sciences of memory (pp 103-146) Seattle Hogrefe amp Huber

438 SIKSTROM

Li S-C Lindenberger U amp Frensch P A (2000) Unifying cog-nitive aging From neuromodulation to representation to cognitionNeurocomputing 32-33 879-890

McClelland J L amp Chappell M (1998) Familiarity breeds dif-ferentiation A subjective-likelihood approach to the effects of expe-rience in recognition memory Psychological Review 105 724-760

Metcalfe J (1982) A composite holographic associative recallmodel Psychological Review 89 627-658

Murdock B B (1982) A theory for the storage and retrieval of item andassociative information Psychological Review 89 609-626

Murdock B B (1998) The mirror effect and attentionndashlikelihoodtheory A reflective analysis Journal of Experimental PsychologyLearning Memory amp Cognition 24 524-534

Okada M (1996) Notions of associative memory and sparse codingNeural Networks 9 1429-1458

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves PsychologicalReview 99 518-535

Rumelhart D E Hinton G E amp Williams R J (1986) Learn-ing representation by backpropagating errors Nature 323 533-536

Shiffrin R M amp Steyvers M (1997) A model for recognitionmemory REMndashretrieving effectively from memory PsychonomicBulletin amp Review 4 145-166

Sikstroumlm S (1996a) TECO A connectionist theory of successiveepisodic tests Umearing Doctoral dissertation Umearing University (ISBN91-7191-155-3)

Sikstroumlm S (1996b) The TECO connectionist theory of recognitionfailure European Journal of Cognitive Psychology 8 341-380

Sikstroumlm S (1999) Power function forgetting curves as an emergentproperty of biologically plausible neural networks model Interna-tional Journal of Psychology 34 460-464

Stretch V amp Wixted J T (1998a) Decision rules for recognitionmemory confidence judgments Journal of Experimental Psychol-ogy Learning Memory amp Cognition 24 1397-1410

Stretch V amp Wixted J T (1998b) On the difference betweenstrength-based and frequency-based mirror effects in recognitionmemory Journal of Experimental Psychology Learning Memoryamp Cognition 24 1379-1396

NOTE

1 The expected number of nodes active at recognition is a2 giventhat the standard deviations of the percentages of active nodes for new[sp(N)] and old items [sp(O)] are equal However if the standard devi-ations are different the expected number of active nodes will be

Because the new variance is smaller than the old variance this valuewill be slightly less than a2 typically around 044a if the ROC curveis 08 It is actually more precise to use this value In this paper the ap-proximation a2 is used except that in the simulations where the ex-pression above is used

APPENDIX

The Expected Value of the Net Input and the Variance of the Net Input

Inserting the weight changes into the net input solves the ex-pected value of the net input for old items In this Appendix thesuperscripts representing the item layer (d) and the contextlayer (c) are dropped Only the active states of activation con-tribute to the net input so jj 5 jj = 1

(A1)

The expected value of the net inputs for the new items iszero

mh(N) 5 0 (A2)

The expected value of the weights is zero The variance ofthe net input is calculated by squaring each term in the netinput Let P represent the number of patterns stored in the net-work If the patterns are uncorrelated or a new item is pre-sented the variance of the net input is

(A3)

For an old item the variance of the net input to the contextlayer from the item layer depends on the frequency of the item( f ) All the aN weights supporting a context node contribute tothe net input in the same direction

(A4)

The same function can be used for calculating the varianceof the net input to a node in the item layer when the same con-text is associated with several items Let the same context be as-sociated with L items in a list Furthermore let the context be-tween different lists be uncorrelated The variance of the netinput to an item node is

The expected variance for all nodes is the weighted sum ofthese two terms Half of the nodes are context nodes and halfof the nodes are item nodes Therefore the expected varianceof the net input to all nodes given an item with a frequency off and a list length of L in a network where p patterns have beenencoded is

(A5)

Performance in the Variance Model When AllNodes Are Used

In the variance model recognition strength is based on nodesactive at encoding However if recognition strength is based onall the nodes performance can be calculated as follows The d centis calculated by using Equation 8 where Pc(O) and Pc(N) are thenumber of correct nodes The number of correct nodes is thenumber of nodes active at retrieval that also is active at encod-ing (ie calculated as usual with Equation 7) plus the numberof inactive nodes at encoding that also are inactive at retrievalThe latter value can be calculated by replacing a with 1 a inEquation 7 and using the expected value of old net inputs for in-active nodes a2 (1 a) N (compare with Equation A1)

(Manuscript received February 9 1999revision accepted for publication October 30 2000)

s h f L a N p a a N2 3 22 1O( ) = +( ) + -( )

s h f a a LN2 4 2 21( ) = -( )

s x x x xh ij jj

N

i j ji

Nf f f a a

a a f N

22 2

4 2 21

( ) = waring aringaelig

egraveccedilouml

oslashdivide= -( ) -( )eacute

eumlecirc

= -( )

s x x x xh ij jj

N

i j j

P

i

NN a a

a a a a a a a

a a a a PN a a PN

2 2 2

2

3

1 1 2 0 1

0 0 1 1

( ) = w

= [( ) ( )] + [( )(1- )] ( )

+ [( )( )] ( ) = ( )

2 2

2 2 2

( ) = -( ) -( ) - - - -

- - - -

aring aringaring

m x x x xh ijj

N

j i j jj

Na a a a N(O) = wDaring aring= -( ) -( ) = -( )1

2

ss s

p

p p

a(N)

( )N (O)+

VARIANCE THEORY OF MIRROR EFFECT 419

inputs than does the easy class As will be discussed laterthis is shared with other theories such as the subjective-likelihood account (McClelland amp Chappell 1998) andREM (Shiffrin amp Steyvers 1997) However a unique as-pect of the variance theory is that it is a necessary outcomeof simply encoding high-frequency items more timesthan low-frequency items In the subjective-likelihood ac-count a larger variability for high- than for low-frequencywords is an assumption that is controlled by a free pa-rameter ( p0) to reflect that high-frequency words havemore definitions than do low-frequency words The vari-ance theory needs no such assumptions and no addi-tional free parameters to control the variability Whereasthe subjective-likelihood account tries to capture wordfrequency with a free parameter in the variance theorythe word frequency effect is simulated via the number ofcontexts associated with an item

Owing to how the variance theory implements the re-lations between contexts and items the variance of the

net inputs increases with the frequency with which an itemis encoded in different contexts Intuitively this occursbecause a high-frequency item activates several differentcontexts causing the representations of many competingcontexts to be activated simultaneously Low-frequencyitems are associated with fewer contexts than are high-frequency items This causes the representations offewer contexts to be activated simultaneously Thus low-frequency items have less variability in the net inputsthan do high-frequency items

At recognition an item produces a net input in the con-text layer that is a mixture of the net inputs from the studycontext that the network is instructed to retrieve fromand from several uncorrelated preexperimental contextsthat were associated with the item during the preexperi-mental phase The study context that the network is in-structed to retrieve from contributes to the correct activestate For example by adding +1 to the net input to anode (which is the expected net input for a node encoded

Figure 4 (A) The probability density of the net inputs in the variance theory The horizontal axis shows the net inputs and the ver-tical axis the probability density of the net inputs The dotted vertical line is the activation threshold (B) The cumulative probabilitydistributions of the net inputs in the variance theory The horizontal axis shows the net inputs and the vertical axis the cumulativeprobability distributions of the net inputs (C) The density of percentage nodes active at recognition in the variance theory The hor-izontal axis shows the percentage of nodes active at recognition and the vertical axis the density (D) The density of recognition strengthin the variance theory The horizontal axis shows the recognition strength and the vertical axis the density of the recognition strengthusing standard parameter values

420 SIKSTROM

in the active state when N 5 8 and a 5 5 see Equation A1in the Appendix) thus increasing the expected net inputThe preexperimental contexts on the other hand ran-domly add to or subtract from the net input For examplea preexperimental context could add +1 to the net inputif the node was in an active state or it could add 1 tothe net input if the particular preexperimental contextwas encoded in an inactive state (which is the expectednet input for a node encoded in the inactive state whenN 5 8 and a 5 5 see Equation 3 4 or A1) Note thatthe net input can be negative whereas the state of activa-tion cannot If the representations of these preexperimen-tal contexts are uncorrelated the number of associatedpreexperimental contexts will not influence the expectednet input That is the expected value of the negative andpositive contributions will cancel out (eg for N 5 8 anda 5 1 the contributions to the net input is +1 with a prob-ability of 50 and 1 with a probability of 50 yield-ing an expected value of 0) However the variability ofthe net inputs increases when more contexts are addedIn this example the variance is increased by 12 for eachadded context (ie the variance increases by each con-tribution raised to the power of two) Encoding an itemalso increases the variability of the net input for all otheritems encoded in the network However the increase invariability for the items uncorrelated with the encodeditem is much smaller because the contribution from eachnode is independent (Eg for N 5 8 and a 5 5 thecontribution from each node is either +14 or 14 [seeEquation 3] each yielding an increase in variability of142 5 116 An expected value of aN nodes contributeso the total increase in variability is 4 116 5 14)

A mathematical analysis of how the variability of thenet inputs in the context nodes increases as a function ofthe frequency with which the item has been encoded indifferent contexts is presented in the Appendix This analy-sis shows that the variance of the net inputs in contextnodes increases linearly with how many times a givenitem is encoded within different contexts The variabil-ity of the net inputs for nonwords may be a special casediscussed at the end of this paper

In the same way as the variability of context nodes de-pends on the itemrsquos frequency the variability of the itemnodes depends on the frequency of the context That isthe variability of the net input to the item nodes increaseswith how many times one context is associated with dif-ferent items Given that the context is constant during thepresentation of a study list the variability of the net in-puts to the item nodes will increase with list length

Expected Net InputThe second mechanism in the variance theory is that

the expected net inputs to the easy and difficult classesof stimuli are equal given that the encoding conditionsduring the experimental phase of the two classes areequal Note that this is in stark contrast to the attention-likelihood theory which assumes that more attention(corresponding to more net input) is given to the easyclass than to the difficult class Experimentally the equal-

ity of the net inputs simply means that different classesof stimuli are given the exact same conditions for encod-ing and retrieval in a recognition memory study The netinput to a node encoded in the active state increases dur-ing encoding whereas the net input to a node encoded inthe inactive state decreases during encoding Only nodesencoded in the active state are used during retrieval sohere we are only interested in the increase in net input thatoccurs for nodes encoded in the active state Strengthen-ing of the weights during encoding increases the expectednet input The degree of increase in expected net input isinfluenced by strength-based variables such as study timerepetition levels of processing and so forth For exam-ple the simulations can be set up so that a study time of1 sec strengthens the weights less leading to lesser in-creases in the net input than does a longer period of studytimemdashfor example 2 sec of encoding time Because thestudy context is unique to the learning episode preexper-imental encoding in other learning contexts will not af-fect the expected net input but they do affect the vari-ability of the net input as was demonstrated above Theitemndashstudy-context associations are learned approxi-mately equally well for old high- and old low-frequencyitems For example the expected net input for CAR (ahigh-frequency word as a difficult class item) is equal tothe expected net input for ARCTIC (a low-frequency wordas an easy class item) Generally the expected net inputdoes not depend on the class of the item because the ex-pected net input is influenced by the study and the test-ing experimental conditions similarly across stimulusclasses in a recognition memory experiment Thereforethe expected net input for a new difficult item is equal tothe expected net input for a new easy class item

The probability density functions of the net inputs fornodes in the active states are plotted in Figure 4A Oldnodes in the inactive state have a negative expected valueof the net input and are not plotted New nodes in the in-active state have the same density as nodes in the activestate The cumulative probability distributions of the netinputs for nodes in the active state are plotted in Figure 4BFigure 4A shows the first mechanismmdashnamely that thestandard deviation of net inputs for the easy class items[sh(A)] is larger than the standard deviation of net inputsfor the difficult class items [sh(B)] The second mecha-nism is shown in the figures in that the expected netinput of an easy class new item [mh(AN)] is equal to theexpected net input of a difficult class new item [mh(BN)]Furthermore at encoding the expected net inputs of ac-tivated nodes are increased equally or approximatelyequally for the easy and the difficult classes of itemsThis is shown in Figure 4A The expected net input for theold easy class items [m h(AO)] is equal to the expectednet input for the difficult class items [mh(BO)]

Recognition StrengthThe variance theory suggests that the recognition de-

cision needs to be based on counting the number of ac-tive nodes in such a way that the performance is optimalor near-optimal If the net input is above the activation

VARIANCE THEORY OF MIRROR EFFECT 421

threshold (T ) and the node was active at encoding thenode is activated at recognition Otherwise the node isinactivated The distributions of active nodes are plottedin Figure 4C

A closer inspection of Figures 4A and 4B reveals thatthese densities or distributions predict the correct orderof the mirror effect given that the activation threshold islarger than the expected net inputs of the new items andless than the expected net inputs of the old items Thus thevariance theory has the nice property of accounting for themirror effect across a large range of the parameter valuesfor the activation threshold Thus P(AN) lt P(BN) ltP(BO) lt P(AO) for mh(AN) 5 mh(BN) lt T lt mh(AO) 5mh(BO) The variance theory predicts a material-basedmirror effect because the variability of the net inputs isdifferent for the easy and the difficult class items Theexpected strengths of the net inputs are equal The vari-ability is lower for easy class items thus making theprobability of false alarms (or the probability of activenodes) lower for the easy than for the diff icult classitems when the activation threshold is larger than the ex-pected value of the new items Similarly the hit rate (or theprobability of active nodes) for easy class items is higherthan the hit rate for difficult class items when the activationthreshold is less than the expected value of the old items

The activation threshold is set to be between the ex-pected value of the new and the old net inputs so that theperformance is optimal Therefore the activation thresh-old is set to the average of the expected net inputs of theold and the new distributions for difficult and easy classitems respectively

T 5 [mh(AN) + m h(BN) + m h(BO) + mh(AO)]

5 [mh(BO) + mh(AO)] 5 mh(O)

Thus in the variance model the activation threshold isfixed for recognition in one condition although it mayvary between different recognition conditions to opti-mize the performance The variance theory accounts forthe strength-based mirror effect that occurs between listsor conditions with a shift in the activation threshold nec-essary for keeping the performance at an optimal levelAs will be discussed later this is true also when perfor-mance is measured by d cent and it is independent of theplacement of the recognition criterion Simply put themodel adopts the activation threshold on the basis of theoverall difficulty of the test in order to maximize the per-formance

In practice subjects may initially make a preliminaryestimate of the activation threshold which may be ad-justed as more information about the expected value ofthe distributions is gathered The theory does not show amirror effect if the activation threshold is lower than theexpected value of the new items or larger than the expectedvalue of the old items Thus setting the activation thresh-old as was suggested above is an important mechanism

in the model The activation threshold should not be con-fused with the subjectrsquos recognition criterion

Figure 4C shows the density of the probability that anode is active at recognition when the activation thresh-old is set as defined above Note how the mean and stan-dard deviations of the distributions of the net input (Fig-ure 4A) change when the percentage of nodes are calculated(Figure 4C) The expected probabilities of active nodesare arranged according to the mirror effect [Pc(AN) ltPc(BN) lt Pc(BO) lt Pc(AO)] whereas the expected valuesof the net inputs are not [mh(AN) = mh(BN) lt mh(BO) =mh(AO)] Furthermore the standard deviation of the per-centage of active nodes for old items is larger than thatfor new items [s p(O) gt sp(N)] whereas the standard de-viations of the net inputs are equal for old and new items[s h(N) 5 sh(O)]

The standard deviation of the recognition strength issmaller for the new distributions than for the old distri-butions because there are fewer nodes active in the newdistributions The standard deviation of the percentage ofactive nodes at retrieval as a function of the expected prob-ability of nodes active at retrieval is plotted in Figure 5Obviously the standard deviation of the percentage ofactive nodes is zero when the probability of active nodesis zero This standard deviation increases as the prob-ability of active nodes increases For small probabilitiesof active nodes the variance of active nodes (sp) is approx-imately proportional to the percentage of active nodes[sp 5 Pc(1 Pc) N raquo Pc N micro Pc] The percentage of ac-tive nodes is of course larger for old than for new itemsThus the variance theory predicts that the standard de-viation of the percentage of active nodes (sp) is smaller fornew than for old items [sp(AN) lt sp(BN) lt sp(BO) ltsp(AO)] whereas the standard deviation of the net inputis not [sh(AN) 5 sh(AO) lt sh(BN) 5 s h(BO)] The es-sential mechanism that makes these changes in themeans and standard deviations is the nonlinearity intro-duced by the counting of the number of active nodes With-out this nonlinearity these changes would not occur andthe model would not show appropriate ROC curves forold and new items

Note that the standard deviation of active nodes de-creases for probabilities larger than one half (see Figure 5the standard deviation is of course zero when the prob-ability of active nodes is one see the mathematical anal-ysis below) However the probability of active nodes can-not exceed one half because the activation threshold isset so that half of the nodes active during encoding areactive at recognition The probability of active nodes istypically set to a small value in the model because it im-proves the performance

To optimize the performance subjects base their rec-ognition decision on the number of active nodes normal-ized by the standard deviation of the net inputs for theitem The normalization makes the judgments more con-servative for difficult items This plays an important rolefor confidence judgments when the responses are biasedbut plays no role for unbiased responses

12

14

14

422 SIKSTROM

To calculate the recognition strength (S ) in Equation 5the expected percentage of active nodes is subtractedfrom the percentage of nodes active at recognition (Pc)This is necessary for the normalization to work properlyOwing to the placement of the activation threshold theexpected percentage of active nodes at recognition is halfof the expected percentage of nodes active at encoding(a 2 see note 1) This is a constant independent of itemclass new or old item and test difficulty The result is di-vided by the standard deviation of the net inputs associ-ated with nodes active at encoding (shcent)

Note that the standard deviation of the net inputs ofthe to-be-recognized item (s hcent ) varies on an item-to-itembasis around the standard deviation of the net inputs ofall items in the class (sh ) This fluctuation may be solarge that it is not possible to accurately sort the wordsinto classes on the basis of the standard deviation of theitems however there is no need for the subject to makesuch classification in the variance model The subjectsdo not need to know the true standard deviation of net in-puts in the class A yes response occurs if the recognitionstrength is larger than or equal to the subjectrsquos recogni-tion criterion (C ie if S sup3 C ) A no response occurs ifthe recognition strength is less than the subjectrsquos recog-nition criterion (S lt C )

The standard deviation of the net inputs does not affectthe probability of a yes response when the recognitioncriterion is unbiased (C 5 0) In this special case therecognition strength can be simplified to S 5 Pc whereC 5 a 2 The standard deviation of the net inputs inEquation 5 affects the probability of a yes response whenthe recognition criterion is biased (C sup1 0) Thus the stan-dard deviation of the net inputs in Equation 5 may be in-terpreted as a scaling factor that influences the confidencemeasurement (but not the unbiased recognition measure-ments) A large standard deviation of the net input for anitem (correlated with difficulty) influences the decisiontoward uncertainty whereas a small standard deviation ofthe net input for an item (correlated with less difficulty)influences the decision to be more certain

Figure 4D shows the density distribution of the recog-nition strength Note how the standard deviation of theactive nodes for the easy class versus the difficult classes(in Figure 4C) changes when it is normalized by the vari-ance of the net input (in Figure 4D) The normalizationfactor makes the standard deviation of the recognitionstrength of the difficult class smaller than that of the easyclass Thus the standard deviation of the recognitionstrength is proportional to the inverse of the standard de-viation of the net input The difficult class items yield asmall standard deviation of the recognition strength be-cause the standard deviation of the net inputs is highThe easy class items yield a large standard deviation ofthe recognition strength because the standard deviationof the net inputs is small The ordering of the means ofthe distributions is unaffected by the normalization andthe normalization does not change the fact that the olddistributions have a larger standard deviation than do thenew distributions

PREDICTIONS

This section describes the predictions of the variancetheory We have just seen that the variance theory predictsa material-based mirror effect for high- and low-frequencyitems because the low-frequency items have a smallervariance of net inputs This yields lower false alarm ratesand higher hit rates for the easy than for the difficultclass when the activation threshold is set between thenew and the old distributions Here it is shown how themodel accounts for other effects such as the strength-based mirror effect between lists list-length effects andthe shift in the response criterion Most important the var-iance theory makes predictions regarding the strength-based mirror effect within lists that is different from thepredictions of the attention-likelihood theory An exper-iment is conducted to test these predictions Comparativemodeling fitting was also conducted to compare the vari-ance theory with the attention-likelihood theory Thepredictions of the theory are based on an analytic solution

Figure 5 The standard deviation of active nodes as a function of the expectedprobability that the nodes are active The standard deviation is calculated with2N = 100

VARIANCE THEORY OF MIRROR EFFECT 423

that is presented at the end of the paper together with ananalysis of optimal performance

The Material-Based Mirror Effect forHigh- and Low-Frequency Items

The variance theory was simulated above here theanalytical results are presented The variance theory pre-dicts the mirror effect for any choice of parameters whenthe recognition criterion is unbiased As will be discussedlater the variance theory can be fully described by twoparameters (the number of nodes N and the percentageof active nodes a) plus one parameter for each class orwords [the standard deviation of the net input sh( )] Thefollowing parameters are used here The number ofnodes is 2N 5 100 and the percentage of nodes active atencoding is set to a 5 1 The standard deviation of thenet inputs to the easy class is sh(AN) 5 sh(AO) 5 125and the standard deviation of the net inputs to the diffi-cult class is sh(BN) 5 sh(BO) 5 156 There are otherparameters which however as will be discussed laterdo not add any degrees of freedom to the model the ex-pected net inputs of the new distributions mh(AN) 5mh(BN) 5 0 and the expected net inputs of the old dis-tributions mh(AO) 5 mh(BO) 5 1 Consequently the ac-tivation threshold is T 5 05

These parameters yield the following probabilities thata node is active at recognition Pc(AN) 5 43 a Pc(BN) 545 a Pc(BO) 5 55 a Pc(AO) 5 57 a The following ex-pected recognition strengths are predicted ms(AN) 5

0012 ms(BN) 5 0008 ms(BO) 5 0008 ms(AO) 50012 Figure 4D plots the four recognition strength den-sities (the distributions are assumed to be normal) usingthe parameters above The same parameter settings wereused in Figures 4A 4B 4C and 5

Strength-Based Mirror Effects Between ListsThe variance theory is consistent with the strength-

based mirror effects Thus variables that increase the hitrates also decrease the false alarm rates This empiricalfinding is called dispersion which means that the newand the old distributions move apart The opposite phe-nomenon is called concentering which means that thenew and the old distributions move closer together Ex-amples of variables showing strength-based mirror ef-fects are speed versus accuracy instructions length ofstudy time encoding task forgetting repetition and ag-ing (Kim amp Glanzer 1993) These experimental variablescan be related to a specific parameter in the variance the-orymdashnamely the expected net input

The variance theory predicts a strength-based mirroreffect because subjects must adjust the activation thresh-old to optimize the performance This change in activa-tion threshold affects the false alarm rates For exampleassume that study time is increased from 1 to 2 sec sothat the expected net input increases from 1 to 2 and theactivation threshold increases from 12 to 1 This dimin-ishes the false alarm rate However the increase in theactivation threshold is smaller than the increase in the old

net input so the hit rate will increase Thus increasing thestudy time increases the hit rate but decreases the falsealarm rate which is dispersion

The mirror effect is accounted for in some theories bya change in the recognition criterion Note that in the var-iance theory the recognition criterion is constant whereasthe activation threshold is changed There is an impor-tant difference between a change in the recognition cri-terion and a change in the activation threshold Thechange in the activation threshold optimizes the perfor-mance as measured by d cent whereas a change in the recog-nition criterion does not influence d cent Given an optimalplacement of the activation threshold the performance interms of percentage correct is optimal if the recognitioncriterion is set to an optimal value which is zero Thusthere is a clear difference between changing the recogni-tion criterion and changing the activation threshold Thevariance theory accounts for the strength-based mirror ef-fect occurring between two conditions by the change inthe activation threshold necessary for optimal performancewhereas the recognition criterion does not change

Concentering occurs for example when subjects areinstructed to emphasize speed (rather than accuracy) withsuperficial (rather than deep or semantic) study instruc-tions with diminished study time or with an increasedretention interval (Kim amp Glanzer 1993) In the variancetheory all these manipulations are assumed to diminishthe old net inputs Figure 6A shows the predictions of thevariance theory when the expected net inputs of the olddistributions are mh(AO) = mh(BO) 5 05 rather than 1as in Figure 4D Consequently the activation thresholdmust be set to 025 for optimizing the performance Thedistributions in Figure 6A are closer than the distribu-tions in Figure 4D Thus decreasing the net inputsmdashforexample by diminishing study timemdashmoves the distrib-utions closer together thus showing concentering

The opposite phenomenon to concentering is disper-sion which means that increasing the performance movesthe distributions apart Dispersion can be studied bychanging the variables listed above in the opposite direc-tionsmdashfor example by increasing the study time Fig-ure 6B shows the predictions of the theory when the ex-pected net inputs of the old distributions are mh(AO) =m h(BO) 5 2 Consequently the activation thresholdmust be set to 1 to maintain a near-optimal performanceThe distributions in Figure 6B are further apart than thedistributions in Figure 4D

These strength-based manipulations are usually ap-plied between different lists or conditions For exampleKim and Glanzer (1993) manipulated study time betweenfour study lists where the items were presented for 1 seceach in two lists and for 2 sec each in two lists After eachlist there was a recognition test In the variance theorythe activation threshold is the same during each recogni-tion test but may vary between two recognition tests withdifferent levels of difficultymdashfor example different studytimes As will be discussed later different empirical re-sults are found when study time is varied within one list

424 SIKSTROM

In this case the activation threshold is also constant dur-ing the recognition tests although the study time varieswithin the condition

The order of the probabilities in the mirror effect issomewhat robust against changes in the activation thresh-old over a large range Setting the activation to a fixedsufficiently low and positive value yields the mirror ef-fect for any value of the expected net input For example

assume that the activation threshold is fixed to 04 Thenthe mirror effect is predicted for the three cases of ex-pected old net inputs discussed above (05 1 and 2) orany value above 04 The predictions for the new distri-butions do not change with these changes in net inputs[P(AN) = 25 P(BN) = 30] thus a change in the acti-vation threshold is needed to change the false alarmrates In contrast the advantage of the old easy class over

Table 1General Table of Results From the Experiment

Condition AN BN BO AO ss(BN)ss(AN) ss(BO)ss(AO)

Control 013 017 069 082 060 086Frequency 020 028 080 068 101 066Time 010 015 078 076 089 081

NotemdashThe rows show the conditions (control presentation frequency and presenta-tion time) The columns show the false alarm rates for low (AN) and high (BN) fre-quencies and the hit rate for high (BO) and low (AO) frequencies the slope of the z-ROCcurve for the new low-frequency distribution as a function the new high-frequency dis-tribution [ss(BN)ss(AN)] and the corresponding slope for the old distributions[ss(BO)ss(AO)]

Figure 6 The densities of recognition strength in the variance theory for different parameter settings (A) concentering (B) dis-persion (C) activity level set to one and (D) equal variance The horizontal axes show the recognition strength and the vertical axesthe density

VARIANCE THEORY OF MIRROR EFFECT 425

the old difficult class increases with the expected netinput [from P(BO) = 55 and P(AO) = 56 for mh(O) =05 to P(BO) = 89 and P(AO) = 92 for mh(O) = 2]

List-Length EffectGiven everything else equal recognition from a short

list length has a higher hit rate and lower false alarm ratethan recognition from a long list In the variance theorylist length is predicted to affect the standard deviation ofthe net input (sh) for both easy and difficult class itemsso that longer lists have a larger variance than do shorterlists The expected value of the net input is not affectedby list length

Assume that context does not change within a list butis uncorrelated between different lists The context for alist is thus associated with as many items as there areitems in the list The variance of the net inputs to the itemnodes increases when the list length is increased Thereason for this increase in variance is essentially thesame as the reason that word frequency affects the vari-ance In the word frequency case the same item is asso-ciated with several contexts and this increases the vari-ance in context nodes In the list-length case the samecontext is associated with several items and this in-creases the variance in the item nodes Thus the varianceof the net inputs in the item nodes will be a linear func-tion of list length Therefore a long list will have a lowerhit rate and a larger false alarm rate than will a short list

ROC CurvesThe percentage of nodes active at recognition is less

for new items than for old items Owing to the placementof the activation threshold this proportion is always lessthan 12 The standard deviation of the percentage of ac-tive nodes increases as a function of the percentage ofactive nodes If the percentage of active nodes is zerothe standard deviation obviously is zero However thisstandard deviation increases as the percentage of activenodes increases This yields a smaller standard deviationfor the new distribution (which is associated with a lowerpercentage of active nodes) as compared with the olddistribution [ss(AO) gt ss(AN) and ss(BO) gt ss(BN)]

For the sake of understanding the model the propor-tion of nodes active at encoding can be set unrealisticallyhighmdashnamely to a = 1 This setting yields around 50of these nodes being active at recognition This parametersetting makes the standard deviations of the new and theold distributions equal [ss(AO) = ss(AN) and ss(BO) =ss(BN)] Figure 6C shows the prediction for a = 1 (all theother parameters are identical to those in Figure 4D)

The standard deviation of recognition strength is largerfor the difficult class than for the easy class [ss(AN) gtss(BN) and ss(AO) gt ss(BO)] because the recognitionstrengths are calculated from the inverse of the standarddeviation of the net inputs Thus when the standard de-viations of the net inputs are set equal the standard de-viation of the recognition strengths and the recognitionstrengths becomes equal Figure 6D plots the predictionsof the theory when all standard deviations of the net in-

puts are 125 The other parameters are the same as thosein Figure 4D

In Figure 4D the four standard deviations of the recog-nition strengths are ss(AN) = 0015 ss(BN) = 0012ss(BO) = 0015 and ss(AO) = 0020 The ratio of thesestandard deviations must follow Equation 2 This is alsothe case with ss(AO) ss(BN) = 061 lt 074 = ss(AO) ss(AN) lt ss(BO)ss(BN) = 078 lt 094 = ss(BO) ss(AN)

Changing the Recognition CriterionThe probability of a yes response (P) for the four

classes depends on the recognition criterion (C) SettingC to an unbiased value of 0 yields P(AN) = 20 P(BN) =25 P(BO) = 70 P(AO) = 74 These predicted data areprototypical of experimental data for the mirror effect

A conservative value of the recognition criterion (C)will not yield the mirror effect For example C = 0016yields P(AN) = 03 P(BN) = 02 P(BO) = 30 P(AO) =43 Thus the variance theory predicts that a conserva-tive recognition criterion yields a higher false alarm ratefor easy class words than for difficult class words Thisprediction is supported by empirical data Greene (1996)asked subjects to respond yes only if they were sure oftheir response Consistent with the prediction no mirroreffect was found

It follows from the ordering of the distributions in Fig-ure 4D that the theory also predicts the experimentalfindings in forced recognition [P(BO BN) lt P(AO BN)P(BO AN) lt P(AO AN) P(BN AN) gt 50 and P(AOBO) gt 50] For the parameters above the predictions ofthe theory are P(BO BN) = 79 lt 81 = P(AO BN)P(BO AN) = 83 lt 84 = P(AO AN) P(BN AN) = 59 gt50 P(AO BO) = 57 gt 50

Within-List Strength ManipulationSo far the predictions made by the variance theory are

qualitatively (but not quantitatively) equal to those of theattention-likelihood theory However there is an excep-tion that differentiates the variance theory from the attention-likelihood theory The mirror effect is nor-mally studied under experimental conditions in whichthe diff icult and the easy classes are given the sameamount of attentionmdashfor example under conditions inwhich the number of presentations the study time andthe study instructions are equal for the two classes ofwords However if the number of presentation is largerfor the difficult class than for the easy class different re-sults emerge Stretch and Wixted (1998b) conducted fiveexperiments in which the basic manipulation was to pre-sent high-frequency words five times whereas the low-frequency words were presented once The results didnot show a mirror effect because the hit rates for thehigh-frequency words were higher than those for thelow-frequency words However increasing the numberof presentations for the high-frequency words did not af-fect the false alarm rate so both the false alarm rate andthe hit rate were larger for high-frequency words

The variance theory accounts for this new finding inthe following way The theory assumes that a longer pre-

426 SIKSTROM

sentation time or a larger presentation frequency in-creases the net inputs of the old items [mh(O)] This is il-lustrated in Figure 7A (compare with Figure 4A wherethe same amount of attention is paid to the two classes)If the net inputs for old high-frequency items are in-creased sufficiently the percentage of active nodes willbe larger than that for old low-frequency items For thisto occur the effect of the increase in net input (whichgives the advantage for old high-frequency items whenattention is focused on these items) must be larger thanthe effect from the larger standard deviation of the netinputs for old high-frequency items (which gives the ad-vantage for old low-frequency items when the same at-tention is paid to the two classes) This increase in thepercentage of active nodes yields a higher hit rate forhigh-frequency items than for low-frequency items

However it will not significantly change the false alarmrates which are larger for high-frequency items than forlow-frequency items Therefore the variance theory pre-dicts no mirror effect when high-frequency items are pre-sented sufficiently more often or with a sufficiently longerpresentation time as compared with low-frequency items

It is apparent from the density of net inputs (Figure 7A)that the density of recognition strengths (Figure 7B)will not show a mirror effect (ie because the percent-age of active nodes are larger for high- than for low-frequency old items) The parameters used in these fig-ures are identical to the parameters used for the standardmirror effect in Figures 4A and 4D with the exceptionthat the expected net input of the old high-frequencyitems [mh(BO)] is 2 rather than 1 Consequently to opti-mize performance the activation threshold becomes

P a ec

h h

h=-( )yen

ograve12

22

m

s

P -

Figure 7 (A) The probability density of the net inputs in the variancetheory when attention is focused on the high-frequency class The hori-zontal axis shows the net inputs and the vertical axis the probabilitydensity of the net inputs The expected value of the high-frequency value(BO) is shifted to the right because attention is focused on this class Thedotted vertical line is the activation threshold (B) The predictions of thevariance theory when subjects focus their attention on high-frequencywords The horizontal axis shows the recognition strength and the ver-tical axis the probability density

075 rather than 050 The figure does not show a mirroreffect because the expected hit rate and the expectedfalse alarm rate are larger for the high-frequency itemsthan for the low-frequency items Setting C to an unbi-ased value of 0 yields P(AN) = 08 P(BN) = 14 P(BO) =86 P(AO) = 063 [which may be compared with Figure6B P(AN) = 20 P(BN) = 25 P(BO) = 70 P(AO) = 74]

Furthermore in the variance theory the ratio of therecognition strength standard deviations for high- andlow-frequency items depends mainly on the standard de-viations of the net inputs The standard deviations of thenet inputs are not dependent on the attention paid to thestimuli Therefore the variance theory predicts no changein the standard deviations when the amount of attentionis manipulated The standard deviation of the old low-frequency distribution is predicted to be larger than thestandard deviation of the old high-frequency distribu-tion Similarly the standard deviation of the new low-frequency distribution is predicted to be larger than thatof the new high-frequency distribution The standard de-viations in Figure 7B are ss(AN) = 0013 ss(BN) = 0011ss(BO) = 0017 and ss(AO) = 0019 These results aresimilar to the results when the same amount of attentionis paid to the two classes in Figure 4D ss(AN) = 0015ss(BN) = 0012 ss(BO) = 0015 and ss(AO) = 0020

The standard version of the attention-likelihood the-ory has problems accounting for the lack of mirror ef-fect when more study time is given to the difficult classthan to the easy class This theory suggests that the classof items to which more attention is being paid is moreeasily recognized For example low-frequency items arebetter recognized than high-frequency items becausesubjects pay more attention to them The amount of at-tention is assumed to influence the number of sampledfeature [n(i)] so more features are sampled for low- thanfor high-frequency items (Kim amp Glanzer 1993) This isthe only parameter that differs between high- and low-frequency items From this explanation it follows thatif the experimental conditions are manipulated so thatsubjects pay more attention to the high-frequency itemsthe standard version of the attention-likelihood theorywill predict a mirror effect where the high-frequencyitems are the easier class (A) and the low-frequency itemsare the more difficult class (B) The difference from thenormal mirror effect is a larger hit rate and a smaller falsealarm rate for high- than for low-frequency items Fur-thermore the attention-likelihood theory makes predic-tions of the order of the slope of ROC curves The standarddeviation of the hit rate for the high-frequency distributionwould be larger than the hit rate for the low-frequencydistribution Similarly it is predicted that the standarddeviation of the high-frequency false alarm distribution islarger than that of the low-frequency distribution

EXPERIMENT

An experiment was conducted to test the predictionsregarding the within-list strength manipulation The

number of presentations and the study time of the high-frequency words were manipulated in an experimentThe original rationale for the experiment was to comparethe results with the predictions of the variance theoryand the attention-likelihood theory because the experi-ment was conducted before the publication of Stretchand Wixtedrsquos (1998b) study which manipulated atten-tion by the number of presentations In this experimenta new manipulation is investigatedmdashnamely how theamount of study time per item for each class influencesthe mirror effect Furthermore the manipulation of thenumber of presentations is replicated Thus there weretwo experimental conditions one in which the amountof study time was manipulated and one in which the pre-sentation time was manipulated There was also one con-trol condition in which high- and low-frequency wordswere given the same amount of attention

MethodSubjects Twenty-one students taking the introductory psychol-

ogy course at the University of Toronto volunteered to participatein a memory experiment for course credit There were 14 female and7 male subjects with a mean age of 20 ranging from 18 to 29 yearsold

Material Sixty low-frequency words and 60 high-frequencywords were selected from Ku Iumlcera and Francis (1967) The low-frequency words have an occurrence of 4ndash5 times per million andthe high frequency words an occurrence of 50ndash55 times per mil-lion Thirty low- and 30 high-frequency words were randomly cho-sen for List 1 and the remaining for List 2

Procedure The subjects were instructed to study a list of wordsso they would be able to recognize the words after study Fifteenlow-frequency words and 15 high-frequency words were randomlychosen as study words for each subject

Design There were three conditions In all the conditions thelow-frequency words were presented once with a presentation timeof 1 sec In the control condition the high-frequency words werealso presented once with a presentation time of 1 sec In the pre-sentation frequency condition the high-frequency words were pre-sented twice for 1 sec each time In the presentation time conditionthe high-frequency words were presented once for 3 sec The pre-sentation order was randomized All the words were presented inuppercase on a blank computer screen Immediately following thestudy list there was a recognition test The subjects were presentedwith either one of the studied words or one of the lures There were15 low-frequency lures and 15 high-frequency lures in each condi-tion The subjects were asked to judge whether the word was pre-sented in the list or not The subjects were also required to rate theirconfidence in their responses on a scale ranging from 1 (guessing)to 5 (very certain) The order of recognition was randomized foreach subject

Each subject participated in two conditions List 1 was alwaysgiven as the first list and List 2 as the second list Twelve subjectswere randomly chosen for the presentation frequency condition fol-lowed by the presentation time condition Nine subjects were giventhe control condition followed by another control condition Thewhole experimental set-up including instructions presentation ofwords and the recognition test was automated on a computer Eachsubject was tested individually

ResultsThe results from the experiment are presented in the

first three rows of Table 1 The probability for hit rates

VARIANCE THEORY OF MIRROR EFFECT 427

428 SIKSTROM

was larger for the high-frequency words than for the low-frequency words in the presentation frequency and thepresentation time conditions In the control conditionthe probability for hit rates for the low-frequency condi-tion was larger One-tailed paired t tests over the perfor-mance for each subject were carried out to test the dif-ferences between the high and the low frequencies Theeffects were significant in the presentation frequencycondition [t(11) = 22 MSe = 004 p = 02 lt 05] and inthe control condition [t(16) = 33 MSe = 004 p = 00lt 05] but not in the presentation time condition [t(11) =041 MSe = 003 p = 34 lt 05]

The false alarm rate was larger for the high-frequencywords in all the conditions However it was only signif-icantly larger in the presentation frequency condition[t(11) = 18 MSe = 003 p = 048 lt 05] but not in thepresentation time condition [t (11) = 15 MSe = 001 p =07 gt 05] and the control condition [t(16) = 14 MSe =002 p = 09 gt 05]

The results in the presentation frequency conditionsupport the variance theory The hit and the false alarmrates were significantly larger for the high-frequencywords than for the low-frequency words Thus there wasno mirror effect However the prediction of the standardversion of the attention-likelihood theory was not sup-ported

The results in the presentation time condition were inthe same direction as those in the presentation frequencycondition although the difference between the high andthe low frequencies was not significant This conditionis consistent with the variance theory although the stan-dard version of the attention-likelihood theory could notbe dismissed in this condition since the results werenonsignificant

Finally the control condition yielded results consis-tent with previous studies showing a mirror effect Thehit rate for the high-frequency words was significantlylower than the hit rate for the low-frequency words Thefalse alarm rate for the high-frequency words was largerthan that for the low-frequency words (although not sig-nificantly) Thus the control condition is as was ex-pected consistent with both the variance theory and thestandard version of the attention-likelihood theory

The slopes of the ROC curves were calculated as fol-lows The hit and false alarm rates for confidence ratings1ndash5 were z-transformed (eg for confidence rating 4 a hitresponse was scored if the confidence rating was 4 orabove) Linear regressions of one z-transformed mea-surement as a function of another z-transformed measure-ment were conducted The slope of the linear regressioncurves between the z-transformed hit rate of the low-frequency words and the z-transformed hit rate of the high-frequency words [ss(BO)ss(AO)] and similarly for theslope of the false alarms [ss(BN)ss(AN)] are shown inthe last two rows of Table 1

The results show that the standard deviations of theold high-frequency distributions were smaller than thestandard deviations of the old low-frequency distribu-tions in all the conditions The standard deviations of the

false alarm high-frequency distributions were smallerthan the standard deviations of the false alarm low-frequency distributions in the presentation frequencycondition and the control condition but were approxi-mately equal in the presentation time condition

To summarize the results in the presentation frequencycondition are consistent with the variance theory and inconsistent with the standard version of the attention-likelihood theory The control condition is consistentwith both the variance theory and the standard version ofthe attention-likelihood theory These data are also con-sistent with results presented by Stretch and Wixted(1998b) However Stretch and Wixted (1998b) suggestedone possible way to modify the standard version of theattention-likelihood theory to bring it in line with thedata presented here They noted that Glanzer et al (1993)had shown that the attention-likelihood theory predictsthe mirror effect although p(i old) is set to the averagevalue of the two classes This modified version can pre-dict the pattern of data presented here given that the at-tention paid to the high-frequency class was high duringencoding [n(B) = 120] and low during recognition [n(B) =40] This formulation of the attention-likelihood theoryseems somewhat unclear It is not well motivated whyp(i old) should be equal during recognition whereas n(i)is different [p(i old) is calculated from n(i)] or why theamount of attention for high-frequency items is lower thanthat for low-frequency words at encoding but higher atrecognition

COMPARATIVE DATA FITTING

Glanzer et al (1993) fitted the attention-likelihoodtheory to experimental data in nine conditions The easyclass (A) consisted of either low-frequency words orconcrete words and the difficult class (B) consisted ofhigh-frequency words or abstract words Here the vari-ance theory is also fitted to the same set of data as thatused in Glanzer et al (1993) This allows a directionevaluation of the variance theory by comparing its f itswith those of the attention-likelihood theory

Glanzer et al (1993) fitted the attention-likelihoodtheory to the four probabilities of yes responses and thefour ratios of slopes of the ROCs The fitting was con-ducted by minimizing the mean squared error divided bythe variancemdashthat is

Three parameters were fittedmdashnamely the attentionpaid to each of the classes n(A) and n(B) and the prob-ability that a feature was activated p(new) The other pa-rameters were held constant at a value found to give agood f it These parameters were (N = 1000) and therecognition criterion [ln (L) = 0]

The variance theory was fitted to the same set of datausing the same technique and the same number of freeparameters The fitted parameters were the percentage

( )

Observed Predictedi i

ii

-

=aring

2

21

8

s

VARIANCE THEORY OF MIRROR EFFECT 429

nodes active at encoding (a) the standard deviation ofthe net inputs for the easy class words [sh(A)] and thestandard deviation of the net inputs for the difficult classwords [sh(B)] The other parameters were held constantand were set to the same values as those in Figure 4D[2N = 100 mh(N ) = 0 mh(O) = 1 and C = 0] The empir-ical standard deviations (si) were not reported inGlanzer et al (1993) so these parameters were set toone

The results show that the variance theory accounts for98 (r2) of the variance for the probabilities The attention-likelihood theory fits equally well The variance theory ac-counts for 93 of the variance for the slope The attention-likelihood theory accounts for 84 of the variance of theslope Thus the variance theory accounted for the sameamount of variance for the probabilities and more variance

for the slope as compared with the attention-likelihoodtheory when three parameters were fitted

It is reasonable to assume that the percentage of ac-tive nodes at encoding (a) does not vary across condi-tions The ratio of standard deviations of the net inputs[sh(B)sh(A)] may also be conceived of as being con-stant given that the same material is used in the differ-ent conditions Therefore the variance theory was fittedby a single variablemdashnamely the standard deviation ofthe net inputs for the easy class [sh(A)] The activitylevel was fixed to 010 and the ratio of the standard de-viations of the net inputs sh(B)sh(A) to 125 (these val-ues were the average of the fitted parameters in the pre-vious fit these parameters were also used in Figure 4D)The results are presented in Figure 8A where the pre-dicted probabilities are plotted as a function of observed

Figure 8 (A) Fitting the variance theory to Glanzer Adams andKimrsquos (1993) probability data The figure shows the predicted proba-bilities (on the vertical axis) as a function of the observed probabilities(on the horizontal axis) (B) Fitting the variance theory to Glanzer et alrsquos(1993) standard deviation slope data The figure shows the predictedratio of slopes of the receiver-operating characteristic curves (on the ver-tical axis) as a function of the observed ratios (on the horizontal axis)

430 SIKSTROM

probabilities Figure 8B shows the corresponding resultsfor the slope The accounted variance is 096 for the prob-abilities and 085 for the slopes Thus the variance theoryfits the slopes using a single parameter equally well asthe attention-likelihood theory does with three fitting pa-rameters The fit for the variance theory for the probabil-ities using one parameter is slightly less than the fit of theattention-likelihood theory using three fitting parametersIt may be concluded that the fit for the variance theory isreasonably good for the probabilities and the slopes Theslopes have a better fit in the variance theory than in theattention-likelihood theory when three variables are used

ANALYTIC SOLUTIONS

In this section analytic solutions of the variance the-ory an approximation of the standard deviation of therecognition strength and analyses of optimal perfor-mance are presented The variance theory has a simpleanalytic solution and can be fully described by four pa-rameters Two of these parametersmdashnamely the stan-dard deviation of the net inputs from the easy class[sh(A)] and the standard deviation of the net inputs fromthe difficult class [sh(B)]mdashcan also be expressed as thefrequency of the item (see the Appendix) The other twoparameters are the number of nodes (N ) and the expectedprobability that the nodes are active at encoding (a)

There are other variables in the theory however theydo not increase the degrees of freedom There are fourexpected net inputs (mh) however two degrees of free-dom disappear because the new net inputs are constrainedto be equal as well as the old net inputs [mh(AN) =mh(BN) and mh(AO) = mh(BO)] Furthermore the predic-tions are independent of moving the old and new dis-tributions in parallel thus removing another degree offreedom Finally changing the difference between the ex-pected old and new net inputs is mathematically equiva-lent to changing the standard deviations [sh(A) andsh(B)] Thus the degrees of freedom in the net inputscan be captured by the degrees of freedom in the stan-dard deviations The activation threshold (T ) and theprobability that nodes are active (Pc) are simply func-tions of other variables and therefore do not increase thedegrees of freedom Thus there are four degrees of free-dom for the distributionsmdashnamely sh(A) sh(B) N anda An additional degree of freedom is introduced whenplacing the recognition criterion (C)

The probability (Pc) that the net inputs exceed the ac-tivation threshold (T ) for nodes active during encodingcan be explicitly solved from the expected net inputs(mh) and the standard deviation of the net inputs (sh)This probability is dependent on the distribution of thenet inputs which can be approximated with a normaldistribution Pc is solved by integrating the net inputsfrom mh T to infinity (yen) over the probability densityfunction for a normal distribution Thus the probability(Pc) that a node is active at recognition is

(6)

Subtracting the expected percentage of active nodes atrecognition (a2 see note 1) from the percentage of ac-tive nodes and dividing by the standard deviation of thenet inputs (sh) calculates the expected recognitionstrength (ms)

Note that the analytic solution uses the standard devi-ation of the class (sh) as an approximation of the stan-dard deviation of the item (shcent ) because it simplifies theanalytic solution however the variance theory or thesimulation uses the standard deviation of the item Thisapproximation is good when there are a large number offeatures however for a small number of features thevariance of feature strength for a single item may fluctu-ate on an item-to-item basis around the variance of thenet inputs for all the items

The standard deviation of the recognition strength (ss)is calculated from sh Pc and N There is 2N number ofnodes in the context and the item layers The distributionof Pc is binomial but can for a certain criterion [ie 2NPc(1 Pc) gt 10] be approximated with a normal distri-bution with a standard deviation of [Pc(1 Pc) 2N]12The final result is scaled by the normalization factor 1sh

(7)

A yes response occurs if the recognition strength isabove the recognition criterion (C) The probability of ayes response [P(Y)] is calculated from the expected recog-nition strength the variance of the recognition strengthand the recognition criterion by integrating the density ofthe recognition strength over a normal distribution

The probability of choosing A over B in a two-choiceforced recognition test [P(A B)] is calculated from theexpected recognition strength of A [ms(A)] and B [ms(B)]and the standard deviations of the recognition strengthof A [ss(A)] and B [ss(B)]

An Excel sheet for calculating the predictions of the vari-ance theory is available on line (wwwpsychutorontoca ~sverkervariancehtml)

P e ds

S Bs s

s s

C

(A B)

(A)

(A) (B)

=

- -[ ]( )+aelig

egraveccediloumloslashdivide

eacuteeumlecirc

yen

ograve12

2

2 2 22

p

m m

s s

( )

P e ds

S s

s

C

(Y) =-( )yen

ograve12

2

22

p

m

s

sss

h

c cP P

N=

-( )eacute

eumlecircecirc

1 1

2

12

mss

Pc

a

h=

-2

P a e dhcT

hh

h=-( )yen

ograve12

2

22

p

m

s

VARIANCE THEORY OF MIRROR EFFECT 431

Approximations of the Standard Deviation ofRecognition Strength

The standard deviation of the recognition strength isin the model calculated with Equation 7 However to fa-cilitate the understanding of this equation it is useful tomake some approximations First note that the proba-bility that a node is active (Pc) is assumed to be low Byapproximating 1 Pc to one the variance of the recog-nition strength can be simplified to

For a particular class of items the variances of the netinputs of old and new items are equal and the varianceof the recognition strength is proportional to the numberof active nodes (s 2

s micro Pc) This approximation suggests avery simple interpretation of the slope of the z-ROC Theratio of variances between new and old items is simplythe ratio between the number of nodes active in the newitems representations and the number of nodes active inthe old items representations

Or alternatively the slope of the z-ROC curve is equalto the square root of the ratio of the number of nodes ac-tive in the new items representations and the number ofnodes active in the old items representations For exam-ple if the slope of the z-ROC curve is 08 the number ofactive nodes in the new items representations divided bythe number of nodes active in the old items representa-tions is 064 (= 0802)

Another approximation useful for understanding themodel is that for two classes of items the number of ac-tive nodes in the new distribution is approximately equaland the number of active nodes in the old distributions isapproximately equal [Pc(AN) raquo Pc(BN) and Pc(BO) raquoPc(AO)] Given these approximations and the approxi-mation above (1 Pc raquo 1) the recognition strength stan-dard deviation is inversely related to the standard devia-tion of the net inputs in the following way The ratiobetween the recognition strength standard deviations ofthe diff icult and the easy distributions is equal to theratio between the standard deviations of the net inputs ofthe easy and the difficult distributions Furthermore theratio between the recognition strength standard devia-tions of the difficult and easy new distributions is equalto the ratio between the recognition strength standard de-viations of the difficult and the easy old distributionsThe exact solution predicts a slightly smaller ratio in theold than in the new distributions

This suggests that the ratio between the recognitionstrength standard deviations of the difficult class and theeasy class can be interpreted as the ratio between the

standard deviations of the net inputs of the easy class andthe difficult class

Optimizing Performance Derives the Assumptions of the Variance Theory

Arguably good memory performance is an importantfactor for selection in the evolutionary process of hu-mans and animals It is reasonable to assume that thebrain has evolved so that the performance at retrieval isoptimal or near-optimal Here it is investigated how sev-eral assumptions of the variance theory influence per-formance It is shown that several assumptions of themodel fall out as a consequence of optimizing perfor-mance in the form of discriminability between new andold items Thus if the model is implemented in a differ-ent way performance is degraded and the model doesnot account for the experimental data Examples of as-sumptions that yield a good performance in the modelare a low percentage of nodes active setting the activa-tion threshold between old and new net inputs measur-ing performance by nodes that are active in the encodingpattern and normalizing the recognition strength It isshown that an optimal performance in the network requiresthe implementation suggested by the variance theory Ifthe implementation of the variance theory is changedsignificantly the performance is degraded and the net-work would not produce the empirically found memoryphenomena

To conduct this analysis it is necessary to define ameasurement of performance A natural choice is to used cent By using the analytical equations above we find thefollowing expression

(8)

Because ss(N) sometimes is near zero it was founduseful to use the standard deviation of both the old andthe new items recognition strength ss(NO) in the de-nominator of this equation Thus Pc(NO) is equal to[Pc(N) 1 Pc(O)] 2 Pc( ) was calculated with Equation 6The expected value of the net inputs and the standard de-viation of the net inputs for new and old items were cal-culated with the equations derived in the Appendix(Equations A1 A2 and A3) Low-frequency items witha preexperimental frequency of zero were used ( f = 0)and the list length was set to one (L = 1)

The performance can be expressed by the parametersa N and p Increasing the number of nodes (N) monot-onically increases d cent and increasing the number ofstored patterns (p) monotonically decreases d cent The im-pact of these two parameters on d cent was of less impor-tance here and they were set to N = 30 and p = 100

Optimal percentage of nodes active at encodingThe solid line in Figure 9A shows the theoretical d cent as afunction of the percentage of nodes active at encoding

cent = - =-

-eacuteeumlecirc

dP P

P PN

s s

s

c c

c c

m ms( ) ( )

(

O N

(NO)

(O) (N)

(NO) (NO) 12

112

ss

ss

ss

ss

s

s

s

s

h

h

s

s

(BO)

(AO)

(B)

(A)

(A)

(B)

(BN)

(AN)pound raquo pound

s

ss

s

c

c

P

P

2

2

(N)

(O)

(N)

(O)raquo

ss

sc

h

P

N

2

2 2=

432 SIKSTROM

(a) The results show that d cent is optimal for a = 052 Thed cent is lower for larger and smaller a The lower d cent for largea occurs because the interference from other items in-crease For an a larger than the optimal value the weightchanges are distributed over a larger number of nodesand the recognition tests therefore include more noise

The lower d cent for an a less than the optimal value oc-curs because there is variability in the number of activenodes at encoding Thus for very small values of a thefluctuation between the number of nodes active in theencoded representation becomes increasingly importantThus for a small a errors are more likely to occur be-cause old items have few active nodes at encoding mak-ing it less likely that many nodes will be active at re-trieval (independently of how well they are encoded)This analysis suggests that a medium low percentage ofactive nodes at encoding yields optimal performanceThis is consistent with variance theory which requires alow activity for fitting some of the empirical data (seebelow)

There is another factor that contributes to the fact thatoptimal performance occurs when the percentage of ac-tive nodes is medium low which is that the number ofpossible representations increases with a If there is onlyone node active in all the representations there are Npossible combinations of representations if there aretwo nodes active in all the representations there are ap-proximately N 2 possible combinations of representa-tions and so forth This factor is not included in theanalyses

Optimal placement of the activation threshold Animportant property of d cent is that it is insensitive to wherethe criterion is placed Thus any criterion yields thesame performance The activation threshold (T ) may beseen as the criterion for a single node and therefore onemight intuitively think that the placement of the thresh-old is unimportant for d cent However surprisingly theplacement of the criterion becomes important in the vari-ance theory because there is a nonlinear transformationwhen the nodes are activated This nonlinearity makes d centdependent on the activation threshold in the nodes

The d cent was maximized by changing the activationthreshold (T ) and the percentage of nodes active at en-coding (a) The maximum d cent was 240 when T = 081and a = 052 Figure 9B plots d cent as a function of the ac-tivation threshold (T ) when the percentage of nodes ac-tive at encoding was f ixed at the optimal value (a =052) The results show that d cent has an optimal valuewhen the activation threshold is set between the old andthe new distributions The variance theory suggests thatthe threshold should be set to the average of the expectedold and expected new net inputs For the parameter val-ues used here this value is 071 which is near butslightly lower than the optimal value of 081 (the ex-pected value of the new net input is 0 and the expectedvalue of the old net input is 142) Note that this resultapplies when both a and T are set to the optimal value Ifa is set to a nonoptimal value the optimal value of T may

deviate significantly from the one proposed by the the-ory (eg if a = 5 the optimal value of T is much largerthan the expected value of old net inputs of 188)

This analysis emphasizes the importance of setting theactivation threshold as suggested by the variance theorySetting the activation threshold between the old and thenew expected net inputs yields not only the mirror effectbut also an optimal performance in the network Noticethat the activation threshold (T ) is constant even if thesubjectsrsquo decision criterion (C) is changed Therefore d centwill not change when the decision criterion changes Bychanging the decision criterion (rather than the recogni-tion threshold) subjects can maintain an optimal d cent fordifferent confidence levels

Optimal usage of the state of activation in the cuepattern An interesting question is how much informa-tion is carried in nodes that are active in the encoded pat-tern as compared with inactive nodes If both active andinactive nodes carry a similar amount of information itis useful to use all the nodes at retrieval However if in-active nodes carry little information in relation to thenoise performance can be improved by using only theinformation in the active nodes

The information carried in the nodes depends on theamount of weight changes which in turn depends on thepercentage of active nodes at encoding (a) For a = 12the absolute values of the weight changes are the samefor active and inactive nodes (however the signs of theweight changes are different) Thus inactive and activenodes carry the same amount of information and theperformance is optimal when information in both activeand inactive nodes is used For a small a the weightchanges are larger for active nodes (proportional to 1 a)than for inactive nodes (proportional to a) For a suffi-ciently small a the noise in inactive nodes will over-whelm the information in the weight changes so that ifthe information is combined the inactive nodes will harmthe information in the active nodes and performance isoptimal if only information from active nodes is used

The performance of the variance theory was calcu-lated by using the information in all the nodes This isdone by counting the number of nodes that are retrievedto the correct state of activation (ie the same state asthat during encoding) The mathematical details of thiscalculation are described at the end of the Appendix Theresults are shown by the dotted line in Figure 9A usingthe same set of parameters as when d cent was calculated byusing only active nodes shown by the solid line The re-sults show that the highest d cent is found when the decisionis based only on active nodes and when a is low Includ-ing inactive nodes in decision lowers d cent However for alarger a (above 15 for the parameters used here) it isbeneficial to base the decision on all the nodes

Optimal placement of the recognition criterion forthe two classes of items The recognition criterion (C)does not affect d cent but it influences performance as mea-sured by the hit rates and false alarm rates Therefore itis necessary to use another criterion for performance

VARIANCE THEORY OF MIRROR EFFECT 433

with respect to the placement of the recognition crite-rion A natural choice for performance in this context isthe probability of hits minus the probability of falsealarms This measurement corresponds to optimal per-formance when old correct responses and new correctnew responses are rewarded equally It is easy to see thatif the standard deviations of the old and the new distri-butions are equal the optimal performance will be foundif the recognition criterion is set exactly between the dis-tributions For unequal standard deviations the optimalrecognition criterion is shifted from the midpoint towardthe distribution with the smallest standard deviationMore exactly the optimal recognition criterion is thepoint at which the old and the new distributions inter-sect It is easy to see that this is true because if the recog-nition criterion is moved to the left of this point the rateof increase in false alarms is larger than the rate of in-crease in hits and performance suffers If the recognition

criterion is moved to the right of this point the rate of de-crease in hits is larger than the rate of decrease in falsealarms and performance also suffers (see eg Figure 4D)Formally f [S(O)] denotes the density of recognitionstrength of the old distribution and f [S(N)] the densityof the recognition strength of the new distribution Theratio between these variables is called the likelihoodratio L = f [S(O)] f [S(N)] and the optimal performanceoccurs when this ratio is equal to one (L = 1)

In the mirror effect there are two classes of itemseach having a new and an old distribution with differentstandard deviations The question of optimal perfor-mance is complicated by the possibility of using differ-ent criteria for the two classes The performance maythen vary depending on the choice of the two criteria andon additional restrictions on the overall level of confi-dence For example if one class is very easy (ie perfectdiscrimination) and one class is very difficult (ie no

Figure 9 (A) Theoretical d cent as a function of percentage of nodes active at encoding The solid line shows the d cent as a function of per-centage of nodes active at encoding when the decision is based only on nodes that are active during encoding The dotted line showsd cent when the decision is based on all the nodes (B) Theoretical d cent as a function of the activation threshold The leftmost arrow pointsat the expected net input of the new items [m(N)] the rightmost arrow points at the expected net input of the old items [m(O)] and themiddle arrow points at the point at the placement of the activation threshold of the nodes Note that the activation threshold is slightlylower than the optimal point (C) Optimal placement of the recognition criterion for the two classes The y-axis shows the maximumlikelihood for Class B divided by the maximum likelihood for Class A An optimal performance is found when this ratio is one Thex-axis shows the false alarm rate for Class A The straight line shows the ratio for theoretical optimal performance the dotted line theratio before normalization and the solid curved line the ratio after normalization See the text for details (D) The advantage of nor-malization for different recognition criteria The y-axis shows the total hit rate after normalization minus the total hit rate before nor-malization as a function of the total false alarm rate on the x-axis See the text for details

434 SIKSTROM

discrimination) and subjects are instructed to respondyes only when they are absolutely certain that they arecorrect it may be optimal to set a very high criterion forthe difficult class so that no yes responses will be madefor the difficult class and a moderate criterion for theeasy class so that some yes responses will be made forthe easy class Therefore any model that optimizes per-formance for the two classes must combine the criteriafor each class so that the performance for the sum of theclasses will be optimal

This problem may formally be stated as follows Giventwo classes (A and B) with a fixed total false alarm rate[P(AN) + P(BN)] how should the recognition criteriafor the two classes [T(A) and T(B)] be chosen so that thehit rates are maximized [P(AO) + P(BO)] The solutionto this problem is surprisingly simple The optimal perfor-mance occurs when the placements of the maximum like-lihoods of the two classes are equal

L(A) = f [S(AO)] f [S(AN)] = L(B)

= f [S(BO)] f [S(BN)]

It is easy to see that this criterion must be satisfied foroptimal performance because any shift from this pointdiminishes performance For example if the recognitionthreshold for class A is diminished the recognition cri-terion for class B must be increased to keep the totalfalse alarm rate constant According to the formulationof the problem the change in total false alarm rates mustbe equal f f [S(BN) = 0] The maximum-likelihood ra-tios are monotonically increasing functions of the recog-nition criteria therefore L(A) L(B) lt 0 when the recog-nition criteria are changed as specified above ThusL(A) = f [S(AO)] f [S(AN)] lt f [S(BO)] f [S(BN)] =L(B) or f [S(AO)] f [S(BO)] lt 0 This shows that thechange in the placement of the criteria from L(A) = L(B)results in an overall decrease in hit ratemdash( f [S(AO)] f [S(BO)] lt 0)mdashand performance suffers

Note that the variance theory only has one overallrecognition criterion However the theory and any the-ory of the mirror effect specifies how this criterionchanges when it is moved over the two classes Thus thesecond criterion can indirectly be inferred from the for-mulation of the theory This is done in the variance the-ory by the normalization factor that scales the recogni-tion differently between the two classes of words

The intriguing question here is whether the variancetheory is optimal or almost optimal in terms of place-ment of the recognition criterion for the two classes Fig-ure 9C plots the maximum likelihood of class B dividedby the maximum likelihood of class A [L(B)L(A)] onthe y-axis The x-axis shows the probability of false alarmsfor class A when the recognition criteria are changedThe optimal ratio of the maximum likelihood on the y-axisis one and it is plotted as the dotted straight line The re-sults before normalization (ie by counting the numberof nodes above the recognition criterion) are plotted in

the dotted monotonically increasing line The resultsafter normalization (ie the percentage of active nodesminus the expected number of active nodes divided bythe standard deviation of the net input) are plotted as thesolid line

The figure clearly shows that performance before nor-malization is far from optimal For a conservative recog-nition criterion or low false alarm rates the maximumlikelihood for class A is larger than the maximum likeli-hood for class B Therefore a more liberal criterion forclass B and a more strict criterion for class A would bemore advantageous For a liberal recognition criterionor a high false alarm rate the maximum likelihood forclass B is larger than the maximum likelihood for classA Therefore a more liberal criterion for Class A and astricter criterion for Class B would be beneficial Theonly point at which the performance is optimal is whenthe recognition criterion is unbiased At this point [aroundP(AN) = 25] the maximum-likelihood ratio is one

Normalization improves performance significantly sothe maximum-likelihood ratio is one or almost one fora range of criteria For false alarm rates larger than 015and smaller than 060 the ratio is within two percentagepoints of one The maximum likelihood for Class A islarger than that for Class B for false alarm rates less than015 and for false alarm rates larger than 060 Thus thereis still some deviation from optimal performance evenafter normalization However the maximum-likelihoodratio is closer to the optimal value for all false alarmrates after normalization than before normalization Ar-guably normalization improves performance sufficientlyso that performance may be described as being near anoptimal value for a wide range of recognition criteria

Overall performance after normalization can be di-rectly compared with performance before normalizationFigure 9D plots the total hit rate after normalizationminus the total hit rate before normalization for differenttotal false alarm rates In this figure the standard devia-tion of the net inputs to Class B was set to a 3 (rather than156) in order to make the difference between perfor-mance before and after normalization more salient Allother parameters were identical to those in Figure 4DFurthermore it is assumed that the number of items inClass A is equal to the number of items in Class B Thetotal hit rate is equal to the average hit rate of Class Aand Class B and the total false alarm rate is equal to theaverage false alarm rate in Class A and Class B

For all recognition criteria or for all false alarm ratesperformance is better or equal after normalization ascompared with performance before normalization Foran unbiased recognition criterion or a slightly largerrecognition criterion performance is approximatelyequal before and after normalization [ie for 25 lt P(N) lt40] For conservative recognition criteria [P(N) lt 25]there is a large advantage for normalized performanceover nonnormalized performance The largest advantageis approximately 005 and it occurs when the false alarm

VARIANCE THEORY OF MIRROR EFFECT 435

rate is approximately 005 For liberal recognition crite-ria [P(N) gt 40] there is also an advantage for normal-ized performance The largest advantage is around 001and it occurs when the false alarm rate is 070 The ad-vantage for liberal criteria is smaller than the advantagefor conservative criteria because of a ceiling effect forlarge false alarm rates and large hit rates

A Nonoptimal Network is Inconsistent With Empirical Data of Recognition Memory

To summarize it has been shown that performance isoptimal when (1) the percentage of nodes active at en-coding is low (2) the activation threshold is set betweenthe new and the old distributions (3) only nodes activeat encoding are used for measuring the recognitionstrength and (4) the recognition strength is normalizedwith the standard deviation of the net input It is inter-esting to note that all these conditions are necessary forproducing the empirically found memory phenomena

1 The percentage of active nodes has to be low forproducing appropriate standard deviations for the oldand the new recognition distributions If the percentageof active nodes is too high the standard deviation of theold distribution will approach the standard deviation ofthe new distribution

2 The model predicts the mirror effect only if the ac-tivation threshold is set between the old and the new dis-tributions If the recognition threshold is larger than theexpected value of the net input of the old distributionthe hit rate of the easy class will be less than the hit rateof the difficult distribution Similarly if the recognitionthreshold is lower than the expected new net input thefalse alarm of the easy class will be larger than the falsealarm rate of the difficult class This is inconsistent withthe empirical data for unbiased responses

3 Assume that the recognition strength is based on allthe nodes (ie not only nodes inactive during encod-ing)mdashfor example by counting the number of nodes inthe correct state of activation This measurement ofrecognition strength will have at least 50 of the nodesin the correct state (ie Pc gt 50) even if the subjectswere merely guessing on new items This would lead tothe incorrect prediction that the old recognition strengthvariance should be smaller than the new recognitionstrength variance because Pc(1 Pc)N decreases for Pcover 50 This is inconsistent with the finding that thevariance of the old distribution is larger than the varianceof the new distribution

4 If the recognition strength is not normalized withthe net input the variance of the recognition strength ofthe new easy class (AN) will be smaller than the varianceof the recognition strength of the new difficult class (BNsee Figure 4C) This is inconsistent with the empirical data

This analysis indicates that several recognition mem-ory phenomena fall out as a consequence of optimizingperformance in the network If the model is optimized interms of performance the model reproduces the empir-ical data If the model is made nonoptimal the model

does not reproduce the empirical data Arguably thehuman brain has evolved to optimize storage capacityand therefore these memory phenomena occur

GENERAL DISCUSSION

This paper has suggested a variance theory for themirror effect that also applies to the ROC curves Themodel is remarkably simple It has been shown that atwo-layer recurrent network where one layer representscontext and one layer represents items shows these phe-nomena if performance is measured by counting thenumber of nodes active at recognition in a way that opti-mizes performance The structure of the theory guaran-tees that high-frequency items have a larger variance inthe net inputs than do low-frequency items because en-coding the same item in different contexts increases thevariance whereas the expected net inputs are the sameThe theory predicts the mirror effect when the sameamount of attention is paid to both classes of stimuli Thestandard deviation of the recognition strength is largerfor old than for new items because more nodes are activein old items The standard deviation for the easy class islarger than the standard deviation for the difficult classbecause the recognition strength is normalized with thestandard deviation for the net input

There are several reasons why the variance theory isinteresting First the theory is extremely simple Theonly necessary assumptions are that recognition is basedon recurrent associations between contexts and itemsand performance is measured by counting the number ofnodes in an optimal way Second these assumptions areconsistent with what is known about how the brain worksTherefore the model is biologically plausible Third themodel accounts for a large amount of data including themirror effect exceptions from the mirror effect ROCcurves list-length effects and so on Fourth the modelfits the empirical data well Fifth it is easy to implementthe model in a connectionist network

Paying more attention to one of the classes violates theassumption of equal expected net inputs to the two classesThe variance theory predicts that attention to the moredifficult class primarily affects the hit rates whereas thefalse alarm rates and standard deviations of the underly-ing distributions are less affected An experiment sup-ported the prediction A standard mirror effect was foundwhen attention was divided equally between high- andlow-frequency words However focusing the subjectsrsquoattention on the high-frequency words either by the pre-sentation frequency or the presentation time made thehit rate larger for the high-frequency words than for thelow-frequency words This manipulation of attention didnot influence the false alarm rates which were higher forthe high-frequency words in all the conditions Thus nomirror effect was found when attention was paid to thehigh-frequency words Nor did the focusing of attentioninfluence the order of the standard deviations The stan-dard deviations were larger for the low-frequency distri-

436 SIKSTROM

bution than for the high-frequency distribution This wastrue for the new and the old distributions both when at-tention was paid to high-frequency words and when at-tention was divided equally between the two classes (ex-cept in the new frequency control condition where thestandard deviations were equal)

The variance theory predicts the order of the standarddeviations of the underlying distributions for the follow-ing reasons The standard deviations of the old distribu-tions are predicted to be larger than those of the new dis-tributions because more nodes are activated in the olddistributions The standard deviations of the easy classdistributions are predicted to be larger than the standarddeviations of the difficult class distributions because therecognition strength is normalized by the itemrsquos diffi-culty estimated from the standard deviation of the net in-puts This is consistent with the empirical data

In contrast the attention-likelihood theory does notconstrain the old distribution to be larger than the newdistribution for the difficult class (it can be larger orsmaller depending on parameter settings) The variancetheory allows the following two orders of the standarddeviations ss(BN) lt ss(BO) lt ss(AN) lt ss(AO) andss(BN) lt ss(AN) lt ss(BO) lt ss(AO) The first order isthe more common although the second order occurs oc-casionally (see eg Glanzer et al 1993 Experiment 1)In addition the attention-likelihood theory allowsss(BO) lt ss(BN) lt ss(AN) lt ss(AO) according to Equa-tion 2 which to the authorrsquos knowledge has not beenfound in empirical data

The variance theory predicts that strength variablessuch as study time repetition and study instructions af-fect the expected net input For example increasing studytime will increase the net input that improves the hit rateIncreasing the net inputs also causes an increase in theactivation threshold that diminishes the false alarm rates

The variance theory has no (ie lateral) connectionswithin the item layer and there are no connections with-in the context layer Including intraitem connections inrecognition makes it impossible to tell whether therecognition strength comes from encoding during thelearning episode or from another preexperimental learn-ing episode Thus there would be a confounding be-tween the itemrsquos frequency and recognition strength Forexample if the recognition strength in the variance the-ory included intraitem connections and used a constantrecognition criterion it would predict a higher hit rateand a higher false alarm rate for high-frequency itemsthan for low-frequency items Thus the hit rate for high-frequency words would be larger than that for low-frequency words which is contrary to the data on the mir-ror effect This issue is called the frequencyrecognitionndashstrength confounding problem Other models may bevulnerable to this problem depending on their specificassumptions The variance theory is immune from thisproblem because recognition strength is based on the as-sociation between the context and the item that yields apure measurement of the strength of the target in a par-

ticular episode Net inputs within the item population arenot used because these connections are highly corre-lated with the frequency of the item

This frequencyrecognition-strength confounding prob-lem may be relevant to several distributed models thatassume that recognition strength increases with frequencythus making the false alarm rate higher for high- than forlow-frequency stimuli This is often implemented in dis-tributed models by including intraitem associations inthe measurement of recognition strength Thus whenintraitem and itemndashcontext associations are added it isnot possible to know whether the intraitem strength oc-curs because an item has been encoded in the to-be-retrieved-from list or at another episode

Although the intraitem associations are not used tomeasure recognition strength they may play an impor-tant role in recognition In the first step of recognitionthese associations may be used for deblurring unclear in-formation in the item cue (a similar mechanism occursfor the context cue) Arguably this deblurring mecha-nism works well for well-known words however fornonwords it is much more likely to fail Such failure willactivate features that were not active in the encoded rep-resentation It is more likely that these wrongly activatednodes represent high-frequency features because it ismore likely to converge on high-frequency features Thereare two interesting implications of this perspective Firstthe wrongly activated nodes will use the wrong connec-tions between the context and the item Second becausethe wrongly activated nodes represent high-frequencyfeatures the average variability will be larger for non-words than for words This is a plausible account of thepoor recognition performance with nonwords as com-pared with words It is also a tentative account of why non-words can be seen as a difficult class with a higher vari-ability than that for words in the variance theory Howeverfurther work is needed before any firm conclusion can bemade regarding this aspect of the theory

A problem similar to frequencyrecognition-strengthconfounding occurs if within-context connections areused at recognition If the context is temporally corre-lated the within-context connections are influenced bylist length This causes a confounding between list lengthand recognition strength This issue is called the list-lengthrecognition-strength confounding problem Othermodels may be vulnerable to this problem depending ontheir specific assumptions

Another issue is whether the variance theory can ac-count for the mirror effect found in abstract and concretewords where words from a concrete class are more eas-ily discriminated than words from an abstract class Thevariance theory can account for this given the assump-tion that the variance of the net input is larger for abstractthan for concrete words However at this point it is notcompletely clear how this assumption can be motivatedA possibility is that although these two classes areequated for word frequency the contexts associated withan abstract word are more variable than the contexts as-

VARIANCE THEORY OF MIRROR EFFECT 437

sociated with a concrete word This larger variability incontext for abstract words may lead to a larger variabil-ity in the net input Another possibility is that the activefeatures in abstract words are more general and there-fore associated with more contexts Nodes active in con-crete words may represent more specific features acti-vated with a lower frequency and therefore associatedwith fewer contexts Thus features in abstract wordsmay be of a higher frequency than features in concretewords although the frequencies of the items are thesame This would lead to a mirror effect in the modelHowever at this point no claim is made that variancetheory can handle this phenomenon

The variance theoryrsquos account of the list-length andlist-strength effects is arguably much simpler than thesubjective-likelihoodrsquos account The activation thresholdis set so that on average half of the nodes active duringencoding are active during recognition The activationthreshold is constant within one condition but may changebetween conditionsmdashfor example when study time ischanged Furthermore subjects do not need to estimatethe list length and the probability that a particular itemis drawn from the study list

The variance theory has advantages over the attention-likelihood theory As was discussed in more detail abovethe attention-likelihood theory requires subjects to clas-sify the stimulus Depending on this classification thesubjects must know (consciously or unconsciously) howmuch attention is paid to the stimuli in order to calculatethe log-likelihood ratios Thus the yesndashno decision isbased on the subjectsrsquo classification of the stimuli Thevariance theory does not require a classification of thestimuli During the calculation of recognition strengththe standard deviation of the net input of the item is used(shcent ) so the subject does not need to know the class or thestandard deviation of the class (sh) The increase in thehit rate and decrease in the false alarm rate for the easierclass occurs according to the theory because the vari-ance is smaller for the easier class The variance theoryhas a simple mathematical base subjects count the num-ber of activated nodes minus the expected value dividedby the standard deviation of the net input of the item Anexplicit solution is presented that requires only calculat-ing the probabilities of two z-transformations

The subjective-likelihood theory uses feature contentof the items for addressing issues regarding item similar-ity and word frequency In particular high-frequency wordsare assumed to have a higher noise or variability than dolow-frequency words The variance theory also usesvariability that depends on frequency However the vari-ance theory simulates the increase in variance duringeach presentation of a feature in different contexts thusmaking it an unavoidable phenomenon for high-frequencyfeatures In the subjective-likelihood theory the featurevariance is introduced as an assumption or a constantand it is not explicitly simulated

There are several other differences between the vari-ance theory and the subjective-likelihood theory The

subjective-likelihood theory is based on log-likelihoodratios In the variance theory log-likelihood ratios arenot necessary to account of the mirror effect and for z-ROC curves Instead the variance theory uses the num-ber of active nodes normalized by the variance of the netinput as the measurement of recognition strength

Another difference is the use of one detector for eachitem in the subjective-likelihood theory This makes themodel essentially local whereas the variance theory isdistributed This property may cause difficulties in im-plementing the model in a connectionist network How-ever see McClelland and Chappell (1998) for a brief dis-cussion of this topic An implementation of the variancetheory is straightforward

REFERENCES

Feller W (1968) An introduction to probability theory and its appli-cation New York Wiley

Gillund G amp Shiffrin R M (1984) A retrieval model for bothrecognition and recall Psychological Review 91 1-67

Glanzer M amp Adams J K (1985) The mirror effect in recognitionmemory Memory amp Cognition 13 8-20

Glanzer M amp Adams J K (1990) The mirror effect in recognitionmemory Data and theory Journal of Experimental PsychologyLearning Memory amp Cognition 16 5-16

Glanzer M Adams J K amp Kim K (1993) The regularities ofrecognition memory Psychological Review 100 546-567

Glanzer M amp Bowles N (1976) Analysis of the word frequencyeffect in recognition memory Journal of Experimental PsychologyHuman Learning amp Memory 2 21-31

Glanzer M Kisok K amp Adams J K (1998) Response distribu-tions as an explanation of the mirror effect Journal of ExperimentalPsychology Learning Memory amp Cognition 24 633-644

Greene R L (1996) Mirror effect in order and associative informa-tion Role of response strategies Journal of Experimental Psychol-ogy Learning Memory amp Cognition 22 687-695

Hertz J Krogh A amp Palmer R G (1991) Introduction to the the-ory of neural computation Reading MA Addison-Wesley

Hintzman D L (1988) Judgment of frequency and recognition memoryin a multiple trace memory model Psychological Review 95 528-551

Hopfield J J (1982) Neural networks and physical systems withemergent collective computational abilities Proceedings of the Na-tional Academy of Sciences 79 2554-2558

Hopfield J J (1984) Neurons with graded response have collectivecomputational properties like those of two-state neurons Proceed-ings of the National Academy of Sciences 81 3088-3092

Humphreys M S Bain J D amp Pike R (1989) Different way to cuea coherent memory system A theory for episodic semantic and pro-cedural tasks Psychological Review 96 208-233

Kim K amp Glanzer M (1993) Speed versus accuracy instructionsstudy time and the mirror effect Journal of Experimental Psychol-ogy Learning Memory amp Cognition 19 638-652

Kruschke J K (1992) ALCOVE An exemplar-based connectionistmodel of category learning Psychological Review 99 22-44

Ku Iumlcera H amp Francis W N (1967) Computational analysis ofpresent-day American English Providence RI Brown UniversityPress

Lewandowsky S (1991) Gradual unlearning and catastrophic inter-ference A comparison of distributed architectures In W E Hockleyamp S Lewandowsky (Eds) Relating theory and data Essays onhuman memory in honor of Bennet B Murdock (pp 445-476) Hills-dale NJ Erlbaum

Li S-C amp Lindenberger U (1999) Cross-level unification A com-putational exploration of the link between deterioration of neuro-transmitter systems and dedifferentiation of cognitive abilities in oldage In L-G Nilsson amp H J Markowitsch (Eds) Cognitive neuro-sciences of memory (pp 103-146) Seattle Hogrefe amp Huber

438 SIKSTROM

Li S-C Lindenberger U amp Frensch P A (2000) Unifying cog-nitive aging From neuromodulation to representation to cognitionNeurocomputing 32-33 879-890

McClelland J L amp Chappell M (1998) Familiarity breeds dif-ferentiation A subjective-likelihood approach to the effects of expe-rience in recognition memory Psychological Review 105 724-760

Metcalfe J (1982) A composite holographic associative recallmodel Psychological Review 89 627-658

Murdock B B (1982) A theory for the storage and retrieval of item andassociative information Psychological Review 89 609-626

Murdock B B (1998) The mirror effect and attentionndashlikelihoodtheory A reflective analysis Journal of Experimental PsychologyLearning Memory amp Cognition 24 524-534

Okada M (1996) Notions of associative memory and sparse codingNeural Networks 9 1429-1458

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves PsychologicalReview 99 518-535

Rumelhart D E Hinton G E amp Williams R J (1986) Learn-ing representation by backpropagating errors Nature 323 533-536

Shiffrin R M amp Steyvers M (1997) A model for recognitionmemory REMndashretrieving effectively from memory PsychonomicBulletin amp Review 4 145-166

Sikstroumlm S (1996a) TECO A connectionist theory of successiveepisodic tests Umearing Doctoral dissertation Umearing University (ISBN91-7191-155-3)

Sikstroumlm S (1996b) The TECO connectionist theory of recognitionfailure European Journal of Cognitive Psychology 8 341-380

Sikstroumlm S (1999) Power function forgetting curves as an emergentproperty of biologically plausible neural networks model Interna-tional Journal of Psychology 34 460-464

Stretch V amp Wixted J T (1998a) Decision rules for recognitionmemory confidence judgments Journal of Experimental Psychol-ogy Learning Memory amp Cognition 24 1397-1410

Stretch V amp Wixted J T (1998b) On the difference betweenstrength-based and frequency-based mirror effects in recognitionmemory Journal of Experimental Psychology Learning Memoryamp Cognition 24 1379-1396

NOTE

1 The expected number of nodes active at recognition is a2 giventhat the standard deviations of the percentages of active nodes for new[sp(N)] and old items [sp(O)] are equal However if the standard devi-ations are different the expected number of active nodes will be

Because the new variance is smaller than the old variance this valuewill be slightly less than a2 typically around 044a if the ROC curveis 08 It is actually more precise to use this value In this paper the ap-proximation a2 is used except that in the simulations where the ex-pression above is used

APPENDIX

The Expected Value of the Net Input and the Variance of the Net Input

Inserting the weight changes into the net input solves the ex-pected value of the net input for old items In this Appendix thesuperscripts representing the item layer (d) and the contextlayer (c) are dropped Only the active states of activation con-tribute to the net input so jj 5 jj = 1

(A1)

The expected value of the net inputs for the new items iszero

mh(N) 5 0 (A2)

The expected value of the weights is zero The variance ofthe net input is calculated by squaring each term in the netinput Let P represent the number of patterns stored in the net-work If the patterns are uncorrelated or a new item is pre-sented the variance of the net input is

(A3)

For an old item the variance of the net input to the contextlayer from the item layer depends on the frequency of the item( f ) All the aN weights supporting a context node contribute tothe net input in the same direction

(A4)

The same function can be used for calculating the varianceof the net input to a node in the item layer when the same con-text is associated with several items Let the same context be as-sociated with L items in a list Furthermore let the context be-tween different lists be uncorrelated The variance of the netinput to an item node is

The expected variance for all nodes is the weighted sum ofthese two terms Half of the nodes are context nodes and halfof the nodes are item nodes Therefore the expected varianceof the net input to all nodes given an item with a frequency off and a list length of L in a network where p patterns have beenencoded is

(A5)

Performance in the Variance Model When AllNodes Are Used

In the variance model recognition strength is based on nodesactive at encoding However if recognition strength is based onall the nodes performance can be calculated as follows The d centis calculated by using Equation 8 where Pc(O) and Pc(N) are thenumber of correct nodes The number of correct nodes is thenumber of nodes active at retrieval that also is active at encod-ing (ie calculated as usual with Equation 7) plus the numberof inactive nodes at encoding that also are inactive at retrievalThe latter value can be calculated by replacing a with 1 a inEquation 7 and using the expected value of old net inputs for in-active nodes a2 (1 a) N (compare with Equation A1)

(Manuscript received February 9 1999revision accepted for publication October 30 2000)

s h f L a N p a a N2 3 22 1O( ) = +( ) + -( )

s h f a a LN2 4 2 21( ) = -( )

s x x x xh ij jj

N

i j ji

Nf f f a a

a a f N

22 2

4 2 21

( ) = waring aringaelig

egraveccedilouml

oslashdivide= -( ) -( )eacute

eumlecirc

= -( )

s x x x xh ij jj

N

i j j

P

i

NN a a

a a a a a a a

a a a a PN a a PN

2 2 2

2

3

1 1 2 0 1

0 0 1 1

( ) = w

= [( ) ( )] + [( )(1- )] ( )

+ [( )( )] ( ) = ( )

2 2

2 2 2

( ) = -( ) -( ) - - - -

- - - -

aring aringaring

m x x x xh ijj

N

j i j jj

Na a a a N(O) = wDaring aring= -( ) -( ) = -( )1

2

ss s

p

p p

a(N)

( )N (O)+

420 SIKSTROM

in the active state when N 5 8 and a 5 5 see Equation A1in the Appendix) thus increasing the expected net inputThe preexperimental contexts on the other hand ran-domly add to or subtract from the net input For examplea preexperimental context could add +1 to the net inputif the node was in an active state or it could add 1 tothe net input if the particular preexperimental contextwas encoded in an inactive state (which is the expectednet input for a node encoded in the inactive state whenN 5 8 and a 5 5 see Equation 3 4 or A1) Note thatthe net input can be negative whereas the state of activa-tion cannot If the representations of these preexperimen-tal contexts are uncorrelated the number of associatedpreexperimental contexts will not influence the expectednet input That is the expected value of the negative andpositive contributions will cancel out (eg for N 5 8 anda 5 1 the contributions to the net input is +1 with a prob-ability of 50 and 1 with a probability of 50 yield-ing an expected value of 0) However the variability ofthe net inputs increases when more contexts are addedIn this example the variance is increased by 12 for eachadded context (ie the variance increases by each con-tribution raised to the power of two) Encoding an itemalso increases the variability of the net input for all otheritems encoded in the network However the increase invariability for the items uncorrelated with the encodeditem is much smaller because the contribution from eachnode is independent (Eg for N 5 8 and a 5 5 thecontribution from each node is either +14 or 14 [seeEquation 3] each yielding an increase in variability of142 5 116 An expected value of aN nodes contributeso the total increase in variability is 4 116 5 14)

A mathematical analysis of how the variability of thenet inputs in the context nodes increases as a function ofthe frequency with which the item has been encoded indifferent contexts is presented in the Appendix This analy-sis shows that the variance of the net inputs in contextnodes increases linearly with how many times a givenitem is encoded within different contexts The variabil-ity of the net inputs for nonwords may be a special casediscussed at the end of this paper

In the same way as the variability of context nodes de-pends on the itemrsquos frequency the variability of the itemnodes depends on the frequency of the context That isthe variability of the net input to the item nodes increaseswith how many times one context is associated with dif-ferent items Given that the context is constant during thepresentation of a study list the variability of the net in-puts to the item nodes will increase with list length

Expected Net InputThe second mechanism in the variance theory is that

the expected net inputs to the easy and difficult classesof stimuli are equal given that the encoding conditionsduring the experimental phase of the two classes areequal Note that this is in stark contrast to the attention-likelihood theory which assumes that more attention(corresponding to more net input) is given to the easyclass than to the difficult class Experimentally the equal-

ity of the net inputs simply means that different classesof stimuli are given the exact same conditions for encod-ing and retrieval in a recognition memory study The netinput to a node encoded in the active state increases dur-ing encoding whereas the net input to a node encoded inthe inactive state decreases during encoding Only nodesencoded in the active state are used during retrieval sohere we are only interested in the increase in net input thatoccurs for nodes encoded in the active state Strengthen-ing of the weights during encoding increases the expectednet input The degree of increase in expected net input isinfluenced by strength-based variables such as study timerepetition levels of processing and so forth For exam-ple the simulations can be set up so that a study time of1 sec strengthens the weights less leading to lesser in-creases in the net input than does a longer period of studytimemdashfor example 2 sec of encoding time Because thestudy context is unique to the learning episode preexper-imental encoding in other learning contexts will not af-fect the expected net input but they do affect the vari-ability of the net input as was demonstrated above Theitemndashstudy-context associations are learned approxi-mately equally well for old high- and old low-frequencyitems For example the expected net input for CAR (ahigh-frequency word as a difficult class item) is equal tothe expected net input for ARCTIC (a low-frequency wordas an easy class item) Generally the expected net inputdoes not depend on the class of the item because the ex-pected net input is influenced by the study and the test-ing experimental conditions similarly across stimulusclasses in a recognition memory experiment Thereforethe expected net input for a new difficult item is equal tothe expected net input for a new easy class item

The probability density functions of the net inputs fornodes in the active states are plotted in Figure 4A Oldnodes in the inactive state have a negative expected valueof the net input and are not plotted New nodes in the in-active state have the same density as nodes in the activestate The cumulative probability distributions of the netinputs for nodes in the active state are plotted in Figure 4BFigure 4A shows the first mechanismmdashnamely that thestandard deviation of net inputs for the easy class items[sh(A)] is larger than the standard deviation of net inputsfor the difficult class items [sh(B)] The second mecha-nism is shown in the figures in that the expected netinput of an easy class new item [mh(AN)] is equal to theexpected net input of a difficult class new item [mh(BN)]Furthermore at encoding the expected net inputs of ac-tivated nodes are increased equally or approximatelyequally for the easy and the difficult classes of itemsThis is shown in Figure 4A The expected net input for theold easy class items [m h(AO)] is equal to the expectednet input for the difficult class items [mh(BO)]

Recognition StrengthThe variance theory suggests that the recognition de-

cision needs to be based on counting the number of ac-tive nodes in such a way that the performance is optimalor near-optimal If the net input is above the activation

VARIANCE THEORY OF MIRROR EFFECT 421

threshold (T ) and the node was active at encoding thenode is activated at recognition Otherwise the node isinactivated The distributions of active nodes are plottedin Figure 4C

A closer inspection of Figures 4A and 4B reveals thatthese densities or distributions predict the correct orderof the mirror effect given that the activation threshold islarger than the expected net inputs of the new items andless than the expected net inputs of the old items Thus thevariance theory has the nice property of accounting for themirror effect across a large range of the parameter valuesfor the activation threshold Thus P(AN) lt P(BN) ltP(BO) lt P(AO) for mh(AN) 5 mh(BN) lt T lt mh(AO) 5mh(BO) The variance theory predicts a material-basedmirror effect because the variability of the net inputs isdifferent for the easy and the difficult class items Theexpected strengths of the net inputs are equal The vari-ability is lower for easy class items thus making theprobability of false alarms (or the probability of activenodes) lower for the easy than for the diff icult classitems when the activation threshold is larger than the ex-pected value of the new items Similarly the hit rate (or theprobability of active nodes) for easy class items is higherthan the hit rate for difficult class items when the activationthreshold is less than the expected value of the old items

The activation threshold is set to be between the ex-pected value of the new and the old net inputs so that theperformance is optimal Therefore the activation thresh-old is set to the average of the expected net inputs of theold and the new distributions for difficult and easy classitems respectively

T 5 [mh(AN) + m h(BN) + m h(BO) + mh(AO)]

5 [mh(BO) + mh(AO)] 5 mh(O)

Thus in the variance model the activation threshold isfixed for recognition in one condition although it mayvary between different recognition conditions to opti-mize the performance The variance theory accounts forthe strength-based mirror effect that occurs between listsor conditions with a shift in the activation threshold nec-essary for keeping the performance at an optimal levelAs will be discussed later this is true also when perfor-mance is measured by d cent and it is independent of theplacement of the recognition criterion Simply put themodel adopts the activation threshold on the basis of theoverall difficulty of the test in order to maximize the per-formance

In practice subjects may initially make a preliminaryestimate of the activation threshold which may be ad-justed as more information about the expected value ofthe distributions is gathered The theory does not show amirror effect if the activation threshold is lower than theexpected value of the new items or larger than the expectedvalue of the old items Thus setting the activation thresh-old as was suggested above is an important mechanism

in the model The activation threshold should not be con-fused with the subjectrsquos recognition criterion

Figure 4C shows the density of the probability that anode is active at recognition when the activation thresh-old is set as defined above Note how the mean and stan-dard deviations of the distributions of the net input (Fig-ure 4A) change when the percentage of nodes are calculated(Figure 4C) The expected probabilities of active nodesare arranged according to the mirror effect [Pc(AN) ltPc(BN) lt Pc(BO) lt Pc(AO)] whereas the expected valuesof the net inputs are not [mh(AN) = mh(BN) lt mh(BO) =mh(AO)] Furthermore the standard deviation of the per-centage of active nodes for old items is larger than thatfor new items [s p(O) gt sp(N)] whereas the standard de-viations of the net inputs are equal for old and new items[s h(N) 5 sh(O)]

The standard deviation of the recognition strength issmaller for the new distributions than for the old distri-butions because there are fewer nodes active in the newdistributions The standard deviation of the percentage ofactive nodes at retrieval as a function of the expected prob-ability of nodes active at retrieval is plotted in Figure 5Obviously the standard deviation of the percentage ofactive nodes is zero when the probability of active nodesis zero This standard deviation increases as the prob-ability of active nodes increases For small probabilitiesof active nodes the variance of active nodes (sp) is approx-imately proportional to the percentage of active nodes[sp 5 Pc(1 Pc) N raquo Pc N micro Pc] The percentage of ac-tive nodes is of course larger for old than for new itemsThus the variance theory predicts that the standard de-viation of the percentage of active nodes (sp) is smaller fornew than for old items [sp(AN) lt sp(BN) lt sp(BO) ltsp(AO)] whereas the standard deviation of the net inputis not [sh(AN) 5 sh(AO) lt sh(BN) 5 s h(BO)] The es-sential mechanism that makes these changes in themeans and standard deviations is the nonlinearity intro-duced by the counting of the number of active nodes With-out this nonlinearity these changes would not occur andthe model would not show appropriate ROC curves forold and new items

Note that the standard deviation of active nodes de-creases for probabilities larger than one half (see Figure 5the standard deviation is of course zero when the prob-ability of active nodes is one see the mathematical anal-ysis below) However the probability of active nodes can-not exceed one half because the activation threshold isset so that half of the nodes active during encoding areactive at recognition The probability of active nodes istypically set to a small value in the model because it im-proves the performance

To optimize the performance subjects base their rec-ognition decision on the number of active nodes normal-ized by the standard deviation of the net inputs for theitem The normalization makes the judgments more con-servative for difficult items This plays an important rolefor confidence judgments when the responses are biasedbut plays no role for unbiased responses

12

14

14

422 SIKSTROM

To calculate the recognition strength (S ) in Equation 5the expected percentage of active nodes is subtractedfrom the percentage of nodes active at recognition (Pc)This is necessary for the normalization to work properlyOwing to the placement of the activation threshold theexpected percentage of active nodes at recognition is halfof the expected percentage of nodes active at encoding(a 2 see note 1) This is a constant independent of itemclass new or old item and test difficulty The result is di-vided by the standard deviation of the net inputs associ-ated with nodes active at encoding (shcent)

Note that the standard deviation of the net inputs ofthe to-be-recognized item (s hcent ) varies on an item-to-itembasis around the standard deviation of the net inputs ofall items in the class (sh ) This fluctuation may be solarge that it is not possible to accurately sort the wordsinto classes on the basis of the standard deviation of theitems however there is no need for the subject to makesuch classification in the variance model The subjectsdo not need to know the true standard deviation of net in-puts in the class A yes response occurs if the recognitionstrength is larger than or equal to the subjectrsquos recogni-tion criterion (C ie if S sup3 C ) A no response occurs ifthe recognition strength is less than the subjectrsquos recog-nition criterion (S lt C )

The standard deviation of the net inputs does not affectthe probability of a yes response when the recognitioncriterion is unbiased (C 5 0) In this special case therecognition strength can be simplified to S 5 Pc whereC 5 a 2 The standard deviation of the net inputs inEquation 5 affects the probability of a yes response whenthe recognition criterion is biased (C sup1 0) Thus the stan-dard deviation of the net inputs in Equation 5 may be in-terpreted as a scaling factor that influences the confidencemeasurement (but not the unbiased recognition measure-ments) A large standard deviation of the net input for anitem (correlated with difficulty) influences the decisiontoward uncertainty whereas a small standard deviation ofthe net input for an item (correlated with less difficulty)influences the decision to be more certain

Figure 4D shows the density distribution of the recog-nition strength Note how the standard deviation of theactive nodes for the easy class versus the difficult classes(in Figure 4C) changes when it is normalized by the vari-ance of the net input (in Figure 4D) The normalizationfactor makes the standard deviation of the recognitionstrength of the difficult class smaller than that of the easyclass Thus the standard deviation of the recognitionstrength is proportional to the inverse of the standard de-viation of the net input The difficult class items yield asmall standard deviation of the recognition strength be-cause the standard deviation of the net inputs is highThe easy class items yield a large standard deviation ofthe recognition strength because the standard deviationof the net inputs is small The ordering of the means ofthe distributions is unaffected by the normalization andthe normalization does not change the fact that the olddistributions have a larger standard deviation than do thenew distributions

PREDICTIONS

This section describes the predictions of the variancetheory We have just seen that the variance theory predictsa material-based mirror effect for high- and low-frequencyitems because the low-frequency items have a smallervariance of net inputs This yields lower false alarm ratesand higher hit rates for the easy than for the difficultclass when the activation threshold is set between thenew and the old distributions Here it is shown how themodel accounts for other effects such as the strength-based mirror effect between lists list-length effects andthe shift in the response criterion Most important the var-iance theory makes predictions regarding the strength-based mirror effect within lists that is different from thepredictions of the attention-likelihood theory An exper-iment is conducted to test these predictions Comparativemodeling fitting was also conducted to compare the vari-ance theory with the attention-likelihood theory Thepredictions of the theory are based on an analytic solution

Figure 5 The standard deviation of active nodes as a function of the expectedprobability that the nodes are active The standard deviation is calculated with2N = 100

VARIANCE THEORY OF MIRROR EFFECT 423

that is presented at the end of the paper together with ananalysis of optimal performance

The Material-Based Mirror Effect forHigh- and Low-Frequency Items

The variance theory was simulated above here theanalytical results are presented The variance theory pre-dicts the mirror effect for any choice of parameters whenthe recognition criterion is unbiased As will be discussedlater the variance theory can be fully described by twoparameters (the number of nodes N and the percentageof active nodes a) plus one parameter for each class orwords [the standard deviation of the net input sh( )] Thefollowing parameters are used here The number ofnodes is 2N 5 100 and the percentage of nodes active atencoding is set to a 5 1 The standard deviation of thenet inputs to the easy class is sh(AN) 5 sh(AO) 5 125and the standard deviation of the net inputs to the diffi-cult class is sh(BN) 5 sh(BO) 5 156 There are otherparameters which however as will be discussed laterdo not add any degrees of freedom to the model the ex-pected net inputs of the new distributions mh(AN) 5mh(BN) 5 0 and the expected net inputs of the old dis-tributions mh(AO) 5 mh(BO) 5 1 Consequently the ac-tivation threshold is T 5 05

These parameters yield the following probabilities thata node is active at recognition Pc(AN) 5 43 a Pc(BN) 545 a Pc(BO) 5 55 a Pc(AO) 5 57 a The following ex-pected recognition strengths are predicted ms(AN) 5

0012 ms(BN) 5 0008 ms(BO) 5 0008 ms(AO) 50012 Figure 4D plots the four recognition strength den-sities (the distributions are assumed to be normal) usingthe parameters above The same parameter settings wereused in Figures 4A 4B 4C and 5

Strength-Based Mirror Effects Between ListsThe variance theory is consistent with the strength-

based mirror effects Thus variables that increase the hitrates also decrease the false alarm rates This empiricalfinding is called dispersion which means that the newand the old distributions move apart The opposite phe-nomenon is called concentering which means that thenew and the old distributions move closer together Ex-amples of variables showing strength-based mirror ef-fects are speed versus accuracy instructions length ofstudy time encoding task forgetting repetition and ag-ing (Kim amp Glanzer 1993) These experimental variablescan be related to a specific parameter in the variance the-orymdashnamely the expected net input

The variance theory predicts a strength-based mirroreffect because subjects must adjust the activation thresh-old to optimize the performance This change in activa-tion threshold affects the false alarm rates For exampleassume that study time is increased from 1 to 2 sec sothat the expected net input increases from 1 to 2 and theactivation threshold increases from 12 to 1 This dimin-ishes the false alarm rate However the increase in theactivation threshold is smaller than the increase in the old

net input so the hit rate will increase Thus increasing thestudy time increases the hit rate but decreases the falsealarm rate which is dispersion

The mirror effect is accounted for in some theories bya change in the recognition criterion Note that in the var-iance theory the recognition criterion is constant whereasthe activation threshold is changed There is an impor-tant difference between a change in the recognition cri-terion and a change in the activation threshold Thechange in the activation threshold optimizes the perfor-mance as measured by d cent whereas a change in the recog-nition criterion does not influence d cent Given an optimalplacement of the activation threshold the performance interms of percentage correct is optimal if the recognitioncriterion is set to an optimal value which is zero Thusthere is a clear difference between changing the recogni-tion criterion and changing the activation threshold Thevariance theory accounts for the strength-based mirror ef-fect occurring between two conditions by the change inthe activation threshold necessary for optimal performancewhereas the recognition criterion does not change

Concentering occurs for example when subjects areinstructed to emphasize speed (rather than accuracy) withsuperficial (rather than deep or semantic) study instruc-tions with diminished study time or with an increasedretention interval (Kim amp Glanzer 1993) In the variancetheory all these manipulations are assumed to diminishthe old net inputs Figure 6A shows the predictions of thevariance theory when the expected net inputs of the olddistributions are mh(AO) = mh(BO) 5 05 rather than 1as in Figure 4D Consequently the activation thresholdmust be set to 025 for optimizing the performance Thedistributions in Figure 6A are closer than the distribu-tions in Figure 4D Thus decreasing the net inputsmdashforexample by diminishing study timemdashmoves the distrib-utions closer together thus showing concentering

The opposite phenomenon to concentering is disper-sion which means that increasing the performance movesthe distributions apart Dispersion can be studied bychanging the variables listed above in the opposite direc-tionsmdashfor example by increasing the study time Fig-ure 6B shows the predictions of the theory when the ex-pected net inputs of the old distributions are mh(AO) =m h(BO) 5 2 Consequently the activation thresholdmust be set to 1 to maintain a near-optimal performanceThe distributions in Figure 6B are further apart than thedistributions in Figure 4D

These strength-based manipulations are usually ap-plied between different lists or conditions For exampleKim and Glanzer (1993) manipulated study time betweenfour study lists where the items were presented for 1 seceach in two lists and for 2 sec each in two lists After eachlist there was a recognition test In the variance theorythe activation threshold is the same during each recogni-tion test but may vary between two recognition tests withdifferent levels of difficultymdashfor example different studytimes As will be discussed later different empirical re-sults are found when study time is varied within one list

424 SIKSTROM

In this case the activation threshold is also constant dur-ing the recognition tests although the study time varieswithin the condition

The order of the probabilities in the mirror effect issomewhat robust against changes in the activation thresh-old over a large range Setting the activation to a fixedsufficiently low and positive value yields the mirror ef-fect for any value of the expected net input For example

assume that the activation threshold is fixed to 04 Thenthe mirror effect is predicted for the three cases of ex-pected old net inputs discussed above (05 1 and 2) orany value above 04 The predictions for the new distri-butions do not change with these changes in net inputs[P(AN) = 25 P(BN) = 30] thus a change in the acti-vation threshold is needed to change the false alarmrates In contrast the advantage of the old easy class over

Table 1General Table of Results From the Experiment

Condition AN BN BO AO ss(BN)ss(AN) ss(BO)ss(AO)

Control 013 017 069 082 060 086Frequency 020 028 080 068 101 066Time 010 015 078 076 089 081

NotemdashThe rows show the conditions (control presentation frequency and presenta-tion time) The columns show the false alarm rates for low (AN) and high (BN) fre-quencies and the hit rate for high (BO) and low (AO) frequencies the slope of the z-ROCcurve for the new low-frequency distribution as a function the new high-frequency dis-tribution [ss(BN)ss(AN)] and the corresponding slope for the old distributions[ss(BO)ss(AO)]

Figure 6 The densities of recognition strength in the variance theory for different parameter settings (A) concentering (B) dis-persion (C) activity level set to one and (D) equal variance The horizontal axes show the recognition strength and the vertical axesthe density

VARIANCE THEORY OF MIRROR EFFECT 425

the old difficult class increases with the expected netinput [from P(BO) = 55 and P(AO) = 56 for mh(O) =05 to P(BO) = 89 and P(AO) = 92 for mh(O) = 2]

List-Length EffectGiven everything else equal recognition from a short

list length has a higher hit rate and lower false alarm ratethan recognition from a long list In the variance theorylist length is predicted to affect the standard deviation ofthe net input (sh) for both easy and difficult class itemsso that longer lists have a larger variance than do shorterlists The expected value of the net input is not affectedby list length

Assume that context does not change within a list butis uncorrelated between different lists The context for alist is thus associated with as many items as there areitems in the list The variance of the net inputs to the itemnodes increases when the list length is increased Thereason for this increase in variance is essentially thesame as the reason that word frequency affects the vari-ance In the word frequency case the same item is asso-ciated with several contexts and this increases the vari-ance in context nodes In the list-length case the samecontext is associated with several items and this in-creases the variance in the item nodes Thus the varianceof the net inputs in the item nodes will be a linear func-tion of list length Therefore a long list will have a lowerhit rate and a larger false alarm rate than will a short list

ROC CurvesThe percentage of nodes active at recognition is less

for new items than for old items Owing to the placementof the activation threshold this proportion is always lessthan 12 The standard deviation of the percentage of ac-tive nodes increases as a function of the percentage ofactive nodes If the percentage of active nodes is zerothe standard deviation obviously is zero However thisstandard deviation increases as the percentage of activenodes increases This yields a smaller standard deviationfor the new distribution (which is associated with a lowerpercentage of active nodes) as compared with the olddistribution [ss(AO) gt ss(AN) and ss(BO) gt ss(BN)]

For the sake of understanding the model the propor-tion of nodes active at encoding can be set unrealisticallyhighmdashnamely to a = 1 This setting yields around 50of these nodes being active at recognition This parametersetting makes the standard deviations of the new and theold distributions equal [ss(AO) = ss(AN) and ss(BO) =ss(BN)] Figure 6C shows the prediction for a = 1 (all theother parameters are identical to those in Figure 4D)

The standard deviation of recognition strength is largerfor the difficult class than for the easy class [ss(AN) gtss(BN) and ss(AO) gt ss(BO)] because the recognitionstrengths are calculated from the inverse of the standarddeviation of the net inputs Thus when the standard de-viations of the net inputs are set equal the standard de-viation of the recognition strengths and the recognitionstrengths becomes equal Figure 6D plots the predictionsof the theory when all standard deviations of the net in-

puts are 125 The other parameters are the same as thosein Figure 4D

In Figure 4D the four standard deviations of the recog-nition strengths are ss(AN) = 0015 ss(BN) = 0012ss(BO) = 0015 and ss(AO) = 0020 The ratio of thesestandard deviations must follow Equation 2 This is alsothe case with ss(AO) ss(BN) = 061 lt 074 = ss(AO) ss(AN) lt ss(BO)ss(BN) = 078 lt 094 = ss(BO) ss(AN)

Changing the Recognition CriterionThe probability of a yes response (P) for the four

classes depends on the recognition criterion (C) SettingC to an unbiased value of 0 yields P(AN) = 20 P(BN) =25 P(BO) = 70 P(AO) = 74 These predicted data areprototypical of experimental data for the mirror effect

A conservative value of the recognition criterion (C)will not yield the mirror effect For example C = 0016yields P(AN) = 03 P(BN) = 02 P(BO) = 30 P(AO) =43 Thus the variance theory predicts that a conserva-tive recognition criterion yields a higher false alarm ratefor easy class words than for difficult class words Thisprediction is supported by empirical data Greene (1996)asked subjects to respond yes only if they were sure oftheir response Consistent with the prediction no mirroreffect was found

It follows from the ordering of the distributions in Fig-ure 4D that the theory also predicts the experimentalfindings in forced recognition [P(BO BN) lt P(AO BN)P(BO AN) lt P(AO AN) P(BN AN) gt 50 and P(AOBO) gt 50] For the parameters above the predictions ofthe theory are P(BO BN) = 79 lt 81 = P(AO BN)P(BO AN) = 83 lt 84 = P(AO AN) P(BN AN) = 59 gt50 P(AO BO) = 57 gt 50

Within-List Strength ManipulationSo far the predictions made by the variance theory are

qualitatively (but not quantitatively) equal to those of theattention-likelihood theory However there is an excep-tion that differentiates the variance theory from the attention-likelihood theory The mirror effect is nor-mally studied under experimental conditions in whichthe diff icult and the easy classes are given the sameamount of attentionmdashfor example under conditions inwhich the number of presentations the study time andthe study instructions are equal for the two classes ofwords However if the number of presentation is largerfor the difficult class than for the easy class different re-sults emerge Stretch and Wixted (1998b) conducted fiveexperiments in which the basic manipulation was to pre-sent high-frequency words five times whereas the low-frequency words were presented once The results didnot show a mirror effect because the hit rates for thehigh-frequency words were higher than those for thelow-frequency words However increasing the numberof presentations for the high-frequency words did not af-fect the false alarm rate so both the false alarm rate andthe hit rate were larger for high-frequency words

The variance theory accounts for this new finding inthe following way The theory assumes that a longer pre-

426 SIKSTROM

sentation time or a larger presentation frequency in-creases the net inputs of the old items [mh(O)] This is il-lustrated in Figure 7A (compare with Figure 4A wherethe same amount of attention is paid to the two classes)If the net inputs for old high-frequency items are in-creased sufficiently the percentage of active nodes willbe larger than that for old low-frequency items For thisto occur the effect of the increase in net input (whichgives the advantage for old high-frequency items whenattention is focused on these items) must be larger thanthe effect from the larger standard deviation of the netinputs for old high-frequency items (which gives the ad-vantage for old low-frequency items when the same at-tention is paid to the two classes) This increase in thepercentage of active nodes yields a higher hit rate forhigh-frequency items than for low-frequency items

However it will not significantly change the false alarmrates which are larger for high-frequency items than forlow-frequency items Therefore the variance theory pre-dicts no mirror effect when high-frequency items are pre-sented sufficiently more often or with a sufficiently longerpresentation time as compared with low-frequency items

It is apparent from the density of net inputs (Figure 7A)that the density of recognition strengths (Figure 7B)will not show a mirror effect (ie because the percent-age of active nodes are larger for high- than for low-frequency old items) The parameters used in these fig-ures are identical to the parameters used for the standardmirror effect in Figures 4A and 4D with the exceptionthat the expected net input of the old high-frequencyitems [mh(BO)] is 2 rather than 1 Consequently to opti-mize performance the activation threshold becomes

P a ec

h h

h=-( )yen

ograve12

22

m

s

P -

Figure 7 (A) The probability density of the net inputs in the variancetheory when attention is focused on the high-frequency class The hori-zontal axis shows the net inputs and the vertical axis the probabilitydensity of the net inputs The expected value of the high-frequency value(BO) is shifted to the right because attention is focused on this class Thedotted vertical line is the activation threshold (B) The predictions of thevariance theory when subjects focus their attention on high-frequencywords The horizontal axis shows the recognition strength and the ver-tical axis the probability density

075 rather than 050 The figure does not show a mirroreffect because the expected hit rate and the expectedfalse alarm rate are larger for the high-frequency itemsthan for the low-frequency items Setting C to an unbi-ased value of 0 yields P(AN) = 08 P(BN) = 14 P(BO) =86 P(AO) = 063 [which may be compared with Figure6B P(AN) = 20 P(BN) = 25 P(BO) = 70 P(AO) = 74]

Furthermore in the variance theory the ratio of therecognition strength standard deviations for high- andlow-frequency items depends mainly on the standard de-viations of the net inputs The standard deviations of thenet inputs are not dependent on the attention paid to thestimuli Therefore the variance theory predicts no changein the standard deviations when the amount of attentionis manipulated The standard deviation of the old low-frequency distribution is predicted to be larger than thestandard deviation of the old high-frequency distribu-tion Similarly the standard deviation of the new low-frequency distribution is predicted to be larger than thatof the new high-frequency distribution The standard de-viations in Figure 7B are ss(AN) = 0013 ss(BN) = 0011ss(BO) = 0017 and ss(AO) = 0019 These results aresimilar to the results when the same amount of attentionis paid to the two classes in Figure 4D ss(AN) = 0015ss(BN) = 0012 ss(BO) = 0015 and ss(AO) = 0020

The standard version of the attention-likelihood the-ory has problems accounting for the lack of mirror ef-fect when more study time is given to the difficult classthan to the easy class This theory suggests that the classof items to which more attention is being paid is moreeasily recognized For example low-frequency items arebetter recognized than high-frequency items becausesubjects pay more attention to them The amount of at-tention is assumed to influence the number of sampledfeature [n(i)] so more features are sampled for low- thanfor high-frequency items (Kim amp Glanzer 1993) This isthe only parameter that differs between high- and low-frequency items From this explanation it follows thatif the experimental conditions are manipulated so thatsubjects pay more attention to the high-frequency itemsthe standard version of the attention-likelihood theorywill predict a mirror effect where the high-frequencyitems are the easier class (A) and the low-frequency itemsare the more difficult class (B) The difference from thenormal mirror effect is a larger hit rate and a smaller falsealarm rate for high- than for low-frequency items Fur-thermore the attention-likelihood theory makes predic-tions of the order of the slope of ROC curves The standarddeviation of the hit rate for the high-frequency distributionwould be larger than the hit rate for the low-frequencydistribution Similarly it is predicted that the standarddeviation of the high-frequency false alarm distribution islarger than that of the low-frequency distribution

EXPERIMENT

An experiment was conducted to test the predictionsregarding the within-list strength manipulation The

number of presentations and the study time of the high-frequency words were manipulated in an experimentThe original rationale for the experiment was to comparethe results with the predictions of the variance theoryand the attention-likelihood theory because the experi-ment was conducted before the publication of Stretchand Wixtedrsquos (1998b) study which manipulated atten-tion by the number of presentations In this experimenta new manipulation is investigatedmdashnamely how theamount of study time per item for each class influencesthe mirror effect Furthermore the manipulation of thenumber of presentations is replicated Thus there weretwo experimental conditions one in which the amountof study time was manipulated and one in which the pre-sentation time was manipulated There was also one con-trol condition in which high- and low-frequency wordswere given the same amount of attention

MethodSubjects Twenty-one students taking the introductory psychol-

ogy course at the University of Toronto volunteered to participatein a memory experiment for course credit There were 14 female and7 male subjects with a mean age of 20 ranging from 18 to 29 yearsold

Material Sixty low-frequency words and 60 high-frequencywords were selected from Ku Iumlcera and Francis (1967) The low-frequency words have an occurrence of 4ndash5 times per million andthe high frequency words an occurrence of 50ndash55 times per mil-lion Thirty low- and 30 high-frequency words were randomly cho-sen for List 1 and the remaining for List 2

Procedure The subjects were instructed to study a list of wordsso they would be able to recognize the words after study Fifteenlow-frequency words and 15 high-frequency words were randomlychosen as study words for each subject

Design There were three conditions In all the conditions thelow-frequency words were presented once with a presentation timeof 1 sec In the control condition the high-frequency words werealso presented once with a presentation time of 1 sec In the pre-sentation frequency condition the high-frequency words were pre-sented twice for 1 sec each time In the presentation time conditionthe high-frequency words were presented once for 3 sec The pre-sentation order was randomized All the words were presented inuppercase on a blank computer screen Immediately following thestudy list there was a recognition test The subjects were presentedwith either one of the studied words or one of the lures There were15 low-frequency lures and 15 high-frequency lures in each condi-tion The subjects were asked to judge whether the word was pre-sented in the list or not The subjects were also required to rate theirconfidence in their responses on a scale ranging from 1 (guessing)to 5 (very certain) The order of recognition was randomized foreach subject

Each subject participated in two conditions List 1 was alwaysgiven as the first list and List 2 as the second list Twelve subjectswere randomly chosen for the presentation frequency condition fol-lowed by the presentation time condition Nine subjects were giventhe control condition followed by another control condition Thewhole experimental set-up including instructions presentation ofwords and the recognition test was automated on a computer Eachsubject was tested individually

ResultsThe results from the experiment are presented in the

first three rows of Table 1 The probability for hit rates

VARIANCE THEORY OF MIRROR EFFECT 427

428 SIKSTROM

was larger for the high-frequency words than for the low-frequency words in the presentation frequency and thepresentation time conditions In the control conditionthe probability for hit rates for the low-frequency condi-tion was larger One-tailed paired t tests over the perfor-mance for each subject were carried out to test the dif-ferences between the high and the low frequencies Theeffects were significant in the presentation frequencycondition [t(11) = 22 MSe = 004 p = 02 lt 05] and inthe control condition [t(16) = 33 MSe = 004 p = 00lt 05] but not in the presentation time condition [t(11) =041 MSe = 003 p = 34 lt 05]

The false alarm rate was larger for the high-frequencywords in all the conditions However it was only signif-icantly larger in the presentation frequency condition[t(11) = 18 MSe = 003 p = 048 lt 05] but not in thepresentation time condition [t (11) = 15 MSe = 001 p =07 gt 05] and the control condition [t(16) = 14 MSe =002 p = 09 gt 05]

The results in the presentation frequency conditionsupport the variance theory The hit and the false alarmrates were significantly larger for the high-frequencywords than for the low-frequency words Thus there wasno mirror effect However the prediction of the standardversion of the attention-likelihood theory was not sup-ported

The results in the presentation time condition were inthe same direction as those in the presentation frequencycondition although the difference between the high andthe low frequencies was not significant This conditionis consistent with the variance theory although the stan-dard version of the attention-likelihood theory could notbe dismissed in this condition since the results werenonsignificant

Finally the control condition yielded results consis-tent with previous studies showing a mirror effect Thehit rate for the high-frequency words was significantlylower than the hit rate for the low-frequency words Thefalse alarm rate for the high-frequency words was largerthan that for the low-frequency words (although not sig-nificantly) Thus the control condition is as was ex-pected consistent with both the variance theory and thestandard version of the attention-likelihood theory

The slopes of the ROC curves were calculated as fol-lows The hit and false alarm rates for confidence ratings1ndash5 were z-transformed (eg for confidence rating 4 a hitresponse was scored if the confidence rating was 4 orabove) Linear regressions of one z-transformed mea-surement as a function of another z-transformed measure-ment were conducted The slope of the linear regressioncurves between the z-transformed hit rate of the low-frequency words and the z-transformed hit rate of the high-frequency words [ss(BO)ss(AO)] and similarly for theslope of the false alarms [ss(BN)ss(AN)] are shown inthe last two rows of Table 1

The results show that the standard deviations of theold high-frequency distributions were smaller than thestandard deviations of the old low-frequency distribu-tions in all the conditions The standard deviations of the

false alarm high-frequency distributions were smallerthan the standard deviations of the false alarm low-frequency distributions in the presentation frequencycondition and the control condition but were approxi-mately equal in the presentation time condition

To summarize the results in the presentation frequencycondition are consistent with the variance theory and inconsistent with the standard version of the attention-likelihood theory The control condition is consistentwith both the variance theory and the standard version ofthe attention-likelihood theory These data are also con-sistent with results presented by Stretch and Wixted(1998b) However Stretch and Wixted (1998b) suggestedone possible way to modify the standard version of theattention-likelihood theory to bring it in line with thedata presented here They noted that Glanzer et al (1993)had shown that the attention-likelihood theory predictsthe mirror effect although p(i old) is set to the averagevalue of the two classes This modified version can pre-dict the pattern of data presented here given that the at-tention paid to the high-frequency class was high duringencoding [n(B) = 120] and low during recognition [n(B) =40] This formulation of the attention-likelihood theoryseems somewhat unclear It is not well motivated whyp(i old) should be equal during recognition whereas n(i)is different [p(i old) is calculated from n(i)] or why theamount of attention for high-frequency items is lower thanthat for low-frequency words at encoding but higher atrecognition

COMPARATIVE DATA FITTING

Glanzer et al (1993) fitted the attention-likelihoodtheory to experimental data in nine conditions The easyclass (A) consisted of either low-frequency words orconcrete words and the difficult class (B) consisted ofhigh-frequency words or abstract words Here the vari-ance theory is also fitted to the same set of data as thatused in Glanzer et al (1993) This allows a directionevaluation of the variance theory by comparing its f itswith those of the attention-likelihood theory

Glanzer et al (1993) fitted the attention-likelihoodtheory to the four probabilities of yes responses and thefour ratios of slopes of the ROCs The fitting was con-ducted by minimizing the mean squared error divided bythe variancemdashthat is

Three parameters were fittedmdashnamely the attentionpaid to each of the classes n(A) and n(B) and the prob-ability that a feature was activated p(new) The other pa-rameters were held constant at a value found to give agood f it These parameters were (N = 1000) and therecognition criterion [ln (L) = 0]

The variance theory was fitted to the same set of datausing the same technique and the same number of freeparameters The fitted parameters were the percentage

( )

Observed Predictedi i

ii

-

=aring

2

21

8

s

VARIANCE THEORY OF MIRROR EFFECT 429

nodes active at encoding (a) the standard deviation ofthe net inputs for the easy class words [sh(A)] and thestandard deviation of the net inputs for the difficult classwords [sh(B)] The other parameters were held constantand were set to the same values as those in Figure 4D[2N = 100 mh(N ) = 0 mh(O) = 1 and C = 0] The empir-ical standard deviations (si) were not reported inGlanzer et al (1993) so these parameters were set toone

The results show that the variance theory accounts for98 (r2) of the variance for the probabilities The attention-likelihood theory fits equally well The variance theory ac-counts for 93 of the variance for the slope The attention-likelihood theory accounts for 84 of the variance of theslope Thus the variance theory accounted for the sameamount of variance for the probabilities and more variance

for the slope as compared with the attention-likelihoodtheory when three parameters were fitted

It is reasonable to assume that the percentage of ac-tive nodes at encoding (a) does not vary across condi-tions The ratio of standard deviations of the net inputs[sh(B)sh(A)] may also be conceived of as being con-stant given that the same material is used in the differ-ent conditions Therefore the variance theory was fittedby a single variablemdashnamely the standard deviation ofthe net inputs for the easy class [sh(A)] The activitylevel was fixed to 010 and the ratio of the standard de-viations of the net inputs sh(B)sh(A) to 125 (these val-ues were the average of the fitted parameters in the pre-vious fit these parameters were also used in Figure 4D)The results are presented in Figure 8A where the pre-dicted probabilities are plotted as a function of observed

Figure 8 (A) Fitting the variance theory to Glanzer Adams andKimrsquos (1993) probability data The figure shows the predicted proba-bilities (on the vertical axis) as a function of the observed probabilities(on the horizontal axis) (B) Fitting the variance theory to Glanzer et alrsquos(1993) standard deviation slope data The figure shows the predictedratio of slopes of the receiver-operating characteristic curves (on the ver-tical axis) as a function of the observed ratios (on the horizontal axis)

430 SIKSTROM

probabilities Figure 8B shows the corresponding resultsfor the slope The accounted variance is 096 for the prob-abilities and 085 for the slopes Thus the variance theoryfits the slopes using a single parameter equally well asthe attention-likelihood theory does with three fitting pa-rameters The fit for the variance theory for the probabil-ities using one parameter is slightly less than the fit of theattention-likelihood theory using three fitting parametersIt may be concluded that the fit for the variance theory isreasonably good for the probabilities and the slopes Theslopes have a better fit in the variance theory than in theattention-likelihood theory when three variables are used

ANALYTIC SOLUTIONS

In this section analytic solutions of the variance the-ory an approximation of the standard deviation of therecognition strength and analyses of optimal perfor-mance are presented The variance theory has a simpleanalytic solution and can be fully described by four pa-rameters Two of these parametersmdashnamely the stan-dard deviation of the net inputs from the easy class[sh(A)] and the standard deviation of the net inputs fromthe difficult class [sh(B)]mdashcan also be expressed as thefrequency of the item (see the Appendix) The other twoparameters are the number of nodes (N ) and the expectedprobability that the nodes are active at encoding (a)

There are other variables in the theory however theydo not increase the degrees of freedom There are fourexpected net inputs (mh) however two degrees of free-dom disappear because the new net inputs are constrainedto be equal as well as the old net inputs [mh(AN) =mh(BN) and mh(AO) = mh(BO)] Furthermore the predic-tions are independent of moving the old and new dis-tributions in parallel thus removing another degree offreedom Finally changing the difference between the ex-pected old and new net inputs is mathematically equiva-lent to changing the standard deviations [sh(A) andsh(B)] Thus the degrees of freedom in the net inputscan be captured by the degrees of freedom in the stan-dard deviations The activation threshold (T ) and theprobability that nodes are active (Pc) are simply func-tions of other variables and therefore do not increase thedegrees of freedom Thus there are four degrees of free-dom for the distributionsmdashnamely sh(A) sh(B) N anda An additional degree of freedom is introduced whenplacing the recognition criterion (C)

The probability (Pc) that the net inputs exceed the ac-tivation threshold (T ) for nodes active during encodingcan be explicitly solved from the expected net inputs(mh) and the standard deviation of the net inputs (sh)This probability is dependent on the distribution of thenet inputs which can be approximated with a normaldistribution Pc is solved by integrating the net inputsfrom mh T to infinity (yen) over the probability densityfunction for a normal distribution Thus the probability(Pc) that a node is active at recognition is

(6)

Subtracting the expected percentage of active nodes atrecognition (a2 see note 1) from the percentage of ac-tive nodes and dividing by the standard deviation of thenet inputs (sh) calculates the expected recognitionstrength (ms)

Note that the analytic solution uses the standard devi-ation of the class (sh) as an approximation of the stan-dard deviation of the item (shcent ) because it simplifies theanalytic solution however the variance theory or thesimulation uses the standard deviation of the item Thisapproximation is good when there are a large number offeatures however for a small number of features thevariance of feature strength for a single item may fluctu-ate on an item-to-item basis around the variance of thenet inputs for all the items

The standard deviation of the recognition strength (ss)is calculated from sh Pc and N There is 2N number ofnodes in the context and the item layers The distributionof Pc is binomial but can for a certain criterion [ie 2NPc(1 Pc) gt 10] be approximated with a normal distri-bution with a standard deviation of [Pc(1 Pc) 2N]12The final result is scaled by the normalization factor 1sh

(7)

A yes response occurs if the recognition strength isabove the recognition criterion (C) The probability of ayes response [P(Y)] is calculated from the expected recog-nition strength the variance of the recognition strengthand the recognition criterion by integrating the density ofthe recognition strength over a normal distribution

The probability of choosing A over B in a two-choiceforced recognition test [P(A B)] is calculated from theexpected recognition strength of A [ms(A)] and B [ms(B)]and the standard deviations of the recognition strengthof A [ss(A)] and B [ss(B)]

An Excel sheet for calculating the predictions of the vari-ance theory is available on line (wwwpsychutorontoca ~sverkervariancehtml)

P e ds

S Bs s

s s

C

(A B)

(A)

(A) (B)

=

- -[ ]( )+aelig

egraveccediloumloslashdivide

eacuteeumlecirc

yen

ograve12

2

2 2 22

p

m m

s s

( )

P e ds

S s

s

C

(Y) =-( )yen

ograve12

2

22

p

m

s

sss

h

c cP P

N=

-( )eacute

eumlecircecirc

1 1

2

12

mss

Pc

a

h=

-2

P a e dhcT

hh

h=-( )yen

ograve12

2

22

p

m

s

VARIANCE THEORY OF MIRROR EFFECT 431

Approximations of the Standard Deviation ofRecognition Strength

The standard deviation of the recognition strength isin the model calculated with Equation 7 However to fa-cilitate the understanding of this equation it is useful tomake some approximations First note that the proba-bility that a node is active (Pc) is assumed to be low Byapproximating 1 Pc to one the variance of the recog-nition strength can be simplified to

For a particular class of items the variances of the netinputs of old and new items are equal and the varianceof the recognition strength is proportional to the numberof active nodes (s 2

s micro Pc) This approximation suggests avery simple interpretation of the slope of the z-ROC Theratio of variances between new and old items is simplythe ratio between the number of nodes active in the newitems representations and the number of nodes active inthe old items representations

Or alternatively the slope of the z-ROC curve is equalto the square root of the ratio of the number of nodes ac-tive in the new items representations and the number ofnodes active in the old items representations For exam-ple if the slope of the z-ROC curve is 08 the number ofactive nodes in the new items representations divided bythe number of nodes active in the old items representa-tions is 064 (= 0802)

Another approximation useful for understanding themodel is that for two classes of items the number of ac-tive nodes in the new distribution is approximately equaland the number of active nodes in the old distributions isapproximately equal [Pc(AN) raquo Pc(BN) and Pc(BO) raquoPc(AO)] Given these approximations and the approxi-mation above (1 Pc raquo 1) the recognition strength stan-dard deviation is inversely related to the standard devia-tion of the net inputs in the following way The ratiobetween the recognition strength standard deviations ofthe diff icult and the easy distributions is equal to theratio between the standard deviations of the net inputs ofthe easy and the difficult distributions Furthermore theratio between the recognition strength standard devia-tions of the difficult and easy new distributions is equalto the ratio between the recognition strength standard de-viations of the difficult and the easy old distributionsThe exact solution predicts a slightly smaller ratio in theold than in the new distributions

This suggests that the ratio between the recognitionstrength standard deviations of the difficult class and theeasy class can be interpreted as the ratio between the

standard deviations of the net inputs of the easy class andthe difficult class

Optimizing Performance Derives the Assumptions of the Variance Theory

Arguably good memory performance is an importantfactor for selection in the evolutionary process of hu-mans and animals It is reasonable to assume that thebrain has evolved so that the performance at retrieval isoptimal or near-optimal Here it is investigated how sev-eral assumptions of the variance theory influence per-formance It is shown that several assumptions of themodel fall out as a consequence of optimizing perfor-mance in the form of discriminability between new andold items Thus if the model is implemented in a differ-ent way performance is degraded and the model doesnot account for the experimental data Examples of as-sumptions that yield a good performance in the modelare a low percentage of nodes active setting the activa-tion threshold between old and new net inputs measur-ing performance by nodes that are active in the encodingpattern and normalizing the recognition strength It isshown that an optimal performance in the network requiresthe implementation suggested by the variance theory Ifthe implementation of the variance theory is changedsignificantly the performance is degraded and the net-work would not produce the empirically found memoryphenomena

To conduct this analysis it is necessary to define ameasurement of performance A natural choice is to used cent By using the analytical equations above we find thefollowing expression

(8)

Because ss(N) sometimes is near zero it was founduseful to use the standard deviation of both the old andthe new items recognition strength ss(NO) in the de-nominator of this equation Thus Pc(NO) is equal to[Pc(N) 1 Pc(O)] 2 Pc( ) was calculated with Equation 6The expected value of the net inputs and the standard de-viation of the net inputs for new and old items were cal-culated with the equations derived in the Appendix(Equations A1 A2 and A3) Low-frequency items witha preexperimental frequency of zero were used ( f = 0)and the list length was set to one (L = 1)

The performance can be expressed by the parametersa N and p Increasing the number of nodes (N) monot-onically increases d cent and increasing the number ofstored patterns (p) monotonically decreases d cent The im-pact of these two parameters on d cent was of less impor-tance here and they were set to N = 30 and p = 100

Optimal percentage of nodes active at encodingThe solid line in Figure 9A shows the theoretical d cent as afunction of the percentage of nodes active at encoding

cent = - =-

-eacuteeumlecirc

dP P

P PN

s s

s

c c

c c

m ms( ) ( )

(

O N

(NO)

(O) (N)

(NO) (NO) 12

112

ss

ss

ss

ss

s

s

s

s

h

h

s

s

(BO)

(AO)

(B)

(A)

(A)

(B)

(BN)

(AN)pound raquo pound

s

ss

s

c

c

P

P

2

2

(N)

(O)

(N)

(O)raquo

ss

sc

h

P

N

2

2 2=

432 SIKSTROM

(a) The results show that d cent is optimal for a = 052 Thed cent is lower for larger and smaller a The lower d cent for largea occurs because the interference from other items in-crease For an a larger than the optimal value the weightchanges are distributed over a larger number of nodesand the recognition tests therefore include more noise

The lower d cent for an a less than the optimal value oc-curs because there is variability in the number of activenodes at encoding Thus for very small values of a thefluctuation between the number of nodes active in theencoded representation becomes increasingly importantThus for a small a errors are more likely to occur be-cause old items have few active nodes at encoding mak-ing it less likely that many nodes will be active at re-trieval (independently of how well they are encoded)This analysis suggests that a medium low percentage ofactive nodes at encoding yields optimal performanceThis is consistent with variance theory which requires alow activity for fitting some of the empirical data (seebelow)

There is another factor that contributes to the fact thatoptimal performance occurs when the percentage of ac-tive nodes is medium low which is that the number ofpossible representations increases with a If there is onlyone node active in all the representations there are Npossible combinations of representations if there aretwo nodes active in all the representations there are ap-proximately N 2 possible combinations of representa-tions and so forth This factor is not included in theanalyses

Optimal placement of the activation threshold Animportant property of d cent is that it is insensitive to wherethe criterion is placed Thus any criterion yields thesame performance The activation threshold (T ) may beseen as the criterion for a single node and therefore onemight intuitively think that the placement of the thresh-old is unimportant for d cent However surprisingly theplacement of the criterion becomes important in the vari-ance theory because there is a nonlinear transformationwhen the nodes are activated This nonlinearity makes d centdependent on the activation threshold in the nodes

The d cent was maximized by changing the activationthreshold (T ) and the percentage of nodes active at en-coding (a) The maximum d cent was 240 when T = 081and a = 052 Figure 9B plots d cent as a function of the ac-tivation threshold (T ) when the percentage of nodes ac-tive at encoding was f ixed at the optimal value (a =052) The results show that d cent has an optimal valuewhen the activation threshold is set between the old andthe new distributions The variance theory suggests thatthe threshold should be set to the average of the expectedold and expected new net inputs For the parameter val-ues used here this value is 071 which is near butslightly lower than the optimal value of 081 (the ex-pected value of the new net input is 0 and the expectedvalue of the old net input is 142) Note that this resultapplies when both a and T are set to the optimal value Ifa is set to a nonoptimal value the optimal value of T may

deviate significantly from the one proposed by the the-ory (eg if a = 5 the optimal value of T is much largerthan the expected value of old net inputs of 188)

This analysis emphasizes the importance of setting theactivation threshold as suggested by the variance theorySetting the activation threshold between the old and thenew expected net inputs yields not only the mirror effectbut also an optimal performance in the network Noticethat the activation threshold (T ) is constant even if thesubjectsrsquo decision criterion (C) is changed Therefore d centwill not change when the decision criterion changes Bychanging the decision criterion (rather than the recogni-tion threshold) subjects can maintain an optimal d cent fordifferent confidence levels

Optimal usage of the state of activation in the cuepattern An interesting question is how much informa-tion is carried in nodes that are active in the encoded pat-tern as compared with inactive nodes If both active andinactive nodes carry a similar amount of information itis useful to use all the nodes at retrieval However if in-active nodes carry little information in relation to thenoise performance can be improved by using only theinformation in the active nodes

The information carried in the nodes depends on theamount of weight changes which in turn depends on thepercentage of active nodes at encoding (a) For a = 12the absolute values of the weight changes are the samefor active and inactive nodes (however the signs of theweight changes are different) Thus inactive and activenodes carry the same amount of information and theperformance is optimal when information in both activeand inactive nodes is used For a small a the weightchanges are larger for active nodes (proportional to 1 a)than for inactive nodes (proportional to a) For a suffi-ciently small a the noise in inactive nodes will over-whelm the information in the weight changes so that ifthe information is combined the inactive nodes will harmthe information in the active nodes and performance isoptimal if only information from active nodes is used

The performance of the variance theory was calcu-lated by using the information in all the nodes This isdone by counting the number of nodes that are retrievedto the correct state of activation (ie the same state asthat during encoding) The mathematical details of thiscalculation are described at the end of the Appendix Theresults are shown by the dotted line in Figure 9A usingthe same set of parameters as when d cent was calculated byusing only active nodes shown by the solid line The re-sults show that the highest d cent is found when the decisionis based only on active nodes and when a is low Includ-ing inactive nodes in decision lowers d cent However for alarger a (above 15 for the parameters used here) it isbeneficial to base the decision on all the nodes

Optimal placement of the recognition criterion forthe two classes of items The recognition criterion (C)does not affect d cent but it influences performance as mea-sured by the hit rates and false alarm rates Therefore itis necessary to use another criterion for performance

VARIANCE THEORY OF MIRROR EFFECT 433

with respect to the placement of the recognition crite-rion A natural choice for performance in this context isthe probability of hits minus the probability of falsealarms This measurement corresponds to optimal per-formance when old correct responses and new correctnew responses are rewarded equally It is easy to see thatif the standard deviations of the old and the new distri-butions are equal the optimal performance will be foundif the recognition criterion is set exactly between the dis-tributions For unequal standard deviations the optimalrecognition criterion is shifted from the midpoint towardthe distribution with the smallest standard deviationMore exactly the optimal recognition criterion is thepoint at which the old and the new distributions inter-sect It is easy to see that this is true because if the recog-nition criterion is moved to the left of this point the rateof increase in false alarms is larger than the rate of in-crease in hits and performance suffers If the recognition

criterion is moved to the right of this point the rate of de-crease in hits is larger than the rate of decrease in falsealarms and performance also suffers (see eg Figure 4D)Formally f [S(O)] denotes the density of recognitionstrength of the old distribution and f [S(N)] the densityof the recognition strength of the new distribution Theratio between these variables is called the likelihoodratio L = f [S(O)] f [S(N)] and the optimal performanceoccurs when this ratio is equal to one (L = 1)

In the mirror effect there are two classes of itemseach having a new and an old distribution with differentstandard deviations The question of optimal perfor-mance is complicated by the possibility of using differ-ent criteria for the two classes The performance maythen vary depending on the choice of the two criteria andon additional restrictions on the overall level of confi-dence For example if one class is very easy (ie perfectdiscrimination) and one class is very difficult (ie no

Figure 9 (A) Theoretical d cent as a function of percentage of nodes active at encoding The solid line shows the d cent as a function of per-centage of nodes active at encoding when the decision is based only on nodes that are active during encoding The dotted line showsd cent when the decision is based on all the nodes (B) Theoretical d cent as a function of the activation threshold The leftmost arrow pointsat the expected net input of the new items [m(N)] the rightmost arrow points at the expected net input of the old items [m(O)] and themiddle arrow points at the point at the placement of the activation threshold of the nodes Note that the activation threshold is slightlylower than the optimal point (C) Optimal placement of the recognition criterion for the two classes The y-axis shows the maximumlikelihood for Class B divided by the maximum likelihood for Class A An optimal performance is found when this ratio is one Thex-axis shows the false alarm rate for Class A The straight line shows the ratio for theoretical optimal performance the dotted line theratio before normalization and the solid curved line the ratio after normalization See the text for details (D) The advantage of nor-malization for different recognition criteria The y-axis shows the total hit rate after normalization minus the total hit rate before nor-malization as a function of the total false alarm rate on the x-axis See the text for details

434 SIKSTROM

discrimination) and subjects are instructed to respondyes only when they are absolutely certain that they arecorrect it may be optimal to set a very high criterion forthe difficult class so that no yes responses will be madefor the difficult class and a moderate criterion for theeasy class so that some yes responses will be made forthe easy class Therefore any model that optimizes per-formance for the two classes must combine the criteriafor each class so that the performance for the sum of theclasses will be optimal

This problem may formally be stated as follows Giventwo classes (A and B) with a fixed total false alarm rate[P(AN) + P(BN)] how should the recognition criteriafor the two classes [T(A) and T(B)] be chosen so that thehit rates are maximized [P(AO) + P(BO)] The solutionto this problem is surprisingly simple The optimal perfor-mance occurs when the placements of the maximum like-lihoods of the two classes are equal

L(A) = f [S(AO)] f [S(AN)] = L(B)

= f [S(BO)] f [S(BN)]

It is easy to see that this criterion must be satisfied foroptimal performance because any shift from this pointdiminishes performance For example if the recognitionthreshold for class A is diminished the recognition cri-terion for class B must be increased to keep the totalfalse alarm rate constant According to the formulationof the problem the change in total false alarm rates mustbe equal f f [S(BN) = 0] The maximum-likelihood ra-tios are monotonically increasing functions of the recog-nition criteria therefore L(A) L(B) lt 0 when the recog-nition criteria are changed as specified above ThusL(A) = f [S(AO)] f [S(AN)] lt f [S(BO)] f [S(BN)] =L(B) or f [S(AO)] f [S(BO)] lt 0 This shows that thechange in the placement of the criteria from L(A) = L(B)results in an overall decrease in hit ratemdash( f [S(AO)] f [S(BO)] lt 0)mdashand performance suffers

Note that the variance theory only has one overallrecognition criterion However the theory and any the-ory of the mirror effect specifies how this criterionchanges when it is moved over the two classes Thus thesecond criterion can indirectly be inferred from the for-mulation of the theory This is done in the variance the-ory by the normalization factor that scales the recogni-tion differently between the two classes of words

The intriguing question here is whether the variancetheory is optimal or almost optimal in terms of place-ment of the recognition criterion for the two classes Fig-ure 9C plots the maximum likelihood of class B dividedby the maximum likelihood of class A [L(B)L(A)] onthe y-axis The x-axis shows the probability of false alarmsfor class A when the recognition criteria are changedThe optimal ratio of the maximum likelihood on the y-axisis one and it is plotted as the dotted straight line The re-sults before normalization (ie by counting the numberof nodes above the recognition criterion) are plotted in

the dotted monotonically increasing line The resultsafter normalization (ie the percentage of active nodesminus the expected number of active nodes divided bythe standard deviation of the net input) are plotted as thesolid line

The figure clearly shows that performance before nor-malization is far from optimal For a conservative recog-nition criterion or low false alarm rates the maximumlikelihood for class A is larger than the maximum likeli-hood for class B Therefore a more liberal criterion forclass B and a more strict criterion for class A would bemore advantageous For a liberal recognition criterionor a high false alarm rate the maximum likelihood forclass B is larger than the maximum likelihood for classA Therefore a more liberal criterion for Class A and astricter criterion for Class B would be beneficial Theonly point at which the performance is optimal is whenthe recognition criterion is unbiased At this point [aroundP(AN) = 25] the maximum-likelihood ratio is one

Normalization improves performance significantly sothe maximum-likelihood ratio is one or almost one fora range of criteria For false alarm rates larger than 015and smaller than 060 the ratio is within two percentagepoints of one The maximum likelihood for Class A islarger than that for Class B for false alarm rates less than015 and for false alarm rates larger than 060 Thus thereis still some deviation from optimal performance evenafter normalization However the maximum-likelihoodratio is closer to the optimal value for all false alarmrates after normalization than before normalization Ar-guably normalization improves performance sufficientlyso that performance may be described as being near anoptimal value for a wide range of recognition criteria

Overall performance after normalization can be di-rectly compared with performance before normalizationFigure 9D plots the total hit rate after normalizationminus the total hit rate before normalization for differenttotal false alarm rates In this figure the standard devia-tion of the net inputs to Class B was set to a 3 (rather than156) in order to make the difference between perfor-mance before and after normalization more salient Allother parameters were identical to those in Figure 4DFurthermore it is assumed that the number of items inClass A is equal to the number of items in Class B Thetotal hit rate is equal to the average hit rate of Class Aand Class B and the total false alarm rate is equal to theaverage false alarm rate in Class A and Class B

For all recognition criteria or for all false alarm ratesperformance is better or equal after normalization ascompared with performance before normalization Foran unbiased recognition criterion or a slightly largerrecognition criterion performance is approximatelyequal before and after normalization [ie for 25 lt P(N) lt40] For conservative recognition criteria [P(N) lt 25]there is a large advantage for normalized performanceover nonnormalized performance The largest advantageis approximately 005 and it occurs when the false alarm

VARIANCE THEORY OF MIRROR EFFECT 435

rate is approximately 005 For liberal recognition crite-ria [P(N) gt 40] there is also an advantage for normal-ized performance The largest advantage is around 001and it occurs when the false alarm rate is 070 The ad-vantage for liberal criteria is smaller than the advantagefor conservative criteria because of a ceiling effect forlarge false alarm rates and large hit rates

A Nonoptimal Network is Inconsistent With Empirical Data of Recognition Memory

To summarize it has been shown that performance isoptimal when (1) the percentage of nodes active at en-coding is low (2) the activation threshold is set betweenthe new and the old distributions (3) only nodes activeat encoding are used for measuring the recognitionstrength and (4) the recognition strength is normalizedwith the standard deviation of the net input It is inter-esting to note that all these conditions are necessary forproducing the empirically found memory phenomena

1 The percentage of active nodes has to be low forproducing appropriate standard deviations for the oldand the new recognition distributions If the percentageof active nodes is too high the standard deviation of theold distribution will approach the standard deviation ofthe new distribution

2 The model predicts the mirror effect only if the ac-tivation threshold is set between the old and the new dis-tributions If the recognition threshold is larger than theexpected value of the net input of the old distributionthe hit rate of the easy class will be less than the hit rateof the difficult distribution Similarly if the recognitionthreshold is lower than the expected new net input thefalse alarm of the easy class will be larger than the falsealarm rate of the difficult class This is inconsistent withthe empirical data for unbiased responses

3 Assume that the recognition strength is based on allthe nodes (ie not only nodes inactive during encod-ing)mdashfor example by counting the number of nodes inthe correct state of activation This measurement ofrecognition strength will have at least 50 of the nodesin the correct state (ie Pc gt 50) even if the subjectswere merely guessing on new items This would lead tothe incorrect prediction that the old recognition strengthvariance should be smaller than the new recognitionstrength variance because Pc(1 Pc)N decreases for Pcover 50 This is inconsistent with the finding that thevariance of the old distribution is larger than the varianceof the new distribution

4 If the recognition strength is not normalized withthe net input the variance of the recognition strength ofthe new easy class (AN) will be smaller than the varianceof the recognition strength of the new difficult class (BNsee Figure 4C) This is inconsistent with the empirical data

This analysis indicates that several recognition mem-ory phenomena fall out as a consequence of optimizingperformance in the network If the model is optimized interms of performance the model reproduces the empir-ical data If the model is made nonoptimal the model

does not reproduce the empirical data Arguably thehuman brain has evolved to optimize storage capacityand therefore these memory phenomena occur

GENERAL DISCUSSION

This paper has suggested a variance theory for themirror effect that also applies to the ROC curves Themodel is remarkably simple It has been shown that atwo-layer recurrent network where one layer representscontext and one layer represents items shows these phe-nomena if performance is measured by counting thenumber of nodes active at recognition in a way that opti-mizes performance The structure of the theory guaran-tees that high-frequency items have a larger variance inthe net inputs than do low-frequency items because en-coding the same item in different contexts increases thevariance whereas the expected net inputs are the sameThe theory predicts the mirror effect when the sameamount of attention is paid to both classes of stimuli Thestandard deviation of the recognition strength is largerfor old than for new items because more nodes are activein old items The standard deviation for the easy class islarger than the standard deviation for the difficult classbecause the recognition strength is normalized with thestandard deviation for the net input

There are several reasons why the variance theory isinteresting First the theory is extremely simple Theonly necessary assumptions are that recognition is basedon recurrent associations between contexts and itemsand performance is measured by counting the number ofnodes in an optimal way Second these assumptions areconsistent with what is known about how the brain worksTherefore the model is biologically plausible Third themodel accounts for a large amount of data including themirror effect exceptions from the mirror effect ROCcurves list-length effects and so on Fourth the modelfits the empirical data well Fifth it is easy to implementthe model in a connectionist network

Paying more attention to one of the classes violates theassumption of equal expected net inputs to the two classesThe variance theory predicts that attention to the moredifficult class primarily affects the hit rates whereas thefalse alarm rates and standard deviations of the underly-ing distributions are less affected An experiment sup-ported the prediction A standard mirror effect was foundwhen attention was divided equally between high- andlow-frequency words However focusing the subjectsrsquoattention on the high-frequency words either by the pre-sentation frequency or the presentation time made thehit rate larger for the high-frequency words than for thelow-frequency words This manipulation of attention didnot influence the false alarm rates which were higher forthe high-frequency words in all the conditions Thus nomirror effect was found when attention was paid to thehigh-frequency words Nor did the focusing of attentioninfluence the order of the standard deviations The stan-dard deviations were larger for the low-frequency distri-

436 SIKSTROM

bution than for the high-frequency distribution This wastrue for the new and the old distributions both when at-tention was paid to high-frequency words and when at-tention was divided equally between the two classes (ex-cept in the new frequency control condition where thestandard deviations were equal)

The variance theory predicts the order of the standarddeviations of the underlying distributions for the follow-ing reasons The standard deviations of the old distribu-tions are predicted to be larger than those of the new dis-tributions because more nodes are activated in the olddistributions The standard deviations of the easy classdistributions are predicted to be larger than the standarddeviations of the difficult class distributions because therecognition strength is normalized by the itemrsquos diffi-culty estimated from the standard deviation of the net in-puts This is consistent with the empirical data

In contrast the attention-likelihood theory does notconstrain the old distribution to be larger than the newdistribution for the difficult class (it can be larger orsmaller depending on parameter settings) The variancetheory allows the following two orders of the standarddeviations ss(BN) lt ss(BO) lt ss(AN) lt ss(AO) andss(BN) lt ss(AN) lt ss(BO) lt ss(AO) The first order isthe more common although the second order occurs oc-casionally (see eg Glanzer et al 1993 Experiment 1)In addition the attention-likelihood theory allowsss(BO) lt ss(BN) lt ss(AN) lt ss(AO) according to Equa-tion 2 which to the authorrsquos knowledge has not beenfound in empirical data

The variance theory predicts that strength variablessuch as study time repetition and study instructions af-fect the expected net input For example increasing studytime will increase the net input that improves the hit rateIncreasing the net inputs also causes an increase in theactivation threshold that diminishes the false alarm rates

The variance theory has no (ie lateral) connectionswithin the item layer and there are no connections with-in the context layer Including intraitem connections inrecognition makes it impossible to tell whether therecognition strength comes from encoding during thelearning episode or from another preexperimental learn-ing episode Thus there would be a confounding be-tween the itemrsquos frequency and recognition strength Forexample if the recognition strength in the variance the-ory included intraitem connections and used a constantrecognition criterion it would predict a higher hit rateand a higher false alarm rate for high-frequency itemsthan for low-frequency items Thus the hit rate for high-frequency words would be larger than that for low-frequency words which is contrary to the data on the mir-ror effect This issue is called the frequencyrecognitionndashstrength confounding problem Other models may bevulnerable to this problem depending on their specificassumptions The variance theory is immune from thisproblem because recognition strength is based on the as-sociation between the context and the item that yields apure measurement of the strength of the target in a par-

ticular episode Net inputs within the item population arenot used because these connections are highly corre-lated with the frequency of the item

This frequencyrecognition-strength confounding prob-lem may be relevant to several distributed models thatassume that recognition strength increases with frequencythus making the false alarm rate higher for high- than forlow-frequency stimuli This is often implemented in dis-tributed models by including intraitem associations inthe measurement of recognition strength Thus whenintraitem and itemndashcontext associations are added it isnot possible to know whether the intraitem strength oc-curs because an item has been encoded in the to-be-retrieved-from list or at another episode

Although the intraitem associations are not used tomeasure recognition strength they may play an impor-tant role in recognition In the first step of recognitionthese associations may be used for deblurring unclear in-formation in the item cue (a similar mechanism occursfor the context cue) Arguably this deblurring mecha-nism works well for well-known words however fornonwords it is much more likely to fail Such failure willactivate features that were not active in the encoded rep-resentation It is more likely that these wrongly activatednodes represent high-frequency features because it ismore likely to converge on high-frequency features Thereare two interesting implications of this perspective Firstthe wrongly activated nodes will use the wrong connec-tions between the context and the item Second becausethe wrongly activated nodes represent high-frequencyfeatures the average variability will be larger for non-words than for words This is a plausible account of thepoor recognition performance with nonwords as com-pared with words It is also a tentative account of why non-words can be seen as a difficult class with a higher vari-ability than that for words in the variance theory Howeverfurther work is needed before any firm conclusion can bemade regarding this aspect of the theory

A problem similar to frequencyrecognition-strengthconfounding occurs if within-context connections areused at recognition If the context is temporally corre-lated the within-context connections are influenced bylist length This causes a confounding between list lengthand recognition strength This issue is called the list-lengthrecognition-strength confounding problem Othermodels may be vulnerable to this problem depending ontheir specific assumptions

Another issue is whether the variance theory can ac-count for the mirror effect found in abstract and concretewords where words from a concrete class are more eas-ily discriminated than words from an abstract class Thevariance theory can account for this given the assump-tion that the variance of the net input is larger for abstractthan for concrete words However at this point it is notcompletely clear how this assumption can be motivatedA possibility is that although these two classes areequated for word frequency the contexts associated withan abstract word are more variable than the contexts as-

VARIANCE THEORY OF MIRROR EFFECT 437

sociated with a concrete word This larger variability incontext for abstract words may lead to a larger variabil-ity in the net input Another possibility is that the activefeatures in abstract words are more general and there-fore associated with more contexts Nodes active in con-crete words may represent more specific features acti-vated with a lower frequency and therefore associatedwith fewer contexts Thus features in abstract wordsmay be of a higher frequency than features in concretewords although the frequencies of the items are thesame This would lead to a mirror effect in the modelHowever at this point no claim is made that variancetheory can handle this phenomenon

The variance theoryrsquos account of the list-length andlist-strength effects is arguably much simpler than thesubjective-likelihoodrsquos account The activation thresholdis set so that on average half of the nodes active duringencoding are active during recognition The activationthreshold is constant within one condition but may changebetween conditionsmdashfor example when study time ischanged Furthermore subjects do not need to estimatethe list length and the probability that a particular itemis drawn from the study list

The variance theory has advantages over the attention-likelihood theory As was discussed in more detail abovethe attention-likelihood theory requires subjects to clas-sify the stimulus Depending on this classification thesubjects must know (consciously or unconsciously) howmuch attention is paid to the stimuli in order to calculatethe log-likelihood ratios Thus the yesndashno decision isbased on the subjectsrsquo classification of the stimuli Thevariance theory does not require a classification of thestimuli During the calculation of recognition strengththe standard deviation of the net input of the item is used(shcent ) so the subject does not need to know the class or thestandard deviation of the class (sh) The increase in thehit rate and decrease in the false alarm rate for the easierclass occurs according to the theory because the vari-ance is smaller for the easier class The variance theoryhas a simple mathematical base subjects count the num-ber of activated nodes minus the expected value dividedby the standard deviation of the net input of the item Anexplicit solution is presented that requires only calculat-ing the probabilities of two z-transformations

The subjective-likelihood theory uses feature contentof the items for addressing issues regarding item similar-ity and word frequency In particular high-frequency wordsare assumed to have a higher noise or variability than dolow-frequency words The variance theory also usesvariability that depends on frequency However the vari-ance theory simulates the increase in variance duringeach presentation of a feature in different contexts thusmaking it an unavoidable phenomenon for high-frequencyfeatures In the subjective-likelihood theory the featurevariance is introduced as an assumption or a constantand it is not explicitly simulated

There are several other differences between the vari-ance theory and the subjective-likelihood theory The

subjective-likelihood theory is based on log-likelihoodratios In the variance theory log-likelihood ratios arenot necessary to account of the mirror effect and for z-ROC curves Instead the variance theory uses the num-ber of active nodes normalized by the variance of the netinput as the measurement of recognition strength

Another difference is the use of one detector for eachitem in the subjective-likelihood theory This makes themodel essentially local whereas the variance theory isdistributed This property may cause difficulties in im-plementing the model in a connectionist network How-ever see McClelland and Chappell (1998) for a brief dis-cussion of this topic An implementation of the variancetheory is straightforward

REFERENCES

Feller W (1968) An introduction to probability theory and its appli-cation New York Wiley

Gillund G amp Shiffrin R M (1984) A retrieval model for bothrecognition and recall Psychological Review 91 1-67

Glanzer M amp Adams J K (1985) The mirror effect in recognitionmemory Memory amp Cognition 13 8-20

Glanzer M amp Adams J K (1990) The mirror effect in recognitionmemory Data and theory Journal of Experimental PsychologyLearning Memory amp Cognition 16 5-16

Glanzer M Adams J K amp Kim K (1993) The regularities ofrecognition memory Psychological Review 100 546-567

Glanzer M amp Bowles N (1976) Analysis of the word frequencyeffect in recognition memory Journal of Experimental PsychologyHuman Learning amp Memory 2 21-31

Glanzer M Kisok K amp Adams J K (1998) Response distribu-tions as an explanation of the mirror effect Journal of ExperimentalPsychology Learning Memory amp Cognition 24 633-644

Greene R L (1996) Mirror effect in order and associative informa-tion Role of response strategies Journal of Experimental Psychol-ogy Learning Memory amp Cognition 22 687-695

Hertz J Krogh A amp Palmer R G (1991) Introduction to the the-ory of neural computation Reading MA Addison-Wesley

Hintzman D L (1988) Judgment of frequency and recognition memoryin a multiple trace memory model Psychological Review 95 528-551

Hopfield J J (1982) Neural networks and physical systems withemergent collective computational abilities Proceedings of the Na-tional Academy of Sciences 79 2554-2558

Hopfield J J (1984) Neurons with graded response have collectivecomputational properties like those of two-state neurons Proceed-ings of the National Academy of Sciences 81 3088-3092

Humphreys M S Bain J D amp Pike R (1989) Different way to cuea coherent memory system A theory for episodic semantic and pro-cedural tasks Psychological Review 96 208-233

Kim K amp Glanzer M (1993) Speed versus accuracy instructionsstudy time and the mirror effect Journal of Experimental Psychol-ogy Learning Memory amp Cognition 19 638-652

Kruschke J K (1992) ALCOVE An exemplar-based connectionistmodel of category learning Psychological Review 99 22-44

Ku Iumlcera H amp Francis W N (1967) Computational analysis ofpresent-day American English Providence RI Brown UniversityPress

Lewandowsky S (1991) Gradual unlearning and catastrophic inter-ference A comparison of distributed architectures In W E Hockleyamp S Lewandowsky (Eds) Relating theory and data Essays onhuman memory in honor of Bennet B Murdock (pp 445-476) Hills-dale NJ Erlbaum

Li S-C amp Lindenberger U (1999) Cross-level unification A com-putational exploration of the link between deterioration of neuro-transmitter systems and dedifferentiation of cognitive abilities in oldage In L-G Nilsson amp H J Markowitsch (Eds) Cognitive neuro-sciences of memory (pp 103-146) Seattle Hogrefe amp Huber

438 SIKSTROM

Li S-C Lindenberger U amp Frensch P A (2000) Unifying cog-nitive aging From neuromodulation to representation to cognitionNeurocomputing 32-33 879-890

McClelland J L amp Chappell M (1998) Familiarity breeds dif-ferentiation A subjective-likelihood approach to the effects of expe-rience in recognition memory Psychological Review 105 724-760

Metcalfe J (1982) A composite holographic associative recallmodel Psychological Review 89 627-658

Murdock B B (1982) A theory for the storage and retrieval of item andassociative information Psychological Review 89 609-626

Murdock B B (1998) The mirror effect and attentionndashlikelihoodtheory A reflective analysis Journal of Experimental PsychologyLearning Memory amp Cognition 24 524-534

Okada M (1996) Notions of associative memory and sparse codingNeural Networks 9 1429-1458

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves PsychologicalReview 99 518-535

Rumelhart D E Hinton G E amp Williams R J (1986) Learn-ing representation by backpropagating errors Nature 323 533-536

Shiffrin R M amp Steyvers M (1997) A model for recognitionmemory REMndashretrieving effectively from memory PsychonomicBulletin amp Review 4 145-166

Sikstroumlm S (1996a) TECO A connectionist theory of successiveepisodic tests Umearing Doctoral dissertation Umearing University (ISBN91-7191-155-3)

Sikstroumlm S (1996b) The TECO connectionist theory of recognitionfailure European Journal of Cognitive Psychology 8 341-380

Sikstroumlm S (1999) Power function forgetting curves as an emergentproperty of biologically plausible neural networks model Interna-tional Journal of Psychology 34 460-464

Stretch V amp Wixted J T (1998a) Decision rules for recognitionmemory confidence judgments Journal of Experimental Psychol-ogy Learning Memory amp Cognition 24 1397-1410

Stretch V amp Wixted J T (1998b) On the difference betweenstrength-based and frequency-based mirror effects in recognitionmemory Journal of Experimental Psychology Learning Memoryamp Cognition 24 1379-1396

NOTE

1 The expected number of nodes active at recognition is a2 giventhat the standard deviations of the percentages of active nodes for new[sp(N)] and old items [sp(O)] are equal However if the standard devi-ations are different the expected number of active nodes will be

Because the new variance is smaller than the old variance this valuewill be slightly less than a2 typically around 044a if the ROC curveis 08 It is actually more precise to use this value In this paper the ap-proximation a2 is used except that in the simulations where the ex-pression above is used

APPENDIX

The Expected Value of the Net Input and the Variance of the Net Input

Inserting the weight changes into the net input solves the ex-pected value of the net input for old items In this Appendix thesuperscripts representing the item layer (d) and the contextlayer (c) are dropped Only the active states of activation con-tribute to the net input so jj 5 jj = 1

(A1)

The expected value of the net inputs for the new items iszero

mh(N) 5 0 (A2)

The expected value of the weights is zero The variance ofthe net input is calculated by squaring each term in the netinput Let P represent the number of patterns stored in the net-work If the patterns are uncorrelated or a new item is pre-sented the variance of the net input is

(A3)

For an old item the variance of the net input to the contextlayer from the item layer depends on the frequency of the item( f ) All the aN weights supporting a context node contribute tothe net input in the same direction

(A4)

The same function can be used for calculating the varianceof the net input to a node in the item layer when the same con-text is associated with several items Let the same context be as-sociated with L items in a list Furthermore let the context be-tween different lists be uncorrelated The variance of the netinput to an item node is

The expected variance for all nodes is the weighted sum ofthese two terms Half of the nodes are context nodes and halfof the nodes are item nodes Therefore the expected varianceof the net input to all nodes given an item with a frequency off and a list length of L in a network where p patterns have beenencoded is

(A5)

Performance in the Variance Model When AllNodes Are Used

In the variance model recognition strength is based on nodesactive at encoding However if recognition strength is based onall the nodes performance can be calculated as follows The d centis calculated by using Equation 8 where Pc(O) and Pc(N) are thenumber of correct nodes The number of correct nodes is thenumber of nodes active at retrieval that also is active at encod-ing (ie calculated as usual with Equation 7) plus the numberof inactive nodes at encoding that also are inactive at retrievalThe latter value can be calculated by replacing a with 1 a inEquation 7 and using the expected value of old net inputs for in-active nodes a2 (1 a) N (compare with Equation A1)

(Manuscript received February 9 1999revision accepted for publication October 30 2000)

s h f L a N p a a N2 3 22 1O( ) = +( ) + -( )

s h f a a LN2 4 2 21( ) = -( )

s x x x xh ij jj

N

i j ji

Nf f f a a

a a f N

22 2

4 2 21

( ) = waring aringaelig

egraveccedilouml

oslashdivide= -( ) -( )eacute

eumlecirc

= -( )

s x x x xh ij jj

N

i j j

P

i

NN a a

a a a a a a a

a a a a PN a a PN

2 2 2

2

3

1 1 2 0 1

0 0 1 1

( ) = w

= [( ) ( )] + [( )(1- )] ( )

+ [( )( )] ( ) = ( )

2 2

2 2 2

( ) = -( ) -( ) - - - -

- - - -

aring aringaring

m x x x xh ijj

N

j i j jj

Na a a a N(O) = wDaring aring= -( ) -( ) = -( )1

2

ss s

p

p p

a(N)

( )N (O)+

VARIANCE THEORY OF MIRROR EFFECT 421

threshold (T ) and the node was active at encoding thenode is activated at recognition Otherwise the node isinactivated The distributions of active nodes are plottedin Figure 4C

A closer inspection of Figures 4A and 4B reveals thatthese densities or distributions predict the correct orderof the mirror effect given that the activation threshold islarger than the expected net inputs of the new items andless than the expected net inputs of the old items Thus thevariance theory has the nice property of accounting for themirror effect across a large range of the parameter valuesfor the activation threshold Thus P(AN) lt P(BN) ltP(BO) lt P(AO) for mh(AN) 5 mh(BN) lt T lt mh(AO) 5mh(BO) The variance theory predicts a material-basedmirror effect because the variability of the net inputs isdifferent for the easy and the difficult class items Theexpected strengths of the net inputs are equal The vari-ability is lower for easy class items thus making theprobability of false alarms (or the probability of activenodes) lower for the easy than for the diff icult classitems when the activation threshold is larger than the ex-pected value of the new items Similarly the hit rate (or theprobability of active nodes) for easy class items is higherthan the hit rate for difficult class items when the activationthreshold is less than the expected value of the old items

The activation threshold is set to be between the ex-pected value of the new and the old net inputs so that theperformance is optimal Therefore the activation thresh-old is set to the average of the expected net inputs of theold and the new distributions for difficult and easy classitems respectively

T 5 [mh(AN) + m h(BN) + m h(BO) + mh(AO)]

5 [mh(BO) + mh(AO)] 5 mh(O)

Thus in the variance model the activation threshold isfixed for recognition in one condition although it mayvary between different recognition conditions to opti-mize the performance The variance theory accounts forthe strength-based mirror effect that occurs between listsor conditions with a shift in the activation threshold nec-essary for keeping the performance at an optimal levelAs will be discussed later this is true also when perfor-mance is measured by d cent and it is independent of theplacement of the recognition criterion Simply put themodel adopts the activation threshold on the basis of theoverall difficulty of the test in order to maximize the per-formance

In practice subjects may initially make a preliminaryestimate of the activation threshold which may be ad-justed as more information about the expected value ofthe distributions is gathered The theory does not show amirror effect if the activation threshold is lower than theexpected value of the new items or larger than the expectedvalue of the old items Thus setting the activation thresh-old as was suggested above is an important mechanism

in the model The activation threshold should not be con-fused with the subjectrsquos recognition criterion

Figure 4C shows the density of the probability that anode is active at recognition when the activation thresh-old is set as defined above Note how the mean and stan-dard deviations of the distributions of the net input (Fig-ure 4A) change when the percentage of nodes are calculated(Figure 4C) The expected probabilities of active nodesare arranged according to the mirror effect [Pc(AN) ltPc(BN) lt Pc(BO) lt Pc(AO)] whereas the expected valuesof the net inputs are not [mh(AN) = mh(BN) lt mh(BO) =mh(AO)] Furthermore the standard deviation of the per-centage of active nodes for old items is larger than thatfor new items [s p(O) gt sp(N)] whereas the standard de-viations of the net inputs are equal for old and new items[s h(N) 5 sh(O)]

The standard deviation of the recognition strength issmaller for the new distributions than for the old distri-butions because there are fewer nodes active in the newdistributions The standard deviation of the percentage ofactive nodes at retrieval as a function of the expected prob-ability of nodes active at retrieval is plotted in Figure 5Obviously the standard deviation of the percentage ofactive nodes is zero when the probability of active nodesis zero This standard deviation increases as the prob-ability of active nodes increases For small probabilitiesof active nodes the variance of active nodes (sp) is approx-imately proportional to the percentage of active nodes[sp 5 Pc(1 Pc) N raquo Pc N micro Pc] The percentage of ac-tive nodes is of course larger for old than for new itemsThus the variance theory predicts that the standard de-viation of the percentage of active nodes (sp) is smaller fornew than for old items [sp(AN) lt sp(BN) lt sp(BO) ltsp(AO)] whereas the standard deviation of the net inputis not [sh(AN) 5 sh(AO) lt sh(BN) 5 s h(BO)] The es-sential mechanism that makes these changes in themeans and standard deviations is the nonlinearity intro-duced by the counting of the number of active nodes With-out this nonlinearity these changes would not occur andthe model would not show appropriate ROC curves forold and new items

Note that the standard deviation of active nodes de-creases for probabilities larger than one half (see Figure 5the standard deviation is of course zero when the prob-ability of active nodes is one see the mathematical anal-ysis below) However the probability of active nodes can-not exceed one half because the activation threshold isset so that half of the nodes active during encoding areactive at recognition The probability of active nodes istypically set to a small value in the model because it im-proves the performance

To optimize the performance subjects base their rec-ognition decision on the number of active nodes normal-ized by the standard deviation of the net inputs for theitem The normalization makes the judgments more con-servative for difficult items This plays an important rolefor confidence judgments when the responses are biasedbut plays no role for unbiased responses

12

14

14

422 SIKSTROM

To calculate the recognition strength (S ) in Equation 5the expected percentage of active nodes is subtractedfrom the percentage of nodes active at recognition (Pc)This is necessary for the normalization to work properlyOwing to the placement of the activation threshold theexpected percentage of active nodes at recognition is halfof the expected percentage of nodes active at encoding(a 2 see note 1) This is a constant independent of itemclass new or old item and test difficulty The result is di-vided by the standard deviation of the net inputs associ-ated with nodes active at encoding (shcent)

Note that the standard deviation of the net inputs ofthe to-be-recognized item (s hcent ) varies on an item-to-itembasis around the standard deviation of the net inputs ofall items in the class (sh ) This fluctuation may be solarge that it is not possible to accurately sort the wordsinto classes on the basis of the standard deviation of theitems however there is no need for the subject to makesuch classification in the variance model The subjectsdo not need to know the true standard deviation of net in-puts in the class A yes response occurs if the recognitionstrength is larger than or equal to the subjectrsquos recogni-tion criterion (C ie if S sup3 C ) A no response occurs ifthe recognition strength is less than the subjectrsquos recog-nition criterion (S lt C )

The standard deviation of the net inputs does not affectthe probability of a yes response when the recognitioncriterion is unbiased (C 5 0) In this special case therecognition strength can be simplified to S 5 Pc whereC 5 a 2 The standard deviation of the net inputs inEquation 5 affects the probability of a yes response whenthe recognition criterion is biased (C sup1 0) Thus the stan-dard deviation of the net inputs in Equation 5 may be in-terpreted as a scaling factor that influences the confidencemeasurement (but not the unbiased recognition measure-ments) A large standard deviation of the net input for anitem (correlated with difficulty) influences the decisiontoward uncertainty whereas a small standard deviation ofthe net input for an item (correlated with less difficulty)influences the decision to be more certain

Figure 4D shows the density distribution of the recog-nition strength Note how the standard deviation of theactive nodes for the easy class versus the difficult classes(in Figure 4C) changes when it is normalized by the vari-ance of the net input (in Figure 4D) The normalizationfactor makes the standard deviation of the recognitionstrength of the difficult class smaller than that of the easyclass Thus the standard deviation of the recognitionstrength is proportional to the inverse of the standard de-viation of the net input The difficult class items yield asmall standard deviation of the recognition strength be-cause the standard deviation of the net inputs is highThe easy class items yield a large standard deviation ofthe recognition strength because the standard deviationof the net inputs is small The ordering of the means ofthe distributions is unaffected by the normalization andthe normalization does not change the fact that the olddistributions have a larger standard deviation than do thenew distributions

PREDICTIONS

This section describes the predictions of the variancetheory We have just seen that the variance theory predictsa material-based mirror effect for high- and low-frequencyitems because the low-frequency items have a smallervariance of net inputs This yields lower false alarm ratesand higher hit rates for the easy than for the difficultclass when the activation threshold is set between thenew and the old distributions Here it is shown how themodel accounts for other effects such as the strength-based mirror effect between lists list-length effects andthe shift in the response criterion Most important the var-iance theory makes predictions regarding the strength-based mirror effect within lists that is different from thepredictions of the attention-likelihood theory An exper-iment is conducted to test these predictions Comparativemodeling fitting was also conducted to compare the vari-ance theory with the attention-likelihood theory Thepredictions of the theory are based on an analytic solution

Figure 5 The standard deviation of active nodes as a function of the expectedprobability that the nodes are active The standard deviation is calculated with2N = 100

VARIANCE THEORY OF MIRROR EFFECT 423

that is presented at the end of the paper together with ananalysis of optimal performance

The Material-Based Mirror Effect forHigh- and Low-Frequency Items

The variance theory was simulated above here theanalytical results are presented The variance theory pre-dicts the mirror effect for any choice of parameters whenthe recognition criterion is unbiased As will be discussedlater the variance theory can be fully described by twoparameters (the number of nodes N and the percentageof active nodes a) plus one parameter for each class orwords [the standard deviation of the net input sh( )] Thefollowing parameters are used here The number ofnodes is 2N 5 100 and the percentage of nodes active atencoding is set to a 5 1 The standard deviation of thenet inputs to the easy class is sh(AN) 5 sh(AO) 5 125and the standard deviation of the net inputs to the diffi-cult class is sh(BN) 5 sh(BO) 5 156 There are otherparameters which however as will be discussed laterdo not add any degrees of freedom to the model the ex-pected net inputs of the new distributions mh(AN) 5mh(BN) 5 0 and the expected net inputs of the old dis-tributions mh(AO) 5 mh(BO) 5 1 Consequently the ac-tivation threshold is T 5 05

These parameters yield the following probabilities thata node is active at recognition Pc(AN) 5 43 a Pc(BN) 545 a Pc(BO) 5 55 a Pc(AO) 5 57 a The following ex-pected recognition strengths are predicted ms(AN) 5

0012 ms(BN) 5 0008 ms(BO) 5 0008 ms(AO) 50012 Figure 4D plots the four recognition strength den-sities (the distributions are assumed to be normal) usingthe parameters above The same parameter settings wereused in Figures 4A 4B 4C and 5

Strength-Based Mirror Effects Between ListsThe variance theory is consistent with the strength-

based mirror effects Thus variables that increase the hitrates also decrease the false alarm rates This empiricalfinding is called dispersion which means that the newand the old distributions move apart The opposite phe-nomenon is called concentering which means that thenew and the old distributions move closer together Ex-amples of variables showing strength-based mirror ef-fects are speed versus accuracy instructions length ofstudy time encoding task forgetting repetition and ag-ing (Kim amp Glanzer 1993) These experimental variablescan be related to a specific parameter in the variance the-orymdashnamely the expected net input

The variance theory predicts a strength-based mirroreffect because subjects must adjust the activation thresh-old to optimize the performance This change in activa-tion threshold affects the false alarm rates For exampleassume that study time is increased from 1 to 2 sec sothat the expected net input increases from 1 to 2 and theactivation threshold increases from 12 to 1 This dimin-ishes the false alarm rate However the increase in theactivation threshold is smaller than the increase in the old

net input so the hit rate will increase Thus increasing thestudy time increases the hit rate but decreases the falsealarm rate which is dispersion

The mirror effect is accounted for in some theories bya change in the recognition criterion Note that in the var-iance theory the recognition criterion is constant whereasthe activation threshold is changed There is an impor-tant difference between a change in the recognition cri-terion and a change in the activation threshold Thechange in the activation threshold optimizes the perfor-mance as measured by d cent whereas a change in the recog-nition criterion does not influence d cent Given an optimalplacement of the activation threshold the performance interms of percentage correct is optimal if the recognitioncriterion is set to an optimal value which is zero Thusthere is a clear difference between changing the recogni-tion criterion and changing the activation threshold Thevariance theory accounts for the strength-based mirror ef-fect occurring between two conditions by the change inthe activation threshold necessary for optimal performancewhereas the recognition criterion does not change

Concentering occurs for example when subjects areinstructed to emphasize speed (rather than accuracy) withsuperficial (rather than deep or semantic) study instruc-tions with diminished study time or with an increasedretention interval (Kim amp Glanzer 1993) In the variancetheory all these manipulations are assumed to diminishthe old net inputs Figure 6A shows the predictions of thevariance theory when the expected net inputs of the olddistributions are mh(AO) = mh(BO) 5 05 rather than 1as in Figure 4D Consequently the activation thresholdmust be set to 025 for optimizing the performance Thedistributions in Figure 6A are closer than the distribu-tions in Figure 4D Thus decreasing the net inputsmdashforexample by diminishing study timemdashmoves the distrib-utions closer together thus showing concentering

The opposite phenomenon to concentering is disper-sion which means that increasing the performance movesthe distributions apart Dispersion can be studied bychanging the variables listed above in the opposite direc-tionsmdashfor example by increasing the study time Fig-ure 6B shows the predictions of the theory when the ex-pected net inputs of the old distributions are mh(AO) =m h(BO) 5 2 Consequently the activation thresholdmust be set to 1 to maintain a near-optimal performanceThe distributions in Figure 6B are further apart than thedistributions in Figure 4D

These strength-based manipulations are usually ap-plied between different lists or conditions For exampleKim and Glanzer (1993) manipulated study time betweenfour study lists where the items were presented for 1 seceach in two lists and for 2 sec each in two lists After eachlist there was a recognition test In the variance theorythe activation threshold is the same during each recogni-tion test but may vary between two recognition tests withdifferent levels of difficultymdashfor example different studytimes As will be discussed later different empirical re-sults are found when study time is varied within one list

424 SIKSTROM

In this case the activation threshold is also constant dur-ing the recognition tests although the study time varieswithin the condition

The order of the probabilities in the mirror effect issomewhat robust against changes in the activation thresh-old over a large range Setting the activation to a fixedsufficiently low and positive value yields the mirror ef-fect for any value of the expected net input For example

assume that the activation threshold is fixed to 04 Thenthe mirror effect is predicted for the three cases of ex-pected old net inputs discussed above (05 1 and 2) orany value above 04 The predictions for the new distri-butions do not change with these changes in net inputs[P(AN) = 25 P(BN) = 30] thus a change in the acti-vation threshold is needed to change the false alarmrates In contrast the advantage of the old easy class over

Table 1General Table of Results From the Experiment

Condition AN BN BO AO ss(BN)ss(AN) ss(BO)ss(AO)

Control 013 017 069 082 060 086Frequency 020 028 080 068 101 066Time 010 015 078 076 089 081

NotemdashThe rows show the conditions (control presentation frequency and presenta-tion time) The columns show the false alarm rates for low (AN) and high (BN) fre-quencies and the hit rate for high (BO) and low (AO) frequencies the slope of the z-ROCcurve for the new low-frequency distribution as a function the new high-frequency dis-tribution [ss(BN)ss(AN)] and the corresponding slope for the old distributions[ss(BO)ss(AO)]

Figure 6 The densities of recognition strength in the variance theory for different parameter settings (A) concentering (B) dis-persion (C) activity level set to one and (D) equal variance The horizontal axes show the recognition strength and the vertical axesthe density

VARIANCE THEORY OF MIRROR EFFECT 425

the old difficult class increases with the expected netinput [from P(BO) = 55 and P(AO) = 56 for mh(O) =05 to P(BO) = 89 and P(AO) = 92 for mh(O) = 2]

List-Length EffectGiven everything else equal recognition from a short

list length has a higher hit rate and lower false alarm ratethan recognition from a long list In the variance theorylist length is predicted to affect the standard deviation ofthe net input (sh) for both easy and difficult class itemsso that longer lists have a larger variance than do shorterlists The expected value of the net input is not affectedby list length

Assume that context does not change within a list butis uncorrelated between different lists The context for alist is thus associated with as many items as there areitems in the list The variance of the net inputs to the itemnodes increases when the list length is increased Thereason for this increase in variance is essentially thesame as the reason that word frequency affects the vari-ance In the word frequency case the same item is asso-ciated with several contexts and this increases the vari-ance in context nodes In the list-length case the samecontext is associated with several items and this in-creases the variance in the item nodes Thus the varianceof the net inputs in the item nodes will be a linear func-tion of list length Therefore a long list will have a lowerhit rate and a larger false alarm rate than will a short list

ROC CurvesThe percentage of nodes active at recognition is less

for new items than for old items Owing to the placementof the activation threshold this proportion is always lessthan 12 The standard deviation of the percentage of ac-tive nodes increases as a function of the percentage ofactive nodes If the percentage of active nodes is zerothe standard deviation obviously is zero However thisstandard deviation increases as the percentage of activenodes increases This yields a smaller standard deviationfor the new distribution (which is associated with a lowerpercentage of active nodes) as compared with the olddistribution [ss(AO) gt ss(AN) and ss(BO) gt ss(BN)]

For the sake of understanding the model the propor-tion of nodes active at encoding can be set unrealisticallyhighmdashnamely to a = 1 This setting yields around 50of these nodes being active at recognition This parametersetting makes the standard deviations of the new and theold distributions equal [ss(AO) = ss(AN) and ss(BO) =ss(BN)] Figure 6C shows the prediction for a = 1 (all theother parameters are identical to those in Figure 4D)

The standard deviation of recognition strength is largerfor the difficult class than for the easy class [ss(AN) gtss(BN) and ss(AO) gt ss(BO)] because the recognitionstrengths are calculated from the inverse of the standarddeviation of the net inputs Thus when the standard de-viations of the net inputs are set equal the standard de-viation of the recognition strengths and the recognitionstrengths becomes equal Figure 6D plots the predictionsof the theory when all standard deviations of the net in-

puts are 125 The other parameters are the same as thosein Figure 4D

In Figure 4D the four standard deviations of the recog-nition strengths are ss(AN) = 0015 ss(BN) = 0012ss(BO) = 0015 and ss(AO) = 0020 The ratio of thesestandard deviations must follow Equation 2 This is alsothe case with ss(AO) ss(BN) = 061 lt 074 = ss(AO) ss(AN) lt ss(BO)ss(BN) = 078 lt 094 = ss(BO) ss(AN)

Changing the Recognition CriterionThe probability of a yes response (P) for the four

classes depends on the recognition criterion (C) SettingC to an unbiased value of 0 yields P(AN) = 20 P(BN) =25 P(BO) = 70 P(AO) = 74 These predicted data areprototypical of experimental data for the mirror effect

A conservative value of the recognition criterion (C)will not yield the mirror effect For example C = 0016yields P(AN) = 03 P(BN) = 02 P(BO) = 30 P(AO) =43 Thus the variance theory predicts that a conserva-tive recognition criterion yields a higher false alarm ratefor easy class words than for difficult class words Thisprediction is supported by empirical data Greene (1996)asked subjects to respond yes only if they were sure oftheir response Consistent with the prediction no mirroreffect was found

It follows from the ordering of the distributions in Fig-ure 4D that the theory also predicts the experimentalfindings in forced recognition [P(BO BN) lt P(AO BN)P(BO AN) lt P(AO AN) P(BN AN) gt 50 and P(AOBO) gt 50] For the parameters above the predictions ofthe theory are P(BO BN) = 79 lt 81 = P(AO BN)P(BO AN) = 83 lt 84 = P(AO AN) P(BN AN) = 59 gt50 P(AO BO) = 57 gt 50

Within-List Strength ManipulationSo far the predictions made by the variance theory are

qualitatively (but not quantitatively) equal to those of theattention-likelihood theory However there is an excep-tion that differentiates the variance theory from the attention-likelihood theory The mirror effect is nor-mally studied under experimental conditions in whichthe diff icult and the easy classes are given the sameamount of attentionmdashfor example under conditions inwhich the number of presentations the study time andthe study instructions are equal for the two classes ofwords However if the number of presentation is largerfor the difficult class than for the easy class different re-sults emerge Stretch and Wixted (1998b) conducted fiveexperiments in which the basic manipulation was to pre-sent high-frequency words five times whereas the low-frequency words were presented once The results didnot show a mirror effect because the hit rates for thehigh-frequency words were higher than those for thelow-frequency words However increasing the numberof presentations for the high-frequency words did not af-fect the false alarm rate so both the false alarm rate andthe hit rate were larger for high-frequency words

The variance theory accounts for this new finding inthe following way The theory assumes that a longer pre-

426 SIKSTROM

sentation time or a larger presentation frequency in-creases the net inputs of the old items [mh(O)] This is il-lustrated in Figure 7A (compare with Figure 4A wherethe same amount of attention is paid to the two classes)If the net inputs for old high-frequency items are in-creased sufficiently the percentage of active nodes willbe larger than that for old low-frequency items For thisto occur the effect of the increase in net input (whichgives the advantage for old high-frequency items whenattention is focused on these items) must be larger thanthe effect from the larger standard deviation of the netinputs for old high-frequency items (which gives the ad-vantage for old low-frequency items when the same at-tention is paid to the two classes) This increase in thepercentage of active nodes yields a higher hit rate forhigh-frequency items than for low-frequency items

However it will not significantly change the false alarmrates which are larger for high-frequency items than forlow-frequency items Therefore the variance theory pre-dicts no mirror effect when high-frequency items are pre-sented sufficiently more often or with a sufficiently longerpresentation time as compared with low-frequency items

It is apparent from the density of net inputs (Figure 7A)that the density of recognition strengths (Figure 7B)will not show a mirror effect (ie because the percent-age of active nodes are larger for high- than for low-frequency old items) The parameters used in these fig-ures are identical to the parameters used for the standardmirror effect in Figures 4A and 4D with the exceptionthat the expected net input of the old high-frequencyitems [mh(BO)] is 2 rather than 1 Consequently to opti-mize performance the activation threshold becomes

P a ec

h h

h=-( )yen

ograve12

22

m

s

P -

Figure 7 (A) The probability density of the net inputs in the variancetheory when attention is focused on the high-frequency class The hori-zontal axis shows the net inputs and the vertical axis the probabilitydensity of the net inputs The expected value of the high-frequency value(BO) is shifted to the right because attention is focused on this class Thedotted vertical line is the activation threshold (B) The predictions of thevariance theory when subjects focus their attention on high-frequencywords The horizontal axis shows the recognition strength and the ver-tical axis the probability density

075 rather than 050 The figure does not show a mirroreffect because the expected hit rate and the expectedfalse alarm rate are larger for the high-frequency itemsthan for the low-frequency items Setting C to an unbi-ased value of 0 yields P(AN) = 08 P(BN) = 14 P(BO) =86 P(AO) = 063 [which may be compared with Figure6B P(AN) = 20 P(BN) = 25 P(BO) = 70 P(AO) = 74]

Furthermore in the variance theory the ratio of therecognition strength standard deviations for high- andlow-frequency items depends mainly on the standard de-viations of the net inputs The standard deviations of thenet inputs are not dependent on the attention paid to thestimuli Therefore the variance theory predicts no changein the standard deviations when the amount of attentionis manipulated The standard deviation of the old low-frequency distribution is predicted to be larger than thestandard deviation of the old high-frequency distribu-tion Similarly the standard deviation of the new low-frequency distribution is predicted to be larger than thatof the new high-frequency distribution The standard de-viations in Figure 7B are ss(AN) = 0013 ss(BN) = 0011ss(BO) = 0017 and ss(AO) = 0019 These results aresimilar to the results when the same amount of attentionis paid to the two classes in Figure 4D ss(AN) = 0015ss(BN) = 0012 ss(BO) = 0015 and ss(AO) = 0020

The standard version of the attention-likelihood the-ory has problems accounting for the lack of mirror ef-fect when more study time is given to the difficult classthan to the easy class This theory suggests that the classof items to which more attention is being paid is moreeasily recognized For example low-frequency items arebetter recognized than high-frequency items becausesubjects pay more attention to them The amount of at-tention is assumed to influence the number of sampledfeature [n(i)] so more features are sampled for low- thanfor high-frequency items (Kim amp Glanzer 1993) This isthe only parameter that differs between high- and low-frequency items From this explanation it follows thatif the experimental conditions are manipulated so thatsubjects pay more attention to the high-frequency itemsthe standard version of the attention-likelihood theorywill predict a mirror effect where the high-frequencyitems are the easier class (A) and the low-frequency itemsare the more difficult class (B) The difference from thenormal mirror effect is a larger hit rate and a smaller falsealarm rate for high- than for low-frequency items Fur-thermore the attention-likelihood theory makes predic-tions of the order of the slope of ROC curves The standarddeviation of the hit rate for the high-frequency distributionwould be larger than the hit rate for the low-frequencydistribution Similarly it is predicted that the standarddeviation of the high-frequency false alarm distribution islarger than that of the low-frequency distribution

EXPERIMENT

An experiment was conducted to test the predictionsregarding the within-list strength manipulation The

number of presentations and the study time of the high-frequency words were manipulated in an experimentThe original rationale for the experiment was to comparethe results with the predictions of the variance theoryand the attention-likelihood theory because the experi-ment was conducted before the publication of Stretchand Wixtedrsquos (1998b) study which manipulated atten-tion by the number of presentations In this experimenta new manipulation is investigatedmdashnamely how theamount of study time per item for each class influencesthe mirror effect Furthermore the manipulation of thenumber of presentations is replicated Thus there weretwo experimental conditions one in which the amountof study time was manipulated and one in which the pre-sentation time was manipulated There was also one con-trol condition in which high- and low-frequency wordswere given the same amount of attention

MethodSubjects Twenty-one students taking the introductory psychol-

ogy course at the University of Toronto volunteered to participatein a memory experiment for course credit There were 14 female and7 male subjects with a mean age of 20 ranging from 18 to 29 yearsold

Material Sixty low-frequency words and 60 high-frequencywords were selected from Ku Iumlcera and Francis (1967) The low-frequency words have an occurrence of 4ndash5 times per million andthe high frequency words an occurrence of 50ndash55 times per mil-lion Thirty low- and 30 high-frequency words were randomly cho-sen for List 1 and the remaining for List 2

Procedure The subjects were instructed to study a list of wordsso they would be able to recognize the words after study Fifteenlow-frequency words and 15 high-frequency words were randomlychosen as study words for each subject

Design There were three conditions In all the conditions thelow-frequency words were presented once with a presentation timeof 1 sec In the control condition the high-frequency words werealso presented once with a presentation time of 1 sec In the pre-sentation frequency condition the high-frequency words were pre-sented twice for 1 sec each time In the presentation time conditionthe high-frequency words were presented once for 3 sec The pre-sentation order was randomized All the words were presented inuppercase on a blank computer screen Immediately following thestudy list there was a recognition test The subjects were presentedwith either one of the studied words or one of the lures There were15 low-frequency lures and 15 high-frequency lures in each condi-tion The subjects were asked to judge whether the word was pre-sented in the list or not The subjects were also required to rate theirconfidence in their responses on a scale ranging from 1 (guessing)to 5 (very certain) The order of recognition was randomized foreach subject

Each subject participated in two conditions List 1 was alwaysgiven as the first list and List 2 as the second list Twelve subjectswere randomly chosen for the presentation frequency condition fol-lowed by the presentation time condition Nine subjects were giventhe control condition followed by another control condition Thewhole experimental set-up including instructions presentation ofwords and the recognition test was automated on a computer Eachsubject was tested individually

ResultsThe results from the experiment are presented in the

first three rows of Table 1 The probability for hit rates

VARIANCE THEORY OF MIRROR EFFECT 427

428 SIKSTROM

was larger for the high-frequency words than for the low-frequency words in the presentation frequency and thepresentation time conditions In the control conditionthe probability for hit rates for the low-frequency condi-tion was larger One-tailed paired t tests over the perfor-mance for each subject were carried out to test the dif-ferences between the high and the low frequencies Theeffects were significant in the presentation frequencycondition [t(11) = 22 MSe = 004 p = 02 lt 05] and inthe control condition [t(16) = 33 MSe = 004 p = 00lt 05] but not in the presentation time condition [t(11) =041 MSe = 003 p = 34 lt 05]

The false alarm rate was larger for the high-frequencywords in all the conditions However it was only signif-icantly larger in the presentation frequency condition[t(11) = 18 MSe = 003 p = 048 lt 05] but not in thepresentation time condition [t (11) = 15 MSe = 001 p =07 gt 05] and the control condition [t(16) = 14 MSe =002 p = 09 gt 05]

The results in the presentation frequency conditionsupport the variance theory The hit and the false alarmrates were significantly larger for the high-frequencywords than for the low-frequency words Thus there wasno mirror effect However the prediction of the standardversion of the attention-likelihood theory was not sup-ported

The results in the presentation time condition were inthe same direction as those in the presentation frequencycondition although the difference between the high andthe low frequencies was not significant This conditionis consistent with the variance theory although the stan-dard version of the attention-likelihood theory could notbe dismissed in this condition since the results werenonsignificant

Finally the control condition yielded results consis-tent with previous studies showing a mirror effect Thehit rate for the high-frequency words was significantlylower than the hit rate for the low-frequency words Thefalse alarm rate for the high-frequency words was largerthan that for the low-frequency words (although not sig-nificantly) Thus the control condition is as was ex-pected consistent with both the variance theory and thestandard version of the attention-likelihood theory

The slopes of the ROC curves were calculated as fol-lows The hit and false alarm rates for confidence ratings1ndash5 were z-transformed (eg for confidence rating 4 a hitresponse was scored if the confidence rating was 4 orabove) Linear regressions of one z-transformed mea-surement as a function of another z-transformed measure-ment were conducted The slope of the linear regressioncurves between the z-transformed hit rate of the low-frequency words and the z-transformed hit rate of the high-frequency words [ss(BO)ss(AO)] and similarly for theslope of the false alarms [ss(BN)ss(AN)] are shown inthe last two rows of Table 1

The results show that the standard deviations of theold high-frequency distributions were smaller than thestandard deviations of the old low-frequency distribu-tions in all the conditions The standard deviations of the

false alarm high-frequency distributions were smallerthan the standard deviations of the false alarm low-frequency distributions in the presentation frequencycondition and the control condition but were approxi-mately equal in the presentation time condition

To summarize the results in the presentation frequencycondition are consistent with the variance theory and inconsistent with the standard version of the attention-likelihood theory The control condition is consistentwith both the variance theory and the standard version ofthe attention-likelihood theory These data are also con-sistent with results presented by Stretch and Wixted(1998b) However Stretch and Wixted (1998b) suggestedone possible way to modify the standard version of theattention-likelihood theory to bring it in line with thedata presented here They noted that Glanzer et al (1993)had shown that the attention-likelihood theory predictsthe mirror effect although p(i old) is set to the averagevalue of the two classes This modified version can pre-dict the pattern of data presented here given that the at-tention paid to the high-frequency class was high duringencoding [n(B) = 120] and low during recognition [n(B) =40] This formulation of the attention-likelihood theoryseems somewhat unclear It is not well motivated whyp(i old) should be equal during recognition whereas n(i)is different [p(i old) is calculated from n(i)] or why theamount of attention for high-frequency items is lower thanthat for low-frequency words at encoding but higher atrecognition

COMPARATIVE DATA FITTING

Glanzer et al (1993) fitted the attention-likelihoodtheory to experimental data in nine conditions The easyclass (A) consisted of either low-frequency words orconcrete words and the difficult class (B) consisted ofhigh-frequency words or abstract words Here the vari-ance theory is also fitted to the same set of data as thatused in Glanzer et al (1993) This allows a directionevaluation of the variance theory by comparing its f itswith those of the attention-likelihood theory

Glanzer et al (1993) fitted the attention-likelihoodtheory to the four probabilities of yes responses and thefour ratios of slopes of the ROCs The fitting was con-ducted by minimizing the mean squared error divided bythe variancemdashthat is

Three parameters were fittedmdashnamely the attentionpaid to each of the classes n(A) and n(B) and the prob-ability that a feature was activated p(new) The other pa-rameters were held constant at a value found to give agood f it These parameters were (N = 1000) and therecognition criterion [ln (L) = 0]

The variance theory was fitted to the same set of datausing the same technique and the same number of freeparameters The fitted parameters were the percentage

( )

Observed Predictedi i

ii

-

=aring

2

21

8

s

VARIANCE THEORY OF MIRROR EFFECT 429

nodes active at encoding (a) the standard deviation ofthe net inputs for the easy class words [sh(A)] and thestandard deviation of the net inputs for the difficult classwords [sh(B)] The other parameters were held constantand were set to the same values as those in Figure 4D[2N = 100 mh(N ) = 0 mh(O) = 1 and C = 0] The empir-ical standard deviations (si) were not reported inGlanzer et al (1993) so these parameters were set toone

The results show that the variance theory accounts for98 (r2) of the variance for the probabilities The attention-likelihood theory fits equally well The variance theory ac-counts for 93 of the variance for the slope The attention-likelihood theory accounts for 84 of the variance of theslope Thus the variance theory accounted for the sameamount of variance for the probabilities and more variance

for the slope as compared with the attention-likelihoodtheory when three parameters were fitted

It is reasonable to assume that the percentage of ac-tive nodes at encoding (a) does not vary across condi-tions The ratio of standard deviations of the net inputs[sh(B)sh(A)] may also be conceived of as being con-stant given that the same material is used in the differ-ent conditions Therefore the variance theory was fittedby a single variablemdashnamely the standard deviation ofthe net inputs for the easy class [sh(A)] The activitylevel was fixed to 010 and the ratio of the standard de-viations of the net inputs sh(B)sh(A) to 125 (these val-ues were the average of the fitted parameters in the pre-vious fit these parameters were also used in Figure 4D)The results are presented in Figure 8A where the pre-dicted probabilities are plotted as a function of observed

Figure 8 (A) Fitting the variance theory to Glanzer Adams andKimrsquos (1993) probability data The figure shows the predicted proba-bilities (on the vertical axis) as a function of the observed probabilities(on the horizontal axis) (B) Fitting the variance theory to Glanzer et alrsquos(1993) standard deviation slope data The figure shows the predictedratio of slopes of the receiver-operating characteristic curves (on the ver-tical axis) as a function of the observed ratios (on the horizontal axis)

430 SIKSTROM

probabilities Figure 8B shows the corresponding resultsfor the slope The accounted variance is 096 for the prob-abilities and 085 for the slopes Thus the variance theoryfits the slopes using a single parameter equally well asthe attention-likelihood theory does with three fitting pa-rameters The fit for the variance theory for the probabil-ities using one parameter is slightly less than the fit of theattention-likelihood theory using three fitting parametersIt may be concluded that the fit for the variance theory isreasonably good for the probabilities and the slopes Theslopes have a better fit in the variance theory than in theattention-likelihood theory when three variables are used

ANALYTIC SOLUTIONS

In this section analytic solutions of the variance the-ory an approximation of the standard deviation of therecognition strength and analyses of optimal perfor-mance are presented The variance theory has a simpleanalytic solution and can be fully described by four pa-rameters Two of these parametersmdashnamely the stan-dard deviation of the net inputs from the easy class[sh(A)] and the standard deviation of the net inputs fromthe difficult class [sh(B)]mdashcan also be expressed as thefrequency of the item (see the Appendix) The other twoparameters are the number of nodes (N ) and the expectedprobability that the nodes are active at encoding (a)

There are other variables in the theory however theydo not increase the degrees of freedom There are fourexpected net inputs (mh) however two degrees of free-dom disappear because the new net inputs are constrainedto be equal as well as the old net inputs [mh(AN) =mh(BN) and mh(AO) = mh(BO)] Furthermore the predic-tions are independent of moving the old and new dis-tributions in parallel thus removing another degree offreedom Finally changing the difference between the ex-pected old and new net inputs is mathematically equiva-lent to changing the standard deviations [sh(A) andsh(B)] Thus the degrees of freedom in the net inputscan be captured by the degrees of freedom in the stan-dard deviations The activation threshold (T ) and theprobability that nodes are active (Pc) are simply func-tions of other variables and therefore do not increase thedegrees of freedom Thus there are four degrees of free-dom for the distributionsmdashnamely sh(A) sh(B) N anda An additional degree of freedom is introduced whenplacing the recognition criterion (C)

The probability (Pc) that the net inputs exceed the ac-tivation threshold (T ) for nodes active during encodingcan be explicitly solved from the expected net inputs(mh) and the standard deviation of the net inputs (sh)This probability is dependent on the distribution of thenet inputs which can be approximated with a normaldistribution Pc is solved by integrating the net inputsfrom mh T to infinity (yen) over the probability densityfunction for a normal distribution Thus the probability(Pc) that a node is active at recognition is

(6)

Subtracting the expected percentage of active nodes atrecognition (a2 see note 1) from the percentage of ac-tive nodes and dividing by the standard deviation of thenet inputs (sh) calculates the expected recognitionstrength (ms)

Note that the analytic solution uses the standard devi-ation of the class (sh) as an approximation of the stan-dard deviation of the item (shcent ) because it simplifies theanalytic solution however the variance theory or thesimulation uses the standard deviation of the item Thisapproximation is good when there are a large number offeatures however for a small number of features thevariance of feature strength for a single item may fluctu-ate on an item-to-item basis around the variance of thenet inputs for all the items

The standard deviation of the recognition strength (ss)is calculated from sh Pc and N There is 2N number ofnodes in the context and the item layers The distributionof Pc is binomial but can for a certain criterion [ie 2NPc(1 Pc) gt 10] be approximated with a normal distri-bution with a standard deviation of [Pc(1 Pc) 2N]12The final result is scaled by the normalization factor 1sh

(7)

A yes response occurs if the recognition strength isabove the recognition criterion (C) The probability of ayes response [P(Y)] is calculated from the expected recog-nition strength the variance of the recognition strengthand the recognition criterion by integrating the density ofthe recognition strength over a normal distribution

The probability of choosing A over B in a two-choiceforced recognition test [P(A B)] is calculated from theexpected recognition strength of A [ms(A)] and B [ms(B)]and the standard deviations of the recognition strengthof A [ss(A)] and B [ss(B)]

An Excel sheet for calculating the predictions of the vari-ance theory is available on line (wwwpsychutorontoca ~sverkervariancehtml)

P e ds

S Bs s

s s

C

(A B)

(A)

(A) (B)

=

- -[ ]( )+aelig

egraveccediloumloslashdivide

eacuteeumlecirc

yen

ograve12

2

2 2 22

p

m m

s s

( )

P e ds

S s

s

C

(Y) =-( )yen

ograve12

2

22

p

m

s

sss

h

c cP P

N=

-( )eacute

eumlecircecirc

1 1

2

12

mss

Pc

a

h=

-2

P a e dhcT

hh

h=-( )yen

ograve12

2

22

p

m

s

VARIANCE THEORY OF MIRROR EFFECT 431

Approximations of the Standard Deviation ofRecognition Strength

The standard deviation of the recognition strength isin the model calculated with Equation 7 However to fa-cilitate the understanding of this equation it is useful tomake some approximations First note that the proba-bility that a node is active (Pc) is assumed to be low Byapproximating 1 Pc to one the variance of the recog-nition strength can be simplified to

For a particular class of items the variances of the netinputs of old and new items are equal and the varianceof the recognition strength is proportional to the numberof active nodes (s 2

s micro Pc) This approximation suggests avery simple interpretation of the slope of the z-ROC Theratio of variances between new and old items is simplythe ratio between the number of nodes active in the newitems representations and the number of nodes active inthe old items representations

Or alternatively the slope of the z-ROC curve is equalto the square root of the ratio of the number of nodes ac-tive in the new items representations and the number ofnodes active in the old items representations For exam-ple if the slope of the z-ROC curve is 08 the number ofactive nodes in the new items representations divided bythe number of nodes active in the old items representa-tions is 064 (= 0802)

Another approximation useful for understanding themodel is that for two classes of items the number of ac-tive nodes in the new distribution is approximately equaland the number of active nodes in the old distributions isapproximately equal [Pc(AN) raquo Pc(BN) and Pc(BO) raquoPc(AO)] Given these approximations and the approxi-mation above (1 Pc raquo 1) the recognition strength stan-dard deviation is inversely related to the standard devia-tion of the net inputs in the following way The ratiobetween the recognition strength standard deviations ofthe diff icult and the easy distributions is equal to theratio between the standard deviations of the net inputs ofthe easy and the difficult distributions Furthermore theratio between the recognition strength standard devia-tions of the difficult and easy new distributions is equalto the ratio between the recognition strength standard de-viations of the difficult and the easy old distributionsThe exact solution predicts a slightly smaller ratio in theold than in the new distributions

This suggests that the ratio between the recognitionstrength standard deviations of the difficult class and theeasy class can be interpreted as the ratio between the

standard deviations of the net inputs of the easy class andthe difficult class

Optimizing Performance Derives the Assumptions of the Variance Theory

Arguably good memory performance is an importantfactor for selection in the evolutionary process of hu-mans and animals It is reasonable to assume that thebrain has evolved so that the performance at retrieval isoptimal or near-optimal Here it is investigated how sev-eral assumptions of the variance theory influence per-formance It is shown that several assumptions of themodel fall out as a consequence of optimizing perfor-mance in the form of discriminability between new andold items Thus if the model is implemented in a differ-ent way performance is degraded and the model doesnot account for the experimental data Examples of as-sumptions that yield a good performance in the modelare a low percentage of nodes active setting the activa-tion threshold between old and new net inputs measur-ing performance by nodes that are active in the encodingpattern and normalizing the recognition strength It isshown that an optimal performance in the network requiresthe implementation suggested by the variance theory Ifthe implementation of the variance theory is changedsignificantly the performance is degraded and the net-work would not produce the empirically found memoryphenomena

To conduct this analysis it is necessary to define ameasurement of performance A natural choice is to used cent By using the analytical equations above we find thefollowing expression

(8)

Because ss(N) sometimes is near zero it was founduseful to use the standard deviation of both the old andthe new items recognition strength ss(NO) in the de-nominator of this equation Thus Pc(NO) is equal to[Pc(N) 1 Pc(O)] 2 Pc( ) was calculated with Equation 6The expected value of the net inputs and the standard de-viation of the net inputs for new and old items were cal-culated with the equations derived in the Appendix(Equations A1 A2 and A3) Low-frequency items witha preexperimental frequency of zero were used ( f = 0)and the list length was set to one (L = 1)

The performance can be expressed by the parametersa N and p Increasing the number of nodes (N) monot-onically increases d cent and increasing the number ofstored patterns (p) monotonically decreases d cent The im-pact of these two parameters on d cent was of less impor-tance here and they were set to N = 30 and p = 100

Optimal percentage of nodes active at encodingThe solid line in Figure 9A shows the theoretical d cent as afunction of the percentage of nodes active at encoding

cent = - =-

-eacuteeumlecirc

dP P

P PN

s s

s

c c

c c

m ms( ) ( )

(

O N

(NO)

(O) (N)

(NO) (NO) 12

112

ss

ss

ss

ss

s

s

s

s

h

h

s

s

(BO)

(AO)

(B)

(A)

(A)

(B)

(BN)

(AN)pound raquo pound

s

ss

s

c

c

P

P

2

2

(N)

(O)

(N)

(O)raquo

ss

sc

h

P

N

2

2 2=

432 SIKSTROM

(a) The results show that d cent is optimal for a = 052 Thed cent is lower for larger and smaller a The lower d cent for largea occurs because the interference from other items in-crease For an a larger than the optimal value the weightchanges are distributed over a larger number of nodesand the recognition tests therefore include more noise

The lower d cent for an a less than the optimal value oc-curs because there is variability in the number of activenodes at encoding Thus for very small values of a thefluctuation between the number of nodes active in theencoded representation becomes increasingly importantThus for a small a errors are more likely to occur be-cause old items have few active nodes at encoding mak-ing it less likely that many nodes will be active at re-trieval (independently of how well they are encoded)This analysis suggests that a medium low percentage ofactive nodes at encoding yields optimal performanceThis is consistent with variance theory which requires alow activity for fitting some of the empirical data (seebelow)

There is another factor that contributes to the fact thatoptimal performance occurs when the percentage of ac-tive nodes is medium low which is that the number ofpossible representations increases with a If there is onlyone node active in all the representations there are Npossible combinations of representations if there aretwo nodes active in all the representations there are ap-proximately N 2 possible combinations of representa-tions and so forth This factor is not included in theanalyses

Optimal placement of the activation threshold Animportant property of d cent is that it is insensitive to wherethe criterion is placed Thus any criterion yields thesame performance The activation threshold (T ) may beseen as the criterion for a single node and therefore onemight intuitively think that the placement of the thresh-old is unimportant for d cent However surprisingly theplacement of the criterion becomes important in the vari-ance theory because there is a nonlinear transformationwhen the nodes are activated This nonlinearity makes d centdependent on the activation threshold in the nodes

The d cent was maximized by changing the activationthreshold (T ) and the percentage of nodes active at en-coding (a) The maximum d cent was 240 when T = 081and a = 052 Figure 9B plots d cent as a function of the ac-tivation threshold (T ) when the percentage of nodes ac-tive at encoding was f ixed at the optimal value (a =052) The results show that d cent has an optimal valuewhen the activation threshold is set between the old andthe new distributions The variance theory suggests thatthe threshold should be set to the average of the expectedold and expected new net inputs For the parameter val-ues used here this value is 071 which is near butslightly lower than the optimal value of 081 (the ex-pected value of the new net input is 0 and the expectedvalue of the old net input is 142) Note that this resultapplies when both a and T are set to the optimal value Ifa is set to a nonoptimal value the optimal value of T may

deviate significantly from the one proposed by the the-ory (eg if a = 5 the optimal value of T is much largerthan the expected value of old net inputs of 188)

This analysis emphasizes the importance of setting theactivation threshold as suggested by the variance theorySetting the activation threshold between the old and thenew expected net inputs yields not only the mirror effectbut also an optimal performance in the network Noticethat the activation threshold (T ) is constant even if thesubjectsrsquo decision criterion (C) is changed Therefore d centwill not change when the decision criterion changes Bychanging the decision criterion (rather than the recogni-tion threshold) subjects can maintain an optimal d cent fordifferent confidence levels

Optimal usage of the state of activation in the cuepattern An interesting question is how much informa-tion is carried in nodes that are active in the encoded pat-tern as compared with inactive nodes If both active andinactive nodes carry a similar amount of information itis useful to use all the nodes at retrieval However if in-active nodes carry little information in relation to thenoise performance can be improved by using only theinformation in the active nodes

The information carried in the nodes depends on theamount of weight changes which in turn depends on thepercentage of active nodes at encoding (a) For a = 12the absolute values of the weight changes are the samefor active and inactive nodes (however the signs of theweight changes are different) Thus inactive and activenodes carry the same amount of information and theperformance is optimal when information in both activeand inactive nodes is used For a small a the weightchanges are larger for active nodes (proportional to 1 a)than for inactive nodes (proportional to a) For a suffi-ciently small a the noise in inactive nodes will over-whelm the information in the weight changes so that ifthe information is combined the inactive nodes will harmthe information in the active nodes and performance isoptimal if only information from active nodes is used

The performance of the variance theory was calcu-lated by using the information in all the nodes This isdone by counting the number of nodes that are retrievedto the correct state of activation (ie the same state asthat during encoding) The mathematical details of thiscalculation are described at the end of the Appendix Theresults are shown by the dotted line in Figure 9A usingthe same set of parameters as when d cent was calculated byusing only active nodes shown by the solid line The re-sults show that the highest d cent is found when the decisionis based only on active nodes and when a is low Includ-ing inactive nodes in decision lowers d cent However for alarger a (above 15 for the parameters used here) it isbeneficial to base the decision on all the nodes

Optimal placement of the recognition criterion forthe two classes of items The recognition criterion (C)does not affect d cent but it influences performance as mea-sured by the hit rates and false alarm rates Therefore itis necessary to use another criterion for performance

VARIANCE THEORY OF MIRROR EFFECT 433

with respect to the placement of the recognition crite-rion A natural choice for performance in this context isthe probability of hits minus the probability of falsealarms This measurement corresponds to optimal per-formance when old correct responses and new correctnew responses are rewarded equally It is easy to see thatif the standard deviations of the old and the new distri-butions are equal the optimal performance will be foundif the recognition criterion is set exactly between the dis-tributions For unequal standard deviations the optimalrecognition criterion is shifted from the midpoint towardthe distribution with the smallest standard deviationMore exactly the optimal recognition criterion is thepoint at which the old and the new distributions inter-sect It is easy to see that this is true because if the recog-nition criterion is moved to the left of this point the rateof increase in false alarms is larger than the rate of in-crease in hits and performance suffers If the recognition

criterion is moved to the right of this point the rate of de-crease in hits is larger than the rate of decrease in falsealarms and performance also suffers (see eg Figure 4D)Formally f [S(O)] denotes the density of recognitionstrength of the old distribution and f [S(N)] the densityof the recognition strength of the new distribution Theratio between these variables is called the likelihoodratio L = f [S(O)] f [S(N)] and the optimal performanceoccurs when this ratio is equal to one (L = 1)

In the mirror effect there are two classes of itemseach having a new and an old distribution with differentstandard deviations The question of optimal perfor-mance is complicated by the possibility of using differ-ent criteria for the two classes The performance maythen vary depending on the choice of the two criteria andon additional restrictions on the overall level of confi-dence For example if one class is very easy (ie perfectdiscrimination) and one class is very difficult (ie no

Figure 9 (A) Theoretical d cent as a function of percentage of nodes active at encoding The solid line shows the d cent as a function of per-centage of nodes active at encoding when the decision is based only on nodes that are active during encoding The dotted line showsd cent when the decision is based on all the nodes (B) Theoretical d cent as a function of the activation threshold The leftmost arrow pointsat the expected net input of the new items [m(N)] the rightmost arrow points at the expected net input of the old items [m(O)] and themiddle arrow points at the point at the placement of the activation threshold of the nodes Note that the activation threshold is slightlylower than the optimal point (C) Optimal placement of the recognition criterion for the two classes The y-axis shows the maximumlikelihood for Class B divided by the maximum likelihood for Class A An optimal performance is found when this ratio is one Thex-axis shows the false alarm rate for Class A The straight line shows the ratio for theoretical optimal performance the dotted line theratio before normalization and the solid curved line the ratio after normalization See the text for details (D) The advantage of nor-malization for different recognition criteria The y-axis shows the total hit rate after normalization minus the total hit rate before nor-malization as a function of the total false alarm rate on the x-axis See the text for details

434 SIKSTROM

discrimination) and subjects are instructed to respondyes only when they are absolutely certain that they arecorrect it may be optimal to set a very high criterion forthe difficult class so that no yes responses will be madefor the difficult class and a moderate criterion for theeasy class so that some yes responses will be made forthe easy class Therefore any model that optimizes per-formance for the two classes must combine the criteriafor each class so that the performance for the sum of theclasses will be optimal

This problem may formally be stated as follows Giventwo classes (A and B) with a fixed total false alarm rate[P(AN) + P(BN)] how should the recognition criteriafor the two classes [T(A) and T(B)] be chosen so that thehit rates are maximized [P(AO) + P(BO)] The solutionto this problem is surprisingly simple The optimal perfor-mance occurs when the placements of the maximum like-lihoods of the two classes are equal

L(A) = f [S(AO)] f [S(AN)] = L(B)

= f [S(BO)] f [S(BN)]

It is easy to see that this criterion must be satisfied foroptimal performance because any shift from this pointdiminishes performance For example if the recognitionthreshold for class A is diminished the recognition cri-terion for class B must be increased to keep the totalfalse alarm rate constant According to the formulationof the problem the change in total false alarm rates mustbe equal f f [S(BN) = 0] The maximum-likelihood ra-tios are monotonically increasing functions of the recog-nition criteria therefore L(A) L(B) lt 0 when the recog-nition criteria are changed as specified above ThusL(A) = f [S(AO)] f [S(AN)] lt f [S(BO)] f [S(BN)] =L(B) or f [S(AO)] f [S(BO)] lt 0 This shows that thechange in the placement of the criteria from L(A) = L(B)results in an overall decrease in hit ratemdash( f [S(AO)] f [S(BO)] lt 0)mdashand performance suffers

Note that the variance theory only has one overallrecognition criterion However the theory and any the-ory of the mirror effect specifies how this criterionchanges when it is moved over the two classes Thus thesecond criterion can indirectly be inferred from the for-mulation of the theory This is done in the variance the-ory by the normalization factor that scales the recogni-tion differently between the two classes of words

The intriguing question here is whether the variancetheory is optimal or almost optimal in terms of place-ment of the recognition criterion for the two classes Fig-ure 9C plots the maximum likelihood of class B dividedby the maximum likelihood of class A [L(B)L(A)] onthe y-axis The x-axis shows the probability of false alarmsfor class A when the recognition criteria are changedThe optimal ratio of the maximum likelihood on the y-axisis one and it is plotted as the dotted straight line The re-sults before normalization (ie by counting the numberof nodes above the recognition criterion) are plotted in

the dotted monotonically increasing line The resultsafter normalization (ie the percentage of active nodesminus the expected number of active nodes divided bythe standard deviation of the net input) are plotted as thesolid line

The figure clearly shows that performance before nor-malization is far from optimal For a conservative recog-nition criterion or low false alarm rates the maximumlikelihood for class A is larger than the maximum likeli-hood for class B Therefore a more liberal criterion forclass B and a more strict criterion for class A would bemore advantageous For a liberal recognition criterionor a high false alarm rate the maximum likelihood forclass B is larger than the maximum likelihood for classA Therefore a more liberal criterion for Class A and astricter criterion for Class B would be beneficial Theonly point at which the performance is optimal is whenthe recognition criterion is unbiased At this point [aroundP(AN) = 25] the maximum-likelihood ratio is one

Normalization improves performance significantly sothe maximum-likelihood ratio is one or almost one fora range of criteria For false alarm rates larger than 015and smaller than 060 the ratio is within two percentagepoints of one The maximum likelihood for Class A islarger than that for Class B for false alarm rates less than015 and for false alarm rates larger than 060 Thus thereis still some deviation from optimal performance evenafter normalization However the maximum-likelihoodratio is closer to the optimal value for all false alarmrates after normalization than before normalization Ar-guably normalization improves performance sufficientlyso that performance may be described as being near anoptimal value for a wide range of recognition criteria

Overall performance after normalization can be di-rectly compared with performance before normalizationFigure 9D plots the total hit rate after normalizationminus the total hit rate before normalization for differenttotal false alarm rates In this figure the standard devia-tion of the net inputs to Class B was set to a 3 (rather than156) in order to make the difference between perfor-mance before and after normalization more salient Allother parameters were identical to those in Figure 4DFurthermore it is assumed that the number of items inClass A is equal to the number of items in Class B Thetotal hit rate is equal to the average hit rate of Class Aand Class B and the total false alarm rate is equal to theaverage false alarm rate in Class A and Class B

For all recognition criteria or for all false alarm ratesperformance is better or equal after normalization ascompared with performance before normalization Foran unbiased recognition criterion or a slightly largerrecognition criterion performance is approximatelyequal before and after normalization [ie for 25 lt P(N) lt40] For conservative recognition criteria [P(N) lt 25]there is a large advantage for normalized performanceover nonnormalized performance The largest advantageis approximately 005 and it occurs when the false alarm

VARIANCE THEORY OF MIRROR EFFECT 435

rate is approximately 005 For liberal recognition crite-ria [P(N) gt 40] there is also an advantage for normal-ized performance The largest advantage is around 001and it occurs when the false alarm rate is 070 The ad-vantage for liberal criteria is smaller than the advantagefor conservative criteria because of a ceiling effect forlarge false alarm rates and large hit rates

A Nonoptimal Network is Inconsistent With Empirical Data of Recognition Memory

To summarize it has been shown that performance isoptimal when (1) the percentage of nodes active at en-coding is low (2) the activation threshold is set betweenthe new and the old distributions (3) only nodes activeat encoding are used for measuring the recognitionstrength and (4) the recognition strength is normalizedwith the standard deviation of the net input It is inter-esting to note that all these conditions are necessary forproducing the empirically found memory phenomena

1 The percentage of active nodes has to be low forproducing appropriate standard deviations for the oldand the new recognition distributions If the percentageof active nodes is too high the standard deviation of theold distribution will approach the standard deviation ofthe new distribution

2 The model predicts the mirror effect only if the ac-tivation threshold is set between the old and the new dis-tributions If the recognition threshold is larger than theexpected value of the net input of the old distributionthe hit rate of the easy class will be less than the hit rateof the difficult distribution Similarly if the recognitionthreshold is lower than the expected new net input thefalse alarm of the easy class will be larger than the falsealarm rate of the difficult class This is inconsistent withthe empirical data for unbiased responses

3 Assume that the recognition strength is based on allthe nodes (ie not only nodes inactive during encod-ing)mdashfor example by counting the number of nodes inthe correct state of activation This measurement ofrecognition strength will have at least 50 of the nodesin the correct state (ie Pc gt 50) even if the subjectswere merely guessing on new items This would lead tothe incorrect prediction that the old recognition strengthvariance should be smaller than the new recognitionstrength variance because Pc(1 Pc)N decreases for Pcover 50 This is inconsistent with the finding that thevariance of the old distribution is larger than the varianceof the new distribution

4 If the recognition strength is not normalized withthe net input the variance of the recognition strength ofthe new easy class (AN) will be smaller than the varianceof the recognition strength of the new difficult class (BNsee Figure 4C) This is inconsistent with the empirical data

This analysis indicates that several recognition mem-ory phenomena fall out as a consequence of optimizingperformance in the network If the model is optimized interms of performance the model reproduces the empir-ical data If the model is made nonoptimal the model

does not reproduce the empirical data Arguably thehuman brain has evolved to optimize storage capacityand therefore these memory phenomena occur

GENERAL DISCUSSION

This paper has suggested a variance theory for themirror effect that also applies to the ROC curves Themodel is remarkably simple It has been shown that atwo-layer recurrent network where one layer representscontext and one layer represents items shows these phe-nomena if performance is measured by counting thenumber of nodes active at recognition in a way that opti-mizes performance The structure of the theory guaran-tees that high-frequency items have a larger variance inthe net inputs than do low-frequency items because en-coding the same item in different contexts increases thevariance whereas the expected net inputs are the sameThe theory predicts the mirror effect when the sameamount of attention is paid to both classes of stimuli Thestandard deviation of the recognition strength is largerfor old than for new items because more nodes are activein old items The standard deviation for the easy class islarger than the standard deviation for the difficult classbecause the recognition strength is normalized with thestandard deviation for the net input

There are several reasons why the variance theory isinteresting First the theory is extremely simple Theonly necessary assumptions are that recognition is basedon recurrent associations between contexts and itemsand performance is measured by counting the number ofnodes in an optimal way Second these assumptions areconsistent with what is known about how the brain worksTherefore the model is biologically plausible Third themodel accounts for a large amount of data including themirror effect exceptions from the mirror effect ROCcurves list-length effects and so on Fourth the modelfits the empirical data well Fifth it is easy to implementthe model in a connectionist network

Paying more attention to one of the classes violates theassumption of equal expected net inputs to the two classesThe variance theory predicts that attention to the moredifficult class primarily affects the hit rates whereas thefalse alarm rates and standard deviations of the underly-ing distributions are less affected An experiment sup-ported the prediction A standard mirror effect was foundwhen attention was divided equally between high- andlow-frequency words However focusing the subjectsrsquoattention on the high-frequency words either by the pre-sentation frequency or the presentation time made thehit rate larger for the high-frequency words than for thelow-frequency words This manipulation of attention didnot influence the false alarm rates which were higher forthe high-frequency words in all the conditions Thus nomirror effect was found when attention was paid to thehigh-frequency words Nor did the focusing of attentioninfluence the order of the standard deviations The stan-dard deviations were larger for the low-frequency distri-

436 SIKSTROM

bution than for the high-frequency distribution This wastrue for the new and the old distributions both when at-tention was paid to high-frequency words and when at-tention was divided equally between the two classes (ex-cept in the new frequency control condition where thestandard deviations were equal)

The variance theory predicts the order of the standarddeviations of the underlying distributions for the follow-ing reasons The standard deviations of the old distribu-tions are predicted to be larger than those of the new dis-tributions because more nodes are activated in the olddistributions The standard deviations of the easy classdistributions are predicted to be larger than the standarddeviations of the difficult class distributions because therecognition strength is normalized by the itemrsquos diffi-culty estimated from the standard deviation of the net in-puts This is consistent with the empirical data

In contrast the attention-likelihood theory does notconstrain the old distribution to be larger than the newdistribution for the difficult class (it can be larger orsmaller depending on parameter settings) The variancetheory allows the following two orders of the standarddeviations ss(BN) lt ss(BO) lt ss(AN) lt ss(AO) andss(BN) lt ss(AN) lt ss(BO) lt ss(AO) The first order isthe more common although the second order occurs oc-casionally (see eg Glanzer et al 1993 Experiment 1)In addition the attention-likelihood theory allowsss(BO) lt ss(BN) lt ss(AN) lt ss(AO) according to Equa-tion 2 which to the authorrsquos knowledge has not beenfound in empirical data

The variance theory predicts that strength variablessuch as study time repetition and study instructions af-fect the expected net input For example increasing studytime will increase the net input that improves the hit rateIncreasing the net inputs also causes an increase in theactivation threshold that diminishes the false alarm rates

The variance theory has no (ie lateral) connectionswithin the item layer and there are no connections with-in the context layer Including intraitem connections inrecognition makes it impossible to tell whether therecognition strength comes from encoding during thelearning episode or from another preexperimental learn-ing episode Thus there would be a confounding be-tween the itemrsquos frequency and recognition strength Forexample if the recognition strength in the variance the-ory included intraitem connections and used a constantrecognition criterion it would predict a higher hit rateand a higher false alarm rate for high-frequency itemsthan for low-frequency items Thus the hit rate for high-frequency words would be larger than that for low-frequency words which is contrary to the data on the mir-ror effect This issue is called the frequencyrecognitionndashstrength confounding problem Other models may bevulnerable to this problem depending on their specificassumptions The variance theory is immune from thisproblem because recognition strength is based on the as-sociation between the context and the item that yields apure measurement of the strength of the target in a par-

ticular episode Net inputs within the item population arenot used because these connections are highly corre-lated with the frequency of the item

This frequencyrecognition-strength confounding prob-lem may be relevant to several distributed models thatassume that recognition strength increases with frequencythus making the false alarm rate higher for high- than forlow-frequency stimuli This is often implemented in dis-tributed models by including intraitem associations inthe measurement of recognition strength Thus whenintraitem and itemndashcontext associations are added it isnot possible to know whether the intraitem strength oc-curs because an item has been encoded in the to-be-retrieved-from list or at another episode

Although the intraitem associations are not used tomeasure recognition strength they may play an impor-tant role in recognition In the first step of recognitionthese associations may be used for deblurring unclear in-formation in the item cue (a similar mechanism occursfor the context cue) Arguably this deblurring mecha-nism works well for well-known words however fornonwords it is much more likely to fail Such failure willactivate features that were not active in the encoded rep-resentation It is more likely that these wrongly activatednodes represent high-frequency features because it ismore likely to converge on high-frequency features Thereare two interesting implications of this perspective Firstthe wrongly activated nodes will use the wrong connec-tions between the context and the item Second becausethe wrongly activated nodes represent high-frequencyfeatures the average variability will be larger for non-words than for words This is a plausible account of thepoor recognition performance with nonwords as com-pared with words It is also a tentative account of why non-words can be seen as a difficult class with a higher vari-ability than that for words in the variance theory Howeverfurther work is needed before any firm conclusion can bemade regarding this aspect of the theory

A problem similar to frequencyrecognition-strengthconfounding occurs if within-context connections areused at recognition If the context is temporally corre-lated the within-context connections are influenced bylist length This causes a confounding between list lengthand recognition strength This issue is called the list-lengthrecognition-strength confounding problem Othermodels may be vulnerable to this problem depending ontheir specific assumptions

Another issue is whether the variance theory can ac-count for the mirror effect found in abstract and concretewords where words from a concrete class are more eas-ily discriminated than words from an abstract class Thevariance theory can account for this given the assump-tion that the variance of the net input is larger for abstractthan for concrete words However at this point it is notcompletely clear how this assumption can be motivatedA possibility is that although these two classes areequated for word frequency the contexts associated withan abstract word are more variable than the contexts as-

VARIANCE THEORY OF MIRROR EFFECT 437

sociated with a concrete word This larger variability incontext for abstract words may lead to a larger variabil-ity in the net input Another possibility is that the activefeatures in abstract words are more general and there-fore associated with more contexts Nodes active in con-crete words may represent more specific features acti-vated with a lower frequency and therefore associatedwith fewer contexts Thus features in abstract wordsmay be of a higher frequency than features in concretewords although the frequencies of the items are thesame This would lead to a mirror effect in the modelHowever at this point no claim is made that variancetheory can handle this phenomenon

The variance theoryrsquos account of the list-length andlist-strength effects is arguably much simpler than thesubjective-likelihoodrsquos account The activation thresholdis set so that on average half of the nodes active duringencoding are active during recognition The activationthreshold is constant within one condition but may changebetween conditionsmdashfor example when study time ischanged Furthermore subjects do not need to estimatethe list length and the probability that a particular itemis drawn from the study list

The variance theory has advantages over the attention-likelihood theory As was discussed in more detail abovethe attention-likelihood theory requires subjects to clas-sify the stimulus Depending on this classification thesubjects must know (consciously or unconsciously) howmuch attention is paid to the stimuli in order to calculatethe log-likelihood ratios Thus the yesndashno decision isbased on the subjectsrsquo classification of the stimuli Thevariance theory does not require a classification of thestimuli During the calculation of recognition strengththe standard deviation of the net input of the item is used(shcent ) so the subject does not need to know the class or thestandard deviation of the class (sh) The increase in thehit rate and decrease in the false alarm rate for the easierclass occurs according to the theory because the vari-ance is smaller for the easier class The variance theoryhas a simple mathematical base subjects count the num-ber of activated nodes minus the expected value dividedby the standard deviation of the net input of the item Anexplicit solution is presented that requires only calculat-ing the probabilities of two z-transformations

The subjective-likelihood theory uses feature contentof the items for addressing issues regarding item similar-ity and word frequency In particular high-frequency wordsare assumed to have a higher noise or variability than dolow-frequency words The variance theory also usesvariability that depends on frequency However the vari-ance theory simulates the increase in variance duringeach presentation of a feature in different contexts thusmaking it an unavoidable phenomenon for high-frequencyfeatures In the subjective-likelihood theory the featurevariance is introduced as an assumption or a constantand it is not explicitly simulated

There are several other differences between the vari-ance theory and the subjective-likelihood theory The

subjective-likelihood theory is based on log-likelihoodratios In the variance theory log-likelihood ratios arenot necessary to account of the mirror effect and for z-ROC curves Instead the variance theory uses the num-ber of active nodes normalized by the variance of the netinput as the measurement of recognition strength

Another difference is the use of one detector for eachitem in the subjective-likelihood theory This makes themodel essentially local whereas the variance theory isdistributed This property may cause difficulties in im-plementing the model in a connectionist network How-ever see McClelland and Chappell (1998) for a brief dis-cussion of this topic An implementation of the variancetheory is straightforward

REFERENCES

Feller W (1968) An introduction to probability theory and its appli-cation New York Wiley

Gillund G amp Shiffrin R M (1984) A retrieval model for bothrecognition and recall Psychological Review 91 1-67

Glanzer M amp Adams J K (1985) The mirror effect in recognitionmemory Memory amp Cognition 13 8-20

Glanzer M amp Adams J K (1990) The mirror effect in recognitionmemory Data and theory Journal of Experimental PsychologyLearning Memory amp Cognition 16 5-16

Glanzer M Adams J K amp Kim K (1993) The regularities ofrecognition memory Psychological Review 100 546-567

Glanzer M amp Bowles N (1976) Analysis of the word frequencyeffect in recognition memory Journal of Experimental PsychologyHuman Learning amp Memory 2 21-31

Glanzer M Kisok K amp Adams J K (1998) Response distribu-tions as an explanation of the mirror effect Journal of ExperimentalPsychology Learning Memory amp Cognition 24 633-644

Greene R L (1996) Mirror effect in order and associative informa-tion Role of response strategies Journal of Experimental Psychol-ogy Learning Memory amp Cognition 22 687-695

Hertz J Krogh A amp Palmer R G (1991) Introduction to the the-ory of neural computation Reading MA Addison-Wesley

Hintzman D L (1988) Judgment of frequency and recognition memoryin a multiple trace memory model Psychological Review 95 528-551

Hopfield J J (1982) Neural networks and physical systems withemergent collective computational abilities Proceedings of the Na-tional Academy of Sciences 79 2554-2558

Hopfield J J (1984) Neurons with graded response have collectivecomputational properties like those of two-state neurons Proceed-ings of the National Academy of Sciences 81 3088-3092

Humphreys M S Bain J D amp Pike R (1989) Different way to cuea coherent memory system A theory for episodic semantic and pro-cedural tasks Psychological Review 96 208-233

Kim K amp Glanzer M (1993) Speed versus accuracy instructionsstudy time and the mirror effect Journal of Experimental Psychol-ogy Learning Memory amp Cognition 19 638-652

Kruschke J K (1992) ALCOVE An exemplar-based connectionistmodel of category learning Psychological Review 99 22-44

Ku Iumlcera H amp Francis W N (1967) Computational analysis ofpresent-day American English Providence RI Brown UniversityPress

Lewandowsky S (1991) Gradual unlearning and catastrophic inter-ference A comparison of distributed architectures In W E Hockleyamp S Lewandowsky (Eds) Relating theory and data Essays onhuman memory in honor of Bennet B Murdock (pp 445-476) Hills-dale NJ Erlbaum

Li S-C amp Lindenberger U (1999) Cross-level unification A com-putational exploration of the link between deterioration of neuro-transmitter systems and dedifferentiation of cognitive abilities in oldage In L-G Nilsson amp H J Markowitsch (Eds) Cognitive neuro-sciences of memory (pp 103-146) Seattle Hogrefe amp Huber

438 SIKSTROM

Li S-C Lindenberger U amp Frensch P A (2000) Unifying cog-nitive aging From neuromodulation to representation to cognitionNeurocomputing 32-33 879-890

McClelland J L amp Chappell M (1998) Familiarity breeds dif-ferentiation A subjective-likelihood approach to the effects of expe-rience in recognition memory Psychological Review 105 724-760

Metcalfe J (1982) A composite holographic associative recallmodel Psychological Review 89 627-658

Murdock B B (1982) A theory for the storage and retrieval of item andassociative information Psychological Review 89 609-626

Murdock B B (1998) The mirror effect and attentionndashlikelihoodtheory A reflective analysis Journal of Experimental PsychologyLearning Memory amp Cognition 24 524-534

Okada M (1996) Notions of associative memory and sparse codingNeural Networks 9 1429-1458

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves PsychologicalReview 99 518-535

Rumelhart D E Hinton G E amp Williams R J (1986) Learn-ing representation by backpropagating errors Nature 323 533-536

Shiffrin R M amp Steyvers M (1997) A model for recognitionmemory REMndashretrieving effectively from memory PsychonomicBulletin amp Review 4 145-166

Sikstroumlm S (1996a) TECO A connectionist theory of successiveepisodic tests Umearing Doctoral dissertation Umearing University (ISBN91-7191-155-3)

Sikstroumlm S (1996b) The TECO connectionist theory of recognitionfailure European Journal of Cognitive Psychology 8 341-380

Sikstroumlm S (1999) Power function forgetting curves as an emergentproperty of biologically plausible neural networks model Interna-tional Journal of Psychology 34 460-464

Stretch V amp Wixted J T (1998a) Decision rules for recognitionmemory confidence judgments Journal of Experimental Psychol-ogy Learning Memory amp Cognition 24 1397-1410

Stretch V amp Wixted J T (1998b) On the difference betweenstrength-based and frequency-based mirror effects in recognitionmemory Journal of Experimental Psychology Learning Memoryamp Cognition 24 1379-1396

NOTE

1 The expected number of nodes active at recognition is a2 giventhat the standard deviations of the percentages of active nodes for new[sp(N)] and old items [sp(O)] are equal However if the standard devi-ations are different the expected number of active nodes will be

Because the new variance is smaller than the old variance this valuewill be slightly less than a2 typically around 044a if the ROC curveis 08 It is actually more precise to use this value In this paper the ap-proximation a2 is used except that in the simulations where the ex-pression above is used

APPENDIX

The Expected Value of the Net Input and the Variance of the Net Input

Inserting the weight changes into the net input solves the ex-pected value of the net input for old items In this Appendix thesuperscripts representing the item layer (d) and the contextlayer (c) are dropped Only the active states of activation con-tribute to the net input so jj 5 jj = 1

(A1)

The expected value of the net inputs for the new items iszero

mh(N) 5 0 (A2)

The expected value of the weights is zero The variance ofthe net input is calculated by squaring each term in the netinput Let P represent the number of patterns stored in the net-work If the patterns are uncorrelated or a new item is pre-sented the variance of the net input is

(A3)

For an old item the variance of the net input to the contextlayer from the item layer depends on the frequency of the item( f ) All the aN weights supporting a context node contribute tothe net input in the same direction

(A4)

The same function can be used for calculating the varianceof the net input to a node in the item layer when the same con-text is associated with several items Let the same context be as-sociated with L items in a list Furthermore let the context be-tween different lists be uncorrelated The variance of the netinput to an item node is

The expected variance for all nodes is the weighted sum ofthese two terms Half of the nodes are context nodes and halfof the nodes are item nodes Therefore the expected varianceof the net input to all nodes given an item with a frequency off and a list length of L in a network where p patterns have beenencoded is

(A5)

Performance in the Variance Model When AllNodes Are Used

In the variance model recognition strength is based on nodesactive at encoding However if recognition strength is based onall the nodes performance can be calculated as follows The d centis calculated by using Equation 8 where Pc(O) and Pc(N) are thenumber of correct nodes The number of correct nodes is thenumber of nodes active at retrieval that also is active at encod-ing (ie calculated as usual with Equation 7) plus the numberof inactive nodes at encoding that also are inactive at retrievalThe latter value can be calculated by replacing a with 1 a inEquation 7 and using the expected value of old net inputs for in-active nodes a2 (1 a) N (compare with Equation A1)

(Manuscript received February 9 1999revision accepted for publication October 30 2000)

s h f L a N p a a N2 3 22 1O( ) = +( ) + -( )

s h f a a LN2 4 2 21( ) = -( )

s x x x xh ij jj

N

i j ji

Nf f f a a

a a f N

22 2

4 2 21

( ) = waring aringaelig

egraveccedilouml

oslashdivide= -( ) -( )eacute

eumlecirc

= -( )

s x x x xh ij jj

N

i j j

P

i

NN a a

a a a a a a a

a a a a PN a a PN

2 2 2

2

3

1 1 2 0 1

0 0 1 1

( ) = w

= [( ) ( )] + [( )(1- )] ( )

+ [( )( )] ( ) = ( )

2 2

2 2 2

( ) = -( ) -( ) - - - -

- - - -

aring aringaring

m x x x xh ijj

N

j i j jj

Na a a a N(O) = wDaring aring= -( ) -( ) = -( )1

2

ss s

p

p p

a(N)

( )N (O)+

422 SIKSTROM

To calculate the recognition strength (S ) in Equation 5the expected percentage of active nodes is subtractedfrom the percentage of nodes active at recognition (Pc)This is necessary for the normalization to work properlyOwing to the placement of the activation threshold theexpected percentage of active nodes at recognition is halfof the expected percentage of nodes active at encoding(a 2 see note 1) This is a constant independent of itemclass new or old item and test difficulty The result is di-vided by the standard deviation of the net inputs associ-ated with nodes active at encoding (shcent)

Note that the standard deviation of the net inputs ofthe to-be-recognized item (s hcent ) varies on an item-to-itembasis around the standard deviation of the net inputs ofall items in the class (sh ) This fluctuation may be solarge that it is not possible to accurately sort the wordsinto classes on the basis of the standard deviation of theitems however there is no need for the subject to makesuch classification in the variance model The subjectsdo not need to know the true standard deviation of net in-puts in the class A yes response occurs if the recognitionstrength is larger than or equal to the subjectrsquos recogni-tion criterion (C ie if S sup3 C ) A no response occurs ifthe recognition strength is less than the subjectrsquos recog-nition criterion (S lt C )

The standard deviation of the net inputs does not affectthe probability of a yes response when the recognitioncriterion is unbiased (C 5 0) In this special case therecognition strength can be simplified to S 5 Pc whereC 5 a 2 The standard deviation of the net inputs inEquation 5 affects the probability of a yes response whenthe recognition criterion is biased (C sup1 0) Thus the stan-dard deviation of the net inputs in Equation 5 may be in-terpreted as a scaling factor that influences the confidencemeasurement (but not the unbiased recognition measure-ments) A large standard deviation of the net input for anitem (correlated with difficulty) influences the decisiontoward uncertainty whereas a small standard deviation ofthe net input for an item (correlated with less difficulty)influences the decision to be more certain

Figure 4D shows the density distribution of the recog-nition strength Note how the standard deviation of theactive nodes for the easy class versus the difficult classes(in Figure 4C) changes when it is normalized by the vari-ance of the net input (in Figure 4D) The normalizationfactor makes the standard deviation of the recognitionstrength of the difficult class smaller than that of the easyclass Thus the standard deviation of the recognitionstrength is proportional to the inverse of the standard de-viation of the net input The difficult class items yield asmall standard deviation of the recognition strength be-cause the standard deviation of the net inputs is highThe easy class items yield a large standard deviation ofthe recognition strength because the standard deviationof the net inputs is small The ordering of the means ofthe distributions is unaffected by the normalization andthe normalization does not change the fact that the olddistributions have a larger standard deviation than do thenew distributions

PREDICTIONS

This section describes the predictions of the variancetheory We have just seen that the variance theory predictsa material-based mirror effect for high- and low-frequencyitems because the low-frequency items have a smallervariance of net inputs This yields lower false alarm ratesand higher hit rates for the easy than for the difficultclass when the activation threshold is set between thenew and the old distributions Here it is shown how themodel accounts for other effects such as the strength-based mirror effect between lists list-length effects andthe shift in the response criterion Most important the var-iance theory makes predictions regarding the strength-based mirror effect within lists that is different from thepredictions of the attention-likelihood theory An exper-iment is conducted to test these predictions Comparativemodeling fitting was also conducted to compare the vari-ance theory with the attention-likelihood theory Thepredictions of the theory are based on an analytic solution

Figure 5 The standard deviation of active nodes as a function of the expectedprobability that the nodes are active The standard deviation is calculated with2N = 100

VARIANCE THEORY OF MIRROR EFFECT 423

that is presented at the end of the paper together with ananalysis of optimal performance

The Material-Based Mirror Effect forHigh- and Low-Frequency Items

The variance theory was simulated above here theanalytical results are presented The variance theory pre-dicts the mirror effect for any choice of parameters whenthe recognition criterion is unbiased As will be discussedlater the variance theory can be fully described by twoparameters (the number of nodes N and the percentageof active nodes a) plus one parameter for each class orwords [the standard deviation of the net input sh( )] Thefollowing parameters are used here The number ofnodes is 2N 5 100 and the percentage of nodes active atencoding is set to a 5 1 The standard deviation of thenet inputs to the easy class is sh(AN) 5 sh(AO) 5 125and the standard deviation of the net inputs to the diffi-cult class is sh(BN) 5 sh(BO) 5 156 There are otherparameters which however as will be discussed laterdo not add any degrees of freedom to the model the ex-pected net inputs of the new distributions mh(AN) 5mh(BN) 5 0 and the expected net inputs of the old dis-tributions mh(AO) 5 mh(BO) 5 1 Consequently the ac-tivation threshold is T 5 05

These parameters yield the following probabilities thata node is active at recognition Pc(AN) 5 43 a Pc(BN) 545 a Pc(BO) 5 55 a Pc(AO) 5 57 a The following ex-pected recognition strengths are predicted ms(AN) 5

0012 ms(BN) 5 0008 ms(BO) 5 0008 ms(AO) 50012 Figure 4D plots the four recognition strength den-sities (the distributions are assumed to be normal) usingthe parameters above The same parameter settings wereused in Figures 4A 4B 4C and 5

Strength-Based Mirror Effects Between ListsThe variance theory is consistent with the strength-

based mirror effects Thus variables that increase the hitrates also decrease the false alarm rates This empiricalfinding is called dispersion which means that the newand the old distributions move apart The opposite phe-nomenon is called concentering which means that thenew and the old distributions move closer together Ex-amples of variables showing strength-based mirror ef-fects are speed versus accuracy instructions length ofstudy time encoding task forgetting repetition and ag-ing (Kim amp Glanzer 1993) These experimental variablescan be related to a specific parameter in the variance the-orymdashnamely the expected net input

The variance theory predicts a strength-based mirroreffect because subjects must adjust the activation thresh-old to optimize the performance This change in activa-tion threshold affects the false alarm rates For exampleassume that study time is increased from 1 to 2 sec sothat the expected net input increases from 1 to 2 and theactivation threshold increases from 12 to 1 This dimin-ishes the false alarm rate However the increase in theactivation threshold is smaller than the increase in the old

net input so the hit rate will increase Thus increasing thestudy time increases the hit rate but decreases the falsealarm rate which is dispersion

The mirror effect is accounted for in some theories bya change in the recognition criterion Note that in the var-iance theory the recognition criterion is constant whereasthe activation threshold is changed There is an impor-tant difference between a change in the recognition cri-terion and a change in the activation threshold Thechange in the activation threshold optimizes the perfor-mance as measured by d cent whereas a change in the recog-nition criterion does not influence d cent Given an optimalplacement of the activation threshold the performance interms of percentage correct is optimal if the recognitioncriterion is set to an optimal value which is zero Thusthere is a clear difference between changing the recogni-tion criterion and changing the activation threshold Thevariance theory accounts for the strength-based mirror ef-fect occurring between two conditions by the change inthe activation threshold necessary for optimal performancewhereas the recognition criterion does not change

Concentering occurs for example when subjects areinstructed to emphasize speed (rather than accuracy) withsuperficial (rather than deep or semantic) study instruc-tions with diminished study time or with an increasedretention interval (Kim amp Glanzer 1993) In the variancetheory all these manipulations are assumed to diminishthe old net inputs Figure 6A shows the predictions of thevariance theory when the expected net inputs of the olddistributions are mh(AO) = mh(BO) 5 05 rather than 1as in Figure 4D Consequently the activation thresholdmust be set to 025 for optimizing the performance Thedistributions in Figure 6A are closer than the distribu-tions in Figure 4D Thus decreasing the net inputsmdashforexample by diminishing study timemdashmoves the distrib-utions closer together thus showing concentering

The opposite phenomenon to concentering is disper-sion which means that increasing the performance movesthe distributions apart Dispersion can be studied bychanging the variables listed above in the opposite direc-tionsmdashfor example by increasing the study time Fig-ure 6B shows the predictions of the theory when the ex-pected net inputs of the old distributions are mh(AO) =m h(BO) 5 2 Consequently the activation thresholdmust be set to 1 to maintain a near-optimal performanceThe distributions in Figure 6B are further apart than thedistributions in Figure 4D

These strength-based manipulations are usually ap-plied between different lists or conditions For exampleKim and Glanzer (1993) manipulated study time betweenfour study lists where the items were presented for 1 seceach in two lists and for 2 sec each in two lists After eachlist there was a recognition test In the variance theorythe activation threshold is the same during each recogni-tion test but may vary between two recognition tests withdifferent levels of difficultymdashfor example different studytimes As will be discussed later different empirical re-sults are found when study time is varied within one list

424 SIKSTROM

In this case the activation threshold is also constant dur-ing the recognition tests although the study time varieswithin the condition

The order of the probabilities in the mirror effect issomewhat robust against changes in the activation thresh-old over a large range Setting the activation to a fixedsufficiently low and positive value yields the mirror ef-fect for any value of the expected net input For example

assume that the activation threshold is fixed to 04 Thenthe mirror effect is predicted for the three cases of ex-pected old net inputs discussed above (05 1 and 2) orany value above 04 The predictions for the new distri-butions do not change with these changes in net inputs[P(AN) = 25 P(BN) = 30] thus a change in the acti-vation threshold is needed to change the false alarmrates In contrast the advantage of the old easy class over

Table 1General Table of Results From the Experiment

Condition AN BN BO AO ss(BN)ss(AN) ss(BO)ss(AO)

Control 013 017 069 082 060 086Frequency 020 028 080 068 101 066Time 010 015 078 076 089 081

NotemdashThe rows show the conditions (control presentation frequency and presenta-tion time) The columns show the false alarm rates for low (AN) and high (BN) fre-quencies and the hit rate for high (BO) and low (AO) frequencies the slope of the z-ROCcurve for the new low-frequency distribution as a function the new high-frequency dis-tribution [ss(BN)ss(AN)] and the corresponding slope for the old distributions[ss(BO)ss(AO)]

Figure 6 The densities of recognition strength in the variance theory for different parameter settings (A) concentering (B) dis-persion (C) activity level set to one and (D) equal variance The horizontal axes show the recognition strength and the vertical axesthe density

VARIANCE THEORY OF MIRROR EFFECT 425

the old difficult class increases with the expected netinput [from P(BO) = 55 and P(AO) = 56 for mh(O) =05 to P(BO) = 89 and P(AO) = 92 for mh(O) = 2]

List-Length EffectGiven everything else equal recognition from a short

list length has a higher hit rate and lower false alarm ratethan recognition from a long list In the variance theorylist length is predicted to affect the standard deviation ofthe net input (sh) for both easy and difficult class itemsso that longer lists have a larger variance than do shorterlists The expected value of the net input is not affectedby list length

Assume that context does not change within a list butis uncorrelated between different lists The context for alist is thus associated with as many items as there areitems in the list The variance of the net inputs to the itemnodes increases when the list length is increased Thereason for this increase in variance is essentially thesame as the reason that word frequency affects the vari-ance In the word frequency case the same item is asso-ciated with several contexts and this increases the vari-ance in context nodes In the list-length case the samecontext is associated with several items and this in-creases the variance in the item nodes Thus the varianceof the net inputs in the item nodes will be a linear func-tion of list length Therefore a long list will have a lowerhit rate and a larger false alarm rate than will a short list

ROC CurvesThe percentage of nodes active at recognition is less

for new items than for old items Owing to the placementof the activation threshold this proportion is always lessthan 12 The standard deviation of the percentage of ac-tive nodes increases as a function of the percentage ofactive nodes If the percentage of active nodes is zerothe standard deviation obviously is zero However thisstandard deviation increases as the percentage of activenodes increases This yields a smaller standard deviationfor the new distribution (which is associated with a lowerpercentage of active nodes) as compared with the olddistribution [ss(AO) gt ss(AN) and ss(BO) gt ss(BN)]

For the sake of understanding the model the propor-tion of nodes active at encoding can be set unrealisticallyhighmdashnamely to a = 1 This setting yields around 50of these nodes being active at recognition This parametersetting makes the standard deviations of the new and theold distributions equal [ss(AO) = ss(AN) and ss(BO) =ss(BN)] Figure 6C shows the prediction for a = 1 (all theother parameters are identical to those in Figure 4D)

The standard deviation of recognition strength is largerfor the difficult class than for the easy class [ss(AN) gtss(BN) and ss(AO) gt ss(BO)] because the recognitionstrengths are calculated from the inverse of the standarddeviation of the net inputs Thus when the standard de-viations of the net inputs are set equal the standard de-viation of the recognition strengths and the recognitionstrengths becomes equal Figure 6D plots the predictionsof the theory when all standard deviations of the net in-

puts are 125 The other parameters are the same as thosein Figure 4D

In Figure 4D the four standard deviations of the recog-nition strengths are ss(AN) = 0015 ss(BN) = 0012ss(BO) = 0015 and ss(AO) = 0020 The ratio of thesestandard deviations must follow Equation 2 This is alsothe case with ss(AO) ss(BN) = 061 lt 074 = ss(AO) ss(AN) lt ss(BO)ss(BN) = 078 lt 094 = ss(BO) ss(AN)

Changing the Recognition CriterionThe probability of a yes response (P) for the four

classes depends on the recognition criterion (C) SettingC to an unbiased value of 0 yields P(AN) = 20 P(BN) =25 P(BO) = 70 P(AO) = 74 These predicted data areprototypical of experimental data for the mirror effect

A conservative value of the recognition criterion (C)will not yield the mirror effect For example C = 0016yields P(AN) = 03 P(BN) = 02 P(BO) = 30 P(AO) =43 Thus the variance theory predicts that a conserva-tive recognition criterion yields a higher false alarm ratefor easy class words than for difficult class words Thisprediction is supported by empirical data Greene (1996)asked subjects to respond yes only if they were sure oftheir response Consistent with the prediction no mirroreffect was found

It follows from the ordering of the distributions in Fig-ure 4D that the theory also predicts the experimentalfindings in forced recognition [P(BO BN) lt P(AO BN)P(BO AN) lt P(AO AN) P(BN AN) gt 50 and P(AOBO) gt 50] For the parameters above the predictions ofthe theory are P(BO BN) = 79 lt 81 = P(AO BN)P(BO AN) = 83 lt 84 = P(AO AN) P(BN AN) = 59 gt50 P(AO BO) = 57 gt 50

Within-List Strength ManipulationSo far the predictions made by the variance theory are

qualitatively (but not quantitatively) equal to those of theattention-likelihood theory However there is an excep-tion that differentiates the variance theory from the attention-likelihood theory The mirror effect is nor-mally studied under experimental conditions in whichthe diff icult and the easy classes are given the sameamount of attentionmdashfor example under conditions inwhich the number of presentations the study time andthe study instructions are equal for the two classes ofwords However if the number of presentation is largerfor the difficult class than for the easy class different re-sults emerge Stretch and Wixted (1998b) conducted fiveexperiments in which the basic manipulation was to pre-sent high-frequency words five times whereas the low-frequency words were presented once The results didnot show a mirror effect because the hit rates for thehigh-frequency words were higher than those for thelow-frequency words However increasing the numberof presentations for the high-frequency words did not af-fect the false alarm rate so both the false alarm rate andthe hit rate were larger for high-frequency words

The variance theory accounts for this new finding inthe following way The theory assumes that a longer pre-

426 SIKSTROM

sentation time or a larger presentation frequency in-creases the net inputs of the old items [mh(O)] This is il-lustrated in Figure 7A (compare with Figure 4A wherethe same amount of attention is paid to the two classes)If the net inputs for old high-frequency items are in-creased sufficiently the percentage of active nodes willbe larger than that for old low-frequency items For thisto occur the effect of the increase in net input (whichgives the advantage for old high-frequency items whenattention is focused on these items) must be larger thanthe effect from the larger standard deviation of the netinputs for old high-frequency items (which gives the ad-vantage for old low-frequency items when the same at-tention is paid to the two classes) This increase in thepercentage of active nodes yields a higher hit rate forhigh-frequency items than for low-frequency items

However it will not significantly change the false alarmrates which are larger for high-frequency items than forlow-frequency items Therefore the variance theory pre-dicts no mirror effect when high-frequency items are pre-sented sufficiently more often or with a sufficiently longerpresentation time as compared with low-frequency items

It is apparent from the density of net inputs (Figure 7A)that the density of recognition strengths (Figure 7B)will not show a mirror effect (ie because the percent-age of active nodes are larger for high- than for low-frequency old items) The parameters used in these fig-ures are identical to the parameters used for the standardmirror effect in Figures 4A and 4D with the exceptionthat the expected net input of the old high-frequencyitems [mh(BO)] is 2 rather than 1 Consequently to opti-mize performance the activation threshold becomes

P a ec

h h

h=-( )yen

ograve12

22

m

s

P -

Figure 7 (A) The probability density of the net inputs in the variancetheory when attention is focused on the high-frequency class The hori-zontal axis shows the net inputs and the vertical axis the probabilitydensity of the net inputs The expected value of the high-frequency value(BO) is shifted to the right because attention is focused on this class Thedotted vertical line is the activation threshold (B) The predictions of thevariance theory when subjects focus their attention on high-frequencywords The horizontal axis shows the recognition strength and the ver-tical axis the probability density

075 rather than 050 The figure does not show a mirroreffect because the expected hit rate and the expectedfalse alarm rate are larger for the high-frequency itemsthan for the low-frequency items Setting C to an unbi-ased value of 0 yields P(AN) = 08 P(BN) = 14 P(BO) =86 P(AO) = 063 [which may be compared with Figure6B P(AN) = 20 P(BN) = 25 P(BO) = 70 P(AO) = 74]

Furthermore in the variance theory the ratio of therecognition strength standard deviations for high- andlow-frequency items depends mainly on the standard de-viations of the net inputs The standard deviations of thenet inputs are not dependent on the attention paid to thestimuli Therefore the variance theory predicts no changein the standard deviations when the amount of attentionis manipulated The standard deviation of the old low-frequency distribution is predicted to be larger than thestandard deviation of the old high-frequency distribu-tion Similarly the standard deviation of the new low-frequency distribution is predicted to be larger than thatof the new high-frequency distribution The standard de-viations in Figure 7B are ss(AN) = 0013 ss(BN) = 0011ss(BO) = 0017 and ss(AO) = 0019 These results aresimilar to the results when the same amount of attentionis paid to the two classes in Figure 4D ss(AN) = 0015ss(BN) = 0012 ss(BO) = 0015 and ss(AO) = 0020

The standard version of the attention-likelihood the-ory has problems accounting for the lack of mirror ef-fect when more study time is given to the difficult classthan to the easy class This theory suggests that the classof items to which more attention is being paid is moreeasily recognized For example low-frequency items arebetter recognized than high-frequency items becausesubjects pay more attention to them The amount of at-tention is assumed to influence the number of sampledfeature [n(i)] so more features are sampled for low- thanfor high-frequency items (Kim amp Glanzer 1993) This isthe only parameter that differs between high- and low-frequency items From this explanation it follows thatif the experimental conditions are manipulated so thatsubjects pay more attention to the high-frequency itemsthe standard version of the attention-likelihood theorywill predict a mirror effect where the high-frequencyitems are the easier class (A) and the low-frequency itemsare the more difficult class (B) The difference from thenormal mirror effect is a larger hit rate and a smaller falsealarm rate for high- than for low-frequency items Fur-thermore the attention-likelihood theory makes predic-tions of the order of the slope of ROC curves The standarddeviation of the hit rate for the high-frequency distributionwould be larger than the hit rate for the low-frequencydistribution Similarly it is predicted that the standarddeviation of the high-frequency false alarm distribution islarger than that of the low-frequency distribution

EXPERIMENT

An experiment was conducted to test the predictionsregarding the within-list strength manipulation The

number of presentations and the study time of the high-frequency words were manipulated in an experimentThe original rationale for the experiment was to comparethe results with the predictions of the variance theoryand the attention-likelihood theory because the experi-ment was conducted before the publication of Stretchand Wixtedrsquos (1998b) study which manipulated atten-tion by the number of presentations In this experimenta new manipulation is investigatedmdashnamely how theamount of study time per item for each class influencesthe mirror effect Furthermore the manipulation of thenumber of presentations is replicated Thus there weretwo experimental conditions one in which the amountof study time was manipulated and one in which the pre-sentation time was manipulated There was also one con-trol condition in which high- and low-frequency wordswere given the same amount of attention

MethodSubjects Twenty-one students taking the introductory psychol-

ogy course at the University of Toronto volunteered to participatein a memory experiment for course credit There were 14 female and7 male subjects with a mean age of 20 ranging from 18 to 29 yearsold

Material Sixty low-frequency words and 60 high-frequencywords were selected from Ku Iumlcera and Francis (1967) The low-frequency words have an occurrence of 4ndash5 times per million andthe high frequency words an occurrence of 50ndash55 times per mil-lion Thirty low- and 30 high-frequency words were randomly cho-sen for List 1 and the remaining for List 2

Procedure The subjects were instructed to study a list of wordsso they would be able to recognize the words after study Fifteenlow-frequency words and 15 high-frequency words were randomlychosen as study words for each subject

Design There were three conditions In all the conditions thelow-frequency words were presented once with a presentation timeof 1 sec In the control condition the high-frequency words werealso presented once with a presentation time of 1 sec In the pre-sentation frequency condition the high-frequency words were pre-sented twice for 1 sec each time In the presentation time conditionthe high-frequency words were presented once for 3 sec The pre-sentation order was randomized All the words were presented inuppercase on a blank computer screen Immediately following thestudy list there was a recognition test The subjects were presentedwith either one of the studied words or one of the lures There were15 low-frequency lures and 15 high-frequency lures in each condi-tion The subjects were asked to judge whether the word was pre-sented in the list or not The subjects were also required to rate theirconfidence in their responses on a scale ranging from 1 (guessing)to 5 (very certain) The order of recognition was randomized foreach subject

Each subject participated in two conditions List 1 was alwaysgiven as the first list and List 2 as the second list Twelve subjectswere randomly chosen for the presentation frequency condition fol-lowed by the presentation time condition Nine subjects were giventhe control condition followed by another control condition Thewhole experimental set-up including instructions presentation ofwords and the recognition test was automated on a computer Eachsubject was tested individually

ResultsThe results from the experiment are presented in the

first three rows of Table 1 The probability for hit rates

VARIANCE THEORY OF MIRROR EFFECT 427

428 SIKSTROM

was larger for the high-frequency words than for the low-frequency words in the presentation frequency and thepresentation time conditions In the control conditionthe probability for hit rates for the low-frequency condi-tion was larger One-tailed paired t tests over the perfor-mance for each subject were carried out to test the dif-ferences between the high and the low frequencies Theeffects were significant in the presentation frequencycondition [t(11) = 22 MSe = 004 p = 02 lt 05] and inthe control condition [t(16) = 33 MSe = 004 p = 00lt 05] but not in the presentation time condition [t(11) =041 MSe = 003 p = 34 lt 05]

The false alarm rate was larger for the high-frequencywords in all the conditions However it was only signif-icantly larger in the presentation frequency condition[t(11) = 18 MSe = 003 p = 048 lt 05] but not in thepresentation time condition [t (11) = 15 MSe = 001 p =07 gt 05] and the control condition [t(16) = 14 MSe =002 p = 09 gt 05]

The results in the presentation frequency conditionsupport the variance theory The hit and the false alarmrates were significantly larger for the high-frequencywords than for the low-frequency words Thus there wasno mirror effect However the prediction of the standardversion of the attention-likelihood theory was not sup-ported

The results in the presentation time condition were inthe same direction as those in the presentation frequencycondition although the difference between the high andthe low frequencies was not significant This conditionis consistent with the variance theory although the stan-dard version of the attention-likelihood theory could notbe dismissed in this condition since the results werenonsignificant

Finally the control condition yielded results consis-tent with previous studies showing a mirror effect Thehit rate for the high-frequency words was significantlylower than the hit rate for the low-frequency words Thefalse alarm rate for the high-frequency words was largerthan that for the low-frequency words (although not sig-nificantly) Thus the control condition is as was ex-pected consistent with both the variance theory and thestandard version of the attention-likelihood theory

The slopes of the ROC curves were calculated as fol-lows The hit and false alarm rates for confidence ratings1ndash5 were z-transformed (eg for confidence rating 4 a hitresponse was scored if the confidence rating was 4 orabove) Linear regressions of one z-transformed mea-surement as a function of another z-transformed measure-ment were conducted The slope of the linear regressioncurves between the z-transformed hit rate of the low-frequency words and the z-transformed hit rate of the high-frequency words [ss(BO)ss(AO)] and similarly for theslope of the false alarms [ss(BN)ss(AN)] are shown inthe last two rows of Table 1

The results show that the standard deviations of theold high-frequency distributions were smaller than thestandard deviations of the old low-frequency distribu-tions in all the conditions The standard deviations of the

false alarm high-frequency distributions were smallerthan the standard deviations of the false alarm low-frequency distributions in the presentation frequencycondition and the control condition but were approxi-mately equal in the presentation time condition

To summarize the results in the presentation frequencycondition are consistent with the variance theory and inconsistent with the standard version of the attention-likelihood theory The control condition is consistentwith both the variance theory and the standard version ofthe attention-likelihood theory These data are also con-sistent with results presented by Stretch and Wixted(1998b) However Stretch and Wixted (1998b) suggestedone possible way to modify the standard version of theattention-likelihood theory to bring it in line with thedata presented here They noted that Glanzer et al (1993)had shown that the attention-likelihood theory predictsthe mirror effect although p(i old) is set to the averagevalue of the two classes This modified version can pre-dict the pattern of data presented here given that the at-tention paid to the high-frequency class was high duringencoding [n(B) = 120] and low during recognition [n(B) =40] This formulation of the attention-likelihood theoryseems somewhat unclear It is not well motivated whyp(i old) should be equal during recognition whereas n(i)is different [p(i old) is calculated from n(i)] or why theamount of attention for high-frequency items is lower thanthat for low-frequency words at encoding but higher atrecognition

COMPARATIVE DATA FITTING

Glanzer et al (1993) fitted the attention-likelihoodtheory to experimental data in nine conditions The easyclass (A) consisted of either low-frequency words orconcrete words and the difficult class (B) consisted ofhigh-frequency words or abstract words Here the vari-ance theory is also fitted to the same set of data as thatused in Glanzer et al (1993) This allows a directionevaluation of the variance theory by comparing its f itswith those of the attention-likelihood theory

Glanzer et al (1993) fitted the attention-likelihoodtheory to the four probabilities of yes responses and thefour ratios of slopes of the ROCs The fitting was con-ducted by minimizing the mean squared error divided bythe variancemdashthat is

Three parameters were fittedmdashnamely the attentionpaid to each of the classes n(A) and n(B) and the prob-ability that a feature was activated p(new) The other pa-rameters were held constant at a value found to give agood f it These parameters were (N = 1000) and therecognition criterion [ln (L) = 0]

The variance theory was fitted to the same set of datausing the same technique and the same number of freeparameters The fitted parameters were the percentage

( )

Observed Predictedi i

ii

-

=aring

2

21

8

s

VARIANCE THEORY OF MIRROR EFFECT 429

nodes active at encoding (a) the standard deviation ofthe net inputs for the easy class words [sh(A)] and thestandard deviation of the net inputs for the difficult classwords [sh(B)] The other parameters were held constantand were set to the same values as those in Figure 4D[2N = 100 mh(N ) = 0 mh(O) = 1 and C = 0] The empir-ical standard deviations (si) were not reported inGlanzer et al (1993) so these parameters were set toone

The results show that the variance theory accounts for98 (r2) of the variance for the probabilities The attention-likelihood theory fits equally well The variance theory ac-counts for 93 of the variance for the slope The attention-likelihood theory accounts for 84 of the variance of theslope Thus the variance theory accounted for the sameamount of variance for the probabilities and more variance

for the slope as compared with the attention-likelihoodtheory when three parameters were fitted

It is reasonable to assume that the percentage of ac-tive nodes at encoding (a) does not vary across condi-tions The ratio of standard deviations of the net inputs[sh(B)sh(A)] may also be conceived of as being con-stant given that the same material is used in the differ-ent conditions Therefore the variance theory was fittedby a single variablemdashnamely the standard deviation ofthe net inputs for the easy class [sh(A)] The activitylevel was fixed to 010 and the ratio of the standard de-viations of the net inputs sh(B)sh(A) to 125 (these val-ues were the average of the fitted parameters in the pre-vious fit these parameters were also used in Figure 4D)The results are presented in Figure 8A where the pre-dicted probabilities are plotted as a function of observed

Figure 8 (A) Fitting the variance theory to Glanzer Adams andKimrsquos (1993) probability data The figure shows the predicted proba-bilities (on the vertical axis) as a function of the observed probabilities(on the horizontal axis) (B) Fitting the variance theory to Glanzer et alrsquos(1993) standard deviation slope data The figure shows the predictedratio of slopes of the receiver-operating characteristic curves (on the ver-tical axis) as a function of the observed ratios (on the horizontal axis)

430 SIKSTROM

probabilities Figure 8B shows the corresponding resultsfor the slope The accounted variance is 096 for the prob-abilities and 085 for the slopes Thus the variance theoryfits the slopes using a single parameter equally well asthe attention-likelihood theory does with three fitting pa-rameters The fit for the variance theory for the probabil-ities using one parameter is slightly less than the fit of theattention-likelihood theory using three fitting parametersIt may be concluded that the fit for the variance theory isreasonably good for the probabilities and the slopes Theslopes have a better fit in the variance theory than in theattention-likelihood theory when three variables are used

ANALYTIC SOLUTIONS

In this section analytic solutions of the variance the-ory an approximation of the standard deviation of therecognition strength and analyses of optimal perfor-mance are presented The variance theory has a simpleanalytic solution and can be fully described by four pa-rameters Two of these parametersmdashnamely the stan-dard deviation of the net inputs from the easy class[sh(A)] and the standard deviation of the net inputs fromthe difficult class [sh(B)]mdashcan also be expressed as thefrequency of the item (see the Appendix) The other twoparameters are the number of nodes (N ) and the expectedprobability that the nodes are active at encoding (a)

There are other variables in the theory however theydo not increase the degrees of freedom There are fourexpected net inputs (mh) however two degrees of free-dom disappear because the new net inputs are constrainedto be equal as well as the old net inputs [mh(AN) =mh(BN) and mh(AO) = mh(BO)] Furthermore the predic-tions are independent of moving the old and new dis-tributions in parallel thus removing another degree offreedom Finally changing the difference between the ex-pected old and new net inputs is mathematically equiva-lent to changing the standard deviations [sh(A) andsh(B)] Thus the degrees of freedom in the net inputscan be captured by the degrees of freedom in the stan-dard deviations The activation threshold (T ) and theprobability that nodes are active (Pc) are simply func-tions of other variables and therefore do not increase thedegrees of freedom Thus there are four degrees of free-dom for the distributionsmdashnamely sh(A) sh(B) N anda An additional degree of freedom is introduced whenplacing the recognition criterion (C)

The probability (Pc) that the net inputs exceed the ac-tivation threshold (T ) for nodes active during encodingcan be explicitly solved from the expected net inputs(mh) and the standard deviation of the net inputs (sh)This probability is dependent on the distribution of thenet inputs which can be approximated with a normaldistribution Pc is solved by integrating the net inputsfrom mh T to infinity (yen) over the probability densityfunction for a normal distribution Thus the probability(Pc) that a node is active at recognition is

(6)

Subtracting the expected percentage of active nodes atrecognition (a2 see note 1) from the percentage of ac-tive nodes and dividing by the standard deviation of thenet inputs (sh) calculates the expected recognitionstrength (ms)

Note that the analytic solution uses the standard devi-ation of the class (sh) as an approximation of the stan-dard deviation of the item (shcent ) because it simplifies theanalytic solution however the variance theory or thesimulation uses the standard deviation of the item Thisapproximation is good when there are a large number offeatures however for a small number of features thevariance of feature strength for a single item may fluctu-ate on an item-to-item basis around the variance of thenet inputs for all the items

The standard deviation of the recognition strength (ss)is calculated from sh Pc and N There is 2N number ofnodes in the context and the item layers The distributionof Pc is binomial but can for a certain criterion [ie 2NPc(1 Pc) gt 10] be approximated with a normal distri-bution with a standard deviation of [Pc(1 Pc) 2N]12The final result is scaled by the normalization factor 1sh

(7)

A yes response occurs if the recognition strength isabove the recognition criterion (C) The probability of ayes response [P(Y)] is calculated from the expected recog-nition strength the variance of the recognition strengthand the recognition criterion by integrating the density ofthe recognition strength over a normal distribution

The probability of choosing A over B in a two-choiceforced recognition test [P(A B)] is calculated from theexpected recognition strength of A [ms(A)] and B [ms(B)]and the standard deviations of the recognition strengthof A [ss(A)] and B [ss(B)]

An Excel sheet for calculating the predictions of the vari-ance theory is available on line (wwwpsychutorontoca ~sverkervariancehtml)

P e ds

S Bs s

s s

C

(A B)

(A)

(A) (B)

=

- -[ ]( )+aelig

egraveccediloumloslashdivide

eacuteeumlecirc

yen

ograve12

2

2 2 22

p

m m

s s

( )

P e ds

S s

s

C

(Y) =-( )yen

ograve12

2

22

p

m

s

sss

h

c cP P

N=

-( )eacute

eumlecircecirc

1 1

2

12

mss

Pc

a

h=

-2

P a e dhcT

hh

h=-( )yen

ograve12

2

22

p

m

s

VARIANCE THEORY OF MIRROR EFFECT 431

Approximations of the Standard Deviation ofRecognition Strength

The standard deviation of the recognition strength isin the model calculated with Equation 7 However to fa-cilitate the understanding of this equation it is useful tomake some approximations First note that the proba-bility that a node is active (Pc) is assumed to be low Byapproximating 1 Pc to one the variance of the recog-nition strength can be simplified to

For a particular class of items the variances of the netinputs of old and new items are equal and the varianceof the recognition strength is proportional to the numberof active nodes (s 2

s micro Pc) This approximation suggests avery simple interpretation of the slope of the z-ROC Theratio of variances between new and old items is simplythe ratio between the number of nodes active in the newitems representations and the number of nodes active inthe old items representations

Or alternatively the slope of the z-ROC curve is equalto the square root of the ratio of the number of nodes ac-tive in the new items representations and the number ofnodes active in the old items representations For exam-ple if the slope of the z-ROC curve is 08 the number ofactive nodes in the new items representations divided bythe number of nodes active in the old items representa-tions is 064 (= 0802)

Another approximation useful for understanding themodel is that for two classes of items the number of ac-tive nodes in the new distribution is approximately equaland the number of active nodes in the old distributions isapproximately equal [Pc(AN) raquo Pc(BN) and Pc(BO) raquoPc(AO)] Given these approximations and the approxi-mation above (1 Pc raquo 1) the recognition strength stan-dard deviation is inversely related to the standard devia-tion of the net inputs in the following way The ratiobetween the recognition strength standard deviations ofthe diff icult and the easy distributions is equal to theratio between the standard deviations of the net inputs ofthe easy and the difficult distributions Furthermore theratio between the recognition strength standard devia-tions of the difficult and easy new distributions is equalto the ratio between the recognition strength standard de-viations of the difficult and the easy old distributionsThe exact solution predicts a slightly smaller ratio in theold than in the new distributions

This suggests that the ratio between the recognitionstrength standard deviations of the difficult class and theeasy class can be interpreted as the ratio between the

standard deviations of the net inputs of the easy class andthe difficult class

Optimizing Performance Derives the Assumptions of the Variance Theory

Arguably good memory performance is an importantfactor for selection in the evolutionary process of hu-mans and animals It is reasonable to assume that thebrain has evolved so that the performance at retrieval isoptimal or near-optimal Here it is investigated how sev-eral assumptions of the variance theory influence per-formance It is shown that several assumptions of themodel fall out as a consequence of optimizing perfor-mance in the form of discriminability between new andold items Thus if the model is implemented in a differ-ent way performance is degraded and the model doesnot account for the experimental data Examples of as-sumptions that yield a good performance in the modelare a low percentage of nodes active setting the activa-tion threshold between old and new net inputs measur-ing performance by nodes that are active in the encodingpattern and normalizing the recognition strength It isshown that an optimal performance in the network requiresthe implementation suggested by the variance theory Ifthe implementation of the variance theory is changedsignificantly the performance is degraded and the net-work would not produce the empirically found memoryphenomena

To conduct this analysis it is necessary to define ameasurement of performance A natural choice is to used cent By using the analytical equations above we find thefollowing expression

(8)

Because ss(N) sometimes is near zero it was founduseful to use the standard deviation of both the old andthe new items recognition strength ss(NO) in the de-nominator of this equation Thus Pc(NO) is equal to[Pc(N) 1 Pc(O)] 2 Pc( ) was calculated with Equation 6The expected value of the net inputs and the standard de-viation of the net inputs for new and old items were cal-culated with the equations derived in the Appendix(Equations A1 A2 and A3) Low-frequency items witha preexperimental frequency of zero were used ( f = 0)and the list length was set to one (L = 1)

The performance can be expressed by the parametersa N and p Increasing the number of nodes (N) monot-onically increases d cent and increasing the number ofstored patterns (p) monotonically decreases d cent The im-pact of these two parameters on d cent was of less impor-tance here and they were set to N = 30 and p = 100

Optimal percentage of nodes active at encodingThe solid line in Figure 9A shows the theoretical d cent as afunction of the percentage of nodes active at encoding

cent = - =-

-eacuteeumlecirc

dP P

P PN

s s

s

c c

c c

m ms( ) ( )

(

O N

(NO)

(O) (N)

(NO) (NO) 12

112

ss

ss

ss

ss

s

s

s

s

h

h

s

s

(BO)

(AO)

(B)

(A)

(A)

(B)

(BN)

(AN)pound raquo pound

s

ss

s

c

c

P

P

2

2

(N)

(O)

(N)

(O)raquo

ss

sc

h

P

N

2

2 2=

432 SIKSTROM

(a) The results show that d cent is optimal for a = 052 Thed cent is lower for larger and smaller a The lower d cent for largea occurs because the interference from other items in-crease For an a larger than the optimal value the weightchanges are distributed over a larger number of nodesand the recognition tests therefore include more noise

The lower d cent for an a less than the optimal value oc-curs because there is variability in the number of activenodes at encoding Thus for very small values of a thefluctuation between the number of nodes active in theencoded representation becomes increasingly importantThus for a small a errors are more likely to occur be-cause old items have few active nodes at encoding mak-ing it less likely that many nodes will be active at re-trieval (independently of how well they are encoded)This analysis suggests that a medium low percentage ofactive nodes at encoding yields optimal performanceThis is consistent with variance theory which requires alow activity for fitting some of the empirical data (seebelow)

There is another factor that contributes to the fact thatoptimal performance occurs when the percentage of ac-tive nodes is medium low which is that the number ofpossible representations increases with a If there is onlyone node active in all the representations there are Npossible combinations of representations if there aretwo nodes active in all the representations there are ap-proximately N 2 possible combinations of representa-tions and so forth This factor is not included in theanalyses

Optimal placement of the activation threshold Animportant property of d cent is that it is insensitive to wherethe criterion is placed Thus any criterion yields thesame performance The activation threshold (T ) may beseen as the criterion for a single node and therefore onemight intuitively think that the placement of the thresh-old is unimportant for d cent However surprisingly theplacement of the criterion becomes important in the vari-ance theory because there is a nonlinear transformationwhen the nodes are activated This nonlinearity makes d centdependent on the activation threshold in the nodes

The d cent was maximized by changing the activationthreshold (T ) and the percentage of nodes active at en-coding (a) The maximum d cent was 240 when T = 081and a = 052 Figure 9B plots d cent as a function of the ac-tivation threshold (T ) when the percentage of nodes ac-tive at encoding was f ixed at the optimal value (a =052) The results show that d cent has an optimal valuewhen the activation threshold is set between the old andthe new distributions The variance theory suggests thatthe threshold should be set to the average of the expectedold and expected new net inputs For the parameter val-ues used here this value is 071 which is near butslightly lower than the optimal value of 081 (the ex-pected value of the new net input is 0 and the expectedvalue of the old net input is 142) Note that this resultapplies when both a and T are set to the optimal value Ifa is set to a nonoptimal value the optimal value of T may

deviate significantly from the one proposed by the the-ory (eg if a = 5 the optimal value of T is much largerthan the expected value of old net inputs of 188)

This analysis emphasizes the importance of setting theactivation threshold as suggested by the variance theorySetting the activation threshold between the old and thenew expected net inputs yields not only the mirror effectbut also an optimal performance in the network Noticethat the activation threshold (T ) is constant even if thesubjectsrsquo decision criterion (C) is changed Therefore d centwill not change when the decision criterion changes Bychanging the decision criterion (rather than the recogni-tion threshold) subjects can maintain an optimal d cent fordifferent confidence levels

Optimal usage of the state of activation in the cuepattern An interesting question is how much informa-tion is carried in nodes that are active in the encoded pat-tern as compared with inactive nodes If both active andinactive nodes carry a similar amount of information itis useful to use all the nodes at retrieval However if in-active nodes carry little information in relation to thenoise performance can be improved by using only theinformation in the active nodes

The information carried in the nodes depends on theamount of weight changes which in turn depends on thepercentage of active nodes at encoding (a) For a = 12the absolute values of the weight changes are the samefor active and inactive nodes (however the signs of theweight changes are different) Thus inactive and activenodes carry the same amount of information and theperformance is optimal when information in both activeand inactive nodes is used For a small a the weightchanges are larger for active nodes (proportional to 1 a)than for inactive nodes (proportional to a) For a suffi-ciently small a the noise in inactive nodes will over-whelm the information in the weight changes so that ifthe information is combined the inactive nodes will harmthe information in the active nodes and performance isoptimal if only information from active nodes is used

The performance of the variance theory was calcu-lated by using the information in all the nodes This isdone by counting the number of nodes that are retrievedto the correct state of activation (ie the same state asthat during encoding) The mathematical details of thiscalculation are described at the end of the Appendix Theresults are shown by the dotted line in Figure 9A usingthe same set of parameters as when d cent was calculated byusing only active nodes shown by the solid line The re-sults show that the highest d cent is found when the decisionis based only on active nodes and when a is low Includ-ing inactive nodes in decision lowers d cent However for alarger a (above 15 for the parameters used here) it isbeneficial to base the decision on all the nodes

Optimal placement of the recognition criterion forthe two classes of items The recognition criterion (C)does not affect d cent but it influences performance as mea-sured by the hit rates and false alarm rates Therefore itis necessary to use another criterion for performance

VARIANCE THEORY OF MIRROR EFFECT 433

with respect to the placement of the recognition crite-rion A natural choice for performance in this context isthe probability of hits minus the probability of falsealarms This measurement corresponds to optimal per-formance when old correct responses and new correctnew responses are rewarded equally It is easy to see thatif the standard deviations of the old and the new distri-butions are equal the optimal performance will be foundif the recognition criterion is set exactly between the dis-tributions For unequal standard deviations the optimalrecognition criterion is shifted from the midpoint towardthe distribution with the smallest standard deviationMore exactly the optimal recognition criterion is thepoint at which the old and the new distributions inter-sect It is easy to see that this is true because if the recog-nition criterion is moved to the left of this point the rateof increase in false alarms is larger than the rate of in-crease in hits and performance suffers If the recognition

criterion is moved to the right of this point the rate of de-crease in hits is larger than the rate of decrease in falsealarms and performance also suffers (see eg Figure 4D)Formally f [S(O)] denotes the density of recognitionstrength of the old distribution and f [S(N)] the densityof the recognition strength of the new distribution Theratio between these variables is called the likelihoodratio L = f [S(O)] f [S(N)] and the optimal performanceoccurs when this ratio is equal to one (L = 1)

In the mirror effect there are two classes of itemseach having a new and an old distribution with differentstandard deviations The question of optimal perfor-mance is complicated by the possibility of using differ-ent criteria for the two classes The performance maythen vary depending on the choice of the two criteria andon additional restrictions on the overall level of confi-dence For example if one class is very easy (ie perfectdiscrimination) and one class is very difficult (ie no

Figure 9 (A) Theoretical d cent as a function of percentage of nodes active at encoding The solid line shows the d cent as a function of per-centage of nodes active at encoding when the decision is based only on nodes that are active during encoding The dotted line showsd cent when the decision is based on all the nodes (B) Theoretical d cent as a function of the activation threshold The leftmost arrow pointsat the expected net input of the new items [m(N)] the rightmost arrow points at the expected net input of the old items [m(O)] and themiddle arrow points at the point at the placement of the activation threshold of the nodes Note that the activation threshold is slightlylower than the optimal point (C) Optimal placement of the recognition criterion for the two classes The y-axis shows the maximumlikelihood for Class B divided by the maximum likelihood for Class A An optimal performance is found when this ratio is one Thex-axis shows the false alarm rate for Class A The straight line shows the ratio for theoretical optimal performance the dotted line theratio before normalization and the solid curved line the ratio after normalization See the text for details (D) The advantage of nor-malization for different recognition criteria The y-axis shows the total hit rate after normalization minus the total hit rate before nor-malization as a function of the total false alarm rate on the x-axis See the text for details

434 SIKSTROM

discrimination) and subjects are instructed to respondyes only when they are absolutely certain that they arecorrect it may be optimal to set a very high criterion forthe difficult class so that no yes responses will be madefor the difficult class and a moderate criterion for theeasy class so that some yes responses will be made forthe easy class Therefore any model that optimizes per-formance for the two classes must combine the criteriafor each class so that the performance for the sum of theclasses will be optimal

This problem may formally be stated as follows Giventwo classes (A and B) with a fixed total false alarm rate[P(AN) + P(BN)] how should the recognition criteriafor the two classes [T(A) and T(B)] be chosen so that thehit rates are maximized [P(AO) + P(BO)] The solutionto this problem is surprisingly simple The optimal perfor-mance occurs when the placements of the maximum like-lihoods of the two classes are equal

L(A) = f [S(AO)] f [S(AN)] = L(B)

= f [S(BO)] f [S(BN)]

It is easy to see that this criterion must be satisfied foroptimal performance because any shift from this pointdiminishes performance For example if the recognitionthreshold for class A is diminished the recognition cri-terion for class B must be increased to keep the totalfalse alarm rate constant According to the formulationof the problem the change in total false alarm rates mustbe equal f f [S(BN) = 0] The maximum-likelihood ra-tios are monotonically increasing functions of the recog-nition criteria therefore L(A) L(B) lt 0 when the recog-nition criteria are changed as specified above ThusL(A) = f [S(AO)] f [S(AN)] lt f [S(BO)] f [S(BN)] =L(B) or f [S(AO)] f [S(BO)] lt 0 This shows that thechange in the placement of the criteria from L(A) = L(B)results in an overall decrease in hit ratemdash( f [S(AO)] f [S(BO)] lt 0)mdashand performance suffers

Note that the variance theory only has one overallrecognition criterion However the theory and any the-ory of the mirror effect specifies how this criterionchanges when it is moved over the two classes Thus thesecond criterion can indirectly be inferred from the for-mulation of the theory This is done in the variance the-ory by the normalization factor that scales the recogni-tion differently between the two classes of words

The intriguing question here is whether the variancetheory is optimal or almost optimal in terms of place-ment of the recognition criterion for the two classes Fig-ure 9C plots the maximum likelihood of class B dividedby the maximum likelihood of class A [L(B)L(A)] onthe y-axis The x-axis shows the probability of false alarmsfor class A when the recognition criteria are changedThe optimal ratio of the maximum likelihood on the y-axisis one and it is plotted as the dotted straight line The re-sults before normalization (ie by counting the numberof nodes above the recognition criterion) are plotted in

the dotted monotonically increasing line The resultsafter normalization (ie the percentage of active nodesminus the expected number of active nodes divided bythe standard deviation of the net input) are plotted as thesolid line

The figure clearly shows that performance before nor-malization is far from optimal For a conservative recog-nition criterion or low false alarm rates the maximumlikelihood for class A is larger than the maximum likeli-hood for class B Therefore a more liberal criterion forclass B and a more strict criterion for class A would bemore advantageous For a liberal recognition criterionor a high false alarm rate the maximum likelihood forclass B is larger than the maximum likelihood for classA Therefore a more liberal criterion for Class A and astricter criterion for Class B would be beneficial Theonly point at which the performance is optimal is whenthe recognition criterion is unbiased At this point [aroundP(AN) = 25] the maximum-likelihood ratio is one

Normalization improves performance significantly sothe maximum-likelihood ratio is one or almost one fora range of criteria For false alarm rates larger than 015and smaller than 060 the ratio is within two percentagepoints of one The maximum likelihood for Class A islarger than that for Class B for false alarm rates less than015 and for false alarm rates larger than 060 Thus thereis still some deviation from optimal performance evenafter normalization However the maximum-likelihoodratio is closer to the optimal value for all false alarmrates after normalization than before normalization Ar-guably normalization improves performance sufficientlyso that performance may be described as being near anoptimal value for a wide range of recognition criteria

Overall performance after normalization can be di-rectly compared with performance before normalizationFigure 9D plots the total hit rate after normalizationminus the total hit rate before normalization for differenttotal false alarm rates In this figure the standard devia-tion of the net inputs to Class B was set to a 3 (rather than156) in order to make the difference between perfor-mance before and after normalization more salient Allother parameters were identical to those in Figure 4DFurthermore it is assumed that the number of items inClass A is equal to the number of items in Class B Thetotal hit rate is equal to the average hit rate of Class Aand Class B and the total false alarm rate is equal to theaverage false alarm rate in Class A and Class B

For all recognition criteria or for all false alarm ratesperformance is better or equal after normalization ascompared with performance before normalization Foran unbiased recognition criterion or a slightly largerrecognition criterion performance is approximatelyequal before and after normalization [ie for 25 lt P(N) lt40] For conservative recognition criteria [P(N) lt 25]there is a large advantage for normalized performanceover nonnormalized performance The largest advantageis approximately 005 and it occurs when the false alarm

VARIANCE THEORY OF MIRROR EFFECT 435

rate is approximately 005 For liberal recognition crite-ria [P(N) gt 40] there is also an advantage for normal-ized performance The largest advantage is around 001and it occurs when the false alarm rate is 070 The ad-vantage for liberal criteria is smaller than the advantagefor conservative criteria because of a ceiling effect forlarge false alarm rates and large hit rates

A Nonoptimal Network is Inconsistent With Empirical Data of Recognition Memory

To summarize it has been shown that performance isoptimal when (1) the percentage of nodes active at en-coding is low (2) the activation threshold is set betweenthe new and the old distributions (3) only nodes activeat encoding are used for measuring the recognitionstrength and (4) the recognition strength is normalizedwith the standard deviation of the net input It is inter-esting to note that all these conditions are necessary forproducing the empirically found memory phenomena

1 The percentage of active nodes has to be low forproducing appropriate standard deviations for the oldand the new recognition distributions If the percentageof active nodes is too high the standard deviation of theold distribution will approach the standard deviation ofthe new distribution

2 The model predicts the mirror effect only if the ac-tivation threshold is set between the old and the new dis-tributions If the recognition threshold is larger than theexpected value of the net input of the old distributionthe hit rate of the easy class will be less than the hit rateof the difficult distribution Similarly if the recognitionthreshold is lower than the expected new net input thefalse alarm of the easy class will be larger than the falsealarm rate of the difficult class This is inconsistent withthe empirical data for unbiased responses

3 Assume that the recognition strength is based on allthe nodes (ie not only nodes inactive during encod-ing)mdashfor example by counting the number of nodes inthe correct state of activation This measurement ofrecognition strength will have at least 50 of the nodesin the correct state (ie Pc gt 50) even if the subjectswere merely guessing on new items This would lead tothe incorrect prediction that the old recognition strengthvariance should be smaller than the new recognitionstrength variance because Pc(1 Pc)N decreases for Pcover 50 This is inconsistent with the finding that thevariance of the old distribution is larger than the varianceof the new distribution

4 If the recognition strength is not normalized withthe net input the variance of the recognition strength ofthe new easy class (AN) will be smaller than the varianceof the recognition strength of the new difficult class (BNsee Figure 4C) This is inconsistent with the empirical data

This analysis indicates that several recognition mem-ory phenomena fall out as a consequence of optimizingperformance in the network If the model is optimized interms of performance the model reproduces the empir-ical data If the model is made nonoptimal the model

does not reproduce the empirical data Arguably thehuman brain has evolved to optimize storage capacityand therefore these memory phenomena occur

GENERAL DISCUSSION

This paper has suggested a variance theory for themirror effect that also applies to the ROC curves Themodel is remarkably simple It has been shown that atwo-layer recurrent network where one layer representscontext and one layer represents items shows these phe-nomena if performance is measured by counting thenumber of nodes active at recognition in a way that opti-mizes performance The structure of the theory guaran-tees that high-frequency items have a larger variance inthe net inputs than do low-frequency items because en-coding the same item in different contexts increases thevariance whereas the expected net inputs are the sameThe theory predicts the mirror effect when the sameamount of attention is paid to both classes of stimuli Thestandard deviation of the recognition strength is largerfor old than for new items because more nodes are activein old items The standard deviation for the easy class islarger than the standard deviation for the difficult classbecause the recognition strength is normalized with thestandard deviation for the net input

There are several reasons why the variance theory isinteresting First the theory is extremely simple Theonly necessary assumptions are that recognition is basedon recurrent associations between contexts and itemsand performance is measured by counting the number ofnodes in an optimal way Second these assumptions areconsistent with what is known about how the brain worksTherefore the model is biologically plausible Third themodel accounts for a large amount of data including themirror effect exceptions from the mirror effect ROCcurves list-length effects and so on Fourth the modelfits the empirical data well Fifth it is easy to implementthe model in a connectionist network

Paying more attention to one of the classes violates theassumption of equal expected net inputs to the two classesThe variance theory predicts that attention to the moredifficult class primarily affects the hit rates whereas thefalse alarm rates and standard deviations of the underly-ing distributions are less affected An experiment sup-ported the prediction A standard mirror effect was foundwhen attention was divided equally between high- andlow-frequency words However focusing the subjectsrsquoattention on the high-frequency words either by the pre-sentation frequency or the presentation time made thehit rate larger for the high-frequency words than for thelow-frequency words This manipulation of attention didnot influence the false alarm rates which were higher forthe high-frequency words in all the conditions Thus nomirror effect was found when attention was paid to thehigh-frequency words Nor did the focusing of attentioninfluence the order of the standard deviations The stan-dard deviations were larger for the low-frequency distri-

436 SIKSTROM

bution than for the high-frequency distribution This wastrue for the new and the old distributions both when at-tention was paid to high-frequency words and when at-tention was divided equally between the two classes (ex-cept in the new frequency control condition where thestandard deviations were equal)

The variance theory predicts the order of the standarddeviations of the underlying distributions for the follow-ing reasons The standard deviations of the old distribu-tions are predicted to be larger than those of the new dis-tributions because more nodes are activated in the olddistributions The standard deviations of the easy classdistributions are predicted to be larger than the standarddeviations of the difficult class distributions because therecognition strength is normalized by the itemrsquos diffi-culty estimated from the standard deviation of the net in-puts This is consistent with the empirical data

In contrast the attention-likelihood theory does notconstrain the old distribution to be larger than the newdistribution for the difficult class (it can be larger orsmaller depending on parameter settings) The variancetheory allows the following two orders of the standarddeviations ss(BN) lt ss(BO) lt ss(AN) lt ss(AO) andss(BN) lt ss(AN) lt ss(BO) lt ss(AO) The first order isthe more common although the second order occurs oc-casionally (see eg Glanzer et al 1993 Experiment 1)In addition the attention-likelihood theory allowsss(BO) lt ss(BN) lt ss(AN) lt ss(AO) according to Equa-tion 2 which to the authorrsquos knowledge has not beenfound in empirical data

The variance theory predicts that strength variablessuch as study time repetition and study instructions af-fect the expected net input For example increasing studytime will increase the net input that improves the hit rateIncreasing the net inputs also causes an increase in theactivation threshold that diminishes the false alarm rates

The variance theory has no (ie lateral) connectionswithin the item layer and there are no connections with-in the context layer Including intraitem connections inrecognition makes it impossible to tell whether therecognition strength comes from encoding during thelearning episode or from another preexperimental learn-ing episode Thus there would be a confounding be-tween the itemrsquos frequency and recognition strength Forexample if the recognition strength in the variance the-ory included intraitem connections and used a constantrecognition criterion it would predict a higher hit rateand a higher false alarm rate for high-frequency itemsthan for low-frequency items Thus the hit rate for high-frequency words would be larger than that for low-frequency words which is contrary to the data on the mir-ror effect This issue is called the frequencyrecognitionndashstrength confounding problem Other models may bevulnerable to this problem depending on their specificassumptions The variance theory is immune from thisproblem because recognition strength is based on the as-sociation between the context and the item that yields apure measurement of the strength of the target in a par-

ticular episode Net inputs within the item population arenot used because these connections are highly corre-lated with the frequency of the item

This frequencyrecognition-strength confounding prob-lem may be relevant to several distributed models thatassume that recognition strength increases with frequencythus making the false alarm rate higher for high- than forlow-frequency stimuli This is often implemented in dis-tributed models by including intraitem associations inthe measurement of recognition strength Thus whenintraitem and itemndashcontext associations are added it isnot possible to know whether the intraitem strength oc-curs because an item has been encoded in the to-be-retrieved-from list or at another episode

Although the intraitem associations are not used tomeasure recognition strength they may play an impor-tant role in recognition In the first step of recognitionthese associations may be used for deblurring unclear in-formation in the item cue (a similar mechanism occursfor the context cue) Arguably this deblurring mecha-nism works well for well-known words however fornonwords it is much more likely to fail Such failure willactivate features that were not active in the encoded rep-resentation It is more likely that these wrongly activatednodes represent high-frequency features because it ismore likely to converge on high-frequency features Thereare two interesting implications of this perspective Firstthe wrongly activated nodes will use the wrong connec-tions between the context and the item Second becausethe wrongly activated nodes represent high-frequencyfeatures the average variability will be larger for non-words than for words This is a plausible account of thepoor recognition performance with nonwords as com-pared with words It is also a tentative account of why non-words can be seen as a difficult class with a higher vari-ability than that for words in the variance theory Howeverfurther work is needed before any firm conclusion can bemade regarding this aspect of the theory

A problem similar to frequencyrecognition-strengthconfounding occurs if within-context connections areused at recognition If the context is temporally corre-lated the within-context connections are influenced bylist length This causes a confounding between list lengthand recognition strength This issue is called the list-lengthrecognition-strength confounding problem Othermodels may be vulnerable to this problem depending ontheir specific assumptions

Another issue is whether the variance theory can ac-count for the mirror effect found in abstract and concretewords where words from a concrete class are more eas-ily discriminated than words from an abstract class Thevariance theory can account for this given the assump-tion that the variance of the net input is larger for abstractthan for concrete words However at this point it is notcompletely clear how this assumption can be motivatedA possibility is that although these two classes areequated for word frequency the contexts associated withan abstract word are more variable than the contexts as-

VARIANCE THEORY OF MIRROR EFFECT 437

sociated with a concrete word This larger variability incontext for abstract words may lead to a larger variabil-ity in the net input Another possibility is that the activefeatures in abstract words are more general and there-fore associated with more contexts Nodes active in con-crete words may represent more specific features acti-vated with a lower frequency and therefore associatedwith fewer contexts Thus features in abstract wordsmay be of a higher frequency than features in concretewords although the frequencies of the items are thesame This would lead to a mirror effect in the modelHowever at this point no claim is made that variancetheory can handle this phenomenon

The variance theoryrsquos account of the list-length andlist-strength effects is arguably much simpler than thesubjective-likelihoodrsquos account The activation thresholdis set so that on average half of the nodes active duringencoding are active during recognition The activationthreshold is constant within one condition but may changebetween conditionsmdashfor example when study time ischanged Furthermore subjects do not need to estimatethe list length and the probability that a particular itemis drawn from the study list

The variance theory has advantages over the attention-likelihood theory As was discussed in more detail abovethe attention-likelihood theory requires subjects to clas-sify the stimulus Depending on this classification thesubjects must know (consciously or unconsciously) howmuch attention is paid to the stimuli in order to calculatethe log-likelihood ratios Thus the yesndashno decision isbased on the subjectsrsquo classification of the stimuli Thevariance theory does not require a classification of thestimuli During the calculation of recognition strengththe standard deviation of the net input of the item is used(shcent ) so the subject does not need to know the class or thestandard deviation of the class (sh) The increase in thehit rate and decrease in the false alarm rate for the easierclass occurs according to the theory because the vari-ance is smaller for the easier class The variance theoryhas a simple mathematical base subjects count the num-ber of activated nodes minus the expected value dividedby the standard deviation of the net input of the item Anexplicit solution is presented that requires only calculat-ing the probabilities of two z-transformations

The subjective-likelihood theory uses feature contentof the items for addressing issues regarding item similar-ity and word frequency In particular high-frequency wordsare assumed to have a higher noise or variability than dolow-frequency words The variance theory also usesvariability that depends on frequency However the vari-ance theory simulates the increase in variance duringeach presentation of a feature in different contexts thusmaking it an unavoidable phenomenon for high-frequencyfeatures In the subjective-likelihood theory the featurevariance is introduced as an assumption or a constantand it is not explicitly simulated

There are several other differences between the vari-ance theory and the subjective-likelihood theory The

subjective-likelihood theory is based on log-likelihoodratios In the variance theory log-likelihood ratios arenot necessary to account of the mirror effect and for z-ROC curves Instead the variance theory uses the num-ber of active nodes normalized by the variance of the netinput as the measurement of recognition strength

Another difference is the use of one detector for eachitem in the subjective-likelihood theory This makes themodel essentially local whereas the variance theory isdistributed This property may cause difficulties in im-plementing the model in a connectionist network How-ever see McClelland and Chappell (1998) for a brief dis-cussion of this topic An implementation of the variancetheory is straightforward

REFERENCES

Feller W (1968) An introduction to probability theory and its appli-cation New York Wiley

Gillund G amp Shiffrin R M (1984) A retrieval model for bothrecognition and recall Psychological Review 91 1-67

Glanzer M amp Adams J K (1985) The mirror effect in recognitionmemory Memory amp Cognition 13 8-20

Glanzer M amp Adams J K (1990) The mirror effect in recognitionmemory Data and theory Journal of Experimental PsychologyLearning Memory amp Cognition 16 5-16

Glanzer M Adams J K amp Kim K (1993) The regularities ofrecognition memory Psychological Review 100 546-567

Glanzer M amp Bowles N (1976) Analysis of the word frequencyeffect in recognition memory Journal of Experimental PsychologyHuman Learning amp Memory 2 21-31

Glanzer M Kisok K amp Adams J K (1998) Response distribu-tions as an explanation of the mirror effect Journal of ExperimentalPsychology Learning Memory amp Cognition 24 633-644

Greene R L (1996) Mirror effect in order and associative informa-tion Role of response strategies Journal of Experimental Psychol-ogy Learning Memory amp Cognition 22 687-695

Hertz J Krogh A amp Palmer R G (1991) Introduction to the the-ory of neural computation Reading MA Addison-Wesley

Hintzman D L (1988) Judgment of frequency and recognition memoryin a multiple trace memory model Psychological Review 95 528-551

Hopfield J J (1982) Neural networks and physical systems withemergent collective computational abilities Proceedings of the Na-tional Academy of Sciences 79 2554-2558

Hopfield J J (1984) Neurons with graded response have collectivecomputational properties like those of two-state neurons Proceed-ings of the National Academy of Sciences 81 3088-3092

Humphreys M S Bain J D amp Pike R (1989) Different way to cuea coherent memory system A theory for episodic semantic and pro-cedural tasks Psychological Review 96 208-233

Kim K amp Glanzer M (1993) Speed versus accuracy instructionsstudy time and the mirror effect Journal of Experimental Psychol-ogy Learning Memory amp Cognition 19 638-652

Kruschke J K (1992) ALCOVE An exemplar-based connectionistmodel of category learning Psychological Review 99 22-44

Ku Iumlcera H amp Francis W N (1967) Computational analysis ofpresent-day American English Providence RI Brown UniversityPress

Lewandowsky S (1991) Gradual unlearning and catastrophic inter-ference A comparison of distributed architectures In W E Hockleyamp S Lewandowsky (Eds) Relating theory and data Essays onhuman memory in honor of Bennet B Murdock (pp 445-476) Hills-dale NJ Erlbaum

Li S-C amp Lindenberger U (1999) Cross-level unification A com-putational exploration of the link between deterioration of neuro-transmitter systems and dedifferentiation of cognitive abilities in oldage In L-G Nilsson amp H J Markowitsch (Eds) Cognitive neuro-sciences of memory (pp 103-146) Seattle Hogrefe amp Huber

438 SIKSTROM

Li S-C Lindenberger U amp Frensch P A (2000) Unifying cog-nitive aging From neuromodulation to representation to cognitionNeurocomputing 32-33 879-890

McClelland J L amp Chappell M (1998) Familiarity breeds dif-ferentiation A subjective-likelihood approach to the effects of expe-rience in recognition memory Psychological Review 105 724-760

Metcalfe J (1982) A composite holographic associative recallmodel Psychological Review 89 627-658

Murdock B B (1982) A theory for the storage and retrieval of item andassociative information Psychological Review 89 609-626

Murdock B B (1998) The mirror effect and attentionndashlikelihoodtheory A reflective analysis Journal of Experimental PsychologyLearning Memory amp Cognition 24 524-534

Okada M (1996) Notions of associative memory and sparse codingNeural Networks 9 1429-1458

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves PsychologicalReview 99 518-535

Rumelhart D E Hinton G E amp Williams R J (1986) Learn-ing representation by backpropagating errors Nature 323 533-536

Shiffrin R M amp Steyvers M (1997) A model for recognitionmemory REMndashretrieving effectively from memory PsychonomicBulletin amp Review 4 145-166

Sikstroumlm S (1996a) TECO A connectionist theory of successiveepisodic tests Umearing Doctoral dissertation Umearing University (ISBN91-7191-155-3)

Sikstroumlm S (1996b) The TECO connectionist theory of recognitionfailure European Journal of Cognitive Psychology 8 341-380

Sikstroumlm S (1999) Power function forgetting curves as an emergentproperty of biologically plausible neural networks model Interna-tional Journal of Psychology 34 460-464

Stretch V amp Wixted J T (1998a) Decision rules for recognitionmemory confidence judgments Journal of Experimental Psychol-ogy Learning Memory amp Cognition 24 1397-1410

Stretch V amp Wixted J T (1998b) On the difference betweenstrength-based and frequency-based mirror effects in recognitionmemory Journal of Experimental Psychology Learning Memoryamp Cognition 24 1379-1396

NOTE

1 The expected number of nodes active at recognition is a2 giventhat the standard deviations of the percentages of active nodes for new[sp(N)] and old items [sp(O)] are equal However if the standard devi-ations are different the expected number of active nodes will be

Because the new variance is smaller than the old variance this valuewill be slightly less than a2 typically around 044a if the ROC curveis 08 It is actually more precise to use this value In this paper the ap-proximation a2 is used except that in the simulations where the ex-pression above is used

APPENDIX

The Expected Value of the Net Input and the Variance of the Net Input

Inserting the weight changes into the net input solves the ex-pected value of the net input for old items In this Appendix thesuperscripts representing the item layer (d) and the contextlayer (c) are dropped Only the active states of activation con-tribute to the net input so jj 5 jj = 1

(A1)

The expected value of the net inputs for the new items iszero

mh(N) 5 0 (A2)

The expected value of the weights is zero The variance ofthe net input is calculated by squaring each term in the netinput Let P represent the number of patterns stored in the net-work If the patterns are uncorrelated or a new item is pre-sented the variance of the net input is

(A3)

For an old item the variance of the net input to the contextlayer from the item layer depends on the frequency of the item( f ) All the aN weights supporting a context node contribute tothe net input in the same direction

(A4)

The same function can be used for calculating the varianceof the net input to a node in the item layer when the same con-text is associated with several items Let the same context be as-sociated with L items in a list Furthermore let the context be-tween different lists be uncorrelated The variance of the netinput to an item node is

The expected variance for all nodes is the weighted sum ofthese two terms Half of the nodes are context nodes and halfof the nodes are item nodes Therefore the expected varianceof the net input to all nodes given an item with a frequency off and a list length of L in a network where p patterns have beenencoded is

(A5)

Performance in the Variance Model When AllNodes Are Used

In the variance model recognition strength is based on nodesactive at encoding However if recognition strength is based onall the nodes performance can be calculated as follows The d centis calculated by using Equation 8 where Pc(O) and Pc(N) are thenumber of correct nodes The number of correct nodes is thenumber of nodes active at retrieval that also is active at encod-ing (ie calculated as usual with Equation 7) plus the numberof inactive nodes at encoding that also are inactive at retrievalThe latter value can be calculated by replacing a with 1 a inEquation 7 and using the expected value of old net inputs for in-active nodes a2 (1 a) N (compare with Equation A1)

(Manuscript received February 9 1999revision accepted for publication October 30 2000)

s h f L a N p a a N2 3 22 1O( ) = +( ) + -( )

s h f a a LN2 4 2 21( ) = -( )

s x x x xh ij jj

N

i j ji

Nf f f a a

a a f N

22 2

4 2 21

( ) = waring aringaelig

egraveccedilouml

oslashdivide= -( ) -( )eacute

eumlecirc

= -( )

s x x x xh ij jj

N

i j j

P

i

NN a a

a a a a a a a

a a a a PN a a PN

2 2 2

2

3

1 1 2 0 1

0 0 1 1

( ) = w

= [( ) ( )] + [( )(1- )] ( )

+ [( )( )] ( ) = ( )

2 2

2 2 2

( ) = -( ) -( ) - - - -

- - - -

aring aringaring

m x x x xh ijj

N

j i j jj

Na a a a N(O) = wDaring aring= -( ) -( ) = -( )1

2

ss s

p

p p

a(N)

( )N (O)+

VARIANCE THEORY OF MIRROR EFFECT 423

that is presented at the end of the paper together with ananalysis of optimal performance

The Material-Based Mirror Effect forHigh- and Low-Frequency Items

The variance theory was simulated above here theanalytical results are presented The variance theory pre-dicts the mirror effect for any choice of parameters whenthe recognition criterion is unbiased As will be discussedlater the variance theory can be fully described by twoparameters (the number of nodes N and the percentageof active nodes a) plus one parameter for each class orwords [the standard deviation of the net input sh( )] Thefollowing parameters are used here The number ofnodes is 2N 5 100 and the percentage of nodes active atencoding is set to a 5 1 The standard deviation of thenet inputs to the easy class is sh(AN) 5 sh(AO) 5 125and the standard deviation of the net inputs to the diffi-cult class is sh(BN) 5 sh(BO) 5 156 There are otherparameters which however as will be discussed laterdo not add any degrees of freedom to the model the ex-pected net inputs of the new distributions mh(AN) 5mh(BN) 5 0 and the expected net inputs of the old dis-tributions mh(AO) 5 mh(BO) 5 1 Consequently the ac-tivation threshold is T 5 05

These parameters yield the following probabilities thata node is active at recognition Pc(AN) 5 43 a Pc(BN) 545 a Pc(BO) 5 55 a Pc(AO) 5 57 a The following ex-pected recognition strengths are predicted ms(AN) 5

0012 ms(BN) 5 0008 ms(BO) 5 0008 ms(AO) 50012 Figure 4D plots the four recognition strength den-sities (the distributions are assumed to be normal) usingthe parameters above The same parameter settings wereused in Figures 4A 4B 4C and 5

Strength-Based Mirror Effects Between ListsThe variance theory is consistent with the strength-

based mirror effects Thus variables that increase the hitrates also decrease the false alarm rates This empiricalfinding is called dispersion which means that the newand the old distributions move apart The opposite phe-nomenon is called concentering which means that thenew and the old distributions move closer together Ex-amples of variables showing strength-based mirror ef-fects are speed versus accuracy instructions length ofstudy time encoding task forgetting repetition and ag-ing (Kim amp Glanzer 1993) These experimental variablescan be related to a specific parameter in the variance the-orymdashnamely the expected net input

The variance theory predicts a strength-based mirroreffect because subjects must adjust the activation thresh-old to optimize the performance This change in activa-tion threshold affects the false alarm rates For exampleassume that study time is increased from 1 to 2 sec sothat the expected net input increases from 1 to 2 and theactivation threshold increases from 12 to 1 This dimin-ishes the false alarm rate However the increase in theactivation threshold is smaller than the increase in the old

net input so the hit rate will increase Thus increasing thestudy time increases the hit rate but decreases the falsealarm rate which is dispersion

The mirror effect is accounted for in some theories bya change in the recognition criterion Note that in the var-iance theory the recognition criterion is constant whereasthe activation threshold is changed There is an impor-tant difference between a change in the recognition cri-terion and a change in the activation threshold Thechange in the activation threshold optimizes the perfor-mance as measured by d cent whereas a change in the recog-nition criterion does not influence d cent Given an optimalplacement of the activation threshold the performance interms of percentage correct is optimal if the recognitioncriterion is set to an optimal value which is zero Thusthere is a clear difference between changing the recogni-tion criterion and changing the activation threshold Thevariance theory accounts for the strength-based mirror ef-fect occurring between two conditions by the change inthe activation threshold necessary for optimal performancewhereas the recognition criterion does not change

Concentering occurs for example when subjects areinstructed to emphasize speed (rather than accuracy) withsuperficial (rather than deep or semantic) study instruc-tions with diminished study time or with an increasedretention interval (Kim amp Glanzer 1993) In the variancetheory all these manipulations are assumed to diminishthe old net inputs Figure 6A shows the predictions of thevariance theory when the expected net inputs of the olddistributions are mh(AO) = mh(BO) 5 05 rather than 1as in Figure 4D Consequently the activation thresholdmust be set to 025 for optimizing the performance Thedistributions in Figure 6A are closer than the distribu-tions in Figure 4D Thus decreasing the net inputsmdashforexample by diminishing study timemdashmoves the distrib-utions closer together thus showing concentering

The opposite phenomenon to concentering is disper-sion which means that increasing the performance movesthe distributions apart Dispersion can be studied bychanging the variables listed above in the opposite direc-tionsmdashfor example by increasing the study time Fig-ure 6B shows the predictions of the theory when the ex-pected net inputs of the old distributions are mh(AO) =m h(BO) 5 2 Consequently the activation thresholdmust be set to 1 to maintain a near-optimal performanceThe distributions in Figure 6B are further apart than thedistributions in Figure 4D

These strength-based manipulations are usually ap-plied between different lists or conditions For exampleKim and Glanzer (1993) manipulated study time betweenfour study lists where the items were presented for 1 seceach in two lists and for 2 sec each in two lists After eachlist there was a recognition test In the variance theorythe activation threshold is the same during each recogni-tion test but may vary between two recognition tests withdifferent levels of difficultymdashfor example different studytimes As will be discussed later different empirical re-sults are found when study time is varied within one list

424 SIKSTROM

In this case the activation threshold is also constant dur-ing the recognition tests although the study time varieswithin the condition

The order of the probabilities in the mirror effect issomewhat robust against changes in the activation thresh-old over a large range Setting the activation to a fixedsufficiently low and positive value yields the mirror ef-fect for any value of the expected net input For example

assume that the activation threshold is fixed to 04 Thenthe mirror effect is predicted for the three cases of ex-pected old net inputs discussed above (05 1 and 2) orany value above 04 The predictions for the new distri-butions do not change with these changes in net inputs[P(AN) = 25 P(BN) = 30] thus a change in the acti-vation threshold is needed to change the false alarmrates In contrast the advantage of the old easy class over

Table 1General Table of Results From the Experiment

Condition AN BN BO AO ss(BN)ss(AN) ss(BO)ss(AO)

Control 013 017 069 082 060 086Frequency 020 028 080 068 101 066Time 010 015 078 076 089 081

NotemdashThe rows show the conditions (control presentation frequency and presenta-tion time) The columns show the false alarm rates for low (AN) and high (BN) fre-quencies and the hit rate for high (BO) and low (AO) frequencies the slope of the z-ROCcurve for the new low-frequency distribution as a function the new high-frequency dis-tribution [ss(BN)ss(AN)] and the corresponding slope for the old distributions[ss(BO)ss(AO)]

Figure 6 The densities of recognition strength in the variance theory for different parameter settings (A) concentering (B) dis-persion (C) activity level set to one and (D) equal variance The horizontal axes show the recognition strength and the vertical axesthe density

VARIANCE THEORY OF MIRROR EFFECT 425

the old difficult class increases with the expected netinput [from P(BO) = 55 and P(AO) = 56 for mh(O) =05 to P(BO) = 89 and P(AO) = 92 for mh(O) = 2]

List-Length EffectGiven everything else equal recognition from a short

list length has a higher hit rate and lower false alarm ratethan recognition from a long list In the variance theorylist length is predicted to affect the standard deviation ofthe net input (sh) for both easy and difficult class itemsso that longer lists have a larger variance than do shorterlists The expected value of the net input is not affectedby list length

Assume that context does not change within a list butis uncorrelated between different lists The context for alist is thus associated with as many items as there areitems in the list The variance of the net inputs to the itemnodes increases when the list length is increased Thereason for this increase in variance is essentially thesame as the reason that word frequency affects the vari-ance In the word frequency case the same item is asso-ciated with several contexts and this increases the vari-ance in context nodes In the list-length case the samecontext is associated with several items and this in-creases the variance in the item nodes Thus the varianceof the net inputs in the item nodes will be a linear func-tion of list length Therefore a long list will have a lowerhit rate and a larger false alarm rate than will a short list

ROC CurvesThe percentage of nodes active at recognition is less

for new items than for old items Owing to the placementof the activation threshold this proportion is always lessthan 12 The standard deviation of the percentage of ac-tive nodes increases as a function of the percentage ofactive nodes If the percentage of active nodes is zerothe standard deviation obviously is zero However thisstandard deviation increases as the percentage of activenodes increases This yields a smaller standard deviationfor the new distribution (which is associated with a lowerpercentage of active nodes) as compared with the olddistribution [ss(AO) gt ss(AN) and ss(BO) gt ss(BN)]

For the sake of understanding the model the propor-tion of nodes active at encoding can be set unrealisticallyhighmdashnamely to a = 1 This setting yields around 50of these nodes being active at recognition This parametersetting makes the standard deviations of the new and theold distributions equal [ss(AO) = ss(AN) and ss(BO) =ss(BN)] Figure 6C shows the prediction for a = 1 (all theother parameters are identical to those in Figure 4D)

The standard deviation of recognition strength is largerfor the difficult class than for the easy class [ss(AN) gtss(BN) and ss(AO) gt ss(BO)] because the recognitionstrengths are calculated from the inverse of the standarddeviation of the net inputs Thus when the standard de-viations of the net inputs are set equal the standard de-viation of the recognition strengths and the recognitionstrengths becomes equal Figure 6D plots the predictionsof the theory when all standard deviations of the net in-

puts are 125 The other parameters are the same as thosein Figure 4D

In Figure 4D the four standard deviations of the recog-nition strengths are ss(AN) = 0015 ss(BN) = 0012ss(BO) = 0015 and ss(AO) = 0020 The ratio of thesestandard deviations must follow Equation 2 This is alsothe case with ss(AO) ss(BN) = 061 lt 074 = ss(AO) ss(AN) lt ss(BO)ss(BN) = 078 lt 094 = ss(BO) ss(AN)

Changing the Recognition CriterionThe probability of a yes response (P) for the four

classes depends on the recognition criterion (C) SettingC to an unbiased value of 0 yields P(AN) = 20 P(BN) =25 P(BO) = 70 P(AO) = 74 These predicted data areprototypical of experimental data for the mirror effect

A conservative value of the recognition criterion (C)will not yield the mirror effect For example C = 0016yields P(AN) = 03 P(BN) = 02 P(BO) = 30 P(AO) =43 Thus the variance theory predicts that a conserva-tive recognition criterion yields a higher false alarm ratefor easy class words than for difficult class words Thisprediction is supported by empirical data Greene (1996)asked subjects to respond yes only if they were sure oftheir response Consistent with the prediction no mirroreffect was found

It follows from the ordering of the distributions in Fig-ure 4D that the theory also predicts the experimentalfindings in forced recognition [P(BO BN) lt P(AO BN)P(BO AN) lt P(AO AN) P(BN AN) gt 50 and P(AOBO) gt 50] For the parameters above the predictions ofthe theory are P(BO BN) = 79 lt 81 = P(AO BN)P(BO AN) = 83 lt 84 = P(AO AN) P(BN AN) = 59 gt50 P(AO BO) = 57 gt 50

Within-List Strength ManipulationSo far the predictions made by the variance theory are

qualitatively (but not quantitatively) equal to those of theattention-likelihood theory However there is an excep-tion that differentiates the variance theory from the attention-likelihood theory The mirror effect is nor-mally studied under experimental conditions in whichthe diff icult and the easy classes are given the sameamount of attentionmdashfor example under conditions inwhich the number of presentations the study time andthe study instructions are equal for the two classes ofwords However if the number of presentation is largerfor the difficult class than for the easy class different re-sults emerge Stretch and Wixted (1998b) conducted fiveexperiments in which the basic manipulation was to pre-sent high-frequency words five times whereas the low-frequency words were presented once The results didnot show a mirror effect because the hit rates for thehigh-frequency words were higher than those for thelow-frequency words However increasing the numberof presentations for the high-frequency words did not af-fect the false alarm rate so both the false alarm rate andthe hit rate were larger for high-frequency words

The variance theory accounts for this new finding inthe following way The theory assumes that a longer pre-

426 SIKSTROM

sentation time or a larger presentation frequency in-creases the net inputs of the old items [mh(O)] This is il-lustrated in Figure 7A (compare with Figure 4A wherethe same amount of attention is paid to the two classes)If the net inputs for old high-frequency items are in-creased sufficiently the percentage of active nodes willbe larger than that for old low-frequency items For thisto occur the effect of the increase in net input (whichgives the advantage for old high-frequency items whenattention is focused on these items) must be larger thanthe effect from the larger standard deviation of the netinputs for old high-frequency items (which gives the ad-vantage for old low-frequency items when the same at-tention is paid to the two classes) This increase in thepercentage of active nodes yields a higher hit rate forhigh-frequency items than for low-frequency items

However it will not significantly change the false alarmrates which are larger for high-frequency items than forlow-frequency items Therefore the variance theory pre-dicts no mirror effect when high-frequency items are pre-sented sufficiently more often or with a sufficiently longerpresentation time as compared with low-frequency items

It is apparent from the density of net inputs (Figure 7A)that the density of recognition strengths (Figure 7B)will not show a mirror effect (ie because the percent-age of active nodes are larger for high- than for low-frequency old items) The parameters used in these fig-ures are identical to the parameters used for the standardmirror effect in Figures 4A and 4D with the exceptionthat the expected net input of the old high-frequencyitems [mh(BO)] is 2 rather than 1 Consequently to opti-mize performance the activation threshold becomes

P a ec

h h

h=-( )yen

ograve12

22

m

s

P -

Figure 7 (A) The probability density of the net inputs in the variancetheory when attention is focused on the high-frequency class The hori-zontal axis shows the net inputs and the vertical axis the probabilitydensity of the net inputs The expected value of the high-frequency value(BO) is shifted to the right because attention is focused on this class Thedotted vertical line is the activation threshold (B) The predictions of thevariance theory when subjects focus their attention on high-frequencywords The horizontal axis shows the recognition strength and the ver-tical axis the probability density

075 rather than 050 The figure does not show a mirroreffect because the expected hit rate and the expectedfalse alarm rate are larger for the high-frequency itemsthan for the low-frequency items Setting C to an unbi-ased value of 0 yields P(AN) = 08 P(BN) = 14 P(BO) =86 P(AO) = 063 [which may be compared with Figure6B P(AN) = 20 P(BN) = 25 P(BO) = 70 P(AO) = 74]

Furthermore in the variance theory the ratio of therecognition strength standard deviations for high- andlow-frequency items depends mainly on the standard de-viations of the net inputs The standard deviations of thenet inputs are not dependent on the attention paid to thestimuli Therefore the variance theory predicts no changein the standard deviations when the amount of attentionis manipulated The standard deviation of the old low-frequency distribution is predicted to be larger than thestandard deviation of the old high-frequency distribu-tion Similarly the standard deviation of the new low-frequency distribution is predicted to be larger than thatof the new high-frequency distribution The standard de-viations in Figure 7B are ss(AN) = 0013 ss(BN) = 0011ss(BO) = 0017 and ss(AO) = 0019 These results aresimilar to the results when the same amount of attentionis paid to the two classes in Figure 4D ss(AN) = 0015ss(BN) = 0012 ss(BO) = 0015 and ss(AO) = 0020

The standard version of the attention-likelihood the-ory has problems accounting for the lack of mirror ef-fect when more study time is given to the difficult classthan to the easy class This theory suggests that the classof items to which more attention is being paid is moreeasily recognized For example low-frequency items arebetter recognized than high-frequency items becausesubjects pay more attention to them The amount of at-tention is assumed to influence the number of sampledfeature [n(i)] so more features are sampled for low- thanfor high-frequency items (Kim amp Glanzer 1993) This isthe only parameter that differs between high- and low-frequency items From this explanation it follows thatif the experimental conditions are manipulated so thatsubjects pay more attention to the high-frequency itemsthe standard version of the attention-likelihood theorywill predict a mirror effect where the high-frequencyitems are the easier class (A) and the low-frequency itemsare the more difficult class (B) The difference from thenormal mirror effect is a larger hit rate and a smaller falsealarm rate for high- than for low-frequency items Fur-thermore the attention-likelihood theory makes predic-tions of the order of the slope of ROC curves The standarddeviation of the hit rate for the high-frequency distributionwould be larger than the hit rate for the low-frequencydistribution Similarly it is predicted that the standarddeviation of the high-frequency false alarm distribution islarger than that of the low-frequency distribution

EXPERIMENT

An experiment was conducted to test the predictionsregarding the within-list strength manipulation The

number of presentations and the study time of the high-frequency words were manipulated in an experimentThe original rationale for the experiment was to comparethe results with the predictions of the variance theoryand the attention-likelihood theory because the experi-ment was conducted before the publication of Stretchand Wixtedrsquos (1998b) study which manipulated atten-tion by the number of presentations In this experimenta new manipulation is investigatedmdashnamely how theamount of study time per item for each class influencesthe mirror effect Furthermore the manipulation of thenumber of presentations is replicated Thus there weretwo experimental conditions one in which the amountof study time was manipulated and one in which the pre-sentation time was manipulated There was also one con-trol condition in which high- and low-frequency wordswere given the same amount of attention

MethodSubjects Twenty-one students taking the introductory psychol-

ogy course at the University of Toronto volunteered to participatein a memory experiment for course credit There were 14 female and7 male subjects with a mean age of 20 ranging from 18 to 29 yearsold

Material Sixty low-frequency words and 60 high-frequencywords were selected from Ku Iumlcera and Francis (1967) The low-frequency words have an occurrence of 4ndash5 times per million andthe high frequency words an occurrence of 50ndash55 times per mil-lion Thirty low- and 30 high-frequency words were randomly cho-sen for List 1 and the remaining for List 2

Procedure The subjects were instructed to study a list of wordsso they would be able to recognize the words after study Fifteenlow-frequency words and 15 high-frequency words were randomlychosen as study words for each subject

Design There were three conditions In all the conditions thelow-frequency words were presented once with a presentation timeof 1 sec In the control condition the high-frequency words werealso presented once with a presentation time of 1 sec In the pre-sentation frequency condition the high-frequency words were pre-sented twice for 1 sec each time In the presentation time conditionthe high-frequency words were presented once for 3 sec The pre-sentation order was randomized All the words were presented inuppercase on a blank computer screen Immediately following thestudy list there was a recognition test The subjects were presentedwith either one of the studied words or one of the lures There were15 low-frequency lures and 15 high-frequency lures in each condi-tion The subjects were asked to judge whether the word was pre-sented in the list or not The subjects were also required to rate theirconfidence in their responses on a scale ranging from 1 (guessing)to 5 (very certain) The order of recognition was randomized foreach subject

Each subject participated in two conditions List 1 was alwaysgiven as the first list and List 2 as the second list Twelve subjectswere randomly chosen for the presentation frequency condition fol-lowed by the presentation time condition Nine subjects were giventhe control condition followed by another control condition Thewhole experimental set-up including instructions presentation ofwords and the recognition test was automated on a computer Eachsubject was tested individually

ResultsThe results from the experiment are presented in the

first three rows of Table 1 The probability for hit rates

VARIANCE THEORY OF MIRROR EFFECT 427

428 SIKSTROM

was larger for the high-frequency words than for the low-frequency words in the presentation frequency and thepresentation time conditions In the control conditionthe probability for hit rates for the low-frequency condi-tion was larger One-tailed paired t tests over the perfor-mance for each subject were carried out to test the dif-ferences between the high and the low frequencies Theeffects were significant in the presentation frequencycondition [t(11) = 22 MSe = 004 p = 02 lt 05] and inthe control condition [t(16) = 33 MSe = 004 p = 00lt 05] but not in the presentation time condition [t(11) =041 MSe = 003 p = 34 lt 05]

The false alarm rate was larger for the high-frequencywords in all the conditions However it was only signif-icantly larger in the presentation frequency condition[t(11) = 18 MSe = 003 p = 048 lt 05] but not in thepresentation time condition [t (11) = 15 MSe = 001 p =07 gt 05] and the control condition [t(16) = 14 MSe =002 p = 09 gt 05]

The results in the presentation frequency conditionsupport the variance theory The hit and the false alarmrates were significantly larger for the high-frequencywords than for the low-frequency words Thus there wasno mirror effect However the prediction of the standardversion of the attention-likelihood theory was not sup-ported

The results in the presentation time condition were inthe same direction as those in the presentation frequencycondition although the difference between the high andthe low frequencies was not significant This conditionis consistent with the variance theory although the stan-dard version of the attention-likelihood theory could notbe dismissed in this condition since the results werenonsignificant

Finally the control condition yielded results consis-tent with previous studies showing a mirror effect Thehit rate for the high-frequency words was significantlylower than the hit rate for the low-frequency words Thefalse alarm rate for the high-frequency words was largerthan that for the low-frequency words (although not sig-nificantly) Thus the control condition is as was ex-pected consistent with both the variance theory and thestandard version of the attention-likelihood theory

The slopes of the ROC curves were calculated as fol-lows The hit and false alarm rates for confidence ratings1ndash5 were z-transformed (eg for confidence rating 4 a hitresponse was scored if the confidence rating was 4 orabove) Linear regressions of one z-transformed mea-surement as a function of another z-transformed measure-ment were conducted The slope of the linear regressioncurves between the z-transformed hit rate of the low-frequency words and the z-transformed hit rate of the high-frequency words [ss(BO)ss(AO)] and similarly for theslope of the false alarms [ss(BN)ss(AN)] are shown inthe last two rows of Table 1

The results show that the standard deviations of theold high-frequency distributions were smaller than thestandard deviations of the old low-frequency distribu-tions in all the conditions The standard deviations of the

false alarm high-frequency distributions were smallerthan the standard deviations of the false alarm low-frequency distributions in the presentation frequencycondition and the control condition but were approxi-mately equal in the presentation time condition

To summarize the results in the presentation frequencycondition are consistent with the variance theory and inconsistent with the standard version of the attention-likelihood theory The control condition is consistentwith both the variance theory and the standard version ofthe attention-likelihood theory These data are also con-sistent with results presented by Stretch and Wixted(1998b) However Stretch and Wixted (1998b) suggestedone possible way to modify the standard version of theattention-likelihood theory to bring it in line with thedata presented here They noted that Glanzer et al (1993)had shown that the attention-likelihood theory predictsthe mirror effect although p(i old) is set to the averagevalue of the two classes This modified version can pre-dict the pattern of data presented here given that the at-tention paid to the high-frequency class was high duringencoding [n(B) = 120] and low during recognition [n(B) =40] This formulation of the attention-likelihood theoryseems somewhat unclear It is not well motivated whyp(i old) should be equal during recognition whereas n(i)is different [p(i old) is calculated from n(i)] or why theamount of attention for high-frequency items is lower thanthat for low-frequency words at encoding but higher atrecognition

COMPARATIVE DATA FITTING

Glanzer et al (1993) fitted the attention-likelihoodtheory to experimental data in nine conditions The easyclass (A) consisted of either low-frequency words orconcrete words and the difficult class (B) consisted ofhigh-frequency words or abstract words Here the vari-ance theory is also fitted to the same set of data as thatused in Glanzer et al (1993) This allows a directionevaluation of the variance theory by comparing its f itswith those of the attention-likelihood theory

Glanzer et al (1993) fitted the attention-likelihoodtheory to the four probabilities of yes responses and thefour ratios of slopes of the ROCs The fitting was con-ducted by minimizing the mean squared error divided bythe variancemdashthat is

Three parameters were fittedmdashnamely the attentionpaid to each of the classes n(A) and n(B) and the prob-ability that a feature was activated p(new) The other pa-rameters were held constant at a value found to give agood f it These parameters were (N = 1000) and therecognition criterion [ln (L) = 0]

The variance theory was fitted to the same set of datausing the same technique and the same number of freeparameters The fitted parameters were the percentage

( )

Observed Predictedi i

ii

-

=aring

2

21

8

s

VARIANCE THEORY OF MIRROR EFFECT 429

nodes active at encoding (a) the standard deviation ofthe net inputs for the easy class words [sh(A)] and thestandard deviation of the net inputs for the difficult classwords [sh(B)] The other parameters were held constantand were set to the same values as those in Figure 4D[2N = 100 mh(N ) = 0 mh(O) = 1 and C = 0] The empir-ical standard deviations (si) were not reported inGlanzer et al (1993) so these parameters were set toone

The results show that the variance theory accounts for98 (r2) of the variance for the probabilities The attention-likelihood theory fits equally well The variance theory ac-counts for 93 of the variance for the slope The attention-likelihood theory accounts for 84 of the variance of theslope Thus the variance theory accounted for the sameamount of variance for the probabilities and more variance

for the slope as compared with the attention-likelihoodtheory when three parameters were fitted

It is reasonable to assume that the percentage of ac-tive nodes at encoding (a) does not vary across condi-tions The ratio of standard deviations of the net inputs[sh(B)sh(A)] may also be conceived of as being con-stant given that the same material is used in the differ-ent conditions Therefore the variance theory was fittedby a single variablemdashnamely the standard deviation ofthe net inputs for the easy class [sh(A)] The activitylevel was fixed to 010 and the ratio of the standard de-viations of the net inputs sh(B)sh(A) to 125 (these val-ues were the average of the fitted parameters in the pre-vious fit these parameters were also used in Figure 4D)The results are presented in Figure 8A where the pre-dicted probabilities are plotted as a function of observed

Figure 8 (A) Fitting the variance theory to Glanzer Adams andKimrsquos (1993) probability data The figure shows the predicted proba-bilities (on the vertical axis) as a function of the observed probabilities(on the horizontal axis) (B) Fitting the variance theory to Glanzer et alrsquos(1993) standard deviation slope data The figure shows the predictedratio of slopes of the receiver-operating characteristic curves (on the ver-tical axis) as a function of the observed ratios (on the horizontal axis)

430 SIKSTROM

probabilities Figure 8B shows the corresponding resultsfor the slope The accounted variance is 096 for the prob-abilities and 085 for the slopes Thus the variance theoryfits the slopes using a single parameter equally well asthe attention-likelihood theory does with three fitting pa-rameters The fit for the variance theory for the probabil-ities using one parameter is slightly less than the fit of theattention-likelihood theory using three fitting parametersIt may be concluded that the fit for the variance theory isreasonably good for the probabilities and the slopes Theslopes have a better fit in the variance theory than in theattention-likelihood theory when three variables are used

ANALYTIC SOLUTIONS

In this section analytic solutions of the variance the-ory an approximation of the standard deviation of therecognition strength and analyses of optimal perfor-mance are presented The variance theory has a simpleanalytic solution and can be fully described by four pa-rameters Two of these parametersmdashnamely the stan-dard deviation of the net inputs from the easy class[sh(A)] and the standard deviation of the net inputs fromthe difficult class [sh(B)]mdashcan also be expressed as thefrequency of the item (see the Appendix) The other twoparameters are the number of nodes (N ) and the expectedprobability that the nodes are active at encoding (a)

There are other variables in the theory however theydo not increase the degrees of freedom There are fourexpected net inputs (mh) however two degrees of free-dom disappear because the new net inputs are constrainedto be equal as well as the old net inputs [mh(AN) =mh(BN) and mh(AO) = mh(BO)] Furthermore the predic-tions are independent of moving the old and new dis-tributions in parallel thus removing another degree offreedom Finally changing the difference between the ex-pected old and new net inputs is mathematically equiva-lent to changing the standard deviations [sh(A) andsh(B)] Thus the degrees of freedom in the net inputscan be captured by the degrees of freedom in the stan-dard deviations The activation threshold (T ) and theprobability that nodes are active (Pc) are simply func-tions of other variables and therefore do not increase thedegrees of freedom Thus there are four degrees of free-dom for the distributionsmdashnamely sh(A) sh(B) N anda An additional degree of freedom is introduced whenplacing the recognition criterion (C)

The probability (Pc) that the net inputs exceed the ac-tivation threshold (T ) for nodes active during encodingcan be explicitly solved from the expected net inputs(mh) and the standard deviation of the net inputs (sh)This probability is dependent on the distribution of thenet inputs which can be approximated with a normaldistribution Pc is solved by integrating the net inputsfrom mh T to infinity (yen) over the probability densityfunction for a normal distribution Thus the probability(Pc) that a node is active at recognition is

(6)

Subtracting the expected percentage of active nodes atrecognition (a2 see note 1) from the percentage of ac-tive nodes and dividing by the standard deviation of thenet inputs (sh) calculates the expected recognitionstrength (ms)

Note that the analytic solution uses the standard devi-ation of the class (sh) as an approximation of the stan-dard deviation of the item (shcent ) because it simplifies theanalytic solution however the variance theory or thesimulation uses the standard deviation of the item Thisapproximation is good when there are a large number offeatures however for a small number of features thevariance of feature strength for a single item may fluctu-ate on an item-to-item basis around the variance of thenet inputs for all the items

The standard deviation of the recognition strength (ss)is calculated from sh Pc and N There is 2N number ofnodes in the context and the item layers The distributionof Pc is binomial but can for a certain criterion [ie 2NPc(1 Pc) gt 10] be approximated with a normal distri-bution with a standard deviation of [Pc(1 Pc) 2N]12The final result is scaled by the normalization factor 1sh

(7)

A yes response occurs if the recognition strength isabove the recognition criterion (C) The probability of ayes response [P(Y)] is calculated from the expected recog-nition strength the variance of the recognition strengthand the recognition criterion by integrating the density ofthe recognition strength over a normal distribution

The probability of choosing A over B in a two-choiceforced recognition test [P(A B)] is calculated from theexpected recognition strength of A [ms(A)] and B [ms(B)]and the standard deviations of the recognition strengthof A [ss(A)] and B [ss(B)]

An Excel sheet for calculating the predictions of the vari-ance theory is available on line (wwwpsychutorontoca ~sverkervariancehtml)

P e ds

S Bs s

s s

C

(A B)

(A)

(A) (B)

=

- -[ ]( )+aelig

egraveccediloumloslashdivide

eacuteeumlecirc

yen

ograve12

2

2 2 22

p

m m

s s

( )

P e ds

S s

s

C

(Y) =-( )yen

ograve12

2

22

p

m

s

sss

h

c cP P

N=

-( )eacute

eumlecircecirc

1 1

2

12

mss

Pc

a

h=

-2

P a e dhcT

hh

h=-( )yen

ograve12

2

22

p

m

s

VARIANCE THEORY OF MIRROR EFFECT 431

Approximations of the Standard Deviation ofRecognition Strength

The standard deviation of the recognition strength isin the model calculated with Equation 7 However to fa-cilitate the understanding of this equation it is useful tomake some approximations First note that the proba-bility that a node is active (Pc) is assumed to be low Byapproximating 1 Pc to one the variance of the recog-nition strength can be simplified to

For a particular class of items the variances of the netinputs of old and new items are equal and the varianceof the recognition strength is proportional to the numberof active nodes (s 2

s micro Pc) This approximation suggests avery simple interpretation of the slope of the z-ROC Theratio of variances between new and old items is simplythe ratio between the number of nodes active in the newitems representations and the number of nodes active inthe old items representations

Or alternatively the slope of the z-ROC curve is equalto the square root of the ratio of the number of nodes ac-tive in the new items representations and the number ofnodes active in the old items representations For exam-ple if the slope of the z-ROC curve is 08 the number ofactive nodes in the new items representations divided bythe number of nodes active in the old items representa-tions is 064 (= 0802)

Another approximation useful for understanding themodel is that for two classes of items the number of ac-tive nodes in the new distribution is approximately equaland the number of active nodes in the old distributions isapproximately equal [Pc(AN) raquo Pc(BN) and Pc(BO) raquoPc(AO)] Given these approximations and the approxi-mation above (1 Pc raquo 1) the recognition strength stan-dard deviation is inversely related to the standard devia-tion of the net inputs in the following way The ratiobetween the recognition strength standard deviations ofthe diff icult and the easy distributions is equal to theratio between the standard deviations of the net inputs ofthe easy and the difficult distributions Furthermore theratio between the recognition strength standard devia-tions of the difficult and easy new distributions is equalto the ratio between the recognition strength standard de-viations of the difficult and the easy old distributionsThe exact solution predicts a slightly smaller ratio in theold than in the new distributions

This suggests that the ratio between the recognitionstrength standard deviations of the difficult class and theeasy class can be interpreted as the ratio between the

standard deviations of the net inputs of the easy class andthe difficult class

Optimizing Performance Derives the Assumptions of the Variance Theory

Arguably good memory performance is an importantfactor for selection in the evolutionary process of hu-mans and animals It is reasonable to assume that thebrain has evolved so that the performance at retrieval isoptimal or near-optimal Here it is investigated how sev-eral assumptions of the variance theory influence per-formance It is shown that several assumptions of themodel fall out as a consequence of optimizing perfor-mance in the form of discriminability between new andold items Thus if the model is implemented in a differ-ent way performance is degraded and the model doesnot account for the experimental data Examples of as-sumptions that yield a good performance in the modelare a low percentage of nodes active setting the activa-tion threshold between old and new net inputs measur-ing performance by nodes that are active in the encodingpattern and normalizing the recognition strength It isshown that an optimal performance in the network requiresthe implementation suggested by the variance theory Ifthe implementation of the variance theory is changedsignificantly the performance is degraded and the net-work would not produce the empirically found memoryphenomena

To conduct this analysis it is necessary to define ameasurement of performance A natural choice is to used cent By using the analytical equations above we find thefollowing expression

(8)

Because ss(N) sometimes is near zero it was founduseful to use the standard deviation of both the old andthe new items recognition strength ss(NO) in the de-nominator of this equation Thus Pc(NO) is equal to[Pc(N) 1 Pc(O)] 2 Pc( ) was calculated with Equation 6The expected value of the net inputs and the standard de-viation of the net inputs for new and old items were cal-culated with the equations derived in the Appendix(Equations A1 A2 and A3) Low-frequency items witha preexperimental frequency of zero were used ( f = 0)and the list length was set to one (L = 1)

The performance can be expressed by the parametersa N and p Increasing the number of nodes (N) monot-onically increases d cent and increasing the number ofstored patterns (p) monotonically decreases d cent The im-pact of these two parameters on d cent was of less impor-tance here and they were set to N = 30 and p = 100

Optimal percentage of nodes active at encodingThe solid line in Figure 9A shows the theoretical d cent as afunction of the percentage of nodes active at encoding

cent = - =-

-eacuteeumlecirc

dP P

P PN

s s

s

c c

c c

m ms( ) ( )

(

O N

(NO)

(O) (N)

(NO) (NO) 12

112

ss

ss

ss

ss

s

s

s

s

h

h

s

s

(BO)

(AO)

(B)

(A)

(A)

(B)

(BN)

(AN)pound raquo pound

s

ss

s

c

c

P

P

2

2

(N)

(O)

(N)

(O)raquo

ss

sc

h

P

N

2

2 2=

432 SIKSTROM

(a) The results show that d cent is optimal for a = 052 Thed cent is lower for larger and smaller a The lower d cent for largea occurs because the interference from other items in-crease For an a larger than the optimal value the weightchanges are distributed over a larger number of nodesand the recognition tests therefore include more noise

The lower d cent for an a less than the optimal value oc-curs because there is variability in the number of activenodes at encoding Thus for very small values of a thefluctuation between the number of nodes active in theencoded representation becomes increasingly importantThus for a small a errors are more likely to occur be-cause old items have few active nodes at encoding mak-ing it less likely that many nodes will be active at re-trieval (independently of how well they are encoded)This analysis suggests that a medium low percentage ofactive nodes at encoding yields optimal performanceThis is consistent with variance theory which requires alow activity for fitting some of the empirical data (seebelow)

There is another factor that contributes to the fact thatoptimal performance occurs when the percentage of ac-tive nodes is medium low which is that the number ofpossible representations increases with a If there is onlyone node active in all the representations there are Npossible combinations of representations if there aretwo nodes active in all the representations there are ap-proximately N 2 possible combinations of representa-tions and so forth This factor is not included in theanalyses

Optimal placement of the activation threshold Animportant property of d cent is that it is insensitive to wherethe criterion is placed Thus any criterion yields thesame performance The activation threshold (T ) may beseen as the criterion for a single node and therefore onemight intuitively think that the placement of the thresh-old is unimportant for d cent However surprisingly theplacement of the criterion becomes important in the vari-ance theory because there is a nonlinear transformationwhen the nodes are activated This nonlinearity makes d centdependent on the activation threshold in the nodes

The d cent was maximized by changing the activationthreshold (T ) and the percentage of nodes active at en-coding (a) The maximum d cent was 240 when T = 081and a = 052 Figure 9B plots d cent as a function of the ac-tivation threshold (T ) when the percentage of nodes ac-tive at encoding was f ixed at the optimal value (a =052) The results show that d cent has an optimal valuewhen the activation threshold is set between the old andthe new distributions The variance theory suggests thatthe threshold should be set to the average of the expectedold and expected new net inputs For the parameter val-ues used here this value is 071 which is near butslightly lower than the optimal value of 081 (the ex-pected value of the new net input is 0 and the expectedvalue of the old net input is 142) Note that this resultapplies when both a and T are set to the optimal value Ifa is set to a nonoptimal value the optimal value of T may

deviate significantly from the one proposed by the the-ory (eg if a = 5 the optimal value of T is much largerthan the expected value of old net inputs of 188)

This analysis emphasizes the importance of setting theactivation threshold as suggested by the variance theorySetting the activation threshold between the old and thenew expected net inputs yields not only the mirror effectbut also an optimal performance in the network Noticethat the activation threshold (T ) is constant even if thesubjectsrsquo decision criterion (C) is changed Therefore d centwill not change when the decision criterion changes Bychanging the decision criterion (rather than the recogni-tion threshold) subjects can maintain an optimal d cent fordifferent confidence levels

Optimal usage of the state of activation in the cuepattern An interesting question is how much informa-tion is carried in nodes that are active in the encoded pat-tern as compared with inactive nodes If both active andinactive nodes carry a similar amount of information itis useful to use all the nodes at retrieval However if in-active nodes carry little information in relation to thenoise performance can be improved by using only theinformation in the active nodes

The information carried in the nodes depends on theamount of weight changes which in turn depends on thepercentage of active nodes at encoding (a) For a = 12the absolute values of the weight changes are the samefor active and inactive nodes (however the signs of theweight changes are different) Thus inactive and activenodes carry the same amount of information and theperformance is optimal when information in both activeand inactive nodes is used For a small a the weightchanges are larger for active nodes (proportional to 1 a)than for inactive nodes (proportional to a) For a suffi-ciently small a the noise in inactive nodes will over-whelm the information in the weight changes so that ifthe information is combined the inactive nodes will harmthe information in the active nodes and performance isoptimal if only information from active nodes is used

The performance of the variance theory was calcu-lated by using the information in all the nodes This isdone by counting the number of nodes that are retrievedto the correct state of activation (ie the same state asthat during encoding) The mathematical details of thiscalculation are described at the end of the Appendix Theresults are shown by the dotted line in Figure 9A usingthe same set of parameters as when d cent was calculated byusing only active nodes shown by the solid line The re-sults show that the highest d cent is found when the decisionis based only on active nodes and when a is low Includ-ing inactive nodes in decision lowers d cent However for alarger a (above 15 for the parameters used here) it isbeneficial to base the decision on all the nodes

Optimal placement of the recognition criterion forthe two classes of items The recognition criterion (C)does not affect d cent but it influences performance as mea-sured by the hit rates and false alarm rates Therefore itis necessary to use another criterion for performance

VARIANCE THEORY OF MIRROR EFFECT 433

with respect to the placement of the recognition crite-rion A natural choice for performance in this context isthe probability of hits minus the probability of falsealarms This measurement corresponds to optimal per-formance when old correct responses and new correctnew responses are rewarded equally It is easy to see thatif the standard deviations of the old and the new distri-butions are equal the optimal performance will be foundif the recognition criterion is set exactly between the dis-tributions For unequal standard deviations the optimalrecognition criterion is shifted from the midpoint towardthe distribution with the smallest standard deviationMore exactly the optimal recognition criterion is thepoint at which the old and the new distributions inter-sect It is easy to see that this is true because if the recog-nition criterion is moved to the left of this point the rateof increase in false alarms is larger than the rate of in-crease in hits and performance suffers If the recognition

criterion is moved to the right of this point the rate of de-crease in hits is larger than the rate of decrease in falsealarms and performance also suffers (see eg Figure 4D)Formally f [S(O)] denotes the density of recognitionstrength of the old distribution and f [S(N)] the densityof the recognition strength of the new distribution Theratio between these variables is called the likelihoodratio L = f [S(O)] f [S(N)] and the optimal performanceoccurs when this ratio is equal to one (L = 1)

In the mirror effect there are two classes of itemseach having a new and an old distribution with differentstandard deviations The question of optimal perfor-mance is complicated by the possibility of using differ-ent criteria for the two classes The performance maythen vary depending on the choice of the two criteria andon additional restrictions on the overall level of confi-dence For example if one class is very easy (ie perfectdiscrimination) and one class is very difficult (ie no

Figure 9 (A) Theoretical d cent as a function of percentage of nodes active at encoding The solid line shows the d cent as a function of per-centage of nodes active at encoding when the decision is based only on nodes that are active during encoding The dotted line showsd cent when the decision is based on all the nodes (B) Theoretical d cent as a function of the activation threshold The leftmost arrow pointsat the expected net input of the new items [m(N)] the rightmost arrow points at the expected net input of the old items [m(O)] and themiddle arrow points at the point at the placement of the activation threshold of the nodes Note that the activation threshold is slightlylower than the optimal point (C) Optimal placement of the recognition criterion for the two classes The y-axis shows the maximumlikelihood for Class B divided by the maximum likelihood for Class A An optimal performance is found when this ratio is one Thex-axis shows the false alarm rate for Class A The straight line shows the ratio for theoretical optimal performance the dotted line theratio before normalization and the solid curved line the ratio after normalization See the text for details (D) The advantage of nor-malization for different recognition criteria The y-axis shows the total hit rate after normalization minus the total hit rate before nor-malization as a function of the total false alarm rate on the x-axis See the text for details

434 SIKSTROM

discrimination) and subjects are instructed to respondyes only when they are absolutely certain that they arecorrect it may be optimal to set a very high criterion forthe difficult class so that no yes responses will be madefor the difficult class and a moderate criterion for theeasy class so that some yes responses will be made forthe easy class Therefore any model that optimizes per-formance for the two classes must combine the criteriafor each class so that the performance for the sum of theclasses will be optimal

This problem may formally be stated as follows Giventwo classes (A and B) with a fixed total false alarm rate[P(AN) + P(BN)] how should the recognition criteriafor the two classes [T(A) and T(B)] be chosen so that thehit rates are maximized [P(AO) + P(BO)] The solutionto this problem is surprisingly simple The optimal perfor-mance occurs when the placements of the maximum like-lihoods of the two classes are equal

L(A) = f [S(AO)] f [S(AN)] = L(B)

= f [S(BO)] f [S(BN)]

It is easy to see that this criterion must be satisfied foroptimal performance because any shift from this pointdiminishes performance For example if the recognitionthreshold for class A is diminished the recognition cri-terion for class B must be increased to keep the totalfalse alarm rate constant According to the formulationof the problem the change in total false alarm rates mustbe equal f f [S(BN) = 0] The maximum-likelihood ra-tios are monotonically increasing functions of the recog-nition criteria therefore L(A) L(B) lt 0 when the recog-nition criteria are changed as specified above ThusL(A) = f [S(AO)] f [S(AN)] lt f [S(BO)] f [S(BN)] =L(B) or f [S(AO)] f [S(BO)] lt 0 This shows that thechange in the placement of the criteria from L(A) = L(B)results in an overall decrease in hit ratemdash( f [S(AO)] f [S(BO)] lt 0)mdashand performance suffers

Note that the variance theory only has one overallrecognition criterion However the theory and any the-ory of the mirror effect specifies how this criterionchanges when it is moved over the two classes Thus thesecond criterion can indirectly be inferred from the for-mulation of the theory This is done in the variance the-ory by the normalization factor that scales the recogni-tion differently between the two classes of words

The intriguing question here is whether the variancetheory is optimal or almost optimal in terms of place-ment of the recognition criterion for the two classes Fig-ure 9C plots the maximum likelihood of class B dividedby the maximum likelihood of class A [L(B)L(A)] onthe y-axis The x-axis shows the probability of false alarmsfor class A when the recognition criteria are changedThe optimal ratio of the maximum likelihood on the y-axisis one and it is plotted as the dotted straight line The re-sults before normalization (ie by counting the numberof nodes above the recognition criterion) are plotted in

the dotted monotonically increasing line The resultsafter normalization (ie the percentage of active nodesminus the expected number of active nodes divided bythe standard deviation of the net input) are plotted as thesolid line

The figure clearly shows that performance before nor-malization is far from optimal For a conservative recog-nition criterion or low false alarm rates the maximumlikelihood for class A is larger than the maximum likeli-hood for class B Therefore a more liberal criterion forclass B and a more strict criterion for class A would bemore advantageous For a liberal recognition criterionor a high false alarm rate the maximum likelihood forclass B is larger than the maximum likelihood for classA Therefore a more liberal criterion for Class A and astricter criterion for Class B would be beneficial Theonly point at which the performance is optimal is whenthe recognition criterion is unbiased At this point [aroundP(AN) = 25] the maximum-likelihood ratio is one

Normalization improves performance significantly sothe maximum-likelihood ratio is one or almost one fora range of criteria For false alarm rates larger than 015and smaller than 060 the ratio is within two percentagepoints of one The maximum likelihood for Class A islarger than that for Class B for false alarm rates less than015 and for false alarm rates larger than 060 Thus thereis still some deviation from optimal performance evenafter normalization However the maximum-likelihoodratio is closer to the optimal value for all false alarmrates after normalization than before normalization Ar-guably normalization improves performance sufficientlyso that performance may be described as being near anoptimal value for a wide range of recognition criteria

Overall performance after normalization can be di-rectly compared with performance before normalizationFigure 9D plots the total hit rate after normalizationminus the total hit rate before normalization for differenttotal false alarm rates In this figure the standard devia-tion of the net inputs to Class B was set to a 3 (rather than156) in order to make the difference between perfor-mance before and after normalization more salient Allother parameters were identical to those in Figure 4DFurthermore it is assumed that the number of items inClass A is equal to the number of items in Class B Thetotal hit rate is equal to the average hit rate of Class Aand Class B and the total false alarm rate is equal to theaverage false alarm rate in Class A and Class B

For all recognition criteria or for all false alarm ratesperformance is better or equal after normalization ascompared with performance before normalization Foran unbiased recognition criterion or a slightly largerrecognition criterion performance is approximatelyequal before and after normalization [ie for 25 lt P(N) lt40] For conservative recognition criteria [P(N) lt 25]there is a large advantage for normalized performanceover nonnormalized performance The largest advantageis approximately 005 and it occurs when the false alarm

VARIANCE THEORY OF MIRROR EFFECT 435

rate is approximately 005 For liberal recognition crite-ria [P(N) gt 40] there is also an advantage for normal-ized performance The largest advantage is around 001and it occurs when the false alarm rate is 070 The ad-vantage for liberal criteria is smaller than the advantagefor conservative criteria because of a ceiling effect forlarge false alarm rates and large hit rates

A Nonoptimal Network is Inconsistent With Empirical Data of Recognition Memory

To summarize it has been shown that performance isoptimal when (1) the percentage of nodes active at en-coding is low (2) the activation threshold is set betweenthe new and the old distributions (3) only nodes activeat encoding are used for measuring the recognitionstrength and (4) the recognition strength is normalizedwith the standard deviation of the net input It is inter-esting to note that all these conditions are necessary forproducing the empirically found memory phenomena

1 The percentage of active nodes has to be low forproducing appropriate standard deviations for the oldand the new recognition distributions If the percentageof active nodes is too high the standard deviation of theold distribution will approach the standard deviation ofthe new distribution

2 The model predicts the mirror effect only if the ac-tivation threshold is set between the old and the new dis-tributions If the recognition threshold is larger than theexpected value of the net input of the old distributionthe hit rate of the easy class will be less than the hit rateof the difficult distribution Similarly if the recognitionthreshold is lower than the expected new net input thefalse alarm of the easy class will be larger than the falsealarm rate of the difficult class This is inconsistent withthe empirical data for unbiased responses

3 Assume that the recognition strength is based on allthe nodes (ie not only nodes inactive during encod-ing)mdashfor example by counting the number of nodes inthe correct state of activation This measurement ofrecognition strength will have at least 50 of the nodesin the correct state (ie Pc gt 50) even if the subjectswere merely guessing on new items This would lead tothe incorrect prediction that the old recognition strengthvariance should be smaller than the new recognitionstrength variance because Pc(1 Pc)N decreases for Pcover 50 This is inconsistent with the finding that thevariance of the old distribution is larger than the varianceof the new distribution

4 If the recognition strength is not normalized withthe net input the variance of the recognition strength ofthe new easy class (AN) will be smaller than the varianceof the recognition strength of the new difficult class (BNsee Figure 4C) This is inconsistent with the empirical data

This analysis indicates that several recognition mem-ory phenomena fall out as a consequence of optimizingperformance in the network If the model is optimized interms of performance the model reproduces the empir-ical data If the model is made nonoptimal the model

does not reproduce the empirical data Arguably thehuman brain has evolved to optimize storage capacityand therefore these memory phenomena occur

GENERAL DISCUSSION

This paper has suggested a variance theory for themirror effect that also applies to the ROC curves Themodel is remarkably simple It has been shown that atwo-layer recurrent network where one layer representscontext and one layer represents items shows these phe-nomena if performance is measured by counting thenumber of nodes active at recognition in a way that opti-mizes performance The structure of the theory guaran-tees that high-frequency items have a larger variance inthe net inputs than do low-frequency items because en-coding the same item in different contexts increases thevariance whereas the expected net inputs are the sameThe theory predicts the mirror effect when the sameamount of attention is paid to both classes of stimuli Thestandard deviation of the recognition strength is largerfor old than for new items because more nodes are activein old items The standard deviation for the easy class islarger than the standard deviation for the difficult classbecause the recognition strength is normalized with thestandard deviation for the net input

There are several reasons why the variance theory isinteresting First the theory is extremely simple Theonly necessary assumptions are that recognition is basedon recurrent associations between contexts and itemsand performance is measured by counting the number ofnodes in an optimal way Second these assumptions areconsistent with what is known about how the brain worksTherefore the model is biologically plausible Third themodel accounts for a large amount of data including themirror effect exceptions from the mirror effect ROCcurves list-length effects and so on Fourth the modelfits the empirical data well Fifth it is easy to implementthe model in a connectionist network

Paying more attention to one of the classes violates theassumption of equal expected net inputs to the two classesThe variance theory predicts that attention to the moredifficult class primarily affects the hit rates whereas thefalse alarm rates and standard deviations of the underly-ing distributions are less affected An experiment sup-ported the prediction A standard mirror effect was foundwhen attention was divided equally between high- andlow-frequency words However focusing the subjectsrsquoattention on the high-frequency words either by the pre-sentation frequency or the presentation time made thehit rate larger for the high-frequency words than for thelow-frequency words This manipulation of attention didnot influence the false alarm rates which were higher forthe high-frequency words in all the conditions Thus nomirror effect was found when attention was paid to thehigh-frequency words Nor did the focusing of attentioninfluence the order of the standard deviations The stan-dard deviations were larger for the low-frequency distri-

436 SIKSTROM

bution than for the high-frequency distribution This wastrue for the new and the old distributions both when at-tention was paid to high-frequency words and when at-tention was divided equally between the two classes (ex-cept in the new frequency control condition where thestandard deviations were equal)

The variance theory predicts the order of the standarddeviations of the underlying distributions for the follow-ing reasons The standard deviations of the old distribu-tions are predicted to be larger than those of the new dis-tributions because more nodes are activated in the olddistributions The standard deviations of the easy classdistributions are predicted to be larger than the standarddeviations of the difficult class distributions because therecognition strength is normalized by the itemrsquos diffi-culty estimated from the standard deviation of the net in-puts This is consistent with the empirical data

In contrast the attention-likelihood theory does notconstrain the old distribution to be larger than the newdistribution for the difficult class (it can be larger orsmaller depending on parameter settings) The variancetheory allows the following two orders of the standarddeviations ss(BN) lt ss(BO) lt ss(AN) lt ss(AO) andss(BN) lt ss(AN) lt ss(BO) lt ss(AO) The first order isthe more common although the second order occurs oc-casionally (see eg Glanzer et al 1993 Experiment 1)In addition the attention-likelihood theory allowsss(BO) lt ss(BN) lt ss(AN) lt ss(AO) according to Equa-tion 2 which to the authorrsquos knowledge has not beenfound in empirical data

The variance theory predicts that strength variablessuch as study time repetition and study instructions af-fect the expected net input For example increasing studytime will increase the net input that improves the hit rateIncreasing the net inputs also causes an increase in theactivation threshold that diminishes the false alarm rates

The variance theory has no (ie lateral) connectionswithin the item layer and there are no connections with-in the context layer Including intraitem connections inrecognition makes it impossible to tell whether therecognition strength comes from encoding during thelearning episode or from another preexperimental learn-ing episode Thus there would be a confounding be-tween the itemrsquos frequency and recognition strength Forexample if the recognition strength in the variance the-ory included intraitem connections and used a constantrecognition criterion it would predict a higher hit rateand a higher false alarm rate for high-frequency itemsthan for low-frequency items Thus the hit rate for high-frequency words would be larger than that for low-frequency words which is contrary to the data on the mir-ror effect This issue is called the frequencyrecognitionndashstrength confounding problem Other models may bevulnerable to this problem depending on their specificassumptions The variance theory is immune from thisproblem because recognition strength is based on the as-sociation between the context and the item that yields apure measurement of the strength of the target in a par-

ticular episode Net inputs within the item population arenot used because these connections are highly corre-lated with the frequency of the item

This frequencyrecognition-strength confounding prob-lem may be relevant to several distributed models thatassume that recognition strength increases with frequencythus making the false alarm rate higher for high- than forlow-frequency stimuli This is often implemented in dis-tributed models by including intraitem associations inthe measurement of recognition strength Thus whenintraitem and itemndashcontext associations are added it isnot possible to know whether the intraitem strength oc-curs because an item has been encoded in the to-be-retrieved-from list or at another episode

Although the intraitem associations are not used tomeasure recognition strength they may play an impor-tant role in recognition In the first step of recognitionthese associations may be used for deblurring unclear in-formation in the item cue (a similar mechanism occursfor the context cue) Arguably this deblurring mecha-nism works well for well-known words however fornonwords it is much more likely to fail Such failure willactivate features that were not active in the encoded rep-resentation It is more likely that these wrongly activatednodes represent high-frequency features because it ismore likely to converge on high-frequency features Thereare two interesting implications of this perspective Firstthe wrongly activated nodes will use the wrong connec-tions between the context and the item Second becausethe wrongly activated nodes represent high-frequencyfeatures the average variability will be larger for non-words than for words This is a plausible account of thepoor recognition performance with nonwords as com-pared with words It is also a tentative account of why non-words can be seen as a difficult class with a higher vari-ability than that for words in the variance theory Howeverfurther work is needed before any firm conclusion can bemade regarding this aspect of the theory

A problem similar to frequencyrecognition-strengthconfounding occurs if within-context connections areused at recognition If the context is temporally corre-lated the within-context connections are influenced bylist length This causes a confounding between list lengthand recognition strength This issue is called the list-lengthrecognition-strength confounding problem Othermodels may be vulnerable to this problem depending ontheir specific assumptions

Another issue is whether the variance theory can ac-count for the mirror effect found in abstract and concretewords where words from a concrete class are more eas-ily discriminated than words from an abstract class Thevariance theory can account for this given the assump-tion that the variance of the net input is larger for abstractthan for concrete words However at this point it is notcompletely clear how this assumption can be motivatedA possibility is that although these two classes areequated for word frequency the contexts associated withan abstract word are more variable than the contexts as-

VARIANCE THEORY OF MIRROR EFFECT 437

sociated with a concrete word This larger variability incontext for abstract words may lead to a larger variabil-ity in the net input Another possibility is that the activefeatures in abstract words are more general and there-fore associated with more contexts Nodes active in con-crete words may represent more specific features acti-vated with a lower frequency and therefore associatedwith fewer contexts Thus features in abstract wordsmay be of a higher frequency than features in concretewords although the frequencies of the items are thesame This would lead to a mirror effect in the modelHowever at this point no claim is made that variancetheory can handle this phenomenon

The variance theoryrsquos account of the list-length andlist-strength effects is arguably much simpler than thesubjective-likelihoodrsquos account The activation thresholdis set so that on average half of the nodes active duringencoding are active during recognition The activationthreshold is constant within one condition but may changebetween conditionsmdashfor example when study time ischanged Furthermore subjects do not need to estimatethe list length and the probability that a particular itemis drawn from the study list

The variance theory has advantages over the attention-likelihood theory As was discussed in more detail abovethe attention-likelihood theory requires subjects to clas-sify the stimulus Depending on this classification thesubjects must know (consciously or unconsciously) howmuch attention is paid to the stimuli in order to calculatethe log-likelihood ratios Thus the yesndashno decision isbased on the subjectsrsquo classification of the stimuli Thevariance theory does not require a classification of thestimuli During the calculation of recognition strengththe standard deviation of the net input of the item is used(shcent ) so the subject does not need to know the class or thestandard deviation of the class (sh) The increase in thehit rate and decrease in the false alarm rate for the easierclass occurs according to the theory because the vari-ance is smaller for the easier class The variance theoryhas a simple mathematical base subjects count the num-ber of activated nodes minus the expected value dividedby the standard deviation of the net input of the item Anexplicit solution is presented that requires only calculat-ing the probabilities of two z-transformations

The subjective-likelihood theory uses feature contentof the items for addressing issues regarding item similar-ity and word frequency In particular high-frequency wordsare assumed to have a higher noise or variability than dolow-frequency words The variance theory also usesvariability that depends on frequency However the vari-ance theory simulates the increase in variance duringeach presentation of a feature in different contexts thusmaking it an unavoidable phenomenon for high-frequencyfeatures In the subjective-likelihood theory the featurevariance is introduced as an assumption or a constantand it is not explicitly simulated

There are several other differences between the vari-ance theory and the subjective-likelihood theory The

subjective-likelihood theory is based on log-likelihoodratios In the variance theory log-likelihood ratios arenot necessary to account of the mirror effect and for z-ROC curves Instead the variance theory uses the num-ber of active nodes normalized by the variance of the netinput as the measurement of recognition strength

Another difference is the use of one detector for eachitem in the subjective-likelihood theory This makes themodel essentially local whereas the variance theory isdistributed This property may cause difficulties in im-plementing the model in a connectionist network How-ever see McClelland and Chappell (1998) for a brief dis-cussion of this topic An implementation of the variancetheory is straightforward

REFERENCES

Feller W (1968) An introduction to probability theory and its appli-cation New York Wiley

Gillund G amp Shiffrin R M (1984) A retrieval model for bothrecognition and recall Psychological Review 91 1-67

Glanzer M amp Adams J K (1985) The mirror effect in recognitionmemory Memory amp Cognition 13 8-20

Glanzer M amp Adams J K (1990) The mirror effect in recognitionmemory Data and theory Journal of Experimental PsychologyLearning Memory amp Cognition 16 5-16

Glanzer M Adams J K amp Kim K (1993) The regularities ofrecognition memory Psychological Review 100 546-567

Glanzer M amp Bowles N (1976) Analysis of the word frequencyeffect in recognition memory Journal of Experimental PsychologyHuman Learning amp Memory 2 21-31

Glanzer M Kisok K amp Adams J K (1998) Response distribu-tions as an explanation of the mirror effect Journal of ExperimentalPsychology Learning Memory amp Cognition 24 633-644

Greene R L (1996) Mirror effect in order and associative informa-tion Role of response strategies Journal of Experimental Psychol-ogy Learning Memory amp Cognition 22 687-695

Hertz J Krogh A amp Palmer R G (1991) Introduction to the the-ory of neural computation Reading MA Addison-Wesley

Hintzman D L (1988) Judgment of frequency and recognition memoryin a multiple trace memory model Psychological Review 95 528-551

Hopfield J J (1982) Neural networks and physical systems withemergent collective computational abilities Proceedings of the Na-tional Academy of Sciences 79 2554-2558

Hopfield J J (1984) Neurons with graded response have collectivecomputational properties like those of two-state neurons Proceed-ings of the National Academy of Sciences 81 3088-3092

Humphreys M S Bain J D amp Pike R (1989) Different way to cuea coherent memory system A theory for episodic semantic and pro-cedural tasks Psychological Review 96 208-233

Kim K amp Glanzer M (1993) Speed versus accuracy instructionsstudy time and the mirror effect Journal of Experimental Psychol-ogy Learning Memory amp Cognition 19 638-652

Kruschke J K (1992) ALCOVE An exemplar-based connectionistmodel of category learning Psychological Review 99 22-44

Ku Iumlcera H amp Francis W N (1967) Computational analysis ofpresent-day American English Providence RI Brown UniversityPress

Lewandowsky S (1991) Gradual unlearning and catastrophic inter-ference A comparison of distributed architectures In W E Hockleyamp S Lewandowsky (Eds) Relating theory and data Essays onhuman memory in honor of Bennet B Murdock (pp 445-476) Hills-dale NJ Erlbaum

Li S-C amp Lindenberger U (1999) Cross-level unification A com-putational exploration of the link between deterioration of neuro-transmitter systems and dedifferentiation of cognitive abilities in oldage In L-G Nilsson amp H J Markowitsch (Eds) Cognitive neuro-sciences of memory (pp 103-146) Seattle Hogrefe amp Huber

438 SIKSTROM

Li S-C Lindenberger U amp Frensch P A (2000) Unifying cog-nitive aging From neuromodulation to representation to cognitionNeurocomputing 32-33 879-890

McClelland J L amp Chappell M (1998) Familiarity breeds dif-ferentiation A subjective-likelihood approach to the effects of expe-rience in recognition memory Psychological Review 105 724-760

Metcalfe J (1982) A composite holographic associative recallmodel Psychological Review 89 627-658

Murdock B B (1982) A theory for the storage and retrieval of item andassociative information Psychological Review 89 609-626

Murdock B B (1998) The mirror effect and attentionndashlikelihoodtheory A reflective analysis Journal of Experimental PsychologyLearning Memory amp Cognition 24 524-534

Okada M (1996) Notions of associative memory and sparse codingNeural Networks 9 1429-1458

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves PsychologicalReview 99 518-535

Rumelhart D E Hinton G E amp Williams R J (1986) Learn-ing representation by backpropagating errors Nature 323 533-536

Shiffrin R M amp Steyvers M (1997) A model for recognitionmemory REMndashretrieving effectively from memory PsychonomicBulletin amp Review 4 145-166

Sikstroumlm S (1996a) TECO A connectionist theory of successiveepisodic tests Umearing Doctoral dissertation Umearing University (ISBN91-7191-155-3)

Sikstroumlm S (1996b) The TECO connectionist theory of recognitionfailure European Journal of Cognitive Psychology 8 341-380

Sikstroumlm S (1999) Power function forgetting curves as an emergentproperty of biologically plausible neural networks model Interna-tional Journal of Psychology 34 460-464

Stretch V amp Wixted J T (1998a) Decision rules for recognitionmemory confidence judgments Journal of Experimental Psychol-ogy Learning Memory amp Cognition 24 1397-1410

Stretch V amp Wixted J T (1998b) On the difference betweenstrength-based and frequency-based mirror effects in recognitionmemory Journal of Experimental Psychology Learning Memoryamp Cognition 24 1379-1396

NOTE

1 The expected number of nodes active at recognition is a2 giventhat the standard deviations of the percentages of active nodes for new[sp(N)] and old items [sp(O)] are equal However if the standard devi-ations are different the expected number of active nodes will be

Because the new variance is smaller than the old variance this valuewill be slightly less than a2 typically around 044a if the ROC curveis 08 It is actually more precise to use this value In this paper the ap-proximation a2 is used except that in the simulations where the ex-pression above is used

APPENDIX

The Expected Value of the Net Input and the Variance of the Net Input

Inserting the weight changes into the net input solves the ex-pected value of the net input for old items In this Appendix thesuperscripts representing the item layer (d) and the contextlayer (c) are dropped Only the active states of activation con-tribute to the net input so jj 5 jj = 1

(A1)

The expected value of the net inputs for the new items iszero

mh(N) 5 0 (A2)

The expected value of the weights is zero The variance ofthe net input is calculated by squaring each term in the netinput Let P represent the number of patterns stored in the net-work If the patterns are uncorrelated or a new item is pre-sented the variance of the net input is

(A3)

For an old item the variance of the net input to the contextlayer from the item layer depends on the frequency of the item( f ) All the aN weights supporting a context node contribute tothe net input in the same direction

(A4)

The same function can be used for calculating the varianceof the net input to a node in the item layer when the same con-text is associated with several items Let the same context be as-sociated with L items in a list Furthermore let the context be-tween different lists be uncorrelated The variance of the netinput to an item node is

The expected variance for all nodes is the weighted sum ofthese two terms Half of the nodes are context nodes and halfof the nodes are item nodes Therefore the expected varianceof the net input to all nodes given an item with a frequency off and a list length of L in a network where p patterns have beenencoded is

(A5)

Performance in the Variance Model When AllNodes Are Used

In the variance model recognition strength is based on nodesactive at encoding However if recognition strength is based onall the nodes performance can be calculated as follows The d centis calculated by using Equation 8 where Pc(O) and Pc(N) are thenumber of correct nodes The number of correct nodes is thenumber of nodes active at retrieval that also is active at encod-ing (ie calculated as usual with Equation 7) plus the numberof inactive nodes at encoding that also are inactive at retrievalThe latter value can be calculated by replacing a with 1 a inEquation 7 and using the expected value of old net inputs for in-active nodes a2 (1 a) N (compare with Equation A1)

(Manuscript received February 9 1999revision accepted for publication October 30 2000)

s h f L a N p a a N2 3 22 1O( ) = +( ) + -( )

s h f a a LN2 4 2 21( ) = -( )

s x x x xh ij jj

N

i j ji

Nf f f a a

a a f N

22 2

4 2 21

( ) = waring aringaelig

egraveccedilouml

oslashdivide= -( ) -( )eacute

eumlecirc

= -( )

s x x x xh ij jj

N

i j j

P

i

NN a a

a a a a a a a

a a a a PN a a PN

2 2 2

2

3

1 1 2 0 1

0 0 1 1

( ) = w

= [( ) ( )] + [( )(1- )] ( )

+ [( )( )] ( ) = ( )

2 2

2 2 2

( ) = -( ) -( ) - - - -

- - - -

aring aringaring

m x x x xh ijj

N

j i j jj

Na a a a N(O) = wDaring aring= -( ) -( ) = -( )1

2

ss s

p

p p

a(N)

( )N (O)+

424 SIKSTROM

In this case the activation threshold is also constant dur-ing the recognition tests although the study time varieswithin the condition

The order of the probabilities in the mirror effect issomewhat robust against changes in the activation thresh-old over a large range Setting the activation to a fixedsufficiently low and positive value yields the mirror ef-fect for any value of the expected net input For example

assume that the activation threshold is fixed to 04 Thenthe mirror effect is predicted for the three cases of ex-pected old net inputs discussed above (05 1 and 2) orany value above 04 The predictions for the new distri-butions do not change with these changes in net inputs[P(AN) = 25 P(BN) = 30] thus a change in the acti-vation threshold is needed to change the false alarmrates In contrast the advantage of the old easy class over

Table 1General Table of Results From the Experiment

Condition AN BN BO AO ss(BN)ss(AN) ss(BO)ss(AO)

Control 013 017 069 082 060 086Frequency 020 028 080 068 101 066Time 010 015 078 076 089 081

NotemdashThe rows show the conditions (control presentation frequency and presenta-tion time) The columns show the false alarm rates for low (AN) and high (BN) fre-quencies and the hit rate for high (BO) and low (AO) frequencies the slope of the z-ROCcurve for the new low-frequency distribution as a function the new high-frequency dis-tribution [ss(BN)ss(AN)] and the corresponding slope for the old distributions[ss(BO)ss(AO)]

Figure 6 The densities of recognition strength in the variance theory for different parameter settings (A) concentering (B) dis-persion (C) activity level set to one and (D) equal variance The horizontal axes show the recognition strength and the vertical axesthe density

VARIANCE THEORY OF MIRROR EFFECT 425

the old difficult class increases with the expected netinput [from P(BO) = 55 and P(AO) = 56 for mh(O) =05 to P(BO) = 89 and P(AO) = 92 for mh(O) = 2]

List-Length EffectGiven everything else equal recognition from a short

list length has a higher hit rate and lower false alarm ratethan recognition from a long list In the variance theorylist length is predicted to affect the standard deviation ofthe net input (sh) for both easy and difficult class itemsso that longer lists have a larger variance than do shorterlists The expected value of the net input is not affectedby list length

Assume that context does not change within a list butis uncorrelated between different lists The context for alist is thus associated with as many items as there areitems in the list The variance of the net inputs to the itemnodes increases when the list length is increased Thereason for this increase in variance is essentially thesame as the reason that word frequency affects the vari-ance In the word frequency case the same item is asso-ciated with several contexts and this increases the vari-ance in context nodes In the list-length case the samecontext is associated with several items and this in-creases the variance in the item nodes Thus the varianceof the net inputs in the item nodes will be a linear func-tion of list length Therefore a long list will have a lowerhit rate and a larger false alarm rate than will a short list

ROC CurvesThe percentage of nodes active at recognition is less

for new items than for old items Owing to the placementof the activation threshold this proportion is always lessthan 12 The standard deviation of the percentage of ac-tive nodes increases as a function of the percentage ofactive nodes If the percentage of active nodes is zerothe standard deviation obviously is zero However thisstandard deviation increases as the percentage of activenodes increases This yields a smaller standard deviationfor the new distribution (which is associated with a lowerpercentage of active nodes) as compared with the olddistribution [ss(AO) gt ss(AN) and ss(BO) gt ss(BN)]

For the sake of understanding the model the propor-tion of nodes active at encoding can be set unrealisticallyhighmdashnamely to a = 1 This setting yields around 50of these nodes being active at recognition This parametersetting makes the standard deviations of the new and theold distributions equal [ss(AO) = ss(AN) and ss(BO) =ss(BN)] Figure 6C shows the prediction for a = 1 (all theother parameters are identical to those in Figure 4D)

The standard deviation of recognition strength is largerfor the difficult class than for the easy class [ss(AN) gtss(BN) and ss(AO) gt ss(BO)] because the recognitionstrengths are calculated from the inverse of the standarddeviation of the net inputs Thus when the standard de-viations of the net inputs are set equal the standard de-viation of the recognition strengths and the recognitionstrengths becomes equal Figure 6D plots the predictionsof the theory when all standard deviations of the net in-

puts are 125 The other parameters are the same as thosein Figure 4D

In Figure 4D the four standard deviations of the recog-nition strengths are ss(AN) = 0015 ss(BN) = 0012ss(BO) = 0015 and ss(AO) = 0020 The ratio of thesestandard deviations must follow Equation 2 This is alsothe case with ss(AO) ss(BN) = 061 lt 074 = ss(AO) ss(AN) lt ss(BO)ss(BN) = 078 lt 094 = ss(BO) ss(AN)

Changing the Recognition CriterionThe probability of a yes response (P) for the four

classes depends on the recognition criterion (C) SettingC to an unbiased value of 0 yields P(AN) = 20 P(BN) =25 P(BO) = 70 P(AO) = 74 These predicted data areprototypical of experimental data for the mirror effect

A conservative value of the recognition criterion (C)will not yield the mirror effect For example C = 0016yields P(AN) = 03 P(BN) = 02 P(BO) = 30 P(AO) =43 Thus the variance theory predicts that a conserva-tive recognition criterion yields a higher false alarm ratefor easy class words than for difficult class words Thisprediction is supported by empirical data Greene (1996)asked subjects to respond yes only if they were sure oftheir response Consistent with the prediction no mirroreffect was found

It follows from the ordering of the distributions in Fig-ure 4D that the theory also predicts the experimentalfindings in forced recognition [P(BO BN) lt P(AO BN)P(BO AN) lt P(AO AN) P(BN AN) gt 50 and P(AOBO) gt 50] For the parameters above the predictions ofthe theory are P(BO BN) = 79 lt 81 = P(AO BN)P(BO AN) = 83 lt 84 = P(AO AN) P(BN AN) = 59 gt50 P(AO BO) = 57 gt 50

Within-List Strength ManipulationSo far the predictions made by the variance theory are

qualitatively (but not quantitatively) equal to those of theattention-likelihood theory However there is an excep-tion that differentiates the variance theory from the attention-likelihood theory The mirror effect is nor-mally studied under experimental conditions in whichthe diff icult and the easy classes are given the sameamount of attentionmdashfor example under conditions inwhich the number of presentations the study time andthe study instructions are equal for the two classes ofwords However if the number of presentation is largerfor the difficult class than for the easy class different re-sults emerge Stretch and Wixted (1998b) conducted fiveexperiments in which the basic manipulation was to pre-sent high-frequency words five times whereas the low-frequency words were presented once The results didnot show a mirror effect because the hit rates for thehigh-frequency words were higher than those for thelow-frequency words However increasing the numberof presentations for the high-frequency words did not af-fect the false alarm rate so both the false alarm rate andthe hit rate were larger for high-frequency words

The variance theory accounts for this new finding inthe following way The theory assumes that a longer pre-

426 SIKSTROM

sentation time or a larger presentation frequency in-creases the net inputs of the old items [mh(O)] This is il-lustrated in Figure 7A (compare with Figure 4A wherethe same amount of attention is paid to the two classes)If the net inputs for old high-frequency items are in-creased sufficiently the percentage of active nodes willbe larger than that for old low-frequency items For thisto occur the effect of the increase in net input (whichgives the advantage for old high-frequency items whenattention is focused on these items) must be larger thanthe effect from the larger standard deviation of the netinputs for old high-frequency items (which gives the ad-vantage for old low-frequency items when the same at-tention is paid to the two classes) This increase in thepercentage of active nodes yields a higher hit rate forhigh-frequency items than for low-frequency items

However it will not significantly change the false alarmrates which are larger for high-frequency items than forlow-frequency items Therefore the variance theory pre-dicts no mirror effect when high-frequency items are pre-sented sufficiently more often or with a sufficiently longerpresentation time as compared with low-frequency items

It is apparent from the density of net inputs (Figure 7A)that the density of recognition strengths (Figure 7B)will not show a mirror effect (ie because the percent-age of active nodes are larger for high- than for low-frequency old items) The parameters used in these fig-ures are identical to the parameters used for the standardmirror effect in Figures 4A and 4D with the exceptionthat the expected net input of the old high-frequencyitems [mh(BO)] is 2 rather than 1 Consequently to opti-mize performance the activation threshold becomes

P a ec

h h

h=-( )yen

ograve12

22

m

s

P -

Figure 7 (A) The probability density of the net inputs in the variancetheory when attention is focused on the high-frequency class The hori-zontal axis shows the net inputs and the vertical axis the probabilitydensity of the net inputs The expected value of the high-frequency value(BO) is shifted to the right because attention is focused on this class Thedotted vertical line is the activation threshold (B) The predictions of thevariance theory when subjects focus their attention on high-frequencywords The horizontal axis shows the recognition strength and the ver-tical axis the probability density

075 rather than 050 The figure does not show a mirroreffect because the expected hit rate and the expectedfalse alarm rate are larger for the high-frequency itemsthan for the low-frequency items Setting C to an unbi-ased value of 0 yields P(AN) = 08 P(BN) = 14 P(BO) =86 P(AO) = 063 [which may be compared with Figure6B P(AN) = 20 P(BN) = 25 P(BO) = 70 P(AO) = 74]

Furthermore in the variance theory the ratio of therecognition strength standard deviations for high- andlow-frequency items depends mainly on the standard de-viations of the net inputs The standard deviations of thenet inputs are not dependent on the attention paid to thestimuli Therefore the variance theory predicts no changein the standard deviations when the amount of attentionis manipulated The standard deviation of the old low-frequency distribution is predicted to be larger than thestandard deviation of the old high-frequency distribu-tion Similarly the standard deviation of the new low-frequency distribution is predicted to be larger than thatof the new high-frequency distribution The standard de-viations in Figure 7B are ss(AN) = 0013 ss(BN) = 0011ss(BO) = 0017 and ss(AO) = 0019 These results aresimilar to the results when the same amount of attentionis paid to the two classes in Figure 4D ss(AN) = 0015ss(BN) = 0012 ss(BO) = 0015 and ss(AO) = 0020

The standard version of the attention-likelihood the-ory has problems accounting for the lack of mirror ef-fect when more study time is given to the difficult classthan to the easy class This theory suggests that the classof items to which more attention is being paid is moreeasily recognized For example low-frequency items arebetter recognized than high-frequency items becausesubjects pay more attention to them The amount of at-tention is assumed to influence the number of sampledfeature [n(i)] so more features are sampled for low- thanfor high-frequency items (Kim amp Glanzer 1993) This isthe only parameter that differs between high- and low-frequency items From this explanation it follows thatif the experimental conditions are manipulated so thatsubjects pay more attention to the high-frequency itemsthe standard version of the attention-likelihood theorywill predict a mirror effect where the high-frequencyitems are the easier class (A) and the low-frequency itemsare the more difficult class (B) The difference from thenormal mirror effect is a larger hit rate and a smaller falsealarm rate for high- than for low-frequency items Fur-thermore the attention-likelihood theory makes predic-tions of the order of the slope of ROC curves The standarddeviation of the hit rate for the high-frequency distributionwould be larger than the hit rate for the low-frequencydistribution Similarly it is predicted that the standarddeviation of the high-frequency false alarm distribution islarger than that of the low-frequency distribution

EXPERIMENT

An experiment was conducted to test the predictionsregarding the within-list strength manipulation The

number of presentations and the study time of the high-frequency words were manipulated in an experimentThe original rationale for the experiment was to comparethe results with the predictions of the variance theoryand the attention-likelihood theory because the experi-ment was conducted before the publication of Stretchand Wixtedrsquos (1998b) study which manipulated atten-tion by the number of presentations In this experimenta new manipulation is investigatedmdashnamely how theamount of study time per item for each class influencesthe mirror effect Furthermore the manipulation of thenumber of presentations is replicated Thus there weretwo experimental conditions one in which the amountof study time was manipulated and one in which the pre-sentation time was manipulated There was also one con-trol condition in which high- and low-frequency wordswere given the same amount of attention

MethodSubjects Twenty-one students taking the introductory psychol-

ogy course at the University of Toronto volunteered to participatein a memory experiment for course credit There were 14 female and7 male subjects with a mean age of 20 ranging from 18 to 29 yearsold

Material Sixty low-frequency words and 60 high-frequencywords were selected from Ku Iumlcera and Francis (1967) The low-frequency words have an occurrence of 4ndash5 times per million andthe high frequency words an occurrence of 50ndash55 times per mil-lion Thirty low- and 30 high-frequency words were randomly cho-sen for List 1 and the remaining for List 2

Procedure The subjects were instructed to study a list of wordsso they would be able to recognize the words after study Fifteenlow-frequency words and 15 high-frequency words were randomlychosen as study words for each subject

Design There were three conditions In all the conditions thelow-frequency words were presented once with a presentation timeof 1 sec In the control condition the high-frequency words werealso presented once with a presentation time of 1 sec In the pre-sentation frequency condition the high-frequency words were pre-sented twice for 1 sec each time In the presentation time conditionthe high-frequency words were presented once for 3 sec The pre-sentation order was randomized All the words were presented inuppercase on a blank computer screen Immediately following thestudy list there was a recognition test The subjects were presentedwith either one of the studied words or one of the lures There were15 low-frequency lures and 15 high-frequency lures in each condi-tion The subjects were asked to judge whether the word was pre-sented in the list or not The subjects were also required to rate theirconfidence in their responses on a scale ranging from 1 (guessing)to 5 (very certain) The order of recognition was randomized foreach subject

Each subject participated in two conditions List 1 was alwaysgiven as the first list and List 2 as the second list Twelve subjectswere randomly chosen for the presentation frequency condition fol-lowed by the presentation time condition Nine subjects were giventhe control condition followed by another control condition Thewhole experimental set-up including instructions presentation ofwords and the recognition test was automated on a computer Eachsubject was tested individually

ResultsThe results from the experiment are presented in the

first three rows of Table 1 The probability for hit rates

VARIANCE THEORY OF MIRROR EFFECT 427

428 SIKSTROM

was larger for the high-frequency words than for the low-frequency words in the presentation frequency and thepresentation time conditions In the control conditionthe probability for hit rates for the low-frequency condi-tion was larger One-tailed paired t tests over the perfor-mance for each subject were carried out to test the dif-ferences between the high and the low frequencies Theeffects were significant in the presentation frequencycondition [t(11) = 22 MSe = 004 p = 02 lt 05] and inthe control condition [t(16) = 33 MSe = 004 p = 00lt 05] but not in the presentation time condition [t(11) =041 MSe = 003 p = 34 lt 05]

The false alarm rate was larger for the high-frequencywords in all the conditions However it was only signif-icantly larger in the presentation frequency condition[t(11) = 18 MSe = 003 p = 048 lt 05] but not in thepresentation time condition [t (11) = 15 MSe = 001 p =07 gt 05] and the control condition [t(16) = 14 MSe =002 p = 09 gt 05]

The results in the presentation frequency conditionsupport the variance theory The hit and the false alarmrates were significantly larger for the high-frequencywords than for the low-frequency words Thus there wasno mirror effect However the prediction of the standardversion of the attention-likelihood theory was not sup-ported

The results in the presentation time condition were inthe same direction as those in the presentation frequencycondition although the difference between the high andthe low frequencies was not significant This conditionis consistent with the variance theory although the stan-dard version of the attention-likelihood theory could notbe dismissed in this condition since the results werenonsignificant

Finally the control condition yielded results consis-tent with previous studies showing a mirror effect Thehit rate for the high-frequency words was significantlylower than the hit rate for the low-frequency words Thefalse alarm rate for the high-frequency words was largerthan that for the low-frequency words (although not sig-nificantly) Thus the control condition is as was ex-pected consistent with both the variance theory and thestandard version of the attention-likelihood theory

The slopes of the ROC curves were calculated as fol-lows The hit and false alarm rates for confidence ratings1ndash5 were z-transformed (eg for confidence rating 4 a hitresponse was scored if the confidence rating was 4 orabove) Linear regressions of one z-transformed mea-surement as a function of another z-transformed measure-ment were conducted The slope of the linear regressioncurves between the z-transformed hit rate of the low-frequency words and the z-transformed hit rate of the high-frequency words [ss(BO)ss(AO)] and similarly for theslope of the false alarms [ss(BN)ss(AN)] are shown inthe last two rows of Table 1

The results show that the standard deviations of theold high-frequency distributions were smaller than thestandard deviations of the old low-frequency distribu-tions in all the conditions The standard deviations of the

false alarm high-frequency distributions were smallerthan the standard deviations of the false alarm low-frequency distributions in the presentation frequencycondition and the control condition but were approxi-mately equal in the presentation time condition

To summarize the results in the presentation frequencycondition are consistent with the variance theory and inconsistent with the standard version of the attention-likelihood theory The control condition is consistentwith both the variance theory and the standard version ofthe attention-likelihood theory These data are also con-sistent with results presented by Stretch and Wixted(1998b) However Stretch and Wixted (1998b) suggestedone possible way to modify the standard version of theattention-likelihood theory to bring it in line with thedata presented here They noted that Glanzer et al (1993)had shown that the attention-likelihood theory predictsthe mirror effect although p(i old) is set to the averagevalue of the two classes This modified version can pre-dict the pattern of data presented here given that the at-tention paid to the high-frequency class was high duringencoding [n(B) = 120] and low during recognition [n(B) =40] This formulation of the attention-likelihood theoryseems somewhat unclear It is not well motivated whyp(i old) should be equal during recognition whereas n(i)is different [p(i old) is calculated from n(i)] or why theamount of attention for high-frequency items is lower thanthat for low-frequency words at encoding but higher atrecognition

COMPARATIVE DATA FITTING

Glanzer et al (1993) fitted the attention-likelihoodtheory to experimental data in nine conditions The easyclass (A) consisted of either low-frequency words orconcrete words and the difficult class (B) consisted ofhigh-frequency words or abstract words Here the vari-ance theory is also fitted to the same set of data as thatused in Glanzer et al (1993) This allows a directionevaluation of the variance theory by comparing its f itswith those of the attention-likelihood theory

Glanzer et al (1993) fitted the attention-likelihoodtheory to the four probabilities of yes responses and thefour ratios of slopes of the ROCs The fitting was con-ducted by minimizing the mean squared error divided bythe variancemdashthat is

Three parameters were fittedmdashnamely the attentionpaid to each of the classes n(A) and n(B) and the prob-ability that a feature was activated p(new) The other pa-rameters were held constant at a value found to give agood f it These parameters were (N = 1000) and therecognition criterion [ln (L) = 0]

The variance theory was fitted to the same set of datausing the same technique and the same number of freeparameters The fitted parameters were the percentage

( )

Observed Predictedi i

ii

-

=aring

2

21

8

s

VARIANCE THEORY OF MIRROR EFFECT 429

nodes active at encoding (a) the standard deviation ofthe net inputs for the easy class words [sh(A)] and thestandard deviation of the net inputs for the difficult classwords [sh(B)] The other parameters were held constantand were set to the same values as those in Figure 4D[2N = 100 mh(N ) = 0 mh(O) = 1 and C = 0] The empir-ical standard deviations (si) were not reported inGlanzer et al (1993) so these parameters were set toone

The results show that the variance theory accounts for98 (r2) of the variance for the probabilities The attention-likelihood theory fits equally well The variance theory ac-counts for 93 of the variance for the slope The attention-likelihood theory accounts for 84 of the variance of theslope Thus the variance theory accounted for the sameamount of variance for the probabilities and more variance

for the slope as compared with the attention-likelihoodtheory when three parameters were fitted

It is reasonable to assume that the percentage of ac-tive nodes at encoding (a) does not vary across condi-tions The ratio of standard deviations of the net inputs[sh(B)sh(A)] may also be conceived of as being con-stant given that the same material is used in the differ-ent conditions Therefore the variance theory was fittedby a single variablemdashnamely the standard deviation ofthe net inputs for the easy class [sh(A)] The activitylevel was fixed to 010 and the ratio of the standard de-viations of the net inputs sh(B)sh(A) to 125 (these val-ues were the average of the fitted parameters in the pre-vious fit these parameters were also used in Figure 4D)The results are presented in Figure 8A where the pre-dicted probabilities are plotted as a function of observed

Figure 8 (A) Fitting the variance theory to Glanzer Adams andKimrsquos (1993) probability data The figure shows the predicted proba-bilities (on the vertical axis) as a function of the observed probabilities(on the horizontal axis) (B) Fitting the variance theory to Glanzer et alrsquos(1993) standard deviation slope data The figure shows the predictedratio of slopes of the receiver-operating characteristic curves (on the ver-tical axis) as a function of the observed ratios (on the horizontal axis)

430 SIKSTROM

probabilities Figure 8B shows the corresponding resultsfor the slope The accounted variance is 096 for the prob-abilities and 085 for the slopes Thus the variance theoryfits the slopes using a single parameter equally well asthe attention-likelihood theory does with three fitting pa-rameters The fit for the variance theory for the probabil-ities using one parameter is slightly less than the fit of theattention-likelihood theory using three fitting parametersIt may be concluded that the fit for the variance theory isreasonably good for the probabilities and the slopes Theslopes have a better fit in the variance theory than in theattention-likelihood theory when three variables are used

ANALYTIC SOLUTIONS

In this section analytic solutions of the variance the-ory an approximation of the standard deviation of therecognition strength and analyses of optimal perfor-mance are presented The variance theory has a simpleanalytic solution and can be fully described by four pa-rameters Two of these parametersmdashnamely the stan-dard deviation of the net inputs from the easy class[sh(A)] and the standard deviation of the net inputs fromthe difficult class [sh(B)]mdashcan also be expressed as thefrequency of the item (see the Appendix) The other twoparameters are the number of nodes (N ) and the expectedprobability that the nodes are active at encoding (a)

There are other variables in the theory however theydo not increase the degrees of freedom There are fourexpected net inputs (mh) however two degrees of free-dom disappear because the new net inputs are constrainedto be equal as well as the old net inputs [mh(AN) =mh(BN) and mh(AO) = mh(BO)] Furthermore the predic-tions are independent of moving the old and new dis-tributions in parallel thus removing another degree offreedom Finally changing the difference between the ex-pected old and new net inputs is mathematically equiva-lent to changing the standard deviations [sh(A) andsh(B)] Thus the degrees of freedom in the net inputscan be captured by the degrees of freedom in the stan-dard deviations The activation threshold (T ) and theprobability that nodes are active (Pc) are simply func-tions of other variables and therefore do not increase thedegrees of freedom Thus there are four degrees of free-dom for the distributionsmdashnamely sh(A) sh(B) N anda An additional degree of freedom is introduced whenplacing the recognition criterion (C)

The probability (Pc) that the net inputs exceed the ac-tivation threshold (T ) for nodes active during encodingcan be explicitly solved from the expected net inputs(mh) and the standard deviation of the net inputs (sh)This probability is dependent on the distribution of thenet inputs which can be approximated with a normaldistribution Pc is solved by integrating the net inputsfrom mh T to infinity (yen) over the probability densityfunction for a normal distribution Thus the probability(Pc) that a node is active at recognition is

(6)

Subtracting the expected percentage of active nodes atrecognition (a2 see note 1) from the percentage of ac-tive nodes and dividing by the standard deviation of thenet inputs (sh) calculates the expected recognitionstrength (ms)

Note that the analytic solution uses the standard devi-ation of the class (sh) as an approximation of the stan-dard deviation of the item (shcent ) because it simplifies theanalytic solution however the variance theory or thesimulation uses the standard deviation of the item Thisapproximation is good when there are a large number offeatures however for a small number of features thevariance of feature strength for a single item may fluctu-ate on an item-to-item basis around the variance of thenet inputs for all the items

The standard deviation of the recognition strength (ss)is calculated from sh Pc and N There is 2N number ofnodes in the context and the item layers The distributionof Pc is binomial but can for a certain criterion [ie 2NPc(1 Pc) gt 10] be approximated with a normal distri-bution with a standard deviation of [Pc(1 Pc) 2N]12The final result is scaled by the normalization factor 1sh

(7)

A yes response occurs if the recognition strength isabove the recognition criterion (C) The probability of ayes response [P(Y)] is calculated from the expected recog-nition strength the variance of the recognition strengthand the recognition criterion by integrating the density ofthe recognition strength over a normal distribution

The probability of choosing A over B in a two-choiceforced recognition test [P(A B)] is calculated from theexpected recognition strength of A [ms(A)] and B [ms(B)]and the standard deviations of the recognition strengthof A [ss(A)] and B [ss(B)]

An Excel sheet for calculating the predictions of the vari-ance theory is available on line (wwwpsychutorontoca ~sverkervariancehtml)

P e ds

S Bs s

s s

C

(A B)

(A)

(A) (B)

=

- -[ ]( )+aelig

egraveccediloumloslashdivide

eacuteeumlecirc

yen

ograve12

2

2 2 22

p

m m

s s

( )

P e ds

S s

s

C

(Y) =-( )yen

ograve12

2

22

p

m

s

sss

h

c cP P

N=

-( )eacute

eumlecircecirc

1 1

2

12

mss

Pc

a

h=

-2

P a e dhcT

hh

h=-( )yen

ograve12

2

22

p

m

s

VARIANCE THEORY OF MIRROR EFFECT 431

Approximations of the Standard Deviation ofRecognition Strength

The standard deviation of the recognition strength isin the model calculated with Equation 7 However to fa-cilitate the understanding of this equation it is useful tomake some approximations First note that the proba-bility that a node is active (Pc) is assumed to be low Byapproximating 1 Pc to one the variance of the recog-nition strength can be simplified to

For a particular class of items the variances of the netinputs of old and new items are equal and the varianceof the recognition strength is proportional to the numberof active nodes (s 2

s micro Pc) This approximation suggests avery simple interpretation of the slope of the z-ROC Theratio of variances between new and old items is simplythe ratio between the number of nodes active in the newitems representations and the number of nodes active inthe old items representations

Or alternatively the slope of the z-ROC curve is equalto the square root of the ratio of the number of nodes ac-tive in the new items representations and the number ofnodes active in the old items representations For exam-ple if the slope of the z-ROC curve is 08 the number ofactive nodes in the new items representations divided bythe number of nodes active in the old items representa-tions is 064 (= 0802)

Another approximation useful for understanding themodel is that for two classes of items the number of ac-tive nodes in the new distribution is approximately equaland the number of active nodes in the old distributions isapproximately equal [Pc(AN) raquo Pc(BN) and Pc(BO) raquoPc(AO)] Given these approximations and the approxi-mation above (1 Pc raquo 1) the recognition strength stan-dard deviation is inversely related to the standard devia-tion of the net inputs in the following way The ratiobetween the recognition strength standard deviations ofthe diff icult and the easy distributions is equal to theratio between the standard deviations of the net inputs ofthe easy and the difficult distributions Furthermore theratio between the recognition strength standard devia-tions of the difficult and easy new distributions is equalto the ratio between the recognition strength standard de-viations of the difficult and the easy old distributionsThe exact solution predicts a slightly smaller ratio in theold than in the new distributions

This suggests that the ratio between the recognitionstrength standard deviations of the difficult class and theeasy class can be interpreted as the ratio between the

standard deviations of the net inputs of the easy class andthe difficult class

Optimizing Performance Derives the Assumptions of the Variance Theory

Arguably good memory performance is an importantfactor for selection in the evolutionary process of hu-mans and animals It is reasonable to assume that thebrain has evolved so that the performance at retrieval isoptimal or near-optimal Here it is investigated how sev-eral assumptions of the variance theory influence per-formance It is shown that several assumptions of themodel fall out as a consequence of optimizing perfor-mance in the form of discriminability between new andold items Thus if the model is implemented in a differ-ent way performance is degraded and the model doesnot account for the experimental data Examples of as-sumptions that yield a good performance in the modelare a low percentage of nodes active setting the activa-tion threshold between old and new net inputs measur-ing performance by nodes that are active in the encodingpattern and normalizing the recognition strength It isshown that an optimal performance in the network requiresthe implementation suggested by the variance theory Ifthe implementation of the variance theory is changedsignificantly the performance is degraded and the net-work would not produce the empirically found memoryphenomena

To conduct this analysis it is necessary to define ameasurement of performance A natural choice is to used cent By using the analytical equations above we find thefollowing expression

(8)

Because ss(N) sometimes is near zero it was founduseful to use the standard deviation of both the old andthe new items recognition strength ss(NO) in the de-nominator of this equation Thus Pc(NO) is equal to[Pc(N) 1 Pc(O)] 2 Pc( ) was calculated with Equation 6The expected value of the net inputs and the standard de-viation of the net inputs for new and old items were cal-culated with the equations derived in the Appendix(Equations A1 A2 and A3) Low-frequency items witha preexperimental frequency of zero were used ( f = 0)and the list length was set to one (L = 1)

The performance can be expressed by the parametersa N and p Increasing the number of nodes (N) monot-onically increases d cent and increasing the number ofstored patterns (p) monotonically decreases d cent The im-pact of these two parameters on d cent was of less impor-tance here and they were set to N = 30 and p = 100

Optimal percentage of nodes active at encodingThe solid line in Figure 9A shows the theoretical d cent as afunction of the percentage of nodes active at encoding

cent = - =-

-eacuteeumlecirc

dP P

P PN

s s

s

c c

c c

m ms( ) ( )

(

O N

(NO)

(O) (N)

(NO) (NO) 12

112

ss

ss

ss

ss

s

s

s

s

h

h

s

s

(BO)

(AO)

(B)

(A)

(A)

(B)

(BN)

(AN)pound raquo pound

s

ss

s

c

c

P

P

2

2

(N)

(O)

(N)

(O)raquo

ss

sc

h

P

N

2

2 2=

432 SIKSTROM

(a) The results show that d cent is optimal for a = 052 Thed cent is lower for larger and smaller a The lower d cent for largea occurs because the interference from other items in-crease For an a larger than the optimal value the weightchanges are distributed over a larger number of nodesand the recognition tests therefore include more noise

The lower d cent for an a less than the optimal value oc-curs because there is variability in the number of activenodes at encoding Thus for very small values of a thefluctuation between the number of nodes active in theencoded representation becomes increasingly importantThus for a small a errors are more likely to occur be-cause old items have few active nodes at encoding mak-ing it less likely that many nodes will be active at re-trieval (independently of how well they are encoded)This analysis suggests that a medium low percentage ofactive nodes at encoding yields optimal performanceThis is consistent with variance theory which requires alow activity for fitting some of the empirical data (seebelow)

There is another factor that contributes to the fact thatoptimal performance occurs when the percentage of ac-tive nodes is medium low which is that the number ofpossible representations increases with a If there is onlyone node active in all the representations there are Npossible combinations of representations if there aretwo nodes active in all the representations there are ap-proximately N 2 possible combinations of representa-tions and so forth This factor is not included in theanalyses

Optimal placement of the activation threshold Animportant property of d cent is that it is insensitive to wherethe criterion is placed Thus any criterion yields thesame performance The activation threshold (T ) may beseen as the criterion for a single node and therefore onemight intuitively think that the placement of the thresh-old is unimportant for d cent However surprisingly theplacement of the criterion becomes important in the vari-ance theory because there is a nonlinear transformationwhen the nodes are activated This nonlinearity makes d centdependent on the activation threshold in the nodes

The d cent was maximized by changing the activationthreshold (T ) and the percentage of nodes active at en-coding (a) The maximum d cent was 240 when T = 081and a = 052 Figure 9B plots d cent as a function of the ac-tivation threshold (T ) when the percentage of nodes ac-tive at encoding was f ixed at the optimal value (a =052) The results show that d cent has an optimal valuewhen the activation threshold is set between the old andthe new distributions The variance theory suggests thatthe threshold should be set to the average of the expectedold and expected new net inputs For the parameter val-ues used here this value is 071 which is near butslightly lower than the optimal value of 081 (the ex-pected value of the new net input is 0 and the expectedvalue of the old net input is 142) Note that this resultapplies when both a and T are set to the optimal value Ifa is set to a nonoptimal value the optimal value of T may

deviate significantly from the one proposed by the the-ory (eg if a = 5 the optimal value of T is much largerthan the expected value of old net inputs of 188)

This analysis emphasizes the importance of setting theactivation threshold as suggested by the variance theorySetting the activation threshold between the old and thenew expected net inputs yields not only the mirror effectbut also an optimal performance in the network Noticethat the activation threshold (T ) is constant even if thesubjectsrsquo decision criterion (C) is changed Therefore d centwill not change when the decision criterion changes Bychanging the decision criterion (rather than the recogni-tion threshold) subjects can maintain an optimal d cent fordifferent confidence levels

Optimal usage of the state of activation in the cuepattern An interesting question is how much informa-tion is carried in nodes that are active in the encoded pat-tern as compared with inactive nodes If both active andinactive nodes carry a similar amount of information itis useful to use all the nodes at retrieval However if in-active nodes carry little information in relation to thenoise performance can be improved by using only theinformation in the active nodes

The information carried in the nodes depends on theamount of weight changes which in turn depends on thepercentage of active nodes at encoding (a) For a = 12the absolute values of the weight changes are the samefor active and inactive nodes (however the signs of theweight changes are different) Thus inactive and activenodes carry the same amount of information and theperformance is optimal when information in both activeand inactive nodes is used For a small a the weightchanges are larger for active nodes (proportional to 1 a)than for inactive nodes (proportional to a) For a suffi-ciently small a the noise in inactive nodes will over-whelm the information in the weight changes so that ifthe information is combined the inactive nodes will harmthe information in the active nodes and performance isoptimal if only information from active nodes is used

The performance of the variance theory was calcu-lated by using the information in all the nodes This isdone by counting the number of nodes that are retrievedto the correct state of activation (ie the same state asthat during encoding) The mathematical details of thiscalculation are described at the end of the Appendix Theresults are shown by the dotted line in Figure 9A usingthe same set of parameters as when d cent was calculated byusing only active nodes shown by the solid line The re-sults show that the highest d cent is found when the decisionis based only on active nodes and when a is low Includ-ing inactive nodes in decision lowers d cent However for alarger a (above 15 for the parameters used here) it isbeneficial to base the decision on all the nodes

Optimal placement of the recognition criterion forthe two classes of items The recognition criterion (C)does not affect d cent but it influences performance as mea-sured by the hit rates and false alarm rates Therefore itis necessary to use another criterion for performance

VARIANCE THEORY OF MIRROR EFFECT 433

with respect to the placement of the recognition crite-rion A natural choice for performance in this context isthe probability of hits minus the probability of falsealarms This measurement corresponds to optimal per-formance when old correct responses and new correctnew responses are rewarded equally It is easy to see thatif the standard deviations of the old and the new distri-butions are equal the optimal performance will be foundif the recognition criterion is set exactly between the dis-tributions For unequal standard deviations the optimalrecognition criterion is shifted from the midpoint towardthe distribution with the smallest standard deviationMore exactly the optimal recognition criterion is thepoint at which the old and the new distributions inter-sect It is easy to see that this is true because if the recog-nition criterion is moved to the left of this point the rateof increase in false alarms is larger than the rate of in-crease in hits and performance suffers If the recognition

criterion is moved to the right of this point the rate of de-crease in hits is larger than the rate of decrease in falsealarms and performance also suffers (see eg Figure 4D)Formally f [S(O)] denotes the density of recognitionstrength of the old distribution and f [S(N)] the densityof the recognition strength of the new distribution Theratio between these variables is called the likelihoodratio L = f [S(O)] f [S(N)] and the optimal performanceoccurs when this ratio is equal to one (L = 1)

In the mirror effect there are two classes of itemseach having a new and an old distribution with differentstandard deviations The question of optimal perfor-mance is complicated by the possibility of using differ-ent criteria for the two classes The performance maythen vary depending on the choice of the two criteria andon additional restrictions on the overall level of confi-dence For example if one class is very easy (ie perfectdiscrimination) and one class is very difficult (ie no

Figure 9 (A) Theoretical d cent as a function of percentage of nodes active at encoding The solid line shows the d cent as a function of per-centage of nodes active at encoding when the decision is based only on nodes that are active during encoding The dotted line showsd cent when the decision is based on all the nodes (B) Theoretical d cent as a function of the activation threshold The leftmost arrow pointsat the expected net input of the new items [m(N)] the rightmost arrow points at the expected net input of the old items [m(O)] and themiddle arrow points at the point at the placement of the activation threshold of the nodes Note that the activation threshold is slightlylower than the optimal point (C) Optimal placement of the recognition criterion for the two classes The y-axis shows the maximumlikelihood for Class B divided by the maximum likelihood for Class A An optimal performance is found when this ratio is one Thex-axis shows the false alarm rate for Class A The straight line shows the ratio for theoretical optimal performance the dotted line theratio before normalization and the solid curved line the ratio after normalization See the text for details (D) The advantage of nor-malization for different recognition criteria The y-axis shows the total hit rate after normalization minus the total hit rate before nor-malization as a function of the total false alarm rate on the x-axis See the text for details

434 SIKSTROM

discrimination) and subjects are instructed to respondyes only when they are absolutely certain that they arecorrect it may be optimal to set a very high criterion forthe difficult class so that no yes responses will be madefor the difficult class and a moderate criterion for theeasy class so that some yes responses will be made forthe easy class Therefore any model that optimizes per-formance for the two classes must combine the criteriafor each class so that the performance for the sum of theclasses will be optimal

This problem may formally be stated as follows Giventwo classes (A and B) with a fixed total false alarm rate[P(AN) + P(BN)] how should the recognition criteriafor the two classes [T(A) and T(B)] be chosen so that thehit rates are maximized [P(AO) + P(BO)] The solutionto this problem is surprisingly simple The optimal perfor-mance occurs when the placements of the maximum like-lihoods of the two classes are equal

L(A) = f [S(AO)] f [S(AN)] = L(B)

= f [S(BO)] f [S(BN)]

It is easy to see that this criterion must be satisfied foroptimal performance because any shift from this pointdiminishes performance For example if the recognitionthreshold for class A is diminished the recognition cri-terion for class B must be increased to keep the totalfalse alarm rate constant According to the formulationof the problem the change in total false alarm rates mustbe equal f f [S(BN) = 0] The maximum-likelihood ra-tios are monotonically increasing functions of the recog-nition criteria therefore L(A) L(B) lt 0 when the recog-nition criteria are changed as specified above ThusL(A) = f [S(AO)] f [S(AN)] lt f [S(BO)] f [S(BN)] =L(B) or f [S(AO)] f [S(BO)] lt 0 This shows that thechange in the placement of the criteria from L(A) = L(B)results in an overall decrease in hit ratemdash( f [S(AO)] f [S(BO)] lt 0)mdashand performance suffers

Note that the variance theory only has one overallrecognition criterion However the theory and any the-ory of the mirror effect specifies how this criterionchanges when it is moved over the two classes Thus thesecond criterion can indirectly be inferred from the for-mulation of the theory This is done in the variance the-ory by the normalization factor that scales the recogni-tion differently between the two classes of words

The intriguing question here is whether the variancetheory is optimal or almost optimal in terms of place-ment of the recognition criterion for the two classes Fig-ure 9C plots the maximum likelihood of class B dividedby the maximum likelihood of class A [L(B)L(A)] onthe y-axis The x-axis shows the probability of false alarmsfor class A when the recognition criteria are changedThe optimal ratio of the maximum likelihood on the y-axisis one and it is plotted as the dotted straight line The re-sults before normalization (ie by counting the numberof nodes above the recognition criterion) are plotted in

the dotted monotonically increasing line The resultsafter normalization (ie the percentage of active nodesminus the expected number of active nodes divided bythe standard deviation of the net input) are plotted as thesolid line

The figure clearly shows that performance before nor-malization is far from optimal For a conservative recog-nition criterion or low false alarm rates the maximumlikelihood for class A is larger than the maximum likeli-hood for class B Therefore a more liberal criterion forclass B and a more strict criterion for class A would bemore advantageous For a liberal recognition criterionor a high false alarm rate the maximum likelihood forclass B is larger than the maximum likelihood for classA Therefore a more liberal criterion for Class A and astricter criterion for Class B would be beneficial Theonly point at which the performance is optimal is whenthe recognition criterion is unbiased At this point [aroundP(AN) = 25] the maximum-likelihood ratio is one

Normalization improves performance significantly sothe maximum-likelihood ratio is one or almost one fora range of criteria For false alarm rates larger than 015and smaller than 060 the ratio is within two percentagepoints of one The maximum likelihood for Class A islarger than that for Class B for false alarm rates less than015 and for false alarm rates larger than 060 Thus thereis still some deviation from optimal performance evenafter normalization However the maximum-likelihoodratio is closer to the optimal value for all false alarmrates after normalization than before normalization Ar-guably normalization improves performance sufficientlyso that performance may be described as being near anoptimal value for a wide range of recognition criteria

Overall performance after normalization can be di-rectly compared with performance before normalizationFigure 9D plots the total hit rate after normalizationminus the total hit rate before normalization for differenttotal false alarm rates In this figure the standard devia-tion of the net inputs to Class B was set to a 3 (rather than156) in order to make the difference between perfor-mance before and after normalization more salient Allother parameters were identical to those in Figure 4DFurthermore it is assumed that the number of items inClass A is equal to the number of items in Class B Thetotal hit rate is equal to the average hit rate of Class Aand Class B and the total false alarm rate is equal to theaverage false alarm rate in Class A and Class B

For all recognition criteria or for all false alarm ratesperformance is better or equal after normalization ascompared with performance before normalization Foran unbiased recognition criterion or a slightly largerrecognition criterion performance is approximatelyequal before and after normalization [ie for 25 lt P(N) lt40] For conservative recognition criteria [P(N) lt 25]there is a large advantage for normalized performanceover nonnormalized performance The largest advantageis approximately 005 and it occurs when the false alarm

VARIANCE THEORY OF MIRROR EFFECT 435

rate is approximately 005 For liberal recognition crite-ria [P(N) gt 40] there is also an advantage for normal-ized performance The largest advantage is around 001and it occurs when the false alarm rate is 070 The ad-vantage for liberal criteria is smaller than the advantagefor conservative criteria because of a ceiling effect forlarge false alarm rates and large hit rates

A Nonoptimal Network is Inconsistent With Empirical Data of Recognition Memory

To summarize it has been shown that performance isoptimal when (1) the percentage of nodes active at en-coding is low (2) the activation threshold is set betweenthe new and the old distributions (3) only nodes activeat encoding are used for measuring the recognitionstrength and (4) the recognition strength is normalizedwith the standard deviation of the net input It is inter-esting to note that all these conditions are necessary forproducing the empirically found memory phenomena

1 The percentage of active nodes has to be low forproducing appropriate standard deviations for the oldand the new recognition distributions If the percentageof active nodes is too high the standard deviation of theold distribution will approach the standard deviation ofthe new distribution

2 The model predicts the mirror effect only if the ac-tivation threshold is set between the old and the new dis-tributions If the recognition threshold is larger than theexpected value of the net input of the old distributionthe hit rate of the easy class will be less than the hit rateof the difficult distribution Similarly if the recognitionthreshold is lower than the expected new net input thefalse alarm of the easy class will be larger than the falsealarm rate of the difficult class This is inconsistent withthe empirical data for unbiased responses

3 Assume that the recognition strength is based on allthe nodes (ie not only nodes inactive during encod-ing)mdashfor example by counting the number of nodes inthe correct state of activation This measurement ofrecognition strength will have at least 50 of the nodesin the correct state (ie Pc gt 50) even if the subjectswere merely guessing on new items This would lead tothe incorrect prediction that the old recognition strengthvariance should be smaller than the new recognitionstrength variance because Pc(1 Pc)N decreases for Pcover 50 This is inconsistent with the finding that thevariance of the old distribution is larger than the varianceof the new distribution

4 If the recognition strength is not normalized withthe net input the variance of the recognition strength ofthe new easy class (AN) will be smaller than the varianceof the recognition strength of the new difficult class (BNsee Figure 4C) This is inconsistent with the empirical data

This analysis indicates that several recognition mem-ory phenomena fall out as a consequence of optimizingperformance in the network If the model is optimized interms of performance the model reproduces the empir-ical data If the model is made nonoptimal the model

does not reproduce the empirical data Arguably thehuman brain has evolved to optimize storage capacityand therefore these memory phenomena occur

GENERAL DISCUSSION

This paper has suggested a variance theory for themirror effect that also applies to the ROC curves Themodel is remarkably simple It has been shown that atwo-layer recurrent network where one layer representscontext and one layer represents items shows these phe-nomena if performance is measured by counting thenumber of nodes active at recognition in a way that opti-mizes performance The structure of the theory guaran-tees that high-frequency items have a larger variance inthe net inputs than do low-frequency items because en-coding the same item in different contexts increases thevariance whereas the expected net inputs are the sameThe theory predicts the mirror effect when the sameamount of attention is paid to both classes of stimuli Thestandard deviation of the recognition strength is largerfor old than for new items because more nodes are activein old items The standard deviation for the easy class islarger than the standard deviation for the difficult classbecause the recognition strength is normalized with thestandard deviation for the net input

There are several reasons why the variance theory isinteresting First the theory is extremely simple Theonly necessary assumptions are that recognition is basedon recurrent associations between contexts and itemsand performance is measured by counting the number ofnodes in an optimal way Second these assumptions areconsistent with what is known about how the brain worksTherefore the model is biologically plausible Third themodel accounts for a large amount of data including themirror effect exceptions from the mirror effect ROCcurves list-length effects and so on Fourth the modelfits the empirical data well Fifth it is easy to implementthe model in a connectionist network

Paying more attention to one of the classes violates theassumption of equal expected net inputs to the two classesThe variance theory predicts that attention to the moredifficult class primarily affects the hit rates whereas thefalse alarm rates and standard deviations of the underly-ing distributions are less affected An experiment sup-ported the prediction A standard mirror effect was foundwhen attention was divided equally between high- andlow-frequency words However focusing the subjectsrsquoattention on the high-frequency words either by the pre-sentation frequency or the presentation time made thehit rate larger for the high-frequency words than for thelow-frequency words This manipulation of attention didnot influence the false alarm rates which were higher forthe high-frequency words in all the conditions Thus nomirror effect was found when attention was paid to thehigh-frequency words Nor did the focusing of attentioninfluence the order of the standard deviations The stan-dard deviations were larger for the low-frequency distri-

436 SIKSTROM

bution than for the high-frequency distribution This wastrue for the new and the old distributions both when at-tention was paid to high-frequency words and when at-tention was divided equally between the two classes (ex-cept in the new frequency control condition where thestandard deviations were equal)

The variance theory predicts the order of the standarddeviations of the underlying distributions for the follow-ing reasons The standard deviations of the old distribu-tions are predicted to be larger than those of the new dis-tributions because more nodes are activated in the olddistributions The standard deviations of the easy classdistributions are predicted to be larger than the standarddeviations of the difficult class distributions because therecognition strength is normalized by the itemrsquos diffi-culty estimated from the standard deviation of the net in-puts This is consistent with the empirical data

In contrast the attention-likelihood theory does notconstrain the old distribution to be larger than the newdistribution for the difficult class (it can be larger orsmaller depending on parameter settings) The variancetheory allows the following two orders of the standarddeviations ss(BN) lt ss(BO) lt ss(AN) lt ss(AO) andss(BN) lt ss(AN) lt ss(BO) lt ss(AO) The first order isthe more common although the second order occurs oc-casionally (see eg Glanzer et al 1993 Experiment 1)In addition the attention-likelihood theory allowsss(BO) lt ss(BN) lt ss(AN) lt ss(AO) according to Equa-tion 2 which to the authorrsquos knowledge has not beenfound in empirical data

The variance theory predicts that strength variablessuch as study time repetition and study instructions af-fect the expected net input For example increasing studytime will increase the net input that improves the hit rateIncreasing the net inputs also causes an increase in theactivation threshold that diminishes the false alarm rates

The variance theory has no (ie lateral) connectionswithin the item layer and there are no connections with-in the context layer Including intraitem connections inrecognition makes it impossible to tell whether therecognition strength comes from encoding during thelearning episode or from another preexperimental learn-ing episode Thus there would be a confounding be-tween the itemrsquos frequency and recognition strength Forexample if the recognition strength in the variance the-ory included intraitem connections and used a constantrecognition criterion it would predict a higher hit rateand a higher false alarm rate for high-frequency itemsthan for low-frequency items Thus the hit rate for high-frequency words would be larger than that for low-frequency words which is contrary to the data on the mir-ror effect This issue is called the frequencyrecognitionndashstrength confounding problem Other models may bevulnerable to this problem depending on their specificassumptions The variance theory is immune from thisproblem because recognition strength is based on the as-sociation between the context and the item that yields apure measurement of the strength of the target in a par-

ticular episode Net inputs within the item population arenot used because these connections are highly corre-lated with the frequency of the item

This frequencyrecognition-strength confounding prob-lem may be relevant to several distributed models thatassume that recognition strength increases with frequencythus making the false alarm rate higher for high- than forlow-frequency stimuli This is often implemented in dis-tributed models by including intraitem associations inthe measurement of recognition strength Thus whenintraitem and itemndashcontext associations are added it isnot possible to know whether the intraitem strength oc-curs because an item has been encoded in the to-be-retrieved-from list or at another episode

Although the intraitem associations are not used tomeasure recognition strength they may play an impor-tant role in recognition In the first step of recognitionthese associations may be used for deblurring unclear in-formation in the item cue (a similar mechanism occursfor the context cue) Arguably this deblurring mecha-nism works well for well-known words however fornonwords it is much more likely to fail Such failure willactivate features that were not active in the encoded rep-resentation It is more likely that these wrongly activatednodes represent high-frequency features because it ismore likely to converge on high-frequency features Thereare two interesting implications of this perspective Firstthe wrongly activated nodes will use the wrong connec-tions between the context and the item Second becausethe wrongly activated nodes represent high-frequencyfeatures the average variability will be larger for non-words than for words This is a plausible account of thepoor recognition performance with nonwords as com-pared with words It is also a tentative account of why non-words can be seen as a difficult class with a higher vari-ability than that for words in the variance theory Howeverfurther work is needed before any firm conclusion can bemade regarding this aspect of the theory

A problem similar to frequencyrecognition-strengthconfounding occurs if within-context connections areused at recognition If the context is temporally corre-lated the within-context connections are influenced bylist length This causes a confounding between list lengthand recognition strength This issue is called the list-lengthrecognition-strength confounding problem Othermodels may be vulnerable to this problem depending ontheir specific assumptions

Another issue is whether the variance theory can ac-count for the mirror effect found in abstract and concretewords where words from a concrete class are more eas-ily discriminated than words from an abstract class Thevariance theory can account for this given the assump-tion that the variance of the net input is larger for abstractthan for concrete words However at this point it is notcompletely clear how this assumption can be motivatedA possibility is that although these two classes areequated for word frequency the contexts associated withan abstract word are more variable than the contexts as-

VARIANCE THEORY OF MIRROR EFFECT 437

sociated with a concrete word This larger variability incontext for abstract words may lead to a larger variabil-ity in the net input Another possibility is that the activefeatures in abstract words are more general and there-fore associated with more contexts Nodes active in con-crete words may represent more specific features acti-vated with a lower frequency and therefore associatedwith fewer contexts Thus features in abstract wordsmay be of a higher frequency than features in concretewords although the frequencies of the items are thesame This would lead to a mirror effect in the modelHowever at this point no claim is made that variancetheory can handle this phenomenon

The variance theoryrsquos account of the list-length andlist-strength effects is arguably much simpler than thesubjective-likelihoodrsquos account The activation thresholdis set so that on average half of the nodes active duringencoding are active during recognition The activationthreshold is constant within one condition but may changebetween conditionsmdashfor example when study time ischanged Furthermore subjects do not need to estimatethe list length and the probability that a particular itemis drawn from the study list

The variance theory has advantages over the attention-likelihood theory As was discussed in more detail abovethe attention-likelihood theory requires subjects to clas-sify the stimulus Depending on this classification thesubjects must know (consciously or unconsciously) howmuch attention is paid to the stimuli in order to calculatethe log-likelihood ratios Thus the yesndashno decision isbased on the subjectsrsquo classification of the stimuli Thevariance theory does not require a classification of thestimuli During the calculation of recognition strengththe standard deviation of the net input of the item is used(shcent ) so the subject does not need to know the class or thestandard deviation of the class (sh) The increase in thehit rate and decrease in the false alarm rate for the easierclass occurs according to the theory because the vari-ance is smaller for the easier class The variance theoryhas a simple mathematical base subjects count the num-ber of activated nodes minus the expected value dividedby the standard deviation of the net input of the item Anexplicit solution is presented that requires only calculat-ing the probabilities of two z-transformations

The subjective-likelihood theory uses feature contentof the items for addressing issues regarding item similar-ity and word frequency In particular high-frequency wordsare assumed to have a higher noise or variability than dolow-frequency words The variance theory also usesvariability that depends on frequency However the vari-ance theory simulates the increase in variance duringeach presentation of a feature in different contexts thusmaking it an unavoidable phenomenon for high-frequencyfeatures In the subjective-likelihood theory the featurevariance is introduced as an assumption or a constantand it is not explicitly simulated

There are several other differences between the vari-ance theory and the subjective-likelihood theory The

subjective-likelihood theory is based on log-likelihoodratios In the variance theory log-likelihood ratios arenot necessary to account of the mirror effect and for z-ROC curves Instead the variance theory uses the num-ber of active nodes normalized by the variance of the netinput as the measurement of recognition strength

Another difference is the use of one detector for eachitem in the subjective-likelihood theory This makes themodel essentially local whereas the variance theory isdistributed This property may cause difficulties in im-plementing the model in a connectionist network How-ever see McClelland and Chappell (1998) for a brief dis-cussion of this topic An implementation of the variancetheory is straightforward

REFERENCES

Feller W (1968) An introduction to probability theory and its appli-cation New York Wiley

Gillund G amp Shiffrin R M (1984) A retrieval model for bothrecognition and recall Psychological Review 91 1-67

Glanzer M amp Adams J K (1985) The mirror effect in recognitionmemory Memory amp Cognition 13 8-20

Glanzer M amp Adams J K (1990) The mirror effect in recognitionmemory Data and theory Journal of Experimental PsychologyLearning Memory amp Cognition 16 5-16

Glanzer M Adams J K amp Kim K (1993) The regularities ofrecognition memory Psychological Review 100 546-567

Glanzer M amp Bowles N (1976) Analysis of the word frequencyeffect in recognition memory Journal of Experimental PsychologyHuman Learning amp Memory 2 21-31

Glanzer M Kisok K amp Adams J K (1998) Response distribu-tions as an explanation of the mirror effect Journal of ExperimentalPsychology Learning Memory amp Cognition 24 633-644

Greene R L (1996) Mirror effect in order and associative informa-tion Role of response strategies Journal of Experimental Psychol-ogy Learning Memory amp Cognition 22 687-695

Hertz J Krogh A amp Palmer R G (1991) Introduction to the the-ory of neural computation Reading MA Addison-Wesley

Hintzman D L (1988) Judgment of frequency and recognition memoryin a multiple trace memory model Psychological Review 95 528-551

Hopfield J J (1982) Neural networks and physical systems withemergent collective computational abilities Proceedings of the Na-tional Academy of Sciences 79 2554-2558

Hopfield J J (1984) Neurons with graded response have collectivecomputational properties like those of two-state neurons Proceed-ings of the National Academy of Sciences 81 3088-3092

Humphreys M S Bain J D amp Pike R (1989) Different way to cuea coherent memory system A theory for episodic semantic and pro-cedural tasks Psychological Review 96 208-233

Kim K amp Glanzer M (1993) Speed versus accuracy instructionsstudy time and the mirror effect Journal of Experimental Psychol-ogy Learning Memory amp Cognition 19 638-652

Kruschke J K (1992) ALCOVE An exemplar-based connectionistmodel of category learning Psychological Review 99 22-44

Ku Iumlcera H amp Francis W N (1967) Computational analysis ofpresent-day American English Providence RI Brown UniversityPress

Lewandowsky S (1991) Gradual unlearning and catastrophic inter-ference A comparison of distributed architectures In W E Hockleyamp S Lewandowsky (Eds) Relating theory and data Essays onhuman memory in honor of Bennet B Murdock (pp 445-476) Hills-dale NJ Erlbaum

Li S-C amp Lindenberger U (1999) Cross-level unification A com-putational exploration of the link between deterioration of neuro-transmitter systems and dedifferentiation of cognitive abilities in oldage In L-G Nilsson amp H J Markowitsch (Eds) Cognitive neuro-sciences of memory (pp 103-146) Seattle Hogrefe amp Huber

438 SIKSTROM

Li S-C Lindenberger U amp Frensch P A (2000) Unifying cog-nitive aging From neuromodulation to representation to cognitionNeurocomputing 32-33 879-890

McClelland J L amp Chappell M (1998) Familiarity breeds dif-ferentiation A subjective-likelihood approach to the effects of expe-rience in recognition memory Psychological Review 105 724-760

Metcalfe J (1982) A composite holographic associative recallmodel Psychological Review 89 627-658

Murdock B B (1982) A theory for the storage and retrieval of item andassociative information Psychological Review 89 609-626

Murdock B B (1998) The mirror effect and attentionndashlikelihoodtheory A reflective analysis Journal of Experimental PsychologyLearning Memory amp Cognition 24 524-534

Okada M (1996) Notions of associative memory and sparse codingNeural Networks 9 1429-1458

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves PsychologicalReview 99 518-535

Rumelhart D E Hinton G E amp Williams R J (1986) Learn-ing representation by backpropagating errors Nature 323 533-536

Shiffrin R M amp Steyvers M (1997) A model for recognitionmemory REMndashretrieving effectively from memory PsychonomicBulletin amp Review 4 145-166

Sikstroumlm S (1996a) TECO A connectionist theory of successiveepisodic tests Umearing Doctoral dissertation Umearing University (ISBN91-7191-155-3)

Sikstroumlm S (1996b) The TECO connectionist theory of recognitionfailure European Journal of Cognitive Psychology 8 341-380

Sikstroumlm S (1999) Power function forgetting curves as an emergentproperty of biologically plausible neural networks model Interna-tional Journal of Psychology 34 460-464

Stretch V amp Wixted J T (1998a) Decision rules for recognitionmemory confidence judgments Journal of Experimental Psychol-ogy Learning Memory amp Cognition 24 1397-1410

Stretch V amp Wixted J T (1998b) On the difference betweenstrength-based and frequency-based mirror effects in recognitionmemory Journal of Experimental Psychology Learning Memoryamp Cognition 24 1379-1396

NOTE

1 The expected number of nodes active at recognition is a2 giventhat the standard deviations of the percentages of active nodes for new[sp(N)] and old items [sp(O)] are equal However if the standard devi-ations are different the expected number of active nodes will be

Because the new variance is smaller than the old variance this valuewill be slightly less than a2 typically around 044a if the ROC curveis 08 It is actually more precise to use this value In this paper the ap-proximation a2 is used except that in the simulations where the ex-pression above is used

APPENDIX

The Expected Value of the Net Input and the Variance of the Net Input

Inserting the weight changes into the net input solves the ex-pected value of the net input for old items In this Appendix thesuperscripts representing the item layer (d) and the contextlayer (c) are dropped Only the active states of activation con-tribute to the net input so jj 5 jj = 1

(A1)

The expected value of the net inputs for the new items iszero

mh(N) 5 0 (A2)

The expected value of the weights is zero The variance ofthe net input is calculated by squaring each term in the netinput Let P represent the number of patterns stored in the net-work If the patterns are uncorrelated or a new item is pre-sented the variance of the net input is

(A3)

For an old item the variance of the net input to the contextlayer from the item layer depends on the frequency of the item( f ) All the aN weights supporting a context node contribute tothe net input in the same direction

(A4)

The same function can be used for calculating the varianceof the net input to a node in the item layer when the same con-text is associated with several items Let the same context be as-sociated with L items in a list Furthermore let the context be-tween different lists be uncorrelated The variance of the netinput to an item node is

The expected variance for all nodes is the weighted sum ofthese two terms Half of the nodes are context nodes and halfof the nodes are item nodes Therefore the expected varianceof the net input to all nodes given an item with a frequency off and a list length of L in a network where p patterns have beenencoded is

(A5)

Performance in the Variance Model When AllNodes Are Used

In the variance model recognition strength is based on nodesactive at encoding However if recognition strength is based onall the nodes performance can be calculated as follows The d centis calculated by using Equation 8 where Pc(O) and Pc(N) are thenumber of correct nodes The number of correct nodes is thenumber of nodes active at retrieval that also is active at encod-ing (ie calculated as usual with Equation 7) plus the numberof inactive nodes at encoding that also are inactive at retrievalThe latter value can be calculated by replacing a with 1 a inEquation 7 and using the expected value of old net inputs for in-active nodes a2 (1 a) N (compare with Equation A1)

(Manuscript received February 9 1999revision accepted for publication October 30 2000)

s h f L a N p a a N2 3 22 1O( ) = +( ) + -( )

s h f a a LN2 4 2 21( ) = -( )

s x x x xh ij jj

N

i j ji

Nf f f a a

a a f N

22 2

4 2 21

( ) = waring aringaelig

egraveccedilouml

oslashdivide= -( ) -( )eacute

eumlecirc

= -( )

s x x x xh ij jj

N

i j j

P

i

NN a a

a a a a a a a

a a a a PN a a PN

2 2 2

2

3

1 1 2 0 1

0 0 1 1

( ) = w

= [( ) ( )] + [( )(1- )] ( )

+ [( )( )] ( ) = ( )

2 2

2 2 2

( ) = -( ) -( ) - - - -

- - - -

aring aringaring

m x x x xh ijj

N

j i j jj

Na a a a N(O) = wDaring aring= -( ) -( ) = -( )1

2

ss s

p

p p

a(N)

( )N (O)+

VARIANCE THEORY OF MIRROR EFFECT 425

the old difficult class increases with the expected netinput [from P(BO) = 55 and P(AO) = 56 for mh(O) =05 to P(BO) = 89 and P(AO) = 92 for mh(O) = 2]

List-Length EffectGiven everything else equal recognition from a short

list length has a higher hit rate and lower false alarm ratethan recognition from a long list In the variance theorylist length is predicted to affect the standard deviation ofthe net input (sh) for both easy and difficult class itemsso that longer lists have a larger variance than do shorterlists The expected value of the net input is not affectedby list length

Assume that context does not change within a list butis uncorrelated between different lists The context for alist is thus associated with as many items as there areitems in the list The variance of the net inputs to the itemnodes increases when the list length is increased Thereason for this increase in variance is essentially thesame as the reason that word frequency affects the vari-ance In the word frequency case the same item is asso-ciated with several contexts and this increases the vari-ance in context nodes In the list-length case the samecontext is associated with several items and this in-creases the variance in the item nodes Thus the varianceof the net inputs in the item nodes will be a linear func-tion of list length Therefore a long list will have a lowerhit rate and a larger false alarm rate than will a short list

ROC CurvesThe percentage of nodes active at recognition is less

for new items than for old items Owing to the placementof the activation threshold this proportion is always lessthan 12 The standard deviation of the percentage of ac-tive nodes increases as a function of the percentage ofactive nodes If the percentage of active nodes is zerothe standard deviation obviously is zero However thisstandard deviation increases as the percentage of activenodes increases This yields a smaller standard deviationfor the new distribution (which is associated with a lowerpercentage of active nodes) as compared with the olddistribution [ss(AO) gt ss(AN) and ss(BO) gt ss(BN)]

For the sake of understanding the model the propor-tion of nodes active at encoding can be set unrealisticallyhighmdashnamely to a = 1 This setting yields around 50of these nodes being active at recognition This parametersetting makes the standard deviations of the new and theold distributions equal [ss(AO) = ss(AN) and ss(BO) =ss(BN)] Figure 6C shows the prediction for a = 1 (all theother parameters are identical to those in Figure 4D)

The standard deviation of recognition strength is largerfor the difficult class than for the easy class [ss(AN) gtss(BN) and ss(AO) gt ss(BO)] because the recognitionstrengths are calculated from the inverse of the standarddeviation of the net inputs Thus when the standard de-viations of the net inputs are set equal the standard de-viation of the recognition strengths and the recognitionstrengths becomes equal Figure 6D plots the predictionsof the theory when all standard deviations of the net in-

puts are 125 The other parameters are the same as thosein Figure 4D

In Figure 4D the four standard deviations of the recog-nition strengths are ss(AN) = 0015 ss(BN) = 0012ss(BO) = 0015 and ss(AO) = 0020 The ratio of thesestandard deviations must follow Equation 2 This is alsothe case with ss(AO) ss(BN) = 061 lt 074 = ss(AO) ss(AN) lt ss(BO)ss(BN) = 078 lt 094 = ss(BO) ss(AN)

Changing the Recognition CriterionThe probability of a yes response (P) for the four

classes depends on the recognition criterion (C) SettingC to an unbiased value of 0 yields P(AN) = 20 P(BN) =25 P(BO) = 70 P(AO) = 74 These predicted data areprototypical of experimental data for the mirror effect

A conservative value of the recognition criterion (C)will not yield the mirror effect For example C = 0016yields P(AN) = 03 P(BN) = 02 P(BO) = 30 P(AO) =43 Thus the variance theory predicts that a conserva-tive recognition criterion yields a higher false alarm ratefor easy class words than for difficult class words Thisprediction is supported by empirical data Greene (1996)asked subjects to respond yes only if they were sure oftheir response Consistent with the prediction no mirroreffect was found

It follows from the ordering of the distributions in Fig-ure 4D that the theory also predicts the experimentalfindings in forced recognition [P(BO BN) lt P(AO BN)P(BO AN) lt P(AO AN) P(BN AN) gt 50 and P(AOBO) gt 50] For the parameters above the predictions ofthe theory are P(BO BN) = 79 lt 81 = P(AO BN)P(BO AN) = 83 lt 84 = P(AO AN) P(BN AN) = 59 gt50 P(AO BO) = 57 gt 50

Within-List Strength ManipulationSo far the predictions made by the variance theory are

qualitatively (but not quantitatively) equal to those of theattention-likelihood theory However there is an excep-tion that differentiates the variance theory from the attention-likelihood theory The mirror effect is nor-mally studied under experimental conditions in whichthe diff icult and the easy classes are given the sameamount of attentionmdashfor example under conditions inwhich the number of presentations the study time andthe study instructions are equal for the two classes ofwords However if the number of presentation is largerfor the difficult class than for the easy class different re-sults emerge Stretch and Wixted (1998b) conducted fiveexperiments in which the basic manipulation was to pre-sent high-frequency words five times whereas the low-frequency words were presented once The results didnot show a mirror effect because the hit rates for thehigh-frequency words were higher than those for thelow-frequency words However increasing the numberof presentations for the high-frequency words did not af-fect the false alarm rate so both the false alarm rate andthe hit rate were larger for high-frequency words

The variance theory accounts for this new finding inthe following way The theory assumes that a longer pre-

426 SIKSTROM

sentation time or a larger presentation frequency in-creases the net inputs of the old items [mh(O)] This is il-lustrated in Figure 7A (compare with Figure 4A wherethe same amount of attention is paid to the two classes)If the net inputs for old high-frequency items are in-creased sufficiently the percentage of active nodes willbe larger than that for old low-frequency items For thisto occur the effect of the increase in net input (whichgives the advantage for old high-frequency items whenattention is focused on these items) must be larger thanthe effect from the larger standard deviation of the netinputs for old high-frequency items (which gives the ad-vantage for old low-frequency items when the same at-tention is paid to the two classes) This increase in thepercentage of active nodes yields a higher hit rate forhigh-frequency items than for low-frequency items

However it will not significantly change the false alarmrates which are larger for high-frequency items than forlow-frequency items Therefore the variance theory pre-dicts no mirror effect when high-frequency items are pre-sented sufficiently more often or with a sufficiently longerpresentation time as compared with low-frequency items

It is apparent from the density of net inputs (Figure 7A)that the density of recognition strengths (Figure 7B)will not show a mirror effect (ie because the percent-age of active nodes are larger for high- than for low-frequency old items) The parameters used in these fig-ures are identical to the parameters used for the standardmirror effect in Figures 4A and 4D with the exceptionthat the expected net input of the old high-frequencyitems [mh(BO)] is 2 rather than 1 Consequently to opti-mize performance the activation threshold becomes

P a ec

h h

h=-( )yen

ograve12

22

m

s

P -

Figure 7 (A) The probability density of the net inputs in the variancetheory when attention is focused on the high-frequency class The hori-zontal axis shows the net inputs and the vertical axis the probabilitydensity of the net inputs The expected value of the high-frequency value(BO) is shifted to the right because attention is focused on this class Thedotted vertical line is the activation threshold (B) The predictions of thevariance theory when subjects focus their attention on high-frequencywords The horizontal axis shows the recognition strength and the ver-tical axis the probability density

075 rather than 050 The figure does not show a mirroreffect because the expected hit rate and the expectedfalse alarm rate are larger for the high-frequency itemsthan for the low-frequency items Setting C to an unbi-ased value of 0 yields P(AN) = 08 P(BN) = 14 P(BO) =86 P(AO) = 063 [which may be compared with Figure6B P(AN) = 20 P(BN) = 25 P(BO) = 70 P(AO) = 74]

Furthermore in the variance theory the ratio of therecognition strength standard deviations for high- andlow-frequency items depends mainly on the standard de-viations of the net inputs The standard deviations of thenet inputs are not dependent on the attention paid to thestimuli Therefore the variance theory predicts no changein the standard deviations when the amount of attentionis manipulated The standard deviation of the old low-frequency distribution is predicted to be larger than thestandard deviation of the old high-frequency distribu-tion Similarly the standard deviation of the new low-frequency distribution is predicted to be larger than thatof the new high-frequency distribution The standard de-viations in Figure 7B are ss(AN) = 0013 ss(BN) = 0011ss(BO) = 0017 and ss(AO) = 0019 These results aresimilar to the results when the same amount of attentionis paid to the two classes in Figure 4D ss(AN) = 0015ss(BN) = 0012 ss(BO) = 0015 and ss(AO) = 0020

The standard version of the attention-likelihood the-ory has problems accounting for the lack of mirror ef-fect when more study time is given to the difficult classthan to the easy class This theory suggests that the classof items to which more attention is being paid is moreeasily recognized For example low-frequency items arebetter recognized than high-frequency items becausesubjects pay more attention to them The amount of at-tention is assumed to influence the number of sampledfeature [n(i)] so more features are sampled for low- thanfor high-frequency items (Kim amp Glanzer 1993) This isthe only parameter that differs between high- and low-frequency items From this explanation it follows thatif the experimental conditions are manipulated so thatsubjects pay more attention to the high-frequency itemsthe standard version of the attention-likelihood theorywill predict a mirror effect where the high-frequencyitems are the easier class (A) and the low-frequency itemsare the more difficult class (B) The difference from thenormal mirror effect is a larger hit rate and a smaller falsealarm rate for high- than for low-frequency items Fur-thermore the attention-likelihood theory makes predic-tions of the order of the slope of ROC curves The standarddeviation of the hit rate for the high-frequency distributionwould be larger than the hit rate for the low-frequencydistribution Similarly it is predicted that the standarddeviation of the high-frequency false alarm distribution islarger than that of the low-frequency distribution

EXPERIMENT

An experiment was conducted to test the predictionsregarding the within-list strength manipulation The

number of presentations and the study time of the high-frequency words were manipulated in an experimentThe original rationale for the experiment was to comparethe results with the predictions of the variance theoryand the attention-likelihood theory because the experi-ment was conducted before the publication of Stretchand Wixtedrsquos (1998b) study which manipulated atten-tion by the number of presentations In this experimenta new manipulation is investigatedmdashnamely how theamount of study time per item for each class influencesthe mirror effect Furthermore the manipulation of thenumber of presentations is replicated Thus there weretwo experimental conditions one in which the amountof study time was manipulated and one in which the pre-sentation time was manipulated There was also one con-trol condition in which high- and low-frequency wordswere given the same amount of attention

MethodSubjects Twenty-one students taking the introductory psychol-

ogy course at the University of Toronto volunteered to participatein a memory experiment for course credit There were 14 female and7 male subjects with a mean age of 20 ranging from 18 to 29 yearsold

Material Sixty low-frequency words and 60 high-frequencywords were selected from Ku Iumlcera and Francis (1967) The low-frequency words have an occurrence of 4ndash5 times per million andthe high frequency words an occurrence of 50ndash55 times per mil-lion Thirty low- and 30 high-frequency words were randomly cho-sen for List 1 and the remaining for List 2

Procedure The subjects were instructed to study a list of wordsso they would be able to recognize the words after study Fifteenlow-frequency words and 15 high-frequency words were randomlychosen as study words for each subject

Design There were three conditions In all the conditions thelow-frequency words were presented once with a presentation timeof 1 sec In the control condition the high-frequency words werealso presented once with a presentation time of 1 sec In the pre-sentation frequency condition the high-frequency words were pre-sented twice for 1 sec each time In the presentation time conditionthe high-frequency words were presented once for 3 sec The pre-sentation order was randomized All the words were presented inuppercase on a blank computer screen Immediately following thestudy list there was a recognition test The subjects were presentedwith either one of the studied words or one of the lures There were15 low-frequency lures and 15 high-frequency lures in each condi-tion The subjects were asked to judge whether the word was pre-sented in the list or not The subjects were also required to rate theirconfidence in their responses on a scale ranging from 1 (guessing)to 5 (very certain) The order of recognition was randomized foreach subject

Each subject participated in two conditions List 1 was alwaysgiven as the first list and List 2 as the second list Twelve subjectswere randomly chosen for the presentation frequency condition fol-lowed by the presentation time condition Nine subjects were giventhe control condition followed by another control condition Thewhole experimental set-up including instructions presentation ofwords and the recognition test was automated on a computer Eachsubject was tested individually

ResultsThe results from the experiment are presented in the

first three rows of Table 1 The probability for hit rates

VARIANCE THEORY OF MIRROR EFFECT 427

428 SIKSTROM

was larger for the high-frequency words than for the low-frequency words in the presentation frequency and thepresentation time conditions In the control conditionthe probability for hit rates for the low-frequency condi-tion was larger One-tailed paired t tests over the perfor-mance for each subject were carried out to test the dif-ferences between the high and the low frequencies Theeffects were significant in the presentation frequencycondition [t(11) = 22 MSe = 004 p = 02 lt 05] and inthe control condition [t(16) = 33 MSe = 004 p = 00lt 05] but not in the presentation time condition [t(11) =041 MSe = 003 p = 34 lt 05]

The false alarm rate was larger for the high-frequencywords in all the conditions However it was only signif-icantly larger in the presentation frequency condition[t(11) = 18 MSe = 003 p = 048 lt 05] but not in thepresentation time condition [t (11) = 15 MSe = 001 p =07 gt 05] and the control condition [t(16) = 14 MSe =002 p = 09 gt 05]

The results in the presentation frequency conditionsupport the variance theory The hit and the false alarmrates were significantly larger for the high-frequencywords than for the low-frequency words Thus there wasno mirror effect However the prediction of the standardversion of the attention-likelihood theory was not sup-ported

The results in the presentation time condition were inthe same direction as those in the presentation frequencycondition although the difference between the high andthe low frequencies was not significant This conditionis consistent with the variance theory although the stan-dard version of the attention-likelihood theory could notbe dismissed in this condition since the results werenonsignificant

Finally the control condition yielded results consis-tent with previous studies showing a mirror effect Thehit rate for the high-frequency words was significantlylower than the hit rate for the low-frequency words Thefalse alarm rate for the high-frequency words was largerthan that for the low-frequency words (although not sig-nificantly) Thus the control condition is as was ex-pected consistent with both the variance theory and thestandard version of the attention-likelihood theory

The slopes of the ROC curves were calculated as fol-lows The hit and false alarm rates for confidence ratings1ndash5 were z-transformed (eg for confidence rating 4 a hitresponse was scored if the confidence rating was 4 orabove) Linear regressions of one z-transformed mea-surement as a function of another z-transformed measure-ment were conducted The slope of the linear regressioncurves between the z-transformed hit rate of the low-frequency words and the z-transformed hit rate of the high-frequency words [ss(BO)ss(AO)] and similarly for theslope of the false alarms [ss(BN)ss(AN)] are shown inthe last two rows of Table 1

The results show that the standard deviations of theold high-frequency distributions were smaller than thestandard deviations of the old low-frequency distribu-tions in all the conditions The standard deviations of the

false alarm high-frequency distributions were smallerthan the standard deviations of the false alarm low-frequency distributions in the presentation frequencycondition and the control condition but were approxi-mately equal in the presentation time condition

To summarize the results in the presentation frequencycondition are consistent with the variance theory and inconsistent with the standard version of the attention-likelihood theory The control condition is consistentwith both the variance theory and the standard version ofthe attention-likelihood theory These data are also con-sistent with results presented by Stretch and Wixted(1998b) However Stretch and Wixted (1998b) suggestedone possible way to modify the standard version of theattention-likelihood theory to bring it in line with thedata presented here They noted that Glanzer et al (1993)had shown that the attention-likelihood theory predictsthe mirror effect although p(i old) is set to the averagevalue of the two classes This modified version can pre-dict the pattern of data presented here given that the at-tention paid to the high-frequency class was high duringencoding [n(B) = 120] and low during recognition [n(B) =40] This formulation of the attention-likelihood theoryseems somewhat unclear It is not well motivated whyp(i old) should be equal during recognition whereas n(i)is different [p(i old) is calculated from n(i)] or why theamount of attention for high-frequency items is lower thanthat for low-frequency words at encoding but higher atrecognition

COMPARATIVE DATA FITTING

Glanzer et al (1993) fitted the attention-likelihoodtheory to experimental data in nine conditions The easyclass (A) consisted of either low-frequency words orconcrete words and the difficult class (B) consisted ofhigh-frequency words or abstract words Here the vari-ance theory is also fitted to the same set of data as thatused in Glanzer et al (1993) This allows a directionevaluation of the variance theory by comparing its f itswith those of the attention-likelihood theory

Glanzer et al (1993) fitted the attention-likelihoodtheory to the four probabilities of yes responses and thefour ratios of slopes of the ROCs The fitting was con-ducted by minimizing the mean squared error divided bythe variancemdashthat is

Three parameters were fittedmdashnamely the attentionpaid to each of the classes n(A) and n(B) and the prob-ability that a feature was activated p(new) The other pa-rameters were held constant at a value found to give agood f it These parameters were (N = 1000) and therecognition criterion [ln (L) = 0]

The variance theory was fitted to the same set of datausing the same technique and the same number of freeparameters The fitted parameters were the percentage

( )

Observed Predictedi i

ii

-

=aring

2

21

8

s

VARIANCE THEORY OF MIRROR EFFECT 429

nodes active at encoding (a) the standard deviation ofthe net inputs for the easy class words [sh(A)] and thestandard deviation of the net inputs for the difficult classwords [sh(B)] The other parameters were held constantand were set to the same values as those in Figure 4D[2N = 100 mh(N ) = 0 mh(O) = 1 and C = 0] The empir-ical standard deviations (si) were not reported inGlanzer et al (1993) so these parameters were set toone

The results show that the variance theory accounts for98 (r2) of the variance for the probabilities The attention-likelihood theory fits equally well The variance theory ac-counts for 93 of the variance for the slope The attention-likelihood theory accounts for 84 of the variance of theslope Thus the variance theory accounted for the sameamount of variance for the probabilities and more variance

for the slope as compared with the attention-likelihoodtheory when three parameters were fitted

It is reasonable to assume that the percentage of ac-tive nodes at encoding (a) does not vary across condi-tions The ratio of standard deviations of the net inputs[sh(B)sh(A)] may also be conceived of as being con-stant given that the same material is used in the differ-ent conditions Therefore the variance theory was fittedby a single variablemdashnamely the standard deviation ofthe net inputs for the easy class [sh(A)] The activitylevel was fixed to 010 and the ratio of the standard de-viations of the net inputs sh(B)sh(A) to 125 (these val-ues were the average of the fitted parameters in the pre-vious fit these parameters were also used in Figure 4D)The results are presented in Figure 8A where the pre-dicted probabilities are plotted as a function of observed

Figure 8 (A) Fitting the variance theory to Glanzer Adams andKimrsquos (1993) probability data The figure shows the predicted proba-bilities (on the vertical axis) as a function of the observed probabilities(on the horizontal axis) (B) Fitting the variance theory to Glanzer et alrsquos(1993) standard deviation slope data The figure shows the predictedratio of slopes of the receiver-operating characteristic curves (on the ver-tical axis) as a function of the observed ratios (on the horizontal axis)

430 SIKSTROM

probabilities Figure 8B shows the corresponding resultsfor the slope The accounted variance is 096 for the prob-abilities and 085 for the slopes Thus the variance theoryfits the slopes using a single parameter equally well asthe attention-likelihood theory does with three fitting pa-rameters The fit for the variance theory for the probabil-ities using one parameter is slightly less than the fit of theattention-likelihood theory using three fitting parametersIt may be concluded that the fit for the variance theory isreasonably good for the probabilities and the slopes Theslopes have a better fit in the variance theory than in theattention-likelihood theory when three variables are used

ANALYTIC SOLUTIONS

In this section analytic solutions of the variance the-ory an approximation of the standard deviation of therecognition strength and analyses of optimal perfor-mance are presented The variance theory has a simpleanalytic solution and can be fully described by four pa-rameters Two of these parametersmdashnamely the stan-dard deviation of the net inputs from the easy class[sh(A)] and the standard deviation of the net inputs fromthe difficult class [sh(B)]mdashcan also be expressed as thefrequency of the item (see the Appendix) The other twoparameters are the number of nodes (N ) and the expectedprobability that the nodes are active at encoding (a)

There are other variables in the theory however theydo not increase the degrees of freedom There are fourexpected net inputs (mh) however two degrees of free-dom disappear because the new net inputs are constrainedto be equal as well as the old net inputs [mh(AN) =mh(BN) and mh(AO) = mh(BO)] Furthermore the predic-tions are independent of moving the old and new dis-tributions in parallel thus removing another degree offreedom Finally changing the difference between the ex-pected old and new net inputs is mathematically equiva-lent to changing the standard deviations [sh(A) andsh(B)] Thus the degrees of freedom in the net inputscan be captured by the degrees of freedom in the stan-dard deviations The activation threshold (T ) and theprobability that nodes are active (Pc) are simply func-tions of other variables and therefore do not increase thedegrees of freedom Thus there are four degrees of free-dom for the distributionsmdashnamely sh(A) sh(B) N anda An additional degree of freedom is introduced whenplacing the recognition criterion (C)

The probability (Pc) that the net inputs exceed the ac-tivation threshold (T ) for nodes active during encodingcan be explicitly solved from the expected net inputs(mh) and the standard deviation of the net inputs (sh)This probability is dependent on the distribution of thenet inputs which can be approximated with a normaldistribution Pc is solved by integrating the net inputsfrom mh T to infinity (yen) over the probability densityfunction for a normal distribution Thus the probability(Pc) that a node is active at recognition is

(6)

Subtracting the expected percentage of active nodes atrecognition (a2 see note 1) from the percentage of ac-tive nodes and dividing by the standard deviation of thenet inputs (sh) calculates the expected recognitionstrength (ms)

Note that the analytic solution uses the standard devi-ation of the class (sh) as an approximation of the stan-dard deviation of the item (shcent ) because it simplifies theanalytic solution however the variance theory or thesimulation uses the standard deviation of the item Thisapproximation is good when there are a large number offeatures however for a small number of features thevariance of feature strength for a single item may fluctu-ate on an item-to-item basis around the variance of thenet inputs for all the items

The standard deviation of the recognition strength (ss)is calculated from sh Pc and N There is 2N number ofnodes in the context and the item layers The distributionof Pc is binomial but can for a certain criterion [ie 2NPc(1 Pc) gt 10] be approximated with a normal distri-bution with a standard deviation of [Pc(1 Pc) 2N]12The final result is scaled by the normalization factor 1sh

(7)

A yes response occurs if the recognition strength isabove the recognition criterion (C) The probability of ayes response [P(Y)] is calculated from the expected recog-nition strength the variance of the recognition strengthand the recognition criterion by integrating the density ofthe recognition strength over a normal distribution

The probability of choosing A over B in a two-choiceforced recognition test [P(A B)] is calculated from theexpected recognition strength of A [ms(A)] and B [ms(B)]and the standard deviations of the recognition strengthof A [ss(A)] and B [ss(B)]

An Excel sheet for calculating the predictions of the vari-ance theory is available on line (wwwpsychutorontoca ~sverkervariancehtml)

P e ds

S Bs s

s s

C

(A B)

(A)

(A) (B)

=

- -[ ]( )+aelig

egraveccediloumloslashdivide

eacuteeumlecirc

yen

ograve12

2

2 2 22

p

m m

s s

( )

P e ds

S s

s

C

(Y) =-( )yen

ograve12

2

22

p

m

s

sss

h

c cP P

N=

-( )eacute

eumlecircecirc

1 1

2

12

mss

Pc

a

h=

-2

P a e dhcT

hh

h=-( )yen

ograve12

2

22

p

m

s

VARIANCE THEORY OF MIRROR EFFECT 431

Approximations of the Standard Deviation ofRecognition Strength

The standard deviation of the recognition strength isin the model calculated with Equation 7 However to fa-cilitate the understanding of this equation it is useful tomake some approximations First note that the proba-bility that a node is active (Pc) is assumed to be low Byapproximating 1 Pc to one the variance of the recog-nition strength can be simplified to

For a particular class of items the variances of the netinputs of old and new items are equal and the varianceof the recognition strength is proportional to the numberof active nodes (s 2

s micro Pc) This approximation suggests avery simple interpretation of the slope of the z-ROC Theratio of variances between new and old items is simplythe ratio between the number of nodes active in the newitems representations and the number of nodes active inthe old items representations

Or alternatively the slope of the z-ROC curve is equalto the square root of the ratio of the number of nodes ac-tive in the new items representations and the number ofnodes active in the old items representations For exam-ple if the slope of the z-ROC curve is 08 the number ofactive nodes in the new items representations divided bythe number of nodes active in the old items representa-tions is 064 (= 0802)

Another approximation useful for understanding themodel is that for two classes of items the number of ac-tive nodes in the new distribution is approximately equaland the number of active nodes in the old distributions isapproximately equal [Pc(AN) raquo Pc(BN) and Pc(BO) raquoPc(AO)] Given these approximations and the approxi-mation above (1 Pc raquo 1) the recognition strength stan-dard deviation is inversely related to the standard devia-tion of the net inputs in the following way The ratiobetween the recognition strength standard deviations ofthe diff icult and the easy distributions is equal to theratio between the standard deviations of the net inputs ofthe easy and the difficult distributions Furthermore theratio between the recognition strength standard devia-tions of the difficult and easy new distributions is equalto the ratio between the recognition strength standard de-viations of the difficult and the easy old distributionsThe exact solution predicts a slightly smaller ratio in theold than in the new distributions

This suggests that the ratio between the recognitionstrength standard deviations of the difficult class and theeasy class can be interpreted as the ratio between the

standard deviations of the net inputs of the easy class andthe difficult class

Optimizing Performance Derives the Assumptions of the Variance Theory

Arguably good memory performance is an importantfactor for selection in the evolutionary process of hu-mans and animals It is reasonable to assume that thebrain has evolved so that the performance at retrieval isoptimal or near-optimal Here it is investigated how sev-eral assumptions of the variance theory influence per-formance It is shown that several assumptions of themodel fall out as a consequence of optimizing perfor-mance in the form of discriminability between new andold items Thus if the model is implemented in a differ-ent way performance is degraded and the model doesnot account for the experimental data Examples of as-sumptions that yield a good performance in the modelare a low percentage of nodes active setting the activa-tion threshold between old and new net inputs measur-ing performance by nodes that are active in the encodingpattern and normalizing the recognition strength It isshown that an optimal performance in the network requiresthe implementation suggested by the variance theory Ifthe implementation of the variance theory is changedsignificantly the performance is degraded and the net-work would not produce the empirically found memoryphenomena

To conduct this analysis it is necessary to define ameasurement of performance A natural choice is to used cent By using the analytical equations above we find thefollowing expression

(8)

Because ss(N) sometimes is near zero it was founduseful to use the standard deviation of both the old andthe new items recognition strength ss(NO) in the de-nominator of this equation Thus Pc(NO) is equal to[Pc(N) 1 Pc(O)] 2 Pc( ) was calculated with Equation 6The expected value of the net inputs and the standard de-viation of the net inputs for new and old items were cal-culated with the equations derived in the Appendix(Equations A1 A2 and A3) Low-frequency items witha preexperimental frequency of zero were used ( f = 0)and the list length was set to one (L = 1)

The performance can be expressed by the parametersa N and p Increasing the number of nodes (N) monot-onically increases d cent and increasing the number ofstored patterns (p) monotonically decreases d cent The im-pact of these two parameters on d cent was of less impor-tance here and they were set to N = 30 and p = 100

Optimal percentage of nodes active at encodingThe solid line in Figure 9A shows the theoretical d cent as afunction of the percentage of nodes active at encoding

cent = - =-

-eacuteeumlecirc

dP P

P PN

s s

s

c c

c c

m ms( ) ( )

(

O N

(NO)

(O) (N)

(NO) (NO) 12

112

ss

ss

ss

ss

s

s

s

s

h

h

s

s

(BO)

(AO)

(B)

(A)

(A)

(B)

(BN)

(AN)pound raquo pound

s

ss

s

c

c

P

P

2

2

(N)

(O)

(N)

(O)raquo

ss

sc

h

P

N

2

2 2=

432 SIKSTROM

(a) The results show that d cent is optimal for a = 052 Thed cent is lower for larger and smaller a The lower d cent for largea occurs because the interference from other items in-crease For an a larger than the optimal value the weightchanges are distributed over a larger number of nodesand the recognition tests therefore include more noise

The lower d cent for an a less than the optimal value oc-curs because there is variability in the number of activenodes at encoding Thus for very small values of a thefluctuation between the number of nodes active in theencoded representation becomes increasingly importantThus for a small a errors are more likely to occur be-cause old items have few active nodes at encoding mak-ing it less likely that many nodes will be active at re-trieval (independently of how well they are encoded)This analysis suggests that a medium low percentage ofactive nodes at encoding yields optimal performanceThis is consistent with variance theory which requires alow activity for fitting some of the empirical data (seebelow)

There is another factor that contributes to the fact thatoptimal performance occurs when the percentage of ac-tive nodes is medium low which is that the number ofpossible representations increases with a If there is onlyone node active in all the representations there are Npossible combinations of representations if there aretwo nodes active in all the representations there are ap-proximately N 2 possible combinations of representa-tions and so forth This factor is not included in theanalyses

Optimal placement of the activation threshold Animportant property of d cent is that it is insensitive to wherethe criterion is placed Thus any criterion yields thesame performance The activation threshold (T ) may beseen as the criterion for a single node and therefore onemight intuitively think that the placement of the thresh-old is unimportant for d cent However surprisingly theplacement of the criterion becomes important in the vari-ance theory because there is a nonlinear transformationwhen the nodes are activated This nonlinearity makes d centdependent on the activation threshold in the nodes

The d cent was maximized by changing the activationthreshold (T ) and the percentage of nodes active at en-coding (a) The maximum d cent was 240 when T = 081and a = 052 Figure 9B plots d cent as a function of the ac-tivation threshold (T ) when the percentage of nodes ac-tive at encoding was f ixed at the optimal value (a =052) The results show that d cent has an optimal valuewhen the activation threshold is set between the old andthe new distributions The variance theory suggests thatthe threshold should be set to the average of the expectedold and expected new net inputs For the parameter val-ues used here this value is 071 which is near butslightly lower than the optimal value of 081 (the ex-pected value of the new net input is 0 and the expectedvalue of the old net input is 142) Note that this resultapplies when both a and T are set to the optimal value Ifa is set to a nonoptimal value the optimal value of T may

deviate significantly from the one proposed by the the-ory (eg if a = 5 the optimal value of T is much largerthan the expected value of old net inputs of 188)

This analysis emphasizes the importance of setting theactivation threshold as suggested by the variance theorySetting the activation threshold between the old and thenew expected net inputs yields not only the mirror effectbut also an optimal performance in the network Noticethat the activation threshold (T ) is constant even if thesubjectsrsquo decision criterion (C) is changed Therefore d centwill not change when the decision criterion changes Bychanging the decision criterion (rather than the recogni-tion threshold) subjects can maintain an optimal d cent fordifferent confidence levels

Optimal usage of the state of activation in the cuepattern An interesting question is how much informa-tion is carried in nodes that are active in the encoded pat-tern as compared with inactive nodes If both active andinactive nodes carry a similar amount of information itis useful to use all the nodes at retrieval However if in-active nodes carry little information in relation to thenoise performance can be improved by using only theinformation in the active nodes

The information carried in the nodes depends on theamount of weight changes which in turn depends on thepercentage of active nodes at encoding (a) For a = 12the absolute values of the weight changes are the samefor active and inactive nodes (however the signs of theweight changes are different) Thus inactive and activenodes carry the same amount of information and theperformance is optimal when information in both activeand inactive nodes is used For a small a the weightchanges are larger for active nodes (proportional to 1 a)than for inactive nodes (proportional to a) For a suffi-ciently small a the noise in inactive nodes will over-whelm the information in the weight changes so that ifthe information is combined the inactive nodes will harmthe information in the active nodes and performance isoptimal if only information from active nodes is used

The performance of the variance theory was calcu-lated by using the information in all the nodes This isdone by counting the number of nodes that are retrievedto the correct state of activation (ie the same state asthat during encoding) The mathematical details of thiscalculation are described at the end of the Appendix Theresults are shown by the dotted line in Figure 9A usingthe same set of parameters as when d cent was calculated byusing only active nodes shown by the solid line The re-sults show that the highest d cent is found when the decisionis based only on active nodes and when a is low Includ-ing inactive nodes in decision lowers d cent However for alarger a (above 15 for the parameters used here) it isbeneficial to base the decision on all the nodes

Optimal placement of the recognition criterion forthe two classes of items The recognition criterion (C)does not affect d cent but it influences performance as mea-sured by the hit rates and false alarm rates Therefore itis necessary to use another criterion for performance

VARIANCE THEORY OF MIRROR EFFECT 433

with respect to the placement of the recognition crite-rion A natural choice for performance in this context isthe probability of hits minus the probability of falsealarms This measurement corresponds to optimal per-formance when old correct responses and new correctnew responses are rewarded equally It is easy to see thatif the standard deviations of the old and the new distri-butions are equal the optimal performance will be foundif the recognition criterion is set exactly between the dis-tributions For unequal standard deviations the optimalrecognition criterion is shifted from the midpoint towardthe distribution with the smallest standard deviationMore exactly the optimal recognition criterion is thepoint at which the old and the new distributions inter-sect It is easy to see that this is true because if the recog-nition criterion is moved to the left of this point the rateof increase in false alarms is larger than the rate of in-crease in hits and performance suffers If the recognition

criterion is moved to the right of this point the rate of de-crease in hits is larger than the rate of decrease in falsealarms and performance also suffers (see eg Figure 4D)Formally f [S(O)] denotes the density of recognitionstrength of the old distribution and f [S(N)] the densityof the recognition strength of the new distribution Theratio between these variables is called the likelihoodratio L = f [S(O)] f [S(N)] and the optimal performanceoccurs when this ratio is equal to one (L = 1)

In the mirror effect there are two classes of itemseach having a new and an old distribution with differentstandard deviations The question of optimal perfor-mance is complicated by the possibility of using differ-ent criteria for the two classes The performance maythen vary depending on the choice of the two criteria andon additional restrictions on the overall level of confi-dence For example if one class is very easy (ie perfectdiscrimination) and one class is very difficult (ie no

Figure 9 (A) Theoretical d cent as a function of percentage of nodes active at encoding The solid line shows the d cent as a function of per-centage of nodes active at encoding when the decision is based only on nodes that are active during encoding The dotted line showsd cent when the decision is based on all the nodes (B) Theoretical d cent as a function of the activation threshold The leftmost arrow pointsat the expected net input of the new items [m(N)] the rightmost arrow points at the expected net input of the old items [m(O)] and themiddle arrow points at the point at the placement of the activation threshold of the nodes Note that the activation threshold is slightlylower than the optimal point (C) Optimal placement of the recognition criterion for the two classes The y-axis shows the maximumlikelihood for Class B divided by the maximum likelihood for Class A An optimal performance is found when this ratio is one Thex-axis shows the false alarm rate for Class A The straight line shows the ratio for theoretical optimal performance the dotted line theratio before normalization and the solid curved line the ratio after normalization See the text for details (D) The advantage of nor-malization for different recognition criteria The y-axis shows the total hit rate after normalization minus the total hit rate before nor-malization as a function of the total false alarm rate on the x-axis See the text for details

434 SIKSTROM

discrimination) and subjects are instructed to respondyes only when they are absolutely certain that they arecorrect it may be optimal to set a very high criterion forthe difficult class so that no yes responses will be madefor the difficult class and a moderate criterion for theeasy class so that some yes responses will be made forthe easy class Therefore any model that optimizes per-formance for the two classes must combine the criteriafor each class so that the performance for the sum of theclasses will be optimal

This problem may formally be stated as follows Giventwo classes (A and B) with a fixed total false alarm rate[P(AN) + P(BN)] how should the recognition criteriafor the two classes [T(A) and T(B)] be chosen so that thehit rates are maximized [P(AO) + P(BO)] The solutionto this problem is surprisingly simple The optimal perfor-mance occurs when the placements of the maximum like-lihoods of the two classes are equal

L(A) = f [S(AO)] f [S(AN)] = L(B)

= f [S(BO)] f [S(BN)]

It is easy to see that this criterion must be satisfied foroptimal performance because any shift from this pointdiminishes performance For example if the recognitionthreshold for class A is diminished the recognition cri-terion for class B must be increased to keep the totalfalse alarm rate constant According to the formulationof the problem the change in total false alarm rates mustbe equal f f [S(BN) = 0] The maximum-likelihood ra-tios are monotonically increasing functions of the recog-nition criteria therefore L(A) L(B) lt 0 when the recog-nition criteria are changed as specified above ThusL(A) = f [S(AO)] f [S(AN)] lt f [S(BO)] f [S(BN)] =L(B) or f [S(AO)] f [S(BO)] lt 0 This shows that thechange in the placement of the criteria from L(A) = L(B)results in an overall decrease in hit ratemdash( f [S(AO)] f [S(BO)] lt 0)mdashand performance suffers

Note that the variance theory only has one overallrecognition criterion However the theory and any the-ory of the mirror effect specifies how this criterionchanges when it is moved over the two classes Thus thesecond criterion can indirectly be inferred from the for-mulation of the theory This is done in the variance the-ory by the normalization factor that scales the recogni-tion differently between the two classes of words

The intriguing question here is whether the variancetheory is optimal or almost optimal in terms of place-ment of the recognition criterion for the two classes Fig-ure 9C plots the maximum likelihood of class B dividedby the maximum likelihood of class A [L(B)L(A)] onthe y-axis The x-axis shows the probability of false alarmsfor class A when the recognition criteria are changedThe optimal ratio of the maximum likelihood on the y-axisis one and it is plotted as the dotted straight line The re-sults before normalization (ie by counting the numberof nodes above the recognition criterion) are plotted in

the dotted monotonically increasing line The resultsafter normalization (ie the percentage of active nodesminus the expected number of active nodes divided bythe standard deviation of the net input) are plotted as thesolid line

The figure clearly shows that performance before nor-malization is far from optimal For a conservative recog-nition criterion or low false alarm rates the maximumlikelihood for class A is larger than the maximum likeli-hood for class B Therefore a more liberal criterion forclass B and a more strict criterion for class A would bemore advantageous For a liberal recognition criterionor a high false alarm rate the maximum likelihood forclass B is larger than the maximum likelihood for classA Therefore a more liberal criterion for Class A and astricter criterion for Class B would be beneficial Theonly point at which the performance is optimal is whenthe recognition criterion is unbiased At this point [aroundP(AN) = 25] the maximum-likelihood ratio is one

Normalization improves performance significantly sothe maximum-likelihood ratio is one or almost one fora range of criteria For false alarm rates larger than 015and smaller than 060 the ratio is within two percentagepoints of one The maximum likelihood for Class A islarger than that for Class B for false alarm rates less than015 and for false alarm rates larger than 060 Thus thereis still some deviation from optimal performance evenafter normalization However the maximum-likelihoodratio is closer to the optimal value for all false alarmrates after normalization than before normalization Ar-guably normalization improves performance sufficientlyso that performance may be described as being near anoptimal value for a wide range of recognition criteria

Overall performance after normalization can be di-rectly compared with performance before normalizationFigure 9D plots the total hit rate after normalizationminus the total hit rate before normalization for differenttotal false alarm rates In this figure the standard devia-tion of the net inputs to Class B was set to a 3 (rather than156) in order to make the difference between perfor-mance before and after normalization more salient Allother parameters were identical to those in Figure 4DFurthermore it is assumed that the number of items inClass A is equal to the number of items in Class B Thetotal hit rate is equal to the average hit rate of Class Aand Class B and the total false alarm rate is equal to theaverage false alarm rate in Class A and Class B

For all recognition criteria or for all false alarm ratesperformance is better or equal after normalization ascompared with performance before normalization Foran unbiased recognition criterion or a slightly largerrecognition criterion performance is approximatelyequal before and after normalization [ie for 25 lt P(N) lt40] For conservative recognition criteria [P(N) lt 25]there is a large advantage for normalized performanceover nonnormalized performance The largest advantageis approximately 005 and it occurs when the false alarm

VARIANCE THEORY OF MIRROR EFFECT 435

rate is approximately 005 For liberal recognition crite-ria [P(N) gt 40] there is also an advantage for normal-ized performance The largest advantage is around 001and it occurs when the false alarm rate is 070 The ad-vantage for liberal criteria is smaller than the advantagefor conservative criteria because of a ceiling effect forlarge false alarm rates and large hit rates

A Nonoptimal Network is Inconsistent With Empirical Data of Recognition Memory

To summarize it has been shown that performance isoptimal when (1) the percentage of nodes active at en-coding is low (2) the activation threshold is set betweenthe new and the old distributions (3) only nodes activeat encoding are used for measuring the recognitionstrength and (4) the recognition strength is normalizedwith the standard deviation of the net input It is inter-esting to note that all these conditions are necessary forproducing the empirically found memory phenomena

1 The percentage of active nodes has to be low forproducing appropriate standard deviations for the oldand the new recognition distributions If the percentageof active nodes is too high the standard deviation of theold distribution will approach the standard deviation ofthe new distribution

2 The model predicts the mirror effect only if the ac-tivation threshold is set between the old and the new dis-tributions If the recognition threshold is larger than theexpected value of the net input of the old distributionthe hit rate of the easy class will be less than the hit rateof the difficult distribution Similarly if the recognitionthreshold is lower than the expected new net input thefalse alarm of the easy class will be larger than the falsealarm rate of the difficult class This is inconsistent withthe empirical data for unbiased responses

3 Assume that the recognition strength is based on allthe nodes (ie not only nodes inactive during encod-ing)mdashfor example by counting the number of nodes inthe correct state of activation This measurement ofrecognition strength will have at least 50 of the nodesin the correct state (ie Pc gt 50) even if the subjectswere merely guessing on new items This would lead tothe incorrect prediction that the old recognition strengthvariance should be smaller than the new recognitionstrength variance because Pc(1 Pc)N decreases for Pcover 50 This is inconsistent with the finding that thevariance of the old distribution is larger than the varianceof the new distribution

4 If the recognition strength is not normalized withthe net input the variance of the recognition strength ofthe new easy class (AN) will be smaller than the varianceof the recognition strength of the new difficult class (BNsee Figure 4C) This is inconsistent with the empirical data

This analysis indicates that several recognition mem-ory phenomena fall out as a consequence of optimizingperformance in the network If the model is optimized interms of performance the model reproduces the empir-ical data If the model is made nonoptimal the model

does not reproduce the empirical data Arguably thehuman brain has evolved to optimize storage capacityand therefore these memory phenomena occur

GENERAL DISCUSSION

This paper has suggested a variance theory for themirror effect that also applies to the ROC curves Themodel is remarkably simple It has been shown that atwo-layer recurrent network where one layer representscontext and one layer represents items shows these phe-nomena if performance is measured by counting thenumber of nodes active at recognition in a way that opti-mizes performance The structure of the theory guaran-tees that high-frequency items have a larger variance inthe net inputs than do low-frequency items because en-coding the same item in different contexts increases thevariance whereas the expected net inputs are the sameThe theory predicts the mirror effect when the sameamount of attention is paid to both classes of stimuli Thestandard deviation of the recognition strength is largerfor old than for new items because more nodes are activein old items The standard deviation for the easy class islarger than the standard deviation for the difficult classbecause the recognition strength is normalized with thestandard deviation for the net input

There are several reasons why the variance theory isinteresting First the theory is extremely simple Theonly necessary assumptions are that recognition is basedon recurrent associations between contexts and itemsand performance is measured by counting the number ofnodes in an optimal way Second these assumptions areconsistent with what is known about how the brain worksTherefore the model is biologically plausible Third themodel accounts for a large amount of data including themirror effect exceptions from the mirror effect ROCcurves list-length effects and so on Fourth the modelfits the empirical data well Fifth it is easy to implementthe model in a connectionist network

Paying more attention to one of the classes violates theassumption of equal expected net inputs to the two classesThe variance theory predicts that attention to the moredifficult class primarily affects the hit rates whereas thefalse alarm rates and standard deviations of the underly-ing distributions are less affected An experiment sup-ported the prediction A standard mirror effect was foundwhen attention was divided equally between high- andlow-frequency words However focusing the subjectsrsquoattention on the high-frequency words either by the pre-sentation frequency or the presentation time made thehit rate larger for the high-frequency words than for thelow-frequency words This manipulation of attention didnot influence the false alarm rates which were higher forthe high-frequency words in all the conditions Thus nomirror effect was found when attention was paid to thehigh-frequency words Nor did the focusing of attentioninfluence the order of the standard deviations The stan-dard deviations were larger for the low-frequency distri-

436 SIKSTROM

bution than for the high-frequency distribution This wastrue for the new and the old distributions both when at-tention was paid to high-frequency words and when at-tention was divided equally between the two classes (ex-cept in the new frequency control condition where thestandard deviations were equal)

The variance theory predicts the order of the standarddeviations of the underlying distributions for the follow-ing reasons The standard deviations of the old distribu-tions are predicted to be larger than those of the new dis-tributions because more nodes are activated in the olddistributions The standard deviations of the easy classdistributions are predicted to be larger than the standarddeviations of the difficult class distributions because therecognition strength is normalized by the itemrsquos diffi-culty estimated from the standard deviation of the net in-puts This is consistent with the empirical data

In contrast the attention-likelihood theory does notconstrain the old distribution to be larger than the newdistribution for the difficult class (it can be larger orsmaller depending on parameter settings) The variancetheory allows the following two orders of the standarddeviations ss(BN) lt ss(BO) lt ss(AN) lt ss(AO) andss(BN) lt ss(AN) lt ss(BO) lt ss(AO) The first order isthe more common although the second order occurs oc-casionally (see eg Glanzer et al 1993 Experiment 1)In addition the attention-likelihood theory allowsss(BO) lt ss(BN) lt ss(AN) lt ss(AO) according to Equa-tion 2 which to the authorrsquos knowledge has not beenfound in empirical data

The variance theory predicts that strength variablessuch as study time repetition and study instructions af-fect the expected net input For example increasing studytime will increase the net input that improves the hit rateIncreasing the net inputs also causes an increase in theactivation threshold that diminishes the false alarm rates

The variance theory has no (ie lateral) connectionswithin the item layer and there are no connections with-in the context layer Including intraitem connections inrecognition makes it impossible to tell whether therecognition strength comes from encoding during thelearning episode or from another preexperimental learn-ing episode Thus there would be a confounding be-tween the itemrsquos frequency and recognition strength Forexample if the recognition strength in the variance the-ory included intraitem connections and used a constantrecognition criterion it would predict a higher hit rateand a higher false alarm rate for high-frequency itemsthan for low-frequency items Thus the hit rate for high-frequency words would be larger than that for low-frequency words which is contrary to the data on the mir-ror effect This issue is called the frequencyrecognitionndashstrength confounding problem Other models may bevulnerable to this problem depending on their specificassumptions The variance theory is immune from thisproblem because recognition strength is based on the as-sociation between the context and the item that yields apure measurement of the strength of the target in a par-

ticular episode Net inputs within the item population arenot used because these connections are highly corre-lated with the frequency of the item

This frequencyrecognition-strength confounding prob-lem may be relevant to several distributed models thatassume that recognition strength increases with frequencythus making the false alarm rate higher for high- than forlow-frequency stimuli This is often implemented in dis-tributed models by including intraitem associations inthe measurement of recognition strength Thus whenintraitem and itemndashcontext associations are added it isnot possible to know whether the intraitem strength oc-curs because an item has been encoded in the to-be-retrieved-from list or at another episode

Although the intraitem associations are not used tomeasure recognition strength they may play an impor-tant role in recognition In the first step of recognitionthese associations may be used for deblurring unclear in-formation in the item cue (a similar mechanism occursfor the context cue) Arguably this deblurring mecha-nism works well for well-known words however fornonwords it is much more likely to fail Such failure willactivate features that were not active in the encoded rep-resentation It is more likely that these wrongly activatednodes represent high-frequency features because it ismore likely to converge on high-frequency features Thereare two interesting implications of this perspective Firstthe wrongly activated nodes will use the wrong connec-tions between the context and the item Second becausethe wrongly activated nodes represent high-frequencyfeatures the average variability will be larger for non-words than for words This is a plausible account of thepoor recognition performance with nonwords as com-pared with words It is also a tentative account of why non-words can be seen as a difficult class with a higher vari-ability than that for words in the variance theory Howeverfurther work is needed before any firm conclusion can bemade regarding this aspect of the theory

A problem similar to frequencyrecognition-strengthconfounding occurs if within-context connections areused at recognition If the context is temporally corre-lated the within-context connections are influenced bylist length This causes a confounding between list lengthand recognition strength This issue is called the list-lengthrecognition-strength confounding problem Othermodels may be vulnerable to this problem depending ontheir specific assumptions

Another issue is whether the variance theory can ac-count for the mirror effect found in abstract and concretewords where words from a concrete class are more eas-ily discriminated than words from an abstract class Thevariance theory can account for this given the assump-tion that the variance of the net input is larger for abstractthan for concrete words However at this point it is notcompletely clear how this assumption can be motivatedA possibility is that although these two classes areequated for word frequency the contexts associated withan abstract word are more variable than the contexts as-

VARIANCE THEORY OF MIRROR EFFECT 437

sociated with a concrete word This larger variability incontext for abstract words may lead to a larger variabil-ity in the net input Another possibility is that the activefeatures in abstract words are more general and there-fore associated with more contexts Nodes active in con-crete words may represent more specific features acti-vated with a lower frequency and therefore associatedwith fewer contexts Thus features in abstract wordsmay be of a higher frequency than features in concretewords although the frequencies of the items are thesame This would lead to a mirror effect in the modelHowever at this point no claim is made that variancetheory can handle this phenomenon

The variance theoryrsquos account of the list-length andlist-strength effects is arguably much simpler than thesubjective-likelihoodrsquos account The activation thresholdis set so that on average half of the nodes active duringencoding are active during recognition The activationthreshold is constant within one condition but may changebetween conditionsmdashfor example when study time ischanged Furthermore subjects do not need to estimatethe list length and the probability that a particular itemis drawn from the study list

The variance theory has advantages over the attention-likelihood theory As was discussed in more detail abovethe attention-likelihood theory requires subjects to clas-sify the stimulus Depending on this classification thesubjects must know (consciously or unconsciously) howmuch attention is paid to the stimuli in order to calculatethe log-likelihood ratios Thus the yesndashno decision isbased on the subjectsrsquo classification of the stimuli Thevariance theory does not require a classification of thestimuli During the calculation of recognition strengththe standard deviation of the net input of the item is used(shcent ) so the subject does not need to know the class or thestandard deviation of the class (sh) The increase in thehit rate and decrease in the false alarm rate for the easierclass occurs according to the theory because the vari-ance is smaller for the easier class The variance theoryhas a simple mathematical base subjects count the num-ber of activated nodes minus the expected value dividedby the standard deviation of the net input of the item Anexplicit solution is presented that requires only calculat-ing the probabilities of two z-transformations

The subjective-likelihood theory uses feature contentof the items for addressing issues regarding item similar-ity and word frequency In particular high-frequency wordsare assumed to have a higher noise or variability than dolow-frequency words The variance theory also usesvariability that depends on frequency However the vari-ance theory simulates the increase in variance duringeach presentation of a feature in different contexts thusmaking it an unavoidable phenomenon for high-frequencyfeatures In the subjective-likelihood theory the featurevariance is introduced as an assumption or a constantand it is not explicitly simulated

There are several other differences between the vari-ance theory and the subjective-likelihood theory The

subjective-likelihood theory is based on log-likelihoodratios In the variance theory log-likelihood ratios arenot necessary to account of the mirror effect and for z-ROC curves Instead the variance theory uses the num-ber of active nodes normalized by the variance of the netinput as the measurement of recognition strength

Another difference is the use of one detector for eachitem in the subjective-likelihood theory This makes themodel essentially local whereas the variance theory isdistributed This property may cause difficulties in im-plementing the model in a connectionist network How-ever see McClelland and Chappell (1998) for a brief dis-cussion of this topic An implementation of the variancetheory is straightforward

REFERENCES

Feller W (1968) An introduction to probability theory and its appli-cation New York Wiley

Gillund G amp Shiffrin R M (1984) A retrieval model for bothrecognition and recall Psychological Review 91 1-67

Glanzer M amp Adams J K (1985) The mirror effect in recognitionmemory Memory amp Cognition 13 8-20

Glanzer M amp Adams J K (1990) The mirror effect in recognitionmemory Data and theory Journal of Experimental PsychologyLearning Memory amp Cognition 16 5-16

Glanzer M Adams J K amp Kim K (1993) The regularities ofrecognition memory Psychological Review 100 546-567

Glanzer M amp Bowles N (1976) Analysis of the word frequencyeffect in recognition memory Journal of Experimental PsychologyHuman Learning amp Memory 2 21-31

Glanzer M Kisok K amp Adams J K (1998) Response distribu-tions as an explanation of the mirror effect Journal of ExperimentalPsychology Learning Memory amp Cognition 24 633-644

Greene R L (1996) Mirror effect in order and associative informa-tion Role of response strategies Journal of Experimental Psychol-ogy Learning Memory amp Cognition 22 687-695

Hertz J Krogh A amp Palmer R G (1991) Introduction to the the-ory of neural computation Reading MA Addison-Wesley

Hintzman D L (1988) Judgment of frequency and recognition memoryin a multiple trace memory model Psychological Review 95 528-551

Hopfield J J (1982) Neural networks and physical systems withemergent collective computational abilities Proceedings of the Na-tional Academy of Sciences 79 2554-2558

Hopfield J J (1984) Neurons with graded response have collectivecomputational properties like those of two-state neurons Proceed-ings of the National Academy of Sciences 81 3088-3092

Humphreys M S Bain J D amp Pike R (1989) Different way to cuea coherent memory system A theory for episodic semantic and pro-cedural tasks Psychological Review 96 208-233

Kim K amp Glanzer M (1993) Speed versus accuracy instructionsstudy time and the mirror effect Journal of Experimental Psychol-ogy Learning Memory amp Cognition 19 638-652

Kruschke J K (1992) ALCOVE An exemplar-based connectionistmodel of category learning Psychological Review 99 22-44

Ku Iumlcera H amp Francis W N (1967) Computational analysis ofpresent-day American English Providence RI Brown UniversityPress

Lewandowsky S (1991) Gradual unlearning and catastrophic inter-ference A comparison of distributed architectures In W E Hockleyamp S Lewandowsky (Eds) Relating theory and data Essays onhuman memory in honor of Bennet B Murdock (pp 445-476) Hills-dale NJ Erlbaum

Li S-C amp Lindenberger U (1999) Cross-level unification A com-putational exploration of the link between deterioration of neuro-transmitter systems and dedifferentiation of cognitive abilities in oldage In L-G Nilsson amp H J Markowitsch (Eds) Cognitive neuro-sciences of memory (pp 103-146) Seattle Hogrefe amp Huber

438 SIKSTROM

Li S-C Lindenberger U amp Frensch P A (2000) Unifying cog-nitive aging From neuromodulation to representation to cognitionNeurocomputing 32-33 879-890

McClelland J L amp Chappell M (1998) Familiarity breeds dif-ferentiation A subjective-likelihood approach to the effects of expe-rience in recognition memory Psychological Review 105 724-760

Metcalfe J (1982) A composite holographic associative recallmodel Psychological Review 89 627-658

Murdock B B (1982) A theory for the storage and retrieval of item andassociative information Psychological Review 89 609-626

Murdock B B (1998) The mirror effect and attentionndashlikelihoodtheory A reflective analysis Journal of Experimental PsychologyLearning Memory amp Cognition 24 524-534

Okada M (1996) Notions of associative memory and sparse codingNeural Networks 9 1429-1458

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves PsychologicalReview 99 518-535

Rumelhart D E Hinton G E amp Williams R J (1986) Learn-ing representation by backpropagating errors Nature 323 533-536

Shiffrin R M amp Steyvers M (1997) A model for recognitionmemory REMndashretrieving effectively from memory PsychonomicBulletin amp Review 4 145-166

Sikstroumlm S (1996a) TECO A connectionist theory of successiveepisodic tests Umearing Doctoral dissertation Umearing University (ISBN91-7191-155-3)

Sikstroumlm S (1996b) The TECO connectionist theory of recognitionfailure European Journal of Cognitive Psychology 8 341-380

Sikstroumlm S (1999) Power function forgetting curves as an emergentproperty of biologically plausible neural networks model Interna-tional Journal of Psychology 34 460-464

Stretch V amp Wixted J T (1998a) Decision rules for recognitionmemory confidence judgments Journal of Experimental Psychol-ogy Learning Memory amp Cognition 24 1397-1410

Stretch V amp Wixted J T (1998b) On the difference betweenstrength-based and frequency-based mirror effects in recognitionmemory Journal of Experimental Psychology Learning Memoryamp Cognition 24 1379-1396

NOTE

1 The expected number of nodes active at recognition is a2 giventhat the standard deviations of the percentages of active nodes for new[sp(N)] and old items [sp(O)] are equal However if the standard devi-ations are different the expected number of active nodes will be

Because the new variance is smaller than the old variance this valuewill be slightly less than a2 typically around 044a if the ROC curveis 08 It is actually more precise to use this value In this paper the ap-proximation a2 is used except that in the simulations where the ex-pression above is used

APPENDIX

The Expected Value of the Net Input and the Variance of the Net Input

Inserting the weight changes into the net input solves the ex-pected value of the net input for old items In this Appendix thesuperscripts representing the item layer (d) and the contextlayer (c) are dropped Only the active states of activation con-tribute to the net input so jj 5 jj = 1

(A1)

The expected value of the net inputs for the new items iszero

mh(N) 5 0 (A2)

The expected value of the weights is zero The variance ofthe net input is calculated by squaring each term in the netinput Let P represent the number of patterns stored in the net-work If the patterns are uncorrelated or a new item is pre-sented the variance of the net input is

(A3)

For an old item the variance of the net input to the contextlayer from the item layer depends on the frequency of the item( f ) All the aN weights supporting a context node contribute tothe net input in the same direction

(A4)

The same function can be used for calculating the varianceof the net input to a node in the item layer when the same con-text is associated with several items Let the same context be as-sociated with L items in a list Furthermore let the context be-tween different lists be uncorrelated The variance of the netinput to an item node is

The expected variance for all nodes is the weighted sum ofthese two terms Half of the nodes are context nodes and halfof the nodes are item nodes Therefore the expected varianceof the net input to all nodes given an item with a frequency off and a list length of L in a network where p patterns have beenencoded is

(A5)

Performance in the Variance Model When AllNodes Are Used

In the variance model recognition strength is based on nodesactive at encoding However if recognition strength is based onall the nodes performance can be calculated as follows The d centis calculated by using Equation 8 where Pc(O) and Pc(N) are thenumber of correct nodes The number of correct nodes is thenumber of nodes active at retrieval that also is active at encod-ing (ie calculated as usual with Equation 7) plus the numberof inactive nodes at encoding that also are inactive at retrievalThe latter value can be calculated by replacing a with 1 a inEquation 7 and using the expected value of old net inputs for in-active nodes a2 (1 a) N (compare with Equation A1)

(Manuscript received February 9 1999revision accepted for publication October 30 2000)

s h f L a N p a a N2 3 22 1O( ) = +( ) + -( )

s h f a a LN2 4 2 21( ) = -( )

s x x x xh ij jj

N

i j ji

Nf f f a a

a a f N

22 2

4 2 21

( ) = waring aringaelig

egraveccedilouml

oslashdivide= -( ) -( )eacute

eumlecirc

= -( )

s x x x xh ij jj

N

i j j

P

i

NN a a

a a a a a a a

a a a a PN a a PN

2 2 2

2

3

1 1 2 0 1

0 0 1 1

( ) = w

= [( ) ( )] + [( )(1- )] ( )

+ [( )( )] ( ) = ( )

2 2

2 2 2

( ) = -( ) -( ) - - - -

- - - -

aring aringaring

m x x x xh ijj

N

j i j jj

Na a a a N(O) = wDaring aring= -( ) -( ) = -( )1

2

ss s

p

p p

a(N)

( )N (O)+

426 SIKSTROM

sentation time or a larger presentation frequency in-creases the net inputs of the old items [mh(O)] This is il-lustrated in Figure 7A (compare with Figure 4A wherethe same amount of attention is paid to the two classes)If the net inputs for old high-frequency items are in-creased sufficiently the percentage of active nodes willbe larger than that for old low-frequency items For thisto occur the effect of the increase in net input (whichgives the advantage for old high-frequency items whenattention is focused on these items) must be larger thanthe effect from the larger standard deviation of the netinputs for old high-frequency items (which gives the ad-vantage for old low-frequency items when the same at-tention is paid to the two classes) This increase in thepercentage of active nodes yields a higher hit rate forhigh-frequency items than for low-frequency items

However it will not significantly change the false alarmrates which are larger for high-frequency items than forlow-frequency items Therefore the variance theory pre-dicts no mirror effect when high-frequency items are pre-sented sufficiently more often or with a sufficiently longerpresentation time as compared with low-frequency items

It is apparent from the density of net inputs (Figure 7A)that the density of recognition strengths (Figure 7B)will not show a mirror effect (ie because the percent-age of active nodes are larger for high- than for low-frequency old items) The parameters used in these fig-ures are identical to the parameters used for the standardmirror effect in Figures 4A and 4D with the exceptionthat the expected net input of the old high-frequencyitems [mh(BO)] is 2 rather than 1 Consequently to opti-mize performance the activation threshold becomes

P a ec

h h

h=-( )yen

ograve12

22

m

s

P -

Figure 7 (A) The probability density of the net inputs in the variancetheory when attention is focused on the high-frequency class The hori-zontal axis shows the net inputs and the vertical axis the probabilitydensity of the net inputs The expected value of the high-frequency value(BO) is shifted to the right because attention is focused on this class Thedotted vertical line is the activation threshold (B) The predictions of thevariance theory when subjects focus their attention on high-frequencywords The horizontal axis shows the recognition strength and the ver-tical axis the probability density

075 rather than 050 The figure does not show a mirroreffect because the expected hit rate and the expectedfalse alarm rate are larger for the high-frequency itemsthan for the low-frequency items Setting C to an unbi-ased value of 0 yields P(AN) = 08 P(BN) = 14 P(BO) =86 P(AO) = 063 [which may be compared with Figure6B P(AN) = 20 P(BN) = 25 P(BO) = 70 P(AO) = 74]

Furthermore in the variance theory the ratio of therecognition strength standard deviations for high- andlow-frequency items depends mainly on the standard de-viations of the net inputs The standard deviations of thenet inputs are not dependent on the attention paid to thestimuli Therefore the variance theory predicts no changein the standard deviations when the amount of attentionis manipulated The standard deviation of the old low-frequency distribution is predicted to be larger than thestandard deviation of the old high-frequency distribu-tion Similarly the standard deviation of the new low-frequency distribution is predicted to be larger than thatof the new high-frequency distribution The standard de-viations in Figure 7B are ss(AN) = 0013 ss(BN) = 0011ss(BO) = 0017 and ss(AO) = 0019 These results aresimilar to the results when the same amount of attentionis paid to the two classes in Figure 4D ss(AN) = 0015ss(BN) = 0012 ss(BO) = 0015 and ss(AO) = 0020

The standard version of the attention-likelihood the-ory has problems accounting for the lack of mirror ef-fect when more study time is given to the difficult classthan to the easy class This theory suggests that the classof items to which more attention is being paid is moreeasily recognized For example low-frequency items arebetter recognized than high-frequency items becausesubjects pay more attention to them The amount of at-tention is assumed to influence the number of sampledfeature [n(i)] so more features are sampled for low- thanfor high-frequency items (Kim amp Glanzer 1993) This isthe only parameter that differs between high- and low-frequency items From this explanation it follows thatif the experimental conditions are manipulated so thatsubjects pay more attention to the high-frequency itemsthe standard version of the attention-likelihood theorywill predict a mirror effect where the high-frequencyitems are the easier class (A) and the low-frequency itemsare the more difficult class (B) The difference from thenormal mirror effect is a larger hit rate and a smaller falsealarm rate for high- than for low-frequency items Fur-thermore the attention-likelihood theory makes predic-tions of the order of the slope of ROC curves The standarddeviation of the hit rate for the high-frequency distributionwould be larger than the hit rate for the low-frequencydistribution Similarly it is predicted that the standarddeviation of the high-frequency false alarm distribution islarger than that of the low-frequency distribution

EXPERIMENT

An experiment was conducted to test the predictionsregarding the within-list strength manipulation The

number of presentations and the study time of the high-frequency words were manipulated in an experimentThe original rationale for the experiment was to comparethe results with the predictions of the variance theoryand the attention-likelihood theory because the experi-ment was conducted before the publication of Stretchand Wixtedrsquos (1998b) study which manipulated atten-tion by the number of presentations In this experimenta new manipulation is investigatedmdashnamely how theamount of study time per item for each class influencesthe mirror effect Furthermore the manipulation of thenumber of presentations is replicated Thus there weretwo experimental conditions one in which the amountof study time was manipulated and one in which the pre-sentation time was manipulated There was also one con-trol condition in which high- and low-frequency wordswere given the same amount of attention

MethodSubjects Twenty-one students taking the introductory psychol-

ogy course at the University of Toronto volunteered to participatein a memory experiment for course credit There were 14 female and7 male subjects with a mean age of 20 ranging from 18 to 29 yearsold

Material Sixty low-frequency words and 60 high-frequencywords were selected from Ku Iumlcera and Francis (1967) The low-frequency words have an occurrence of 4ndash5 times per million andthe high frequency words an occurrence of 50ndash55 times per mil-lion Thirty low- and 30 high-frequency words were randomly cho-sen for List 1 and the remaining for List 2

Procedure The subjects were instructed to study a list of wordsso they would be able to recognize the words after study Fifteenlow-frequency words and 15 high-frequency words were randomlychosen as study words for each subject

Design There were three conditions In all the conditions thelow-frequency words were presented once with a presentation timeof 1 sec In the control condition the high-frequency words werealso presented once with a presentation time of 1 sec In the pre-sentation frequency condition the high-frequency words were pre-sented twice for 1 sec each time In the presentation time conditionthe high-frequency words were presented once for 3 sec The pre-sentation order was randomized All the words were presented inuppercase on a blank computer screen Immediately following thestudy list there was a recognition test The subjects were presentedwith either one of the studied words or one of the lures There were15 low-frequency lures and 15 high-frequency lures in each condi-tion The subjects were asked to judge whether the word was pre-sented in the list or not The subjects were also required to rate theirconfidence in their responses on a scale ranging from 1 (guessing)to 5 (very certain) The order of recognition was randomized foreach subject

Each subject participated in two conditions List 1 was alwaysgiven as the first list and List 2 as the second list Twelve subjectswere randomly chosen for the presentation frequency condition fol-lowed by the presentation time condition Nine subjects were giventhe control condition followed by another control condition Thewhole experimental set-up including instructions presentation ofwords and the recognition test was automated on a computer Eachsubject was tested individually

ResultsThe results from the experiment are presented in the

first three rows of Table 1 The probability for hit rates

VARIANCE THEORY OF MIRROR EFFECT 427

428 SIKSTROM

was larger for the high-frequency words than for the low-frequency words in the presentation frequency and thepresentation time conditions In the control conditionthe probability for hit rates for the low-frequency condi-tion was larger One-tailed paired t tests over the perfor-mance for each subject were carried out to test the dif-ferences between the high and the low frequencies Theeffects were significant in the presentation frequencycondition [t(11) = 22 MSe = 004 p = 02 lt 05] and inthe control condition [t(16) = 33 MSe = 004 p = 00lt 05] but not in the presentation time condition [t(11) =041 MSe = 003 p = 34 lt 05]

The false alarm rate was larger for the high-frequencywords in all the conditions However it was only signif-icantly larger in the presentation frequency condition[t(11) = 18 MSe = 003 p = 048 lt 05] but not in thepresentation time condition [t (11) = 15 MSe = 001 p =07 gt 05] and the control condition [t(16) = 14 MSe =002 p = 09 gt 05]

The results in the presentation frequency conditionsupport the variance theory The hit and the false alarmrates were significantly larger for the high-frequencywords than for the low-frequency words Thus there wasno mirror effect However the prediction of the standardversion of the attention-likelihood theory was not sup-ported

The results in the presentation time condition were inthe same direction as those in the presentation frequencycondition although the difference between the high andthe low frequencies was not significant This conditionis consistent with the variance theory although the stan-dard version of the attention-likelihood theory could notbe dismissed in this condition since the results werenonsignificant

Finally the control condition yielded results consis-tent with previous studies showing a mirror effect Thehit rate for the high-frequency words was significantlylower than the hit rate for the low-frequency words Thefalse alarm rate for the high-frequency words was largerthan that for the low-frequency words (although not sig-nificantly) Thus the control condition is as was ex-pected consistent with both the variance theory and thestandard version of the attention-likelihood theory

The slopes of the ROC curves were calculated as fol-lows The hit and false alarm rates for confidence ratings1ndash5 were z-transformed (eg for confidence rating 4 a hitresponse was scored if the confidence rating was 4 orabove) Linear regressions of one z-transformed mea-surement as a function of another z-transformed measure-ment were conducted The slope of the linear regressioncurves between the z-transformed hit rate of the low-frequency words and the z-transformed hit rate of the high-frequency words [ss(BO)ss(AO)] and similarly for theslope of the false alarms [ss(BN)ss(AN)] are shown inthe last two rows of Table 1

The results show that the standard deviations of theold high-frequency distributions were smaller than thestandard deviations of the old low-frequency distribu-tions in all the conditions The standard deviations of the

false alarm high-frequency distributions were smallerthan the standard deviations of the false alarm low-frequency distributions in the presentation frequencycondition and the control condition but were approxi-mately equal in the presentation time condition

To summarize the results in the presentation frequencycondition are consistent with the variance theory and inconsistent with the standard version of the attention-likelihood theory The control condition is consistentwith both the variance theory and the standard version ofthe attention-likelihood theory These data are also con-sistent with results presented by Stretch and Wixted(1998b) However Stretch and Wixted (1998b) suggestedone possible way to modify the standard version of theattention-likelihood theory to bring it in line with thedata presented here They noted that Glanzer et al (1993)had shown that the attention-likelihood theory predictsthe mirror effect although p(i old) is set to the averagevalue of the two classes This modified version can pre-dict the pattern of data presented here given that the at-tention paid to the high-frequency class was high duringencoding [n(B) = 120] and low during recognition [n(B) =40] This formulation of the attention-likelihood theoryseems somewhat unclear It is not well motivated whyp(i old) should be equal during recognition whereas n(i)is different [p(i old) is calculated from n(i)] or why theamount of attention for high-frequency items is lower thanthat for low-frequency words at encoding but higher atrecognition

COMPARATIVE DATA FITTING

Glanzer et al (1993) fitted the attention-likelihoodtheory to experimental data in nine conditions The easyclass (A) consisted of either low-frequency words orconcrete words and the difficult class (B) consisted ofhigh-frequency words or abstract words Here the vari-ance theory is also fitted to the same set of data as thatused in Glanzer et al (1993) This allows a directionevaluation of the variance theory by comparing its f itswith those of the attention-likelihood theory

Glanzer et al (1993) fitted the attention-likelihoodtheory to the four probabilities of yes responses and thefour ratios of slopes of the ROCs The fitting was con-ducted by minimizing the mean squared error divided bythe variancemdashthat is

Three parameters were fittedmdashnamely the attentionpaid to each of the classes n(A) and n(B) and the prob-ability that a feature was activated p(new) The other pa-rameters were held constant at a value found to give agood f it These parameters were (N = 1000) and therecognition criterion [ln (L) = 0]

The variance theory was fitted to the same set of datausing the same technique and the same number of freeparameters The fitted parameters were the percentage

( )

Observed Predictedi i

ii

-

=aring

2

21

8

s

VARIANCE THEORY OF MIRROR EFFECT 429

nodes active at encoding (a) the standard deviation ofthe net inputs for the easy class words [sh(A)] and thestandard deviation of the net inputs for the difficult classwords [sh(B)] The other parameters were held constantand were set to the same values as those in Figure 4D[2N = 100 mh(N ) = 0 mh(O) = 1 and C = 0] The empir-ical standard deviations (si) were not reported inGlanzer et al (1993) so these parameters were set toone

The results show that the variance theory accounts for98 (r2) of the variance for the probabilities The attention-likelihood theory fits equally well The variance theory ac-counts for 93 of the variance for the slope The attention-likelihood theory accounts for 84 of the variance of theslope Thus the variance theory accounted for the sameamount of variance for the probabilities and more variance

for the slope as compared with the attention-likelihoodtheory when three parameters were fitted

It is reasonable to assume that the percentage of ac-tive nodes at encoding (a) does not vary across condi-tions The ratio of standard deviations of the net inputs[sh(B)sh(A)] may also be conceived of as being con-stant given that the same material is used in the differ-ent conditions Therefore the variance theory was fittedby a single variablemdashnamely the standard deviation ofthe net inputs for the easy class [sh(A)] The activitylevel was fixed to 010 and the ratio of the standard de-viations of the net inputs sh(B)sh(A) to 125 (these val-ues were the average of the fitted parameters in the pre-vious fit these parameters were also used in Figure 4D)The results are presented in Figure 8A where the pre-dicted probabilities are plotted as a function of observed

Figure 8 (A) Fitting the variance theory to Glanzer Adams andKimrsquos (1993) probability data The figure shows the predicted proba-bilities (on the vertical axis) as a function of the observed probabilities(on the horizontal axis) (B) Fitting the variance theory to Glanzer et alrsquos(1993) standard deviation slope data The figure shows the predictedratio of slopes of the receiver-operating characteristic curves (on the ver-tical axis) as a function of the observed ratios (on the horizontal axis)

430 SIKSTROM

probabilities Figure 8B shows the corresponding resultsfor the slope The accounted variance is 096 for the prob-abilities and 085 for the slopes Thus the variance theoryfits the slopes using a single parameter equally well asthe attention-likelihood theory does with three fitting pa-rameters The fit for the variance theory for the probabil-ities using one parameter is slightly less than the fit of theattention-likelihood theory using three fitting parametersIt may be concluded that the fit for the variance theory isreasonably good for the probabilities and the slopes Theslopes have a better fit in the variance theory than in theattention-likelihood theory when three variables are used

ANALYTIC SOLUTIONS

In this section analytic solutions of the variance the-ory an approximation of the standard deviation of therecognition strength and analyses of optimal perfor-mance are presented The variance theory has a simpleanalytic solution and can be fully described by four pa-rameters Two of these parametersmdashnamely the stan-dard deviation of the net inputs from the easy class[sh(A)] and the standard deviation of the net inputs fromthe difficult class [sh(B)]mdashcan also be expressed as thefrequency of the item (see the Appendix) The other twoparameters are the number of nodes (N ) and the expectedprobability that the nodes are active at encoding (a)

There are other variables in the theory however theydo not increase the degrees of freedom There are fourexpected net inputs (mh) however two degrees of free-dom disappear because the new net inputs are constrainedto be equal as well as the old net inputs [mh(AN) =mh(BN) and mh(AO) = mh(BO)] Furthermore the predic-tions are independent of moving the old and new dis-tributions in parallel thus removing another degree offreedom Finally changing the difference between the ex-pected old and new net inputs is mathematically equiva-lent to changing the standard deviations [sh(A) andsh(B)] Thus the degrees of freedom in the net inputscan be captured by the degrees of freedom in the stan-dard deviations The activation threshold (T ) and theprobability that nodes are active (Pc) are simply func-tions of other variables and therefore do not increase thedegrees of freedom Thus there are four degrees of free-dom for the distributionsmdashnamely sh(A) sh(B) N anda An additional degree of freedom is introduced whenplacing the recognition criterion (C)

The probability (Pc) that the net inputs exceed the ac-tivation threshold (T ) for nodes active during encodingcan be explicitly solved from the expected net inputs(mh) and the standard deviation of the net inputs (sh)This probability is dependent on the distribution of thenet inputs which can be approximated with a normaldistribution Pc is solved by integrating the net inputsfrom mh T to infinity (yen) over the probability densityfunction for a normal distribution Thus the probability(Pc) that a node is active at recognition is

(6)

Subtracting the expected percentage of active nodes atrecognition (a2 see note 1) from the percentage of ac-tive nodes and dividing by the standard deviation of thenet inputs (sh) calculates the expected recognitionstrength (ms)

Note that the analytic solution uses the standard devi-ation of the class (sh) as an approximation of the stan-dard deviation of the item (shcent ) because it simplifies theanalytic solution however the variance theory or thesimulation uses the standard deviation of the item Thisapproximation is good when there are a large number offeatures however for a small number of features thevariance of feature strength for a single item may fluctu-ate on an item-to-item basis around the variance of thenet inputs for all the items

The standard deviation of the recognition strength (ss)is calculated from sh Pc and N There is 2N number ofnodes in the context and the item layers The distributionof Pc is binomial but can for a certain criterion [ie 2NPc(1 Pc) gt 10] be approximated with a normal distri-bution with a standard deviation of [Pc(1 Pc) 2N]12The final result is scaled by the normalization factor 1sh

(7)

A yes response occurs if the recognition strength isabove the recognition criterion (C) The probability of ayes response [P(Y)] is calculated from the expected recog-nition strength the variance of the recognition strengthand the recognition criterion by integrating the density ofthe recognition strength over a normal distribution

The probability of choosing A over B in a two-choiceforced recognition test [P(A B)] is calculated from theexpected recognition strength of A [ms(A)] and B [ms(B)]and the standard deviations of the recognition strengthof A [ss(A)] and B [ss(B)]

An Excel sheet for calculating the predictions of the vari-ance theory is available on line (wwwpsychutorontoca ~sverkervariancehtml)

P e ds

S Bs s

s s

C

(A B)

(A)

(A) (B)

=

- -[ ]( )+aelig

egraveccediloumloslashdivide

eacuteeumlecirc

yen

ograve12

2

2 2 22

p

m m

s s

( )

P e ds

S s

s

C

(Y) =-( )yen

ograve12

2

22

p

m

s

sss

h

c cP P

N=

-( )eacute

eumlecircecirc

1 1

2

12

mss

Pc

a

h=

-2

P a e dhcT

hh

h=-( )yen

ograve12

2

22

p

m

s

VARIANCE THEORY OF MIRROR EFFECT 431

Approximations of the Standard Deviation ofRecognition Strength

The standard deviation of the recognition strength isin the model calculated with Equation 7 However to fa-cilitate the understanding of this equation it is useful tomake some approximations First note that the proba-bility that a node is active (Pc) is assumed to be low Byapproximating 1 Pc to one the variance of the recog-nition strength can be simplified to

For a particular class of items the variances of the netinputs of old and new items are equal and the varianceof the recognition strength is proportional to the numberof active nodes (s 2

s micro Pc) This approximation suggests avery simple interpretation of the slope of the z-ROC Theratio of variances between new and old items is simplythe ratio between the number of nodes active in the newitems representations and the number of nodes active inthe old items representations

Or alternatively the slope of the z-ROC curve is equalto the square root of the ratio of the number of nodes ac-tive in the new items representations and the number ofnodes active in the old items representations For exam-ple if the slope of the z-ROC curve is 08 the number ofactive nodes in the new items representations divided bythe number of nodes active in the old items representa-tions is 064 (= 0802)

Another approximation useful for understanding themodel is that for two classes of items the number of ac-tive nodes in the new distribution is approximately equaland the number of active nodes in the old distributions isapproximately equal [Pc(AN) raquo Pc(BN) and Pc(BO) raquoPc(AO)] Given these approximations and the approxi-mation above (1 Pc raquo 1) the recognition strength stan-dard deviation is inversely related to the standard devia-tion of the net inputs in the following way The ratiobetween the recognition strength standard deviations ofthe diff icult and the easy distributions is equal to theratio between the standard deviations of the net inputs ofthe easy and the difficult distributions Furthermore theratio between the recognition strength standard devia-tions of the difficult and easy new distributions is equalto the ratio between the recognition strength standard de-viations of the difficult and the easy old distributionsThe exact solution predicts a slightly smaller ratio in theold than in the new distributions

This suggests that the ratio between the recognitionstrength standard deviations of the difficult class and theeasy class can be interpreted as the ratio between the

standard deviations of the net inputs of the easy class andthe difficult class

Optimizing Performance Derives the Assumptions of the Variance Theory

Arguably good memory performance is an importantfactor for selection in the evolutionary process of hu-mans and animals It is reasonable to assume that thebrain has evolved so that the performance at retrieval isoptimal or near-optimal Here it is investigated how sev-eral assumptions of the variance theory influence per-formance It is shown that several assumptions of themodel fall out as a consequence of optimizing perfor-mance in the form of discriminability between new andold items Thus if the model is implemented in a differ-ent way performance is degraded and the model doesnot account for the experimental data Examples of as-sumptions that yield a good performance in the modelare a low percentage of nodes active setting the activa-tion threshold between old and new net inputs measur-ing performance by nodes that are active in the encodingpattern and normalizing the recognition strength It isshown that an optimal performance in the network requiresthe implementation suggested by the variance theory Ifthe implementation of the variance theory is changedsignificantly the performance is degraded and the net-work would not produce the empirically found memoryphenomena

To conduct this analysis it is necessary to define ameasurement of performance A natural choice is to used cent By using the analytical equations above we find thefollowing expression

(8)

Because ss(N) sometimes is near zero it was founduseful to use the standard deviation of both the old andthe new items recognition strength ss(NO) in the de-nominator of this equation Thus Pc(NO) is equal to[Pc(N) 1 Pc(O)] 2 Pc( ) was calculated with Equation 6The expected value of the net inputs and the standard de-viation of the net inputs for new and old items were cal-culated with the equations derived in the Appendix(Equations A1 A2 and A3) Low-frequency items witha preexperimental frequency of zero were used ( f = 0)and the list length was set to one (L = 1)

The performance can be expressed by the parametersa N and p Increasing the number of nodes (N) monot-onically increases d cent and increasing the number ofstored patterns (p) monotonically decreases d cent The im-pact of these two parameters on d cent was of less impor-tance here and they were set to N = 30 and p = 100

Optimal percentage of nodes active at encodingThe solid line in Figure 9A shows the theoretical d cent as afunction of the percentage of nodes active at encoding

cent = - =-

-eacuteeumlecirc

dP P

P PN

s s

s

c c

c c

m ms( ) ( )

(

O N

(NO)

(O) (N)

(NO) (NO) 12

112

ss

ss

ss

ss

s

s

s

s

h

h

s

s

(BO)

(AO)

(B)

(A)

(A)

(B)

(BN)

(AN)pound raquo pound

s

ss

s

c

c

P

P

2

2

(N)

(O)

(N)

(O)raquo

ss

sc

h

P

N

2

2 2=

432 SIKSTROM

(a) The results show that d cent is optimal for a = 052 Thed cent is lower for larger and smaller a The lower d cent for largea occurs because the interference from other items in-crease For an a larger than the optimal value the weightchanges are distributed over a larger number of nodesand the recognition tests therefore include more noise

The lower d cent for an a less than the optimal value oc-curs because there is variability in the number of activenodes at encoding Thus for very small values of a thefluctuation between the number of nodes active in theencoded representation becomes increasingly importantThus for a small a errors are more likely to occur be-cause old items have few active nodes at encoding mak-ing it less likely that many nodes will be active at re-trieval (independently of how well they are encoded)This analysis suggests that a medium low percentage ofactive nodes at encoding yields optimal performanceThis is consistent with variance theory which requires alow activity for fitting some of the empirical data (seebelow)

There is another factor that contributes to the fact thatoptimal performance occurs when the percentage of ac-tive nodes is medium low which is that the number ofpossible representations increases with a If there is onlyone node active in all the representations there are Npossible combinations of representations if there aretwo nodes active in all the representations there are ap-proximately N 2 possible combinations of representa-tions and so forth This factor is not included in theanalyses

Optimal placement of the activation threshold Animportant property of d cent is that it is insensitive to wherethe criterion is placed Thus any criterion yields thesame performance The activation threshold (T ) may beseen as the criterion for a single node and therefore onemight intuitively think that the placement of the thresh-old is unimportant for d cent However surprisingly theplacement of the criterion becomes important in the vari-ance theory because there is a nonlinear transformationwhen the nodes are activated This nonlinearity makes d centdependent on the activation threshold in the nodes

The d cent was maximized by changing the activationthreshold (T ) and the percentage of nodes active at en-coding (a) The maximum d cent was 240 when T = 081and a = 052 Figure 9B plots d cent as a function of the ac-tivation threshold (T ) when the percentage of nodes ac-tive at encoding was f ixed at the optimal value (a =052) The results show that d cent has an optimal valuewhen the activation threshold is set between the old andthe new distributions The variance theory suggests thatthe threshold should be set to the average of the expectedold and expected new net inputs For the parameter val-ues used here this value is 071 which is near butslightly lower than the optimal value of 081 (the ex-pected value of the new net input is 0 and the expectedvalue of the old net input is 142) Note that this resultapplies when both a and T are set to the optimal value Ifa is set to a nonoptimal value the optimal value of T may

deviate significantly from the one proposed by the the-ory (eg if a = 5 the optimal value of T is much largerthan the expected value of old net inputs of 188)

This analysis emphasizes the importance of setting theactivation threshold as suggested by the variance theorySetting the activation threshold between the old and thenew expected net inputs yields not only the mirror effectbut also an optimal performance in the network Noticethat the activation threshold (T ) is constant even if thesubjectsrsquo decision criterion (C) is changed Therefore d centwill not change when the decision criterion changes Bychanging the decision criterion (rather than the recogni-tion threshold) subjects can maintain an optimal d cent fordifferent confidence levels

Optimal usage of the state of activation in the cuepattern An interesting question is how much informa-tion is carried in nodes that are active in the encoded pat-tern as compared with inactive nodes If both active andinactive nodes carry a similar amount of information itis useful to use all the nodes at retrieval However if in-active nodes carry little information in relation to thenoise performance can be improved by using only theinformation in the active nodes

The information carried in the nodes depends on theamount of weight changes which in turn depends on thepercentage of active nodes at encoding (a) For a = 12the absolute values of the weight changes are the samefor active and inactive nodes (however the signs of theweight changes are different) Thus inactive and activenodes carry the same amount of information and theperformance is optimal when information in both activeand inactive nodes is used For a small a the weightchanges are larger for active nodes (proportional to 1 a)than for inactive nodes (proportional to a) For a suffi-ciently small a the noise in inactive nodes will over-whelm the information in the weight changes so that ifthe information is combined the inactive nodes will harmthe information in the active nodes and performance isoptimal if only information from active nodes is used

The performance of the variance theory was calcu-lated by using the information in all the nodes This isdone by counting the number of nodes that are retrievedto the correct state of activation (ie the same state asthat during encoding) The mathematical details of thiscalculation are described at the end of the Appendix Theresults are shown by the dotted line in Figure 9A usingthe same set of parameters as when d cent was calculated byusing only active nodes shown by the solid line The re-sults show that the highest d cent is found when the decisionis based only on active nodes and when a is low Includ-ing inactive nodes in decision lowers d cent However for alarger a (above 15 for the parameters used here) it isbeneficial to base the decision on all the nodes

Optimal placement of the recognition criterion forthe two classes of items The recognition criterion (C)does not affect d cent but it influences performance as mea-sured by the hit rates and false alarm rates Therefore itis necessary to use another criterion for performance

VARIANCE THEORY OF MIRROR EFFECT 433

with respect to the placement of the recognition crite-rion A natural choice for performance in this context isthe probability of hits minus the probability of falsealarms This measurement corresponds to optimal per-formance when old correct responses and new correctnew responses are rewarded equally It is easy to see thatif the standard deviations of the old and the new distri-butions are equal the optimal performance will be foundif the recognition criterion is set exactly between the dis-tributions For unequal standard deviations the optimalrecognition criterion is shifted from the midpoint towardthe distribution with the smallest standard deviationMore exactly the optimal recognition criterion is thepoint at which the old and the new distributions inter-sect It is easy to see that this is true because if the recog-nition criterion is moved to the left of this point the rateof increase in false alarms is larger than the rate of in-crease in hits and performance suffers If the recognition

criterion is moved to the right of this point the rate of de-crease in hits is larger than the rate of decrease in falsealarms and performance also suffers (see eg Figure 4D)Formally f [S(O)] denotes the density of recognitionstrength of the old distribution and f [S(N)] the densityof the recognition strength of the new distribution Theratio between these variables is called the likelihoodratio L = f [S(O)] f [S(N)] and the optimal performanceoccurs when this ratio is equal to one (L = 1)

In the mirror effect there are two classes of itemseach having a new and an old distribution with differentstandard deviations The question of optimal perfor-mance is complicated by the possibility of using differ-ent criteria for the two classes The performance maythen vary depending on the choice of the two criteria andon additional restrictions on the overall level of confi-dence For example if one class is very easy (ie perfectdiscrimination) and one class is very difficult (ie no

Figure 9 (A) Theoretical d cent as a function of percentage of nodes active at encoding The solid line shows the d cent as a function of per-centage of nodes active at encoding when the decision is based only on nodes that are active during encoding The dotted line showsd cent when the decision is based on all the nodes (B) Theoretical d cent as a function of the activation threshold The leftmost arrow pointsat the expected net input of the new items [m(N)] the rightmost arrow points at the expected net input of the old items [m(O)] and themiddle arrow points at the point at the placement of the activation threshold of the nodes Note that the activation threshold is slightlylower than the optimal point (C) Optimal placement of the recognition criterion for the two classes The y-axis shows the maximumlikelihood for Class B divided by the maximum likelihood for Class A An optimal performance is found when this ratio is one Thex-axis shows the false alarm rate for Class A The straight line shows the ratio for theoretical optimal performance the dotted line theratio before normalization and the solid curved line the ratio after normalization See the text for details (D) The advantage of nor-malization for different recognition criteria The y-axis shows the total hit rate after normalization minus the total hit rate before nor-malization as a function of the total false alarm rate on the x-axis See the text for details

434 SIKSTROM

discrimination) and subjects are instructed to respondyes only when they are absolutely certain that they arecorrect it may be optimal to set a very high criterion forthe difficult class so that no yes responses will be madefor the difficult class and a moderate criterion for theeasy class so that some yes responses will be made forthe easy class Therefore any model that optimizes per-formance for the two classes must combine the criteriafor each class so that the performance for the sum of theclasses will be optimal

This problem may formally be stated as follows Giventwo classes (A and B) with a fixed total false alarm rate[P(AN) + P(BN)] how should the recognition criteriafor the two classes [T(A) and T(B)] be chosen so that thehit rates are maximized [P(AO) + P(BO)] The solutionto this problem is surprisingly simple The optimal perfor-mance occurs when the placements of the maximum like-lihoods of the two classes are equal

L(A) = f [S(AO)] f [S(AN)] = L(B)

= f [S(BO)] f [S(BN)]

It is easy to see that this criterion must be satisfied foroptimal performance because any shift from this pointdiminishes performance For example if the recognitionthreshold for class A is diminished the recognition cri-terion for class B must be increased to keep the totalfalse alarm rate constant According to the formulationof the problem the change in total false alarm rates mustbe equal f f [S(BN) = 0] The maximum-likelihood ra-tios are monotonically increasing functions of the recog-nition criteria therefore L(A) L(B) lt 0 when the recog-nition criteria are changed as specified above ThusL(A) = f [S(AO)] f [S(AN)] lt f [S(BO)] f [S(BN)] =L(B) or f [S(AO)] f [S(BO)] lt 0 This shows that thechange in the placement of the criteria from L(A) = L(B)results in an overall decrease in hit ratemdash( f [S(AO)] f [S(BO)] lt 0)mdashand performance suffers

Note that the variance theory only has one overallrecognition criterion However the theory and any the-ory of the mirror effect specifies how this criterionchanges when it is moved over the two classes Thus thesecond criterion can indirectly be inferred from the for-mulation of the theory This is done in the variance the-ory by the normalization factor that scales the recogni-tion differently between the two classes of words

The intriguing question here is whether the variancetheory is optimal or almost optimal in terms of place-ment of the recognition criterion for the two classes Fig-ure 9C plots the maximum likelihood of class B dividedby the maximum likelihood of class A [L(B)L(A)] onthe y-axis The x-axis shows the probability of false alarmsfor class A when the recognition criteria are changedThe optimal ratio of the maximum likelihood on the y-axisis one and it is plotted as the dotted straight line The re-sults before normalization (ie by counting the numberof nodes above the recognition criterion) are plotted in

the dotted monotonically increasing line The resultsafter normalization (ie the percentage of active nodesminus the expected number of active nodes divided bythe standard deviation of the net input) are plotted as thesolid line

The figure clearly shows that performance before nor-malization is far from optimal For a conservative recog-nition criterion or low false alarm rates the maximumlikelihood for class A is larger than the maximum likeli-hood for class B Therefore a more liberal criterion forclass B and a more strict criterion for class A would bemore advantageous For a liberal recognition criterionor a high false alarm rate the maximum likelihood forclass B is larger than the maximum likelihood for classA Therefore a more liberal criterion for Class A and astricter criterion for Class B would be beneficial Theonly point at which the performance is optimal is whenthe recognition criterion is unbiased At this point [aroundP(AN) = 25] the maximum-likelihood ratio is one

Normalization improves performance significantly sothe maximum-likelihood ratio is one or almost one fora range of criteria For false alarm rates larger than 015and smaller than 060 the ratio is within two percentagepoints of one The maximum likelihood for Class A islarger than that for Class B for false alarm rates less than015 and for false alarm rates larger than 060 Thus thereis still some deviation from optimal performance evenafter normalization However the maximum-likelihoodratio is closer to the optimal value for all false alarmrates after normalization than before normalization Ar-guably normalization improves performance sufficientlyso that performance may be described as being near anoptimal value for a wide range of recognition criteria

Overall performance after normalization can be di-rectly compared with performance before normalizationFigure 9D plots the total hit rate after normalizationminus the total hit rate before normalization for differenttotal false alarm rates In this figure the standard devia-tion of the net inputs to Class B was set to a 3 (rather than156) in order to make the difference between perfor-mance before and after normalization more salient Allother parameters were identical to those in Figure 4DFurthermore it is assumed that the number of items inClass A is equal to the number of items in Class B Thetotal hit rate is equal to the average hit rate of Class Aand Class B and the total false alarm rate is equal to theaverage false alarm rate in Class A and Class B

For all recognition criteria or for all false alarm ratesperformance is better or equal after normalization ascompared with performance before normalization Foran unbiased recognition criterion or a slightly largerrecognition criterion performance is approximatelyequal before and after normalization [ie for 25 lt P(N) lt40] For conservative recognition criteria [P(N) lt 25]there is a large advantage for normalized performanceover nonnormalized performance The largest advantageis approximately 005 and it occurs when the false alarm

VARIANCE THEORY OF MIRROR EFFECT 435

rate is approximately 005 For liberal recognition crite-ria [P(N) gt 40] there is also an advantage for normal-ized performance The largest advantage is around 001and it occurs when the false alarm rate is 070 The ad-vantage for liberal criteria is smaller than the advantagefor conservative criteria because of a ceiling effect forlarge false alarm rates and large hit rates

A Nonoptimal Network is Inconsistent With Empirical Data of Recognition Memory

To summarize it has been shown that performance isoptimal when (1) the percentage of nodes active at en-coding is low (2) the activation threshold is set betweenthe new and the old distributions (3) only nodes activeat encoding are used for measuring the recognitionstrength and (4) the recognition strength is normalizedwith the standard deviation of the net input It is inter-esting to note that all these conditions are necessary forproducing the empirically found memory phenomena

1 The percentage of active nodes has to be low forproducing appropriate standard deviations for the oldand the new recognition distributions If the percentageof active nodes is too high the standard deviation of theold distribution will approach the standard deviation ofthe new distribution

2 The model predicts the mirror effect only if the ac-tivation threshold is set between the old and the new dis-tributions If the recognition threshold is larger than theexpected value of the net input of the old distributionthe hit rate of the easy class will be less than the hit rateof the difficult distribution Similarly if the recognitionthreshold is lower than the expected new net input thefalse alarm of the easy class will be larger than the falsealarm rate of the difficult class This is inconsistent withthe empirical data for unbiased responses

3 Assume that the recognition strength is based on allthe nodes (ie not only nodes inactive during encod-ing)mdashfor example by counting the number of nodes inthe correct state of activation This measurement ofrecognition strength will have at least 50 of the nodesin the correct state (ie Pc gt 50) even if the subjectswere merely guessing on new items This would lead tothe incorrect prediction that the old recognition strengthvariance should be smaller than the new recognitionstrength variance because Pc(1 Pc)N decreases for Pcover 50 This is inconsistent with the finding that thevariance of the old distribution is larger than the varianceof the new distribution

4 If the recognition strength is not normalized withthe net input the variance of the recognition strength ofthe new easy class (AN) will be smaller than the varianceof the recognition strength of the new difficult class (BNsee Figure 4C) This is inconsistent with the empirical data

This analysis indicates that several recognition mem-ory phenomena fall out as a consequence of optimizingperformance in the network If the model is optimized interms of performance the model reproduces the empir-ical data If the model is made nonoptimal the model

does not reproduce the empirical data Arguably thehuman brain has evolved to optimize storage capacityand therefore these memory phenomena occur

GENERAL DISCUSSION

This paper has suggested a variance theory for themirror effect that also applies to the ROC curves Themodel is remarkably simple It has been shown that atwo-layer recurrent network where one layer representscontext and one layer represents items shows these phe-nomena if performance is measured by counting thenumber of nodes active at recognition in a way that opti-mizes performance The structure of the theory guaran-tees that high-frequency items have a larger variance inthe net inputs than do low-frequency items because en-coding the same item in different contexts increases thevariance whereas the expected net inputs are the sameThe theory predicts the mirror effect when the sameamount of attention is paid to both classes of stimuli Thestandard deviation of the recognition strength is largerfor old than for new items because more nodes are activein old items The standard deviation for the easy class islarger than the standard deviation for the difficult classbecause the recognition strength is normalized with thestandard deviation for the net input

There are several reasons why the variance theory isinteresting First the theory is extremely simple Theonly necessary assumptions are that recognition is basedon recurrent associations between contexts and itemsand performance is measured by counting the number ofnodes in an optimal way Second these assumptions areconsistent with what is known about how the brain worksTherefore the model is biologically plausible Third themodel accounts for a large amount of data including themirror effect exceptions from the mirror effect ROCcurves list-length effects and so on Fourth the modelfits the empirical data well Fifth it is easy to implementthe model in a connectionist network

Paying more attention to one of the classes violates theassumption of equal expected net inputs to the two classesThe variance theory predicts that attention to the moredifficult class primarily affects the hit rates whereas thefalse alarm rates and standard deviations of the underly-ing distributions are less affected An experiment sup-ported the prediction A standard mirror effect was foundwhen attention was divided equally between high- andlow-frequency words However focusing the subjectsrsquoattention on the high-frequency words either by the pre-sentation frequency or the presentation time made thehit rate larger for the high-frequency words than for thelow-frequency words This manipulation of attention didnot influence the false alarm rates which were higher forthe high-frequency words in all the conditions Thus nomirror effect was found when attention was paid to thehigh-frequency words Nor did the focusing of attentioninfluence the order of the standard deviations The stan-dard deviations were larger for the low-frequency distri-

436 SIKSTROM

bution than for the high-frequency distribution This wastrue for the new and the old distributions both when at-tention was paid to high-frequency words and when at-tention was divided equally between the two classes (ex-cept in the new frequency control condition where thestandard deviations were equal)

The variance theory predicts the order of the standarddeviations of the underlying distributions for the follow-ing reasons The standard deviations of the old distribu-tions are predicted to be larger than those of the new dis-tributions because more nodes are activated in the olddistributions The standard deviations of the easy classdistributions are predicted to be larger than the standarddeviations of the difficult class distributions because therecognition strength is normalized by the itemrsquos diffi-culty estimated from the standard deviation of the net in-puts This is consistent with the empirical data

In contrast the attention-likelihood theory does notconstrain the old distribution to be larger than the newdistribution for the difficult class (it can be larger orsmaller depending on parameter settings) The variancetheory allows the following two orders of the standarddeviations ss(BN) lt ss(BO) lt ss(AN) lt ss(AO) andss(BN) lt ss(AN) lt ss(BO) lt ss(AO) The first order isthe more common although the second order occurs oc-casionally (see eg Glanzer et al 1993 Experiment 1)In addition the attention-likelihood theory allowsss(BO) lt ss(BN) lt ss(AN) lt ss(AO) according to Equa-tion 2 which to the authorrsquos knowledge has not beenfound in empirical data

The variance theory predicts that strength variablessuch as study time repetition and study instructions af-fect the expected net input For example increasing studytime will increase the net input that improves the hit rateIncreasing the net inputs also causes an increase in theactivation threshold that diminishes the false alarm rates

The variance theory has no (ie lateral) connectionswithin the item layer and there are no connections with-in the context layer Including intraitem connections inrecognition makes it impossible to tell whether therecognition strength comes from encoding during thelearning episode or from another preexperimental learn-ing episode Thus there would be a confounding be-tween the itemrsquos frequency and recognition strength Forexample if the recognition strength in the variance the-ory included intraitem connections and used a constantrecognition criterion it would predict a higher hit rateand a higher false alarm rate for high-frequency itemsthan for low-frequency items Thus the hit rate for high-frequency words would be larger than that for low-frequency words which is contrary to the data on the mir-ror effect This issue is called the frequencyrecognitionndashstrength confounding problem Other models may bevulnerable to this problem depending on their specificassumptions The variance theory is immune from thisproblem because recognition strength is based on the as-sociation between the context and the item that yields apure measurement of the strength of the target in a par-

ticular episode Net inputs within the item population arenot used because these connections are highly corre-lated with the frequency of the item

This frequencyrecognition-strength confounding prob-lem may be relevant to several distributed models thatassume that recognition strength increases with frequencythus making the false alarm rate higher for high- than forlow-frequency stimuli This is often implemented in dis-tributed models by including intraitem associations inthe measurement of recognition strength Thus whenintraitem and itemndashcontext associations are added it isnot possible to know whether the intraitem strength oc-curs because an item has been encoded in the to-be-retrieved-from list or at another episode

Although the intraitem associations are not used tomeasure recognition strength they may play an impor-tant role in recognition In the first step of recognitionthese associations may be used for deblurring unclear in-formation in the item cue (a similar mechanism occursfor the context cue) Arguably this deblurring mecha-nism works well for well-known words however fornonwords it is much more likely to fail Such failure willactivate features that were not active in the encoded rep-resentation It is more likely that these wrongly activatednodes represent high-frequency features because it ismore likely to converge on high-frequency features Thereare two interesting implications of this perspective Firstthe wrongly activated nodes will use the wrong connec-tions between the context and the item Second becausethe wrongly activated nodes represent high-frequencyfeatures the average variability will be larger for non-words than for words This is a plausible account of thepoor recognition performance with nonwords as com-pared with words It is also a tentative account of why non-words can be seen as a difficult class with a higher vari-ability than that for words in the variance theory Howeverfurther work is needed before any firm conclusion can bemade regarding this aspect of the theory

A problem similar to frequencyrecognition-strengthconfounding occurs if within-context connections areused at recognition If the context is temporally corre-lated the within-context connections are influenced bylist length This causes a confounding between list lengthand recognition strength This issue is called the list-lengthrecognition-strength confounding problem Othermodels may be vulnerable to this problem depending ontheir specific assumptions

Another issue is whether the variance theory can ac-count for the mirror effect found in abstract and concretewords where words from a concrete class are more eas-ily discriminated than words from an abstract class Thevariance theory can account for this given the assump-tion that the variance of the net input is larger for abstractthan for concrete words However at this point it is notcompletely clear how this assumption can be motivatedA possibility is that although these two classes areequated for word frequency the contexts associated withan abstract word are more variable than the contexts as-

VARIANCE THEORY OF MIRROR EFFECT 437

sociated with a concrete word This larger variability incontext for abstract words may lead to a larger variabil-ity in the net input Another possibility is that the activefeatures in abstract words are more general and there-fore associated with more contexts Nodes active in con-crete words may represent more specific features acti-vated with a lower frequency and therefore associatedwith fewer contexts Thus features in abstract wordsmay be of a higher frequency than features in concretewords although the frequencies of the items are thesame This would lead to a mirror effect in the modelHowever at this point no claim is made that variancetheory can handle this phenomenon

The variance theoryrsquos account of the list-length andlist-strength effects is arguably much simpler than thesubjective-likelihoodrsquos account The activation thresholdis set so that on average half of the nodes active duringencoding are active during recognition The activationthreshold is constant within one condition but may changebetween conditionsmdashfor example when study time ischanged Furthermore subjects do not need to estimatethe list length and the probability that a particular itemis drawn from the study list

The variance theory has advantages over the attention-likelihood theory As was discussed in more detail abovethe attention-likelihood theory requires subjects to clas-sify the stimulus Depending on this classification thesubjects must know (consciously or unconsciously) howmuch attention is paid to the stimuli in order to calculatethe log-likelihood ratios Thus the yesndashno decision isbased on the subjectsrsquo classification of the stimuli Thevariance theory does not require a classification of thestimuli During the calculation of recognition strengththe standard deviation of the net input of the item is used(shcent ) so the subject does not need to know the class or thestandard deviation of the class (sh) The increase in thehit rate and decrease in the false alarm rate for the easierclass occurs according to the theory because the vari-ance is smaller for the easier class The variance theoryhas a simple mathematical base subjects count the num-ber of activated nodes minus the expected value dividedby the standard deviation of the net input of the item Anexplicit solution is presented that requires only calculat-ing the probabilities of two z-transformations

The subjective-likelihood theory uses feature contentof the items for addressing issues regarding item similar-ity and word frequency In particular high-frequency wordsare assumed to have a higher noise or variability than dolow-frequency words The variance theory also usesvariability that depends on frequency However the vari-ance theory simulates the increase in variance duringeach presentation of a feature in different contexts thusmaking it an unavoidable phenomenon for high-frequencyfeatures In the subjective-likelihood theory the featurevariance is introduced as an assumption or a constantand it is not explicitly simulated

There are several other differences between the vari-ance theory and the subjective-likelihood theory The

subjective-likelihood theory is based on log-likelihoodratios In the variance theory log-likelihood ratios arenot necessary to account of the mirror effect and for z-ROC curves Instead the variance theory uses the num-ber of active nodes normalized by the variance of the netinput as the measurement of recognition strength

Another difference is the use of one detector for eachitem in the subjective-likelihood theory This makes themodel essentially local whereas the variance theory isdistributed This property may cause difficulties in im-plementing the model in a connectionist network How-ever see McClelland and Chappell (1998) for a brief dis-cussion of this topic An implementation of the variancetheory is straightforward

REFERENCES

Feller W (1968) An introduction to probability theory and its appli-cation New York Wiley

Gillund G amp Shiffrin R M (1984) A retrieval model for bothrecognition and recall Psychological Review 91 1-67

Glanzer M amp Adams J K (1985) The mirror effect in recognitionmemory Memory amp Cognition 13 8-20

Glanzer M amp Adams J K (1990) The mirror effect in recognitionmemory Data and theory Journal of Experimental PsychologyLearning Memory amp Cognition 16 5-16

Glanzer M Adams J K amp Kim K (1993) The regularities ofrecognition memory Psychological Review 100 546-567

Glanzer M amp Bowles N (1976) Analysis of the word frequencyeffect in recognition memory Journal of Experimental PsychologyHuman Learning amp Memory 2 21-31

Glanzer M Kisok K amp Adams J K (1998) Response distribu-tions as an explanation of the mirror effect Journal of ExperimentalPsychology Learning Memory amp Cognition 24 633-644

Greene R L (1996) Mirror effect in order and associative informa-tion Role of response strategies Journal of Experimental Psychol-ogy Learning Memory amp Cognition 22 687-695

Hertz J Krogh A amp Palmer R G (1991) Introduction to the the-ory of neural computation Reading MA Addison-Wesley

Hintzman D L (1988) Judgment of frequency and recognition memoryin a multiple trace memory model Psychological Review 95 528-551

Hopfield J J (1982) Neural networks and physical systems withemergent collective computational abilities Proceedings of the Na-tional Academy of Sciences 79 2554-2558

Hopfield J J (1984) Neurons with graded response have collectivecomputational properties like those of two-state neurons Proceed-ings of the National Academy of Sciences 81 3088-3092

Humphreys M S Bain J D amp Pike R (1989) Different way to cuea coherent memory system A theory for episodic semantic and pro-cedural tasks Psychological Review 96 208-233

Kim K amp Glanzer M (1993) Speed versus accuracy instructionsstudy time and the mirror effect Journal of Experimental Psychol-ogy Learning Memory amp Cognition 19 638-652

Kruschke J K (1992) ALCOVE An exemplar-based connectionistmodel of category learning Psychological Review 99 22-44

Ku Iumlcera H amp Francis W N (1967) Computational analysis ofpresent-day American English Providence RI Brown UniversityPress

Lewandowsky S (1991) Gradual unlearning and catastrophic inter-ference A comparison of distributed architectures In W E Hockleyamp S Lewandowsky (Eds) Relating theory and data Essays onhuman memory in honor of Bennet B Murdock (pp 445-476) Hills-dale NJ Erlbaum

Li S-C amp Lindenberger U (1999) Cross-level unification A com-putational exploration of the link between deterioration of neuro-transmitter systems and dedifferentiation of cognitive abilities in oldage In L-G Nilsson amp H J Markowitsch (Eds) Cognitive neuro-sciences of memory (pp 103-146) Seattle Hogrefe amp Huber

438 SIKSTROM

Li S-C Lindenberger U amp Frensch P A (2000) Unifying cog-nitive aging From neuromodulation to representation to cognitionNeurocomputing 32-33 879-890

McClelland J L amp Chappell M (1998) Familiarity breeds dif-ferentiation A subjective-likelihood approach to the effects of expe-rience in recognition memory Psychological Review 105 724-760

Metcalfe J (1982) A composite holographic associative recallmodel Psychological Review 89 627-658

Murdock B B (1982) A theory for the storage and retrieval of item andassociative information Psychological Review 89 609-626

Murdock B B (1998) The mirror effect and attentionndashlikelihoodtheory A reflective analysis Journal of Experimental PsychologyLearning Memory amp Cognition 24 524-534

Okada M (1996) Notions of associative memory and sparse codingNeural Networks 9 1429-1458

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves PsychologicalReview 99 518-535

Rumelhart D E Hinton G E amp Williams R J (1986) Learn-ing representation by backpropagating errors Nature 323 533-536

Shiffrin R M amp Steyvers M (1997) A model for recognitionmemory REMndashretrieving effectively from memory PsychonomicBulletin amp Review 4 145-166

Sikstroumlm S (1996a) TECO A connectionist theory of successiveepisodic tests Umearing Doctoral dissertation Umearing University (ISBN91-7191-155-3)

Sikstroumlm S (1996b) The TECO connectionist theory of recognitionfailure European Journal of Cognitive Psychology 8 341-380

Sikstroumlm S (1999) Power function forgetting curves as an emergentproperty of biologically plausible neural networks model Interna-tional Journal of Psychology 34 460-464

Stretch V amp Wixted J T (1998a) Decision rules for recognitionmemory confidence judgments Journal of Experimental Psychol-ogy Learning Memory amp Cognition 24 1397-1410

Stretch V amp Wixted J T (1998b) On the difference betweenstrength-based and frequency-based mirror effects in recognitionmemory Journal of Experimental Psychology Learning Memoryamp Cognition 24 1379-1396

NOTE

1 The expected number of nodes active at recognition is a2 giventhat the standard deviations of the percentages of active nodes for new[sp(N)] and old items [sp(O)] are equal However if the standard devi-ations are different the expected number of active nodes will be

Because the new variance is smaller than the old variance this valuewill be slightly less than a2 typically around 044a if the ROC curveis 08 It is actually more precise to use this value In this paper the ap-proximation a2 is used except that in the simulations where the ex-pression above is used

APPENDIX

The Expected Value of the Net Input and the Variance of the Net Input

Inserting the weight changes into the net input solves the ex-pected value of the net input for old items In this Appendix thesuperscripts representing the item layer (d) and the contextlayer (c) are dropped Only the active states of activation con-tribute to the net input so jj 5 jj = 1

(A1)

The expected value of the net inputs for the new items iszero

mh(N) 5 0 (A2)

The expected value of the weights is zero The variance ofthe net input is calculated by squaring each term in the netinput Let P represent the number of patterns stored in the net-work If the patterns are uncorrelated or a new item is pre-sented the variance of the net input is

(A3)

For an old item the variance of the net input to the contextlayer from the item layer depends on the frequency of the item( f ) All the aN weights supporting a context node contribute tothe net input in the same direction

(A4)

The same function can be used for calculating the varianceof the net input to a node in the item layer when the same con-text is associated with several items Let the same context be as-sociated with L items in a list Furthermore let the context be-tween different lists be uncorrelated The variance of the netinput to an item node is

The expected variance for all nodes is the weighted sum ofthese two terms Half of the nodes are context nodes and halfof the nodes are item nodes Therefore the expected varianceof the net input to all nodes given an item with a frequency off and a list length of L in a network where p patterns have beenencoded is

(A5)

Performance in the Variance Model When AllNodes Are Used

In the variance model recognition strength is based on nodesactive at encoding However if recognition strength is based onall the nodes performance can be calculated as follows The d centis calculated by using Equation 8 where Pc(O) and Pc(N) are thenumber of correct nodes The number of correct nodes is thenumber of nodes active at retrieval that also is active at encod-ing (ie calculated as usual with Equation 7) plus the numberof inactive nodes at encoding that also are inactive at retrievalThe latter value can be calculated by replacing a with 1 a inEquation 7 and using the expected value of old net inputs for in-active nodes a2 (1 a) N (compare with Equation A1)

(Manuscript received February 9 1999revision accepted for publication October 30 2000)

s h f L a N p a a N2 3 22 1O( ) = +( ) + -( )

s h f a a LN2 4 2 21( ) = -( )

s x x x xh ij jj

N

i j ji

Nf f f a a

a a f N

22 2

4 2 21

( ) = waring aringaelig

egraveccedilouml

oslashdivide= -( ) -( )eacute

eumlecirc

= -( )

s x x x xh ij jj

N

i j j

P

i

NN a a

a a a a a a a

a a a a PN a a PN

2 2 2

2

3

1 1 2 0 1

0 0 1 1

( ) = w

= [( ) ( )] + [( )(1- )] ( )

+ [( )( )] ( ) = ( )

2 2

2 2 2

( ) = -( ) -( ) - - - -

- - - -

aring aringaring

m x x x xh ijj

N

j i j jj

Na a a a N(O) = wDaring aring= -( ) -( ) = -( )1

2

ss s

p

p p

a(N)

( )N (O)+

075 rather than 050 The figure does not show a mirroreffect because the expected hit rate and the expectedfalse alarm rate are larger for the high-frequency itemsthan for the low-frequency items Setting C to an unbi-ased value of 0 yields P(AN) = 08 P(BN) = 14 P(BO) =86 P(AO) = 063 [which may be compared with Figure6B P(AN) = 20 P(BN) = 25 P(BO) = 70 P(AO) = 74]

Furthermore in the variance theory the ratio of therecognition strength standard deviations for high- andlow-frequency items depends mainly on the standard de-viations of the net inputs The standard deviations of thenet inputs are not dependent on the attention paid to thestimuli Therefore the variance theory predicts no changein the standard deviations when the amount of attentionis manipulated The standard deviation of the old low-frequency distribution is predicted to be larger than thestandard deviation of the old high-frequency distribu-tion Similarly the standard deviation of the new low-frequency distribution is predicted to be larger than thatof the new high-frequency distribution The standard de-viations in Figure 7B are ss(AN) = 0013 ss(BN) = 0011ss(BO) = 0017 and ss(AO) = 0019 These results aresimilar to the results when the same amount of attentionis paid to the two classes in Figure 4D ss(AN) = 0015ss(BN) = 0012 ss(BO) = 0015 and ss(AO) = 0020

The standard version of the attention-likelihood the-ory has problems accounting for the lack of mirror ef-fect when more study time is given to the difficult classthan to the easy class This theory suggests that the classof items to which more attention is being paid is moreeasily recognized For example low-frequency items arebetter recognized than high-frequency items becausesubjects pay more attention to them The amount of at-tention is assumed to influence the number of sampledfeature [n(i)] so more features are sampled for low- thanfor high-frequency items (Kim amp Glanzer 1993) This isthe only parameter that differs between high- and low-frequency items From this explanation it follows thatif the experimental conditions are manipulated so thatsubjects pay more attention to the high-frequency itemsthe standard version of the attention-likelihood theorywill predict a mirror effect where the high-frequencyitems are the easier class (A) and the low-frequency itemsare the more difficult class (B) The difference from thenormal mirror effect is a larger hit rate and a smaller falsealarm rate for high- than for low-frequency items Fur-thermore the attention-likelihood theory makes predic-tions of the order of the slope of ROC curves The standarddeviation of the hit rate for the high-frequency distributionwould be larger than the hit rate for the low-frequencydistribution Similarly it is predicted that the standarddeviation of the high-frequency false alarm distribution islarger than that of the low-frequency distribution

EXPERIMENT

An experiment was conducted to test the predictionsregarding the within-list strength manipulation The

number of presentations and the study time of the high-frequency words were manipulated in an experimentThe original rationale for the experiment was to comparethe results with the predictions of the variance theoryand the attention-likelihood theory because the experi-ment was conducted before the publication of Stretchand Wixtedrsquos (1998b) study which manipulated atten-tion by the number of presentations In this experimenta new manipulation is investigatedmdashnamely how theamount of study time per item for each class influencesthe mirror effect Furthermore the manipulation of thenumber of presentations is replicated Thus there weretwo experimental conditions one in which the amountof study time was manipulated and one in which the pre-sentation time was manipulated There was also one con-trol condition in which high- and low-frequency wordswere given the same amount of attention

MethodSubjects Twenty-one students taking the introductory psychol-

ogy course at the University of Toronto volunteered to participatein a memory experiment for course credit There were 14 female and7 male subjects with a mean age of 20 ranging from 18 to 29 yearsold

Material Sixty low-frequency words and 60 high-frequencywords were selected from Ku Iumlcera and Francis (1967) The low-frequency words have an occurrence of 4ndash5 times per million andthe high frequency words an occurrence of 50ndash55 times per mil-lion Thirty low- and 30 high-frequency words were randomly cho-sen for List 1 and the remaining for List 2

Procedure The subjects were instructed to study a list of wordsso they would be able to recognize the words after study Fifteenlow-frequency words and 15 high-frequency words were randomlychosen as study words for each subject

Design There were three conditions In all the conditions thelow-frequency words were presented once with a presentation timeof 1 sec In the control condition the high-frequency words werealso presented once with a presentation time of 1 sec In the pre-sentation frequency condition the high-frequency words were pre-sented twice for 1 sec each time In the presentation time conditionthe high-frequency words were presented once for 3 sec The pre-sentation order was randomized All the words were presented inuppercase on a blank computer screen Immediately following thestudy list there was a recognition test The subjects were presentedwith either one of the studied words or one of the lures There were15 low-frequency lures and 15 high-frequency lures in each condi-tion The subjects were asked to judge whether the word was pre-sented in the list or not The subjects were also required to rate theirconfidence in their responses on a scale ranging from 1 (guessing)to 5 (very certain) The order of recognition was randomized foreach subject

Each subject participated in two conditions List 1 was alwaysgiven as the first list and List 2 as the second list Twelve subjectswere randomly chosen for the presentation frequency condition fol-lowed by the presentation time condition Nine subjects were giventhe control condition followed by another control condition Thewhole experimental set-up including instructions presentation ofwords and the recognition test was automated on a computer Eachsubject was tested individually

ResultsThe results from the experiment are presented in the

first three rows of Table 1 The probability for hit rates

VARIANCE THEORY OF MIRROR EFFECT 427

428 SIKSTROM

was larger for the high-frequency words than for the low-frequency words in the presentation frequency and thepresentation time conditions In the control conditionthe probability for hit rates for the low-frequency condi-tion was larger One-tailed paired t tests over the perfor-mance for each subject were carried out to test the dif-ferences between the high and the low frequencies Theeffects were significant in the presentation frequencycondition [t(11) = 22 MSe = 004 p = 02 lt 05] and inthe control condition [t(16) = 33 MSe = 004 p = 00lt 05] but not in the presentation time condition [t(11) =041 MSe = 003 p = 34 lt 05]

The false alarm rate was larger for the high-frequencywords in all the conditions However it was only signif-icantly larger in the presentation frequency condition[t(11) = 18 MSe = 003 p = 048 lt 05] but not in thepresentation time condition [t (11) = 15 MSe = 001 p =07 gt 05] and the control condition [t(16) = 14 MSe =002 p = 09 gt 05]

The results in the presentation frequency conditionsupport the variance theory The hit and the false alarmrates were significantly larger for the high-frequencywords than for the low-frequency words Thus there wasno mirror effect However the prediction of the standardversion of the attention-likelihood theory was not sup-ported

The results in the presentation time condition were inthe same direction as those in the presentation frequencycondition although the difference between the high andthe low frequencies was not significant This conditionis consistent with the variance theory although the stan-dard version of the attention-likelihood theory could notbe dismissed in this condition since the results werenonsignificant

Finally the control condition yielded results consis-tent with previous studies showing a mirror effect Thehit rate for the high-frequency words was significantlylower than the hit rate for the low-frequency words Thefalse alarm rate for the high-frequency words was largerthan that for the low-frequency words (although not sig-nificantly) Thus the control condition is as was ex-pected consistent with both the variance theory and thestandard version of the attention-likelihood theory

The slopes of the ROC curves were calculated as fol-lows The hit and false alarm rates for confidence ratings1ndash5 were z-transformed (eg for confidence rating 4 a hitresponse was scored if the confidence rating was 4 orabove) Linear regressions of one z-transformed mea-surement as a function of another z-transformed measure-ment were conducted The slope of the linear regressioncurves between the z-transformed hit rate of the low-frequency words and the z-transformed hit rate of the high-frequency words [ss(BO)ss(AO)] and similarly for theslope of the false alarms [ss(BN)ss(AN)] are shown inthe last two rows of Table 1

The results show that the standard deviations of theold high-frequency distributions were smaller than thestandard deviations of the old low-frequency distribu-tions in all the conditions The standard deviations of the

false alarm high-frequency distributions were smallerthan the standard deviations of the false alarm low-frequency distributions in the presentation frequencycondition and the control condition but were approxi-mately equal in the presentation time condition

To summarize the results in the presentation frequencycondition are consistent with the variance theory and inconsistent with the standard version of the attention-likelihood theory The control condition is consistentwith both the variance theory and the standard version ofthe attention-likelihood theory These data are also con-sistent with results presented by Stretch and Wixted(1998b) However Stretch and Wixted (1998b) suggestedone possible way to modify the standard version of theattention-likelihood theory to bring it in line with thedata presented here They noted that Glanzer et al (1993)had shown that the attention-likelihood theory predictsthe mirror effect although p(i old) is set to the averagevalue of the two classes This modified version can pre-dict the pattern of data presented here given that the at-tention paid to the high-frequency class was high duringencoding [n(B) = 120] and low during recognition [n(B) =40] This formulation of the attention-likelihood theoryseems somewhat unclear It is not well motivated whyp(i old) should be equal during recognition whereas n(i)is different [p(i old) is calculated from n(i)] or why theamount of attention for high-frequency items is lower thanthat for low-frequency words at encoding but higher atrecognition

COMPARATIVE DATA FITTING

Glanzer et al (1993) fitted the attention-likelihoodtheory to experimental data in nine conditions The easyclass (A) consisted of either low-frequency words orconcrete words and the difficult class (B) consisted ofhigh-frequency words or abstract words Here the vari-ance theory is also fitted to the same set of data as thatused in Glanzer et al (1993) This allows a directionevaluation of the variance theory by comparing its f itswith those of the attention-likelihood theory

Glanzer et al (1993) fitted the attention-likelihoodtheory to the four probabilities of yes responses and thefour ratios of slopes of the ROCs The fitting was con-ducted by minimizing the mean squared error divided bythe variancemdashthat is

Three parameters were fittedmdashnamely the attentionpaid to each of the classes n(A) and n(B) and the prob-ability that a feature was activated p(new) The other pa-rameters were held constant at a value found to give agood f it These parameters were (N = 1000) and therecognition criterion [ln (L) = 0]

The variance theory was fitted to the same set of datausing the same technique and the same number of freeparameters The fitted parameters were the percentage

( )

Observed Predictedi i

ii

-

=aring

2

21

8

s

VARIANCE THEORY OF MIRROR EFFECT 429

nodes active at encoding (a) the standard deviation ofthe net inputs for the easy class words [sh(A)] and thestandard deviation of the net inputs for the difficult classwords [sh(B)] The other parameters were held constantand were set to the same values as those in Figure 4D[2N = 100 mh(N ) = 0 mh(O) = 1 and C = 0] The empir-ical standard deviations (si) were not reported inGlanzer et al (1993) so these parameters were set toone

The results show that the variance theory accounts for98 (r2) of the variance for the probabilities The attention-likelihood theory fits equally well The variance theory ac-counts for 93 of the variance for the slope The attention-likelihood theory accounts for 84 of the variance of theslope Thus the variance theory accounted for the sameamount of variance for the probabilities and more variance

for the slope as compared with the attention-likelihoodtheory when three parameters were fitted

It is reasonable to assume that the percentage of ac-tive nodes at encoding (a) does not vary across condi-tions The ratio of standard deviations of the net inputs[sh(B)sh(A)] may also be conceived of as being con-stant given that the same material is used in the differ-ent conditions Therefore the variance theory was fittedby a single variablemdashnamely the standard deviation ofthe net inputs for the easy class [sh(A)] The activitylevel was fixed to 010 and the ratio of the standard de-viations of the net inputs sh(B)sh(A) to 125 (these val-ues were the average of the fitted parameters in the pre-vious fit these parameters were also used in Figure 4D)The results are presented in Figure 8A where the pre-dicted probabilities are plotted as a function of observed

Figure 8 (A) Fitting the variance theory to Glanzer Adams andKimrsquos (1993) probability data The figure shows the predicted proba-bilities (on the vertical axis) as a function of the observed probabilities(on the horizontal axis) (B) Fitting the variance theory to Glanzer et alrsquos(1993) standard deviation slope data The figure shows the predictedratio of slopes of the receiver-operating characteristic curves (on the ver-tical axis) as a function of the observed ratios (on the horizontal axis)

430 SIKSTROM

probabilities Figure 8B shows the corresponding resultsfor the slope The accounted variance is 096 for the prob-abilities and 085 for the slopes Thus the variance theoryfits the slopes using a single parameter equally well asthe attention-likelihood theory does with three fitting pa-rameters The fit for the variance theory for the probabil-ities using one parameter is slightly less than the fit of theattention-likelihood theory using three fitting parametersIt may be concluded that the fit for the variance theory isreasonably good for the probabilities and the slopes Theslopes have a better fit in the variance theory than in theattention-likelihood theory when three variables are used

ANALYTIC SOLUTIONS

In this section analytic solutions of the variance the-ory an approximation of the standard deviation of therecognition strength and analyses of optimal perfor-mance are presented The variance theory has a simpleanalytic solution and can be fully described by four pa-rameters Two of these parametersmdashnamely the stan-dard deviation of the net inputs from the easy class[sh(A)] and the standard deviation of the net inputs fromthe difficult class [sh(B)]mdashcan also be expressed as thefrequency of the item (see the Appendix) The other twoparameters are the number of nodes (N ) and the expectedprobability that the nodes are active at encoding (a)

There are other variables in the theory however theydo not increase the degrees of freedom There are fourexpected net inputs (mh) however two degrees of free-dom disappear because the new net inputs are constrainedto be equal as well as the old net inputs [mh(AN) =mh(BN) and mh(AO) = mh(BO)] Furthermore the predic-tions are independent of moving the old and new dis-tributions in parallel thus removing another degree offreedom Finally changing the difference between the ex-pected old and new net inputs is mathematically equiva-lent to changing the standard deviations [sh(A) andsh(B)] Thus the degrees of freedom in the net inputscan be captured by the degrees of freedom in the stan-dard deviations The activation threshold (T ) and theprobability that nodes are active (Pc) are simply func-tions of other variables and therefore do not increase thedegrees of freedom Thus there are four degrees of free-dom for the distributionsmdashnamely sh(A) sh(B) N anda An additional degree of freedom is introduced whenplacing the recognition criterion (C)

The probability (Pc) that the net inputs exceed the ac-tivation threshold (T ) for nodes active during encodingcan be explicitly solved from the expected net inputs(mh) and the standard deviation of the net inputs (sh)This probability is dependent on the distribution of thenet inputs which can be approximated with a normaldistribution Pc is solved by integrating the net inputsfrom mh T to infinity (yen) over the probability densityfunction for a normal distribution Thus the probability(Pc) that a node is active at recognition is

(6)

Subtracting the expected percentage of active nodes atrecognition (a2 see note 1) from the percentage of ac-tive nodes and dividing by the standard deviation of thenet inputs (sh) calculates the expected recognitionstrength (ms)

Note that the analytic solution uses the standard devi-ation of the class (sh) as an approximation of the stan-dard deviation of the item (shcent ) because it simplifies theanalytic solution however the variance theory or thesimulation uses the standard deviation of the item Thisapproximation is good when there are a large number offeatures however for a small number of features thevariance of feature strength for a single item may fluctu-ate on an item-to-item basis around the variance of thenet inputs for all the items

The standard deviation of the recognition strength (ss)is calculated from sh Pc and N There is 2N number ofnodes in the context and the item layers The distributionof Pc is binomial but can for a certain criterion [ie 2NPc(1 Pc) gt 10] be approximated with a normal distri-bution with a standard deviation of [Pc(1 Pc) 2N]12The final result is scaled by the normalization factor 1sh

(7)

A yes response occurs if the recognition strength isabove the recognition criterion (C) The probability of ayes response [P(Y)] is calculated from the expected recog-nition strength the variance of the recognition strengthand the recognition criterion by integrating the density ofthe recognition strength over a normal distribution

The probability of choosing A over B in a two-choiceforced recognition test [P(A B)] is calculated from theexpected recognition strength of A [ms(A)] and B [ms(B)]and the standard deviations of the recognition strengthof A [ss(A)] and B [ss(B)]

An Excel sheet for calculating the predictions of the vari-ance theory is available on line (wwwpsychutorontoca ~sverkervariancehtml)

P e ds

S Bs s

s s

C

(A B)

(A)

(A) (B)

=

- -[ ]( )+aelig

egraveccediloumloslashdivide

eacuteeumlecirc

yen

ograve12

2

2 2 22

p

m m

s s

( )

P e ds

S s

s

C

(Y) =-( )yen

ograve12

2

22

p

m

s

sss

h

c cP P

N=

-( )eacute

eumlecircecirc

1 1

2

12

mss

Pc

a

h=

-2

P a e dhcT

hh

h=-( )yen

ograve12

2

22

p

m

s

VARIANCE THEORY OF MIRROR EFFECT 431

Approximations of the Standard Deviation ofRecognition Strength

The standard deviation of the recognition strength isin the model calculated with Equation 7 However to fa-cilitate the understanding of this equation it is useful tomake some approximations First note that the proba-bility that a node is active (Pc) is assumed to be low Byapproximating 1 Pc to one the variance of the recog-nition strength can be simplified to

For a particular class of items the variances of the netinputs of old and new items are equal and the varianceof the recognition strength is proportional to the numberof active nodes (s 2

s micro Pc) This approximation suggests avery simple interpretation of the slope of the z-ROC Theratio of variances between new and old items is simplythe ratio between the number of nodes active in the newitems representations and the number of nodes active inthe old items representations

Or alternatively the slope of the z-ROC curve is equalto the square root of the ratio of the number of nodes ac-tive in the new items representations and the number ofnodes active in the old items representations For exam-ple if the slope of the z-ROC curve is 08 the number ofactive nodes in the new items representations divided bythe number of nodes active in the old items representa-tions is 064 (= 0802)

Another approximation useful for understanding themodel is that for two classes of items the number of ac-tive nodes in the new distribution is approximately equaland the number of active nodes in the old distributions isapproximately equal [Pc(AN) raquo Pc(BN) and Pc(BO) raquoPc(AO)] Given these approximations and the approxi-mation above (1 Pc raquo 1) the recognition strength stan-dard deviation is inversely related to the standard devia-tion of the net inputs in the following way The ratiobetween the recognition strength standard deviations ofthe diff icult and the easy distributions is equal to theratio between the standard deviations of the net inputs ofthe easy and the difficult distributions Furthermore theratio between the recognition strength standard devia-tions of the difficult and easy new distributions is equalto the ratio between the recognition strength standard de-viations of the difficult and the easy old distributionsThe exact solution predicts a slightly smaller ratio in theold than in the new distributions

This suggests that the ratio between the recognitionstrength standard deviations of the difficult class and theeasy class can be interpreted as the ratio between the

standard deviations of the net inputs of the easy class andthe difficult class

Optimizing Performance Derives the Assumptions of the Variance Theory

Arguably good memory performance is an importantfactor for selection in the evolutionary process of hu-mans and animals It is reasonable to assume that thebrain has evolved so that the performance at retrieval isoptimal or near-optimal Here it is investigated how sev-eral assumptions of the variance theory influence per-formance It is shown that several assumptions of themodel fall out as a consequence of optimizing perfor-mance in the form of discriminability between new andold items Thus if the model is implemented in a differ-ent way performance is degraded and the model doesnot account for the experimental data Examples of as-sumptions that yield a good performance in the modelare a low percentage of nodes active setting the activa-tion threshold between old and new net inputs measur-ing performance by nodes that are active in the encodingpattern and normalizing the recognition strength It isshown that an optimal performance in the network requiresthe implementation suggested by the variance theory Ifthe implementation of the variance theory is changedsignificantly the performance is degraded and the net-work would not produce the empirically found memoryphenomena

To conduct this analysis it is necessary to define ameasurement of performance A natural choice is to used cent By using the analytical equations above we find thefollowing expression

(8)

Because ss(N) sometimes is near zero it was founduseful to use the standard deviation of both the old andthe new items recognition strength ss(NO) in the de-nominator of this equation Thus Pc(NO) is equal to[Pc(N) 1 Pc(O)] 2 Pc( ) was calculated with Equation 6The expected value of the net inputs and the standard de-viation of the net inputs for new and old items were cal-culated with the equations derived in the Appendix(Equations A1 A2 and A3) Low-frequency items witha preexperimental frequency of zero were used ( f = 0)and the list length was set to one (L = 1)

The performance can be expressed by the parametersa N and p Increasing the number of nodes (N) monot-onically increases d cent and increasing the number ofstored patterns (p) monotonically decreases d cent The im-pact of these two parameters on d cent was of less impor-tance here and they were set to N = 30 and p = 100

Optimal percentage of nodes active at encodingThe solid line in Figure 9A shows the theoretical d cent as afunction of the percentage of nodes active at encoding

cent = - =-

-eacuteeumlecirc

dP P

P PN

s s

s

c c

c c

m ms( ) ( )

(

O N

(NO)

(O) (N)

(NO) (NO) 12

112

ss

ss

ss

ss

s

s

s

s

h

h

s

s

(BO)

(AO)

(B)

(A)

(A)

(B)

(BN)

(AN)pound raquo pound

s

ss

s

c

c

P

P

2

2

(N)

(O)

(N)

(O)raquo

ss

sc

h

P

N

2

2 2=

432 SIKSTROM

(a) The results show that d cent is optimal for a = 052 Thed cent is lower for larger and smaller a The lower d cent for largea occurs because the interference from other items in-crease For an a larger than the optimal value the weightchanges are distributed over a larger number of nodesand the recognition tests therefore include more noise

The lower d cent for an a less than the optimal value oc-curs because there is variability in the number of activenodes at encoding Thus for very small values of a thefluctuation between the number of nodes active in theencoded representation becomes increasingly importantThus for a small a errors are more likely to occur be-cause old items have few active nodes at encoding mak-ing it less likely that many nodes will be active at re-trieval (independently of how well they are encoded)This analysis suggests that a medium low percentage ofactive nodes at encoding yields optimal performanceThis is consistent with variance theory which requires alow activity for fitting some of the empirical data (seebelow)

There is another factor that contributes to the fact thatoptimal performance occurs when the percentage of ac-tive nodes is medium low which is that the number ofpossible representations increases with a If there is onlyone node active in all the representations there are Npossible combinations of representations if there aretwo nodes active in all the representations there are ap-proximately N 2 possible combinations of representa-tions and so forth This factor is not included in theanalyses

Optimal placement of the activation threshold Animportant property of d cent is that it is insensitive to wherethe criterion is placed Thus any criterion yields thesame performance The activation threshold (T ) may beseen as the criterion for a single node and therefore onemight intuitively think that the placement of the thresh-old is unimportant for d cent However surprisingly theplacement of the criterion becomes important in the vari-ance theory because there is a nonlinear transformationwhen the nodes are activated This nonlinearity makes d centdependent on the activation threshold in the nodes

The d cent was maximized by changing the activationthreshold (T ) and the percentage of nodes active at en-coding (a) The maximum d cent was 240 when T = 081and a = 052 Figure 9B plots d cent as a function of the ac-tivation threshold (T ) when the percentage of nodes ac-tive at encoding was f ixed at the optimal value (a =052) The results show that d cent has an optimal valuewhen the activation threshold is set between the old andthe new distributions The variance theory suggests thatthe threshold should be set to the average of the expectedold and expected new net inputs For the parameter val-ues used here this value is 071 which is near butslightly lower than the optimal value of 081 (the ex-pected value of the new net input is 0 and the expectedvalue of the old net input is 142) Note that this resultapplies when both a and T are set to the optimal value Ifa is set to a nonoptimal value the optimal value of T may

deviate significantly from the one proposed by the the-ory (eg if a = 5 the optimal value of T is much largerthan the expected value of old net inputs of 188)

This analysis emphasizes the importance of setting theactivation threshold as suggested by the variance theorySetting the activation threshold between the old and thenew expected net inputs yields not only the mirror effectbut also an optimal performance in the network Noticethat the activation threshold (T ) is constant even if thesubjectsrsquo decision criterion (C) is changed Therefore d centwill not change when the decision criterion changes Bychanging the decision criterion (rather than the recogni-tion threshold) subjects can maintain an optimal d cent fordifferent confidence levels

Optimal usage of the state of activation in the cuepattern An interesting question is how much informa-tion is carried in nodes that are active in the encoded pat-tern as compared with inactive nodes If both active andinactive nodes carry a similar amount of information itis useful to use all the nodes at retrieval However if in-active nodes carry little information in relation to thenoise performance can be improved by using only theinformation in the active nodes

The information carried in the nodes depends on theamount of weight changes which in turn depends on thepercentage of active nodes at encoding (a) For a = 12the absolute values of the weight changes are the samefor active and inactive nodes (however the signs of theweight changes are different) Thus inactive and activenodes carry the same amount of information and theperformance is optimal when information in both activeand inactive nodes is used For a small a the weightchanges are larger for active nodes (proportional to 1 a)than for inactive nodes (proportional to a) For a suffi-ciently small a the noise in inactive nodes will over-whelm the information in the weight changes so that ifthe information is combined the inactive nodes will harmthe information in the active nodes and performance isoptimal if only information from active nodes is used

The performance of the variance theory was calcu-lated by using the information in all the nodes This isdone by counting the number of nodes that are retrievedto the correct state of activation (ie the same state asthat during encoding) The mathematical details of thiscalculation are described at the end of the Appendix Theresults are shown by the dotted line in Figure 9A usingthe same set of parameters as when d cent was calculated byusing only active nodes shown by the solid line The re-sults show that the highest d cent is found when the decisionis based only on active nodes and when a is low Includ-ing inactive nodes in decision lowers d cent However for alarger a (above 15 for the parameters used here) it isbeneficial to base the decision on all the nodes

Optimal placement of the recognition criterion forthe two classes of items The recognition criterion (C)does not affect d cent but it influences performance as mea-sured by the hit rates and false alarm rates Therefore itis necessary to use another criterion for performance

VARIANCE THEORY OF MIRROR EFFECT 433

with respect to the placement of the recognition crite-rion A natural choice for performance in this context isthe probability of hits minus the probability of falsealarms This measurement corresponds to optimal per-formance when old correct responses and new correctnew responses are rewarded equally It is easy to see thatif the standard deviations of the old and the new distri-butions are equal the optimal performance will be foundif the recognition criterion is set exactly between the dis-tributions For unequal standard deviations the optimalrecognition criterion is shifted from the midpoint towardthe distribution with the smallest standard deviationMore exactly the optimal recognition criterion is thepoint at which the old and the new distributions inter-sect It is easy to see that this is true because if the recog-nition criterion is moved to the left of this point the rateof increase in false alarms is larger than the rate of in-crease in hits and performance suffers If the recognition

criterion is moved to the right of this point the rate of de-crease in hits is larger than the rate of decrease in falsealarms and performance also suffers (see eg Figure 4D)Formally f [S(O)] denotes the density of recognitionstrength of the old distribution and f [S(N)] the densityof the recognition strength of the new distribution Theratio between these variables is called the likelihoodratio L = f [S(O)] f [S(N)] and the optimal performanceoccurs when this ratio is equal to one (L = 1)

In the mirror effect there are two classes of itemseach having a new and an old distribution with differentstandard deviations The question of optimal perfor-mance is complicated by the possibility of using differ-ent criteria for the two classes The performance maythen vary depending on the choice of the two criteria andon additional restrictions on the overall level of confi-dence For example if one class is very easy (ie perfectdiscrimination) and one class is very difficult (ie no

Figure 9 (A) Theoretical d cent as a function of percentage of nodes active at encoding The solid line shows the d cent as a function of per-centage of nodes active at encoding when the decision is based only on nodes that are active during encoding The dotted line showsd cent when the decision is based on all the nodes (B) Theoretical d cent as a function of the activation threshold The leftmost arrow pointsat the expected net input of the new items [m(N)] the rightmost arrow points at the expected net input of the old items [m(O)] and themiddle arrow points at the point at the placement of the activation threshold of the nodes Note that the activation threshold is slightlylower than the optimal point (C) Optimal placement of the recognition criterion for the two classes The y-axis shows the maximumlikelihood for Class B divided by the maximum likelihood for Class A An optimal performance is found when this ratio is one Thex-axis shows the false alarm rate for Class A The straight line shows the ratio for theoretical optimal performance the dotted line theratio before normalization and the solid curved line the ratio after normalization See the text for details (D) The advantage of nor-malization for different recognition criteria The y-axis shows the total hit rate after normalization minus the total hit rate before nor-malization as a function of the total false alarm rate on the x-axis See the text for details

434 SIKSTROM

discrimination) and subjects are instructed to respondyes only when they are absolutely certain that they arecorrect it may be optimal to set a very high criterion forthe difficult class so that no yes responses will be madefor the difficult class and a moderate criterion for theeasy class so that some yes responses will be made forthe easy class Therefore any model that optimizes per-formance for the two classes must combine the criteriafor each class so that the performance for the sum of theclasses will be optimal

This problem may formally be stated as follows Giventwo classes (A and B) with a fixed total false alarm rate[P(AN) + P(BN)] how should the recognition criteriafor the two classes [T(A) and T(B)] be chosen so that thehit rates are maximized [P(AO) + P(BO)] The solutionto this problem is surprisingly simple The optimal perfor-mance occurs when the placements of the maximum like-lihoods of the two classes are equal

L(A) = f [S(AO)] f [S(AN)] = L(B)

= f [S(BO)] f [S(BN)]

It is easy to see that this criterion must be satisfied foroptimal performance because any shift from this pointdiminishes performance For example if the recognitionthreshold for class A is diminished the recognition cri-terion for class B must be increased to keep the totalfalse alarm rate constant According to the formulationof the problem the change in total false alarm rates mustbe equal f f [S(BN) = 0] The maximum-likelihood ra-tios are monotonically increasing functions of the recog-nition criteria therefore L(A) L(B) lt 0 when the recog-nition criteria are changed as specified above ThusL(A) = f [S(AO)] f [S(AN)] lt f [S(BO)] f [S(BN)] =L(B) or f [S(AO)] f [S(BO)] lt 0 This shows that thechange in the placement of the criteria from L(A) = L(B)results in an overall decrease in hit ratemdash( f [S(AO)] f [S(BO)] lt 0)mdashand performance suffers

Note that the variance theory only has one overallrecognition criterion However the theory and any the-ory of the mirror effect specifies how this criterionchanges when it is moved over the two classes Thus thesecond criterion can indirectly be inferred from the for-mulation of the theory This is done in the variance the-ory by the normalization factor that scales the recogni-tion differently between the two classes of words

The intriguing question here is whether the variancetheory is optimal or almost optimal in terms of place-ment of the recognition criterion for the two classes Fig-ure 9C plots the maximum likelihood of class B dividedby the maximum likelihood of class A [L(B)L(A)] onthe y-axis The x-axis shows the probability of false alarmsfor class A when the recognition criteria are changedThe optimal ratio of the maximum likelihood on the y-axisis one and it is plotted as the dotted straight line The re-sults before normalization (ie by counting the numberof nodes above the recognition criterion) are plotted in

the dotted monotonically increasing line The resultsafter normalization (ie the percentage of active nodesminus the expected number of active nodes divided bythe standard deviation of the net input) are plotted as thesolid line

The figure clearly shows that performance before nor-malization is far from optimal For a conservative recog-nition criterion or low false alarm rates the maximumlikelihood for class A is larger than the maximum likeli-hood for class B Therefore a more liberal criterion forclass B and a more strict criterion for class A would bemore advantageous For a liberal recognition criterionor a high false alarm rate the maximum likelihood forclass B is larger than the maximum likelihood for classA Therefore a more liberal criterion for Class A and astricter criterion for Class B would be beneficial Theonly point at which the performance is optimal is whenthe recognition criterion is unbiased At this point [aroundP(AN) = 25] the maximum-likelihood ratio is one

Normalization improves performance significantly sothe maximum-likelihood ratio is one or almost one fora range of criteria For false alarm rates larger than 015and smaller than 060 the ratio is within two percentagepoints of one The maximum likelihood for Class A islarger than that for Class B for false alarm rates less than015 and for false alarm rates larger than 060 Thus thereis still some deviation from optimal performance evenafter normalization However the maximum-likelihoodratio is closer to the optimal value for all false alarmrates after normalization than before normalization Ar-guably normalization improves performance sufficientlyso that performance may be described as being near anoptimal value for a wide range of recognition criteria

Overall performance after normalization can be di-rectly compared with performance before normalizationFigure 9D plots the total hit rate after normalizationminus the total hit rate before normalization for differenttotal false alarm rates In this figure the standard devia-tion of the net inputs to Class B was set to a 3 (rather than156) in order to make the difference between perfor-mance before and after normalization more salient Allother parameters were identical to those in Figure 4DFurthermore it is assumed that the number of items inClass A is equal to the number of items in Class B Thetotal hit rate is equal to the average hit rate of Class Aand Class B and the total false alarm rate is equal to theaverage false alarm rate in Class A and Class B

For all recognition criteria or for all false alarm ratesperformance is better or equal after normalization ascompared with performance before normalization Foran unbiased recognition criterion or a slightly largerrecognition criterion performance is approximatelyequal before and after normalization [ie for 25 lt P(N) lt40] For conservative recognition criteria [P(N) lt 25]there is a large advantage for normalized performanceover nonnormalized performance The largest advantageis approximately 005 and it occurs when the false alarm

VARIANCE THEORY OF MIRROR EFFECT 435

rate is approximately 005 For liberal recognition crite-ria [P(N) gt 40] there is also an advantage for normal-ized performance The largest advantage is around 001and it occurs when the false alarm rate is 070 The ad-vantage for liberal criteria is smaller than the advantagefor conservative criteria because of a ceiling effect forlarge false alarm rates and large hit rates

A Nonoptimal Network is Inconsistent With Empirical Data of Recognition Memory

To summarize it has been shown that performance isoptimal when (1) the percentage of nodes active at en-coding is low (2) the activation threshold is set betweenthe new and the old distributions (3) only nodes activeat encoding are used for measuring the recognitionstrength and (4) the recognition strength is normalizedwith the standard deviation of the net input It is inter-esting to note that all these conditions are necessary forproducing the empirically found memory phenomena

1 The percentage of active nodes has to be low forproducing appropriate standard deviations for the oldand the new recognition distributions If the percentageof active nodes is too high the standard deviation of theold distribution will approach the standard deviation ofthe new distribution

2 The model predicts the mirror effect only if the ac-tivation threshold is set between the old and the new dis-tributions If the recognition threshold is larger than theexpected value of the net input of the old distributionthe hit rate of the easy class will be less than the hit rateof the difficult distribution Similarly if the recognitionthreshold is lower than the expected new net input thefalse alarm of the easy class will be larger than the falsealarm rate of the difficult class This is inconsistent withthe empirical data for unbiased responses

3 Assume that the recognition strength is based on allthe nodes (ie not only nodes inactive during encod-ing)mdashfor example by counting the number of nodes inthe correct state of activation This measurement ofrecognition strength will have at least 50 of the nodesin the correct state (ie Pc gt 50) even if the subjectswere merely guessing on new items This would lead tothe incorrect prediction that the old recognition strengthvariance should be smaller than the new recognitionstrength variance because Pc(1 Pc)N decreases for Pcover 50 This is inconsistent with the finding that thevariance of the old distribution is larger than the varianceof the new distribution

4 If the recognition strength is not normalized withthe net input the variance of the recognition strength ofthe new easy class (AN) will be smaller than the varianceof the recognition strength of the new difficult class (BNsee Figure 4C) This is inconsistent with the empirical data

This analysis indicates that several recognition mem-ory phenomena fall out as a consequence of optimizingperformance in the network If the model is optimized interms of performance the model reproduces the empir-ical data If the model is made nonoptimal the model

does not reproduce the empirical data Arguably thehuman brain has evolved to optimize storage capacityand therefore these memory phenomena occur

GENERAL DISCUSSION

This paper has suggested a variance theory for themirror effect that also applies to the ROC curves Themodel is remarkably simple It has been shown that atwo-layer recurrent network where one layer representscontext and one layer represents items shows these phe-nomena if performance is measured by counting thenumber of nodes active at recognition in a way that opti-mizes performance The structure of the theory guaran-tees that high-frequency items have a larger variance inthe net inputs than do low-frequency items because en-coding the same item in different contexts increases thevariance whereas the expected net inputs are the sameThe theory predicts the mirror effect when the sameamount of attention is paid to both classes of stimuli Thestandard deviation of the recognition strength is largerfor old than for new items because more nodes are activein old items The standard deviation for the easy class islarger than the standard deviation for the difficult classbecause the recognition strength is normalized with thestandard deviation for the net input

There are several reasons why the variance theory isinteresting First the theory is extremely simple Theonly necessary assumptions are that recognition is basedon recurrent associations between contexts and itemsand performance is measured by counting the number ofnodes in an optimal way Second these assumptions areconsistent with what is known about how the brain worksTherefore the model is biologically plausible Third themodel accounts for a large amount of data including themirror effect exceptions from the mirror effect ROCcurves list-length effects and so on Fourth the modelfits the empirical data well Fifth it is easy to implementthe model in a connectionist network

Paying more attention to one of the classes violates theassumption of equal expected net inputs to the two classesThe variance theory predicts that attention to the moredifficult class primarily affects the hit rates whereas thefalse alarm rates and standard deviations of the underly-ing distributions are less affected An experiment sup-ported the prediction A standard mirror effect was foundwhen attention was divided equally between high- andlow-frequency words However focusing the subjectsrsquoattention on the high-frequency words either by the pre-sentation frequency or the presentation time made thehit rate larger for the high-frequency words than for thelow-frequency words This manipulation of attention didnot influence the false alarm rates which were higher forthe high-frequency words in all the conditions Thus nomirror effect was found when attention was paid to thehigh-frequency words Nor did the focusing of attentioninfluence the order of the standard deviations The stan-dard deviations were larger for the low-frequency distri-

436 SIKSTROM

bution than for the high-frequency distribution This wastrue for the new and the old distributions both when at-tention was paid to high-frequency words and when at-tention was divided equally between the two classes (ex-cept in the new frequency control condition where thestandard deviations were equal)

The variance theory predicts the order of the standarddeviations of the underlying distributions for the follow-ing reasons The standard deviations of the old distribu-tions are predicted to be larger than those of the new dis-tributions because more nodes are activated in the olddistributions The standard deviations of the easy classdistributions are predicted to be larger than the standarddeviations of the difficult class distributions because therecognition strength is normalized by the itemrsquos diffi-culty estimated from the standard deviation of the net in-puts This is consistent with the empirical data

In contrast the attention-likelihood theory does notconstrain the old distribution to be larger than the newdistribution for the difficult class (it can be larger orsmaller depending on parameter settings) The variancetheory allows the following two orders of the standarddeviations ss(BN) lt ss(BO) lt ss(AN) lt ss(AO) andss(BN) lt ss(AN) lt ss(BO) lt ss(AO) The first order isthe more common although the second order occurs oc-casionally (see eg Glanzer et al 1993 Experiment 1)In addition the attention-likelihood theory allowsss(BO) lt ss(BN) lt ss(AN) lt ss(AO) according to Equa-tion 2 which to the authorrsquos knowledge has not beenfound in empirical data

The variance theory predicts that strength variablessuch as study time repetition and study instructions af-fect the expected net input For example increasing studytime will increase the net input that improves the hit rateIncreasing the net inputs also causes an increase in theactivation threshold that diminishes the false alarm rates

The variance theory has no (ie lateral) connectionswithin the item layer and there are no connections with-in the context layer Including intraitem connections inrecognition makes it impossible to tell whether therecognition strength comes from encoding during thelearning episode or from another preexperimental learn-ing episode Thus there would be a confounding be-tween the itemrsquos frequency and recognition strength Forexample if the recognition strength in the variance the-ory included intraitem connections and used a constantrecognition criterion it would predict a higher hit rateand a higher false alarm rate for high-frequency itemsthan for low-frequency items Thus the hit rate for high-frequency words would be larger than that for low-frequency words which is contrary to the data on the mir-ror effect This issue is called the frequencyrecognitionndashstrength confounding problem Other models may bevulnerable to this problem depending on their specificassumptions The variance theory is immune from thisproblem because recognition strength is based on the as-sociation between the context and the item that yields apure measurement of the strength of the target in a par-

ticular episode Net inputs within the item population arenot used because these connections are highly corre-lated with the frequency of the item

This frequencyrecognition-strength confounding prob-lem may be relevant to several distributed models thatassume that recognition strength increases with frequencythus making the false alarm rate higher for high- than forlow-frequency stimuli This is often implemented in dis-tributed models by including intraitem associations inthe measurement of recognition strength Thus whenintraitem and itemndashcontext associations are added it isnot possible to know whether the intraitem strength oc-curs because an item has been encoded in the to-be-retrieved-from list or at another episode

Although the intraitem associations are not used tomeasure recognition strength they may play an impor-tant role in recognition In the first step of recognitionthese associations may be used for deblurring unclear in-formation in the item cue (a similar mechanism occursfor the context cue) Arguably this deblurring mecha-nism works well for well-known words however fornonwords it is much more likely to fail Such failure willactivate features that were not active in the encoded rep-resentation It is more likely that these wrongly activatednodes represent high-frequency features because it ismore likely to converge on high-frequency features Thereare two interesting implications of this perspective Firstthe wrongly activated nodes will use the wrong connec-tions between the context and the item Second becausethe wrongly activated nodes represent high-frequencyfeatures the average variability will be larger for non-words than for words This is a plausible account of thepoor recognition performance with nonwords as com-pared with words It is also a tentative account of why non-words can be seen as a difficult class with a higher vari-ability than that for words in the variance theory Howeverfurther work is needed before any firm conclusion can bemade regarding this aspect of the theory

A problem similar to frequencyrecognition-strengthconfounding occurs if within-context connections areused at recognition If the context is temporally corre-lated the within-context connections are influenced bylist length This causes a confounding between list lengthand recognition strength This issue is called the list-lengthrecognition-strength confounding problem Othermodels may be vulnerable to this problem depending ontheir specific assumptions

Another issue is whether the variance theory can ac-count for the mirror effect found in abstract and concretewords where words from a concrete class are more eas-ily discriminated than words from an abstract class Thevariance theory can account for this given the assump-tion that the variance of the net input is larger for abstractthan for concrete words However at this point it is notcompletely clear how this assumption can be motivatedA possibility is that although these two classes areequated for word frequency the contexts associated withan abstract word are more variable than the contexts as-

VARIANCE THEORY OF MIRROR EFFECT 437

sociated with a concrete word This larger variability incontext for abstract words may lead to a larger variabil-ity in the net input Another possibility is that the activefeatures in abstract words are more general and there-fore associated with more contexts Nodes active in con-crete words may represent more specific features acti-vated with a lower frequency and therefore associatedwith fewer contexts Thus features in abstract wordsmay be of a higher frequency than features in concretewords although the frequencies of the items are thesame This would lead to a mirror effect in the modelHowever at this point no claim is made that variancetheory can handle this phenomenon

The variance theoryrsquos account of the list-length andlist-strength effects is arguably much simpler than thesubjective-likelihoodrsquos account The activation thresholdis set so that on average half of the nodes active duringencoding are active during recognition The activationthreshold is constant within one condition but may changebetween conditionsmdashfor example when study time ischanged Furthermore subjects do not need to estimatethe list length and the probability that a particular itemis drawn from the study list

The variance theory has advantages over the attention-likelihood theory As was discussed in more detail abovethe attention-likelihood theory requires subjects to clas-sify the stimulus Depending on this classification thesubjects must know (consciously or unconsciously) howmuch attention is paid to the stimuli in order to calculatethe log-likelihood ratios Thus the yesndashno decision isbased on the subjectsrsquo classification of the stimuli Thevariance theory does not require a classification of thestimuli During the calculation of recognition strengththe standard deviation of the net input of the item is used(shcent ) so the subject does not need to know the class or thestandard deviation of the class (sh) The increase in thehit rate and decrease in the false alarm rate for the easierclass occurs according to the theory because the vari-ance is smaller for the easier class The variance theoryhas a simple mathematical base subjects count the num-ber of activated nodes minus the expected value dividedby the standard deviation of the net input of the item Anexplicit solution is presented that requires only calculat-ing the probabilities of two z-transformations

The subjective-likelihood theory uses feature contentof the items for addressing issues regarding item similar-ity and word frequency In particular high-frequency wordsare assumed to have a higher noise or variability than dolow-frequency words The variance theory also usesvariability that depends on frequency However the vari-ance theory simulates the increase in variance duringeach presentation of a feature in different contexts thusmaking it an unavoidable phenomenon for high-frequencyfeatures In the subjective-likelihood theory the featurevariance is introduced as an assumption or a constantand it is not explicitly simulated

There are several other differences between the vari-ance theory and the subjective-likelihood theory The

subjective-likelihood theory is based on log-likelihoodratios In the variance theory log-likelihood ratios arenot necessary to account of the mirror effect and for z-ROC curves Instead the variance theory uses the num-ber of active nodes normalized by the variance of the netinput as the measurement of recognition strength

Another difference is the use of one detector for eachitem in the subjective-likelihood theory This makes themodel essentially local whereas the variance theory isdistributed This property may cause difficulties in im-plementing the model in a connectionist network How-ever see McClelland and Chappell (1998) for a brief dis-cussion of this topic An implementation of the variancetheory is straightforward

REFERENCES

Feller W (1968) An introduction to probability theory and its appli-cation New York Wiley

Gillund G amp Shiffrin R M (1984) A retrieval model for bothrecognition and recall Psychological Review 91 1-67

Glanzer M amp Adams J K (1985) The mirror effect in recognitionmemory Memory amp Cognition 13 8-20

Glanzer M amp Adams J K (1990) The mirror effect in recognitionmemory Data and theory Journal of Experimental PsychologyLearning Memory amp Cognition 16 5-16

Glanzer M Adams J K amp Kim K (1993) The regularities ofrecognition memory Psychological Review 100 546-567

Glanzer M amp Bowles N (1976) Analysis of the word frequencyeffect in recognition memory Journal of Experimental PsychologyHuman Learning amp Memory 2 21-31

Glanzer M Kisok K amp Adams J K (1998) Response distribu-tions as an explanation of the mirror effect Journal of ExperimentalPsychology Learning Memory amp Cognition 24 633-644

Greene R L (1996) Mirror effect in order and associative informa-tion Role of response strategies Journal of Experimental Psychol-ogy Learning Memory amp Cognition 22 687-695

Hertz J Krogh A amp Palmer R G (1991) Introduction to the the-ory of neural computation Reading MA Addison-Wesley

Hintzman D L (1988) Judgment of frequency and recognition memoryin a multiple trace memory model Psychological Review 95 528-551

Hopfield J J (1982) Neural networks and physical systems withemergent collective computational abilities Proceedings of the Na-tional Academy of Sciences 79 2554-2558

Hopfield J J (1984) Neurons with graded response have collectivecomputational properties like those of two-state neurons Proceed-ings of the National Academy of Sciences 81 3088-3092

Humphreys M S Bain J D amp Pike R (1989) Different way to cuea coherent memory system A theory for episodic semantic and pro-cedural tasks Psychological Review 96 208-233

Kim K amp Glanzer M (1993) Speed versus accuracy instructionsstudy time and the mirror effect Journal of Experimental Psychol-ogy Learning Memory amp Cognition 19 638-652

Kruschke J K (1992) ALCOVE An exemplar-based connectionistmodel of category learning Psychological Review 99 22-44

Ku Iumlcera H amp Francis W N (1967) Computational analysis ofpresent-day American English Providence RI Brown UniversityPress

Lewandowsky S (1991) Gradual unlearning and catastrophic inter-ference A comparison of distributed architectures In W E Hockleyamp S Lewandowsky (Eds) Relating theory and data Essays onhuman memory in honor of Bennet B Murdock (pp 445-476) Hills-dale NJ Erlbaum

Li S-C amp Lindenberger U (1999) Cross-level unification A com-putational exploration of the link between deterioration of neuro-transmitter systems and dedifferentiation of cognitive abilities in oldage In L-G Nilsson amp H J Markowitsch (Eds) Cognitive neuro-sciences of memory (pp 103-146) Seattle Hogrefe amp Huber

438 SIKSTROM

Li S-C Lindenberger U amp Frensch P A (2000) Unifying cog-nitive aging From neuromodulation to representation to cognitionNeurocomputing 32-33 879-890

McClelland J L amp Chappell M (1998) Familiarity breeds dif-ferentiation A subjective-likelihood approach to the effects of expe-rience in recognition memory Psychological Review 105 724-760

Metcalfe J (1982) A composite holographic associative recallmodel Psychological Review 89 627-658

Murdock B B (1982) A theory for the storage and retrieval of item andassociative information Psychological Review 89 609-626

Murdock B B (1998) The mirror effect and attentionndashlikelihoodtheory A reflective analysis Journal of Experimental PsychologyLearning Memory amp Cognition 24 524-534

Okada M (1996) Notions of associative memory and sparse codingNeural Networks 9 1429-1458

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves PsychologicalReview 99 518-535

Rumelhart D E Hinton G E amp Williams R J (1986) Learn-ing representation by backpropagating errors Nature 323 533-536

Shiffrin R M amp Steyvers M (1997) A model for recognitionmemory REMndashretrieving effectively from memory PsychonomicBulletin amp Review 4 145-166

Sikstroumlm S (1996a) TECO A connectionist theory of successiveepisodic tests Umearing Doctoral dissertation Umearing University (ISBN91-7191-155-3)

Sikstroumlm S (1996b) The TECO connectionist theory of recognitionfailure European Journal of Cognitive Psychology 8 341-380

Sikstroumlm S (1999) Power function forgetting curves as an emergentproperty of biologically plausible neural networks model Interna-tional Journal of Psychology 34 460-464

Stretch V amp Wixted J T (1998a) Decision rules for recognitionmemory confidence judgments Journal of Experimental Psychol-ogy Learning Memory amp Cognition 24 1397-1410

Stretch V amp Wixted J T (1998b) On the difference betweenstrength-based and frequency-based mirror effects in recognitionmemory Journal of Experimental Psychology Learning Memoryamp Cognition 24 1379-1396

NOTE

1 The expected number of nodes active at recognition is a2 giventhat the standard deviations of the percentages of active nodes for new[sp(N)] and old items [sp(O)] are equal However if the standard devi-ations are different the expected number of active nodes will be

Because the new variance is smaller than the old variance this valuewill be slightly less than a2 typically around 044a if the ROC curveis 08 It is actually more precise to use this value In this paper the ap-proximation a2 is used except that in the simulations where the ex-pression above is used

APPENDIX

The Expected Value of the Net Input and the Variance of the Net Input

Inserting the weight changes into the net input solves the ex-pected value of the net input for old items In this Appendix thesuperscripts representing the item layer (d) and the contextlayer (c) are dropped Only the active states of activation con-tribute to the net input so jj 5 jj = 1

(A1)

The expected value of the net inputs for the new items iszero

mh(N) 5 0 (A2)

The expected value of the weights is zero The variance ofthe net input is calculated by squaring each term in the netinput Let P represent the number of patterns stored in the net-work If the patterns are uncorrelated or a new item is pre-sented the variance of the net input is

(A3)

For an old item the variance of the net input to the contextlayer from the item layer depends on the frequency of the item( f ) All the aN weights supporting a context node contribute tothe net input in the same direction

(A4)

The same function can be used for calculating the varianceof the net input to a node in the item layer when the same con-text is associated with several items Let the same context be as-sociated with L items in a list Furthermore let the context be-tween different lists be uncorrelated The variance of the netinput to an item node is

The expected variance for all nodes is the weighted sum ofthese two terms Half of the nodes are context nodes and halfof the nodes are item nodes Therefore the expected varianceof the net input to all nodes given an item with a frequency off and a list length of L in a network where p patterns have beenencoded is

(A5)

Performance in the Variance Model When AllNodes Are Used

In the variance model recognition strength is based on nodesactive at encoding However if recognition strength is based onall the nodes performance can be calculated as follows The d centis calculated by using Equation 8 where Pc(O) and Pc(N) are thenumber of correct nodes The number of correct nodes is thenumber of nodes active at retrieval that also is active at encod-ing (ie calculated as usual with Equation 7) plus the numberof inactive nodes at encoding that also are inactive at retrievalThe latter value can be calculated by replacing a with 1 a inEquation 7 and using the expected value of old net inputs for in-active nodes a2 (1 a) N (compare with Equation A1)

(Manuscript received February 9 1999revision accepted for publication October 30 2000)

s h f L a N p a a N2 3 22 1O( ) = +( ) + -( )

s h f a a LN2 4 2 21( ) = -( )

s x x x xh ij jj

N

i j ji

Nf f f a a

a a f N

22 2

4 2 21

( ) = waring aringaelig

egraveccedilouml

oslashdivide= -( ) -( )eacute

eumlecirc

= -( )

s x x x xh ij jj

N

i j j

P

i

NN a a

a a a a a a a

a a a a PN a a PN

2 2 2

2

3

1 1 2 0 1

0 0 1 1

( ) = w

= [( ) ( )] + [( )(1- )] ( )

+ [( )( )] ( ) = ( )

2 2

2 2 2

( ) = -( ) -( ) - - - -

- - - -

aring aringaring

m x x x xh ijj

N

j i j jj

Na a a a N(O) = wDaring aring= -( ) -( ) = -( )1

2

ss s

p

p p

a(N)

( )N (O)+

428 SIKSTROM

was larger for the high-frequency words than for the low-frequency words in the presentation frequency and thepresentation time conditions In the control conditionthe probability for hit rates for the low-frequency condi-tion was larger One-tailed paired t tests over the perfor-mance for each subject were carried out to test the dif-ferences between the high and the low frequencies Theeffects were significant in the presentation frequencycondition [t(11) = 22 MSe = 004 p = 02 lt 05] and inthe control condition [t(16) = 33 MSe = 004 p = 00lt 05] but not in the presentation time condition [t(11) =041 MSe = 003 p = 34 lt 05]

The false alarm rate was larger for the high-frequencywords in all the conditions However it was only signif-icantly larger in the presentation frequency condition[t(11) = 18 MSe = 003 p = 048 lt 05] but not in thepresentation time condition [t (11) = 15 MSe = 001 p =07 gt 05] and the control condition [t(16) = 14 MSe =002 p = 09 gt 05]

The results in the presentation frequency conditionsupport the variance theory The hit and the false alarmrates were significantly larger for the high-frequencywords than for the low-frequency words Thus there wasno mirror effect However the prediction of the standardversion of the attention-likelihood theory was not sup-ported

The results in the presentation time condition were inthe same direction as those in the presentation frequencycondition although the difference between the high andthe low frequencies was not significant This conditionis consistent with the variance theory although the stan-dard version of the attention-likelihood theory could notbe dismissed in this condition since the results werenonsignificant

Finally the control condition yielded results consis-tent with previous studies showing a mirror effect Thehit rate for the high-frequency words was significantlylower than the hit rate for the low-frequency words Thefalse alarm rate for the high-frequency words was largerthan that for the low-frequency words (although not sig-nificantly) Thus the control condition is as was ex-pected consistent with both the variance theory and thestandard version of the attention-likelihood theory

The slopes of the ROC curves were calculated as fol-lows The hit and false alarm rates for confidence ratings1ndash5 were z-transformed (eg for confidence rating 4 a hitresponse was scored if the confidence rating was 4 orabove) Linear regressions of one z-transformed mea-surement as a function of another z-transformed measure-ment were conducted The slope of the linear regressioncurves between the z-transformed hit rate of the low-frequency words and the z-transformed hit rate of the high-frequency words [ss(BO)ss(AO)] and similarly for theslope of the false alarms [ss(BN)ss(AN)] are shown inthe last two rows of Table 1

The results show that the standard deviations of theold high-frequency distributions were smaller than thestandard deviations of the old low-frequency distribu-tions in all the conditions The standard deviations of the

false alarm high-frequency distributions were smallerthan the standard deviations of the false alarm low-frequency distributions in the presentation frequencycondition and the control condition but were approxi-mately equal in the presentation time condition

To summarize the results in the presentation frequencycondition are consistent with the variance theory and inconsistent with the standard version of the attention-likelihood theory The control condition is consistentwith both the variance theory and the standard version ofthe attention-likelihood theory These data are also con-sistent with results presented by Stretch and Wixted(1998b) However Stretch and Wixted (1998b) suggestedone possible way to modify the standard version of theattention-likelihood theory to bring it in line with thedata presented here They noted that Glanzer et al (1993)had shown that the attention-likelihood theory predictsthe mirror effect although p(i old) is set to the averagevalue of the two classes This modified version can pre-dict the pattern of data presented here given that the at-tention paid to the high-frequency class was high duringencoding [n(B) = 120] and low during recognition [n(B) =40] This formulation of the attention-likelihood theoryseems somewhat unclear It is not well motivated whyp(i old) should be equal during recognition whereas n(i)is different [p(i old) is calculated from n(i)] or why theamount of attention for high-frequency items is lower thanthat for low-frequency words at encoding but higher atrecognition

COMPARATIVE DATA FITTING

Glanzer et al (1993) fitted the attention-likelihoodtheory to experimental data in nine conditions The easyclass (A) consisted of either low-frequency words orconcrete words and the difficult class (B) consisted ofhigh-frequency words or abstract words Here the vari-ance theory is also fitted to the same set of data as thatused in Glanzer et al (1993) This allows a directionevaluation of the variance theory by comparing its f itswith those of the attention-likelihood theory

Glanzer et al (1993) fitted the attention-likelihoodtheory to the four probabilities of yes responses and thefour ratios of slopes of the ROCs The fitting was con-ducted by minimizing the mean squared error divided bythe variancemdashthat is

Three parameters were fittedmdashnamely the attentionpaid to each of the classes n(A) and n(B) and the prob-ability that a feature was activated p(new) The other pa-rameters were held constant at a value found to give agood f it These parameters were (N = 1000) and therecognition criterion [ln (L) = 0]

The variance theory was fitted to the same set of datausing the same technique and the same number of freeparameters The fitted parameters were the percentage

( )

Observed Predictedi i

ii

-

=aring

2

21

8

s

VARIANCE THEORY OF MIRROR EFFECT 429

nodes active at encoding (a) the standard deviation ofthe net inputs for the easy class words [sh(A)] and thestandard deviation of the net inputs for the difficult classwords [sh(B)] The other parameters were held constantand were set to the same values as those in Figure 4D[2N = 100 mh(N ) = 0 mh(O) = 1 and C = 0] The empir-ical standard deviations (si) were not reported inGlanzer et al (1993) so these parameters were set toone

The results show that the variance theory accounts for98 (r2) of the variance for the probabilities The attention-likelihood theory fits equally well The variance theory ac-counts for 93 of the variance for the slope The attention-likelihood theory accounts for 84 of the variance of theslope Thus the variance theory accounted for the sameamount of variance for the probabilities and more variance

for the slope as compared with the attention-likelihoodtheory when three parameters were fitted

It is reasonable to assume that the percentage of ac-tive nodes at encoding (a) does not vary across condi-tions The ratio of standard deviations of the net inputs[sh(B)sh(A)] may also be conceived of as being con-stant given that the same material is used in the differ-ent conditions Therefore the variance theory was fittedby a single variablemdashnamely the standard deviation ofthe net inputs for the easy class [sh(A)] The activitylevel was fixed to 010 and the ratio of the standard de-viations of the net inputs sh(B)sh(A) to 125 (these val-ues were the average of the fitted parameters in the pre-vious fit these parameters were also used in Figure 4D)The results are presented in Figure 8A where the pre-dicted probabilities are plotted as a function of observed

Figure 8 (A) Fitting the variance theory to Glanzer Adams andKimrsquos (1993) probability data The figure shows the predicted proba-bilities (on the vertical axis) as a function of the observed probabilities(on the horizontal axis) (B) Fitting the variance theory to Glanzer et alrsquos(1993) standard deviation slope data The figure shows the predictedratio of slopes of the receiver-operating characteristic curves (on the ver-tical axis) as a function of the observed ratios (on the horizontal axis)

430 SIKSTROM

probabilities Figure 8B shows the corresponding resultsfor the slope The accounted variance is 096 for the prob-abilities and 085 for the slopes Thus the variance theoryfits the slopes using a single parameter equally well asthe attention-likelihood theory does with three fitting pa-rameters The fit for the variance theory for the probabil-ities using one parameter is slightly less than the fit of theattention-likelihood theory using three fitting parametersIt may be concluded that the fit for the variance theory isreasonably good for the probabilities and the slopes Theslopes have a better fit in the variance theory than in theattention-likelihood theory when three variables are used

ANALYTIC SOLUTIONS

In this section analytic solutions of the variance the-ory an approximation of the standard deviation of therecognition strength and analyses of optimal perfor-mance are presented The variance theory has a simpleanalytic solution and can be fully described by four pa-rameters Two of these parametersmdashnamely the stan-dard deviation of the net inputs from the easy class[sh(A)] and the standard deviation of the net inputs fromthe difficult class [sh(B)]mdashcan also be expressed as thefrequency of the item (see the Appendix) The other twoparameters are the number of nodes (N ) and the expectedprobability that the nodes are active at encoding (a)

There are other variables in the theory however theydo not increase the degrees of freedom There are fourexpected net inputs (mh) however two degrees of free-dom disappear because the new net inputs are constrainedto be equal as well as the old net inputs [mh(AN) =mh(BN) and mh(AO) = mh(BO)] Furthermore the predic-tions are independent of moving the old and new dis-tributions in parallel thus removing another degree offreedom Finally changing the difference between the ex-pected old and new net inputs is mathematically equiva-lent to changing the standard deviations [sh(A) andsh(B)] Thus the degrees of freedom in the net inputscan be captured by the degrees of freedom in the stan-dard deviations The activation threshold (T ) and theprobability that nodes are active (Pc) are simply func-tions of other variables and therefore do not increase thedegrees of freedom Thus there are four degrees of free-dom for the distributionsmdashnamely sh(A) sh(B) N anda An additional degree of freedom is introduced whenplacing the recognition criterion (C)

The probability (Pc) that the net inputs exceed the ac-tivation threshold (T ) for nodes active during encodingcan be explicitly solved from the expected net inputs(mh) and the standard deviation of the net inputs (sh)This probability is dependent on the distribution of thenet inputs which can be approximated with a normaldistribution Pc is solved by integrating the net inputsfrom mh T to infinity (yen) over the probability densityfunction for a normal distribution Thus the probability(Pc) that a node is active at recognition is

(6)

Subtracting the expected percentage of active nodes atrecognition (a2 see note 1) from the percentage of ac-tive nodes and dividing by the standard deviation of thenet inputs (sh) calculates the expected recognitionstrength (ms)

Note that the analytic solution uses the standard devi-ation of the class (sh) as an approximation of the stan-dard deviation of the item (shcent ) because it simplifies theanalytic solution however the variance theory or thesimulation uses the standard deviation of the item Thisapproximation is good when there are a large number offeatures however for a small number of features thevariance of feature strength for a single item may fluctu-ate on an item-to-item basis around the variance of thenet inputs for all the items

The standard deviation of the recognition strength (ss)is calculated from sh Pc and N There is 2N number ofnodes in the context and the item layers The distributionof Pc is binomial but can for a certain criterion [ie 2NPc(1 Pc) gt 10] be approximated with a normal distri-bution with a standard deviation of [Pc(1 Pc) 2N]12The final result is scaled by the normalization factor 1sh

(7)

A yes response occurs if the recognition strength isabove the recognition criterion (C) The probability of ayes response [P(Y)] is calculated from the expected recog-nition strength the variance of the recognition strengthand the recognition criterion by integrating the density ofthe recognition strength over a normal distribution

The probability of choosing A over B in a two-choiceforced recognition test [P(A B)] is calculated from theexpected recognition strength of A [ms(A)] and B [ms(B)]and the standard deviations of the recognition strengthof A [ss(A)] and B [ss(B)]

An Excel sheet for calculating the predictions of the vari-ance theory is available on line (wwwpsychutorontoca ~sverkervariancehtml)

P e ds

S Bs s

s s

C

(A B)

(A)

(A) (B)

=

- -[ ]( )+aelig

egraveccediloumloslashdivide

eacuteeumlecirc

yen

ograve12

2

2 2 22

p

m m

s s

( )

P e ds

S s

s

C

(Y) =-( )yen

ograve12

2

22

p

m

s

sss

h

c cP P

N=

-( )eacute

eumlecircecirc

1 1

2

12

mss

Pc

a

h=

-2

P a e dhcT

hh

h=-( )yen

ograve12

2

22

p

m

s

VARIANCE THEORY OF MIRROR EFFECT 431

Approximations of the Standard Deviation ofRecognition Strength

The standard deviation of the recognition strength isin the model calculated with Equation 7 However to fa-cilitate the understanding of this equation it is useful tomake some approximations First note that the proba-bility that a node is active (Pc) is assumed to be low Byapproximating 1 Pc to one the variance of the recog-nition strength can be simplified to

For a particular class of items the variances of the netinputs of old and new items are equal and the varianceof the recognition strength is proportional to the numberof active nodes (s 2

s micro Pc) This approximation suggests avery simple interpretation of the slope of the z-ROC Theratio of variances between new and old items is simplythe ratio between the number of nodes active in the newitems representations and the number of nodes active inthe old items representations

Or alternatively the slope of the z-ROC curve is equalto the square root of the ratio of the number of nodes ac-tive in the new items representations and the number ofnodes active in the old items representations For exam-ple if the slope of the z-ROC curve is 08 the number ofactive nodes in the new items representations divided bythe number of nodes active in the old items representa-tions is 064 (= 0802)

Another approximation useful for understanding themodel is that for two classes of items the number of ac-tive nodes in the new distribution is approximately equaland the number of active nodes in the old distributions isapproximately equal [Pc(AN) raquo Pc(BN) and Pc(BO) raquoPc(AO)] Given these approximations and the approxi-mation above (1 Pc raquo 1) the recognition strength stan-dard deviation is inversely related to the standard devia-tion of the net inputs in the following way The ratiobetween the recognition strength standard deviations ofthe diff icult and the easy distributions is equal to theratio between the standard deviations of the net inputs ofthe easy and the difficult distributions Furthermore theratio between the recognition strength standard devia-tions of the difficult and easy new distributions is equalto the ratio between the recognition strength standard de-viations of the difficult and the easy old distributionsThe exact solution predicts a slightly smaller ratio in theold than in the new distributions

This suggests that the ratio between the recognitionstrength standard deviations of the difficult class and theeasy class can be interpreted as the ratio between the

standard deviations of the net inputs of the easy class andthe difficult class

Optimizing Performance Derives the Assumptions of the Variance Theory

Arguably good memory performance is an importantfactor for selection in the evolutionary process of hu-mans and animals It is reasonable to assume that thebrain has evolved so that the performance at retrieval isoptimal or near-optimal Here it is investigated how sev-eral assumptions of the variance theory influence per-formance It is shown that several assumptions of themodel fall out as a consequence of optimizing perfor-mance in the form of discriminability between new andold items Thus if the model is implemented in a differ-ent way performance is degraded and the model doesnot account for the experimental data Examples of as-sumptions that yield a good performance in the modelare a low percentage of nodes active setting the activa-tion threshold between old and new net inputs measur-ing performance by nodes that are active in the encodingpattern and normalizing the recognition strength It isshown that an optimal performance in the network requiresthe implementation suggested by the variance theory Ifthe implementation of the variance theory is changedsignificantly the performance is degraded and the net-work would not produce the empirically found memoryphenomena

To conduct this analysis it is necessary to define ameasurement of performance A natural choice is to used cent By using the analytical equations above we find thefollowing expression

(8)

Because ss(N) sometimes is near zero it was founduseful to use the standard deviation of both the old andthe new items recognition strength ss(NO) in the de-nominator of this equation Thus Pc(NO) is equal to[Pc(N) 1 Pc(O)] 2 Pc( ) was calculated with Equation 6The expected value of the net inputs and the standard de-viation of the net inputs for new and old items were cal-culated with the equations derived in the Appendix(Equations A1 A2 and A3) Low-frequency items witha preexperimental frequency of zero were used ( f = 0)and the list length was set to one (L = 1)

The performance can be expressed by the parametersa N and p Increasing the number of nodes (N) monot-onically increases d cent and increasing the number ofstored patterns (p) monotonically decreases d cent The im-pact of these two parameters on d cent was of less impor-tance here and they were set to N = 30 and p = 100

Optimal percentage of nodes active at encodingThe solid line in Figure 9A shows the theoretical d cent as afunction of the percentage of nodes active at encoding

cent = - =-

-eacuteeumlecirc

dP P

P PN

s s

s

c c

c c

m ms( ) ( )

(

O N

(NO)

(O) (N)

(NO) (NO) 12

112

ss

ss

ss

ss

s

s

s

s

h

h

s

s

(BO)

(AO)

(B)

(A)

(A)

(B)

(BN)

(AN)pound raquo pound

s

ss

s

c

c

P

P

2

2

(N)

(O)

(N)

(O)raquo

ss

sc

h

P

N

2

2 2=

432 SIKSTROM

(a) The results show that d cent is optimal for a = 052 Thed cent is lower for larger and smaller a The lower d cent for largea occurs because the interference from other items in-crease For an a larger than the optimal value the weightchanges are distributed over a larger number of nodesand the recognition tests therefore include more noise

The lower d cent for an a less than the optimal value oc-curs because there is variability in the number of activenodes at encoding Thus for very small values of a thefluctuation between the number of nodes active in theencoded representation becomes increasingly importantThus for a small a errors are more likely to occur be-cause old items have few active nodes at encoding mak-ing it less likely that many nodes will be active at re-trieval (independently of how well they are encoded)This analysis suggests that a medium low percentage ofactive nodes at encoding yields optimal performanceThis is consistent with variance theory which requires alow activity for fitting some of the empirical data (seebelow)

There is another factor that contributes to the fact thatoptimal performance occurs when the percentage of ac-tive nodes is medium low which is that the number ofpossible representations increases with a If there is onlyone node active in all the representations there are Npossible combinations of representations if there aretwo nodes active in all the representations there are ap-proximately N 2 possible combinations of representa-tions and so forth This factor is not included in theanalyses

Optimal placement of the activation threshold Animportant property of d cent is that it is insensitive to wherethe criterion is placed Thus any criterion yields thesame performance The activation threshold (T ) may beseen as the criterion for a single node and therefore onemight intuitively think that the placement of the thresh-old is unimportant for d cent However surprisingly theplacement of the criterion becomes important in the vari-ance theory because there is a nonlinear transformationwhen the nodes are activated This nonlinearity makes d centdependent on the activation threshold in the nodes

The d cent was maximized by changing the activationthreshold (T ) and the percentage of nodes active at en-coding (a) The maximum d cent was 240 when T = 081and a = 052 Figure 9B plots d cent as a function of the ac-tivation threshold (T ) when the percentage of nodes ac-tive at encoding was f ixed at the optimal value (a =052) The results show that d cent has an optimal valuewhen the activation threshold is set between the old andthe new distributions The variance theory suggests thatthe threshold should be set to the average of the expectedold and expected new net inputs For the parameter val-ues used here this value is 071 which is near butslightly lower than the optimal value of 081 (the ex-pected value of the new net input is 0 and the expectedvalue of the old net input is 142) Note that this resultapplies when both a and T are set to the optimal value Ifa is set to a nonoptimal value the optimal value of T may

deviate significantly from the one proposed by the the-ory (eg if a = 5 the optimal value of T is much largerthan the expected value of old net inputs of 188)

This analysis emphasizes the importance of setting theactivation threshold as suggested by the variance theorySetting the activation threshold between the old and thenew expected net inputs yields not only the mirror effectbut also an optimal performance in the network Noticethat the activation threshold (T ) is constant even if thesubjectsrsquo decision criterion (C) is changed Therefore d centwill not change when the decision criterion changes Bychanging the decision criterion (rather than the recogni-tion threshold) subjects can maintain an optimal d cent fordifferent confidence levels

Optimal usage of the state of activation in the cuepattern An interesting question is how much informa-tion is carried in nodes that are active in the encoded pat-tern as compared with inactive nodes If both active andinactive nodes carry a similar amount of information itis useful to use all the nodes at retrieval However if in-active nodes carry little information in relation to thenoise performance can be improved by using only theinformation in the active nodes

The information carried in the nodes depends on theamount of weight changes which in turn depends on thepercentage of active nodes at encoding (a) For a = 12the absolute values of the weight changes are the samefor active and inactive nodes (however the signs of theweight changes are different) Thus inactive and activenodes carry the same amount of information and theperformance is optimal when information in both activeand inactive nodes is used For a small a the weightchanges are larger for active nodes (proportional to 1 a)than for inactive nodes (proportional to a) For a suffi-ciently small a the noise in inactive nodes will over-whelm the information in the weight changes so that ifthe information is combined the inactive nodes will harmthe information in the active nodes and performance isoptimal if only information from active nodes is used

The performance of the variance theory was calcu-lated by using the information in all the nodes This isdone by counting the number of nodes that are retrievedto the correct state of activation (ie the same state asthat during encoding) The mathematical details of thiscalculation are described at the end of the Appendix Theresults are shown by the dotted line in Figure 9A usingthe same set of parameters as when d cent was calculated byusing only active nodes shown by the solid line The re-sults show that the highest d cent is found when the decisionis based only on active nodes and when a is low Includ-ing inactive nodes in decision lowers d cent However for alarger a (above 15 for the parameters used here) it isbeneficial to base the decision on all the nodes

Optimal placement of the recognition criterion forthe two classes of items The recognition criterion (C)does not affect d cent but it influences performance as mea-sured by the hit rates and false alarm rates Therefore itis necessary to use another criterion for performance

VARIANCE THEORY OF MIRROR EFFECT 433

with respect to the placement of the recognition crite-rion A natural choice for performance in this context isthe probability of hits minus the probability of falsealarms This measurement corresponds to optimal per-formance when old correct responses and new correctnew responses are rewarded equally It is easy to see thatif the standard deviations of the old and the new distri-butions are equal the optimal performance will be foundif the recognition criterion is set exactly between the dis-tributions For unequal standard deviations the optimalrecognition criterion is shifted from the midpoint towardthe distribution with the smallest standard deviationMore exactly the optimal recognition criterion is thepoint at which the old and the new distributions inter-sect It is easy to see that this is true because if the recog-nition criterion is moved to the left of this point the rateof increase in false alarms is larger than the rate of in-crease in hits and performance suffers If the recognition

criterion is moved to the right of this point the rate of de-crease in hits is larger than the rate of decrease in falsealarms and performance also suffers (see eg Figure 4D)Formally f [S(O)] denotes the density of recognitionstrength of the old distribution and f [S(N)] the densityof the recognition strength of the new distribution Theratio between these variables is called the likelihoodratio L = f [S(O)] f [S(N)] and the optimal performanceoccurs when this ratio is equal to one (L = 1)

In the mirror effect there are two classes of itemseach having a new and an old distribution with differentstandard deviations The question of optimal perfor-mance is complicated by the possibility of using differ-ent criteria for the two classes The performance maythen vary depending on the choice of the two criteria andon additional restrictions on the overall level of confi-dence For example if one class is very easy (ie perfectdiscrimination) and one class is very difficult (ie no

Figure 9 (A) Theoretical d cent as a function of percentage of nodes active at encoding The solid line shows the d cent as a function of per-centage of nodes active at encoding when the decision is based only on nodes that are active during encoding The dotted line showsd cent when the decision is based on all the nodes (B) Theoretical d cent as a function of the activation threshold The leftmost arrow pointsat the expected net input of the new items [m(N)] the rightmost arrow points at the expected net input of the old items [m(O)] and themiddle arrow points at the point at the placement of the activation threshold of the nodes Note that the activation threshold is slightlylower than the optimal point (C) Optimal placement of the recognition criterion for the two classes The y-axis shows the maximumlikelihood for Class B divided by the maximum likelihood for Class A An optimal performance is found when this ratio is one Thex-axis shows the false alarm rate for Class A The straight line shows the ratio for theoretical optimal performance the dotted line theratio before normalization and the solid curved line the ratio after normalization See the text for details (D) The advantage of nor-malization for different recognition criteria The y-axis shows the total hit rate after normalization minus the total hit rate before nor-malization as a function of the total false alarm rate on the x-axis See the text for details

434 SIKSTROM

discrimination) and subjects are instructed to respondyes only when they are absolutely certain that they arecorrect it may be optimal to set a very high criterion forthe difficult class so that no yes responses will be madefor the difficult class and a moderate criterion for theeasy class so that some yes responses will be made forthe easy class Therefore any model that optimizes per-formance for the two classes must combine the criteriafor each class so that the performance for the sum of theclasses will be optimal

This problem may formally be stated as follows Giventwo classes (A and B) with a fixed total false alarm rate[P(AN) + P(BN)] how should the recognition criteriafor the two classes [T(A) and T(B)] be chosen so that thehit rates are maximized [P(AO) + P(BO)] The solutionto this problem is surprisingly simple The optimal perfor-mance occurs when the placements of the maximum like-lihoods of the two classes are equal

L(A) = f [S(AO)] f [S(AN)] = L(B)

= f [S(BO)] f [S(BN)]

It is easy to see that this criterion must be satisfied foroptimal performance because any shift from this pointdiminishes performance For example if the recognitionthreshold for class A is diminished the recognition cri-terion for class B must be increased to keep the totalfalse alarm rate constant According to the formulationof the problem the change in total false alarm rates mustbe equal f f [S(BN) = 0] The maximum-likelihood ra-tios are monotonically increasing functions of the recog-nition criteria therefore L(A) L(B) lt 0 when the recog-nition criteria are changed as specified above ThusL(A) = f [S(AO)] f [S(AN)] lt f [S(BO)] f [S(BN)] =L(B) or f [S(AO)] f [S(BO)] lt 0 This shows that thechange in the placement of the criteria from L(A) = L(B)results in an overall decrease in hit ratemdash( f [S(AO)] f [S(BO)] lt 0)mdashand performance suffers

Note that the variance theory only has one overallrecognition criterion However the theory and any the-ory of the mirror effect specifies how this criterionchanges when it is moved over the two classes Thus thesecond criterion can indirectly be inferred from the for-mulation of the theory This is done in the variance the-ory by the normalization factor that scales the recogni-tion differently between the two classes of words

The intriguing question here is whether the variancetheory is optimal or almost optimal in terms of place-ment of the recognition criterion for the two classes Fig-ure 9C plots the maximum likelihood of class B dividedby the maximum likelihood of class A [L(B)L(A)] onthe y-axis The x-axis shows the probability of false alarmsfor class A when the recognition criteria are changedThe optimal ratio of the maximum likelihood on the y-axisis one and it is plotted as the dotted straight line The re-sults before normalization (ie by counting the numberof nodes above the recognition criterion) are plotted in

the dotted monotonically increasing line The resultsafter normalization (ie the percentage of active nodesminus the expected number of active nodes divided bythe standard deviation of the net input) are plotted as thesolid line

The figure clearly shows that performance before nor-malization is far from optimal For a conservative recog-nition criterion or low false alarm rates the maximumlikelihood for class A is larger than the maximum likeli-hood for class B Therefore a more liberal criterion forclass B and a more strict criterion for class A would bemore advantageous For a liberal recognition criterionor a high false alarm rate the maximum likelihood forclass B is larger than the maximum likelihood for classA Therefore a more liberal criterion for Class A and astricter criterion for Class B would be beneficial Theonly point at which the performance is optimal is whenthe recognition criterion is unbiased At this point [aroundP(AN) = 25] the maximum-likelihood ratio is one

Normalization improves performance significantly sothe maximum-likelihood ratio is one or almost one fora range of criteria For false alarm rates larger than 015and smaller than 060 the ratio is within two percentagepoints of one The maximum likelihood for Class A islarger than that for Class B for false alarm rates less than015 and for false alarm rates larger than 060 Thus thereis still some deviation from optimal performance evenafter normalization However the maximum-likelihoodratio is closer to the optimal value for all false alarmrates after normalization than before normalization Ar-guably normalization improves performance sufficientlyso that performance may be described as being near anoptimal value for a wide range of recognition criteria

Overall performance after normalization can be di-rectly compared with performance before normalizationFigure 9D plots the total hit rate after normalizationminus the total hit rate before normalization for differenttotal false alarm rates In this figure the standard devia-tion of the net inputs to Class B was set to a 3 (rather than156) in order to make the difference between perfor-mance before and after normalization more salient Allother parameters were identical to those in Figure 4DFurthermore it is assumed that the number of items inClass A is equal to the number of items in Class B Thetotal hit rate is equal to the average hit rate of Class Aand Class B and the total false alarm rate is equal to theaverage false alarm rate in Class A and Class B

For all recognition criteria or for all false alarm ratesperformance is better or equal after normalization ascompared with performance before normalization Foran unbiased recognition criterion or a slightly largerrecognition criterion performance is approximatelyequal before and after normalization [ie for 25 lt P(N) lt40] For conservative recognition criteria [P(N) lt 25]there is a large advantage for normalized performanceover nonnormalized performance The largest advantageis approximately 005 and it occurs when the false alarm

VARIANCE THEORY OF MIRROR EFFECT 435

rate is approximately 005 For liberal recognition crite-ria [P(N) gt 40] there is also an advantage for normal-ized performance The largest advantage is around 001and it occurs when the false alarm rate is 070 The ad-vantage for liberal criteria is smaller than the advantagefor conservative criteria because of a ceiling effect forlarge false alarm rates and large hit rates

A Nonoptimal Network is Inconsistent With Empirical Data of Recognition Memory

To summarize it has been shown that performance isoptimal when (1) the percentage of nodes active at en-coding is low (2) the activation threshold is set betweenthe new and the old distributions (3) only nodes activeat encoding are used for measuring the recognitionstrength and (4) the recognition strength is normalizedwith the standard deviation of the net input It is inter-esting to note that all these conditions are necessary forproducing the empirically found memory phenomena

1 The percentage of active nodes has to be low forproducing appropriate standard deviations for the oldand the new recognition distributions If the percentageof active nodes is too high the standard deviation of theold distribution will approach the standard deviation ofthe new distribution

2 The model predicts the mirror effect only if the ac-tivation threshold is set between the old and the new dis-tributions If the recognition threshold is larger than theexpected value of the net input of the old distributionthe hit rate of the easy class will be less than the hit rateof the difficult distribution Similarly if the recognitionthreshold is lower than the expected new net input thefalse alarm of the easy class will be larger than the falsealarm rate of the difficult class This is inconsistent withthe empirical data for unbiased responses

3 Assume that the recognition strength is based on allthe nodes (ie not only nodes inactive during encod-ing)mdashfor example by counting the number of nodes inthe correct state of activation This measurement ofrecognition strength will have at least 50 of the nodesin the correct state (ie Pc gt 50) even if the subjectswere merely guessing on new items This would lead tothe incorrect prediction that the old recognition strengthvariance should be smaller than the new recognitionstrength variance because Pc(1 Pc)N decreases for Pcover 50 This is inconsistent with the finding that thevariance of the old distribution is larger than the varianceof the new distribution

4 If the recognition strength is not normalized withthe net input the variance of the recognition strength ofthe new easy class (AN) will be smaller than the varianceof the recognition strength of the new difficult class (BNsee Figure 4C) This is inconsistent with the empirical data

This analysis indicates that several recognition mem-ory phenomena fall out as a consequence of optimizingperformance in the network If the model is optimized interms of performance the model reproduces the empir-ical data If the model is made nonoptimal the model

does not reproduce the empirical data Arguably thehuman brain has evolved to optimize storage capacityand therefore these memory phenomena occur

GENERAL DISCUSSION

This paper has suggested a variance theory for themirror effect that also applies to the ROC curves Themodel is remarkably simple It has been shown that atwo-layer recurrent network where one layer representscontext and one layer represents items shows these phe-nomena if performance is measured by counting thenumber of nodes active at recognition in a way that opti-mizes performance The structure of the theory guaran-tees that high-frequency items have a larger variance inthe net inputs than do low-frequency items because en-coding the same item in different contexts increases thevariance whereas the expected net inputs are the sameThe theory predicts the mirror effect when the sameamount of attention is paid to both classes of stimuli Thestandard deviation of the recognition strength is largerfor old than for new items because more nodes are activein old items The standard deviation for the easy class islarger than the standard deviation for the difficult classbecause the recognition strength is normalized with thestandard deviation for the net input

There are several reasons why the variance theory isinteresting First the theory is extremely simple Theonly necessary assumptions are that recognition is basedon recurrent associations between contexts and itemsand performance is measured by counting the number ofnodes in an optimal way Second these assumptions areconsistent with what is known about how the brain worksTherefore the model is biologically plausible Third themodel accounts for a large amount of data including themirror effect exceptions from the mirror effect ROCcurves list-length effects and so on Fourth the modelfits the empirical data well Fifth it is easy to implementthe model in a connectionist network

Paying more attention to one of the classes violates theassumption of equal expected net inputs to the two classesThe variance theory predicts that attention to the moredifficult class primarily affects the hit rates whereas thefalse alarm rates and standard deviations of the underly-ing distributions are less affected An experiment sup-ported the prediction A standard mirror effect was foundwhen attention was divided equally between high- andlow-frequency words However focusing the subjectsrsquoattention on the high-frequency words either by the pre-sentation frequency or the presentation time made thehit rate larger for the high-frequency words than for thelow-frequency words This manipulation of attention didnot influence the false alarm rates which were higher forthe high-frequency words in all the conditions Thus nomirror effect was found when attention was paid to thehigh-frequency words Nor did the focusing of attentioninfluence the order of the standard deviations The stan-dard deviations were larger for the low-frequency distri-

436 SIKSTROM

bution than for the high-frequency distribution This wastrue for the new and the old distributions both when at-tention was paid to high-frequency words and when at-tention was divided equally between the two classes (ex-cept in the new frequency control condition where thestandard deviations were equal)

The variance theory predicts the order of the standarddeviations of the underlying distributions for the follow-ing reasons The standard deviations of the old distribu-tions are predicted to be larger than those of the new dis-tributions because more nodes are activated in the olddistributions The standard deviations of the easy classdistributions are predicted to be larger than the standarddeviations of the difficult class distributions because therecognition strength is normalized by the itemrsquos diffi-culty estimated from the standard deviation of the net in-puts This is consistent with the empirical data

In contrast the attention-likelihood theory does notconstrain the old distribution to be larger than the newdistribution for the difficult class (it can be larger orsmaller depending on parameter settings) The variancetheory allows the following two orders of the standarddeviations ss(BN) lt ss(BO) lt ss(AN) lt ss(AO) andss(BN) lt ss(AN) lt ss(BO) lt ss(AO) The first order isthe more common although the second order occurs oc-casionally (see eg Glanzer et al 1993 Experiment 1)In addition the attention-likelihood theory allowsss(BO) lt ss(BN) lt ss(AN) lt ss(AO) according to Equa-tion 2 which to the authorrsquos knowledge has not beenfound in empirical data

The variance theory predicts that strength variablessuch as study time repetition and study instructions af-fect the expected net input For example increasing studytime will increase the net input that improves the hit rateIncreasing the net inputs also causes an increase in theactivation threshold that diminishes the false alarm rates

The variance theory has no (ie lateral) connectionswithin the item layer and there are no connections with-in the context layer Including intraitem connections inrecognition makes it impossible to tell whether therecognition strength comes from encoding during thelearning episode or from another preexperimental learn-ing episode Thus there would be a confounding be-tween the itemrsquos frequency and recognition strength Forexample if the recognition strength in the variance the-ory included intraitem connections and used a constantrecognition criterion it would predict a higher hit rateand a higher false alarm rate for high-frequency itemsthan for low-frequency items Thus the hit rate for high-frequency words would be larger than that for low-frequency words which is contrary to the data on the mir-ror effect This issue is called the frequencyrecognitionndashstrength confounding problem Other models may bevulnerable to this problem depending on their specificassumptions The variance theory is immune from thisproblem because recognition strength is based on the as-sociation between the context and the item that yields apure measurement of the strength of the target in a par-

ticular episode Net inputs within the item population arenot used because these connections are highly corre-lated with the frequency of the item

This frequencyrecognition-strength confounding prob-lem may be relevant to several distributed models thatassume that recognition strength increases with frequencythus making the false alarm rate higher for high- than forlow-frequency stimuli This is often implemented in dis-tributed models by including intraitem associations inthe measurement of recognition strength Thus whenintraitem and itemndashcontext associations are added it isnot possible to know whether the intraitem strength oc-curs because an item has been encoded in the to-be-retrieved-from list or at another episode

Although the intraitem associations are not used tomeasure recognition strength they may play an impor-tant role in recognition In the first step of recognitionthese associations may be used for deblurring unclear in-formation in the item cue (a similar mechanism occursfor the context cue) Arguably this deblurring mecha-nism works well for well-known words however fornonwords it is much more likely to fail Such failure willactivate features that were not active in the encoded rep-resentation It is more likely that these wrongly activatednodes represent high-frequency features because it ismore likely to converge on high-frequency features Thereare two interesting implications of this perspective Firstthe wrongly activated nodes will use the wrong connec-tions between the context and the item Second becausethe wrongly activated nodes represent high-frequencyfeatures the average variability will be larger for non-words than for words This is a plausible account of thepoor recognition performance with nonwords as com-pared with words It is also a tentative account of why non-words can be seen as a difficult class with a higher vari-ability than that for words in the variance theory Howeverfurther work is needed before any firm conclusion can bemade regarding this aspect of the theory

A problem similar to frequencyrecognition-strengthconfounding occurs if within-context connections areused at recognition If the context is temporally corre-lated the within-context connections are influenced bylist length This causes a confounding between list lengthand recognition strength This issue is called the list-lengthrecognition-strength confounding problem Othermodels may be vulnerable to this problem depending ontheir specific assumptions

Another issue is whether the variance theory can ac-count for the mirror effect found in abstract and concretewords where words from a concrete class are more eas-ily discriminated than words from an abstract class Thevariance theory can account for this given the assump-tion that the variance of the net input is larger for abstractthan for concrete words However at this point it is notcompletely clear how this assumption can be motivatedA possibility is that although these two classes areequated for word frequency the contexts associated withan abstract word are more variable than the contexts as-

VARIANCE THEORY OF MIRROR EFFECT 437

sociated with a concrete word This larger variability incontext for abstract words may lead to a larger variabil-ity in the net input Another possibility is that the activefeatures in abstract words are more general and there-fore associated with more contexts Nodes active in con-crete words may represent more specific features acti-vated with a lower frequency and therefore associatedwith fewer contexts Thus features in abstract wordsmay be of a higher frequency than features in concretewords although the frequencies of the items are thesame This would lead to a mirror effect in the modelHowever at this point no claim is made that variancetheory can handle this phenomenon

The variance theoryrsquos account of the list-length andlist-strength effects is arguably much simpler than thesubjective-likelihoodrsquos account The activation thresholdis set so that on average half of the nodes active duringencoding are active during recognition The activationthreshold is constant within one condition but may changebetween conditionsmdashfor example when study time ischanged Furthermore subjects do not need to estimatethe list length and the probability that a particular itemis drawn from the study list

The variance theory has advantages over the attention-likelihood theory As was discussed in more detail abovethe attention-likelihood theory requires subjects to clas-sify the stimulus Depending on this classification thesubjects must know (consciously or unconsciously) howmuch attention is paid to the stimuli in order to calculatethe log-likelihood ratios Thus the yesndashno decision isbased on the subjectsrsquo classification of the stimuli Thevariance theory does not require a classification of thestimuli During the calculation of recognition strengththe standard deviation of the net input of the item is used(shcent ) so the subject does not need to know the class or thestandard deviation of the class (sh) The increase in thehit rate and decrease in the false alarm rate for the easierclass occurs according to the theory because the vari-ance is smaller for the easier class The variance theoryhas a simple mathematical base subjects count the num-ber of activated nodes minus the expected value dividedby the standard deviation of the net input of the item Anexplicit solution is presented that requires only calculat-ing the probabilities of two z-transformations

The subjective-likelihood theory uses feature contentof the items for addressing issues regarding item similar-ity and word frequency In particular high-frequency wordsare assumed to have a higher noise or variability than dolow-frequency words The variance theory also usesvariability that depends on frequency However the vari-ance theory simulates the increase in variance duringeach presentation of a feature in different contexts thusmaking it an unavoidable phenomenon for high-frequencyfeatures In the subjective-likelihood theory the featurevariance is introduced as an assumption or a constantand it is not explicitly simulated

There are several other differences between the vari-ance theory and the subjective-likelihood theory The

subjective-likelihood theory is based on log-likelihoodratios In the variance theory log-likelihood ratios arenot necessary to account of the mirror effect and for z-ROC curves Instead the variance theory uses the num-ber of active nodes normalized by the variance of the netinput as the measurement of recognition strength

Another difference is the use of one detector for eachitem in the subjective-likelihood theory This makes themodel essentially local whereas the variance theory isdistributed This property may cause difficulties in im-plementing the model in a connectionist network How-ever see McClelland and Chappell (1998) for a brief dis-cussion of this topic An implementation of the variancetheory is straightforward

REFERENCES

Feller W (1968) An introduction to probability theory and its appli-cation New York Wiley

Gillund G amp Shiffrin R M (1984) A retrieval model for bothrecognition and recall Psychological Review 91 1-67

Glanzer M amp Adams J K (1985) The mirror effect in recognitionmemory Memory amp Cognition 13 8-20

Glanzer M amp Adams J K (1990) The mirror effect in recognitionmemory Data and theory Journal of Experimental PsychologyLearning Memory amp Cognition 16 5-16

Glanzer M Adams J K amp Kim K (1993) The regularities ofrecognition memory Psychological Review 100 546-567

Glanzer M amp Bowles N (1976) Analysis of the word frequencyeffect in recognition memory Journal of Experimental PsychologyHuman Learning amp Memory 2 21-31

Glanzer M Kisok K amp Adams J K (1998) Response distribu-tions as an explanation of the mirror effect Journal of ExperimentalPsychology Learning Memory amp Cognition 24 633-644

Greene R L (1996) Mirror effect in order and associative informa-tion Role of response strategies Journal of Experimental Psychol-ogy Learning Memory amp Cognition 22 687-695

Hertz J Krogh A amp Palmer R G (1991) Introduction to the the-ory of neural computation Reading MA Addison-Wesley

Hintzman D L (1988) Judgment of frequency and recognition memoryin a multiple trace memory model Psychological Review 95 528-551

Hopfield J J (1982) Neural networks and physical systems withemergent collective computational abilities Proceedings of the Na-tional Academy of Sciences 79 2554-2558

Hopfield J J (1984) Neurons with graded response have collectivecomputational properties like those of two-state neurons Proceed-ings of the National Academy of Sciences 81 3088-3092

Humphreys M S Bain J D amp Pike R (1989) Different way to cuea coherent memory system A theory for episodic semantic and pro-cedural tasks Psychological Review 96 208-233

Kim K amp Glanzer M (1993) Speed versus accuracy instructionsstudy time and the mirror effect Journal of Experimental Psychol-ogy Learning Memory amp Cognition 19 638-652

Kruschke J K (1992) ALCOVE An exemplar-based connectionistmodel of category learning Psychological Review 99 22-44

Ku Iumlcera H amp Francis W N (1967) Computational analysis ofpresent-day American English Providence RI Brown UniversityPress

Lewandowsky S (1991) Gradual unlearning and catastrophic inter-ference A comparison of distributed architectures In W E Hockleyamp S Lewandowsky (Eds) Relating theory and data Essays onhuman memory in honor of Bennet B Murdock (pp 445-476) Hills-dale NJ Erlbaum

Li S-C amp Lindenberger U (1999) Cross-level unification A com-putational exploration of the link between deterioration of neuro-transmitter systems and dedifferentiation of cognitive abilities in oldage In L-G Nilsson amp H J Markowitsch (Eds) Cognitive neuro-sciences of memory (pp 103-146) Seattle Hogrefe amp Huber

438 SIKSTROM

Li S-C Lindenberger U amp Frensch P A (2000) Unifying cog-nitive aging From neuromodulation to representation to cognitionNeurocomputing 32-33 879-890

McClelland J L amp Chappell M (1998) Familiarity breeds dif-ferentiation A subjective-likelihood approach to the effects of expe-rience in recognition memory Psychological Review 105 724-760

Metcalfe J (1982) A composite holographic associative recallmodel Psychological Review 89 627-658

Murdock B B (1982) A theory for the storage and retrieval of item andassociative information Psychological Review 89 609-626

Murdock B B (1998) The mirror effect and attentionndashlikelihoodtheory A reflective analysis Journal of Experimental PsychologyLearning Memory amp Cognition 24 524-534

Okada M (1996) Notions of associative memory and sparse codingNeural Networks 9 1429-1458

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves PsychologicalReview 99 518-535

Rumelhart D E Hinton G E amp Williams R J (1986) Learn-ing representation by backpropagating errors Nature 323 533-536

Shiffrin R M amp Steyvers M (1997) A model for recognitionmemory REMndashretrieving effectively from memory PsychonomicBulletin amp Review 4 145-166

Sikstroumlm S (1996a) TECO A connectionist theory of successiveepisodic tests Umearing Doctoral dissertation Umearing University (ISBN91-7191-155-3)

Sikstroumlm S (1996b) The TECO connectionist theory of recognitionfailure European Journal of Cognitive Psychology 8 341-380

Sikstroumlm S (1999) Power function forgetting curves as an emergentproperty of biologically plausible neural networks model Interna-tional Journal of Psychology 34 460-464

Stretch V amp Wixted J T (1998a) Decision rules for recognitionmemory confidence judgments Journal of Experimental Psychol-ogy Learning Memory amp Cognition 24 1397-1410

Stretch V amp Wixted J T (1998b) On the difference betweenstrength-based and frequency-based mirror effects in recognitionmemory Journal of Experimental Psychology Learning Memoryamp Cognition 24 1379-1396

NOTE

1 The expected number of nodes active at recognition is a2 giventhat the standard deviations of the percentages of active nodes for new[sp(N)] and old items [sp(O)] are equal However if the standard devi-ations are different the expected number of active nodes will be

Because the new variance is smaller than the old variance this valuewill be slightly less than a2 typically around 044a if the ROC curveis 08 It is actually more precise to use this value In this paper the ap-proximation a2 is used except that in the simulations where the ex-pression above is used

APPENDIX

The Expected Value of the Net Input and the Variance of the Net Input

Inserting the weight changes into the net input solves the ex-pected value of the net input for old items In this Appendix thesuperscripts representing the item layer (d) and the contextlayer (c) are dropped Only the active states of activation con-tribute to the net input so jj 5 jj = 1

(A1)

The expected value of the net inputs for the new items iszero

mh(N) 5 0 (A2)

The expected value of the weights is zero The variance ofthe net input is calculated by squaring each term in the netinput Let P represent the number of patterns stored in the net-work If the patterns are uncorrelated or a new item is pre-sented the variance of the net input is

(A3)

For an old item the variance of the net input to the contextlayer from the item layer depends on the frequency of the item( f ) All the aN weights supporting a context node contribute tothe net input in the same direction

(A4)

The same function can be used for calculating the varianceof the net input to a node in the item layer when the same con-text is associated with several items Let the same context be as-sociated with L items in a list Furthermore let the context be-tween different lists be uncorrelated The variance of the netinput to an item node is

The expected variance for all nodes is the weighted sum ofthese two terms Half of the nodes are context nodes and halfof the nodes are item nodes Therefore the expected varianceof the net input to all nodes given an item with a frequency off and a list length of L in a network where p patterns have beenencoded is

(A5)

Performance in the Variance Model When AllNodes Are Used

In the variance model recognition strength is based on nodesactive at encoding However if recognition strength is based onall the nodes performance can be calculated as follows The d centis calculated by using Equation 8 where Pc(O) and Pc(N) are thenumber of correct nodes The number of correct nodes is thenumber of nodes active at retrieval that also is active at encod-ing (ie calculated as usual with Equation 7) plus the numberof inactive nodes at encoding that also are inactive at retrievalThe latter value can be calculated by replacing a with 1 a inEquation 7 and using the expected value of old net inputs for in-active nodes a2 (1 a) N (compare with Equation A1)

(Manuscript received February 9 1999revision accepted for publication October 30 2000)

s h f L a N p a a N2 3 22 1O( ) = +( ) + -( )

s h f a a LN2 4 2 21( ) = -( )

s x x x xh ij jj

N

i j ji

Nf f f a a

a a f N

22 2

4 2 21

( ) = waring aringaelig

egraveccedilouml

oslashdivide= -( ) -( )eacute

eumlecirc

= -( )

s x x x xh ij jj

N

i j j

P

i

NN a a

a a a a a a a

a a a a PN a a PN

2 2 2

2

3

1 1 2 0 1

0 0 1 1

( ) = w

= [( ) ( )] + [( )(1- )] ( )

+ [( )( )] ( ) = ( )

2 2

2 2 2

( ) = -( ) -( ) - - - -

- - - -

aring aringaring

m x x x xh ijj

N

j i j jj

Na a a a N(O) = wDaring aring= -( ) -( ) = -( )1

2

ss s

p

p p

a(N)

( )N (O)+

VARIANCE THEORY OF MIRROR EFFECT 429

nodes active at encoding (a) the standard deviation ofthe net inputs for the easy class words [sh(A)] and thestandard deviation of the net inputs for the difficult classwords [sh(B)] The other parameters were held constantand were set to the same values as those in Figure 4D[2N = 100 mh(N ) = 0 mh(O) = 1 and C = 0] The empir-ical standard deviations (si) were not reported inGlanzer et al (1993) so these parameters were set toone

The results show that the variance theory accounts for98 (r2) of the variance for the probabilities The attention-likelihood theory fits equally well The variance theory ac-counts for 93 of the variance for the slope The attention-likelihood theory accounts for 84 of the variance of theslope Thus the variance theory accounted for the sameamount of variance for the probabilities and more variance

for the slope as compared with the attention-likelihoodtheory when three parameters were fitted

It is reasonable to assume that the percentage of ac-tive nodes at encoding (a) does not vary across condi-tions The ratio of standard deviations of the net inputs[sh(B)sh(A)] may also be conceived of as being con-stant given that the same material is used in the differ-ent conditions Therefore the variance theory was fittedby a single variablemdashnamely the standard deviation ofthe net inputs for the easy class [sh(A)] The activitylevel was fixed to 010 and the ratio of the standard de-viations of the net inputs sh(B)sh(A) to 125 (these val-ues were the average of the fitted parameters in the pre-vious fit these parameters were also used in Figure 4D)The results are presented in Figure 8A where the pre-dicted probabilities are plotted as a function of observed

Figure 8 (A) Fitting the variance theory to Glanzer Adams andKimrsquos (1993) probability data The figure shows the predicted proba-bilities (on the vertical axis) as a function of the observed probabilities(on the horizontal axis) (B) Fitting the variance theory to Glanzer et alrsquos(1993) standard deviation slope data The figure shows the predictedratio of slopes of the receiver-operating characteristic curves (on the ver-tical axis) as a function of the observed ratios (on the horizontal axis)

430 SIKSTROM

probabilities Figure 8B shows the corresponding resultsfor the slope The accounted variance is 096 for the prob-abilities and 085 for the slopes Thus the variance theoryfits the slopes using a single parameter equally well asthe attention-likelihood theory does with three fitting pa-rameters The fit for the variance theory for the probabil-ities using one parameter is slightly less than the fit of theattention-likelihood theory using three fitting parametersIt may be concluded that the fit for the variance theory isreasonably good for the probabilities and the slopes Theslopes have a better fit in the variance theory than in theattention-likelihood theory when three variables are used

ANALYTIC SOLUTIONS

In this section analytic solutions of the variance the-ory an approximation of the standard deviation of therecognition strength and analyses of optimal perfor-mance are presented The variance theory has a simpleanalytic solution and can be fully described by four pa-rameters Two of these parametersmdashnamely the stan-dard deviation of the net inputs from the easy class[sh(A)] and the standard deviation of the net inputs fromthe difficult class [sh(B)]mdashcan also be expressed as thefrequency of the item (see the Appendix) The other twoparameters are the number of nodes (N ) and the expectedprobability that the nodes are active at encoding (a)

There are other variables in the theory however theydo not increase the degrees of freedom There are fourexpected net inputs (mh) however two degrees of free-dom disappear because the new net inputs are constrainedto be equal as well as the old net inputs [mh(AN) =mh(BN) and mh(AO) = mh(BO)] Furthermore the predic-tions are independent of moving the old and new dis-tributions in parallel thus removing another degree offreedom Finally changing the difference between the ex-pected old and new net inputs is mathematically equiva-lent to changing the standard deviations [sh(A) andsh(B)] Thus the degrees of freedom in the net inputscan be captured by the degrees of freedom in the stan-dard deviations The activation threshold (T ) and theprobability that nodes are active (Pc) are simply func-tions of other variables and therefore do not increase thedegrees of freedom Thus there are four degrees of free-dom for the distributionsmdashnamely sh(A) sh(B) N anda An additional degree of freedom is introduced whenplacing the recognition criterion (C)

The probability (Pc) that the net inputs exceed the ac-tivation threshold (T ) for nodes active during encodingcan be explicitly solved from the expected net inputs(mh) and the standard deviation of the net inputs (sh)This probability is dependent on the distribution of thenet inputs which can be approximated with a normaldistribution Pc is solved by integrating the net inputsfrom mh T to infinity (yen) over the probability densityfunction for a normal distribution Thus the probability(Pc) that a node is active at recognition is

(6)

Subtracting the expected percentage of active nodes atrecognition (a2 see note 1) from the percentage of ac-tive nodes and dividing by the standard deviation of thenet inputs (sh) calculates the expected recognitionstrength (ms)

Note that the analytic solution uses the standard devi-ation of the class (sh) as an approximation of the stan-dard deviation of the item (shcent ) because it simplifies theanalytic solution however the variance theory or thesimulation uses the standard deviation of the item Thisapproximation is good when there are a large number offeatures however for a small number of features thevariance of feature strength for a single item may fluctu-ate on an item-to-item basis around the variance of thenet inputs for all the items

The standard deviation of the recognition strength (ss)is calculated from sh Pc and N There is 2N number ofnodes in the context and the item layers The distributionof Pc is binomial but can for a certain criterion [ie 2NPc(1 Pc) gt 10] be approximated with a normal distri-bution with a standard deviation of [Pc(1 Pc) 2N]12The final result is scaled by the normalization factor 1sh

(7)

A yes response occurs if the recognition strength isabove the recognition criterion (C) The probability of ayes response [P(Y)] is calculated from the expected recog-nition strength the variance of the recognition strengthand the recognition criterion by integrating the density ofthe recognition strength over a normal distribution

The probability of choosing A over B in a two-choiceforced recognition test [P(A B)] is calculated from theexpected recognition strength of A [ms(A)] and B [ms(B)]and the standard deviations of the recognition strengthof A [ss(A)] and B [ss(B)]

An Excel sheet for calculating the predictions of the vari-ance theory is available on line (wwwpsychutorontoca ~sverkervariancehtml)

P e ds

S Bs s

s s

C

(A B)

(A)

(A) (B)

=

- -[ ]( )+aelig

egraveccediloumloslashdivide

eacuteeumlecirc

yen

ograve12

2

2 2 22

p

m m

s s

( )

P e ds

S s

s

C

(Y) =-( )yen

ograve12

2

22

p

m

s

sss

h

c cP P

N=

-( )eacute

eumlecircecirc

1 1

2

12

mss

Pc

a

h=

-2

P a e dhcT

hh

h=-( )yen

ograve12

2

22

p

m

s

VARIANCE THEORY OF MIRROR EFFECT 431

Approximations of the Standard Deviation ofRecognition Strength

The standard deviation of the recognition strength isin the model calculated with Equation 7 However to fa-cilitate the understanding of this equation it is useful tomake some approximations First note that the proba-bility that a node is active (Pc) is assumed to be low Byapproximating 1 Pc to one the variance of the recog-nition strength can be simplified to

For a particular class of items the variances of the netinputs of old and new items are equal and the varianceof the recognition strength is proportional to the numberof active nodes (s 2

s micro Pc) This approximation suggests avery simple interpretation of the slope of the z-ROC Theratio of variances between new and old items is simplythe ratio between the number of nodes active in the newitems representations and the number of nodes active inthe old items representations

Or alternatively the slope of the z-ROC curve is equalto the square root of the ratio of the number of nodes ac-tive in the new items representations and the number ofnodes active in the old items representations For exam-ple if the slope of the z-ROC curve is 08 the number ofactive nodes in the new items representations divided bythe number of nodes active in the old items representa-tions is 064 (= 0802)

Another approximation useful for understanding themodel is that for two classes of items the number of ac-tive nodes in the new distribution is approximately equaland the number of active nodes in the old distributions isapproximately equal [Pc(AN) raquo Pc(BN) and Pc(BO) raquoPc(AO)] Given these approximations and the approxi-mation above (1 Pc raquo 1) the recognition strength stan-dard deviation is inversely related to the standard devia-tion of the net inputs in the following way The ratiobetween the recognition strength standard deviations ofthe diff icult and the easy distributions is equal to theratio between the standard deviations of the net inputs ofthe easy and the difficult distributions Furthermore theratio between the recognition strength standard devia-tions of the difficult and easy new distributions is equalto the ratio between the recognition strength standard de-viations of the difficult and the easy old distributionsThe exact solution predicts a slightly smaller ratio in theold than in the new distributions

This suggests that the ratio between the recognitionstrength standard deviations of the difficult class and theeasy class can be interpreted as the ratio between the

standard deviations of the net inputs of the easy class andthe difficult class

Optimizing Performance Derives the Assumptions of the Variance Theory

Arguably good memory performance is an importantfactor for selection in the evolutionary process of hu-mans and animals It is reasonable to assume that thebrain has evolved so that the performance at retrieval isoptimal or near-optimal Here it is investigated how sev-eral assumptions of the variance theory influence per-formance It is shown that several assumptions of themodel fall out as a consequence of optimizing perfor-mance in the form of discriminability between new andold items Thus if the model is implemented in a differ-ent way performance is degraded and the model doesnot account for the experimental data Examples of as-sumptions that yield a good performance in the modelare a low percentage of nodes active setting the activa-tion threshold between old and new net inputs measur-ing performance by nodes that are active in the encodingpattern and normalizing the recognition strength It isshown that an optimal performance in the network requiresthe implementation suggested by the variance theory Ifthe implementation of the variance theory is changedsignificantly the performance is degraded and the net-work would not produce the empirically found memoryphenomena

To conduct this analysis it is necessary to define ameasurement of performance A natural choice is to used cent By using the analytical equations above we find thefollowing expression

(8)

Because ss(N) sometimes is near zero it was founduseful to use the standard deviation of both the old andthe new items recognition strength ss(NO) in the de-nominator of this equation Thus Pc(NO) is equal to[Pc(N) 1 Pc(O)] 2 Pc( ) was calculated with Equation 6The expected value of the net inputs and the standard de-viation of the net inputs for new and old items were cal-culated with the equations derived in the Appendix(Equations A1 A2 and A3) Low-frequency items witha preexperimental frequency of zero were used ( f = 0)and the list length was set to one (L = 1)

The performance can be expressed by the parametersa N and p Increasing the number of nodes (N) monot-onically increases d cent and increasing the number ofstored patterns (p) monotonically decreases d cent The im-pact of these two parameters on d cent was of less impor-tance here and they were set to N = 30 and p = 100

Optimal percentage of nodes active at encodingThe solid line in Figure 9A shows the theoretical d cent as afunction of the percentage of nodes active at encoding

cent = - =-

-eacuteeumlecirc

dP P

P PN

s s

s

c c

c c

m ms( ) ( )

(

O N

(NO)

(O) (N)

(NO) (NO) 12

112

ss

ss

ss

ss

s

s

s

s

h

h

s

s

(BO)

(AO)

(B)

(A)

(A)

(B)

(BN)

(AN)pound raquo pound

s

ss

s

c

c

P

P

2

2

(N)

(O)

(N)

(O)raquo

ss

sc

h

P

N

2

2 2=

432 SIKSTROM

(a) The results show that d cent is optimal for a = 052 Thed cent is lower for larger and smaller a The lower d cent for largea occurs because the interference from other items in-crease For an a larger than the optimal value the weightchanges are distributed over a larger number of nodesand the recognition tests therefore include more noise

The lower d cent for an a less than the optimal value oc-curs because there is variability in the number of activenodes at encoding Thus for very small values of a thefluctuation between the number of nodes active in theencoded representation becomes increasingly importantThus for a small a errors are more likely to occur be-cause old items have few active nodes at encoding mak-ing it less likely that many nodes will be active at re-trieval (independently of how well they are encoded)This analysis suggests that a medium low percentage ofactive nodes at encoding yields optimal performanceThis is consistent with variance theory which requires alow activity for fitting some of the empirical data (seebelow)

There is another factor that contributes to the fact thatoptimal performance occurs when the percentage of ac-tive nodes is medium low which is that the number ofpossible representations increases with a If there is onlyone node active in all the representations there are Npossible combinations of representations if there aretwo nodes active in all the representations there are ap-proximately N 2 possible combinations of representa-tions and so forth This factor is not included in theanalyses

Optimal placement of the activation threshold Animportant property of d cent is that it is insensitive to wherethe criterion is placed Thus any criterion yields thesame performance The activation threshold (T ) may beseen as the criterion for a single node and therefore onemight intuitively think that the placement of the thresh-old is unimportant for d cent However surprisingly theplacement of the criterion becomes important in the vari-ance theory because there is a nonlinear transformationwhen the nodes are activated This nonlinearity makes d centdependent on the activation threshold in the nodes

The d cent was maximized by changing the activationthreshold (T ) and the percentage of nodes active at en-coding (a) The maximum d cent was 240 when T = 081and a = 052 Figure 9B plots d cent as a function of the ac-tivation threshold (T ) when the percentage of nodes ac-tive at encoding was f ixed at the optimal value (a =052) The results show that d cent has an optimal valuewhen the activation threshold is set between the old andthe new distributions The variance theory suggests thatthe threshold should be set to the average of the expectedold and expected new net inputs For the parameter val-ues used here this value is 071 which is near butslightly lower than the optimal value of 081 (the ex-pected value of the new net input is 0 and the expectedvalue of the old net input is 142) Note that this resultapplies when both a and T are set to the optimal value Ifa is set to a nonoptimal value the optimal value of T may

deviate significantly from the one proposed by the the-ory (eg if a = 5 the optimal value of T is much largerthan the expected value of old net inputs of 188)

This analysis emphasizes the importance of setting theactivation threshold as suggested by the variance theorySetting the activation threshold between the old and thenew expected net inputs yields not only the mirror effectbut also an optimal performance in the network Noticethat the activation threshold (T ) is constant even if thesubjectsrsquo decision criterion (C) is changed Therefore d centwill not change when the decision criterion changes Bychanging the decision criterion (rather than the recogni-tion threshold) subjects can maintain an optimal d cent fordifferent confidence levels

Optimal usage of the state of activation in the cuepattern An interesting question is how much informa-tion is carried in nodes that are active in the encoded pat-tern as compared with inactive nodes If both active andinactive nodes carry a similar amount of information itis useful to use all the nodes at retrieval However if in-active nodes carry little information in relation to thenoise performance can be improved by using only theinformation in the active nodes

The information carried in the nodes depends on theamount of weight changes which in turn depends on thepercentage of active nodes at encoding (a) For a = 12the absolute values of the weight changes are the samefor active and inactive nodes (however the signs of theweight changes are different) Thus inactive and activenodes carry the same amount of information and theperformance is optimal when information in both activeand inactive nodes is used For a small a the weightchanges are larger for active nodes (proportional to 1 a)than for inactive nodes (proportional to a) For a suffi-ciently small a the noise in inactive nodes will over-whelm the information in the weight changes so that ifthe information is combined the inactive nodes will harmthe information in the active nodes and performance isoptimal if only information from active nodes is used

The performance of the variance theory was calcu-lated by using the information in all the nodes This isdone by counting the number of nodes that are retrievedto the correct state of activation (ie the same state asthat during encoding) The mathematical details of thiscalculation are described at the end of the Appendix Theresults are shown by the dotted line in Figure 9A usingthe same set of parameters as when d cent was calculated byusing only active nodes shown by the solid line The re-sults show that the highest d cent is found when the decisionis based only on active nodes and when a is low Includ-ing inactive nodes in decision lowers d cent However for alarger a (above 15 for the parameters used here) it isbeneficial to base the decision on all the nodes

Optimal placement of the recognition criterion forthe two classes of items The recognition criterion (C)does not affect d cent but it influences performance as mea-sured by the hit rates and false alarm rates Therefore itis necessary to use another criterion for performance

VARIANCE THEORY OF MIRROR EFFECT 433

with respect to the placement of the recognition crite-rion A natural choice for performance in this context isthe probability of hits minus the probability of falsealarms This measurement corresponds to optimal per-formance when old correct responses and new correctnew responses are rewarded equally It is easy to see thatif the standard deviations of the old and the new distri-butions are equal the optimal performance will be foundif the recognition criterion is set exactly between the dis-tributions For unequal standard deviations the optimalrecognition criterion is shifted from the midpoint towardthe distribution with the smallest standard deviationMore exactly the optimal recognition criterion is thepoint at which the old and the new distributions inter-sect It is easy to see that this is true because if the recog-nition criterion is moved to the left of this point the rateof increase in false alarms is larger than the rate of in-crease in hits and performance suffers If the recognition

criterion is moved to the right of this point the rate of de-crease in hits is larger than the rate of decrease in falsealarms and performance also suffers (see eg Figure 4D)Formally f [S(O)] denotes the density of recognitionstrength of the old distribution and f [S(N)] the densityof the recognition strength of the new distribution Theratio between these variables is called the likelihoodratio L = f [S(O)] f [S(N)] and the optimal performanceoccurs when this ratio is equal to one (L = 1)

In the mirror effect there are two classes of itemseach having a new and an old distribution with differentstandard deviations The question of optimal perfor-mance is complicated by the possibility of using differ-ent criteria for the two classes The performance maythen vary depending on the choice of the two criteria andon additional restrictions on the overall level of confi-dence For example if one class is very easy (ie perfectdiscrimination) and one class is very difficult (ie no

Figure 9 (A) Theoretical d cent as a function of percentage of nodes active at encoding The solid line shows the d cent as a function of per-centage of nodes active at encoding when the decision is based only on nodes that are active during encoding The dotted line showsd cent when the decision is based on all the nodes (B) Theoretical d cent as a function of the activation threshold The leftmost arrow pointsat the expected net input of the new items [m(N)] the rightmost arrow points at the expected net input of the old items [m(O)] and themiddle arrow points at the point at the placement of the activation threshold of the nodes Note that the activation threshold is slightlylower than the optimal point (C) Optimal placement of the recognition criterion for the two classes The y-axis shows the maximumlikelihood for Class B divided by the maximum likelihood for Class A An optimal performance is found when this ratio is one Thex-axis shows the false alarm rate for Class A The straight line shows the ratio for theoretical optimal performance the dotted line theratio before normalization and the solid curved line the ratio after normalization See the text for details (D) The advantage of nor-malization for different recognition criteria The y-axis shows the total hit rate after normalization minus the total hit rate before nor-malization as a function of the total false alarm rate on the x-axis See the text for details

434 SIKSTROM

discrimination) and subjects are instructed to respondyes only when they are absolutely certain that they arecorrect it may be optimal to set a very high criterion forthe difficult class so that no yes responses will be madefor the difficult class and a moderate criterion for theeasy class so that some yes responses will be made forthe easy class Therefore any model that optimizes per-formance for the two classes must combine the criteriafor each class so that the performance for the sum of theclasses will be optimal

This problem may formally be stated as follows Giventwo classes (A and B) with a fixed total false alarm rate[P(AN) + P(BN)] how should the recognition criteriafor the two classes [T(A) and T(B)] be chosen so that thehit rates are maximized [P(AO) + P(BO)] The solutionto this problem is surprisingly simple The optimal perfor-mance occurs when the placements of the maximum like-lihoods of the two classes are equal

L(A) = f [S(AO)] f [S(AN)] = L(B)

= f [S(BO)] f [S(BN)]

It is easy to see that this criterion must be satisfied foroptimal performance because any shift from this pointdiminishes performance For example if the recognitionthreshold for class A is diminished the recognition cri-terion for class B must be increased to keep the totalfalse alarm rate constant According to the formulationof the problem the change in total false alarm rates mustbe equal f f [S(BN) = 0] The maximum-likelihood ra-tios are monotonically increasing functions of the recog-nition criteria therefore L(A) L(B) lt 0 when the recog-nition criteria are changed as specified above ThusL(A) = f [S(AO)] f [S(AN)] lt f [S(BO)] f [S(BN)] =L(B) or f [S(AO)] f [S(BO)] lt 0 This shows that thechange in the placement of the criteria from L(A) = L(B)results in an overall decrease in hit ratemdash( f [S(AO)] f [S(BO)] lt 0)mdashand performance suffers

Note that the variance theory only has one overallrecognition criterion However the theory and any the-ory of the mirror effect specifies how this criterionchanges when it is moved over the two classes Thus thesecond criterion can indirectly be inferred from the for-mulation of the theory This is done in the variance the-ory by the normalization factor that scales the recogni-tion differently between the two classes of words

The intriguing question here is whether the variancetheory is optimal or almost optimal in terms of place-ment of the recognition criterion for the two classes Fig-ure 9C plots the maximum likelihood of class B dividedby the maximum likelihood of class A [L(B)L(A)] onthe y-axis The x-axis shows the probability of false alarmsfor class A when the recognition criteria are changedThe optimal ratio of the maximum likelihood on the y-axisis one and it is plotted as the dotted straight line The re-sults before normalization (ie by counting the numberof nodes above the recognition criterion) are plotted in

the dotted monotonically increasing line The resultsafter normalization (ie the percentage of active nodesminus the expected number of active nodes divided bythe standard deviation of the net input) are plotted as thesolid line

The figure clearly shows that performance before nor-malization is far from optimal For a conservative recog-nition criterion or low false alarm rates the maximumlikelihood for class A is larger than the maximum likeli-hood for class B Therefore a more liberal criterion forclass B and a more strict criterion for class A would bemore advantageous For a liberal recognition criterionor a high false alarm rate the maximum likelihood forclass B is larger than the maximum likelihood for classA Therefore a more liberal criterion for Class A and astricter criterion for Class B would be beneficial Theonly point at which the performance is optimal is whenthe recognition criterion is unbiased At this point [aroundP(AN) = 25] the maximum-likelihood ratio is one

Normalization improves performance significantly sothe maximum-likelihood ratio is one or almost one fora range of criteria For false alarm rates larger than 015and smaller than 060 the ratio is within two percentagepoints of one The maximum likelihood for Class A islarger than that for Class B for false alarm rates less than015 and for false alarm rates larger than 060 Thus thereis still some deviation from optimal performance evenafter normalization However the maximum-likelihoodratio is closer to the optimal value for all false alarmrates after normalization than before normalization Ar-guably normalization improves performance sufficientlyso that performance may be described as being near anoptimal value for a wide range of recognition criteria

Overall performance after normalization can be di-rectly compared with performance before normalizationFigure 9D plots the total hit rate after normalizationminus the total hit rate before normalization for differenttotal false alarm rates In this figure the standard devia-tion of the net inputs to Class B was set to a 3 (rather than156) in order to make the difference between perfor-mance before and after normalization more salient Allother parameters were identical to those in Figure 4DFurthermore it is assumed that the number of items inClass A is equal to the number of items in Class B Thetotal hit rate is equal to the average hit rate of Class Aand Class B and the total false alarm rate is equal to theaverage false alarm rate in Class A and Class B

For all recognition criteria or for all false alarm ratesperformance is better or equal after normalization ascompared with performance before normalization Foran unbiased recognition criterion or a slightly largerrecognition criterion performance is approximatelyequal before and after normalization [ie for 25 lt P(N) lt40] For conservative recognition criteria [P(N) lt 25]there is a large advantage for normalized performanceover nonnormalized performance The largest advantageis approximately 005 and it occurs when the false alarm

VARIANCE THEORY OF MIRROR EFFECT 435

rate is approximately 005 For liberal recognition crite-ria [P(N) gt 40] there is also an advantage for normal-ized performance The largest advantage is around 001and it occurs when the false alarm rate is 070 The ad-vantage for liberal criteria is smaller than the advantagefor conservative criteria because of a ceiling effect forlarge false alarm rates and large hit rates

A Nonoptimal Network is Inconsistent With Empirical Data of Recognition Memory

To summarize it has been shown that performance isoptimal when (1) the percentage of nodes active at en-coding is low (2) the activation threshold is set betweenthe new and the old distributions (3) only nodes activeat encoding are used for measuring the recognitionstrength and (4) the recognition strength is normalizedwith the standard deviation of the net input It is inter-esting to note that all these conditions are necessary forproducing the empirically found memory phenomena

1 The percentage of active nodes has to be low forproducing appropriate standard deviations for the oldand the new recognition distributions If the percentageof active nodes is too high the standard deviation of theold distribution will approach the standard deviation ofthe new distribution

2 The model predicts the mirror effect only if the ac-tivation threshold is set between the old and the new dis-tributions If the recognition threshold is larger than theexpected value of the net input of the old distributionthe hit rate of the easy class will be less than the hit rateof the difficult distribution Similarly if the recognitionthreshold is lower than the expected new net input thefalse alarm of the easy class will be larger than the falsealarm rate of the difficult class This is inconsistent withthe empirical data for unbiased responses

3 Assume that the recognition strength is based on allthe nodes (ie not only nodes inactive during encod-ing)mdashfor example by counting the number of nodes inthe correct state of activation This measurement ofrecognition strength will have at least 50 of the nodesin the correct state (ie Pc gt 50) even if the subjectswere merely guessing on new items This would lead tothe incorrect prediction that the old recognition strengthvariance should be smaller than the new recognitionstrength variance because Pc(1 Pc)N decreases for Pcover 50 This is inconsistent with the finding that thevariance of the old distribution is larger than the varianceof the new distribution

4 If the recognition strength is not normalized withthe net input the variance of the recognition strength ofthe new easy class (AN) will be smaller than the varianceof the recognition strength of the new difficult class (BNsee Figure 4C) This is inconsistent with the empirical data

This analysis indicates that several recognition mem-ory phenomena fall out as a consequence of optimizingperformance in the network If the model is optimized interms of performance the model reproduces the empir-ical data If the model is made nonoptimal the model

does not reproduce the empirical data Arguably thehuman brain has evolved to optimize storage capacityand therefore these memory phenomena occur

GENERAL DISCUSSION

This paper has suggested a variance theory for themirror effect that also applies to the ROC curves Themodel is remarkably simple It has been shown that atwo-layer recurrent network where one layer representscontext and one layer represents items shows these phe-nomena if performance is measured by counting thenumber of nodes active at recognition in a way that opti-mizes performance The structure of the theory guaran-tees that high-frequency items have a larger variance inthe net inputs than do low-frequency items because en-coding the same item in different contexts increases thevariance whereas the expected net inputs are the sameThe theory predicts the mirror effect when the sameamount of attention is paid to both classes of stimuli Thestandard deviation of the recognition strength is largerfor old than for new items because more nodes are activein old items The standard deviation for the easy class islarger than the standard deviation for the difficult classbecause the recognition strength is normalized with thestandard deviation for the net input

There are several reasons why the variance theory isinteresting First the theory is extremely simple Theonly necessary assumptions are that recognition is basedon recurrent associations between contexts and itemsand performance is measured by counting the number ofnodes in an optimal way Second these assumptions areconsistent with what is known about how the brain worksTherefore the model is biologically plausible Third themodel accounts for a large amount of data including themirror effect exceptions from the mirror effect ROCcurves list-length effects and so on Fourth the modelfits the empirical data well Fifth it is easy to implementthe model in a connectionist network

Paying more attention to one of the classes violates theassumption of equal expected net inputs to the two classesThe variance theory predicts that attention to the moredifficult class primarily affects the hit rates whereas thefalse alarm rates and standard deviations of the underly-ing distributions are less affected An experiment sup-ported the prediction A standard mirror effect was foundwhen attention was divided equally between high- andlow-frequency words However focusing the subjectsrsquoattention on the high-frequency words either by the pre-sentation frequency or the presentation time made thehit rate larger for the high-frequency words than for thelow-frequency words This manipulation of attention didnot influence the false alarm rates which were higher forthe high-frequency words in all the conditions Thus nomirror effect was found when attention was paid to thehigh-frequency words Nor did the focusing of attentioninfluence the order of the standard deviations The stan-dard deviations were larger for the low-frequency distri-

436 SIKSTROM

bution than for the high-frequency distribution This wastrue for the new and the old distributions both when at-tention was paid to high-frequency words and when at-tention was divided equally between the two classes (ex-cept in the new frequency control condition where thestandard deviations were equal)

The variance theory predicts the order of the standarddeviations of the underlying distributions for the follow-ing reasons The standard deviations of the old distribu-tions are predicted to be larger than those of the new dis-tributions because more nodes are activated in the olddistributions The standard deviations of the easy classdistributions are predicted to be larger than the standarddeviations of the difficult class distributions because therecognition strength is normalized by the itemrsquos diffi-culty estimated from the standard deviation of the net in-puts This is consistent with the empirical data

In contrast the attention-likelihood theory does notconstrain the old distribution to be larger than the newdistribution for the difficult class (it can be larger orsmaller depending on parameter settings) The variancetheory allows the following two orders of the standarddeviations ss(BN) lt ss(BO) lt ss(AN) lt ss(AO) andss(BN) lt ss(AN) lt ss(BO) lt ss(AO) The first order isthe more common although the second order occurs oc-casionally (see eg Glanzer et al 1993 Experiment 1)In addition the attention-likelihood theory allowsss(BO) lt ss(BN) lt ss(AN) lt ss(AO) according to Equa-tion 2 which to the authorrsquos knowledge has not beenfound in empirical data

The variance theory predicts that strength variablessuch as study time repetition and study instructions af-fect the expected net input For example increasing studytime will increase the net input that improves the hit rateIncreasing the net inputs also causes an increase in theactivation threshold that diminishes the false alarm rates

The variance theory has no (ie lateral) connectionswithin the item layer and there are no connections with-in the context layer Including intraitem connections inrecognition makes it impossible to tell whether therecognition strength comes from encoding during thelearning episode or from another preexperimental learn-ing episode Thus there would be a confounding be-tween the itemrsquos frequency and recognition strength Forexample if the recognition strength in the variance the-ory included intraitem connections and used a constantrecognition criterion it would predict a higher hit rateand a higher false alarm rate for high-frequency itemsthan for low-frequency items Thus the hit rate for high-frequency words would be larger than that for low-frequency words which is contrary to the data on the mir-ror effect This issue is called the frequencyrecognitionndashstrength confounding problem Other models may bevulnerable to this problem depending on their specificassumptions The variance theory is immune from thisproblem because recognition strength is based on the as-sociation between the context and the item that yields apure measurement of the strength of the target in a par-

ticular episode Net inputs within the item population arenot used because these connections are highly corre-lated with the frequency of the item

This frequencyrecognition-strength confounding prob-lem may be relevant to several distributed models thatassume that recognition strength increases with frequencythus making the false alarm rate higher for high- than forlow-frequency stimuli This is often implemented in dis-tributed models by including intraitem associations inthe measurement of recognition strength Thus whenintraitem and itemndashcontext associations are added it isnot possible to know whether the intraitem strength oc-curs because an item has been encoded in the to-be-retrieved-from list or at another episode

Although the intraitem associations are not used tomeasure recognition strength they may play an impor-tant role in recognition In the first step of recognitionthese associations may be used for deblurring unclear in-formation in the item cue (a similar mechanism occursfor the context cue) Arguably this deblurring mecha-nism works well for well-known words however fornonwords it is much more likely to fail Such failure willactivate features that were not active in the encoded rep-resentation It is more likely that these wrongly activatednodes represent high-frequency features because it ismore likely to converge on high-frequency features Thereare two interesting implications of this perspective Firstthe wrongly activated nodes will use the wrong connec-tions between the context and the item Second becausethe wrongly activated nodes represent high-frequencyfeatures the average variability will be larger for non-words than for words This is a plausible account of thepoor recognition performance with nonwords as com-pared with words It is also a tentative account of why non-words can be seen as a difficult class with a higher vari-ability than that for words in the variance theory Howeverfurther work is needed before any firm conclusion can bemade regarding this aspect of the theory

A problem similar to frequencyrecognition-strengthconfounding occurs if within-context connections areused at recognition If the context is temporally corre-lated the within-context connections are influenced bylist length This causes a confounding between list lengthand recognition strength This issue is called the list-lengthrecognition-strength confounding problem Othermodels may be vulnerable to this problem depending ontheir specific assumptions

Another issue is whether the variance theory can ac-count for the mirror effect found in abstract and concretewords where words from a concrete class are more eas-ily discriminated than words from an abstract class Thevariance theory can account for this given the assump-tion that the variance of the net input is larger for abstractthan for concrete words However at this point it is notcompletely clear how this assumption can be motivatedA possibility is that although these two classes areequated for word frequency the contexts associated withan abstract word are more variable than the contexts as-

VARIANCE THEORY OF MIRROR EFFECT 437

sociated with a concrete word This larger variability incontext for abstract words may lead to a larger variabil-ity in the net input Another possibility is that the activefeatures in abstract words are more general and there-fore associated with more contexts Nodes active in con-crete words may represent more specific features acti-vated with a lower frequency and therefore associatedwith fewer contexts Thus features in abstract wordsmay be of a higher frequency than features in concretewords although the frequencies of the items are thesame This would lead to a mirror effect in the modelHowever at this point no claim is made that variancetheory can handle this phenomenon

The variance theoryrsquos account of the list-length andlist-strength effects is arguably much simpler than thesubjective-likelihoodrsquos account The activation thresholdis set so that on average half of the nodes active duringencoding are active during recognition The activationthreshold is constant within one condition but may changebetween conditionsmdashfor example when study time ischanged Furthermore subjects do not need to estimatethe list length and the probability that a particular itemis drawn from the study list

The variance theory has advantages over the attention-likelihood theory As was discussed in more detail abovethe attention-likelihood theory requires subjects to clas-sify the stimulus Depending on this classification thesubjects must know (consciously or unconsciously) howmuch attention is paid to the stimuli in order to calculatethe log-likelihood ratios Thus the yesndashno decision isbased on the subjectsrsquo classification of the stimuli Thevariance theory does not require a classification of thestimuli During the calculation of recognition strengththe standard deviation of the net input of the item is used(shcent ) so the subject does not need to know the class or thestandard deviation of the class (sh) The increase in thehit rate and decrease in the false alarm rate for the easierclass occurs according to the theory because the vari-ance is smaller for the easier class The variance theoryhas a simple mathematical base subjects count the num-ber of activated nodes minus the expected value dividedby the standard deviation of the net input of the item Anexplicit solution is presented that requires only calculat-ing the probabilities of two z-transformations

The subjective-likelihood theory uses feature contentof the items for addressing issues regarding item similar-ity and word frequency In particular high-frequency wordsare assumed to have a higher noise or variability than dolow-frequency words The variance theory also usesvariability that depends on frequency However the vari-ance theory simulates the increase in variance duringeach presentation of a feature in different contexts thusmaking it an unavoidable phenomenon for high-frequencyfeatures In the subjective-likelihood theory the featurevariance is introduced as an assumption or a constantand it is not explicitly simulated

There are several other differences between the vari-ance theory and the subjective-likelihood theory The

subjective-likelihood theory is based on log-likelihoodratios In the variance theory log-likelihood ratios arenot necessary to account of the mirror effect and for z-ROC curves Instead the variance theory uses the num-ber of active nodes normalized by the variance of the netinput as the measurement of recognition strength

Another difference is the use of one detector for eachitem in the subjective-likelihood theory This makes themodel essentially local whereas the variance theory isdistributed This property may cause difficulties in im-plementing the model in a connectionist network How-ever see McClelland and Chappell (1998) for a brief dis-cussion of this topic An implementation of the variancetheory is straightforward

REFERENCES

Feller W (1968) An introduction to probability theory and its appli-cation New York Wiley

Gillund G amp Shiffrin R M (1984) A retrieval model for bothrecognition and recall Psychological Review 91 1-67

Glanzer M amp Adams J K (1985) The mirror effect in recognitionmemory Memory amp Cognition 13 8-20

Glanzer M amp Adams J K (1990) The mirror effect in recognitionmemory Data and theory Journal of Experimental PsychologyLearning Memory amp Cognition 16 5-16

Glanzer M Adams J K amp Kim K (1993) The regularities ofrecognition memory Psychological Review 100 546-567

Glanzer M amp Bowles N (1976) Analysis of the word frequencyeffect in recognition memory Journal of Experimental PsychologyHuman Learning amp Memory 2 21-31

Glanzer M Kisok K amp Adams J K (1998) Response distribu-tions as an explanation of the mirror effect Journal of ExperimentalPsychology Learning Memory amp Cognition 24 633-644

Greene R L (1996) Mirror effect in order and associative informa-tion Role of response strategies Journal of Experimental Psychol-ogy Learning Memory amp Cognition 22 687-695

Hertz J Krogh A amp Palmer R G (1991) Introduction to the the-ory of neural computation Reading MA Addison-Wesley

Hintzman D L (1988) Judgment of frequency and recognition memoryin a multiple trace memory model Psychological Review 95 528-551

Hopfield J J (1982) Neural networks and physical systems withemergent collective computational abilities Proceedings of the Na-tional Academy of Sciences 79 2554-2558

Hopfield J J (1984) Neurons with graded response have collectivecomputational properties like those of two-state neurons Proceed-ings of the National Academy of Sciences 81 3088-3092

Humphreys M S Bain J D amp Pike R (1989) Different way to cuea coherent memory system A theory for episodic semantic and pro-cedural tasks Psychological Review 96 208-233

Kim K amp Glanzer M (1993) Speed versus accuracy instructionsstudy time and the mirror effect Journal of Experimental Psychol-ogy Learning Memory amp Cognition 19 638-652

Kruschke J K (1992) ALCOVE An exemplar-based connectionistmodel of category learning Psychological Review 99 22-44

Ku Iumlcera H amp Francis W N (1967) Computational analysis ofpresent-day American English Providence RI Brown UniversityPress

Lewandowsky S (1991) Gradual unlearning and catastrophic inter-ference A comparison of distributed architectures In W E Hockleyamp S Lewandowsky (Eds) Relating theory and data Essays onhuman memory in honor of Bennet B Murdock (pp 445-476) Hills-dale NJ Erlbaum

Li S-C amp Lindenberger U (1999) Cross-level unification A com-putational exploration of the link between deterioration of neuro-transmitter systems and dedifferentiation of cognitive abilities in oldage In L-G Nilsson amp H J Markowitsch (Eds) Cognitive neuro-sciences of memory (pp 103-146) Seattle Hogrefe amp Huber

438 SIKSTROM

Li S-C Lindenberger U amp Frensch P A (2000) Unifying cog-nitive aging From neuromodulation to representation to cognitionNeurocomputing 32-33 879-890

McClelland J L amp Chappell M (1998) Familiarity breeds dif-ferentiation A subjective-likelihood approach to the effects of expe-rience in recognition memory Psychological Review 105 724-760

Metcalfe J (1982) A composite holographic associative recallmodel Psychological Review 89 627-658

Murdock B B (1982) A theory for the storage and retrieval of item andassociative information Psychological Review 89 609-626

Murdock B B (1998) The mirror effect and attentionndashlikelihoodtheory A reflective analysis Journal of Experimental PsychologyLearning Memory amp Cognition 24 524-534

Okada M (1996) Notions of associative memory and sparse codingNeural Networks 9 1429-1458

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves PsychologicalReview 99 518-535

Rumelhart D E Hinton G E amp Williams R J (1986) Learn-ing representation by backpropagating errors Nature 323 533-536

Shiffrin R M amp Steyvers M (1997) A model for recognitionmemory REMndashretrieving effectively from memory PsychonomicBulletin amp Review 4 145-166

Sikstroumlm S (1996a) TECO A connectionist theory of successiveepisodic tests Umearing Doctoral dissertation Umearing University (ISBN91-7191-155-3)

Sikstroumlm S (1996b) The TECO connectionist theory of recognitionfailure European Journal of Cognitive Psychology 8 341-380

Sikstroumlm S (1999) Power function forgetting curves as an emergentproperty of biologically plausible neural networks model Interna-tional Journal of Psychology 34 460-464

Stretch V amp Wixted J T (1998a) Decision rules for recognitionmemory confidence judgments Journal of Experimental Psychol-ogy Learning Memory amp Cognition 24 1397-1410

Stretch V amp Wixted J T (1998b) On the difference betweenstrength-based and frequency-based mirror effects in recognitionmemory Journal of Experimental Psychology Learning Memoryamp Cognition 24 1379-1396

NOTE

1 The expected number of nodes active at recognition is a2 giventhat the standard deviations of the percentages of active nodes for new[sp(N)] and old items [sp(O)] are equal However if the standard devi-ations are different the expected number of active nodes will be

Because the new variance is smaller than the old variance this valuewill be slightly less than a2 typically around 044a if the ROC curveis 08 It is actually more precise to use this value In this paper the ap-proximation a2 is used except that in the simulations where the ex-pression above is used

APPENDIX

The Expected Value of the Net Input and the Variance of the Net Input

Inserting the weight changes into the net input solves the ex-pected value of the net input for old items In this Appendix thesuperscripts representing the item layer (d) and the contextlayer (c) are dropped Only the active states of activation con-tribute to the net input so jj 5 jj = 1

(A1)

The expected value of the net inputs for the new items iszero

mh(N) 5 0 (A2)

The expected value of the weights is zero The variance ofthe net input is calculated by squaring each term in the netinput Let P represent the number of patterns stored in the net-work If the patterns are uncorrelated or a new item is pre-sented the variance of the net input is

(A3)

For an old item the variance of the net input to the contextlayer from the item layer depends on the frequency of the item( f ) All the aN weights supporting a context node contribute tothe net input in the same direction

(A4)

The same function can be used for calculating the varianceof the net input to a node in the item layer when the same con-text is associated with several items Let the same context be as-sociated with L items in a list Furthermore let the context be-tween different lists be uncorrelated The variance of the netinput to an item node is

The expected variance for all nodes is the weighted sum ofthese two terms Half of the nodes are context nodes and halfof the nodes are item nodes Therefore the expected varianceof the net input to all nodes given an item with a frequency off and a list length of L in a network where p patterns have beenencoded is

(A5)

Performance in the Variance Model When AllNodes Are Used

In the variance model recognition strength is based on nodesactive at encoding However if recognition strength is based onall the nodes performance can be calculated as follows The d centis calculated by using Equation 8 where Pc(O) and Pc(N) are thenumber of correct nodes The number of correct nodes is thenumber of nodes active at retrieval that also is active at encod-ing (ie calculated as usual with Equation 7) plus the numberof inactive nodes at encoding that also are inactive at retrievalThe latter value can be calculated by replacing a with 1 a inEquation 7 and using the expected value of old net inputs for in-active nodes a2 (1 a) N (compare with Equation A1)

(Manuscript received February 9 1999revision accepted for publication October 30 2000)

s h f L a N p a a N2 3 22 1O( ) = +( ) + -( )

s h f a a LN2 4 2 21( ) = -( )

s x x x xh ij jj

N

i j ji

Nf f f a a

a a f N

22 2

4 2 21

( ) = waring aringaelig

egraveccedilouml

oslashdivide= -( ) -( )eacute

eumlecirc

= -( )

s x x x xh ij jj

N

i j j

P

i

NN a a

a a a a a a a

a a a a PN a a PN

2 2 2

2

3

1 1 2 0 1

0 0 1 1

( ) = w

= [( ) ( )] + [( )(1- )] ( )

+ [( )( )] ( ) = ( )

2 2

2 2 2

( ) = -( ) -( ) - - - -

- - - -

aring aringaring

m x x x xh ijj

N

j i j jj

Na a a a N(O) = wDaring aring= -( ) -( ) = -( )1

2

ss s

p

p p

a(N)

( )N (O)+

430 SIKSTROM

probabilities Figure 8B shows the corresponding resultsfor the slope The accounted variance is 096 for the prob-abilities and 085 for the slopes Thus the variance theoryfits the slopes using a single parameter equally well asthe attention-likelihood theory does with three fitting pa-rameters The fit for the variance theory for the probabil-ities using one parameter is slightly less than the fit of theattention-likelihood theory using three fitting parametersIt may be concluded that the fit for the variance theory isreasonably good for the probabilities and the slopes Theslopes have a better fit in the variance theory than in theattention-likelihood theory when three variables are used

ANALYTIC SOLUTIONS

In this section analytic solutions of the variance the-ory an approximation of the standard deviation of therecognition strength and analyses of optimal perfor-mance are presented The variance theory has a simpleanalytic solution and can be fully described by four pa-rameters Two of these parametersmdashnamely the stan-dard deviation of the net inputs from the easy class[sh(A)] and the standard deviation of the net inputs fromthe difficult class [sh(B)]mdashcan also be expressed as thefrequency of the item (see the Appendix) The other twoparameters are the number of nodes (N ) and the expectedprobability that the nodes are active at encoding (a)

There are other variables in the theory however theydo not increase the degrees of freedom There are fourexpected net inputs (mh) however two degrees of free-dom disappear because the new net inputs are constrainedto be equal as well as the old net inputs [mh(AN) =mh(BN) and mh(AO) = mh(BO)] Furthermore the predic-tions are independent of moving the old and new dis-tributions in parallel thus removing another degree offreedom Finally changing the difference between the ex-pected old and new net inputs is mathematically equiva-lent to changing the standard deviations [sh(A) andsh(B)] Thus the degrees of freedom in the net inputscan be captured by the degrees of freedom in the stan-dard deviations The activation threshold (T ) and theprobability that nodes are active (Pc) are simply func-tions of other variables and therefore do not increase thedegrees of freedom Thus there are four degrees of free-dom for the distributionsmdashnamely sh(A) sh(B) N anda An additional degree of freedom is introduced whenplacing the recognition criterion (C)

The probability (Pc) that the net inputs exceed the ac-tivation threshold (T ) for nodes active during encodingcan be explicitly solved from the expected net inputs(mh) and the standard deviation of the net inputs (sh)This probability is dependent on the distribution of thenet inputs which can be approximated with a normaldistribution Pc is solved by integrating the net inputsfrom mh T to infinity (yen) over the probability densityfunction for a normal distribution Thus the probability(Pc) that a node is active at recognition is

(6)

Subtracting the expected percentage of active nodes atrecognition (a2 see note 1) from the percentage of ac-tive nodes and dividing by the standard deviation of thenet inputs (sh) calculates the expected recognitionstrength (ms)

Note that the analytic solution uses the standard devi-ation of the class (sh) as an approximation of the stan-dard deviation of the item (shcent ) because it simplifies theanalytic solution however the variance theory or thesimulation uses the standard deviation of the item Thisapproximation is good when there are a large number offeatures however for a small number of features thevariance of feature strength for a single item may fluctu-ate on an item-to-item basis around the variance of thenet inputs for all the items

The standard deviation of the recognition strength (ss)is calculated from sh Pc and N There is 2N number ofnodes in the context and the item layers The distributionof Pc is binomial but can for a certain criterion [ie 2NPc(1 Pc) gt 10] be approximated with a normal distri-bution with a standard deviation of [Pc(1 Pc) 2N]12The final result is scaled by the normalization factor 1sh

(7)

A yes response occurs if the recognition strength isabove the recognition criterion (C) The probability of ayes response [P(Y)] is calculated from the expected recog-nition strength the variance of the recognition strengthand the recognition criterion by integrating the density ofthe recognition strength over a normal distribution

The probability of choosing A over B in a two-choiceforced recognition test [P(A B)] is calculated from theexpected recognition strength of A [ms(A)] and B [ms(B)]and the standard deviations of the recognition strengthof A [ss(A)] and B [ss(B)]

An Excel sheet for calculating the predictions of the vari-ance theory is available on line (wwwpsychutorontoca ~sverkervariancehtml)

P e ds

S Bs s

s s

C

(A B)

(A)

(A) (B)

=

- -[ ]( )+aelig

egraveccediloumloslashdivide

eacuteeumlecirc

yen

ograve12

2

2 2 22

p

m m

s s

( )

P e ds

S s

s

C

(Y) =-( )yen

ograve12

2

22

p

m

s

sss

h

c cP P

N=

-( )eacute

eumlecircecirc

1 1

2

12

mss

Pc

a

h=

-2

P a e dhcT

hh

h=-( )yen

ograve12

2

22

p

m

s

VARIANCE THEORY OF MIRROR EFFECT 431

Approximations of the Standard Deviation ofRecognition Strength

The standard deviation of the recognition strength isin the model calculated with Equation 7 However to fa-cilitate the understanding of this equation it is useful tomake some approximations First note that the proba-bility that a node is active (Pc) is assumed to be low Byapproximating 1 Pc to one the variance of the recog-nition strength can be simplified to

For a particular class of items the variances of the netinputs of old and new items are equal and the varianceof the recognition strength is proportional to the numberof active nodes (s 2

s micro Pc) This approximation suggests avery simple interpretation of the slope of the z-ROC Theratio of variances between new and old items is simplythe ratio between the number of nodes active in the newitems representations and the number of nodes active inthe old items representations

Or alternatively the slope of the z-ROC curve is equalto the square root of the ratio of the number of nodes ac-tive in the new items representations and the number ofnodes active in the old items representations For exam-ple if the slope of the z-ROC curve is 08 the number ofactive nodes in the new items representations divided bythe number of nodes active in the old items representa-tions is 064 (= 0802)

Another approximation useful for understanding themodel is that for two classes of items the number of ac-tive nodes in the new distribution is approximately equaland the number of active nodes in the old distributions isapproximately equal [Pc(AN) raquo Pc(BN) and Pc(BO) raquoPc(AO)] Given these approximations and the approxi-mation above (1 Pc raquo 1) the recognition strength stan-dard deviation is inversely related to the standard devia-tion of the net inputs in the following way The ratiobetween the recognition strength standard deviations ofthe diff icult and the easy distributions is equal to theratio between the standard deviations of the net inputs ofthe easy and the difficult distributions Furthermore theratio between the recognition strength standard devia-tions of the difficult and easy new distributions is equalto the ratio between the recognition strength standard de-viations of the difficult and the easy old distributionsThe exact solution predicts a slightly smaller ratio in theold than in the new distributions

This suggests that the ratio between the recognitionstrength standard deviations of the difficult class and theeasy class can be interpreted as the ratio between the

standard deviations of the net inputs of the easy class andthe difficult class

Optimizing Performance Derives the Assumptions of the Variance Theory

Arguably good memory performance is an importantfactor for selection in the evolutionary process of hu-mans and animals It is reasonable to assume that thebrain has evolved so that the performance at retrieval isoptimal or near-optimal Here it is investigated how sev-eral assumptions of the variance theory influence per-formance It is shown that several assumptions of themodel fall out as a consequence of optimizing perfor-mance in the form of discriminability between new andold items Thus if the model is implemented in a differ-ent way performance is degraded and the model doesnot account for the experimental data Examples of as-sumptions that yield a good performance in the modelare a low percentage of nodes active setting the activa-tion threshold between old and new net inputs measur-ing performance by nodes that are active in the encodingpattern and normalizing the recognition strength It isshown that an optimal performance in the network requiresthe implementation suggested by the variance theory Ifthe implementation of the variance theory is changedsignificantly the performance is degraded and the net-work would not produce the empirically found memoryphenomena

To conduct this analysis it is necessary to define ameasurement of performance A natural choice is to used cent By using the analytical equations above we find thefollowing expression

(8)

Because ss(N) sometimes is near zero it was founduseful to use the standard deviation of both the old andthe new items recognition strength ss(NO) in the de-nominator of this equation Thus Pc(NO) is equal to[Pc(N) 1 Pc(O)] 2 Pc( ) was calculated with Equation 6The expected value of the net inputs and the standard de-viation of the net inputs for new and old items were cal-culated with the equations derived in the Appendix(Equations A1 A2 and A3) Low-frequency items witha preexperimental frequency of zero were used ( f = 0)and the list length was set to one (L = 1)

The performance can be expressed by the parametersa N and p Increasing the number of nodes (N) monot-onically increases d cent and increasing the number ofstored patterns (p) monotonically decreases d cent The im-pact of these two parameters on d cent was of less impor-tance here and they were set to N = 30 and p = 100

Optimal percentage of nodes active at encodingThe solid line in Figure 9A shows the theoretical d cent as afunction of the percentage of nodes active at encoding

cent = - =-

-eacuteeumlecirc

dP P

P PN

s s

s

c c

c c

m ms( ) ( )

(

O N

(NO)

(O) (N)

(NO) (NO) 12

112

ss

ss

ss

ss

s

s

s

s

h

h

s

s

(BO)

(AO)

(B)

(A)

(A)

(B)

(BN)

(AN)pound raquo pound

s

ss

s

c

c

P

P

2

2

(N)

(O)

(N)

(O)raquo

ss

sc

h

P

N

2

2 2=

432 SIKSTROM

(a) The results show that d cent is optimal for a = 052 Thed cent is lower for larger and smaller a The lower d cent for largea occurs because the interference from other items in-crease For an a larger than the optimal value the weightchanges are distributed over a larger number of nodesand the recognition tests therefore include more noise

The lower d cent for an a less than the optimal value oc-curs because there is variability in the number of activenodes at encoding Thus for very small values of a thefluctuation between the number of nodes active in theencoded representation becomes increasingly importantThus for a small a errors are more likely to occur be-cause old items have few active nodes at encoding mak-ing it less likely that many nodes will be active at re-trieval (independently of how well they are encoded)This analysis suggests that a medium low percentage ofactive nodes at encoding yields optimal performanceThis is consistent with variance theory which requires alow activity for fitting some of the empirical data (seebelow)

There is another factor that contributes to the fact thatoptimal performance occurs when the percentage of ac-tive nodes is medium low which is that the number ofpossible representations increases with a If there is onlyone node active in all the representations there are Npossible combinations of representations if there aretwo nodes active in all the representations there are ap-proximately N 2 possible combinations of representa-tions and so forth This factor is not included in theanalyses

Optimal placement of the activation threshold Animportant property of d cent is that it is insensitive to wherethe criterion is placed Thus any criterion yields thesame performance The activation threshold (T ) may beseen as the criterion for a single node and therefore onemight intuitively think that the placement of the thresh-old is unimportant for d cent However surprisingly theplacement of the criterion becomes important in the vari-ance theory because there is a nonlinear transformationwhen the nodes are activated This nonlinearity makes d centdependent on the activation threshold in the nodes

The d cent was maximized by changing the activationthreshold (T ) and the percentage of nodes active at en-coding (a) The maximum d cent was 240 when T = 081and a = 052 Figure 9B plots d cent as a function of the ac-tivation threshold (T ) when the percentage of nodes ac-tive at encoding was f ixed at the optimal value (a =052) The results show that d cent has an optimal valuewhen the activation threshold is set between the old andthe new distributions The variance theory suggests thatthe threshold should be set to the average of the expectedold and expected new net inputs For the parameter val-ues used here this value is 071 which is near butslightly lower than the optimal value of 081 (the ex-pected value of the new net input is 0 and the expectedvalue of the old net input is 142) Note that this resultapplies when both a and T are set to the optimal value Ifa is set to a nonoptimal value the optimal value of T may

deviate significantly from the one proposed by the the-ory (eg if a = 5 the optimal value of T is much largerthan the expected value of old net inputs of 188)

This analysis emphasizes the importance of setting theactivation threshold as suggested by the variance theorySetting the activation threshold between the old and thenew expected net inputs yields not only the mirror effectbut also an optimal performance in the network Noticethat the activation threshold (T ) is constant even if thesubjectsrsquo decision criterion (C) is changed Therefore d centwill not change when the decision criterion changes Bychanging the decision criterion (rather than the recogni-tion threshold) subjects can maintain an optimal d cent fordifferent confidence levels

Optimal usage of the state of activation in the cuepattern An interesting question is how much informa-tion is carried in nodes that are active in the encoded pat-tern as compared with inactive nodes If both active andinactive nodes carry a similar amount of information itis useful to use all the nodes at retrieval However if in-active nodes carry little information in relation to thenoise performance can be improved by using only theinformation in the active nodes

The information carried in the nodes depends on theamount of weight changes which in turn depends on thepercentage of active nodes at encoding (a) For a = 12the absolute values of the weight changes are the samefor active and inactive nodes (however the signs of theweight changes are different) Thus inactive and activenodes carry the same amount of information and theperformance is optimal when information in both activeand inactive nodes is used For a small a the weightchanges are larger for active nodes (proportional to 1 a)than for inactive nodes (proportional to a) For a suffi-ciently small a the noise in inactive nodes will over-whelm the information in the weight changes so that ifthe information is combined the inactive nodes will harmthe information in the active nodes and performance isoptimal if only information from active nodes is used

The performance of the variance theory was calcu-lated by using the information in all the nodes This isdone by counting the number of nodes that are retrievedto the correct state of activation (ie the same state asthat during encoding) The mathematical details of thiscalculation are described at the end of the Appendix Theresults are shown by the dotted line in Figure 9A usingthe same set of parameters as when d cent was calculated byusing only active nodes shown by the solid line The re-sults show that the highest d cent is found when the decisionis based only on active nodes and when a is low Includ-ing inactive nodes in decision lowers d cent However for alarger a (above 15 for the parameters used here) it isbeneficial to base the decision on all the nodes

Optimal placement of the recognition criterion forthe two classes of items The recognition criterion (C)does not affect d cent but it influences performance as mea-sured by the hit rates and false alarm rates Therefore itis necessary to use another criterion for performance

VARIANCE THEORY OF MIRROR EFFECT 433

with respect to the placement of the recognition crite-rion A natural choice for performance in this context isthe probability of hits minus the probability of falsealarms This measurement corresponds to optimal per-formance when old correct responses and new correctnew responses are rewarded equally It is easy to see thatif the standard deviations of the old and the new distri-butions are equal the optimal performance will be foundif the recognition criterion is set exactly between the dis-tributions For unequal standard deviations the optimalrecognition criterion is shifted from the midpoint towardthe distribution with the smallest standard deviationMore exactly the optimal recognition criterion is thepoint at which the old and the new distributions inter-sect It is easy to see that this is true because if the recog-nition criterion is moved to the left of this point the rateof increase in false alarms is larger than the rate of in-crease in hits and performance suffers If the recognition

criterion is moved to the right of this point the rate of de-crease in hits is larger than the rate of decrease in falsealarms and performance also suffers (see eg Figure 4D)Formally f [S(O)] denotes the density of recognitionstrength of the old distribution and f [S(N)] the densityof the recognition strength of the new distribution Theratio between these variables is called the likelihoodratio L = f [S(O)] f [S(N)] and the optimal performanceoccurs when this ratio is equal to one (L = 1)

In the mirror effect there are two classes of itemseach having a new and an old distribution with differentstandard deviations The question of optimal perfor-mance is complicated by the possibility of using differ-ent criteria for the two classes The performance maythen vary depending on the choice of the two criteria andon additional restrictions on the overall level of confi-dence For example if one class is very easy (ie perfectdiscrimination) and one class is very difficult (ie no

Figure 9 (A) Theoretical d cent as a function of percentage of nodes active at encoding The solid line shows the d cent as a function of per-centage of nodes active at encoding when the decision is based only on nodes that are active during encoding The dotted line showsd cent when the decision is based on all the nodes (B) Theoretical d cent as a function of the activation threshold The leftmost arrow pointsat the expected net input of the new items [m(N)] the rightmost arrow points at the expected net input of the old items [m(O)] and themiddle arrow points at the point at the placement of the activation threshold of the nodes Note that the activation threshold is slightlylower than the optimal point (C) Optimal placement of the recognition criterion for the two classes The y-axis shows the maximumlikelihood for Class B divided by the maximum likelihood for Class A An optimal performance is found when this ratio is one Thex-axis shows the false alarm rate for Class A The straight line shows the ratio for theoretical optimal performance the dotted line theratio before normalization and the solid curved line the ratio after normalization See the text for details (D) The advantage of nor-malization for different recognition criteria The y-axis shows the total hit rate after normalization minus the total hit rate before nor-malization as a function of the total false alarm rate on the x-axis See the text for details

434 SIKSTROM

discrimination) and subjects are instructed to respondyes only when they are absolutely certain that they arecorrect it may be optimal to set a very high criterion forthe difficult class so that no yes responses will be madefor the difficult class and a moderate criterion for theeasy class so that some yes responses will be made forthe easy class Therefore any model that optimizes per-formance for the two classes must combine the criteriafor each class so that the performance for the sum of theclasses will be optimal

This problem may formally be stated as follows Giventwo classes (A and B) with a fixed total false alarm rate[P(AN) + P(BN)] how should the recognition criteriafor the two classes [T(A) and T(B)] be chosen so that thehit rates are maximized [P(AO) + P(BO)] The solutionto this problem is surprisingly simple The optimal perfor-mance occurs when the placements of the maximum like-lihoods of the two classes are equal

L(A) = f [S(AO)] f [S(AN)] = L(B)

= f [S(BO)] f [S(BN)]

It is easy to see that this criterion must be satisfied foroptimal performance because any shift from this pointdiminishes performance For example if the recognitionthreshold for class A is diminished the recognition cri-terion for class B must be increased to keep the totalfalse alarm rate constant According to the formulationof the problem the change in total false alarm rates mustbe equal f f [S(BN) = 0] The maximum-likelihood ra-tios are monotonically increasing functions of the recog-nition criteria therefore L(A) L(B) lt 0 when the recog-nition criteria are changed as specified above ThusL(A) = f [S(AO)] f [S(AN)] lt f [S(BO)] f [S(BN)] =L(B) or f [S(AO)] f [S(BO)] lt 0 This shows that thechange in the placement of the criteria from L(A) = L(B)results in an overall decrease in hit ratemdash( f [S(AO)] f [S(BO)] lt 0)mdashand performance suffers

Note that the variance theory only has one overallrecognition criterion However the theory and any the-ory of the mirror effect specifies how this criterionchanges when it is moved over the two classes Thus thesecond criterion can indirectly be inferred from the for-mulation of the theory This is done in the variance the-ory by the normalization factor that scales the recogni-tion differently between the two classes of words

The intriguing question here is whether the variancetheory is optimal or almost optimal in terms of place-ment of the recognition criterion for the two classes Fig-ure 9C plots the maximum likelihood of class B dividedby the maximum likelihood of class A [L(B)L(A)] onthe y-axis The x-axis shows the probability of false alarmsfor class A when the recognition criteria are changedThe optimal ratio of the maximum likelihood on the y-axisis one and it is plotted as the dotted straight line The re-sults before normalization (ie by counting the numberof nodes above the recognition criterion) are plotted in

the dotted monotonically increasing line The resultsafter normalization (ie the percentage of active nodesminus the expected number of active nodes divided bythe standard deviation of the net input) are plotted as thesolid line

The figure clearly shows that performance before nor-malization is far from optimal For a conservative recog-nition criterion or low false alarm rates the maximumlikelihood for class A is larger than the maximum likeli-hood for class B Therefore a more liberal criterion forclass B and a more strict criterion for class A would bemore advantageous For a liberal recognition criterionor a high false alarm rate the maximum likelihood forclass B is larger than the maximum likelihood for classA Therefore a more liberal criterion for Class A and astricter criterion for Class B would be beneficial Theonly point at which the performance is optimal is whenthe recognition criterion is unbiased At this point [aroundP(AN) = 25] the maximum-likelihood ratio is one

Normalization improves performance significantly sothe maximum-likelihood ratio is one or almost one fora range of criteria For false alarm rates larger than 015and smaller than 060 the ratio is within two percentagepoints of one The maximum likelihood for Class A islarger than that for Class B for false alarm rates less than015 and for false alarm rates larger than 060 Thus thereis still some deviation from optimal performance evenafter normalization However the maximum-likelihoodratio is closer to the optimal value for all false alarmrates after normalization than before normalization Ar-guably normalization improves performance sufficientlyso that performance may be described as being near anoptimal value for a wide range of recognition criteria

Overall performance after normalization can be di-rectly compared with performance before normalizationFigure 9D plots the total hit rate after normalizationminus the total hit rate before normalization for differenttotal false alarm rates In this figure the standard devia-tion of the net inputs to Class B was set to a 3 (rather than156) in order to make the difference between perfor-mance before and after normalization more salient Allother parameters were identical to those in Figure 4DFurthermore it is assumed that the number of items inClass A is equal to the number of items in Class B Thetotal hit rate is equal to the average hit rate of Class Aand Class B and the total false alarm rate is equal to theaverage false alarm rate in Class A and Class B

For all recognition criteria or for all false alarm ratesperformance is better or equal after normalization ascompared with performance before normalization Foran unbiased recognition criterion or a slightly largerrecognition criterion performance is approximatelyequal before and after normalization [ie for 25 lt P(N) lt40] For conservative recognition criteria [P(N) lt 25]there is a large advantage for normalized performanceover nonnormalized performance The largest advantageis approximately 005 and it occurs when the false alarm

VARIANCE THEORY OF MIRROR EFFECT 435

rate is approximately 005 For liberal recognition crite-ria [P(N) gt 40] there is also an advantage for normal-ized performance The largest advantage is around 001and it occurs when the false alarm rate is 070 The ad-vantage for liberal criteria is smaller than the advantagefor conservative criteria because of a ceiling effect forlarge false alarm rates and large hit rates

A Nonoptimal Network is Inconsistent With Empirical Data of Recognition Memory

To summarize it has been shown that performance isoptimal when (1) the percentage of nodes active at en-coding is low (2) the activation threshold is set betweenthe new and the old distributions (3) only nodes activeat encoding are used for measuring the recognitionstrength and (4) the recognition strength is normalizedwith the standard deviation of the net input It is inter-esting to note that all these conditions are necessary forproducing the empirically found memory phenomena

1 The percentage of active nodes has to be low forproducing appropriate standard deviations for the oldand the new recognition distributions If the percentageof active nodes is too high the standard deviation of theold distribution will approach the standard deviation ofthe new distribution

2 The model predicts the mirror effect only if the ac-tivation threshold is set between the old and the new dis-tributions If the recognition threshold is larger than theexpected value of the net input of the old distributionthe hit rate of the easy class will be less than the hit rateof the difficult distribution Similarly if the recognitionthreshold is lower than the expected new net input thefalse alarm of the easy class will be larger than the falsealarm rate of the difficult class This is inconsistent withthe empirical data for unbiased responses

3 Assume that the recognition strength is based on allthe nodes (ie not only nodes inactive during encod-ing)mdashfor example by counting the number of nodes inthe correct state of activation This measurement ofrecognition strength will have at least 50 of the nodesin the correct state (ie Pc gt 50) even if the subjectswere merely guessing on new items This would lead tothe incorrect prediction that the old recognition strengthvariance should be smaller than the new recognitionstrength variance because Pc(1 Pc)N decreases for Pcover 50 This is inconsistent with the finding that thevariance of the old distribution is larger than the varianceof the new distribution

4 If the recognition strength is not normalized withthe net input the variance of the recognition strength ofthe new easy class (AN) will be smaller than the varianceof the recognition strength of the new difficult class (BNsee Figure 4C) This is inconsistent with the empirical data

This analysis indicates that several recognition mem-ory phenomena fall out as a consequence of optimizingperformance in the network If the model is optimized interms of performance the model reproduces the empir-ical data If the model is made nonoptimal the model

does not reproduce the empirical data Arguably thehuman brain has evolved to optimize storage capacityand therefore these memory phenomena occur

GENERAL DISCUSSION

This paper has suggested a variance theory for themirror effect that also applies to the ROC curves Themodel is remarkably simple It has been shown that atwo-layer recurrent network where one layer representscontext and one layer represents items shows these phe-nomena if performance is measured by counting thenumber of nodes active at recognition in a way that opti-mizes performance The structure of the theory guaran-tees that high-frequency items have a larger variance inthe net inputs than do low-frequency items because en-coding the same item in different contexts increases thevariance whereas the expected net inputs are the sameThe theory predicts the mirror effect when the sameamount of attention is paid to both classes of stimuli Thestandard deviation of the recognition strength is largerfor old than for new items because more nodes are activein old items The standard deviation for the easy class islarger than the standard deviation for the difficult classbecause the recognition strength is normalized with thestandard deviation for the net input

There are several reasons why the variance theory isinteresting First the theory is extremely simple Theonly necessary assumptions are that recognition is basedon recurrent associations between contexts and itemsand performance is measured by counting the number ofnodes in an optimal way Second these assumptions areconsistent with what is known about how the brain worksTherefore the model is biologically plausible Third themodel accounts for a large amount of data including themirror effect exceptions from the mirror effect ROCcurves list-length effects and so on Fourth the modelfits the empirical data well Fifth it is easy to implementthe model in a connectionist network

Paying more attention to one of the classes violates theassumption of equal expected net inputs to the two classesThe variance theory predicts that attention to the moredifficult class primarily affects the hit rates whereas thefalse alarm rates and standard deviations of the underly-ing distributions are less affected An experiment sup-ported the prediction A standard mirror effect was foundwhen attention was divided equally between high- andlow-frequency words However focusing the subjectsrsquoattention on the high-frequency words either by the pre-sentation frequency or the presentation time made thehit rate larger for the high-frequency words than for thelow-frequency words This manipulation of attention didnot influence the false alarm rates which were higher forthe high-frequency words in all the conditions Thus nomirror effect was found when attention was paid to thehigh-frequency words Nor did the focusing of attentioninfluence the order of the standard deviations The stan-dard deviations were larger for the low-frequency distri-

436 SIKSTROM

bution than for the high-frequency distribution This wastrue for the new and the old distributions both when at-tention was paid to high-frequency words and when at-tention was divided equally between the two classes (ex-cept in the new frequency control condition where thestandard deviations were equal)

The variance theory predicts the order of the standarddeviations of the underlying distributions for the follow-ing reasons The standard deviations of the old distribu-tions are predicted to be larger than those of the new dis-tributions because more nodes are activated in the olddistributions The standard deviations of the easy classdistributions are predicted to be larger than the standarddeviations of the difficult class distributions because therecognition strength is normalized by the itemrsquos diffi-culty estimated from the standard deviation of the net in-puts This is consistent with the empirical data

In contrast the attention-likelihood theory does notconstrain the old distribution to be larger than the newdistribution for the difficult class (it can be larger orsmaller depending on parameter settings) The variancetheory allows the following two orders of the standarddeviations ss(BN) lt ss(BO) lt ss(AN) lt ss(AO) andss(BN) lt ss(AN) lt ss(BO) lt ss(AO) The first order isthe more common although the second order occurs oc-casionally (see eg Glanzer et al 1993 Experiment 1)In addition the attention-likelihood theory allowsss(BO) lt ss(BN) lt ss(AN) lt ss(AO) according to Equa-tion 2 which to the authorrsquos knowledge has not beenfound in empirical data

The variance theory predicts that strength variablessuch as study time repetition and study instructions af-fect the expected net input For example increasing studytime will increase the net input that improves the hit rateIncreasing the net inputs also causes an increase in theactivation threshold that diminishes the false alarm rates

The variance theory has no (ie lateral) connectionswithin the item layer and there are no connections with-in the context layer Including intraitem connections inrecognition makes it impossible to tell whether therecognition strength comes from encoding during thelearning episode or from another preexperimental learn-ing episode Thus there would be a confounding be-tween the itemrsquos frequency and recognition strength Forexample if the recognition strength in the variance the-ory included intraitem connections and used a constantrecognition criterion it would predict a higher hit rateand a higher false alarm rate for high-frequency itemsthan for low-frequency items Thus the hit rate for high-frequency words would be larger than that for low-frequency words which is contrary to the data on the mir-ror effect This issue is called the frequencyrecognitionndashstrength confounding problem Other models may bevulnerable to this problem depending on their specificassumptions The variance theory is immune from thisproblem because recognition strength is based on the as-sociation between the context and the item that yields apure measurement of the strength of the target in a par-

ticular episode Net inputs within the item population arenot used because these connections are highly corre-lated with the frequency of the item

This frequencyrecognition-strength confounding prob-lem may be relevant to several distributed models thatassume that recognition strength increases with frequencythus making the false alarm rate higher for high- than forlow-frequency stimuli This is often implemented in dis-tributed models by including intraitem associations inthe measurement of recognition strength Thus whenintraitem and itemndashcontext associations are added it isnot possible to know whether the intraitem strength oc-curs because an item has been encoded in the to-be-retrieved-from list or at another episode

Although the intraitem associations are not used tomeasure recognition strength they may play an impor-tant role in recognition In the first step of recognitionthese associations may be used for deblurring unclear in-formation in the item cue (a similar mechanism occursfor the context cue) Arguably this deblurring mecha-nism works well for well-known words however fornonwords it is much more likely to fail Such failure willactivate features that were not active in the encoded rep-resentation It is more likely that these wrongly activatednodes represent high-frequency features because it ismore likely to converge on high-frequency features Thereare two interesting implications of this perspective Firstthe wrongly activated nodes will use the wrong connec-tions between the context and the item Second becausethe wrongly activated nodes represent high-frequencyfeatures the average variability will be larger for non-words than for words This is a plausible account of thepoor recognition performance with nonwords as com-pared with words It is also a tentative account of why non-words can be seen as a difficult class with a higher vari-ability than that for words in the variance theory Howeverfurther work is needed before any firm conclusion can bemade regarding this aspect of the theory

A problem similar to frequencyrecognition-strengthconfounding occurs if within-context connections areused at recognition If the context is temporally corre-lated the within-context connections are influenced bylist length This causes a confounding between list lengthand recognition strength This issue is called the list-lengthrecognition-strength confounding problem Othermodels may be vulnerable to this problem depending ontheir specific assumptions

Another issue is whether the variance theory can ac-count for the mirror effect found in abstract and concretewords where words from a concrete class are more eas-ily discriminated than words from an abstract class Thevariance theory can account for this given the assump-tion that the variance of the net input is larger for abstractthan for concrete words However at this point it is notcompletely clear how this assumption can be motivatedA possibility is that although these two classes areequated for word frequency the contexts associated withan abstract word are more variable than the contexts as-

VARIANCE THEORY OF MIRROR EFFECT 437

sociated with a concrete word This larger variability incontext for abstract words may lead to a larger variabil-ity in the net input Another possibility is that the activefeatures in abstract words are more general and there-fore associated with more contexts Nodes active in con-crete words may represent more specific features acti-vated with a lower frequency and therefore associatedwith fewer contexts Thus features in abstract wordsmay be of a higher frequency than features in concretewords although the frequencies of the items are thesame This would lead to a mirror effect in the modelHowever at this point no claim is made that variancetheory can handle this phenomenon

The variance theoryrsquos account of the list-length andlist-strength effects is arguably much simpler than thesubjective-likelihoodrsquos account The activation thresholdis set so that on average half of the nodes active duringencoding are active during recognition The activationthreshold is constant within one condition but may changebetween conditionsmdashfor example when study time ischanged Furthermore subjects do not need to estimatethe list length and the probability that a particular itemis drawn from the study list

The variance theory has advantages over the attention-likelihood theory As was discussed in more detail abovethe attention-likelihood theory requires subjects to clas-sify the stimulus Depending on this classification thesubjects must know (consciously or unconsciously) howmuch attention is paid to the stimuli in order to calculatethe log-likelihood ratios Thus the yesndashno decision isbased on the subjectsrsquo classification of the stimuli Thevariance theory does not require a classification of thestimuli During the calculation of recognition strengththe standard deviation of the net input of the item is used(shcent ) so the subject does not need to know the class or thestandard deviation of the class (sh) The increase in thehit rate and decrease in the false alarm rate for the easierclass occurs according to the theory because the vari-ance is smaller for the easier class The variance theoryhas a simple mathematical base subjects count the num-ber of activated nodes minus the expected value dividedby the standard deviation of the net input of the item Anexplicit solution is presented that requires only calculat-ing the probabilities of two z-transformations

The subjective-likelihood theory uses feature contentof the items for addressing issues regarding item similar-ity and word frequency In particular high-frequency wordsare assumed to have a higher noise or variability than dolow-frequency words The variance theory also usesvariability that depends on frequency However the vari-ance theory simulates the increase in variance duringeach presentation of a feature in different contexts thusmaking it an unavoidable phenomenon for high-frequencyfeatures In the subjective-likelihood theory the featurevariance is introduced as an assumption or a constantand it is not explicitly simulated

There are several other differences between the vari-ance theory and the subjective-likelihood theory The

subjective-likelihood theory is based on log-likelihoodratios In the variance theory log-likelihood ratios arenot necessary to account of the mirror effect and for z-ROC curves Instead the variance theory uses the num-ber of active nodes normalized by the variance of the netinput as the measurement of recognition strength

Another difference is the use of one detector for eachitem in the subjective-likelihood theory This makes themodel essentially local whereas the variance theory isdistributed This property may cause difficulties in im-plementing the model in a connectionist network How-ever see McClelland and Chappell (1998) for a brief dis-cussion of this topic An implementation of the variancetheory is straightforward

REFERENCES

Feller W (1968) An introduction to probability theory and its appli-cation New York Wiley

Gillund G amp Shiffrin R M (1984) A retrieval model for bothrecognition and recall Psychological Review 91 1-67

Glanzer M amp Adams J K (1985) The mirror effect in recognitionmemory Memory amp Cognition 13 8-20

Glanzer M amp Adams J K (1990) The mirror effect in recognitionmemory Data and theory Journal of Experimental PsychologyLearning Memory amp Cognition 16 5-16

Glanzer M Adams J K amp Kim K (1993) The regularities ofrecognition memory Psychological Review 100 546-567

Glanzer M amp Bowles N (1976) Analysis of the word frequencyeffect in recognition memory Journal of Experimental PsychologyHuman Learning amp Memory 2 21-31

Glanzer M Kisok K amp Adams J K (1998) Response distribu-tions as an explanation of the mirror effect Journal of ExperimentalPsychology Learning Memory amp Cognition 24 633-644

Greene R L (1996) Mirror effect in order and associative informa-tion Role of response strategies Journal of Experimental Psychol-ogy Learning Memory amp Cognition 22 687-695

Hertz J Krogh A amp Palmer R G (1991) Introduction to the the-ory of neural computation Reading MA Addison-Wesley

Hintzman D L (1988) Judgment of frequency and recognition memoryin a multiple trace memory model Psychological Review 95 528-551

Hopfield J J (1982) Neural networks and physical systems withemergent collective computational abilities Proceedings of the Na-tional Academy of Sciences 79 2554-2558

Hopfield J J (1984) Neurons with graded response have collectivecomputational properties like those of two-state neurons Proceed-ings of the National Academy of Sciences 81 3088-3092

Humphreys M S Bain J D amp Pike R (1989) Different way to cuea coherent memory system A theory for episodic semantic and pro-cedural tasks Psychological Review 96 208-233

Kim K amp Glanzer M (1993) Speed versus accuracy instructionsstudy time and the mirror effect Journal of Experimental Psychol-ogy Learning Memory amp Cognition 19 638-652

Kruschke J K (1992) ALCOVE An exemplar-based connectionistmodel of category learning Psychological Review 99 22-44

Ku Iumlcera H amp Francis W N (1967) Computational analysis ofpresent-day American English Providence RI Brown UniversityPress

Lewandowsky S (1991) Gradual unlearning and catastrophic inter-ference A comparison of distributed architectures In W E Hockleyamp S Lewandowsky (Eds) Relating theory and data Essays onhuman memory in honor of Bennet B Murdock (pp 445-476) Hills-dale NJ Erlbaum

Li S-C amp Lindenberger U (1999) Cross-level unification A com-putational exploration of the link between deterioration of neuro-transmitter systems and dedifferentiation of cognitive abilities in oldage In L-G Nilsson amp H J Markowitsch (Eds) Cognitive neuro-sciences of memory (pp 103-146) Seattle Hogrefe amp Huber

438 SIKSTROM

Li S-C Lindenberger U amp Frensch P A (2000) Unifying cog-nitive aging From neuromodulation to representation to cognitionNeurocomputing 32-33 879-890

McClelland J L amp Chappell M (1998) Familiarity breeds dif-ferentiation A subjective-likelihood approach to the effects of expe-rience in recognition memory Psychological Review 105 724-760

Metcalfe J (1982) A composite holographic associative recallmodel Psychological Review 89 627-658

Murdock B B (1982) A theory for the storage and retrieval of item andassociative information Psychological Review 89 609-626

Murdock B B (1998) The mirror effect and attentionndashlikelihoodtheory A reflective analysis Journal of Experimental PsychologyLearning Memory amp Cognition 24 524-534

Okada M (1996) Notions of associative memory and sparse codingNeural Networks 9 1429-1458

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves PsychologicalReview 99 518-535

Rumelhart D E Hinton G E amp Williams R J (1986) Learn-ing representation by backpropagating errors Nature 323 533-536

Shiffrin R M amp Steyvers M (1997) A model for recognitionmemory REMndashretrieving effectively from memory PsychonomicBulletin amp Review 4 145-166

Sikstroumlm S (1996a) TECO A connectionist theory of successiveepisodic tests Umearing Doctoral dissertation Umearing University (ISBN91-7191-155-3)

Sikstroumlm S (1996b) The TECO connectionist theory of recognitionfailure European Journal of Cognitive Psychology 8 341-380

Sikstroumlm S (1999) Power function forgetting curves as an emergentproperty of biologically plausible neural networks model Interna-tional Journal of Psychology 34 460-464

Stretch V amp Wixted J T (1998a) Decision rules for recognitionmemory confidence judgments Journal of Experimental Psychol-ogy Learning Memory amp Cognition 24 1397-1410

Stretch V amp Wixted J T (1998b) On the difference betweenstrength-based and frequency-based mirror effects in recognitionmemory Journal of Experimental Psychology Learning Memoryamp Cognition 24 1379-1396

NOTE

1 The expected number of nodes active at recognition is a2 giventhat the standard deviations of the percentages of active nodes for new[sp(N)] and old items [sp(O)] are equal However if the standard devi-ations are different the expected number of active nodes will be

Because the new variance is smaller than the old variance this valuewill be slightly less than a2 typically around 044a if the ROC curveis 08 It is actually more precise to use this value In this paper the ap-proximation a2 is used except that in the simulations where the ex-pression above is used

APPENDIX

The Expected Value of the Net Input and the Variance of the Net Input

Inserting the weight changes into the net input solves the ex-pected value of the net input for old items In this Appendix thesuperscripts representing the item layer (d) and the contextlayer (c) are dropped Only the active states of activation con-tribute to the net input so jj 5 jj = 1

(A1)

The expected value of the net inputs for the new items iszero

mh(N) 5 0 (A2)

The expected value of the weights is zero The variance ofthe net input is calculated by squaring each term in the netinput Let P represent the number of patterns stored in the net-work If the patterns are uncorrelated or a new item is pre-sented the variance of the net input is

(A3)

For an old item the variance of the net input to the contextlayer from the item layer depends on the frequency of the item( f ) All the aN weights supporting a context node contribute tothe net input in the same direction

(A4)

The same function can be used for calculating the varianceof the net input to a node in the item layer when the same con-text is associated with several items Let the same context be as-sociated with L items in a list Furthermore let the context be-tween different lists be uncorrelated The variance of the netinput to an item node is

The expected variance for all nodes is the weighted sum ofthese two terms Half of the nodes are context nodes and halfof the nodes are item nodes Therefore the expected varianceof the net input to all nodes given an item with a frequency off and a list length of L in a network where p patterns have beenencoded is

(A5)

Performance in the Variance Model When AllNodes Are Used

In the variance model recognition strength is based on nodesactive at encoding However if recognition strength is based onall the nodes performance can be calculated as follows The d centis calculated by using Equation 8 where Pc(O) and Pc(N) are thenumber of correct nodes The number of correct nodes is thenumber of nodes active at retrieval that also is active at encod-ing (ie calculated as usual with Equation 7) plus the numberof inactive nodes at encoding that also are inactive at retrievalThe latter value can be calculated by replacing a with 1 a inEquation 7 and using the expected value of old net inputs for in-active nodes a2 (1 a) N (compare with Equation A1)

(Manuscript received February 9 1999revision accepted for publication October 30 2000)

s h f L a N p a a N2 3 22 1O( ) = +( ) + -( )

s h f a a LN2 4 2 21( ) = -( )

s x x x xh ij jj

N

i j ji

Nf f f a a

a a f N

22 2

4 2 21

( ) = waring aringaelig

egraveccedilouml

oslashdivide= -( ) -( )eacute

eumlecirc

= -( )

s x x x xh ij jj

N

i j j

P

i

NN a a

a a a a a a a

a a a a PN a a PN

2 2 2

2

3

1 1 2 0 1

0 0 1 1

( ) = w

= [( ) ( )] + [( )(1- )] ( )

+ [( )( )] ( ) = ( )

2 2

2 2 2

( ) = -( ) -( ) - - - -

- - - -

aring aringaring

m x x x xh ijj

N

j i j jj

Na a a a N(O) = wDaring aring= -( ) -( ) = -( )1

2

ss s

p

p p

a(N)

( )N (O)+

VARIANCE THEORY OF MIRROR EFFECT 431

Approximations of the Standard Deviation ofRecognition Strength

The standard deviation of the recognition strength isin the model calculated with Equation 7 However to fa-cilitate the understanding of this equation it is useful tomake some approximations First note that the proba-bility that a node is active (Pc) is assumed to be low Byapproximating 1 Pc to one the variance of the recog-nition strength can be simplified to

For a particular class of items the variances of the netinputs of old and new items are equal and the varianceof the recognition strength is proportional to the numberof active nodes (s 2

s micro Pc) This approximation suggests avery simple interpretation of the slope of the z-ROC Theratio of variances between new and old items is simplythe ratio between the number of nodes active in the newitems representations and the number of nodes active inthe old items representations

Or alternatively the slope of the z-ROC curve is equalto the square root of the ratio of the number of nodes ac-tive in the new items representations and the number ofnodes active in the old items representations For exam-ple if the slope of the z-ROC curve is 08 the number ofactive nodes in the new items representations divided bythe number of nodes active in the old items representa-tions is 064 (= 0802)

Another approximation useful for understanding themodel is that for two classes of items the number of ac-tive nodes in the new distribution is approximately equaland the number of active nodes in the old distributions isapproximately equal [Pc(AN) raquo Pc(BN) and Pc(BO) raquoPc(AO)] Given these approximations and the approxi-mation above (1 Pc raquo 1) the recognition strength stan-dard deviation is inversely related to the standard devia-tion of the net inputs in the following way The ratiobetween the recognition strength standard deviations ofthe diff icult and the easy distributions is equal to theratio between the standard deviations of the net inputs ofthe easy and the difficult distributions Furthermore theratio between the recognition strength standard devia-tions of the difficult and easy new distributions is equalto the ratio between the recognition strength standard de-viations of the difficult and the easy old distributionsThe exact solution predicts a slightly smaller ratio in theold than in the new distributions

This suggests that the ratio between the recognitionstrength standard deviations of the difficult class and theeasy class can be interpreted as the ratio between the

standard deviations of the net inputs of the easy class andthe difficult class

Optimizing Performance Derives the Assumptions of the Variance Theory

Arguably good memory performance is an importantfactor for selection in the evolutionary process of hu-mans and animals It is reasonable to assume that thebrain has evolved so that the performance at retrieval isoptimal or near-optimal Here it is investigated how sev-eral assumptions of the variance theory influence per-formance It is shown that several assumptions of themodel fall out as a consequence of optimizing perfor-mance in the form of discriminability between new andold items Thus if the model is implemented in a differ-ent way performance is degraded and the model doesnot account for the experimental data Examples of as-sumptions that yield a good performance in the modelare a low percentage of nodes active setting the activa-tion threshold between old and new net inputs measur-ing performance by nodes that are active in the encodingpattern and normalizing the recognition strength It isshown that an optimal performance in the network requiresthe implementation suggested by the variance theory Ifthe implementation of the variance theory is changedsignificantly the performance is degraded and the net-work would not produce the empirically found memoryphenomena

To conduct this analysis it is necessary to define ameasurement of performance A natural choice is to used cent By using the analytical equations above we find thefollowing expression

(8)

Because ss(N) sometimes is near zero it was founduseful to use the standard deviation of both the old andthe new items recognition strength ss(NO) in the de-nominator of this equation Thus Pc(NO) is equal to[Pc(N) 1 Pc(O)] 2 Pc( ) was calculated with Equation 6The expected value of the net inputs and the standard de-viation of the net inputs for new and old items were cal-culated with the equations derived in the Appendix(Equations A1 A2 and A3) Low-frequency items witha preexperimental frequency of zero were used ( f = 0)and the list length was set to one (L = 1)

The performance can be expressed by the parametersa N and p Increasing the number of nodes (N) monot-onically increases d cent and increasing the number ofstored patterns (p) monotonically decreases d cent The im-pact of these two parameters on d cent was of less impor-tance here and they were set to N = 30 and p = 100

Optimal percentage of nodes active at encodingThe solid line in Figure 9A shows the theoretical d cent as afunction of the percentage of nodes active at encoding

cent = - =-

-eacuteeumlecirc

dP P

P PN

s s

s

c c

c c

m ms( ) ( )

(

O N

(NO)

(O) (N)

(NO) (NO) 12

112

ss

ss

ss

ss

s

s

s

s

h

h

s

s

(BO)

(AO)

(B)

(A)

(A)

(B)

(BN)

(AN)pound raquo pound

s

ss

s

c

c

P

P

2

2

(N)

(O)

(N)

(O)raquo

ss

sc

h

P

N

2

2 2=

432 SIKSTROM

(a) The results show that d cent is optimal for a = 052 Thed cent is lower for larger and smaller a The lower d cent for largea occurs because the interference from other items in-crease For an a larger than the optimal value the weightchanges are distributed over a larger number of nodesand the recognition tests therefore include more noise

The lower d cent for an a less than the optimal value oc-curs because there is variability in the number of activenodes at encoding Thus for very small values of a thefluctuation between the number of nodes active in theencoded representation becomes increasingly importantThus for a small a errors are more likely to occur be-cause old items have few active nodes at encoding mak-ing it less likely that many nodes will be active at re-trieval (independently of how well they are encoded)This analysis suggests that a medium low percentage ofactive nodes at encoding yields optimal performanceThis is consistent with variance theory which requires alow activity for fitting some of the empirical data (seebelow)

There is another factor that contributes to the fact thatoptimal performance occurs when the percentage of ac-tive nodes is medium low which is that the number ofpossible representations increases with a If there is onlyone node active in all the representations there are Npossible combinations of representations if there aretwo nodes active in all the representations there are ap-proximately N 2 possible combinations of representa-tions and so forth This factor is not included in theanalyses

Optimal placement of the activation threshold Animportant property of d cent is that it is insensitive to wherethe criterion is placed Thus any criterion yields thesame performance The activation threshold (T ) may beseen as the criterion for a single node and therefore onemight intuitively think that the placement of the thresh-old is unimportant for d cent However surprisingly theplacement of the criterion becomes important in the vari-ance theory because there is a nonlinear transformationwhen the nodes are activated This nonlinearity makes d centdependent on the activation threshold in the nodes

The d cent was maximized by changing the activationthreshold (T ) and the percentage of nodes active at en-coding (a) The maximum d cent was 240 when T = 081and a = 052 Figure 9B plots d cent as a function of the ac-tivation threshold (T ) when the percentage of nodes ac-tive at encoding was f ixed at the optimal value (a =052) The results show that d cent has an optimal valuewhen the activation threshold is set between the old andthe new distributions The variance theory suggests thatthe threshold should be set to the average of the expectedold and expected new net inputs For the parameter val-ues used here this value is 071 which is near butslightly lower than the optimal value of 081 (the ex-pected value of the new net input is 0 and the expectedvalue of the old net input is 142) Note that this resultapplies when both a and T are set to the optimal value Ifa is set to a nonoptimal value the optimal value of T may

deviate significantly from the one proposed by the the-ory (eg if a = 5 the optimal value of T is much largerthan the expected value of old net inputs of 188)

This analysis emphasizes the importance of setting theactivation threshold as suggested by the variance theorySetting the activation threshold between the old and thenew expected net inputs yields not only the mirror effectbut also an optimal performance in the network Noticethat the activation threshold (T ) is constant even if thesubjectsrsquo decision criterion (C) is changed Therefore d centwill not change when the decision criterion changes Bychanging the decision criterion (rather than the recogni-tion threshold) subjects can maintain an optimal d cent fordifferent confidence levels

Optimal usage of the state of activation in the cuepattern An interesting question is how much informa-tion is carried in nodes that are active in the encoded pat-tern as compared with inactive nodes If both active andinactive nodes carry a similar amount of information itis useful to use all the nodes at retrieval However if in-active nodes carry little information in relation to thenoise performance can be improved by using only theinformation in the active nodes

The information carried in the nodes depends on theamount of weight changes which in turn depends on thepercentage of active nodes at encoding (a) For a = 12the absolute values of the weight changes are the samefor active and inactive nodes (however the signs of theweight changes are different) Thus inactive and activenodes carry the same amount of information and theperformance is optimal when information in both activeand inactive nodes is used For a small a the weightchanges are larger for active nodes (proportional to 1 a)than for inactive nodes (proportional to a) For a suffi-ciently small a the noise in inactive nodes will over-whelm the information in the weight changes so that ifthe information is combined the inactive nodes will harmthe information in the active nodes and performance isoptimal if only information from active nodes is used

The performance of the variance theory was calcu-lated by using the information in all the nodes This isdone by counting the number of nodes that are retrievedto the correct state of activation (ie the same state asthat during encoding) The mathematical details of thiscalculation are described at the end of the Appendix Theresults are shown by the dotted line in Figure 9A usingthe same set of parameters as when d cent was calculated byusing only active nodes shown by the solid line The re-sults show that the highest d cent is found when the decisionis based only on active nodes and when a is low Includ-ing inactive nodes in decision lowers d cent However for alarger a (above 15 for the parameters used here) it isbeneficial to base the decision on all the nodes

Optimal placement of the recognition criterion forthe two classes of items The recognition criterion (C)does not affect d cent but it influences performance as mea-sured by the hit rates and false alarm rates Therefore itis necessary to use another criterion for performance

VARIANCE THEORY OF MIRROR EFFECT 433

with respect to the placement of the recognition crite-rion A natural choice for performance in this context isthe probability of hits minus the probability of falsealarms This measurement corresponds to optimal per-formance when old correct responses and new correctnew responses are rewarded equally It is easy to see thatif the standard deviations of the old and the new distri-butions are equal the optimal performance will be foundif the recognition criterion is set exactly between the dis-tributions For unequal standard deviations the optimalrecognition criterion is shifted from the midpoint towardthe distribution with the smallest standard deviationMore exactly the optimal recognition criterion is thepoint at which the old and the new distributions inter-sect It is easy to see that this is true because if the recog-nition criterion is moved to the left of this point the rateof increase in false alarms is larger than the rate of in-crease in hits and performance suffers If the recognition

criterion is moved to the right of this point the rate of de-crease in hits is larger than the rate of decrease in falsealarms and performance also suffers (see eg Figure 4D)Formally f [S(O)] denotes the density of recognitionstrength of the old distribution and f [S(N)] the densityof the recognition strength of the new distribution Theratio between these variables is called the likelihoodratio L = f [S(O)] f [S(N)] and the optimal performanceoccurs when this ratio is equal to one (L = 1)

In the mirror effect there are two classes of itemseach having a new and an old distribution with differentstandard deviations The question of optimal perfor-mance is complicated by the possibility of using differ-ent criteria for the two classes The performance maythen vary depending on the choice of the two criteria andon additional restrictions on the overall level of confi-dence For example if one class is very easy (ie perfectdiscrimination) and one class is very difficult (ie no

Figure 9 (A) Theoretical d cent as a function of percentage of nodes active at encoding The solid line shows the d cent as a function of per-centage of nodes active at encoding when the decision is based only on nodes that are active during encoding The dotted line showsd cent when the decision is based on all the nodes (B) Theoretical d cent as a function of the activation threshold The leftmost arrow pointsat the expected net input of the new items [m(N)] the rightmost arrow points at the expected net input of the old items [m(O)] and themiddle arrow points at the point at the placement of the activation threshold of the nodes Note that the activation threshold is slightlylower than the optimal point (C) Optimal placement of the recognition criterion for the two classes The y-axis shows the maximumlikelihood for Class B divided by the maximum likelihood for Class A An optimal performance is found when this ratio is one Thex-axis shows the false alarm rate for Class A The straight line shows the ratio for theoretical optimal performance the dotted line theratio before normalization and the solid curved line the ratio after normalization See the text for details (D) The advantage of nor-malization for different recognition criteria The y-axis shows the total hit rate after normalization minus the total hit rate before nor-malization as a function of the total false alarm rate on the x-axis See the text for details

434 SIKSTROM

discrimination) and subjects are instructed to respondyes only when they are absolutely certain that they arecorrect it may be optimal to set a very high criterion forthe difficult class so that no yes responses will be madefor the difficult class and a moderate criterion for theeasy class so that some yes responses will be made forthe easy class Therefore any model that optimizes per-formance for the two classes must combine the criteriafor each class so that the performance for the sum of theclasses will be optimal

This problem may formally be stated as follows Giventwo classes (A and B) with a fixed total false alarm rate[P(AN) + P(BN)] how should the recognition criteriafor the two classes [T(A) and T(B)] be chosen so that thehit rates are maximized [P(AO) + P(BO)] The solutionto this problem is surprisingly simple The optimal perfor-mance occurs when the placements of the maximum like-lihoods of the two classes are equal

L(A) = f [S(AO)] f [S(AN)] = L(B)

= f [S(BO)] f [S(BN)]

It is easy to see that this criterion must be satisfied foroptimal performance because any shift from this pointdiminishes performance For example if the recognitionthreshold for class A is diminished the recognition cri-terion for class B must be increased to keep the totalfalse alarm rate constant According to the formulationof the problem the change in total false alarm rates mustbe equal f f [S(BN) = 0] The maximum-likelihood ra-tios are monotonically increasing functions of the recog-nition criteria therefore L(A) L(B) lt 0 when the recog-nition criteria are changed as specified above ThusL(A) = f [S(AO)] f [S(AN)] lt f [S(BO)] f [S(BN)] =L(B) or f [S(AO)] f [S(BO)] lt 0 This shows that thechange in the placement of the criteria from L(A) = L(B)results in an overall decrease in hit ratemdash( f [S(AO)] f [S(BO)] lt 0)mdashand performance suffers

Note that the variance theory only has one overallrecognition criterion However the theory and any the-ory of the mirror effect specifies how this criterionchanges when it is moved over the two classes Thus thesecond criterion can indirectly be inferred from the for-mulation of the theory This is done in the variance the-ory by the normalization factor that scales the recogni-tion differently between the two classes of words

The intriguing question here is whether the variancetheory is optimal or almost optimal in terms of place-ment of the recognition criterion for the two classes Fig-ure 9C plots the maximum likelihood of class B dividedby the maximum likelihood of class A [L(B)L(A)] onthe y-axis The x-axis shows the probability of false alarmsfor class A when the recognition criteria are changedThe optimal ratio of the maximum likelihood on the y-axisis one and it is plotted as the dotted straight line The re-sults before normalization (ie by counting the numberof nodes above the recognition criterion) are plotted in

the dotted monotonically increasing line The resultsafter normalization (ie the percentage of active nodesminus the expected number of active nodes divided bythe standard deviation of the net input) are plotted as thesolid line

The figure clearly shows that performance before nor-malization is far from optimal For a conservative recog-nition criterion or low false alarm rates the maximumlikelihood for class A is larger than the maximum likeli-hood for class B Therefore a more liberal criterion forclass B and a more strict criterion for class A would bemore advantageous For a liberal recognition criterionor a high false alarm rate the maximum likelihood forclass B is larger than the maximum likelihood for classA Therefore a more liberal criterion for Class A and astricter criterion for Class B would be beneficial Theonly point at which the performance is optimal is whenthe recognition criterion is unbiased At this point [aroundP(AN) = 25] the maximum-likelihood ratio is one

Normalization improves performance significantly sothe maximum-likelihood ratio is one or almost one fora range of criteria For false alarm rates larger than 015and smaller than 060 the ratio is within two percentagepoints of one The maximum likelihood for Class A islarger than that for Class B for false alarm rates less than015 and for false alarm rates larger than 060 Thus thereis still some deviation from optimal performance evenafter normalization However the maximum-likelihoodratio is closer to the optimal value for all false alarmrates after normalization than before normalization Ar-guably normalization improves performance sufficientlyso that performance may be described as being near anoptimal value for a wide range of recognition criteria

Overall performance after normalization can be di-rectly compared with performance before normalizationFigure 9D plots the total hit rate after normalizationminus the total hit rate before normalization for differenttotal false alarm rates In this figure the standard devia-tion of the net inputs to Class B was set to a 3 (rather than156) in order to make the difference between perfor-mance before and after normalization more salient Allother parameters were identical to those in Figure 4DFurthermore it is assumed that the number of items inClass A is equal to the number of items in Class B Thetotal hit rate is equal to the average hit rate of Class Aand Class B and the total false alarm rate is equal to theaverage false alarm rate in Class A and Class B

For all recognition criteria or for all false alarm ratesperformance is better or equal after normalization ascompared with performance before normalization Foran unbiased recognition criterion or a slightly largerrecognition criterion performance is approximatelyequal before and after normalization [ie for 25 lt P(N) lt40] For conservative recognition criteria [P(N) lt 25]there is a large advantage for normalized performanceover nonnormalized performance The largest advantageis approximately 005 and it occurs when the false alarm

VARIANCE THEORY OF MIRROR EFFECT 435

rate is approximately 005 For liberal recognition crite-ria [P(N) gt 40] there is also an advantage for normal-ized performance The largest advantage is around 001and it occurs when the false alarm rate is 070 The ad-vantage for liberal criteria is smaller than the advantagefor conservative criteria because of a ceiling effect forlarge false alarm rates and large hit rates

A Nonoptimal Network is Inconsistent With Empirical Data of Recognition Memory

To summarize it has been shown that performance isoptimal when (1) the percentage of nodes active at en-coding is low (2) the activation threshold is set betweenthe new and the old distributions (3) only nodes activeat encoding are used for measuring the recognitionstrength and (4) the recognition strength is normalizedwith the standard deviation of the net input It is inter-esting to note that all these conditions are necessary forproducing the empirically found memory phenomena

1 The percentage of active nodes has to be low forproducing appropriate standard deviations for the oldand the new recognition distributions If the percentageof active nodes is too high the standard deviation of theold distribution will approach the standard deviation ofthe new distribution

2 The model predicts the mirror effect only if the ac-tivation threshold is set between the old and the new dis-tributions If the recognition threshold is larger than theexpected value of the net input of the old distributionthe hit rate of the easy class will be less than the hit rateof the difficult distribution Similarly if the recognitionthreshold is lower than the expected new net input thefalse alarm of the easy class will be larger than the falsealarm rate of the difficult class This is inconsistent withthe empirical data for unbiased responses

3 Assume that the recognition strength is based on allthe nodes (ie not only nodes inactive during encod-ing)mdashfor example by counting the number of nodes inthe correct state of activation This measurement ofrecognition strength will have at least 50 of the nodesin the correct state (ie Pc gt 50) even if the subjectswere merely guessing on new items This would lead tothe incorrect prediction that the old recognition strengthvariance should be smaller than the new recognitionstrength variance because Pc(1 Pc)N decreases for Pcover 50 This is inconsistent with the finding that thevariance of the old distribution is larger than the varianceof the new distribution

4 If the recognition strength is not normalized withthe net input the variance of the recognition strength ofthe new easy class (AN) will be smaller than the varianceof the recognition strength of the new difficult class (BNsee Figure 4C) This is inconsistent with the empirical data

This analysis indicates that several recognition mem-ory phenomena fall out as a consequence of optimizingperformance in the network If the model is optimized interms of performance the model reproduces the empir-ical data If the model is made nonoptimal the model

does not reproduce the empirical data Arguably thehuman brain has evolved to optimize storage capacityand therefore these memory phenomena occur

GENERAL DISCUSSION

This paper has suggested a variance theory for themirror effect that also applies to the ROC curves Themodel is remarkably simple It has been shown that atwo-layer recurrent network where one layer representscontext and one layer represents items shows these phe-nomena if performance is measured by counting thenumber of nodes active at recognition in a way that opti-mizes performance The structure of the theory guaran-tees that high-frequency items have a larger variance inthe net inputs than do low-frequency items because en-coding the same item in different contexts increases thevariance whereas the expected net inputs are the sameThe theory predicts the mirror effect when the sameamount of attention is paid to both classes of stimuli Thestandard deviation of the recognition strength is largerfor old than for new items because more nodes are activein old items The standard deviation for the easy class islarger than the standard deviation for the difficult classbecause the recognition strength is normalized with thestandard deviation for the net input

There are several reasons why the variance theory isinteresting First the theory is extremely simple Theonly necessary assumptions are that recognition is basedon recurrent associations between contexts and itemsand performance is measured by counting the number ofnodes in an optimal way Second these assumptions areconsistent with what is known about how the brain worksTherefore the model is biologically plausible Third themodel accounts for a large amount of data including themirror effect exceptions from the mirror effect ROCcurves list-length effects and so on Fourth the modelfits the empirical data well Fifth it is easy to implementthe model in a connectionist network

Paying more attention to one of the classes violates theassumption of equal expected net inputs to the two classesThe variance theory predicts that attention to the moredifficult class primarily affects the hit rates whereas thefalse alarm rates and standard deviations of the underly-ing distributions are less affected An experiment sup-ported the prediction A standard mirror effect was foundwhen attention was divided equally between high- andlow-frequency words However focusing the subjectsrsquoattention on the high-frequency words either by the pre-sentation frequency or the presentation time made thehit rate larger for the high-frequency words than for thelow-frequency words This manipulation of attention didnot influence the false alarm rates which were higher forthe high-frequency words in all the conditions Thus nomirror effect was found when attention was paid to thehigh-frequency words Nor did the focusing of attentioninfluence the order of the standard deviations The stan-dard deviations were larger for the low-frequency distri-

436 SIKSTROM

bution than for the high-frequency distribution This wastrue for the new and the old distributions both when at-tention was paid to high-frequency words and when at-tention was divided equally between the two classes (ex-cept in the new frequency control condition where thestandard deviations were equal)

The variance theory predicts the order of the standarddeviations of the underlying distributions for the follow-ing reasons The standard deviations of the old distribu-tions are predicted to be larger than those of the new dis-tributions because more nodes are activated in the olddistributions The standard deviations of the easy classdistributions are predicted to be larger than the standarddeviations of the difficult class distributions because therecognition strength is normalized by the itemrsquos diffi-culty estimated from the standard deviation of the net in-puts This is consistent with the empirical data

In contrast the attention-likelihood theory does notconstrain the old distribution to be larger than the newdistribution for the difficult class (it can be larger orsmaller depending on parameter settings) The variancetheory allows the following two orders of the standarddeviations ss(BN) lt ss(BO) lt ss(AN) lt ss(AO) andss(BN) lt ss(AN) lt ss(BO) lt ss(AO) The first order isthe more common although the second order occurs oc-casionally (see eg Glanzer et al 1993 Experiment 1)In addition the attention-likelihood theory allowsss(BO) lt ss(BN) lt ss(AN) lt ss(AO) according to Equa-tion 2 which to the authorrsquos knowledge has not beenfound in empirical data

The variance theory predicts that strength variablessuch as study time repetition and study instructions af-fect the expected net input For example increasing studytime will increase the net input that improves the hit rateIncreasing the net inputs also causes an increase in theactivation threshold that diminishes the false alarm rates

The variance theory has no (ie lateral) connectionswithin the item layer and there are no connections with-in the context layer Including intraitem connections inrecognition makes it impossible to tell whether therecognition strength comes from encoding during thelearning episode or from another preexperimental learn-ing episode Thus there would be a confounding be-tween the itemrsquos frequency and recognition strength Forexample if the recognition strength in the variance the-ory included intraitem connections and used a constantrecognition criterion it would predict a higher hit rateand a higher false alarm rate for high-frequency itemsthan for low-frequency items Thus the hit rate for high-frequency words would be larger than that for low-frequency words which is contrary to the data on the mir-ror effect This issue is called the frequencyrecognitionndashstrength confounding problem Other models may bevulnerable to this problem depending on their specificassumptions The variance theory is immune from thisproblem because recognition strength is based on the as-sociation between the context and the item that yields apure measurement of the strength of the target in a par-

ticular episode Net inputs within the item population arenot used because these connections are highly corre-lated with the frequency of the item

This frequencyrecognition-strength confounding prob-lem may be relevant to several distributed models thatassume that recognition strength increases with frequencythus making the false alarm rate higher for high- than forlow-frequency stimuli This is often implemented in dis-tributed models by including intraitem associations inthe measurement of recognition strength Thus whenintraitem and itemndashcontext associations are added it isnot possible to know whether the intraitem strength oc-curs because an item has been encoded in the to-be-retrieved-from list or at another episode

Although the intraitem associations are not used tomeasure recognition strength they may play an impor-tant role in recognition In the first step of recognitionthese associations may be used for deblurring unclear in-formation in the item cue (a similar mechanism occursfor the context cue) Arguably this deblurring mecha-nism works well for well-known words however fornonwords it is much more likely to fail Such failure willactivate features that were not active in the encoded rep-resentation It is more likely that these wrongly activatednodes represent high-frequency features because it ismore likely to converge on high-frequency features Thereare two interesting implications of this perspective Firstthe wrongly activated nodes will use the wrong connec-tions between the context and the item Second becausethe wrongly activated nodes represent high-frequencyfeatures the average variability will be larger for non-words than for words This is a plausible account of thepoor recognition performance with nonwords as com-pared with words It is also a tentative account of why non-words can be seen as a difficult class with a higher vari-ability than that for words in the variance theory Howeverfurther work is needed before any firm conclusion can bemade regarding this aspect of the theory

A problem similar to frequencyrecognition-strengthconfounding occurs if within-context connections areused at recognition If the context is temporally corre-lated the within-context connections are influenced bylist length This causes a confounding between list lengthand recognition strength This issue is called the list-lengthrecognition-strength confounding problem Othermodels may be vulnerable to this problem depending ontheir specific assumptions

Another issue is whether the variance theory can ac-count for the mirror effect found in abstract and concretewords where words from a concrete class are more eas-ily discriminated than words from an abstract class Thevariance theory can account for this given the assump-tion that the variance of the net input is larger for abstractthan for concrete words However at this point it is notcompletely clear how this assumption can be motivatedA possibility is that although these two classes areequated for word frequency the contexts associated withan abstract word are more variable than the contexts as-

VARIANCE THEORY OF MIRROR EFFECT 437

sociated with a concrete word This larger variability incontext for abstract words may lead to a larger variabil-ity in the net input Another possibility is that the activefeatures in abstract words are more general and there-fore associated with more contexts Nodes active in con-crete words may represent more specific features acti-vated with a lower frequency and therefore associatedwith fewer contexts Thus features in abstract wordsmay be of a higher frequency than features in concretewords although the frequencies of the items are thesame This would lead to a mirror effect in the modelHowever at this point no claim is made that variancetheory can handle this phenomenon

The variance theoryrsquos account of the list-length andlist-strength effects is arguably much simpler than thesubjective-likelihoodrsquos account The activation thresholdis set so that on average half of the nodes active duringencoding are active during recognition The activationthreshold is constant within one condition but may changebetween conditionsmdashfor example when study time ischanged Furthermore subjects do not need to estimatethe list length and the probability that a particular itemis drawn from the study list

The variance theory has advantages over the attention-likelihood theory As was discussed in more detail abovethe attention-likelihood theory requires subjects to clas-sify the stimulus Depending on this classification thesubjects must know (consciously or unconsciously) howmuch attention is paid to the stimuli in order to calculatethe log-likelihood ratios Thus the yesndashno decision isbased on the subjectsrsquo classification of the stimuli Thevariance theory does not require a classification of thestimuli During the calculation of recognition strengththe standard deviation of the net input of the item is used(shcent ) so the subject does not need to know the class or thestandard deviation of the class (sh) The increase in thehit rate and decrease in the false alarm rate for the easierclass occurs according to the theory because the vari-ance is smaller for the easier class The variance theoryhas a simple mathematical base subjects count the num-ber of activated nodes minus the expected value dividedby the standard deviation of the net input of the item Anexplicit solution is presented that requires only calculat-ing the probabilities of two z-transformations

The subjective-likelihood theory uses feature contentof the items for addressing issues regarding item similar-ity and word frequency In particular high-frequency wordsare assumed to have a higher noise or variability than dolow-frequency words The variance theory also usesvariability that depends on frequency However the vari-ance theory simulates the increase in variance duringeach presentation of a feature in different contexts thusmaking it an unavoidable phenomenon for high-frequencyfeatures In the subjective-likelihood theory the featurevariance is introduced as an assumption or a constantand it is not explicitly simulated

There are several other differences between the vari-ance theory and the subjective-likelihood theory The

subjective-likelihood theory is based on log-likelihoodratios In the variance theory log-likelihood ratios arenot necessary to account of the mirror effect and for z-ROC curves Instead the variance theory uses the num-ber of active nodes normalized by the variance of the netinput as the measurement of recognition strength

Another difference is the use of one detector for eachitem in the subjective-likelihood theory This makes themodel essentially local whereas the variance theory isdistributed This property may cause difficulties in im-plementing the model in a connectionist network How-ever see McClelland and Chappell (1998) for a brief dis-cussion of this topic An implementation of the variancetheory is straightforward

REFERENCES

Feller W (1968) An introduction to probability theory and its appli-cation New York Wiley

Gillund G amp Shiffrin R M (1984) A retrieval model for bothrecognition and recall Psychological Review 91 1-67

Glanzer M amp Adams J K (1985) The mirror effect in recognitionmemory Memory amp Cognition 13 8-20

Glanzer M amp Adams J K (1990) The mirror effect in recognitionmemory Data and theory Journal of Experimental PsychologyLearning Memory amp Cognition 16 5-16

Glanzer M Adams J K amp Kim K (1993) The regularities ofrecognition memory Psychological Review 100 546-567

Glanzer M amp Bowles N (1976) Analysis of the word frequencyeffect in recognition memory Journal of Experimental PsychologyHuman Learning amp Memory 2 21-31

Glanzer M Kisok K amp Adams J K (1998) Response distribu-tions as an explanation of the mirror effect Journal of ExperimentalPsychology Learning Memory amp Cognition 24 633-644

Greene R L (1996) Mirror effect in order and associative informa-tion Role of response strategies Journal of Experimental Psychol-ogy Learning Memory amp Cognition 22 687-695

Hertz J Krogh A amp Palmer R G (1991) Introduction to the the-ory of neural computation Reading MA Addison-Wesley

Hintzman D L (1988) Judgment of frequency and recognition memoryin a multiple trace memory model Psychological Review 95 528-551

Hopfield J J (1982) Neural networks and physical systems withemergent collective computational abilities Proceedings of the Na-tional Academy of Sciences 79 2554-2558

Hopfield J J (1984) Neurons with graded response have collectivecomputational properties like those of two-state neurons Proceed-ings of the National Academy of Sciences 81 3088-3092

Humphreys M S Bain J D amp Pike R (1989) Different way to cuea coherent memory system A theory for episodic semantic and pro-cedural tasks Psychological Review 96 208-233

Kim K amp Glanzer M (1993) Speed versus accuracy instructionsstudy time and the mirror effect Journal of Experimental Psychol-ogy Learning Memory amp Cognition 19 638-652

Kruschke J K (1992) ALCOVE An exemplar-based connectionistmodel of category learning Psychological Review 99 22-44

Ku Iumlcera H amp Francis W N (1967) Computational analysis ofpresent-day American English Providence RI Brown UniversityPress

Lewandowsky S (1991) Gradual unlearning and catastrophic inter-ference A comparison of distributed architectures In W E Hockleyamp S Lewandowsky (Eds) Relating theory and data Essays onhuman memory in honor of Bennet B Murdock (pp 445-476) Hills-dale NJ Erlbaum

Li S-C amp Lindenberger U (1999) Cross-level unification A com-putational exploration of the link between deterioration of neuro-transmitter systems and dedifferentiation of cognitive abilities in oldage In L-G Nilsson amp H J Markowitsch (Eds) Cognitive neuro-sciences of memory (pp 103-146) Seattle Hogrefe amp Huber

438 SIKSTROM

Li S-C Lindenberger U amp Frensch P A (2000) Unifying cog-nitive aging From neuromodulation to representation to cognitionNeurocomputing 32-33 879-890

McClelland J L amp Chappell M (1998) Familiarity breeds dif-ferentiation A subjective-likelihood approach to the effects of expe-rience in recognition memory Psychological Review 105 724-760

Metcalfe J (1982) A composite holographic associative recallmodel Psychological Review 89 627-658

Murdock B B (1982) A theory for the storage and retrieval of item andassociative information Psychological Review 89 609-626

Murdock B B (1998) The mirror effect and attentionndashlikelihoodtheory A reflective analysis Journal of Experimental PsychologyLearning Memory amp Cognition 24 524-534

Okada M (1996) Notions of associative memory and sparse codingNeural Networks 9 1429-1458

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves PsychologicalReview 99 518-535

Rumelhart D E Hinton G E amp Williams R J (1986) Learn-ing representation by backpropagating errors Nature 323 533-536

Shiffrin R M amp Steyvers M (1997) A model for recognitionmemory REMndashretrieving effectively from memory PsychonomicBulletin amp Review 4 145-166

Sikstroumlm S (1996a) TECO A connectionist theory of successiveepisodic tests Umearing Doctoral dissertation Umearing University (ISBN91-7191-155-3)

Sikstroumlm S (1996b) The TECO connectionist theory of recognitionfailure European Journal of Cognitive Psychology 8 341-380

Sikstroumlm S (1999) Power function forgetting curves as an emergentproperty of biologically plausible neural networks model Interna-tional Journal of Psychology 34 460-464

Stretch V amp Wixted J T (1998a) Decision rules for recognitionmemory confidence judgments Journal of Experimental Psychol-ogy Learning Memory amp Cognition 24 1397-1410

Stretch V amp Wixted J T (1998b) On the difference betweenstrength-based and frequency-based mirror effects in recognitionmemory Journal of Experimental Psychology Learning Memoryamp Cognition 24 1379-1396

NOTE

1 The expected number of nodes active at recognition is a2 giventhat the standard deviations of the percentages of active nodes for new[sp(N)] and old items [sp(O)] are equal However if the standard devi-ations are different the expected number of active nodes will be

Because the new variance is smaller than the old variance this valuewill be slightly less than a2 typically around 044a if the ROC curveis 08 It is actually more precise to use this value In this paper the ap-proximation a2 is used except that in the simulations where the ex-pression above is used

APPENDIX

The Expected Value of the Net Input and the Variance of the Net Input

Inserting the weight changes into the net input solves the ex-pected value of the net input for old items In this Appendix thesuperscripts representing the item layer (d) and the contextlayer (c) are dropped Only the active states of activation con-tribute to the net input so jj 5 jj = 1

(A1)

The expected value of the net inputs for the new items iszero

mh(N) 5 0 (A2)

The expected value of the weights is zero The variance ofthe net input is calculated by squaring each term in the netinput Let P represent the number of patterns stored in the net-work If the patterns are uncorrelated or a new item is pre-sented the variance of the net input is

(A3)

For an old item the variance of the net input to the contextlayer from the item layer depends on the frequency of the item( f ) All the aN weights supporting a context node contribute tothe net input in the same direction

(A4)

The same function can be used for calculating the varianceof the net input to a node in the item layer when the same con-text is associated with several items Let the same context be as-sociated with L items in a list Furthermore let the context be-tween different lists be uncorrelated The variance of the netinput to an item node is

The expected variance for all nodes is the weighted sum ofthese two terms Half of the nodes are context nodes and halfof the nodes are item nodes Therefore the expected varianceof the net input to all nodes given an item with a frequency off and a list length of L in a network where p patterns have beenencoded is

(A5)

Performance in the Variance Model When AllNodes Are Used

In the variance model recognition strength is based on nodesactive at encoding However if recognition strength is based onall the nodes performance can be calculated as follows The d centis calculated by using Equation 8 where Pc(O) and Pc(N) are thenumber of correct nodes The number of correct nodes is thenumber of nodes active at retrieval that also is active at encod-ing (ie calculated as usual with Equation 7) plus the numberof inactive nodes at encoding that also are inactive at retrievalThe latter value can be calculated by replacing a with 1 a inEquation 7 and using the expected value of old net inputs for in-active nodes a2 (1 a) N (compare with Equation A1)

(Manuscript received February 9 1999revision accepted for publication October 30 2000)

s h f L a N p a a N2 3 22 1O( ) = +( ) + -( )

s h f a a LN2 4 2 21( ) = -( )

s x x x xh ij jj

N

i j ji

Nf f f a a

a a f N

22 2

4 2 21

( ) = waring aringaelig

egraveccedilouml

oslashdivide= -( ) -( )eacute

eumlecirc

= -( )

s x x x xh ij jj

N

i j j

P

i

NN a a

a a a a a a a

a a a a PN a a PN

2 2 2

2

3

1 1 2 0 1

0 0 1 1

( ) = w

= [( ) ( )] + [( )(1- )] ( )

+ [( )( )] ( ) = ( )

2 2

2 2 2

( ) = -( ) -( ) - - - -

- - - -

aring aringaring

m x x x xh ijj

N

j i j jj

Na a a a N(O) = wDaring aring= -( ) -( ) = -( )1

2

ss s

p

p p

a(N)

( )N (O)+

432 SIKSTROM

(a) The results show that d cent is optimal for a = 052 Thed cent is lower for larger and smaller a The lower d cent for largea occurs because the interference from other items in-crease For an a larger than the optimal value the weightchanges are distributed over a larger number of nodesand the recognition tests therefore include more noise

The lower d cent for an a less than the optimal value oc-curs because there is variability in the number of activenodes at encoding Thus for very small values of a thefluctuation between the number of nodes active in theencoded representation becomes increasingly importantThus for a small a errors are more likely to occur be-cause old items have few active nodes at encoding mak-ing it less likely that many nodes will be active at re-trieval (independently of how well they are encoded)This analysis suggests that a medium low percentage ofactive nodes at encoding yields optimal performanceThis is consistent with variance theory which requires alow activity for fitting some of the empirical data (seebelow)

There is another factor that contributes to the fact thatoptimal performance occurs when the percentage of ac-tive nodes is medium low which is that the number ofpossible representations increases with a If there is onlyone node active in all the representations there are Npossible combinations of representations if there aretwo nodes active in all the representations there are ap-proximately N 2 possible combinations of representa-tions and so forth This factor is not included in theanalyses

Optimal placement of the activation threshold Animportant property of d cent is that it is insensitive to wherethe criterion is placed Thus any criterion yields thesame performance The activation threshold (T ) may beseen as the criterion for a single node and therefore onemight intuitively think that the placement of the thresh-old is unimportant for d cent However surprisingly theplacement of the criterion becomes important in the vari-ance theory because there is a nonlinear transformationwhen the nodes are activated This nonlinearity makes d centdependent on the activation threshold in the nodes

The d cent was maximized by changing the activationthreshold (T ) and the percentage of nodes active at en-coding (a) The maximum d cent was 240 when T = 081and a = 052 Figure 9B plots d cent as a function of the ac-tivation threshold (T ) when the percentage of nodes ac-tive at encoding was f ixed at the optimal value (a =052) The results show that d cent has an optimal valuewhen the activation threshold is set between the old andthe new distributions The variance theory suggests thatthe threshold should be set to the average of the expectedold and expected new net inputs For the parameter val-ues used here this value is 071 which is near butslightly lower than the optimal value of 081 (the ex-pected value of the new net input is 0 and the expectedvalue of the old net input is 142) Note that this resultapplies when both a and T are set to the optimal value Ifa is set to a nonoptimal value the optimal value of T may

deviate significantly from the one proposed by the the-ory (eg if a = 5 the optimal value of T is much largerthan the expected value of old net inputs of 188)

This analysis emphasizes the importance of setting theactivation threshold as suggested by the variance theorySetting the activation threshold between the old and thenew expected net inputs yields not only the mirror effectbut also an optimal performance in the network Noticethat the activation threshold (T ) is constant even if thesubjectsrsquo decision criterion (C) is changed Therefore d centwill not change when the decision criterion changes Bychanging the decision criterion (rather than the recogni-tion threshold) subjects can maintain an optimal d cent fordifferent confidence levels

Optimal usage of the state of activation in the cuepattern An interesting question is how much informa-tion is carried in nodes that are active in the encoded pat-tern as compared with inactive nodes If both active andinactive nodes carry a similar amount of information itis useful to use all the nodes at retrieval However if in-active nodes carry little information in relation to thenoise performance can be improved by using only theinformation in the active nodes

The information carried in the nodes depends on theamount of weight changes which in turn depends on thepercentage of active nodes at encoding (a) For a = 12the absolute values of the weight changes are the samefor active and inactive nodes (however the signs of theweight changes are different) Thus inactive and activenodes carry the same amount of information and theperformance is optimal when information in both activeand inactive nodes is used For a small a the weightchanges are larger for active nodes (proportional to 1 a)than for inactive nodes (proportional to a) For a suffi-ciently small a the noise in inactive nodes will over-whelm the information in the weight changes so that ifthe information is combined the inactive nodes will harmthe information in the active nodes and performance isoptimal if only information from active nodes is used

The performance of the variance theory was calcu-lated by using the information in all the nodes This isdone by counting the number of nodes that are retrievedto the correct state of activation (ie the same state asthat during encoding) The mathematical details of thiscalculation are described at the end of the Appendix Theresults are shown by the dotted line in Figure 9A usingthe same set of parameters as when d cent was calculated byusing only active nodes shown by the solid line The re-sults show that the highest d cent is found when the decisionis based only on active nodes and when a is low Includ-ing inactive nodes in decision lowers d cent However for alarger a (above 15 for the parameters used here) it isbeneficial to base the decision on all the nodes

Optimal placement of the recognition criterion forthe two classes of items The recognition criterion (C)does not affect d cent but it influences performance as mea-sured by the hit rates and false alarm rates Therefore itis necessary to use another criterion for performance

VARIANCE THEORY OF MIRROR EFFECT 433

with respect to the placement of the recognition crite-rion A natural choice for performance in this context isthe probability of hits minus the probability of falsealarms This measurement corresponds to optimal per-formance when old correct responses and new correctnew responses are rewarded equally It is easy to see thatif the standard deviations of the old and the new distri-butions are equal the optimal performance will be foundif the recognition criterion is set exactly between the dis-tributions For unequal standard deviations the optimalrecognition criterion is shifted from the midpoint towardthe distribution with the smallest standard deviationMore exactly the optimal recognition criterion is thepoint at which the old and the new distributions inter-sect It is easy to see that this is true because if the recog-nition criterion is moved to the left of this point the rateof increase in false alarms is larger than the rate of in-crease in hits and performance suffers If the recognition

criterion is moved to the right of this point the rate of de-crease in hits is larger than the rate of decrease in falsealarms and performance also suffers (see eg Figure 4D)Formally f [S(O)] denotes the density of recognitionstrength of the old distribution and f [S(N)] the densityof the recognition strength of the new distribution Theratio between these variables is called the likelihoodratio L = f [S(O)] f [S(N)] and the optimal performanceoccurs when this ratio is equal to one (L = 1)

In the mirror effect there are two classes of itemseach having a new and an old distribution with differentstandard deviations The question of optimal perfor-mance is complicated by the possibility of using differ-ent criteria for the two classes The performance maythen vary depending on the choice of the two criteria andon additional restrictions on the overall level of confi-dence For example if one class is very easy (ie perfectdiscrimination) and one class is very difficult (ie no

Figure 9 (A) Theoretical d cent as a function of percentage of nodes active at encoding The solid line shows the d cent as a function of per-centage of nodes active at encoding when the decision is based only on nodes that are active during encoding The dotted line showsd cent when the decision is based on all the nodes (B) Theoretical d cent as a function of the activation threshold The leftmost arrow pointsat the expected net input of the new items [m(N)] the rightmost arrow points at the expected net input of the old items [m(O)] and themiddle arrow points at the point at the placement of the activation threshold of the nodes Note that the activation threshold is slightlylower than the optimal point (C) Optimal placement of the recognition criterion for the two classes The y-axis shows the maximumlikelihood for Class B divided by the maximum likelihood for Class A An optimal performance is found when this ratio is one Thex-axis shows the false alarm rate for Class A The straight line shows the ratio for theoretical optimal performance the dotted line theratio before normalization and the solid curved line the ratio after normalization See the text for details (D) The advantage of nor-malization for different recognition criteria The y-axis shows the total hit rate after normalization minus the total hit rate before nor-malization as a function of the total false alarm rate on the x-axis See the text for details

434 SIKSTROM

discrimination) and subjects are instructed to respondyes only when they are absolutely certain that they arecorrect it may be optimal to set a very high criterion forthe difficult class so that no yes responses will be madefor the difficult class and a moderate criterion for theeasy class so that some yes responses will be made forthe easy class Therefore any model that optimizes per-formance for the two classes must combine the criteriafor each class so that the performance for the sum of theclasses will be optimal

This problem may formally be stated as follows Giventwo classes (A and B) with a fixed total false alarm rate[P(AN) + P(BN)] how should the recognition criteriafor the two classes [T(A) and T(B)] be chosen so that thehit rates are maximized [P(AO) + P(BO)] The solutionto this problem is surprisingly simple The optimal perfor-mance occurs when the placements of the maximum like-lihoods of the two classes are equal

L(A) = f [S(AO)] f [S(AN)] = L(B)

= f [S(BO)] f [S(BN)]

It is easy to see that this criterion must be satisfied foroptimal performance because any shift from this pointdiminishes performance For example if the recognitionthreshold for class A is diminished the recognition cri-terion for class B must be increased to keep the totalfalse alarm rate constant According to the formulationof the problem the change in total false alarm rates mustbe equal f f [S(BN) = 0] The maximum-likelihood ra-tios are monotonically increasing functions of the recog-nition criteria therefore L(A) L(B) lt 0 when the recog-nition criteria are changed as specified above ThusL(A) = f [S(AO)] f [S(AN)] lt f [S(BO)] f [S(BN)] =L(B) or f [S(AO)] f [S(BO)] lt 0 This shows that thechange in the placement of the criteria from L(A) = L(B)results in an overall decrease in hit ratemdash( f [S(AO)] f [S(BO)] lt 0)mdashand performance suffers

Note that the variance theory only has one overallrecognition criterion However the theory and any the-ory of the mirror effect specifies how this criterionchanges when it is moved over the two classes Thus thesecond criterion can indirectly be inferred from the for-mulation of the theory This is done in the variance the-ory by the normalization factor that scales the recogni-tion differently between the two classes of words

The intriguing question here is whether the variancetheory is optimal or almost optimal in terms of place-ment of the recognition criterion for the two classes Fig-ure 9C plots the maximum likelihood of class B dividedby the maximum likelihood of class A [L(B)L(A)] onthe y-axis The x-axis shows the probability of false alarmsfor class A when the recognition criteria are changedThe optimal ratio of the maximum likelihood on the y-axisis one and it is plotted as the dotted straight line The re-sults before normalization (ie by counting the numberof nodes above the recognition criterion) are plotted in

the dotted monotonically increasing line The resultsafter normalization (ie the percentage of active nodesminus the expected number of active nodes divided bythe standard deviation of the net input) are plotted as thesolid line

The figure clearly shows that performance before nor-malization is far from optimal For a conservative recog-nition criterion or low false alarm rates the maximumlikelihood for class A is larger than the maximum likeli-hood for class B Therefore a more liberal criterion forclass B and a more strict criterion for class A would bemore advantageous For a liberal recognition criterionor a high false alarm rate the maximum likelihood forclass B is larger than the maximum likelihood for classA Therefore a more liberal criterion for Class A and astricter criterion for Class B would be beneficial Theonly point at which the performance is optimal is whenthe recognition criterion is unbiased At this point [aroundP(AN) = 25] the maximum-likelihood ratio is one

Normalization improves performance significantly sothe maximum-likelihood ratio is one or almost one fora range of criteria For false alarm rates larger than 015and smaller than 060 the ratio is within two percentagepoints of one The maximum likelihood for Class A islarger than that for Class B for false alarm rates less than015 and for false alarm rates larger than 060 Thus thereis still some deviation from optimal performance evenafter normalization However the maximum-likelihoodratio is closer to the optimal value for all false alarmrates after normalization than before normalization Ar-guably normalization improves performance sufficientlyso that performance may be described as being near anoptimal value for a wide range of recognition criteria

Overall performance after normalization can be di-rectly compared with performance before normalizationFigure 9D plots the total hit rate after normalizationminus the total hit rate before normalization for differenttotal false alarm rates In this figure the standard devia-tion of the net inputs to Class B was set to a 3 (rather than156) in order to make the difference between perfor-mance before and after normalization more salient Allother parameters were identical to those in Figure 4DFurthermore it is assumed that the number of items inClass A is equal to the number of items in Class B Thetotal hit rate is equal to the average hit rate of Class Aand Class B and the total false alarm rate is equal to theaverage false alarm rate in Class A and Class B

For all recognition criteria or for all false alarm ratesperformance is better or equal after normalization ascompared with performance before normalization Foran unbiased recognition criterion or a slightly largerrecognition criterion performance is approximatelyequal before and after normalization [ie for 25 lt P(N) lt40] For conservative recognition criteria [P(N) lt 25]there is a large advantage for normalized performanceover nonnormalized performance The largest advantageis approximately 005 and it occurs when the false alarm

VARIANCE THEORY OF MIRROR EFFECT 435

rate is approximately 005 For liberal recognition crite-ria [P(N) gt 40] there is also an advantage for normal-ized performance The largest advantage is around 001and it occurs when the false alarm rate is 070 The ad-vantage for liberal criteria is smaller than the advantagefor conservative criteria because of a ceiling effect forlarge false alarm rates and large hit rates

A Nonoptimal Network is Inconsistent With Empirical Data of Recognition Memory

To summarize it has been shown that performance isoptimal when (1) the percentage of nodes active at en-coding is low (2) the activation threshold is set betweenthe new and the old distributions (3) only nodes activeat encoding are used for measuring the recognitionstrength and (4) the recognition strength is normalizedwith the standard deviation of the net input It is inter-esting to note that all these conditions are necessary forproducing the empirically found memory phenomena

1 The percentage of active nodes has to be low forproducing appropriate standard deviations for the oldand the new recognition distributions If the percentageof active nodes is too high the standard deviation of theold distribution will approach the standard deviation ofthe new distribution

2 The model predicts the mirror effect only if the ac-tivation threshold is set between the old and the new dis-tributions If the recognition threshold is larger than theexpected value of the net input of the old distributionthe hit rate of the easy class will be less than the hit rateof the difficult distribution Similarly if the recognitionthreshold is lower than the expected new net input thefalse alarm of the easy class will be larger than the falsealarm rate of the difficult class This is inconsistent withthe empirical data for unbiased responses

3 Assume that the recognition strength is based on allthe nodes (ie not only nodes inactive during encod-ing)mdashfor example by counting the number of nodes inthe correct state of activation This measurement ofrecognition strength will have at least 50 of the nodesin the correct state (ie Pc gt 50) even if the subjectswere merely guessing on new items This would lead tothe incorrect prediction that the old recognition strengthvariance should be smaller than the new recognitionstrength variance because Pc(1 Pc)N decreases for Pcover 50 This is inconsistent with the finding that thevariance of the old distribution is larger than the varianceof the new distribution

4 If the recognition strength is not normalized withthe net input the variance of the recognition strength ofthe new easy class (AN) will be smaller than the varianceof the recognition strength of the new difficult class (BNsee Figure 4C) This is inconsistent with the empirical data

This analysis indicates that several recognition mem-ory phenomena fall out as a consequence of optimizingperformance in the network If the model is optimized interms of performance the model reproduces the empir-ical data If the model is made nonoptimal the model

does not reproduce the empirical data Arguably thehuman brain has evolved to optimize storage capacityand therefore these memory phenomena occur

GENERAL DISCUSSION

This paper has suggested a variance theory for themirror effect that also applies to the ROC curves Themodel is remarkably simple It has been shown that atwo-layer recurrent network where one layer representscontext and one layer represents items shows these phe-nomena if performance is measured by counting thenumber of nodes active at recognition in a way that opti-mizes performance The structure of the theory guaran-tees that high-frequency items have a larger variance inthe net inputs than do low-frequency items because en-coding the same item in different contexts increases thevariance whereas the expected net inputs are the sameThe theory predicts the mirror effect when the sameamount of attention is paid to both classes of stimuli Thestandard deviation of the recognition strength is largerfor old than for new items because more nodes are activein old items The standard deviation for the easy class islarger than the standard deviation for the difficult classbecause the recognition strength is normalized with thestandard deviation for the net input

There are several reasons why the variance theory isinteresting First the theory is extremely simple Theonly necessary assumptions are that recognition is basedon recurrent associations between contexts and itemsand performance is measured by counting the number ofnodes in an optimal way Second these assumptions areconsistent with what is known about how the brain worksTherefore the model is biologically plausible Third themodel accounts for a large amount of data including themirror effect exceptions from the mirror effect ROCcurves list-length effects and so on Fourth the modelfits the empirical data well Fifth it is easy to implementthe model in a connectionist network

Paying more attention to one of the classes violates theassumption of equal expected net inputs to the two classesThe variance theory predicts that attention to the moredifficult class primarily affects the hit rates whereas thefalse alarm rates and standard deviations of the underly-ing distributions are less affected An experiment sup-ported the prediction A standard mirror effect was foundwhen attention was divided equally between high- andlow-frequency words However focusing the subjectsrsquoattention on the high-frequency words either by the pre-sentation frequency or the presentation time made thehit rate larger for the high-frequency words than for thelow-frequency words This manipulation of attention didnot influence the false alarm rates which were higher forthe high-frequency words in all the conditions Thus nomirror effect was found when attention was paid to thehigh-frequency words Nor did the focusing of attentioninfluence the order of the standard deviations The stan-dard deviations were larger for the low-frequency distri-

436 SIKSTROM

bution than for the high-frequency distribution This wastrue for the new and the old distributions both when at-tention was paid to high-frequency words and when at-tention was divided equally between the two classes (ex-cept in the new frequency control condition where thestandard deviations were equal)

The variance theory predicts the order of the standarddeviations of the underlying distributions for the follow-ing reasons The standard deviations of the old distribu-tions are predicted to be larger than those of the new dis-tributions because more nodes are activated in the olddistributions The standard deviations of the easy classdistributions are predicted to be larger than the standarddeviations of the difficult class distributions because therecognition strength is normalized by the itemrsquos diffi-culty estimated from the standard deviation of the net in-puts This is consistent with the empirical data

In contrast the attention-likelihood theory does notconstrain the old distribution to be larger than the newdistribution for the difficult class (it can be larger orsmaller depending on parameter settings) The variancetheory allows the following two orders of the standarddeviations ss(BN) lt ss(BO) lt ss(AN) lt ss(AO) andss(BN) lt ss(AN) lt ss(BO) lt ss(AO) The first order isthe more common although the second order occurs oc-casionally (see eg Glanzer et al 1993 Experiment 1)In addition the attention-likelihood theory allowsss(BO) lt ss(BN) lt ss(AN) lt ss(AO) according to Equa-tion 2 which to the authorrsquos knowledge has not beenfound in empirical data

The variance theory predicts that strength variablessuch as study time repetition and study instructions af-fect the expected net input For example increasing studytime will increase the net input that improves the hit rateIncreasing the net inputs also causes an increase in theactivation threshold that diminishes the false alarm rates

The variance theory has no (ie lateral) connectionswithin the item layer and there are no connections with-in the context layer Including intraitem connections inrecognition makes it impossible to tell whether therecognition strength comes from encoding during thelearning episode or from another preexperimental learn-ing episode Thus there would be a confounding be-tween the itemrsquos frequency and recognition strength Forexample if the recognition strength in the variance the-ory included intraitem connections and used a constantrecognition criterion it would predict a higher hit rateand a higher false alarm rate for high-frequency itemsthan for low-frequency items Thus the hit rate for high-frequency words would be larger than that for low-frequency words which is contrary to the data on the mir-ror effect This issue is called the frequencyrecognitionndashstrength confounding problem Other models may bevulnerable to this problem depending on their specificassumptions The variance theory is immune from thisproblem because recognition strength is based on the as-sociation between the context and the item that yields apure measurement of the strength of the target in a par-

ticular episode Net inputs within the item population arenot used because these connections are highly corre-lated with the frequency of the item

This frequencyrecognition-strength confounding prob-lem may be relevant to several distributed models thatassume that recognition strength increases with frequencythus making the false alarm rate higher for high- than forlow-frequency stimuli This is often implemented in dis-tributed models by including intraitem associations inthe measurement of recognition strength Thus whenintraitem and itemndashcontext associations are added it isnot possible to know whether the intraitem strength oc-curs because an item has been encoded in the to-be-retrieved-from list or at another episode

Although the intraitem associations are not used tomeasure recognition strength they may play an impor-tant role in recognition In the first step of recognitionthese associations may be used for deblurring unclear in-formation in the item cue (a similar mechanism occursfor the context cue) Arguably this deblurring mecha-nism works well for well-known words however fornonwords it is much more likely to fail Such failure willactivate features that were not active in the encoded rep-resentation It is more likely that these wrongly activatednodes represent high-frequency features because it ismore likely to converge on high-frequency features Thereare two interesting implications of this perspective Firstthe wrongly activated nodes will use the wrong connec-tions between the context and the item Second becausethe wrongly activated nodes represent high-frequencyfeatures the average variability will be larger for non-words than for words This is a plausible account of thepoor recognition performance with nonwords as com-pared with words It is also a tentative account of why non-words can be seen as a difficult class with a higher vari-ability than that for words in the variance theory Howeverfurther work is needed before any firm conclusion can bemade regarding this aspect of the theory

A problem similar to frequencyrecognition-strengthconfounding occurs if within-context connections areused at recognition If the context is temporally corre-lated the within-context connections are influenced bylist length This causes a confounding between list lengthand recognition strength This issue is called the list-lengthrecognition-strength confounding problem Othermodels may be vulnerable to this problem depending ontheir specific assumptions

Another issue is whether the variance theory can ac-count for the mirror effect found in abstract and concretewords where words from a concrete class are more eas-ily discriminated than words from an abstract class Thevariance theory can account for this given the assump-tion that the variance of the net input is larger for abstractthan for concrete words However at this point it is notcompletely clear how this assumption can be motivatedA possibility is that although these two classes areequated for word frequency the contexts associated withan abstract word are more variable than the contexts as-

VARIANCE THEORY OF MIRROR EFFECT 437

sociated with a concrete word This larger variability incontext for abstract words may lead to a larger variabil-ity in the net input Another possibility is that the activefeatures in abstract words are more general and there-fore associated with more contexts Nodes active in con-crete words may represent more specific features acti-vated with a lower frequency and therefore associatedwith fewer contexts Thus features in abstract wordsmay be of a higher frequency than features in concretewords although the frequencies of the items are thesame This would lead to a mirror effect in the modelHowever at this point no claim is made that variancetheory can handle this phenomenon

The variance theoryrsquos account of the list-length andlist-strength effects is arguably much simpler than thesubjective-likelihoodrsquos account The activation thresholdis set so that on average half of the nodes active duringencoding are active during recognition The activationthreshold is constant within one condition but may changebetween conditionsmdashfor example when study time ischanged Furthermore subjects do not need to estimatethe list length and the probability that a particular itemis drawn from the study list

The variance theory has advantages over the attention-likelihood theory As was discussed in more detail abovethe attention-likelihood theory requires subjects to clas-sify the stimulus Depending on this classification thesubjects must know (consciously or unconsciously) howmuch attention is paid to the stimuli in order to calculatethe log-likelihood ratios Thus the yesndashno decision isbased on the subjectsrsquo classification of the stimuli Thevariance theory does not require a classification of thestimuli During the calculation of recognition strengththe standard deviation of the net input of the item is used(shcent ) so the subject does not need to know the class or thestandard deviation of the class (sh) The increase in thehit rate and decrease in the false alarm rate for the easierclass occurs according to the theory because the vari-ance is smaller for the easier class The variance theoryhas a simple mathematical base subjects count the num-ber of activated nodes minus the expected value dividedby the standard deviation of the net input of the item Anexplicit solution is presented that requires only calculat-ing the probabilities of two z-transformations

The subjective-likelihood theory uses feature contentof the items for addressing issues regarding item similar-ity and word frequency In particular high-frequency wordsare assumed to have a higher noise or variability than dolow-frequency words The variance theory also usesvariability that depends on frequency However the vari-ance theory simulates the increase in variance duringeach presentation of a feature in different contexts thusmaking it an unavoidable phenomenon for high-frequencyfeatures In the subjective-likelihood theory the featurevariance is introduced as an assumption or a constantand it is not explicitly simulated

There are several other differences between the vari-ance theory and the subjective-likelihood theory The

subjective-likelihood theory is based on log-likelihoodratios In the variance theory log-likelihood ratios arenot necessary to account of the mirror effect and for z-ROC curves Instead the variance theory uses the num-ber of active nodes normalized by the variance of the netinput as the measurement of recognition strength

Another difference is the use of one detector for eachitem in the subjective-likelihood theory This makes themodel essentially local whereas the variance theory isdistributed This property may cause difficulties in im-plementing the model in a connectionist network How-ever see McClelland and Chappell (1998) for a brief dis-cussion of this topic An implementation of the variancetheory is straightforward

REFERENCES

Feller W (1968) An introduction to probability theory and its appli-cation New York Wiley

Gillund G amp Shiffrin R M (1984) A retrieval model for bothrecognition and recall Psychological Review 91 1-67

Glanzer M amp Adams J K (1985) The mirror effect in recognitionmemory Memory amp Cognition 13 8-20

Glanzer M amp Adams J K (1990) The mirror effect in recognitionmemory Data and theory Journal of Experimental PsychologyLearning Memory amp Cognition 16 5-16

Glanzer M Adams J K amp Kim K (1993) The regularities ofrecognition memory Psychological Review 100 546-567

Glanzer M amp Bowles N (1976) Analysis of the word frequencyeffect in recognition memory Journal of Experimental PsychologyHuman Learning amp Memory 2 21-31

Glanzer M Kisok K amp Adams J K (1998) Response distribu-tions as an explanation of the mirror effect Journal of ExperimentalPsychology Learning Memory amp Cognition 24 633-644

Greene R L (1996) Mirror effect in order and associative informa-tion Role of response strategies Journal of Experimental Psychol-ogy Learning Memory amp Cognition 22 687-695

Hertz J Krogh A amp Palmer R G (1991) Introduction to the the-ory of neural computation Reading MA Addison-Wesley

Hintzman D L (1988) Judgment of frequency and recognition memoryin a multiple trace memory model Psychological Review 95 528-551

Hopfield J J (1982) Neural networks and physical systems withemergent collective computational abilities Proceedings of the Na-tional Academy of Sciences 79 2554-2558

Hopfield J J (1984) Neurons with graded response have collectivecomputational properties like those of two-state neurons Proceed-ings of the National Academy of Sciences 81 3088-3092

Humphreys M S Bain J D amp Pike R (1989) Different way to cuea coherent memory system A theory for episodic semantic and pro-cedural tasks Psychological Review 96 208-233

Kim K amp Glanzer M (1993) Speed versus accuracy instructionsstudy time and the mirror effect Journal of Experimental Psychol-ogy Learning Memory amp Cognition 19 638-652

Kruschke J K (1992) ALCOVE An exemplar-based connectionistmodel of category learning Psychological Review 99 22-44

Ku Iumlcera H amp Francis W N (1967) Computational analysis ofpresent-day American English Providence RI Brown UniversityPress

Lewandowsky S (1991) Gradual unlearning and catastrophic inter-ference A comparison of distributed architectures In W E Hockleyamp S Lewandowsky (Eds) Relating theory and data Essays onhuman memory in honor of Bennet B Murdock (pp 445-476) Hills-dale NJ Erlbaum

Li S-C amp Lindenberger U (1999) Cross-level unification A com-putational exploration of the link between deterioration of neuro-transmitter systems and dedifferentiation of cognitive abilities in oldage In L-G Nilsson amp H J Markowitsch (Eds) Cognitive neuro-sciences of memory (pp 103-146) Seattle Hogrefe amp Huber

438 SIKSTROM

Li S-C Lindenberger U amp Frensch P A (2000) Unifying cog-nitive aging From neuromodulation to representation to cognitionNeurocomputing 32-33 879-890

McClelland J L amp Chappell M (1998) Familiarity breeds dif-ferentiation A subjective-likelihood approach to the effects of expe-rience in recognition memory Psychological Review 105 724-760

Metcalfe J (1982) A composite holographic associative recallmodel Psychological Review 89 627-658

Murdock B B (1982) A theory for the storage and retrieval of item andassociative information Psychological Review 89 609-626

Murdock B B (1998) The mirror effect and attentionndashlikelihoodtheory A reflective analysis Journal of Experimental PsychologyLearning Memory amp Cognition 24 524-534

Okada M (1996) Notions of associative memory and sparse codingNeural Networks 9 1429-1458

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves PsychologicalReview 99 518-535

Rumelhart D E Hinton G E amp Williams R J (1986) Learn-ing representation by backpropagating errors Nature 323 533-536

Shiffrin R M amp Steyvers M (1997) A model for recognitionmemory REMndashretrieving effectively from memory PsychonomicBulletin amp Review 4 145-166

Sikstroumlm S (1996a) TECO A connectionist theory of successiveepisodic tests Umearing Doctoral dissertation Umearing University (ISBN91-7191-155-3)

Sikstroumlm S (1996b) The TECO connectionist theory of recognitionfailure European Journal of Cognitive Psychology 8 341-380

Sikstroumlm S (1999) Power function forgetting curves as an emergentproperty of biologically plausible neural networks model Interna-tional Journal of Psychology 34 460-464

Stretch V amp Wixted J T (1998a) Decision rules for recognitionmemory confidence judgments Journal of Experimental Psychol-ogy Learning Memory amp Cognition 24 1397-1410

Stretch V amp Wixted J T (1998b) On the difference betweenstrength-based and frequency-based mirror effects in recognitionmemory Journal of Experimental Psychology Learning Memoryamp Cognition 24 1379-1396

NOTE

1 The expected number of nodes active at recognition is a2 giventhat the standard deviations of the percentages of active nodes for new[sp(N)] and old items [sp(O)] are equal However if the standard devi-ations are different the expected number of active nodes will be

Because the new variance is smaller than the old variance this valuewill be slightly less than a2 typically around 044a if the ROC curveis 08 It is actually more precise to use this value In this paper the ap-proximation a2 is used except that in the simulations where the ex-pression above is used

APPENDIX

The Expected Value of the Net Input and the Variance of the Net Input

Inserting the weight changes into the net input solves the ex-pected value of the net input for old items In this Appendix thesuperscripts representing the item layer (d) and the contextlayer (c) are dropped Only the active states of activation con-tribute to the net input so jj 5 jj = 1

(A1)

The expected value of the net inputs for the new items iszero

mh(N) 5 0 (A2)

The expected value of the weights is zero The variance ofthe net input is calculated by squaring each term in the netinput Let P represent the number of patterns stored in the net-work If the patterns are uncorrelated or a new item is pre-sented the variance of the net input is

(A3)

For an old item the variance of the net input to the contextlayer from the item layer depends on the frequency of the item( f ) All the aN weights supporting a context node contribute tothe net input in the same direction

(A4)

The same function can be used for calculating the varianceof the net input to a node in the item layer when the same con-text is associated with several items Let the same context be as-sociated with L items in a list Furthermore let the context be-tween different lists be uncorrelated The variance of the netinput to an item node is

The expected variance for all nodes is the weighted sum ofthese two terms Half of the nodes are context nodes and halfof the nodes are item nodes Therefore the expected varianceof the net input to all nodes given an item with a frequency off and a list length of L in a network where p patterns have beenencoded is

(A5)

Performance in the Variance Model When AllNodes Are Used

In the variance model recognition strength is based on nodesactive at encoding However if recognition strength is based onall the nodes performance can be calculated as follows The d centis calculated by using Equation 8 where Pc(O) and Pc(N) are thenumber of correct nodes The number of correct nodes is thenumber of nodes active at retrieval that also is active at encod-ing (ie calculated as usual with Equation 7) plus the numberof inactive nodes at encoding that also are inactive at retrievalThe latter value can be calculated by replacing a with 1 a inEquation 7 and using the expected value of old net inputs for in-active nodes a2 (1 a) N (compare with Equation A1)

(Manuscript received February 9 1999revision accepted for publication October 30 2000)

s h f L a N p a a N2 3 22 1O( ) = +( ) + -( )

s h f a a LN2 4 2 21( ) = -( )

s x x x xh ij jj

N

i j ji

Nf f f a a

a a f N

22 2

4 2 21

( ) = waring aringaelig

egraveccedilouml

oslashdivide= -( ) -( )eacute

eumlecirc

= -( )

s x x x xh ij jj

N

i j j

P

i

NN a a

a a a a a a a

a a a a PN a a PN

2 2 2

2

3

1 1 2 0 1

0 0 1 1

( ) = w

= [( ) ( )] + [( )(1- )] ( )

+ [( )( )] ( ) = ( )

2 2

2 2 2

( ) = -( ) -( ) - - - -

- - - -

aring aringaring

m x x x xh ijj

N

j i j jj

Na a a a N(O) = wDaring aring= -( ) -( ) = -( )1

2

ss s

p

p p

a(N)

( )N (O)+

VARIANCE THEORY OF MIRROR EFFECT 433

with respect to the placement of the recognition crite-rion A natural choice for performance in this context isthe probability of hits minus the probability of falsealarms This measurement corresponds to optimal per-formance when old correct responses and new correctnew responses are rewarded equally It is easy to see thatif the standard deviations of the old and the new distri-butions are equal the optimal performance will be foundif the recognition criterion is set exactly between the dis-tributions For unequal standard deviations the optimalrecognition criterion is shifted from the midpoint towardthe distribution with the smallest standard deviationMore exactly the optimal recognition criterion is thepoint at which the old and the new distributions inter-sect It is easy to see that this is true because if the recog-nition criterion is moved to the left of this point the rateof increase in false alarms is larger than the rate of in-crease in hits and performance suffers If the recognition

criterion is moved to the right of this point the rate of de-crease in hits is larger than the rate of decrease in falsealarms and performance also suffers (see eg Figure 4D)Formally f [S(O)] denotes the density of recognitionstrength of the old distribution and f [S(N)] the densityof the recognition strength of the new distribution Theratio between these variables is called the likelihoodratio L = f [S(O)] f [S(N)] and the optimal performanceoccurs when this ratio is equal to one (L = 1)

In the mirror effect there are two classes of itemseach having a new and an old distribution with differentstandard deviations The question of optimal perfor-mance is complicated by the possibility of using differ-ent criteria for the two classes The performance maythen vary depending on the choice of the two criteria andon additional restrictions on the overall level of confi-dence For example if one class is very easy (ie perfectdiscrimination) and one class is very difficult (ie no

Figure 9 (A) Theoretical d cent as a function of percentage of nodes active at encoding The solid line shows the d cent as a function of per-centage of nodes active at encoding when the decision is based only on nodes that are active during encoding The dotted line showsd cent when the decision is based on all the nodes (B) Theoretical d cent as a function of the activation threshold The leftmost arrow pointsat the expected net input of the new items [m(N)] the rightmost arrow points at the expected net input of the old items [m(O)] and themiddle arrow points at the point at the placement of the activation threshold of the nodes Note that the activation threshold is slightlylower than the optimal point (C) Optimal placement of the recognition criterion for the two classes The y-axis shows the maximumlikelihood for Class B divided by the maximum likelihood for Class A An optimal performance is found when this ratio is one Thex-axis shows the false alarm rate for Class A The straight line shows the ratio for theoretical optimal performance the dotted line theratio before normalization and the solid curved line the ratio after normalization See the text for details (D) The advantage of nor-malization for different recognition criteria The y-axis shows the total hit rate after normalization minus the total hit rate before nor-malization as a function of the total false alarm rate on the x-axis See the text for details

434 SIKSTROM

discrimination) and subjects are instructed to respondyes only when they are absolutely certain that they arecorrect it may be optimal to set a very high criterion forthe difficult class so that no yes responses will be madefor the difficult class and a moderate criterion for theeasy class so that some yes responses will be made forthe easy class Therefore any model that optimizes per-formance for the two classes must combine the criteriafor each class so that the performance for the sum of theclasses will be optimal

This problem may formally be stated as follows Giventwo classes (A and B) with a fixed total false alarm rate[P(AN) + P(BN)] how should the recognition criteriafor the two classes [T(A) and T(B)] be chosen so that thehit rates are maximized [P(AO) + P(BO)] The solutionto this problem is surprisingly simple The optimal perfor-mance occurs when the placements of the maximum like-lihoods of the two classes are equal

L(A) = f [S(AO)] f [S(AN)] = L(B)

= f [S(BO)] f [S(BN)]

It is easy to see that this criterion must be satisfied foroptimal performance because any shift from this pointdiminishes performance For example if the recognitionthreshold for class A is diminished the recognition cri-terion for class B must be increased to keep the totalfalse alarm rate constant According to the formulationof the problem the change in total false alarm rates mustbe equal f f [S(BN) = 0] The maximum-likelihood ra-tios are monotonically increasing functions of the recog-nition criteria therefore L(A) L(B) lt 0 when the recog-nition criteria are changed as specified above ThusL(A) = f [S(AO)] f [S(AN)] lt f [S(BO)] f [S(BN)] =L(B) or f [S(AO)] f [S(BO)] lt 0 This shows that thechange in the placement of the criteria from L(A) = L(B)results in an overall decrease in hit ratemdash( f [S(AO)] f [S(BO)] lt 0)mdashand performance suffers

Note that the variance theory only has one overallrecognition criterion However the theory and any the-ory of the mirror effect specifies how this criterionchanges when it is moved over the two classes Thus thesecond criterion can indirectly be inferred from the for-mulation of the theory This is done in the variance the-ory by the normalization factor that scales the recogni-tion differently between the two classes of words

The intriguing question here is whether the variancetheory is optimal or almost optimal in terms of place-ment of the recognition criterion for the two classes Fig-ure 9C plots the maximum likelihood of class B dividedby the maximum likelihood of class A [L(B)L(A)] onthe y-axis The x-axis shows the probability of false alarmsfor class A when the recognition criteria are changedThe optimal ratio of the maximum likelihood on the y-axisis one and it is plotted as the dotted straight line The re-sults before normalization (ie by counting the numberof nodes above the recognition criterion) are plotted in

the dotted monotonically increasing line The resultsafter normalization (ie the percentage of active nodesminus the expected number of active nodes divided bythe standard deviation of the net input) are plotted as thesolid line

The figure clearly shows that performance before nor-malization is far from optimal For a conservative recog-nition criterion or low false alarm rates the maximumlikelihood for class A is larger than the maximum likeli-hood for class B Therefore a more liberal criterion forclass B and a more strict criterion for class A would bemore advantageous For a liberal recognition criterionor a high false alarm rate the maximum likelihood forclass B is larger than the maximum likelihood for classA Therefore a more liberal criterion for Class A and astricter criterion for Class B would be beneficial Theonly point at which the performance is optimal is whenthe recognition criterion is unbiased At this point [aroundP(AN) = 25] the maximum-likelihood ratio is one

Normalization improves performance significantly sothe maximum-likelihood ratio is one or almost one fora range of criteria For false alarm rates larger than 015and smaller than 060 the ratio is within two percentagepoints of one The maximum likelihood for Class A islarger than that for Class B for false alarm rates less than015 and for false alarm rates larger than 060 Thus thereis still some deviation from optimal performance evenafter normalization However the maximum-likelihoodratio is closer to the optimal value for all false alarmrates after normalization than before normalization Ar-guably normalization improves performance sufficientlyso that performance may be described as being near anoptimal value for a wide range of recognition criteria

Overall performance after normalization can be di-rectly compared with performance before normalizationFigure 9D plots the total hit rate after normalizationminus the total hit rate before normalization for differenttotal false alarm rates In this figure the standard devia-tion of the net inputs to Class B was set to a 3 (rather than156) in order to make the difference between perfor-mance before and after normalization more salient Allother parameters were identical to those in Figure 4DFurthermore it is assumed that the number of items inClass A is equal to the number of items in Class B Thetotal hit rate is equal to the average hit rate of Class Aand Class B and the total false alarm rate is equal to theaverage false alarm rate in Class A and Class B

For all recognition criteria or for all false alarm ratesperformance is better or equal after normalization ascompared with performance before normalization Foran unbiased recognition criterion or a slightly largerrecognition criterion performance is approximatelyequal before and after normalization [ie for 25 lt P(N) lt40] For conservative recognition criteria [P(N) lt 25]there is a large advantage for normalized performanceover nonnormalized performance The largest advantageis approximately 005 and it occurs when the false alarm

VARIANCE THEORY OF MIRROR EFFECT 435

rate is approximately 005 For liberal recognition crite-ria [P(N) gt 40] there is also an advantage for normal-ized performance The largest advantage is around 001and it occurs when the false alarm rate is 070 The ad-vantage for liberal criteria is smaller than the advantagefor conservative criteria because of a ceiling effect forlarge false alarm rates and large hit rates

A Nonoptimal Network is Inconsistent With Empirical Data of Recognition Memory

To summarize it has been shown that performance isoptimal when (1) the percentage of nodes active at en-coding is low (2) the activation threshold is set betweenthe new and the old distributions (3) only nodes activeat encoding are used for measuring the recognitionstrength and (4) the recognition strength is normalizedwith the standard deviation of the net input It is inter-esting to note that all these conditions are necessary forproducing the empirically found memory phenomena

1 The percentage of active nodes has to be low forproducing appropriate standard deviations for the oldand the new recognition distributions If the percentageof active nodes is too high the standard deviation of theold distribution will approach the standard deviation ofthe new distribution

2 The model predicts the mirror effect only if the ac-tivation threshold is set between the old and the new dis-tributions If the recognition threshold is larger than theexpected value of the net input of the old distributionthe hit rate of the easy class will be less than the hit rateof the difficult distribution Similarly if the recognitionthreshold is lower than the expected new net input thefalse alarm of the easy class will be larger than the falsealarm rate of the difficult class This is inconsistent withthe empirical data for unbiased responses

3 Assume that the recognition strength is based on allthe nodes (ie not only nodes inactive during encod-ing)mdashfor example by counting the number of nodes inthe correct state of activation This measurement ofrecognition strength will have at least 50 of the nodesin the correct state (ie Pc gt 50) even if the subjectswere merely guessing on new items This would lead tothe incorrect prediction that the old recognition strengthvariance should be smaller than the new recognitionstrength variance because Pc(1 Pc)N decreases for Pcover 50 This is inconsistent with the finding that thevariance of the old distribution is larger than the varianceof the new distribution

4 If the recognition strength is not normalized withthe net input the variance of the recognition strength ofthe new easy class (AN) will be smaller than the varianceof the recognition strength of the new difficult class (BNsee Figure 4C) This is inconsistent with the empirical data

This analysis indicates that several recognition mem-ory phenomena fall out as a consequence of optimizingperformance in the network If the model is optimized interms of performance the model reproduces the empir-ical data If the model is made nonoptimal the model

does not reproduce the empirical data Arguably thehuman brain has evolved to optimize storage capacityand therefore these memory phenomena occur

GENERAL DISCUSSION

This paper has suggested a variance theory for themirror effect that also applies to the ROC curves Themodel is remarkably simple It has been shown that atwo-layer recurrent network where one layer representscontext and one layer represents items shows these phe-nomena if performance is measured by counting thenumber of nodes active at recognition in a way that opti-mizes performance The structure of the theory guaran-tees that high-frequency items have a larger variance inthe net inputs than do low-frequency items because en-coding the same item in different contexts increases thevariance whereas the expected net inputs are the sameThe theory predicts the mirror effect when the sameamount of attention is paid to both classes of stimuli Thestandard deviation of the recognition strength is largerfor old than for new items because more nodes are activein old items The standard deviation for the easy class islarger than the standard deviation for the difficult classbecause the recognition strength is normalized with thestandard deviation for the net input

There are several reasons why the variance theory isinteresting First the theory is extremely simple Theonly necessary assumptions are that recognition is basedon recurrent associations between contexts and itemsand performance is measured by counting the number ofnodes in an optimal way Second these assumptions areconsistent with what is known about how the brain worksTherefore the model is biologically plausible Third themodel accounts for a large amount of data including themirror effect exceptions from the mirror effect ROCcurves list-length effects and so on Fourth the modelfits the empirical data well Fifth it is easy to implementthe model in a connectionist network

Paying more attention to one of the classes violates theassumption of equal expected net inputs to the two classesThe variance theory predicts that attention to the moredifficult class primarily affects the hit rates whereas thefalse alarm rates and standard deviations of the underly-ing distributions are less affected An experiment sup-ported the prediction A standard mirror effect was foundwhen attention was divided equally between high- andlow-frequency words However focusing the subjectsrsquoattention on the high-frequency words either by the pre-sentation frequency or the presentation time made thehit rate larger for the high-frequency words than for thelow-frequency words This manipulation of attention didnot influence the false alarm rates which were higher forthe high-frequency words in all the conditions Thus nomirror effect was found when attention was paid to thehigh-frequency words Nor did the focusing of attentioninfluence the order of the standard deviations The stan-dard deviations were larger for the low-frequency distri-

436 SIKSTROM

bution than for the high-frequency distribution This wastrue for the new and the old distributions both when at-tention was paid to high-frequency words and when at-tention was divided equally between the two classes (ex-cept in the new frequency control condition where thestandard deviations were equal)

The variance theory predicts the order of the standarddeviations of the underlying distributions for the follow-ing reasons The standard deviations of the old distribu-tions are predicted to be larger than those of the new dis-tributions because more nodes are activated in the olddistributions The standard deviations of the easy classdistributions are predicted to be larger than the standarddeviations of the difficult class distributions because therecognition strength is normalized by the itemrsquos diffi-culty estimated from the standard deviation of the net in-puts This is consistent with the empirical data

In contrast the attention-likelihood theory does notconstrain the old distribution to be larger than the newdistribution for the difficult class (it can be larger orsmaller depending on parameter settings) The variancetheory allows the following two orders of the standarddeviations ss(BN) lt ss(BO) lt ss(AN) lt ss(AO) andss(BN) lt ss(AN) lt ss(BO) lt ss(AO) The first order isthe more common although the second order occurs oc-casionally (see eg Glanzer et al 1993 Experiment 1)In addition the attention-likelihood theory allowsss(BO) lt ss(BN) lt ss(AN) lt ss(AO) according to Equa-tion 2 which to the authorrsquos knowledge has not beenfound in empirical data

The variance theory predicts that strength variablessuch as study time repetition and study instructions af-fect the expected net input For example increasing studytime will increase the net input that improves the hit rateIncreasing the net inputs also causes an increase in theactivation threshold that diminishes the false alarm rates

The variance theory has no (ie lateral) connectionswithin the item layer and there are no connections with-in the context layer Including intraitem connections inrecognition makes it impossible to tell whether therecognition strength comes from encoding during thelearning episode or from another preexperimental learn-ing episode Thus there would be a confounding be-tween the itemrsquos frequency and recognition strength Forexample if the recognition strength in the variance the-ory included intraitem connections and used a constantrecognition criterion it would predict a higher hit rateand a higher false alarm rate for high-frequency itemsthan for low-frequency items Thus the hit rate for high-frequency words would be larger than that for low-frequency words which is contrary to the data on the mir-ror effect This issue is called the frequencyrecognitionndashstrength confounding problem Other models may bevulnerable to this problem depending on their specificassumptions The variance theory is immune from thisproblem because recognition strength is based on the as-sociation between the context and the item that yields apure measurement of the strength of the target in a par-

ticular episode Net inputs within the item population arenot used because these connections are highly corre-lated with the frequency of the item

This frequencyrecognition-strength confounding prob-lem may be relevant to several distributed models thatassume that recognition strength increases with frequencythus making the false alarm rate higher for high- than forlow-frequency stimuli This is often implemented in dis-tributed models by including intraitem associations inthe measurement of recognition strength Thus whenintraitem and itemndashcontext associations are added it isnot possible to know whether the intraitem strength oc-curs because an item has been encoded in the to-be-retrieved-from list or at another episode

Although the intraitem associations are not used tomeasure recognition strength they may play an impor-tant role in recognition In the first step of recognitionthese associations may be used for deblurring unclear in-formation in the item cue (a similar mechanism occursfor the context cue) Arguably this deblurring mecha-nism works well for well-known words however fornonwords it is much more likely to fail Such failure willactivate features that were not active in the encoded rep-resentation It is more likely that these wrongly activatednodes represent high-frequency features because it ismore likely to converge on high-frequency features Thereare two interesting implications of this perspective Firstthe wrongly activated nodes will use the wrong connec-tions between the context and the item Second becausethe wrongly activated nodes represent high-frequencyfeatures the average variability will be larger for non-words than for words This is a plausible account of thepoor recognition performance with nonwords as com-pared with words It is also a tentative account of why non-words can be seen as a difficult class with a higher vari-ability than that for words in the variance theory Howeverfurther work is needed before any firm conclusion can bemade regarding this aspect of the theory

A problem similar to frequencyrecognition-strengthconfounding occurs if within-context connections areused at recognition If the context is temporally corre-lated the within-context connections are influenced bylist length This causes a confounding between list lengthand recognition strength This issue is called the list-lengthrecognition-strength confounding problem Othermodels may be vulnerable to this problem depending ontheir specific assumptions

Another issue is whether the variance theory can ac-count for the mirror effect found in abstract and concretewords where words from a concrete class are more eas-ily discriminated than words from an abstract class Thevariance theory can account for this given the assump-tion that the variance of the net input is larger for abstractthan for concrete words However at this point it is notcompletely clear how this assumption can be motivatedA possibility is that although these two classes areequated for word frequency the contexts associated withan abstract word are more variable than the contexts as-

VARIANCE THEORY OF MIRROR EFFECT 437

sociated with a concrete word This larger variability incontext for abstract words may lead to a larger variabil-ity in the net input Another possibility is that the activefeatures in abstract words are more general and there-fore associated with more contexts Nodes active in con-crete words may represent more specific features acti-vated with a lower frequency and therefore associatedwith fewer contexts Thus features in abstract wordsmay be of a higher frequency than features in concretewords although the frequencies of the items are thesame This would lead to a mirror effect in the modelHowever at this point no claim is made that variancetheory can handle this phenomenon

The variance theoryrsquos account of the list-length andlist-strength effects is arguably much simpler than thesubjective-likelihoodrsquos account The activation thresholdis set so that on average half of the nodes active duringencoding are active during recognition The activationthreshold is constant within one condition but may changebetween conditionsmdashfor example when study time ischanged Furthermore subjects do not need to estimatethe list length and the probability that a particular itemis drawn from the study list

The variance theory has advantages over the attention-likelihood theory As was discussed in more detail abovethe attention-likelihood theory requires subjects to clas-sify the stimulus Depending on this classification thesubjects must know (consciously or unconsciously) howmuch attention is paid to the stimuli in order to calculatethe log-likelihood ratios Thus the yesndashno decision isbased on the subjectsrsquo classification of the stimuli Thevariance theory does not require a classification of thestimuli During the calculation of recognition strengththe standard deviation of the net input of the item is used(shcent ) so the subject does not need to know the class or thestandard deviation of the class (sh) The increase in thehit rate and decrease in the false alarm rate for the easierclass occurs according to the theory because the vari-ance is smaller for the easier class The variance theoryhas a simple mathematical base subjects count the num-ber of activated nodes minus the expected value dividedby the standard deviation of the net input of the item Anexplicit solution is presented that requires only calculat-ing the probabilities of two z-transformations

The subjective-likelihood theory uses feature contentof the items for addressing issues regarding item similar-ity and word frequency In particular high-frequency wordsare assumed to have a higher noise or variability than dolow-frequency words The variance theory also usesvariability that depends on frequency However the vari-ance theory simulates the increase in variance duringeach presentation of a feature in different contexts thusmaking it an unavoidable phenomenon for high-frequencyfeatures In the subjective-likelihood theory the featurevariance is introduced as an assumption or a constantand it is not explicitly simulated

There are several other differences between the vari-ance theory and the subjective-likelihood theory The

subjective-likelihood theory is based on log-likelihoodratios In the variance theory log-likelihood ratios arenot necessary to account of the mirror effect and for z-ROC curves Instead the variance theory uses the num-ber of active nodes normalized by the variance of the netinput as the measurement of recognition strength

Another difference is the use of one detector for eachitem in the subjective-likelihood theory This makes themodel essentially local whereas the variance theory isdistributed This property may cause difficulties in im-plementing the model in a connectionist network How-ever see McClelland and Chappell (1998) for a brief dis-cussion of this topic An implementation of the variancetheory is straightforward

REFERENCES

Feller W (1968) An introduction to probability theory and its appli-cation New York Wiley

Gillund G amp Shiffrin R M (1984) A retrieval model for bothrecognition and recall Psychological Review 91 1-67

Glanzer M amp Adams J K (1985) The mirror effect in recognitionmemory Memory amp Cognition 13 8-20

Glanzer M amp Adams J K (1990) The mirror effect in recognitionmemory Data and theory Journal of Experimental PsychologyLearning Memory amp Cognition 16 5-16

Glanzer M Adams J K amp Kim K (1993) The regularities ofrecognition memory Psychological Review 100 546-567

Glanzer M amp Bowles N (1976) Analysis of the word frequencyeffect in recognition memory Journal of Experimental PsychologyHuman Learning amp Memory 2 21-31

Glanzer M Kisok K amp Adams J K (1998) Response distribu-tions as an explanation of the mirror effect Journal of ExperimentalPsychology Learning Memory amp Cognition 24 633-644

Greene R L (1996) Mirror effect in order and associative informa-tion Role of response strategies Journal of Experimental Psychol-ogy Learning Memory amp Cognition 22 687-695

Hertz J Krogh A amp Palmer R G (1991) Introduction to the the-ory of neural computation Reading MA Addison-Wesley

Hintzman D L (1988) Judgment of frequency and recognition memoryin a multiple trace memory model Psychological Review 95 528-551

Hopfield J J (1982) Neural networks and physical systems withemergent collective computational abilities Proceedings of the Na-tional Academy of Sciences 79 2554-2558

Hopfield J J (1984) Neurons with graded response have collectivecomputational properties like those of two-state neurons Proceed-ings of the National Academy of Sciences 81 3088-3092

Humphreys M S Bain J D amp Pike R (1989) Different way to cuea coherent memory system A theory for episodic semantic and pro-cedural tasks Psychological Review 96 208-233

Kim K amp Glanzer M (1993) Speed versus accuracy instructionsstudy time and the mirror effect Journal of Experimental Psychol-ogy Learning Memory amp Cognition 19 638-652

Kruschke J K (1992) ALCOVE An exemplar-based connectionistmodel of category learning Psychological Review 99 22-44

Ku Iumlcera H amp Francis W N (1967) Computational analysis ofpresent-day American English Providence RI Brown UniversityPress

Lewandowsky S (1991) Gradual unlearning and catastrophic inter-ference A comparison of distributed architectures In W E Hockleyamp S Lewandowsky (Eds) Relating theory and data Essays onhuman memory in honor of Bennet B Murdock (pp 445-476) Hills-dale NJ Erlbaum

Li S-C amp Lindenberger U (1999) Cross-level unification A com-putational exploration of the link between deterioration of neuro-transmitter systems and dedifferentiation of cognitive abilities in oldage In L-G Nilsson amp H J Markowitsch (Eds) Cognitive neuro-sciences of memory (pp 103-146) Seattle Hogrefe amp Huber

438 SIKSTROM

Li S-C Lindenberger U amp Frensch P A (2000) Unifying cog-nitive aging From neuromodulation to representation to cognitionNeurocomputing 32-33 879-890

McClelland J L amp Chappell M (1998) Familiarity breeds dif-ferentiation A subjective-likelihood approach to the effects of expe-rience in recognition memory Psychological Review 105 724-760

Metcalfe J (1982) A composite holographic associative recallmodel Psychological Review 89 627-658

Murdock B B (1982) A theory for the storage and retrieval of item andassociative information Psychological Review 89 609-626

Murdock B B (1998) The mirror effect and attentionndashlikelihoodtheory A reflective analysis Journal of Experimental PsychologyLearning Memory amp Cognition 24 524-534

Okada M (1996) Notions of associative memory and sparse codingNeural Networks 9 1429-1458

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves PsychologicalReview 99 518-535

Rumelhart D E Hinton G E amp Williams R J (1986) Learn-ing representation by backpropagating errors Nature 323 533-536

Shiffrin R M amp Steyvers M (1997) A model for recognitionmemory REMndashretrieving effectively from memory PsychonomicBulletin amp Review 4 145-166

Sikstroumlm S (1996a) TECO A connectionist theory of successiveepisodic tests Umearing Doctoral dissertation Umearing University (ISBN91-7191-155-3)

Sikstroumlm S (1996b) The TECO connectionist theory of recognitionfailure European Journal of Cognitive Psychology 8 341-380

Sikstroumlm S (1999) Power function forgetting curves as an emergentproperty of biologically plausible neural networks model Interna-tional Journal of Psychology 34 460-464

Stretch V amp Wixted J T (1998a) Decision rules for recognitionmemory confidence judgments Journal of Experimental Psychol-ogy Learning Memory amp Cognition 24 1397-1410

Stretch V amp Wixted J T (1998b) On the difference betweenstrength-based and frequency-based mirror effects in recognitionmemory Journal of Experimental Psychology Learning Memoryamp Cognition 24 1379-1396

NOTE

1 The expected number of nodes active at recognition is a2 giventhat the standard deviations of the percentages of active nodes for new[sp(N)] and old items [sp(O)] are equal However if the standard devi-ations are different the expected number of active nodes will be

Because the new variance is smaller than the old variance this valuewill be slightly less than a2 typically around 044a if the ROC curveis 08 It is actually more precise to use this value In this paper the ap-proximation a2 is used except that in the simulations where the ex-pression above is used

APPENDIX

The Expected Value of the Net Input and the Variance of the Net Input

Inserting the weight changes into the net input solves the ex-pected value of the net input for old items In this Appendix thesuperscripts representing the item layer (d) and the contextlayer (c) are dropped Only the active states of activation con-tribute to the net input so jj 5 jj = 1

(A1)

The expected value of the net inputs for the new items iszero

mh(N) 5 0 (A2)

The expected value of the weights is zero The variance ofthe net input is calculated by squaring each term in the netinput Let P represent the number of patterns stored in the net-work If the patterns are uncorrelated or a new item is pre-sented the variance of the net input is

(A3)

For an old item the variance of the net input to the contextlayer from the item layer depends on the frequency of the item( f ) All the aN weights supporting a context node contribute tothe net input in the same direction

(A4)

The same function can be used for calculating the varianceof the net input to a node in the item layer when the same con-text is associated with several items Let the same context be as-sociated with L items in a list Furthermore let the context be-tween different lists be uncorrelated The variance of the netinput to an item node is

The expected variance for all nodes is the weighted sum ofthese two terms Half of the nodes are context nodes and halfof the nodes are item nodes Therefore the expected varianceof the net input to all nodes given an item with a frequency off and a list length of L in a network where p patterns have beenencoded is

(A5)

Performance in the Variance Model When AllNodes Are Used

In the variance model recognition strength is based on nodesactive at encoding However if recognition strength is based onall the nodes performance can be calculated as follows The d centis calculated by using Equation 8 where Pc(O) and Pc(N) are thenumber of correct nodes The number of correct nodes is thenumber of nodes active at retrieval that also is active at encod-ing (ie calculated as usual with Equation 7) plus the numberof inactive nodes at encoding that also are inactive at retrievalThe latter value can be calculated by replacing a with 1 a inEquation 7 and using the expected value of old net inputs for in-active nodes a2 (1 a) N (compare with Equation A1)

(Manuscript received February 9 1999revision accepted for publication October 30 2000)

s h f L a N p a a N2 3 22 1O( ) = +( ) + -( )

s h f a a LN2 4 2 21( ) = -( )

s x x x xh ij jj

N

i j ji

Nf f f a a

a a f N

22 2

4 2 21

( ) = waring aringaelig

egraveccedilouml

oslashdivide= -( ) -( )eacute

eumlecirc

= -( )

s x x x xh ij jj

N

i j j

P

i

NN a a

a a a a a a a

a a a a PN a a PN

2 2 2

2

3

1 1 2 0 1

0 0 1 1

( ) = w

= [( ) ( )] + [( )(1- )] ( )

+ [( )( )] ( ) = ( )

2 2

2 2 2

( ) = -( ) -( ) - - - -

- - - -

aring aringaring

m x x x xh ijj

N

j i j jj

Na a a a N(O) = wDaring aring= -( ) -( ) = -( )1

2

ss s

p

p p

a(N)

( )N (O)+

434 SIKSTROM

discrimination) and subjects are instructed to respondyes only when they are absolutely certain that they arecorrect it may be optimal to set a very high criterion forthe difficult class so that no yes responses will be madefor the difficult class and a moderate criterion for theeasy class so that some yes responses will be made forthe easy class Therefore any model that optimizes per-formance for the two classes must combine the criteriafor each class so that the performance for the sum of theclasses will be optimal

This problem may formally be stated as follows Giventwo classes (A and B) with a fixed total false alarm rate[P(AN) + P(BN)] how should the recognition criteriafor the two classes [T(A) and T(B)] be chosen so that thehit rates are maximized [P(AO) + P(BO)] The solutionto this problem is surprisingly simple The optimal perfor-mance occurs when the placements of the maximum like-lihoods of the two classes are equal

L(A) = f [S(AO)] f [S(AN)] = L(B)

= f [S(BO)] f [S(BN)]

It is easy to see that this criterion must be satisfied foroptimal performance because any shift from this pointdiminishes performance For example if the recognitionthreshold for class A is diminished the recognition cri-terion for class B must be increased to keep the totalfalse alarm rate constant According to the formulationof the problem the change in total false alarm rates mustbe equal f f [S(BN) = 0] The maximum-likelihood ra-tios are monotonically increasing functions of the recog-nition criteria therefore L(A) L(B) lt 0 when the recog-nition criteria are changed as specified above ThusL(A) = f [S(AO)] f [S(AN)] lt f [S(BO)] f [S(BN)] =L(B) or f [S(AO)] f [S(BO)] lt 0 This shows that thechange in the placement of the criteria from L(A) = L(B)results in an overall decrease in hit ratemdash( f [S(AO)] f [S(BO)] lt 0)mdashand performance suffers

Note that the variance theory only has one overallrecognition criterion However the theory and any the-ory of the mirror effect specifies how this criterionchanges when it is moved over the two classes Thus thesecond criterion can indirectly be inferred from the for-mulation of the theory This is done in the variance the-ory by the normalization factor that scales the recogni-tion differently between the two classes of words

The intriguing question here is whether the variancetheory is optimal or almost optimal in terms of place-ment of the recognition criterion for the two classes Fig-ure 9C plots the maximum likelihood of class B dividedby the maximum likelihood of class A [L(B)L(A)] onthe y-axis The x-axis shows the probability of false alarmsfor class A when the recognition criteria are changedThe optimal ratio of the maximum likelihood on the y-axisis one and it is plotted as the dotted straight line The re-sults before normalization (ie by counting the numberof nodes above the recognition criterion) are plotted in

the dotted monotonically increasing line The resultsafter normalization (ie the percentage of active nodesminus the expected number of active nodes divided bythe standard deviation of the net input) are plotted as thesolid line

The figure clearly shows that performance before nor-malization is far from optimal For a conservative recog-nition criterion or low false alarm rates the maximumlikelihood for class A is larger than the maximum likeli-hood for class B Therefore a more liberal criterion forclass B and a more strict criterion for class A would bemore advantageous For a liberal recognition criterionor a high false alarm rate the maximum likelihood forclass B is larger than the maximum likelihood for classA Therefore a more liberal criterion for Class A and astricter criterion for Class B would be beneficial Theonly point at which the performance is optimal is whenthe recognition criterion is unbiased At this point [aroundP(AN) = 25] the maximum-likelihood ratio is one

Normalization improves performance significantly sothe maximum-likelihood ratio is one or almost one fora range of criteria For false alarm rates larger than 015and smaller than 060 the ratio is within two percentagepoints of one The maximum likelihood for Class A islarger than that for Class B for false alarm rates less than015 and for false alarm rates larger than 060 Thus thereis still some deviation from optimal performance evenafter normalization However the maximum-likelihoodratio is closer to the optimal value for all false alarmrates after normalization than before normalization Ar-guably normalization improves performance sufficientlyso that performance may be described as being near anoptimal value for a wide range of recognition criteria

Overall performance after normalization can be di-rectly compared with performance before normalizationFigure 9D plots the total hit rate after normalizationminus the total hit rate before normalization for differenttotal false alarm rates In this figure the standard devia-tion of the net inputs to Class B was set to a 3 (rather than156) in order to make the difference between perfor-mance before and after normalization more salient Allother parameters were identical to those in Figure 4DFurthermore it is assumed that the number of items inClass A is equal to the number of items in Class B Thetotal hit rate is equal to the average hit rate of Class Aand Class B and the total false alarm rate is equal to theaverage false alarm rate in Class A and Class B

For all recognition criteria or for all false alarm ratesperformance is better or equal after normalization ascompared with performance before normalization Foran unbiased recognition criterion or a slightly largerrecognition criterion performance is approximatelyequal before and after normalization [ie for 25 lt P(N) lt40] For conservative recognition criteria [P(N) lt 25]there is a large advantage for normalized performanceover nonnormalized performance The largest advantageis approximately 005 and it occurs when the false alarm

VARIANCE THEORY OF MIRROR EFFECT 435

rate is approximately 005 For liberal recognition crite-ria [P(N) gt 40] there is also an advantage for normal-ized performance The largest advantage is around 001and it occurs when the false alarm rate is 070 The ad-vantage for liberal criteria is smaller than the advantagefor conservative criteria because of a ceiling effect forlarge false alarm rates and large hit rates

A Nonoptimal Network is Inconsistent With Empirical Data of Recognition Memory

To summarize it has been shown that performance isoptimal when (1) the percentage of nodes active at en-coding is low (2) the activation threshold is set betweenthe new and the old distributions (3) only nodes activeat encoding are used for measuring the recognitionstrength and (4) the recognition strength is normalizedwith the standard deviation of the net input It is inter-esting to note that all these conditions are necessary forproducing the empirically found memory phenomena

1 The percentage of active nodes has to be low forproducing appropriate standard deviations for the oldand the new recognition distributions If the percentageof active nodes is too high the standard deviation of theold distribution will approach the standard deviation ofthe new distribution

2 The model predicts the mirror effect only if the ac-tivation threshold is set between the old and the new dis-tributions If the recognition threshold is larger than theexpected value of the net input of the old distributionthe hit rate of the easy class will be less than the hit rateof the difficult distribution Similarly if the recognitionthreshold is lower than the expected new net input thefalse alarm of the easy class will be larger than the falsealarm rate of the difficult class This is inconsistent withthe empirical data for unbiased responses

3 Assume that the recognition strength is based on allthe nodes (ie not only nodes inactive during encod-ing)mdashfor example by counting the number of nodes inthe correct state of activation This measurement ofrecognition strength will have at least 50 of the nodesin the correct state (ie Pc gt 50) even if the subjectswere merely guessing on new items This would lead tothe incorrect prediction that the old recognition strengthvariance should be smaller than the new recognitionstrength variance because Pc(1 Pc)N decreases for Pcover 50 This is inconsistent with the finding that thevariance of the old distribution is larger than the varianceof the new distribution

4 If the recognition strength is not normalized withthe net input the variance of the recognition strength ofthe new easy class (AN) will be smaller than the varianceof the recognition strength of the new difficult class (BNsee Figure 4C) This is inconsistent with the empirical data

This analysis indicates that several recognition mem-ory phenomena fall out as a consequence of optimizingperformance in the network If the model is optimized interms of performance the model reproduces the empir-ical data If the model is made nonoptimal the model

does not reproduce the empirical data Arguably thehuman brain has evolved to optimize storage capacityand therefore these memory phenomena occur

GENERAL DISCUSSION

This paper has suggested a variance theory for themirror effect that also applies to the ROC curves Themodel is remarkably simple It has been shown that atwo-layer recurrent network where one layer representscontext and one layer represents items shows these phe-nomena if performance is measured by counting thenumber of nodes active at recognition in a way that opti-mizes performance The structure of the theory guaran-tees that high-frequency items have a larger variance inthe net inputs than do low-frequency items because en-coding the same item in different contexts increases thevariance whereas the expected net inputs are the sameThe theory predicts the mirror effect when the sameamount of attention is paid to both classes of stimuli Thestandard deviation of the recognition strength is largerfor old than for new items because more nodes are activein old items The standard deviation for the easy class islarger than the standard deviation for the difficult classbecause the recognition strength is normalized with thestandard deviation for the net input

There are several reasons why the variance theory isinteresting First the theory is extremely simple Theonly necessary assumptions are that recognition is basedon recurrent associations between contexts and itemsand performance is measured by counting the number ofnodes in an optimal way Second these assumptions areconsistent with what is known about how the brain worksTherefore the model is biologically plausible Third themodel accounts for a large amount of data including themirror effect exceptions from the mirror effect ROCcurves list-length effects and so on Fourth the modelfits the empirical data well Fifth it is easy to implementthe model in a connectionist network

Paying more attention to one of the classes violates theassumption of equal expected net inputs to the two classesThe variance theory predicts that attention to the moredifficult class primarily affects the hit rates whereas thefalse alarm rates and standard deviations of the underly-ing distributions are less affected An experiment sup-ported the prediction A standard mirror effect was foundwhen attention was divided equally between high- andlow-frequency words However focusing the subjectsrsquoattention on the high-frequency words either by the pre-sentation frequency or the presentation time made thehit rate larger for the high-frequency words than for thelow-frequency words This manipulation of attention didnot influence the false alarm rates which were higher forthe high-frequency words in all the conditions Thus nomirror effect was found when attention was paid to thehigh-frequency words Nor did the focusing of attentioninfluence the order of the standard deviations The stan-dard deviations were larger for the low-frequency distri-

436 SIKSTROM

bution than for the high-frequency distribution This wastrue for the new and the old distributions both when at-tention was paid to high-frequency words and when at-tention was divided equally between the two classes (ex-cept in the new frequency control condition where thestandard deviations were equal)

The variance theory predicts the order of the standarddeviations of the underlying distributions for the follow-ing reasons The standard deviations of the old distribu-tions are predicted to be larger than those of the new dis-tributions because more nodes are activated in the olddistributions The standard deviations of the easy classdistributions are predicted to be larger than the standarddeviations of the difficult class distributions because therecognition strength is normalized by the itemrsquos diffi-culty estimated from the standard deviation of the net in-puts This is consistent with the empirical data

In contrast the attention-likelihood theory does notconstrain the old distribution to be larger than the newdistribution for the difficult class (it can be larger orsmaller depending on parameter settings) The variancetheory allows the following two orders of the standarddeviations ss(BN) lt ss(BO) lt ss(AN) lt ss(AO) andss(BN) lt ss(AN) lt ss(BO) lt ss(AO) The first order isthe more common although the second order occurs oc-casionally (see eg Glanzer et al 1993 Experiment 1)In addition the attention-likelihood theory allowsss(BO) lt ss(BN) lt ss(AN) lt ss(AO) according to Equa-tion 2 which to the authorrsquos knowledge has not beenfound in empirical data

The variance theory predicts that strength variablessuch as study time repetition and study instructions af-fect the expected net input For example increasing studytime will increase the net input that improves the hit rateIncreasing the net inputs also causes an increase in theactivation threshold that diminishes the false alarm rates

The variance theory has no (ie lateral) connectionswithin the item layer and there are no connections with-in the context layer Including intraitem connections inrecognition makes it impossible to tell whether therecognition strength comes from encoding during thelearning episode or from another preexperimental learn-ing episode Thus there would be a confounding be-tween the itemrsquos frequency and recognition strength Forexample if the recognition strength in the variance the-ory included intraitem connections and used a constantrecognition criterion it would predict a higher hit rateand a higher false alarm rate for high-frequency itemsthan for low-frequency items Thus the hit rate for high-frequency words would be larger than that for low-frequency words which is contrary to the data on the mir-ror effect This issue is called the frequencyrecognitionndashstrength confounding problem Other models may bevulnerable to this problem depending on their specificassumptions The variance theory is immune from thisproblem because recognition strength is based on the as-sociation between the context and the item that yields apure measurement of the strength of the target in a par-

ticular episode Net inputs within the item population arenot used because these connections are highly corre-lated with the frequency of the item

This frequencyrecognition-strength confounding prob-lem may be relevant to several distributed models thatassume that recognition strength increases with frequencythus making the false alarm rate higher for high- than forlow-frequency stimuli This is often implemented in dis-tributed models by including intraitem associations inthe measurement of recognition strength Thus whenintraitem and itemndashcontext associations are added it isnot possible to know whether the intraitem strength oc-curs because an item has been encoded in the to-be-retrieved-from list or at another episode

Although the intraitem associations are not used tomeasure recognition strength they may play an impor-tant role in recognition In the first step of recognitionthese associations may be used for deblurring unclear in-formation in the item cue (a similar mechanism occursfor the context cue) Arguably this deblurring mecha-nism works well for well-known words however fornonwords it is much more likely to fail Such failure willactivate features that were not active in the encoded rep-resentation It is more likely that these wrongly activatednodes represent high-frequency features because it ismore likely to converge on high-frequency features Thereare two interesting implications of this perspective Firstthe wrongly activated nodes will use the wrong connec-tions between the context and the item Second becausethe wrongly activated nodes represent high-frequencyfeatures the average variability will be larger for non-words than for words This is a plausible account of thepoor recognition performance with nonwords as com-pared with words It is also a tentative account of why non-words can be seen as a difficult class with a higher vari-ability than that for words in the variance theory Howeverfurther work is needed before any firm conclusion can bemade regarding this aspect of the theory

A problem similar to frequencyrecognition-strengthconfounding occurs if within-context connections areused at recognition If the context is temporally corre-lated the within-context connections are influenced bylist length This causes a confounding between list lengthand recognition strength This issue is called the list-lengthrecognition-strength confounding problem Othermodels may be vulnerable to this problem depending ontheir specific assumptions

Another issue is whether the variance theory can ac-count for the mirror effect found in abstract and concretewords where words from a concrete class are more eas-ily discriminated than words from an abstract class Thevariance theory can account for this given the assump-tion that the variance of the net input is larger for abstractthan for concrete words However at this point it is notcompletely clear how this assumption can be motivatedA possibility is that although these two classes areequated for word frequency the contexts associated withan abstract word are more variable than the contexts as-

VARIANCE THEORY OF MIRROR EFFECT 437

sociated with a concrete word This larger variability incontext for abstract words may lead to a larger variabil-ity in the net input Another possibility is that the activefeatures in abstract words are more general and there-fore associated with more contexts Nodes active in con-crete words may represent more specific features acti-vated with a lower frequency and therefore associatedwith fewer contexts Thus features in abstract wordsmay be of a higher frequency than features in concretewords although the frequencies of the items are thesame This would lead to a mirror effect in the modelHowever at this point no claim is made that variancetheory can handle this phenomenon

The variance theoryrsquos account of the list-length andlist-strength effects is arguably much simpler than thesubjective-likelihoodrsquos account The activation thresholdis set so that on average half of the nodes active duringencoding are active during recognition The activationthreshold is constant within one condition but may changebetween conditionsmdashfor example when study time ischanged Furthermore subjects do not need to estimatethe list length and the probability that a particular itemis drawn from the study list

The variance theory has advantages over the attention-likelihood theory As was discussed in more detail abovethe attention-likelihood theory requires subjects to clas-sify the stimulus Depending on this classification thesubjects must know (consciously or unconsciously) howmuch attention is paid to the stimuli in order to calculatethe log-likelihood ratios Thus the yesndashno decision isbased on the subjectsrsquo classification of the stimuli Thevariance theory does not require a classification of thestimuli During the calculation of recognition strengththe standard deviation of the net input of the item is used(shcent ) so the subject does not need to know the class or thestandard deviation of the class (sh) The increase in thehit rate and decrease in the false alarm rate for the easierclass occurs according to the theory because the vari-ance is smaller for the easier class The variance theoryhas a simple mathematical base subjects count the num-ber of activated nodes minus the expected value dividedby the standard deviation of the net input of the item Anexplicit solution is presented that requires only calculat-ing the probabilities of two z-transformations

The subjective-likelihood theory uses feature contentof the items for addressing issues regarding item similar-ity and word frequency In particular high-frequency wordsare assumed to have a higher noise or variability than dolow-frequency words The variance theory also usesvariability that depends on frequency However the vari-ance theory simulates the increase in variance duringeach presentation of a feature in different contexts thusmaking it an unavoidable phenomenon for high-frequencyfeatures In the subjective-likelihood theory the featurevariance is introduced as an assumption or a constantand it is not explicitly simulated

There are several other differences between the vari-ance theory and the subjective-likelihood theory The

subjective-likelihood theory is based on log-likelihoodratios In the variance theory log-likelihood ratios arenot necessary to account of the mirror effect and for z-ROC curves Instead the variance theory uses the num-ber of active nodes normalized by the variance of the netinput as the measurement of recognition strength

Another difference is the use of one detector for eachitem in the subjective-likelihood theory This makes themodel essentially local whereas the variance theory isdistributed This property may cause difficulties in im-plementing the model in a connectionist network How-ever see McClelland and Chappell (1998) for a brief dis-cussion of this topic An implementation of the variancetheory is straightforward

REFERENCES

Feller W (1968) An introduction to probability theory and its appli-cation New York Wiley

Gillund G amp Shiffrin R M (1984) A retrieval model for bothrecognition and recall Psychological Review 91 1-67

Glanzer M amp Adams J K (1985) The mirror effect in recognitionmemory Memory amp Cognition 13 8-20

Glanzer M amp Adams J K (1990) The mirror effect in recognitionmemory Data and theory Journal of Experimental PsychologyLearning Memory amp Cognition 16 5-16

Glanzer M Adams J K amp Kim K (1993) The regularities ofrecognition memory Psychological Review 100 546-567

Glanzer M amp Bowles N (1976) Analysis of the word frequencyeffect in recognition memory Journal of Experimental PsychologyHuman Learning amp Memory 2 21-31

Glanzer M Kisok K amp Adams J K (1998) Response distribu-tions as an explanation of the mirror effect Journal of ExperimentalPsychology Learning Memory amp Cognition 24 633-644

Greene R L (1996) Mirror effect in order and associative informa-tion Role of response strategies Journal of Experimental Psychol-ogy Learning Memory amp Cognition 22 687-695

Hertz J Krogh A amp Palmer R G (1991) Introduction to the the-ory of neural computation Reading MA Addison-Wesley

Hintzman D L (1988) Judgment of frequency and recognition memoryin a multiple trace memory model Psychological Review 95 528-551

Hopfield J J (1982) Neural networks and physical systems withemergent collective computational abilities Proceedings of the Na-tional Academy of Sciences 79 2554-2558

Hopfield J J (1984) Neurons with graded response have collectivecomputational properties like those of two-state neurons Proceed-ings of the National Academy of Sciences 81 3088-3092

Humphreys M S Bain J D amp Pike R (1989) Different way to cuea coherent memory system A theory for episodic semantic and pro-cedural tasks Psychological Review 96 208-233

Kim K amp Glanzer M (1993) Speed versus accuracy instructionsstudy time and the mirror effect Journal of Experimental Psychol-ogy Learning Memory amp Cognition 19 638-652

Kruschke J K (1992) ALCOVE An exemplar-based connectionistmodel of category learning Psychological Review 99 22-44

Ku Iumlcera H amp Francis W N (1967) Computational analysis ofpresent-day American English Providence RI Brown UniversityPress

Lewandowsky S (1991) Gradual unlearning and catastrophic inter-ference A comparison of distributed architectures In W E Hockleyamp S Lewandowsky (Eds) Relating theory and data Essays onhuman memory in honor of Bennet B Murdock (pp 445-476) Hills-dale NJ Erlbaum

Li S-C amp Lindenberger U (1999) Cross-level unification A com-putational exploration of the link between deterioration of neuro-transmitter systems and dedifferentiation of cognitive abilities in oldage In L-G Nilsson amp H J Markowitsch (Eds) Cognitive neuro-sciences of memory (pp 103-146) Seattle Hogrefe amp Huber

438 SIKSTROM

Li S-C Lindenberger U amp Frensch P A (2000) Unifying cog-nitive aging From neuromodulation to representation to cognitionNeurocomputing 32-33 879-890

McClelland J L amp Chappell M (1998) Familiarity breeds dif-ferentiation A subjective-likelihood approach to the effects of expe-rience in recognition memory Psychological Review 105 724-760

Metcalfe J (1982) A composite holographic associative recallmodel Psychological Review 89 627-658

Murdock B B (1982) A theory for the storage and retrieval of item andassociative information Psychological Review 89 609-626

Murdock B B (1998) The mirror effect and attentionndashlikelihoodtheory A reflective analysis Journal of Experimental PsychologyLearning Memory amp Cognition 24 524-534

Okada M (1996) Notions of associative memory and sparse codingNeural Networks 9 1429-1458

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves PsychologicalReview 99 518-535

Rumelhart D E Hinton G E amp Williams R J (1986) Learn-ing representation by backpropagating errors Nature 323 533-536

Shiffrin R M amp Steyvers M (1997) A model for recognitionmemory REMndashretrieving effectively from memory PsychonomicBulletin amp Review 4 145-166

Sikstroumlm S (1996a) TECO A connectionist theory of successiveepisodic tests Umearing Doctoral dissertation Umearing University (ISBN91-7191-155-3)

Sikstroumlm S (1996b) The TECO connectionist theory of recognitionfailure European Journal of Cognitive Psychology 8 341-380

Sikstroumlm S (1999) Power function forgetting curves as an emergentproperty of biologically plausible neural networks model Interna-tional Journal of Psychology 34 460-464

Stretch V amp Wixted J T (1998a) Decision rules for recognitionmemory confidence judgments Journal of Experimental Psychol-ogy Learning Memory amp Cognition 24 1397-1410

Stretch V amp Wixted J T (1998b) On the difference betweenstrength-based and frequency-based mirror effects in recognitionmemory Journal of Experimental Psychology Learning Memoryamp Cognition 24 1379-1396

NOTE

1 The expected number of nodes active at recognition is a2 giventhat the standard deviations of the percentages of active nodes for new[sp(N)] and old items [sp(O)] are equal However if the standard devi-ations are different the expected number of active nodes will be

Because the new variance is smaller than the old variance this valuewill be slightly less than a2 typically around 044a if the ROC curveis 08 It is actually more precise to use this value In this paper the ap-proximation a2 is used except that in the simulations where the ex-pression above is used

APPENDIX

The Expected Value of the Net Input and the Variance of the Net Input

Inserting the weight changes into the net input solves the ex-pected value of the net input for old items In this Appendix thesuperscripts representing the item layer (d) and the contextlayer (c) are dropped Only the active states of activation con-tribute to the net input so jj 5 jj = 1

(A1)

The expected value of the net inputs for the new items iszero

mh(N) 5 0 (A2)

The expected value of the weights is zero The variance ofthe net input is calculated by squaring each term in the netinput Let P represent the number of patterns stored in the net-work If the patterns are uncorrelated or a new item is pre-sented the variance of the net input is

(A3)

For an old item the variance of the net input to the contextlayer from the item layer depends on the frequency of the item( f ) All the aN weights supporting a context node contribute tothe net input in the same direction

(A4)

The same function can be used for calculating the varianceof the net input to a node in the item layer when the same con-text is associated with several items Let the same context be as-sociated with L items in a list Furthermore let the context be-tween different lists be uncorrelated The variance of the netinput to an item node is

The expected variance for all nodes is the weighted sum ofthese two terms Half of the nodes are context nodes and halfof the nodes are item nodes Therefore the expected varianceof the net input to all nodes given an item with a frequency off and a list length of L in a network where p patterns have beenencoded is

(A5)

Performance in the Variance Model When AllNodes Are Used

In the variance model recognition strength is based on nodesactive at encoding However if recognition strength is based onall the nodes performance can be calculated as follows The d centis calculated by using Equation 8 where Pc(O) and Pc(N) are thenumber of correct nodes The number of correct nodes is thenumber of nodes active at retrieval that also is active at encod-ing (ie calculated as usual with Equation 7) plus the numberof inactive nodes at encoding that also are inactive at retrievalThe latter value can be calculated by replacing a with 1 a inEquation 7 and using the expected value of old net inputs for in-active nodes a2 (1 a) N (compare with Equation A1)

(Manuscript received February 9 1999revision accepted for publication October 30 2000)

s h f L a N p a a N2 3 22 1O( ) = +( ) + -( )

s h f a a LN2 4 2 21( ) = -( )

s x x x xh ij jj

N

i j ji

Nf f f a a

a a f N

22 2

4 2 21

( ) = waring aringaelig

egraveccedilouml

oslashdivide= -( ) -( )eacute

eumlecirc

= -( )

s x x x xh ij jj

N

i j j

P

i

NN a a

a a a a a a a

a a a a PN a a PN

2 2 2

2

3

1 1 2 0 1

0 0 1 1

( ) = w

= [( ) ( )] + [( )(1- )] ( )

+ [( )( )] ( ) = ( )

2 2

2 2 2

( ) = -( ) -( ) - - - -

- - - -

aring aringaring

m x x x xh ijj

N

j i j jj

Na a a a N(O) = wDaring aring= -( ) -( ) = -( )1

2

ss s

p

p p

a(N)

( )N (O)+

VARIANCE THEORY OF MIRROR EFFECT 435

rate is approximately 005 For liberal recognition crite-ria [P(N) gt 40] there is also an advantage for normal-ized performance The largest advantage is around 001and it occurs when the false alarm rate is 070 The ad-vantage for liberal criteria is smaller than the advantagefor conservative criteria because of a ceiling effect forlarge false alarm rates and large hit rates

A Nonoptimal Network is Inconsistent With Empirical Data of Recognition Memory

To summarize it has been shown that performance isoptimal when (1) the percentage of nodes active at en-coding is low (2) the activation threshold is set betweenthe new and the old distributions (3) only nodes activeat encoding are used for measuring the recognitionstrength and (4) the recognition strength is normalizedwith the standard deviation of the net input It is inter-esting to note that all these conditions are necessary forproducing the empirically found memory phenomena

1 The percentage of active nodes has to be low forproducing appropriate standard deviations for the oldand the new recognition distributions If the percentageof active nodes is too high the standard deviation of theold distribution will approach the standard deviation ofthe new distribution

2 The model predicts the mirror effect only if the ac-tivation threshold is set between the old and the new dis-tributions If the recognition threshold is larger than theexpected value of the net input of the old distributionthe hit rate of the easy class will be less than the hit rateof the difficult distribution Similarly if the recognitionthreshold is lower than the expected new net input thefalse alarm of the easy class will be larger than the falsealarm rate of the difficult class This is inconsistent withthe empirical data for unbiased responses

3 Assume that the recognition strength is based on allthe nodes (ie not only nodes inactive during encod-ing)mdashfor example by counting the number of nodes inthe correct state of activation This measurement ofrecognition strength will have at least 50 of the nodesin the correct state (ie Pc gt 50) even if the subjectswere merely guessing on new items This would lead tothe incorrect prediction that the old recognition strengthvariance should be smaller than the new recognitionstrength variance because Pc(1 Pc)N decreases for Pcover 50 This is inconsistent with the finding that thevariance of the old distribution is larger than the varianceof the new distribution

4 If the recognition strength is not normalized withthe net input the variance of the recognition strength ofthe new easy class (AN) will be smaller than the varianceof the recognition strength of the new difficult class (BNsee Figure 4C) This is inconsistent with the empirical data

This analysis indicates that several recognition mem-ory phenomena fall out as a consequence of optimizingperformance in the network If the model is optimized interms of performance the model reproduces the empir-ical data If the model is made nonoptimal the model

does not reproduce the empirical data Arguably thehuman brain has evolved to optimize storage capacityand therefore these memory phenomena occur

GENERAL DISCUSSION

This paper has suggested a variance theory for themirror effect that also applies to the ROC curves Themodel is remarkably simple It has been shown that atwo-layer recurrent network where one layer representscontext and one layer represents items shows these phe-nomena if performance is measured by counting thenumber of nodes active at recognition in a way that opti-mizes performance The structure of the theory guaran-tees that high-frequency items have a larger variance inthe net inputs than do low-frequency items because en-coding the same item in different contexts increases thevariance whereas the expected net inputs are the sameThe theory predicts the mirror effect when the sameamount of attention is paid to both classes of stimuli Thestandard deviation of the recognition strength is largerfor old than for new items because more nodes are activein old items The standard deviation for the easy class islarger than the standard deviation for the difficult classbecause the recognition strength is normalized with thestandard deviation for the net input

There are several reasons why the variance theory isinteresting First the theory is extremely simple Theonly necessary assumptions are that recognition is basedon recurrent associations between contexts and itemsand performance is measured by counting the number ofnodes in an optimal way Second these assumptions areconsistent with what is known about how the brain worksTherefore the model is biologically plausible Third themodel accounts for a large amount of data including themirror effect exceptions from the mirror effect ROCcurves list-length effects and so on Fourth the modelfits the empirical data well Fifth it is easy to implementthe model in a connectionist network

Paying more attention to one of the classes violates theassumption of equal expected net inputs to the two classesThe variance theory predicts that attention to the moredifficult class primarily affects the hit rates whereas thefalse alarm rates and standard deviations of the underly-ing distributions are less affected An experiment sup-ported the prediction A standard mirror effect was foundwhen attention was divided equally between high- andlow-frequency words However focusing the subjectsrsquoattention on the high-frequency words either by the pre-sentation frequency or the presentation time made thehit rate larger for the high-frequency words than for thelow-frequency words This manipulation of attention didnot influence the false alarm rates which were higher forthe high-frequency words in all the conditions Thus nomirror effect was found when attention was paid to thehigh-frequency words Nor did the focusing of attentioninfluence the order of the standard deviations The stan-dard deviations were larger for the low-frequency distri-

436 SIKSTROM

bution than for the high-frequency distribution This wastrue for the new and the old distributions both when at-tention was paid to high-frequency words and when at-tention was divided equally between the two classes (ex-cept in the new frequency control condition where thestandard deviations were equal)

The variance theory predicts the order of the standarddeviations of the underlying distributions for the follow-ing reasons The standard deviations of the old distribu-tions are predicted to be larger than those of the new dis-tributions because more nodes are activated in the olddistributions The standard deviations of the easy classdistributions are predicted to be larger than the standarddeviations of the difficult class distributions because therecognition strength is normalized by the itemrsquos diffi-culty estimated from the standard deviation of the net in-puts This is consistent with the empirical data

In contrast the attention-likelihood theory does notconstrain the old distribution to be larger than the newdistribution for the difficult class (it can be larger orsmaller depending on parameter settings) The variancetheory allows the following two orders of the standarddeviations ss(BN) lt ss(BO) lt ss(AN) lt ss(AO) andss(BN) lt ss(AN) lt ss(BO) lt ss(AO) The first order isthe more common although the second order occurs oc-casionally (see eg Glanzer et al 1993 Experiment 1)In addition the attention-likelihood theory allowsss(BO) lt ss(BN) lt ss(AN) lt ss(AO) according to Equa-tion 2 which to the authorrsquos knowledge has not beenfound in empirical data

The variance theory predicts that strength variablessuch as study time repetition and study instructions af-fect the expected net input For example increasing studytime will increase the net input that improves the hit rateIncreasing the net inputs also causes an increase in theactivation threshold that diminishes the false alarm rates

The variance theory has no (ie lateral) connectionswithin the item layer and there are no connections with-in the context layer Including intraitem connections inrecognition makes it impossible to tell whether therecognition strength comes from encoding during thelearning episode or from another preexperimental learn-ing episode Thus there would be a confounding be-tween the itemrsquos frequency and recognition strength Forexample if the recognition strength in the variance the-ory included intraitem connections and used a constantrecognition criterion it would predict a higher hit rateand a higher false alarm rate for high-frequency itemsthan for low-frequency items Thus the hit rate for high-frequency words would be larger than that for low-frequency words which is contrary to the data on the mir-ror effect This issue is called the frequencyrecognitionndashstrength confounding problem Other models may bevulnerable to this problem depending on their specificassumptions The variance theory is immune from thisproblem because recognition strength is based on the as-sociation between the context and the item that yields apure measurement of the strength of the target in a par-

ticular episode Net inputs within the item population arenot used because these connections are highly corre-lated with the frequency of the item

This frequencyrecognition-strength confounding prob-lem may be relevant to several distributed models thatassume that recognition strength increases with frequencythus making the false alarm rate higher for high- than forlow-frequency stimuli This is often implemented in dis-tributed models by including intraitem associations inthe measurement of recognition strength Thus whenintraitem and itemndashcontext associations are added it isnot possible to know whether the intraitem strength oc-curs because an item has been encoded in the to-be-retrieved-from list or at another episode

Although the intraitem associations are not used tomeasure recognition strength they may play an impor-tant role in recognition In the first step of recognitionthese associations may be used for deblurring unclear in-formation in the item cue (a similar mechanism occursfor the context cue) Arguably this deblurring mecha-nism works well for well-known words however fornonwords it is much more likely to fail Such failure willactivate features that were not active in the encoded rep-resentation It is more likely that these wrongly activatednodes represent high-frequency features because it ismore likely to converge on high-frequency features Thereare two interesting implications of this perspective Firstthe wrongly activated nodes will use the wrong connec-tions between the context and the item Second becausethe wrongly activated nodes represent high-frequencyfeatures the average variability will be larger for non-words than for words This is a plausible account of thepoor recognition performance with nonwords as com-pared with words It is also a tentative account of why non-words can be seen as a difficult class with a higher vari-ability than that for words in the variance theory Howeverfurther work is needed before any firm conclusion can bemade regarding this aspect of the theory

A problem similar to frequencyrecognition-strengthconfounding occurs if within-context connections areused at recognition If the context is temporally corre-lated the within-context connections are influenced bylist length This causes a confounding between list lengthand recognition strength This issue is called the list-lengthrecognition-strength confounding problem Othermodels may be vulnerable to this problem depending ontheir specific assumptions

Another issue is whether the variance theory can ac-count for the mirror effect found in abstract and concretewords where words from a concrete class are more eas-ily discriminated than words from an abstract class Thevariance theory can account for this given the assump-tion that the variance of the net input is larger for abstractthan for concrete words However at this point it is notcompletely clear how this assumption can be motivatedA possibility is that although these two classes areequated for word frequency the contexts associated withan abstract word are more variable than the contexts as-

VARIANCE THEORY OF MIRROR EFFECT 437

sociated with a concrete word This larger variability incontext for abstract words may lead to a larger variabil-ity in the net input Another possibility is that the activefeatures in abstract words are more general and there-fore associated with more contexts Nodes active in con-crete words may represent more specific features acti-vated with a lower frequency and therefore associatedwith fewer contexts Thus features in abstract wordsmay be of a higher frequency than features in concretewords although the frequencies of the items are thesame This would lead to a mirror effect in the modelHowever at this point no claim is made that variancetheory can handle this phenomenon

The variance theoryrsquos account of the list-length andlist-strength effects is arguably much simpler than thesubjective-likelihoodrsquos account The activation thresholdis set so that on average half of the nodes active duringencoding are active during recognition The activationthreshold is constant within one condition but may changebetween conditionsmdashfor example when study time ischanged Furthermore subjects do not need to estimatethe list length and the probability that a particular itemis drawn from the study list

The variance theory has advantages over the attention-likelihood theory As was discussed in more detail abovethe attention-likelihood theory requires subjects to clas-sify the stimulus Depending on this classification thesubjects must know (consciously or unconsciously) howmuch attention is paid to the stimuli in order to calculatethe log-likelihood ratios Thus the yesndashno decision isbased on the subjectsrsquo classification of the stimuli Thevariance theory does not require a classification of thestimuli During the calculation of recognition strengththe standard deviation of the net input of the item is used(shcent ) so the subject does not need to know the class or thestandard deviation of the class (sh) The increase in thehit rate and decrease in the false alarm rate for the easierclass occurs according to the theory because the vari-ance is smaller for the easier class The variance theoryhas a simple mathematical base subjects count the num-ber of activated nodes minus the expected value dividedby the standard deviation of the net input of the item Anexplicit solution is presented that requires only calculat-ing the probabilities of two z-transformations

The subjective-likelihood theory uses feature contentof the items for addressing issues regarding item similar-ity and word frequency In particular high-frequency wordsare assumed to have a higher noise or variability than dolow-frequency words The variance theory also usesvariability that depends on frequency However the vari-ance theory simulates the increase in variance duringeach presentation of a feature in different contexts thusmaking it an unavoidable phenomenon for high-frequencyfeatures In the subjective-likelihood theory the featurevariance is introduced as an assumption or a constantand it is not explicitly simulated

There are several other differences between the vari-ance theory and the subjective-likelihood theory The

subjective-likelihood theory is based on log-likelihoodratios In the variance theory log-likelihood ratios arenot necessary to account of the mirror effect and for z-ROC curves Instead the variance theory uses the num-ber of active nodes normalized by the variance of the netinput as the measurement of recognition strength

Another difference is the use of one detector for eachitem in the subjective-likelihood theory This makes themodel essentially local whereas the variance theory isdistributed This property may cause difficulties in im-plementing the model in a connectionist network How-ever see McClelland and Chappell (1998) for a brief dis-cussion of this topic An implementation of the variancetheory is straightforward

REFERENCES

Feller W (1968) An introduction to probability theory and its appli-cation New York Wiley

Gillund G amp Shiffrin R M (1984) A retrieval model for bothrecognition and recall Psychological Review 91 1-67

Glanzer M amp Adams J K (1985) The mirror effect in recognitionmemory Memory amp Cognition 13 8-20

Glanzer M amp Adams J K (1990) The mirror effect in recognitionmemory Data and theory Journal of Experimental PsychologyLearning Memory amp Cognition 16 5-16

Glanzer M Adams J K amp Kim K (1993) The regularities ofrecognition memory Psychological Review 100 546-567

Glanzer M amp Bowles N (1976) Analysis of the word frequencyeffect in recognition memory Journal of Experimental PsychologyHuman Learning amp Memory 2 21-31

Glanzer M Kisok K amp Adams J K (1998) Response distribu-tions as an explanation of the mirror effect Journal of ExperimentalPsychology Learning Memory amp Cognition 24 633-644

Greene R L (1996) Mirror effect in order and associative informa-tion Role of response strategies Journal of Experimental Psychol-ogy Learning Memory amp Cognition 22 687-695

Hertz J Krogh A amp Palmer R G (1991) Introduction to the the-ory of neural computation Reading MA Addison-Wesley

Hintzman D L (1988) Judgment of frequency and recognition memoryin a multiple trace memory model Psychological Review 95 528-551

Hopfield J J (1982) Neural networks and physical systems withemergent collective computational abilities Proceedings of the Na-tional Academy of Sciences 79 2554-2558

Hopfield J J (1984) Neurons with graded response have collectivecomputational properties like those of two-state neurons Proceed-ings of the National Academy of Sciences 81 3088-3092

Humphreys M S Bain J D amp Pike R (1989) Different way to cuea coherent memory system A theory for episodic semantic and pro-cedural tasks Psychological Review 96 208-233

Kim K amp Glanzer M (1993) Speed versus accuracy instructionsstudy time and the mirror effect Journal of Experimental Psychol-ogy Learning Memory amp Cognition 19 638-652

Kruschke J K (1992) ALCOVE An exemplar-based connectionistmodel of category learning Psychological Review 99 22-44

Ku Iumlcera H amp Francis W N (1967) Computational analysis ofpresent-day American English Providence RI Brown UniversityPress

Lewandowsky S (1991) Gradual unlearning and catastrophic inter-ference A comparison of distributed architectures In W E Hockleyamp S Lewandowsky (Eds) Relating theory and data Essays onhuman memory in honor of Bennet B Murdock (pp 445-476) Hills-dale NJ Erlbaum

Li S-C amp Lindenberger U (1999) Cross-level unification A com-putational exploration of the link between deterioration of neuro-transmitter systems and dedifferentiation of cognitive abilities in oldage In L-G Nilsson amp H J Markowitsch (Eds) Cognitive neuro-sciences of memory (pp 103-146) Seattle Hogrefe amp Huber

438 SIKSTROM

Li S-C Lindenberger U amp Frensch P A (2000) Unifying cog-nitive aging From neuromodulation to representation to cognitionNeurocomputing 32-33 879-890

McClelland J L amp Chappell M (1998) Familiarity breeds dif-ferentiation A subjective-likelihood approach to the effects of expe-rience in recognition memory Psychological Review 105 724-760

Metcalfe J (1982) A composite holographic associative recallmodel Psychological Review 89 627-658

Murdock B B (1982) A theory for the storage and retrieval of item andassociative information Psychological Review 89 609-626

Murdock B B (1998) The mirror effect and attentionndashlikelihoodtheory A reflective analysis Journal of Experimental PsychologyLearning Memory amp Cognition 24 524-534

Okada M (1996) Notions of associative memory and sparse codingNeural Networks 9 1429-1458

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves PsychologicalReview 99 518-535

Rumelhart D E Hinton G E amp Williams R J (1986) Learn-ing representation by backpropagating errors Nature 323 533-536

Shiffrin R M amp Steyvers M (1997) A model for recognitionmemory REMndashretrieving effectively from memory PsychonomicBulletin amp Review 4 145-166

Sikstroumlm S (1996a) TECO A connectionist theory of successiveepisodic tests Umearing Doctoral dissertation Umearing University (ISBN91-7191-155-3)

Sikstroumlm S (1996b) The TECO connectionist theory of recognitionfailure European Journal of Cognitive Psychology 8 341-380

Sikstroumlm S (1999) Power function forgetting curves as an emergentproperty of biologically plausible neural networks model Interna-tional Journal of Psychology 34 460-464

Stretch V amp Wixted J T (1998a) Decision rules for recognitionmemory confidence judgments Journal of Experimental Psychol-ogy Learning Memory amp Cognition 24 1397-1410

Stretch V amp Wixted J T (1998b) On the difference betweenstrength-based and frequency-based mirror effects in recognitionmemory Journal of Experimental Psychology Learning Memoryamp Cognition 24 1379-1396

NOTE

1 The expected number of nodes active at recognition is a2 giventhat the standard deviations of the percentages of active nodes for new[sp(N)] and old items [sp(O)] are equal However if the standard devi-ations are different the expected number of active nodes will be

Because the new variance is smaller than the old variance this valuewill be slightly less than a2 typically around 044a if the ROC curveis 08 It is actually more precise to use this value In this paper the ap-proximation a2 is used except that in the simulations where the ex-pression above is used

APPENDIX

The Expected Value of the Net Input and the Variance of the Net Input

Inserting the weight changes into the net input solves the ex-pected value of the net input for old items In this Appendix thesuperscripts representing the item layer (d) and the contextlayer (c) are dropped Only the active states of activation con-tribute to the net input so jj 5 jj = 1

(A1)

The expected value of the net inputs for the new items iszero

mh(N) 5 0 (A2)

The expected value of the weights is zero The variance ofthe net input is calculated by squaring each term in the netinput Let P represent the number of patterns stored in the net-work If the patterns are uncorrelated or a new item is pre-sented the variance of the net input is

(A3)

For an old item the variance of the net input to the contextlayer from the item layer depends on the frequency of the item( f ) All the aN weights supporting a context node contribute tothe net input in the same direction

(A4)

The same function can be used for calculating the varianceof the net input to a node in the item layer when the same con-text is associated with several items Let the same context be as-sociated with L items in a list Furthermore let the context be-tween different lists be uncorrelated The variance of the netinput to an item node is

The expected variance for all nodes is the weighted sum ofthese two terms Half of the nodes are context nodes and halfof the nodes are item nodes Therefore the expected varianceof the net input to all nodes given an item with a frequency off and a list length of L in a network where p patterns have beenencoded is

(A5)

Performance in the Variance Model When AllNodes Are Used

In the variance model recognition strength is based on nodesactive at encoding However if recognition strength is based onall the nodes performance can be calculated as follows The d centis calculated by using Equation 8 where Pc(O) and Pc(N) are thenumber of correct nodes The number of correct nodes is thenumber of nodes active at retrieval that also is active at encod-ing (ie calculated as usual with Equation 7) plus the numberof inactive nodes at encoding that also are inactive at retrievalThe latter value can be calculated by replacing a with 1 a inEquation 7 and using the expected value of old net inputs for in-active nodes a2 (1 a) N (compare with Equation A1)

(Manuscript received February 9 1999revision accepted for publication October 30 2000)

s h f L a N p a a N2 3 22 1O( ) = +( ) + -( )

s h f a a LN2 4 2 21( ) = -( )

s x x x xh ij jj

N

i j ji

Nf f f a a

a a f N

22 2

4 2 21

( ) = waring aringaelig

egraveccedilouml

oslashdivide= -( ) -( )eacute

eumlecirc

= -( )

s x x x xh ij jj

N

i j j

P

i

NN a a

a a a a a a a

a a a a PN a a PN

2 2 2

2

3

1 1 2 0 1

0 0 1 1

( ) = w

= [( ) ( )] + [( )(1- )] ( )

+ [( )( )] ( ) = ( )

2 2

2 2 2

( ) = -( ) -( ) - - - -

- - - -

aring aringaring

m x x x xh ijj

N

j i j jj

Na a a a N(O) = wDaring aring= -( ) -( ) = -( )1

2

ss s

p

p p

a(N)

( )N (O)+

436 SIKSTROM

bution than for the high-frequency distribution This wastrue for the new and the old distributions both when at-tention was paid to high-frequency words and when at-tention was divided equally between the two classes (ex-cept in the new frequency control condition where thestandard deviations were equal)

The variance theory predicts the order of the standarddeviations of the underlying distributions for the follow-ing reasons The standard deviations of the old distribu-tions are predicted to be larger than those of the new dis-tributions because more nodes are activated in the olddistributions The standard deviations of the easy classdistributions are predicted to be larger than the standarddeviations of the difficult class distributions because therecognition strength is normalized by the itemrsquos diffi-culty estimated from the standard deviation of the net in-puts This is consistent with the empirical data

In contrast the attention-likelihood theory does notconstrain the old distribution to be larger than the newdistribution for the difficult class (it can be larger orsmaller depending on parameter settings) The variancetheory allows the following two orders of the standarddeviations ss(BN) lt ss(BO) lt ss(AN) lt ss(AO) andss(BN) lt ss(AN) lt ss(BO) lt ss(AO) The first order isthe more common although the second order occurs oc-casionally (see eg Glanzer et al 1993 Experiment 1)In addition the attention-likelihood theory allowsss(BO) lt ss(BN) lt ss(AN) lt ss(AO) according to Equa-tion 2 which to the authorrsquos knowledge has not beenfound in empirical data

The variance theory predicts that strength variablessuch as study time repetition and study instructions af-fect the expected net input For example increasing studytime will increase the net input that improves the hit rateIncreasing the net inputs also causes an increase in theactivation threshold that diminishes the false alarm rates

The variance theory has no (ie lateral) connectionswithin the item layer and there are no connections with-in the context layer Including intraitem connections inrecognition makes it impossible to tell whether therecognition strength comes from encoding during thelearning episode or from another preexperimental learn-ing episode Thus there would be a confounding be-tween the itemrsquos frequency and recognition strength Forexample if the recognition strength in the variance the-ory included intraitem connections and used a constantrecognition criterion it would predict a higher hit rateand a higher false alarm rate for high-frequency itemsthan for low-frequency items Thus the hit rate for high-frequency words would be larger than that for low-frequency words which is contrary to the data on the mir-ror effect This issue is called the frequencyrecognitionndashstrength confounding problem Other models may bevulnerable to this problem depending on their specificassumptions The variance theory is immune from thisproblem because recognition strength is based on the as-sociation between the context and the item that yields apure measurement of the strength of the target in a par-

ticular episode Net inputs within the item population arenot used because these connections are highly corre-lated with the frequency of the item

This frequencyrecognition-strength confounding prob-lem may be relevant to several distributed models thatassume that recognition strength increases with frequencythus making the false alarm rate higher for high- than forlow-frequency stimuli This is often implemented in dis-tributed models by including intraitem associations inthe measurement of recognition strength Thus whenintraitem and itemndashcontext associations are added it isnot possible to know whether the intraitem strength oc-curs because an item has been encoded in the to-be-retrieved-from list or at another episode

Although the intraitem associations are not used tomeasure recognition strength they may play an impor-tant role in recognition In the first step of recognitionthese associations may be used for deblurring unclear in-formation in the item cue (a similar mechanism occursfor the context cue) Arguably this deblurring mecha-nism works well for well-known words however fornonwords it is much more likely to fail Such failure willactivate features that were not active in the encoded rep-resentation It is more likely that these wrongly activatednodes represent high-frequency features because it ismore likely to converge on high-frequency features Thereare two interesting implications of this perspective Firstthe wrongly activated nodes will use the wrong connec-tions between the context and the item Second becausethe wrongly activated nodes represent high-frequencyfeatures the average variability will be larger for non-words than for words This is a plausible account of thepoor recognition performance with nonwords as com-pared with words It is also a tentative account of why non-words can be seen as a difficult class with a higher vari-ability than that for words in the variance theory Howeverfurther work is needed before any firm conclusion can bemade regarding this aspect of the theory

A problem similar to frequencyrecognition-strengthconfounding occurs if within-context connections areused at recognition If the context is temporally corre-lated the within-context connections are influenced bylist length This causes a confounding between list lengthand recognition strength This issue is called the list-lengthrecognition-strength confounding problem Othermodels may be vulnerable to this problem depending ontheir specific assumptions

Another issue is whether the variance theory can ac-count for the mirror effect found in abstract and concretewords where words from a concrete class are more eas-ily discriminated than words from an abstract class Thevariance theory can account for this given the assump-tion that the variance of the net input is larger for abstractthan for concrete words However at this point it is notcompletely clear how this assumption can be motivatedA possibility is that although these two classes areequated for word frequency the contexts associated withan abstract word are more variable than the contexts as-

VARIANCE THEORY OF MIRROR EFFECT 437

sociated with a concrete word This larger variability incontext for abstract words may lead to a larger variabil-ity in the net input Another possibility is that the activefeatures in abstract words are more general and there-fore associated with more contexts Nodes active in con-crete words may represent more specific features acti-vated with a lower frequency and therefore associatedwith fewer contexts Thus features in abstract wordsmay be of a higher frequency than features in concretewords although the frequencies of the items are thesame This would lead to a mirror effect in the modelHowever at this point no claim is made that variancetheory can handle this phenomenon

The variance theoryrsquos account of the list-length andlist-strength effects is arguably much simpler than thesubjective-likelihoodrsquos account The activation thresholdis set so that on average half of the nodes active duringencoding are active during recognition The activationthreshold is constant within one condition but may changebetween conditionsmdashfor example when study time ischanged Furthermore subjects do not need to estimatethe list length and the probability that a particular itemis drawn from the study list

The variance theory has advantages over the attention-likelihood theory As was discussed in more detail abovethe attention-likelihood theory requires subjects to clas-sify the stimulus Depending on this classification thesubjects must know (consciously or unconsciously) howmuch attention is paid to the stimuli in order to calculatethe log-likelihood ratios Thus the yesndashno decision isbased on the subjectsrsquo classification of the stimuli Thevariance theory does not require a classification of thestimuli During the calculation of recognition strengththe standard deviation of the net input of the item is used(shcent ) so the subject does not need to know the class or thestandard deviation of the class (sh) The increase in thehit rate and decrease in the false alarm rate for the easierclass occurs according to the theory because the vari-ance is smaller for the easier class The variance theoryhas a simple mathematical base subjects count the num-ber of activated nodes minus the expected value dividedby the standard deviation of the net input of the item Anexplicit solution is presented that requires only calculat-ing the probabilities of two z-transformations

The subjective-likelihood theory uses feature contentof the items for addressing issues regarding item similar-ity and word frequency In particular high-frequency wordsare assumed to have a higher noise or variability than dolow-frequency words The variance theory also usesvariability that depends on frequency However the vari-ance theory simulates the increase in variance duringeach presentation of a feature in different contexts thusmaking it an unavoidable phenomenon for high-frequencyfeatures In the subjective-likelihood theory the featurevariance is introduced as an assumption or a constantand it is not explicitly simulated

There are several other differences between the vari-ance theory and the subjective-likelihood theory The

subjective-likelihood theory is based on log-likelihoodratios In the variance theory log-likelihood ratios arenot necessary to account of the mirror effect and for z-ROC curves Instead the variance theory uses the num-ber of active nodes normalized by the variance of the netinput as the measurement of recognition strength

Another difference is the use of one detector for eachitem in the subjective-likelihood theory This makes themodel essentially local whereas the variance theory isdistributed This property may cause difficulties in im-plementing the model in a connectionist network How-ever see McClelland and Chappell (1998) for a brief dis-cussion of this topic An implementation of the variancetheory is straightforward

REFERENCES

Feller W (1968) An introduction to probability theory and its appli-cation New York Wiley

Gillund G amp Shiffrin R M (1984) A retrieval model for bothrecognition and recall Psychological Review 91 1-67

Glanzer M amp Adams J K (1985) The mirror effect in recognitionmemory Memory amp Cognition 13 8-20

Glanzer M amp Adams J K (1990) The mirror effect in recognitionmemory Data and theory Journal of Experimental PsychologyLearning Memory amp Cognition 16 5-16

Glanzer M Adams J K amp Kim K (1993) The regularities ofrecognition memory Psychological Review 100 546-567

Glanzer M amp Bowles N (1976) Analysis of the word frequencyeffect in recognition memory Journal of Experimental PsychologyHuman Learning amp Memory 2 21-31

Glanzer M Kisok K amp Adams J K (1998) Response distribu-tions as an explanation of the mirror effect Journal of ExperimentalPsychology Learning Memory amp Cognition 24 633-644

Greene R L (1996) Mirror effect in order and associative informa-tion Role of response strategies Journal of Experimental Psychol-ogy Learning Memory amp Cognition 22 687-695

Hertz J Krogh A amp Palmer R G (1991) Introduction to the the-ory of neural computation Reading MA Addison-Wesley

Hintzman D L (1988) Judgment of frequency and recognition memoryin a multiple trace memory model Psychological Review 95 528-551

Hopfield J J (1982) Neural networks and physical systems withemergent collective computational abilities Proceedings of the Na-tional Academy of Sciences 79 2554-2558

Hopfield J J (1984) Neurons with graded response have collectivecomputational properties like those of two-state neurons Proceed-ings of the National Academy of Sciences 81 3088-3092

Humphreys M S Bain J D amp Pike R (1989) Different way to cuea coherent memory system A theory for episodic semantic and pro-cedural tasks Psychological Review 96 208-233

Kim K amp Glanzer M (1993) Speed versus accuracy instructionsstudy time and the mirror effect Journal of Experimental Psychol-ogy Learning Memory amp Cognition 19 638-652

Kruschke J K (1992) ALCOVE An exemplar-based connectionistmodel of category learning Psychological Review 99 22-44

Ku Iumlcera H amp Francis W N (1967) Computational analysis ofpresent-day American English Providence RI Brown UniversityPress

Lewandowsky S (1991) Gradual unlearning and catastrophic inter-ference A comparison of distributed architectures In W E Hockleyamp S Lewandowsky (Eds) Relating theory and data Essays onhuman memory in honor of Bennet B Murdock (pp 445-476) Hills-dale NJ Erlbaum

Li S-C amp Lindenberger U (1999) Cross-level unification A com-putational exploration of the link between deterioration of neuro-transmitter systems and dedifferentiation of cognitive abilities in oldage In L-G Nilsson amp H J Markowitsch (Eds) Cognitive neuro-sciences of memory (pp 103-146) Seattle Hogrefe amp Huber

438 SIKSTROM

Li S-C Lindenberger U amp Frensch P A (2000) Unifying cog-nitive aging From neuromodulation to representation to cognitionNeurocomputing 32-33 879-890

McClelland J L amp Chappell M (1998) Familiarity breeds dif-ferentiation A subjective-likelihood approach to the effects of expe-rience in recognition memory Psychological Review 105 724-760

Metcalfe J (1982) A composite holographic associative recallmodel Psychological Review 89 627-658

Murdock B B (1982) A theory for the storage and retrieval of item andassociative information Psychological Review 89 609-626

Murdock B B (1998) The mirror effect and attentionndashlikelihoodtheory A reflective analysis Journal of Experimental PsychologyLearning Memory amp Cognition 24 524-534

Okada M (1996) Notions of associative memory and sparse codingNeural Networks 9 1429-1458

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves PsychologicalReview 99 518-535

Rumelhart D E Hinton G E amp Williams R J (1986) Learn-ing representation by backpropagating errors Nature 323 533-536

Shiffrin R M amp Steyvers M (1997) A model for recognitionmemory REMndashretrieving effectively from memory PsychonomicBulletin amp Review 4 145-166

Sikstroumlm S (1996a) TECO A connectionist theory of successiveepisodic tests Umearing Doctoral dissertation Umearing University (ISBN91-7191-155-3)

Sikstroumlm S (1996b) The TECO connectionist theory of recognitionfailure European Journal of Cognitive Psychology 8 341-380

Sikstroumlm S (1999) Power function forgetting curves as an emergentproperty of biologically plausible neural networks model Interna-tional Journal of Psychology 34 460-464

Stretch V amp Wixted J T (1998a) Decision rules for recognitionmemory confidence judgments Journal of Experimental Psychol-ogy Learning Memory amp Cognition 24 1397-1410

Stretch V amp Wixted J T (1998b) On the difference betweenstrength-based and frequency-based mirror effects in recognitionmemory Journal of Experimental Psychology Learning Memoryamp Cognition 24 1379-1396

NOTE

1 The expected number of nodes active at recognition is a2 giventhat the standard deviations of the percentages of active nodes for new[sp(N)] and old items [sp(O)] are equal However if the standard devi-ations are different the expected number of active nodes will be

Because the new variance is smaller than the old variance this valuewill be slightly less than a2 typically around 044a if the ROC curveis 08 It is actually more precise to use this value In this paper the ap-proximation a2 is used except that in the simulations where the ex-pression above is used

APPENDIX

The Expected Value of the Net Input and the Variance of the Net Input

Inserting the weight changes into the net input solves the ex-pected value of the net input for old items In this Appendix thesuperscripts representing the item layer (d) and the contextlayer (c) are dropped Only the active states of activation con-tribute to the net input so jj 5 jj = 1

(A1)

The expected value of the net inputs for the new items iszero

mh(N) 5 0 (A2)

The expected value of the weights is zero The variance ofthe net input is calculated by squaring each term in the netinput Let P represent the number of patterns stored in the net-work If the patterns are uncorrelated or a new item is pre-sented the variance of the net input is

(A3)

For an old item the variance of the net input to the contextlayer from the item layer depends on the frequency of the item( f ) All the aN weights supporting a context node contribute tothe net input in the same direction

(A4)

The same function can be used for calculating the varianceof the net input to a node in the item layer when the same con-text is associated with several items Let the same context be as-sociated with L items in a list Furthermore let the context be-tween different lists be uncorrelated The variance of the netinput to an item node is

The expected variance for all nodes is the weighted sum ofthese two terms Half of the nodes are context nodes and halfof the nodes are item nodes Therefore the expected varianceof the net input to all nodes given an item with a frequency off and a list length of L in a network where p patterns have beenencoded is

(A5)

Performance in the Variance Model When AllNodes Are Used

In the variance model recognition strength is based on nodesactive at encoding However if recognition strength is based onall the nodes performance can be calculated as follows The d centis calculated by using Equation 8 where Pc(O) and Pc(N) are thenumber of correct nodes The number of correct nodes is thenumber of nodes active at retrieval that also is active at encod-ing (ie calculated as usual with Equation 7) plus the numberof inactive nodes at encoding that also are inactive at retrievalThe latter value can be calculated by replacing a with 1 a inEquation 7 and using the expected value of old net inputs for in-active nodes a2 (1 a) N (compare with Equation A1)

(Manuscript received February 9 1999revision accepted for publication October 30 2000)

s h f L a N p a a N2 3 22 1O( ) = +( ) + -( )

s h f a a LN2 4 2 21( ) = -( )

s x x x xh ij jj

N

i j ji

Nf f f a a

a a f N

22 2

4 2 21

( ) = waring aringaelig

egraveccedilouml

oslashdivide= -( ) -( )eacute

eumlecirc

= -( )

s x x x xh ij jj

N

i j j

P

i

NN a a

a a a a a a a

a a a a PN a a PN

2 2 2

2

3

1 1 2 0 1

0 0 1 1

( ) = w

= [( ) ( )] + [( )(1- )] ( )

+ [( )( )] ( ) = ( )

2 2

2 2 2

( ) = -( ) -( ) - - - -

- - - -

aring aringaring

m x x x xh ijj

N

j i j jj

Na a a a N(O) = wDaring aring= -( ) -( ) = -( )1

2

ss s

p

p p

a(N)

( )N (O)+

VARIANCE THEORY OF MIRROR EFFECT 437

sociated with a concrete word This larger variability incontext for abstract words may lead to a larger variabil-ity in the net input Another possibility is that the activefeatures in abstract words are more general and there-fore associated with more contexts Nodes active in con-crete words may represent more specific features acti-vated with a lower frequency and therefore associatedwith fewer contexts Thus features in abstract wordsmay be of a higher frequency than features in concretewords although the frequencies of the items are thesame This would lead to a mirror effect in the modelHowever at this point no claim is made that variancetheory can handle this phenomenon

The variance theoryrsquos account of the list-length andlist-strength effects is arguably much simpler than thesubjective-likelihoodrsquos account The activation thresholdis set so that on average half of the nodes active duringencoding are active during recognition The activationthreshold is constant within one condition but may changebetween conditionsmdashfor example when study time ischanged Furthermore subjects do not need to estimatethe list length and the probability that a particular itemis drawn from the study list

The variance theory has advantages over the attention-likelihood theory As was discussed in more detail abovethe attention-likelihood theory requires subjects to clas-sify the stimulus Depending on this classification thesubjects must know (consciously or unconsciously) howmuch attention is paid to the stimuli in order to calculatethe log-likelihood ratios Thus the yesndashno decision isbased on the subjectsrsquo classification of the stimuli Thevariance theory does not require a classification of thestimuli During the calculation of recognition strengththe standard deviation of the net input of the item is used(shcent ) so the subject does not need to know the class or thestandard deviation of the class (sh) The increase in thehit rate and decrease in the false alarm rate for the easierclass occurs according to the theory because the vari-ance is smaller for the easier class The variance theoryhas a simple mathematical base subjects count the num-ber of activated nodes minus the expected value dividedby the standard deviation of the net input of the item Anexplicit solution is presented that requires only calculat-ing the probabilities of two z-transformations

The subjective-likelihood theory uses feature contentof the items for addressing issues regarding item similar-ity and word frequency In particular high-frequency wordsare assumed to have a higher noise or variability than dolow-frequency words The variance theory also usesvariability that depends on frequency However the vari-ance theory simulates the increase in variance duringeach presentation of a feature in different contexts thusmaking it an unavoidable phenomenon for high-frequencyfeatures In the subjective-likelihood theory the featurevariance is introduced as an assumption or a constantand it is not explicitly simulated

There are several other differences between the vari-ance theory and the subjective-likelihood theory The

subjective-likelihood theory is based on log-likelihoodratios In the variance theory log-likelihood ratios arenot necessary to account of the mirror effect and for z-ROC curves Instead the variance theory uses the num-ber of active nodes normalized by the variance of the netinput as the measurement of recognition strength

Another difference is the use of one detector for eachitem in the subjective-likelihood theory This makes themodel essentially local whereas the variance theory isdistributed This property may cause difficulties in im-plementing the model in a connectionist network How-ever see McClelland and Chappell (1998) for a brief dis-cussion of this topic An implementation of the variancetheory is straightforward

REFERENCES

Feller W (1968) An introduction to probability theory and its appli-cation New York Wiley

Gillund G amp Shiffrin R M (1984) A retrieval model for bothrecognition and recall Psychological Review 91 1-67

Glanzer M amp Adams J K (1985) The mirror effect in recognitionmemory Memory amp Cognition 13 8-20

Glanzer M amp Adams J K (1990) The mirror effect in recognitionmemory Data and theory Journal of Experimental PsychologyLearning Memory amp Cognition 16 5-16

Glanzer M Adams J K amp Kim K (1993) The regularities ofrecognition memory Psychological Review 100 546-567

Glanzer M amp Bowles N (1976) Analysis of the word frequencyeffect in recognition memory Journal of Experimental PsychologyHuman Learning amp Memory 2 21-31

Glanzer M Kisok K amp Adams J K (1998) Response distribu-tions as an explanation of the mirror effect Journal of ExperimentalPsychology Learning Memory amp Cognition 24 633-644

Greene R L (1996) Mirror effect in order and associative informa-tion Role of response strategies Journal of Experimental Psychol-ogy Learning Memory amp Cognition 22 687-695

Hertz J Krogh A amp Palmer R G (1991) Introduction to the the-ory of neural computation Reading MA Addison-Wesley

Hintzman D L (1988) Judgment of frequency and recognition memoryin a multiple trace memory model Psychological Review 95 528-551

Hopfield J J (1982) Neural networks and physical systems withemergent collective computational abilities Proceedings of the Na-tional Academy of Sciences 79 2554-2558

Hopfield J J (1984) Neurons with graded response have collectivecomputational properties like those of two-state neurons Proceed-ings of the National Academy of Sciences 81 3088-3092

Humphreys M S Bain J D amp Pike R (1989) Different way to cuea coherent memory system A theory for episodic semantic and pro-cedural tasks Psychological Review 96 208-233

Kim K amp Glanzer M (1993) Speed versus accuracy instructionsstudy time and the mirror effect Journal of Experimental Psychol-ogy Learning Memory amp Cognition 19 638-652

Kruschke J K (1992) ALCOVE An exemplar-based connectionistmodel of category learning Psychological Review 99 22-44

Ku Iumlcera H amp Francis W N (1967) Computational analysis ofpresent-day American English Providence RI Brown UniversityPress

Lewandowsky S (1991) Gradual unlearning and catastrophic inter-ference A comparison of distributed architectures In W E Hockleyamp S Lewandowsky (Eds) Relating theory and data Essays onhuman memory in honor of Bennet B Murdock (pp 445-476) Hills-dale NJ Erlbaum

Li S-C amp Lindenberger U (1999) Cross-level unification A com-putational exploration of the link between deterioration of neuro-transmitter systems and dedifferentiation of cognitive abilities in oldage In L-G Nilsson amp H J Markowitsch (Eds) Cognitive neuro-sciences of memory (pp 103-146) Seattle Hogrefe amp Huber

438 SIKSTROM

Li S-C Lindenberger U amp Frensch P A (2000) Unifying cog-nitive aging From neuromodulation to representation to cognitionNeurocomputing 32-33 879-890

McClelland J L amp Chappell M (1998) Familiarity breeds dif-ferentiation A subjective-likelihood approach to the effects of expe-rience in recognition memory Psychological Review 105 724-760

Metcalfe J (1982) A composite holographic associative recallmodel Psychological Review 89 627-658

Murdock B B (1982) A theory for the storage and retrieval of item andassociative information Psychological Review 89 609-626

Murdock B B (1998) The mirror effect and attentionndashlikelihoodtheory A reflective analysis Journal of Experimental PsychologyLearning Memory amp Cognition 24 524-534

Okada M (1996) Notions of associative memory and sparse codingNeural Networks 9 1429-1458

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves PsychologicalReview 99 518-535

Rumelhart D E Hinton G E amp Williams R J (1986) Learn-ing representation by backpropagating errors Nature 323 533-536

Shiffrin R M amp Steyvers M (1997) A model for recognitionmemory REMndashretrieving effectively from memory PsychonomicBulletin amp Review 4 145-166

Sikstroumlm S (1996a) TECO A connectionist theory of successiveepisodic tests Umearing Doctoral dissertation Umearing University (ISBN91-7191-155-3)

Sikstroumlm S (1996b) The TECO connectionist theory of recognitionfailure European Journal of Cognitive Psychology 8 341-380

Sikstroumlm S (1999) Power function forgetting curves as an emergentproperty of biologically plausible neural networks model Interna-tional Journal of Psychology 34 460-464

Stretch V amp Wixted J T (1998a) Decision rules for recognitionmemory confidence judgments Journal of Experimental Psychol-ogy Learning Memory amp Cognition 24 1397-1410

Stretch V amp Wixted J T (1998b) On the difference betweenstrength-based and frequency-based mirror effects in recognitionmemory Journal of Experimental Psychology Learning Memoryamp Cognition 24 1379-1396

NOTE

1 The expected number of nodes active at recognition is a2 giventhat the standard deviations of the percentages of active nodes for new[sp(N)] and old items [sp(O)] are equal However if the standard devi-ations are different the expected number of active nodes will be

Because the new variance is smaller than the old variance this valuewill be slightly less than a2 typically around 044a if the ROC curveis 08 It is actually more precise to use this value In this paper the ap-proximation a2 is used except that in the simulations where the ex-pression above is used

APPENDIX

The Expected Value of the Net Input and the Variance of the Net Input

Inserting the weight changes into the net input solves the ex-pected value of the net input for old items In this Appendix thesuperscripts representing the item layer (d) and the contextlayer (c) are dropped Only the active states of activation con-tribute to the net input so jj 5 jj = 1

(A1)

The expected value of the net inputs for the new items iszero

mh(N) 5 0 (A2)

The expected value of the weights is zero The variance ofthe net input is calculated by squaring each term in the netinput Let P represent the number of patterns stored in the net-work If the patterns are uncorrelated or a new item is pre-sented the variance of the net input is

(A3)

For an old item the variance of the net input to the contextlayer from the item layer depends on the frequency of the item( f ) All the aN weights supporting a context node contribute tothe net input in the same direction

(A4)

The same function can be used for calculating the varianceof the net input to a node in the item layer when the same con-text is associated with several items Let the same context be as-sociated with L items in a list Furthermore let the context be-tween different lists be uncorrelated The variance of the netinput to an item node is

The expected variance for all nodes is the weighted sum ofthese two terms Half of the nodes are context nodes and halfof the nodes are item nodes Therefore the expected varianceof the net input to all nodes given an item with a frequency off and a list length of L in a network where p patterns have beenencoded is

(A5)

Performance in the Variance Model When AllNodes Are Used

In the variance model recognition strength is based on nodesactive at encoding However if recognition strength is based onall the nodes performance can be calculated as follows The d centis calculated by using Equation 8 where Pc(O) and Pc(N) are thenumber of correct nodes The number of correct nodes is thenumber of nodes active at retrieval that also is active at encod-ing (ie calculated as usual with Equation 7) plus the numberof inactive nodes at encoding that also are inactive at retrievalThe latter value can be calculated by replacing a with 1 a inEquation 7 and using the expected value of old net inputs for in-active nodes a2 (1 a) N (compare with Equation A1)

(Manuscript received February 9 1999revision accepted for publication October 30 2000)

s h f L a N p a a N2 3 22 1O( ) = +( ) + -( )

s h f a a LN2 4 2 21( ) = -( )

s x x x xh ij jj

N

i j ji

Nf f f a a

a a f N

22 2

4 2 21

( ) = waring aringaelig

egraveccedilouml

oslashdivide= -( ) -( )eacute

eumlecirc

= -( )

s x x x xh ij jj

N

i j j

P

i

NN a a

a a a a a a a

a a a a PN a a PN

2 2 2

2

3

1 1 2 0 1

0 0 1 1

( ) = w

= [( ) ( )] + [( )(1- )] ( )

+ [( )( )] ( ) = ( )

2 2

2 2 2

( ) = -( ) -( ) - - - -

- - - -

aring aringaring

m x x x xh ijj

N

j i j jj

Na a a a N(O) = wDaring aring= -( ) -( ) = -( )1

2

ss s

p

p p

a(N)

( )N (O)+

438 SIKSTROM

Li S-C Lindenberger U amp Frensch P A (2000) Unifying cog-nitive aging From neuromodulation to representation to cognitionNeurocomputing 32-33 879-890

McClelland J L amp Chappell M (1998) Familiarity breeds dif-ferentiation A subjective-likelihood approach to the effects of expe-rience in recognition memory Psychological Review 105 724-760

Metcalfe J (1982) A composite holographic associative recallmodel Psychological Review 89 627-658

Murdock B B (1982) A theory for the storage and retrieval of item andassociative information Psychological Review 89 609-626

Murdock B B (1998) The mirror effect and attentionndashlikelihoodtheory A reflective analysis Journal of Experimental PsychologyLearning Memory amp Cognition 24 524-534

Okada M (1996) Notions of associative memory and sparse codingNeural Networks 9 1429-1458

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves PsychologicalReview 99 518-535

Rumelhart D E Hinton G E amp Williams R J (1986) Learn-ing representation by backpropagating errors Nature 323 533-536

Shiffrin R M amp Steyvers M (1997) A model for recognitionmemory REMndashretrieving effectively from memory PsychonomicBulletin amp Review 4 145-166

Sikstroumlm S (1996a) TECO A connectionist theory of successiveepisodic tests Umearing Doctoral dissertation Umearing University (ISBN91-7191-155-3)

Sikstroumlm S (1996b) The TECO connectionist theory of recognitionfailure European Journal of Cognitive Psychology 8 341-380

Sikstroumlm S (1999) Power function forgetting curves as an emergentproperty of biologically plausible neural networks model Interna-tional Journal of Psychology 34 460-464

Stretch V amp Wixted J T (1998a) Decision rules for recognitionmemory confidence judgments Journal of Experimental Psychol-ogy Learning Memory amp Cognition 24 1397-1410

Stretch V amp Wixted J T (1998b) On the difference betweenstrength-based and frequency-based mirror effects in recognitionmemory Journal of Experimental Psychology Learning Memoryamp Cognition 24 1379-1396

NOTE

1 The expected number of nodes active at recognition is a2 giventhat the standard deviations of the percentages of active nodes for new[sp(N)] and old items [sp(O)] are equal However if the standard devi-ations are different the expected number of active nodes will be

Because the new variance is smaller than the old variance this valuewill be slightly less than a2 typically around 044a if the ROC curveis 08 It is actually more precise to use this value In this paper the ap-proximation a2 is used except that in the simulations where the ex-pression above is used

APPENDIX

The Expected Value of the Net Input and the Variance of the Net Input

Inserting the weight changes into the net input solves the ex-pected value of the net input for old items In this Appendix thesuperscripts representing the item layer (d) and the contextlayer (c) are dropped Only the active states of activation con-tribute to the net input so jj 5 jj = 1

(A1)

The expected value of the net inputs for the new items iszero

mh(N) 5 0 (A2)

The expected value of the weights is zero The variance ofthe net input is calculated by squaring each term in the netinput Let P represent the number of patterns stored in the net-work If the patterns are uncorrelated or a new item is pre-sented the variance of the net input is

(A3)

For an old item the variance of the net input to the contextlayer from the item layer depends on the frequency of the item( f ) All the aN weights supporting a context node contribute tothe net input in the same direction

(A4)

The same function can be used for calculating the varianceof the net input to a node in the item layer when the same con-text is associated with several items Let the same context be as-sociated with L items in a list Furthermore let the context be-tween different lists be uncorrelated The variance of the netinput to an item node is

The expected variance for all nodes is the weighted sum ofthese two terms Half of the nodes are context nodes and halfof the nodes are item nodes Therefore the expected varianceof the net input to all nodes given an item with a frequency off and a list length of L in a network where p patterns have beenencoded is

(A5)

Performance in the Variance Model When AllNodes Are Used

In the variance model recognition strength is based on nodesactive at encoding However if recognition strength is based onall the nodes performance can be calculated as follows The d centis calculated by using Equation 8 where Pc(O) and Pc(N) are thenumber of correct nodes The number of correct nodes is thenumber of nodes active at retrieval that also is active at encod-ing (ie calculated as usual with Equation 7) plus the numberof inactive nodes at encoding that also are inactive at retrievalThe latter value can be calculated by replacing a with 1 a inEquation 7 and using the expected value of old net inputs for in-active nodes a2 (1 a) N (compare with Equation A1)

(Manuscript received February 9 1999revision accepted for publication October 30 2000)

s h f L a N p a a N2 3 22 1O( ) = +( ) + -( )

s h f a a LN2 4 2 21( ) = -( )

s x x x xh ij jj

N

i j ji

Nf f f a a

a a f N

22 2

4 2 21

( ) = waring aringaelig

egraveccedilouml

oslashdivide= -( ) -( )eacute

eumlecirc

= -( )

s x x x xh ij jj

N

i j j

P

i

NN a a

a a a a a a a

a a a a PN a a PN

2 2 2

2

3

1 1 2 0 1

0 0 1 1

( ) = w

= [( ) ( )] + [( )(1- )] ( )

+ [( )( )] ( ) = ( )

2 2

2 2 2

( ) = -( ) -( ) - - - -

- - - -

aring aringaring

m x x x xh ijj

N

j i j jj

Na a a a N(O) = wDaring aring= -( ) -( ) = -( )1

2

ss s

p

p p

a(N)

( )N (O)+


Recommended