Averaged prediction errors in a probabilistic reward task (a) DA response in trials with different reward probabilities. Population peri-stimulus time histograms (PSTHs) show the summed spiking activity of several DA neurons over many trials, for each pr, pooled over rewarded and unrewarded trials at intermediate probabilities. (b) TD prediction error with asymmetric scaling. In the simulated task, in each trial one of five stimuli was randomly chosen and displayed at time t = 5. The stimulus was turned off at t = 25, at which time a reward was given with a probability of pr specified by the stimulus. We used a tapped delay-line representation of the stimuli (see text), with each stimulus represented by a different set of units ('neurons'). The TD error was δ(t) = r(t) + w(t - 1)·x(t) - w(t - 1)·x(t - 1), with r(t) the reward at time t, and x(t) and w(t) the state and weight vectors for the unit. A standard online TD learning rule was used with a fixed learning rate α, w(t) = w(t - 1) + αδ(t)x(t - 1), so each weight represented an expected future reward value. Similar to Fiorillo et al., we depict the prediction error δ(t) averaged over many trials, after the task has been learned. The representational asymmetry arises as negative values of δ(t) have been scaled by d = 1/6 prior to summation of the simulated PSTH, although learning proceeds according to unscaled errors. Finally, to account for the small positive responses at the time of the stimulus for pr = 0 and at the time of the (predicted) reward for pr = 1 seen in (a), we assumed a small (8%) chance that a predictive stimulus is misidentified. (c) DA response in pr = 0.5 trials, separated into rewarded (left) and unrewarded (right) trials. (d) TD Model of (c). (a,c) Reprinted with permission from ©2003 AAAS. Permission from AAAS is required for all other uses.
Niv et al. Behavioral and Brain Functions 2005 1:6 doi:10.1186/1744-9081-1-6