A few years back (probably a decade back, by now?) I wrote some software to help me memorize Japanese vocabulary, by doing time-spaced repetition. And it was also an excuse to play around with Ruby and with Rails.

I’ve been using that software ever since: sometimes a little more diligently and sometimes a little less diligently, but always well enough to let me more-or-less keep up with things. But, in recent years, it had gotten to be a bit much, and I’d started falling behind in my reviews; some of that was because I’m not spending as much time on Japanese as I once was, but that didn’t feel like the whole reason to me. I might not have been reviewing quite as frequently as I had been, but I also wasn’t adding in new words nearly as frequently as I had been, so if anything I should have been being asked to review fewer words each day? But there were just some words that kept on coming back over and over, and doing so more often than felt necessary to me. So I finally decided that I should do something about that.

 

The basic assumption that I’d made was that I should space the review of each item along an exponential curve, but that different items needed different exponential curves. I’d start them off at an increase of 2.5, but I’d automatically adjust them if I got them wrong too often, with the most gradual increase using a factor of 1.3. And my goal was to get each item right 90% of the time; I implemented that by increasing the factor by .1 on a streak of 10 and decreasing it by .1 (unless I was already at the 1.3 limit) when I got it wrong.

The core idea still felt right, but it also seemed like the details of the automatic adjustment weren’t correct; in particular, too many words were getting stuck at the 1.3 factor.

 

Not sure if I’ve got the chronology right here, but the first thing I wanted to tackle was new words that quickly hit the 1.3 limit: it seemed like that was happening too often. I felt like part of the issue there was that maybe it took a little while for a word to get in my brain the first time; and part of the issue there was that I didn’t reliably make it through the the whole list of items to review every day, so it might take multiple days for me for me to get back to a word; that’s a problem if the algorithm says I should review it every day!

I ultimately made two tweaks there. One is that I simply wouldn’t count wrong answers if they were on a streak of 0. I think I was already doing that some of the time, but not for brand new vocabulary words? So this actually simplified my code, which is nice. (And it helped for unrelated reasons: idempotency helps a lot with error conditions.)

But also, I decided that, if I got a word right and then got it wrong the next time I saw the word, then I wouldn’t decrease the factor: in practice, that “correct” answer often meant that I more or less had the right idea but it wasn’t really in my short-term memory? And, again, the short streak repeats are vulnerable to problems if I’m not clearing things out every day.

With these two together, it felt like it took significantly longer for new items to get down to 1.3: it still sometimes happens, and it should, but before it seemed like words either put into a bucket of “words made out of kanji that I know well that fit together in an obvious way”, that would stay in the 2.2 – 2.5 range, and “words with something a little more unusual going down”, that would all crash down to 1.3. And now it seems like I’m getting more differentiation in that latter set, so I’m using more of the range between 1.3 and 2.2.

 

That helped with new words. But I felt I also had problems with words that had been around for a while: even once I had them basically calibrated, they’d mostly stay at the same difficulty rating but sometimes the multiplier would decrease, while the multiplier would never actually get higher.

Thinking about it some, I decided that I was applying the streak increase at the wrong time: I was applying it as soon as I hit a streak of 10. And, actually, it was worse than that: if the multiplier was changing from, say, 1.7 to 1.8, I treated the next gap as 1.811 instead of 1.710 * 1.8. So it was a big discontinuity in the review spacing.

I could imagine a few different ways to fix that, but I went with the easiest one: keep the multiplier the same until I get it wrong, only applying the streak bonus when I get a question wrong.

But, even with that, I felt like the multiplier was decreasing significantly more often than it was increasing, even for words that had been around long enough that I felt they should have stabilized. And, looking at the probability, I think I just got it wrong: there’s something intuitive in saying that, if your target is to get it wrong 10% of the time, then you should do something different on a streak of 10. But, the thing is, if a question is actually calibrated accurately, and if I have a 90% chance of getting it right every time I ask it, then I have a 61% chance of having my streak end before I hit 10, so the most common case is actually for my multiplier to decrease. And I only have a 12% chance to hit a streak of 20, which doesn’t come close to balancing that out.

For now, I’ve changed things so the streak bonus kicks in at 8; that way I have a 48% chance of having the multiplier decrease. Which, as I type it out, is still wrong? And I’m using 18 for when the multiplier increases, which is a 15% chance. So 48% chance of it decreasing by 1, 37% chance of it staying the same, 15% chance of it increasing; that’s not right. (Some portion of the increase is by more than .1, so it’s not quite as bad as that seems, but I think that’s negligible. Also, the math above doesn’t take into effect that getting it wrong right at the start is a no-op; that actually makes the 8 part seem pretty reasonable.) So: probably more tweaking to come.

I think the probability problem is more subtle than that, though: I don’t actually know what the correct multiplier is for any given vocabulary term. So what I’m really trying to do isn’t just to have it stay more or less stable when I have the multiplier right, what I’m instead trying to do is update my best guess based on priors and new observations. And I don’t really understand about the best way to go about that; makes me wish that I’d actually studied probability some, that I actually understood what the word “Bayesian” meant…

 

Anyways, things are getting better? I am sometimes seeing items hit a streak of 18, so items are slowly starting to move back up from a multiplier of 1.3. Which also points at another aspect of the probability question: maybe the important question isn’t whether, if we’re at the correct multiplier, we stay exactly on that multiplier: instead, the important question is more, if we get that multiplier wrong, we’ll course-correct and get it back to where it should be? Which, of course, isn’t just a probability question: it involves having a model that lets us predict our chances of getting an item wrong if we have our multiplier wrong. And I don’t actually have any idea what the answer is to that one? (Heck, I’m not even sure that an exponential spacing approach is correct in the first place…)

And this whole thing brings me to the last thing that I improved: more visibility into the underlying factors here, your current streak length and multiplier. That information was always accessible, but not easily so: I had an attitude in the back of my head that it was wrong to pay attention to it, that I should just answer the questions the algorithm throws at me without worrying about streaks and the like. So I didn’t make it easy to get at the streak numbers; but I’d periodically check on them anyways: in the past, just because I was curious, but now, because I was actively tuning the algorithm.

So I finally gave in and realized that, while I still think I don’t want the numbers to be in your face, I also don’t want them to be hard to find. The previous way of doing that was to hit the back button in my browser and then on the “show current item” link; but something in iOS’s behavior recently had made that very unreliable, where when I hit back it would show me the new item instead of the previous item. (It’s the same URL for both, representing “show me the current item to be quizzed on”.)

I ended up adding a link to the current page saying “show me the details for the last item I was quizzed on”. So the information isn’t in my face, but it’s just a click away when I want it. And it definitely helps; I’m not checking on it all the time or anything, but when I do want to check on that info, it’s nice to have it easily accessible.

 

Arguably the most interesting part of this process, actually, was getting some experience with a side of agile software development that I don’t normally see: acting as the Product Owner. (Or, to use the XP term, the Onsite Customer.) I mean, I was implementing the changes too, but that side of thing was easy; each change probably only took about half an hour of work, so the above is maybe three hours of programming total?

But it’s three hours of programming that really made a difference in my experience using the program. I’m not going to say that that kind of extreme is a normal part of programming: most changes to software do take more work than that. But I also kind of feel like it’s the case that there’s not necessarily that much connection between effort and business value, and that we undervalue changes that take less programming time and that make a quality of life difference for users.

Though there is a part of these changes that really did take time: I’ve been living with this software for years, so it’s taken a while to understand the consequences of the choices to its algorithms. That’s certainly more the case with this software than with a lot of kinds of software: the algorithm feeds me tasks at a days-to-weeks-to-months rhythm, so I’m just not going to be able to make a change, play around with it for a few minutes, and have an idea for how that change has played out and what to do next. But still, I am getting the sense that the rhythm of living with software and the rhythm of developing software are different, that the Product Owner side of things is informed by the former, and that it’s important to give the former time to breathe instead of prioritizing a constant implementation grind.

Post Revisions: