Last year Burr Settles (from Duolingo) and Brendan Meeder proposed a model predicting the probability of recall during the practice with Duolingo based on the lag time since a particular item was last presented by a learner. The original paper is called A Trainable Spaced Repetition Model for Language Learning (video) and the idea is also described in this blog post.

Since I am a Ph.D. candidate in Adaptive Learning Group and my research is focused on computerized adaptive practice of factual knowledge and issues related to its evaluation, I am very interested in papers produced by Duolingo employees (e.g., I recommend a paper Mixture Modeling of Individual Learning Curves). I presented the paper at our seminar on March 22 and thanks to the fact that authors published not only the paper, but also the source code of the model and data used for its evaluation, I was able to analyze behavior of the model and its evaluation more closely.

To understand the analysis which follows, please read the original paper or blog post. The full code of this analysis is available on GitHub and is based on the fork of the original repository to minimize the probability of my error.

If you want to this Jupyter notebook on your own, please download the original data set to the **data** directory and run the experiment:

`pypy experiment.py -m hlr ./data/settles.acl16.learning_traces.13m.csv.gz`

After you run the command above, you should see something like this in your console:

```
method = "hlr"
reading data...0...1000000...2000000...3000000...4000000...5000000...6000000...
7000000...8000000...9000000...10000000...11000000...12000000...done!
|train| = 11568803
|test| = 1285423
test 191801294.6 (p=109014.3, h=191692154.1, l2=126.2) mae(p)=0.129
cor(p)=0.038 mae(h)=117.454 cor(h)=0.203
```

```
%matplotlib inline
%load_ext autoreload
%autoreload 2
```

```
from evaluation import plot_model_stats
from models import train_test_set, ItemAverage
import matplotlib.pylab as plt
import numpy as np
import pandas as pd
import seaborn as sns
sns.set(style="white")
DAY_SECONDS = 60 * 60 * 24
```

## Analysis of the Input Traces¶

Since the paper is about finding the optimal learner model handling forgetting, we start with the analysis of available data, because the data usually matters a lot, see Impact of Data Collection on Interpretation and Evaluation of Student Models.

```
traces = pd.read_csv('./data/settles.acl16.learning_traces.13m.csv.gz')
```

```
traces['delta_days'] = traces['delta'].apply(lambda d: d / DAY_SECONDS)
```

```
traces.head()
```

### Observed Recall Rate¶

```
plt.hist(traces['p_recall'], bins=50)
plt.ylabel('Number of iteractions')
plt.xlabel('Probability of recall')
plt.show()
```

The most of interactions leads to high probability of recall. There is a danger that the proposed model will ignore some interactions to achieve higher accuracy

### Distribution of Lag Times¶

```
plt.hist([x for x in traces['delta_days'] if x < 365], bins=20)
plt.xlabel('Days')
plt.ylabel('Number of answers')
plt.yscale('log')
plt.show()
```

You can see that the most of the interactions have only small lag time, so the fitted model will probably focus mainly on the interactions having lag time lower than 10 days.

### Length of Traces¶

```
plt.hist([
x for x in traces.sort_values(by=['timestamp']).drop_duplicates(['user_id', 'lexeme_id'])['history_seen']
if x <= 20
], bins=20)
plt.xlim(1, 20)
plt.xlabel('Length')
plt.ylabel('Number of series')
plt.show()
```

The most of series (a learner's interactions for one item) has a length of three, or two.

### Forgetting¶

```
def _compute_bin_p_recall(group):
return pd.DataFrame([
{'p_recall': group['p_recall'].mean()}
])
global_forgetting = traces.groupby(
pd.qcut(traces['delta_days'], 20)
).apply(_compute_bin_p_recall).reset_index().drop(['level_1'], axis=1)
sns.barplot(x='delta_days', y='p_recall', data=global_forgetting, color=sns.color_palette()[0])
plt.xticks(rotation=90)
plt.ylim(0.85, 0.95)
plt.show()
```

Suprisingly, the probability of recall does not only decrease with higher lag time. An average probability of the recall is the lowest in the case of really low lag time. Probably, in this case lexemes are not fully learned yet. Unfortunately, this phenomenon is not handled by the proposed model by design.

## Analysis of the Results¶

The paper presents the following features used in half-life regression:

- interaction features
- number of all correct answers (square root)
- number of all wrong answers (square root)

- lexeme tag features
- bias dependent on a lexeme

I really do not underestand why there is no learners' feature (e.g., UI language, or estimated prior skill).

```
results = pd.read_csv('./results/hlr.settles.acl16.learning_traces.13m.preds', delimiter='\t')
```

```
results.head()
```

### Calibration¶

```
plot_model_stats(results['pp'], results['p'], bins=50)
```

Using MAE for the evalution of a learner model is questionable, see Metrics for Evaluation of Student Models paper for more information. The final model is really bad in the predictive accuracy (ideally, the blue line should be aligned with the green one). It seems that the presented model has almost no predictive power. Also the diversity of predictions is really low, almost all predictions are near to one.

### Comparison to Item Average¶

Imagine a learner model which ignores learning and forgetting. This model just computes an average probability of recall per item and uses it as a prediction for the future interactions.

It is worth noting that train/test division is not the same as the one used to fit and evaluate half-life regression model. However, I assume it does not affect a general messsage of the analysis.

```
trainset, testset = train_test_set(traces)
```

```
item_average = ItemAverage()
item_average.train(trainset)
```

```
predicted = np.zeros(len(testset))
for i, lexeme_id in enumerate(testset['lexeme_id'].values):
predicted[i] = item_average.predict(lexeme_id)
plot_model_stats(predicted, testset['p_recall'], 50)
```

You can see that even this simple model is much better than the one presented in the original paper. I understand that the goal of the paper is to find a parameter controlling degradation of skill meters and good predictive accuracy is a bonus. However, I assume that if there is a model ignoring learning and forgetting better than the proposed one, there is also a much better model taking learning and fogetting into accout.