Imagine you wish to know how many fish are in a rather large pond… The water is murky so forget about counting them
easily… In ecology this is a common problem! The population size of a species often needs to be known, but it is
either impractical or simply impossible to count all individual members of that population. Fortunately, there is
a relatively easy solution to get an estimation!

To estimate the size of the population you would have to visit the site twice. On the first visit a number of
animals are caught, marked and released. It is important to mark them in a way it doesn’t do them any disadvantage or harm,
equally important is to use marks that don’t easily rub off. You keep count of how many members of the population were
caught and marked. Next, a day or so later, the site is visited again and a number of animals is caught. This time the number
of marked animals that were recaptured is important. As the fraction of the marked individuals captured the second day, is
essentially the same as the fraction of the population that was marked the first day. This method goes by a few names
like Capture-Mark-Recapture, Mark and Recapture or other slight variations.

As an example imagine you catch and tag 50 individuals on day one. The next day, 100 animals are caught and 10 of those
are marked. So we know that roughly 10% of all individuals were tagged, we also know that there are in total 50 tagged
individuals, so the total population size should be about 500.

[begin{aligned}
frac{n_marked}{population_size} = frac{n_recaptured}{captured_round_2}
end{aligned}]

This formula can be re-written as:

[begin{aligned}
population_size = frac{n_marked * captured_round_2}{n_recaptured}
end{aligned}]

While the latter is really simple and similar exercises are already seen in primary school (known as the
rule of three), calculating the confidence interval (the range in which the actual population size is with a
certain amount, usually 95%, of confidence) on this estimate is far from trivial (check the link, it is
a very complex formula). However, when using a Bayesian approach, we get the uncertainty on the population size
automatically, without any extra effort. So let’s implement this model in PyMC3 and see what we get.

Creating a PyMC3 model

The data from the observation is stored in a few variables: n_marked (the number of individuals marked on the
first visit), captured_round_2 (the total number of individuals caught the second visit) and n_recaptured
(the number of marked individuals caught the second visit). The big unknown that we need to infer is the
population_size, we don’t really have any clue about this except that it is equal or higher than the total number
of individual that were caught during both visits. So we can set the number of unique animals seen as the lower limit.

The probability to capture an already marked animal the second visit p_marked is the number of marked animal within the
full population. So this is a pm.Deterministic variable as it is defined by the number of marked animals (which is known)
and the population size (which we try to infer). Finally, a likelihood recapture_obs is required, this will be
a pm.Binomial with the total number of animals caught the second day as the number of draws, the number of
marked individual recaptured as the observation and a probability of p_marked.

%load_ext nb_black
import pymc3 as pm

n_marked = 50
captured_round_2 = 100
n_recaptured = 10

with pm.Model() as model:
    population_size = pm.Bound(
        pm.Flat, lower=n_marked + captured_round_2 - n_recaptured
    )("population_size")
    p_marked = pm.Deterministic("p_marked", n_marked / population_size)

    recapture_obs = pm.Binomial(
        "recapture_obs", captured_round_2, p_marked, observed=n_recaptured
    )

    trace = pm.sample(4000, tune=1000, return_inferencedata=False)

After sampling the model, we get our estimate for the population size. In this case the population size is estimated
to be between 294 and 1034 individuals (hdi_3% and hdi_97%, PyMC3 by default gives you the smallest range where 94% of the
inferred values were found) with a mean value of 620. While the mean isn’t crazy far off from the
population size in our example, the uncertainty is rather large. It does show that not nearly enough animals were
marked or re-captured in our thought experiment.

  mean sd hdi_3% hdi_97%
population_size 620.067 220.926 293.750 1034.423
p_marked 0.090 0.028 0.039 0.142

Update 30/08/2022 – HyperGeometric likelihood

As in this case draws from the population aren’t independent, a HyperGeometric distribution is better for this model. The
last two parts can be switched for the lines below to implement this. While this wouldn’t be a big difference in case
the number of marked individuals is large enough, here it does affect the mean estimate and 94% HDI.

    recapture_obs = pm.HyperGeometric(
        "recapture_obs", N=population_size, k=n_marked, n=captured_round_2, observed=n_recaptured
    )

    trace = pm.sample(4000, tune=1000, return_inferencedata=False, target_accept=0.9)
  mean sd hdi_3% hdi_97%
population_size 748.737 372.363 273.785 1396.472
p_marked 0.040 0.015 0.012 0.067

Can we improve without catching more animals ?

So without catching more animals overall, is it better to mark more animals, and catch fewer the next visit? Or capture more
animals on the second visit, while marking fewer the first time around? With this model we can easily test this out!

Let’s mark 100 animals on day one and catch 50 on day two. Given a population size of 500 we’d expect about 10 to be
re-capture. If we feed those number in the model we get the results below.

  mean sd hdi_3% hdi_97%
population_size 610.585 212.195 310.344 1008.055
p_marked 0.180 0.054 0.082 0.279

While there is still a lot of uncertainty, it does drop when catching more animals the first day and fewer the
second day. (I’ll leave confirming it gets worse if you do things the other way around to the reader) However,
once you drop below a point where you are barely recapturing any animals on the second day, this is also negatively
going to affect the estimate.

Catching twice as many animals

This model really works well if enough individuals can be tagged and recaptured. So if we increase the number of
marks to 200 and capture 100 the second day, we would expect about 40 to be tagged. Punching in those numbers in the
model, we see that now the mean is very close to what it should be in our imaginary population and the
uncertainty is far smaller.

  mean sd hdi_3% hdi_97%
population_size 519.579 66.165 402.473 643.741
p_marked 0.391 0.048 0.302 0.483

Conclusion

While the formula for this model is simple, calculating the confidence interval isn’t. Though here it could be instrumental to know
how much uncertainty you have. In our first example, depending on how many fish there are in the pond you might allow
fishers to capture a certain number. So if there are 293, 500, 610 or, 1034 this could influence how many people
are allowed to catch fish in that pond.

By using a Bayesian approach, the problem is solved in a few simple lines of code, and we get the confidence interval
in one go as well, easy! This is exactly the strength of Bayesian statistics. The advantages here are further
highlighted as it is very easy to play with the number to see how the uncertainty can be decreased, so the efforts of
catching and marking individuals can be done as efficient as possible.

To see what I did with this formula, check out the next post where this is used to estimate how many KeyForge decks
were printed.

Acknowledgements

Header image by Sebastian Pena Lambarri on Unsplash

Categories: Programming

0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *