|
|
|
#+title: Example: Constant Parameters
|
|
|
|
|
|
|
|
Created by Alexander Zarebski
|
|
|
|
|
|
|
|
Edited by [[https://github.com/BernardoGG][Bernardo Gutierrez]] on 12/08/2022
|
|
|
|
|
|
|
|
Last edited using version =v0.3.0=.
|
|
|
|
|
|
|
|
This tutorial demonstrates how to use TimTam to estimate the reproduction number
|
|
|
|
and prevalence of infection using a sequence alignment and a time series of
|
|
|
|
confirmed cases. Often in genomic epidemiology, the available set of sequences is
|
|
|
|
not representative of the total number of recorded cases due to limitations in
|
|
|
|
sequencing capacity, accessibility to samples or other reasons. TimTam facilitates
|
|
|
|
the inclusion of case counts combined with available genomic data to estimate
|
|
|
|
epidemiological parameters from outbreaks.
|
|
|
|
|
|
|
|
* Data and model
|
|
|
|
|
|
|
|
In this tutorial, we will work with data obtained from a simulated epidemic. The
|
|
|
|
data consists of a sequence alignment, [[https://github.com/aezarebski/timtam2/wiki/tutorial-1/sample-sequences.fasta][sample-sequences.fasta]] and a time series
|
|
|
|
of daily case counts, [[https://github.com/aezarebski/timtam2/wiki/tutorial-1/aggregated-occurrences.csv][aggregated-occurrences.csv]]. These data were simulated from
|
|
|
|
a birth-death process in which individuals infect (therefore 'give birth to')
|
|
|
|
new individuals and cease to be infectious (therefore 'die') after a given time.
|
|
|
|
From the epidemic process, individuals that cease to be infectious as they
|
|
|
|
recover or are removed from the epidemic due to a fatal outcome of the
|
|
|
|
infection. However, from our data collection perspective, these 'removed'
|
|
|
|
individuals correspond to one of three possible scenarios: i) they recover from
|
|
|
|
the infection without ever being diagnosed (in which case we don't observe their
|
|
|
|
infection at all), ii) they can be diagnosed and have a genetic sequence
|
|
|
|
generated (in which case we get a time stamped sequence for that case) or they
|
|
|
|
can be diagnosed but generate an unsequenced sample (in which case they appear
|
|
|
|
in the dataset as a confirmed case but there is no sequence data attached to the
|
|
|
|
case). For the unsequenced samples, they are aggregated into daily counts giving
|
|
|
|
rise to the time series data.
|
|
|
|
|
|
|
|
The following figure shows the number of infectious individuals ("infectious")
|
|
|
|
through time along with the cumulative number of people who were previously
|
|
|
|
infectious, for example, who have recovered or been sampled. Note from the
|
|
|
|
subtitle that there are 2222 infectious individuals at the present. The number
|
|
|
|
of infectious individuals is growing exponentially, as indicated by the linear
|
|
|
|
growth on a logarithmic /y/-scale.
|
|
|
|
|
|
|
|
[[https://raw.githubusercontent.com/wiki/aezarebski/timtam2/images/tutorial-1-simulated-trajectories.png]]
|
|
|
|
|
|
|
|
The following figure shows the daily number of confirmed cases: infectious that
|
|
|
|
have been observed but for which a sequence was not collected. These data form
|
|
|
|
the time series that we will use.
|
|
|
|
|
|
|
|
[[https://raw.githubusercontent.com/wiki/aezarebski/timtam2/images/tutorial-1-occurrence-time-series.png]]
|
|
|
|
|
|
|
|
Note that at the end of the simulated epidemic there are approximately 2250
|
|
|
|
infectious individuals. The parameters used to simulate this epidemic are set to
|
|
|
|
represent a pathogen with a basic reproduction number of \(1.85\). To simplify
|
|
|
|
this tutorial we will assume a known death rate of (\(0.046\)), and that the
|
|
|
|
pathogen evolves with a molecular clock rate of (\(10^{-3}\))
|
|
|
|
substitutions/site/day. Both values fall within the ballpark of what might be
|
|
|
|
expected for an RNA virus. The origin time for our simulated epidemic took place
|
|
|
|
(\(70\)) days before the end of the simulation.
|
|
|
|
|
|
|
|
* Analysis
|
|
|
|
|
|
|
|
1. Create a work folder in your computer and download the [[https://github.com/aezarebski/timtam2/wiki/tutorial-0/sample-sequences.fasta][sample-sequences.fasta]] and
|
|
|
|
[[https://github.com/aezarebski/timtam2/wiki/tutorial-0/aggregated-occurrences.csv][aggregated-occurrences.csv]] files there.
|
|
|
|
2. Create an XML file using BEAUti.
|
|
|
|
1. Load the sequence data into BEAUti
|
|
|
|
2. Parse the tip-dates using the default auto-configured values function.
|
|
|
|
3. Leave the site model tab with the default values.
|
|
|
|
4. Set the clock rate to =0.001= in the clock model tab.
|
|
|
|
5. Select the /Tim Tam Model/ on the tree prior tab.
|
|
|
|
6. (Optional) set the chain length to a smaller value if you are in a hurry;
|
|
|
|
1,000,000 MCMC steps should do.
|
|
|
|
7. Save the resulting file as an XML.
|
|
|
|
3. We now need to estimate some parameter values to add to the XML. to do this,
|
|
|
|
download and run the R script [[https://github.com/aezarebski/timtam2/wiki/tutorial-1/print-disaster-data.R][print-disaster-data.R]] and keep a copy of the
|
|
|
|
data that gets printed out on the Console or Terminal. There should be two
|
|
|
|
sets of values, these are the representation of the time series data that
|
|
|
|
TimTam expects. We do this because the times in the simulation are relative
|
|
|
|
to date of the last sequence we have; we therefore need to shift the times in
|
|
|
|
the XML so they agree with the ones we just computed.
|
|
|
|
4. Now it's time to tweak the XML by opening it in a text editor and making the
|
|
|
|
following edits:
|
|
|
|
- Remove operators relating to the clock rate, as we are working with a known
|
|
|
|
(from the simulation parameters) evolutionary rate. There should be two of
|
|
|
|
these clock operators, you can identify them because they begin with
|
|
|
|
~<operator id="clock.rate"~
|
|
|
|
- Remove all references to =historySizes= and =historyChecks=, these are used
|
|
|
|
to estimate historical prevalence and are not needed for this example.
|
|
|
|
- Copy and paste the two number series from the R script output as the values
|
|
|
|
of =disasterTimes= and =disasterSizes= respectively, in the TimTam
|
|
|
|
distribution node. You should recognise it by the =id=
|
|
|
|
=TimTam.t:sample-sequences=. This will format the time series of the
|
|
|
|
sampling times and sizes so that they are included properly.
|
|
|
|
- The final XML should look like [[https://github.com/aezarebski/timtam2/wiki/tutorial-1/timtam.xml][timtam.xml]].
|
|
|
|
5. Run MCMC by executing the XML file on BEAST (you may want to grab a coffee
|
|
|
|
while this is running.)
|
|
|
|
6. Plot the results using [[https://github.com/aezarebski/timtam2/wiki/tutorial-1/post-processing.R][post-processing.R]]. This will produce two figures
|
|
|
|
similar to the ones shown below, summarising some of the epidemiological
|
|
|
|
estimates from your analysis.
|
|
|
|
|
|
|
|
* Results
|
|
|
|
|
|
|
|
By following this tutorial, we have used a multiple sequence alignment and a
|
|
|
|
time series of observed (but not sequenced) case data to estimate the basic
|
|
|
|
reproduction number of a pathogen and the prevalence of the disease at the time
|
|
|
|
of the last observation in our simulated epidemic.
|
|
|
|
|
|
|
|
The first figure shows the posterior distribution of the (basic) reproduction
|
|
|
|
number as inferred during your BEAST run. The true value for the reproduction
|
|
|
|
number (defined as a known value in the simulation) is shown as a vertical line.
|
|
|
|
|
|
|
|
[[https://raw.githubusercontent.com/wiki/aezarebski/timtam2/images/tutorial-1-marginals.png]]
|
|
|
|
|
|
|
|
Note how in the plot, the 95% HPD includes the real value of the basic reproduction
|
|
|
|
number. We would expect to recover the true value of an estimated parameter credible
|
|
|
|
interval when working with real-world data.
|
|
|
|
|
|
|
|
The second figure shows the posterior distribution of the prevalence at the time
|
|
|
|
of the last observation, i.e. the number of infectious people at the present time.
|
|
|
|
Recall from the observation we made above that this number should be approximately
|
|
|
|
2250 infectious cases at that time. How does the estimate from the figure below and
|
|
|
|
the estimate from your own analysis compare to the true number of infectious cases
|
|
|
|
at this last time point?
|
|
|
|
|
|
|
|
[[https://raw.githubusercontent.com/wiki/aezarebski/timtam2/images/tutorial-1-prevalence.png]]
|
|
|
|
|
|
|
|
The plots produced by the provided R script from your analysis should look very
|
|
|
|
similar to the ones shown above but not identical. Remember that the MCMC
|
|
|
|
samples across parameter space randomly, but while the exact values vary from
|
|
|
|
one run to the next, the distribution of these values should be mostly the same. |