Changes

commit 25002ed54f9d0137117674fdc5d7d3c07ab0edbb Author: Alex Zarebski <aezarebski@gmail.com> Date: Thu Oct 13 15:05:17 2022 +0100 update documentation commit f8be09217ee8effc3f40f653583463e757b6c359 Author: Alex Zarebski <aezarebski@gmail.com> Date: Thu Oct 13 15:04:40 2022 +0100 update figure files commit 159641dcd5fefca32b0e9be83d4303b82b571066 Author: Alex Zarebski <aezarebski@gmail.com> Date: Thu Oct 13 15:04:01 2022 +0100 tweak scripts to work with new version commit 4e6830096a6e17dc0170a20cb2616fa32e99374f Author: Alex Zarebski <aezarebski@gmail.com> Date: Thu Oct 13 14:15:41 2022 +0100 tweak some of the data and model section
Alex Zarebski · 3707c26b
--- a/Constant-Parameters-History.org
+++ b/Constant-Parameters-History.org
-#+title: Example: Constant Parameters in terms of R0
+#+title: Example: Constant parameters with historical prevalence
 Created by Alexander Zarebski
-Last edited using version =v0.3.0=.
+Last edited using =BEAST v2.6.7= and =timtam v0.3.1=.
 This tutorial demonstrates how to use TimTam to estimate the reproduction number
 and prevalence of infection at two points in time using a sequence alignment and
@@ -14,72 +14,78 @@ genomic data to estimate epidemiological parameters from outbreaks.
 * Data and model
-In this tutorial, we will work with data obtained from a simulated epidemic. The
+In this tutorial, we will estimate the reproduction number and prevalence of
-data consists of a sequence alignment, [[https://github.com/aezarebski/timtam2/wiki/tutorial-7/sample-sequences.fasta][sample-sequences.fasta]] and a time series
+infection in a simulated epidemic. The data consists of a sequence alignment,
-of daily case counts, [[https://github.com/aezarebski/timtam2/wiki/tutorial-7/aggregated-occurrences.csv][aggregated-occurrences.csv]]. These data were simulated from
+[[https://github.com/aezarebski/timtam2/wiki/tutorial-7/sample-sequences.fasta][sample-sequences.fasta]], and a time series of daily case counts,
-a birth-death process in which individuals infect (therefore 'give birth to')
+[[https://github.com/aezarebski/timtam2/wiki/tutorial-7/aggregated-occurrences.csv][aggregated-occurrences.csv]]. These data were simulated from a (constante rate)
-new individuals and cease to be infectious (therefore 'die') after a given time.
+birth-death process in which individuals infect (therefore 'give birth to') new
-From the epidemic process, individuals that cease to be infectious as they
+individuals and cease to be infectious (therefore 'die') after an exponentially
-recover or are removed from the epidemic due to a fatal outcome of the
+distributed amount of time. From the epidemic process, individuals that cease to
-infection. However, from our data collection perspective, these 'removed'
+be infectious as they recover or are removed from the epidemic due to a fatal
-individuals correspond to one of three possible scenarios: i) they recover from
+outcome of the infection. However, from our data collection perspective, these
-the infection without ever being diagnosed (in which case we don't observe their
+'removed' individuals correspond to one of three possible scenarios: i) they
-infection at all), ii) they can be diagnosed and have a genetic sequence
+recover from the infection without ever being diagnosed (in which case we don't
-generated (in which case we get a time stamped sequence for that case) or they
+observe their infection at all), ii) they can be diagnosed and have a genetic
-can be diagnosed but generate an unsequenced sample (in which case they appear
+sequence generated (in which case we get a time stamped sequence for that case)
-in the dataset as a confirmed case but there is no sequence data attached to the
+or they can be diagnosed but generate an unsequenced sample (in which case they
-case). For the unsequenced samples, they are aggregated into daily counts giving
+appear in the dataset as a confirmed case but there is no sequence data attached
-rise to the time series data.
+to the case). For the unsequenced samples, they are aggregated into daily counts
+giving rise to the time series data.
 The following figure shows the number of infectious individuals ("infectious")
 through time along with the cumulative number of people who were previously
-infectious, for example, who have recovered or been sampled. Note from the
+infectious but no longer are, e.g. because they have recovered or been sampled.
-subtitle that there are 2222 infectious individuals at the present. The number
+Note from the subtitle that there are 2222 infectious individuals at the
-of infectious individuals is growing exponentially, as indicated by the linear
+present. The number of infectious individuals is growing exponentially, as
-growth on a logarithmic /y/-scale.
+indicated by the following figure.
 [[https://raw.githubusercontent.com/wiki/aezarebski/timtam2/images/tutorial-7-simulation-trajectory.png]]
-The following figure shows the daily number of confirmed cases: infectious that
+The next figure shows the daily number of confirmed cases, i.e. the number of
-have been observed but for which a sequence was not collected. These data form
+people that have tested positive on that day (and are infectious) but for which
-the time series that we will use.
+a viral genome sequence was not recorded. These data, the daily counts, form the
+time series that we will use.
 [[https://raw.githubusercontent.com/wiki/aezarebski/timtam2/images/tutorial-7-occurrence-time-series.png]]
-Note that at the end of the simulated epidemic there are 2222 infectious
+These data were simulated with [[http://tgvaughan.github.io/MASTER][MASTER]], with parameters chosen so the epidemic
-individuals. The parameters used to simulate this epidemic are set to represent
+has a basic reproduction number of \(1.85\). To simplify this tutorial we will
-a pathogen with a basic reproduction number of \(1.85\). To simplify this
+assume a known net removal rate, denoted by [[https://github.com/aezarebski/timtam2/wiki/Glossary#sigma][sigma]], of (\(0.054 = 0.046 +
-tutorial we will assume a known net removal rate, denoted by [[https://github.com/aezarebski/timtam2/wiki/Glossary#sigma][sigma]], of
+0.008\)), and the pathogen evolves with a substitution rate of (\(10^{-3}\))
-(\(0.046 + 0.008\)), and that the pathogen evolves with a molecular clock rate
+substitutions/site/day constantly across the tree. Both values fall within the
-of (\(10^{-3}\)) substitutions/site/day. Both values fall within the ballpark of
+ballpark of what might be expected for an RNA virus. The origin time for our
-what might be expected for an RNA virus. The origin time for our simulated
+simulated epidemic took place (\(70\)) days before the end of the simulation.
-epidemic took place (\(70\)) days before the end of the simulation.
 * Analysis
-** Setting up a scaffold with BEAUti
+We will start by using BEAUti to set up an XML file which handles most of the
+analysis. We will then tweak this so that the historical prevalence is
+estimated.
+** Setting up with BEAUti
 1. Create a work folder in your computer and download the [[https://github.com/aezarebski/timtam2/wiki/tutorial-7/sample-sequences.fasta][sample-sequences.fasta]]
   and [[https://github.com/aezarebski/timtam2/wiki/tutorial-7/aggregated-occurrences.csv][aggregated-occurrences.csv]] files there.
 2. Create an XML file using BEAUti.
-   1. Load the sequence data into BEAUti
+   1. Load the sequence FASTA file into BEAUti
-   2. Parse the tip-dates using the default auto-configured values function.
+   2. In the *Tip Dates* tab, parse the tip-dates using the default
+      auto-configured values function.
   3. Leave the site model tab with the default values.
-   4. Set the clock rate to =0.001= in the clock model tab.
+   4. In the *Clock Model* tab, set the clock rate to =0.001=.
-   5. Select the /Tim Tam Time Series Model/ on the tree prior tab.
+   5. In the *Priors* tab, select the /Tim Tam Time Series Model/.
   6. (Optional) set the chain length to a smaller value if you are in a hurry;
      1,000,000 MCMC steps should do.
-   7. Save the resulting file as an XML, eg =timtam.xml=.
+   7. Save the resulting file as an XML, e.g. =timtam.xml=.
 ** Adjusting the XML for our model
-1. We now need to estimate some parameter values to add to the XML. To do this,
+1. We now need to compute some values to add to the XML. To do this, download
-   download and run the R script [[https://github.com/aezarebski/timtam2/wiki/tutorial-7/print-disaster-data.R][print-disaster-data.R]] and keep a copy of the
+   and run the R script [[https://github.com/aezarebski/timtam2/wiki/tutorial-7/print-disaster-data.R][print-disaster-data.R]] and keep a copy of the data that
-   data that gets printed out on the Console or Terminal. There should be two
+   gets printed out on the Console or Terminal. There should be two sets of
-   sets of values, these are the representation of the time series data that
+   values, these are the representation of the time series data that TimTam
-   TimTam expects. We do this because the times in the simulation are relative
+   expects. We do this because the times in the simulation are relative to date
-   to date of the last sequence we have; we therefore need to shift the times in
+   of the last sequence we have; we therefore need to shift the times in the XML
-   the XML so they agree with the ones we just computed.
+   so they agree with the ones we just computed.
 #+begin_src sh
  Rscript print-disaster-data.R
@@ -89,13 +95,15 @@ epidemic took place (\(70\)) days before the end of the simulation.
   following edits:
   - Remove operators relating to the clock rate, as we are working with a known
     (from the simulation parameters) evolutionary rate. There should be two of
-     these clock operators, you can identify them because they begin with
+     these clock operators, you can identify them because they have the clock
-     ~<operator id="clock.rate"~
+     rate among the parameters that they act on. Search =clockRate= to find
-   - We need to tweak =historySizes= and =historyChecks=: these are used to
+     these in the file.
-     estimate historical prevalence. There are several things that will need to
+   - We need to tweak =historySizes= and =historyChecks= which are the
-     be removed:
+     parameters used to estimate the historical prevalence. There are several
-     + the =stateNode= for =historySizes=, should have its value set to
+     things that will need to be edited:
-       something similar to =2000 3000= as a starting point,
+     + the =stateNode= for =TTHistorySizes.t=, should have its value set to
+       something similar to =2000 3000= as a starting point, (*n.b.* you may
+       need to adjust this to different values for your own dataset.)
     + the prior distribution on the history sizes can be left as it is,
     + the =operator= on the history sizes can be left as it is,
     + the =log= for the history sizes can be left as it is,
@@ -107,10 +115,10 @@ epidemic took place (\(70\)) days before the end of the simulation.
     distribution node. You should recognise it by the =id= value of
     =TimTam.t:sample-sequences=. This will format the time series of the
     sampling times and sizes so that they are included properly. Note that
-     BEAUti may have added ~dimension='3'~ to these nodes, in which case delete
+     BEAUti may have added ~dimension='3'~ to these nodes, in which case, delete
     these attributes.
-   - Double check that =sigma= is set to the known value of \(0.1\) in
+   - Double check that =sigma= (the net removal rate) is set to the known value
-     =TimTam.d:sample-sequences=.
+     of \(0.1\) in =TimTam.d:sample-sequences=.
   - Double check that the value of =originTime= is set to \(70\).
   - Save the resulting XML which should look like [[https://github.com/aezarebski/timtam2/wiki/tutorial-7/timtam.xml][timtam.xml]].
@@ -119,7 +127,8 @@ epidemic took place (\(70\)) days before the end of the simulation.
 1. Run MCMC by executing the XML file on BEAST (you may want to grab a coffee
   while this is running.)
 2. Download the simulation data (which is needed for subsequent visualisations)
-   in [[https://github.com/aezarebski/timtam2/wiki/tutorial-7/tutorial-7-simulation-output.json][tutorial-7-simulation-output.json]].
+   in [[https://github.com/aezarebski/timtam2/wiki/tutorial-7/tutorial-7-simulation-output.json][tutorial-7-simulation-output.json]] and make sure it is in your working
+   directory.
 3. Plot the results using [[https://github.com/aezarebski/timtam2/wiki/tutorial-7/post-processing.R][post-processing.R]]. This will produce two figures
   similar to the ones shown below, summarising some of the epidemiological
   estimates from your analysis.