.. _input_data_format:

|popy| Data Format
####################

The |popy| data file records |obs| and dosing regimens for each individual in a study.

The columns or fields in the data file are split into four main types in :numref:`table_field_types`:-

.. _table_field_types:

.. list-table:: |popy| data fields
    :header-rows: 1

    * - Field
      - Comment
      
    * - :ref:`required_fields`
      - TYPE/ID/TIME
      
    * - :ref:`dosing_fields`
      - dosing regime data
      
    * - :ref:`obs_fields`
      - observed measurements
        
    * - :ref:`extra_fields` 
      - extra co-variate information
      
The data file values for each field can be accessed using the |cx| notation in the |popy| |script_file|.

.. _required_fields:

Required Fields
================

A |popy| data set requires the following fields:-

* :ref:`TYPE` - type of row
* :ref:`ID` - identity 
* :ref:`TIME` - time field 

Note the names ``TYPE``, ``ID`` and ``TIME`` are the default names of these three required fields. 
You can use other field names if you choose to redefine them in the |script_file| |data_fields| section.

.. _type:

TYPE
------

The ``TYPE`` field specifies the event that is happening in each row of the data file. The different types of row are as follows:-

* obs - Measurements that contribute to the log likelihood as defined in the |predictions| section.
* dose - Creates a dose according to the dosing functions in the |derivatives| section.
* pred - Extra prediction data points. |popy| will output extra |px| data at these time points, but they do |not| contribute to the likelihood.
* reset -  Set the |sx| compartment states back to the initial values (usually zero)
* reset+dose - A 'reset' combined with a 'dose' event.

Typically a drug trial data set mainly consists mainly of 'obs' and 'dose' rows with a few 'reset' rows, per subject.
 
.. _id:

ID
------

The ``ID`` field value defines the individual for a given row. As |popy| is a |poppkpd| system. The 'ID' field is required because the data is split over multiple individuals to form a population.

Note that non-population analysis can be performed in |popy| by assigning all rows the same 'ID' value. 

.. _time:

TIME
------

The ``TIME`` field defines the time stamp for each row. 

The time field is required to be monotonically increasing, unless a |TYPE| = 'reset' or 'reset+dose' row is reached. Note that when the :ref:`ID` identifier changes between rows, then an implicit 'reset' occurs.

For an example of a valid combination of TYPE/ID/TIME data see :numref:`table_popy_time`.

.. _table_popy_time:

.. list-table:: |popy| time reset example 
    :header-rows: 1

    * - |type|
      - |id|
      - |time|
      - comment
      
    * - obs
      - Bob
      - 0.0
      - observation at time zero
    
    * - dose
      - Bob
      - 4.0
      - dose for bob at time 4.0
      
    * - obs
      - Bob
      - 4.0
      - observation for bob at time 4.0
         
    * - obs
      - Bob
      - 8.0
      - later observation
          
    * - obs
      - Ruth
      - 0.0
      - time goes back, ok cos new ID

    * - dose
      - Ruth
      - 10.0
      - dose for Ruth at time 10.0
      
    * - obs
      - Ruth
      - 20.0
      - later observation
      
    * - reset
      - Ruth
      - 30.0
      - |sx| reset at time 30.0
      
    * - obs
      - Ruth
      - 1.0
      - observation following reset

In :numref:`table_popy_time` the time always increases or stays the same in consecutive rows, but time is allowed to go backwards after a new ID or a reset.


.. _dosing_fields:

Dosing Fields
===============

Dosing events are created in the data file using 'dose' values in the |type| field.

There are two methods of associating data dose rows with the |derivatives| section in the |popy| |script_file|, as follows:-

* :ref:`single_dose_type`
* :ref:`multi_dose_types`

The first involves using just the 'dose' value, the second involves defining dose type names.

The amount of each dose is usually specified in an |amt| field.

Note in |popy| AMT is |not| a keyword. It is just the conventional name for the dose amount field used in this documentation.

.. _single_dose_type:

Single Dose Type
-------------------

The simplest way to create doses at a set of fixed times is shown in :numref:`table_popy_single_doses`.

.. _table_popy_single_doses:

.. list-table:: |popy| single dose type example 
    :header-rows: 1

    * - |type|
      - |time|
      - |amt|
      - comment
      
    * - dose
      - 1.0
      - 100
      - dose of 100 at time 1.0
      
    * - dose
      - 2.0
      - 200
      - dose of 200 at time 2.0
      
    * - dose
      - 3.0
      - 100
      - dose of 100 at time 3.0

Note that this creates 3 doses at times [1.0, 2.0, 3.0]. The script file loading this data set should have a |derivatives| section something like:-

.. code-block:: pyml

    DERIVATIVES: |
        d[DEPOT] = @bolus{amt: c[AMT]} - m[KE] * s[DEPOT]

Note that the :ref:`@bolus` dose has no name associated with it.

.. _multi_dose_types:

Multiple Dose Types
---------------------

If you have multiple types of dose in your analysis, |eg| two different drugs being prescribed, then you need to give each dose type a name, as shown in :numref:`table_popy_multi_doses`.

.. _table_popy_multi_doses:

.. list-table:: |popy| multi dose type example 
    :header-rows: 1

    * - |type|
      - |time|
      - AMT_DRUG1
      - AMT_DRUG2
      - comment
      
    * - dose:drug1
      - 1.0
      - 100
      - 0
      - 100 units of drug1
      
    * - dose:drug2
      - 2.0
      - 0
      - 200
      - 200 units of drug2
      
    * - dose:drug1
      - 3.0
      - 50
      - 0
      - 50 units of drug1

The data file above creates 2 doses of drug1 and 1 dose of drug2. The script file loading this data set should have a |derivatives| section something like:-

.. code-block:: pyml

    DERIVATIVES: |
        dose[drug1] = @bolus{amt: c[AMT_DRUG1]}
        dose[drug2] = @bolus{amt: c[AMT_DRUG2]}
        d[DEPOT1] = dose[drug1] - m[KE1] * s[DEPOT1]
        d[DEPOT2] = dose[drug2] - m[KE2] * s[DEPOT2]

The important aspect here is that the :ref:`@bolus` doses are defined with names 'drug1' and 'drug2'. These names also appear in the |type| field in the data set as 'dose:drug1' and 'dose:drug2'.

An alternative naming syntax is as follows:-

.. code-block:: pyml

    DERIVATIVES: |
        d[DEPOT1] = @bolus{amt: c[AMT_DRUG1], name: 'drug1'} - m[KE1] * s[DEPOT1]
        d[DEPOT2] = @bolus{amt: c[AMT_DRUG2], name: 'drug2'} - m[KE2] * s[DEPOT2]

Note that when creating a |popy| data set, you only need to specify a name for each type of dose. You can leave the modelling decision of where each dose appears in the compartment model to a later time.
        
.. _obs_fields:

Observation Fields
=====================

Another important set of fields in the data file are the columns that define observed measurements. Observation rows are defined by setting |type| = 'obs'.

This section shows examples of the following:-

* :ref:`single_obs_field`
* :ref:`single_obs_field_missing`
* :ref:`multiple_obs_fields`

Note in each case the |predictions| section of the |popy| |script_file| is associated with observation fields in the data file in order to compute the likelihood correctly.

.. _single_obs_field:

Single Observed Field
----------------------

An example of a single observed field is shown in :numref:`table_single_obs`.

.. _table_single_obs:

.. list-table:: |popy| single observed field example 
    :header-rows: 1

    * - |type|
      - DRUG_CONC
      
    * - obs
      - 10.5
      
    * - obs
      - 15.5
      
    * - obs
      - 2.0

In this simple case the |predictions| section may look something like:-

.. code-block:: pyml

    PREDICTIONS: |
        p[DRUG_CONC] = s[CEN]/m[V]
        c[DRUG_CONC] ~ norm(p[DRUG_CONC], m[ANOISE_var])
  
Note that the :pyml:`c[DRUG_CONC]` references the 'DRUG_CONC' field of the data set. Here the likelihood is computed by comparing the model prediction :pyml:`p[DRUG_CONC]` and the data file observation :pyml:`c[DRUG_CONC]` for **all** rows of the data set, where |type| = 'obs'. 

Therefore all values of the data column 'DRUG_CONC' have to be valid observations. If you have missing values then you need to use the data structure in :ref:`single_obs_field_missing`.
    
.. _single_obs_field_missing:

Observed Field with missing data
-----------------------------------

An example of a single observed field, with some **missing** data is shown in :numref:`table_single_obs_missing`.
 
.. _table_single_obs_missing:

.. list-table:: |popy| single observed field missing data example 
    :header-rows: 1

    * - |type|
      - DRUG_CONC
      - DRUG_CONC_FLAG
      - comment
      
    * - obs
      - 10.5
      - 1
      - DRUG_CONC valid
      
    * - obs
      - 0.0
      - 0
      - DRUG_CONC invalid
      
    * - obs
      - -5.0
      - 0
      - DRUG_CONC invalid
      
    * - obs
      - 2.0
      - 1
      - DRUG_CONC valid

In this case the |predictions| section may still look something like:-

.. code-block:: pyml

    PREDICTIONS: |
        p[DRUG_CONC] = s[CEN]/m[V]
        c[DRUG_CONC] ~ norm(p[DRUG_CONC], m[ANOISE_var])
  
However not all the |type| = 'obs' rows contribute to the likelihood in this case. Only the rows that have |type| = 'obs' **and** DRUG_CONC_FLAG = 1. 

It is similar to having the following 'if' statement in your |predictions| section:-

.. code-block:: pyml

    PREDICTIONS: |
        p[DRUG_CONC] = s[CEN]/m[V]
        if c[DRUG_CONC_FLAG] > 0.5:
            c[DRUG_CONC] ~ norm(p[DRUG_CONC], m[ANOISE_var])

You can include the 'if' statement in your |predictions| section if you like, but it is not required (or encouraged).

Note also that missing out the 'DRUG_CONC_FLAG' field from your data set, has a similar effect to creating a 'DRUG_CONC_FLAG' field and setting all the values to 1. |ie| Flags default to 1 in |popy|.
            
If you have multiple observation types in your data set then flag fields become more important, see the example data structure in :ref:`multiple_obs_fields`.

.. _multiple_obs_fields:

Multiple Observed Fields
---------------------------

An example of multiple observed fields, is shown in :numref:`table_multiple_obs`.
 
.. _table_multiple_obs:

.. list-table:: |popy| multiple observed fields
    :header-rows: 1

    * - |type|
      - DRUG1
      - DRUG1_FLAG
      - DRUG2
      - DRUG2_FLAG
      - comment
      
    * - obs
      - 10.5
      - 1
      - 0.2
      - 1
      - Both drugs valid
      
    * - obs
      - 10.5
      - 1
      - 0.0
      - 0
      - only drug1 valid
      
    * - obs
      - -4.1
      - 0
      - 0.0
      - 0
      - both drugs invalid
      
    * - obs
      - -4.1
      - 0
      - 0.5
      - 1
      - only drug2 valid

In this case the |predictions| section may look something like:-

.. code-block:: pyml

    PREDICTIONS: |
        p[DRUG1] = s[CEN1]/m[V1]
        c[DRUG1] ~ norm(p[DRUG1], m[ANOISE_var1])
        p[DRUG2] = s[CEN2]/m[V2]
        c[DRUG2] ~ norm(p[DRUG2], m[ANOISE_var2])
        
Here |popy| uses the 'DRUG1_FLAG' and 'DRUG2_FLAG' fields from the data set to only compute the likelihood from valid observations. You don't have to use 'if' statements in the |predictions| section to achieve this.

.. _extra_fields:

Extra Fields
=====================

The other columns of the |popy| data file are available to use in the following :term:`verbatim` sections:-

* |model_params|
* |states|
* |derivatives|
* |predictions|

For example see below for a simple example of :ref:`covariate modelling <covariates>` using the |model_params|:-

.. code-block:: pyml

    MODEL_PARAMS: |
        m[X] = f[X] + f[X_Y_EFFECT]*c[Y]

Here the |mx| parameter is modelled as having a linear relationship with the :pyml:`c[Y]` covariate from the data file.

It is also possible to use |cx| variables in the other sections. One usage case is when you already have |pk| parameters estimated (from a previous study) and wish to use these |cx| variables in the |derivatives| section, instead of estimating |mx| parameters for each individual.

..  comment
    We don't have any PD examples yet, so add this later.
    Maybe an example of loading in previous |pk| results in a |pd| example???

.. only:: browser

    .. _next_steps_data_format:

    Next Steps
    ================

    You can use the information above to construct your own |popy| data sets from real data. If you have a previously constructed |nonmem| data set then see :ref:`nonmem_dat_to_popy_dat` for guidance on how to convert such a data set to |popy| format.

    See :ref:`simple_tut_example` for an example of creating a synthetic |popy| data file from a single script. It is also possible to create multiple data sets, see :ref:`simple_mtut_example`.
