Learn how to combine implied and derived attributes in pepr

This vignette will show you how and why to use the derieved attributes and implied attributes functionalities concurrently of the pepr package.

Problem/Goal

While either derived attributes or implied attributes functionalities alone are often sufficient to efficiently describe your samples in PEP, the example below demonstrates how to use the derived attributes to simplify and unclutter the columns of the sample_table.csv file, after implying the attributes for samples that follow certain patterns. The two functionalities combined provide you with the way of building complex, yet flexible sample annotation tables effortlessly. Note that the attributes implication is always performed first - before the attributes are derived. This means that the newly created attributes (implied ones) can be used to construct the attributes in the column derivation process. Please consider the example below for reference:

sample_name organism time file_path
pig_0h pig 0 data/lab/project/pig_susScr11_untreated.fastq
pig_1h pig 1 data/lab/project/pig_susScr11_treated.fastq
frog_0h frog 0 data/lab/project/frog_xenTro9_untreated.fastq
frog_1h frog 1 data/lab/project/frog_xenTro9_treated.fastq

Solution

The specification of detailed file paths/names (as presented above) is cumbersome. In order to make your life easier just find the patterns that the file names in file_path column of sample_table.csv follow, imply needed attributes and derive the file names. This multi step process is orchestrated by the project_config.yaml file via the sample_modifiers.derive and sample_modifiers.imply sections:

  Registered S3 method overwritten by 'pryr':
    method      from
    print.bytes Rcpp
   pep_version: 2.0.0
   sample_table: sample_table.csv
   output_dir: $HOME/hello_looper_results
   sample_modifiers:
      derive:
          attributes: file_path
          sources:
              source1: /data/lab/project/{organism}_{genome}_{condition}.fastq
      imply:
              if:
                  organism: pig
              then:
                  genome: susScr11
              if:
                  organism: frog
              then:
                  genome: xenTro9
              if:
                  time: 0
              then:
                  condition: untreated
              if:
                  time: 1
              then:
                  condition: treated

The *_untreated files are clearly associated with the samples that are labeled with time 0. Therefore the untreated attribute is implied for the samples which have 0 in the time columns. Similarly, the codes susScr11 and xenTro9 are associated with the attributes in the oragnism column. Therefore, the column condion that consists of those two codes is implied from the attributes in the organism column according to the project_config.yaml.

Let’s introduce a few modifications to the original sample_table.csv file to imply the attributes genome and condition and subsequently map the appropriate data sources from the project_config.yaml with attributes in the derived column - [file_path]:

sample_name organism time file_path
pig_0h pig 0 source1
pig_1h pig 1 source1
frog_0h frog 0 source1
frog_1h frog 1 source1

Code

Load pepr and read in the project metadata by specifying the path to the project_config.yaml:

library(pepr)
projectConfig = system.file(
"extdata",
paste0("example_peps-", branch),
"example_derive_imply",
"project_config.yaml",
package = "pepr"
)
p = Project(projectConfig)
## Loading config file: /Users/runner/work/_temp/Library/pepr/extdata/example_peps-master/example_derive_imply/project_config.yaml

And inspect it:

sampleTable(p)
##    sample_name organism time                                      file_path
## 1:      pig_0h      pig    0 /data/lab/project/pig_susScr11_untreated.fastq
## 2:      pig_1h      pig    1   /data/lab/project/pig_susScr11_treated.fastq
## 3:     frog_0h     frog    0 /data/lab/project/frog_xenTro9_untreated.fastq
## 4:     frog_1h     frog    1   /data/lab/project/frog_xenTro9_treated.fastq
##      genome condition
## 1: susScr11 untreated
## 2: susScr11   treated
## 3:  xenTro9 untreated
## 4:  xenTro9   treated

As you can see, the resulting samples are annotated the same way as if they were read from the original, unwieldy, annotations file (enriched with the genome and condition attributes that were implied).