vignettes/feature4_derivedImpliedAttributes.Rmd
feature4_derivedImpliedAttributes.Rmd
pepr
This vignette will show you how and why to use the derieved
attributes and implied attributes functionalities concurrently of the
pepr
package.
For the basic information about the PEP concept on the project website
Make sure to study the dedicated derived attributes and implied attributes vignettes prior to reading this one
While either derived attributes or implied attributes functionalities
alone are often sufficient to efficiently describe your samples in PEP,
the example below demonstrates how to use the derived attributes to
simplify and unclutter the columns of the
sample_table.csv
file, after implying the attributes for
samples that follow certain patterns. The two
functionalities combined provide you with the way of building complex,
yet flexible sample annotation tables effortlessly. Note that the
attributes implication is always performed first - before the attributes
are derived. This means that the newly created attributes (implied ones)
can be used to construct the attributes in the column derivation
process. Please consider the example below for reference:
sample_name | organism | time | file_path |
---|---|---|---|
pig_0h | pig | 0 | data/lab/project/pig_susScr11_untreated.fastq |
pig_1h | pig | 1 | data/lab/project/pig_susScr11_treated.fastq |
frog_0h | frog | 0 | data/lab/project/frog_xenTro9_untreated.fastq |
frog_1h | frog | 1 | data/lab/project/frog_xenTro9_treated.fastq |
The specification of detailed file paths/names (as presented above)
is cumbersome. In order to make your life easier just find the patterns
that the file names in file_path
column of
sample_table.csv
follow, imply needed attributes and derive
the file names. This multi step process is orchestrated by the
project_config.yaml
file via the
sample_modifiers.derive
and
sample_modifiers.imply
sections:
pep_version: 2.0.0
sample_table: sample_table.csv
output_dir: $HOME/hello_looper_results
sample_modifiers:
derive:
attributes: file_path
sources:
source1: /data/lab/project/{organism}_{genome}_{condition}.fastq
imply:
if:
organism: pig
then:
genome: susScr11
if:
organism: frog
then:
genome: xenTro9
if:
time: 0
then:
condition: untreated
if:
time: 1
then:
condition: treated
The *_untreated
files are clearly associated with the
samples that are labeled with time
0. Therefore the
untreated
attribute is implied for the samples which have 0
in the time
columns. Similarly, the codes
susScr11
and xenTro9
are associated with the
attributes in the oragnism
column. Therefore, the column
condion
that consists of those two codes is implied from
the attributes in the organism
column according to the
project_config.yaml
.
Let’s introduce a few modifications to the original
sample_table.csv
file to imply the attributes
genome
and condition
and subsequently map the
appropriate data sources from the project_config.yaml
with
attributes in the derived column - [file_path]
:
sample_name | organism | time | file_path |
---|---|---|---|
pig_0h | pig | 0 | source1 |
pig_1h | pig | 1 | source1 |
frog_0h | frog | 0 | source1 |
frog_1h | frog | 1 | source1 |
Load pepr
and read in the project metadata by specifying
the path to the project_config.yaml
:
library(pepr)
projectConfig = system.file(
"extdata",
paste0("example_peps-", branch),
"example_derive_imply",
"project_config.yaml",
package = "pepr"
)
p = Project(projectConfig)
## Loading config file: /home/runner/work/_temp/Library/pepr/extdata/example_peps-master/example_derive_imply/project_config.yaml
And inspect it:
sampleTable(p)
## sample_name organism time file_path
## 1: pig_0h pig 0 /data/lab/project/pig_susScr11_untreated.fastq
## 2: pig_1h pig 1 /data/lab/project/pig_susScr11_treated.fastq
## 3: frog_0h frog 0 /data/lab/project/frog_xenTro9_untreated.fastq
## 4: frog_1h frog 1 /data/lab/project/frog_xenTro9_treated.fastq
## genome condition
## 1: susScr11 untreated
## 2: susScr11 treated
## 3: xenTro9 untreated
## 4: xenTro9 treated
As you can see, the resulting samples are annotated the same way as
if they were read from the original, unwieldy, annotations file
(enriched with the genome
and condition
attributes that were implied).