vignettes/feature5_sampleSubtable.Rmd
feature5_sampleSubtable.Rmd
pepr
This vignette will show you how and why to use the subsample table
functionality of the pepr
package.
basic information about the PEP concept visit the project website.
broader theoretical description in the subsample table documentation section.
This series of examples below demonstrates how and why to use sample subannoatation functionality in multiple cases to provide multiple input files of the same type for a single sample.
This example demonstrates how the sample subannotation functionality
is used. In this example, 2 samples have multiple input files that need
merging (frog_1
and frog_2
), while 1 sample
(frog_3
) does not. Therefore, frog_3
specifies
its file in the sample_table.csv
file, while the others
leave that field blank and instead specify several files in the
subsample_table.csv
file.
This example is made up of these components:
pep_version: 2.0.0
sample_table: sample_table.csv
subsample_table: subsample_table.csv
looper:
output_dir: $HOME/example_results
sample_name | protocol | file |
---|---|---|
frog_1 | anySampleType | multi |
frog_2 | anySampleType | multi |
frog_3 | anySampleType | multi |
sample_name | subsample_name | file |
---|---|---|
frog_1 | sub_a | data/frog1a_data.txt |
frog_1 | sub_b | data/frog1b_data.txt |
frog_1 | sub_c | data/frog1c_data.txt |
frog_2 | sub_a | data/frog2a_data.txt |
frog_2 | sub_b | data/frog2b_data.txt |
Let’s create the Project object and see if multiple files are present
projectConfig1 = system.file(
"extdata",
paste0("example_peps-", branch),
"example_subtable1",
"project_config.yaml",
package = "pepr"
)
p1 = Project(projectConfig1)
#> Loading config file: /home/runner/work/_temp/Library/pepr/extdata/example_peps-master/example_subtable1/project_config.yaml
# Check the files
p1Samples = sampleTable(p1)
p1Samples$file
#> [[1]]
#> [1] "data/frog1a_data.txt" "data/frog1b_data.txt" "data/frog1c_data.txt"
#>
#> [[2]]
#> [1] "data/frog2a_data.txt" "data/frog2b_data.txt"
#>
#> [[3]]
#> [1] "multi"
# Check the subsample names
p1Samples$subsample_name
#> [[1]]
#> [1] "sub_a" "sub_b" "sub_c"
#>
#> [[2]]
#> [1] "sub_a" "sub_b"
#>
#> [[3]]
#> NULL
And inspect the whole table in p1@samples
slot
sample_name | protocol | file | subsample_name |
---|---|---|---|
frog_1 | anySampleType | data/frog1a_data.txt, data/frog1b_data.txt, data/frog1c_data.txt | sub_a, sub_b, sub_c |
frog_2 | anySampleType | data/frog2a_data.txt, data/frog2b_data.txt | sub_a, sub_b |
frog_3 | anySampleType | multi | NULL |
You can also access a single subsample if you call the
getSubsample
method with appropriate
sample_name
- subsample_name
attribute
combination. Note, that this is only possible if the
subsample_name
column is defined in the
sub_annotation.csv
file.
sampleName = "frog_1"
subsampleName = "sub_a"
getSubsample(p1, sampleName, subsampleName)
#> sample_name protocol file subsample_name
#> 1: frog_1 anySampleType data/frog1a_data.txt sub_a
This example uses a subsample_table.csv
file and a
derived attributes to point to files. This is a rather complex example.
Notice we must include the file_id
column in the
sample_table.csv
file, and leave it blank; this is then
populated by just some of the samples (frog_1
and
frog_2
) in the subsample_table.csv
, but is
left empty for the samples that are not merged.
This example is made up of these components:
pep_version: 2.0.0
sample_table: sample_table.csv
subsample_table: subsample_table.csv
looper:
output_dir: $HOME/hello_looper_results
pipeline_interfaces: ../pipeline/pipeline_interface.yaml
sample_modifiers:
derive:
attributes: file
sources:
local_files: ../data/{identifier}{file_id}_data.txt
local_files_unmerged: ../data/{identifier}_data.txt
sample_name | protocol | identifier | file |
---|---|---|---|
frog_1 | anySampleType | frog1 | local_files |
frog_2 | anySampleType | frog2 | local_files |
frog_3 | anySampleType | frog3 | local_files_unmerged |
frog_4 | anySampleType | frog4 | local_files_unmerged |
sample_name | file_id | subsample_name |
---|---|---|
frog_1 | a | a |
frog_1 | b | b |
frog_1 | c | c |
frog_2 | a | a |
frog_2 | b | b |
Let’s load the project config, create the Project object and see if multiple files are present
projectConfig2 = system.file(
"extdata",
paste0("example_peps-", branch),
"example_subtable2",
"project_config.yaml",
package = "pepr"
)
p2 = Project(projectConfig2)
#> Loading config file: /home/runner/work/_temp/Library/pepr/extdata/example_peps-master/example_subtable2/project_config.yaml
#> Warning in `[<-.data.frame`(x, i, j, value): replacement element 1 has 3 rows
#> to replace 1 rows
#> Warning in `[<-.data.frame`(x, i, j, value): replacement element 1 has 2 rows
#> to replace 1 rows
# Check the files
p2Samples = sampleTable(p2)
p2Samples$file
#> [[1]]
#> [1] "../data/frog1a_data.txt"
#>
#> [[2]]
#> [1] "../data/frog2a_data.txt"
#>
#> [[3]]
#> [1] "../data/frog3_data.txt"
#>
#> [[4]]
#> [1] "../data/frog4_data.txt"
And inspect the whole table in p2@samples
slot
sample_name | protocol | identifier | file | file_id | subsample_name |
---|---|---|---|---|---|
frog_1 | anySampleType | frog1 | ../data/frog1a_data.txt | a, b, c | a, b, c |
frog_2 | anySampleType | frog2 | ../data/frog2a_data.txt | a, b | a, b |
frog_3 | anySampleType | frog3 | ../data/frog3_data.txt | NULL | NULL |
frog_4 | anySampleType | frog4 | ../data/frog4_data.txt | NULL | NULL |
This example gives the exact same results as Example 2, but in this
case, uses a wildcard for frog_2
instead of including it in
the subsample_table.csv
file. Since we can’t use a wildcard
and a subannotation for the same sample, this necessitates specifying a
second data source class (local_files_unmerged
) that uses
an asterisk (*
). The outcome is the same.
This example is made up of these components:
pep_version: 2.0.0
sample_table: sample_table.csv
subsample_table: subsample_table.csv
looper:
output_dir: $HOME/hello_looper_results
pipeline_interfaces: ../pipeline/pipeline_interface.yaml
sample_modifiers:
derive:
attributes: file
sources:
local_files: ../data/{identifier}{file_id}_data.txt
local_files_unmerged: ../data/{identifier}*_data.txt
sample_name | protocol | identifier | file | file_id |
---|---|---|---|---|
frog_1 | anySampleType | frog1 | local_files | NA |
frog_2 | anySampleType | frog2 | local_files_unmerged | NA |
frog_3 | anySampleType | frog3 | local_files_unmerged | NA |
frog_4 | anySampleType | frog4 | local_files_unmerged | NA |
sample_name | file_id |
---|---|
frog_1 | a |
frog_1 | b |
frog_1 | c |
Let’s load the project config, create the Project object and see if multiple files are present
projectConfig3 = system.file(
"extdata",
paste0("example_peps-", branch),
"example_subtable3",
"project_config.yaml",
package = "pepr"
)
p3 = Project(projectConfig3)
#> Loading config file: /home/runner/work/_temp/Library/pepr/extdata/example_peps-master/example_subtable3/project_config.yaml
#> Warning in `[<-.data.frame`(x, i, j, value): replacement element 1 has 3 rows
#> to replace 1 rows
# Check the files
p3Samples = sampleTable(p3)
p3Samples$file
#> [[1]]
#> [1] "../data/frog1a_data.txt"
#>
#> [[2]]
#> [1] "../data/frog2*_data.txt"
#>
#> [[3]]
#> [1] "../data/frog3*_data.txt"
#>
#> [[4]]
#> [1] "../data/frog4*_data.txt"
And inspect the whole table in p3@samples
slot
sample_name | protocol | identifier | file | file_id |
---|---|---|---|---|
frog_1 | anySampleType | frog1 | ../data/frog1a_data.txt | a, b, c |
frog_2 | anySampleType | frog2 | ../data/frog2*_data.txt | |
frog_3 | anySampleType | frog3 | ../data/frog3*_data.txt | |
frog_4 | anySampleType | frog4 | ../data/frog4*_data.txt |
Merging is for same class inputs (like, multiple files for read1). Different-class inputs (like read1 vs read2) are handled by different attributes (or columns). This example shows you how to handle paired-end data, while also merging within each.
This example is made up of these components:
pep_version: 2.0.0
sample_table: sample_table.csv
subsample_table: subsample_table.csv
looper:
output_dir: $HOME/hello_looper_results
pipeline_interfaces: ../pipeline/pipeline_interface.yaml
sample_name | protocol |
---|---|
frog_1 | anySampleType |
frog_2 | anySampleType |
frog_3 | anySampleType |
frog_4 | anySampleType |
sample_name | read1 | read2 |
---|---|---|
frog_1 | frog1a_data.txt | frog1a_data2.txt |
frog_1 | frog1b_data.txt | frog1b_data2.txt |
frog_1 | frog1c_data.txt | frog1b_data2.txt |
Let’s load the project config, create the Project object and see if multiple files are present
projectConfig4 = system.file(
"extdata",
paste0("example_peps-", branch),
"example_subtable4",
"project_config.yaml",
package = "pepr"
)
p4 = Project(projectConfig4)
#> Loading config file: /home/runner/work/_temp/Library/pepr/extdata/example_peps-master/example_subtable4/project_config.yaml
# Check the read1 and read2 columns
p4Samples = sampleTable(p4)
p4Samples$read1
#> [[1]]
#> [1] "frog1a_data.txt" "frog1b_data.txt" "frog1c_data.txt"
#>
#> [[2]]
#> NULL
#>
#> [[3]]
#> NULL
#>
#> [[4]]
#> NULL
p4Samples$read2
#> [[1]]
#> [1] "frog1a_data2.txt" "frog1b_data2.txt" "frog1b_data2.txt"
#>
#> [[2]]
#> NULL
#>
#> [[3]]
#> NULL
#>
#> [[4]]
#> NULL
And inspect the whole table in p4@samples
slot
sample_name | protocol | read1 | read2 |
---|---|---|---|
frog_1 | anySampleType | frog1a_data.txt, frog1b_data.txt, frog1c_data.txt | frog1a_data2.txt, frog1b_data2.txt, frog1b_data2.txt |
frog_2 | anySampleType | NULL | NULL |
frog_3 | anySampleType | NULL | NULL |
frog_4 | anySampleType | NULL | NULL |