Skip to contents

Creates a small, realistic VDJ contig annotation data frame mimicking 10x Genomics format. Designed for unit testing with controllable edge cases including dropout, heterozygous chains, and invalid VDJ sequences.

Usage

generate_test_VDJ(
  n_cells = 100,
  n_clones = 10,
  samples = c("SampleA", "SampleB"),
  dropout_rate = 0.2,
  heterozygous_rate = 0.3,
  invalid_rate = 0.05,
  seed = 42
)

Arguments

n_cells

Number of cells to generate. Default is 50.

n_clones

Number of distinct clonotypes. Default is 10.

samples

Character vector of sample names. Default is c("SampleA", "SampleB").

dropout_rate

Proportion of cells with missing chain data (0-1). Default is 0.2.

heterozygous_rate

Proportion of cells with heterozygous (dual allele) chains (0-1). Default is 0.3.

invalid_rate

Proportion of contigs with invalid VDJ (missing V or J gene) (0-1). Default is 0.05.

seed

Random seed for reproducibility. Default is 42.

Value

A data frame in 10x Genomics VDJ contig annotation format with columns: barcode, is_cell, contig_id, high_confidence, length, chain, v_gene, d_gene, j_gene, c_gene, full_length, productive, fwr1, fwr1_nt, cdr1, cdr1_nt, fwr2, fwr2_nt, cdr2, cdr2_nt, fwr3, fwr3_nt, cdr3, cdr3_nt, fwr4, fwr4_nt, reads, umis, raw_clonotype_id, raw_consensus_id, exact_subclonotype_id, origin.

Examples

test_data <- generate_test_VDJ(n_cells = 20, n_clones = 5, seed = 123)
head(test_data)
#>      barcode is_cell           contig_id high_confidence length chain v_gene
#> 1 CELL0001-1    true CELL0001-1_contig_1            true    514   TRB TRBV21
#> 2 CELL0002-1    true CELL0002-1_contig_2            true    604   TRA  TRAV3
#> 3 CELL0002-1    true CELL0002-1_contig_3            true    417   TRB  TRBV9
#> 4 CELL0003-1    true CELL0003-1_contig_4            true    507   TRA  TRAV3
#> 5 CELL0003-1    true CELL0003-1_contig_5            true    698   TRB  TRBV9
#> 6 CELL0004-1    true CELL0004-1_contig_6            true    564   TRA TRAV14
#>   d_gene j_gene c_gene full_length productive                        fwr1
#> 1  TRBD2  TRBJ1  TRBC2        true       true   CMAWVSDACCIYQPHIDSANVQIFF
#> 2   <NA>  TRAJ6   TRAC        true      false   CFPNIGSMNSRANLLEKYFKWDCNF
#> 3  TRBD1  TRBJ6  TRBC1       false       true      CVHHNKSDDQCEALGDNANPVF
#> 4   <NA>  TRAJ6   TRAC        true       true CNPGLILGWTCWWYPAGFVHTRFFPIF
#> 5  TRBD2  TRBJ6  TRBC2        true       true      CDQLYRHSNYKDWPITELTLFF
#> 6   <NA> TRAJ14   TRAC        true       true   CGDYQKYEHWIGCVHDDPIKNWCNF
#>                                                        fwr1_nt       cdr1
#> 1 CATGAAGGTCACGGGCAGGATATCTAGATTATACACTTCTCTCGACAACAGCTAAGGAAC   CSHEPNPF
#> 2 GAGAGAATTAAAGAATAAGTCCGTCGACTCTTTTCAGTGTAGTCGTGGTCTCCTGACAAC  CPVRPQVNF
#> 3 CACAGCTCGCGGTAGTAGTGGCGGCCTCGCGGTAACCTTGAAACAAGCTGTAACTAAGTC    CCLCGPF
#> 4 GACCTCCCCATGAGCTTCGCGCTGTCAGTTTCTTGCTGACCCCAAGGGCTTCCCCCCCGT    CTSFDGF
#> 5 GACACTCCTCAGCGGAGCATATACGTTCTGGCCTGTGAGAATAATCAGCTTGTAGGACCA  CPKTRQTTF
#> 6 GATTGTCGGGTCGTCGGCACACTTGCCTCCCTTTTTGGACAAGAGAAAATCTTTCAGGGT CEQITCFDFF
#>                    cdr1_nt                  fwr2
#> 1 CTTGACCGAGGAGGGGGTGTGCTA CDNQFQGKAYQWFCAYEMFEF
#> 2 TGTCCCAGAGGTCGGTCGATTAAG   CDRDCFFHCSIVWKWQWFF
#> 3 CGCGATGTGCACCGCTGGGAATCG CTGFTSKMFQFMPYEEMYGLF
#> 4 CAGGGCATCGGAGGCACAACTAGC     CEWKCTNQLHSCLWCIF
#> 5 GAACACCCACATCGGACCAGTCAG    CYKKTKLFMSQQPRQHLF
#> 6 CAGCTCACGGCTCTGCCGATAAGC     CWHISNLIYVYGHCTNF
#>                                              fwr2_nt       cdr2
#> 1 GGAGAAGAAGCAAACCCTTTCTGGTCACGAGGGACTGGCCAAACGGCCCA    CGEDMLF
#> 2 ACGCCCGTCGGGACAGAAGTTATCGAAACACAATTTCGTCTTATCGCAGG  CFIRQSAQF
#> 3 CGAGCAACTGGATCTGCCCTCCTAAAAACAGAGAAAGTAGGAAGGTTAAC  CLSAGRIAF
#> 4 AAGCTCGCCGTTTCCAGGCGGCTTACTGAGTGTAAATATAGAGCGCTTTC  CKREKFQTF
#> 5 ACAATCATATCGGGCAATATAAATTCCCGCGCGAATACAACTAAAAGTTA CGELRCHVKF
#> 6 ACTCTCGTGATCCTCAGTGGCACTTAATCGGGCTTGCAGAGGGCCTCACA CRKQIQTTHF
#>                    cdr2_nt                                      fwr3
#> 1 CAGACCAGTGGCGTAGCCGTTCCC CHHGTYAFLPLCRRIDCGQRWKRSIKSKCKMMEHVMAMGLF
#> 2 GGGAACGGAGCCTGCGAAAATTGG     CASIFLKFQWASGYIQQIMYFQAAFDEDQVLNSDNFF
#> 3 TATTATAGTGACAGTTTGAGATTT   CETYGHHMAWITQHDARQNTIACFPWYKSATVLKMACMF
#> 4 TGACCGAGGTAGGCATTCTCGCGC    CPSCPPGWMNMAIAINISLRLIICQVMCWSCKFGVTNF
#> 5 AGCTTCCGGCGTCATAGCGTACAC  CDCDAIDGIRTYWVDSMFNDGELYKRLVMVCFFYWRCLYF
#> 6 AAACCACAAGATCGGGCTCTAGTA  CDEKMWKLSNPPREILFVQGWWVGIHNENTLILRLMVYNF
#>                                                                                                fwr3_nt
#> 1 ATGGTTCCACTGTGACTGACTGTAGGCCGACGTTTGAATTCGTTAGAGGGGTTCCCAACCCGTATTCTCAGGTTTAGATGCAATCGATCAAGCGGGGTGT
#> 2 ACAAGTCTAGACGTTTGAAATCCACTTTCGATCCCGAGAGTCATTATCATGTGGTTTAGAGTTCTTGTGGCGAATTAAGGATAATCTCTATTACGTATGC
#> 3 AATTGTGTCCCAAATACGTCGCGTGCGGGTTGCATTTCAAAATTCTTGTTAGGCATGTAAGAATTTGGTGCAAGTGTACATGGTCAAGTATTGTTTGCTC
#> 4 GTTTCGATGACGGTGCGTCAACGAGGTGCAAAGATATCACCTTTCGGTCTTCGAGCATGTTATCCGCTGGCTAATGTAGAAGACTCGCGGGGTAGTAAGT
#> 5 TGATGAGGGAAGTCGAACCCAAACGGCCGCGTGGAACTAGTAGAAGCCGTCACGGTCGCCGGATGCGTGAACGTTAAACCGCAAGTGTGGGACCTGAGGG
#> 6 AATTAAGGATGTGACTGGCTGTGTCCAAGGCTTTCTTACAGCGTTCGAGTCCAGTTGCAGATAACCACCAATCCCGTTCCTAACGTCGGTAATTGATACC
#>                  cdr3                                                   cdr3_nt
#> 1 CTDVCFNPLGLLGSSDEDF GTGTTGCTAGAGTCAAGATGCCCGCTTTGCGGTCAGGGGGGCCAACAAATATTCCCC
#> 2        CCEPFWYQDISF                      ATCGTTGATAGCCACCCGAACTGCCCGGGGCTGGTG
#> 3  CDSCRDKHKECNKMHPFF    CCTATGGGCAGGGGTGCAGTCAGACGTACTCGCGTTCCAGTGGAGGTGCGCGGA
#> 4        CCEPFWYQDISF                      TCTTTACCGGGTGTAACGGCCGTAAAAACCTATTGA
#> 5  CDSCRDKHKECNKMHPFF    ACGAGCGCGAAGATGAACACGGGCCTCATCTGGTGACCGACTCCATGAAAAGAT
#> 6 CNPVAGRKRSYGMIHSTVF TGCAGCTCTCCCCCGACTATCTGTGCGCTGCAACTTCGGGTCTATTATTCGAAACTG
#>             fwr4                        fwr4_nt reads umis raw_clonotype_id
#> 1     CMDTYDQFVF CTCCCTTTAAGAAGACCTGGTTACGGACGA   911   40       clonotype3
#> 2  CGHESEKFSMSKF GCGCGAAAAAGCTGCGGGATTCGGTAAAGC  4212   25       clonotype4
#> 3     CPRNKWWGPF AGATGCTAGTTCAACGCTACCACGGCACTA  4272   24       clonotype4
#> 4 CRGRCTDVLNIIKF CCGAGTGGGACACACATTACACTAGTAATA   450   43       clonotype4
#> 5 CMFACEGFWNALEF TGCCTCTTGCCTTCGCGGCTGATTTCGACA  3279    9       clonotype4
#> 6 CVIYYEKMSLEAMF TCTGATTAGGCCAGCAGGGTCCTCAGTGGC   386   15       clonotype3
#>         raw_consensus_id exact_subclonotype_id  origin
#> 1 clonotype3_consensus_1                 sub_3 SampleB
#> 2 clonotype4_consensus_1                 sub_4 SampleB
#> 3 clonotype4_consensus_1                 sub_4 SampleB
#> 4 clonotype4_consensus_1                 sub_4 SampleA
#> 5 clonotype4_consensus_1                 sub_4 SampleA
#> 6 clonotype3_consensus_1                 sub_3 SampleA