Deep Mutational Scanning: Protein GB1
This tutorial builds a deep mutational scanning (DMS) library for the IgG-binding domain of protein G (GB1), a 56-amino-acid protein domain. This library extends the GB1 DMS study by Olson et al. (Current Biology, 2014) by considering:
All single amino acid substitutions
All pairwise amino acid substitutions
10,000 random higher-order mutants
100 wild-type replicates
import poolparty as pp
pp.init()
Define the wild-type ORF
The GB1 coding sequence is 168 bp (56 codons). We load it as a single-sequence pool with from_seq and target codons 1 through 55 for mutagenesis, skipping the start codon at position 0.
GB1_ORF = (
"ATGCAGTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAG"
"ACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATG"
"GACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA"
)
orf_pool = pp.from_seq(GB1_ORF).named("orf_pool")
pos = slice(1, 56)
The codon_positions=slice(1, 56) range used below will target all
55 non-start codons for mutagenesis.
Single amino acid substitutions
mutagenize_orf in
sequential mode with num_mutations=1
generates every possible single amino acid
substitution. Each codon position has 19 possible missense changes (one
per alternative amino acid). Because most amino acids are encoded by
multiple codons, missense_only_first selects a single codon for each
target amino acid (the first listed in the codon table), avoiding
redundant synonymous alternatives. The style="red" parameter
highlights mutated codons in the output (see Styling).
single_pool = orf_pool.mutagenize_orf(
num_mutations=1,
mutation_type="missense_only_first",
codon_positions=pos,
prefix="single",
style="red",
mode="sequential",
cards={"codon_positions": "position", "wt_aas": "wt_aa", "mut_aas": "mut_aa"},
).named("single_pool")
single_pool.print_library(num_seqs=5, show_name=True)
| name | seq |
|---|---|
| single_0000 | ATGTTCTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAGACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATGGACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA |
| single_0001 | ATGCTGTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAGACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATGGACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA |
| single_0002 | ATGATCTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAGACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATGGACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA |
| single_0003 | ATGATGTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAGACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATGGACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA |
| single_0004 | ATGGTGTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAGACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATGGACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA |
This yields 1,045 variants: 55 positions times 19 alternative amino acids at each position. The first five variants shown above all mutate codon 1 (Gln in the wild type) to different amino acids, with the mutated codon highlighted in red.
Pairwise amino acid substitutions
The same operation with num_mutations=2 enumerates every possible
pair of single amino acid changes.
double_pool = orf_pool.mutagenize_orf(
num_mutations=2,
mutation_type="missense_only_first",
codon_positions=pos,
prefix="double",
style="red",
mode="sequential",
).named("double_pool")
print(double_pool.num_states) # 536085
With 55 positions and 19 amino acids each, the number of pairwise combinations is C(55, 2) x 192 = 536,085.
Random higher-order mutants
For variants with three or more mutations, exhaustive enumeration is
impractical. Random mode samples from this
space instead. Unlike num_mutations, which fixes the exact number
of mutations per sequence, mutation_rate specifies a per-codon
probability, so each sequence receives a variable number of changes.
Here mutation_rate=0.1 mutates each codon independently with 10%
probability, and num_states=10000 controls how many random draws
to take.
random_pool = orf_pool.mutagenize_orf(
mutation_rate=0.1,
mutation_type="missense_only_first",
codon_positions=pos,
prefix="random",
style="red",
mode="random",
num_states=10000,
).named("random_pool")
Wild-type replicates
Including multiple copies of the wild-type sequence provides internal controls for experimental normalization. repeat simply duplicates the input a given number of times.
wt_pool = orf_pool.repeat(times=100, prefix="wt").named("wt_pool")
Combine into a final library
stack merges the four sub-libraries into a single pool. Each component retains its own naming prefix, so variants can be traced back to their source.
dms_pool = pp.stack([single_pool, double_pool, random_pool, wt_pool])
dms_pool.print_library(num_seqs=10, seed=42, show_name=True)
| name | seq |
|---|---|
| single_0000 | ATGTTCTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAGACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATGGACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA |
| single_0001 | ATGCTGTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAGACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATGGACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA |
| single_0002 | ATGATCTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAGACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATGGACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA |
| single_0003 | ATGATGTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAGACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATGGACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA |
| single_0004 | ATGGTGTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAGACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATGGACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA |
| single_0005 | ATGAGCTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAGACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATGGACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA |
| single_0006 | ATGCCCTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAGACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATGGACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA |
| single_0007 | ATGACCTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAGACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATGGACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA |
| single_0008 | ATGGCCTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAGACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATGGACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA |
| single_0009 | ATGTACTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAGACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATGGACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA |
Because stack places components in the order they are listed, the
first 1,045 states are all single mutants. The 10 variants shown here
are therefore all single amino acid substitutions at codon position 1
(Gln in the wild type). The mutated codon is highlighted in red, making
it easy to spot changes at a glance.
Translating to amino acid sequences
translate converts the coding sequence to its amino acid
representation. When preserve_codon_styles=True (the default), the
red highlighting carries over from the mutated codon to the
corresponding amino acid.
translated = dms_pool.translate()
translated.print_library(num_seqs=5, show_name=True)
| name | seq |
|---|---|
| single_0000 | MFYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDDATKTFTVTE |
| single_0001 | MLYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDDATKTFTVTE |
| single_0002 | MIYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDDATKTFTVTE |
| single_0003 | MMYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDDATKTFTVTE |
| single_0004 | MVYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDDATKTFTVTE |
Design cards
The cards parameter on mutagenize_orf records each mutation as
structured design card columns, so every
variant carries a record of what was changed:
df = single_pool.generate_library()
| name | seq | position | wt_aa | mut_aa |
|---|---|---|---|---|
| single_0000 | ATGTTCTAC...GAA | (1,) | (Q,) | (F,) |
| single_0001 | ATGCTGTAC...GAA | (1,) | (Q,) | (L,) |
| single_0002 | ATGATCTAC...GAA | (1,) | (Q,) | (I,) |
| single_0003 | ATGATGTAC...GAA | (1,) | (Q,) | (M,) |
| single_0004 | ATGGTGTAC...GAA | (1,) | (Q,) | (V,) |
| ... | ... | ... | ... | ... |
Each row records the codon position, wild-type amino acid, and substituted amino acid. These columns are ready for downstream filtering and analysis without parsing the sequences themselves.
Library composition
Component |
Mode |
States |
|---|---|---|
Single mutants |
sequential |
1,045 |
Double mutants |
sequential |
536,085 |
Random mutants |
random |
10,000 |
Wild-type replicates |
— |
100 |
Total |
547,230 |
See mutagenize_orf,
translate,
repeat,
stack, and
library size for full parameter
details and how operation counts compose. To export the library as a
DataFrame or file, see to_df and to_file in Pools.