Deep Mutational Scanning: Protein GB1

This tutorial builds a deep mutational scanning (DMS) library for the IgG-binding domain of protein G (GB1), a 56-amino-acid protein domain. This library extends the GB1 DMS study by Olson et al. (Current Biology, 2014) by considering:

  • All single amino acid substitutions

  • All pairwise amino acid substitutions

  • 10,000 random higher-order mutants

  • 100 wild-type replicates

DMS library design DAG showing the pipeline from wild-type ORF through single, pairwise, and higher-order mutagenesis to the final stacked library.
import poolparty as pp
pp.init()

Define the wild-type ORF

The GB1 coding sequence is 168 bp (56 codons). We load it as a single-sequence pool with from_seq and target codons 1 through 55 for mutagenesis, skipping the start codon at position 0.

GB1_ORF = (
    "ATGCAGTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAG"
    "ACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATG"
    "GACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA"
)

orf_pool = pp.from_seq(GB1_ORF).named("orf_pool")
pos = slice(1, 56)

The codon_positions=slice(1, 56) range used below will target all 55 non-start codons for mutagenesis.

Single amino acid substitutions

mutagenize_orf in sequential mode with num_mutations=1 generates every possible single amino acid substitution. Each codon position has 19 possible missense changes (one per alternative amino acid). Because most amino acids are encoded by multiple codons, missense_only_first selects a single codon for each target amino acid (the first listed in the codon table), avoiding redundant synonymous alternatives. The style="red" parameter highlights mutated codons in the output (see Styling).

single_pool = orf_pool.mutagenize_orf(
    num_mutations=1,
    mutation_type="missense_only_first",
    codon_positions=pos,
    prefix="single",
    style="red",
    mode="sequential",
    cards={"codon_positions": "position", "wt_aas": "wt_aa", "mut_aas": "mut_aa"},
).named("single_pool")

single_pool.print_library(num_seqs=5, show_name=True)
single_pool: seq_length=168, num_states=1045
nameseq
single_0000ATGTTCTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAGACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATGGACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA
single_0001ATGCTGTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAGACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATGGACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA
single_0002ATGATCTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAGACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATGGACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA
single_0003ATGATGTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAGACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATGGACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA
single_0004ATGGTGTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAGACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATGGACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA
... (1,045 total)

This yields 1,045 variants: 55 positions times 19 alternative amino acids at each position. The first five variants shown above all mutate codon 1 (Gln in the wild type) to different amino acids, with the mutated codon highlighted in red.

Pairwise amino acid substitutions

The same operation with num_mutations=2 enumerates every possible pair of single amino acid changes.

double_pool = orf_pool.mutagenize_orf(
    num_mutations=2,
    mutation_type="missense_only_first",
    codon_positions=pos,
    prefix="double",
    style="red",
    mode="sequential",
).named("double_pool")

print(double_pool.num_states)  # 536085

With 55 positions and 19 amino acids each, the number of pairwise combinations is C(55, 2) x 192 = 536,085.

Random higher-order mutants

For variants with three or more mutations, exhaustive enumeration is impractical. Random mode samples from this space instead. Unlike num_mutations, which fixes the exact number of mutations per sequence, mutation_rate specifies a per-codon probability, so each sequence receives a variable number of changes. Here mutation_rate=0.1 mutates each codon independently with 10% probability, and num_states=10000 controls how many random draws to take.

random_pool = orf_pool.mutagenize_orf(
    mutation_rate=0.1,
    mutation_type="missense_only_first",
    codon_positions=pos,
    prefix="random",
    style="red",
    mode="random",
    num_states=10000,
).named("random_pool")

Wild-type replicates

Including multiple copies of the wild-type sequence provides internal controls for experimental normalization. repeat simply duplicates the input a given number of times.

wt_pool = orf_pool.repeat(times=100, prefix="wt").named("wt_pool")

Combine into a final library

stack merges the four sub-libraries into a single pool. Each component retains its own naming prefix, so variants can be traced back to their source.

dms_pool = pp.stack([single_pool, double_pool, random_pool, wt_pool])

dms_pool.print_library(num_seqs=10, seed=42, show_name=True)
dms_pool: seq_length=168, num_states=547230
nameseq
single_0000ATGTTCTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAGACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATGGACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA
single_0001ATGCTGTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAGACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATGGACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA
single_0002ATGATCTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAGACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATGGACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA
single_0003ATGATGTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAGACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATGGACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA
single_0004ATGGTGTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAGACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATGGACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA
single_0005ATGAGCTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAGACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATGGACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA
single_0006ATGCCCTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAGACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATGGACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA
single_0007ATGACCTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAGACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATGGACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA
single_0008ATGGCCTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAGACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATGGACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA
single_0009ATGTACTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAGACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATGGACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA
... (547,230 total)

Because stack places components in the order they are listed, the first 1,045 states are all single mutants. The 10 variants shown here are therefore all single amino acid substitutions at codon position 1 (Gln in the wild type). The mutated codon is highlighted in red, making it easy to spot changes at a glance.

Translating to amino acid sequences

translate converts the coding sequence to its amino acid representation. When preserve_codon_styles=True (the default), the red highlighting carries over from the mutated codon to the corresponding amino acid.

translated = dms_pool.translate()
translated.print_library(num_seqs=5, show_name=True)
translated: seq_length=56, num_states=547230
nameseq
single_0000MFYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDDATKTFTVTE
single_0001MLYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDDATKTFTVTE
single_0002MIYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDDATKTFTVTE
single_0003MMYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDDATKTFTVTE
single_0004MVYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDDATKTFTVTE
... (547,230 total)

Design cards

The cards parameter on mutagenize_orf records each mutation as structured design card columns, so every variant carries a record of what was changed:

df = single_pool.generate_library()
df — 1,045 rows × 5 columns
nameseqpositionwt_aamut_aa
single_0000ATGTTCTAC...GAA(1,)(Q,)(F,)
single_0001ATGCTGTAC...GAA(1,)(Q,)(L,)
single_0002ATGATCTAC...GAA(1,)(Q,)(I,)
single_0003ATGATGTAC...GAA(1,)(Q,)(M,)
single_0004ATGGTGTAC...GAA(1,)(Q,)(V,)
...............

Each row records the codon position, wild-type amino acid, and substituted amino acid. These columns are ready for downstream filtering and analysis without parsing the sequences themselves.

Library composition

Component

Mode

States

Single mutants

sequential

1,045

Double mutants

sequential

536,085

Random mutants

random

10,000

Wild-type replicates

100

Total

547,230

See mutagenize_orf, translate, repeat, stack, and library size for full parameter details and how operation counts compose. To export the library as a DataFrame or file, see to_df and to_file in Pools.