Deep Mutational Scanning: Protein GB1
======================================
This tutorial builds a deep mutational scanning (DMS) library for the
IgG-binding domain of protein G (GB1), a 56-amino-acid protein domain.
This library extends the GB1 DMS study by Olson et al. (*Current Biology*, 2014)
by considering:
- All single amino acid substitutions
- All pairwise amino acid substitutions
- 10,000 random higher-order mutants
- 100 wild-type replicates
.. image:: /_static/images/figure2a.drawio.svg
:width: 80%
:align: center
:alt: DMS library design DAG showing the pipeline from wild-type ORF through single, pairwise, and higher-order mutagenesis to the final stacked library.
.. code-block:: python
import poolparty as pp
pp.init()
----
Define the wild-type ORF
------------------------
The GB1 coding sequence is 168 bp (56 codons). We load it as a
single-sequence pool with :doc:`from_seq ` and target
codons 1 through 55 for mutagenesis, skipping the start codon at
position 0.
.. code-block:: python
GB1_ORF = (
"ATGCAGTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAG"
"ACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATG"
"GACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA"
)
orf_pool = pp.from_seq(GB1_ORF).named("orf_pool")
pos = slice(1, 56)
The ``codon_positions=slice(1, 56)`` range used below will target all
55 non-start codons for mutagenesis.
Single amino acid substitutions
-------------------------------
:doc:`mutagenize_orf ` in
:doc:`sequential mode ` with ``num_mutations=1``
generates every possible single amino acid
substitution. Each codon position has 19 possible missense changes (one
per alternative amino acid). Because most amino acids are encoded by
multiple codons, ``missense_only_first`` selects a single codon for each
target amino acid (the first listed in the codon table), avoiding
redundant synonymous alternatives. The ``style="red"`` parameter
highlights mutated codons in the output (see :doc:`/metadata/styling`).
.. code-block:: python
single_pool = orf_pool.mutagenize_orf(
num_mutations=1,
mutation_type="missense_only_first",
codon_positions=pos,
prefix="single",
style="red",
mode="sequential",
cards={"codon_positions": "position", "wt_aas": "wt_aa", "mut_aas": "mut_aa"},
).named("single_pool")
single_pool.print_library(num_seqs=5, show_name=True)
.. raw:: html
| name | seq |
| single_0000 | ATGTTCTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAGACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATGGACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA |
| single_0001 | ATGCTGTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAGACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATGGACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA |
| single_0002 | ATGATCTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAGACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATGGACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA |
| single_0003 | ATGATGTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAGACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATGGACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA |
| single_0004 | ATGGTGTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAGACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATGGACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA |
... (1,045 total)
This yields 1,045 variants: 55 positions times 19 alternative amino
acids at each position. The first five variants shown above all mutate
codon 1 (Gln in the wild type) to different amino acids, with the
mutated codon highlighted in red.
Pairwise amino acid substitutions
---------------------------------
The same operation with ``num_mutations=2`` enumerates every possible
pair of single amino acid changes.
.. code-block:: python
double_pool = orf_pool.mutagenize_orf(
num_mutations=2,
mutation_type="missense_only_first",
codon_positions=pos,
prefix="double",
style="red",
mode="sequential",
).named("double_pool")
print(double_pool.num_states) # 536085
With 55 positions and 19 amino acids each, the number of pairwise
combinations is C(55, 2) x 19\ :sup:`2` = 536,085.
Random higher-order mutants
---------------------------
For variants with three or more mutations, exhaustive enumeration is
impractical. :doc:`Random mode ` samples from this
space instead. Unlike ``num_mutations``, which fixes the exact number
of mutations per sequence, ``mutation_rate`` specifies a per-codon
probability, so each sequence receives a variable number of changes.
Here ``mutation_rate=0.1`` mutates each codon independently with 10%
probability, and ``num_states=10000`` controls how many random draws
to take.
.. code-block:: python
random_pool = orf_pool.mutagenize_orf(
mutation_rate=0.1,
mutation_type="missense_only_first",
codon_positions=pos,
prefix="random",
style="red",
mode="random",
num_states=10000,
).named("random_pool")
Wild-type replicates
--------------------
Including multiple copies of the wild-type sequence provides internal
controls for experimental normalization.
:doc:`repeat ` simply duplicates the input a given
number of times.
.. code-block:: python
wt_pool = orf_pool.repeat(times=100, prefix="wt").named("wt_pool")
Combine into a final library
-----------------------------
:doc:`stack ` merges the four sub-libraries into a single pool. Each
component retains its own naming prefix, so variants can be traced
back to their source.
.. code-block:: python
dms_pool = pp.stack([single_pool, double_pool, random_pool, wt_pool])
dms_pool.print_library(num_seqs=10, seed=42, show_name=True)
.. raw:: html
| name | seq |
| single_0000 | ATGTTCTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAGACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATGGACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA |
| single_0001 | ATGCTGTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAGACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATGGACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA |
| single_0002 | ATGATCTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAGACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATGGACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA |
| single_0003 | ATGATGTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAGACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATGGACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA |
| single_0004 | ATGGTGTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAGACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATGGACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA |
| single_0005 | ATGAGCTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAGACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATGGACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA |
| single_0006 | ATGCCCTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAGACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATGGACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA |
| single_0007 | ATGACCTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAGACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATGGACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA |
| single_0008 | ATGGCCTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAGACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATGGACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA |
| single_0009 | ATGTACTACAAGCTGATCCTGAACGGTAAGACGCTGAAAGGTGAGACGACCACCGAAGCTGTAGACGCTGCTACTGCAGAGAAGGTGTTCAAGCAGTACGCTAACGACAACGGCGTCGACGGTGAATGGACCTACGACGACGCTACCAAAACCTTCACGGTTACCGAA |
... (547,230 total)
Because ``stack`` places components in the order they are listed, the
first 1,045 states are all single mutants. The 10 variants shown here
are therefore all single amino acid substitutions at codon position 1
(Gln in the wild type). The mutated codon is highlighted in red, making
it easy to spot changes at a glance.
Translating to amino acid sequences
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
:doc:`translate ` converts the coding sequence to its amino acid
representation. When ``preserve_codon_styles=True`` (the default), the
red highlighting carries over from the mutated codon to the
corresponding amino acid.
.. code-block:: python
translated = dms_pool.translate()
translated.print_library(num_seqs=5, show_name=True)
.. raw:: html
| name | seq |
| single_0000 | MFYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDDATKTFTVTE |
| single_0001 | MLYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDDATKTFTVTE |
| single_0002 | MIYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDDATKTFTVTE |
| single_0003 | MMYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDDATKTFTVTE |
| single_0004 | MVYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDDATKTFTVTE |
... (547,230 total)
Design cards
~~~~~~~~~~~~
The ``cards`` parameter on ``mutagenize_orf`` records each mutation as
structured :doc:`design card ` columns, so every
variant carries a record of what was changed:
.. code-block:: python
df = single_pool.generate_library()
.. raw:: html
| name | seq | position | wt_aa | mut_aa |
| single_0000 | ATGTTCTAC...GAA | (1,) | (Q,) | (F,) |
| single_0001 | ATGCTGTAC...GAA | (1,) | (Q,) | (L,) |
| single_0002 | ATGATCTAC...GAA | (1,) | (Q,) | (I,) |
| single_0003 | ATGATGTAC...GAA | (1,) | (Q,) | (M,) |
| single_0004 | ATGGTGTAC...GAA | (1,) | (Q,) | (V,) |
| ... | ... | ... | ... | ... |
Each row records the codon position, wild-type amino acid, and
substituted amino acid. These columns are ready for downstream filtering
and analysis without parsing the sequences themselves.
Library composition
-------------------
.. list-table::
:header-rows: 1
:widths: 30 20 20
* - Component
- Mode
- States
* - Single mutants
- sequential
- 1,045
* - Double mutants
- sequential
- 536,085
* - Random mutants
- random
- 10,000
* - Wild-type replicates
- \—
- 100
* - **Total**
-
- **547,230**
See :doc:`mutagenize_orf `,
:doc:`translate `,
:doc:`repeat `,
:doc:`stack `, and
:doc:`library size ` for full parameter
details and how operation counts compose. To export the library as a
DataFrame or file, see ``to_df`` and ``to_file`` in :doc:`/pool`.