Quickstart Guide
================
The examples below walk through the main ideas behind PoolParty, from
creating a single pool to composing a complete combinatorial library.
Installation
------------
.. code-block:: bash
pip install poolparty
Getting started
---------------
.. code-block:: python
import poolparty as pp
pp.init()
``pp.init()`` initializes a fresh design context. Call it before each
independent library design. For scoped contexts (e.g., inside a function
that builds one library), use ``with pp.Party()`` instead.
See :doc:`pool` for details.
----
Core concepts
-------------
**Pools.** A Pool represents a designed collection of DNA sequences. Pools
are *lazy*: they record the rules for generating sequences but delay actual
generation until the user requests it. Pools are also *immutable*: every
operation returns a new Pool, leaving the original unchanged. This means you
can branch a pipeline at any point without interference.
**Operations.** An Operation takes one or more Pools as input and produces a
new Pool as output. By chaining operations, you build a directed acyclic
graph (DAG) that specifies your library design. PoolParty provides over 20
built-in operations in four categories (source, transformation, composition,
and state). See :doc:`operations/index` for the full catalog.
**Modes.** Most operations accept a ``mode`` parameter that controls how
outputs are produced:
- ``sequential`` -- enumerate every possibility deterministically
- ``random`` -- sample from the design space
- ``fixed`` -- output is uniquely determined by the input (no variation)
See :doc:`operations/modes` for details.
These three ideas underlie every PoolParty pipeline. The sections below
show how they work in practice.
----
Creating a pool
---------------
All pools originate from a *source operation*. Source operations do not
require an existing pool as input. The simplest is ``from_seq``, which wraps
a single DNA sequence:
.. code-block:: python
wt = pp.from_seq("ATCGATCG")
wt.print_library()
.. raw:: html
ATCGATCG
This pool has ``num_states=1`` -- the number of distinct sequences the
pool can produce -- because it contains exactly one sequence.
``seq_length`` reports the length of every sequence in the pool.
Other source operations include ``from_seqs`` (multiple sequences),
``from_iupac`` (degenerate IUPAC codes), ``from_fasta`` (FASTA files),
``get_kmers`` (all k-mers of a given length), and ``get_barcodes``
(constrained barcodes). See :doc:`operations/source_operations`.
----
Applying an operation
---------------------
Operations transform pools. Each operation returns a new pool; the original
is unchanged. Here, ``mutagenize`` in sequential mode generates every
single-nucleotide substitution:
.. code-block:: python
mutants = wt.mutagenize(num_mutations=1, mode="sequential")
mutants.print_library(num_seqs=6)
.. raw:: html
CTCGATCG
GTCGATCG
TTCGATCG
AACGATCG
ACCGATCG
AGCGATCG
... (24 total)
``mode="sequential"`` enumerates all 24 single-nucleotide substitutions:
8 positions times 3 alternative bases. The original ``wt`` pool is still a
single-sequence pool -- ``mutagenize`` returned a new pool.
With ``mode="random"``, the operation would draw a single random mutant
instead; passing ``num_states=N`` in random mode draws N random designs.
See :doc:`operations/modes`.
Operations can be called as standalone functions or as methods on a Pool:
``wt.mutagenize(...)`` and ``pp.mutagenize(wt, ...)`` are equivalent.
See :doc:`operations/index` for the full catalog.
----
Working with pools
------------------
A few API patterns make PoolParty pipelines easier to write and debug.
**Method chaining.** Since every operation returns a new Pool, calls can be
chained left-to-right into a pipeline:
.. code-block:: python
library = (
pp.from_seq("ATCGATCG")
.mutagenize(num_mutations=1, mode="sequential")
.named("mutants")
.print_library(num_seqs=3)
.repeat(times=2)
)
print(library.num_states) # 48
.. raw:: html
CTCGATCG
GTCGATCG
TTCGATCG
... (24 total)
- ``.named("mutants")`` labels the pool for display (appears in
``print_library`` headers and ``print_dag`` output). This is distinct from
``prefix``, which labels *sequence names* in the output DataFrame
(see :doc:`metadata/naming`).
- ``.print_library(num_seqs=3)`` previews 3 sequences mid-chain, then
returns the pool unchanged so the chain continues.
- The final pool has ``num_states=48``: 24 mutants times 2 repeats.
**Branching.** Because pools are immutable, you can apply different
operations to the same input without interference:
.. code-block:: python
branch_a = wt.mutagenize(num_mutations=1, mode="sequential")
branch_b = wt.deletion_scan(deletion_length=2, mode="sequential")
# wt is unchanged; branch_a and branch_b are independent
This branching pattern is how you build multi-component libraries (as in
the capstone example below).
**Inspecting a pool.** At any point you can check ``pool.num_states``,
``pool.seq_length``, and ``pool.regions``. Call ``pool.print_dag()`` to
visualize the full pipeline structure (demonstrated in the capstone).
**Reproducibility.** Pass ``seed=42`` to ``print_library``,
``generate_library``, or ``to_df`` for reproducible output across runs.
----
Sequence regions
----------------
You often want to perform different operations on different parts of a
sequence. Regions let you mark specific segments with XML-style tags so
that operations can target them by name:
.. code-block:: python
template = pp.from_seq("AAAAATCGATCGTTTT")
cre_mutants = template.mutagenize(
num_mutations=1, region="cre", mode="sequential"
).named("cre_mutants")
cre_mutants.print_library(num_seqs=4)
.. raw:: html
AAAA<cre>CTCGATCG</cre>TTTT
AAAA<cre>GTCGATCG</cre>TTTT
AAAA<cre>TTCGATCG</cre>TTTT
AAAA<cre>AACGATCG</cre>TTTT
... (24 total)
Only the 8 bases inside ```` are mutated; the flanking ``AAAA`` and
``TTTT`` remain unchanged. Tags persist through the DAG, so multiple
operations can target the same region in series. PoolParty also supports
self-closing tags (e.g., ````) for zero-length insertion points.
See :doc:`regions` for full tag syntax, persistence rules, and
programmatic tag insertion.
----
Scanning operations
-------------------
Scanning operations systematically tile a window across a sequence (or a
region), producing one variant per position. They are the workhorse for
saturation-style screens:
.. code-block:: python
dels = template.deletion_scan(
deletion_length=3, region="cre", mode="sequential"
).named("dels")
dels.print_library(num_seqs=4)
.. raw:: html
AAAA<cre>---GATCG</cre>TTTT
AAAA<cre>A---ATCG</cre>TTTT
AAAA<cre>AT---TCG</cre>TTTT
AAAA<cre>ATC---CG</cre>TTTT
... (6 total)
``deletion_scan`` slides a 3-bp window across the 8-bp region, yielding
8 - 3 + 1 = 6 variants (one per valid window position). The scan is
restricted to ````, so flanking sequences remain intact.
Other scanning operations include ``insertion_scan``, ``replacement_scan``,
``mutagenize_scan``, and their multi-window variants
(``insertion_multiscan``, etc.). See :doc:`operations/scanning`.
----
Combining pools
---------------
Composition operations combine sequences from multiple pools. The two
primary operations are ``stack`` (merge state spaces) and ``join``
(concatenate sequences end-to-end).
``stack`` merges the mutagenesis and deletion pools:
.. code-block:: python
combined = pp.stack([cre_mutants, dels])
print(combined.num_states) # 30 (24 + 6)
``repeat`` duplicates a pool's sequences for replication:
.. code-block:: python
wt_copies = template.repeat(times=5)
print(wt_copies.num_states) # 5
``stack`` produces a pool whose state space is the union of all inputs
(24 + 6 = 30). ``repeat`` produces N copies of each input sequence
(multiplicative). Another key composition operation is ``join``, which
concatenates sequences from different pools end-to-end (Cartesian product
of state spaces).
See :doc:`operations/composition_operations`. For how state counts compose
across different operation types, see :doc:`operations/library_size`.
----
Sequence metadata
-----------------
PoolParty automatically tracks how each sequence was constructed through
three complementary mechanisms:
- **Names.** Each sequence receives a dot-separated name summarizing its
construction history (e.g., ``mut_03.rep_1``). Users assign labels via the
``prefix`` parameter on operations.
- **Design cards.** Structured DataFrame columns that record every design
choice -- mutation positions, substituted characters, scores, orientations
-- ready for filtering, grouping, and statistical modeling.
- **Styling.** Per-character color and formatting annotations that highlight
mutations, deletions, and regions in ``print_library`` output for quick
visual auditing.
The capstone example below demonstrates all three: names via ``prefix``,
styling via ``style``, and design cards via ``cards``.
See :doc:`metadata/naming`, :doc:`metadata/design_cards`, and
:doc:`metadata/styling`.
----
Generating libraries
--------------------
``print_library()`` previews sequences in the terminal. To produce a
``pandas.DataFrame``, use ``generate_library()``:
.. code-block:: python
df = combined.generate_library()
The DataFrame contains a ``name`` column, a ``seq`` column, and any design
card columns requested via the ``cards`` parameter. For larger libraries
(above ~10k sequences), use ``to_df`` (chunked streaming) or ``to_file``
(stream directly to CSV, FASTA, or JSONL). See :doc:`pool` for full export
options.
----
Putting it all together
-----------------------
The following example combines every concept from the preceding sections
into a complete pipeline. A template
sequence contains a ```` region targeted for both mutagenesis and
deletion scanning. We start a fresh session to build the example from
scratch.
.. image:: /_static/images/figure1c.drawio.svg
:width: 100%
:alt: Example PoolParty workflow combining mutagenesis and deletion scanning into a single library.
.. code-block:: python
pp.init()
template = pp.from_seq("TCCGACTGCAATTCGGA").named("template")
mut_pool = template.mutagenize(
num_mutations=1,
region="tag",
style="red bold",
prefix="mut",
mode="sequential",
cards={"positions": "mut_pos", "wt_chars": "wt", "mut_chars": "mut"},
).named("mut_pool")
del_pool = template.deletion_scan(
deletion_length=1,
region="tag",
style="green bold",
prefix="del",
mode="sequential",
cards={"start": "del_start"},
).repeat(times=2, prefix="rep",
cards={"repeat_index": "rep_idx"},
).named("del_pool")
pool_final = pp.stack([mut_pool, del_pool]).named("pool_final")
pool_final.print_library(show_name=True)
.. raw:: html
| name | seq |
| mut_0 | TCCGACT<tag>ACA</tag>ATTCGGA |
| mut_1 | TCCGACT<tag>CCA</tag>ATTCGGA |
| mut_2 | TCCGACT<tag>TCA</tag>ATTCGGA |
| mut_3 | TCCGACT<tag>GAA</tag>ATTCGGA |
| mut_4 | TCCGACT<tag>GGA</tag>ATTCGGA |
| mut_5 | TCCGACT<tag>GTA</tag>ATTCGGA |
| mut_6 | TCCGACT<tag>GCC</tag>ATTCGGA |
| mut_7 | TCCGACT<tag>GCG</tag>ATTCGGA |
| mut_8 | TCCGACT<tag>GCT</tag>ATTCGGA |
| del_0.rep_0 | TCCGACT<tag>-CA</tag>ATTCGGA |
| del_0.rep_1 | TCCGACT<tag>-CA</tag>ATTCGGA |
| del_1.rep_0 | TCCGACT<tag>G-A</tag>ATTCGGA |
| del_1.rep_1 | TCCGACT<tag>G-A</tag>ATTCGGA |
| del_2.rep_0 | TCCGACT<tag>GC-</tag>ATTCGGA |
| del_2.rep_1 | TCCGACT<tag>GC-</tag>ATTCGGA |
This pipeline combines every concept from the preceding sections: a source
operation creates the template, two transformation operations
(``mutagenize`` and ``deletion_scan``) branch from it targeting the same
```` region, ``repeat`` replicates one branch, and ``stack`` merges
them into the final library. The ``prefix`` parameter labels each variant
type so names are self-documenting (``mut_0``, ``del_0.rep_0``, etc.).
The ``style`` parameter applies color annotations visible in the output:
mutations in red, deletions in green.
The DAG view confirms the pipeline structure:
.. code-block:: python
pool_final.print_dag()
.. code-block:: text
pool_final (pool, n=15)
└── op[6]:stack [mode=sequential, n=2]
├── mut_pool (pool, n=9)
│ └── op[1]:mutagenize [mode=sequential, n=9]
│ └── template (pool, n=1)
│ └── op[0]:from_seq [mode=fixed, n=1]
└── del_pool (pool, n=6)
└── op[5]:repeat [mode=sequential, n=2]
└── pool[4] (pool, n=3)
└── op[4]:deletion_scan(replace_region) [mode=fixed, n=1]
├── pool[2] (pool, n=3)
│ └── op[2]:deletion_scan(region_scan) [mode=sequential, n=3]
│ └── template (pool, n=1)
│ └── op[0]:from_seq [mode=fixed, n=1]
└── pool[3] (pool, n=1)
└── op[3]:deletion_scan(from_seq) [mode=fixed, n=1]
Each node shows its mode and internal state count. Named pools
(``template``, ``mut_pool``, ``del_pool``, ``pool_final``) appear with
their labels; unnamed intermediate pools use default identifiers.
Because each operation was called with ``cards=``, the exported DataFrame
includes design card columns alongside ``name`` and ``seq``:
.. code-block:: python
df = pool_final.generate_library()
.. raw:: html
| name | mut_pos | wt | mut | del_start | rep_idx |
| mut_0 | (0,) | (G,) | (A,) | None | None |
| mut_1 | (0,) | (G,) | (C,) | None | None |
| mut_2 | (0,) | (G,) | (T,) | None | None |
| ... | ... | ... | ... | ... | ... |
| del_0.rep_0 | None | None | None | 0 | 0 |
| del_0.rep_1 | None | None | None | 0 | 1 |
| ... | ... | ... | ... | ... | ... |
Each operation contributes its own card columns: ``mutagenize`` records
the mutation position, wild-type base, and substituted base;
``deletion_scan`` records the deletion start position; and ``repeat`` records
the copy index. Sequences that did not pass through a given operation
have ``None`` in its columns.
Each operation defines its own set of available card keys.
See :doc:`metadata/design_cards`.
----
Next steps
----------
- Walk through complete real-world library designs in the
:doc:`tutorials/index` (deep mutational scanning, MPRA)
- Browse the :doc:`operations/index` for the full operation catalog
- See :doc:`pool` for Pool properties, export methods (``to_df``,
``to_file``), and context management