Pools
A Pool represents a designed collection of DNA sequences. A Pool can represent the final library you wish to generate, or an intermediate set of sequences used to construct it. Pools are lazy: they record which operations to apply and to what inputs, forming a directed acyclic graph (DAG), but no sequences are generated until you explicitly request them. This means you can explore and test multiple design options without triggering expensive computations.
Pools are also immutable: every operation returns a new Pool, leaving the original unchanged. You can branch a pipeline at any point and apply different operations to each branch without interference.
The final Pool in a pipeline – the one from which you generate sequences – is called the root Pool. The DAG rooted at this Pool describes the high-level logic used to generate your library; PoolParty handles the procedural details and bookkeeping internally.
Context management
Pools must be created inside an active context. Call pp.init() before
each independent library design to initialize a fresh context:
import poolparty as pp
pp.init()
If you design multiple libraries in one script, call pp.init() again
before starting each new design. For scoped contexts (e.g., inside a
reusable function), use with pp.Party(): instead – the context is
automatically cleaned up when the block exits:
with pp.Party():
pool = pp.from_seq("ACGT")
# ... build and export library ...
# context is released here
All remaining examples on this page assume the import and
pp.init() calls above have been run.
Properties
Attribute |
Type |
Description |
|---|---|---|
|
|
Human-readable name for this pool. Settable. Defaults to |
|
|
Number of distinct sequences this pool produces (the total across the entire pipeline). |
|
|
Fixed sequence length, or |
|
|
Set of |
|
|
Input pools that this pool’s operation reads from. |
|
|
The operation that created this pool. Exposes |
Internally, each sequence is identified by a state – an integer that,
together with a random seed, uniquely determines the sequence content.
pool.num_states is the total number of distinct states (and therefore
distinct sequences) the pool can produce.
Note that pool.num_states and pool.operation.num_states are different
values. The pool’s num_states is the total across the entire pipeline,
while the operation’s num_states is just that operation’s contribution
(see Operation Modes and Library Size):
seqs = pp.from_seqs(["AAA", "CCC", "GGG"], mode="sequential")
mut = seqs.mutagenize(num_mutations=1, mode="sequential")
mut.num_states # 27 (3 inputs × 9 mutants)
mut.operation.num_states # 9 (mutagenize alone)
mut.operation.natural_num_states # 9 (before any num_states override)
Naming pools
named(name)
Set the pool’s name and return self, allowing in-line renaming without
breaking a chain:
wt = pp.from_seq("ACGT").named("wildtype")
# wt.name == "wildtype"
scored = (
pp.from_iupac("NNNN", mode="sequential")
.mutagenize(num_mutations=1)
.named("single_mut")
)
Pool names appear in print_library headers and print_dag output.
This is distinct from prefix, which labels individual sequence names
in the output DataFrame (see Sequence Names).
Previewing sequences — print_library(...)
Print a formatted preview of the pool’s sequences to stdout. Returns self
so it can be used mid-pipeline.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Number of sequences to show. |
|
|
|
Print a summary header line before the sequences. |
|
|
|
Include the sequence name column. |
|
|
|
Include the state index column. |
|
|
|
Random seed for reproducible previews. |
See Pool in the API Reference for the full parameter list.
pp.from_iupac("NNNNN", mode="sequential").print_library(num_seqs=6)
pool[0].1 AAAAC
pool[0].2 AAAAG
pool[0].3 AAAAT
pool[0].4 AAACA
pool[0].5 AAACC
Generating libraries — generate_library(...)
Generate all sequences from this pool and return them as a
pandas.DataFrame. Best for small to medium pools; for libraries above ~10k
sequences, use to_df which streams in chunks.
pool = pp.from_iupac("NNNN", mode="sequential")
df = pool.generate_library()
# df has columns: name, seq (plus any design card columns)
See generate_library for full documentation.
Exporting to a DataFrame — to_df(...)
Generate sequences and collect them into a pandas.DataFrame using
chunked streaming. Prefer to_df over generate_library for large
libraries (above ~10k sequences). It processes sequences in batches, keeping
peak memory proportional to chunk_size rather than the full library.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Total sequences to generate. Required when |
|
|
|
Number of complete passes through the pool’s |
|
|
|
Sequences generated per internal batch. Larger values may be faster but use more memory. |
|
|
|
If |
|
|
|
Random seed for reproducibility. |
|
|
|
Skip sequences removed by a |
|
|
|
Columns to keep. Defaults to all columns ( |
|
|
|
Display a |
See Pool in the API Reference for the full parameter list.
Basic usage
pool = pp.from_iupac("NNNNNNNN", mode="sequential")
df = pool.to_df(num_cycles=1)
# 65536 rows, columns: name, seq
Large library with chunked streaming
pool = pp.from_iupac("NNNNNNNNNN")
df = pool.to_df(num_seqs=500_000, chunk_size=10_000, seed=42) # Random sample of 500k sequences from ~1M possible sequences
Keep only name and seq (drop design cards)
scored = pool.score(pp.calc_gc, card_key="gc", cards={"gc": "gc"})
df = scored.to_df(num_cycles=1, columns=["name", "seq"])
# "gc" column is excluded
Exporting to file — to_file(...)
Stream sequences directly to disk without ever holding the full library in memory. Supports CSV, TSV, FASTA, and JSONL formats, including gzip compression.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
(required) |
Output file path. Use a |
|
|
|
|
|
|
|
Total sequences to write. |
|
|
|
Number of complete passes through the pool’s |
|
|
|
Sequences written per internal batch. |
|
|
|
Include region tags in output sequences. |
|
|
|
Random seed for reproducibility. |
|
|
|
Skip sequences removed by a |
|
|
|
Columns to write (CSV/TSV only). |
|
|
|
FASTA only: wrap sequence lines at this width. |
|
|
|
FASTA only: additional description text after the sequence name.
A string is treated as a format template (e.g. |
|
|
|
Show a |
Returns the number of sequences written. See Pool in the
API Reference for the full parameter list.
Export to CSV
pool = pp.from_iupac("NNNNNNNN")
n = pool.to_file("library.csv", num_seqs=100_000)
# n == 100000
name,seq
pool[0].0,AAAAAAAA
pool[0].1,AAAAAAAC
pool[0].2,AAAAAAAG
pool[0].3,AAAAAAAT
pool[0].4,AAAAAACA
...
Export to gzip-compressed CSV
n = pool.to_file("library.csv.gz", num_seqs=1_000_000, chunk_size=50_000)
Export to FASTA
n = pool.to_file("library.fasta", num_seqs=10_000)
>pool[0].0
AAAAAAAA
>pool[0].1
AAAAAAAC
>pool[0].2
AAAAAAAG
...
FASTA with a custom description line
scored = pool.score(pp.calc_gc, card_key="gc", cards={"gc": "gc"})
n = scored.to_file(
"library.fasta",
num_seqs=1000,
description=lambda row: f"GC={row['gc']:.3f}",
)
>pool[0].0 GC=0.000
AAAAAAAA
>pool[0].1 GC=0.125
AAAAAAAC
>pool[0].2 GC=0.125
AAAAAAAG
...
Visualising the DAG — print_dag(...)
Print an ASCII tree of the computation graph rooted at this pool. Returns
self so it can be used mid-pipeline.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Tree drawing style. |
|
|
|
Show pool nodes in addition to operation nodes. |
wt = pp.from_seq("ACG")
mut = wt.mutagenize(num_mutations=1, mode="sequential")
repeated = mut * 2
repeated.print_dag()
pool[2] (pool, n=18)
└── op[2]:repeat [mode=sequential, n=2]
└── pool[1] (pool, n=9)
└── op[1]:mutagenize [mode=sequential, n=9]
└── pool[0] (pool, n=1)
└── op[0]:from_seq [mode=fixed, n=1]
Advanced
Operator shortcuts
Pools support three Python operators as shorthand for common operations:
pool_a + pool_bEquivalent to
pp.stack([pool_a, pool_b]). See stack.pool * NEquivalent to
pp.repeat(pool, times=N). See repeat.pool[start:stop]Equivalent to
pp.slice_states(pool, start=start, stop=stop). See slice_states.
a = pp.from_seqs(["AAA", "CCC"], mode="sequential")
b = pp.from_seqs(["GGG", "TTT"], mode="sequential")
combined = a + b # 4 states (2 + 2)
repeated = a * 3 # 6 states (2 × 3)
sliced = combined[:3] # 3 states (first 3 of 4)
copy() and deepcopy()
copy() creates a new pool that shares the same input pools – useful for
branching a design at a specific point without re-running earlier operations.
deepcopy() creates a fully independent copy of the entire upstream DAG
– nothing is shared with the original. In most cases copy() is sufficient.
Use deepcopy() when the two branches must be fully independent and share
no input pools.
base = pp.from_iupac("NNNN", mode="sequential")
branch_a = base.mutagenize(num_mutations=1).named("branch_a")
branch_b = base.copy().mutagenize(num_mutations=2).named("branch_b")
# branch_a and branch_b share the same "base" input pool