Pools

A Pool represents a designed collection of DNA sequences. A Pool can represent the final library you wish to generate, or an intermediate set of sequences used to construct it. Pools are lazy: they record which operations to apply and to what inputs, forming a directed acyclic graph (DAG), but no sequences are generated until you explicitly request them. This means you can explore and test multiple design options without triggering expensive computations.

Pools are also immutable: every operation returns a new Pool, leaving the original unchanged. You can branch a pipeline at any point and apply different operations to each branch without interference.

The final Pool in a pipeline – the one from which you generate sequences – is called the root Pool. The DAG rooted at this Pool describes the high-level logic used to generate your library; PoolParty handles the procedural details and bookkeeping internally.


Context management

Pools must be created inside an active context. Call pp.init() before each independent library design to initialize a fresh context:

import poolparty as pp
pp.init()

If you design multiple libraries in one script, call pp.init() again before starting each new design. For scoped contexts (e.g., inside a reusable function), use with pp.Party(): instead – the context is automatically cleaned up when the block exits:

with pp.Party():
    pool = pp.from_seq("ACGT")
    # ... build and export library ...
# context is released here

All remaining examples on this page assume the import and pp.init() calls above have been run.


Properties

Attribute

Type

Description

name

str

Human-readable name for this pool. Settable. Defaults to "pool[N]".

num_states

int

Number of distinct sequences this pool produces (the total across the entire pipeline).

seq_length

int | None

Fixed sequence length, or None for variable-length pools.

regions

set[Region]

Set of Region objects present in this pool’s sequences. See Sequence Regions for details.

parents

list[Pool]

Input pools that this pool’s operation reads from.

operation

Operation

The operation that created this pool. Exposes operation.mode, operation.num_states, and operation.natural_num_states.

Internally, each sequence is identified by a state – an integer that, together with a random seed, uniquely determines the sequence content. pool.num_states is the total number of distinct states (and therefore distinct sequences) the pool can produce.

Note that pool.num_states and pool.operation.num_states are different values. The pool’s num_states is the total across the entire pipeline, while the operation’s num_states is just that operation’s contribution (see Operation Modes and Library Size):

seqs = pp.from_seqs(["AAA", "CCC", "GGG"], mode="sequential")
mut  = seqs.mutagenize(num_mutations=1, mode="sequential")

mut.num_states                    # 27 (3 inputs × 9 mutants)
mut.operation.num_states          # 9  (mutagenize alone)
mut.operation.natural_num_states  # 9  (before any num_states override)

Naming pools

named(name)

Set the pool’s name and return self, allowing in-line renaming without breaking a chain:

wt = pp.from_seq("ACGT").named("wildtype")
# wt.name == "wildtype"

scored = (
    pp.from_iupac("NNNN", mode="sequential")
      .mutagenize(num_mutations=1)
      .named("single_mut")
)

Pool names appear in print_library headers and print_dag output. This is distinct from prefix, which labels individual sequence names in the output DataFrame (see Sequence Names).


Previewing sequences — print_library(...)

Print a formatted preview of the pool’s sequences to stdout. Returns self so it can be used mid-pipeline.

Parameter

Type

Default

Description

num_seqs

int | None

None

Number of sequences to show.

show_header

bool

True

Print a summary header line before the sequences.

show_name

bool

True

Include the sequence name column.

show_state

bool

False

Include the state index column.

seed

int | None

None

Random seed for reproducible previews.

See Pool in the API Reference for the full parameter list.

pp.from_iupac("NNNNN", mode="sequential").print_library(num_seqs=6)
pool[0]: seq_length=5, num_states=1024 pool[0].0 AAAAA
pool[0].1 AAAAC
pool[0].2 AAAAG
pool[0].3 AAAAT
pool[0].4 AAACA
pool[0].5 AAACC

Generating libraries — generate_library(...)

Generate all sequences from this pool and return them as a pandas.DataFrame. Best for small to medium pools; for libraries above ~10k sequences, use to_df which streams in chunks.

pool = pp.from_iupac("NNNN", mode="sequential")
df   = pool.generate_library()
# df has columns: name, seq  (plus any design card columns)

See generate_library for full documentation.


Exporting to a DataFrame — to_df(...)

Generate sequences and collect them into a pandas.DataFrame using chunked streaming. Prefer to_df over generate_library for large libraries (above ~10k sequences). It processes sequences in batches, keeping peak memory proportional to chunk_size rather than the full library.

Parameter

Type

Default

Description

num_seqs

int | None

None

Total sequences to generate. Required when num_cycles is not given.

num_cycles

int | None

None

Number of complete passes through the pool’s num_states sequences. One cycle produces num_states sequences.

chunk_size

int

1000

Sequences generated per internal batch. Larger values may be faster but use more memory.

write_tags

bool

False

If True, include region tags (e.g. <region>…</region>) in the seq column.

seed

int | None

None

Random seed for reproducibility.

discard_null_seqs

bool

True

Skip sequences removed by a filter operation (NullSeq).

columns

list[str] | None

None

Columns to keep. Defaults to all columns (name, seq, plus any design card columns). Pass ["name", "seq"] to drop cards.

show_progress

bool

True

Display a tqdm progress bar during generation.

See Pool in the API Reference for the full parameter list.

Basic usage

pool = pp.from_iupac("NNNNNNNN", mode="sequential")
df   = pool.to_df(num_cycles=1)
# 65536 rows, columns: name, seq

Large library with chunked streaming

pool = pp.from_iupac("NNNNNNNNNN")
df   = pool.to_df(num_seqs=500_000, chunk_size=10_000, seed=42) # Random sample of 500k sequences from ~1M possible sequences

Keep only name and seq (drop design cards)

scored = pool.score(pp.calc_gc, card_key="gc", cards={"gc": "gc"})
df     = scored.to_df(num_cycles=1, columns=["name", "seq"])
# "gc" column is excluded

Exporting to file — to_file(...)

Stream sequences directly to disk without ever holding the full library in memory. Supports CSV, TSV, FASTA, and JSONL formats, including gzip compression.

Parameter

Type

Default

Description

path

str | Path

(required)

Output file path. Use a .gz suffix for transparent gzip compression (e.g. library.csv.gz).

file_type

str | None

None

"csv", "tsv", "fasta", or "jsonl". Auto-detected from the file extension when None.

num_seqs

int | None

None

Total sequences to write.

num_cycles

int | None

None

Number of complete passes through the pool’s num_states sequences. One cycle produces num_states sequences.

chunk_size

int

1000

Sequences written per internal batch.

write_tags

bool

False

Include region tags in output sequences.

seed

int | None

None

Random seed for reproducibility.

discard_null_seqs

bool

True

Skip sequences removed by a filter operation (NullSeq).

columns

list[str] | None

None

Columns to write (CSV/TSV only).

line_width

int | None

60

FASTA only: wrap sequence lines at this width. None for no wrapping.

description

str | callable | None

None

FASTA only: additional description text after the sequence name. A string is treated as a format template (e.g. "GC={gc:.2f}"); a callable receives the row dict and should return a string.

show_progress

bool

True

Show a tqdm progress bar.

Returns the number of sequences written. See Pool in the API Reference for the full parameter list.

Export to CSV

pool = pp.from_iupac("NNNNNNNN")
n    = pool.to_file("library.csv", num_seqs=100_000)
# n == 100000
name,seq
pool[0].0,AAAAAAAA
pool[0].1,AAAAAAAC
pool[0].2,AAAAAAAG
pool[0].3,AAAAAAAT
pool[0].4,AAAAAACA
...

Export to gzip-compressed CSV

n = pool.to_file("library.csv.gz", num_seqs=1_000_000, chunk_size=50_000)

Export to FASTA

n = pool.to_file("library.fasta", num_seqs=10_000)
>pool[0].0
AAAAAAAA
>pool[0].1
AAAAAAAC
>pool[0].2
AAAAAAAG
...

FASTA with a custom description line

scored = pool.score(pp.calc_gc, card_key="gc", cards={"gc": "gc"})
n = scored.to_file(
    "library.fasta",
    num_seqs=1000,
    description=lambda row: f"GC={row['gc']:.3f}",
)
>pool[0].0 GC=0.000
AAAAAAAA
>pool[0].1 GC=0.125
AAAAAAAC
>pool[0].2 GC=0.125
AAAAAAAG
...

Visualising the DAG — print_dag(...)

Print an ASCII tree of the computation graph rooted at this pool. Returns self so it can be used mid-pipeline.

Parameter

Type

Default

Description

style

str

"clean"

Tree drawing style. "clean" uses Unicode box-drawing characters; "ascii" uses only ASCII.

show_pools

bool

True

Show pool nodes in addition to operation nodes.

wt       = pp.from_seq("ACG")
mut      = wt.mutagenize(num_mutations=1, mode="sequential")
repeated = mut * 2
repeated.print_dag()
pool[2] (pool, n=18)
└── op[2]:repeat [mode=sequential, n=2]
    └── pool[1] (pool, n=9)
        └── op[1]:mutagenize [mode=sequential, n=9]
            └── pool[0] (pool, n=1)
                └── op[0]:from_seq [mode=fixed, n=1]

Advanced

Operator shortcuts

Pools support three Python operators as shorthand for common operations:

pool_a + pool_b

Equivalent to pp.stack([pool_a, pool_b]). See stack.

pool * N

Equivalent to pp.repeat(pool, times=N). See repeat.

pool[start:stop]

Equivalent to pp.slice_states(pool, start=start, stop=stop). See slice_states.

a = pp.from_seqs(["AAA", "CCC"], mode="sequential")
b = pp.from_seqs(["GGG", "TTT"], mode="sequential")

combined = a + b          # 4 states (2 + 2)
repeated = a * 3          # 6 states (2 × 3)
sliced   = combined[:3]   # 3 states (first 3 of 4)

copy() and deepcopy()

copy() creates a new pool that shares the same input pools – useful for branching a design at a specific point without re-running earlier operations.

deepcopy() creates a fully independent copy of the entire upstream DAG – nothing is shared with the original. In most cases copy() is sufficient. Use deepcopy() when the two branches must be fully independent and share no input pools.

base = pp.from_iupac("NNNN", mode="sequential")
branch_a = base.mutagenize(num_mutations=1).named("branch_a")
branch_b = base.copy().mutagenize(num_mutations=2).named("branch_b")
# branch_a and branch_b share the same "base" input pool