Pools ===== Every PoolParty operation returns a **Pool**. A Pool represents a designed sequence library: it records which operation was applied and to what inputs, forming a directed acyclic graph (DAG) of operations. PoolParty walks this graph to generate sequences on demand when you call ``generate_library()``, ``print_library()``, ``to_df()``, or ``to_file()``. Every operation returns a **new** Pool — the original is never modified. This means you can branch a pipeline at any point and apply different operations to each branch without interference. Each pool carries a reference to the operation that created it. You can inspect it via ``pool.operation`` to check settings like ``operation.mode`` and ``operation.num_states`` at any point in a pipeline. See :doc:`operations/modes` for details. Pools must be created inside an active context. Call ``pp.init()`` once at the top of a notebook, or use ``with pp.Party():`` for automatic cleanup when the block exits. See :doc:`quickstart` for details. All examples assume: .. code-block:: python import poolparty as pp pp.init() ---- Properties ---------- .. list-table:: :widths: 25 15 60 :header-rows: 1 * - Attribute - Type - Description * - ``name`` - ``str`` - Human-readable name for this pool. Settable. Defaults to ``"pool[N]"``. * - ``num_states`` - ``int`` - Number of distinct sequences this pool produces. * - ``seq_length`` - ``int | None`` - Fixed sequence length, or ``None`` for variable-length pools. * - ``iter_order`` - ``float`` - Iteration priority. Controls which pool's sequences change most rapidly when generating combinations in a joined or stacked pool. * - ``regions`` - ``set[Region]`` - Set of :class:`~poolparty.Region` objects present in this pool's sequences. See :doc:`regions` for details. * - ``parents`` - ``list[Pool]`` - Input pools that this pool's operation reads from. * - ``operation`` - ``Operation`` - The operation that created this pool. Exposes ``operation.mode``, ``operation.num_states``, and ``operation.natural_num_states``. Note that ``pool.num_states`` and ``pool.operation.num_states`` are different values. The pool's ``num_states`` is the total across the entire pipeline, while the operation's ``num_states`` is just that operation's contribution (see :doc:`operations/modes` and :doc:`operations/library_size`): .. code-block:: python seqs = pp.from_seqs(["AAA", "CCC", "GGG"], mode="sequential") mut = seqs.mutagenize(num_mutations=1, mode="sequential") mut.num_states # 27 (3 inputs × 9 mutants) mut.operation.num_states # 9 (mutagenize alone) mut.operation.natural_num_states # 9 (before any num_states override) ---- Naming and copying ------------------ ``named(name)`` ~~~~~~~~~~~~~~~ Set the pool's name and return ``self``, allowing in-line renaming without breaking a chain. .. code-block:: python wt = pp.from_seq("ACGT").named("wildtype") # wt.name == "wildtype" scored = ( pp.from_iupac("NNNN", mode="sequential") .mutagenize(num_mutations=1) .named("single_mut") ) ``copy()`` and ``deepcopy()`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``copy()`` creates a new pool that shares the same input pools — useful for branching a design at a specific point without re-running earlier operations. ``deepcopy()`` creates a fully independent copy of the entire upstream DAG — nothing is shared with the original. In most cases ``copy()`` is sufficient. Use ``deepcopy()`` when the two branches must be fully independent and share no input pools. .. code-block:: python base = pp.from_iupac("NNNN", mode="sequential") branch_a = base.mutagenize(num_mutations=1).named("branch_a") branch_b = base.copy().mutagenize(num_mutations=2).named("branch_b") # branch_a and branch_b share the same "base" input pool ---- Operator shortcuts ------------------ Pools support three Python operators as shorthand for common operations: ``pool_a + pool_b`` Equivalent to ``pp.stack([pool_a, pool_b])``. See :doc:`operations/stack`. ``pool * N`` Equivalent to ``pp.repeat(pool, times=N)``. See :doc:`operations/repeat`. ``pool[start:stop]`` Equivalent to ``pp.slice_states(pool, start=start, stop=stop)``. See :doc:`operations/slice_states`. .. code-block:: python a = pp.from_seqs(["AAA", "CCC"], mode="sequential") b = pp.from_seqs(["GGG", "TTT"], mode="sequential") combined = a + b # 4 states (2 + 2) repeated = a * 3 # 6 states (2 × 3) sliced = combined[:3] # 3 states (first 3 of 4) ---- Generating sequences -------------------- ``generate_library(...)`` ~~~~~~~~~~~~~~~~~~~~~~~~~ Generate all sequences from this pool and return them as a :class:`pandas.DataFrame`. Best for small to medium pools; for libraries above ~10k sequences, use ``to_df`` which streams in chunks. See :doc:`operations/generate_library` for full documentation. .. code-block:: python pool = pp.from_iupac("NNNN", mode="sequential") df = pool.generate_library() # df has columns: name, seq (plus any design card columns) ``print_library(...)`` ~~~~~~~~~~~~~~~~~~~~~~ Print a formatted preview of the pool's sequences to stdout. Returns ``self`` so it can be used mid-pipeline. .. list-table:: :widths: 25 15 15 45 :header-rows: 1 * - Parameter - Type - Default - Description * - ``num_seqs`` - ``int | None`` - ``None`` - Number of sequences to show. * - ``num_cycles`` - ``int | None`` - ``1`` - Number of complete passes through the pool's ``num_states`` sequences (used when ``num_seqs`` is not given). One cycle produces ``num_states`` sequences. * - ``show_header`` - ``bool`` - ``True`` - Print a summary header line before the sequences. * - ``show_name`` - ``bool`` - ``True`` - Include the sequence name column. * - ``show_seq`` - ``bool`` - ``True`` - Include the sequence column. * - ``show_state`` - ``bool`` - ``False`` - Include the state index column. * - ``pad_names`` - ``bool`` - ``True`` - Align sequences by padding names to the same width. * - ``seed`` - ``int | None`` - ``None`` - Random seed for reproducible previews. * - ``discard_null_seqs`` - ``bool`` - ``False`` - Skip sequences removed by a ``filter`` operation (``NullSeq``). See :class:`~poolparty.Pool` in the :doc:`api` for the full parameter list. .. code-block:: python pp.from_iupac("NNNNN", mode="sequential").print_library(num_seqs=6) .. raw:: html
pool[0]: seq_length=5, num_states=1024 pool[0].0 AAAAA
pool[0].1 AAAAC
pool[0].2 AAAAG
pool[0].3 AAAAT
pool[0].4 AAACA
pool[0].5 AAACC
---- Exporting to a DataFrame — ``to_df(...)`` ----------------------------------------- Generate sequences and collect them into a :class:`pandas.DataFrame` using chunked streaming. Prefer ``to_df`` over ``generate_library`` for large libraries (above ~10k sequences). It processes sequences in batches, keeping peak memory proportional to ``chunk_size`` rather than the full library. .. list-table:: :widths: 25 15 15 45 :header-rows: 1 * - Parameter - Type - Default - Description * - ``num_seqs`` - ``int | None`` - ``None`` - Total sequences to generate. Required when ``num_cycles`` is not given. * - ``num_cycles`` - ``int | None`` - ``None`` - Number of complete passes through the pool's ``num_states`` sequences. One cycle produces ``num_states`` sequences. * - ``chunk_size`` - ``int`` - ``1000`` - Sequences generated per internal batch. Larger values may be faster but use more memory. * - ``write_tags`` - ``bool`` - ``False`` - If ``True``, include region tags (e.g. ````) in the ``seq`` column. * - ``seed`` - ``int | None`` - ``None`` - Random seed for reproducibility. * - ``discard_null_seqs`` - ``bool`` - ``True`` - Skip sequences removed by a ``filter`` operation (``NullSeq``). * - ``columns`` - ``list[str] | None`` - ``None`` - Columns to keep. Defaults to all columns (``name``, ``seq``, plus any design card columns). Pass ``["name", "seq"]`` to drop cards. * - ``show_progress`` - ``bool`` - ``True`` - Display a ``tqdm`` progress bar during generation. See :class:`~poolparty.Pool` in the :doc:`api` for the full parameter list. .. rubric:: Basic usage .. code-block:: python pool = pp.from_iupac("NNNNNNNN", mode="sequential") df = pool.to_df(num_cycles=1) # 65536 rows, columns: name, seq .. rubric:: Large library with chunked streaming .. code-block:: python pool = pp.from_iupac("NNNNNNNNNN") df = pool.to_df(num_seqs=500_000, chunk_size=10_000, seed=42) # Random sample of 500k sequences from ~1M possible sequences .. rubric:: Keep only name and seq (drop design cards) .. code-block:: python scored = pool.score(pp.calc_gc, card_key="gc", cards={"gc": "gc"}) df = scored.to_df(num_cycles=1, columns=["name", "seq"]) # "gc" column is excluded ---- Exporting to file — ``to_file(...)`` ------------------------------------- Stream sequences directly to disk without ever holding the full library in memory. Supports CSV, TSV, FASTA, and JSONL formats, including gzip compression. .. list-table:: :widths: 25 18 15 42 :header-rows: 1 * - Parameter - Type - Default - Description * - ``path`` - ``str | Path`` - *(required)* - Output file path. Use a ``.gz`` suffix for transparent gzip compression (e.g. ``library.csv.gz``). * - ``file_type`` - ``str | None`` - ``None`` - ``"csv"``, ``"tsv"``, ``"fasta"``, or ``"jsonl"``. Auto-detected from the file extension when ``None``. * - ``num_seqs`` - ``int | None`` - ``None`` - Total sequences to write. * - ``num_cycles`` - ``int | None`` - ``None`` - Number of complete passes through the pool's ``num_states`` sequences. One cycle produces ``num_states`` sequences. * - ``chunk_size`` - ``int`` - ``1000`` - Sequences written per internal batch. * - ``write_tags`` - ``bool`` - ``False`` - Include region tags in output sequences. * - ``seed`` - ``int | None`` - ``None`` - Random seed for reproducibility. * - ``discard_null_seqs`` - ``bool`` - ``True`` - Skip sequences removed by a ``filter`` operation (``NullSeq``). * - ``columns`` - ``list[str] | None`` - ``None`` - Columns to write (CSV/TSV only). * - ``line_width`` - ``int | None`` - ``60`` - FASTA only: wrap sequence lines at this width. ``None`` for no wrapping. * - ``description`` - ``str | callable | None`` - ``None`` - FASTA only: additional description text after the sequence name. A string is treated as a format template (e.g. ``"GC={gc:.2f}"``); a callable receives the row dict and should return a string. * - ``show_progress`` - ``bool`` - ``True`` - Show a ``tqdm`` progress bar. Returns the number of sequences written. See :class:`~poolparty.Pool` in the :doc:`api` for the full parameter list. .. rubric:: Export to CSV .. code-block:: python pool = pp.from_iupac("NNNNNNNN") n = pool.to_file("library.csv", num_seqs=100_000) # n == 100000 .. code-block:: text name,seq pool[0].0,AAAAAAAA pool[0].1,AAAAAAAC pool[0].2,AAAAAAAG pool[0].3,AAAAAAAT pool[0].4,AAAAAACA ... .. rubric:: Export to gzip-compressed CSV .. code-block:: python n = pool.to_file("library.csv.gz", num_seqs=1_000_000, chunk_size=50_000) .. rubric:: Export to FASTA .. code-block:: python n = pool.to_file("library.fasta", num_seqs=10_000) .. code-block:: text >pool[0].0 AAAAAAAA >pool[0].1 AAAAAAAC >pool[0].2 AAAAAAAG ... .. rubric:: FASTA with a custom description line .. code-block:: python scored = pool.score(pp.calc_gc, card_key="gc", cards={"gc": "gc"}) n = scored.to_file( "library.fasta", num_seqs=1000, description=lambda row: f"GC={row['gc']:.3f}", ) .. code-block:: text >pool[0].0 GC=0.000 AAAAAAAA >pool[0].1 GC=0.125 AAAAAAAC >pool[0].2 GC=0.125 AAAAAAAG ... ---- Visualising the DAG — ``print_dag(...)`` ----------------------------------------- Print an ASCII tree of the computation graph rooted at this pool. Returns ``self`` so it can be used mid-pipeline. .. list-table:: :widths: 25 15 15 45 :header-rows: 1 * - Parameter - Type - Default - Description * - ``style`` - ``str`` - ``"clean"`` - Tree drawing style. ``"clean"`` uses Unicode box-drawing characters; ``"ascii"`` uses only ASCII. * - ``show_pools`` - ``bool`` - ``True`` - Show pool nodes in addition to operation nodes. .. code-block:: python wt = pp.from_seq("ACG") mut = wt.mutagenize(num_mutations=1, mode="sequential") scored = mut.score(pp.calc_gc, card_key="gc", cards={"gc": "gc"}) scored.print_dag() .. code-block:: text pool[2] (pool, n=9) └── op[2]:score [mode=fixed, n=1] └── pool[1] (pool, n=9) └── op[1]:mutagenize [mode=sequential, n=9] └── pool[0] (pool, n=1) └── op[0]:from_seq [mode=fixed, n=1]