filter

Retain only the sequences for which a predicate function returns True; all other sequences are replaced with a NullSeq sentinel.

import poolparty as pp
pp.init()

Note

Rejected sequences are not removed from the state space — they become NullSeq values that propagate silently through every downstream operation. By default generate_library still includes NullSeq rows (as empty values). Pass discard_null_seqs=True to exclude them from the output.

The predicate receives the tag-free sequence string (region tags are stripped before evaluation).


Parameters

Parameter

Type

Default

Description

pool

Pool | DnaPool | ProteinPool

(required)

Input pool to filter.

predicate

Callable[[str], bool]

(required)

Function taking the clean (tag-free) sequence string; return True to keep the sequence.

name

str | None

None

Optional name for the filter operation.

prefix

str | None

None

Prefix for sequence names in the resulting pool.

cards

list[str] | dict[str, str] | None

None

Design card keys to include. Available keys: 'passed'.


Note

Only the most commonly used parameters are shown above. For the full parameter list, see filter() in the API Reference.

Examples

Filter by GC content

Keep only sequences whose GC count is at least 3 (GC content ≥ 50 %). Sequences that fail the predicate become None (a NullSeq sentinel); pass discard_null_seqs=True to generate_library to exclude them from the final DataFrame.

seqs    = pp.from_seqs(
    ["AAAAAA", "GCGCGC", "AAACCC", "TTTTTT", "GGCCAA"],
    mode="sequential",
)
high_gc = pp.filter(seqs, lambda s: s.count("G") + s.count("C") >= 3)
high_gc.print_library()
high_gc: seq_length=6, num_states=5 None
GCGCGC
AAACCC
None
GGCCAA

Filter by sequence length

When a pool may contain sequences of varying length, keep only those that are exactly 8 bases long.

seqs    = pp.from_seqs(
    ["ATCG", "ATCGATCG", "GGCC", "TTTTAAAA", "ACG"],
    mode="sequential",
)
trimmed = pp.filter(seqs, lambda s: len(s) == 8)
trimmed.print_library()
df      = pp.generate_library(trimmed, discard_null_seqs=True)
trimmed: seq_length=None, num_states=5 None
ATCGATCG
None
TTTTAAAA
None

Exclude sequences containing a restriction site

Remove any 8-mer that contains the EcoRI recognition site GAATTC. (Here get_kmers uses mode="sequential" with length 8; length 12 is too large for sequential enumeration under the default state limit.)

pool     = pp.get_kmers(8, mode="sequential")
no_ecori = pp.filter(pool, lambda s: "GAATTC" not in s)
no_ecori.print_library()
df       = pp.generate_library(no_ecori, num_seqs=6, discard_null_seqs=True)
no_ecori: seq_length=8, num_states=65536 AAAAAAAA
AAAAAAAC
AAAAAAAG
AAAAAAAT
AAAAAACA ... (65536 total)

Chain: mutagenize then filter by GC content

Generate all single-nucleotide mutants, then keep only those whose GC content is at least 5 out of 8 bases.

wt      = pp.from_seq("ATCGATCG")
mutants = pp.mutagenize(wt, num_mutations=1, mode="sequential")
high_gc = pp.filter(mutants, lambda s: s.count("G") + s.count("C") >= 5)
high_gc.print_library(num_seqs=8)
high_gc: seq_length=8, num_states=24 CTCGATCG
GTCGATCG
None
None
ACCGATCG
AGCGATCG
None
None ... (24 total)

See filter().