MPRA Library for Regulatory Grammar ==================================== This tutorial designs a massively parallel reporter assay (MPRA) library for probing transcriptional regulatory grammar. The library places three liver-enriched transcription factor binding sites (TFBSs) at random positions and orientations within a 100 bp candidate regulatory element (CRE). Each unique CRE arrangement is paired with three distinct barcodes for technical replication, yielding 24,000 barcoded sequences that can be used to test how binding site configuration affects gene expression. The TFBS sequences (HNF4A, PPARA, XBP1) come from Georgakopoulos-Soares et al. (*Nature Communications*, 2023), and the oligo construct layout follows Melnikov et al. (*Nature Biotechnology*, 2012). .. image:: /_static/images/figure3a.drawio.svg :width: 80% :align: center :alt: MPRA library design schematic showing the pipeline from template construct through TFBS insertion, shuffling, and barcode attachment. .. code-block:: python import poolparty as pp pp.init() ---- Reference sequences -------------------- The construct follows the Melnikov et al. oligo layout: a 5' adaptor, a 100 bp CRE region containing putatively inert background sequence, a KpnI/XbaI restriction junction, an 8 bp barcode, and a 3' sequencing adapter. The 100 bp background is drawn from a confirmed-negative genomic region (Georgakopoulos-Soares et al., Supplementary Table 2). .. code-block:: python BG1_100 = ( "GCAAGTCTGCCATCGTGTTCAGAAGGGCCAGAAATGCCAAGGACTCAGGGGAGG" "AGAATTAAGTCAGAGAGTTTCATTACTGAGTGTTGTTTGACTTTGT" ) MELNIKOV_5P = "ACTGGCCGCTTCACTG" # 5' adaptor MELNIKOV_3P = "AGATCGGAAGAGCGTCG" # sequencing adapter MELNIKOV_JUNCTION = "GGTACCTCTAGA" # KpnI + XbaI Build the template ------------------ The template contains two :doc:`tagged regions `: ```` marks the 100 bp element where TFBSs will be placed, and ```` marks the barcode placeholder (initially filled with ``N`` characters). .. code-block:: python MPRA_TEMPLATE = ( MELNIKOV_5P + "" + BG1_100 + "" + MELNIKOV_JUNCTION + "" + "N" * 8 + "" + MELNIKOV_3P ) template = pp.from_seq(MPRA_TEMPLATE) Create TFBS pools ----------------- Each TFBS is created as a single-sequence pool, then passed through :doc:`flip ` to include both forward and reverse-complement orientations. Color :doc:`styles ` make TFBSs visually distinguishable in the output: HNF4A in blue, PPARA in purple, XBP1 in orange. .. code-block:: python hnf4a = pp.from_seq("GGGGCAAAGGTCA", style="blue").flip( mode="sequential", cards={"flip": "hnf4a_strand"}) ppara = pp.from_seq("CCGGGTCATTGGGGTCAGG", style="purple").flip( mode="sequential", cards={"flip": "ppara_strand"}) xbp1 = pp.from_seq("GTGATGACGTGTCCCAT", style="orange").flip( mode="sequential", cards={"flip": "xbp1_strand"}) Each TFBS pool now contains two states (forward and reverse complement). Insert TFBSs into the CRE region --------------------------------- :doc:`insertion_multiscan ` places three TFBSs at random positions within the ```` region. The ``replace=True`` flag replaces the underlying background bases so the total sequence length stays constant. ``insertion_mode="unordered"`` means the three sites can appear in any order, and ``min_spacing=0`` allows binding sites to sit immediately adjacent to each other. .. code-block:: python cre_pool = template.insertion_multiscan( region="cre", insertion_pools=[hnf4a, ppara, xbp1], insertion_mode="unordered", replace=True, min_spacing=0, num_insertions=3, mode="random", num_states=1000, names=["hnf4a", "ppara", "xbp1"], cards={"starts": "positions", "names": "tfbs"}, ).repeat(times=3) The ``num_states=1000`` parameter draws 1,000 random position configurations. Because ``flip`` uses :doc:`sequential mode `, it exhaustively enumerates both orientations for each TFBS rather than sampling. With three TFBSs, this gives 2\ :sup:`3` = 8 orientation combinations per position configuration, yielding 8,000 unique CRE variants. :doc:`repeat ` then creates three copies of each variant (24,000 total), so that each unique CRE arrangement will receive three distinct barcodes for technical replication. Generate and attach barcodes ---------------------------- Each CRE variant receives a unique 8 bp barcode. :doc:`get_barcodes ` generates barcodes with controlled GC content and minimum edit distance to ensure they are distinguishable by sequencing. .. code-block:: python barcode_pool = pp.get_barcodes( num_barcodes=cre_pool.num_states, length=8, gc_range=(0.3, 0.6), min_edit_distance=1, style="bold", seed=42, ) mpra_pool = cre_pool.replace_region( region_name="bc", content_pool=barcode_pool, ) :doc:`replace_region ` with the default ``sync=True`` pairs each of the 24,000 CRE variants with exactly one barcode. Because every unique CRE arrangement appears three times (from ``repeat``), each arrangement receives three distinct barcodes for technical replication. Inspect the library ------------------- .. code-block:: python mpra_pool.print_library(num_seqs=12, seed=42) .. raw:: html
mpra_pool: seq_length=153, num_states=24000 ACTGGCCGCTTCACTG<cre>GCGTGATGACGTGTCCCATCAGAAGGGCCAGAAATGCCAACCGGGTCATTGGGGTCAGGTAAGTCAGAGAGTTTCATTACTGAGTGGGGGCAAAGGTCAT</cre>GGTACCTCTAGA<bc>TGGAGAAA</bc>AGATCGGAAGAGCGTCG
ACTGGCCGCTTCACTG<cre>GCGTGATGACGTGTCCCATCAGAAGGGCCAGAAATGCCAACCGGGTCATTGGGGTCAGGTAAGTCAGAGAGTTTCATTACTGAGTGGGGGCAAAGGTCAT</cre>GGTACCTCTAGA<bc>GCTGTCTT</bc>AGATCGGAAGAGCGTCG
ACTGGCCGCTTCACTG<cre>GCGTGATGACGTGTCCCATCAGAAGGGCCAGAAATGCCAACCGGGTCATTGGGGTCAGGTAAGTCAGAGAGTTTCATTACTGAGTGGGGGCAAAGGTCAT</cre>GGTACCTCTAGA<bc>CCCGAATT</bc>AGATCGGAAGAGCGTCG
ACTGGCCGCTTCACTG<cre>GCAAGTCTGCCATCGTGTTCAGAGGGGCAAAGGTCACCAACCGGGTCATTGGGGTCAGGTAAGTCAGAGAGTGATGACGTGTCCCATTGTTTGACTTTGT</cre>GGTACCTCTAGA<bc>AAAGGGTC</bc>AGATCGGAAGAGCGTCG
ACTGGCCGCTTCACTG<cre>GCAAGTCTGCCATCGTGTTCAGAGGGGCAAAGGTCACCAACCGGGTCATTGGGGTCAGGTAAGTCAGAGAGTGATGACGTGTCCCATTGTTTGACTTTGT</cre>GGTACCTCTAGA<bc>ACCCACAA</bc>AGATCGGAAGAGCGTCG
ACTGGCCGCTTCACTG<cre>GCAAGTCTGCCATCGTGTTCAGAGGGGCAAAGGTCACCAACCGGGTCATTGGGGTCAGGTAAGTCAGAGAGTGATGACGTGTCCCATTGTTTGACTTTGT</cre>GGTACCTCTAGA<bc>AAGATCTG</bc>AGATCGGAAGAGCGTCG
ACTGGCCGCTTCACTG<cre>GCAAGTCTGCCACCGGGTCATTGGGGTCAGGAAATGCCAAGGACTCAGGTGATGACGTGTCCCATAGAGAGTTTCATTACTGGGGCAAAGGTCACTTTGT</cre>GGTACCTCTAGA<bc>CTGTTGTT</bc>AGATCGGAAGAGCGTCG
ACTGGCCGCTTCACTG<cre>GCAAGTCTGCCACCGGGTCATTGGGGTCAGGAAATGCCAAGGACTCAGGTGATGACGTGTCCCATAGAGAGTTTCATTACTGGGGCAAAGGTCACTTTGT</cre>GGTACCTCTAGA<bc>AGTCATGG</bc>AGATCGGAAGAGCGTCG
ACTGGCCGCTTCACTG<cre>GCAAGTCTGCCACCGGGTCATTGGGGTCAGGAAATGCCAAGGACTCAGGTGATGACGTGTCCCATAGAGAGTTTCATTACTGGGGCAAAGGTCACTTTGT</cre>GGTACCTCTAGA<bc>AGACTGGT</bc>AGATCGGAAGAGCGTCG
ACTGGCCGCTTCACTG<cre>GCAAGGGGCAAAGGTCATTCAGAAGGGCCAGAAATGCCAAGGACTCCGGGTCATTGGGGTCAGGGTGATGACGTGTCCCATGAGTGTTGTTTGACTTTGT</cre>GGTACCTCTAGA<bc>GAGGAACT</bc>AGATCGGAAGAGCGTCG
ACTGGCCGCTTCACTG<cre>GCAAGGGGCAAAGGTCATTCAGAAGGGCCAGAAATGCCAAGGACTCCGGGTCATTGGGGTCAGGGTGATGACGTGTCCCATGAGTGTTGTTTGACTTTGT</cre>GGTACCTCTAGA<bc>ATACAACC</bc>AGATCGGAAGAGCGTCG
ACTGGCCGCTTCACTG<cre>GCAAGGGGCAAAGGTCATTCAGAAGGGCCAGAAATGCCAAGGACTCCGGGTCATTGGGGTCAGGGTGATGACGTGTCCCATGAGTGTTGTTTGACTTTGT</cre>GGTACCTCTAGA<bc>ACCCAGAA</bc>AGATCGGAAGAGCGTCG
Each sequence shows the positions and orientations of the three TFBSs (HNF4A in blue, PPARA in purple, XBP1 in orange) and the barcode in bold. The ```` and ```` region tags are preserved so downstream operations can continue to reference those regions. Notice that the first three sequences share the same TFBS positions and orientations but carry different barcodes, reflecting the three technical replicates produced by ``repeat(times=3)``. Design cards ~~~~~~~~~~~~ The ``cards`` parameters on ``flip`` and ``insertion_multiscan`` record each variant's TFBS positions, spatial ordering, and strand orientations as :doc:`design card ` columns: .. code-block:: python df = mpra_pool.sample(num_seqs=6, seed=42).generate_library() df[["positions", "tfbs", "hnf4a_strand", "ppara_strand", "xbp1_strand"]] .. raw:: html
df — 6 rows × 7 columns (card columns shown)
positionstfbshnf4a_strandppara_strandxbp1_strand
[5, 37, 87][xbp1, ppara, hnf4a]rcforwardrc
[7, 43, 65][xbp1, ppara, hnf4a]forwardforwardforward
[7, 43, 65][xbp1, ppara, hnf4a]forwardrcforward
[18, 47, 80][ppara, hnf4a, xbp1]forwardforwardrc
[9, 37, 59][ppara, xbp1, hnf4a]rcforwardrc
[10, 31, 71][xbp1, ppara, hnf4a]rcforwardrc
The ``positions`` column records the start position of each TFBS within the CRE, ``tfbs`` records their spatial order (left to right along the sequence), and the strand columns record each site's orientation. Notice that position configurations, orderings, and strand combinations all vary independently across the library. See :doc:`insertion_multiscan `, :doc:`flip `, :doc:`get_barcodes `, and :doc:`replace_region ` for full parameter details. To export the library as a DataFrame or file, see ``to_df`` and ``to_file`` in :doc:`/pool`.