Read Time:27 Minute, 46 Second

Hi-C data and processing

We used seven human and mouse Hi-C profiles in this study: IMR-90, GM12878, H1-hESC, K562, CUTLL1, T cell, Mouse Patski (Supplementary Table 1). All of the data are available on GEO (www.ncbi.nlm.nih.gov/geo) and/or the 4D Nucleome Data Portal (https://data.4dnucleome.org). To minimize bias in Hi-C data preprocessing, we obtained counts data in raw fastq format. The reads from human cell lines were aligned to GRCh38 human reference genome and mouse cell lines were aligned to mm10 mouse genome. The alignments were filtered at 10-kb resolution and iteratively corrected with HiC-bench56. To ensure the compatibility of the prediction result with downstream analytical tools, we only used a reversible natural log transform to process the Hi-C prediction targets. Prediction from C.Origami with exponential transformation can be directly used as Hi-C chromatin contact matrix data for any downstream analysis.

CTCF ChIP–seq and ATAC–seq data

CTCF ChIP–seq and ATAC–seq data for most cell types are publicly available online from GEO (www.ncbi.nlm.nih.gov/geo) and the ENCODE data portal (www.encodeproject.org/). CUTLL1 ATAC–seq was sequenced according to a standard method57. ATAC–seq libraries were generated from 0.5 × 106 CUTLL1 cells. Libraries were sequenced on an Illumina NovaSeq using 100-bp paired-end reads. Details on accession number are listed in Supplementary Table 2. To maintain signal consistency across different cell lines, we aggregated fastq data from different replicates and subsampled them down to 40 million reads. The reads were processed by Seq-N-Slide to generate bigWig files (https://doi.org/10.5281/zenodo.6308846). The bigWig files were used as regular, dense inputs to our model. To prepare an alternative sparse input format, we used MACS2 to perform peak calling on the intermediate bam files to obtain sparse peaks for CTCF and ATAC–seq58. The sparse narrowPeak file was converted back to bigWig with ucscutils. We performed a log(x + 1) transformation on both dense and sparse bigWig files and used them as inputs to the model.

DNA sequence

We used the reference genome sequence (hg38 and mm10) from the UCSC Genome Browser database. The original fasta file includes four types of nucleotides and ‘n’ for unknown type. We retained the ‘n’ category and encoded it as the unknown fifth ‘nucleotide’. After encoding, each nucleotide is a five-channel one-hot vector representing ‘ATCGN’, respectively. The same reference genome sequence was used for all cell types.

Training data

The training data consist of DNA sequence, CTCF-binding signal, ATAC–seq signal and Hi-C matrix from the IMR-90 cell line. The input data to the model include DNA sequence, CTCF ChIP–seq signal and ATAC–seq signal at a 2,097,152-bp region. The output target is the Hi-C matrix at the corresponding region. The Hi-C matrix was originally called at 10-kb resolution and downscaled to 8,192 bp to match the model output resolution. To generate batches of training data, we defined 2-Mb sliding windows across the genome with 40-kb steps. Windows that have overlap with telomere or centromere regions were removed. We randomly split the genome into training, validation and test chromosomes. Chromosomes 10 and 15 were used as the validation set and the test set, respectively. The rest of the chromosomes were used as the training set.

Model architecture

C.Origami is implemented with the PyTorch framework. The model consists of two 1D convolutional encoders, a transformer module and a task-specific 2D convolutional decoder. The sequence and genomic feature encoder has five and two input channels, respectively. To reduce memory consumption, encoders start with a 1D convolution header with stride 2. To reduce the input length from 2 Mb down to 256 bins, we deployed 12 convolution modules, each of which consists of a residual block and a scaling block. The residual block has two sets of convolution layers with kernel width 5 and the same padding. Batch normalization and ReLU nonlinearity follow each convolutional layer, and the start and end positions of the residual block are connected by a residual connection. The residual blocks do not alter dimension of inputs. The residual connections within the residual block help promote information propagation. The scaling block consists of a 1D convolutional layer with kernel size 5 and stride 2 followed by batch normalization and ReLU activation. The scaling block reduces input length by a factor of 2 and increases the number of hidden layers. We increase the hidden size according to this configuration: 32, 32, 32, 32, 64, 64, 128, 128, 128, 128, 256, 256. The output from the last scaling module has a length of 256 with 256 channels.

The transformer module is built with eight customized attention layers similar to a BERT model59. Specifically, we set the number of hidden layers to 256 and ReLU as the activation function, and used eight attention heads. We used relative key query as positional embedding and set the maximum length to be 256. After the transformer module, the model concatenates each position in the 256 bins to every other position to form a 256-by-256 interaction map. The concatenation function takes the 256-bin sequence from the feature extraction module and outputs a 256-by-256 grid where location (i, j) is a concatenation of the features at i and j positions. Since each bin has 256 channels, the concatenation produces a 512-channel 256-by-256 3D tensor.

The decoder consists of five dilated residual networks. We designed the dilation at the corresponding layer to be 2, 4, 8, 16, 32 so that the receptive field of each pixel at the last layer covers the input space, reinforcing interactions between different elements. At the end of the decoder, we use a Conv2D layer with 1 × 1 kernel to combine 256 channels down to one channel, and the output is a 256 × 256 matrix with one channel. The 256 × 256 output from the model was compared with the experimental Hi-C map (ground truth) via an MSE loss. The loss was back propagated through the whole network for gradient updates.

Model training and prediction

To train the model, we used a training batch size of 8 and Adam optimizer with a learning rate of 0.002. A cosine learning rate scheduler with 200-epoch period was used for stabilizing training. We used three types of data augmentations. First, we selected the 2-Mb window with random shifts within 0.36-Mb range. Second, we reverse-complemented the sequence and flipped the target Hi-C matrix with 0.5 probability. Third, we added Gaussian noise to all input signals with zero mean and 0.1 standard deviation. The model achieved minimal validation loss when trained for 54 epochs. The model training time was 18 h on a GPU cluster with quad NVIDIA Tesla V100 GPUs, 320 GB of RAM and 10 CPU cores. Model inference with a mobile NVIDIA RTX 2060 GPU can be achieved in under 1 s, and 3 s on a mobile Intel i7-8750H CPU. To run prediction in IMR-90, the reference DNA sequence, CTCF ChIP–seq and ATAC–seq from IMR-90 in a 2-Mb region are taken as input. For de novo prediction in a target cell type, we replaced IMR-90 CTCF ChIP–seq and ATAC–seq with the corresponding CTCF and ATAC–seq from the specific target while keeping the same reference sequence.

Insulation score

Insulation score is implemented as the ratio of maximum left and right region average intensity and the middle region intensity56. We also added a pseudocount calculated from chromosome-wide average intensity to prevent division by zero in unmappable regions. Given that all the regions contain n interactions, the insulation score can be formulated as follows:

$$\begin{array}{l}{{{\mathrm{Insulation}}}}\\ = \frac{{{{{\mathrm{max}}}}\left( {\frac{1}{{{{n}}}}\mathop {\sum }\nolimits_{{{n}}} \left( {{{{\mathrm{Left}}}}\;{{{\mathrm{Intensity}}}}} \right),\frac{1}{{{{n}}}}\mathop {\sum }\nolimits_{{{n}}} \left( {{{{\mathrm{Right}}}}\;{{{\mathrm{Intensity}}}}} \right)} \right) + {{{\mathrm{pseudocount}}}}}}{{\frac{1}{{{{n}}}}\mathop {\sum }\nolimits_{{{n}}} \left( {{{{\mathrm{Center}}}}\;{{{\mathrm{Intensity}}}}} \right) + {{{\mathrm{pseudocount}}}}}}\end{array}$$

where pseudocount is set to the average intensity of one chromosome within 2 Mb.

Loop calling

We used the Hi-C valid pairs with the FitHiC software60,61 to identify significant interactions. We used a resolution of 10 kb, and minimum and maximum distances of 30 kb and 1 Mb. For loop calling on predicted matrices, we converted the predicted matrix back to valid pairs by merging predictions to chromosomes and counting the discretized intensity value. FitHiC generated a list of significant interactions with corresponding false discovery rate (FDR)-corrected Q values using global background as reference. For loop analysis on IMR-90, we computed AUROC and overlap between loops called from experimental Hi-C and loops called from predicted Hi-C. To calculate AUROC, we used predicted loops as target. Q value cutoffs ranging from 1 × 10−5 to 1 × 10−13 are selected to filter significant loops called from the predicted Hi-C. Then, the Q values from loops called from experimental Hi-C were compared with significant loops called from prediction to calculate an AUROC. For overlap analysis, we chose a fixed 1 × 10−5 cutoff for loops called from predicted and experimental Hi-C and compared the overlap of significant loops. For loop analysis on specific types of interaction, we overlapped the two anchors of each loop and obtained the categories for each loop called. The loops were then filtered by different categories and the same AUROC and overlap analysis was performed on each category of loops.

For cell-type-specific loop analysis between IMR-90 and GM12878, we first used a more stringent cutoff of 1 × 10−7 as a threshold for significant loops. Then we further categorized specific loops into IMR-90 specific or GM12878 specific according to the log2fc of loop interaction counts. To calculate AUROC, we used log2fc in place of the Q value cutoff from previous analysis. We compared two log2fc values. The first log2fc is between predicted loops in cell type 1 and predicted loops in cell type 2 (for example, IMR-90 predicted loop/GM12878 predicted loop). The second log2fc is between experimental loops in cell type 1 and predicted loops in cell type 2 (for example, IMR-90 experimental loop/GM12878 predicted loop). Then the same AUROC and overlap analysis was performed for each of the two cell-type-specific groups. For loop analysis on a specific type of interaction in a cell-type-specific way, the same anchor overlap was performed with corresponding AUROC and overlap analysis.

Chromosome-scale Hi-C contact matrix prediction

To bridge adjacent 2-Mb-window predictions into chromosome-wide Hi-C contact matrices, we ran the prediction in a sliding window with 262,144-bp step size, which is 1/8 of the 2-Mb prediction window. All predictions were in-painted to their corresponding location on the contact map with multiple overlaps. To correct for different levels of overlap, we counted the total times of overlap for every pixel and divided by the number of overlaps. The resulting chromosome-wide prediction can be directly used for downstream analysis such as TAD calling, loop calling and insulation score calculation.

Distance-stratified intensity and correlation

Distance-stratified intensity and correlation calculations were based on fused chromosome prediction. Stratified intensity at distance i was calculated by aggregating the line that is parallel to the Hi-C diagonal with offset of i. Stratified correlation was calculated as Pearson’s r between the shifted diagonal line of prediction and ground truth.

Performance comparison with previous methods

We compared the performance of C.Origami against three previously published methods: Akita18, DeepC19 and Orca20. We compared the performance using four metrics: insulation score correlation, observed versus expected Hi-C metrices correlation, MSE and distance-stratified correlation. We calculated the four metrics separately for the four models by comparing their prediction to the experimental Hi-C data. The comparison was carried out in two different cell types: (1) the training cell type, IMR-90 cells, which most models were trained on and (2) a new cell type, GM12878 cells, aiming to quantify the performance of de novo prediction of chromatin organization of the four models.

We generated a set of sliding windows that covers the whole genome and can be predicted by each model. Since Akita and DeepC are only able to predict interaction within a 1-Mb window, we restricted the test regions to 1-Mb blocks. To generate a genome-wide testing dataset, we selected all 1-Mb regions in a sliding window with 0.5-Mb overlap between neighboring regions. To ensure compatibility with all models’ prediction windows, the sliding window starts and ends 1.5 Mb after chromosome starting location and before ending location to create buffer regions for models requiring 2-Mb windows as inputs. In total, 5,935 regions were generated genome-wide. We used all four models to predict the interaction for the corresponding regions.

The most relevant versions of the previous models were selected for comparison. For Akita, the IMR-90 output channel was selected. For DeepC, we used their model trained with IMR-90 data. Orca was only trained on two cell types, human foreskin fibroblasts (HFFs) and H1-hESCs. We used the HFF model because HFF is also a fibroblast cell line similar to IMR-90. The comparison turned out to be valid because even though Orca was trained on HFF, it outperformed both Akita and DeepC on IMR-90 in many benchmarks. For C.Origami, we used the IMR-90-trained model.

It is necessary to perform scaling and normalization to each model’s outputs due to their varied prediction target customizations. Akita predicts a 1,048,576-bp region with 512 bins. We removed the extra 48,576 bp on the sides to make the prediction 1 Mb, followed by rescaling into 128 bins. Orca can predict interactions at multiple scales. Since C.Origami used a 2-Mb window as prediction target, we selected the 2-Mb window in Orca for consistency. The prediction was then cropped to 1 Mb and rescaled to 128-by-128. For C.Origami, the prediction is a 2,097,152-bp window. We cropped the prediction to leave the center 1-Mb regions and rescaled to 128 bins.

DeepC’s prediction target is different from other models, 45-degree rotations. DeepC also produces predicted Hi-C maps in different scales compared with other methods. Thus, we performed a series of transformations (Supplementary Fig. 11) including mirroring, rotating and cropping to make a comparable contact matrix to outputs produced by other models. We used a 1-Mb prediction window for DeepC and rescaled the output to 128-by-128.

The first step to make the models comparable is selecting a common ground truth Hi-C as the evaluation target. Since each model used a different ground truth with different transformations (for example, observed/expected, log, gaussian smoothing), they cannot be compared directly. We defined the evaluation target as logged Hi-C intensity with iterative correction and eigenvector decomposition (ICE): (log(ICE normalized counts + 1)). Logged intensity has a few advantages over observed versus expected map. First, it allows for computing insulation scores. Second, it can be converted to observed versus expected while the reverse is not straightforward. It can also be converted to raw counts by taking the exponent. Third, it is used as the default Hi-C format for most downstream analysis pipelines such as loop calling and visualization.

The second step to make the models comparable is to normalize model outputs to the evaluation Hi-C target. Since each model used a different original prediction target, the intensities of prediction and evaluation target show a large discrepancy depending on the model. Specifically, DeepC results stood out with a unique pattern which might be a result of their custom stratified binning method (Supplementary Fig. 12). We also observed that the raw predicted matrix intensities were too different to compare (Supplementary Fig. 12).

We performed distance-stratified normalization to align all predictions to the target (Supplementary Fig. 13). We computed the mean and s.d. for each diagonal and then normalized the prediction to target experimental Hi-C. Formally, let \(\hat T\) be the normalized matrix, T be the target ground truth matrix and M be the unnormalized matrix. Let \(m_{d,i}\) be the corresponding element in M, and μ and σ denote the mean and s.d. at diagonal d in matrix T and M. Then, every ith entry on dth diagonal, \(t_{d,i}\) can be normalized as follows:

$$\forall {{{t}}}_{{{{d}}},{{{i}}}} \in {{{\hat{ T}}}},{{{t}}}_{{{{d}}},{{{i}}}} = \frac{{{\sigma }}_{{{d}}}^{{{T}}}}{{{\sigma }}_{{{d}}}^{{{M}}}}\left( {{{{m}}}_{{{{d}}},{{{i}}}} – {\mu}_{{{d}}}^{{{M}}}} \right) + {\mu}_{{{d}}}^{{{T}}}$$

The normalized predictions were compared with the target Hi-C using the four metrics. Each metric was calculated per chromosome for every tested model using their corresponding prediction and the experimental data as ground truth.

We also performed GM12878 de novo prediction comparison. For C.Origami, we used the same IMR-90-trained model but GM12878 CTCF ChIP–seq and ATAC–seq profiles as inputs to predict Hi-C. For sequence-only models, we used the same DNA sequence setup because they could not provide cell-type-specific de novo prediction. Though ideally input DNA sequence should be cell-type-specific, such a procedure is not realistic for general applications.

De novo prediction evaluation

Regions with normal intensity (>10% intensity quantile) and low similarity (<20% insulation difference) between the experimental Hi-C matrices of the two analyzed cell types were selected as structurally different genomic regions. In total, ~15% of the entire genome (~450 Mb) was included for evaluating the performance of cell-type-specific Hi-C prediction in each pair of cell types. In comparison, structurally conserved genomic regions were characterized by normal intensity (>10% intensity quantile) and high similarity (>20% insulation difference). These regions were used for control analysis in parallel with the aforementioned evaluation in the structurally differential genomic regions.

C.Origami prediction at the CUTLL1 t(7;9) translocation site

To generate experimental Hi-C data, we defined a custom chromosome in HiC-bench56. The custom genome in HiC-bench is defined at the matrix-filtered step where the pipeline assigns reads to chromosomes. For the CUTLL1 experiment, we defined a custom chromosome chr7chr9, with chr7:0-142800000 as the starting chromosome and chr9:136500000-138394717 as the ending chromosome. CUTLL1 t(7;9) translocation is heterozygous, leading to allele-specific complexity to its corresponding Hi-C matrix. Since only one allele is translocated, the experimental Hi-C data mapped to either the normal reference genome or the t(7;9) translocated reference genome would be a mixture of chromatin interactions from both translocated and normal chromosomes. To align with this hybrid effect of Hi-C contact map, we first separately predicted three sets of Hi-C maps: t(7;9) translocated chromosome, normal chromosome 7 and normal chromosome 9. The predicted Hi-C matrix at the t(7;9) locus is an average of the predicted Hi-C maps of t(7;9) translocation chromosome and a fused prediction map ranging from normal chr7 to the breakpoint chr7:142,797,952 and extending from chr9:136,502,817 to the rest of normal chr9. We did not count the interchromosomal interactions at these loci due to their much weaker intensity compared with the intrachromosomal interaction at the translocation site.

Mouse prediction

For the mouse Patski cell-type prediction42, the CTCF ChIP–seq and ATAC–seq inputs were processed using the same pipeline with mm10 as the assembly number. The original C.Origami model trained with IMR-90 dense input features was used for prediction. For genome-wide evaluation of predicting mouse chromatin organization, we adopted the same procedure from the ‘Performance comparison with previous methods’ section.

CTCF depletion prediction in mouse embryonic stem cells (mESCs)

We preprocessed CTCF ChIP–seq and Hi-C on mESCs from Nora et al.32, following the same pipeline for ChIP–seq and Hi-C. In total, three sets of data, with conditions: untreated, auxin-induced CTCF depletion and wash-off, are processed. Since this study did not measure ATAC–seq, the C.Origami model was re-trained using only DNA sequence and CTCF ChIP–seq on the untreated condition. The re-trained model was then used for predicting chromatin organization in the CTCF depletion (auxin treatment) and restoration (auxin wash-off) conditions. Genome-wide performance benchmark followed the same procedure from the ‘Performance comparison with previous methods’ section.

GRAM

The GRAM scoring system is a generalized version of Grad-CAM on 2D outputs62. Instead of taking a single output, GRAM operates on a region r in the output space y and runs backpropagation on all pixels within r. GRAM on region r in network layer m is defined as follows:

$${{{\mathrm{GRAM}}}}_{{{m}}}\left( {{{r}}} \right) = \mathop {\sum }\limits_{{{k}}} {{{\mathrm{ReLU}}}}\left( {{\alpha }}_{{{k}}}^{{{r}}} \right) \cdot {{{\mathrm{ReLU}}}}\left( {{{{A}}}_{{{k}}}^{{{m}}}} \right)$$

where \(\alpha _k^r\) is the activation weight for channel k and region r. Formally, \(\alpha _k^r\) is defined as:

$${\alpha}_{{{k}}}^{{{r}}} = \frac{1}{{{{Z}}}}\mathop {\sum }\limits_{{{i}}} \mathop {\sum }\limits_{{{j}}} \frac{{\partial {{{r}}}}}{{\partial {{{A}}}_{{{{k}}}_{{{{i}}},{{{j}}}}}^{{{m}}}}}$$

where Z is the number of activations in the layer and the quotient is the gradient at position i, j in the activation layer m with respect to output r. \(\alpha _k^r\) can be interpreted as the average gradient across width and height dimensions at the layer m. \(A_k^m\) is the activation in channel k at layer m. In this study, we choose r to be the full output space. During forward propagations, activation (\(A^m\)) at the target layer m is recorded. This activation map is a 3D tensor, or an image with k channels. Then, the r region of the output is selected for backpropagation and gradients are calculated for every layer. The gradients (used for calculating weights \(\alpha _k^r\)) at the target layer m are collected. The set of collected gradients is also an image-like 3D tensor with k channels. To obtain \(\alpha _k^r\), we averaged the gradients across width and height dimensions, resulting in a k-dimensional array. The goal of GRAM is to visualize a gradient-weighted activation map that maximizes the output signal. To obtain this weighted activation, \(\alpha _k^r\) is used as weights to average the k channels activation image (\(A^m\)). The final averaged activation is defined as the GRAM output.

Attention score

In the transformer module, we implemented the vanilla multi-head attention27:

$${{{\mathrm{MultiHead}}}}\left( {{{{Q}}},{{{K}}},{{{V}}}} \right) = {{{\mathrm{Concat}}}}\left( {{{{\mathrm{head}}}}_1, \ldots ,{{{\mathrm{head}}}}_{{{h}}}} \right){{{W}}}^{{{{\mathrm{O}}}}}$$

where Q, K, V are query, key and values. \(W^{\mathrm{O}}\) is the out projection of dimension (number of heads h times value dimension \(d_{\mathrm{v}}\) by model dimension \(d_{\mathrm{m}}\)). In our implementation \(d_{\mathrm{v}}\) and \(d_{\mathrm{m}}\) are set to 128. \(head_i\) is a single attention head and is calculated by:

$${{{\mathrm{head}}}}_{{{i}}}\left( {{{{Q}}},{{{K}}},{{{V}}}} \right) = {{{\mathrm{softmax}}}}\left( {\frac{{\left( {{{{QW}}}_{{{i}}}^{{{{\mathrm{Q}}}}}} \right)\left( {{{{KW}}}_{{{i}}}^{{{{\mathrm{K}}}}}} \right)^{{{T}}}}}{{\sqrt {{{{d}}}_{{{{\mathrm{k}}}}}} }}} \right){{{VW}}}_{{{i}}}^{{{{\mathrm{V}}}}}$$

where \(W^{\mathrm{Q}}\), \(W^{\mathrm{K}}\), \(W^{\mathrm{V}}\) are projection weights for query, key and value. \(d_{\mathrm{k}}\) is the embedding dimension of key, also implemented as 128. During forward propagation, we extract attention weights for head i which is defined as the alignment between query and key:

$${{{\mathrm{weights}}}}_{{{i}}}\left( {{{{Q}}},{{{K}}}} \right) = {{{\mathrm{softmax}}}}\left( {\frac{{\left( {{{{QW}}}_{{{i}}}^{{{{\mathrm{Q}}}}}} \right)\left( {{{{KW}}}_{{{i}}}^{{{{\mathrm{K}}}}}} \right)^{{{T}}}}}{{\sqrt {{{{d}}}_{{{{\mathrm{k}}}}}} }}} \right)$$

The attention score can be calculated by averaging attention weights across different heads:

$${{{\mathrm{Attention}}}}\;{{{\mathrm{score}}}}\left( {{{{Q}}},{{{K}}}} \right) = \frac{1}{{{{N}}}}\mathop {\sum }\limits_{{{i}}} {{{\mathrm{weights}}}}_{{{i}}}$$

where N = 8 because each layer has eight attention heads. Since the transformer module consists of eight attention layers, for each prediction, we obtained a set of eight attention scores. The attention score is visualized with the BertViz package63.

Impact score

The impact score in the screening experiment is defined as the pixel-wise mean absolute difference between two predictions. Formally, given that we have a prediction S, a 2D contact matrix from the original input and S′ from the input perturbed at location x, and let \({{{s}}}_{{{{i}}},{{{j}}}}\) be the individual pixel in S at position i and j and n be the width/heights of S, the impact score of location x is defined as:

$${{{\mathrm{Impact}}}}\;{{{\mathrm{score}}}}\left( {{{x}}} \right) = \mathop {\sum }\limits_{{{i}}}^{{{n}}} \mathop {\sum }\limits_{{{j}}}^{{{n}}} \frac{{\left| {{{{s}}}\prime _{{{{i}}},{{{j}}}} – {{{s}}}_{{{{i}}},{{{j}}}}} \right|}}{{{{{n}}}^2}}$$

In silico genetic screen

Typical ChIP–seq profiles have peak widths ranging from a few hundred base pairs to 1 kb. To capture fine-regulation elements, we performed genome-wide ISGS at 1-kb resolution. The screening starts from individual chromosomes with a window size of 2 Mb. Inside this window, a 1-kb perturbation region centered at the 2-Mb window was deleted followed by padding at the end and C.Origami prediction. For each window, the original input and perturbed input were predicted by C.Origami and collected. Once the output acquisition is completed for the window, the screening moves to a downstream overlapping window that has 1-kb offset from the current window. Since the in silico screening offset is equal to the length of perturbation size, this procedure produces a continuous impact score that covers all genomic regions with a resolution of 1 kb. It is worth noting that screening at 1-kb resolution could be computationally intensive. To reduce computational load, we randomly sampled 10 chromosomes (chr 5, 7, 8, 11, 12, 14, 15, 19, 20, 22) to represent the whole genome and performed 1-kb-resolution screening on the selected chromosomes.

To obtain the most impactful elements from the screening result, we designed a custom peak calling algorithm. We defined the peak score p of a locus as the difference between maximum and minimum signal within the range of three bins including the locus. We then selected the top 1% of the total screened regions as a cutoff for impactful elements based on the peak score.

To annotate the in silico genetic screen-identified impactful cis-elements, we compiled a set of genomic annotations including TAD boundary regions, enhancers, promoters, intragenic regions and intergenic regions. The boundary region was generated by calling TAD boundaries at 10-kb resolution with HiC-bench56, using its TopDom module and connecting adjacent TADs. To increase robustness of TAD boundary calling, we expanded the boundary width to 50 kb. The promoter region was defined as a 5.5-kb window, spanning 5 kb upstream and 500 bp downstream of gene transcription start site. Enhancers were defined by the H3K4me1 modification, which marks both active and inactive enhancers64. The H3K4me1 ChIP–seq peaks for IMR-90 were downloaded from ENOCDE with accession number ENCFF611UWF (https://www.encodeproject.org/files/ENCFF611UWF). To increase robustness, we expanded peaks to have at least 1-kb width.

In silico genetic screen at 2-Mb windows

We conducted an in silico genetic screen at a fixed 2-Mb window without centering the deletion element. We systematically removed segments of 8,192 bp, or 1 bin, from model inputs. To scan through the entire 2-Mb region, we performed 256 deletion experiments at each bin and calculated the prediction difference map before and after deletion. To maintain input shape, we appended 8,192 bp of empty input features to the end.

CRISPR screening for chromatin remodeling genes in T-ALL cell lines

Pooled CRISPR screenings across 313 chromatin remodeling genes in CUTLL1 and Jurkat cells were carried out in parallel with our previous studies for pooled screening of RNA binding protein in T-ALL cells65. Briefly, for each chromatin remodeling gene, we designed on average 6–8 single guide RNAs (sgRNAs), for a total of 2,500 sgRNAs. The sgRNA sequences were synthesized by Twist Bioscience, and cloned into a lentivirus-based sgRNA vector tagged with GFP (Addgene plasmid no. 65656). Cas9-expressing T-ALL cell lines were transduced with sgRNA library virus at a low multiplicity of infection (MOI, 0.3), followed by infection efficiency assessment through GFP percentage on Day 4 posttransduction. Remaining cells were placed back into culture until 20 day posttransduction.

Cell proliferation was measured by comparing the sgRNA frequencies between Day 4 and Day 20 cells. Genomic DNA was collected from Day 4 and Day 20 cells using Qiagen DNA Purification kit based on the manufacturer’s protocol. The gRNA frequencies in the genomic DNA were amplified and quantified following our previous procedure65. For pooled CRISPR screening analysis, samples of each time-point were normalized as sgRNA read count/total read count × 100,000. Subsequently, normalized reads were then used to calculate log2fc (as normalized read count Day 4/normalized read count Day 20) for each gRNA. The fold changes between Day 4 and Day 20 for each gene were averaged from all CRISPR gRNA targets. P values were calculated via a two-sided t-test comparing the fold changes of all gRNA targets of the same gene with fold change of 1.

Virtual 4C

HiC-bench ‘virtual4C’ pipeline56 was used to compute the interactions of each selected viewpoint in a roll-window fashion. We summed the valid read pairs in a 5-kb area centered at 100-bp bins that covered the area of ±2.5 Mb from the viewpoint (50,000 bins per viewpoint). The interactions were normalized by the total number of valid pairs of the sample.

Trans-acting regulator identification in T-ALL cell lines

To connect the differential patterns of cis-elements with trans-acting regulators, we compared the selected cell-type-specific impactful regions by a custom peak calling method, followed by a transcription factor enrichment test for identifying potential trans-acting regulators. We used the transcription factor database from ReMap2022 (ref. 50). To reduce low-quality signals from the ReMap database, we filtered out transcription factor profiles that had less than 7,000 hits, or profiles that only had a single experiment. Together, we collected 612 transcription factor binding profiles for downstream analysis. We used Fisher’s exact test to evaluate the overlap between impactful cis-elements from ISGS and each transcription factor from the database. The test was conducted using the LOLA (Locus Overlap Analysis) package66. For common transcription factors with hit counts larger than 20,000, we downsampled profiles to 20,000. We calculated the Q value with FDR correction based on the 612 transcription factor profiles tested and used odds ratio as the main metric to determine enrichment of each factor in impactful cis-elements.

To compare the contributing trans-acting regulator profiles between different cell types, we first normalized the odds ratio within each cell type. We performed k-means clustering of transcription factors based on their normalized odds ratios in CUTLL1, Jurkat and T cells. The k-means clustering was performed with standard Euclidean distance with six centroids. The clusters were further grouped and visualized using a heatmap.

Intra-TAD activity analysis

Iteratively corrected matrices were re-normalized by dividing each bin value by the sum of all the values in the same distance bin in the same chromosome (distance normalization). All the TADs identified in the control sample were used as the reference TADs to compute the intra-TAD activity changes. The set of reference TADs between the two samples, S1 (control) and S2 (treatment), were denoted as set T. A paired two-sided t-test was performed on each single interaction bin within each reference TAD between the two samples. We also calculated the difference between the average scores of all interaction intensities within such TADs and the TAD interaction log fold change. Finally, a multiple testing correction is performed by calculating the FDR on the total number of TAD pairs tested. The TAD interaction change for each t in T is calculated as follows:

$${{{\mathrm{TAD}}}}\;{{{\mathrm{change}}}}\left( {{{t}}} \right) = \frac{{\mathop {\sum }\nolimits_{{{i}}}^{{{{I}}}_{{{t}}}} {{{S}}}_{2{{{i}}}}}}{{\left| {{{{I}}}_{{{t}}}} \right|}} – \frac{{\mathop {\sum }\nolimits_{{{i}}}^{{{{I}}}_{{{t}}}} {{{S}}}_{1{{{i}}}}}}{{\left| {{{{I}}}_{{{t}}}} \right|}}$$

We classified the reference TADs in terms of Loss, Gain or Stable intra-TAD changes by using the following thresholds: FDR < 0.01, absolute TAD interaction log fold change > 0.25 and absolute TAD interaction change > 0.1.

Additional software

Additional software used included the following: software: HiC-bench, Seq-N-slide, MACS2 (v.2.1.1), FitHiC (v.2.0.7); Python packages: Pytorch (v.1.9.0), Numpy (v.1.22.3), pybigwig (v.0.3.18), scikit-image (v.0.19.3); R packages: EnhancedVolcano (v.3.16).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.


Source link

Happy
Happy
0 %
Sad
Sad
0 %
Excited
Excited
0 %
Sleepy
Sleepy
0 %
Angry
Angry
0 %
Surprise
Surprise
0 %
Previous post The economic downturn is hurting the creator economy, one of the most-hyped sectors of the past decade, in particular since its middle class hasn't yet emerged (Alex Kantrowitz/Big Technology) – The Knowledge Pal
Next post How Telegram, TikTok, Instagram, Facebook, and Twitter were used to boost election fraud claims in Brazil before riots hit congress and supreme court buildings (Elizabeth Dwoskin/Washington Post) – The Knowledge Pal