Cell culture and treatments
Normal murine mammary gland NMuMG cells (provided by José Antonio Pintor-Toro, CABIMER, in 2014) were cultured in Dulbecco’s modified Eagle’s medium containing 10% fetal bovine serum (FBS) and 10 μg/ml insulin (complete medium), until reaching 70–80% confluency. HEK293T cells were grown in Dulbecco’s modified Eagle medium (DMEM) with 10% fetal bovine serum (FBS). Human epithelial MCF7 cells were cultured in RPMI medium containing 10% FBS and supplemented with L-glutamine 2 mM. RPE1 cells were grown in DMEM-F12 medium containing 10% FBS and supplemented with L-glutamine 2 mM. All the cell culture media included penicillin (100 U/ml) and streptomycin (100 μg/ml). TGFβ treatments were performed after 6 h of serum starvation, except when indicated. All serum starvation treatments of the NMuMG cells also include deprivation of insulin. In some experiments, TGFβ treatments were performed in normal complete medium (Supplementary Fig. 2b–e) or after 24 or 48 h of serum and insulin starvation (Supplementary Fig. 2f). For TGFβ treatments, 5 ng/ml TGFβ1 diluted in 4 mM HCl, 1 mg/ml BSA (240-B, R&D Systems) or 4 mM HCl 1 mg/ml BSA (vehicle, as control), was added to the medium for the indicated time.
ATAC-sequencing
The ATAC experiments were performed according to ref. 28. NMuMG cells were treated with either vehicle or TGFβ for 2 or 12 h (after 6 h of serum and insulin starvation), vehicle or TGFβ-treated for 10 min (after 6 h of serum and insulin starvation) or vehicle or TGFβ-treated for 2 h (in complete medium); and then collected and treated with transposase Tn5 (Nextera DNA Library Preparation Kit, Illumina). DNA was purified using MinElute PCR Purification Kit (Qiagen). All samples were then amplified by PCR using NEBNextHigh-Fidelity 2× PCR Master Mix (New Englands Labs) with primers containing a barcode to generate libraries. DNA was again purified using MinElute PCR Purification kit and samples were sequenced using Illumina NextSeq 500 system (Illumina) with 75 bp paired-end reads at the Genomic Unit of CABIMER (Sevilla, Spain)
ATAC-see experiments
For ATAC-see experiments, cells were grown on coverslips and processed according to ref. 29. For kinetic analyses, NMuMG cells were changed to serum- and insulin-free medium for 6 h and then were treated with vehicle or TGFβ for 5, 10 min, 1, 2, and 12 h before fixation with 4% paraformaldehyde in phosphate-buffered saline (PBS) for 10 min. For the study of the canonical pathway, cells were transfected with siCT or DsiSMAD4 (IDT, Leuven, Belgium) (previously used in ref. 97), using Lipofectamine RNAiMAX (#13778-150, Thermo Fisher) according to manufacturer instructions. After 48 h, cells were serum- and insulin-depleted for 6 h and treated with vehicle or TGFβ for 10 min before fixation. For the study of the non-canonical pathway, cells were first serum- and insulin-depleted for 4 h. Then, cells were treated with DMSO or SB431542 (S4317, Sigma-Aldrich), U0126 (S1102, Selleckchem) and TAKINIB (SML2216, Sigma-Aldrich) inhibitors at 10 μM for 2 additional hours and finally cells were treated with vehicle or TGFβ for 10 min. Other ATAC-see experiments were performed under the conditions detailed in figure legends. After fixation, cells were permeabilized with lysis buffer (10 mM Tris–HCl, pH 7.4, 10 mM NaCl, 3 mM MgCl2, 0.01% Igepal CA-630) for 10 min at room temperature. Next, coverslips were rinsed in PBS twice and put in a humidity chamber box at 37 °C. The transposase mixture solution (25 μl 2× TD buffer, final concentration of 100 nM Tn5-ATTO-59ON, adding dH2O up to 50 μl) was added on the coverslips and incubated for 30 min at 37 °C. As a negative control 50 mM EDTA was added to the Tn5-ATTO590N reaction buffer. After the transposase reaction, coverslips were washed with PBS containing 0.01% sodium dodecyl sulfate (SDS) and 50 mM EDTA for 15 min three times at 55 °C. Then, slides were stained with DAPI for 5 min and mounted using Vectashield for imaging.
DNAseI-seq
DNase I hypersensitivity mapping was performed according to ref. 98 with brief modifications. NMuMG cells were either vehicle or TGFβ-treated for 2 h. The cells were trypsinized and pelleted before washing and resuspension in buffer A (15 mM Tris-Cl (pH 8.0), 15 mM NaCl, 60 mM KCl, 1 mM EDTA (pH 8.0), 0.5 mM EGTA (pH 8.0), 0.5 mM spermidine, 0.15 mM spermine). Nuclei were extracted by adding buffer A containing NP-40. The nuclei were washed with buffer A and resuspended in pre-warmed lysis buffer at a concentration of 5 × 106/ml and then digested with 75 units of DNase I (Roche) for 5 min at 37 °C. The reactions were terminated by the addition of an equal volume of stop buffer and incubated at 55 °C. After 15 min, proteinase K (final concentration of 20 μg/ml) was added to each digestion reaction and incubated for 16 h at 55 °C. DNA was extracted by careful phenol-chloroform purification using phase-lock gel. DNA fragments of 50–300 bp were selected using low-melting agarose gel and was purified using GenElute Gel Extraction Kit (Sigma-Aldrich). DNA was again purified using MinElute PCR Purification kit and samples were sequenced using Illumina NextSeq 500 system (Illumina) with 75 bp paired-end reads at the Genomic Unit of CABIMER (Sevilla, Spain).
RNA sequencing
Total RNA from NMuMG cells, treated with either vehicle or TGFβ for 2 or 12 h, was extracted using the RNeasy kit (74106, QIAGEN). Then, libraries were prepared with the TruSeq Stranded TOTAL RNA kit (Illumina). The sequencing was performed with a HiSeq 2000 system (Illumina) with 50 bp single-end reads at the Genomics Core Facility (EMBL Heidelberg).
ChromRNA sequencing
Cellular fractionation was carried out according to ref. 99, based in the protocol described in ref. 100. Cell pellets were resuspended in 400 μl cold cytoplasmic lysis buffer (0.15% NP-40, 10 mM Tris pH 7.5, 150 mM NaCl) and incubated on ice for 5 min. The lysates were layered onto 1 ml cold sucrose buffer (10 mM Tris pH 7.5, 150 mM NaCl, 24% sucrose w/v), and centrifuged in microfuge tubes at 3500 g for 10 min. The nuclear pellets were gently resuspended into 250 μl cold glycerol buffer (20 mM Tris pH 7.9, 75 mM NaCl, 0.5 mM EDTA, 50% glycerol). An additional 250 μl of cold nuclei lysis buffer (20 mM HEPES pH 7.6, 7.5 mM MgCl2, 0.2 mM EDTA, 0.3 M NaCl, 1 M urea, 1% NP-40, 1 mM DTT) was added to the samples, followed by a pulsed vortexing and incubation on ice for 2 min. Samples were then spun in microfuge tubes for 2 min at 13,000 × g. Fifty microliters of cold PBS was added to the remaining chromatin pellet, and gently pipetted up and down over the pellet, followed by a brief vortex. Then RNA was extracted using TRIZOL and treated with TURBO DNA-free kit (AM1907, Invitrogen) for DNA removal according to manufacturer’s instructions. Libraries preparation and sequencing was as described in the RNA sequencing section.
ChIP sequencing
ChIP assays were performed according to ref. 101 with some modifications. Briefly, NMuMG cells treated with either vehicle for 2 h or TGFβ for 2 or 12 h were crosslinked in 1% formaldehyde for 10 min at room temperature followed by addition of glycine (125 mM final concentration) for 5 min. Nuclei were isolated using lysis buffer 1 (5 mM Pipes pH 8, 85 mM KCl, 0.5% NP-40, and complete protease inhibitor cocktail (Roche)) and were lysed using lysis buffer 2 (1% SDS, 10 mM EDTA, 50 mM Tris–HCl pH 8.1, and complete protease inhibitor cocktail (Roche)). Chromatin was sheared into an average size of 500 bp by eight pulses of 30 s (30 s pause between pulses) at 4 °C in the water bath sonicator Bioruptor (Diagenode, Liège, Belgium). Thirty micrograms of chromatin were diluted 1:10 in IP buffer (0.01% SDS, 1.1% Triton X-100, 1.2 mM EDTA, 16.7 mM Tris–HCl pH 8.1, 167 mM NaCl, 1% sodium deoxycholate) and incubated overnight at 4 °C in rotation with 3 µg of the respective antibodies: anti-H3K27ac (Ab4729, Abcam), anti-H3K4me1 (Ab8895, Abcam) and anti-K3m4me3 (Ab8580, Abcam). Immunoprecipitations were incubated for 2 h at 4 °C in rotation with protein A Dynabeads (Invitrogen) and then washed with wash buffer 1 (0.1% SDS, 1% Triton X-100, 2 mM EDTA, 20 mM Tris–HCl pH 8.1, and 150 mM NaCl, 1% sodium deoxycholate), wash buffer 2 (0.1% SDS, 1% Triton X-100, 2 mM EDTA, 20 mM Tris–HCl pH 8.1, 500 mM NaCl, 1% sodium deoxycholate), wash buffer 3 (0.25 M LiCl, 1% NP-40, 1% sodium deoxycholate, 1 mM EDTA, 10 mM Tris–HCl pH 8.1), and twice with TE1x buffer (10 mM Tris–HCl pH 8.0, 1 mM EDTA). The complexes were eluted twice from the beads with elution buffer (1% SDS in TE1x buffer) by incubating 10 min at 65 °C. The eluates and the inputs were incubated overnight at 65 °C for de-crosslinking and treated with Proteinase K for 1 h at 37 °C. Chipped DNA was purified with phenol:chloroform extraction followed by ethanol precipitation and resuspended in miliQ water. The sequencing was performed with the HiSeq 2000 system (Illumina) with 50 bp single-end reads at the Genomics Core Facility (EMBL Heidelberg).
ChIP qPCR
ChIP assays were performed as described above for ChIP sequencing, using anti-H3K27ac (Ab4729, Abcam) and anti-CRIPSR-Cas9 (C15310258-100, Diagenode). Rabbit IgG (Sigma) was used as a control for non-specific interactions. Chipped DNA was purified using ChIP DNA Clean & Concentrator kit (Zymo Research), and eluted in 50 µl of elution buffer. Input was prepared with 10% of the chromatin material used for immunoprecipation. Input material was diluted 1:5 before PCR amplification. Quantification of immunoprecipitated DNA was performed by real-time PCR (qPCR) with the Applied Biosystems 7500 FAST real-time PCR system, using Applied Biosystems Power SYBR green master mix. Sequences of all oligonucleotides are listed on Supplementary Data 4. Data are the average of two or three biological independent replicates and three technical replicates of each biological replicate.
Western blotting
Western blotting were performed according to ref. 102, using the following antibodies: anti-cFos (2250, Cell Signaling. Dilution 1/1000), anti-cJun (60A8; 9165, Cell Signaling. Dilution 1/1000), anti-alfa-Tubulin (DM1A; T9026, Sigma. Dilution 1/1000), anti-Erk1/2 (P28482, Millipore. Dilution 1/1000), and anti-Phospho-Erk1/2 (thr202/Tyr204) (9101, Cell Signaling. Dilution 1/1000), anti-SMAD4 (SC-7996, Santa Cruz. Dilution 1/250). As secondary antibodies anti-mouse-HRP and anti-rabbit-HRP (Sigma-Aldrich. Dilution 1/1000) were used.
CRISPR interference
For enhancer inactivation we used CRISPRi technology described in ref. 47. First the NMuMG-dCAS9 cell line was established as follows. For lentiviral particles preparation 2 × 106 HEK293T cells were transfected with Fugene 6 (Promega) using 7.5 μg of the transfer vector pHR-SFFV-dCas9-BFP-KRAB (this plasmid was a gift from S. Qi & J. Weissman (Addgene plasmid # 46911; http://n2t.net/addgene:46911; RRID:Addgene_46911)47 with 5 and 2.5 μg of the packaging plasmids pCMVDR8.91 and pVSVG, respectively. Lentiviruses were harvested 72 h post-transfection, passed through a 0.45-μm filter and concentrated by ultracentrifugation at 100,000 × g for 90 min. Virus particles were resuspended in DMEM, snap frozen in liquid nitrogen and stored at −80 °C. After tittering, NMuMG cells were infected with pHR-SFFV-dCas9-BFP-KRAB lentiviruses using polybrene 8 μg/ml. Single BFP-positive cells were sorted 72 h after infection into individual wells of a 96-well plate, containing a mixture of fresh and conditioned medium (1:1), and then one clone was chosen after BFP signal and EMT functional validation. Then, NMuMG cell pools expressing specific sgRNAs were established as follows. To design sgRNAs we selected neighboring region for ATAC-seq peak submit and use IDT (https://eu.idtdna.com/site/order/designtool/index/CRISPR_CUSTOM) and CRISPOR (http://crispor.tefor.net/) tools. Designed sgRNA for selected enhancers (Supplementary Data 4) were inserted into sgRNA expressing lentiviral vectors pU6-sgGAL4-1 (this plasmid was a gift from S. Qi & J. Weissman (Addgene plasmid # 46915; http://n2t.net/addgene:46915; RRID:Addgene_46915)47. Then lentiviral particles were produced and purified as above. After tittering, NMuMG-dCAS9 cells were infected with sgRNA expressing lentiviruses and selected with puromycin 1 μg/mL for one week before working with the selected pool.
Determination of mRNA by quantitative reverse transcription PCR (RT-qPCR)
Total RNA was prepared by using the RNeasy Kit (Qiagen), as described in the manufacturer’s instructions, including DNase I digestion to avoid potential DNA contamination. Complementary DNA (cDNA) was generated from 3 μg of total RNA using SuperScript First Strand Synthesis System (Invitrogen). Two 2 µl of generated cDNA solution was used as a template for real-time PCR (qPCR). Gene products were quantified by qPCR with the Applied Biosystems 7500 FAST Real-Time PCR System, using Applied Biosystems Power SYBR Green Master Mix. Values were normalized to the expression of the Gapdh housekeeping gene. At least three biological replicates and three technical determinations were performed in each case. All oligonucleotide sequences used are listed in Supplementary Data 4.
Computational methods and statistical analysis
Most of analyses were performed using R (v3.4.4), Rstudio (v0.99.879) and Bioconductor (v2.38).
Statement on the use of the mm9 assembly
We used the GRCm37 (mm9) assembly to map all sequencing reads from mouse origin in this study (ATAC-seq, RNA-seq, ChromRNA-seq, DNaseI-seq and ChIP-seq) because many programs did not support yet the mm10 build when we started our study. As our study compares samples across different treatments and does not perform absolute analyses, re-aligning the reads to GRCm38 (mm10) should not significantly affect our conclusions.
Processing of ATAC-seq data
ATAC-seq data were processed (trimmed, aligned, filtered, and quality controlled) for each condition independently, using the ATAC-seq pipeline from the Kundaje lab (https://github.com/ENCODE-DCC/atac-seq-pipeline). Information about read depth can be found in the reporting summary. The model-based analysis of ChIP-seq macs2 (v2.1.2) was used to identify the peak regions with options -B, -q 0.01–nomodel, -f BAM, -g mm9. The Irreproducible Discovery Rate (IDR) method was used to identify reproducible peaks between two biological replicates. Only peaks reproducible between the two biological replicates (with IDR ≤ 0.05) were retained for downstream analyses. Peaks for all conditions of each experiment were then merged together into a standard peak list. For the experiment described in Figs. 1a and 2a, 52,626, 100,459, and 105,334 peaks were identified for vehicle, TGFβ2h and TGFβ12h conditions, respectively. Then sets of peaks for each condition were refined. All peaks smaller than 150 bp were discarded, and all peaks closer than 1000 bp were merged together using csaw (v1.12.0)103, obtaining 16,769, 32,545, and 33,077 peaks for each condition. Then, peaks for the three different conditions closer than 1000 bp were combined in a single set of peaks obtaining 39,432 unique regions. Similar procedures were used for the experiment described in Fig. 2b and Supplementary Fig. 2c–e. In all the cases summits were re-calculated using MACS2 refinepeak with default options. Peaks coordinates for downstream analysis were calculated as summit ± 500 bp. For enhancer assignment, those peaks that overlap with TSS ± 1000 bp of any transcript from mm9 UCSC KnownGene, were discarded using bedtools (v.2.27.1) subtract with -A options, obtaining 29,071 peaks, which were considered as putative enhancers. Correlation between replicates were calculated using multiBamSummary and plotCorrelation from deeptools (v3.1.3) using default options. Shushi package (v.1.16.0) was used for data visualization.
Processing of DNaseI-seq data
DNaseI-seq data was processed following the same pipeline as ATAC-seq data but using specific options for DnaseI-seq. Information about read depth can be found in the reporting summary.
Classification of enhancers according to increase of accessibility
To classify enhancers according to the increase of accessibility, reads were counted in full-length of ATAC-seq peaks using regionCounts() from csaw with minq = 40 option. To normalize data, first we count reads in full genome using windowCounts() with bin = TRUE, width = 2000 and minq = 40 options; then normalization factors were calculated using normOffsets(). After that, estimateDisp(), glmQLFit() and glmLRT() from edgeR package (v3.20.9) was used for calculating log2FC and statistics. Enhancers were then classified into four groups: FC < 1.5, 1.5≤FC < 2, 2≤FC < 4 and FC ≥ 4, obtaining 4810, 8926, 14,222, and 1113 for TGFβ2h vs veh comparison; and 3988, 7010, 15,362, and 2771 for TGFβ12h vs. veh. EnrichedHeatmap (v1.12.0) package was used for drawing heatmaps. In Fig. 2a rows were ordered according to decreasing ATAC-seq cpm for TGFβ2h or TGFβ12h (merged replicates). Differential ATAC-seq signal analysis (Supplementary Fig. 1b) was also performed in full-length ATAC peaks overlapping TSS (promoters), putative enhancers and the whole genome outside ATAC-seq peaks (rest) using regionCounts() with same parameters as above.
Analysis of ATAC-see using TANGO
Confocal microscopy images were analyzed using TANGO30 plugin from imageJ104. To determine nuclei area we used standard protocol for segmentation and post-filters obtaining 203 nuclei for vehicle condition and 194 for TGFβ2h from two independent experiments. To determine open-chromatin domains we used as pre-filters standard fast filters 3D and LoG 3D(BIG) filter. Then 3D spot detector was used for segmentation and region adjustment and merged regions were finally used as post-filters. Standard measure geometrical simple and signal quantification were performed to analyze open-chromatin domains features and signal distribution.
ATAC-see signal quantification
Fluorescence microscopy images were analyzed using an imageJ104 macro as follow: first images were splitted by color. Then nuclei were segmented on DAPI max projection previous preprocessing and size-selection. ATTO590 signal was quantified in max projection images in nucleus-exclusive area after background subtraction. All experiments were median-normalized. Three independent experiments were performed. Number of nuclei quantified (n) per condition including the three replicates are provided in Supplementary Data 5.
Motif-binding analysis with CentriMo
Motif-binding analyses was performed with CentriMo (v5.0.2)105 from MEME suite against all 579 binding motifs from JASPAR vertebrates106 with default options. CentriMo perform Central Motif Enrichment Analysis (CMEA), which better identified binding motifs involved in the regulation of Chip-seq or ATAC-seq peaks. Position Weigh Matrixes (PWM) for SNAI1 and SMAD 5GC107 binding motifs were also constructed and included in the analysis. For each category, enhancer sequences were shuffled five times for obtaining 5x random sequences with the same %CG content using scrambleFasta.pl from HOMER (v4.10.3) suite, and use that set of sequences as a background.
Footprinting analysis
To identify footprints in enhancer sequences wellington_footprints.py script from pyDNase suite (v0.2.4)108 was used with -A option. Footprints were detected in TGFβ2h and TGFβ12h ATAC-seq data. Both ATAC-seq replicates for each condition were merged together for increasing the strength of footprint identification. For footprints enrichment analyses in different categories of chromatin accessibility first scrambleFasta.pl was used for obtaining 5x background sequences for each category. Then we look for motif-binding sites in real footprints and background sequences with FIMO (v5.0.0)109 using default option and p-value < 0.0001 as cutoff. Enrichment was calculated as:
$${\mathrm{Enr}} = \left( {{\mathrm{footprints}}\left[ {{\mathrm{motif}}} \right]/{\mathrm{footprints}}\left[ {{\mathrm{total}}} \right]} \right)/\left( {{\mathrm{footprints}}\_{\mathrm{bg}}\left[ {{\mathrm{motif}}} \right]/{\mathrm{footprints}}\_{\mathrm{bg}}\left[ {{\mathrm{total}}} \right]} \right)$$
(1)
where footprints[motif] is the number of footprints with a specific motif-binding sequence; footprints[total] is the number of total footprints; footprints_bg[motif] is the number of footprints with a specific motif-binding sequence in the background set of sequences; and footprints_bg[total] is the number of total footprints in the background set of sequences. p-values were calculated using fisher.test() from R stats package. Enrichment heatmap was plotted using gitools (v2.3.1) including most representatives motives.
Measurement of Tn5 integrations was calculated in Bioconductor using GenomicAlignments (v.1.20.1), Biostrings (v2.52.0) and soGGi (v.1.16.0) packages. Briefly, all ATAC-seq fragment smaller than 100 bp (nucleosome-free regions) were extracted from TGFβ2h BAM file. Then we resized our reads to 1 bp and make the shift of 4 or −5 bp depending on strand to adjust for expected shift from insertion of Tn5 transposase to produce cut-sites.
Footprinting analysis in regulated enhancers was performed as above but using as a category background all enhancers sequences that are not included in that category. TGFβ2h ATAC-seq data were used for footprints identification in early categories of regulated enhancers, and TGFβ12h for late categories.
SMAD2/3 ChIP-seq analysis
ChIP-seq data for SMAD2/3 in NMuMG cells were downloaded from GSE12125436, and aligned against mm9 mouse reference genome using align() function from Rsubread package (v1.28.1) with type = 1, TH1 = 2 and unique = TRUE parameters. Peak calling was performed using macs2 callpeak using input as control file and default options. Then, peaks were filtered using −log10(FDR) ≥ 5 criteria and those peaks that overlap with blacklisted region of the genome were removed using bedtools subtract with -A options resulting in 8401 peaks. Peaks that overlap with the TSS ± 1000 bp of any transcript from mm9 UCSC KnownGene were discarded obtaining 7503 peaks, of which 5906 overlap with our enhancers. Overlapping analysis between SMAD2/3 bound enhancers and enhancers classified according to its increase in chromatin accessibility at TGFβ2h and/or AP-1 footprints were performed using vecsets (v1.2.1) package. p-values were calculated using dhyper() function from stats R package. ATAC-seq signal density graphs (Supplementary Fig. 3f) were plotted for TGFβ2h using computeMatrix reference-point from deeptools with -a 2000 -b 2000 options in different sets of regions: AP-1 footprinted enhancers, SMAD2/3 bound enhancers and both (enhancers with AP-1 footprints and SMAD2/3 binding).
RNA-seq analysis
Two independent biological replicates for each condition of RNA-seq data were primarily filtered using the FASTQ Toolkit v1.0.0 program. Data were aligned using subjunc function from Rsubread package, to map reads to the mm9 mouse reference genome using TH1 = 2 and unique=TRUE parameters. The downstream analysis was performed on bamfiles with duplicates removed using the samtools (v0.1.19) rmdup command. FeatureCounts() function from Rsubread package was used to assign reads to UCSC mm9 KnownGenes (miRNAs were discarded from the analysis) using GTF.featureType = “exon”, GTF.attrType = “gene_id” and strandSpecific = 2 parameters on duplicate removed bamfiles. Then differential gene expression analysis was performed using the voom/limma (v.3.34.9) and edgeR (v.3.20.9) Bioconductor packages110. Genes that were expressed at ≥2 counts per million counted reads in ≥2 replicates were analyzed, resulting in 11081 genes out of 21240 UCSC KnownGene. CalcNormFactors() function using TMM method was used to normalize samples. Differentially expressed genes with p-value<0.05 (empirical BAYES moderated t-statistics test with adjustments for multiple comparison.) were classified according to:
Early-downregulated: log2FC(TGFβ2h_vs_veh) < −1 and not log2FC(TGFβ12h_vs_TGFβ2h) > 1
Early-upregulated: log2FC(TGFβ2h_vs_veh) > 1 and not log2FC(TGFβ12h_vs_TGFβ2h) < −1
Late-downregulated: log2FC(TGFβ12h_vs_veh) < −1 and not log2FC(TGFβ2h_vs_veh) < −1
Late-upregulated: log2FC(TGFβ12h_vs_veh) > 1 and not log2FC(TGFβ2h_vs_veh) > 1
Transient-downregulated: log2FC(TGFβ2h_vs_veh) < −1 and log2FC(TGFβ12h_vs_TGFβ2h) < −1
Transient-upregulated: log2FC(TGFβ2h_vs_veh) > 1 and log2FC(TGFβ12h_vs_TGFβ2h) > 1
TGFβ-independent: expressed genes that are not included in any of the categories above.
Because of low number of genes in transient categories, they were not taken into account for downstream analyses. For MA plots (Supplementary Fig. 4a), log2FC were plotted against average expression (log2CPM) for all samples for each comparison. For boxplots of Supplementary Fig. 4b, distribution of log2CPM of the two independent replicates were plotted. For expression heatmap of Fig. 2h log2FC were plotted for selected genes using gitools.
Histone ChIP-seq analysis
Two biological replicates of the ChIP-seq data for each condition and for each histone modification (H3K27ac, H3K4me1, and H3K4me3) were primarily filtered using the FASTQ Toolkit program. Data were aligned using align function from Rsubread package, to map reads to the mm9 mouse reference genome using type = 1, TH1 = 2 and unique = TRUE parameters. The downstream analysis was performed on bamfiles with duplicates removed using the samtools rmdup command. Information about read depth can be found in the reporting summary. Shushi package (v.1.16.0) was used for data visualization.
Enhancer classification
For creating the NMuMG cells enhancer repertoire first H3K27ac and H3K4me1 signal was measured by counting reads in full-length ATAC-seq peaks (29071 non-TSS peaks) using regionCounts() from csaw package. For H3K4me1, coordinates were increased ±500 bp because H3K4me1 showed a broad distribution. Then, CPM and RPKM were calculated using all counted reads in ATAC-seq peaks as library size and combining replicates. Peaks whose RPKM were larger than percentile 25 of all RPKM for H3K27ac or 10 for H4K4me1 were considered as having the mark. Enhancers without H3K4me1 nor H3K27ac in vehicle condition but that acquire these marks with TGFβ treatment were classified as latent. Enhancers with H3K4me1 but no H3K27ac were classified as primed and those with both marks, as active. To classify enhancers according to its regulation by TGFβ, we analyze H3K27ac dynamics in vehicle, 2 h and 12 h TGFβ-treated ChIP-seq data, considering both replicates independently and using a pipeline, including csaw and edgeR (v3.20.9) packages. Reads were counted in full-length ATAC-seq peaks as above and normalization factors were calculated as normOffset() function with option type = “loess” parameter. For differential binding analysis scaleOffset(), estimateDisp(), glmQLFit() and glmLRT() functions was used. Then enhancers were classified in early-activated, early-repressed, late-activated and late-repressed, following same criteria as differentially expressed genes. For each comparison we also consider a p-value < 0.05 and average log2CPM > 0. Because of low number of enhancers in transient categories, they weren’t taking into account for downstream analyses.
EnrichedHeatmap package was used for drawing heatmaps. For Fig. 3a, rows were ordered according to decreasing H3K27ac CPM for TGFβ2h and all enhancers were plotted as summit ±500 bp increasing ±1500 bp from start and end. Boxplots of Supplementary Fig. 5, represents average CPM of the two replicates for each condition. Profiles screen-shot was constructed using sushi package (v1.16.0) on bedgraph constructed by bamCoverage from deeptools previous merging of replicates with samtools merge.
ChromRNA-seq
Two biological replicates for each condition of ChromRNA-seq data were primarily filtered using the FASTQ Toolkit program. Data were aligned using subjunc function from Rsubread package, to map reads to the mm9 mouse reference genome using TH1 = 2 and unique = TRUE parameters. The downstream analysis was performed on bamfiles with duplicates removed using the samtools rmdup command. As short size of eRNAs, ChromRNA-seq reads were counted in summit ±200 bp for intergenic enhancers using regionCounts(). From 29,071 enhancers, those which overlap with any UCSC KnownGene transcript bodies (5 kb extended from the Transcription termination site (TTS) to avoid termination read-through reads) were discarded, resulting in 12857 intergenic enhancers. After read counting, we calculated CPM of eRNA for each eRNA-expressing enhancer. Correlation between H3K27ac levels and eRNAs levels shown in Supplementary Fig. 6b was calculated using H3K27ac levels of the 12857 enhancers where eRNAs were counted. For correlation between increase in H3K27ac and increase in eRNA shown in Supplementary Fig. 6c, d, first we filtered by expression, obtaining 3636 eRNA-expressing intergenic enhancers. Then, differential expression analysis using csaw and edgeR was computed.
Analysis of the location of TGFβ-regulated genes with respect to TGFβ-regulated enhancers
To study the regulation of genes in the neighborhood of TGFβ-regulated enhancers, first we plotted log2FC distribution of genes adjacent to regulated enhancers. To determine position of genes with respect to enhancers, distances between gene TSSs and enhancers ATAC-seq peaks summit were calculated. Then genes were grouped according to: gene position with respect to the enhancers (upstream (-) or downstream(+)) and gene order with respect to the enhancer (1st to 4th). According to this classification, genes were called: enhancer-gene ± n, Eg ± n, with n = 1, 2, 3, and 4. Non-expressed genes were not taken into account for analysis. Randomization of enhancers was performed by shifting or by random sampling. For shifting randomization, lists of numbered enhancers for each recircularized chromosome were constructed and random enhancers were obtained by shifting enhancers 50 positions. So, if the regulated enhancer in at position n, the random enhancer is at position n + 50. For sampling randomization, all enhancers of each chromosome were randomly sampled. We obtained similar results with both randomization methods, however, we choose shifting method for considering it more conservative with genome structure. To study the dependency on enhancer-gene distance, we grouped genes located in 50 kb non-overlapping distance intervals (i.e., from summit to 50 kb, from 50 kb to 100 kb… upstream and downstream until 250 kb) from TGFβ-regulated enhancers of each enhancer category. The same was done for randomized enhancers. To study the correlation between change of activity of enhancers (measured as log2FC(H3K27ac)) and differential expression of the closest genes (log2FC), pairs of enhancer-closest gene were ordered according to increasing log2FC(H3K27ac) and grouped into 20 bins. Then, least squares fit analysis were performed to obtain r and slope. To study if enhancers operate in a continuous manner, first we subset those enhancers with one of their neighborhood genes (for example, Eg+2) robustly regulated (|log2FC | > 1 and adjusted p-value < 0.05). Then, log2FC distribution of the rest of genes in the neighborhood region (Eg-2, Eg-1, and Eg+1) were plotted. We only tested until Eg-2 and Eg+2 because of the small number of enhancers whose 3rd or 4th closest upstream or downstream gene is robustly regulated.
Identification of clusters of co-regulated genes
To identify clusters of co-regulated genes, first we sorted all genes according to 5ʹ–>3ʹ chromosomal order using gene TSS coordinates and we looked for robustly regulated pairs of genes from the same category (early-upregulated, early-downregulated, late-upregulated, or late-downregulated) that are strictly contiguous. To determine if the number of co-regulated pairs of genes was higher than expected by chance, we computed a random distribution of this score by randomly sampling a number of genes identical to the number of robustly regulated genes for each category and looking for pairs of contiguous genes. This process was iterated 5000 times, obtaining a distribution of random pairs of co-regulated genes. Probability of obtaining the real number of pairs of co-regulated genes given the random distributions was obtained using (https://keisan.casio.com/exec/system/1180573188).
Analysis of gene co-regulation
Similar to previous analysis, we numbered genes according to 5ʹ–>3ʹ chromosomal order using gene TSS coordinates (1, 2, …, n). Then, for every robustly TGFβ-regulated gene (|log2FC | > 1 and adjusted p-value < 0.05), log2FC of genes n ± k with k = 1, 2, 3, or 4, were plotted. Non-expressed genes were not taken into account for analysis. Random genes lists, per chromosome, were obtained by random sampling or by shifting 50 positions, as described above, with similar results. To study the effect of distance on co-regulation, we grouped genes into non-overlapping 50 kb distance intervals from robustly regulated gene TSSs (i.e., from TSS to 50 kb, from 50 kb to 100 kb… until 250 kb upstream and downstream) for each category and log2FC distributions were plotted. Similar analysis was done for genes randomized by shifting or by sampling with similar results.
Study of the influence of co-regulation in intergenic transcription
To study the influence of co-regulation in intergenic transcription we determined the ChromRNA-seq signal in the neighborhood of TGFβ-regulated genes, previously all UCSC KnownGene transcript bodies (2 kb extended from the TSS and TTS to avoid termination read-through reads and divergent promoter transcription) were removed. Then, we counted reads of 250 kb (divided into 10 kb non-overlapping bins) upstream and downstream from a TGFβ robustly regulated gene TSS. For that, we used region.counts(). windowCounts() with bin = TRUE and width = 10,000 options. NormOffsets() was used to calculate normalization factors in libraries. Then we performed differential analysis bin to bin using estimateDisp() and glmQLFit() functions to obtain log2FC. Similar analysis was performed using random genes (as described above) instead of TGFβ-regulated genes. Finally, log2FC of random genes was subtracted from log2FC of TGFβ-regulated genes, bin to bin, to obtain log2FC over random values that were finally plotted.
Study of the influence of co-regulation in histone modifications
log2FC of H3K27ac, H3K4me3, and H3K4me1 ChIP-seq signal in the neighborhood of TGFβ-regulated genes was performed as in the previous section. In this case 5 kb upstream and and downstream of the regulated gene TSS (or random gene) were removed, in order to eliminate promoter signal of the regulated gene. UCSC KnownGene transcript bodies coordinates were not removed from this analysis. Differential analysis was performed as in the previous section.
Identification of TGFβ-regulatory domains
To identify TGFβ-regulatory domains (TRD) we identified clusters of co-regulated genes, as above, using a different threshold of differentially expressed genes (|log2FC | > 0.5 and p-value < 0.05) to include genes that are significantly regulated with TGFβ, although not robustly affected. Metaplots were created combining late and early TRDs for upregulation and downregulation and using computeMatrix scale-regions with -m 200,000 -bs 4000 -a 10,000 and -b 100,000 options. -bl option was used for calculating intergenic transcription (ChromRNA-seq) to avoid counting in all UCSC knownGene transcript bodies.
Analysis of the relationship between TRDs and chromatin 3D organization
We used TADs coordinates from NMuMG cells after 8 h of TGFβ treatment, published in ref. 49 after conversion to mm9 mouse gene assembly using UCSC liftover tool. Random TADs lists for each chromosome were obtained by recirculating the chromosome and shifting TAD borders by 10 Mb. To determine within TAD co-regulation, first TADs with less than three expressed genes were removed. Then, TADs were classified according to its percentage of expressed upregulated/downregulated genes (log2FC > 0 or log2FC < 0). For randomization, we entirely randomized positions of genes within TADs, conserving the same number of genes per TAD. Five hundred iterations were performed to obtain random distributions, for statistical analysis. Observed/expected plot was created using the ratio real over the mean of the random distribution.
Analysis of the role of TAD boundaries on gene co-regulation and enhancer-gene regulation
To determine the expression correlation between two strictly contiguous genes (x, y), we sorted genes according to their 5ʹ–>3ʹ chromosomal order using gene TSSs and then we obtain all expression log2FC pairs (log2FC(x), log2FC(y)) for all expressed genes, obtaining 6265 pairs. Then genes were sorted according to log2FC(x) and grouped into 20 bins. For random plots we randomized position of all expressed genes and proceed in a similar way. To analyze the effect of TAD borders we extracted pairs (log2FC(x), log2FC(y)), where x and y genes were separated by a TAD border, obtaining 393 pairs. Then genes were sorted according to log2FC(x) and grouped into 20 bins.
To study the effect of TAD borders in enhancer-gene regulation, enhancer-closest gene pairs separated by a TAD border were selected (1230 pairs). Then we extracted log2FC(H3K27ac) of the enhancer and log2FC(expression) of its closest gene that is separated by a TAD border, sorted by log2FC(expression) and grouped into 20 bins. As random borders, we used the borders of random TADs that we describe in the previous section.
Processing Hi-C data
NMuMG TGFβ-treated for 8 h HiC data was obtained from GSE96033. Paired-end reads were aligned to the mm9 using bowtie2 (global parameters: –very-sensitive –L 30 –score-min L,−0.6,−0.2 –end-to-end–reorder; local parameters:–very-sensitive –L 20 –scoremin L,−0.6,−0.2 –end-to-end–reorder) through HiC-Pro software (v2.11.1)111. Unmapped reads, non-uniquely mapped reads and PCR duplicates were filtered and uniquely aligned reads were paired. Genome fragment distribution was obtained for DpnII restriction enzyme using as ligation sequence GATCGATC. Raw and ICE-normalized cis-contact matrices were assembled by binning paired reads into uniform 10-kb bins. HiTC Bioconductor package (v1.30.0) was used to analyze HiC interactions matrices. To determine intra-TRD interactions, interactions from raw and ICE-normalized matrices for TRD or random TRD coordinates were obtained using extractRegion() function. Random TRD coordinates were calculated inverting the start coordinate for each TRD and keeping the same length.
To analyze enhancer-promoter interactions first, we retrieve bins corresponding to TSS and enhancer midpoint coordinates. Then we obtained the interaction for each enhancer-promoter pair for raw matrix and grouped then according to: if the interaction occurs within the TRD (enhancer and promoter are located in the same TRD) or outside the TRD at different windows (enhancer is located within the TRD and genes are located at 0–100 kb, 100–250 kb, and 250–500 kb from the TRD upstream and downstream boundaries).
Directionality Index (DI) was calculated for each bin according to ref. 50:
$${\mathrm{DI}} = \left( {\frac{{B - A}}{{\left| {B - A} \right|}}} \right) \cdot \left( {\frac{{\left( {A - E} \right)^2}}{E}} \right. + \left. {\frac{{(B - E)^2}}{E}} \right),$$
(2)
where A is the number of reads that map from a given 10 kb bin to the upstream 250 kb, B is the number of reads that map from the same 10 kb bin to the downstream 250 kb, and E, the expected number of reads under the null hypothesis, is equal to (A + B)/2. Then, directionality index metaplots were constructed for TADs and TRDs using proportional number of bin for each structure: 100 bins for TADs (as average size of TAD is 1 Mb) and 10 bins for TRDs. DI in boundaries was also plotted for TADs and TRDs using surrounding 100 bins (50 upstream and 50 downstream) and 30 bins respectively.
CTCF metaplot analysis
First, CTCF ChIP-seq data for mouse mammary gland tissue was downloaded from GSE74826 and was processed as described above for SMAD2/3 ChIP-seq data, finding 35360 CTCF peaks. Then we looked for motives in those peaks using fimo what allowed us to divide CTCF peaks according to its strand orientation: 17,800 in the forward and 17,850 in the reverse orientation. To identify CTCF footprints across the genome using our ATAC-seq data PIQ software112 was used. CTCF density metaplots was constructed using ChIP-seq and footprints information in TADs and TRDs using proportional number of bin for each structure: 100 bins for TADs (as average size of TAD is 1 Mb) and 10 bins for TRDs. CTCF density in boundaries was also plotted for TADs and TRDs using surrounding 50 bins (25 upstream and 25 downstream).
Statistical analysis
Statistical and graphical data analyses were performed using either Prism 5 (Graphpad) software or R package. To determine the significance between two groups, comparisons were made using two-tailed Mann–Whitney non-parametric test by using wilcox.text() function form Stats R package. For correlation, the non-parametric Spearman coefficient was calculated and significance was calculated using Fisher exact test. Significance of enrichments were calculated with the Fisher exact test using fisher.test() function from R. Probabilities of overlapping were calculated using the hypergeometric distribution using dhyper() function from the stats R package. Other statistical methods are described above. The horizontal black line of the boxplot represents the median value, the box spans the 25th and 75th percentiles, and whiskers indicate 5th and 95th percentiles.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Comments
Something to say?
Log in or Sign up for free