Text:
Increase font size
Decrease font size
MUNIn (Multiple tissue UNifying long-range chromatin Interaction detector): a statistical framework for identifying long-range chromatin interactions from multiple tissues
What You Need
Installation
After downloading the
MUNIn_1.0.tar.gz into a chosen local folder "local_path",
1. Unzip the file MUNIn_1.0.tar.gz, you will get a C++ executable program MUNIn, a folder Example including toy data, and a folder MUNIn_outputs including output results of MUNIn.
2. Copy the C++ executable program MUNIn to any folder and it is ready to use.
How to Run
In this example, we use the TAD in chromosome 1 from 6595000 bp to 7965000 bp (denoted as "TAD_6595000_7965000") from two tissues, the cortical and subcortical plate (CP) and the germinal zone (GZ), at 10 KB resolution (Won et al. Nature, 2016).
1. For each tissue, we start from HiC contact matrix, and calculate expected frequency using a modified version of Fit-Hi-C, which can be downloaded from here. The command interface of our utility software is exactly the same as Fit-Hi-C. Please refer to Fit-Hi-C for more details at https://noble.gs.washington.edu/proj/fit-hi-c/.
2. We first perform peak calling in each tissue. To conduct peak calling, users need to prepare HiC data file for HiC_HMRF_Bayes_Files to load, which is a text file with 5 columns, separated by the table delimiter, respectively as tissue index, middle point of fragment 1, middle point of fragment 2, observed frequency and expected frequency.
The 8 required command parameters are:
- -I, HiC input data file, which is a text file, with 5 columns respectively as tissue index, middle point of fragment 1, middle point of fragment 2, observed frequency and expected frequency. The example file for CP is CP_1_6595000_7965000.txt.
- -NP, size of HiC contact matrix.
- -Tune, .
- -NG, number of Gibbs sample
- -Bininitial, the middle point of the first fragment 1
- -Binsize, fragment length.
- -SEED, seed of the random number generator. Setting the seed to a fixed value can make the results reproducible.
- -O, output folder, which contains the output files of inferred peak status and parameters in the HMRF peak calling model. The example file is CP_output.
To run HMRF tissue by tissue, use
./HMRF -I Example/CP_1_6595000_7965000.txt -NP 138 -Tune 100 -NG 10000 -Bininitial 6595000 -Binsize 10000 -SEED 123 -O Example/CP_output/
3. With the peak calling results from each tissue, we lable the tissues with different indices, i.e. 0, 1, 2..., and concatenate the long format output files together as the input file for MUNIn, which contains 6 columns respectively as tissue index, middle point of fragment 1, middle point of fragment 2, observed frequency, expected frequency and peak status. Users also need to prepare four files respectively containing the estimated parameters of theta, phi, gamma and psi of each tissue.
The 8 required command parameters are
- -I, input data file for MUNIn, which is a text file with 6 columns respectively as tissue index, middle point of fragment 1, middle point of fragment 2, observed frequency, expected frequency and peak status. The example file is CP_GZ_Record_long_format.txt.
- -NP, size of HiC contact matrix.
- -NT, number of tissues
- -NG, number of Gibbs sample
- -Bininitial, the middle point of the first fragment 1
- -Binsize, fragment length.
- -Theta, theta input file, which is text file of one column listing the estimated theta from each tissue.
- -Phi, phi input file, which is text file of one column listing the estimated phi from each tissue.
- -Gamma, gamma input file, which is text file of one column listing the estimated gamma from each tissue
- -Psi, psi input file, which is text file of one column listing the estimated psi from each tissue
- -Alpha, tissue dependency input file. When there are two tissues, it contains 5 columns respectively as order index, peak status in tissue 1, peak statues in tissue 2, heterogeneity of peak status in the two tissues (0, shared background; 1, tissue-specific peak; 2, shared peak) and proportion of each status in all the fragment pairs. The example file is alpha_CP_GZ_1_6595000_7965000.txt.
- -SEED, seed of the random number generator. Setting the seed to a fixed value can make the results reproducible.
- -O, output folder, which contains the output files of inferred peak status and parameters in the HMRF peak calling model. The example file is MUNIn_outputs.
To run MUNIn, use
./MUNIn -I Example/CP_GZ_Record_long_format.txt -NP 138 -NT 2 -NG 10000 -Bininitial 6595000 -Binsize 10000 -Theta Example/theta.txt -Phi Example/phi.txt -Gamma Example/gamma.txt -Psi Example/psi.txt -Alpha Example/alpha_CP_GZ_1_6595000_7965000.txt -SEED 1 -O MUNIn_output/
Input formats for HMRF
- The input data file for HMRF is a text file, with 5 columns respectively as middle point of fragment 1, middle point of fragment 2, observed frequency, expected frequency and p-value estimated by Fit-Hi-C. For example, the first several lines of CP_1_6595000_7965000.txt are
6595000 6605000 0 22.391277 1.000000e+00
6595000 6615000 36 22.391277 4.900238e-03
6595000 6625000 0 22.391277 1.000000e+00
6595000 6635000 32 20.537476 1.144471e-02
6595000 6645000 26 16.747731 2.158414e-02
6595000 6655000 11 14.374905 8.478567e-01
6595000 6665000 25 13.049002 2.093234e-03
6595000 6675000 10 12.043115 7.613544e-01
6595000 6685000 36 11.191971 3.047137e-09
6595000 6695000 12 10.470704 3.578206e-01
...
Input formats for MUNIn
- The input data file for MUNIn is a text file, with 4 columns respectively as tissue index, middle point of fragment 1, middle point of fragment 2, observed frequency, expected frequency and peak status. For example, the first several lines of CP_GZ_Record_long_format.txt are
tissue_index frag1 frag2 Oij Eij peak_status
0 6595000 6605000 0 22.39127700 -1
0 6595000 6615000 36 22.39127700 -1
0 6595000 6625000 0 22.39127700 -1
0 6595000 6635000 32 20.53747600 -1
0 6595000 6645000 26 16.74773100 -1
0 6595000 6655000 11 14.37490500 -1
0 6595000 6665000 25 13.04900200 -1
0 6595000 6675000 10 12.04311500 -1
0 6595000 6685000 36 11.19197100 1
0 6595000 6695000 12 10.47070400 -1
...
1 6595000 6605000 0 22.42577200 -1
1 6595000 6615000 39 22.42577200 -1
1 6595000 6625000 0 22.42577200 -1
1 6595000 6635000 34 20.91651100 -1
1 6595000 6645000 31 17.25057900 -1
1 6595000 6655000 16 14.82693000 -1
1 6595000 6665000 39 13.39173400 1
1 6595000 6675000 12 12.31906000 -1
1 6595000 6685000 39 11.44867100 1
1 6595000 6695000 11 10.73675700 -1
...
- The alpha file is a text file with 5 columns, when there are two tissues, respectively as order index, peak status in tissue 1, peak statues in tissue 2, heterogeneity of peak status in the two tissues (0, shared background; 1, tissue-specific peak; 2, shared peak) and proportion of each status in all the fragment pairs. Here is an example for alpha_CP_GZ_1_6595000_7965000.txt
order_index peak_status_tissue 1 peak_status_tissue 2 proportion
0 0 0 0 0.77234740
1 1 0 1 0.05585528
2 0 1 1 0.04178568
3 1 1 2 0.13001164
- The parameter file is a text file of one column listing the estimated parameter, for example theta, from each tissue. Here is an example for theta.txt
1.3131
1.3013
Output formats of MUNIn
MUNIn outputs multiple files, the majority of which are files recoding peak status and parameters for each tissue.
- The output Hi-C peak recode file is a text file, with 5 columns respectively as tissue index, middle point of fragment 1, middle point of fragment 2, observed frequency, expected frequency and peak status. For example, the first several lines of Record_long_format.txt are
tissue_index frag1 frag2 Oij Eij peak_status
0 6595000 6605000 0 22.39127700 -1
0 6595000 6615000 36 22.39127700 -1
0 6595000 6625000 0 22.39127700 -1
0 6595000 6635000 32 20.53747600 -1
0 6595000 6645000 26 16.74773100 -1
0 6595000 6655000 11 14.37490500 -1
0 6595000 6665000 25 13.04900200 -1
0 6595000 6675000 10 12.04311500 -1
0 6595000 6685000 36 11.19197100 1
0 6595000 6695000 12 10.47070400 -1
...
1 6595000 6605000 0 22.42577200 -1
1 6595000 6615000 39 22.42577200 -1
1 6595000 6625000 0 22.42577200 -1
1 6595000 6635000 34 20.91651100 -1
1 6595000 6645000 31 17.25057900 -1
1 6595000 6655000 16 14.82693000 -1
1 6595000 6665000 39 13.39173400 -1
1 6595000 6675000 12 12.31906000 -1
1 6595000 6685000 39 11.44867100 1
1 6595000 6695000 11 10.73675700 -1
...
- The output parameter recode file is a text file, with maximum likelihood value, estimated parameters of theta, phi, gamma and psi, and the number of Gibbs sample and seed used for MUNIn outputted. For example, the texts of the file Record_Para.txt are
Best LogLike = 215124.8443
Tissue = 0
Best Theta = 1.4969
Best Phi = 5.6009
Best Gamma = -0.0073
Best Psi = 0.3947
Tissue = 1
Best Theta = 1.4575
Best Phi = 5.6002
Best Gamma = -0.0253
Best Psi = 0.3986
NumGibbs = 10000
SEED = 1