sep241deconvolve#

Deconvolve CUT&TAG 2for1. Uses the output of the sep241prep step.

This step computes two likelihoods for all cuts (both ends of each fragment) for them to originate from one of two channels. By default, we expect that channel 1 contains smaller fragments, e.g., Pol2S5p induced, than channel 2, e.g., H3K27me3 induced. This is controlled by --c1-dirichlet-prior and --c2-dirichlet-prior. The dirichlet prior specifies weights for multiple log-normal distributions that make up a mixture distribution describing the prior fragment-length distribution that typically asserts a ladder of distinct modes.

The mode position and width is shared between both channels and controlled by --length-dist-modes and --length-dist-mode-sds. The default setting further assumes that co-occurrence of cuts in channel 1 is more concentrated than channel 2, in which cuts are assumed to be more spread out. This is controlled through the covariance functions of the respective position marginal log-likelihood functions --c1-cov and --c2-cov.

A single instance of the deconvolution process deconvolves one workchunk specified by --workchunk. Either of the following code snippets run all the deconvolutions at once, where each variable with args. is an argument to the deconvolution step and [N] is the highest workchunk id.

Run deconvolution using slurm:

sbatch --array=0-[N] --mem=[memory target] sep241deconvolve [jobdata pkl file]

Run in bash or zsh:

for wc in $(seq 0 [N]); do
    sep241deconvolve [jobdata pkl file] --workchunk $wc
done

The preparation step tries to create workgroups that will take the target memory demand, but this is only an estimate. If the memory demand of the deconvolution step is prohibitively high, try rerunning the preparation step with a lower memory target.

usage: sep241deconvolve [-h] [-l LEVEL] [--logfile logfile] [--workchunk int]
                        [--c1-cov 'a*CovFuncargs) + ...']
                        [--c2-cov 'a*CovFunc(args + ...']
                        [--sparsity-threshold float]
                        [--length-dist-modes floats [floats ...]]
                        [--length-dist-mode-sds floats [floats ...]]
                        [--c1-dirichlet-prior floats [floats ...]]
                        [--c2-dirichlet-prior floats [floats ...]]
                        [--constrain] [--cores int] [--compiledir dir]
                        [jobdata-file]

jobdata-file#: Jobdata with cuts per interval and workchunk ids.

-h, --help#: show this help message and exit

-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --log {DEBUG,INFO,WARNING,ERROR,CRITICAL}#: Set the logging level (default=INFO)

--logfile <logfile>#: Write detailed log to this file.

--workchunk <int>#: Work chunk number of jobdata to handle. Defaults to the environment variable SLURM_ARRAY_TASK_ID if set.

--c1-cov <'a*covfunc(args) + ...'>#: Covariance function of component 1 using pymc covariance functions without the input_dim argument: https://docs.pymc.io/api/gp/cov.html (default=Matern32(500))

--c2-cov <'a*covfunc(args) + ...'>#: Covariance function of component 2 using pymc covariance functions without the input_dim argument: https://docs.pymc.io/api/gp/cov.html (default=Matern32(2000))

--sparsity-threshold <float>#: The Gaussian processes modeling the location distribution use a custom covariance matrix that saves time and memory by representing only covariances above this threshold. (default=1e-8)

--length-dist-modes <floats>#: Modes of the log-normal distributions used in the mixture model for the fragment length distribution. (default=70 200 400 600)

--length-dist-mode-sds <floats>#: Standard deviations of the log-normal modes of the mixture model for the fragment length distribution. (default=0.29 0.18 0.15 0.085)

--c1-dirichlet-prior <floats>#: Dirichlet prior for the ratio of modes in the length distribution. (default=450 100 10 1)

--c2-dirichlet-prior <floats>#: Dirichlet prior for the ratio of modes in the length distribution. (default=150 300 50 10)

--constrain#: Use this flag to constrain average fragment length to be larger in one component. (default=False)

--cores <int>#: Number of CPUs to use for inference. Defaults to the environment variable SLURM_CPUS_PER_TASK if set.

--compiledir <dir>#: Directory to compile code in. Should be different between parallel run instances and can be deleted after run. Defaults to sep241tmp/{args.jobdata}/deconvolve/{args.workchunk} or if the environment variable TMPDIR is set, defaults to TMPDIR/sep241tmp/{args.jobdata}/deconvolve/{args.workchunk}