sep241prep#

Prepare CUT&TAG 2for1 deconvolution. The input sequencing reads should be supplied in bed format.

We exclude genomic regions with low event density based on a gaussian kernel density estimation. This can be controlled withe the --kde-bw and --kde-threshold arguments.

To remain within memory bounds and numeric feasibility, the data is split up into small intervals that are grouped into subsets that will be deconvolved in separate workchunks. The resulting data is written to a pickle file specified with --out and will be the input of the deconvolution.

Workgroups are each processed by a different instance of the deconvolution step. The total memory cost of each workchunk is controlled by targeting the memory demand specified with --memory-target. Note that this script only estimates the memory and the actual memory demand may be significantly different from the target. In this case, consider adjusting --memory-target and running the preparation again.

Note that this script uses covariance functions to better estimate the memory demand of the deconvolution step. If you want to use a covariance function or covariance function parameters other than the default, and the accuracy of the memory estimation is important, it is advisable to specify the covariance function and parameters in the preparation step with the --c1-cov-for-memory-estimation, --c2-cov-for-memory-estimation, and --sparsity-threshold-for-memory-estimation parameters.

Intervals within different workchunk are deconvolved independently. It is therefore helpful that each workchunk contains a representative subset of intervals to infer the correct global fragment length distributions. If the inference fails and a workchunk produces fragment length distributions that significantly differ from the global average, 2for1separator will request to rerun it with a updated prior. To avoid excessive reruns we advise that that workchunks should each have at least 20 intervals. Generally, this should occur by default, but in the case that workchunks have too few intervals, consider decreasing --max-locs and --max-cuts which will lead to finer subdivisions of the used intervals.

usage: sep241prep [-h] [-l LEVEL] [--logfile logfile] [-o dir]
                  [--barcode pattern] [--bc-from-file]
                  [--omit-seqname-postfix] [--keep-duplicates] [--seed int]
                  [--kde-bw float] [--kde-threshold float]
                  [--selection-padding int] [--selection-bed str]
                  [--region-padding int] [--interval-overlap int]
                  [--max-locs int] [--max-cuts int] [--memory-target float]
                  [--c1-cov 'a*CovFuncargs) + ...']
                  [--c2-cov 'a*CovFunc(args + ...']
                  [--sparsity-threshold float] [--blacklist file.bed.gz2]
                  [--blacklisted-seqs chrN [chrN ...]] [--no-progress]
                  [--cores int] [--compiledir dir]
                  fragment-file.bed [fragment-file.bed ...]

fragment-file.bed#: Input bed files. If --bc-from-file is not set, the fourth column should contain the cell barcodes. Bed files can be compressed using gzip (*.gz) or bzip2 (*.gz2).

-h, --help#: show this help message and exit

-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --log {DEBUG,INFO,WARNING,ERROR,CRITICAL}#: Set the logging level (default=INFO).

--logfile <logfile>#: Write detailed log to this file.

-o <dir>, --out <dir>#: Output file path (default=work_chunks.pandas_pkl).

--barcode <pattern>#: Specify a regular expression to extract cell barcodes. E.g. ‘file_([ACGT]*)’ will extract ‘ACCGGC’ from ‘something_file_ACCGGC_other’. If set to empty ‘’, no barcode is saved. (default=’(.*)’)

--bc-from-file#: Extract barcode from file name instead of using the fourth column of the bed file(s).

--omit-seqname-postfix#: For the sequence name omit everything behind the first ‘_’.

--keep-duplicates#: Do not remove identical reads (PCR duplicates).

--seed <int>#: Random state seed to assign intervals to workchunks (default=242567).

--kde-bw <float>#: Bandwidth (sigma) for kernel density estimate (KDE) used for interval selection (default=200).

--kde-threshold <float>#: Minimum KDE value to consider genomic section for deconvolution (default=2).

--selection-padding <int>#: Additional padding around genomic section selected based on KDE (default=10,000).

--selection-bed <str>#: Alternatively to KDE selection, specify bed file of regions to deconvolve.

--region-padding <int>#: Additional padding around intervals even if specified through bed file (default=5,000).

--interval-overlap <int>#: Overlap to neighboring subdivides and minimal exclusive section size (default=10,000).

--max-locs <int>#: Maximum of unique cuts per interval. Memory demand may increase proportionally (default=10,000,000).

--max-cuts <int>#: Maximum of cuts per interval. Memory demand may increase proportionally (default=15,000,000).

--memory-target <float>#: Memory demand target in GBs for individual workchunks (default=20).

--c1-cov <'a*covfunc(args) + ...'>, --c1-cov-for-memory-estimation <'a*covfunc(args) + ...'>#: Covariance function of component 1 for the deconvolution step using pymc covariance functions without the input_dim argument: https://docs.pymc.io/api/gp/cov.html In the preparation step, this is only used for a more accurate estimation of the memory demand of the deconvolution step and the assignment of intervals into workchunks based on the memory estimations. (default=Matern32(500))

--c2-cov <'a*covfunc(args) + ...'>, --c2-cov-for-memory-estimation <'a*covfunc(args) + ...'>#: Covariance function of component 2 for the deconvolution step using pymc covariance functions without the input_dim argument: https://docs.pymc.io/api/gp/cov.html In the preparation step, this is only used for a more accurate estimation of the memory demand of the deconvolution step and the assignment of intervals into workchunks based on the memory estimations. (default=Matern32(2000))

--sparsity-threshold <float>, --sparsity-threshold-for-memory-estimation <float>#: In the deconvolution step, do not calculate covariances that will be below this threshold. In the preparation step, this is only used for a more accurate estimation of the memory demand of the deconvolution step and the assignment of intervals into workchunks based on the memory estimations. (default=1e-8)

--blacklist <file.bed.gz2>#: Bed file of genomic regions to exclude from the deconvolution.

--blacklisted-seqs <chrn>#: Sequences to exclude from the deconvolution (default=chrM chrUn chrEBV).

--no-progress#: Do not show progress.

--cores <int>#: Number of CPUs to use for the preparation.

--compiledir <dir>#: Directory to compile code in. Can be deleted after run. Enter a directory path, e.g. ‘./sep241tmp/{args.out}/prep’ to keep the compiled code saved in that directory. By default, creates a temporary directory.