Skip to content

tacular-omics/tdfextractor

Repository files navigation

tdfextractor

A Python package to extract MS/MS spectra from Bruker TimsTOF .D folders and convert them to standard formats (MS2, MGF, and mzML).

Installation

pip install tdfextractor

Usage

tdfextractor provides two command-line tools for extracting spectra:

MS2 Extraction

Extract MS2 format files (compatible with MS-GF+, Comet, etc.):

ms2-extractor /path/to/sample.d

# shorthand
ms2-ex 
ms2-ex /path/to/sample.d --output custom_output.ms2 --min-intensity 100 --min-charge 2
ms2-ex /path/to/directory_with_multiple_d_folders --output /path/to/output_directory

MGF Extraction

Extract MGF format files

mgf-extractor /path/to/sample.d

#shorthand
mgf-ex
mgf-ex /path/to/sample.d --casanovo  # Optimized for Casanovo de novo sequencing
mgf-ex /path/to/directory_with_multiple_d_folders --output /path/to/output_directory

mzML Extraction

Extract mzML format files (includes both MS1 and MS2 PASEF spectra):

mzml-extractor /path/to/sample.d

# shorthand
mzml-ex /path/to/sample.d
mzml-ex /path/to/sample.d --no-ms1  # MS2 spectra only
mzml-ex /path/to/sample.d --mz-compression zstd --intensity-encoding 32
mzml-ex /path/to/directory_with_multiple_d_folders --output /path/to/output_directory

Output Options

Both extractors support flexible output options:

  1. No output specified: Files are created within each .D folder with auto-generated names
  2. Specific file path: Use -o filename.ms2 or -o filename.mgf for single .D folder processing
  3. Output directory: Use -o /path/to/output_dir for batch processing multiple .D folders
  4. Overwrite protection: Use --overwrite to replace existing output files

Batch Processing

When processing multiple .D folders, the extractors will:

  • Automatically find all .D folders in the specified directory
  • Create output files with names matching the .D folder names
  • Skip existing files unless --overwrite is specified
  • Create the output directory if it doesn't exist

Command Line Arguments

Both MS2 and MGF extractors share the same arguments, with only a few format-specific options:

Argument Type Default Description
analysis_dir str - Path to the .D analysis directory or directory containing .D folders
-o, --output str <analysis_dir_name>.<ext> Output file path or directory
--remove-precursor flag False Remove precursor peaks from MS/MS spectra
--precursor-peak-width float 2.0 Width around precursor m/z to remove (Da)
--batch-size int 100 Batch size for processing spectra
--top-n-peaks int None Keep only top N most intense peaks per spectrum
--min-spectra-intensity float None Minimum intensity threshold for MS/MS peaks (absolute or 0.0-1.0 for percentage)
--max-spectra-intensity float None Maximum intensity threshold for MS/MS peaks (absolute or 0.0-1.0 for percentage)
--min-spectra-mz float None Minimum m/z filter for MS/MS peaks
--max-spectra-mz float None Maximum m/z filter for MS/MS peaks
--min-precursor-intensity float None Minimum precursor intensity filter
--max-precursor-intensity float None Maximum precursor intensity filter
--min-precursor-charge int None Minimum precursor charge state filter
--max-precursor-charge int None Maximum precursor charge state filter
--min-precursor-mz float None Minimum precursor m/z filter
--max-precursor-mz float None Maximum precursor m/z filter
--min-precursor-rt float None Minimum precursor retention time filter (seconds)
--max-precursor-rt float None Maximum precursor retention time filter (seconds)
--min-precursor-ccs float None Minimum precursor CCS filter
--max-precursor-ccs float None Maximum precursor CCS filter
--min-precursor-neutral-mass float None Minimum precursor neutral mass filter
--max-precursor-neutral-mass float None Maximum precursor neutral mass filter
--mz-precision int 5 Number of decimal places for m/z values
--intensity-precision int 0 Number of decimal places for intensity values
--keep-empty-spectra flag False Write empty spectra to output file
--overwrite flag False Overwrite existing output files
--workers int 1 Number of worker threads for processing multiple .d folders
-v, --verbose flag False Enable verbose logging

Format-Specific Arguments

MS2 Extractor Only:

  • --ip2: Use IP2 preset settings (sets min charge to 2, top 500 peaks)

MGF Extractor Only:

  • --casanovo: Use Casanovo preset settings (enables precursor removal, top-150 peaks, min intensity 0.01, m/z range 50-2500, min charge 2)

mzML Extractor Only:

Argument Type Default Description
--no-ms1 flag False Skip MS1 spectra; write only MS2 PASEF spectra
--mz-compression str zlib Compression for m/z arrays (none, zlib, zstd, numpress-linear, numpress-slof, numpress-pic)
--intensity-compression str zlib Compression for intensity arrays
--mobility-compression str zlib Compression for per-peak ion mobility arrays (MS1)
--mz-encoding int 64 Bit width for m/z values (32 or 64)
--intensity-encoding int 32 Bit width for intensity values (32 or 64)
--centroid-noise-filter str none Noise filter before centroiding (none, mad, percentile, histogram, baseline, iterative_median)
--centroid-mz-tolerance float 8.0 m/z tolerance for centroiding
--centroid-mz-tolerance-type str ppm Unit for m/z tolerance (ppm or da)
--centroid-im-tolerance float 0.05 Ion mobility tolerance for centroiding
--centroid-im-tolerance-type str relative Unit for ion mobility tolerance (relative or absolute)
--centroid-min-peaks int 5 Minimum raw peaks required to form a centroided peak

Performance Options

The --workers argument allows parallel processing of multiple .d folders:

# Process multiple .d folders with 4 worker threads
mgf-ex /path/to/directory_with_multiple_d_folders --workers 4

Note: Workers only affect processing when multiple .d folders are being processed simultaneously. Each worker processes one complete .d folder independently.

About

A Python package to extract MS/MS spectra from Bruker TimsTOF .D folders and convert them to standard formats (MS2, MGF, and mzML).

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors