Aggregating analysis results

Overview of software:

  1. Analyzing component:

    1. Reads in syn.txt files & calculated the best scoring peptides for each scan.

    2. Even though, MSGF+ estimates False discovery rates(FDRs) in some datasets MSGFplus tool when dealing with SPLIT FASTAs(multiple FASTA for the same sample) doesn’t actually account all of them due to which the QValue and PepQValue value aren’t based on the entire FASTA file for that dataset. Therefore, we’re recomputing QValue and PepQValue to improve the FDR.

    3. merges the outputs from MSGF+ and MASIC, and applies to filter to control the false discovery rate. The output is a crosstab format table with rows containing protein sequence information, and columns with relative abundance measurements for proteins identified in each sample analyzed.

Merging Algorithm:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
- Given an User-input:
     - Generate
        - start_file.xlsx
        - job_info_query.xlsx
     - For each dataset:
         **Algorithm for merging MSGF+ and MASIC analysis job results**
         - Combine multiple `MSGFplusJob results`_:
              1. Merge "*_msgfplus_syn.txt" files based on scan_id.
              2. keep the best scoring peptide(minimum MSGFDB_SpecEValue candidate) for each scan.
              3. Generate `consolidate_syn`( big peptide file).
              4. Generate,`recomupted_consolidate_syn`
                [improving the "False Discovery Rate" algorithm on `Consolidate_syn`.](###Algorithm-to-Recompute-the-qvalues)
              5. Obtain protein information from below files & merge it with `recomupted_consolidate_syn`.
                      `*_msgfplus_syn_SeqToProteinMap.txt`(protein file) and
                      `*_msgfplus_syn_ResultToSeqMap.txt` {mapper file}
              6. Generate `MSGFjobs_Merged`.
         - Combine `MSGFjobs_Merged` with MASIC results
              1. Merge on the basis of `Scan_id` in `_SICstats.txt`.
              2. Generate `MSGFjobs_MASIC_resultant`.

Recompute the qvalues Algorithm:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
  In some datasets MSGFplus tool dealing with SPLIT FASTAs due to which the `QValue` and `PepQValue` value aren't
  based on the entire FASTA file for that dataset. Therefore, we're recomputing these 2 columns after we get
  `Consolidate_syn` (by merging all the MSGFplusjobs ran on multiple FASTAs)
   1) Using the Consolidate_syn object
   3) For each scan, select the peptide with the lowest Spec EValue
       - If there is more than one peptide with a tied spec evalue, select both of them
   4) Examine the proteins for the selected peptide (or peptides) in the given scan
        - If all of the proteins start with XXX_, this is a Decoy (aka reverse) PSM
        - Otherwise, if any of the proteins does not start with XXX, this is a forward PSM
   5) Remove the prefix and suffix letters from the peptide
       - For example, given peptide K.SPVGKS*PPSTGSTYGSSQKEESAASGGAAYTKR.Y, remove K. and .Y to get SPVGKS*PPSTGSTYGSSQKEESAASGGAAYTKR
       - Call this the base peptide
   6) Look for that base peptide in a new in-memory table called the UniquePeptide table
       - If the table does not have the peptide, add it, storing BasePeptide, SpecEValue, and IsDecoy=True or false
       - If the table does have the peptide, and if the SpecEValue for the base peptide in this scan is lower, update the update the entry for the peptide to have the lower SpecEValue
   7) Once all scans are processed, your new in-memory table will have one row per base peptide
       - If you had encountered these two peptides, you would have two rows; that's OK (and appropriate, since the * symbol is in different locations)
         SPVGKS*PPSTGSTYGSSQKEESAASGGAAYTKR
      SPVGKSPPSTGSTYGSS*QKEESAASGGAAYTKR
   8) Sort this unique peptide table on SpecEValue, then IsDecoy, then peptide (thus, for entries with the same SpecEValue, the IsDecoy=false ones should be first)
   9) Step through that table and count the number of Forward hits and the number of Decoy hits, up to the current row
   10) Compute DecoyCount / ForwardCount; this is the PepQValue
   11) When the DecoyCount value increments by one, assign the DecoyCount / ForwardCount value for the previous row to all previous rows with the same DecoyCount value
   12) Once all rows have been processed, go back to the original table and assign an updated PepQValue by linking the two tables on Peptide (linking on Base Peptide)
   13) Report both the original PepQValue and your re-computed PepQValue in the output table so that we can assure that they're similar.

Data flow diagram:

Input:

Output:

  • start_file.csv

Dataset_ID

MSGFPlusJob

MSGFplus_loc

NewestMasicJob

MASIC_loc

  • job_info_query.csv

Job

Dataset

Experiment

OrganismDBName

ProteinCollectionList

ParameterFileName

  • consolidate_syn.csv

ResultID

Scan

FragMethod

SpecIndex

Charge

PrecursorMZ

DelM

DelM_PPM

MH

Peptide

Protein

NTT

DeNovoScore

MSGFScore

MSGFDB_SpecEValue

Rank_MSGFDB_SpecEValue

EValue

QValue

PepQValue

IsotopeError

JobNum

Dataset

  • recomupted_consolidate_syn.csv

ResultID

Scan

FragMethod

SpecIndex

Charge

PrecursorMZ

DelM

DelM_PPM

MH

Peptide

Protein

NTT

DeNovoScore

MSGFScore

MSGFDB_SpecEValue

Rank_MSGFDB_SpecEValue

EValue

QValue

PepQValue

IsotopeError

JobNum

Dataset

recomputedQValue

recomputedPepQValue

  • MSGFjobs_Merged.csv

ResultID

Scan

FragMethod

SpecIndex

Charge

PrecursorMZ

DelM

DelM_PPM

MH

Peptide

Protein

NTT

DeNovoScore

MSGFScore

MSGFDB_SpecEValue

Rank_MSGFDB_SpecEValue

EValue

QValue

PepQValue

IsotopeError

JobNum

Dataset

Unique_Seq_ID

Cleavage_State

Terminus_State

Protein_Expectation_Value_Log(e)

Protein_Intensity_Log(I)

  • MSGFjobs_MASIC_resultant.csv and resultants_df.csv

ResultID

Scan

FragMethod

SpecIndex

Charge

PrecursorMZ

DelM

DelM_PPM

MH

Peptide

Protein

NTT

DeNovoScore

MSGFScore

MSGFDB_SpecEValue

Rank_MSGFDB_SpecEValue

EValue

QValue

PepQValue

IsotopeError

JobNum

Dataset_x

Unique_Seq_ID

Cleavage_State

Terminus_State

Protein_Expectation_Value_Log(e)

Protein_Intensity_Log(I)

Dataset_y

ParentIonIndex

MZ

SurveyScanNumber

OptimalPeakApexScanNumber

PeakApexOverrideParentIonIndex

CustomSICPeak

PeakScanStart

PeakScanEnd

PeakScanMaxIntensity

PeakMaxIntensity

PeakSignalToNoiseRatio

FWHMInScans

PeakArea

ParentIonIntensity

PeakBaselineNoiseLevel

PeakBaselineNoiseStDev

PeakBaselinePointsUsed

StatMomentsArea

CenterOfMassScan

PeakStDev

PeakSkew

PeakKSStat

StatMomentsDataCountUsed

Moving on

Now it is time to move on to Report generation.