Dataset
The PsmDataset
class is used to define a collection of
peptide-spectrum matches.
- class crema.dataset.PsmDataset(psms, target_column, spectrum_columns, score_columns, peptide_column, protein_column, protein_delim, peptide_pairing=None, copy_data=True)[source]
Store a collection of peptide-spectrum matches (PSMs).
- Parameters:
- psmspandas.DataFrame
A
pandas.DataFrame
of PSMs.- target_columnstr
The column that indicates whether a PSM is a target or a decoy. This column should be boolean, where
True
indicates a target andFalse
indicates a decoy.- spectrum_columnsstr or tuple of str
One or more columns that together define a unique mass spectrum.
- score_columnsstr or tuple of str, optional
One or more columns that indicate scores by which crema can rank PSMs.
- peptide_columnstr
The column that defines a unique peptide. Modifications should be indicated either in square brackets
[]
or parentheses()
. The exact modification format within these entities does not matter, so long as it is consistent.- protein_columnsstr
The column that defines a unique protein.
- protein_delimstr
The string delimiter that is needed to separate multiple proteins found in the protein column.
- peptide_pairing: dict
A map of target and decoy peptide sequence pairings to be used for TDC. This should be in the form {key=target_sequence:value=decoy_sequence} where decoy sequences are shuffled versions of target sequences.
- copy_databool, optional
If true, a deep copy of the data is created. This uses more memory, but is safer because it prevents accidental modification of the underlying data. This argument only has an effect when pin_files is a
pandas.DataFrame
- Attributes:
data
pandas.DataFrameThe collection of PSMs as a
pandas.DataFrame
.- spectrum_columnslist of str
- score_columnslist of str
- target_columnstr
- peptide_columnstr
- protein_columnstr
protein_delim
strThe delimiter to split protein IDs as a string.
- methodsdict
peptide_pairing
dictA dictionary containing target/decoy peptide pairs
Methods
assign_confidence
([score_column, desc, ...])Assign confidence estimates to this collection of peptide-spectrum matches.
find_best_score
([eval_fdr])Find the best score for this collection of PSMs
set_peptide_column
(new_peptide_column)Replaces current peptide column with input peptide column
set_protein_column
(new_protein_column)Replaces current protein column with input protein column
- assign_confidence(score_column=None, desc=None, eval_fdr=0.01, method='tdc', pep_fdr_type='psm-peptide', prot_fdr_type='best', threshold=0.01)[source]
Assign confidence estimates to this collection of peptide-spectrum matches.
- Parameters:
- score_columnstr, optional
The score by which to rank the PSMs for confidence estimation. If
None
, the score that yields the most PSMs at the specified false discovery rate threshold (eval_fdr), will be used.- descbool, optional
True if higher scores better, False if lower scores are better. If None, crema will try both and use the choice that yields the most PSMs at the specified false discovery rate threshold (eval_fdr). If score_column is
None
, this parameter is ignored.- eval_fdrfloat, optional
The false discovery rate threshold used to evaluate the best score_column and desc to choose. This should range from 0 to 1.
- method{“tdc”}, optional
The method for crema to use when calculating the confidence estimates.
- pep_fdr_type{“psm-only”,”peptide-only”,psm-peptide”}, optional
The method for crema to use when calculating peptide level confidence estimates. Default is “psm-peptide”.
- prot_fdr_type{“best”, “combine”}, optional
The method for crema to use when calculating protein level confidence estimates. Default is “best”.
- thresholdfloat or “q-value”, optional
The FDR threshold for accepting discoveries. Default is 0.01. If “q-value” is chosen, then “accept” column is replaced with “crema q-value”.
- Returns:
- Confidence object
The confidence estimates for this PsmDataset.
- find_best_score(eval_fdr=0.01)[source]
Find the best score for this collection of PSMs
Try each of the available score columns, determining how many PSMs are detected below the provided false discovery rate threshold. The best score is the one that returns the most.
- Parameters:
- eval fdrfloat
The false discovery rate threshold used to find the best score.
- Returns:
- best_scorestr
The best score.
- n_passingint
The number of PSMs that meet the specified FDR threshold.
- descbool
True if higher scores better, False if lower scores are better.
- set_peptide_column(new_peptide_column)[source]
Replaces current peptide column with input peptide column
- Parameters:
- new_peptide_columnpandas.Series
- Returns:
- set_protein_column(new_protein_column)[source]
Replaces current protein column with input protein column
- Parameters:
- new_protein_columnpandas.Series
- Returns:
- property columns
The columns of the PSM
pandas.DataFrame
- property data
The collection of PSMs as a
pandas.DataFrame
.
- property peptide_pairing
A dictionary containing target/decoy peptide pairs
- property peptides
The peptides as a
pandas.Series
.
- property protein_delim
The delimiter to split protein IDs as a string.
- property proteins
The proteins as a
pandas.Series
.
- property scores
The scores for each PSM as a
pandas.DataFrame
.
- property spectra
The mass spectrum identifiers as a
pandas.DataFrame
.
- property targets
An array indicating whether each PSM is a target