Dataset

The PsmDataset class is used to define a collection of peptide-spectrum matches.

class crema.dataset.PsmDataset(psms, target_column, spectrum_columns, score_columns, peptide_column, protein_column, protein_delim, peptide_pairing=None, copy_data=True)[source]

Store a collection of peptide-spectrum matches (PSMs).

Parameters:

psmspandas.DataFrame: A pandas.DataFrame of PSMs.
target_columnstr: The column that indicates whether a PSM is a target or a decoy. This column should be boolean, where True indicates a target and False indicates a decoy.
spectrum_columnsstr or tuple of str: One or more columns that together define a unique mass spectrum.
score_columnsstr or tuple of str, optional: One or more columns that indicate scores by which crema can rank PSMs.
peptide_columnstr: The column that defines a unique peptide. Modifications should be indicated either in square brackets [] or parentheses (). The exact modification format within these entities does not matter, so long as it is consistent.
protein_columnsstr: The column that defines a unique protein.
protein_delimstr: The string delimiter that is needed to separate multiple proteins found in the protein column.
peptide_pairing: dict: A map of target and decoy peptide sequence pairings to be used for TDC. This should be in the form {key=target_sequence:value=decoy_sequence} where decoy sequences are shuffled versions of target sequences.
copy_databool, optional: If true, a deep copy of the data is created. This uses more memory, but is safer because it prevents accidental modification of the underlying data. This argument only has an effect when pin_files is a pandas.DataFrame

Attributes:

datapandas.DataFrame: The collection of PSMs as a pandas.DataFrame.
spectrum_columnslist of str
score_columnslist of str
target_columnstr
peptide_columnstr
protein_columnstr
protein_delimstr: The delimiter to split protein IDs as a string.
methodsdict
peptide_pairingdict: A dictionary containing target/decoy peptide pairs

Methods

`assign_confidence`([score_column, desc, ...])	Assign confidence estimates to this collection of peptide-spectrum matches.
`find_best_score`([eval_fdr])	Find the best score for this collection of PSMs
`set_peptide_column`(new_peptide_column)	Replaces current peptide column with input peptide column
`set_protein_column`(new_protein_column)	Replaces current protein column with input protein column

assign_confidence(score_column=None, desc=None, eval_fdr=0.01, method='tdc', pep_fdr_type='psm-peptide', prot_fdr_type='best', threshold=0.01)[source]

Assign confidence estimates to this collection of peptide-spectrum matches.

Parameters:

score_columnstr, optional: The score by which to rank the PSMs for confidence estimation. If None, the score that yields the most PSMs at the specified false discovery rate threshold (eval_fdr), will be used.
descbool, optional: True if higher scores better, False if lower scores are better. If None, crema will try both and use the choice that yields the most PSMs at the specified false discovery rate threshold (eval_fdr). If score_column is None, this parameter is ignored.
eval_fdrfloat, optional: The false discovery rate threshold used to evaluate the best score_column and desc to choose. This should range from 0 to 1.
method{“tdc”}, optional: The method for crema to use when calculating the confidence estimates.
pep_fdr_type{“psm-only”,”peptide-only”,psm-peptide”}, optional: The method for crema to use when calculating peptide level confidence estimates. Default is “psm-peptide”.
prot_fdr_type{“best”, “combine”}, optional: The method for crema to use when calculating protein level confidence estimates. Default is “best”.
thresholdfloat or “q-value”, optional: The FDR threshold for accepting discoveries. Default is 0.01. If “q-value” is chosen, then “accept” column is replaced with “crema q-value”.

Returns:

Confidence object: The confidence estimates for this PsmDataset.

find_best_score(eval_fdr=0.01)[source]

Find the best score for this collection of PSMs

Try each of the available score columns, determining how many PSMs are detected below the provided false discovery rate threshold. The best score is the one that returns the most.

Parameters:

eval fdrfloat: The false discovery rate threshold used to find the best score.

Returns:

best_scorestr: The best score.
n_passingint: The number of PSMs that meet the specified FDR threshold.
descbool: True if higher scores better, False if lower scores are better.

set_peptide_column(new_peptide_column)[source]

Replaces current peptide column with input peptide column

Parameters:

new_peptide_columnpandas.Series

Returns:

set_protein_column(new_protein_column)[source]

Replaces current protein column with input protein column

Parameters:

new_protein_columnpandas.Series

Returns:

property columns: The columns of the PSM pandas.DataFrame

property data: The collection of PSMs as a pandas.DataFrame.

property peptide_pairing: A dictionary containing target/decoy peptide pairs

property peptides: The peptides as a pandas.Series.

property protein_delim: The delimiter to split protein IDs as a string.

property proteins: The proteins as a pandas.Series.

property scores: The scores for each PSM as a pandas.DataFrame.

property spectra: The mass spectrum identifiers as a pandas.DataFrame.

property targets: An array indicating whether each PSM is a target