Dataset

The PsmDataset class is used to define a collection of peptide-spectrum matches.

class crema.dataset.PsmDataset(psms, target_column, spectrum_columns, score_columns, peptide_column, protein_column, protein_delim, peptide_pairing=None, copy_data=True)[source]

Store a collection of peptide-spectrum matches (PSMs).

Parameters:
psmspandas.DataFrame

A pandas.DataFrame of PSMs.

target_columnstr

The column that indicates whether a PSM is a target or a decoy. This column should be boolean, where True indicates a target and False indicates a decoy.

spectrum_columnsstr or tuple of str

One or more columns that together define a unique mass spectrum.

score_columnsstr or tuple of str, optional

One or more columns that indicate scores by which crema can rank PSMs.

peptide_columnstr

The column that defines a unique peptide. Modifications should be indicated either in square brackets [] or parentheses (). The exact modification format within these entities does not matter, so long as it is consistent.

protein_columnsstr

The column that defines a unique protein.

protein_delimstr

The string delimiter that is needed to separate multiple proteins found in the protein column.

peptide_pairing: dict

A map of target and decoy peptide sequence pairings to be used for TDC. This should be in the form {key=target_sequence:value=decoy_sequence} where decoy sequences are shuffled versions of target sequences.

copy_databool, optional

If true, a deep copy of the data is created. This uses more memory, but is safer because it prevents accidental modification of the underlying data. This argument only has an effect when pin_files is a pandas.DataFrame

Attributes:
datapandas.DataFrame

The collection of PSMs as a pandas.DataFrame.

spectrum_columnslist of str
score_columnslist of str
target_columnstr
peptide_columnstr
protein_columnstr
protein_delimstr

The delimiter to split protein IDs as a string.

methodsdict
peptide_pairingdict

A dictionary containing target/decoy peptide pairs

Methods

assign_confidence([score_column, desc, ...])

Assign confidence estimates to this collection of peptide-spectrum matches.

find_best_score([eval_fdr])

Find the best score for this collection of PSMs

set_peptide_column(new_peptide_column)

Replaces current peptide column with input peptide column

set_protein_column(new_protein_column)

Replaces current protein column with input protein column

assign_confidence(score_column=None, desc=None, eval_fdr=0.01, method='tdc', pep_fdr_type='psm-peptide', prot_fdr_type='best', threshold=0.01)[source]

Assign confidence estimates to this collection of peptide-spectrum matches.

Parameters:
score_columnstr, optional

The score by which to rank the PSMs for confidence estimation. If None, the score that yields the most PSMs at the specified false discovery rate threshold (eval_fdr), will be used.

descbool, optional

True if higher scores better, False if lower scores are better. If None, crema will try both and use the choice that yields the most PSMs at the specified false discovery rate threshold (eval_fdr). If score_column is None, this parameter is ignored.

eval_fdrfloat, optional

The false discovery rate threshold used to evaluate the best score_column and desc to choose. This should range from 0 to 1.

method{“tdc”}, optional

The method for crema to use when calculating the confidence estimates.

pep_fdr_type{“psm-only”,”peptide-only”,psm-peptide”}, optional

The method for crema to use when calculating peptide level confidence estimates. Default is “psm-peptide”.

prot_fdr_type{“best”, “combine”}, optional

The method for crema to use when calculating protein level confidence estimates. Default is “best”.

thresholdfloat or “q-value”, optional

The FDR threshold for accepting discoveries. Default is 0.01. If “q-value” is chosen, then “accept” column is replaced with “crema q-value”.

Returns:
Confidence object

The confidence estimates for this PsmDataset.

find_best_score(eval_fdr=0.01)[source]

Find the best score for this collection of PSMs

Try each of the available score columns, determining how many PSMs are detected below the provided false discovery rate threshold. The best score is the one that returns the most.

Parameters:
eval fdrfloat

The false discovery rate threshold used to find the best score.

Returns:
best_scorestr

The best score.

n_passingint

The number of PSMs that meet the specified FDR threshold.

descbool

True if higher scores better, False if lower scores are better.

set_peptide_column(new_peptide_column)[source]

Replaces current peptide column with input peptide column

Parameters:
new_peptide_columnpandas.Series
Returns:
set_protein_column(new_protein_column)[source]

Replaces current protein column with input protein column

Parameters:
new_protein_columnpandas.Series
Returns:
property columns

The columns of the PSM pandas.DataFrame

property data

The collection of PSMs as a pandas.DataFrame.

property peptide_pairing

A dictionary containing target/decoy peptide pairs

property peptides

The peptides as a pandas.Series.

property protein_delim

The delimiter to split protein IDs as a string.

property proteins

The proteins as a pandas.Series.

property scores

The scores for each PSM as a pandas.DataFrame.

property spectra

The mass spectrum identifiers as a pandas.DataFrame.

property targets

An array indicating whether each PSM is a target