Inspect & map identifiers#

To make data queryable by an entity identifier, one needs to ensure that identifiers comply to a chosen standard. Bionty enables this by mapping metadata on the versioned ontologies using inspect().

from bionty import Gene, CellMarker
import pandas as pd

Genes#

To illustrate it, let us generate a DataFrame that stores a number of gene identifiers, some of which corrupted.

data = {
    "gene symbol": ["A1CF", "A1BG", "FANCD1", "corrupted"],
    "hgnc id": ["HGNC:24086", "HGNC:5", "HGNC:1101", "corrupted"],
    "ensembl_gene_id": [
        "ENSG00000148584",
        "ENSG00000121410",
        "ENSG00000188389",
        "corrupted",
    ],
}
df_orig = pd.DataFrame(data).set_index("ensembl_gene_id")

df_orig

	gene symbol	hgnc id
ensembl_gene_id
ENSG00000148584	A1CF	HGNC:24086
ENSG00000121410	A1BG	HGNC:5
ENSG00000188389	FANCD1	HGNC:1101
corrupted	corrupted	corrupted

We require a reference identifier (specified as the reference_id parameter for curate). The list can be looked up using lookup(). Examples are “ontology_id”, which corresponds to the IDs of the ontology terms (e.g. ‘ENSG00000148584’) or “name” which corresponds to the ontology term names (e.g. ‘A1CF’).

To curate the DataFrame into queryable form, we create an index that corresponds to a default identifier. By default we use ensembl_gene_id. The default behavior is to curate the index if a column name is not provided.

First we create a Gene() instance using the default source database and version.

gene_bionty = Gene()

First we can check whether any of our values are mappable against the ontology.

gene_bionty.inspect(df_orig.index, gene_bionty.ensembl_gene_id)

✅ 3 terms (75.0%) are mapped.

🔶 1 terms (25.0%) are not mapped.

{'mapped': ['ENSG00000148584', 'ENSG00000121410', 'ENSG00000188389'],
 'not_mapped': ['corrupted']}

The same procedure is available for gene symbols. First, we inspect which symbols are mappable against the ontology.

gene_bionty.inspect(df_orig["gene symbol"], gene_bionty.symbol)

🔶 The identifiers contain synonyms!

💡 To increase mappability, convert them into standardized names/symbols using '.map_synonyms()'

✅ 2 terms (50.0%) are mapped.

🔶 2 terms (50.0%) are not mapped.

{'mapped': ['A1CF', 'A1BG'], 'not_mapped': ['FANCD1', 'corrupted']}

Apparently 2 of the gene symbols are mappable. Bionty further warns us that some of our symbols can be mapped into standardized symbols.

mapped_symbol_synonyms = gene_bionty.map_synonyms(
    df_orig["gene symbol"], gene_bionty.symbol
)
mapped_symbol_synonyms

['A1CF', 'A1BG', 'BRCA2', 'corrupted']

We can store them in our DataFrame further use.

df_orig["non-synonymous gene symbol"] = mapped_symbol_synonyms

CellMarker#

Depending on how the data was collected and which terminology was used, it is not always possible to curate values. Some values might have used a different standard or be corrupted.

This section will demonstrate how to look up unmatched terms and curate them using CellMarker.

First, we create an examplary DataFrame containing a valid & invalid cell markers (antibody targets) and an additional feature (time) from a flow cytometry dataset.

markers = pd.DataFrame(
    index=[
        "KI67",
        "CCR7x",
        "CD14",
        "CD8",
        "CD45RA",
        "CD4",
        "CD3",
        "CD127",
        "PD1",
        "Invalid-1",
        "Invalid-2",
        "CD66b",
        "Siglec8",
        "Time",
    ]
)

Let’s instantiate the CellMarker ontology with the default database and version.

cell_marker_bionty = CellMarker()

First, we can have a look at the cell marker table that we just loaded.

df = cell_marker_bionty.df()

df.head()

	id	name	ncbi_gene_id	gene_symbol	gene_name	uniprotkb_id	synonyms
0	CM_MERTK	MERTK	10461	MERTK	MER proto-oncogene, tyrosine kinase	Q12866	None
1	CM_CD16	CD16	2215	FCGR3A	Fc fragment of IgG receptor IIIb	O75015	None
2	CM_CD206	CD206	4360	MRC1	mannose receptor C-type 1	P22897	None
3	CM_CRIg	CRIg	11326	VSIG4	V-set and immunoglobulin domain containing 4	Q9Y279	None
4	CM_CD163	CD163	9332	CD163	CD163 molecule	Q86VB7	None

Now let’s check which cell markers from the file can be found in the reference:

cell_marker_bionty.inspect(markers.index, cell_marker_bionty.name)

🔶 The identifiers contain synonyms!

💡 To increase mappability, convert them into standardized names/symbols using '.map_synonyms()'

✅ 7 terms (50.0%) are mapped.

🔶 7 terms (50.0%) are not mapped.

{'mapped': ['CD14', 'CD8', 'CD45RA', 'CD4', 'CD3', 'CD127', 'CD66b'],
 'not_mapped': ['KI67',
  'CCR7x',
  'PD1',
  'Invalid-1',
  'Invalid-2',
  'Siglec8',
  'Time']}

From the logging, it can be seen that 4 terms were not found in the reference!

Among them Time, Invalid-1 and Invalid-2 are non-marker channels which won’t be curated by cell marker.

Note, certain markers will be converted to synonyms such as PD1 -> PD-1.

We don’t really find CCR7x, let’s check in the lookup with auto-completion:

cell_marker_bionty_lookup = cell_marker_bionty.lookup()

https://d33wubrfki0l68.cloudfront.net/eee08aab484a13dbaefc78633d1805ee61cd933c/8d864/_images/lookup_ccr7.png

cell_marker_bionty_lookup.CCR7

cell_marker(index=163, id='CM_CCR7', name='CCR7', ncbi_gene_id='1236', gene_symbol='CCR7', gene_name='C-C motif chemokine receptor 7', uniprotkb_id='P32248', synonyms=None)

Indeed we find it should be CCR7, we had a typo there with CCR7x.

Now let’s fix the markers so all of them can be linked:

Tip

Using the .lookup instead of passing a string helps eliminate possible typos!

curated_df = markers.rename(index={"CCR7x": cell_marker_bionty_lookup.CCR7.name})

OK, now we can try to run curate again and all cell markers are linked!

cell_marker_bionty.inspect(curated_df.index, cell_marker_bionty.name)

🔶 The identifiers contain synonyms!

💡 To increase mappability, convert them into standardized names/symbols using '.map_synonyms()'

✅ 8 terms (57.1%) are mapped.

🔶 6 terms (42.9%) are not mapped.

{'mapped': ['CCR7', 'CD14', 'CD8', 'CD45RA', 'CD4', 'CD3', 'CD127', 'CD66b'],
 'not_mapped': ['KI67', 'PD1', 'Invalid-1', 'Invalid-2', 'Siglec8', 'Time']}