lamindb.Feature¶

Bases: SQLRecord, HasType, CanCurate, TracksRun, TracksUpdates

Dimensions of measurement such as dataframe columns or dictionary keys.

Features represent what is measured in a dataset—the variables or dimensions along which data is organized. They enable you to query datasets based on their structure and corresponding label annotations.

Parameters:

name – str Name of the feature, typically a column name.
dtype – Dtype | Registry | list[Registry] | FieldAttr See Dtype. For categorical types, you can define to which registry values are restricted, e.g., ULabel or [ULabel, bionty.CellType].
unit – str | None = None Unit of measure, ideally SI ("m", "s", "kg", etc.) or "normalized" etc.
description – str | None = None A description.
synonyms – str | None = None Bar-separated synonyms.
nullable – bool = True Whether the feature can have null-like values (None, pd.NA, NaN, etc.), see nullable.
default_value – Any | None = None Default value for the feature.
coerce_dtype – bool = False When True, attempts to coerce values to the specified dtype during validation, see coerce_dtype.
cat_filters – dict[str, str] | None = None Subset a registry by additional filters to define valid categories.

Note

For more control, you can use bionty registries to manage simple biological entities like genes, proteins & cell markers. Or you define custom registries to manage high-level derived features like gene sets.

See also

from_dataframe(): Create feature records from DataFrame.
features: Feature manager of an artifact or collection.
ULabel: Universal labels.
Schema: Sets of features.

Example

Features with simple data types:

ln.Feature(name="sample_note", dtype=str).save()
ln.Feature(name="temperature_in_celsius", dtype=float).save()
ln.Feature(name="read_count", dtype=int).save()

A categorical feature measuring labels managed in the Record registry:

ln.Feature(name="sample", dtype=ln.Record).save()

The same for the bt.CellType registry:

ln.Feature(name="cell_type_by_expert", dtype=bt.CellType).save()  # expert annotation
ln.Feature(name="cell_type_by_model", dtype=bt.CellType).save()   # model annotation

Scope a feature with a feature type to distinguish the same feature name across different contexts:

abc_feature_type = ln.Feature(name="ABC", is_type=True).save()  # ABC could reference a schema, a project, a team, etc.
ln.Feature(name="concentration_nM", dtype=float, type=abc_feature_type).save()

xyz_feature_type = ln.Feature(name="XYZ", is_type=True).save()  # XYZ could reference a schema, a project, a team, etc.
ln.Feature(name="concentration_nM", dtype=float, type=xyz_feature_type).save()

# calling .save() again with the same name and type returns the existing feature
ln.Feature(name="concentration_nM", dtype=float, type=xyz_feature_type).save()

Annotate an artifact with features (works identically for records and runs):

artifact.features.add_values({
    "temperature_in_celsius": 37.5,
    "sample_note": "Control sample",
})

Query artifacts/records/runs by features:

ln.Artifact.filter(features__name="temperature_in_celsius")  # artifacts with this feature
ln.Artifact.filter(temperature_in_celsius__gt=37)            # artifacts where temperature > 37

A list dtype:

ln.Feature(
    name="cell_types",
    dtype=list[bt.CellType],  # or list[str] for a list of strings
).save()

A path feature:

ln.Feature(
    name="image_path",
    dtype="path",   # will be validated as `str`
).save()

Note

Features and labels denote two ways of using entities to organize data:

A feature qualifies what is measured, i.e., a numerical or categorical random variable
A label is a measured value, i.e., a category

Example: When annotating a dataset that measured expression of 30k genes, those genes serve as feature identifiers. When annotating a dataset whose experiment knocked out 3 specific genes, those genes serve as labels.

Re-shaping data can introduce ambiguity among features & labels. If this happened, ask yourself what the joint measurement was: a feature qualifies variables in a joint measurement. The canonical data matrix lists jointly measured variables in the columns.

Attributes¶

property coerce_dtype: bool¶

Whether dtypes should be coerced during validation.

For example, a objects-dtyped pandas column can be coerced to categorical and would pass validation if this is true.

property default_value: Any¶

A default value that overwrites missing values (default None).

This takes effect when you call Curator.standardize().

If default_value = None, missing values like pd.NA or np.nan are kept.

property nullable: bool¶

Indicates whether the feature can have nullable values (default True).

Example:

import lamindb as ln
import pandas as pd

disease = ln.Feature(name="disease", dtype=ln.ULabel, nullable=False).save()
schema = ln.Schema(features=[disease]).save()
dataset = {"disease": pd.Categorical([pd.NA, "asthma"])}
df = pd.DataFrame(dataset)
curator = ln.curators.DataFrameCurator(df, schema)
try:
    curator.validate()
except ln.errors.ValidationError as e:
    assert str(e).startswith("non-nullable series 'disease' contains null values")

Simple fields¶

uid: str¶: Universal id, valid across DB instances.

name: str¶: Name of feature.

dtype: Dtype¶: Data type (Dtype).

is_type: bool¶: Distinguish types from instances of the type.

unit: str | None¶: Unit of measure, ideally SI (m, s, kg, etc.) or ‘normalized’ etc. (optional).

description: str | None¶: A description.

array_rank: int¶

Rank of feature.

Number of indices of the array: 0 for scalar, 1 for vector, 2 for matrix.

Is called .ndim in numpy and pytorch but shouldn’t be confused with the dimension of the feature space.

array_size: int¶

Number of elements of the feature.

Total number of elements (product of shape components) of the array.

A number or string (a scalar): 1
A 50-dimensional embedding: 50
A 25 x 25 image: 625

array_shape: list[int] | None¶

Shape of the feature.

A number or string (a scalar): [1]
A 50-dimensional embedding: [50]
A 25 x 25 image: [25, 25]

Is stored as a list rather than a tuple because it’s serialized as JSON.

proxy_dtype: Dtype | None¶

Proxy data type.

If the feature is an image it’s often stored via a path to the image file. Hence, while the dtype might be image with a certain shape, the proxy dtype would be str.

synonyms: str | None¶: Bar-separated (|) synonyms (optional).

is_locked: bool¶: Whether the record is locked for edits.

created_at: datetime¶: Time of creation of record.

updated_at: datetime¶: Time of last update to record.

Relational fields¶

branch: Branch¶

Life cycle state of record.

branch.name can be “main” (default branch), “trash” (trash), branch.name = "archive" (archived), or any other user-created branch typically planned for merging onto main after review.

space: Space¶: The space in which the record lives.

created_by: User¶: Creator of record.

run: Run | None¶: Run that created record.

type: Feature | None¶

Type of feature (e.g., ‘Readout’, ‘Metric’, ‘Metadata’, ‘ExpertAnnotation’, ‘ModelPrediction’).

Allows to group features by type, e.g., all read outs, all metrics, etc.

schemas: Schema¶: Schemas linked to this feature.

features: Feature¶: Features of this type (can only be non-empty if is_type is True).

values: FeatureValue¶: Values for this feature.

projects: Project¶: Annotating projects.

blocks: FeatureBlock¶: Blocks that annotate this feature.

Class methods¶

classmethod from_dataframe(df, field=None, *, mute=False)¶

Create Feature records for dataframe columns.

Parameters:

df (DataFrame) – Source DataFrame to extract column information from
field (DeferredAttribute | None, default: None) – FieldAttr for Feature model validation, defaults to Feature.name
mute (bool, default: False) – Whether to mute Feature creation similar names found warnings

Return type:

SQLRecordList

classmethod from_dict(dictionary, field=None, *, str_as_cat=None, type=None, mute=False)¶

Create Feature records for dictionary keys.

Parameters:

dictionary (dict[str, Any]) – Source dictionary to extract key information from
field (DeferredAttribute | None, default: None) – FieldAttr for Feature model validation, defaults to Feature.name
str_as_cat (bool | None, default: None) – Deprecated. Will be removed in LaminDB 2.0.0. Create features explicitly with dtype=’cat’ for categorical values.
type (Feature | None, default: None) – Feature type of all created features
mute (bool, default: False) – Whether to mute dtype inference and feature creation warnings

Return type:

SQLRecordList

classmethod filter(*queries, **expressions)¶

Query records.

Parameters:

queries – One or multiple Q objects.
expressions – Fields and values passed as Django query expressions.

Return type:

QuerySet

See also

Guide: Query & search registries
Django documentation: Queries

Examples

>>> ln.Project(name="my label").save()
>>> ln.Project.filter(name__startswith="my").to_dataframe()

classmethod get(idlike=None, **expressions)¶

Get a single record.

Parameters:

idlike (int | str | None, default: None) – Either a uid stub, uid or an integer id.
expressions – Fields and values passed as Django query expressions.

Raises:

lamindb.errors.DoesNotExist – In case no matching record is found.

Return type:

SQLRecord

See also

Guide: Query & search registries
Django documentation: Queries

Examples

record = ln.Record.get("FvtpPJLJ")
record = ln.Record.get(name="my-label")

classmethod to_dataframe(include=None, features=False, limit=100)¶

Evaluate and convert to pd.DataFrame.

By default, maps simple fields and foreign keys onto DataFrame columns.

Guide: Query & search registries

Parameters:

include (str | list[str] | None, default: None) – Related data to include as columns. Takes strings of form "records__name", "cell_types__name", etc. or a list of such strings. For Artifact, Record, and Run, can also pass "features" to include features with data types pointing to entities in the core schema. If "privates", includes private fields (fields starting with _).
features (bool | list[str], default: False) – Configure the features to include. Can be a feature name or a list of such names. If "queryset", infers the features used within the current queryset. Only available for Artifact, Record, and Run.
limit (int, default: 100) – Maximum number of rows to display. If None, includes all results.
order_by – Field name to order the records by. Prefix with ‘-’ for descending order. Defaults to ‘-id’ to get the most recent records. This argument is ignored if the queryset is already ordered or if the specified field does not exist.

Return type:

DataFrame

Examples

Include the name of the creator:

ln.Record.to_dataframe(include="created_by__name"])

Include features:

ln.Artifact.to_dataframe(include="features")

Include selected features:

ln.Artifact.to_dataframe(features=["cell_type_by_expert", "cell_type_by_model"])

classmethod search(string, *, field=None, limit=20, case_sensitive=False)¶

Search.

Parameters:

string (str) – The input string to match against the field ontology values.
field (str | DeferredAttribute | None, default: None) – The field or fields to search. Search all string fields by default.
limit (int | None, default: 20) – Maximum amount of top results to return.
case_sensitive (bool, default: False) – Whether the match is case sensitive.

Return type:

QuerySet

Returns:

A sorted DataFrame of search results with a score in column score. If return_queryset is True. QuerySet.

See also

filter() lookup()

Examples

records = ln.Record.from_values(["Label1", "Label2", "Label3"], field="name").save()
ln.Record.search("Label2")

classmethod lookup(field=None, return_field=None)¶

Return an auto-complete object for a field.

Parameters:

field (str | DeferredAttribute | None, default: None) – The field to look up the values for. Defaults to first string field.
return_field (str | DeferredAttribute | None, default: None) – The field to return. If None, returns the whole record.
keep – When multiple records are found for a lookup, how to return the records. - "first": return the first record. - "last": return the last record. - False: return all records.

Return type:

NamedTuple

Returns:

A NamedTuple of lookup information of the field values with a dictionary converter.

See also

search()

Examples

Lookup via auto-complete on .:

import bionty as bt
bt.Gene.from_source(symbol="ADGB-DT").save()
lookup = bt.Gene.lookup()
lookup.adgb_dt

Look up via auto-complete in dictionary:

lookup_dict = lookup.dict()
lookup_dict['ADGB-DT']

Look up via a specific field:

lookup_by_ensembl_id = bt.Gene.lookup(field="ensembl_gene_id")
genes.ensg00000002745

Return a specific field value instead of the full record:

lookup_return_symbols = bt.Gene.lookup(field="ensembl_gene_id", return_field="symbol")

classmethod connect(instance)¶

Query a non-default LaminDB instance.

Parameters:: instance (str | None) – An instance identifier of form “account_handle/instance_name”.
Return type:: QuerySet

Examples

ln.Record.connect("account_handle/instance_name").search("label7", field="name")

classmethod inspect(values, field=None, *, mute=False, organism=None, source=None, from_source=True, strict_source=False)¶

Inspect if values are mappable to a field.

Being mappable means that an exact match exists.

Parameters:

values (list[str] | Series | array) – Values that will be checked against the field.
field (str | DeferredAttribute | None, default: None) – The field of values. Examples are 'ontology_id' to map against the source ID or 'name' to map against the ontologies field names.
mute (bool, default: False) – Whether to mute logging.
organism (str | SQLRecord | None, default: None) – An Organism name or record.
source (SQLRecord | None, default: None) – A bionty.Source record that specifies the version to inspect against.
strict_source (bool, default: False) – Determines the validation behavior against records in the registry. - If False, validation will include all records in the registry, ignoring the specified source. - If True, validation will only include records in the registry that are linked to the specified source. Note: this parameter won’t affect validation against public sources.

Return type:

bionty.base.dev.InspectResult

See also

validate()

Example:

import bionty as bt

# save some gene records
bt.Gene.from_values(["A1CF", "A1BG", "BRCA2"], field="symbol", organism="human").save()

# inspect gene symbols
gene_symbols = ["A1CF", "A1BG", "FANCD1", "FANCD20"]
result = bt.Gene.inspect(gene_symbols, field=bt.Gene.symbol, organism="human")
assert result.validated == ["A1CF", "A1BG"]
assert result.non_validated == ["FANCD1", "FANCD20"]

classmethod validate(values, field=None, *, mute=False, organism=None, source=None, strict_source=False)¶

Validate values against existing values of a string field.

Note this is strict_source validation, only asserts exact matches.

Parameters:

values (list[str] | Series | array) – Values that will be validated against the field.
field (str | DeferredAttribute | None, default: None) – The field of values. Examples are 'ontology_id' to map against the source ID or 'name' to map against the ontologies field names.
mute (bool, default: False) – Whether to mute logging.
organism (str | SQLRecord | None, default: None) – An Organism name or record.
source (SQLRecord | None, default: None) – A bionty.Source record that specifies the version to validate against.
strict_source (bool, default: False) – Determines the validation behavior against records in the registry. - If False, validation will include all records in the registry, ignoring the specified source. - If True, validation will only include records in the registry that are linked to the specified source. Note: this parameter won’t affect validation against public sources.

Return type:

ndarray

Returns:

A vector of booleans indicating if an element is validated.

See also

inspect()

Example:

import bionty as bt

bt.Gene.from_values(["A1CF", "A1BG", "BRCA2"], field="symbol", organism="human").save()

gene_symbols = ["A1CF", "A1BG", "FANCD1", "FANCD20"]
bt.Gene.validate(gene_symbols, field=bt.Gene.symbol, organism="human")
#> array([ True,  True, False, False])

classmethod from_values(values, field=None, create=False, organism=None, source=None, mute=False)¶

Bulk create validated records by parsing values for an identifier such as a name or an id).

Parameters:

values (list[str] | Series | array) – A list of values for an identifier, e.g. ["name1", "name2"].
field (str | DeferredAttribute | None, default: None) – A SQLRecord field to look up, e.g., bt.CellMarker.name.
create (bool, default: False) – Whether to create records if they don’t exist.
organism (SQLRecord | str | None, default: None) – A bionty.Organism name or record.
source (SQLRecord | None, default: None) – A bionty.Source record to validate against to create records for.
mute (bool, default: False) – Whether to mute logging.

Return type:

SQLRecordList

Returns:

A list of validated records. For bionty registries. Also returns knowledge-coupled records.

Notes

For more info, see tutorial: Manage biological ontologies.

Example:

import bionty as bt

# Bulk create from non-validated values will log warnings & returns empty list
ulabels = ln.ULabel.from_values(["benchmark", "prediction", "test"])
assert len(ulabels) == 0

# Bulk create records from validated values returns the corresponding existing records
ulabels = ln.ULabel.from_values(["benchmark", "prediction", "test"], create=True).save()
assert len(ulabels) == 3

# Bulk create records from public reference
bt.CellType.from_values(["T cell", "B cell"]).save()

classmethod standardize(values, field=None, *, return_field=None, return_mapper=False, case_sensitive=False, mute=False, source_aware=True, keep='first', synonyms_field='synonyms', organism=None, source=None, strict_source=False)¶

Maps input synonyms to standardized names.

Parameters:

values (Iterable) – Identifiers that will be standardized.
field (str | DeferredAttribute | None, default: None) – The field representing the standardized names.
return_field (str | DeferredAttribute | None, default: None) – The field to return. Defaults to field.
return_mapper (bool, default: False) – If True, returns {input_value: standardized_name}.
case_sensitive (bool, default: False) – Whether the mapping is case sensitive.
mute (bool, default: False) – Whether to mute logging.
source_aware (bool, default: True) – Whether to standardize from public source. Defaults to True for BioRecord registries.
keep (Literal['first', 'last', False], default: 'first') –
When a synonym maps to multiple names, determines which duplicates to mark as pd.DataFrame.duplicated: - "first": returns the first mapped standardized name - "last": returns the last mapped standardized name - False: returns all mapped standardized name.

When keep is False, the returned list of standardized names will contain nested lists in case of duplicates.

When a field is converted into return_field, keep marks which matches to keep when multiple return_field values map to the same field value.
synonyms_field (str, default: 'synonyms') – A field containing the concatenated synonyms.
organism (str | SQLRecord | None, default: None) – An Organism name or record.
source (SQLRecord | None, default: None) – A bionty.Source record that specifies the version to validate against.
strict_source (bool, default: False) – Determines the validation behavior against records in the registry. - If False, validation will include all records in the registry, ignoring the specified source. - If True, validation will only include records in the registry that are linked to the specified source. Note: this parameter won’t affect validation against public sources.

Return type:

list[str] | dict[str, str]

Returns:

If return_mapper is False – a list of standardized names. Otherwise, a dictionary of mapped values with mappable synonyms as keys and standardized names as values.

See also

add_synonym(): Add synonyms.
remove_synonym(): Remove synonyms.

Example:

import bionty as bt

# save some gene records
bt.Gene.from_values(["A1CF", "A1BG", "BRCA2"], field="symbol", organism="human").save()

# standardize gene synonyms
gene_synonyms = ["A1CF", "A1BG", "FANCD1", "FANCD20"]
bt.Gene.standardize(gene_synonyms)
#> ['A1CF', 'A1BG', 'BRCA2', 'FANCD20']

Methods¶

query_features()¶

Query features of sub types.

While .features retrieves the features with the current type, this method also retrieves sub types and the features with sub types of the current type.

Return type:: QuerySet

save(*args, **kwargs)¶

Save the feature to the instance.

Return type:: Feature

with_config(optional=None)¶

Pass addtional configurations to the schema.

Return type:: tuple[Feature, dict]

restore()¶

Restore from trash onto the main branch.

Does not restore descendant records if the record is HasType with is_type = True.

Return type:: None

delete(permanent=None, **kwargs)¶

Delete record.

If record is HasType with is_type = True, deletes all descendant records, too.

Parameters:: permanent (bool | None, default: None) – Whether to permanently delete the record (skips trash). If None, performs soft delete if the record is not already in the trash.
Return type:: None

Examples

For any SQLRecord object record, call:

>>> record.delete()

query_types()¶

Query types of a record recursively.

While .type retrieves the type, this method retrieves all super types of that type:

# Create type hierarchy
type1 = model_class(name="Type1", is_type=True).save()
type2 = model_class(name="Type2", is_type=True, type=type1).save()
type3 = model_class(name="Type3", is_type=True, type=type2).save()

# Create a record with type3
record = model_class(name=f"{model_name}3", type=type3).save()

# Query super types
super_types = record.query_types()
assert super_types[0] == type3
assert super_types[1] == type2
assert super_types[2] == type1

Return type:: SQLRecordList

add_synonym(synonym, force=False, save=None)¶

Add synonyms to a record.

Parameters:

synonym (str | list[str] | Series | array) – The synonyms to add to the record.
force (bool, default: False) – Whether to add synonyms even if they are already synonyms of other records.
save (bool | None, default: None) – Whether to save the record to the database.

See also

remove_synonym(): Remove synonyms.

Example:

import bionty as bt

# save "T cell" record
record = bt.CellType.from_source(name="T cell").save()
record.synonyms
#> "T-cell|T lymphocyte|T-lymphocyte"

# add a synonym
record.add_synonym("T cells")
record.synonyms
#> "T cells|T-cell|T-lymphocyte|T lymphocyte"

remove_synonym(synonym)¶

Remove synonyms from a record.

Parameters:: synonym (str | list[str] | Series | array) – The synonym values to remove.

See also

add_synonym(): Add synonyms

Example:

import bionty as bt

# save "T cell" record
record = bt.CellType.from_source(name="T cell").save()
record.synonyms
#> "T-cell|T lymphocyte|T-lymphocyte"

# remove a synonym
record.remove_synonym("T-cell")
record.synonyms
#> "T lymphocyte|T-lymphocyte"

set_abbr(value)¶

Set value for abbr field and add to synonyms.

Parameters:: value (str) – A value for an abbreviation.

See also

add_synonym()

Example:

import bionty as bt

# save an experimental factor record
scrna = bt.ExperimentalFactor.from_source(name="single-cell RNA sequencing").save()
assert scrna.abbr is None
assert scrna.synonyms == "single-cell RNA-seq|single-cell transcriptome sequencing|scRNA-seq|single cell RNA sequencing"

# set abbreviation
scrna.set_abbr("scRNA")
assert scrna.abbr == "scRNA"
# synonyms are updated
assert scrna.synonyms == "scRNA|single-cell RNA-seq|single cell RNA sequencing|single-cell transcriptome sequencing|scRNA-seq"

refresh_from_db(using=None, fields=None, from_queryset=None)¶

Reload field values from the database.

By default, the reloading happens from the database this instance was loaded from, or by the read router if this instance wasn’t loaded from any database. The using parameter will override the default.

Fields can be used to specify which fields to reload. The fields should be an iterable of field attnames. If fields is None, then all non-deferred fields are reloaded.

When accessing deferred fields of an instance, the deferred loading of the field will call this method.

async arefresh_from_db(using=None, fields=None, from_queryset=None)¶