lamindb.Schema¶

Bases: SQLRecord, HasType, CanCurate, TracksRun

Schemas of datasets such as column sets of dataframes.

Note

To create a schema, at least one of the following parameters must be passed:

features - a list of Feature objects
itype - the identifier type, e.g., Feature or bt.Gene.ensembl_gene_id
slots - a dictionary mapping slots to Schema objects, e.g., for an AnnData, {"obs": Schema(...), "var.T": Schema(...)}
is_type=True - a schema type to group schemas, e.g., “ProteinPanel”

Parameters:

features – list[SQLRecord] | list[tuple[Feature, dict]] | None = None Feature records, e.g., [Feature(...), Feature(...)] or features with their config, e.g., [Feature(...).with_config(optional=True)].
slots – dict[str, Schema] | None = None A dictionary mapping slot names to Schema objects to create a _composite_ schema.
name – str | None = None Name of the schema.
description – str | None = None Description of the schema.
itype – str | None = None Feature identifier type to validate against, e.g., ln.Feature or bt.Gene.ensembl_gene_id. Is automatically set to the type of the passed features.
type – Schema | None = None Define schema types like ln.Schema(name="ProteinPanel", is_type=True).
is_type – bool = False Whether the schema is a type.
index – Feature | None = None A Feature record to validate an index of a DataFrame and therefore also, e.g., AnnData obs and var indices.
flexible – bool | None = None Whether to include any feature of the same itype during validation & annotation. If features is passed, defaults to False so that, e.g., additional columns of a DataFrame encountered during validation are disregarded. If features is not passed, defaults to True.
otype – str | None = None An object type to define the structure of a composite schema, e.g., "DataFrame", "AnnData".
dtype – str | None = None A dtype to assume for all features in the schema (e.g., “num”, float, int). Defaults to None if itype is Feature. Otherwise to "num", e.g., if itype is bt.Gene.ensembl_gene_id.
minimal_set – bool = True Whether all passed features are required by default. See optionals for more-fine-grained control.
maximal_set – bool = False Whether additional features are allowed.
ordered_set – bool = False Whether features are required to be ordered.
coerce_dtype – bool = False When True, attempts to coerce values to the specified dtype during validation, see coerce_dtype.
n – int | None = None A manual way of specifying the number of features in the schema. Is inferred from features if passed.

See also

from_dataframe(): Validate & annotate a DataFrame with a schema.
from_anndata(): Validate & annotate an AnnData with a schema.
from_mudata(): Validate & annotate an MuData with a schema.
from_spatialdata(): Validate & annotate a SpatialData with a schema.

Examples

A schema with a single required feature:

import lamindb as ln

schema = ln.Schema([ln.Feature(name="required_feature", dtype=str).save()]).save()

A schema that constrains feature identifiers to be a valid feature names:

schema = ln.Schema(itype=ln.Feature)  # uses Feature.name as identifier type

Or valid Ensembl gene ids:

import bionty as bt

schema = ln.Schema(itype=bt.Gene.ensembl_gene_id)

A flexible schema that requires a single feature but also validates & annotates additional features with registered feature identifiers:

schema = ln.Schema(
    [ln.Feature(name="required_feature", dtype=str).save()],
    itype=ln.Feature,
    flexible=True,
).save()

Create a schema type to group schemas:

protein_panel = ln.Schema(name="ProteinPanel", is_type=True).save()
schema = ln.Schema(itype=bt.CellMarker, type=protein_panel).save()

Validate the index of a DataFrame:

schema = ln.Schema(
    [ln.Feature(name="required_feature", dtype=str).save()],
    index=ln.Feature(name="sample", dtype=ln.ULabel).save(),
).save()

Mark a feature as optional:

schema = ln.Schema([
    ln.Feature(name="required_feature", dtype=str).save(),
    ln.Feature(name="feature2", dtype=int).save().with_config(optional=True),
]).save()

Parse & validate feature identifier values:

schema = ln.Schema.from_values(
adata.var[“ensemble_id”], field=bt.Gene.ensembl_gene_id, organism=”mouse”,

).save()

Create a schema from a DataFrame:

df = pd.DataFrame({"feat1": [1, 2], "feat2": [3.1, 4.2], "feat3": ["cond1", "cond2"]})
schema = ln.Schema.from_dataframe(df)

Attributes¶

property coerce_dtype: bool¶

Whether dtypes should be coerced during validation.

For example, a objects-dtyped pandas column can be coerced to categorical and would pass validation if this is true.

property flexible: bool¶

Indicates how to handle validation and annotation in case features are not defined.

Examples

Make a rigid schema flexible:

schema = ln.Schema.get(name="my_schema")
schema.flexible = True
schema.save()

During schema creation:

# if you're not passing features but just defining the itype, defaults to flexible = True
schema = ln.Schema(itype=ln.Feature).save()
# schema.flexible is True

# if you're passing features, defaults to flexible = False
schema = ln.Schema(
    features=[ln.Feature(name="my_required_feature", dtype=int).save()],
)
# schema.flexible is False

# you can also validate & annotate features in addition to those that you're explicitly defining:
schema = ln.Schema(
    features=[ln.Feature(name="my_required_feature", dtype=int).save()],
    flexible=True,
)
# schema.flexible is True

property index: None | Feature¶

The feature configured to act as index.

To unset it, set schema.index to None.

property members: QuerySet¶

A queryset for the individual records in the feature set underlying the schema.

Unlike schema.features, schema.genes, schema.proteins, etc., this queryset is ordered and doesn’t require knowledge of the entity.

property optionals: SchemaOptionals¶

Manage optional features.

Example

# a schema with optional "sample_name"
schema_optional_sample_name = ln.Schema(
    features=[
        ln.Feature(name="sample_id", dtype=str).save(),  # required
        ln.Feature(name="sample_name", dtype=str).save().with_config(optional=True),  # optional
    ],
).save()

# raise ValidationError since `sample_id` is required
ln.curators.DataFrameCurator(
    pd.DataFrame(
        {
        "sample_name": ["Sample 1", "Sample 2"],
        }
    ),
    schema=schema_optional_sample_name).validate()
)

# passes because an optional column is missing
ln.curators.DataFrameCurator(
    pd.DataFrame(
        {
        "sample_id": ["sample1", "sample2"],
        }
    ),
    schema=schema_optional_sample_name).validate()
)

property slots: dict[str, Schema]¶

Slots.

Examples

# define composite schema
anndata_schema = ln.Schema(
    name="mini_immuno_anndata_schema",
    otype="AnnData",
    slots={"obs": obs_schema, "var": var_schema},
).save()

# access slots
anndata_schema.slots
#> {'obs': <Schema: obs_schema>, 'var': <Schema: var_schema>}

Simple fields¶

uid: str¶: A universal id.

name: str | None¶: A name.

description: str | None¶: A description.

n: int¶: Number of features in the schema.

is_type: bool¶: Distinguish types from instances of the type.

itype: str | None¶

A field of a registry that stores feature identifier types, e.g., 'Feature.name' or 'bionty.Gene.ensembl_gene_id'. Defaults to the default name field if a registry is passed (passing Feature would result in Feature.name).

Depending on itype, .members stores, e.g., Feature or bionty.Gene records.

otype: str | None¶: Default Python object type, e.g., DataFrame, AnnData.

dtype: str | None¶

Data type, e.g., “num”, “float”, “int”. Is None for Feature.

For Feature, types are expected to be heterogeneous and defined on a per-feature level.

hash: str | None¶

A hash of the set of feature identifiers.

For a composite schema, the hash of hashes.

minimal_set: bool¶

Whether all passed features are to be considered required by default (default True).

Note that features that are explicitly marked as optional via feature.with_config(optional=True) are not required even if this minimal_set is true.

ordered_set: bool¶: Whether features are required to be ordered (default False).

maximal_set: bool¶

Whether all features present in the dataset must be in the schema (default False).

If False, additional features are allowed to be present in the dataset.

If True, no additional features are allowed to be present in the dataset.

is_locked: bool¶: Whether the record is locked for edits.

created_at: datetime¶: Time of creation of record.

Relational fields¶

branch: Branch¶

Life cycle state of record.

branch.name can be “main” (default branch), “trash” (trash), branch.name = "archive" (archived), or any other user-created branch typically planned for merging onto main after review.

space: Space¶: The space in which the record lives.

created_by: User¶: Creator of record.

run: Run | None¶: Run that created record.

type: Schema | None¶

Type of schema.

Allows to group schemas by type, e.g., all meassurements evaluating gene expression vs. protein expression vs. multi modal.

You can define types via ln.Schema(name="ProteinPanel", is_type=True).

Here are a few more examples for type names: 'ExpressionPanel', 'ProteinPanel', 'Multimodal', 'Metadata', 'Embedding'.

components: Schema¶: Components of this schema.

features: Feature¶: The features contained in the schema.

schemas: Schema¶: Schemas for this type.

composites: Schema¶

The composite schemas that contains this schema as a component.

For example, an AnnData composes multiple schemas: var[DataFrameT], obs[DataFrame], obsm[Array], uns[dict], etc.

validated_artifacts: Artifact¶: The artifacts that were validated against this schema with a Curator.

artifacts: Artifact¶: The artifacts that measure a feature set that matches this schema.

records: Record¶: Records that were annotated with this schema.

projects: Project¶: Linked projects.

blocks¶

Accessor to the related objects manager on the reverse side of a many-to-one relation.

In the example:

class Child(Model):
    parent = ForeignKey(Parent, related_name='children')

Parent.children is a ReverseManyToOneDescriptor instance.

Most of the implementation is delegated to a dynamically defined manager class built by create_forward_many_to_many_manager() defined below.

Class methods¶

classmethod from_values(values, field=FieldAttr(Feature.name), dtype=None, name=None, mute=False, organism=None, source=None, raise_validation_error=True)¶

Create feature set for validated features.

Parameters:

values (list[str] | Series | array) – A list of values, like feature names or ids.
field (DeferredAttribute, default: FieldAttr(Feature.name)) – The field of a reference registry to map values.
dtype (str | None, default: None) – The simple dtype. Defaults to None if reference registry is Feature, defaults to "float" otherwise.
name (str | None, default: None) – A name.
organism (SQLRecord | str | None, default: None) – An organism to resolve gene mapping.
source (SQLRecord | None, default: None) – A public ontology to resolve feature identifier mapping.
raise_validation_error (bool, default: True) – Whether to raise a validation error if some values are not valid.

Raises:

ValidationError – If some values are not valid.

Return type:

Schema

Example

import lamindb as ln
import bionty as bt

features = [ln.Feature(name=feat, dtype="str").save() for feat in ["feat11", "feat21"]]
schema = ln.Schema.from_values(features)

genes = ["ENSG00000139618", "ENSG00000198786"]
schema = ln.Schema.from_values(features, bt.Gene.ensembl_gene_id, "float")

classmethod from_dataframe(df, field=FieldAttr(Feature.name), name=None, mute=False, organism=None, source=None)¶

Create schema for valid columns.

Return type:: Schema | None

classmethod filter(*queries, **expressions)¶

Query records.

Parameters:

queries – One or multiple Q objects.
expressions – Fields and values passed as Django query expressions.

Return type:

QuerySet

See also

Guide: Query & search registries
Django documentation: Queries

Examples

>>> ln.Project(name="my label").save()
>>> ln.Project.filter(name__startswith="my").to_dataframe()

classmethod get(idlike=None, **expressions)¶

Get a single record.

Parameters:

idlike (int | str | None, default: None) – Either a uid stub, uid or an integer id.
expressions – Fields and values passed as Django query expressions.

Raises:

lamindb.errors.DoesNotExist – In case no matching record is found.

Return type:

SQLRecord

See also

Guide: Query & search registries
Django documentation: Queries

Examples

record = ln.Record.get("FvtpPJLJ")
record = ln.Record.get(name="my-label")

classmethod to_dataframe(include=None, features=False, limit=100)¶

Evaluate and convert to pd.DataFrame.

By default, maps simple fields and foreign keys onto DataFrame columns.

Guide: Query & search registries

Parameters:

include (str | list[str] | None, default: None) – Related data to include as columns. Takes strings of form "records__name", "cell_types__name", etc. or a list of such strings. For Artifact, Record, and Run, can also pass "features" to include features with data types pointing to entities in the core schema. If "privates", includes private fields (fields starting with _).
features (bool | list[str], default: False) – Configure the features to include. Can be a feature name or a list of such names. If "queryset", infers the features used within the current queryset. Only available for Artifact, Record, and Run.
limit (int, default: 100) – Maximum number of rows to display. If None, includes all results.
order_by – Field name to order the records by. Prefix with ‘-’ for descending order. Defaults to ‘-id’ to get the most recent records. This argument is ignored if the queryset is already ordered or if the specified field does not exist.

Return type:

DataFrame

Examples

Include the name of the creator:

ln.Record.to_dataframe(include="created_by__name"])

Include features:

ln.Artifact.to_dataframe(include="features")

Include selected features:

ln.Artifact.to_dataframe(features=["cell_type_by_expert", "cell_type_by_model"])

classmethod search(string, *, field=None, limit=20, case_sensitive=False)¶

Search.

Parameters:

string (str) – The input string to match against the field ontology values.
field (str | DeferredAttribute | None, default: None) – The field or fields to search. Search all string fields by default.
limit (int | None, default: 20) – Maximum amount of top results to return.
case_sensitive (bool, default: False) – Whether the match is case sensitive.

Return type:

QuerySet

Returns:

A sorted DataFrame of search results with a score in column score. If return_queryset is True. QuerySet.

See also

filter() lookup()

Examples

records = ln.Record.from_values(["Label1", "Label2", "Label3"], field="name").save()
ln.Record.search("Label2")

classmethod lookup(field=None, return_field=None)¶

Return an auto-complete object for a field.

Parameters:

field (str | DeferredAttribute | None, default: None) – The field to look up the values for. Defaults to first string field.
return_field (str | DeferredAttribute | None, default: None) – The field to return. If None, returns the whole record.
keep – When multiple records are found for a lookup, how to return the records. - "first": return the first record. - "last": return the last record. - False: return all records.

Return type:

NamedTuple

Returns:

A NamedTuple of lookup information of the field values with a dictionary converter.

See also

search()

Examples

Lookup via auto-complete on .:

import bionty as bt
bt.Gene.from_source(symbol="ADGB-DT").save()
lookup = bt.Gene.lookup()
lookup.adgb_dt

Look up via auto-complete in dictionary:

lookup_dict = lookup.dict()
lookup_dict['ADGB-DT']

Look up via a specific field:

lookup_by_ensembl_id = bt.Gene.lookup(field="ensembl_gene_id")
genes.ensg00000002745

Return a specific field value instead of the full record:

lookup_return_symbols = bt.Gene.lookup(field="ensembl_gene_id", return_field="symbol")

classmethod connect(instance)¶

Query a non-default LaminDB instance.

Parameters:: instance (str | None) – An instance identifier of form “account_handle/instance_name”.
Return type:: QuerySet

Examples

ln.Record.connect("account_handle/instance_name").search("label7", field="name")

classmethod inspect(values, field=None, *, mute=False, organism=None, source=None, from_source=True, strict_source=False)¶

Inspect if values are mappable to a field.

Being mappable means that an exact match exists.

Parameters:

values (list[str] | Series | array) – Values that will be checked against the field.
field (str | DeferredAttribute | None, default: None) – The field of values. Examples are 'ontology_id' to map against the source ID or 'name' to map against the ontologies field names.
mute (bool, default: False) – Whether to mute logging.
organism (str | SQLRecord | None, default: None) – An Organism name or record.
source (SQLRecord | None, default: None) – A bionty.Source record that specifies the version to inspect against.
strict_source (bool, default: False) – Determines the validation behavior against records in the registry. - If False, validation will include all records in the registry, ignoring the specified source. - If True, validation will only include records in the registry that are linked to the specified source. Note: this parameter won’t affect validation against public sources.

Return type:

bionty.base.dev.InspectResult

See also

validate()

Example:

import bionty as bt

# save some gene records
bt.Gene.from_values(["A1CF", "A1BG", "BRCA2"], field="symbol", organism="human").save()

# inspect gene symbols
gene_symbols = ["A1CF", "A1BG", "FANCD1", "FANCD20"]
result = bt.Gene.inspect(gene_symbols, field=bt.Gene.symbol, organism="human")
assert result.validated == ["A1CF", "A1BG"]
assert result.non_validated == ["FANCD1", "FANCD20"]

classmethod validate(values, field=None, *, mute=False, organism=None, source=None, strict_source=False)¶

Validate values against existing values of a string field.

Note this is strict_source validation, only asserts exact matches.

Parameters:

values (list[str] | Series | array) – Values that will be validated against the field.
field (str | DeferredAttribute | None, default: None) – The field of values. Examples are 'ontology_id' to map against the source ID or 'name' to map against the ontologies field names.
mute (bool, default: False) – Whether to mute logging.
organism (str | SQLRecord | None, default: None) – An Organism name or record.
source (SQLRecord | None, default: None) – A bionty.Source record that specifies the version to validate against.
strict_source (bool, default: False) – Determines the validation behavior against records in the registry. - If False, validation will include all records in the registry, ignoring the specified source. - If True, validation will only include records in the registry that are linked to the specified source. Note: this parameter won’t affect validation against public sources.

Return type:

ndarray

Returns:

A vector of booleans indicating if an element is validated.

See also

inspect()

Example:

import bionty as bt

bt.Gene.from_values(["A1CF", "A1BG", "BRCA2"], field="symbol", organism="human").save()

gene_symbols = ["A1CF", "A1BG", "FANCD1", "FANCD20"]
bt.Gene.validate(gene_symbols, field=bt.Gene.symbol, organism="human")
#> array([ True,  True, False, False])

classmethod standardize(values, field=None, *, return_field=None, return_mapper=False, case_sensitive=False, mute=False, source_aware=True, keep='first', synonyms_field='synonyms', organism=None, source=None, strict_source=False)¶

Maps input synonyms to standardized names.

Parameters:

values (Iterable) – Identifiers that will be standardized.
field (str | DeferredAttribute | None, default: None) – The field representing the standardized names.
return_field (str | DeferredAttribute | None, default: None) – The field to return. Defaults to field.
return_mapper (bool, default: False) – If True, returns {input_value: standardized_name}.
case_sensitive (bool, default: False) – Whether the mapping is case sensitive.
mute (bool, default: False) – Whether to mute logging.
source_aware (bool, default: True) – Whether to standardize from public source. Defaults to True for BioRecord registries.
keep (Literal['first', 'last', False], default: 'first') –
When a synonym maps to multiple names, determines which duplicates to mark as pd.DataFrame.duplicated: - "first": returns the first mapped standardized name - "last": returns the last mapped standardized name - False: returns all mapped standardized name.

When keep is False, the returned list of standardized names will contain nested lists in case of duplicates.

When a field is converted into return_field, keep marks which matches to keep when multiple return_field values map to the same field value.
synonyms_field (str, default: 'synonyms') – A field containing the concatenated synonyms.
organism (str | SQLRecord | None, default: None) – An Organism name or record.
source (SQLRecord | None, default: None) – A bionty.Source record that specifies the version to validate against.
strict_source (bool, default: False) – Determines the validation behavior against records in the registry. - If False, validation will include all records in the registry, ignoring the specified source. - If True, validation will only include records in the registry that are linked to the specified source. Note: this parameter won’t affect validation against public sources.

Return type:

list[str] | dict[str, str]

Returns:

If return_mapper is False – a list of standardized names. Otherwise, a dictionary of mapped values with mappable synonyms as keys and standardized names as values.

See also

add_synonym(): Add synonyms.
remove_synonym(): Remove synonyms.

Example:

import bionty as bt

# save some gene records
bt.Gene.from_values(["A1CF", "A1BG", "BRCA2"], field="symbol", organism="human").save()

# standardize gene synonyms
gene_synonyms = ["A1CF", "A1BG", "FANCD1", "FANCD20"]
bt.Gene.standardize(gene_synonyms)
#> ['A1CF', 'A1BG', 'BRCA2', 'FANCD20']

Methods¶

query_schemas()¶

Query schemas of sub types.

While .schemas retrieves the schemas with the current type, this method also retrieves sub types and the schemas with sub types of the current type.

Return type:: QuerySet

save(*args, **kwargs)¶

Save schema.

Return type:: Schema

add_optional_features(features)¶

Add optional features to the schema.

Return type:: None

remove_optional_features(features)¶

Remove optional features from the schema.

Return type:: None

describe(return_str=False)¶

Describe schema.

Return type:: None | str

restore()¶

Restore from trash onto the main branch.

Does not restore descendant records if the record is HasType with is_type = True.

Return type:: None

delete(permanent=None, **kwargs)¶

Delete record.

If record is HasType with is_type = True, deletes all descendant records, too.

Parameters:: permanent (bool | None, default: None) – Whether to permanently delete the record (skips trash). If None, performs soft delete if the record is not already in the trash.
Return type:: None

Examples

For any SQLRecord object record, call:

>>> record.delete()

query_types()¶

Query types of a record recursively.

While .type retrieves the type, this method retrieves all super types of that type:

# Create type hierarchy
type1 = model_class(name="Type1", is_type=True).save()
type2 = model_class(name="Type2", is_type=True, type=type1).save()
type3 = model_class(name="Type3", is_type=True, type=type2).save()

# Create a record with type3
record = model_class(name=f"{model_name}3", type=type3).save()

# Query super types
super_types = record.query_types()
assert super_types[0] == type3
assert super_types[1] == type2
assert super_types[2] == type1

Return type:: SQLRecordList

add_synonym(synonym, force=False, save=None)¶

Add synonyms to a record.

Parameters:

synonym (str | list[str] | Series | array) – The synonyms to add to the record.
force (bool, default: False) – Whether to add synonyms even if they are already synonyms of other records.
save (bool | None, default: None) – Whether to save the record to the database.

See also

remove_synonym(): Remove synonyms.

Example:

import bionty as bt

# save "T cell" record
record = bt.CellType.from_source(name="T cell").save()
record.synonyms
#> "T-cell|T lymphocyte|T-lymphocyte"

# add a synonym
record.add_synonym("T cells")
record.synonyms
#> "T cells|T-cell|T-lymphocyte|T lymphocyte"

remove_synonym(synonym)¶

Remove synonyms from a record.

Parameters:: synonym (str | list[str] | Series | array) – The synonym values to remove.

See also

add_synonym(): Add synonyms

Example:

import bionty as bt

# save "T cell" record
record = bt.CellType.from_source(name="T cell").save()
record.synonyms
#> "T-cell|T lymphocyte|T-lymphocyte"

# remove a synonym
record.remove_synonym("T-cell")
record.synonyms
#> "T lymphocyte|T-lymphocyte"

set_abbr(value)¶

Set value for abbr field and add to synonyms.

Parameters:: value (str) – A value for an abbreviation.

See also

add_synonym()

Example:

import bionty as bt

# save an experimental factor record
scrna = bt.ExperimentalFactor.from_source(name="single-cell RNA sequencing").save()
assert scrna.abbr is None
assert scrna.synonyms == "single-cell RNA-seq|single-cell transcriptome sequencing|scRNA-seq|single cell RNA sequencing"

# set abbreviation
scrna.set_abbr("scRNA")
assert scrna.abbr == "scRNA"
# synonyms are updated
assert scrna.synonyms == "scRNA|single-cell RNA-seq|single cell RNA sequencing|single-cell transcriptome sequencing|scRNA-seq"

refresh_from_db(using=None, fields=None, from_queryset=None)¶

Reload field values from the database.

By default, the reloading happens from the database this instance was loaded from, or by the read router if this instance wasn’t loaded from any database. The using parameter will override the default.

Fields can be used to specify which fields to reload. The fields should be an iterable of field attnames. If fields is None, then all non-deferred fields are reloaded.

When accessing deferred fields of an instance, the deferred loading of the field will call this method.

async arefresh_from_db(using=None, fields=None, from_queryset=None)¶