lamindb.Schema¶
- class lamindb.Schema(features: list[SQLRecord] | SQLRecordList | list[tuple[Feature, dict]] | None = None, *, slots: dict[str, Schema] | None = None, name: str | None = None, description: str | None = None, itype: str | Registry | FieldAttr | None = None, type: Schema | None = None, is_type: bool = False, index: Feature | None = None, flexible: bool | None = None, otype: str | None = None, dtype: str | Type[int | float | str] | None = None, minimal_set: bool = True, maximal_set: bool = False, ordered_set: bool = False, coerce_dtype: bool = False, n: int | None = None)¶
Bases:
SQLRecord,HasType,CanCurate,TracksRunSchemas of datasets such as column sets of dataframes.
Note
To create a schema, at least one of the following parameters must be passed:
features- a list ofFeatureobjectsitype- the identifier type, e.g.,Featureorbt.Gene.ensembl_gene_idslots- a dictionary mapping slots toSchemaobjects, e.g., for anAnnData,{"obs": Schema(...), "var.T": Schema(...)}is_type=True- a schema type to group schemas, e.g., “ProteinPanel”
- Parameters:
features –
list[SQLRecord] | list[tuple[Feature, dict]] | None = NoneFeature records, e.g.,[Feature(...), Feature(...)]or features with their config, e.g.,[Feature(...).with_config(optional=True)].slots –
dict[str, Schema] | None = NoneA dictionary mapping slot names toSchemaobjects to create a _composite_ schema.name –
str | None = NoneName of the schema.description –
str | None = NoneDescription of the schema.itype –
str | None = NoneFeature identifier type to validate against, e.g.,ln.Featureorbt.Gene.ensembl_gene_id. Is automatically set to the type of the passedfeatures.type –
Schema | None = NoneDefine schema types likeln.Schema(name="ProteinPanel", is_type=True).is_type –
bool = FalseWhether the schema is a type.index –
Feature | None = NoneAFeaturerecord to validate an index of aDataFrameand therefore also, e.g.,AnnDataobs and var indices.flexible –
bool | None = NoneWhether to include any feature of the sameitypeduring validation & annotation. Iffeaturesis passed, defaults toFalseso that, e.g., additional columns of aDataFrameencountered during validation are disregarded. Iffeaturesis not passed, defaults toTrue.otype –
str | None = NoneAn object type to define the structure of a composite schema, e.g.,"DataFrame","AnnData".dtype –
str | None = NoneAdtypeto assume for all features in the schema (e.g., “num”, float, int). Defaults toNoneifitypeisFeature. Otherwise to"num", e.g., ifitypeisbt.Gene.ensembl_gene_id.minimal_set –
bool = TrueWhether all passed features are required by default. Seeoptionalsfor more-fine-grained control.maximal_set –
bool = FalseWhether additional features are allowed.ordered_set –
bool = FalseWhether features are required to be ordered.coerce_dtype –
bool = FalseWhen True, attempts to coerce values to the specified dtype during validation, seecoerce_dtype.n –
int | None = NoneA manual way of specifying the number of features in the schema. Is inferred fromfeaturesif passed.
See also
from_dataframe()Validate & annotate a
DataFramewith a schema.from_anndata()Validate & annotate an
AnnDatawith a schema.from_mudata()Validate & annotate an
MuDatawith a schema.from_spatialdata()Validate & annotate a
SpatialDatawith a schema.
Examples
A schema with a single required feature:
import lamindb as ln schema = ln.Schema([ln.Feature(name="required_feature", dtype=str).save()]).save()
A schema that constrains feature identifiers to be a valid feature names:
schema = ln.Schema(itype=ln.Feature) # uses Feature.name as identifier type
Or valid Ensembl gene ids:
import bionty as bt schema = ln.Schema(itype=bt.Gene.ensembl_gene_id)
A
flexibleschema that requires a single feature but also validates & annotates additional features with registered feature identifiers:schema = ln.Schema( [ln.Feature(name="required_feature", dtype=str).save()], itype=ln.Feature, flexible=True, ).save()
Create a schema type to group schemas:
protein_panel = ln.Schema(name="ProteinPanel", is_type=True).save() schema = ln.Schema(itype=bt.CellMarker, type=protein_panel).save()
Validate the
indexof aDataFrame:schema = ln.Schema( [ln.Feature(name="required_feature", dtype=str).save()], index=ln.Feature(name="sample", dtype=ln.ULabel).save(), ).save()
Mark a feature as
optional:schema = ln.Schema([ ln.Feature(name="required_feature", dtype=str).save(), ln.Feature(name="feature2", dtype=int).save().with_config(optional=True), ]).save()
Parse & validate feature identifier values:
- schema = ln.Schema.from_values(
adata.var[“ensemble_id”], field=bt.Gene.ensembl_gene_id, organism=”mouse”,
).save()
Create a schema from a
DataFrame:df = pd.DataFrame({"feat1": [1, 2], "feat2": [3.1, 4.2], "feat3": ["cond1", "cond2"]}) schema = ln.Schema.from_dataframe(df)
Attributes¶
- property coerce_dtype: bool¶
Whether dtypes should be coerced during validation.
For example, a
objects-dtyped pandas column can be coerced tocategoricaland would pass validation if this is true.
- property flexible: bool¶
Indicates how to handle validation and annotation in case features are not defined.
Examples
Make a rigid schema flexible:
schema = ln.Schema.get(name="my_schema") schema.flexible = True schema.save()
During schema creation:
# if you're not passing features but just defining the itype, defaults to flexible = True schema = ln.Schema(itype=ln.Feature).save() # schema.flexible is True # if you're passing features, defaults to flexible = False schema = ln.Schema( features=[ln.Feature(name="my_required_feature", dtype=int).save()], ) # schema.flexible is False # you can also validate & annotate features in addition to those that you're explicitly defining: schema = ln.Schema( features=[ln.Feature(name="my_required_feature", dtype=int).save()], flexible=True, ) # schema.flexible is True
- property index: None | Feature¶
The feature configured to act as index.
To unset it, set
schema.indextoNone.
- property members: QuerySet¶
A queryset for the individual records in the feature set underlying the schema.
Unlike
schema.features,schema.genes,schema.proteins, etc., this queryset is ordered and doesn’t require knowledge of the entity.
- property optionals: SchemaOptionals¶
Manage optional features.
Example
# a schema with optional "sample_name" schema_optional_sample_name = ln.Schema( features=[ ln.Feature(name="sample_id", dtype=str).save(), # required ln.Feature(name="sample_name", dtype=str).save().with_config(optional=True), # optional ], ).save() # raise ValidationError since `sample_id` is required ln.curators.DataFrameCurator( pd.DataFrame( { "sample_name": ["Sample 1", "Sample 2"], } ), schema=schema_optional_sample_name).validate() ) # passes because an optional column is missing ln.curators.DataFrameCurator( pd.DataFrame( { "sample_id": ["sample1", "sample2"], } ), schema=schema_optional_sample_name).validate() )
- property slots: dict[str, Schema]¶
Slots.
Examples
# define composite schema anndata_schema = ln.Schema( name="mini_immuno_anndata_schema", otype="AnnData", slots={"obs": obs_schema, "var": var_schema}, ).save() # access slots anndata_schema.slots #> {'obs': <Schema: obs_schema>, 'var': <Schema: var_schema>}
Simple fields¶
- uid: str¶
A universal id.
- name: str | None¶
A name.
- description: str | None¶
A description.
- n: int¶
Number of features in the schema.
- is_type: bool¶
Distinguish types from instances of the type.
- itype: str | None¶
A field of a registry that stores feature identifier types, e.g.,
'Feature.name'or'bionty.Gene.ensembl_gene_id'. Defaults to the default name field if a registry is passed (passingFeaturewould result inFeature.name).Depending on
itype,.membersstores, e.g.,Featureorbionty.Generecords.
- otype: str | None¶
Default Python object type, e.g., DataFrame, AnnData.
- dtype: str | None¶
Data type, e.g., “num”, “float”, “int”. Is
NoneforFeature.For
Feature, types are expected to be heterogeneous and defined on a per-feature level.
- hash: str | None¶
A hash of the set of feature identifiers.
For a composite schema, the hash of hashes.
- minimal_set: bool¶
Whether all passed features are to be considered required by default (default
True).Note that features that are explicitly marked as
optionalviafeature.with_config(optional=True)are not required even if thisminimal_setis true.
- ordered_set: bool¶
Whether features are required to be ordered (default
False).
- maximal_set: bool¶
Whether all features present in the dataset must be in the schema (default
False).If
False, additional features are allowed to be present in the dataset.If
True, no additional features are allowed to be present in the dataset.
- is_locked: bool¶
Whether the record is locked for edits.
- created_at: datetime¶
Time of creation of record.
Relational fields¶
-
branch:
Branch¶ Life cycle state of record.
branch.namecan be “main” (default branch), “trash” (trash),branch.name = "archive"(archived), or any other user-created branch typically planned for merging onto main after review.
-
type:
Schema| None¶ Type of schema.
Allows to group schemas by type, e.g., all meassurements evaluating gene expression vs. protein expression vs. multi modal.
You can define types via
ln.Schema(name="ProteinPanel", is_type=True).Here are a few more examples for type names:
'ExpressionPanel','ProteinPanel','Multimodal','Metadata','Embedding'.
-
composites:
Schema¶ The composite schemas that contains this schema as a component.
For example, an
AnnDatacomposes multiple schemas:var[DataFrameT],obs[DataFrame],obsm[Array],uns[dict], etc.
-
validated_artifacts:
Artifact¶ The artifacts that were validated against this schema with a
Curator.
- blocks¶
Accessor to the related objects manager on the reverse side of a many-to-one relation.
In the example:
class Child(Model): parent = ForeignKey(Parent, related_name='children')
Parent.childrenis aReverseManyToOneDescriptorinstance.Most of the implementation is delegated to a dynamically defined manager class built by
create_forward_many_to_many_manager()defined below.
Class methods¶
- classmethod from_values(values, field=FieldAttr(Feature.name), dtype=None, name=None, mute=False, organism=None, source=None, raise_validation_error=True)¶
Create feature set for validated features.
- Parameters:
values (
list[str] |Series|array) – A list of values, like feature names or ids.field (
DeferredAttribute, default:FieldAttr(Feature.name)) – The field of a reference registry to map values.dtype (
str|None, default:None) – The simple dtype. Defaults toNoneif reference registry isFeature, defaults to"float"otherwise.name (
str|None, default:None) – A name.organism (
SQLRecord|str|None, default:None) – An organism to resolve gene mapping.source (
SQLRecord|None, default:None) – A public ontology to resolve feature identifier mapping.raise_validation_error (
bool, default:True) – Whether to raise a validation error if some values are not valid.
- Raises:
ValidationError – If some values are not valid.
- Return type:
Example
import lamindb as ln import bionty as bt features = [ln.Feature(name=feat, dtype="str").save() for feat in ["feat11", "feat21"]] schema = ln.Schema.from_values(features) genes = ["ENSG00000139618", "ENSG00000198786"] schema = ln.Schema.from_values(features, bt.Gene.ensembl_gene_id, "float")
- classmethod from_dataframe(df, field=FieldAttr(Feature.name), name=None, mute=False, organism=None, source=None)¶
Create schema for valid columns.
- Return type:
Schema|None
- classmethod filter(*queries, **expressions)¶
Query records.
- Parameters:
queries – One or multiple
Qobjects.expressions – Fields and values passed as Django query expressions.
- Return type:
See also
Guide: Query & search registries
Django documentation: Queries
Examples
>>> ln.Project(name="my label").save() >>> ln.Project.filter(name__startswith="my").to_dataframe()
- classmethod get(idlike=None, **expressions)¶
Get a single record.
- Parameters:
idlike (
int|str|None, default:None) – Either a uid stub, uid or an integer id.expressions – Fields and values passed as Django query expressions.
- Raises:
lamindb.errors.DoesNotExist – In case no matching record is found.
- Return type:
See also
Guide: Query & search registries
Django documentation: Queries
Examples
record = ln.Record.get("FvtpPJLJ") record = ln.Record.get(name="my-label")
- classmethod to_dataframe(include=None, features=False, limit=100)¶
Evaluate and convert to
pd.DataFrame.By default, maps simple fields and foreign keys onto
DataFramecolumns.Guide: Query & search registries
- Parameters:
include (
str|list[str] |None, default:None) – Related data to include as columns. Takes strings of form"records__name","cell_types__name", etc. or a list of such strings. ForArtifact,Record, andRun, can also pass"features"to include features with data types pointing to entities in the core schema. If"privates", includes private fields (fields starting with_).features (
bool|list[str], default:False) – Configure the features to include. Can be a feature name or a list of such names. If"queryset", infers the features used within the current queryset. Only available forArtifact,Record, andRun.limit (
int, default:100) – Maximum number of rows to display. IfNone, includes all results.order_by – Field name to order the records by. Prefix with ‘-’ for descending order. Defaults to ‘-id’ to get the most recent records. This argument is ignored if the queryset is already ordered or if the specified field does not exist.
- Return type:
DataFrame
Examples
Include the name of the creator:
ln.Record.to_dataframe(include="created_by__name"])
Include features:
ln.Artifact.to_dataframe(include="features")
Include selected features:
ln.Artifact.to_dataframe(features=["cell_type_by_expert", "cell_type_by_model"])
- classmethod search(string, *, field=None, limit=20, case_sensitive=False)¶
Search.
- Parameters:
string (
str) – The input string to match against the field ontology values.field (
str|DeferredAttribute|None, default:None) – The field or fields to search. Search all string fields by default.limit (
int|None, default:20) – Maximum amount of top results to return.case_sensitive (
bool, default:False) – Whether the match is case sensitive.
- Return type:
- Returns:
A sorted
DataFrameof search results with a score in columnscore. Ifreturn_querysetisTrue.QuerySet.
Examples
records = ln.Record.from_values(["Label1", "Label2", "Label3"], field="name").save() ln.Record.search("Label2")
- classmethod lookup(field=None, return_field=None)¶
Return an auto-complete object for a field.
- Parameters:
field (
str|DeferredAttribute|None, default:None) – The field to look up the values for. Defaults to first string field.return_field (
str|DeferredAttribute|None, default:None) – The field to return. IfNone, returns the whole record.keep – When multiple records are found for a lookup, how to return the records. -
"first": return the first record. -"last": return the last record. -False: return all records.
- Return type:
NamedTuple- Returns:
A
NamedTupleof lookup information of the field values with a dictionary converter.
See also
Examples
Lookup via auto-complete on
.:import bionty as bt bt.Gene.from_source(symbol="ADGB-DT").save() lookup = bt.Gene.lookup() lookup.adgb_dt
Look up via auto-complete in dictionary:
lookup_dict = lookup.dict() lookup_dict['ADGB-DT']
Look up via a specific field:
lookup_by_ensembl_id = bt.Gene.lookup(field="ensembl_gene_id") genes.ensg00000002745
Return a specific field value instead of the full record:
lookup_return_symbols = bt.Gene.lookup(field="ensembl_gene_id", return_field="symbol")
- classmethod connect(instance)¶
Query a non-default LaminDB instance.
- Parameters:
instance (
str|None) – An instance identifier of form “account_handle/instance_name”.- Return type:
Examples
ln.Record.connect("account_handle/instance_name").search("label7", field="name")
- classmethod inspect(values, field=None, *, mute=False, organism=None, source=None, from_source=True, strict_source=False)¶
Inspect if values are mappable to a field.
Being mappable means that an exact match exists.
- Parameters:
values (
list[str] |Series|array) – Values that will be checked against the field.field (
str|DeferredAttribute|None, default:None) – The field of values. Examples are'ontology_id'to map against the source ID or'name'to map against the ontologies field names.mute (
bool, default:False) – Whether to mute logging.organism (
str|SQLRecord|None, default:None) – An Organism name or record.source (
SQLRecord|None, default:None) – Abionty.Sourcerecord that specifies the version to inspect against.strict_source (
bool, default:False) – Determines the validation behavior against records in the registry. - IfFalse, validation will include all records in the registry, ignoring the specified source. - IfTrue, validation will only include records in the registry that are linked to the specified source. Note: this parameter won’t affect validation against public sources.
- Return type:
bionty.base.dev.InspectResult
See also
Example:
import bionty as bt # save some gene records bt.Gene.from_values(["A1CF", "A1BG", "BRCA2"], field="symbol", organism="human").save() # inspect gene symbols gene_symbols = ["A1CF", "A1BG", "FANCD1", "FANCD20"] result = bt.Gene.inspect(gene_symbols, field=bt.Gene.symbol, organism="human") assert result.validated == ["A1CF", "A1BG"] assert result.non_validated == ["FANCD1", "FANCD20"]
- classmethod validate(values, field=None, *, mute=False, organism=None, source=None, strict_source=False)¶
Validate values against existing values of a string field.
Note this is strict_source validation, only asserts exact matches.
- Parameters:
values (
list[str] |Series|array) – Values that will be validated against the field.field (
str|DeferredAttribute|None, default:None) – The field of values. Examples are'ontology_id'to map against the source ID or'name'to map against the ontologies field names.mute (
bool, default:False) – Whether to mute logging.organism (
str|SQLRecord|None, default:None) – An Organism name or record.source (
SQLRecord|None, default:None) – Abionty.Sourcerecord that specifies the version to validate against.strict_source (
bool, default:False) – Determines the validation behavior against records in the registry. - IfFalse, validation will include all records in the registry, ignoring the specified source. - IfTrue, validation will only include records in the registry that are linked to the specified source. Note: this parameter won’t affect validation against public sources.
- Return type:
ndarray- Returns:
A vector of booleans indicating if an element is validated.
See also
Example:
import bionty as bt bt.Gene.from_values(["A1CF", "A1BG", "BRCA2"], field="symbol", organism="human").save() gene_symbols = ["A1CF", "A1BG", "FANCD1", "FANCD20"] bt.Gene.validate(gene_symbols, field=bt.Gene.symbol, organism="human") #> array([ True, True, False, False])
- classmethod standardize(values, field=None, *, return_field=None, return_mapper=False, case_sensitive=False, mute=False, source_aware=True, keep='first', synonyms_field='synonyms', organism=None, source=None, strict_source=False)¶
Maps input synonyms to standardized names.
- Parameters:
values (
Iterable) – Identifiers that will be standardized.field (
str|DeferredAttribute|None, default:None) – The field representing the standardized names.return_field (
str|DeferredAttribute|None, default:None) – The field to return. Defaults to field.return_mapper (
bool, default:False) – IfTrue, returns{input_value: standardized_name}.case_sensitive (
bool, default:False) – Whether the mapping is case sensitive.mute (
bool, default:False) – Whether to mute logging.source_aware (
bool, default:True) – Whether to standardize from public source. Defaults toTruefor BioRecord registries.keep (
Literal['first','last',False], default:'first') –When a synonym maps to multiple names, determines which duplicates to mark as
pd.DataFrame.duplicated: -"first": returns the first mapped standardized name -"last": returns the last mapped standardized name -False: returns all mapped standardized name.When
keepisFalse, the returned list of standardized names will contain nested lists in case of duplicates.When a field is converted into return_field, keep marks which matches to keep when multiple return_field values map to the same field value.
synonyms_field (
str, default:'synonyms') – A field containing the concatenated synonyms.organism (
str|SQLRecord|None, default:None) – An Organism name or record.source (
SQLRecord|None, default:None) – Abionty.Sourcerecord that specifies the version to validate against.strict_source (
bool, default:False) – Determines the validation behavior against records in the registry. - IfFalse, validation will include all records in the registry, ignoring the specified source. - IfTrue, validation will only include records in the registry that are linked to the specified source. Note: this parameter won’t affect validation against public sources.
- Return type:
list[str] |dict[str,str]- Returns:
If
return_mapperisFalse– a list of standardized names. Otherwise, a dictionary of mapped values with mappable synonyms as keys and standardized names as values.
See also
add_synonym()Add synonyms.
remove_synonym()Remove synonyms.
Example:
import bionty as bt # save some gene records bt.Gene.from_values(["A1CF", "A1BG", "BRCA2"], field="symbol", organism="human").save() # standardize gene synonyms gene_synonyms = ["A1CF", "A1BG", "FANCD1", "FANCD20"] bt.Gene.standardize(gene_synonyms) #> ['A1CF', 'A1BG', 'BRCA2', 'FANCD20']
Methods¶
- query_schemas()¶
Query schemas of sub types.
While
.schemasretrieves the schemas with the current type, this method also retrieves sub types and the schemas with sub types of the current type.- Return type:
- add_optional_features(features)¶
Add optional features to the schema.
- Return type:
None
- remove_optional_features(features)¶
Remove optional features from the schema.
- Return type:
None
- describe(return_str=False)¶
Describe schema.
- Return type:
None|str
- restore()¶
Restore from trash onto the main branch.
Does not restore descendant records if the record is
HasTypewithis_type = True.- Return type:
None
- delete(permanent=None, **kwargs)¶
Delete record.
If record is
HasTypewithis_type = True, deletes all descendant records, too.- Parameters:
permanent (
bool|None, default:None) – Whether to permanently delete the record (skips trash). IfNone, performs soft delete if the record is not already in the trash.- Return type:
None
Examples
For any
SQLRecordobjectrecord, call:>>> record.delete()
- query_types()¶
Query types of a record recursively.
While
.typeretrieves thetype, this method retrieves all super types of thattype:# Create type hierarchy type1 = model_class(name="Type1", is_type=True).save() type2 = model_class(name="Type2", is_type=True, type=type1).save() type3 = model_class(name="Type3", is_type=True, type=type2).save() # Create a record with type3 record = model_class(name=f"{model_name}3", type=type3).save() # Query super types super_types = record.query_types() assert super_types[0] == type3 assert super_types[1] == type2 assert super_types[2] == type1
- Return type:
- add_synonym(synonym, force=False, save=None)¶
Add synonyms to a record.
- Parameters:
synonym (
str|list[str] |Series|array) – The synonyms to add to the record.force (
bool, default:False) – Whether to add synonyms even if they are already synonyms of other records.save (
bool|None, default:None) – Whether to save the record to the database.
See also
remove_synonym()Remove synonyms.
Example:
import bionty as bt # save "T cell" record record = bt.CellType.from_source(name="T cell").save() record.synonyms #> "T-cell|T lymphocyte|T-lymphocyte" # add a synonym record.add_synonym("T cells") record.synonyms #> "T cells|T-cell|T-lymphocyte|T lymphocyte"
- remove_synonym(synonym)¶
Remove synonyms from a record.
- Parameters:
synonym (
str|list[str] |Series|array) – The synonym values to remove.
See also
add_synonym()Add synonyms
Example:
import bionty as bt # save "T cell" record record = bt.CellType.from_source(name="T cell").save() record.synonyms #> "T-cell|T lymphocyte|T-lymphocyte" # remove a synonym record.remove_synonym("T-cell") record.synonyms #> "T lymphocyte|T-lymphocyte"
- set_abbr(value)¶
Set value for abbr field and add to synonyms.
- Parameters:
value (
str) – A value for an abbreviation.
See also
Example:
import bionty as bt # save an experimental factor record scrna = bt.ExperimentalFactor.from_source(name="single-cell RNA sequencing").save() assert scrna.abbr is None assert scrna.synonyms == "single-cell RNA-seq|single-cell transcriptome sequencing|scRNA-seq|single cell RNA sequencing" # set abbreviation scrna.set_abbr("scRNA") assert scrna.abbr == "scRNA" # synonyms are updated assert scrna.synonyms == "scRNA|single-cell RNA-seq|single cell RNA sequencing|single-cell transcriptome sequencing|scRNA-seq"
- refresh_from_db(using=None, fields=None, from_queryset=None)¶
Reload field values from the database.
By default, the reloading happens from the database this instance was loaded from, or by the read router if this instance wasn’t loaded from any database. The using parameter will override the default.
Fields can be used to specify which fields to reload. The fields should be an iterable of field attnames. If fields is None, then all non-deferred fields are reloaded.
When accessing deferred fields of an instance, the deferred loading of the field will call this method.
- async arefresh_from_db(using=None, fields=None, from_queryset=None)¶