Freezable

How minus80 works behind the scenes

As detailed in the overview, minus80 comes with two objects that make it easy to store and access data about experimental samples or more broadly, experimental accessions. Accession objects can be created, but are not persistent across python sessions unless they are stored within a Cohort.

Also, changes to the underlying data does not happen when accessions are changed, only when they are changed through the Cohort methods. Think of taking DNA samples out of the freezer and using them in an assay. A small aliquot is taken from the frozen sample to perform the analysis on, changes to this aliquot (the non-persistent accession here) does not change the underlying DNA stored in the freezer (the cohort).

In the same vein, duplicate Accessions can stored (with sometimes different metadata, etc) in multiple Cohorts in minus80. It is the context of the Cohort name differentiates what data goes along with each accession contained within it. Think of 10 individuals. You could have a cohort called “genomicDNA” that contains all 10 samples data. You could have the same 10 accessions under a different context, perhaps “liverSamples”.

Freezable datastructures

Cohorts persist across python sessions because the Cohort class inherits from the Freezable class. Accessions do not inherit these properties. The freezable class is an abstract one, meaning that you would most likely not create a freezable instance on its own. This is much like how lists inherit from the iterable abstract class, you would never create just an iterable but rather create objects that are iterable. In the same way, you can create objects that are freezable.

Here is the signature for the Freezable class:

class minus80.Freezable(name, parent=None, basedir=None)[source]

Freezable is an abstract class. Things that inherit from Freezable can be loaded and unloaded from the Minus80.

A freezable object is a persistant object that lives in a known directory aimed to make expensive to build objects and databases loadable from new runtimes.

The three main things that a Freezable object supplies are: * access to a sqlite database (relational records) * access to a bcolz databsase (columnar/table data) * access to a persistant key/val store * access to named temp files

A Freezable object needs a name attribute and a discernible dtype in order to be frozen. For instance, since Cohorts are freezable, assume the following:

>>> import minus80 as m80
>>> x = m80.Cohort('Sample1Liver')

Here the name is Sample1Liver and the dtype is Cohort. Since Cohorts are freezable, and the freezable __init__ function is called when the object is made, the object inherits some attributes. Part of these attributes are links to several centralized databases. These databases are stored in the directory dictated by the basedir options in the ~/.minus80.conf file.

Three different databases are supported, each serving a slightly different purpose. You can read the full details in the API Reference, or read a summary here.

Relational Data

A link to a sqlite database is provided using the apsw package internally. When the object is created, a database file linked to its name and dtype is created.

Also, upon creation, an open connections to the db is made. This can be accessed as an internal attribute:

>>> x = Cohort('Sample1Liver')
# Get the internal sqlite connection
>>> con = x._db
# Get a cursor
>>> cur = x._db.cursor()
# Execute some SQL
>>> cur.execute('SELECT * FROM ...')

Additionally, tables from other minus80 databases can be cross accessed.

Or the path to the object database can be accessed:

Columnar Data

Columnar data structures, such as numpy arrays or pandas dataframes, are not as well suited to be stored and accessed quickly in an SQL database. Minus80 stores columnar data internally on disk as bcolz tables. These are very fast to load from disk and can even be accessed out of memory using things like blaze.

If you want to store a pandas dataframe, use the _bcolz method:

>>> x = Cohort('Sample2Liver')
# Create a dataframe
>>> data_frame = pandas.DataFrame(...)
# Store the dataframe in the Cohort
>>> x._bcolz('tblData',df=data_frame)
# Retrieve the dataframe from the database
>>> data_frame2 = x.bcolz('tblData')
# Note, to retrieve, provide a name but no df option

If you want to store a numpy array instead of an entire dataframe, you can do that with the similar _bcolz_array method:

>>> import numpy as np
>>> x = Cohort('Sample3Liver')
# Say, you have an array
>>> data_array = np.array([0,1,2,3,4,5])
# Store the array as part of the Cohort object
# Here we give it the  name 'data'
>>> x._bcolz_array('data', array=data_array)
# To retrieve the array from the database, use the
# same method, but do not provide the array option
>>> data_array2 = x._bcolz_array('data')

Key/Value Store

A simple key/value store is included with Freezable object for things like object attributes. This is backed internally by sqlite3. This is mainly for a small number of object attributes or the like. The internal method is smart enough to detect three different types of values: int, float, and str. More complex values are not supported by this method.

Note

This is not optimized for massive datasets and does not compete with specialized key/value stores.

The key/value store can be accessed using the internal _dict method:

>>> x = Cohort('Sample4Liver')
# Store a value in the dict
>>> x._dict('foo',val='bar')
# Retrieve the value
>>> val = x._dict('foo')

Temporary Files

Temporary files can be accessed using the _tmpfile method. This simple method just wraps the tempfile module and creates the tmpfile in the minus80 basedir as to consolidate everything.