Freezable¶
How minus80 works behind the scenes¶
As detailed in the overview, minus80 comes with two objects that make it easy to store and access data about experimental samples or more broadly, experimental accessions. Accession objects can be created, but are not persistent across python sessions unless they are stored within a Cohort.
Also, changes to the underlying data does not happen when accessions are changed, only when they are changed through the Cohort methods. Think of taking DNA samples out of the freezer and using them in an assay. A small aliquot is taken from the frozen sample to perform the analysis on, changes to this aliquot (the non-persistent accession here) does not change the underlying DNA stored in the freezer (the cohort).
In the same vein, duplicate Accessions can stored (with sometimes different metadata, etc) in multiple Cohorts in minus80. It is the context of the Cohort name differentiates what data goes along with each accession contained within it. Think of 10 individuals. You could have a cohort called “genomicDNA” that contains all 10 samples data. You could have the same 10 accessions under a different context, perhaps “liverSamples”.
Freezable datastructures¶
Cohorts persist across python sessions because the Cohort class inherits from the Freezable class. Accessions do not inherit these properties. The freezable class is an abstract one, meaning that you would most likely not create a freezable instance on its own. This is much like how lists inherit from the iterable abstract class, you would never create just an iterable but rather create objects that are iterable. In the same way, you can create objects that are freezable.
Here is the signature for the Freezable class:
-
class
minus80.
Freezable
(name, parent=None, basedir=None)[source] Freezable is an abstract class. Things that inherit from Freezable can be loaded and unloaded from the Minus80.
A freezable object is a persistant object that lives in a known directory aimed to make expensive to build objects and databases loadable from new runtimes.
The three main things that a Freezable object supplies are: * access to a sqlite database (relational records) * access to a bcolz databsase (columnar/table data) * access to a persistant key/val store * access to named temp files
A Freezable object needs a name
attribute and a discernible dtype
in order to be frozen.
For instance, since Cohorts are freezable, assume the following:
>>> import minus80 as m80
>>> x = m80.Cohort('Sample1Liver')
Here the name is Sample1Liver
and the dtype is Cohort
. Since Cohorts are freezable, and the
freezable __init__
function is called when the object is made, the object inherits some attributes.
Part of these attributes are links to several centralized databases. These databases are stored in the
directory dictated by the basedir
options in the ~/.minus80.conf
file.
Three different databases are supported, each serving a slightly different purpose. You can read the full details in the API Reference, or read a summary here.
Relational Data¶
A link to a sqlite
database is provided using the apsw
package internally. When the object is
created, a database file linked to its name
and dtype
is created.
Also, upon creation, an open connections to the db is made. This can be accessed as an internal attribute:
>>> x = Cohort('Sample1Liver')
# Get the internal sqlite connection
>>> con = x._db
# Get a cursor
>>> cur = x._db.cursor()
# Execute some SQL
>>> cur.execute('SELECT * FROM ...')
Additionally, tables from other minus80 databases can be cross accessed.
Or the path to the object database can be accessed:
Columnar Data¶
Columnar data structures, such as numpy arrays or pandas dataframes, are not as well suited to be stored and accessed quickly in an SQL database. Minus80 stores columnar data internally on disk as bcolz tables. These are very fast to load from disk and can even be accessed out of memory using things like blaze.
If you want to store a pandas dataframe, use the _bcolz
method:
>>> x = Cohort('Sample2Liver')
# Create a dataframe
>>> data_frame = pandas.DataFrame(...)
# Store the dataframe in the Cohort
>>> x._bcolz('tblData',df=data_frame)
# Retrieve the dataframe from the database
>>> data_frame2 = x.bcolz('tblData')
# Note, to retrieve, provide a name but no df option
If you want to store a numpy array instead of an entire dataframe, you can do that with the similar
_bcolz_array
method:
>>> import numpy as np
>>> x = Cohort('Sample3Liver')
# Say, you have an array
>>> data_array = np.array([0,1,2,3,4,5])
# Store the array as part of the Cohort object
# Here we give it the name 'data'
>>> x._bcolz_array('data', array=data_array)
# To retrieve the array from the database, use the
# same method, but do not provide the array option
>>> data_array2 = x._bcolz_array('data')
Key/Value Store¶
A simple key/value store is included with Freezable object for things like object attributes. This
is backed internally by sqlite3. This is mainly for a small number of object attributes or the like.
The internal method is smart enough to detect three different types of values: int
, float
, and
str
. More complex values are not supported by this method.
Note
This is not optimized for massive datasets and does not compete with specialized key/value stores.
The key/value store can be accessed using the internal _dict
method:
>>> x = Cohort('Sample4Liver')
# Store a value in the dict
>>> x._dict('foo',val='bar')
# Retrieve the value
>>> val = x._dict('foo')
Temporary Files¶
Temporary files can be accessed using the _tmpfile
method. This simple method
just wraps the tempfile
module and creates the tmpfile in the minus80 basedir
as to consolidate everything.