Caching & Views#

All datasets have a local cache. This cache store entries, specifications, and records that have been retrieved from the server. By default, this cache is stored in memory, which allows for fast access but it not permanent. That is, restarting a script or jupyter notebook will require downloading data again.

It is possible (and often recommended) to use disk-based cache instead. To enable this, pass in a path to a directory into the PortalClient constructor. Or, if using qcportal configuration files, set the cache_dir variable to the path. If the path does not exist it will be automatically created.

This path can be shared among different instances - a subdirectory will be created for the particular instance you are using.

>>> from qcportal import PortalClient
>>> client = PortalClient("https://ml.qcarchive.molssi.org", cache_dir='/path/to/dir')
>>> print(client.cache.cache_dir)
/path/to/dir/ml.qcarchive.molssi.org_None

Dataset Caches#

Datasets obtained using a client with the cache directory set will store information in an sqlite file within that cache directory. Re-running a script or otherwise re-connecting to the same server and obtaining the same dataset will mean that previously-downloaded information will not need to be fetched again.

For the most part, this should be transparent, and the cache is meant to be intelligent about checking the server for updated records if needed. However, various methods of the dataset classes have a force_refetch parameter which will override this behavior and result in data being fetched again, even if it exists locally.

>>> ds = client.get_dataset_by_id(377)
>>> for entry in ds.iterate_entries(force_refetch=True):
...     print(entry.initial_molecule)
Molecule(name='000280960', formula='C5H10N2O2', hash='e883492')
Molecule(name='000524682', formula='C7H12', hash='0416c77')
Molecule(name='010464300', formula='C5H8O', hash='2f5a6ea')
...

>>> for ename, sname, record in ds.iterate_records(force_refetch=True):
...     print(ename, sname, record.status)
000280960 default RecordStatusEnum.complete
000524682 default RecordStatusEnum.complete
010464300 default RecordStatusEnum.complete
012917294 default RecordStatusEnum.complete

Dataset Views#

A dataset view is a standalone file that contains data from a dataset. This file can be loaded independent of any server connection and used as a dataset and is entirely local to your computer.

If you have a view file, you can use the qcportal.dataset_models.load_dataset_view() function.

>>> from qcportal import load_dataset_view
>>> dsv = load_dataset_view('/path/to/view.sqlite')

>>> for ename, sname, record in dsv.iterate_records():
...     print(ename, sname, record.status)
000280960 default RecordStatusEnum.complete
000524682 default RecordStatusEnum.complete
010464300 default RecordStatusEnum.complete
012917294 default RecordStatusEnum.complete

Fetching Views#

Datasets can have views as attachments. For convenience, attachments that are views can be listed with list_views(). This returns a list of DatasetAttachment objects which contain the metadata for attachments to this dataset.

Given the id of the attachment, it can then be downloaded with download_view().

By default, download_view() will download the most recent view file. The destination path can also be overridden - by default, it will use the filename generated by the server and download to the current working directory.

>>> ds = client.get_dataset_by_id(377)
>>> ds.list_views()
[DatasetAttachment(id=6, file_type=<ExternalFileTypeEnum.dataset_attachment: 'dataset_attachment'>,...

>>> ds.download_view(6, '/path/to/file.sqlite')

Creating Views#

Views can be created on the server create_view(). This creates a background internal job which creates the view, and returns the InternalJob> object that can be used to query the progress.

>>> ds = client.get_dataset_by_id(377)
>>> ij = ds.create_view("A test view", {})

>>> # Then go on and do other things
>>> # If you want to watch the progress with a progress bar
>>> ij.watch()

Warning

Views for large datasets may take a long time and be very large. The server is limited to creating one view file at any given time as it can be resource intensive.

The dataset attachment object has a file_size property that shows the size (in bytes)

>>> ds = client.get_dataset_by_id(377)
>>> v = ds.list_views()
>>> v[0].file_size
737280

Using views as a starting cache#

Views in general are good for “finished” datasets. However, they can also be useful for datasets that are still running since they can be used as a starting point for the dataset cache. That is, you can bootstrap the cache using a view.

The two useful functions are:

  • use_view_cache() - uses an existing (already-downloaded view file) as a cache. Modifications to the dataset will be reflected in this file. You do not need to have caching enabled (ie, you do not need to set cache_dir in the client).

  • preload_cache() - downloads and uses the file specified by id as the current cache. This will download the file to the current cache directory and then use it as the cache file. Caching must be enabled in the client.

    >>> ds = client.get_dataset_by_id(377)
    
    >>> # Use the latest view on the server as a starting cache file
    >>> ds.preload_cache()
    
    >>> # Manually use a downloaded file as a cache file
    >>> ds.use_view_cache('./dataset_377_view.sqlite')
    

Note

There is some protection against using a view file from another server or dataset, although it is probably not perfect

>>> ds = client.get_dataset_by_id(381)

>>> # Manually use a downloaded file as a cache file
>>> ds.use_view_cache('./dataset_377_view.sqlite')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)

...

ValueError: Info in view file does not match this dataset. ID in the file 377, ID of this dataset 382