Exploring the Datasets¶

The Dataset collection represents a table whose rows correspond to Molecules and whose columns correspond to properties. Columns may either result from QCFractal-based calculations or be contributed from outside sources. For example, the QM9 dataset in QCArchive contains small organic molecules with up to 9 heavy atoms, and includes the original reported PBE0 energies, as well as energies calculated with a variety of other density functionals and basis sets.

The existing Datasets can be listed with FractalClient.list_collections("Dataset") and obtained using FractalClient.get_collection("Dataset", name).

Querying the Data¶

Available result specifications (method, basis set, program, keyword, driver combinations) in a Dataset may be listed with the list_values method. Values are queried with the get_values method. For results computed using QCFractal, the underlying Records are retrieved with get_records.

For further details about how to query Datasets see the QCArchive examples.

Statistics and Visualization¶

Statistical operations on Datasets may be performed using statistics command and plotted using the visualize command.

For examples of visualizing Datasets, see the QCArchive examples.

Creating the Datasets¶

Construct an empty Dataset:

>>> import qcportal as ptl
>>> client = plt.FractalClient()  # add server and login information as needed
>>> ds = ptl.collections.Dataset("name", client=client)

The primary index of a Dataset is a list of Molecules. Molecules can be added to a Dataset with add_entry:

>>> ds.add_entry(name, molecule)

Once all Molecules are added, the changes can be committed to the server with save method. Note that this requires write permissions.

>>> ds.save()

Computational Tasks¶

Computations on the molecules within the Datasets can be performed using the compute command. If the results of the requested computation already exist in the Dataset, they will be reused to avoid recomputation. Note that for perfoming computations, compute permissions is required.

>>> models = {('b3lyp', 'def2-svp'), ('mp2', 'cc-pVDZ')}

>>> for method, basis in models:
>>>     print(method, basis)
>>>     spec = {"program": "psi4",
>>>         "method": method,
>>>         "basis": basis,
>>>         "keywords": "my_keywords",
>>>         "tag": "mgwtfm"}
>>>     ds.compute(**spec)

Note

A default quantum chemical program and a set of computational keywords can be specified for a Dataset. These default values will be used in the compute, get_values, and get_records methods.

>>> ds.set_default_program("psi4")

>>> keywords = ptl.models.KeywordSet(values={'maxiter': 1000,
>>>                                         'e_convergence': 8,
>>>                                         'guess': 'sad',
>>>                                         'scf_type': 'df'})

>>> ds.add_keywords("my_keywords", "psi4", keywords, default=True)

>>> ds.save()

API¶

class qcportal.collections.Dataset(name: str, client: Optional[FractalClient] = None, **kwargs: Any)[source]¶

The Dataset class for homogeneous computations on many molecules.

Variables

client (client.FractalClient) – A FractalClient connected to a server
data (dict) – JSON representation of the database backbone
df (pd.DataFrame) – The underlying dataframe for the Dataset object

class DataModel(*, id: str = 'local', name: str, collection: str, provenance: Dict[str, str] = {}, tags: List[str] = [], tagline: str = None, description: str = None, group: str = 'default', visibility: bool = True, view_url_hdf5: str = None, view_url_plaintext: str = None, view_metadata: Dict[str, str] = None, view_available: bool = False, metadata: Dict[str, Any] = {}, default_program: str = None, default_keywords: Dict[str, str] = {}, default_driver: str = 'energy', default_units: str = 'kcal / mol', default_benchmark: str = None, alias_keywords: Dict[str, Dict[str, str]] = {}, records: List[qcportal.collections.dataset.MoleculeEntry] = None, contributed_values: Dict[str, qcportal.collections.dataset.ContributedValues] = None, history: Set[Tuple[str, str, str, Optional[str], Optional[str]]] = {}, history_keys: Tuple[str, str, str, str, str] = ('driver', 'program', 'method', 'basis', 'keywords'))[source]¶

Parameters

id (str, Default: local)
name (str)
collection (str)
provenance (name=’provenance’ type=Mapping[str, str] required=False default={}, Default: {})
tags (List[str], Default: [])
tagline (str, Optional)
description (str, Optional)
group (str, Default: default)
visibility (bool, Default: True)
view_url_hdf5 (str, Optional)
view_url_plaintext (str, Optional)
view_metadata (name=’view_metadata’ type=Optional[Mapping[str, str]] required=False default=None, Optional)
view_available (bool, Default: False)
metadata (Dict[str, Any], Default: {})
default_program (str, Optional)
default_keywords (name=’default_keywords’ type=Mapping[str, str] required=False default={}, Default: {})
default_driver (str, Default: energy)
default_units (str, Default: kcal / mol)
default_benchmark (str, Optional)
alias_keywords (Dict[str, Dict[str, str]], Default: {})
records (MoleculeEntry, Optional)
contributed_values (ContributedValues, Optional)
history (Set[Tuple[str, str, str, str, str]], Default: set())
history_keys (Tuple[str, str, str, str, str], Default: (‘driver’, ‘program’, ‘method’, ‘basis’, ‘keywords’))

add_contributed_values(contrib: qcportal.collections.dataset.ContributedValues, overwrite: bool = False) → None[source]¶

Adds a ContributedValues to the database. Be sure to call save() to commit changes to the server.

Parameters

contrib (ContributedValues) – The ContributedValues to add.
overwrite (bool, optional) – Overwrites pre-existing values

add_entry(name: str, molecule: Molecule, **kwargs: Dict[str, Any]) → None[source]¶

Adds a new entry to the Dataset

Parameters

name (str) – The name of the record
molecule (Molecule) – The Molecule associated with this record
**kwargs (Dict[str, Any]) – Additional arguments to pass to the record

add_keywords(alias: str, program: str, keyword: KeywordSet, default: bool = False) → bool[source]¶

Adds an option alias to the dataset. Not that keywords are not present until a save call has been completed.

Parameters

alias (str) – The alias of the option
program (str) – The compute program the alias is for
keyword (KeywordSet) – The Keywords object to use.
default (bool, optional) – Sets this option as the default for the program

compute(method: str, basis: Optional[str] = None, *, keywords: Optional[str] = None, program: Optional[str] = None, tag: Optional[str] = None, priority: Optional[str] = None, protocols: Optional[Dict[str, Any]] = None) → qcportal.models.rest_models.ComputeResponse[source]¶

Executes a computational method for all reactions in the Dataset. Previously completed computations are not repeated.

Parameters

method (str) – The computational method to compute (B3LYP)
basis (Optional[str], optional) – The computational basis to compute (6-31G)
keywords (Optional[str], optional) – The keyword alias for the requested compute
program (Optional[str], optional) – The underlying QC program
tag (Optional[str], optional) – The queue tag to use when submitting compute requests.
priority (Optional[str], optional) – The priority of the jobs low, medium, or high.
protocols (Optional[Dict[str, Any]], optional) – Protocols for store more or less data per field. Current valid protocols: {‘wavefunction’}

Returns

An object that contains the submitted ObjectIds of the new compute. This object has the following fields:

ids: The ObjectId’s of the task in the order of input molecules
submitted: A list of ObjectId’s that were submitted to the compute queue
existing: A list of ObjectId’s of tasks already in the database

Return type

ComputeResponse

download(local_path: Optional[Union[str, pathlib.Path]] = None, verify: bool = True, progress_bar: bool = True) → None[source]¶

Download a remote view if available. The dataset will use this view to avoid server queries for calls to: - get_entries - get_molecules - get_values - list_values

Parameters

local_path (Optional[Union[str, Path]], optional) – Local path the store downloaded view. If None, the view will be stored in a temporary file and deleted on exit.
verify (bool, optional) – Verify download checksum. Default: True.
progress_bar (bool, optional) – Display a download progress bar. Default: True

get_entries(subset: Optional[List[str]] = None, force: bool = False) → pandas.core.frame.DataFrame[source]¶

Provides a list of entries for the dataset

Parameters

subset (Optional[List[str]], optional) – The indices of the desired subset. Return all indices if subset is None.
force (bool, optional) – skip cache

Returns

A dataframe containing entry names and specifciations. For Dataset, specifications are molecule ids. For ReactionDataset, specifications describe reaction stoichiometry.

Return type

pd.DataFrame

get_index(subset: Optional[List[str]] = None, force: bool = False) → List[str][source]¶

Returns the current index of the database.

Returns: ret – The names of all reactions in the database
Return type: List[str]

get_keywords(alias: str, program: str, return_id: bool = False) → Union[KeywordSet, str][source]¶

Pulls the keywords alias from the server for inspection.

Parameters

alias (str) – The keywords alias.
program (str) – The program the keywords correspond to.
return_id (bool, optional) – If True, returns the id rather than the KeywordSet object. Description

Returns

The requested KeywordSet or KeywordSet id.

Return type

Union[‘KeywordSet’, str]

get_molecules(subset: Optional[Union[str, Set[str]]] = None, force: bool = False) → Union[pandas.core.frame.DataFrame, Molecule][source]¶

Queries full Molecules from the database.

Parameters

subset (Optional[Union[str, Set[str]]], optional) – The index subset to query on
force (bool, optional) – Force pull of molecules from server

Returns

Either a DataFrame of indexed Molecules or a single Molecule if a single subset string was provided.

Return type

Union[pd.DataFrame, ‘Molecule’]

get_records(method: str, basis: Optional[str] = None, *, keywords: Optional[str] = None, program: Optional[str] = None, include: Optional[List[str]] = None, subset: Optional[Union[str, Set[str]]] = None, merge: bool = False) → Union[pandas.core.frame.DataFrame, ResultRecord][source]¶

Queries full ResultRecord objects from the database.

Parameters

method (str) – The computational method to query on (B3LYP)
basis (Optional[str], optional) – The computational basis query on (6-31G)
keywords (Optional[str], optional) – The option token desired
program (Optional[str], optional) – The program to query on
include (Optional[List[str]], optional) – The attributes to return. Otherwise returns ResultRecord objects.
subset (Optional[Union[str, Set[str]]], optional) – The index subset to query on
merge (bool) – Merge multiple results into one (as in the case of DFT-D3). This only works when include=[‘return_results’], as in get_values.

Returns

Either a DataFrame of indexed ResultRecords or a single ResultRecord if a single subset string was provided.

Return type

Union[pd.DataFrame, ‘ResultRecord’]

get_values(method: Optional[Union[List[str], str]] = None, basis: Optional[Union[List[str], str]] = None, keywords: Optional[str] = None, program: Optional[str] = None, driver: Optional[str] = None, name: Optional[Union[List[str], str]] = None, native: Optional[bool] = None, subset: Optional[Union[List[str], str]] = None, force: bool = False) → pandas.core.frame.DataFrame[source]¶

Obtains values matching the search parameters provided for the expected return_result values. Defaults to the standard programs and keywords if not provided.

Note that unlike get_records, get_values will automatically expand searches and return multiple method and basis combinations simultaneously.

None is a wildcard selector. To search for None, use “None”.

Parameters

method (Optional[Union[str, List[str]]], optional) – The computational method (B3LYP)
basis (Optional[Union[str, List[str]]], optional) – The computational basis (6-31G)
keywords (Optional[str], optional) – The keyword alias
program (Optional[str], optional) – The underlying QC program
driver (Optional[str], optional) – The type of calculation (e.g. energy, gradient, hessian, dipole…)
name (Optional[Union[str, List[str]]], optional) – Canonical name of the record. Overrides the above selectors.
native (Optional[bool], optional) – True: only include data computed with QCFractal False: only include data contributed from outside sources None: include both
subset (Optional[List[str]], optional) – The indices of the desired subset. Return all indices if subset is None.
force (bool, optional) – Data is typically cached, forces a new query if True

Returns

A DataFrame of values with columns corresponding to methods and rows corresponding to molecule entries.

Return type

DataFrame

list_keywords() → pandas.core.frame.DataFrame[source]¶

Lists keyword aliases for each program in the dataset.

Returns: A dataframe containing programs, keyword aliases, KeywordSet ids, and whether those keywords are the default for a program. Indexed on program.
Return type: pd.DataFrame

list_records(dftd3: bool = False, pretty: bool = True, **search: Optional[Union[List[str], str]]) → pandas.core.frame.DataFrame[source]¶

Lists specifications of available records, i.e. method, program, basis set, keyword set, driver combinations None is a wildcard selector. To search for None, use “None”.

Parameters

pretty (bool) – Replace NaN with “None” in returned DataFrame
**search (Dict[str, Optional[str]]) – Allows searching to narrow down return.

Returns

Record specifications matching **search.

Return type

DataFrame

list_values(method: Optional[Union[List[str], str]] = None, basis: Optional[Union[List[str], str]] = None, keywords: Optional[str] = None, program: Optional[str] = None, driver: Optional[str] = None, name: Optional[Union[List[str], str]] = None, native: Optional[bool] = None, force: bool = False) → pandas.core.frame.DataFrame[source]¶

Lists available data that may be queried with get_values. Results may be narrowed by providing search keys. None is a wildcard selector. To search for None, use “None”.

Parameters

method (Optional[Union[str, List[str]]], optional) – The computational method (B3LYP)
basis (Optional[Union[str, List[str]]], optional) – The computational basis (6-31G)
keywords (Optional[str], optional) – The keyword alias
program (Optional[str], optional) – The underlying QC program
driver (Optional[str], optional) – The type of calculation (e.g. energy, gradient, hessian, dipole…)
name (Optional[Union[str, List[str]]], optional) – The canonical name of the data column
native (Optional[bool], optional) – True: only include data computed with QCFractal False: only include data contributed from outside sources None: include both
force (bool, optional) – Data is typically cached, forces a new query if True

Returns

A DataFrame of the matching data specifications

Return type

DataFrame

set_default_benchmark(benchmark: str) → bool[source]¶

Sets the default benchmark value.

Parameters: benchmark (str) – The benchmark to default to.

set_default_program(program: str) → bool[source]¶

Sets the default program.

Parameters: program (str) – The program to default to.

set_view(path: Union[str, pathlib.Path]) → None[source]¶

Set a dataset to use a local view.

Parameters: path (Union[str, Path]) – path to an hdf5 file representing a view for this dataset

statistics(stype: str, value: str, bench: Optional[str] = None, **kwargs: Dict[str, Any]) → Union[numpy.ndarray, pandas.core.series.Series, numpy.float64][source]¶

Provides statistics for various columns in the underlying dataframe.

Parameters

stype (str) – The type of statistic in question
value (str) – The method string to compare
bench (str, optional) – The benchmark method for the comparison, defaults to default_benchmark.
kwargs (Dict[str, Any]) – Additional kwargs to pass to the statistics functions

Returns

Returns an ndarray, Series, or float with the requested statistics depending on input.

Return type

np.ndarray, pd.Series, float

to_file(path: Union[str, pathlib.Path], encoding: str) → None[source]¶

Writes a view of the dataset to a file

Parameters

path (Union[str, Path]) – Where to write the file
encoding (str) – Options: plaintext, hdf5

visualize(method: Optional[str] = None, basis: Optional[str] = None, keywords: Optional[str] = None, program: Optional[str] = None, groupby: Optional[str] = None, metric: str = 'UE', bench: Optional[str] = None, kind: str = 'bar', return_figure: Optional[bool] = None, show_incomplete: bool = False) → plotly.Figure[source]¶

Parameters

method (Optional[str], optional) – Methods to query
basis (Optional[str], optional) – Bases to query
keywords (Optional[str], optional) – Keyword aliases to query
program (Optional[str], optional) – Programs aliases to query
groupby (Optional[str], optional) – Groups the plot by this index.
metric (str, optional) – The metric to use either UE (unsigned error) or URE (unsigned relative error)
bench (Optional[str], optional) – The benchmark level of theory to use
kind (str, optional) – The kind of chart to produce, either ‘bar’ or ‘violin’
return_figure (Optional[bool], optional) – If True, return the raw plotly figure. If False, returns a hosted iPlot. If None, return a iPlot display in Jupyter notebook and a raw plotly figure in all other circumstances.
show_incomplete (bool, optional) – Display statistics method/basis set combinations where results are incomplete

Returns

The requested figure.

Return type

plotly.Figure