Exploring the Datasets

The Dataset collection represents a table whose rows correspond to Molecules and whose columns correspond to properties. Columns may either result from QCFractal-based calculations or be contributed from outside sources. For example, the QM9 dataset in QCArchive contains small organic molecules with up to 9 heavy atoms, and includes the original reported PBE0 energies, as well as energies calculated with a variety of other density functionals and basis sets.

The existing Datasets can be listed with FractalClient.list_collections("Dataset") and obtained using FractalClient.get_collection("Dataset", name).

Querying the Data

Available result specifications (method, basis set, program, keyword, driver combinations) in a Dataset may be listed with the list_values method. Values are queried with the get_values method. For results computed using QCFractal, the underlying Records are retrieved with get_records.

For further details about how to query Datasets see the QCArchive examples.

Statistics and Visualization

Statistical operations on Datasets may be performed using statistics command and plotted using the visualize command.

For examples of visualizing Datasets, see the QCArchive examples.

Creating the Datasets

Construct an empty Dataset:

>>> import qcportal as ptl
>>> client = plt.FractalClient()  # add server and login information as needed
>>> ds = ptl.collections.Dataset("name", client=client)

The primary index of a Dataset is a list of Molecules. Molecules can be added to a Dataset with add_entry:

>>> ds.add_entry(name, molecule)

Once all Molecules are added, the changes can be committed to the server with save method. Note that this requires write permissions.

>>> ds.save()

Computational Tasks

Computations on the molecules within the Datasets can be performed using the compute command. If the results of the requested computation already exist in the Dataset, they will be reused to avoid recomputation. Note that for perfoming computations, compute permissions is required.

>>> models = {('b3lyp', 'def2-svp'), ('mp2', 'cc-pVDZ')}

>>> for method, basis in models:
>>>     print(method, basis)
>>>     spec = {"program": "psi4",
>>>         "method": method,
>>>         "basis": basis,
>>>         "keywords": "my_keywords",
>>>         "tag": "mgwtfm"}
>>>     ds.compute(**spec)

Note

A default quantum chemical program and a set of computational keywords can be specified for a Dataset. These default values will be used in the compute, get_values, and get_records methods.

>>> ds.set_default_program("psi4")

>>> keywords = ptl.models.KeywordSet(values={'maxiter': 1000,
>>>                                         'e_convergence': 8,
>>>                                         'guess': 'sad',
>>>                                         'scf_type': 'df'})

>>> ds.add_keywords("my_keywords", "psi4", keywords, default=True)

>>> ds.save()

API

class qcportal.collections.Dataset(name: str, client: Optional[FractalClient] = None, **kwargs: Any)[source]

The Dataset class for homogeneous computations on many molecules.

Variables
  • client (client.FractalClient) – A FractalClient connected to a server

  • data (dict) – JSON representation of the database backbone

  • df (pd.DataFrame) – The underlying dataframe for the Dataset object

class DataModel(*, id: str = 'local', name: str, collection: str, provenance: Dict[str, str] = {}, tags: List[str] = [], tagline: str = None, description: str = None, group: str = 'default', visibility: bool = True, view_url_hdf5: str = None, view_url_plaintext: str = None, view_metadata: Dict[str, str] = None, view_available: bool = False, metadata: Dict[str, Any] = {}, default_program: str = None, default_keywords: Dict[str, str] = {}, default_driver: str = 'energy', default_units: str = 'kcal / mol', default_benchmark: str = None, alias_keywords: Dict[str, Dict[str, str]] = {}, records: List[qcportal.collections.dataset.MoleculeEntry] = None, contributed_values: Dict[str, qcportal.collections.dataset.ContributedValues] = None, history: Set[Tuple[str, str, str, Optional[str], Optional[str]]] = {}, history_keys: Tuple[str, str, str, str, str] = ('driver', 'program', 'method', 'basis', 'keywords'))[source]
Parameters
  • id (str, Default: local)

  • name (str)

  • collection (str)

  • provenance (name=’provenance’ type=Mapping[str, str] required=False default={}, Default: {})

  • tags (List[str], Default: [])

  • tagline (str, Optional)

  • description (str, Optional)

  • group (str, Default: default)

  • visibility (bool, Default: True)

  • view_url_hdf5 (str, Optional)

  • view_url_plaintext (str, Optional)

  • view_metadata (name=’view_metadata’ type=Optional[Mapping[str, str]] required=False default=None, Optional)

  • view_available (bool, Default: False)

  • metadata (Dict[str, Any], Default: {})

  • default_program (str, Optional)

  • default_keywords (name=’default_keywords’ type=Mapping[str, str] required=False default={}, Default: {})

  • default_driver (str, Default: energy)

  • default_units (str, Default: kcal / mol)

  • default_benchmark (str, Optional)

  • alias_keywords (Dict[str, Dict[str, str]], Default: {})

  • records (MoleculeEntry, Optional)

  • contributed_values (ContributedValues, Optional)

  • history (Set[Tuple[str, str, str, str, str]], Default: set())

  • history_keys (Tuple[str, str, str, str, str], Default: (‘driver’, ‘program’, ‘method’, ‘basis’, ‘keywords’))

add_contributed_values(contrib: qcportal.collections.dataset.ContributedValues, overwrite: bool = False)None[source]

Adds a ContributedValues to the database. Be sure to call save() to commit changes to the server.

Parameters
  • contrib (ContributedValues) – The ContributedValues to add.

  • overwrite (bool, optional) – Overwrites pre-existing values

add_entry(name: str, molecule: Molecule, **kwargs: Dict[str, Any])None[source]

Adds a new entry to the Dataset

Parameters
  • name (str) – The name of the record

  • molecule (Molecule) – The Molecule associated with this record

  • **kwargs (Dict[str, Any]) – Additional arguments to pass to the record

add_keywords(alias: str, program: str, keyword: KeywordSet, default: bool = False)bool[source]

Adds an option alias to the dataset. Not that keywords are not present until a save call has been completed.

Parameters
  • alias (str) – The alias of the option

  • program (str) – The compute program the alias is for

  • keyword (KeywordSet) – The Keywords object to use.

  • default (bool, optional) – Sets this option as the default for the program

compute(method: str, basis: Optional[str] = None, *, keywords: Optional[str] = None, program: Optional[str] = None, tag: Optional[str] = None, priority: Optional[str] = None, protocols: Optional[Dict[str, Any]] = None)qcportal.models.rest_models.ComputeResponse[source]

Executes a computational method for all reactions in the Dataset. Previously completed computations are not repeated.

Parameters
  • method (str) – The computational method to compute (B3LYP)

  • basis (Optional[str], optional) – The computational basis to compute (6-31G)

  • keywords (Optional[str], optional) – The keyword alias for the requested compute

  • program (Optional[str], optional) – The underlying QC program

  • tag (Optional[str], optional) – The queue tag to use when submitting compute requests.

  • priority (Optional[str], optional) – The priority of the jobs low, medium, or high.

  • protocols (Optional[Dict[str, Any]], optional) – Protocols for store more or less data per field. Current valid protocols: {‘wavefunction’}

Returns

An object that contains the submitted ObjectIds of the new compute. This object has the following fields:
  • ids: The ObjectId’s of the task in the order of input molecules

  • submitted: A list of ObjectId’s that were submitted to the compute queue

  • existing: A list of ObjectId’s of tasks already in the database

Return type

ComputeResponse

download(local_path: Optional[Union[str, pathlib.Path]] = None, verify: bool = True, progress_bar: bool = True)None[source]

Download a remote view if available. The dataset will use this view to avoid server queries for calls to: - get_entries - get_molecules - get_values - list_values

Parameters
  • local_path (Optional[Union[str, Path]], optional) – Local path the store downloaded view. If None, the view will be stored in a temporary file and deleted on exit.

  • verify (bool, optional) – Verify download checksum. Default: True.

  • progress_bar (bool, optional) – Display a download progress bar. Default: True

get_entries(subset: Optional[List[str]] = None, force: bool = False)pandas.core.frame.DataFrame[source]

Provides a list of entries for the dataset

Parameters
  • subset (Optional[List[str]], optional) – The indices of the desired subset. Return all indices if subset is None.

  • force (bool, optional) – skip cache

Returns

A dataframe containing entry names and specifciations. For Dataset, specifications are molecule ids. For ReactionDataset, specifications describe reaction stoichiometry.

Return type

pd.DataFrame

get_index(subset: Optional[List[str]] = None, force: bool = False)List[str][source]

Returns the current index of the database.

Returns

ret – The names of all reactions in the database

Return type

List[str]

get_keywords(alias: str, program: str, return_id: bool = False)Union[KeywordSet, str][source]

Pulls the keywords alias from the server for inspection.

Parameters
  • alias (str) – The keywords alias.

  • program (str) – The program the keywords correspond to.

  • return_id (bool, optional) – If True, returns the id rather than the KeywordSet object. Description

Returns

The requested KeywordSet or KeywordSet id.

Return type

Union[‘KeywordSet’, str]

get_molecules(subset: Optional[Union[str, Set[str]]] = None, force: bool = False)Union[pandas.core.frame.DataFrame, Molecule][source]

Queries full Molecules from the database.

Parameters
  • subset (Optional[Union[str, Set[str]]], optional) – The index subset to query on

  • force (bool, optional) – Force pull of molecules from server

Returns

Either a DataFrame of indexed Molecules or a single Molecule if a single subset string was provided.

Return type

Union[pd.DataFrame, ‘Molecule’]

get_records(method: str, basis: Optional[str] = None, *, keywords: Optional[str] = None, program: Optional[str] = None, include: Optional[List[str]] = None, subset: Optional[Union[str, Set[str]]] = None, merge: bool = False)Union[pandas.core.frame.DataFrame, ResultRecord][source]

Queries full ResultRecord objects from the database.

Parameters
  • method (str) – The computational method to query on (B3LYP)

  • basis (Optional[str], optional) – The computational basis query on (6-31G)

  • keywords (Optional[str], optional) – The option token desired

  • program (Optional[str], optional) – The program to query on

  • include (Optional[List[str]], optional) – The attributes to return. Otherwise returns ResultRecord objects.

  • subset (Optional[Union[str, Set[str]]], optional) – The index subset to query on

  • merge (bool) – Merge multiple results into one (as in the case of DFT-D3). This only works when include=[‘return_results’], as in get_values.

Returns

Either a DataFrame of indexed ResultRecords or a single ResultRecord if a single subset string was provided.

Return type

Union[pd.DataFrame, ‘ResultRecord’]

get_values(method: Optional[Union[List[str], str]] = None, basis: Optional[Union[List[str], str]] = None, keywords: Optional[str] = None, program: Optional[str] = None, driver: Optional[str] = None, name: Optional[Union[List[str], str]] = None, native: Optional[bool] = None, subset: Optional[Union[List[str], str]] = None, force: bool = False)pandas.core.frame.DataFrame[source]

Obtains values matching the search parameters provided for the expected return_result values. Defaults to the standard programs and keywords if not provided.

Note that unlike get_records, get_values will automatically expand searches and return multiple method and basis combinations simultaneously.

None is a wildcard selector. To search for None, use “None”.

Parameters
  • method (Optional[Union[str, List[str]]], optional) – The computational method (B3LYP)

  • basis (Optional[Union[str, List[str]]], optional) – The computational basis (6-31G)

  • keywords (Optional[str], optional) – The keyword alias

  • program (Optional[str], optional) – The underlying QC program

  • driver (Optional[str], optional) – The type of calculation (e.g. energy, gradient, hessian, dipole…)

  • name (Optional[Union[str, List[str]]], optional) – Canonical name of the record. Overrides the above selectors.

  • native (Optional[bool], optional) – True: only include data computed with QCFractal False: only include data contributed from outside sources None: include both

  • subset (Optional[List[str]], optional) – The indices of the desired subset. Return all indices if subset is None.

  • force (bool, optional) – Data is typically cached, forces a new query if True

Returns

A DataFrame of values with columns corresponding to methods and rows corresponding to molecule entries.

Return type

DataFrame

list_keywords()pandas.core.frame.DataFrame[source]

Lists keyword aliases for each program in the dataset.

Returns

A dataframe containing programs, keyword aliases, KeywordSet ids, and whether those keywords are the default for a program. Indexed on program.

Return type

pd.DataFrame

list_records(dftd3: bool = False, pretty: bool = True, **search: Optional[Union[List[str], str]])pandas.core.frame.DataFrame[source]

Lists specifications of available records, i.e. method, program, basis set, keyword set, driver combinations None is a wildcard selector. To search for None, use “None”.

Parameters
  • pretty (bool) – Replace NaN with “None” in returned DataFrame

  • **search (Dict[str, Optional[str]]) – Allows searching to narrow down return.

Returns

Record specifications matching **search.

Return type

DataFrame

list_values(method: Optional[Union[List[str], str]] = None, basis: Optional[Union[List[str], str]] = None, keywords: Optional[str] = None, program: Optional[str] = None, driver: Optional[str] = None, name: Optional[Union[List[str], str]] = None, native: Optional[bool] = None, force: bool = False)pandas.core.frame.DataFrame[source]

Lists available data that may be queried with get_values. Results may be narrowed by providing search keys. None is a wildcard selector. To search for None, use “None”.

Parameters
  • method (Optional[Union[str, List[str]]], optional) – The computational method (B3LYP)

  • basis (Optional[Union[str, List[str]]], optional) – The computational basis (6-31G)

  • keywords (Optional[str], optional) – The keyword alias

  • program (Optional[str], optional) – The underlying QC program

  • driver (Optional[str], optional) – The type of calculation (e.g. energy, gradient, hessian, dipole…)

  • name (Optional[Union[str, List[str]]], optional) – The canonical name of the data column

  • native (Optional[bool], optional) – True: only include data computed with QCFractal False: only include data contributed from outside sources None: include both

  • force (bool, optional) – Data is typically cached, forces a new query if True

Returns

A DataFrame of the matching data specifications

Return type

DataFrame

set_default_benchmark(benchmark: str)bool[source]

Sets the default benchmark value.

Parameters

benchmark (str) – The benchmark to default to.

set_default_program(program: str)bool[source]

Sets the default program.

Parameters

program (str) – The program to default to.

set_view(path: Union[str, pathlib.Path])None[source]

Set a dataset to use a local view.

Parameters

path (Union[str, Path]) – path to an hdf5 file representing a view for this dataset

statistics(stype: str, value: str, bench: Optional[str] = None, **kwargs: Dict[str, Any])Union[numpy.ndarray, pandas.core.series.Series, numpy.float64][source]

Provides statistics for various columns in the underlying dataframe.

Parameters
  • stype (str) – The type of statistic in question

  • value (str) – The method string to compare

  • bench (str, optional) – The benchmark method for the comparison, defaults to default_benchmark.

  • kwargs (Dict[str, Any]) – Additional kwargs to pass to the statistics functions

Returns

Returns an ndarray, Series, or float with the requested statistics depending on input.

Return type

np.ndarray, pd.Series, float

to_file(path: Union[str, pathlib.Path], encoding: str)None[source]

Writes a view of the dataset to a file

Parameters
  • path (Union[str, Path]) – Where to write the file

  • encoding (str) – Options: plaintext, hdf5

visualize(method: Optional[str] = None, basis: Optional[str] = None, keywords: Optional[str] = None, program: Optional[str] = None, groupby: Optional[str] = None, metric: str = 'UE', bench: Optional[str] = None, kind: str = 'bar', return_figure: Optional[bool] = None, show_incomplete: bool = False)plotly.Figure[source]
Parameters
  • method (Optional[str], optional) – Methods to query

  • basis (Optional[str], optional) – Bases to query

  • keywords (Optional[str], optional) – Keyword aliases to query

  • program (Optional[str], optional) – Programs aliases to query

  • groupby (Optional[str], optional) – Groups the plot by this index.

  • metric (str, optional) – The metric to use either UE (unsigned error) or URE (unsigned relative error)

  • bench (Optional[str], optional) – The benchmark level of theory to use

  • kind (str, optional) – The kind of chart to produce, either ‘bar’ or ‘violin’

  • return_figure (Optional[bool], optional) – If True, return the raw plotly figure. If False, returns a hosted iPlot. If None, return a iPlot display in Jupyter notebook and a raw plotly figure in all other circumstances.

  • show_incomplete (bool, optional) – Display statistics method/basis set combinations where results are incomplete

Returns

The requested figure.

Return type

plotly.Figure