Creating datasets#
Adding Datasets#
Datasets can be created on a server with the add_dataset()
function of the PortalClient
. This function returns the dataset object
(such as a OptimizationDataset
)
>>> ds = client.add_dataset("optimization", "Optimization of important molecules")
>>> print(ds.id)
27
The add_dataset()
takes several optional arguments, including
some for descriptions of the dataset as well as default priority and tags.
>>> ds = client.add_dataset("optimization", "Optimization of large molecules",
.. default_tag="large_mem", default_priority="low")
>>> print(ds.id)
28
Adding Entries and Specifications#
Entries and specifications can be added with add_entries
, add_entry
, and add_specification
.
The details of these functions depend on the type of dataset - see Record & Computation Types. The following examples
are for an optimization dataset.
Both entries and specifications are given descriptive names, which can be use later in other functions
(like get_record()
).
When entries or specifications are added, the changes are reflected immediately on the server.
First, we add some entries to this dataset. For an optimization dataset, an entry corresponds to an unoptimized ‘initial’ molecule. Adding entries returns metadata.
>>> from qcportal.molecules import Molecule
>>> from qcportal.optimization import OptimizationDatasetEntry
>>> mol = Molecule(symbols=['C', 'O'], geometry=[0.0, 0.0, 0.0, 0.0, 0.0, 2.0])
>>> meta = ds.add_entry("carbon monoxide", mol)
>>> print(meta)
InsertMetadata(error_description=None, errors=[], inserted_idx=[0], existing_idx=[])
>>> # Can also create lots of entries and add them at once
>>> mol2 = Molecule(symbols=['F', 'F'], geometry=[0.0, 0.0, 0.0, 0.0, 0.0, 2.0])
>>> mol3 = Molecule(symbols=['Br', 'Br'], geometry=[0.0, 0.0, 0.0, 0.0, 0.0, 2.0])
>>> entry2 = OptimizationDatasetEntry(name='difluorine', initial_molecule=mol2)
>>> entry3 = OptimizationDatasetEntry(name='dibromine', initial_molecule=mol3)
>>> meta = ds.add_entries([entry2, entry3])
>>> print(meta)
InsertMetadata(error_description=None, errors=[], inserted_idx=[0, 1], existing_idx=[])
Now our dataset has three entries
>>> print(ds.entry_names)
['carbon monoxide', 'difluorine', 'dibromine']
Next, we will add some specifications. For an optimization dataset, this is an
OptimizationSpecification
.
>>> from qcportal.singlepoint import SinglepointSpecification
>>> from qcportal.optimization import OptimizationSpecification
>>> # Use geometric, compute gradients with psi4. Optimize with b3lyp/def2-tzvp
>>> spec = OptimizationSpecification(
... program='geometric',
... qc_specification=QCSpecification(
... program='psi4',
... driver='deferred',
... method='b3lyp',
... basis='def2-tzvp',
... )
... )
>>> meta = ds.add_specification(name='psi4/b3lyp/def2-tzvp', specification=spec)
>>> print(meta)
InsertMetadata(error_description=None, errors=[], inserted_idx=[0], existing_idx=[])
Submitting Computations#
Adding entries and specifications does not immediately create the underlying records. To do that,
we use submit()
.
The process of submitting is generally as follows.
Loop over the given entries and specifications
If a record is attached to the dataset for that entry/specification pair, nothing is done
If the
find_existing
parameter is false, then a new record is created and attached to the datasetIf the
find_existing
parameter is true (default), then the database is searched for an existing record matching that entry and specification.If a record is found, then that record is attached to the dataset.
If a record is not found, then a new record is created and attached to the dataset
With no arguments, this will find/create and attach missing records for all entries and specifications, using the default tag and priority of the dataset (set when creating the dataset, or modified after).
You may also submit only certain entries and specifications, or change the tag and priority of any newly-created records.
>>> ds.submit() # Create everything
>>> # Submit missing difluorine computations with a special tag
>>> ds.submit(['difluorine'], tag='special_tag')
>>> # Submit dibromine hf/sto-3g computation at a high priority
>>> ds.submit(['dibromine'], ['hf/sto-3g'], priority='high')
If there are a lot of records to be created, you may instead submit them using
background_submit()
. This will create a background
internal job and return an InternalJob
object that
you can use to monitor the progress (if desired - see watch()
).
background_submit()
. Takes the same arguments as
submit()
.
>>> ij = ds.background_submit()
>>> print(ij.progress)
50
Cloning a dataset#
An entire dataset can be cloned using PortalClient.clone_dataset()
.
This will clone all entries, specifications, and records. Records themselves are not duplicated, but will be linked to
by both datasets.
>>> cloned_ds = client.clone_dataset(432, "New dataset name")
>>> print(cloned_ds.name)
New dataset name
>>> print(cloned_ds.id)
542
>>> cloned_ds.print_status()
specification complete error invalid
------------- ---------- ------- ---------
spec_2 13569 401
spec_6 13926 44
Copying from another dataset#
Entries, specifications, and records can be copied from another dataset of the same type using the following functions:
These functions take a dataset id, and optionally lists of entry and specification names.
copy_records_from()
will copy entries and specifications before copying
the records. The records themselves are not duplicated - instead, the records will be referenced by both datasets.
The below copies some entries and specifications from another dataset
>>> ds = client.add_dataset("singlepoint", "New combined dataset")
>>> # If no entry_names are given, all entries will be copied
>>> ds.copy_entries_from(432, entry_names=['010 ALA-0', '010 ALA-1', '010 ALA-2'])
>>> print(ds.entry_names)
['010 ALA-0', '010 ALA-1', '010 ALA-2']
>>> # If no specification_names are given, all specifications will be copied
>>> ds.copy_specifications_from(432, specification_names=['wb97m-d3bj/def2-tzvppd'])
>>> print(ds.specification_names)
['wb97m-d3bj/def2-tzvppd']
The below copies entries, specifications, and actual records from the other dataset
>>> ds = client.add_dataset("singlepoint", "New combined dataset")
>>> # If no entry_names or specification_names are given, all will be copied from the source
>>> ds.copy_records_from(432, entry_names=['010 ALA-0', '010 ALA-1', '010 ALA-2'], specification_names=['wb97m-d3bj/def2-tzvppd'])
>>> print(ds.entry_names)
['010 ALA-0', '010 ALA-1', '010 ALA-2']
>>> # If no specification_names are given, all specifications will be copied
>>> ds.copy_specifications_from(432, specification_names=['wb97m-d3bj/def2-tzvppd'])
>>> print(ds.specification_names)
['wb97m-d3bj/def2-tzvppd']
>>> ds.print_status()
specification complete
---------------------- ----------
wb97m-d3bj/def2-tzvppd 3
Note
Singlepoint datasets have the ability to add entries from other types of datasets - see the singlepoint documentation