Personal information
Verified email addresses
Verified email domains
Biography
As the data curator for Constellation, OLCF’s Open Data Portal, I help manage approximately 3.5 petabytes of data and 674,083 files. The largest dataset is approximately 2 petabytes and the largest single file is 17 terabytes in size. Staff in OLCF Data Lifecycle and Scalable Workflows group have been meeting the challenges of large, unstructured datasets whose management often requires creative and collaborative solutions. Constellation directly improves the sharing of scientific results and its quality. For example, this year a Principal Investigator (PI) deposited 10 million directories corresponding to molecules into the repository. During the curation process it was discovered that one of the directories was empty, and upon further discussion with the PI it was determined that this was a mistake, and that the experiment would need to be re-run. The result was that 41 unprocessed molecules was reduced to 13. That is, curation actively contributed to scientific achievement at OLCF. I am interested in ensuring that we can continue do this in a methodological manner according to emerging best practices as datasets become larger and larger during the exascale era, while being aware of the ethics and pitfall of publishing and archiving datasets this large.