Personal information

Verified email addresses

Verified email domains

big data, repositories, data curation

Biography

As the data curator for Constellation, OLCF’s Open Data Portal, I help manage approximately 3.5 petabytes of data and 674,083 files. The largest dataset is approximately 2 petabytes and the largest single file is 17 terabytes in size. Staff in OLCF Data Lifecycle and Scalable Workflows group have been meeting the challenges of large, unstructured datasets whose management often requires creative and collaborative solutions. Constellation directly improves the sharing of scientific results and its quality. For example, this year a Principal Investigator (PI) deposited 10 million directories corresponding to molecules into the repository. During the curation process it was discovered that one of the directories was empty, and upon further discussion with the PI it was determined that this was a mistake, and that the experiment would need to be re-run. The result was that 41 unprocessed molecules was reduced to 13. That is, curation actively contributed to scientific achievement at OLCF. I am interested in ensuring that we can continue do this in a methodological manner according to emerging best practices as datasets become larger and larger during the exascale era, while being aware of the ethics and pitfall of publishing and archiving datasets this large.

Activities

Employment (1)

Oak Ridge National Laboratory: Oak Ridge, TN, US

2021-08-09 to present | Data Services Engineer (Data Life Cycles and Scalable Workflows)
Employment
Source: Self-asserted source
Alexander May

Professional activities (2)

DOE Data Curation Working Group: Oak Ridge, Tennessee, US

2023-01 to present | Co-Lead
Membership
Source: Self-asserted source
Alexander May

Oak Ridge National Laboratory: Oak Ridge, Tennessee, US

2022-12-12 | SPA awarded in the category of Paramount Accomplishment, with the reasoning stated as: “ORNL lab-wide data strategy (section 3.5 a – c) was in the Operational Excellence of the FY22 Lab Agenda and contained 3 milestones: develop a lab-wide data management strategy framework; deliver a 3-year implementation plan; and deliver two target demonstrations such as data management plans (DMPs) for one INTERSECT selected workflow and ERP upgrade to S4. (Data Life Cycles and Scalable Workflows)
Distinction
Source: Self-asserted source
Alexander May

Works (5)

Pseudonymization at Scale: OLCF’s Summit Usage Data Case Study

2022 IEEE International Conference on Big Data (Big Data)
2022-12-17 | Journal article
Contributors: Ketan Maheshwari; Sean R. Wilkinson; Alex May; Tyler Skluzacek; Olga A. Kuchar; Rafael Ferreira da Silva
Source: Self-asserted source
Alexander May

Working Group on NIH DMSP Guidance

Open Science Framework
2022-11-21 | Journal article
Contributors: Hao Ye; Marla Hertz; Kelsey Badger; Lena Bohman; Collin Schwantes; Lauren Phegley; Katherine Smith; Nina Exner; Jennifer Muilenburg; Reid Otsuji et al.
Source: Self-asserted source
Alexander May

Who is Afraid of a Petabyte Dataset? Rethinking Repository Infrastructures and Curation Workflows for the Scale and Type of Next Generation Data

Open Repositories 2022
2022-06-08 | Conference paper
Source: Self-asserted source
Alexander May

Towards a Big-Data Toolkit: Ensuring Data Governance & Ethical Considerations Are Applied to Large Datasets

DOE Data Days
2022-05 | Conference abstract
Contributors: Alexander May
Source: Self-asserted source
Alexander May

Towards the DOE Data Catalog: Ensuring Access, Sharing and Protection

DOE Data Days
2022-05 | Conference poster
Contributors: Alexander May
Source: Self-asserted source
Alexander May