As scientific data grow beyond available bandwidth capacity, data become difficult to transport for processing or sharing. Scientists who require hundreds of terabytes of data for large simulations, such as those in cosmology and turbulence, need large storage spaces and quick processing times to do their science. Cloud storage and high performance computing services enable these scientific communities to conduct research, but may constrain access to results. Datasets become scattered across locations, often described by competing metadata schema, which limits their discoverability and retrieval by other scientists. We report preliminary findings from a case study of an infrastructure being designed for use by multiple scientific disciplines. The infrastructure is intended to store original datasets, code used to conduct analysis, and resulting datasets in a common area available via web browser. Researchers will be able to share these components of their workflows by granting access to a virtual notebook. Although the notebook is not necessarily a permanent record of the research, it can be exported in many formats, referenced, and can support multiple simulations with multiple runs of parameters, all within a browser. The Jupyter (formerly iPython) notebooks will be configured and load tested to scale with the system in a special environment that will be maintained on the server. We are studying how these innovative tools and infrastructures are applied and adopted across disciplines. These new workflows to address the data size problem and to consolidate a researcher’s work into one virtual area may enable new forms of scientific collaboration. The benefits, costs, and tradeoffs of these workflows and tools will inform scientific practice and policy.
Sands, A. E. (April 2016). Data Infrastructures in the Sloan Digital Sky Survey and Large Synoptic Survey Telescope Projects. Poster presented at the FORCE 2016 Conference, Portland, OR.