Cloud Access
This article describes accessing Roman data stored in the AWS cloud in the form of S3 buckets. The python library s3fs
is used to access the data, which can either be streamed directly into memory or downloaded for longterm use.
Cloud Data Storage, S3 Buckets, and URIs
Data is stored on the AWS cloud using the S3 object storage service. Amazon S3 organizes data as objects within buckets, where an object represents a file and its associated metadata, and a bucket serves as a container for these objects. For instance, the Barbara A. Mikulski Archive for Space Telescopes (MAST) provides cloud access to data sets from various missions, including the Hubble Space Telescope (HST) and the Transiting Exoplanet Survey Satellite (TESS). MAST data is stored in a bucket named stpubdata
, and specific files within the bucket can be accessed via a uniform resource identifier (URI) in the format: s3://stpubdata/hst/public/some_folder/some_file
. A URI is a general concept that uniquely identifies an abstract or physical resource, which may or may not be connected to the internet. In contrast, a uniform resource locator (URL) specifically locates resources on a network, such as the internet.
Roman data will be available at AWS US East-1 (Northern Virginia) data center.
Querying MAST Archive and Fetching a URI
The URI of a data product can be retrieved from the MAST archive using the astroquery.mast.Observations module (see the astroquery API documentation for more information). In the example below, we begin by enabling the cloud data access, which is necessary to obtain the URIs. Next we query the archive to get the observations as an astropy.table.Table object. Applying the get_product_list method to this table gives a table of products. Finally, a single data product is selected from the table, based on its ID and product description. Applying the get_cloud_uris method returns a list containing the URI of the selected file.
from astroquery.mast import Observations Observations.enable_cloud_dataset() obs = Observations.query_criteria(obs_collection='JWST', filters='F444W', instrument_name='NIRCAM/IMAGE', proposal_id=['1783'], dataRights='PUBLIC') products = Observations.get_product_list(obs) single = Observations.filter_products(products, productSubGroupDescription='RATE', obsID='87766440') print('Single data product:\n', single, '\n') print(Observations.get_cloud_uris(single)[0])
The code above prints the data product table:
Single data product: obsID obs_collection dataproduct_type ... dataRights calib_level filters -------- -------------- ---------------- ... ---------- ----------- ------- 87766440 JWST image ... PUBLIC 2 F444W
The table is quite wide, so it has been abbreviated and does not show all columns. The method get_cloud_uris returns the following data product URI:
's3://stpubdata/jwst/public/jw01783/jw01783004007/jw01783004007_02101_00004_nrcalong_rate.fits
'
Accessing Data from Python
The data stored in the S3 buckets can be accessed in python using the
s3fs
package (see the s3fs documentation for more information). The top-level S3FileSystem
class holds connection information and allows typical file-system style operations cp
, mv
, ls
, du
, glob
, etc., as well as put/get
of local files to/from S3.
The connection can be anonymous (anon=True option
), in which case only publicly-available, read-only buckets are accessible, or via credentials, that can be supplied explicitly or in configuration files. Setting anon=False
, which is the default option, enables the use of the default credentials.
import s3fs fs = s3fs.S3FileSystem(anon=True) fs.ls('s3://stpubdata/jwst/public/jw01783/')
Running the above code will produce the following result (truncated for display):
['stpubdata/jwst/public/jw01783/L3', 'stpubdata/jwst/public/jw01783/asn', 'stpubdata/jwst/public/jw01783/jw01783001001', 'stpubdata/jwst/public/jw01783/jw01783002001', ...]
There are two methods for accessing data on the cloud from a machine, as described below:
- The data can be loaded directly into the machine's memory.
- The data can be downloaded to storage attached to the machine and then accessed from there.
Load Data Directly into Memory (Streaming Access)
S3 data can be loaded directly into the memory of the working machine by calling open on a URI with an S3FileSystem
object. This returns an S3File
object that provides read and write access and emulates the standard Python file object, supporting methods such as read, write
, tell
, and seek
. Note that only binary read and write modes are implemented.
The example code below demonstrates how to load a FITS file and an ASDF file from URIs. For FITS files, the S3File
object is passed to the fits.open
function, while ASDF files are opened using the asdf.open
function. For Roman ASDF files, it is recommended to use the open
function from the roman_datamodels
module. Refer to this notebook for further details on working with Roman ASDF files.
Streaming files into memory is the recommended method for accessing most data files in the cloud.
import s3fs from astropy.io import fits # Reading a fits file # replace with appropriate path fits_file_uri = 's3://stpubdata/jwst/public/jw01783/jw01783004007/jw01783004007_02101_00004_nrcalong_rate.fits' fs = s3fs.S3FileSystem(anon=True) with fs.open(fits_file_uri, 'rb') as f: with fits.open(f, 'readonly') as HDUlist: HDUlist.info() sci = HDUlist[1].data print(type(sci))
Running the above code should return the following result:
Filename: <class 's3fs.core.S3File'> No. Name Ver Type Cards Dimensions Format 0 PRIMARY 1 PrimaryHDU 254 () 1 SCI 1 ImageHDU 75 (2048, 2048) float32 2 ERR 1 ImageHDU 10 (2048, 2048) float32 3 DQ 1 ImageHDU 11 (2048, 2048) int32 (rescales to uint32) 4 VAR_POISSON 1 ImageHDU 9 (2048, 2048) float32 5 VAR_RNOISE 1 ImageHDU 9 (2048, 2048) float32 6 ASDF 1 BinTableHDU 11 1R x 1C [7867B] <class 'numpy.ndarray'>
Download Files
This method is not recommended, though it is shown below for completeness. WFI data products can have large data volumes, therefore it is recommended to work in the cloud.
Files in an S3 bucket can be downloaded using the get
method of S3FileSystem
object, as shown below. The get method has two input parameters, the URI from which to download the data and the local file path to store the data.
from pathlib import Path import s3fs fs = s3fs.S3FileSystem(anon=True) local_file_path = Path('data/') local_file_path.mkdir(parents=True, exist_ok=True) uri = 's3://stpubdata/jwst/public/jw01783/jw01783004007/jw01783004007_02101_00004_nrcalong_rate.fits' fs.get(uri, local_file_path)
For additional questions not answered in this article, please contact the Roman Help Desk at STScI.