Cloud Access

This article describes accessing Roman data stored in the AWS cloud in the form of S3 buckets. The python library s3fs is used to access the data, which can either be streamed directly into memory or downloaded for longterm use.






Cloud Data Storage,  S3 Buckets, and URIs

Data is stored on the AWS cloud using the S3 object storage service. Amazon S3 organizes data as objects within buckets, where an object represents a file and its associated metadata, and a bucket serves as a container for these objects. For instance, the Barbara A. Mikulski Archive for Space Telescopes (MAST) provides cloud access to data sets from various missions, including the Hubble Space Telescope (HST) and the Transiting Exoplanet Survey Satellite (TESS). MAST data is stored in a bucket named stpubdata, and specific files within the bucket can be accessed via a uniform resource identifier (URI) in the format: s3://stpubdata/hst/public/some_folder/some_file. A URI is a general concept that uniquely identifies an abstract or physical resource, which may or may not be connected to the internet. In contrast, a uniform resource locator (URL) specifically locates resources on a network, such as the internet.

Roman data will be available at AWS US East-1 (Northern Virginia) data center. 


Querying MAST Archive and Fetching a URI

The URI of a data product can be retrieved from the MAST archive using the  astroquery.mast.Observations module (see the astroquery API documentation for more information). In the example below, we begin by enabling the cloud data access, which is necessary to obtain the URIs. Next we query the archive to get the observations as an astropy.table.Table object.  Applying the get_product_list method to this table gives a table of products. Finally, a single data product is selected from the table, based on its ID and product description. Applying the  get_cloud_uris method returns a list containing the URI of the selected file.

MAST query
from astroquery.mast import Observations
Observations.enable_cloud_dataset()
obs = Observations.query_criteria(obs_collection='JWST',
                                  filters='F444W',
                                  instrument_name='NIRCAM/IMAGE',
                                  proposal_id=['1783'],
                                  dataRights='PUBLIC')
products = Observations.get_product_list(obs)
single = Observations.filter_products(products, productSubGroupDescription='RATE', obsID='87766440')
print('Single data product:\n', single, '\n')
print(Observations.get_cloud_uris(single)[0])

The code above prints the data product table:

MAST query result
Single data product:

  obsID   obs_collection dataproduct_type ... dataRights calib_level filters

-------- -------------- ---------------- ... ---------- ----------- -------

87766440           JWST            image ...     PUBLIC           2   F444W

 The table is quite wide, so it has been abbreviated and does not show all columns. The method get_cloud_uris returns the following data product URI:

's3://stpubdata/jwst/public/jw01783/jw01783004007/jw01783004007_02101_00004_nrcalong_rate.fits


Accessing Data from Python

The data stored in the S3 buckets can be accessed in python using the  s3fs package (see the s3fs documentation for more information).  The top-level S3FileSystem class holds connection information and allows typical file-system style operations cp, mv, ls, du, glob, etc., as well as put/get of local files to/from S3.

The connection can be anonymous (anon=True option), in which case only publicly-available, read-only buckets are accessible, or via credentials, that can be supplied explicitly or in configuration files. Setting anon=False, which is the default option, enables the use of the default credentials. 

List Files on S3
import s3fs
fs = s3fs.S3FileSystem(anon=True)
fs.ls('s3://stpubdata/jwst/public/jw01783/')

Running the above code will produce the following result (truncated for display):

List Files on S3 Result
['stpubdata/jwst/public/jw01783/L3',
 'stpubdata/jwst/public/jw01783/asn',
 'stpubdata/jwst/public/jw01783/jw01783001001',
 'stpubdata/jwst/public/jw01783/jw01783002001',
...]


There are two methods for accessing data on the cloud from a machine, as described below:

  1. The data can be loaded directly into the machine's memory.
  2. The data can be downloaded to storage attached to the machine and then accessed from there.

Load Data Directly into Memory (Streaming Access)

S3 data can be loaded directly into the memory of the working machine by calling open on a URI with an S3FileSystem object. This returns an S3File object that provides read and write access and emulates the standard Python file object, supporting methods such as read, write, tell, and seek. Note that only binary read and write modes are implemented.

The example code below demonstrates how to load a FITS file and an ASDF file from URIs. For FITS files, the S3File object is passed to the fits.open function, while ASDF files are opened using the asdf.open function. For Roman ASDF files, it is recommended to use the open function from the roman_datamodels module. Refer to this notebook for further details on working with Roman ASDF files.

Streaming files into memory is the recommended method for accessing most data files in the cloud. 


Stream File
import s3fs
from astropy.io import fits

# Reading a fits file
# replace with appropriate path
fits_file_uri = 's3://stpubdata/jwst/public/jw01783/jw01783004007/jw01783004007_02101_00004_nrcalong_rate.fits'

fs = s3fs.S3FileSystem(anon=True)
with fs.open(fits_file_uri, 'rb') as f:
   with fits.open(f, 'readonly') as HDUlist:
       HDUlist.info()
       sci = HDUlist[1].data
print(type(sci))

Running the above code should return the following result:

Stream File Result
Filename: <class 's3fs.core.S3File'>
No.    Name      Ver    Type      Cards   Dimensions   Format
  0  PRIMARY       1 PrimaryHDU     254   ()
  1  SCI           1 ImageHDU        75   (2048, 2048)   float32
  2  ERR           1 ImageHDU        10   (2048, 2048)   float32
  3  DQ            1 ImageHDU        11   (2048, 2048)   int32 (rescales to uint32)
  4  VAR_POISSON    1 ImageHDU         9   (2048, 2048)   float32
  5  VAR_RNOISE    1 ImageHDU         9   (2048, 2048)   float32
  6  ASDF          1 BinTableHDU     11   1R x 1C   [7867B]
<class 'numpy.ndarray'>



Download Files

This method is not recommended, though it is shown below for completeness. WFI data products can have large data volumes, therefore it is recommended to work in the cloud.


Files in an S3 bucket can be downloaded using the get method of S3FileSystem object, as shown below. The get method has two input parameters, the URI from which to download the data and the local file path to store the data.  

download file
from pathlib import Path
import s3fs
fs = s3fs.S3FileSystem(anon=True)
local_file_path = Path('data/')
local_file_path.mkdir(parents=True, exist_ok=True)
uri = 's3://stpubdata/jwst/public/jw01783/jw01783004007/jw01783004007_02101_00004_nrcalong_rate.fits'
fs.get(uri, local_file_path)




For additional questions not answered in this article, please contact the Roman Help Desk at STScI.







Latest Update

Publication

Initial publication of the article.