The Analytics Accelerator Library for Amazon S3 helps you accelerate access to Amazon S3 data from your applications. This open-source solution reduces processing times and compute costs for your data analytics workloads.
With this library, you can:
- Lower processing times and compute costs for data analytics workloads.
- Implement S3 best practices for performance.
- Utilize optimizations specific to Apache Parquet files, such as pre-fetching metadata located in the footer of the object and predictive column pre-fetching.
- Improve the price performance for your data analytics applications, such as workloads based on Apache Spark.
The Analytics Accelerator Library for Amazon S3 has been tested and integrated with the Apache Hadoop S3A client, and is currently in the process of being integrated into the Iceberg S3FileIO client. It is also tested for datasets stored in S3 Table Buckets.
We're constantly working on improving Analytics Accelerator Library for Amazon S3 and are interested in hearing your feedback on features, performance, and compatibility. Please send feedback by opening a GitHub issue.
You can use Analytics Accelerator Library for Amazon S3 either as a standalone library or with Hadoop library and Spark engine. For Spark engine, it also supports Iceberg's open table format as well as S3 Table Buckets. Integrations with Hadoop and Iceberg are turned off by default. We describe how to enable Analytics Accelerator Library for Amazon S3 below.
To get started, import the library dependency from Maven into your project:
<dependency>
<groupId>software.amazon.s3.analyticsaccelerator</groupId>
<artifactId>analyticsaccelerator-s3</artifactId>
<version>1.0.0</version>
<scope>compile</scope>
</dependency>
Then, initialize the library S3SeekableInputStreamFactory
S3AsyncClient crtClient = S3CrtAsyncClient.builder().maxConcurrency(600).build();
S3SeekableInputStreamFactory s3SeekableInputStreamFactory = new S3SeekableInputStreamFactory(
new S3SdkObjectClient(this.crtClient), S3SeekableInputStreamConfiguration.DEFAULT);
Note: The S3SeekableInputStreamFactory
can be initialized with either the S3AsyncClient or the S3 CRT client.
We recommend that you use the S3 CRT client due to its enhanced connection pool management and higher throughput on downloads.
For either client, we recommend you initialize with a higher concurrency value to fully benefit from the library's optimizations.
This is because the library makes multiple parallel requests to S3 to prefetch data asynchronously. For the Java S3AsyncClient, you can increase the maximum connections by doing the following:
NettyNioAsyncHttpClient.Builder httpClientBuilder =
NettyNioAsyncHttpClient.builder()
.maxConcurrency(600);
S3AsyncClient s3AsyncClient = S3AsyncClient.builder().httpClientBuilder(httpClientBuilder).build();
To open a stream:
S3SeekableInputStream s3SeekableInputStream = s3SeekableInputStreamFactory.createStream(S3URI.of(bucket, key));
For more details on the usage of this stream, refer to the SeekableInputStream interface.
When the S3SeekableInputStreamFactory
is no longer required to create new streams, close it to free resources (eg: caches for prefetched data) held by the factory.
s3SeekableInputStreamFactory.close();
If you are using Analytics Accelerator Library for Amazon S3 with Hadoop, you need to set the stream type to analytics
in the Hadoop configuration. An example configuration is as follows:
<property>
<name>fs.s3a.input.stream.type</name>
<value>analytics</value>
</property>
For more information, see the Hadoop documentation on the Analytics stream type.
Using the Analytics Accelerator Library for Amazon S3 with Spark is similar to the Hadoop usage. The only difference is to prepend the property names with spark.hadoop
.
For example, to enable Analytics Accelerator Library for Amazon S3 on Spark engine, you need to set following property in Spark configuration.
<property>
<name>spark.hadoop.fs.s3a.input.stream.type</name>
<value>analytics</value>
</property>
When your data in S3 is organised in Iceberg open table format, data will be retrieved using Iceberg S3FileIO client instead of Hadoop S3A client. Analytics Accelerator Library for Amazon S3 is currently being integrated into Iceberg S3FileIO client. To enable it in the Spark engine, set the following Spark property.
<property>
<name>spark.sql.catalog.<CATALOG_NAME>.s3.analytics-accelerator.enabled</name>
<value>true</value>
</property>
For S3 General Purpose Buckets and S3 Directory Buckets set the <CATALOG_NAME>
to spark_catalog
, the default catalog.
S3 Table Buckets require you to set a custom catalog name, as outlined here.
Once you set the catalog, you can replace the <CATALOG_NAME>
parameter with your chosen name.
To learn more about how to set rest of the configurations, read our configuration documents.
Analytics Accelerator Library for Amazon S3 accelerates read performance of objects stored in Amazon S3 by integrating AWS Common Run Time (CRT) libraries and implementing optimizations specific to Apache Parquet files. The AWS CRT is a software library built for interacting with AWS services, that implements best practice performance design patterns, including timeouts, retries, and automatic request parallelization for high throughput.
You can use S3SeekableInputStreamFactory
to initialize streams for all file types to benefit from read optimizations on top of benefits coming from CRT.
These optimizations are:
- Sequential prefetching - The library detects sequential read patterns to prefetch data and reduce latency, and reads the full object when the object is small to minimize the number of read operations.
- Small object prefetching - The library will prefetch the object if the object size is less than 3MB.
When the object key ends with the file extension .parquet
or .par
, we use the following Apache Parquet specific optimizations:
- Parquet footer caching - The library reads the tail of the object with configurable size (1MB by default) as soon as a stream to a Parquet object is opened and caches it in memory. This is done to prevent multiple small GET requests that occur at the tail
of the file for the Parquet metadata,
pageIndex
, and bloom filter structures. - Predictive column prefetching - The library tracks recent columns being read using parquet metadata. When
subsequent Parquet files which have these columns are opened, the library will prefetch these columns. For example, if columns
x
andy
are read fromA.parquet
, and thenB.parquet
is opened, and it also contains columns namedx
andy
, the library will prefetch them asynchronously.
The current benchmarking results are provided for reference only. It is important to note that the performance of these queries can be affected by a variety of factors, including compute and storage variability, cluster configuration, and compute choice. All of the results presented have a margin of error of up to 3%.
To establish the performance impact of changes, we rely on a benchmark derived from an industry standard TPC-DS benchmark at a 3 TB scale. It is important to note that our TPC-DS derived benchmark results are not directly comparable with official TPC-DS benchmark results. We also found that the sizing of Apache Parquet files and partitioning of the dataset have a substantive impact on the workload performance. As a result, we have created several versions of the test dataset, with a focus on different object sizes, ranging from singular MiBs to tens of GiBs, as well as various partitioning approaches
On S3A, we have observed a total suite execution acceleration between 10% and 27%, with some queries showing a speed-up of up to 40%.
Known issue: We are currently observing a regression of up to 8% on queries similar to the Q44 in issue 173. We have determined the root cause of this issue is a data over-read due to overly eager columnar prefetching when all of the following is true:
- The query is filtering on dictionary encoded columns.
- The query is selective, and most objects do not contain the required data.
- The query operates on a multi-GB dataset.
We are actively working on this issue. You can track the progress in the issue 173 page.
The remaining TPC-DS queries show no regressions within the specified margin of error.
We welcome contributions to Analytics Accelerator Library for Amazon S3! See the contributing guidelines for more information on how to report bugs, build from source code, or submit pull requests.
If you discover a potential security issue in this project we ask that you notify Amazon Web Services (AWS) Security via our vulnerability reporting page. Do not create a public GitHub issue.
Analytics Accelerator Library for Amazon S3 is licensed under the Apache-2.0 license. The pull request template will ask you to confirm the licensing of your contribution and to agree to the Developer Certificate of Origin (DCO).