Skip to content

Analytics Accelerator Library for Amazon S3 is an open source library that accelerates data access from client applications to Amazon S3.

License

Notifications You must be signed in to change notification settings

awslabs/analytics-accelerator-s3

Analytics Accelerator Library for Amazon S3

The Analytics Accelerator Library for Amazon S3 helps you accelerate access to Amazon S3 data from your applications. This open-source solution reduces processing times and compute costs for your data analytics workloads.

With this library, you can:

  • Lower processing times and compute costs for data analytics workloads.
  • Implement S3 best practices for performance.
  • Utilize optimizations specific to Apache Parquet files, such as pre-fetching metadata located in the footer of the object and predictive column pre-fetching.
  • Improve the price performance for your data analytics applications, such as workloads based on Apache Spark.

Current Status

The Analytics Accelerator Library for Amazon S3 has been tested and integrated with the Apache Hadoop S3A client, and is currently in the process of being integrated into the Iceberg S3FileIO client. It is also tested for datasets stored in S3 Table Buckets.

We're constantly working on improving Analytics Accelerator Library for Amazon S3 and are interested in hearing your feedback on features, performance, and compatibility. Please send feedback by opening a GitHub issue.

Getting Started

You can use Analytics Accelerator Library for Amazon S3 either as a standalone library or with Hadoop library and Spark engine. For Spark engine, it also supports Iceberg's open table format as well as S3 Table Buckets. Integrations with Hadoop and Iceberg are turned off by default. We describe how to enable Analytics Accelerator Library for Amazon S3 below.

Standalone Usage

To get started, import the library dependency from Maven into your project:

    <dependency>
      <groupId>software.amazon.s3.analyticsaccelerator</groupId>
      <artifactId>analyticsaccelerator-s3</artifactId>
      <version>1.0.0</version>
      <scope>compile</scope>
    </dependency>

Then, initialize the library S3SeekableInputStreamFactory

S3AsyncClient crtClient = S3CrtAsyncClient.builder().maxConcurrency(600).build();
S3SeekableInputStreamFactory s3SeekableInputStreamFactory = new S3SeekableInputStreamFactory(
                new S3SdkObjectClient(this.crtClient), S3SeekableInputStreamConfiguration.DEFAULT);

Note: The S3SeekableInputStreamFactory can be initialized with either the S3AsyncClient or the S3 CRT client. We recommend that you use the S3 CRT client due to its enhanced connection pool management and higher throughput on downloads. For either client, we recommend you initialize with a higher concurrency value to fully benefit from the library's optimizations. This is because the library makes multiple parallel requests to S3 to prefetch data asynchronously. For the Java S3AsyncClient, you can increase the maximum connections by doing the following:

NettyNioAsyncHttpClient.Builder httpClientBuilder =
        NettyNioAsyncHttpClient.builder()
        .maxConcurrency(600);

S3AsyncClient s3AsyncClient = S3AsyncClient.builder().httpClientBuilder(httpClientBuilder).build(); 

To open a stream:

S3SeekableInputStream s3SeekableInputStream = s3SeekableInputStreamFactory.createStream(S3URI.of(bucket, key));

For more details on the usage of this stream, refer to the SeekableInputStream interface.

When the S3SeekableInputStreamFactory is no longer required to create new streams, close it to free resources (eg: caches for prefetched data) held by the factory.

s3SeekableInputStreamFactory.close();

Using with Hadoop

If you are using Analytics Accelerator Library for Amazon S3 with Hadoop, you need to set the stream type to analytics in the Hadoop configuration. An example configuration is as follows:

<property>
  <name>fs.s3a.input.stream.type</name>
  <value>analytics</value>
</property>

For more information, see the Hadoop documentation on the Analytics stream type.

Using with Spark (without Iceberg)

Using the Analytics Accelerator Library for Amazon S3 with Spark is similar to the Hadoop usage. The only difference is to prepend the property names with spark.hadoop. For example, to enable Analytics Accelerator Library for Amazon S3 on Spark engine, you need to set following property in Spark configuration.

<property>
  <name>spark.hadoop.fs.s3a.input.stream.type</name>
  <value>analytics</value>
</property>

Using with Spark (with Iceberg)

When your data in S3 is organised in Iceberg open table format, data will be retrieved using Iceberg S3FileIO client instead of Hadoop S3A client. Analytics Accelerator Library for Amazon S3 is currently being integrated into Iceberg S3FileIO client. To enable it in the Spark engine, set the following Spark property.

<property>
  <name>spark.sql.catalog.<CATALOG_NAME>.s3.analytics-accelerator.enabled</name>
  <value>true</value>
</property>

For S3 General Purpose Buckets and S3 Directory Buckets set the <CATALOG_NAME>to spark_catalog, the default catalog.

S3 Table Buckets require you to set a custom catalog name, as outlined here. Once you set the catalog, you can replace the <CATALOG_NAME> parameter with your chosen name.

To learn more about how to set rest of the configurations, read our configuration documents.

Summary of Optimizations

Analytics Accelerator Library for Amazon S3 accelerates read performance of objects stored in Amazon S3 by integrating AWS Common Run Time (CRT) libraries and implementing optimizations specific to Apache Parquet files. The AWS CRT is a software library built for interacting with AWS services, that implements best practice performance design patterns, including timeouts, retries, and automatic request parallelization for high throughput. You can use S3SeekableInputStreamFactory to initialize streams for all file types to benefit from read optimizations on top of benefits coming from CRT.

These optimizations are:

  • Sequential prefetching - The library detects sequential read patterns to prefetch data and reduce latency, and reads the full object when the object is small to minimize the number of read operations.
  • Small object prefetching - The library will prefetch the object if the object size is less than 3MB.

When the object key ends with the file extension .parquet or .par, we use the following Apache Parquet specific optimizations:

  • Parquet footer caching - The library reads the tail of the object with configurable size (1MB by default) as soon as a stream to a Parquet object is opened and caches it in memory. This is done to prevent multiple small GET requests that occur at the tail of the file for the Parquet metadata, pageIndex, and bloom filter structures.
  • Predictive column prefetching - The library tracks recent columns being read using parquet metadata. When subsequent Parquet files which have these columns are opened, the library will prefetch these columns. For example, if columns x and y are read from A.parquet , and then B.parquet is opened, and it also contains columns named x and y, the library will prefetch them asynchronously.

Benchmark Results

Benchmarking Results -- November 25, 2024

The current benchmarking results are provided for reference only. It is important to note that the performance of these queries can be affected by a variety of factors, including compute and storage variability, cluster configuration, and compute choice. All of the results presented have a margin of error of up to 3%.

To establish the performance impact of changes, we rely on a benchmark derived from an industry standard TPC-DS benchmark at a 3 TB scale. It is important to note that our TPC-DS derived benchmark results are not directly comparable with official TPC-DS benchmark results. We also found that the sizing of Apache Parquet files and partitioning of the dataset have a substantive impact on the workload performance. As a result, we have created several versions of the test dataset, with a focus on different object sizes, ranging from singular MiBs to tens of GiBs, as well as various partitioning approaches

On S3A, we have observed a total suite execution acceleration between 10% and 27%, with some queries showing a speed-up of up to 40%.

Known issue: We are currently observing a regression of up to 8% on queries similar to the Q44 in issue 173. We have determined the root cause of this issue is a data over-read due to overly eager columnar prefetching when all of the following is true:

  1. The query is filtering on dictionary encoded columns.
  2. The query is selective, and most objects do not contain the required data.
  3. The query operates on a multi-GB dataset.

We are actively working on this issue. You can track the progress in the issue 173 page.

The remaining TPC-DS queries show no regressions within the specified margin of error.

Contributions

We welcome contributions to Analytics Accelerator Library for Amazon S3! See the contributing guidelines for more information on how to report bugs, build from source code, or submit pull requests.

Security

If you discover a potential security issue in this project we ask that you notify Amazon Web Services (AWS) Security via our vulnerability reporting page. Do not create a public GitHub issue.

License

Analytics Accelerator Library for Amazon S3 is licensed under the Apache-2.0 license. The pull request template will ask you to confirm the licensing of your contribution and to agree to the Developer Certificate of Origin (DCO).

About

Analytics Accelerator Library for Amazon S3 is an open source library that accelerates data access from client applications to Amazon S3.

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Packages

No packages published

Languages