S3 Table Buckets, One Month Later

[This article was co-written in collaboration with Nicola Corda. Also, this article appeared first on the author’s substack select * from] At the last re:Invent conference AWS announced the new S3 Table Buckets powered by Apache Iceberg (or just Iceberg). The promises were impressive (3x performance compared to Iceberg tables, no maintenance, world peace) and many people started…

By.

min read

aws-you-are-late

[This article was co-written in collaboration with Nicola Corda. Also, this article appeared first on the author’s substack select * from]

At the last re:Invent conference AWS announced the new S3 Table Buckets powered by Apache Iceberg (or just Iceberg). The promises were impressive (3x performance compared to Iceberg tables, no maintenance, world peace) and many people started talking about it.

Now, a month after the announcement, with the marketing dust settled down, we think we can properly take a look at what S3 Table Buckets are, what they deliver, and what are the proper use cases. Also, was this just a rushed move to show something after the Tabular acquisition from Databricks?

S3 Tables what are they and what do they do?

Let’s start with what an S3 Table Buckets is:

  • A special type of S3 Bucket, where you store tables and not files. There are files, but you don’t access them directly.
  • The tables are Iceberg tables associated with an AWS maintained Iceberg catalog.
  • These Iceberg tables in the table buckets are also automatically maintained by AWS (at a cost).

What else AWS promised:

  • Performance improvement compared to self managed iceberg tables
  • Less maintenance, thank to auto-compaction
  • Seamless integration (at least in the AWS world)
  • Security built-in

Performance and Compaction

From what we could see in this AWS blog post the performance gain promised by the AWS marketing team were just the result of the data compaction.

Everybody using Iceberg is (or should be) familiar with the optimize and vacuum operations, and had to figure out how and when to compact new data in their data lake. AWS S3 Tables brings in the convenience of not worrying about this, but with some additional costs.

In case you want to keep some flexibility on running the compactions, here a few things you want to know:

  • Compaction (optimize) and Unreferenced file removal (vacuum) are enable by default
  • Compaction (optimize) is configured at table level (not bucket).
  • Unreferenced file removal (vacuum) is configured at bucket level (not table).
  • Compaction runs when there are too many small files. Reducing the size of the target files (between 64 and 512 mb) will make the compaction run more often.

Integrations

Our expectations

Reading the announcement we expected they would work like this:

  1. We create a new S3 Table Bucket.
  2. Everything we create in this special bucket is an Iceberg table, but we don’t need to worry about any Iceberg Catalog, a managed catalog will take care of it.
  3. We can write and query the new table in the table bucket using any AWS service, like Redshift, Athena, EMR and so on.

Integration hiccups with other services (Snowflake, Databricks, Trino) have to be expected.

How it actually works

Here is the reality of what you can actually do with S3 Table Buckets:

  • From what we got from the documentation, only Spark and Firehose can create and write tables in the S3 Table buckets. The documentation is not clear, but passing the Spark configuration to a Glue job it should be possible to use AWS Glue to manage and use these tables. This means that you can probably also use Databricks for this (but there you have other options to write an Iceberg table).
  • Services like AWS Athena or Redshift cannot create iceberg tables on those S3 table buckets.
  • Reading from those tables via Athena requires that we register the managed Iceberg catalog, behind an S3 Table bucket, to our Athena configuration. Athena can read and write into existing tables, it cannot create or modify existing tables.
  • The Redshift integration requires the creation of a resource link), after that the resource link name will appear as an additional database that we can use to read (and only read) data from the S3 Table buckets.
  • Most of the other query engines that support Iceberg, like Trino, or libraries like PyIceberg, cannot read S3 Tables. Or at least not yet.
  • If you take a peek at the magic behind an S3 Table Bucket, you will find the actual location of the Iceberg metadata.json file and use DuckDB to read from an S3 Table. (Credits to Damon Cortesi to unveil the magic).
  • The AWS Console has a very limited support for S3 Tables: the console can only view the created tables, but it is not able to check their structure (like columns or partitions). Everything else must be done via the AWS CLI or SDK: e.g. create a namespace, create a table and so on.
  • Finally, you can use the AWS CLI to fully operate the S3 Table Buckets.

The question for the answer “it depends”

It’s been more than a month since AWS unveiled the S3 Table Buckets, should we actually use them? And if they are not for us, who are they best suited for?

At this point, it is quite hard to say who should use the S3 Table Buckets.

Let’s break down a few scenarios:

  1. Teams With Limited Engineering Resources
    If you’ve outgrown your SQL database but lack deep data-engineering expertise, you might see S3 Tables as a simplified version of Iceberg. However, keep in mind that beyond EMR and Spark, engine support is still limited, which could hinder more complex use cases.
  2. Mature Data Engineering Teams
    If you already have robust Iceberg best practices in place, you’ll probably find that S3 Tables don’t offer enough flexibility. The extra cost and limited integration may not justify the switch—especially if you’re perfectly happy managing your own Iceberg catalog and query engines.
  3. AWS-First Data Teams, Ready for a Data Lake
    If you’re transitioning from a data warehouse to a data lake on AWS and you’re not well-versed in Iceberg, S3 Tables could be an easier on-ramp. In time, AWS will likely expand engine compatibility (beyond Athena and Redshift), which might make this option more appealing.Organizations already invested in alternatives like Snowflake may find it simpler to stick with what they know rather than adopting S3 Tables.
  4. Quick Proof-of-Concept ProjectsExternal consultants looking to showcase Iceberg to clients – or teams that just want to do a rapid prototype – might see S3 Tables as a handy shortcut. You can spin up a data lake in EMR, then query it via Athena or Redshift. Will it work? Probably. Is it a long-term solution? Maybe not.

Conclusion

A month ago, the announcement of the S3 Table Buckets generated a lot of buzz in the data community. After the acquisition of Tabular from Databricks, Iceberg became a hot topic (😅) and it feels like AWS needed to do something to steer the conversation toward their own court.

Unfortunately, the AWS’s implementation removed almost completely the openness that allowed Iceberg to win over Databricks Delta. Even within the AWS ecosystem, the integrations seem pretty limited.

We are sure that will change over time, but, speaking of time, would AWS have been better off keeping a close eye on the table format battle and introducing S3 Tables three or four years ago?

Leave a Reply

Your email address will not be published. Required fields are marked *