Faster S3 Object Listing

Amazon S3 is a popular object storage service provided by Amazon Web Services (AWS). It can be used to store a vast amount of data. In one of my real-world projects, I needed to list a large number of objects (millions) in an S3 bucket. Doing this can be slow because the S3 API only returns 1,000 object entries per request. That means an application has to make multiple requests to get all the objects.

Issue ¶

To demonstrate the problem, I wrote a simple Go program that creates 25,000 empty objects in an S3 bucket. The name of each object is a random UUIDv4. You can see the code here.

To list objects in a bucket, the S3 API provides the ListObject endpoint. In the AWS SDK for Go v2, we find the ListObjectsV2 operation that calls this endpoint.

As mentioned before, this endpoint returns a maximum of 1,000 object entries per request. If there are more objects to list, the response contains a continuation token that can be used to retrieve the next set of objects. So an application has to keep sending requests until the response does not contain a continuation token anymore. Pagination is a common pattern in the AWS API, and a lot of services use it to efficiently handle large sets of data.

The SDK for Go provides a convenient way to handle the pagination using the ListObjectsV2Paginator. This paginator automatically handles the continuation token, an application only needs to iterate over the results with the HasMorePages and NextPage methods. Under the hood, the paginator will keep sending requests until all objects are listed.

  s3Client := s3.NewFromConfig(cfg)
  bucket := "rasc-test-list"

  input := &s3.ListObjectsV2Input{
    Bucket: aws.String(bucket),
  }

  paginator := s3.NewListObjectsV2Paginator(s3Client, input)

  startTime := time.Now()

  var objectCount int
  for paginator.HasMorePages() {
    page, err := paginator.NextPage(context.Background())
    if err != nil {
      log.Fatalf("failed to get page: %v", err)
    }

    for range page.Contents {
      objectCount++
    }
  }

main.go

If I run this code on my machine, it takes about 11 seconds to list all the objects. If you have millions of objects, even with a fast internet connection, listing all objects can take a long time.

Faster ¶

The question is how can we make this faster? As mentioned above, the object names are random UUIDv4s. That means the object name either starts with a letter from a to f or a digit from 0 to 9. And it turns out that the ListObject endpoint supports a Prefix parameter that can be used to filter the objects by their names. If you specify a prefix, the list operation will only return objects that start with that prefix.

So, we can use this to our advantage. We can start 16 goroutines, each with a different prefix, and let them run in parallel. Each goroutine will list the objects that start with its prefix. Each goroutine sends the response to a channel. Once all goroutines finish, the application can process the results. This example program is only interested in the number of objects found, so it simply counts them. But you can easily modify it so that it returns the object entries as well.

func main() {
  cfg, err := config.LoadDefaultConfig(context.TODO(), config.WithSharedConfigProfile("home"))
  if err != nil {
    log.Fatalf("failed to load config: %v", err)
  }

  s3Client := s3.NewFromConfig(cfg)
  bucket := "rasc-test-list"

  prefixes := []string{"0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "a", "b", "c", "d", "e", "f"}

  startTime := time.Now()

  resultChan := make(chan result, len(prefixes))
  var wg sync.WaitGroup

  for _, prefix := range prefixes {
    wg.Add(1)
    go func(prefix string) {
      defer wg.Done()
      count, err := listObjectsWithPrefix(context.Background(), s3Client, bucket, prefix)
      resultChan <- result{prefix: prefix, count: count, error: err}
    }(prefix)
  }

  wg.Wait()
  close(resultChan)

  totalObjects := 0
  for result := range resultChan {
    if result.error == nil {
      totalObjects += result.count
    } else {
      fmt.Printf("prefix %s: ERROR - %v\n", result.prefix, result.error)
    }
  }

  elapsed := time.Since(startTime)

  fmt.Printf("Total objects found: %d\n", totalObjects)
  fmt.Printf("List operation completed in: %v\n", elapsed)
}

func listObjectsWithPrefix(ctx context.Context, s3Client *s3.Client, bucket, prefix string) (int, error) {
  input := &s3.ListObjectsV2Input{
    Bucket: aws.String(bucket),
    Prefix: aws.String(prefix),
  }

  paginator := s3.NewListObjectsV2Paginator(s3Client, input)
  var objectCount int

  for paginator.HasMorePages() {
    page, err := paginator.NextPage(ctx)
    if err != nil {
      return 0, err
    }

    for range page.Contents {
      objectCount++
    }
  }

  return objectCount, nil
}

main.go

This code is now significantly faster. On my machine, it takes about 3 to 4 seconds to list all the objects.

Splitting the list object requests with a prefix obviously works only if the prefixes are known in advance and the object names are not clustered around a small set of prefixes. If 90% of the objects start with the same prefix, then this approach will not help much. You could try to find longer prefixes; it does not have to be a single character, it can be a string of any length.

Fortunately, in my real-world project and in this example project, the names are random UUID v4s, so we can assume that the prefixes are evenly distributed.

Generic ¶

The following code shows an attempt to make this solution more generic.

The fastListS3Objects function takes a list of prefixes and the number of goroutines to use. It will either start as many goroutines as there are prefixes or the number of goroutines specified, whichever is smaller. Like in the previous example, this code is only interested in the number of objects.

type result struct {
  prefix string
  count  int
  error  error
}

func fastListS3Objects(ctx context.Context, s3Client *s3.Client, bucket string, prefixes []string, numGoroutines int) (int, time.Duration, error) {
  startTime := time.Now()

  maxWorkers := min(len(prefixes), numGoroutines)

  resultChan := make(chan result, len(prefixes))
  workChan := make(chan string, len(prefixes))
  var wg sync.WaitGroup

  for _, prefix := range prefixes {
    workChan <- prefix
  }
  close(workChan)

  for range maxWorkers {
    wg.Add(1)
    go func() {
      defer wg.Done()
      for prefix := range workChan {
        count, err := listObjectsWithPrefix(ctx, s3Client, bucket, prefix)
        resultChan <- result{prefix: prefix, count: count, error: err}
      }
    }()
  }

  wg.Wait()
  close(resultChan)

  totalObjects := 0
  var firstError error
  for result := range resultChan {
    if result.error == nil {
      totalObjects += result.count
    } else {
      fmt.Printf("prefix %s: ERROR - %v\n", result.prefix, result.error)
      if firstError == nil {
        firstError = result.error
      }
    }
  }

  elapsed := time.Since(startTime)
  return totalObjects, elapsed, firstError
}

func listObjectsWithPrefix(ctx context.Context, s3Client *s3.Client, bucket, prefix string) (int, error) {
  input := &s3.ListObjectsV2Input{
    Bucket: aws.String(bucket),
    Prefix: aws.String(prefix),
  }

  paginator := s3.NewListObjectsV2Paginator(s3Client, input)
  var objectCount int

  for paginator.HasMorePages() {
    page, err := paginator.NextPage(ctx)
    if err != nil {
      return 0, err
    }

    for range page.Contents {
      objectCount++
    }
  }

  return objectCount, nil
}

main.go

This function can be used like this. This example uses the same prefixes as before but only starts 8 goroutines.

  s3Client := s3.NewFromConfig(cfg)
  bucket := "rasc-test-list"

  prefixes := []string{"0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "a", "b", "c", "d", "e", "f"}
  numGoroutines := 8
  totalObjects, elapsed, err := fastListS3Objects(context.Background(), s3Client, bucket, prefixes, numGoroutines)

main.go

In my tests, this is surprisingly almost as fast as the previous example with 16 goroutines. Not sure why; my Internet connection is not that fast, and maybe sending 16 requests at the same time is too much to handle efficiently. It's definitely worth experimenting with the number of goroutines to find the optimal value for your use case.

Conclusion ¶

This article showed you how to list a large number of objects in an S3 bucket in a fast way. Using the prefix parameter in the list object request allows an application to start multiple goroutines and call the list object endpoint in parallel. This can significantly reduce the time it takes to list all objects in a bucket.

As already mentioned, this approach only works if the prefixes are known in advance and the object names can be split into multiple prefixes. If most object names start with the same prefix, this approach will not help much. But on the other hand, if the prefixes of the object names are evenly distributed, this approach can be very effective.