lakeFS Spark Client Enterprise v0.16.0¶

Changelog¶

Improved credentials handling for compatibility with Hadoop 3.2.x/3.3.x

v0.15.0 - 2025-07-21¶

GC: Add support for GCP

v0.14.3 - 2025-05-28¶

Metaclient: Skip unknown files when scanning metadata prefix

v0.14.2 - 2025-04-30¶

Spark Client StorageID support

v0.14.1 - 2024-07-04¶

When scanning across all commits, correctly handle lakeFS storage namespaces that do not end in /

v0.14.0 - 2024-06-20¶

Expose user metadata for commit

v0.13.0 - 2024-01-21¶

Shade scalapb for use in current DataBricks LTS releases.

This is a minor release: while it adds no new functionality, it changes how users can use the Spark metadata client with DataBricks (and possibly other managed Spark providers).

v0.12.0 - 2023-12-21¶

Metaclient: Skip dummy file when scanning metadata prefix

v0.11.0 - 2023-10-23¶

Bump lakeFS SDK to v1.0
Breaking: drop support for lakeFS server < v0.108.0

v0.10.0 - 2023-08-15¶

Bug fixes and improvements.
Seperated flows of committs and uncommitted GC are no longer supported. Use the new GC job.

v0.9.1 - 2023-07-31¶

Fix bug when GC runs into the initial commit repository.
Fix bug when GC runs on a bucket outside the provided region.

v0.9.0 - 2023-07-23¶

A new an improved Garbage Collection job. See the docs for more information.

v0.8.1 - 2023-05-22¶

Bug fixes: * Read previous run ID in case of a sweep-only run * Perform run ID logging only after sweep completed successfully

v0.8.0 - 2023-05-18¶

What's new: * Incremental committed GC implementation - supported starting from lakeFS 0.100.0

v0.7.3 - 2023-05-07¶

Bug fixes: * Fix uncommitted gc to handle no uncommitted location

v0.7.2 - 2023-05-02¶

Bug fixes: * Fix uncommitted garbage collection bulk remove out of disk space

v0.7.1 - 2023-04-24¶

=== Performance improvements===

No user-visible parts inside, but some parameters...

Deletion now retries S3 deleteObjects, a lot. Parameters lakefs.gc.s3.min_backoff_secs, lakefs.gc.s3.max_backoff_secs control how long it will try.

v0.7.0 - 2023-03-13¶

=== Performance improvements===

No user-visible parts inside, but some parameters...

Write expired addresses to fewer locations
Only write text-format expired addresses if new Hadoop config option lakefs.gc.address.write_as_text is set.
- Remove one (unused) expired address location.
Handle huge metaranges: New Hadoop config option lakefs.gc.address.approx_num_ranges_to_spread_per_partition can be set when there are very many ranges. Values 20..100 are probably best.
For debugging performance

Guaranteed to produce incorrect results! You probably never want to set this in production or on any repository you care about. * New Hadoop config option lakefs.debug.gc.addresses_sample_fraction can be set below 1.0 to debug performance only.

v0.6.5 - 2023-03-14¶

Bug fix: * UGC fix uncommitted exists change * UGC fix list mark addresses

v0.6.4 - 2023-03-13¶

Bug fix: * UGC handle no uncommitted data on repository

v0.6.3 - 2023-03-09¶

Bug fix: * UGC repartition by addresses to handle large repositories * UGC use task context to delete temporary files * UGC copy metadata to local without crc files

v0.6.2 - 2023-02-23¶

Bug fix: * Add exponential backoff retry to the S3 client

v0.6.1 - 2023-01-30¶

What's new: * Upgrade lakeFS client to v0.91.0 * Add UGC cutoff time to the report

v0.6.0 - 2022-12-14¶

What's new: * Beta feature: Uncommitted garbage collection. Documentation

Bug fixes: * Correctly write GC JSON summary object using UTF-8

v0.5.2 - 2022-11-29¶

Bug fixes: * Identify the region of the S3 bucket if it's not reachable and update the initialized S3 client using that region.

v0.5.1 - 2022-10-20¶

Bug fixes: * Make GC backup and restore support expired addresses list including object not in the underlying object store * Don't package with hadoop-aws. This removes many dependency failures and simplifies configuration. But it also means that for plain Spark distributions such as that provided when downloading from the Apache Spark homepage you will need to add --packages org.apache.hadoop:hadoop-aws:2.7.7 or --packages org.apache.hadoop:hadoop-aws:3.2.1 or similar, to add in this package.

v0.5.0 - 2022-10-06¶

What's new: * A utility for GC backup and restore. It allows users to copy objects that GC plans to delete or restore objects from a previously created backup

v0.4.0 - 2022-09-30¶

What's new: * Separate GC into a mark and sweep parts and add configuration parameters to control what phases to run

Bug fixes: * Fix the failure to write an empty dataframe into GC reports when running in mark-only mode * Only clean up relative path names

v0.3.0 - 2022-09-21¶

What's new: - Add retries mechanism - Improve performance

v0.2.3 - 2022-09-11¶

Performance improvements
Fix bug: parsing problems in Azure

v0.2.2 - 2022-08-29¶

What's new: - Added custom lakeFS client read timeout configuration - Rename custom lakeFS client timeout configuration keys

Bug fixes: - [GC] re-use http clients to limit the number of open connections and fix a resource leak

v0.2.1 - 2022-08-18¶

Bug fixes: - [GC] Added configuration flag of lakeFS client timeout to spark-submit command

v0.2.0 - 2022-08-01¶

What's new: - Garbage Collection on Azure

Bug fixes: - [GC] Respect Hadoop AWS access key configuration in S3Client - exit GC in case that no GC rules are configured for a repo