Skip to content

Commit

Permalink
Docs: Fix missing options for remove_orphan_files procedure (#11080)
Browse files Browse the repository at this point in the history
  • Loading branch information
manuzhang committed Sep 14, 2024
1 parent 5582b0c commit 2e4d5b5
Showing 1 changed file with 38 additions and 0 deletions.
38 changes: 38 additions & 0 deletions docs/docs/spark-procedures.md
Original file line number Diff line number Diff line change
Expand Up @@ -312,6 +312,10 @@ Used to remove files which are not referenced in any metadata files of an Iceber
| `location` | | string | Directory to look for files in (defaults to the table's location) |
| `dry_run` | | boolean | When true, don't actually remove files (defaults to false) |
| `max_concurrent_deletes` | | int | Size of the thread pool used for delete file actions (by default, no thread pool is used) |
| `file_list_view` | | string | Dataset to look for files in (skipping the directory listing) |
| `equal_schemes` | | map<string, string> | Mapping of file system schemes to be considered equal. Key is a comma-separated list of schemes and value is a scheme (defaults to `map('s3a,s3n','s3')`). |
| `equal_authorities` | | map<string, string> | Mapping of file system authorities to be considered equal. Key is a comma-separated list of authorities and value is an authority. |
| `prefix_mismatch_mode` | | string | Action behavior when location prefixes (schemes/authorities) mismatch: <ul><li>ERROR - throw an exception. (default) </li><li>IGNORE - no action.</li><li>DELETE - delete files.</li></ul> |

#### Output

Expand All @@ -331,6 +335,40 @@ Remove any files in the `tablelocation/data` folder which are not known to the t
CALL catalog_name.system.remove_orphan_files(table => 'db.sample', location => 'tablelocation/data');
```

Remove any files in the `files_view` view which are not known to the table `db.sample`.
```java
Dataset<Row> compareToFileList =
spark
.createDataFrame(allFiles, FilePathLastModifiedRecord.class)
.withColumnRenamed("filePath", "file_path")
.withColumnRenamed("lastModified", "last_modified");
String fileListViewName = "files_view";
compareToFileList.createOrReplaceTempView(fileListViewName);
```
```sql
CALL catalog_name.system.remove_orphan_files(table => 'db.sample', file_list_view => 'files_view');
```

When a file matches references in metadata files except for location prefix (scheme/authority), an error is thrown by default.
The error can be ignored and the file will be skipped by setting `prefix_mismatch_mode` to `IGNORE`.
```sql
CALL catalog_name.system.remove_orphan_files(table => 'db.sample', prefix_mismatch_mode => 'IGNORE');
```

The file can still be deleted by setting `prefix_mismatch_mode` to `DELETE`.
```sql
CALL catalog_name.system.remove_orphan_files(table => 'db.sample', prefix_mismatch_mode => 'DELETE');
```

The file can also be deleted by considering the mismatched prefixes equal.
```sql
CALL catalog_name.system.remove_orphan_files(table => 'db.sample', equal_schemes => map('file', 'file1'));
```

```sql
CALL catalog_name.system.remove_orphan_files(table => 'db.sample', equal_authorities => map('ns1', 'ns2'));
```

### `rewrite_data_files`

Iceberg tracks each data file in a table. More data files leads to more metadata stored in manifest files, and small data files causes an unnecessary amount of metadata and less efficient queries from file open costs.
Expand Down

0 comments on commit 2e4d5b5

Please sign in to comment.