Lakehouse table utilities
A few tools created as a side-effect of the bifabrik
library for manipulating delta tables.
from bifabrik.utils import tableUtils as tu
help(tu)
Restore tables to to earlier version
These tools make use of time travel in delta tables and the RESTORE command, making it easier to perform batch restore operations.
List lakehouse tables
import bifabrik.utils.tableUtils as tu
tableList = tu.listTables()
# > ['dimension_date', 'dimension_employee', 'dimension_stock_item', 'fact_sale']
listTables()
simply lists the names of the tables in the default lakehouse for use by subsequent functions
Show tables history
As a next step, let’s have a look at the history of our tables using tablesHistory({list of table names})
import bifabrik.utils.tableUtils as tu
tableList = tu.listTables()
history = tu.tablesHistory(tableList)
display(history)
The result can look like this
This is basically the union of DESCRIBE HISTORY executed for each of the tables
Restore tables to a point in time
Finally, restore the tables to the desired point in time. For each table, their latest version before the given time is taken. If the table has no history before that point, it is skipped.
tableList = tu.listTables()
# filter a subset of the tables as needed
tu.restoreToPIT(tableList, '2024-05-08 14:00:00')
# All restores or table skips are printed out like this:
#
# > Restoring `dimension_employee` to version 1 (2024-05-08 12:32:51.325000)
# > Restoring `dimension_stock_item` to version 1 (2024-05-08 12:34:15.257000)
# > Restoring `fact_sale` to version 3 (2024-05-08 13:42:00.739000)
# > ...
Add new column
There are two options when adding new columns to a table
New column from value
addTableColumnFromValue
takes the target database name (lakehouse), table name, new column name and the value the column will have. The data type is infered from the value provided.
from bifabrik.utils import tableUtils as tu
tu.addTableColumnFromValue('Lakehouse1', 'DimBranch', 'NewCol1', 123.45)
New column from type
This adds an empty column with the provided Spark data type
tu.addTableColumnFromType('LH1', 'Table1', 'Column1', 'string')
Rename column
from bifabrik.utils import tableUtils as tu
tu.renameTableColumn('LH1', 'FactLedger', 'OriginalColumnName', 'NewColumnName')
This uses SQL to alter the column name. In order to do that, it also needs to upgrade the table like this (if it already isn’t “up to scratch”)
ALTER TABLE `LH1`.`FactLedger` SET TBLPROPERTIES (
'delta.minReaderVersion' = '2',
'delta.minWriterVersion' = '5',
'delta.columnMapping.mode' = 'name'
)
Drop column
from bifabrik.utils import tableUtils as tu
tu.dropTableColumn('LH1', 'FactLedger', 'ColumnToBeRemoved')
Similarly to renaming, the delta properties minWriterVersion
and columnMapping.mode
will be upgraded as needed.