View on GitHub

bifabrik

Microsoft Fabric ETL toolbox

Archiving processed source files

When loading data from files, such as CSV or JSON, you may want to loaded files to an archive folder so that you don’t reprocess them the next time you run the pipeline.

A sample configuration can look like this

import bifabrik as bif

bif.config.fileSource.moveFilesToArchive = True
bif.config.fileSource.archiveFolder = 'Files/CsvArchive'
bif.config.fileSource.archiveFilePattern = 'processed_{filename}_{timestamp}{extension}'

Then you can load the file as usual

bif.fromCsv('CsvFiles/dimBranch.csv').delimiter(';').toTable('DimBranchArch').run()

With the archiving feature configured, the pipeline will remember which files were loaded. After the pipeline finishes (by saving data to the destination table), the used source files will be moved to the folder specified, here Files/Csv_Archive. Based on the archive file pattern, the file path can look like this: Files/CsvArchive/processed_dimBranch_2024_03_02_19_09_44_722291.csv (preview). This is a location within the source lakehouse. You can use these placeholders in the file name pattern:

{filename}  : source file name without suffix
{extension} : source file name extension, including the '.'
{timestamp} : current timestamp as '%Y_%m_%d_%H_%M_%S_%f'

The files will only be moved if the pipeline succeeds. Should it fail, your files will remain in the source folder.

For info on all the general file loading config options, see

help(bif.config.fileSource)

You may also want to configure logging like this:

import bifabrik as bif
from bifabrik.utils import log
from bifabrik.cfg.specific.LogConfiguration import LogConfiguration

cfg = LogConfiguration()

cfg.logPath = '/log/archive_test_log.csv'
cfg.loggingLevel = 'INFO'
logger = log.configureLogger(cfg)

bif.config.fileSource.moveFilesToArchive = True
bif.config.fileSource.archiveFolder = 'Files/CsvArchive'
bif.config.fileSource.archiveFilePattern = 'processed_{filename}_{timestamp}{extension}'
bif.fromCsv('CsvFiles/dimBranch.csv').delimiter(';').toTable('DimBranchArch').run()

With this setup in place, bifabrik can log a nice summary of what the archiving has done

2024-03-02 19:26:02,437	INFO	Executing CSV source: CsvFiles/dimBranch.csv
2024-03-02 19:26:02,437	INFO	Searching location Files
2024-03-02 19:26:02,528	INFO	Searching location Files/CsvFiles
2024-03-02 19:26:02,562	INFO	Loading CSV files: [/lakehouse/default/Files/CsvFiles/dimBranch.csv]
2024-03-02 19:26:02,595	INFO	Executing Table destination: DimBranchArch
2024-03-02 19:26:05,024	INFO	Archiving abfss://.../Files/CsvFiles/dimBranch.csv -> abfss://.../Files/CsvArchive/processed_dimBranch_2024_03_02_19_26_05_024146.csv

And, if you don’t feel like repeating these configuration steps each, time, save them to a file

bif.config.saveToFile('Files/cfg/fileSrcSettings.json')

# and then load it...

bif.config.loadFromFile('Files/cfg/fileSrcSettings.json')

See more in Configuration

Back