Low maintenance data integration (ETL)

By mtm from London.pm
Date: Wednesday August 13, 2008 10:40
Duration: 30 minutes
Tags: dataprocessing etl sjerek

You can find more information on the speaker's site:


This is a tech talk about an existing ETL system used at Nestoria.co.uk (vertical search engine, 4 countries). It's the processing piece between arrived data and database insert.
http://en.wikipedia.org/wiki/Extract%2C_transform%2C_load

Lots of Perl folks have written ETL systems in the past, lots will have to write one in the future. There is often no way around a custom solution.

We will look at some best practices around 24/7 availability, monitoring, data cleansing, data quality, i18n, scaling, dealing with failures and changes ... and of course CPAN modules.

Nestoria had to integrate dozens of different formats (flatfile, database dumps, XML, custom), delivery methods (fetch, crawl, FTP) and update methods (complete, incremental, partial, custom). We thought we were prepared for everything, but over the years we learned some valuable lessons about corrupt files, failing servers, data quality, i18n issues and performance.

Attended by: