3/18/13

Pentaho Kettle ETL regression with Spring

The real-life scenario is a Java application with an ETL process to load some data. For example, update the list of names in a phone directory book. With a pretty naive approach, automated regression tests can be introduced quite easily. Here's a demo.

Overview

The application itself makes use of:
  • Spring Framework 3.0,
  • Hibernate 3.6,
  • Pentaho Kettle ETL 4.4.

The model is

  • two entities: Person and PhoneNumber,
  • two DAOs: PersonDao, PhoneNumberDao,
  • ETL loads a CSV file and updates the PERSON table,
  • a wrapper around Kettle Transformation/Job API,

Testing dependencies

  • Spring-test 3.0,
  • DbUnit 2.2,
  • JUnit 4.4,
  • HSQLDB 2.2,
  • Unitils 2.4.

General ideas

  • underlying database is HSQLDB instead of an actual Oracle instance,
    • for Kettle ETL tests, the database is temporarily switched to Oracle compatibility mode - we can use MySQL, PostgreSQL and other compatibility modes,
    • we can also test against a real Oracle DB - in this case, any Oracle-specific functionality can be used. I prefer a compromise with HSQLDB.
  • db schema is created with hbm2ddl,
  • db test data is loaded with DbUnit XML files (tricky!)
  • the actual testing happens inside a simple JUnit4 / Spring Integration test case:
    • first, we execute the ETL transformation,
    • next, we test the results with our Hibernate-based API.

The starting point

We already have a piece of code which allows us to execute the ETL process programmatically:

KettleAdapterImpl.java

public class KettleAdapterImpl implements KettleAdapter {

    private DataSource dataSource;
    private String repositoryPath;

    public KettleAdapterImpl() throws KettleException {
        KettleEnvironment.init(false);
    }

    @Override
    public int executeTransformation(String s, Map<String, String> parameters) {
        try {
            DataSourceProviderFactory.setDataSourceProviderInterface(new DataSourceProviderInterface() {
                @Override
                public DataSource getNamedDataSource(String s) throws DataSourceNamingException {
                    return dataSource;
                }
            });

            KettleFileRepositoryMeta repositoryMeta = new KettleFileRepositoryMeta("id", "name", "repo", ResourceUtils.getFile(repositoryPath).getAbsolutePath());
            KettleFileRepository repository = new KettleFileRepository();
            repository.setRepositoryMeta(repositoryMeta);
            TransMeta transMeta = new TransMeta(ResourceUtils.getFile(repositoryPath + "/" + s).getAbsolutePath(), repository);

            for (Map.Entry<String, String> entry : parameters.entrySet()) {
                transMeta.setParameterValue(entry.getKey(), entry.getValue());
            }

            Trans trans = new Trans(transMeta);
            trans.execute(new String[]{});
            trans.waitUntilFinished();
            int result = trans.getErrors();
            return result;
        } catch (KettleException e) {
            throw new RuntimeException(e);
        } catch (FileNotFoundException e) {
            throw new RuntimeException(e);
        }
    }

    public void setDataSource(DataSource dataSource) {
        this.dataSource = dataSource;
    }

    public void setRepositoryPath(String repositoryPath) {
        this.repositoryPath = repositoryPath;
    }
}

Test configuration, step by step

And now to the configuration details.

Spring integration tests

We'll use some neat features of Unitils, which will speed up test development and maintenance. First, a test for PersonDao:

PersonDaoImpl.java

public class PersonDaoImpl extends HibernateDaoSupport implements PersonDao {
    @Override
    public Integer save(Person person) {
        Serializable save = getHibernateTemplate().save(person);
        getHibernateTemplate().flush();
        return (Integer)save;
    }

    @Override
    public Person get(Integer id) {
        return getHibernateTemplate().get(Person.class, id);
    }

    @Override
    public List<Person> getAll() {
        return getHibernateTemplate().findByNamedQuery("person.loadAll");
    }
}

PersonDaoImplTest.java

@SpringApplicationContext({"classpath:/application-dao.xml", "classpath:/application-test-datasource.xml"})
@DataSet("PersonDaoImplTest.xml")
public class PersonDaoImplTest extends UnitilsJUnit4 {

    @SpringBeanByName
    private PersonDao personDao;

    @SpringBeanByName
    private PhoneNumberDao phoneNumberDao;

    @Test
    public void testPersonDaoImpl() throws Exception {
        Person person = new Person();
        person.setFirstName("Steve");
        PhoneNumber phoneNumber = new PhoneNumber();
        phoneNumber.setNumber("01234");
        person.setPhoneNumbers(Sets.newHashSet(phoneNumber));
        phoneNumberDao.save(phoneNumber);
        personDao.save(person);

        Assert.assertTrue("new id above 1000", person.getId().intValue() >= 1000);
        Assert.assertEquals("all 3", 3, personDao.getAll().size());
    }

}

Testing with HSQLDB

I think it's easier to maintain tests with minimum external dependencies - hence replacing Oracle with in-memory HSQLDB. At some point, you may end up with Oracle, anyway, but until you don't have to do that, HSQLDB will be easier.

Maven

<dependency>
    <groupId>org.hsqldb</groupId>
    <artifactId>hsqldb</artifactId>
    <version>2.2.9</version>
    <scope>test</scope>
</dependency>

Spring configuration

A separate file - test datasource configuration:

application-test-datasource.xml

<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.springframework.org/schema/beans
    http://www.springframework.org/schema/beans/spring-beans-3.0.xsd">

    <bean id="myDataSource" class="org.apache.commons.dbcp.BasicDataSource" destroy-method="close">
        <property name="driverClassName" value="org.hsqldb.jdbcDriver"/>
        <property name="url" value="jdbc:hsqldb:mem:aname;hsqldb.tx=mvcc"/>
        <property name="username" value="sa"/>
        <property name="password" value=""/>
    </bean>

</beans> 

Hack #1

The MVCC parameter will be required by Kettle - the tests tend to deadlock without this parameter.

Unitils

unitils.properties

database.driverClassName=org.hsqldb.jdbcDriver
database.url=jdbc:hsqldb:mem:aname;hsqldb.tx=mvcc
database.userName=sa
database.password=
database.dialect=hsqldb
database.schemaNames=PUBLIC
DbUnitModule.DataSet.loadStrategy.default=pl.erizo.random.kettle.regression.util.CleanAllInsertLoadStrategy 

DbUnit

Necessary for pre-loading some data for the tests. Easy and fast, but not problem-free:

Hack #2

DbUnit 2.2 has issues with deleting data from multiple tables - by default, it tries to clean only tables mentioned in the appropriate dataset. So, we're going to always clear all tables:

CleanAllInsertLoadStrategy.java

public class CleanAllInsertLoadStrategy extends BaseDataSetLoadStrategy {

    @Override
    public void doExecute(DbUnitDatabaseConnection dbUnitDatabaseConnection, IDataSet dataSet) throws DatabaseUnitException, SQLException {
        try {
            InputStream xmlStream = ResourceUtils.getURL("classpath:AllTables.xml").openStream();
            FlatXmlDataSet allTablesDataSet = new FlatXmlDataSet(xmlStream);
            DatabaseOperation.DELETE_ALL.execute(dbUnitDatabaseConnection, allTablesDataSet);
            DatabaseOperation.INSERT.execute(dbUnitDatabaseConnection, dataSet);
            dbUnitDatabaseConnection.getConnection().commit();
        } catch (IOException e) {
            throw new DatabaseUnitException("Could not read AllTables.xml", e);
        }
    }
}
And the list of tables:

AllTables.xml

<dataset>
    <PERSON />
    <PHONE_NUMBER />
</dataset> 

One last thing - Hack #3

Mixing up DbUnit datasets, Hibernate, ETL and DB sequences may cause trouble: while loading the datasets, DbUnit does not increment sequences. But this can be fixed with Hibernate SessionFactory configuration:

application-dao.xml

    <bean id="mySessionFactory" class="org.springframework.orm.hibernate3.LocalSessionFactoryBean">
        <property name="dataSource" ref="myDataSource"/>
        <property name="mappingResources">
            <list>
                <value>pl/erizo/random/kettle/regression/model/Person.hbm.xml</value>
                <value>pl/erizo/random/kettle/regression/model/PhoneNumber.hbm.xml</value>
            </list>
        </property>
        <property name="hibernateProperties">
            <value>
                hibernate.dialect=org.hibernate.dialect.HSQLDialect
                hibernate.hbm2ddl.auto=create
                hibernate.show_sql=true
                hibernate.hbm2ddl.import_files=/sequences.sql
            </value>
        </property>
    </bean>
And, situated in src/test/resources:

sequences.sql

-- increment to 1000
alter sequence PER_ID_SEQ restart with 1000;

Let's test

KettleAdapterTest.java

@SpringApplicationContext({"classpath:/application-dao.xml", "classpath:/application-test-datasource.xml"})
@DataSet("KettleAdapterTest.xml")
public class KettleAdapterTest extends UnitilsJUnit4 {

    @SpringBeanByName
    private KettleAdapter kettleAdapter;

    @SpringBeanByName
    private PersonDao personDao;

    @Before
    public void setUp() throws Exception {
        SQLUnitils.executeUpdate("SET DATABASE SQL SYNTAX ORA TRUE", DatabaseUnitils.getDataSource());
    }

    @Test
    public void testExecute() throws Exception {
        List<Person> everybody = personDao.getAll();
        Assert.assertEquals("2 results", 2, everybody.size());
         HashMap<String, String> map = Maps.newHashMap();
        map.put("input.file", ResourceUtils.getFile("classpath:pl/erizo/random/kettle/regression/phonenumbers.txt").getAbsolutePath());
        int errors = kettleAdapter.executeTransformation("LoadNumbers.ktr", map);
        Assert.assertEquals("no errors", 0, errors);
        everybody = personDao.getAll();
        ReflectionAssert.assertPropertyLenientEquals("last names ordered", "lastName",
        Lists.newArrayList("Gates", "Newman", "Smith"), everybody);
        Assert.assertEquals("3 results", 3, everybody.size());
    }

    @After
    public void tearDown() throws Exception {
        SQLUnitils.executeUpdate("SET DATABASE SQL SYNTAX ORA FALSE", DatabaseUnitils.getDataSource());
    }
}

KettleAdapterTest.xml

<dataset>
    <PERSON PER_ID="1" PER_FIRST_NAME="John" PER_LAST_NAME="Smith" />
    <PERSON PER_ID="2" PER_FIRST_NAME="Bill" PER_LAST_NAME="Gates" />
</dataset>

phonenumbers.txt

John,Smith,001
Paul,Newman,002

Some more reading

16 comments:

  1. Thanks for outlining how to use junit and hibernate to test ETL transformations. It seems a bit cumbersome and might be hard for folks who aren't java devs, but if you already have the infrastructure and knowledge, it's hard to beat the granularity.

    How fast do you find this solution to be?

    I've written a series of blog posts on how to test kettle transformations without writing java code--you build a parallel kettle job that exercises your logic and compares it to a golden set of values. (My post focuses on using file based golden data, but can easily be extended to database tables.) The most recent post is here: http://www.mooreds.com/wordpress/archives/1061

    ReplyDelete
    Replies
    1. In terms of performance - I think that the transformation itself is not much affected with the Spring/Hibernate wrapping.

      There is some overhead to start up the application context, Hibernate session factory, and load the data, but this time is pretty constant.

      A note to non-Java folks: this approach may be better if you already have a Java application, which uses Kettle ETL; however, if you're an ETL developer, it's probably not going to be your choice.

      Delete
  2. This comment has been removed by a blog administrator.

    ReplyDelete
  3. This comment has been removed by a blog administrator.

    ReplyDelete
  4. This comment has been removed by a blog administrator.

    ReplyDelete
  5. This comment has been removed by a blog administrator.

    ReplyDelete
  6. This comment has been removed by a blog administrator.

    ReplyDelete
  7. This comment has been removed by a blog administrator.

    ReplyDelete
  8. This comment has been removed by a blog administrator.

    ReplyDelete
  9. What does your ktr file look like? I'm trying to reproduce but the datasource isn't being set correctly in my ktr file. I've also parameterized the values of the database so it can point at other sources dynamically.

    ReplyDelete
    Replies
    1. Hi Steven,
      Have a look here: http://pastebin.com/WpcdeMby

      The datasource name is set as a JNDI name (KettleDS), but it should be ignored according to the first lines of executeTransformation in KettleAdapterImpl at the top.

      Delete
  10. ETL testing and data stores is a superb method to handle the huge data groups existing within an organization. This article will allow you to discover more information regarding both these phrases.
    You most likely found out about datawarehouses & research instruments previously in database relevant discussions or in career meetings. The truth is that everyday more and more corporations are beginning to think (and develop) data warehouses or other data analysis and mathematical methods smartmind online training provides best online training for Etl testing.http://smartmindonlinetraining.com/etl-testing-online-training/

    ReplyDelete

  11. This blog is an awesome resource page.for more details.

    http://www.tekclasses.com/

    ReplyDelete
  12. Nice Bolg about Testing Types..for more details

    http://www.tekclasses.com/

    ReplyDelete
  13. Uniqe informative content, thanks for sharing.for more details

    http://www.tekclasses.com/

    ReplyDelete
  14. Great post and informative blog.for more details

    http://www.tekclasses.com/

    ReplyDelete