Saturday, February 22, 2014

Spring batch - special aspects of batch processing

Spring batch what for?

The spring batch is a framework specially designed for batch processing.
Intended for processing e.g. files with large amount of data, providing a clear DSL defined based on XML.
The framework comes with abstractions and defaults that have "extension points" where business or processing logic can be placed.

Why should I use spring batch?

It is regarded as the standard framework for batch processing. A lot of developers know how to use it.
It provides useful abstractions and can be configured in many regards to support more advanced requirements:
  • transaction support
  • retry support
  • skip functionality
  • perfectly integrated in the spring world (DI etc.)
  • strong layered architecture
  • very scalable due to support of step partitioning, multi-threaded steps, …

Basic concept

In spring batch the processing starts with spring batch job. It consists of steps. 
The steps can be chunk-oriented or so called TaskletStep but this is for supporting legacy code.
The main components of spring batch are: 
  • ItemReader - reads one item
  • ItemProcessor - processes one item, but is optional
  • ItemWriter - writes a list of items
Besides that there are different type of listeners for placing business
logic:
  • job/step execution listener
  • chunk listener
  • ItemRead/Process/Write- Listener
  • SkipListener
The transaction boundary can never be around steps or a complete job.
Metadata like execution starting/end point, amount of commits/rollbacks, step status etc. are saved at several points:
  • step execution context - a map that is used for serializing data
  • chunk execution context - used inside a chunk transaction for knowing the current item in process
The default behavior for rollbacks is that if a non-caught exceptions happened during the processing of a chunk, the step is rolled back.
All committed chunks until that stay committed, but the complete job fails.

The meta data of a step are initialized in the beginning of the step and updated at the end of the step. This is done in separate transactions, in order to update the state of the step status this has to be done in an own transaction as the processing of the step itself can fail and has to be rollbacked then.

A spring batch job consists of steps as we know now and these steps consists of chunks. Each chunk is executed in its own transaction.
When does spring batch know how much data has to be read into the chunk?
This is specified by a policy: CompletionPolicy.
Specifying the commitInterval on the chunk tag leads to a SimpleCompletionPolicy.
As soon as the amount of items are read satisfying the completionPolicy, the items read and processed are passed to the ItemWriter.

Restart of a job

How is a restart of a job done?
Well, for batch processing the data retrieval if not from a file but retrieved from a database table(s) or messaging system are done over the declaration of a non-transactional datasource. As on a processing error a rollback would close the data retrieval channel as well.
On this non-transactional datasource no other job, component or module should operate. The data is normally read by use a database cursor in order to avoid memory issues. Spring batch provides here the JdbcCursorItemReader.
The restart of a job itself is analyzed by spring batch. If a job is called with the same parameters (same job parameters) spring batch sees this as a restart of the job if the job before ended with a failure.
In order to accomplish that the state is saved in the execution context of the chunk.
Here every reader persists the counter of read items inside the transaction of the chunk. Every chunk commits his work inside his transaction towards the end of work. Meaning if the completionPolicy is set to 10. After the 10 item the chunk tries to commit its work. On job restart the execution point of the last successful chunk is taken and the not committed items are processed.
The execution context is seen by all the readers - that means the state of the counter can be modified by different counters. This setup is not thread-safe!

Ordering

For the restart-ability of a job the ordering of the read data must be well-defined.
If the data comes from database or from a file the ordering in the data retrieval must be explicit set in order that the restart of the job get the items that have to be reprocessed in the same order as on the first run.














No comments:

Post a Comment