Data processing is simply the conversion of raw data to meaningful information through a process. There are two general ways to process data:
- Batch processing, in which multiple data records are collected and stored before being processed together in a single operation.
- Stream processing, in which a source of data is constantly monitored and processed in real time as new data events occur.
batch processing
In batch processing, newly arriving data elements are collected and stored, and the whole group is processed together as a batch. For example, you can process data based on a scheduled time interval (for example, every hour), or it could be triggered when a certain amount of data has arrived, or as the result of some other event.
Advantages of batch processing include:
- Large volumes of data can be processed at a convenient time.
- It can be scheduled to run at a time when computers or systems might otherwise be idle, such as overnight, or during off-peak hours.
Disadvantages of batch processing include:
- The time delay between ingesting the data and getting the results.
- All of a batch job's input data must be ready before a batch can be processed. This means data must be carefully checked. Problems with data, errors, and program crashes that occur during batch jobs bring the whole process to a halt. The input data must be carefully checked before the job can be run again. Even minor data errors can prevent a batch job from running.
In stream processing, each new piece of data is processed when it arrives. Stream data processing is beneficial in scenarios where new, dynamic data is generated on a continual basis.Real world examples of streaming data include:
- A financial institution tracks changes in the stock market in real time, computes value-at-risk, and automatically rebalances portfolios based on stock price movements.
- An online gaming company collects real-time data about player-game interactions, and feeds the data into its gaming platform. It then analyzes the data in real time, offers incentives and dynamic experiences to engage its players.
- A real-estate website that tracks a subset of data from mobile devices, and makes real-time property recommendations of properties to visit based on their geo-location.
Stream processing is ideal for time-critical operations that require an instant real-time response.
Combine batch and stream processing
Many large-scale analytics solutions include a mix of batch and stream processing, enabling both historical and real-time data analysis.
The following diagram shows some ways in which batch and stream processing can be combined in a large-scale data analytics architecture.
- Data events from a streaming data source are captured in real-time.
- Data from other sources is ingested into a data store (often a data lake) for batch processing.
- If real-time analytics is not required, the captured streaming data is written to the data store for subsequent batch processing.
- When real-time analytics is required, a stream processing technology is used to prepare the streaming data for real-time analysis or visualization; often by filtering or aggregating the data over temporal windows.
- The non-streaming data is periodically batch processed to prepare it for analysis, and the results are persisted in an analytical data store (often referred to as a data warehouse) for historical analysis.
- The results of stream processing may also be persisted in the analytical data store to support historical analysis.
- Analytical and visualization tools are used to present and explore the real-time and historical data.