Background Task Processing

In the sense of a web crawler task processing is not background, it is the main job a crawler does. But the Crawler-Lib Engine was generalized and gets workflow capabilities so it can now be used for any kind of background processing task. Therefore we call it no longer a crawler engine or data mining and information retrieve engine. Now it is a workflow enabled background task processor.

Performance and Throughput

When we talk about performance with developers who never had to deal with massive parallel requests, we hear always the same crap about threads, TPL, and so on. Many developers think that they can achieve performance by starting thousands of threads. If you do it, you get a really huge memory footprint and a lot of overhead which burns lot of CPU cycles. The better approach is to use the async programming model. In fact you can handle ten thousands of requests in an async manner. But the handling of the async pattern is annoying and complicated.

Some would say no problem, we have the Parallel.ForEach method (or something other in the TPL), and which does all the async stuff for us. In fact this is completely wrong. Even you have 50.000 requests, Parallel.ForEach will start under one hundred in parallel, because it is not designed to serve thousands of blocking operations.

The Crawler-Lib Engine comes to the cure. It uses in its default configuration only a view worker threads (default is the amount of CPU cores) and performs the requests asynchronously. In fact the workflow processor is decoupled from the async operation processor so that they can nicely interact.

Workflow

What is the matter with workflow? Social media APIs have complex needs, to get the results, we want. Sometimes the task workflow must perform an authentication before other operations can be performed. Big result sets can’t be retrieved in one operation, so there must be subsequent requests gather the result.

Servers and APIs have limitations, so our task workflow have to ensure that we don’t overwhelm this services with requests. We need to control the parallelism and the throughput of our operations.

The storage system performances benefits from assembled results instead a vast amount of small updates. So the task workflow could query a range of information sources for a item, assemble the results to one big result and deliver it back.

These are a view examples why we have developed the workflow capabilities of the Crawler-Lib Engine.