In this C# sample we will develop a task that gets an URL of a website in its request and provides a list of links from the website in its result. Fist of all the classes for the requests and results:
[DataContract] public class SimpleTaskRequest : TaskRequestBase { [DataMember] public Uri Url { get; set; } public override TaskBase CreateTask() { return new SimpleTask(); } } [DataContract] public class SimpleTaskResult : TaskResultBase { [DataMember] public List Links { get; set; } }
As we can see, the task request contains the CreateTask() method which is a factory method for the task. This is how the real task will be created. This is the task itself:
public class SimpleTask : TaskBase { public new SimpleTaskRequest TaskRequest{ get { return (SimpleTaskRequest)base.TaskRequest; }} public new SimpleTaskResult TaskResult{ get { return (SimpleTaskResult)base.TaskResult; }} public override async void StartWork() { base.TaskResult = new SimpleTaskResult(); var request = await new HttpRequest(TaskRequest.Url, new HttpRequestQuota { MaxDownloadSize = 100000, OperationTimeoutMilliseconds = 10000, ResponseTimeoutMilliseconds = 5000 }); this.TaskResult.Links = new List(); HtmlNodeCollection nodes = request.Html.DocumentNode.SelectNodes("//a[@href]"); foreach (var node in nodes) { string href = node.Attributes["href"].Value; this.TaskResult.Links.Add(href); } } }
The workflow begins in the StartWork()
method of the task. It is important to set the TaskResult
property before any other code is executed. Exceptions in the StartWork()
are delivered in the FatalException
property of the task result, so there must be an instance to store it. This workflow uses the async/await pattern to specify the success handler for the request as a continuation.
TaskBase
.[DataContract]
and derive it from TaskRequestBase
.[DataContract]
and derive it from TaskResultBase
.CreateTask()
method and return a task instance.StartWork()
method and assign a task result instance to the TaskResult
property.For convenience you can redefine the TaskRequest
and the TaskResult
properties with the correct type as done in the sample above. After that you can start to code the business logic of your task.
Somebody will ask why such a bloated concept of requests and results is introduced in a crawler. In fact it wasn’t till the crawler engine was generalized to a task processor. Due to the fact nobody can say what a general task needs to start and what it delivers there must be an mechanism to provide parameters to the task and to deliver results. We have decided to use classes for this that must derive from TaskRequestBase
and TaskResultBase
because of the following reasons:
Memory pressure is one of the main reasons why crawlers have no throughput. So after all we can see that the concept of task requests and results are lean and not bloated. And it is needed especially when we look on the memory footprint.