Crawler-Lib - Advice

Posted: 2 days ago Quote #69

Prior to testing your library I was using HtmlAgilityPack and drilling down through a store site to get item title, upc & prices. It was a recursive routine that called itself if there were category links present and when it reached a plain item link it would parse the necessary data. My throughput was about 100 items a minute including updating a SQL database after a pageful of links. This, of course, was taking 3.5 to 5 hours to complete a full parse due to the number of items on the site. I wanted to trim this down by at least 4 fold.
I revamped the program to pull item links for one category on a site into a queue which I referenced in a TaskRequestBase and ran with a TaskBase. I experimented with Parallel processing and limited it by running only 10 links at a time and waiting a second or so to repeat. I tried all different kinds of links at a time and delays in between. I was getting results, but it was missing quite a bit of data here and there. I decided to give your Retry a test. This actually seemed worse and I actually didn't see where I was getting any Retry data back.
Now I'm sure that maybe I'm not adept to how the parallel processing works and it may be my fault in the way that I am approaching it. Maybe it is the limitations of the community edition and I'm just not addressing that limitation correctly in code. I did see where someone wrote about the 600 requests per minute and 2 threads. I did, however, trim down to only a couple requests every few seconds and still had bottlenecks. I would have expected at least comparable results to serial processing using only HtmlAgilityPack. Have you any other samples of how you would go about this type of process?
The site can take the throughput or I wouldn't be able to acheive 100 per minute using serial methods. They do no checking of IPs or blocking. They are probably delirious with pleasure reporting to their sponsors how many hits they are receiving.
I have my parsing routines inside the TaskBase. Do you think that this is causing it to be too weighty and using too high a memory footprint? I'll look into changing that around to see if it helps. But any suggestions that you have would be of great value to me.
I'm sure this is an awesome library if it weren't for Neanderthals like me going about it the wrong way. Sorry for the trouble but I did spend about 10 hours on it today.
Thanks

Posted: 2 days ago Quote #70

I will help you to design the task. Parsing inside the task is absolutely ok. I assume you have a test project for the task design. Please send me a zipped version of this to [email protected]. You will get back a working sample with the best performance. I will treat your code confidential.

Kind regards Tom

P.S.
Thank you for your response. I know we have a lack of samples at the moment and the Crawler-Lib Framework is too complex to be understood out of the box. But major parts (like the storage operation processor) are not released yet, so we decided to release all parts first and come up with the samples later. This kind of feedback is very important for us and we encourage everybody to share their use cases with us. We will provide specific samples and solutions for your problems.

Crawler-Lib Developer

Advice

Information

User service

Follow us