Crawler-Lib - Correct setup for data miner

kev

Total Posts: 3

PM

Posted: 3 days ago Quote #217

Hi

I've been making some simple applications to test performance before I integrate into some of my mainstream application. One such application visits a list of url's extracts basic information this is done using a thread process.

As I don't require to crawl the site just visit url for my simple test a took the simpletask example and modified it slightly just for testing basically looping through a set of url's and doing nothing else. this was mainly to test speed to compare against.

I'm getting around 15-20 urls a second average process which is actually slower than my multithread current model (that's also processing the data ) when running a comparison.

I've checked license and this is correct and allowing task number etc to be adjusted(I have a full license).
varied the max task which makes little difference.
I've also checked against several URL list of varying sizes(1000-100000)
I've tried on my dev machine and also server which has a 500mb+ connection

any suggestions only alteration done to example is below

thanks
kev

SimpleTaskSample() alterations


License.Lock();          
            var engineConfig = new CrawlerEngineConfig();
         engineConfig.MaxWorkingTasks = 5000; 
                   
          //  engineConfig.MaxTasksPerMinute = 900000000;               
          //  engineConfig.MaxFinishedTasks = 100000000;
          
            
           
            var engine = new CrawlerEngine(engineConfig);
        
                                                                                               
            Console.WriteLine();
            Console.WriteLine("Start Task");
           
            var ofd = new OpenFileDialog();
            if (ofd.ShowDialog() != DialogResult.OK) return;                  
            string filepath = ofd.FileName;
            List<string> list = new List<string>();
            using (var sr = new StreamReader(File.Open(filepath, FileMode.Open, FileAccess.Read, FileShare.ReadWrite)))
            {
               
                while (sr.Peek() >= 0)
                {
                    string input = sr.ReadLine();
                    if (!string.IsNullOrEmpty(input))
                    {
                        list.Add(input);

                    }
                }
            }
            list = list.Distinct().ToList();
            engine.Start();
            foreach (string stin in list)
            {
             
                    engine.AddTask(new SimpleTaskRequest { Url = new Uri(stin)});

               
            }

StartWork() alterations


   base.TaskResult = new SimpleTaskResult();
              
                                    var request = await new HttpRequest((new HttpRequestConfig
                                    {
                                        AwaitProcessing = AwaitProcessingEnum.Success,
                                        Url = TaskRequest.Url,
                                        UserAgentHeader = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:35.0) Gecko/20100101 Firefox/35.0",
                                        Quota = new HttpRequestQuota { MaxDownloadSize = 5000000, OperationTimeoutMilliseconds = 60000, ResponseTimeoutMilliseconds = 15000 }
                                    }));
           
                              
                                this.TaskResult.Links = new List<string>();

Tom

Status: Moderator
Total Posts: 102

PM

Posted: 21 hours ago Quote #218

If you crawl Urls with different hosts the main bottleneck is Name Resolution (DNS). These days many major ISP have throttled the DNS servers to prevent massive crawling. This throttling punches in harder when you are going massive parallel. So a MT solution with less threads is faster in name resolution if the ISP throttles DNS requests.

If this is true for your ISPs you must split the name resolution from the crawling to gain maximum performance. We are working on a multi DNS server resolver to configure and prevent this. This component will use a (possibly large) list of DNS servers, but this is still WIP.

Crawler-Lib Developer

kev

Total Posts: 3

PM

Posted: 6 hours ago Quote #219

Thanks for your reply tom I saw one of the crawler-lib posts on stackoverflow where it mentioned high throughput I assumed something like what you mentioned was already implemented.

I'm able to do what you mentioned about separating out the dns lookups from the crawler.

how would I then use this with the crawler. The only way I can think of is to replace the website name with the ip address and set the host header. Doesn't this cause issues with https sites and the certificates. and don't some sites reject http://IPADDRESS/webpage format.

Any feedback would be appreciated mass crawling different hosts was the main reason for me purchasing the crawler-lib

thanks
kev

Correct setup for data miner

Information

User service

Follow us