I'm not sure whether this indeed has to do with HtmlAgilityPack settings within the Crawler-Lib Engine, but it may. In the following rough code sample the HttpRequest returns with no errors and with an empty HtmlAgilityPack.HtmlDocument. I'm not sure why this is occurring. Below the request I created an HtmlDocument and use HtmlWeb to retrieve the same page and I get a valid document back. The MaxDownLoad size is not being exceeded (the html is < 300K), there are no catch errors trapped, it doesn't trigger a retry and it seems to only occur on the URL specified and a few others like it. I had a good look at the OuterHtml that was being returned by the second request and it looks normal to me. You can find it here if you want a look: https://www.dropbox.com/s/hmpydtjbzj8msrg/test2.html?dl=0
I had a look at request.Html.DocumentNode and saw no settings that should have an influence on it.
Is there something I'm not seeing?
If this is not related to Crawler-Lib then accept my apologies and I will look elsewhere for the answer. Thanks.
I had a look at request.Html.DocumentNode and saw no settings that should have an influence on it.
Is there something I'm not seeing?
If this is not related to Crawler-Lib then accept my apologies and I will look elsewhere for the answer. Thanks.
private static readonly HttpRequestQuota requestQuota = new HttpRequestQuota {
MaxDownloadSize = 1000000,
OperationTimeoutMilliseconds = 30000,
ResponseTimeoutMilliseconds = 30000
};
await new Retry(
5,
async @retry =>
{
if (@retry.ProcessingInfo.RetryCount > 0)
{
await new Delay(500 * @retry.ProcessingInfo.RetryCount);
}
try {
string url = "http://www.walmart.com/ip/Paw-Patrol-Skyes-High-Flyin-Copter/36774670";
var request = await new HttpRequest(new Uri(url), requestQuota);
HtmlWeb webGet = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = webGet.Load(this.TaskRequest.Url);
}
catch (Exception ex)
{
lastError = ex.Message;
this.TaskResult.Errors++;
}
}
);