Crawler-Lib - HttpRequest returns empty HtmlAgilityPack.HtmlDocument for valid URL

Peter

Total Posts: 16

PM

Posted: 5 days ago Quote #91

I'm not sure whether this indeed has to do with HtmlAgilityPack settings within the Crawler-Lib Engine, but it may. In the following rough code sample the HttpRequest returns with no errors and with an empty HtmlAgilityPack.HtmlDocument. I'm not sure why this is occurring. Below the request I created an HtmlDocument and use HtmlWeb to retrieve the same page and I get a valid document back. The MaxDownLoad size is not being exceeded (the html is < 300K), there are no catch errors trapped, it doesn't trigger a retry and it seems to only occur on the URL specified and a few others like it. I had a good look at the OuterHtml that was being returned by the second request and it looks normal to me. You can find it here if you want a look: https://www.dropbox.com/s/hmpydtjbzj8msrg/test2.html?dl=0
I had a look at request.Html.DocumentNode and saw no settings that should have an influence on it.
Is there something I'm not seeing?
If this is not related to Crawler-Lib then accept my apologies and I will look elsewhere for the answer. Thanks.


private static readonly HttpRequestQuota requestQuota = new HttpRequestQuota { 
  MaxDownloadSize = 1000000, 
  OperationTimeoutMilliseconds = 30000, 
  ResponseTimeoutMilliseconds = 30000 
};
await new Retry(
    5,
    async @retry =>
    {
        if (@retry.ProcessingInfo.RetryCount > 0)
        {
            await new Delay(500 * @retry.ProcessingInfo.RetryCount);
        }

        try {
      string url = "http://www.walmart.com/ip/Paw-Patrol-Skyes-High-Flyin-Copter/36774670";
      var request = await new HttpRequest(new Uri(url), requestQuota);
      HtmlWeb webGet = new HtmlWeb();
      HtmlAgilityPack.HtmlDocument doc = webGet.Load(this.TaskRequest.Url);
    }
        catch (Exception ex)
        {
            lastError = ex.Message;
            this.TaskResult.Errors++;
        }
  }
);

Peter

Total Posts: 16

PM

Posted: 5 days ago Quote #92

In the code:
HtmlAgilityPack.HtmlDocument doc = webGet.Load(this.TaskRequest.Url);
should actually read:
HtmlAgilityPack.HtmlDocument doc = webGet.Load(url);

They were both loaded from the same source.

Tom

Status: Moderator
Total Posts: 49

PM

Posted: 5 days ago Quote #93

Hello Peter, I'm analyzing this issue right now. First tests show that your submitted code run without any issue in the unlocked version. Some tests have shown that the licensing/obfuscation system caused some problems with the locked versions (The IL code is heavy modified by this tool). So I assume, your problem is caused by the licensing/obfuscation system. I've already started to replace this after the issues with the ClickOnce installation (We have done a lot of internal testing after and found problems that occur only in the obfuscated version). Our own implementation of the licensing is already programmed and integrated internally. I have to deploy the license generator to our website and to release the Crawler-Lib engine with the new licensing system. This should be done by tomorrow. Sorry about this.
After that, you must update the NuGet package and generate a new license on our website.

Best regards

Tom

Crawler-Lib Developer

Tom

Status: Moderator
Total Posts: 49

PM

Posted: 5 days ago Quote #94

Sorry I'm so focused on the licensing thing, that I don't see the real problems.

The Url you want to crawl redirects to another page. ({http://www.walmart.com/ip/Paw-Patrol-Skyes-High-Flyin-Copter/36774670).

request.Response.StatusCode is Found and
request.Response.RedirectLocation is pointing to the redirected Url.

By default the HttpRequest doesn't follow redirects, because when you do website analysis you want to know the redirects. Such tools have a workflow that follows and records each redirect (In fact a loop ). But it is possible to start a HttpRequest that follows the redirects automatically:


var request = await new HttpRequest(new HttpRequestConfig{Url= new Uri(url), Quota = requestQuota, AutoPerformRedirects = true});

Hope that helps.
Best regards, Tom

Crawler-Lib Developer

Peter

Total Posts: 16

PM

Posted: 5 days ago Quote #95

That helps. Thanks for your timely response. The info I was gathering was actually on the page that gets redirected. But it is on the redirected page as well. I was unaware of the AutoPerformRedirects = true flag. Guess I was looking in the wrong place as well. 🙂
Thanks

Tom

Status: Moderator
Total Posts: 49

PM

Posted: 4 days ago Quote #96

You're welcome. Please feel free to ask soon and don't spend hours trying to find solutions. It is very interesting for us to see on which points our products can't be used on the fly. This provides us important information where further enhancements and documentation is needed. Best regards Tom.

BTW:
The new NuGet packages will contain IntelliSense (documentation XML) and the downloadable zip package will contain Html Help 1 (.chm) and Microsoft Help Viewer Help (.mshc). The later can be integrated into Visual Studio so that you can access it via F1.

Crawler-Lib Developer

HttpRequest returns empty HtmlAgilityPack.HtmlDocument for valid URL – response redirects du different URL

Information

User service

Follow us