Crawler-Lib - HttpRequest returns empty HtmlAgilityPack.HtmlDocument for valid URL

Peter

Total Posts: 20

PM

Posted: 4 years ago Quote #91

I'm not sure whether this indeed has to do with HtmlAgilityPack settings within the Crawler-Lib Engine, but it may. In the following rough code sample the HttpRequest returns with no errors and with an empty HtmlAgilityPack.HtmlDocument. I'm not sure why this is occurring. Below the request I created an HtmlDocument and use HtmlWeb to retrieve the same page and I get a valid document back. The MaxDownLoad size is not being exceeded (the html is < 300K), there are no catch errors trapped, it doesn't trigger a retry and it seems to only occur on the URL specified and a few others like it. I had a good look at the OuterHtml that was being returned by the second request and it looks normal to me. You can find it here if you want a look: https://www.dropbox.com/s/hmpydtjbzj8msrg/test2.html?dl=0
I had a look at request.Html.DocumentNode and saw no settings that should have an influence on it.
Is there something I'm not seeing?
If this is not related to Crawler-Lib then accept my apologies and I will look elsewhere for the answer. Thanks.


private static readonly HttpRequestQuota requestQuota = new HttpRequestQuota { 
  MaxDownloadSize = 1000000, 
  OperationTimeoutMilliseconds = 30000, 
  ResponseTimeoutMilliseconds = 30000 
};
await new Retry(
    5,
    async @retry =>
    {
        if (@retry.ProcessingInfo.RetryCount > 0)
        {
            await new Delay(500 * @retry.ProcessingInfo.RetryCount);
        }

        try {
      string url = "http://www.walmart.com/ip/Paw-Patrol-Skyes-High-Flyin-Copter/36774670";
      var request = await new HttpRequest(new Uri(url), requestQuota);
      HtmlWeb webGet = new HtmlWeb();
      HtmlAgilityPack.HtmlDocument doc = webGet.Load(this.TaskRequest.Url);
    }
        catch (Exception ex)
        {
            lastError = ex.Message;
            this.TaskResult.Errors++;
        }
  }
);

Peter

Total Posts: 20

PM

Posted: 4 years ago Quote #92

In the code:
HtmlAgilityPack.HtmlDocument doc = webGet.Load(this.TaskRequest.Url);
should actually read:
HtmlAgilityPack.HtmlDocument doc = webGet.Load(url);

They were both loaded from the same source.

Tom

Status: Moderator
Total Posts: 103

PM

Posted: 4 years ago Quote #93

Hello Peter, I'm analyzing this issue right now. First tests show that your submitted code run without any issue in the unlocked version. Some tests have shown that the licensing/obfuscation system caused some problems with the locked versions (The IL code is heavy modified by this tool). So I assume, your problem is caused by the licensing/obfuscation system. I've already started to replace this after the issues with the ClickOnce installation (We have done a lot of internal testing after and found problems that occur only in the obfuscated version). Our own implementation of the licensing is already programmed and integrated internally. I have to deploy the license generator to our website and to release the Crawler-Lib engine with the new licensing system. This should be done by tomorrow. Sorry about this.
After that, you must update the NuGet package and generate a new license on our website.

Best regards

Tom

Crawler-Lib Developer

Tom

Status: Moderator
Total Posts: 103

PM

Posted: 4 years ago Quote #94

Sorry I'm so focused on the licensing thing, that I don't see the real problems.

The Url you want to crawl redirects to another page. ({http://www.walmart.com/ip/Paw-Patrol-Skyes-High-Flyin-Copter/36774670).

request.Response.StatusCode is Found and
request.Response.RedirectLocation is pointing to the redirected Url.

By default the HttpRequest doesn't follow redirects, because when you do website analysis you want to know the redirects. Such tools have a workflow that follows and records each redirect (In fact a loop ). But it is possible to start a HttpRequest that follows the redirects automatically:


var request = await new HttpRequest(new HttpRequestConfig{Url= new Uri(url), Quota = requestQuota, AutoPerformRedirects = true});

Hope that helps.
Best regards, Tom

Crawler-Lib Developer

Peter

Total Posts: 20

PM

Posted: 4 years ago Quote #95

That helps. Thanks for your timely response. The info I was gathering was actually on the page that gets redirected. But it is on the redirected page as well. I was unaware of the AutoPerformRedirects = true flag. Guess I was looking in the wrong place as well. 🙂
Thanks

Tom

Status: Moderator
Total Posts: 103

PM

Posted: 4 years ago Quote #96

You're welcome. Please feel free to ask soon and don't spend hours trying to find solutions. It is very interesting for us to see on which points our products can't be used on the fly. This provides us important information where further enhancements and documentation is needed. Best regards Tom.

BTW:
The new NuGet packages will contain IntelliSense (documentation XML) and the downloadable zip package will contain Html Help 1 (.chm) and Microsoft Help Viewer Help (.mshc). The later can be integrated into Visual Studio so that you can access it via F1.

Crawler-Lib Developer

wlp

Total Posts: 66

PM

Posted: 2 years ago Quote #268

nfl vikings jerseys, lunette ray ban pas cher, abercrombie fitch, hollister clothing store, adidas shoes, coach factory, juicy couture clothings, michael kors, oakley sunglasses cheap, new balance outlet, nfl lions jerseys, ralph lauren, rolex watches, mcm handbags, nfl broncos jerseys, kate spade outlet online, ralph lauren, nfl bears jerseys, barbour jackets, tory burch sale, burberry outlet online, replica handbags, ray bans, longchamp handbags, ray bans, mlb jerseys, burberry outlet online, replica watches, nike, air jordans, givenchy outlet online, ugg, montre femme, guess, michael kors, nhl jerseys, hogan, uggs on sale, washington wizards jerseys, dsquared2, babyliss, nfl browns jerseys, huarache, ray ban, burberry online shop, cheap michael kors, dre beats, michael kors, nike, mlb jerseys, celine bags, adidas schuhe, asics gel, longchamp, new balance, true religion jeans women, ugg australia, jimmy choo, memphis grizzlies jerseys, barbour jacket outlet, cheap oakley sunglasses, replica watches, ralph lauren outlet, uggs boots, coach factory, ray ban outlet, burberry outlet, [url=http://www.pandora.

HttpRequest returns empty HtmlAgilityPack.HtmlDocument for valid URL – response redirects du different URL

Information

User service

Follow us