Single Site Crawls

PBN Lab now gives you the ability to crawl a single web site in its entirety with just a couple of clicks.

Setting up the job is as easy as entering a single URL, whether it’s the home page or any page of the site and the crawler will crawl right around the site to reveal every single expired domain they’re linking out to.

Finding 600 niche-specific domains in 1 targeted crawl:

Watch this 5 minute video now, where I show you how I crawled Pat Flynn’s Smart Passive Income website in about 15 minutes flat, revealing the 300 domains he links out to.

Better still, check out I found more than 600 niche-specific domains in the airline industry by crawling just one web directory!

 

How the Single Site Crawl works:

The crawler begins with the URL you provide, and will continue crawling and indexing every link it finds, but it will only crawl web pages whose “hostname” matches the original site you provided.

For instance, if you specify a URL of http://www.42seconds.com (or even http://www.42seconds.com/category/sub-category), in either case, the hostname is www.42seconds.com.

Both HTTP and HTTPS pages will be crawled, regardless of which URL you start with. It’s best you copy and paste their site url – but if you’re unsure, go to with HTTP.

For example, here’s a list of pages that will be crawled based on the hostname www.42seconds.com:

  • http://www.42seconds.com
  • http://www.42seconds.com/about
  • https://www.42seconds.com/category/post-name
  • https://www.42seconds.com/category/another-post
  • http://www.42seconds.com/page.html
  • https://www.42seconds.com/contact-us.php
  • …everything and anything on www.42seconds.com

Sites that would NOT be crawled:

  • http://blog.42seconds.com – because the hostname is blog.42seconds.com
  • http://blog.42seconds.com/category/post-name – because the hostname is blog.42seconds.com
  • http://support.42seconds.com – because the hostname is support.42seconds.com
  • http://wiki.42seconds.com – because the hostname is wiki.42seconds.com
  • …anything that is not strictly www.42seconds.com!

Why not all sub-domains at once?

The reason it works this way is that technically hosts “www” and “blog” on the domain 42seconds.com are completely seperate properties. They could be different sites, on different servers, in different countries even!

In some instances, they could be different and unrelated sites (think of web 2.0 properties, like wordpress.com, for instance).

Or, they could simply be massive authority sites with silo’d content, with finance.42seconds.com and sports.42seconds.com. Or, maybe like Wikipedia, they use subdomains to manage the site in different languages. I.e. en.wikipedia.org vs. fr.wikipedia.org.

We have to assume they’re different properties, or otherwise you’d lose the ability to stick to one web property, or to one specific silo, for instance.

It’s worth noting that all of the URLs parsed in the crawl will still be indexed and assessed as being potentially expired domains.

Current limitations of the single site crawl:

As at 18th August 2016:

  • The maximum number of URLs that will be crawled in a single job is currently 100,000 pages. This limitation will be improved/removed in the near future, but for the moment, this is ensuring each crawl engine doesn’t have any memory issues.
  • The number of crawl bots that will be used is currently limited to 30 for the single-site crawl, regardless of which plan you’re on. This is to prevent having the private proxy IPs from being banned (temporarily or not) by the web server, since we’re crawling it from end-to-end in one go.
  • You must be on a Tera or Exa plan to have access to this feature. It is NOT available on the Byte or Mega plan.