How to Find All Existing and Archived URLs on a Website
How to Find All Existing and Archived URLs on a Website
Blog Article
There are several good reasons you may perhaps want to discover many of the URLs on a website, but your specific objective will decide That which you’re hunting for. For example, you might want to:
Detect each individual indexed URL to analyze problems like cannibalization or index bloat
Collect existing and historic URLs Google has observed, specifically for web-site migrations
Locate all 404 URLs to Get better from put up-migration errors
In Each individual circumstance, only one Device received’t Supply you with almost everything you may need. However, Google Search Console isn’t exhaustive, along with a “web page:instance.com” search is proscribed and tough to extract facts from.
In this particular write-up, I’ll wander you thru some tools to construct your URL listing and right before deduplicating the info utilizing a spreadsheet or Jupyter Notebook, based upon your web site’s sizing.
Outdated sitemaps and crawl exports
When you’re on the lookout for URLs that disappeared within the Are living web site not too long ago, there’s an opportunity a person on the crew could possibly have saved a sitemap file or perhaps a crawl export prior to the variations were created. When you haven’t now, look for these information; they are able to usually give what you will need. But, in the event you’re looking through this, you almost certainly did not get so Fortunate.
Archive.org
Archive.org
Archive.org is a useful Software for SEO duties, funded by donations. When you look for a domain and choose the “URLs” choice, you are able to obtain nearly 10,000 listed URLs.
Having said that, There are many limitations:
URL limit: You may only retrieve up to web designer kuala lumpur 10,000 URLs, and that is inadequate for larger sized sites.
Top quality: Several URLs may very well be malformed or reference source files (e.g., photos or scripts).
No export choice: There isn’t a built-in technique to export the list.
To bypass The shortage of the export button, make use of a browser scraping plugin like Dataminer.io. Nevertheless, these restrictions indicate Archive.org might not give a whole Resolution for greater web-sites. Also, Archive.org doesn’t point out whether Google indexed a URL—but when Archive.org identified it, there’s a great chance Google did, far too.
Moz Professional
Whilst you would possibly ordinarily make use of a link index to find external web sites linking for you, these equipment also discover URLs on your site in the method.
The way to utilize it:
Export your inbound backlinks in Moz Professional to get a speedy and easy list of goal URLs from the site. Should you’re addressing a massive Site, consider using the Moz API to export details beyond what’s workable in Excel or Google Sheets.
It’s vital that you Take note that Moz Pro doesn’t confirm if URLs are indexed or found out by Google. On the other hand, due to the fact most web sites implement exactly the same robots.txt regulations to Moz’s bots since they do to Google’s, this method usually performs very well for a proxy for Googlebot’s discoverability.
Google Research Console
Google Lookup Console gives numerous beneficial resources for creating your listing of URLs.
Inbound links stories:
Similar to Moz Pro, the Back links portion gives exportable lists of focus on URLs. Regrettably, these exports are capped at one,000 URLs each. You may utilize filters for particular pages, but considering that filters don’t implement to your export, you may perhaps ought to rely upon browser scraping instruments—limited to 500 filtered URLs at any given time. Not perfect.
Performance → Search Results:
This export provides you with a list of web pages receiving search impressions. Even though the export is proscribed, You need to use Google Research Console API for much larger datasets. There's also free of charge Google Sheets plugins that simplify pulling much more in depth details.
Indexing → Webpages report:
This part offers exports filtered by challenge type, however they're also minimal in scope.
Google Analytics
Google Analytics
The Engagement → Web pages and Screens default report in GA4 is a superb source for collecting URLs, using a generous limit of 100,000 URLs.
Even better, you are able to apply filters to create unique URL lists, properly surpassing the 100k Restrict. For instance, if you want to export only site URLs, follow these measures:
Phase 1: Include a segment towards the report
Stage 2: Click “Produce a new section.”
Phase 3: Determine the segment which has a narrower URL pattern, such as URLs made up of /weblog/
Note: URLs found in Google Analytics might not be discoverable by Googlebot or indexed by Google, but they provide valuable insights.
Server log files
Server or CDN log information are Maybe the ultimate Device at your disposal. These logs capture an exhaustive record of each URL route queried by users, Googlebot, or other bots through the recorded period.
Concerns:
Details dimensions: Log information is usually large, numerous sites only retain the last two weeks of information.
Complexity: Examining log documents is usually challenging, but different equipment can be found to simplify the process.
Combine, and good luck
When you’ve collected URLs from all of these sources, it’s time to mix them. If your internet site is small enough, use Excel or, for greater datasets, applications like Google Sheets or Jupyter Notebook. Assure all URLs are continually formatted, then deduplicate the record.
And voilà—you now have a comprehensive list of recent, aged, and archived URLs. Good luck!