Table of Contents
A few weeks ago we were contacted by a client whose e-commerce website was having crawling and indexation issues – the crawl rate of the website was declining, lots of unnecessary pages were being indexed and, most importantly, they were unable to analyze their website with a crawling tool (Screaming Frog) to identify and resolve those issues.
After getting the overall description of the situation, we decided to start planning the crawling of the website and asked for log files for later analysis.
We realized that the bot was discovering millions of pages while the client’s website only had around 5000 product pages. Since it was an e-commerce website, we naturally thought that URL parameters might be at fault.
First, let’s talk about the URL parameters. According to Google, URL parameter is “a way to pass information about a click through its URL”. In the example below, we know that we are looking at T-shirts that are blue and short-sleeved.
URL parameters are especially useful for tracking and for enhancing the user experience on the website by providing filtered and organized content. But as you may have already guessed, there is a catch!
While certainly handy, URL parameters can also become a nightmare for SEOs. They can create duplicate content, waste the crawl budget, and even mess up the ranking signals.
So naturally, there are several SEO best practices website owners should follow to avoid potential URL parameter issues:
- Remove URLs with blank values
- Avoid ID duplication in the URL
- Keep the same order of keys in the URL
- Add canonical tags when applicable
- Add noindex tags when applicable
- Disallow URLs with unimportant parameters in the Robots.txt file
- Easily add rules for URL parameters in the Search Console
These will ensure that you are using the URL parameters in accordance to your business goals instead of wasting that oh so essential crawl budget.
Okay, enough with introduction and back to the matter at hand – how to fix a 5000-page website that returns millions of pages.
After identifying that the issues were indeed caused by the URL parameters, we quickly analyzed the syntax of the URLs and came up with a list of 54 parameters that should be excluded from the crawl. We used Screaming Frog for this.
We did this by excluding the URL parameters from the crawl file by going to Configuration > Exclude and writing the id in the following formats for each parameter:
After running a new crawl (limit being 5.000.000 URLs), even with the excluded parameters, we still ended up with 5.000.000 URLs. (I know, I know)
But! Luckily, the exclusion enabled us to see yet another issue with the URLs. Besides having parameters in the URL, static URLs were created whenever the user chose a brand in the shop page. Moreover, if several brands were chosen at the same time, a static URL including all of the brands was being generated, and the order in which the brands were listed in the URL depended on the user’s selection. Here is what the URL looked like: https://www.example.com/t-shirts/brand_1-brand_2-brand_3
To confirm our suspicion, we went on and updated the exclude file with the brand links. You can exclude the brand URLs from your SF crawl by writing the following rule:
We were finally able to make Screaming Frog crawl the website as it was intended! *fireworks*
But It was too soon to celebrate, here is where we discovered some other issues with the URL structure the most worrying of which was that the URLs contained lots of repeating words.
Getting the URL structure right is essential both for SEO and for pleasant user experience. After all, the page URL is one of the first things that the user sees alongside page title and meta description in the search engine result page.
While there are no fixed rules on how to structure your URLs, there are still some best practices worth following:
- Use short keywords as topics:
- Avoid using unnecessary repetition of words:
- Build hierarchy and structure and stick to it:
T-shirt > brand > v-neck, crewneck > color
- Avoid caps and pay attention to hash usage and URL length:
So coming back to our website, after identifying the issues with the structure and parameters, it was time to identify the priority URLs that needed to be fixed. This is where the log files came to play! We got the most important URLs from the log file analysis and started with the optimizations. *more fireworks*
Since Google uses mobile-first indexing, we firstly prioritized the URLs according to the number of events occurring for Googlebot Smartphone. The brand URLs had the biggest number of occurring events.
When it came to Desktop Googlebot and Bingbot, URL parameters were taking most of the crawl budget.
Here is the summary of how we have fixed the URL parameter and URL structure issues.
TL;DR Solution Summary
- We restructured the brand URLs with syntax parameters
- We added rules in the Search Console for all existing URL parameters.
- We added a noindex tag on the pages that are already indexed but shouldn’t be.
- Some of the parameters were disallowed through the robots.txt file.
- We fixed the parameter order and structure.
- New hierarchy for the URL structure was developed and implemented where necessary.
The bots recognize your website through links and URLs, so their structure can easily make or break a website. The best way to approach the URL structure is to develop a strategy for them even before developing the website. Nonetheless, even if you don’t have a strategy yet, it is never too late to start. Although time-consuming, it is always possible to track and fix URLs parameter and structure issues.