One of the essential tools an online marketer needs in his or her tool chest is a web crawler, also known as a spider that crawls through your website following all the links, crawling those links, discovering what’s broken, what’s missing, what’s too big, too small or otherwise problematic. And you’re about to learn about all you should be doing with the crawler so that that most important crawler Googlebot is able to crawl your site efficiently and completely.
My guest for this episode, Rebecca Berbel, loves to geek out on all this crawling tech like I do. Rebecca is the Content Manager at OnCrawl. She’s never at a loss for technical SEO subjects to get excited about. She believes in evangelizing technology and using data to understand website performance on search engines.
In today’s episode, we talk about how if you’re crawling your website, you’re getting information about things like what your title tags are, meta descriptions, the length of these, whether you have overly large images, all sorts of information that can help to improve your SEO, improve your page speed and user experience. We talk about the difference between duplicate pages and duplicate content. We talked about keyword cannibalization. If you don’t know what log files, link equity, or site trees are, you’re going to learn a lot.
Rebecca also has a great offer for you in terms of an extended free trial of OnCrawl. If you care about optimizing your site for search engines and for user experience, stay tuned for some super valuable knowledge.
In this Episode
- [00:30] – Stephan introduces Rebecca Berbel, the Content Manager at OnCrawl. She believes in evangelizing tech and using data to understand website performance on search engines.
- [06:18] – Rebecca talks about OnCrawl’s Inrank metric that offers a view of PageRank within your website.
- [11:40] – How to identify & fix keyword cannibalization?
- [19:32] – Stephan clears out misinformation that duplicate content is a penalty that Google issues against a domain.
- [26:04] – How often do you need to do an error check for your website?
- [30:51] – Stephan and Rebecca discuss the importance of having access to Google Search Console for website owners.
- [36:24] – Rebecca explains how OnCrawl can give you the data on how your website is performing based on what you should do.
- [42:55] – What is a log file analysis, and how can it be used for SEO?
- [47:58] – Rebecca shares some actionable insights you can get from log file analysis and OnCrawl crawler.
- [52:48] – Visit OnCrawl’s website at OnCrawl.com to know more on how to create smart SEO decisions and to take advantage of their one-month free trial.
Rebecca, it’s so great to have you on the show.
Yeah. Thanks for having me, Stephan.
First of all, let’s talk about crawling, and we’ll get into some of the cool techs at OnCrawl, but I’d love to get a background for our listeners who’s not an SEO specialist. They want to understand why they need a crawler, why can’t they just rely on the data coming out of Google Search Console? There are some nice little graphs and charts and things. Why do you need a third-party tool like OnCrawl or something similar?
That is a great question. Let’s just step back a minute and talk about what is a crawl. A crawl is a program that exists online that asks a website to provide information. So it’s a program that exists to collect information about one or more web pages. Lots of people would use a crawler for different reasons. For example, search engines use a crawler to constitute their library, their index of websites that they can use to provide answers to a query. You have third party tools that would crawl all different sites because they also need to create a database of information that they can provide to you as a user. For example, a keyword tool or a ranking tool needs to collect that information to be representative of the web so that your site, when you compare it to other sites, you have some real information. You can have less ethical uses of a crawl, to crawl a website to maliciously copy its content, because that crawl then can copy everything that it sees on the webpage. And you have site owners, which would use a third-party tool, like OnCrawl, to collect information about their website. So the types of information you might want to collect can be used for SEO. For example, you can collect all of the site information for each page, such as titles, metadata, how many words are on the page, what types of words are on the page, how long it takes to load the page. This is the type of information that you can collect on your page, which might not be present in other tools. You’re using a crawl to create a database of information about your site that you can use to optimize it.An SEO audit doesn't just produce surface information. It's information that gives a more in-depth look at your site and what you can do to improve your SEO. Click To Tweet
Got it. If you’re crawling your website, you’re getting information about things like what your title tags are, meta descriptions, the length of these, whether you have overly large images, all sorts of information that can help to improve your SEO, improve your PageSpeed and user experience. I mean, if you have really big, unwieldy images, for example, that take forever to load, it’s bad for SEO and user experience.
Exactly. You can even go beyond that and look at the site as a whole. For example, you can look at how your site is structured using the internal links from one page to another, which can give you a sense of how deep is your site? How far down do users have to go in and sort of a rabbit hole to find your page? That’s a user experience question. But it’s also a search optimization question because we hear more and more today about how Google will look at your whole site. It will determine the importance of each page based on how you link to it. Pages that aren’t linked too often are not important on your site. Your site doesn’t promote them to users. So Google also doesn’t want to promote them to searchers. Things like that that you can find in a crawl by collecting all of the links from all of the pages to see how many links go to one page, how many links go to another page, can help you find areas to improve your site using a crawl. That information can be difficult to find elsewhere unless you want to collect it by hand. But once you have more than 50 or 100 pages, that can be extremely laborious and time-consuming.
Google will look at your whole site and will determine the importance of each page based on how you link it.
Right. Also, you get to see how deep a page is in the site tree, how many levels from the homepage a page is. That’s important in terms of how much link equity it’s going to inherit. If it’s very deep down in the site tree, it’s not likely to rank as well as a page that’s one click away from a secondary level page.
Exactly. And link equity is also something that you can see in many crawling tools. For example, OnCrawl has the Inrank value, which is a link equity measure, based on PageRank. So the original PageRank algorithm that then we’ve used millions of pages that we’ve observed, and seeing how Google treats them, to help adapt it the way Google has adapted their PageRank algorithm, to give you a sense of how important is your page. So it takes into account links, it takes into account depth, and it helps you figure out which pages on your site are important and how you can modify the flow of link equity to push important pages in SEO. For me, that’s one of the most important things that you can do with a crawler. You can look and find information. But it’s not necessarily just surface information, and it’s information that then you can treat and handle to be able to have a more in-depth look at your site and what you can do to improve your SEO.
Right. So if you’re looking at a page that you wanted to rank better, it’s got a very low Inrank value. Maybe you don’t even link to it yourself. I’ve seen that before; orphan pages that a client would like to have rank, and they don’t link to it themselves.
One of the things that OnCrawl does fairly well is finding orphan pages.
That’s one of the things that I’d like to talk about today. Orphan pages are an interesting case because orphan pages, since they are not linked to in a crawl, and the way most crawlers work is to look through a page, collect all of the links on it and add those links to the pages to crawl next. So if you can’t find a link to your page, a crawler, such as Google’s crawler, or a crawler, such as OnCrawl, or a third party tool that you’re using to examine your site won’t be able to find that page either. The question is, how do you find orphan pages. And that’s one of the things that OnCrawl does fairly well. We use all sorts of other sources of data to find pages that might be known or might exist, you can provide us a list of your paid campaign landing pages, and we’ll compare that to what our crawler finds. You can use your sitemap. Does your sitemap contain pages that aren’t linked to the site? That’s a really important signal because sitemaps tend to show pages that you want to rank in Google. But if it’s not linked to your site architecture, Google doesn’t know what to do with it. And that can severely devalue a page. We include data from other sources like Google Search Console, and Google Analytics, which measure pages that Google knows about from other sources. Now there might be an external backlink to a page that is an orphan page. It measures user behavior in Google Analytics. So users might know about a page because it’s been linked from social media or an email that might not be in your crawl. We also use log files. Do you think your listeners today are aware of what log files are?
They could be. Let’s actually define it for them, and then we’ll talk more about log file analysis and why you want to do that. Let’s just broad-brush cover what the log file is.
The crawl gives you information about your site that you can tell from looking at the site itself. And log files will give you information about how visitors, people, or other programs interact with your site because every web server records every single interaction, every single request that it receives to provide information. It will then record when the request happened, who requested it, what they requested, and what it returned, whether it’s a status code, whether there was data to transfer so that the person could receive a webpage. These logs are essentially all of the activity on your site. If you look at the webpages that were requested, sometimes you see webpages that weren’t linked to your site. That will give you another source of orphan pages that you might want to use to compare to the crawl to find these pages that haven’t been included.
That’s so great. The really powerful tech that you guys have there. Let’s talk a bit more about Sitemaps. So XML sitemaps are something that is kind of table stakes. It’s an essential thing these days to ensure that you’re giving Google enough information about what’s canonical and what’s not. And it’s a signal for canonicalization. Let’s define, for our listeners, canonicalization. Why is it important to tell Google what is canonical and what isn’t? And how can people generate XML sitemaps?
This is a great question. But this is essentially the sort of the core of SEO. Today’s websites provide content at a URL. But usually, they compose each URL that’s requested based on how you’ve defined the content that should go into that. Let’s take a typical WordPress site, for example. You’ll have a slug, your page content, but you also have author information, you have category information. And just like these are in little tabs in a different section, your database stores this information separately. When somebody asks for a URL, your database has to compose that page. This means that the content and the URL are often separate. Which consequently means that you can have multiple URLs that serve the same type of content. You might have a typical reason to have a different URL, it might be you have a CMS that automatically creates a URL, but you’ve specified a category mask or a product mask for this item. You might have a mobile page and a desktop page. So it’s the same content, but it’s served on a different URL. And when you have multiple URLs with the same content, that’s called duplicate content.
The problem for Google is to know which of these pages do you want people to see, and the question for us is, when you have multiple pages, multiple URLs that show up for the same query, it dilutes the total number of traffic that you could have for these pages. This is the essential keyword cannibalization, where multiple pages sort of eating away at the traffic that they could be absorbing. The way that SEO treats these pages is to tell Google which of the pages should rank and to suppress the other pages from appearing in that query result. How do we do that? We have lots of different ways to signal to Google that one page is more important than another, but the most important is perhaps canonicalization. You insert an indicator at the top of the page that says, “The official URL for this page is x.” And you can do that even on page x and say, “This is the official page. Here are some of the other pages, but this one is the official version.” And when you indicate the official version, Google uses that as a signal to say, “Okay, that’s the version we want.” So if I’m looking at a desktop page, and the canonical version is the mobile page, Google will then go look at the mobile page and index that instead. One thing OnCrawl can do is look over that.
Cannibalization happens when multiple pages eat away the traffic that they could be absorbing.
Just to clarify, for our listeners, that’s called by many SEO is a canonical tag more technically accurate to call a canonical link element. So rel=canonical in the HTML code in the head portion of the HTML. And most CMS support this feature of setting a canonical link element and was something that was presented by Google and the other search engines many years ago as a standard that they wanted webmasters and CMS different e-commerce platforms to all adopt, and everybody has.
They’ve adopted it so much that sometimes the signals conflict. For example, we cited earlier how you might put something in your sitemap. But say that that page is not canonical, that a different URL should be used for that page. It’s a conflicting signal because when you send something in a sitemap, Google thinks that that’s what you want to be indexed. And when you say something is not the canonical URL, that’s not what should be indexed. So Google has started saying that that’s just a signal rather than something that they will officially obey. So the important thing to understand here is that you want all signals about which part of your URL, which version of your URL should be indexed to align. You don’t want to put something that isn’t canonical in a sitemap. You don’t want to put something that’s been redirected into a sitemap, and you don’t want to put something that isn’t linked to in an XML sitemap.
A great example of this kind of mistake that happens a lot is an HTTP version of the site, which isn’t canonical, it’s the HTTPS version of the site. Both have self-referencing canonical link elements in them. So when you view the HTML source, you’ll see that in the HTTP version of the page, it’s got HTTP in the URL for the canonical, and you go to the HTTPS version, and you see the HTTPS version there. Both of those pages should have the HTTPS version of the canonical in there so that Google aggregates the link equity across any links pointing to both to that version, that’s the HTTPS. In actuality, the best approach is rather than rely on this canonical hint in the signal that may or may not is obeyed, is to instead of 301 redirect the HTTP version to the HTTPS, then it’s foolproof because that’s something Google has to obey. It’s not just a hint.Every time you have a frequent amount of change, you have a potential for broken links and other difficulties. Click To Tweet
Exactly. Now we’re talking about something that I think is fairly easy to understand that you want to be consistent in what you tell Google about which pages you want to show up in search results. But what can be a little bit more difficult is how to get the list of those pages that aren’t working correctly. And if we go back to crawling, that’s where our crawler comes in. Because we mention title tags, we mention word count. We mention the number of links on or to a page. That’s another thing, it will record those link rel=canonical elements, group them and tell you, at least OnCrawl will tell you this group of canonicals that all reference one another. There are one or two or multiple ones in here that aren’t consistent. We notice these pages can create a group of pages that you claim have the same, or we see, have the same content. But a few of the canonical declarations are not correct or not consistent.
Yep, that makes a lot of sense. And just to differentiate for our listeners, duplicate content and duplicate pages are two different things. Duplicate content can be partial, where maybe 20% of the page comes from one source on your site, and then another 20% from another, and so forth. So the whole page, in aggregate, is a duplicate of pieces of five different other pages. But it’s not a duplicate page because it’s not the same content from another page of your site in total.
Right. And so that’s also something that you would want to look at if you’re using a third-party tool to crawl your site and examine duplicate pages. How much of the page is duplicated? Is it 5%, or is it 93% of that page? A page that has, say, as you cited 20% duplicate content from another source here or another source, there is not as critical as a page that is 95%, the same as another page on your site.
Right. Because then you’ll get cannibalization in the search results, Google will filter out one version versus the other because Google wants to display diversity in the search results. Query Deserves Diversity, for example, QDD. And what happens then is the version that you may want, preferred, may not be the one that gets filtered out by the duplicate content filter of Google. So that’s why you want to specify canonical link elements to kind of steer Google in the right direction of which version it should favor in the search results.
That’s true. That can then create different issues on your website. If the wrong page shows up in the search results, you might not have a clearly defined user journey from that page, and you might not have perfectly optimized it for the type of search intent. So that can have an impact on how your site performs for your business in terms of revenue in terms of conversions in terms of leads.
Now there’s this misinformation or myth out there that duplicate content is a penalty that Google issues. And it’s not a penalty, there’s a filter. You can even see it in action if you do, let’s say a site colon query of your site, or maybe even add a keyword in there. So site colon, whatever your site domain is .com. And then a keyword like, I don’t know, cheese or something, if you sell cheese, and see if at the end of the search results set, there’s a little message that says some results have been omitted. Essentially, you click here to add those admitted results back in. That’s one of the duplicate content filtering algorithms at work. They’re filtering out near-duplicates from the results.
Yeah, and as you said, that’s not an official penalty against your domain. It’s just a filter that removes some of your pages that you might want to have replaced others in those results.
Yeah. And that’s a user experience thing that if all those duplicates showed up in Google on a search query, it would not be a great user experience. It’d be the same content over and over and over again. Users would get very frustrated by that.
Let’s now talk about errors and the kinds of status codes that some are good, and some are an indicator that something’s gone wrong. So let’s talk a bit about that.
Okay, if we go back to what we were saying about how your server works when somebody asks for a webpage, your server will then tell the requester somebody’s browser or the program that is crawling the site, it will say, “This page is available,” “Sorry, this page has moved,” “Can’t find this page,” or “I’ve got a personal problem. I can’t answer any queries right now at all.” And so those are your HTTP status codes. So you have the 200, and usually, that’s when you have a page that works perfectly well. You have the 300 series, which are ways of telling the requester that this page has been moved either temporarily or permanently. You have the 400 series, which are called client errors, which means the requester requested something incorrectly. And the 500 series, which are server errors to the server, was unable to complete the request. You also have a few other weird ones like, I think it’s 416, which is I’m a teapot. So every once in a while, you get some strange errors.
I didn’t know about that one. Like there’s 307 most people don’t know about. They don’t hear about 301 versus 302s, temporary it’s the 302, permanent redirects are the 301s. But there’s also a 307.
Yeah, 410 is permanently gone, which Google still treats essentially like a 404. Like you can bring back a 410 page from the dead, and Google can put that back in the index.
Right. And even some of them like the 307. I know there was a recent webmaster’s talk on that. The 307 essentially says this has been moved, but it’s also a browser response. So you might not have a web server saying this was a 307 code. It might be a browser that says, “No, I know that this needs to be an HTTPS version. So go check out the HTTPS instead of HTTP that you requested.”
Now, let’s say that you have a lot of 404 status codes showing up on your site, you have linked to pages that have been deleted, products that have been discontinued, blog posts that are no longer relevant. So you went in and turned them back into drafts, whatever. And so now you have all these broken links, and the way I like to convey the importance of the problem with doing this is this philosophy I learned about in the book, The Tipping Point by Malcolm Gladwell, and it was called Broken Windows. What happened in New York City was there was a whole lot of crime. It was terrible, really terrible, like in the 80s. And then I think it was in the 90s that this policy was put in place called Broken Windows. What they did is the police, and the government would focus on just a little stuff, making sure that broken windows in neighborhoods were fixed.
Graffiti was cleaned up, turnstile jumpers were stopped from entering the subway that way, and so then, if people didn’t see these minor crimes happening, then they also didn’t see the major crimes happening. Murders and so forth dropped because it looked like the place was cleaned up and being well-policed. So the violent crimes significantly dropped just by cleaning up graffiti. So imagine applying this to your website, if you have graffiti, broken windows, boarded up windows, things that don’t look very good to the search engines that it looks like you have a very unkempt neighborhood, that sends a signal to Google that nobody’s home, nobody cares. Don’t do that. Clean up the broken links, you need to use a crawler to help you identify these broken links. Now, Google does provide some information inside a Google search console about this. But it’s always a good insurance policy to use a third party crawling tool, such as OnCrawl, to get a complete list of all your broken links that you have linked to internally from your site.
That can also be useful if you have a very large site or a site that changes often. So a lot of e-commerce sites where new products come in and products go in and out of stock, media sites where you’re constantly creating new pages, there is often a delay in what you see in Google Search Console. Whereas if you call a site, it’s immediately exactly what you see that day. So that can help you find or keep abreast of new broken links and a state of a website that is constantly in flux.
Great point. How often do you recommend that someone do an error check for 404s and 500 style server errors and all that? Is it weekly, monthly, quarterly, how often?
I’m gonna have to be a real SEO here and say it depends. Because it does depend, it depends on your website. If you have, for example, just a basic corporate website, where you update it every six months every year. That’s not something that’s going to change very often. So if you run a crawl to check up on your basic website health every six months, every year, you know that not a lot is going to change between one crawl and another. If you have a website where you post to your blog every week, you might want to check up on that every month because things change a lot on that site if you have a website that posts ads, or coupons, or something that frequently happens, with a large volume of new pages, particularly if those things are automated. That’s something you want to check up even more frequently. You might want to look at that every single week, for example, because you know that things appear and disappear and change frequently. Every time you have a frequent amount of change, you can have a frequent potential for broken links and other difficulties.
And speaking of when things change a lot on your site, you should be updating your XML sitemaps file just as frequently as the pages change on your site. So let’s say that you add a new piece of content every single day, well every single day, you should update your XML sitemaps as soon as the new page surfaces or published on your site.
You should update your XML sitemaps as soon as the new page surfaces or published on your site.
And in most CMS is like with canonical declarations. Most CMS either will or have plugins that will do that sort of thing for you.
Now, what if it’s an old-style hand-coded website, and they don’t have XML sitemaps generating capabilities, does on crawl provide an XML sitemaps generator?
We don’t have a generator. So we are a data platform where we collect data, analyze it, and provide you with the results. So if you need to generate an XML sitemap, there are sitemap generators online. You can also use crawl results, export the list then of everything we’ve crawled, and convert that from a list to an XML sitemap. So there are plenty of solutions. With an XML sitemap, what you want to make sure is that all your important pages are listed there. You don’t necessarily have to list every single page on your website. But definitely, pages that you want users to be able to find via search should be there.
Yeah, and no duplicates. So the stuff that’s canonical only should show up in your XML sitemaps. Because otherwise, you’re going to confuse Google by giving non-canonical URLs in your XML sitemap.
We think Google knows a lot today because we always Google things to find out answers. But Google is just a collection of algorithms. So it gets confused fairly easily. As sophisticated as the algorithms are, you don’t always want them to impose their solution. You’d rather be able to indicate something very clearly so that you know that what you want is what happens.
Speaking of which, inside of Google Search Console, there’s a tool that allows you to specify which parameters in the URL are essential and which ones are superfluous. So it’s set by default to let Googlebot decide. But you can go in and say, “This parameter is superfluous. It’s not something that changes the content, and it’s just a tracking parameter. Please ignore it, Google.” What are your thoughts about that tool?Sitemaps tend to show pages you want to be ranked in Google. But if it's not linked to your site architecture, Google doesn't know what to do with it. Click To Tweet
I love that tool. So I don’t work a lot with, for example, e-commerce sites where this can be extremely useful. But having the ability to be able to decide or indicate clearly which elements of the URL or which parameters get tacked on to the end of your URL are important to take into account when using search results and which aren’t, can be such a help. I mentioned e-commerce just a moment ago, and you often see this used in e-commerce to filter search internal search results. So if you’re looking for a t-shirt on an e-commerce site, you might be able to filter by white t-shirts or t-shirts under 20 euros or dollars. And these will often show up in the URL as parameters. Some of these, they don’t want Google to index every single type of search result. They just want a category page to show up in the search results. Being able to tell Google directly that these types of parameters are not useful is a huge help.
Now, that’s just one of many tools inside of Google Search Console. It’s mind-boggling to me that some website owners or marketing managers still don’t have access or even any knowledge about Google Search Console. So, everybody who’s listening who has a website they care about or works for a company where the website is something that is part of your job, you need to have Google Search Console access, and it’s free.
Yeah. I’m sometimes amazed at how many useful free tools, not just gimmicky elements. But tools that are essential to the daily function of a website that appears in Google search are made available by Google. And search console is one of those. I can give an example of one of the sites that I followed recently that had a fairly minor update. But in Google Analytics, which most people are familiar with, we saw a huge drop in traffic. So we were wondering, what happened, what did we do wrong, and what is going on here? And one of the first places we checked was to measure the organic traffic in Google Analytics compared to what Google Search Console told us we were getting in terms of number of clicks from search pages. And we realized that the problem wasn’t only our website; it was mostly how our website allowed Google Analytics to track different elements. So there wasn’t a drop in traffic. There was a drop in reporting. This is one of the ways that Google Search Console can be really interesting. It tracks your rank, it tracks your keywords, not the keywords that you think or that a tool thinks you’ve ranked for, but the keywords that Google records that it has ranked you for. It tracks the number of times that the pages on your site have appeared for those keywords and the number of times that people have clicked on your site when it appears for those keywords. That is essential information for a robust and functional SEO strategy on Google.
Tools that are essential to the daily function of a website that appears in Google search are made available by Google.
For sure. Speaking of what you were describing of the tracking being off in Google Analytics, and so the reporting showed that the traffic went away. I had that happen with one of my clients recently. They implemented a visual redesign to their blog but didn’t change the linking structure or much about anything about the back end, and it was just a visual redesign of the look. And the traffic disappeared, like organic traffic completely went away. Like, “This doesn’t make sense. This is not right.” So I had a quick look. And in Google Search Console, it’s still shown that the blog was ranking and getting clicks and so forth. So I looked, and I saw that it was the GA. I think it was completely left out of the HTML code. So for memory, that’s what I think happened. So they’re like, “Oh, thank goodness, the traffic didn’t go away, I would stop that we’d have to wait several weeks or months until it came back because of the redesign.” Like, “No, no, no. Traffic is still there. You’re still ranking, and you’re still getting organic clicks.” So yeah, make sure that Google Analytics or Google Tag Manager code is always in place and never gets lost in the redesigns or revamps that you’re doing.
And that’s one of the other advantages of the Search Console. So you don’t have the same type of all user tracking information. But the information in Google Search Console comes from Google. It doesn’t come from your site, it comes from what Google knows and has stored about your site. So there’s nothing to add to your site. It can’t go wrong. You might just have a delay in Google’s ability to provide it to you. But there’s nothing that you can do to suddenly not have that information.
Yeah. And what I recommend my clients do is set up Google Search Console with a domain property so that all subdomains HTTP and HTTPS, like all those different variants, can be aggregated together and you get reporting on all of it, instead of having to claim each separate subdomain, each HTTP, HTTPS version. That’s all just in one bucket.
Yeah. Sometimes it can be useful to split those apart, for example, if you have subdomains for different regions or different languages, where you’ll see a very different performance and a different type of keyword showing up for one region than for another. But having a global view is extremely useful.
That’s a great point. You can do this claim separate GSC site for a subdirectory, not just a subdomain. So let’s say that you have a language or country version of the site, that’s part of the main site, it’s just in a subdirectory, you can claim that as a separate site in Google Search Console, still do the domain property as well, and as many of these separate URL properties as you want, but just realize that you can take a certain subdirectory and kind of carve that out, make it its own Google Search Console setup.
And treat it as its site.
Yeah. And get all these insights into what’s happening with just that little portion of the website. Now, why would you want to hook up a tool like OnCrawl to Google Search Console so that there’s data sharing coming from Google Search Console into the crawling platform like OnCrawl?
That’s a great question. We just talked about what’s in Google Search Console. So that’s what Google knows about your site and how Google uses your site in its search results, and crawl, which is information about your site based on what is on your site. So these are complementary types of information. And sometimes, it helps to be able to see how one interacts with another, and you might have a subset of pages that on your website are missing titles. So that title tag is either not there, or it’s empty. That might be something you’d want to correct in general. But you might not have enough bandwidth in your marketing team or your SEO team to take care of the sort of miniature things like that. But if you cross that data with data from Google Search Console, and you realize that all of those pages are not ranking in Google Search Console, that might be an indicator that those pages need to be revamped. And that that’s a priority, not just something that we have to do one day, because it’s not good. But something that might be affecting the ability of those pages to come up in search results and then to bring traffic to your site.
The crawler can tell you how your site is performing based on best practices.
Yeah, that makes sense. So it helps uncover what are the burning fires versus the “best practice” that you should ideally get to, but it may not be a priority.
Right, exactly. So your crawler can tell you how your site is performing based on best practices, based on what we should do. But when you look at that, compared to how your site performs, in the Google Search Console, it’s how your site performs on Google, with Google Analytics. It’s how your site performs with how users interact with your site that tells you where those elements have an impact. What types of best practices correlate with improved performance? And which of those best practices don’t seem to have an impact on your site?
Other search engines, it might be the SEMrush bot or the OnCrawl bot that has come to look at your site, because either in the case of SEMrush, they’re creating a database of information for their users, or in the case of on-call, because someone probably you wants to crawl off your site. So all of that information is present in a log file, but you won’t see that you won’t see being bought information in Google Search Console. You won’t see bought information because most of it is filtered out, not all of it in Google Analytics. And looking at how a bot, particularly search engine bots, look at your site, you can have more information about what is indexable and what is not indexable. Are there sections of your site that bots never visit? Or are there pages on your site that really shouldn’t be visited by a bot, that Google is coming and visiting every single day? Those are things that you can see in your log files. But that won’t turn up in certain other tools.
That’s an important distinction. If an SEO is not doing log file analysis, they’re missing some important data and potentially actionable insights that they could take from that data.
Log file analysis can be a challenge. Because it’s pretty technical, you need something that knows how to read these files because they’re encoded in a specific way. You need to be able to then analyze that data to understand it because that’s essentially what Google Analytics does. It takes that type of information that you might have in a log file for the subset of visitors that are probably humans, and it aggregates them and draws conclusions. So these are the people based on their IP address and location that we can identify that come from the US. These are the number of sessions on, and we define a session in a certain way, though. In Google Analytics, it starts and runs for 30 minutes. And then, if the user leaves, it will create a new session afterward. So they’ve grouped interactions from a single user into a session. They might say these are the number of hits from people or the number of visits from people using a smartphone. That’s also information that you can find in your log.
Google Analytics is a way of looking at all the types of information that you can have in your log files. But it’s already been processed. So the log file is the raw information. So you might need a Log Analyzer like OnCrawl offers to be able to look at that information. When you’re looking at it for bots, that’s not something that’s offered in a tool like Google Analytics. Wherein bots are involved, you might see information like whether it’s a mobile bot or a desktop bot. And as I’m sure you know that Google is moving towards what they call the mobile-first index, which is they’re using their mobile bot to create their index. So they’re using the pages that your website provides to a smartphone or a mobile phone to create their library of pages that they’ll provide in search results, rather than the desktop version. And site by site, they’re moving sites over to that version of the index. The thing is, you didn’t always know when your site was moved, that we found that almost all of our clients discovered when that had happened before Google announced it. Because there was an increase in the part of mobile Googlebots on their site until a certain point when it was almost exclusively Google Mobile, so that was one thing that they could see in their log files, when that move started, and when it happened, and this occurred before Google announced to them that they had changed.
Right. If the user agent shows as Googlebot mobile, that’s Google’s mobile bot that’s processing your website and putting the data into its index. Whereas if it’s just regular Googlebot, that’s the old version with Google moving to Mobile First, and most websites, I think, have been migrated over. They’re going to be mobile-only. So it’s not that desktop versions will get less value or just be seen as an additional resource. No, it’ll be ignored. Like the desktop version, if there are content and links present on the desktop version of the site that’s not present in the mobile version, you’re out of luck. That won’t get counted.
Yeah, and that’s part of why it’s so important to know whether your site has already been moved or not. Although, as you said, they’re moving to all sites, which should be in place I think, by March 2021. So just a little bit of time left before you’re stuck with the mobile version of your site.
Yeah, a little bit. Okay, so there are these bad bots that are out there that are trying to do malicious things such as scan websites for vulnerabilities, like old versions of WordPress and old versions of WordPress plugins that it can then use as an exploit to get in or scraping content, as you said earlier, to steal and put on their sites. So what do you do to block bad bots from accessing your site and stealing your stuff?
That’s a good question. That’s not as much my domain. But there are all sorts of tools that you can use on your website to tell your server and your website what types of people are allowed to ask for information. So if you tell your server not to provide information to certain user agents, those are what we just said, the identification of someone asking for information from your server, it will respond that that content is forbidden or not allowed.
Yep. So this is a funny thing I often see is a big list of bad bots listed in a robots.txt file to disallow them from the site. And as if a bad bot is going to obey what’s in a robots.txt file, like, “Oh, okay, you don’t want me to steal your stuff? I’ll go away. Nevermind.” That’s kind of ridiculous. Badly behaved bots are not going to just voluntarily obey your requests that you make in a robots.txt file.The challenge for Google is to know which pages people want to see. If you have multiple URLs that show up for the same query, it dilutes the total number of traffic you can receive to these pages. Click To Tweet
Right. But I mean, it’s a reasonable thing to think, “Oh, yeah, I use a robots.txt file to provide information to bots, so they know what they can and can’t crawl.” But the next step is to understand that you have to, as a bot, be told to go look for that file and then be told to follow that file.
Only then the nice bots go into the robots.txt file and obey what’s in there, like OnCrawl’s bot.
Yep. That causes us issues sometimes because people tell us, “Yeah, but I want to do something different. I want to pretend I’m Googlebot.” You can pretend your Googlebot, but OnCrawl follows the rules that are provided for good bots on sites. So to pretend your Googlebot, we have different solutions. And you need to be able to verify that you are the owner of the site before we want to put those in place.
All right. I wanted to also talk to you a little bit about what sort of actionable insights can you get from either log file analysis or from the OnCrawl crawler that we haven’t already talked about?
It depends on what you’re trying to do. If you have something that you’re specifically trying to do, usually, you can find the set of pages that you need to work on using a Log Analyzer and a crawler. You can also find the elements in those pages that are the most critical to work on. If I’m looking at crawl and data, and I’ve pulled in my Google Search Console information and my Google Analytics information, you can find all sorts of really interesting information about behavior, and therefore how you should create your site to accomplish your marketing goals. A good example that I’ve been looking at recently is using Google Search Console data and crawl data, and you can find out whether structured data gets you more traffic or less traffic.
So structured data often will provide you with rich snippets or even featured snippets on the search results pages. Sometimes this can be a great thing because of its visibility. It encourages people to click on the first thing they see, which is you. In other cases, this might not be as useful in terms of getting traffic because of what we call the zero-click searches. So Google is trying to provide an answer directly on their search pages. And that might be your structured data, providing them the information they need to respond to the query. So you’re visible, but you’re not being clicked on. Using a crawler and Google in the Search Console. You can find those cases, is it one way or is it the other way? So based on your marketing strategy, you know what to do. Should you be using more or less structured data?
Yeah. And there’s a surprisingly large number of zero-click searches where somebody searches. They don’t click on anything in the search results. They just got their answer. And they’re out of there.
If your strategy for those pages, where you appear on those results, is a visibility strategy, you need to be more present, you need to do whatever it is you’re doing, to continue to get those results pages and to appear there. Whereas if you’re using those pages to get traffic and conversions, that might not be the best strategy for those keywords.
Last question. What do you tell somebody who is using a free tool, like Xenu’s Link Sleuth, to look for broken links on their site and nothing else essentially? Because it’s a very limited tool. What do you tell them about what they should do instead if they don’t have any budget to afford a robust, paid tool like OnCrawl?
I think the most important thing is that you should be able to look for the same types of information. So if you’re using a free tool, particularly because it’s a budget issue, what’s going to happen is you’re going to probably have a lot of tools. Because each of these individual types of information is something that you can get, these are not hidden information. A crawl collects publicly available information because you’ve published it on your website. Log files collect information that is available to anyone because it’s what a server tells anyone asking for that page. Google Analytics is free. Google Search Console is free. But these are all sources of information. So if you don’t have a crawler, if you can’t pay for a crawler, you’re going to need something that collects broken links, something that collects your titles, you might want to, if you’re programmatically inclined, write something that can scrape information to get this information. So it’s possible to collect all of this information and then analyze it by hand. But sometimes, the amount of time you spend finding the correspondence between one tool and another and bringing all that data together is worth whatever you might pay monthly for a more important tool.
Yeah. Well said. I think it’s pennywise and pound foolish to not invest in robust, paid tools such as OnCrawl. I love it. I think it’s an essential part of my toolset.
That’s great to hear. As you said, a crawler allows you to collect a large amount and analyze a large amount of information, rather than individual tools that might not be able to bring that information together.
You provide a two-week trial to OnCrawl.
Yes, we do. So it’s a full-function trial for two weeks. And it’s free, and you don’t need to enter anything to be able to get it.
But you’re gonna do something special for our listeners.
Yeah, anyone listening to this, you are welcome to have a one-month trial. So we’ll extend that free trial for enough time for you to get a better sense of the tool. You can sign-up directly for the free trial on our website, www.OnCrawl.com. And we will extend that for you if you contact us to tell us that you heard about it from this podcast.
Okay, so they would send an email or a support request and say, “Can you extend the trial based on the fact that I listened to this episode”?
Yeah, there’s chat support within the application itself. So feel free to reach out to us there.
Awesome. That’s a very generous offer. So thank you for providing that.
Thanks. And we hope it helps. People get used to looking at their website from a crawler point of view or a Googlebot point of view and helps them improve what they’re doing in marketing and SEO.
So thank you so much. This was fantastic information, Rebecca. You know your stuff inside and out. It’s really impressive. I appreciate you sharing all this great information and knowledge, and wisdom with my listeners.
Thanks, Stephan. It was a great time talking to you this morning.
Awesome. Listeners, please take advantage of all the great information that you heard in this episode, and apply it. Just pick a few things that you’ve learned and start there because you eat an elephant a bite at a time.
- Twitter – Rebecca Berbel
- Twitter – OnCrawl
- Facebook – OnCrawl
- LinkedIn – OnCrawl
- Instagram – OnCrawl
- Youtube – OnCrawl
- Slideshare – OnCrawl
- The Tipping Point
- Google Search Console
- Google Analytics
- Xenu’s Link Sleuth
- Zero Click Searches
Your Checklist of Actions to Take
Ensure my website gets crawled and indexed correctly. That gives it a better chance of getting ranked on Google. Include having a crawl budget in my digital marketing strategy to implement and maintain tactics to make sites appear on search results.
Site structure is one of the top priorities in an SEO strategy. Structuring a website properly can help Google determine its most important pages.
Spread valuable internal links on the website. Doing so will help show search engine crawlers the value of a site based on its content.
Don’t overlook the log file analysis in your SEO strategy. It is an essential part of an SEO audit. A log file analysis tells how visitors or other programs interact with a site.
Avoid “keyword cannibalization.” It’s the process of having multiple pages on one topic. Doing so will eat away at the traffic when it could’ve been one long post that contains all the crucial information.
Be consistent with my niche. Make sure Google doesn’t get confused about my site so it can crawl my site more effectively.
Update the XML sitemaps regularly. A good XML sitemap serves as a guide for Google to the critical pages of the website.
Take advantage of the tools made available by Google. Reco – Search Console
Be particular about my page speed. A slow loading site can increase bounce rates. Check whether there are large images that take forever to load.
Subscribe to Oncrawl’s one-month free trial for Marketing Speak listeners. Send their support team an email to qualify for the offer.
About Rebecca Berbel
Rebecca Berbel is the Content Manager at OnCrawl. Fascinated by NLP and machine models of language in particular, and by systems and how they work in general, Rebecca is never at a loss for technical SEO subjects to get excited about. She believes in evangelizing tech and using data to understand website performance on search engines.