Mike, it’s great to have you on the show.
Thanks for having me. It’s very good to catch up.
You’re speaking at conferences as you normally do. You’re making the rounds at conferences like MozCon, BrightonSEO and so forth. What has been a presentation topic that’s garnered a lot of interest from the audience?
What I tend to do at the end of the year, normally at the top of the year, I get bored with all the stuff that I’ve been doing and talking about that I try to challenge myself to go even further. I find that me doing that also aligns with the things that the audience is pretty interested in. This year I’ve been digging very deeply into the computer science behind search. Understanding things like information retrieval and natural language processing and things of that nature, and break it down so everyone can understand a lot of the components of it. I know you’ve been in this space for a long time. You were probably around the first wave when people were trying to reverse engineer how search works and they were digging into the IR concepts and things. I’m sure that stuff isn’t new to you, but it seems like there’s been a generation of people who come into this space that are more content-driven and they don’t know those things. The more I bring it back to those base components then tie it to what we’re seeing right now, it seems like everybody’s very enamored with the discussion and they’re interested in like, “Here’s how things very likely work.” It’s resonating a lot and we’re getting a lot of great client conversations as a result of it.
It seems like content marketers these days don’t have a clue about SEO, which is scary.
We’ve lost our way as SEOs and there’s definitely a distinction between content marketers and SEOs. A lot of them think of SEO as a channel that’s going to work because you create things like if you build it, they will come. As we all know that’s not true. Bringing back this technical layer, just from the standard layer, “Is this site optimized and then is this content optimized?” It is eye-opening to help people understand that search engines have very specific expectations of content based on content that’s already performing. Breaking that down into the technical components, people realize one of two things. They’re like, “I had no idea what I was doing,” or two, it’s like, “There’s a whole new world of how I can improve this content that I’m doing.”
Let’s talk about some of these opportunities that are not being seized currently by many marketers and even many SEOs.
There’s a whole world of text analysis out there and most SEO tools aren’t doing most of these things. As far as I’ve come across with SEO tools, it seems like a lot of tools are going in the TF-IDF direction. That’s come up in ranking factor discussions. As long as I’ve been reading them, it’s always been one of the top five highest correlating factors whether or not you have the important words that other pages that rank for these keywords are on your pages. It’s interesting to see over the last few years a lot more discussion around that. You’ve got Ryte that’s used to be called OnPage.org. They’ve got a tool called Content Success. SEMrush is getting in the game now. Also, Searchmetrics has their Content Experience tool. It seems like those tools are also doing a little more of these things in the background. I’m not completely sure what those things are doing.
I do know that when I bring up things like LDA or I bring up Latent Semantic Analysis and things of that nature, you’ll hear people that have created these tools say, “We’re automatically doing that in the background as well.” Then abstracting it for users where it’s like, “Use this word more, use this word less.” In a lot of ways, it mirrors the way people used to think about keyword density, which is scary. You don’t want to get people back in that mindset. Nevertheless, as far as SEO tools, it seems like the main player right now is TF-IDF, but there are many other ways to analyze texts just from the perspective of how search engines, and I don’t mean Google themselves, I mean the base level and understanding of information retrieval. The computer science behind search, the core concepts of, “Let’s take this page, break it into paragraphs, break it into sentences, break it into words or tokens,” then use a variety of different statistical analysis to determine the relationship of these words to each other, score them, stem them, do limitization, a variety of different things like this. Then also look for the distributions of those words across different documents that are ranking for these keywords.
There are a variety of concepts here that I’ve thrown out. Nevertheless, if you’re into things like computational linguistics or you’re doing natural language processing, a lot of these concepts are pretty much like basics to those spaces. In SEO, no one’s talking about Hidden Markov Models or entity salient and things like that. We’ve talked about in passing like, “You should have an entity strategy,” but I’ve never met a client that has had an entity strategy. As of late, you’re seeing more and more people talk about some of these things. For instance, Stephanie Briggs from Briggsby at MozCon. She gave her talk where she talked a lot about entity salient and her methodology for using this stuff in creating content and reviewing that entity salient using Google’s NLP API. There are a lot of opportunities here, but it seems like not too many people are mindful of it or know where to start with it.We're all at the mercy of the algorithm in ways that we weren't before. Click To Tweet
On our end, we’re in the process of hiring a computational linguist so we can make this more detail than our process. We’re already using tools like Knime and things like that to do text analysis and compare what our clients have done versus what’s currently ranking and then determine the word usage deficiencies and then making adjustments to copy to account for that as well. It’s a brave new world, but at the same time, it’s not because these are all concepts that early SEOs had some concept of. It seems like as of late it’s fallen to the wayside. I’m happy to be a part of bringing these concepts back into the forefront to see where we can go with it as an industry.
There’s a lot to unpack here. For our audience who is not deep into SEO and they want to understand a bit more about these concepts like TF-IDF, Latent Semantic Analysis, Markov and so forth, let’s briefly give them some definitions of these different concepts. Let’s start with TF-IDF.
TF-IDF stands for term frequency – inverse document frequency. What that means is if you’re looking to rank for a given keyword, search engines are going to look for the pages that already rank for those keywords. Then they’re going to say, “Do you have these words on your page?” Let’s say you want to rank for the keyword basketball and pages that rank for basketball also features Lebron James, slam dunk, Michael Jordan, foul shot, things like that. Google or search engines, in general, would also expect that you have these words on your page. It’s one of the lesser sophisticated ways to determine whether or not a page is relevant to a given keyword based on the other words on the page. When I say it’s like keyword density, it is in a way in that it’s looking at the density of other words, not necessarily your target keyword.
Latent Semantic Indexing which is built on Latent Semantic Analysis. If you talk to someone like Bill Slawski, he’s adamant that Google is not using this at all. What you’re looking for is to determine the hidden relationships between words. Effectively what happens is words are broken down into their incidence rates and then also compared with respect to how often do they happen together? Effectively you’re creating these statistical matrices of words and then determining how related they are and then saying, “Does this word matter as much as this other word?” to then determine if a given page is as relevant to that word as another page. Similar concept to TF-IDF, except it’s more looking at it with the relationships between the words, not just the weighting of the different words.
Let’s tie these two concepts, TF-IDF and LSA or LSI, together. Let’s say that you’re writing a document or an article blogpost about basketball and you’re not referring to all these different related terms. You’re just being repetitive around basketball. The example I like to give is if you’re writing an article or a product description about lawnmowers and you’re not talking about grass or grass clippings, lawn clippings, weed whackers, lawn care or any of these other related terms, which incidentally, I know Brian Dean from Backlinko likes to refer to them as LSI keywords. They’re related topics. If you aren’t incorporating any of these other related terms, it’s a pretty thin article, a pretty surface level. It’s not comprehensive. If somebody wants to then take some action from this, they would have a tool ideally that analyzes the page to see how surface level it is. You mentioned a few tools as examples like Searchmetrics, the Content Experience Suite, which includes the Topic Explorer. The Content Editor will review the article once you’ve selected the topics that you want this to be focused around and it will show you where the gaps are.
That one I like a lot. They all to some degree do the same thing, but I liked that one because the UI is very intuitive. It looks like the TinyMCE editor in WordPress where as you write, it’s making suggestions. You’ve got this actual text editor where it’s not like you’re writing in Notepad, you’re writing in something with formatting and things of that nature. I like that one. You can also deploy directly to your CMS. The whole point is that what they’ll do is they’ll look for the weighting of those words throughout the pages that rank and then give you suggestions on which words to use on your end as well so that you can get the most value out of the content that you’re creating. Any tool that you might use for this, even if you’re using a text analysis tool, it’s effectively going to do the same thing, it’s going to say, “These are the words that are important. These are the words that you’re using. Here’s the disparity. Why don’t you write your content in such a way that it accounts for that disparity?”
The beauty of this stuff, I’m not saying all this stuff to sound like a “thought leader,” it’s one of those things where if you make these adjustments, you can see dramatic increases in your rankings from doing this. Most people are like, “We don’t have enough links.” As SEOs, we over-index on links and as a result of that, that’s probably why we’re seeing links being so impactful. It’s not just because the algorithm itself is overweighting links. It’s because the user action of us going after links so much is what makes links seem the more powerful thing. This is part of the reasons why you’ll see a page that has no links pointing to it and it suddenly is ranking better than pages that have thousands of links and you’re like, “I don’t understand, what’s happening here?” It’s largely because of on-page factors like this that can get you there.
As a machine learning and AI in general advances, it’s going to be harder and harder to try and game the system by dropping some additional keywords in a page or by buying some links or having a PBN or something.
Bringing up ML is a huge topic here because with the latest updates, I’m seeing this pattern wherein you get a spike in Google Bot activity and then you get a spike in traffic because you’re also getting a spike in visibility after that. If you’re a site that gets negatively impacted by the algorithm update, what you’re going to see right after that is a drop-off in rankings, traffic and so on. What we’re seeing here is that Google is effectively leveraging those user signals as response variables in these ML models to determine whether or not these are good results. What I mean by that is typically when you’re doing machine learning, you have what we call response variable where it’s like, “We’ve tried this as a result from this model that we’ve built and so we have a binary or yes and no that says this is a good result.”
The response would be some combination of either click-through or dwell time, which would say as the evaluation measure that this is a good result because the user found value here, they stayed on this page. In that timeframe where effectively we’re testing the model, we’re seeing a very low dwell time on these pages that we’re giving short-term visibility and that’s an indication that this is not a good page. My whole point here is that the data that I’m seeing indicates that Google is weighing those user signals way more heavily than they had done in the past. Having a strong user experience and of course, content that has utility for users is way more important than it’s ever been before.
The user signal of dwell time is something that we can’t see. Only Google can see it because they’re tracking what happens at the click and when the user returns and clicks on another result, if it’s a very short time period that’s called pogo-sticking. That’s something that Google tracks. They’re not spying on our Google Analytics and using it against us like the bounce rates and time on site. They don’t know what’s happening once the user enters our site. What happens with the user if they bounced back and then click on another result or do another search, that tells them if it’s very quick that it was an unsuccessful result or didn’t answer the user’s query very effectively probably.
There’s always been a discussion as to whether or not Google is using this. There have been a lot of tests. I know Eric and Rand also ran a bunch of tests on this. Google always says, “We don’t use CTR. We don’t use dwell time,” or they’ll say, “We’re definitely not using your analytics.” They’re not using analytics. I completely believe that. I also believe that they have to be very clear about the fact that they’re not using your analytics because of the fact that there’s a whole bunch of issues with privacy. Dwell time and CTR are core evaluation measures in information retrieval. It would be ridiculous if they didn’t use that. In fact, they have said that they’ve used it for evaluation. It’s very realistic to expect that because they’ve gone this direction of machine learning, they have to have some response variable which indicates this was a good result and those evaluation measures, the CTR and dwell time are already there. They very much will tell you whether or not a user thought this was good. We don’t have any visibility into dwell time but at the same time, time on site is something we can use to infer.
It’s a good proxy.
If you’re seeing high bounce rate, very low time on site around the time that there’s an algorithm update and then you see a drop-off, it’s very likely that you probably have been negatively impacted.
With machine learning, it’s also important to have a very large data set. That’s where ML most thrives is if there’s a large data set for it to base its analysis on. With the user engagement, that’s a massive data set. Whereas if it’s using something that’s on the website specifically and it’s a small website, it might not be as much a data for the ML algorithm to work off of.
I do think at the same time though that for smaller websites, they do have to make some jumping the computation to the point that you made because if there isn’t a statistically significant data set, then how would they do that? There are plenty of small websites that rank well. They have to account for it in some way. I would imagine that the same way you account for things like that and all types of statistics, there is some correction to do that but at the same time, there are probably a lot of false positives as a result of that too. We’re at the mercy of these ML algos at this point. We’ve read the posts and the blogs where Google engineers are like, “We don’t even know how it works.” We’re all at the mercy of the algorithm in ways that we weren’t before.We should get back to testing because it's so powerful. There's so much conjecture and hypotheses in the space that people take for granted. Click To Tweet
It’s important to recognize that Google is going to put out disinformation and misinformation in order to thwart the manipulators, us SEO, especially the black hat ones who are trying to use that information against them and against their algorithms. You can’t trust everything that comes out of various Google or Twitter feeds and so forth, but you can take that and use it as a potential basis for a hypothesis or whatever and then test.
That’s another key point. There hasn’t been enough testing in our space as of late. That was one of the things when I first got in SEO that was so magical to me to watch some of the old school guys who were like, “Here’s this thing I tested,” and everybody was sharing knowledge around tests. Google has discouraged a lot of that. They’ve actively said things against any type of tools like MozCast or any type of thing that would show the general trends and what’s been happening with search, specifically organic search. I definitely would love for us as an industry to get back towards the testing because that is so powerful. There are so many conjecture and hypotheses in the space that people read and take for granted. It’s very easy for someone who isn’t as experienced to take what they’re hearing from Googlers as the gospel. We all know from being in this space for so long that you can’t. You have to do the guessing check by doing your own experiments. I would love to see our space getting back more in that direction because it’s so valuable.
Another thing to recognize as either an SEO or marketer that Google is not your friend. Googlers are not your friends. They’re not trying to help you to SEO your website. They’re trying to guard their algorithms from getting gamed and manipulated editorial process of analyzing all these various signals. They would prefer that we stopped trying to manipulate altogether and create great content and let the links happen by themselves. Let the various optimization techniques fall by the wayside and focus on great content.
I have some thoughts on that as well. If you think about it and this is a guesstimate, let’s say Google has 50,000 employees. Let’s assume 10,000 of them are search engineers, which they’re not but let’s keep the map easy. 10,000 of them, search engineers, there are millions of websites on the web. Last time we checked on LinkedIn, people that had SEO on their title, there are 400,000 of us. Effectively, they need us to have search quality because they can’t do it without effectively crowdsourcing it across all these websites that we work on. We’ve seen in the past where there were fewer SEOs and search quality wasn’t as good as it is now. Algorithmically, there’s a lot that is done. It’s probably like 60/40 roll. 60% of it would be great without us but the other 40%, the right things wouldn’t be getting visibility if it wasn’t for the work that we do. That part of this, and this is my own tinfoil hat theory, is they have to think of us as a psychology project. They have to manipulate us to some degree so that they can achieve the scale for search quality that they have. If there was a point where we all went black hat, Google would be done. Whereas the fact we’re all kept in line by these guidelines, kept in line by the FUD that they create.
You’ve got to explain what FUD is.
Fear, uncertainty and doubt.
I love that acronym and I love the stories behind that. Many of our audience won’t know what FUD is. Let’s talk about what happened, speaking of FUD, with John Mueller and his reaction to one of your presentations. You put out a SlideShare deck for a presentation.
It’s called, You Don’t Know SEO. I gave the talk at MozCon and there were a couple of slides where I was walking through. What I’m calling is a pyramid of the relationship between Google engineers, SEOs, webmaster trends analysts and so on. I broke it down as we’ve got this pyramid where you have at the bottom, information retrieval as the computer science behind everything that we’re doing. Then you’ve got the Google level which improved on information retrieval, but it’s a narrower scope as to what you’re doing with information retrieval. Then at the top layer, you’ve got SEOs. I was saying that at the same layer, that’s where Google webmasters, trends analysts sit as well. Based on some of the things that they tell us and some of the things they seem to know. It seemed they aren’t able to give us any more than we already know. The general thing I was trying to say there is we have to take more ownership of learning how things are working, what does work and not take what they say as Gospel.
It was largely a miscommunication because John wasn’t there. He saw my slides. He took it out of context. I used the picture that wasn’t a recent picture. I’ve never met the guy. I honestly don’t know what he looks like right now. I do now after this whole thing. He assumed that I was body shaming him, and he also took offense to me saying we don’t need to just listen to Googlers. We need to take more ownership and accountability for our understanding. It created a big discussion on Twitter. A lot of people use it as their coming out party to hate on me, which is to be expected when you’re a person that people know. A lot of people or basically everyone that was there was like, “This is crazy. He wasn’t dissing you. He was telling it like it is.” The aftermath of that was interesting to watch because I’ve been through enough things on the internet where people will come at me with pitchforks in both the music world and now this world. It was more interesting to sit back and watch it. I didn’t take much offense to it. I didn’t care as much. At the same time, I recognize that John’s feelings were hurt by what he perceived to be direct slights at him. I was like, “That’s fine.” I reached out to him. I apologized. I also wrote a medium piece and that was pretty much that. It’s taught me to pick my slides a little more carefully because somebody may read them and not see the actual video or what have you and take offense or be upset by it. That’s not my goal. My goal is to challenge our industry and help us all get better at this.
What I find is that as we move towards more electronic communication, nobody seems to call each other on the phone anymore. It’s adequate now to tell people you’re going to call them via text. I find that bizarre and dysfunctional. When I’m hearing more from my kids to via text than I am by phone or in person, I know that that’s not heading in a healthy direction. I was so spoiled in the early days of SEO, having Matt Cutts be so vocal and be at conferences sharing and be so descriptive with long-form content, blogposts on his blog and posts to WebmasterWorld and so forth. It was so helpful. Now we get these cryptic tweets from Googlers that we have to try and interpret. How do you interpret what’s now 140 characters or 280 characters? Oftentimes they’re very short tweets that we get from Google in comparison to something that was hundreds of words long in the past. It’s a real shame that that lack of information trying to keep us in the dark does not serve the greater good in my opinion.
That was another point that I made and that may have also contributed to the offense that was taken by my slides. I was saying how I miss Matt Cutts and there’s a clear distinction between the role that Matt Cutts is in and the roles that the webmaster trends analyst is in. Matt Cutts is somebody that holds patents on some of the mechanisms for search at Google. He’s a guy that committed code on some of the earlier changes to the algorithm. He’s going to be able to give more color to things. I have no visibility into why it is the way it is. It may be a function the webmaster trends analyst role being more about like, “How do we give you the tidbits that we precisely want to give you?” or is it a question of access? I don’t know but it seems from the outside looking in, when you hear people on the team be like, “Something’s going to launch next week,” then it doesn’t come out three months later. It seems like the same experience that we all have when we’re trying to get time with the developers to see like, “Where are you at with the deployment schedule?”
That’s my outside looking in perspective. I don’t know but that what it looks like. Whereas Matt Cutts was someone who was the head of the webspam team, he knew everything that you would want to know or at least he knew exactly who to talk to because some of those people may have even reported to him. I miss Matt Cutts but at the same time, I don’t think it’s fair for us to hold these other guys to the same standard if they are not the same type of person in the organization. The reality of it is Google is a very large organization and it’s very easy for these teams to not communicate well. I miss having that level of information, but you never know what you got until it’s gone. People were hating on Matt Cutts, myself included back then and now that we don’t have him, it’s missed. It creates more opportunity for us to learn more about how things work.
It’s more than lack of access or lack of control over certain things. You don’t get the same level of information that Matt provided. It’s a policy decision like the fact that they don’t want to give us as SEOs weather reports on when major updates are happening or they deny that a major update happened until we prod and poke at them. We went from a lot of transparency from Google to a lot of opacity and I don’t appreciate that. They’re not at a government or something. They’re a company and their goal is to maximize shareholder value and that’s to be expected. If it’s keeping us in the dark and helping them achieve that goal, then so be it. There’s not a lot we can do except vote with our dollars and if we’re spending money on Google ads, maybe reconsider and move some of that money elsewhere because money talks.
Where are we going to go? What are we going to do, put it on Facebook?
They do have a monopoly going. Let’s circle back to what we were talking about in some of these IR, information retrieval, related terms. If you can define Markov and entity salience for us, that would be helpful.
This is the harder one to breakdown. Entity salient is this concept that it’s more like leveraging the information behind the entity that’s being discussed in the content. For instance, let’s use another basketball metaphor. Let’s say you’re talking about the Dream Team. You mentioned four players and then you don’t mention the fifth player in the starting lineup, but there’s an understanding that search engines have that even if you haven’t mentioned the fourth player, but you’ve mentioned the other four, while you’re inherently mentioning that one because there are all these attributes of these entities that are inherently understood because of the mention of the other entities. Let’s say I talk about Michael Jordan, I talk about all these players. I’m inherently mentioning the Chicago Bulls. I’m inherently mentioning his MVP trophies and so on. All those features and facts about that person, place or thing are inherently leveraged in the understanding of your content.
With Hidden Markov Models, you don’t have to explicitly talk about aspects of entities. You’re implicitly talking about them from the mention of some component of those entities. To bring that home as to how would you use this, let’s say I’m talking about a subject, going back to what you’re talking about lawn mowers. If I mentioned a variety of aspects of that lawnmower but I don’t explicitly say, “Honda lawn mower,” search engines can understand that you’re talking about that entity without you having to explicitly say it. One of the key things to understand there is that we’ve always thought of this concept of like, “Are we using this target keyword 49 times?” You don’t necessarily have to do that. Maybe you mentioned that once or twice and then you describing that entity is enough to reinforce the concept of that entity. We then tie that back to TF-IDF. Maybe we mentioned our target keyword once or twice, but then those keywords that we call proof keywords, you use those to further describe that entity. Then you’re going to be able to be more relevant than if you mentioned that keyword 49 times.
To use an SEO example, let’s say a document or a blogpost mentions you, it mentions me, it mentions Rand Fishkin and a few other SEOs, Bruce Clay or whatever. It never mentions SEO practitioner. It’s implied that that document is about that because of the examples that were used and the mentions of all the different people who are in that field.
A great example of that would be let’s say we’re talking about your book, The Art of SEO. Then we mentioned only two of the authors. Inherently because you’re talking about the book, Google will know that there are these other authors involved in it as well. You don’t necessarily have to explicitly say all these people’s names, but you are implicitly referring to them and then that page may also show up for one of those authors that aren’t mentioned.
There are other applications that involve Markov from a black hat standpoint back in the early days when people were using article spinners, which maybe are still in use. Creating 1,000 articles from one source article using a Markov algorithm, do you want to briefly say what that’s about?
I don’t know that I can easily break down how Markov chains work. In the past, content spinners generally will do that. Effectively, what they do is they look for a synonymous term with some of the words in your existing copy and then they flipped them until they make sense. You’re rewriting the content without writing the content. I don’t know that I can break down the Markov chain in a simple way.
If we are thinking like a search engine and trying to compare two documents to each other, whether it’s spun content of an original source or if it’s a complete rewrite by human or it’s shuffling around the paragraphs in order to make it seem unique so you don’t get hit by Google’s duplicate content filter, whichever way you end up choosing, Google is using algorithms to figure out, “Is this duplicate content? Is this the same content that I’ve seen elsewhere on the web?” There are very simplistic ways of getting a sense for that, like looking whether the title is duplicate, the meta description is duplicate, the text in the article is duplicate. There’s another more sophisticated way to think about it, which is way more effective and that’s to use Shingles or to think in terms of Shingles. Are you familiar with that term?
I’m not. What’s that?
Imagine a five-word long window, for the sake of argument. It can be longer or it can be shorter, but that’s a Shingle. If you compare the two documents to each other by comparing the Shingles, let’s say you shuffle some paragraphs around, but most of the content stays the same, you just shuffled paragraphs. Most of the Shingles would still be in common between the two documents then. The Shingles go across two paragraphs and now you’ve moved the paragraphs so those are no longer the same Shingles. For the most part, you’re keeping most of the Shingles the same. Even if you’re augmenting the content with some additional content but keeping most of the content the same, you’re still using the manufacturer supplied product description but you’re adding a couple of additional paragraphs. You still have a lot of the Shingles in common. That’s pretty old school, but so is a lot of what we’ve been talking about, TF-IDF and so forth.
I’m also familiar with the SimHashing side of it, but I’ve never heard of the Shingles thing. This is cool.
Let’s kick out a bit more on another aspect of algorithms and that is this idea that there’s a second wave of indexing. It’s important for our audience to understand this. Can you go into a bit of detail on that?
If you’re old school, you’ll remember back in the day before there was a progressive enhancement, there was graceful degradation.
You could always keep it completely old school and throw it all in the no script tag.
Back in the day when I had invented the first iteration of GravityStream, which is a proxy-based SEO technology platform, it did a very simple search and replace on various elements, various containers in the HTML so it could optimize URL structure by replacing the internal links in the page with more keyword rich, search engine friendly URLs, the fix navigation elements and all that. It was on a simple search and replace approach. One huge innovation in this product was once we started getting our development team working on maintaining and building on GravityStream. I only wrote the first prototype and then they processed the DOM, the document object model. They were able to get a lot more sophisticated from simple search and replace to being able to rejig lots of other elements of the page and not just look for this bit of text and then replace it with this bit of text.
Let’s move on to another topic that is very interesting and something that you follow very closely. That’s natural language processing and how RankBrain is able to understand the intention of the searcher and not the just the words they’re using, get inside their heads. When you talk about something and you ask a question, you don’t use any other relevant keywords. Google is getting better at being able to tease out what the actual keywords are that we should be searching for or what documents should we present to the user. Let’s talk more about that.
Google is a very fascinating place with respect to having machine learning be so core to everything that we’re doing. There’s been a lot of discussion about how they break down the query and get the understanding. It goes back to what we’re talking about with the whole entity understanding as well. There’s a post where the query is like, “Who was the US President when such and such won the World Series?” They show you visually how they break it down into entities. It’s a list of US Presidents, World Series winners this year. Whatever the year was, they show how there’s a relationship between all the things that happened in that year and then the relationships between the list of US Presidents and so on.
You’ve also heard some of the Googlers say like, “The first thing we do is look for the entities in the query.” When we go into things like RankBrain, that’s a core place where this understanding is being used. Not only are they taking your query and then adjusting that query to align it with things that make sense to them, they’re also adjusting the result set. A lot of the research that I did that I put together in constructing that You Don’t Know SEO presentation, some of the patents that I read talked about things where Google makes last second or last millisecond adjustments to the results that they show. They do all these adjustments to the query you put in. There’s also a series of scoring functions or algorithms in this case that will run before they even think about giving you a result set. Let’s say even if we knew the “algorithm,” there are so many options for which algorithm they might use before they even get to the point of showing you what they want you to see.
To the point of RankBrain, it’s interesting the fact that they see so many different queries that they have to effectively turn those queries into queries that they already understand to some degree, if it’s too wildly away from what they’ve seen before. Let’s say you put in a query that’s unseen and then they’re like, “Where do we start? Let’s break this into entities.” Once they’ve broken into entities, they’re saying, “These matches with a query that we’re already seeing,” then they’re able to give you results that are even more relevant even faster. It’s that query understanding has gotten so much more powerful than we even thought it was before. That’s part of why we’re seeing them return similar search volumes for different keywords because they’re effectively consolidating these ideas into the same thing. Despite the fact that the plural may get fewer searches than the singular, they’re representing it as the same thing in their understanding of the query.
Although you will get different results searching for the singular than you are for the plural. There are more similarities now in the search results returned than in years past because of the way that Google is getting more sophisticated in analyzing the query.
Another interesting thing there is you’re seeing them making changes in the same vein on the ad side as well where they’re saying like, “Exact match isn’t exact match anymore,” because they’re showing your ad for something that is like a permutation that to them still means the same thing. A user going to a different stage of the user journey, the query may slightly change. They’re still saying like, “No, this is a good match for you.” Google on all sides is leveraging that query understanding to improve search quality and also make it less taxing on their systems by consolidating some of these concepts using entities.
Let’s move to a lightning round. Let’s start with 304 status code.
304 status code means not modified. It’s a great thing to use to maximize your call allocation. If Google crawls the page and they’ve crawled it before, they’ve already indexed it and it hasn’t changed yet, you can return this 304 and they’ll know not to download it again. That way they can end up crawling more pages rather than crawling the same ones over and over.
Shards, what are those?
That’s the database concept where they break down their database into a series of partitions and then they can live across a variety of systems. This also allows things to be way faster because they’re able to ping all the shards at the same time and then collect the answers into one place and then serve the final answer. The way that the index is built is in that way so they can distribute it and make it as fast as possible.
Let’s talk about signals. How many ranking signals would you guess there are? There’s always the conjecture that there are several hundred or whatever. We don’t know for certain but I’m curious what your opinion is.
There are signals and then there are vectors. At this point, we’re thinking at least 1,000 because it’s like for every signal there are different directions that signal could be looked at from. Depending on the vertical, it could be this subset of signals. Depending on this other vertical, it could be this subset of signals. I think that we’re looking at some combination of over 1,000 signals.
What would you say is a good definition for entity that will help a non-SEO layperson to talk about something more than just “keywords” because that’s an antiquated way of thinking about SEO? Give them a working definition of entity.
The way that Google describes it is one of the easier ways, this concept of going from strings to things. A string is basically a word or a series of words and then a thing is like a person, place or thing. You’re talking about the nouns rather than the actual text on the page. Thinking about the person, place or thing, all the things that represent that thing or make them up or the features and behaviors of those things are what you need to be thinking about when you’re writing your content rather than like, “How do I shove these words in here as much as possible?” How can I describe this thing where it makes sense with the well-understood definition of this thing? If you look at Wikipedia or Freebase or these two different data sources, you’ll see that these people, places and things are described in great detail. How can I leverage those details when discussing these nouns in such a way that makes my content more robust and more valuable to the user than me repeating the name of the thing 59 times?
What’s your favorite SEO resource?
I keep going back to Moz because the blog is very well written. For anybody who’s just getting into SEO, if they don’t want to buy The Art of SEO, I send them to the beginner’s guide. That’s still my first resource I’ll send people to. Other than that, it’s staying apprised of the news by following the right people on Twitter. People like you, people like Rand, all these key people in the space. Then also check out sites like Search Engine Land, Search Engine Journal, things like that.
Search Engine Roundtable. Let’s say that the audience wants to work with you, work with your agency, how would they get in touch with you?
They can check out iPullRank.com. You can also reach out to me on Twitter, @iPullRank. You can find me on LinkedIn. I have one of the most common names in the world, Michael King. I’m pretty easy to find.
I never heard the origin story of iPullRank. Could you tell us where that name came from?
It’s my personal brand. I started working at Razorfish and I was into double entendre. I was like, “I need a cool name to associate myself within this space,” and it just came to me.
Thank you, Mike. It was an informative and powerful interview. Thank you to the audience. I know this was a geeky-information-packed episode. We’ll catch you on the next episode of Marketing Speak.
- Mike King
- You Don’t Know SEO SlideShare
- The Art of SEO
- Just How Smart Are Search Robots? blogpost
- Angular Universal
- Bartosz Góralewicz – Previous Episode
- Barry Adams – Previous Episode
- Search Engine Land
- Search Engine Journal
- Search Engine Roundtable
- @iPullRank – Twitter
- Michael King – LinkedIn
Your Checklist of Actions to Take
☑ Be familiar with the different SEO tools and identify which ones actually contribute to text analysis. Mike recommends Knime, the Content Success tool from Ryte (used to be called OnPage.org), SEMrush, and Content Experience tool from Searchmetrics.
☑ Understand and apply these different SEO concepts: Term Frequency-Inverse Document Frequency (TF-IDF), Latent Semantic Indexing (LSI), Entity salient and Hidden Markov Models.
☑ Be open to various sources of information but don’t easily fall for it. Always test and experiment my strategies.
☑ Avoid focusing too much on Google. As an SEO or marketer, Google doesn’t help my website at all but is only concerned with guarding their algorithms.
☑ View Mike’s SlideShare presentation called You Don’t Know SEO and be informed with the state of the SEO industry and insights on information retrieval to understand Google better.
☑ Read Mike’s blog post Just How Smart Are Search Robots? and expand my knowledge about headless browsing.
☑ Utilize rendering tools like Prerender.io and BromBone. React has a function called render to string which ensures that it is server-side rendered, as well as Next.js and Angular Universal.
☑ Follow a design pattern of progressive enhancement on building my website. Ensure that the lowest common denominator user is able to see all the content.
☑ Visit Mike’s digital marketing agency iPullRank and get an opportunity to collaborate and further enhance my webpage’s SEO performance and many more.
About Mike King
An artist and a technologist, all rolled into one, Michael King recently founded boutique digital marketing agency, iPullRank. Mike consults with companies all over the world, including brands ranging from SAP, American Express, HSBC, SanDisk, General Mills, and FTD, to a laundry list of promising startups and small businesses.
Mike has held previous roles as Marketing Director, Developer, and tactical SEO at multi-national agencies such as Publicis Modem and Razorfish. Effortlessly leaning on his background as an independent hip-hop musician, Mike King is dynamic speaker who is called upon to contribute to conferences and blogs all over the world. Mike recently purchased UndergroundHipHop.com a 20-year-old indie rap mainstay and is working on combining his loves of music, marketing, and technology to revitalize the brand.