How Does Google’s Search Algorithm Work? – SEO Theory

Date:


How does Google’s search algorithm work? Google’s search engine consists of about 150 sub-systems performing specialized tasks that collect, analyze, organize, and present data from the Web in response to user queries. There is no single algorithm, and “search” must be viewed from several angles to understand how it works.

 

One of the great myths of search engine optimization is that there is “a Google search algorithm”. Google doesn’t have anything like that. According to Googler Gary Illyes, there are approximately 150 systems that comprise what I call the Google Search System. So while you may be searching for explanations of how Google’s search algorithm works, you’re not going to find anything reliable. That’s because every writer who speaks of “Google’s algorithm” doesn’t understand what algorithms are, or the difference between a system and an algorithm.

I’ve earned degrees in Computer Science and Data Processing Technology. I’ve worked in the field for … well, longer than most of you have been alive. I’ve studied hundreds of technical papers, patents, presentations, textbooks, etc. I’ve attended or watched many lectures on search engine design, machine learning, and all sorts of database management systems.

And the truth is that I have no better idea about how Google’s search system works than you. The people who try to explain the way Google works in blog posts are blowing smoke in your eyes. The best they can do is parrot back various public statements from Google employees. They don’t demonstrate any real knowledge or understanding of what Bing, Google, and other search engines do.

Still, you’re looking for information about Google’s search system. Here are some things you should know if you really want to understand how Google works.

Get the most advanced SEO Newsletter every week


Read real-world case studies, detailed SEO strategies and tips, site design pros and cons, and more. We explain complex search engine patents and algorithms in plain English.

Monthly subscriptions are $25. Annual subscriptions are $200.


1. Google Manages Around 23 Data Centers

Google’s search system is so large and so complex they need multiple data centers to make it work. Each data center maintains one or more copies of the things that comprise what we can figuratively call “Google’s database”. There really isn’t a single database according to various Google presentations, papers, and such. There are multiple data systems.

We also know there is more than one search system. Former Googler Matt Cutts once said there were probably three “algorithms” running in the wild at any given time. That was over a decade ago, and he was humoring people by speaking of algorithms when he meant systems. I don’t know if Google was using 150-ish sub-systems 10-15 years ago when Matt explained some of Google’s mysteries to the public, but I’m sure they were using dozens of them.

The “three algorithms” were three variations on the search system. Matt did not elaborate on how they differed from each other. But nowadays people know that Google runs “live” experiments, where new algorithms are tested. Matt described the experiments in a 2014 video.

Google uses multiple data centers to deliver its search results around the world. They’re able to localize results better, respond to user queries in less average time than if they served results only from one location, maintain a world-wide presence even when some of their systems go offline, and recover lost data more quickly when some of their systems fail.

If you don’t visualize the search system as a swarm of similar, related systems working together then you’ll struggle with understanding how Google works. There’s more than one reason why the search results you see differ so much from the search results other people see. It’s not all about “personalization”. In fact, Googler John Mueller recently said that personalization isn’t triggered as often as many people believe. The many different data centers, and the 3 or more concurrent but slightly different search systems, are 2 of the reasons why search results look different.

*=> The multiple data centers and search systems are major reasons why you cannot trust the “ranking” data from SEO tools.

2. Google Maintains Multiple Indexes

It’s well-known that Google divides its data across shards. Think of each shard as an independent database (or system of database applications) that stores its own data and uses its own algorithms (applications) to search that data. The more important Google believes a piece of information to be, the more shards that information will be stored in.

But Gary Illyes also recently revealed that Google stores data in different types of physical storage: RAM (fast memory), Solid State Drives, and Hard Disk Drives. They keep their most important or preferred data in the RAM storage. Less important data is stored on SSDs. And the least important information is stored on hard drives.

Through the years, Googlers have hinted that the less important data is re-fetched (crawled) less often than the most important data. Gary says they use RAM storage to ensure that the most important information is retrieved in the least amount of time possible and served as quickly as possible. That might be Website listings, but think of all the direct answers Google provides to searchers. I deduce that real-time data is stored and managed in the RAM index. But frequently crawled and cached pages, like the home pages of major news portals, are probably also stored in RAM.

If you’re wondering how you can ensure your site is indexed in RAM, you’ll need to launch the next BBC News or CNN. But I suspect there are many pages on those and other important news sites that are only indexed in SSD- or HDD-based systems.

3. Google’s Crawl System Is Complicated

Gary once said that Google can rebuild its entire search index in a matter of hours. I believe him. But they rarely do that.

What do they, instead, is crawl the Web constantly. Their crawlers request content from a central pool of stored documents that were pulled from the Web. This pool or “crawl cache” as Matt Cutts once described it, saves Google time and helps their crawler system use resources more efficiently.

You’ve probably heard about crawl budget. Googlers explain that their systems assign a crawl budget to every Website. The crawl prioritizing algorithms estimate how much of a load each Website can handle from Google’s crawling and they then decide which URLs to fetch within a crawling period. That period might be a day, a week, or a month.

*=> For most sites, there is nothing you can do to modify crawl budget. Only very large sites need to worry about crawl budget.

Google’s crawl system literally manages a super crawl budget for the entire Web. That pool of cached documents makes it possible for Google’s algorithms to function 24/7 without needing to reach all parts of the Web all the time.

Thanks to the URL Inspection Tool in Google’s Search Console, you have an option for escalating the crawl priority for a small number of URLs each day. Use it wisely, because it doesn’t guarantee that anything will be indexed quickly, or rank for the queries you think are important.

Google’s search system crawlers operate from a small number of IP addresses. Google tells you how to verify when a crawler really is Googlebot. It’s easy for your competitors and other marketers to spoof Googlebot’s user-agent. And many people do spoof Googlebot.

You should be aware that some of Google’s tools, like the Mobile-friendly Test Tool, crawl from their own IP addresses. So you should be careful about which self-identified Google crawlers you allow to access your sites.

4. There Are Multiple Google Search Engines

Google isn’t one search engine, it’s several search engines. There is the Web Search Engine, the News Search Engine, the Video Search Engine (separate from the video hosting platform YouTube), the Image Search Engine, and more. The Books, Patents, and Scholar search engines are very specialized compared to other Google search engines.

Some of their specialty search indexes may only be sub-systems of these independent search engines. I don’t know that for sure, but the way they operate, I get the impression they’re built out from the same indexes as other search systems.

When you think about how the Google algorithm works, are you thinking of the right algorithm or system? Web search results may include excerpts from other search engines – that is what Google calls Universal Search Results. They are all-inclusive.

If you’re keeping count, I’ve so far described three different perspectives of the Google search experience for you: one based on distributed data centers, one based on shards and variations on the core search algorithms, and one based on separate search verticals. Each search vertical is designed to work differently from the others, using different data storage and management techniques as well as different algorithms.

The (in)famous Hilltop algorithm was integrated into Google News Search. Developed by Krishna Bharat, it was never integrated into Web Search even though thousands of SEO specialists have been incorrectly telling each other for years it was the basis of a Google Web Search algorithm update in the early 2000s (Florida). The “Florida” update was something else entirely – or so Matt Cutts once said.

5. Google’s Rankings are (Usually) Computed at Run-time

Regardless of which Google search engine you’re using, the rankings you see are either computed or selected at run-time. If they are NOT computed in real-time then they’re pulled from a pool of cached or stored search results that were computed in a previous run-time.

For Web Search (at least – possibly other search verticals), there is a master algorithm that takes your query and decides whether it will be answered by the “live” system or by some alternative sub-system. RankBrain is one such alternative system.

Query processing is a major sub-system in the Google search system. Each query is evaluated by one or more algorithms. Some of those algorithms may rewrite the query. Some of those algorithms pull search results based on their own criteria. The master algorithm figures out which set of results to show you.

*=> You may not always see the same search results even if you run the same query repeatedly.

Google’s system can assume you’re not happy with the results it shows you and may try to show you something else. Don’t assume this is all done on the basis of your clicks and dwell time. Google has no idea of what you do on a Web page or if it satisfies your need. Google only knows what your queries look like.

*=> Searchers can influence how Google’s search results are chosen.

You can turn SafeSearch on or off, use special query operators that modify the criteria used to resolve your search, log in or out of Google, disable Google’s ability to use your search history, clear your cookies, use an ad blocker or other privacy extension, change browsers, change computers, change locations, change language, change country, etc.

People change their search context all the time, in many ways, and usually without realizing it. All these subtle changes force the algorithms to return different results. And this is yet another reason why you cannot trust SEO tool “search rankings”. They really don’t mean anything in the greater scheme of things.

Your pages don’t have “rankings”. You see rankings in fixed contexts. Those fixed contexts may be fairly consistent (that is, they exist for a plurality, perhaps even a majority of searches) but they will never be the same 100% of the time.

6. Google Learns from Searcher Query Logs

Google studies searcher behavior. They’re trying to infer searcher intent from the way people search, from their search contexts, and from the queries they use.

You as the Web marketer cannot infer searcher intent from the fact that someone landed on your site. Nor does “bounce rate” tell you anything about the searcher’s intent. You will indeed benefit from understanding searcher intent, but you must infer it from other sources. In my experience, a majority of Web marketers fail badly at understanding searcher intent. You need real insider knowledge of the query space to understand searcher intent. You know why you search for your favorite restaurant near you when you’re hungry.

However, if you’re not working in the oil drilling industry, you probably have no understanding of why someone would search for “oil sweetening”. Sure, you can read articles about petroleum processing and refining methods, but you need the insight of someone who has actually worked in the industry to understand why they are using queries about esoteric topics.

Get 1 premium article each week


It’s been a tough year. We understand.

If you’re not ready to pay for a full subscription of $25/month, try our $5/month Tweekly newsletter.


Query log analysis is a major part of search engine system management. To extract useful patterns from the data, the search engineers need to study a massive amount of data. They can’t do this in real-time. When you read patents and research papers that mention or describe query log analysis, you’re not reading about real-time ranking algorithms.

By the same token, when search engine documents mention “user behavior modeling”, they’re talking about experiments where they analyze probabilistic distributions of outcomes based on assumed user behavior and intentions. These “models” are not used at run-time to choose search results or rank them. They are used to evaluate the algorithms designed to choose/select search results and rank them.

7. Google’s Predictive Algorithms Create “Lookup Tables”

Information retrieval science depends on programs (algorithms) called “document classifiers”. These machine learning programs process documents pulled from the Web, evaluating them, assigning scores to them, extracting information, triggering flags, etc. But modern search engines supplement that classification process with more sophisticated data analysis.

Pre-training algorithms like BERT attract a lot of attention, but most of what the Web marketing world has written about these algorithms is pure nonsense. Machine learning systems are used to find patterns in large amounts of data. Sometimes those patterns can be leveraged by other algorithms. That’s what pre-training does: it collects the patterns most likely to identify relationships between pieces of information and organizes them so that other algorithms can use that information about relationships without having to learn everything themselves.

It’s a bit simplistic to say that BERT builds lookup tables, but some of the software developers who write these kinds of algorithms occasionally result to that explanation for lack of anything more relevant to the non-ML specialist vocabulary.

The learning algorithms, especially Google’s revolutionary transformers, rely on weighted averages to extract these useful patterns from millions of rows of data (the “vectors” everyone talks about). The data is often representational, consisting of simple values like -1, 0, and 1. Think about it, when you’re crunching millions of numbers together, you need to keep those values as small as possible because their totals and averages will be large numbers. The vectors can consist of hundreds or thousands or even tens of thousands of units of information, each representing a specific, special piece of data about some “source”.

Although Google has a few machine learning algorithms that run in real-time, the majority of them operate in “batch mode”, offline, outside of the search system. Your search results are the second- or third-generation product of algorithms that don’t use machine learning so much as they use what machine learning algorithms produce.

Conclusion

If you’re looking for insight into how Google ranks search results, the best I or anyone can give you is the generic ranking formula, which looks something like this:

Document ranking = A * B * C … X * Y * Z

Each of those letters (variables) represents a score or signal used to compute a final ranking value. Every document gets its own list of signals, and some documents may be missing signals.

You have no way of knowing what signals are used to determine any document’s ranking. There is no “single variable test” that can extract these signals. There is literally no science for extrapolating what the ranking signals in the large, complex search system may be. The math hasn’t been invented to do this, and the computing power either doesn’t exist or is way too expensive to use.

Google leases its proprietary machine learning architecture (Tensor Processing Units, or TPUs) to researchers. You might need to spend half a million dollars to use state of the art equipment for a few hours, maybe a few days at most. So the next time someone tells you they’ve identified ranking signals, just move on.

Read more about Google’s search systems on SEO Theory

Why Google Does Not Use CTR, Bounce Rate, and E-A-T the Way You Think It Does

How the Google Panda Algorithm Works

What Every SEO Specialist Needs to Know about Vectors

The Late, Great Bounce Rate Debate

Google’s Linguistic Analysis is Not All about Web Search

Why You Cannot Reverse Engineer Google’s Algorithm

References

What Crawl Budget Means for Googlebot

Google Data Center Locations

Gary Illyes: 150 Google sub-systems

Search Off the Record Podcast: Language Complexities In Search Index Selection and More!

Video: How does Google evaluate algorithmic changes?

Follow SEO Theory

A confirmation email will be sent to new subscribers AND unsubscribers. Please look for it!








Original Post

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Share post:

Subscribe

spot_imgspot_img

Popular

More like this
Related