#SEOCHAT Crawling and Indexing

Crawling and Indexing are controversial topics in technical SEO. There are so many misconceptions about how search bots crawl and index pages. So it’s necessary to clarify the major assumptions about the topics.

In this post, “Dan Taylor” moderated the #SEOchat on Twitter. These tweets are jam-packed with information that debunks some crawling and indexing myths, yet you can miss them because of how quickly they appear and go.

Dan Taylor is the Head of Technical SEO at @salt agency | #TechSEO Boost 2018 Research Winner at #EdgeSEO. Recognize as TechSEO on the Elephant app.

Before we dive into the chat, let’s get down to the basics.

Is crawling different from indexing? Let’s find out:

Stop! You need to know this 👇👇👇

Here’s one of the most common mistakes many SEOs make:

They confuse crawling and indexing.

But it’s crucial to know the difference.

Crawling is a process where a bot (such as Googlebot) downloads information about a page, and discovers the… pic.twitter.com/Y6nN1KTYAW
— Kristina Azarenko (@azarchick) April 19, 2023

Additionally, here are videos that explain how crawling and indexing work from the largest search engine, Google:

Now, let’s clear your doubts about indexing and crawling.

Is it normal for 20% of a website to not be indexed? Is it completely normal that we don’t index everything off of the website?

Q1: Back in Aug 2021, @JohnMu is noted stating that it's normal for 20% of a website to not be indexed, and that " it's completely normal that we don't index everything off of the website" – In your exp, how much does this ring true? & how big are the sites? #SEOChat
— Dan Taylor (@TaylorDanRW) January 12, 2023

A1: Anecdotally, I’m seeing this on enterprise sites more than small to mid – with the trend increasing over recent months with freshness/time decay seemingly becoming more of a factor.
Dan Taylor

A1. For my personal sites, the % of pages not indexed has been much smaller (thankfully) but for work, the bigger the site, the bigger the %
Luke Davis (he/him)

#SEOChat A1. Not very. It varies, largely based on factors such as: 1) Quantity of pages 2) Quantity of inbound links to pages 3) Originality of content 4) Internal linking 5) Sitemap usage
Darth Autocrat (Lyndon NA)

Google stated that they have a “quality threshold” for indexing, and for SERPs catering to multiple common query interpretations, each “source-type” has a different quality threshold. How do you factor this analysis into your competitor analysis for KWs?

Q2: @Google have publicly stated they have a "quality threshold" for indexing, and for SERPs catering to multiple common query interpretations, each "source-type" has a different quality threshold. How do you factor this analysis into your competitor analysis for KWs? #SEOChat
— Dan Taylor (@TaylorDanRW) January 12, 2023

A2: This is a good one. Look at the origins of some keywords. Are they commonly used terms OR did a brand create the term?
Sweepsify

#SEOChat A2. It’s not a specific thing I go out of my way for, it’s standard. Identify target term(s), analyse SERP for competitors, check the Intent and Format, look at/count DataPoints in competing content. (Also compare a number of lower ranking pages)
Darth Autocrat (Lyndon NA)

How much correlation between confirmed and unconfirmed Google updates have you seen on your website’s indexing levels over the past 12 months? And if possible, how big are the sites?

Q3: Over the past 12 months, how much correlation between confirmed and unconfirmed Google updates have you seen on your website's indexing levels? And if possible, how big are the sites?#SEOChat
— Dan Taylor (@TaylorDanRW) January 12, 2023

A3: For example, that weird dip in the graph below – correlated with two unconfirmed updates and was the indexing of around ~20 pages that dropped, then came back:
Dan Taylor

A3: A good deal of step function correlation, especially around impression changes … less click changes which makes me think it was SERP UI changes …
Eric Wu

Depends on the type/nature of the site, and the general quality level of the content. Sites that are strong seem to barely see a blip. It’s those that had questionable content that seem the most at risk.
Darth Autocrat (Lyndon NA)

How would you adapt if Google and other search engines stopped crawling a large percentage of the internet and relied on things like IndexNow for the discovery of URLs and domains “outside the higher tiers”?

Q4: In Nov '22 Gary Illyes said that about "60% of the internet is duplicate" – so, What if Google et al stopped crawling a large % of the internet and relied on things like IndexNow for the discovery of URLs and domains "outside the higher tiers". How would you adapt? #SEOChat pic.twitter.com/0xtqbJ7alO
— Dan Taylor (@TaylorDanRW) January 12, 2023

Would be super annoying but would I think be a step backwards for democratizing the web #seochat
Mordy Oberstein

I wouldn’t worry too much. The figure of “60%” may include not only the obvious “copies” and “highly similar” spins, (some of the bigger sites out there have multiple copies of them… which makes the %), but numerous Canonical issues and Infinite crawls etc.
Darth Autocrat (Lyndon NA)

A4 Make sure IndexNow is part of the workflow for new or updated content? Prioritize content that needs to be available in search and get it promoted.
Mark Alves

How do you currently monitor and measure how search engines are crawling your site? Do you have IndexNow as part of your current workflow at all?

Q5: How do you currently monitor and measure how search engines are crawling your site? Do you have IndexNow as part of your current workflow at all?#SEOChat #DemandIndexing pic.twitter.com/aWGsAsuyEd
— Dan Taylor (@TaylorDanRW) January 12, 2023

#SEOChat A5. Depends on what’s available. I’m a SALs lover – but not all hosts provide them, and CDNs charge for them. Failing that, web tracking, looking for bot hits, and aligning with Sitemaps in GSC (or spot checking priority pages with inurl: operator etc.) + No
Darth Autocrat (Lyndon NA)

A5: Had the Bing submission plugin installed since the start. Bing was the first to index us before Google.

Google only indexed the home page for about a month.
Sweepsify

A5: Mostly access logs piped into Kibana or into a db table that is filtered + visualized by page type. GSC crawl data is also

For Gbot I use the published IPs to filter https://developers.google.com/search/apis/ipranges/googlebot.jsonse… For all others I validate with fwd and reverse DNS
Eric Wu

How much do concepts such as “beneficial purpose”, “perspectives”, and the page source type play into your content strategy?

Q6. Ok, final question. How much do concepts such as "beneficial purpose", "perspectives", and the page source type play into your content strategy?#SEOChat pic.twitter.com/VxuNxkRKFy
— Dan Taylor (@TaylorDanRW) January 12, 2023

A6: Very important IMO bc the definition of terms like MFA, evergreen have changed with AI content. Is the page generic with too much focus on CTA/ads? Don’t index it. Is it timely AND relevant? Index it.
Sweepsify

Thank you to everyone who took part this hour. I'll still be monitoring the thread and hashtag for a while if you want to get involved.

Thank you to @MordyOberstein and @NicoleCPonce as well for inviting me, and being in the wings to support.#SEOChat 10-4, good night! pic.twitter.com/3pSynZwJU0
— Dan Taylor (@TaylorDanRW) January 12, 2023

If you want to learn more about how you can improve your technical SEO. You can read our posts on technical SEO, how search engines crawl and index web pages, proper technical SEO, and sitemap tricks that can improve your indexation.