#Serpstat_chat How Search Engines Crawl And Index

The topics of indexing and crawling will be covered in this episode of the #serpstat conversation. A tremendous “firehose of information” and “Tweet discussions” may be found on Twitter this is our recap of a recent chat with Marianne.

Marianne is enthusiastic about breaking down the barriers that exist between information architecture, content strategy, and web development in order to build a more holistic approach to fulfilling user demands for an ideal user experience.

What does it mean to “index a page?

Q1: What does it mean to "index a page?#serpstat_chat pic.twitter.com/gJ5NOQmjYv
— Serpstat (@serpstat) December 1, 2022

A1.1 Google robots systematize content by keywords and its newness, creating a search index based on the collected data about the page.
Olena Prokhoda

It means the page is included in to the storage system (database etc.), and has associations applied (such as keywords, intent type etc.)
Darth Autocrat (Lyndon NA)

I have found some confusion around crawling, indexing, and caching site content. Crawling: We have some control here through the robots.txt instruction. Indexing: minimal control here through NoINDEX. Here, the search engine transforms and stores content. #serp_chat
Marianne Sweeny

Summary: Google’ bots’s system uses a process called crawling to systematically search the internet for new or updated content. Once content is found, it is added to the search index, which is a database that stores information about the page, including its keywords and how recent it is.

The search index is used to help quickly and accurately match users’ search queries with relevant content. There is some control over this process through the use of the robots.txt file and the NoINDEX tag, but overall, the search engine has the ability to transform and store content as it sees fit.

Are crawling and indexing a page the same thing?

Q2: Are crawling and indexing a page the same thing? #serpstat_chat pic.twitter.com/nDsmibVGRd
— Serpstat (@serpstat) December 1, 2022

No. Crawling finds the content and sends it back to be processed by Google. This is where the decision is made on whether to index the page. Indexing is breaking the page into elements, transforming the words into tokens, and storing them in datatables.
Marianne Sweeny

Crawling is the process of discovering new or updated web pages and following links to them from existing web pages. Indexing is the process of storing and organizing the data found during the crawling process in an easily searchable format.
Rahul Marthak

A2: No. Crawling is reading the code for each page on a site, utilizing the directives (server and client-based), reading the navigation and folder structure, taxonomy, and following the links across the site to facilitate understanding the IA and ultimately the relevance of a site.
Boyd Lake SEO

Summary: No, crawling and indexing are not the same. The process of crawling involves discovering new or updated web pages and following links to them from existing pages. This allows the search engine to understand the structure and organization of a website, as well as the relevance of its content.

Once the crawling process is complete, the search engine uses indexing to store and organize the data it has found in an easily searchable format. This allows users to quickly find relevant information when they perform a search on the search engine.

What is the difference between indexing a page and caching a page?

Q3: What is the difference between indexing a page and caching a page? #serpstat_chat pic.twitter.com/xXUOlSd8Vr
— Serpstat (@serpstat) December 1, 2022

Caching is literally “storing a copy” of what was crawled (served to the bot). It is separate to Indexing, and has no impact/influence on Indexing (you can be indexed but there be no Cache, due to an error or the usage of “noarchive” meta robots)
Darth Autocrat (Lyndon NA)

Here is a point of additional confusion. A page does not have to be cached to appear in search results. A cached version is a copy retained by the search engine. Search engines make the decision to retain copies of pages for reasons that are not shared.
Marianne Sweeny

A3: Indexing and caching a page is a way of making websites run faster by storing certain parts of the website in a special place so that it can be accessed quickly.
damians.eth damians/ Psychedelic Domains

Q4: Does a page have to be indexed to appear in search results?#serpstat_chat pic.twitter.com/SzLiprOGVq
— Serpstat (@serpstat) December 1, 2022

A4: Yes they do. However this applies to only organic search.
Avast Zumac

Tough question… and the answer depends on the definition of “indexed.” Technically, you cannot appear in the SERP unless Google knows of and has stored data about that URL. But… that “stored data” may not be complete (G may have stored the URL only).
Darth Autocrat (Lyndon NA)

A4. Yes & No Yes, a page must be indexed by Google in order to appear in Google search results. No (Maybe?) (Not always), in “rare” cases.
Rahul Marthak

Q5: How can you influence google indexing your content?#serpstat_chat pic.twitter.com/o32WUCDU35
— Serpstat (@serpstat) December 1, 2022

A5.1 Check if you don’t block Googlebot in your robots.txt file. User-agent: Googlebot Disallow: / or User-agent: * Disallow: /
Olena Prokhoda

There is a new protocol out that was developed by Bing and Yandex called indexhow. https://indexnow.org Site owners can submit a list of URLs for immediate crawling and consideration for the index. Not sure if Google has signed on … yet. https://indexnow.org
Marianne Sweeny

* Permitting crawling (robots.txt) * Aiding crawling (sitemap(s), internal links) * Permitting indexing (robots meta/header) * Reducing duplicates (better organization, 301s, CLE/CLR/CLS) * Obtaining Inbound Links * Higher quality content * Greater originality
Darth Autocrat (Lyndon NA)

Q6: Do you check server logs to make sure Googlebot is able to index your content? #serpstat_chat pic.twitter.com/5nFBpzjTIP
— Serpstat (@serpstat) December 1, 2022

! YES ! when I can. Unfortunately, there’s a strong trend for shared hosts to disable SALs (Server Access Logs) to save resources, and CDNs stuff up localized SALs, and often don’t have CDNALs unless you pay extra for them! Next best thing, Web Tracking
Darth Autocrat (Lyndon NA)

A6: YES, no need to waste time wondering and guessing. This is how you can fix page infrastructure concerns faster.
Sweepsify

A6: I regularly review server logs and have observed that most of our clients receive visits from Google at intervals of 7 to 18 minutes.
Marianne Sweeny

Q7: What impact does DeepRank have in indexing content? #serpstat_chat pic.twitter.com/2VxBR6Bpd6
— Serpstat (@serpstat) December 1, 2022

Not sure. DR is a furtherance of BERT, and is used for “comprehension.” It could influence what is indexed with regards to (dis)similarity measures (near-duplicates, spun, paraphrased, etc.). It could also aid in what a page is indexed as (topicality/relevance)
Darth Autocrat (Lyndon NA)

Q7Q: How are partially paywalled sites impacted by DeepRank?
Sweepsify

A7: Deep ranking can give indexing a number of advantages. Help identify content more accurately and reduce the need for manual tagging. Can enable more efficient indexing, as well as improve search accuracy, by better understanding the content. Deep ranking can fine tune.
damians.eth damians/ Psychedelic Domains

🙌Marianne @msweeny, thank you so much for participating in #serpstat_chat and shedding light on indexing and crawling pic.twitter.com/yeQnQeUDBz
— Serpstat (@serpstat) December 1, 2022

I hope you had a good time during this segment of the talk; do let me know what you thought in the comments area.