5.8Search Algorithms & Systems

Inverted Index

Watch how search engines build an inverted index from raw documents. Follow the full pipeline from tokenization through index construction to query evaluation with boolean operators.

Presets
Documents
Tokenize
Index Build
Query
Select a preset to begin the indexing pipeline.
Documents4 docs
Doc 1The Cat

The cat sat on the mat. The cat is fluffy and warm.

Doc 2The Dog

A dog ran in the park. The dog is loyal and fast.

Doc 3The Garden

The garden has flowers and trees. A cat sleeps in the warm garden.

Doc 4The Park

The park is big and green. Dogs and cats play in the park.

Tokenization
Tokens will appear here during tokenization
Token
Stop Word
Stemmed
Inverted Index0 terms
TermPosting List (Document IDs)
Index will build here during the indexing phase
Query
Query results will appear here. Build the index first.
1.0x
Docs Processed0
Unique Terms0
Index Size0
Query Matches0
Phasedocuments
How Inverted Indices Work

1. Documents

Raw text documents are the input. Each document has a unique ID and contains natural language text that needs to be searchable.

2. Tokenization

Text is split into tokens, lowercased, and stop words (common words like “the”, “is”) are removed. Remaining words are stemmed to their root form.

3. Index Build

The inverted index maps each unique term to a posting list: the set of document IDs containing that term. This reverses the document-to-word relationship.

4. Query

Queries look up terms in the index. AND queries intersect posting lists (documents with all terms), OR queries union them (documents with any term).

Boolean Query Operations

ANDIntersection
“cat” AND “warm”
cat: [D1, D3, D4]
warm: [D1, D3]
Result: [D1, D3]
ORUnion
“cat” OR “dog”
cat: [D1, D3, D4]
dog: [D2, D4]
Result: [D1, D2, D3, D4]

Real-World Applications

Search Engines: Google, Bing use massive inverted indices across billions of web pages.
Elasticsearch: Powers full-text search in applications using Lucene-based inverted indices.
Code Search: GitHub code search uses inverted indices to search across millions of repositories.