5.8Search Algorithms & Systems

Inverted Index

Watch how search engines build an inverted index from raw documents. Follow the full pipeline from tokenization through index construction to query evaluation with boolean operators.

Presets

Documents

Tokenize

Index Build

Query

Select a preset to begin the indexing pipeline.

Documents4 docs

Doc 1The Cat

The cat sat on the mat. The cat is fluffy and warm.

Doc 2The Dog

A dog ran in the park. The dog is loyal and fast.

Doc 3The Garden

The garden has flowers and trees. A cat sleeps in the warm garden.

Doc 4The Park

The park is big and green. Dogs and cats play in the park.

Tokenization

Tokens will appear here during tokenization

Token

Stop Word

Stemmed

Inverted Index0 terms

TermPosting List (Document IDs)

Index will build here during the indexing phase

Query

Query results will appear here. Build the index first.

1.0x

Docs Processed0

Unique Terms0

Index Size0

Query Matches0

Phasedocuments

How Inverted Indices Work

1. Documents

Raw text documents are the input. Each document has a unique ID and contains natural language text that needs to be searchable.

2. Tokenization

Text is split into tokens, lowercased, and stop words (common words like “the”, “is”) are removed. Remaining words are stemmed to their root form.

3. Index Build

The inverted index maps each unique term to a posting list: the set of document IDs containing that term. This reverses the document-to-word relationship.

4. Query

Queries look up terms in the index. AND queries intersect posting lists (documents with all terms), OR queries union them (documents with any term).

Boolean Query Operations

ANDIntersection

“cat” AND “warm”

cat: [D1, D3, D4]

warm: [D1, D3]

Result: [D1, D3]

ORUnion

“cat” OR “dog”

cat: [D1, D3, D4]

dog: [D2, D4]

Result: [D1, D2, D3, D4]

Real-World Applications

Search Engines: Google, Bing use massive inverted indices across billions of web pages.

Elasticsearch: Powers full-text search in applications using Lucene-based inverted indices.

Code Search: GitHub code search uses inverted indices to search across millions of repositories.