Understanding Elasticsearch bottom up
Understanding Elasticsearch
- https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up
- Inverted index
- Dictionary contains term and frequency, postings contain documents (IDs)
- Index term being unit of search
- Prefix searches are efficient, contains searches are not
- Modelling problems as prefix searches
- Suffix matching - indexing reversed words
- Contains matching - split words into n-grams
- Decompound compound words
- Geo coordinates - as geo hashes
- Numerical and time ranges - store values trie-like
- etc
- Building indexes
- Prioritize
- search speed
- index compactness
- indexing speed
- time to be visible
- Small index, faster search
- Lucene indexes immutable
- Deletions are only marked
- Updates = delete + reinsert
- Updating is costlier than inserting
- Index changes buffered in memory, eventually flushed (lucene) to disk = index segment
- Index segments
- Advancements with lucene versions
- Lucene <2.3 would make a segment for each doc, which were merged on flush
- Nowadays, can make larger in-memory segments
- Lucene 4 - one segment per thread - increased indexing performance, concurrent flushing
- Flushing segments invalidates field and filter caches (which are per-segment)
- Elasticsearch indexes
- ES Index -> 1/more shards with 0/more replicas = lucene indexes -> 1/more index segments
Elasticsearch in production
Comments