Review of The Case for Learned Index Structures
Authors
“The Case for Learned Index Structures”
 Paper code
 arXiv:1712.0120l8v3
 Date
 URL
 https://www.arxivvanity.com/papers/1712.01208/
Researchers
Tim Kraska  MIT  Cambridge, MA  kraska@mit.edu 

Alex Beutel  Google, Inc.  Mountain View, CA  alexbeutel@google.com 
Ed H. Chi  Google, Inc.  Mountain View, CA  edchi@google.com 
Jeffrey Dean  Google, Inc.  Mountain View, CA  jeff@google.com 
Neoklis Polyzotis  Google, Inc.  Mountain View, CA  npolyzotis@google.com 
Jeff
 Lead of Google.ai. Pretty cool guy.
 2017, “Outrageously Large Neural Networks”.
Preliminaries
Background Knowledge
 Deep learning models
 are function approximators.
Search engine vs Database
 Relational Databases use a BTree index.
 Search engines mostly use inverted index.q
 Relational Databases give you what you asked for.
 Search engines give you what you wanted.
Terminology
 Indices
 = indexes. Indexes just sounds wrong to me.
 Model
 The set of functions that describe the relations between variables.
“Probabilistic and information theoretic methods are used to make results better anyway. Compromises are made anyway. Query reformulation, drift, etc. So it is just a natural progression to use NNs for some of these components? Am I right.” – A quote from myself.
More Background Knowledge
The research works under the premise that
 Indices are models (set of functions). For example,
 BTreeIndex: \(f: key \mapsto pos\)
 \(pos\) is the position of a record, within a sorted array
 HashIndex: \(f: key \mapsto pos\)
 \(pos\) is the position of a record, within an unsorted array
 BitMapIndex: indicates if a data record exists or not
 BTreeIndex: \(f: key \mapsto pos\)
A new term is introduced!
 Learned Index
 A deeplearning model with the function of an index structure. Automagestically synthesised.
Overview
The Argument of the Paper
The researchers hypothesise…
that All existing index structures can be replaced with learned indices.

Paper does not argue that you should necessarily.
It’s a novel approach to build indexes, complimenting existing work.

Specifically, a model can
 Learn the sort order/structure of keys,
 and use this to predict the position/existence of records.
They explore…
 The extent to which learned models (including NNs) can replace traditional index for efficient data access.
They speculate…
 This could fundamentally change the way database systems are developed in the future.
Investigations / Case studies
The studies performed in the paper are:
 About evaluating learned models on efficient data access, the role of traditional indices.
 Done on CPUs rather than G/TPUs for a fairer comparison with existing methods, despite new hardware being the biggest reason to use learned indices.
Theme 1: Can learned models speed up indices?
tested for readonly analytical workloads  (The majority of this paper) 

tested for writeheavy workloads  (Briefly covered) 
Theme 2: Can replacing individual components speed up indices?
Study 1 / 3  BTree  (Evaluated) 

Study 2 / 3  Hashindex  (Evaluated) 
Study 3 / 3  Bloomfilter  (Evaluated) 
other components (sorting, joins)  (Briefly covered) 
Debunking the Myths
Myths or soon to become myths

Machine learning cannot provide the same semantic guarantees.Traditional indices largely are already learned indices.
 BTrees predict record position.
 Bloom filter is a binary classifier (like our Delta Rule network). It’s a spaceefficient probabilistic data structure. See: BitFunnel.
NNs thought of as being very expensive to evaluate. Huge benefits, especially on the next generation of hardware.
Trends

GPUs and TPUs in phones
The main reason to adopt learned indices (page 4).

Scaling NN trivial. Cost = 0.
Benefits for databases
 Remove the
branchheavy index structuresand add Neural Networks
Results and Conclusions sneak peak
Results
 Learned indices can be 70% faster than cacheoptimized BTrees while saving an orderofmagnitude in memory.
 Tested over several realworld datasets.
Conclusions
 Replacing components of a data management system with learned models has farreaching implications.
 This work only provides a glimpse of what might be possible…
Introduction
“Traditional” Index Structures
Some examples
Covered in this paper by 3 separate studies:
 BTrees
 Great for range requests (retrieve all in a..b)
 HashMaps
 keybased lookups
 Bloomfilters
 Set membership
 May give false positives, but no false negatives
Solidly built
 Highly Optimised
 Memory
 Cache
 CPU
 Assume worst case
It works because…

Knowing the exact data distribution enables optimisation of the index.
…But then we… must know. But we don’t always.
Benefits of replacing BTrees with Learned Indices
Benefits of replacing BTrees with Learned Indices
 BTree lookup \(O(\log_n) \Longrightarrow O(n)\) (if SLM)
 Simple Linear [Regression] Model: predictor, 1 mul, 1 add…
 ML accelerators (GPU/TPU) If the entire learned index can fit into GPU’s memory, that’s 1M NN ops every 30 cycles with current technology.
 Mixture of Models (builds upon Jeff’s paper from last year) ReLU at top, learning a wide range of complex data distributions. SLRM at the bottom because they are inexpensive. Or use BTrees at the bottom stage if the data is hard to learn.
Case Studies
Study 1 of 3: BTree \(\Rightarrow\) Learned Range Index [Model]
Replacing a Btree with a Learned [Range] Index [Model].
Theory
 \(\therefore\) BTree \(\approx\) Regression Tree \(\approx\) CDF \(\equiv\) Learned Range Index.
Plan
 Experiment with a Naïve Learned Index … to see how bad it is.
 Experiment with a much better Learned Index, the RMIndex.
Study 1 of 3: BTree \(\Rightarrow\) Learned Range Index [Model]
Why can we replace BTrees with DL again?
BTree isa
model
 BTreeIndex
 \(f: key \mapsto pos\)
 \(pos\) is the position of a record, within a sorted array
BTree \(\approx\) Regression Tree
 Regression Tree
 A decision tree with \(\mathbb{R}\) targets.
 Maps a key to a position with a min and max error.
Range Index Model isa
Cumulative Density Function (CDF)
A model which predicts the position given a key inside a sorted array effectively approximates a CDF (page 5).
 \(\therefore\) BTree \(\approx\) Regression Tree \(\approx\) CDF \(\equiv\) Learned Range Index.
Study 1 of 3: BTree \(\Rightarrow\) RT/RIM \(\Rightarrow\) CDF \(\Rightarrow\) Learned R.I.
Analogs
 Rebalanced vs Retrained
\(\therefore\) min/max error guarantee only needed for training.
Cumulative Density Function (CDF)
\(F_X(x) = P(X \leq x)\)
A range index needs to be able to provide:
 point queries \(\checkmark\)
 range queries, sort order(records) \(\equiv\) sort order(sorted lookup keys)) \(\checkmark\)
 guarantees on min/max error.
CDF is good to go. It can be used as our Learned Range Index.
\(\therefore\)
Can replace index with other models including DL, so long as min and max error are similar to btree.
Study 1 of 3: BTree \(\Rightarrow\) Learned Range Index [Model]
Experiment 1.1  Naïve Learned Index with TensorFlow
 Objective
 Evaluate to study the technical requirements to replace BTrees.
 Architecture
 Twolayer fully conneted neural network (32:32).
 32 neurons/units per layer.
 ReLU activation function.
 The timestamps of messages from web server logs
 The positions of the messages (actual line number?)
 Is not simply error minimisation. Min/max error
 Purpose
 Build secondary index over timestamps. Test performance.
Study 1 of 3: BTree \(\Rightarrow\) Learned Range Index [Model]
Critique
This is a very naïve learned index, and that’s how we want it. The researchers want to see how much faster a BTree is than a naïve neural network substitution. The answer is 300x faster.
 ReLU activation function
 \(f(x) = max(0, x)\)
The ReLU activation function is the new sigmoid in that it’s now the goto activation function for deep learning.
It’s typically used for hidden layers as it avoids vanishing gradient problem, yet we don’t have a hidden layer. It’s just a line. It’s so basic, it’s perfect.
Also, the researchers are after a sparse representation, matching one key to one position, so this property of the ReLU makes it an even better candidate.
I assume that 32 neurons are used because that is the max string length of the timestamp / record position.
Study 1 of 3: BTree \(\Rightarrow\) Learned Range Index [Model]
Experiment 1.1  Results
The researchers came to these findings:
 BTrees are 2 orders of magnitude faster. Tensorflow is designed for larger models. Lots of overhead with Python.
 A single neural network requires significantly more space and CPU time for the last mile of error minimisation.
 BTrees, or decision trees in general, are really good in overfitting the data (adding new data after balancing) with a few operations. They just divide up the space cheaply, using an ifstatement.
 Other models can be significantly more efficient to approximate the general shape of a CDF.
 So models like NNs might be more CPU and space efficient to narrow down the position for an item from the entire data set to a region of thousands.
 But usually requires significantly more space and CPU time for the last mile.
These ideas are taken into account when designing the next model, the RMIndex.
Study 1 of 3: Learned Range Index [Model] \(\approx\) BTree
Challenges to replacing BTrees
 Main challenge: balance model complexity with accuracy.
 Bounded cost for inserts and lookups, taking advantage of the cache.
 Map keys to pages (memory or disk?)
 Last mile accuracy. This is the main reason why the Naïve Learned Model was so slow. Overcome by using the Recursive Model (RM) Index.

New terms
 Last mile accuracy
Study 1 of 3: Learned Range Index [Model] \(\approx\) BTree
Recursive Model (RM) Index
Also known as the Recursive Regression Model.
One of the key contributions of this research paper.
A hierarchy of models.
At each stage the model takes the key as an input and based on it picks another model, until the final stage predicts the position.
Each prediction as you go down the hierarchy is picking an expert that has better knowledge about certain keys.
Solves the ‘Last mile accuracy’ problem.
Study 1 of 3: BTree \(\Rightarrow\) Learned Range Index [Model]
Experiment 1.2  Hybrid Recursive Model Index
 Method
 n stages, n models per stage = hyperparameters
 Each net
 0 to 2 fully conneted hiddenlayers
 Up to 32 neurons/units per layer
 ReLU activation functions
 BTrees.
 The timestamps of messages from web server logs
 The positions of the messages (actual line number?)
 Blogs, Maps, web documents, lognormal (synthetic)
 Is not simply error minimisation.
 After training, the index is optimised by replacing NN models with BTrees if absolute min/max error is above a predefined threshold value.
 Conclusions
 Allow use to bound the worst case performance of learned indexes to the performance of BTrees.
Study 1 of 3: BTree \(\Rightarrow\) Learned Range Index [Model]
Results of Experiment 1.2
Was the data used obtained ethically? Who knows.
Testing
 They developed what they call the ‘Leaning Index Framework’, an index synthesis system. It accelerates the process of index synthesis and testing.
Aim of review
Questions

What is the specific problem or topic that this research addresses?
 Optimisation of an index requires knowledge of the data distribution. There is no guarantee of this. But it can be learned.
 Learned indices provide new ways to further optimise search engines.

If the paper presents a new network, algorithm, or technique, how does it work? Is it suited to the task?

A new model architecture, the Recursive Regression Model
Task: A substitute for a BTree.
Inspired by work done in the paper “Outrageously Large Neural Networks”.
Constitution: Build a hierarchy of models. At each stage the model takes the key as an input and based on it picks another model, until the final stage predicts the position.
Each model makes a prediction with a certain error about the position for the key and that the prediction is used to select the next model.
Recursive Model Indices are not trees.
The architecture divides the space into smaller subranges like a Btree/decision tree to make it easier to achieve to required lastmile accuracy with a fewer number of operations.

Is it suited to the task? The model divides the space into smaller subranges like a BTree to make it easier to achieve the required “last mile” accuracy with fewer operations. This solves one of the aformentioned complications of replacing a BTree.
The entire index can be represented as a sparse matrixmultiplication for a TPU/GPU.

Has it been well tested, and does it really work as claimed? What are the limitations?
 This could change the way database systems are developed.

What are Innovations

Learned indices can be 70% faster than cacheoptimized BTrees while saving an orderofmagnitude in memory.
 Tested over several realworld datasets.

Did they choose the architecture  why or why not?
Is it clearly described (all parameters, settings etc.)? What strengths and/or weaknesses of the NN approach does it illustrate?
• Is the paper well structured and well written?
Q&A
Evaluation
Was the paper well organised?
It is well structured and well written.
Problem and solution
 problem
 Real world data does not perfectly follow known patterns. Specialised solutions expensive.
 solution
 ML. Learn the model > Synthesise specialised index. Low cost.
Strengths and/or weaknesses of the NN approach
The paper illustrated that…
Did they choose the right architectures? Why or why not?
Is it clearly described (all parameters, settings etc.)?
Own Questions
Paper
Research question defined?
What is the research question?
Generalization
Does the study allow generalization?
Consistency
The discussion and conclusions should be consistent with the study’s results.
Results in accordance with the researcher’s expectations not in accordance.
Do the authors of the article you hold in hand do the same?
If this article appears incomplete, it may be intentional. Try prompting for a continuation.