Deconstructing the Cocomel search engine
This is my attempt to understand Vaughan Kitchen’s search engine.
- Source code
- https://github.com/vkitchen/cocomel
- Related articles
- Entropy, Cross-Entropy and KL-Divergence // Bodacious Blog \ An example information retrieval problem
Glossary
|
|
Class diagrams in UML
|
|
|
|

|
|
+----------------+-----------------+--------------+
| | - key | |
| bst | - left | + insert() |
| | - right | + write() |
| | - store | |
+----------------+-----------------+--------------+
+----------------+-----------------+--------------+
| | + capacity | + append() |
| dynamic_array | + length | + back() |
| | + store | |
+----------------+-----------------+--------------+
+----------------+-----------------+--------------+
| | | + find() |
| | - capacity | - hash() |
| hash_table | - lengths | + insert() |
| | - store | + read() |
| | | + write() |
+----------------+-----------------+--------------+
+----------------+-----------------+--------------+
| | + chunk_length | |
| | - current | + append() |
| | - index | + back() |
| linked_array | + last | + begin() |
| | + length | + end() |
| | + next | + iterator()|
| | + store | |
+----------------+-----------------+--------------+
+----------------+-----------------+--------------+
| | - counts | |
| | - id | append() |
| posting | - id_capacity | decompress()|
| | - id_length | write() |
| | - id_store | |
+----------------+-----------------+--------------+
+----------------+-----------------+--------------+
| | | + c_dup() |
| str | - store | + c_str() |
| | | + length() |
| | | + resize() |
+----------------+-----------------+--------------+
+----------------+-----------------+--------------+
| | - document | + init() |
| tokenizer | - index | next() |
| | - length | |
+----------------+-----------------+--------------+
+----------------+-----------------+--------------+
| | - document | |
| | - document_max | + cleanup() |
| tokenizer_zlib | - fd | + init() |
| | - header_buffer| next() |
| | - index | next_file() |
| | - length | |
+----------------+-----------------+--------------+
Extra classes
The class-diagram generator didn’t pick up on this. Actually, it incorrectly bundled all the methods and attributes into the parent class.
The iterator
class is a public member of class linked_array
. It is, however, only used by and closely coupled to linked_array
.
I graphed this manually.
|
|

The true linked_array
|
|
+--------------+----------------+--------------+
| | + chunk_length | + append() |
| | + last | + back() |
| linked_array | + length | + begin() |
| | + next | + end() |
| | + store | + iterator()|
+--------------+----------------+--------------+
linked_array::iterator
|
|
+----------+-----------+--------+
| | - current | + !=() |
| iterator | - index | + *() |
| | | + ++()|
+----------+-----------+--------+
Source file overview
header / source | purpose | capacity | used by | instances |
---|---|---|---|---|
bst.h | provide binary search tree to hash table | ∞ | hash_table.h | one per hash table |
hash_table.h | provide hash table | 65536 1 « 16 == 65536 = 0x10000 |
||
dynamic_array.h | provide array | ∞ 256 items initially, but doubles when reaching capacity |
search_cli.cpp posting.cpp search.cpp search_cgi.cpp index.cpp |
Once when searching from search_cli.cpp , docNos is created. This comes from searching via search.cpp . When a search is run via search.cpp a postings list is decompressed for each search term; that’s one dynamic_array per word. These postings go into the intersect_postings() function and a single dynamic_array is returned. This is what is passed to search_cli.cpp . |
|
|
header / source | purpose | capacity | used by | instances |
---|---|---|---|---|
str.h | provide strings + str helper funcs | ∞ | search.cpp index.cpp bst.h tokenizer.h |
|
file.{h,cpp} | slurping files | ∞ | search_cli.cpp search_cgi.cpp index.cpp |
When searching, once to load index. When indexing, once for each non gzipped file indexed. |
Never used |
||||
memory.{h,cpp} | ||||
posting.{h,cpp} | ||||
search.{h,cpp} | ||||
tar.h | ||||
tokenizer.{h,cpp} | ||||
tokenizer_zlib.{h,cpp} | ||||
vbyte.{h,cpp} | ||||
index.cpp | ||||
search_cgi.cpp | ||||
search_cli.cpp |
- The variable ‘
docNos
’ - An array of
int
pairs (int
,int
). Created upon searching.
Optimisations
BST instead of RBT
Instead of a RBT, use a BST for the hash table.
RBT is slower because it’s not ordered.
BST and hash table needs balancing.
Vaughan used gprof to profile the performance of the BST.
When you rebalance the tree you shuffle pointers around.
memory compression
Store in smaller amounts of memory means faster memory access.
Headers
bst.h
includes str.h
, and memory.h
provides bst
- Attributes
- key: This is a number to be compared.
- left: The left subtree of a node contains only nodes with keys lesser than the node’s key.
- right: ditto
- Methods
- bst (constructor)
- insert
- write
dynamic_array.h
includes stdint.h
, stdlib.h
and memory.h
provides dynamic_array
- Attributes
- capacity
- length
- Methods
- dynamic_array (constructor)
- append
- back
hash_table.h
includes bst.h
provides hash_table
-
Attributes
- store
- lengths
-
Methods
- hash_table (constructor)
- hash
- insert
- find
- write
- read
linked_array.h
(You can ignore this)
It is never actually used.
linked_array::iterator
class
Overloaded dereference operator *
The purpose of dereference operator is to
dereference a pointer and returns the object
reference. Hence it must return the reference.
That is why it is T&
.
|
|
The &
means that the return type is a
reference; you should read this as function
operator *
that returns a type of T &
.
- As an aside
- The purpose of referring operator (
->
) is to return a pointer and henceT*
is returned.1 2 3 4
T* operator-> () { return pData; }
- address operator
- The ampersand symbol & is used in C++ as a reference declarator in addition to being the address operator.
op | name | |
---|---|---|
(type) |
cast | 2nd highest precedence |
* |
dereference | 2nd highest precedence |
& |
address | 2nd highest precedence |
-> |
referring | highest precedence |
[] |
subscript | highest precedence |
Increment operator ++
When I saw this my first thought was concatenate. I’ve been reading too much haskell.
|
|
When you call iterator++
it returns a
pointer to iterator
, which on its own seems
pointless, but it also increments the internal
value index
which keeps track of which item the
iterator thinks is current.
The internal value (index
) is also used by
the dereference operator of iterator
to
specify which element of the array store
(contained in linked_array
, the parent
class) to return a reference to.
Application logic
search_cli.cpp
Read in a query string from the user.
Read in the index (index.dat
) naïvely into a char *
. Think of it as an array of bytes. You can then do index[2 * sizeof(uint32_t)]
and that means the pointer to the 3rd uint32_t
. We refer to the char *
as “index
” from now on since that’s its variable name.
|
|
docNos
is an array of int
pairs (int
, int
).
|
|
Reinterpret the char *
index as an array of uint32
.
|
|
The number of pairs contained in docNos
is
determined by reading the second uint32
from
index
.
|
|
This is because index
is probably an array
of arrays. The first uint32
is probably the
size of the parent array. I will find out later.
- I think I found it
- It might be the position of the dictionary
1
vim +/"size_t dict_offset" "$MYGIT/vkitchen/cocomel/search.cpp"
1 2 3 4 5 6 7
// Decode index dynamic_array<std::pair<uint32_t, uint32_t>> *docNos = new dynamic_array<std::pair<uint32_t, uint32_t>>(); docNos->length = ((uint32_t *)index)[1]; docNos->store = (std::pair<uint32_t, uint32_t> *)&index[2 * sizeof(uint32_t)]; size_t dict_offset = ((uint32_t *)index)[0]; hash_table<posting, uint32_t> *dictionary = hash_table<posting, uint32_t>::read(&index[dict_offset]);
Load the actual pairs of uint32
from the
index into docNos
. Get the address of the
3rd int of index. This is the pointer to the
first pair of ints in the array.
|
|
Instances
dynamic_array.h
These are the instances of dynamic_array
within cocomel
.
search_cli.cpp
|
|
posting.cpp
|
|
search.cpp
When you decompress a posting.
|
|
building the index
functions
|
|
The function fileno()
examines the
argument stream and returns its integer
descriptor.
Appendix
Gprof
– Notes taken from the man page
A profiler that calculates the amount of time spent in each routine.
Next, these times are propagated along the edges of the call graph.
Cycles are discovered, and calls into a cycle are made to share the time of the cycle.
Types of output
Several forms of output are available from the analysis.
- flat profile
- Shows how much time your
program spent in each function, and how many
times that function was called.
If you simply want to know which functions burn most of the cycles, it is stated concisely here.
- call graph
- Shows, for each function, which
functions called it, which other functions it
called, and how many times.
There is also an estimate of how much time was spent in the subroutines of each function.
This can suggest places where you might try to eliminate function calls that use a lot of time.
- annotated source listing
- A copy of the program’s source code, labeled with the number of times each line of the program was executed.
If this article appears incomplete, it may be intentional. Try prompting for a continuation.