This is my attempt to understand Vaughan Kitchen’s search engine.

Source code
https://github.com/vkitchen/cocomel
Related articles
Entropy, Cross-Entropy and KL-Divergence // Bodacious Blog \ An example information retrieval problem

## Glossary

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71  red-black tree RBT [self-balancing binary search tree] Each node of the binary tree has an extra bit, and that bit is often interpreted as the color of the node. These color bits are used to ensure the tree remains approximately balanced during insertions and deletions. binary search tree BST Node-based binary tree data structure which has the following properties: - The left subtree of a node contains only nodes with keys lesser than the node’s key. - The right subtree of a node contains only nodes with keys greater than the node’s key. - The left and right subtree each must also be a binary search tree. slurp Read into memory. universal code universal code for integers [#data compression] [prefix code] Maps the positive integers onto binary codewords, with the additional property that whatever the true probability distribution on integers, as long as the distribution is monotonic (i.e., p(i) ≥ p(i + 1) for all positive i) , the expected lengths of the codewords are within a constant factor of the expected lengths that the optimal code for that probability distribution would have assigned. Asymptotically optimal if the ratio between actual and optimal expected lengths is bounded by a function of the information entropy of the code that, in addition to being bounded, approaches 1 as entropy approaches infinity. variable-length quantity VLQ [universal code] Uses an arbitrary number of binary octets (eight-bit bytes) to represent an arbitrarily large integer. A VLQ is essentially a base-128 representation of an unsigned integer with the addition of the eighth bit to mark continuation of bytes. continuation bit continuing bit See "variable-length quantity".

## Class diagrams in UML

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18  find -L -O3 $CWD \ -not -path '*/build/*' \ -and \ $$\ -name '*.py' \ -o -name '*.php' \ -o -name '*.java' \ -o -iname '*.[CH]' \ -o -name '*.cpp' \ -o -name '*.hpp' \ -o -name '*.cc' \ -o -name '*.hh' \ -o -name '*.hs' \$$ \ > "$destdir/cscope.files" $HOME/scripts/ctags --fields=+latinK+i -R -L ./cscope.files$MYGIT/Guad/tags2uml/tags2uml --members=1 --methods=1 --infile tags | tee uml.dot | dot-png uml
  1 2 3 4 5 6 7 8 9 10  #+BEGIN_SRC graphviz-dot -n :filter dot-digraph cocomel-uml :async :results raw drawer n6 [label="{str|- store\l|+ c_dup()\l+ c_str()\l+ length()\l+ resize()\l}" style=filled fillcolor="#ffffff" shape="record"]; n7 [label="{tokenizer|- document\l- index\l- length\l|+ init()\l next()\l}" style=filled fillcolor="#ffffff" shape="record"]; n8 [label="{tokenizer_zlib|- document\l- document_max\l- fd\l- header_buffer\l- index\l- length\l|+ cleanup()\l+ init()\l next()\l next_file()\l}" style=filled fillcolor="#ffffff" shape="record"]; n1 [label="{bst|- key\l- left\l- right\l- store\l|+ insert()\l+ write()\l}" style=filled fillcolor="#ffffff" shape="record"]; n2 [label="{dynamic_array|+ capacity\l+ length\l+ store\l|+ append()\l+ back()\l}" style=filled fillcolor="#ffffff" shape="record"]; n3 [label="{hash_table|- capacity\l- lengths\l- store\l|+ find()\l- hash()\l+ insert()\l+ read()\l+ write()\l}" style=filled fillcolor="#ffffff" shape="record"]; n4 [label="{linked_array|+ chunk_length\l- current\l- index\l+ last\l+ length\l+ next\l+ store\l|+ append()\l+ back()\l+ begin()\l+ end()\l+ iterator()\l}" style=filled color="#ff9999" fontcolor="#999999" fillcolor="#ffffff" shape="record"]; n5 [label="{posting|- counts\l- id\l- id_capacity\l- id_length\l- id_store\l| append()\l decompress()\l write()\l}" style=filled fillcolor="#ffffff" shape="record"]; #+END_SRC
  1 2 3 4 5 6 7 8 9 10  #+BEGIN_SRC graphviz-dot -n :filter dot-digraph-ascii :async :results verbatim code n6 [label="{str|- store\l|+ c_dup()\l+ c_str()\l+ length()\l+ resize()\l}" style=filled fillcolor="#ffffff" shape="record"]; n7 [label="{tokenizer|- document\l- index\l- length\l|+ init()\l next()\l}" style=filled fillcolor="#ffffff" shape="record"]; n8 [label="{tokenizer_zlib|- document\l- document_max\l- fd\l- header_buffer\l- index\l- length\l|+ cleanup()\l+ init()\l next()\l next_file()\l}" style=filled fillcolor="#ffffff" shape="record"]; n1 [label="{bst|- key\l- left\l- right\l- store\l|+ insert()\l+ write()\l}" style=filled fillcolor="#ffffff" shape="record"]; n2 [label="{dynamic_array|+ capacity\l+ length\l+ store\l|+ append()\l+ back()\l}" style=filled fillcolor="#ffffff" shape="record"]; n3 [label="{hash_table|- capacity\l- lengths\l- store\l|+ find()\l- hash()\l+ insert()\l+ read()\l+ write()\l}" style=filled fillcolor="#ffffff" shape="record"]; n4 [label="{linked_array|+ chunk_length\l- current\l- index\l+ last\l+ length\l+ next\l+ store\l|+ append()\l+ back()\l+ begin()\l+ end()\l+ iterator()\l}" style=filled fillcolor="#ffffff" shape="record"]; n5 [label="{posting|- counts\l- id\l- id_capacity\l- id_length\l- id_store\l| append()\l decompress()\l write()\l}" style=filled fillcolor="#ffffff" shape="record"]; #+END_SRC
+----------------+-----------------+--------------+
|                |      - key      |              |
|      bst       |  - left         |  + insert()  |
|                |  - right        |  + write()   |
|                |  - store        |              |
+----------------+-----------------+--------------+
+----------------+-----------------+--------------+
|                |    + capacity   |  + append()  |
| dynamic_array  |  + length       |  + back()    |
|                |  + store        |              |
+----------------+-----------------+--------------+
+----------------+-----------------+--------------+
|                |                 |   + find()   |
|                |    - capacity   |  - hash()    |
|   hash_table   |  - lengths      |  + insert()  |
|                |  - store        |  + read()    |
|                |                 |  + write()   |
+----------------+-----------------+--------------+
+----------------+-----------------+--------------+
|                |  + chunk_length |              |
|                |  - current      |  + append()  |
|                |  - index        |  + back()    |
|  linked_array  |  + last         |  + begin()   |
|                |  + length       |  + end()     |
|                |  + next         |  + iterator()|
|                |  + store        |              |
+----------------+-----------------+--------------+
+----------------+-----------------+--------------+
|                |     - counts    |              |
|                |  - id           |   append()   |
|    posting     |  - id_capacity  |  decompress()|
|                |  - id_length    |  write()     |
|                |  - id_store     |              |
+----------------+-----------------+--------------+
+----------------+-----------------+--------------+
|                |                 |   + c_dup()  |
|      str       |     - store     |  + c_str()   |
|                |                 |  + length()  |
|                |                 |  + resize()  |
+----------------+-----------------+--------------+
+----------------+-----------------+--------------+
|                |    - document   |   + init()   |
|   tokenizer    |  - index        |  next()      |
|                |  - length       |              |
+----------------+-----------------+--------------+
+----------------+-----------------+--------------+
|                |    - document   |              |
|                |  - document_max |  + cleanup() |
| tokenizer_zlib |  - fd           |  + init()    |
|                |  - header_buffer|  next()      |
|                |  - index        |  next_file() |
|                |  - length       |              |
+----------------+-----------------+--------------+


### Extra classes

The class-diagram generator didn’t pick up on this. Actually, it incorrectly bundled all the methods and attributes into the parent class.

The iterator class is a public member of class linked_array. It is, however, only used by and closely coupled to linked_array.

I graphed this manually.

 1 2  n1 [label="{linked_array|+ chunk_length\l+ last\l+ length\l+ next\l+ store\l|+ append()\l+ back()\l+ begin()\l+ end()\l+ iterator()\l}" style=filled color="#ff9999" fontcolor="#999999" style=filled fillcolor="#ffffff" shape="record"]; n2 [label="{iterator|- current\l- index\l|+ !=()\l+ *() \l+ ++()\l}" style=filled color="#ff9999" fontcolor="#999999" style=filled fillcolor="#ffffff" shape="record"];

#### The true linked_array

 1  n1 [label="{linked_array|+ chunk_length\l+ last\l+ length\l+ next\l+ store\l|+ append()\l+ back()\l+ begin()\l+ end()\l+ iterator()\l}" style=filled fillcolor="#ffffff" shape="record"];
+--------------+----------------+--------------+
|              | + chunk_length |  + append()  |
|              |  + last        |  + back()    |
| linked_array |  + length      |  + begin()   |
|              |  + next        |  + end()     |
|              |  + store       |  + iterator()|
+--------------+----------------+--------------+


#### linked_array::iterator

 1  n2 [label="{iterator|- current\l- index\l|+ !=()\l+ *() \l+ ++()\l}" style=filled fillcolor="#ffffff" shape="record"];
+----------+-----------+--------+
|          | - current | + !=() |
| iterator |  - index  |  + *() |
|          |           |  + ++()|
+----------+-----------+--------+


## Source file overview

header / source purpose capacity used by instances
bst.h provide binary search tree to hash table hash_table.h one per hash table
hash_table.h provide hash table 65536

1 « 16 == 65536 = 0x10000
dynamic_array.h provide array

256 items initially, but doubles when reaching capacity
search_cli.cpp
posting.cpp
search.cpp
search_cgi.cpp
index.cpp
Once when searching from search_cli.cpp, docNos is created. This comes from searching via search.cpp.

When a search is run via search.cpp a postings list is decompressed for each search term; that’s one dynamic_array per word. These postings go into the intersect_postings() function and a single dynamic_array is returned. This is what is passed to search_cli.cpp.
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16  postings list For each term there is a list that records which documents the term occurs in. It may also contain the positions in the document. posting Each item in a postings list is called a posting. inverted index inverted file A dictionary of terms (sometimes also referred to as a vocabulary or lexicon. For each term there is a postings list.
header / source purpose capacity used by instances
str.h provide strings + str helper funcs search.cpp
index.cpp
bst.h
tokenizer.h
file.{h,cpp} slurping files search_cli.cpp
search_cgi.cpp
index.cpp
When searching, once to load index.

When indexing, once for each non gzipped file indexed.
linked_array.h provide linked_array and iterator Never used
memory.{h,cpp}
posting.{h,cpp}
search.{h,cpp}
tar.h
tokenizer.{h,cpp}
tokenizer_zlib.{h,cpp}
vbyte.{h,cpp}
index.cpp
search_cgi.cpp
search_cli.cpp
The variable ‘docNos
An array of int pairs (int, int). Created upon searching.

## Optimisations

Instead of a RBT, use a BST for the hash table.

RBT is slower because it’s not ordered.

BST and hash table needs balancing.

Vaughan used gprof to profile the performance of the BST.

When you rebalance the tree you shuffle pointers around.

### memory compression

Store in smaller amounts of memory means faster memory access.

### bst.h

#### provides bst

• Attributes
• key: This is a number to be compared.
• left: The left subtree of a node contains only nodes with keys lesser than the node’s key.
• right: ditto
• Methods
• bst (constructor)
• insert
• write

### dynamic_array.h

#### provides dynamic_array

• Attributes
• capacity
• length
• Methods
• dynamic_array (constructor)
• append
• back

### hash_table.h

#### provides hash_table

• Attributes

• store
• lengths
• Methods

• hash_table (constructor)
• hash
• insert
• find
• write

### linked_array.h (You can ignore this)

It is never actually used.

#### linked_array::iterator class

Overloaded dereference operator *

The purpose of dereference operator is to dereference a pointer and returns the object reference. Hence it must return the reference. That is why it is T&.

 1 2 3 4  T &operator *() { return current->store[index]; }

The & means that the return type is a reference; you should read this as function operator * that returns a type of T &.

As an aside
The purpose of referring operator (->) is to return a pointer and hence T* is returned.
 1 2 3 4  T* operator-> () { return pData; }
The ampersand symbol & is used in C++ as a reference declarator in addition to being the address operator.
op name
(type) cast 2nd highest precedence
* dereference 2nd highest precedence
& address 2nd highest precedence
-> referring highest precedence
[] subscript highest precedence

Increment operator ++

When I saw this my first thought was concatenate. I’ve been reading too much haskell.

  1 2 3 4 5 6 7 8 9 10  iterator &operator ++() { index++; if (index == current->chunk_length) { index = 0; current = current->next; } return *this; }

When you call iterator++ it returns a pointer to iterator, which on its own seems pointless, but it also increments the internal value index which keeps track of which item the iterator thinks is current.

The internal value (index) is also used by the dereference operator of iterator to specify which element of the array store (contained in linked_array, the parent class) to return a reference to.

## Application logic

#### search_cli.cpp

Read in a query string from the user.

Read in the index (index.dat) naïvely into a char *. Think of it as an array of bytes. You can then do index[2 * sizeof(uint32_t)] and that means the pointer to the 3rd uint32_t. We refer to the char * as “index” from now on since that’s its variable name.

 1 2  // Read in the index (=index.dat=) file_slurp("index.dat", &index);

docNos is an array of int pairs (int, int).

 1 2  // docNos dynamic_array> *docNos = new dynamic_array>();

Reinterpret the char * index as an array of uint32.

 1 2  // Reinterpret (uint32_t *)index

The number of pairs contained in docNos is determined by reading the second uint32 from index.

 1 2  // The number of pairs contained in =docNos= docNos->length = ((uint32_t *)index);

This is because index is probably an array of arrays. The first uint32 is probably the size of the parent array. I will find out later.

I think I found it
It might be the position of the dictionary
 1  vim +/"size_t dict_offset" "\$MYGIT/vkitchen/cocomel/search.cpp"
 1 2 3 4 5 6 7  // Decode index dynamic_array> *docNos = new dynamic_array>(); docNos->length = ((uint32_t *)index); docNos->store = (std::pair *)&index[2 * sizeof(uint32_t)]; size_t dict_offset = ((uint32_t *)index); hash_table *dictionary = hash_table::read(&index[dict_offset]);

Load the actual pairs of uint32 from the index into docNos. Get the address of the 3rd int of index. This is the pointer to the first pair of ints in the array.

 1 2  // Load the actual pairs of =uint32= docNos->store = (std::pair *)&index[2 * sizeof(uint32_t)];

## Instances

### dynamic_array.h

These are the instances of dynamic_array within cocomel.

#### search_cli.cpp

 1 2  dynamic_array> *docNos = new dynamic_array>(); dynamic_array> *result_list = search(index, line);

#### posting.cpp

 1  dynamic_array> *out = new dynamic_array>;

#### search.cpp

When you decompress a posting.

 1  dynamic_array> *post = ((posting *)terms->store[i])->decompress();

## functions

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20  size_t file_slurp(char const *filename, char **into) { FILE *fh; struct stat details; size_t file_length = 0; if ((fh = fopen(filename, "rb")) != NULL) { if (fstat(fileno(fh), &details) == 0) if ((file_length = details.st_size) != 0) { *into = (char *)malloc(file_length + 1); (*into)[file_length] = '\0'; if (fread(*into, details.st_size, 1, fh) != 1) { free(*into); file_length = 0; } } fclose(fh); } return file_length; }

The function fileno() examines the argument stream and returns its integer descriptor.

## Appendix

### Gprof – Notes taken from the man page

A profiler that calculates the amount of time spent in each routine.

Next, these times are propagated along the edges of the call graph.

Cycles are discovered, and calls into a cycle are made to share the time of the cycle.

#### Types of output

Several forms of output are available from the analysis.

flat profile
Shows how much time your program spent in each function, and how many times that function was called.

If you simply want to know which functions burn most of the cycles, it is stated concisely here.

call graph
Shows, for each function, which functions called it, which other functions it called, and how many times.

There is also an estimate of how much time was spent in the subroutines of each function.

This can suggest places where you might try to eliminate function calls that use a lot of time.

annotated source listing
A copy of the program’s source code, labeled with the number of times each line of the program was executed.