This is my attempt to understand Vaughan Kitchen’s search engine.

Source code
https://github.com/vkitchen/cocomel
Related articles
Entropy, Cross-Entropy and KL-Divergence // Bodacious Blog \ An example information retrieval problem

Glossary

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
red-black tree
RBT
    [self-balancing binary search tree]

    Each node of the binary tree has an extra
    bit, and that bit is often interpreted as
    the color of the node.

    These color bits are used to ensure the
    tree remains approximately balanced during
    insertions and deletions.

binary search tree
BST
    Node-based binary tree data structure
    which has the following properties:

    - The left subtree of a node contains only
      nodes with keys lesser than the node’s
      key.

    - The right subtree of a node contains
      only nodes with keys greater than the
      node’s key.

    - The left and right subtree each must
      also be a binary search tree.

slurp
    Read into memory.

universal code
universal code for integers
    [#data compression]
    [prefix code]

    Maps the positive integers onto binary
    codewords, with the additional property
    that whatever the true probability
    distribution on integers, as long as the
    distribution is monotonic (i.e., p(i)
    ≥ p(i + 1) for all positive i) , the
    expected lengths of the codewords are
    within a constant factor of the expected
    lengths that the optimal code for that
    probability distribution would have
    assigned.

    Asymptotically optimal if the ratio
    between actual and optimal expected
    lengths is bounded by a function of the
    information entropy of the code that, in
    addition to being bounded, approaches 1 as
    entropy approaches infinity.

variable-length quantity
VLQ
    [universal code]

    Uses an arbitrary number of binary octets
    (eight-bit bytes) to represent an
    arbitrarily large integer.

    A VLQ is essentially a base-128
    representation of an unsigned integer with
    the addition of the eighth bit to mark
    continuation of bytes.

continuation bit
continuing bit
    See "variable-length quantity".

Class diagrams in UML

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
find -L -O3 $CWD \
    -not -path '*/build/*' \
    -and \
    \( \
    -name '*.py' \
    -o -name '*.php' \
    -o -name '*.java' \
    -o -iname '*.[CH]' \
    -o -name '*.cpp' \
    -o -name '*.hpp'  \
    -o -name '*.cc' \
    -o -name '*.hh' \
    -o -name '*.hs' \
    \) \
    > "$destdir/cscope.files"

$HOME/scripts/ctags --fields=+latinK+i -R -L ./cscope.files
$MYGIT/Guad/tags2uml/tags2uml --members=1 --methods=1 --infile tags | tee uml.dot | dot-png uml
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
#+BEGIN_SRC graphviz-dot -n :filter dot-digraph cocomel-uml :async :results raw drawer
  n6 [label="{str|- store\l|+ c_dup()\l+ c_str()\l+ length()\l+ resize()\l}" style=filled fillcolor="#ffffff" shape="record"];
  n7 [label="{tokenizer|- document\l- index\l- length\l|+ init()\l  next()\l}" style=filled fillcolor="#ffffff" shape="record"];
  n8 [label="{tokenizer_zlib|- document\l- document_max\l- fd\l- header_buffer\l- index\l- length\l|+ cleanup()\l+ init()\l  next()\l  next_file()\l}" style=filled fillcolor="#ffffff" shape="record"];
  n1 [label="{bst|- key\l- left\l- right\l- store\l|+ insert()\l+ write()\l}" style=filled fillcolor="#ffffff" shape="record"];
  n2 [label="{dynamic_array|+ capacity\l+ length\l+ store\l|+ append()\l+ back()\l}" style=filled fillcolor="#ffffff" shape="record"];
  n3 [label="{hash_table|- capacity\l- lengths\l- store\l|+ find()\l- hash()\l+ insert()\l+ read()\l+ write()\l}" style=filled fillcolor="#ffffff" shape="record"];
  n4 [label="{linked_array|+ chunk_length\l- current\l- index\l+ last\l+ length\l+ next\l+ store\l|+ append()\l+ back()\l+ begin()\l+ end()\l+ iterator()\l}" style=filled color="#ff9999" fontcolor="#999999" fillcolor="#ffffff" shape="record"];
  n5 [label="{posting|- counts\l- id\l- id_capacity\l- id_length\l- id_store\l|  append()\l  decompress()\l  write()\l}" style=filled fillcolor="#ffffff" shape="record"];
#+END_SRC
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
#+BEGIN_SRC graphviz-dot -n :filter dot-digraph-ascii :async :results verbatim code
  n6 [label="{str|- store\l|+ c_dup()\l+ c_str()\l+ length()\l+ resize()\l}" style=filled fillcolor="#ffffff" shape="record"];
  n7 [label="{tokenizer|- document\l- index\l- length\l|+ init()\l  next()\l}" style=filled fillcolor="#ffffff" shape="record"];
  n8 [label="{tokenizer_zlib|- document\l- document_max\l- fd\l- header_buffer\l- index\l- length\l|+ cleanup()\l+ init()\l  next()\l  next_file()\l}" style=filled fillcolor="#ffffff" shape="record"];
  n1 [label="{bst|- key\l- left\l- right\l- store\l|+ insert()\l+ write()\l}" style=filled fillcolor="#ffffff" shape="record"];
  n2 [label="{dynamic_array|+ capacity\l+ length\l+ store\l|+ append()\l+ back()\l}" style=filled fillcolor="#ffffff" shape="record"];
  n3 [label="{hash_table|- capacity\l- lengths\l- store\l|+ find()\l- hash()\l+ insert()\l+ read()\l+ write()\l}" style=filled fillcolor="#ffffff" shape="record"];
  n4 [label="{linked_array|+ chunk_length\l- current\l- index\l+ last\l+ length\l+ next\l+ store\l|+ append()\l+ back()\l+ begin()\l+ end()\l+ iterator()\l}" style=filled fillcolor="#ffffff" shape="record"];
  n5 [label="{posting|- counts\l- id\l- id_capacity\l- id_length\l- id_store\l|  append()\l  decompress()\l  write()\l}" style=filled fillcolor="#ffffff" shape="record"];
#+END_SRC
+----------------+-----------------+--------------+
|                |      - key      |              |
|      bst       |  - left         |  + insert()  |
|                |  - right        |  + write()   |
|                |  - store        |              |
+----------------+-----------------+--------------+
+----------------+-----------------+--------------+
|                |    + capacity   |  + append()  |
| dynamic_array  |  + length       |  + back()    |
|                |  + store        |              |
+----------------+-----------------+--------------+
+----------------+-----------------+--------------+
|                |                 |   + find()   |
|                |    - capacity   |  - hash()    |
|   hash_table   |  - lengths      |  + insert()  |
|                |  - store        |  + read()    |
|                |                 |  + write()   |
+----------------+-----------------+--------------+
+----------------+-----------------+--------------+
|                |  + chunk_length |              |
|                |  - current      |  + append()  |
|                |  - index        |  + back()    |
|  linked_array  |  + last         |  + begin()   |
|                |  + length       |  + end()     |
|                |  + next         |  + iterator()|
|                |  + store        |              |
+----------------+-----------------+--------------+
+----------------+-----------------+--------------+
|                |     - counts    |              |
|                |  - id           |   append()   |
|    posting     |  - id_capacity  |  decompress()|
|                |  - id_length    |  write()     |
|                |  - id_store     |              |
+----------------+-----------------+--------------+
+----------------+-----------------+--------------+
|                |                 |   + c_dup()  |
|      str       |     - store     |  + c_str()   |
|                |                 |  + length()  |
|                |                 |  + resize()  |
+----------------+-----------------+--------------+
+----------------+-----------------+--------------+
|                |    - document   |   + init()   |
|   tokenizer    |  - index        |  next()      |
|                |  - length       |              |
+----------------+-----------------+--------------+
+----------------+-----------------+--------------+
|                |    - document   |              |
|                |  - document_max |  + cleanup() |
| tokenizer_zlib |  - fd           |  + init()    |
|                |  - header_buffer|  next()      |
|                |  - index        |  next_file() |
|                |  - length       |              |
+----------------+-----------------+--------------+

Extra classes

The class-diagram generator didn’t pick up on this. Actually, it incorrectly bundled all the methods and attributes into the parent class.

The iterator class is a public member of class linked_array. It is, however, only used by and closely coupled to linked_array.

I graphed this manually.

1
2
n1 [label="{linked_array|+ chunk_length\l+ last\l+ length\l+ next\l+ store\l|+ append()\l+ back()\l+ begin()\l+ end()\l+ iterator()\l}" style=filled color="#ff9999" fontcolor="#999999" style=filled fillcolor="#ffffff" shape="record"];
n2 [label="{iterator|- current\l- index\l|+ !=()\l+ *() \l+ ++()\l}" style=filled color="#ff9999" fontcolor="#999999" style=filled fillcolor="#ffffff" shape="record"];

The true linked_array

1
n1 [label="{linked_array|+ chunk_length\l+ last\l+ length\l+ next\l+ store\l|+ append()\l+ back()\l+ begin()\l+ end()\l+ iterator()\l}" style=filled fillcolor="#ffffff" shape="record"];
+--------------+----------------+--------------+
|              | + chunk_length |  + append()  |
|              |  + last        |  + back()    |
| linked_array |  + length      |  + begin()   |
|              |  + next        |  + end()     |
|              |  + store       |  + iterator()|
+--------------+----------------+--------------+

linked_array::iterator

1
n2 [label="{iterator|- current\l- index\l|+ !=()\l+ *() \l+ ++()\l}" style=filled fillcolor="#ffffff" shape="record"];
+----------+-----------+--------+
|          | - current | + !=() |
| iterator |  - index  |  + *() |
|          |           |  + ++()|
+----------+-----------+--------+

Source file overview

header / source purpose capacity used by instances
bst.h provide binary search tree to hash table hash_table.h one per hash table
hash_table.h provide hash table 65536

1 « 16 == 65536 = 0x10000
dynamic_array.h provide array

256 items initially, but doubles when reaching capacity
search_cli.cpp
posting.cpp
search.cpp
search_cgi.cpp
index.cpp
Once when searching from search_cli.cpp, docNos is created. This comes from searching via search.cpp.

When a search is run via search.cpp a postings list is decompressed for each search term; that’s one dynamic_array per word. These postings go into the intersect_postings() function and a single dynamic_array is returned. This is what is passed to search_cli.cpp.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
postings list
    For each term there is a list that records
    which documents the term occurs in.

    It may also contain the positions in the
    document.

posting
    Each item in a postings list is called a posting.

inverted index
inverted file
    A dictionary of terms (sometimes also
    referred to as a vocabulary or lexicon.

    For each term there is a postings list.
header / source purpose capacity used by instances
str.h provide strings + str helper funcs search.cpp
index.cpp
bst.h
tokenizer.h
file.{h,cpp} slurping files search_cli.cpp
search_cgi.cpp
index.cpp
When searching, once to load index.

When indexing, once for each non gzipped file indexed.
linked_array.h provide linked_array and iterator Never used
memory.{h,cpp}
posting.{h,cpp}
search.{h,cpp}
tar.h
tokenizer.{h,cpp}
tokenizer_zlib.{h,cpp}
vbyte.{h,cpp}
index.cpp
search_cgi.cpp
search_cli.cpp
The variable ‘docNos
An array of int pairs (int, int). Created upon searching.

Optimisations

BST instead of RBT

Instead of a RBT, use a BST for the hash table.

RBT is slower because it’s not ordered.

BST and hash table needs balancing.

Vaughan used gprof to profile the performance of the BST.

When you rebalance the tree you shuffle pointers around.

memory compression

Store in smaller amounts of memory means faster memory access.

Headers

bst.h

includes str.h, and memory.h

provides bst

  • Attributes
    • key: This is a number to be compared.
    • left: The left subtree of a node contains only nodes with keys lesser than the node’s key.
    • right: ditto
  • Methods
    • bst (constructor)
    • insert
    • write

dynamic_array.h

includes stdint.h, stdlib.h and memory.h

provides dynamic_array

  • Attributes
    • capacity
    • length
  • Methods
    • dynamic_array (constructor)
    • append
    • back

hash_table.h

includes bst.h

provides hash_table

  • Attributes

    • store
    • lengths
  • Methods

    • hash_table (constructor)
    • hash
    • insert
    • find
    • write
    • read

linked_array.h (You can ignore this)

It is never actually used.

linked_array::iterator class

Overloaded dereference operator *

The purpose of dereference operator is to dereference a pointer and returns the object reference. Hence it must return the reference. That is why it is T&.

1
2
3
4
T &operator *()
{
  return current->store[index];
}

The & means that the return type is a reference; you should read this as function operator * that returns a type of T &.

As an aside
The purpose of referring operator (->) is to return a pointer and hence T* is returned.
1
2
3
4
T* operator-> ()
{
    return pData;
}
address operator
The ampersand symbol & is used in C++ as a reference declarator in addition to being the address operator.
op name
(type) cast 2nd highest precedence
* dereference 2nd highest precedence
& address 2nd highest precedence
-> referring highest precedence
[] subscript highest precedence

Increment operator ++

When I saw this my first thought was concatenate. I’ve been reading too much haskell.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
iterator &operator ++()
{
  index++;
  if (index == current->chunk_length)
    {
      index = 0;
      current = current->next;
    }
  return *this;
}

When you call iterator++ it returns a pointer to iterator, which on its own seems pointless, but it also increments the internal value index which keeps track of which item the iterator thinks is current.

The internal value (index) is also used by the dereference operator of iterator to specify which element of the array store (contained in linked_array, the parent class) to return a reference to.

Application logic

search_cli.cpp

Read in a query string from the user.

Read in the index (index.dat) naïvely into a char *. Think of it as an array of bytes. You can then do index[2 * sizeof(uint32_t)] and that means the pointer to the 3rd uint32_t. We refer to the char * as “index” from now on since that’s its variable name.

1
2
// Read in the index (=index.dat=)
file_slurp("index.dat", &index);

docNos is an array of int pairs (int, int).

1
2
// docNos
dynamic_array<std::pair<uint32_t, uint32_t>> *docNos = new dynamic_array<std::pair<uint32_t, uint32_t>>();

Reinterpret the char * index as an array of uint32.

1
2
// Reinterpret
(uint32_t *)index

The number of pairs contained in docNos is determined by reading the second uint32 from index.

1
2
// The number of pairs contained in =docNos=
docNos->length = ((uint32_t *)index)[1];

This is because index is probably an array of arrays. The first uint32 is probably the size of the parent array. I will find out later.

I think I found it
It might be the position of the dictionary
1
vim +/"size_t dict_offset" "$MYGIT/vkitchen/cocomel/search.cpp"
1
2
3
4
5
6
7
// Decode index
dynamic_array<std::pair<uint32_t, uint32_t>> *docNos = new dynamic_array<std::pair<uint32_t, uint32_t>>();
docNos->length = ((uint32_t *)index)[1];
docNos->store = (std::pair<uint32_t, uint32_t> *)&index[2 * sizeof(uint32_t)];

size_t dict_offset = ((uint32_t *)index)[0];
hash_table<posting, uint32_t> *dictionary = hash_table<posting, uint32_t>::read(&index[dict_offset]);

Load the actual pairs of uint32 from the index into docNos. Get the address of the 3rd int of index. This is the pointer to the first pair of ints in the array.

1
2
// Load the actual pairs of =uint32=
docNos->store = (std::pair<uint32_t, uint32_t> *)&index[2 * sizeof(uint32_t)];

Instances

dynamic_array.h

These are the instances of dynamic_array within cocomel.

search_cli.cpp

1
2
dynamic_array<std::pair<uint32_t, uint32_t>> *docNos = new dynamic_array<std::pair<uint32_t, uint32_t>>();
dynamic_array<std::pair<size_t, double>> *result_list = search(index, line);

posting.cpp

1
dynamic_array<std::pair<size_t, double>> *out = new dynamic_array<std::pair<size_t, double>>;

search.cpp

When you decompress a posting.

1
dynamic_array<std::pair<size_t, double>> *post = ((posting *)terms->store[i])->decompress();

building the index

functions

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
size_t file_slurp(char const *filename, char **into) {
  FILE *fh;
  struct stat details;
  size_t file_length = 0;

  if ((fh = fopen(filename, "rb")) != NULL) {
    if (fstat(fileno(fh), &details) == 0)
      if ((file_length = details.st_size) != 0) {
        *into = (char *)malloc(file_length + 1);
        (*into)[file_length] = '\0';
        if (fread(*into, details.st_size, 1, fh) != 1) {
          free(*into);
          file_length = 0;
        }
      }
    fclose(fh);
  }

  return file_length;
}

The function fileno() examines the argument stream and returns its integer descriptor.

Appendix

Gprof – Notes taken from the man page

A profiler that calculates the amount of time spent in each routine.

Next, these times are propagated along the edges of the call graph.

Cycles are discovered, and calls into a cycle are made to share the time of the cycle.

Types of output

Several forms of output are available from the analysis.

flat profile
Shows how much time your program spent in each function, and how many times that function was called.

If you simply want to know which functions burn most of the cycles, it is stated concisely here.

call graph
Shows, for each function, which functions called it, which other functions it called, and how many times.

There is also an estimate of how much time was spent in the subroutines of each function.

This can suggest places where you might try to eliminate function calls that use a lot of time.

annotated source listing
A copy of the program’s source code, labeled with the number of times each line of the program was executed.