Database – Indexes, data sorting and searching improvement

In the previous post, we discussed data concurrency/consistency and figured out how to properly set the isolation level of an explicit DBMS transaction.

In this new post, we will discuss the theory behind the indexes used in a DBMS to improve the data searching on a database table.

Types of indexes

Indexes in a database make searches faster because they allow the database to quickly locate the specific data that matches a search query.

Without an index, the database would have to scan every row in the table to find the desired data, which can be slow and resource-intensive, especially for large tables.

An index creates a separate data structure that contains a copy of the data from one or more columns in the table, organized in a way that allows for fast searching and sorting. This makes it possible for the database to quickly narrow down the rows that match a search query, reducing the amount of data that needs to be scanned and improving performance.

Several types of indexes can be used in a database. Some of the most common types include:

  1. B-Tree Index: This is the most common type of index that is used in databases. It is well-suited for indexing columns with discrete values, such as dates or names.
  2. Bitmap Index: This index is used for columns with relatively few distinct values. It is efficient for data warehousing and analytics use cases.
  3. Clustered Index: A clustered index determines the physical order of data in a table, which can help improve the performance of certain types of queries.
  4. Non-clustered Index: A non-clustered index is a separate structure that stores a copy of a portion of a table’s data in a different location, allowing for faster searching and sorting.
  5. Full-text Index: This index performs text-based searches on large volumes of unstructured data, such as documents or web pages.
  6. Spatial Index: A spatial index is used to optimize searches on data with a spatial component, such as geographic data.

The specific type of index that is most appropriate depends on the characteristics of the data and the type of queries that will be performed on it.

B-Tree Index

B-tree indexes are well-suited for columns with high cardinality, meaning many distinct values exist in the column. They work well for range queries, where you need to search for a range of values or use comparison operators like greater than or less than.

They are also helpful for equality queries, where you must match a single value. In general, you should use a B-tree index when the values in the column are not too repetitive, and you need to look up individual values or ranges of values frequently.

Here’s an example of a B-tree structure:

                      +----+
                      | 26 |
                      +----+
                    /   |   \
                   /    |    \
                  /     |     \
            +----+ +----+ +----+ +----+
            | 10 | | 20 | | 30 | | 40 |
            +----+ +----+ +----+ +----+
           /  |  \      / \      |    \
          /   |   \    /   \     |     \
         /    |    \  /     \    |      \
    +----+ +----+ +----+ +----+ +----+ +----+
    |  1 | |  4 | |  7 | | 13 | | 15 | | 18 |
    +----+ +----+ +----+ +----+ +----+ +----+

In this example, each node in the tree represents a range of values, and the keys in each node are used to navigate to other nodes in the tree. At the top of the tree, the root node contains the most significant key value and points to child nodes with smaller values.

The tree is split into multiple levels, containing nodes pointing to child nodes in the below level. In this example, the tree has three levels: the root node at level 0, the intermediate nodes at level 1, and the leaf nodes at level 2.

The leaf nodes in the tree contain the actual data values and are linked together in a doubly-linked list to allow for efficient range queries. When a query is performed on the tree, the search starts at the root node and follows the keys in each node to navigate down to the appropriate leaf node.

BTrees are commonly used in databases and file systems to store and retrieve large amounts of data efficiently.

Searching Time on B-Trees

The searching time on a B-Tree structure is typically O(log n), where n is the number of keys in the index. This is because B-Trees are designed to maintain a balanced tree structure with a fixed height, even as the number of keys grows.

When searching for a specific key in a B-Tree index, the algorithm starts at the tree’s root node and uses the key values to determine which child node to follow. The search recursively down the tree until the key is found in a leaf node or it is determined that the key does not exist in the index.

Because each level of the B-Tree can have many keys, the number of nodes that need to be traversed to find a specific key grows logarithmically with the size of the index. This means that the search time increases much more slowly as the number of keys in the index grows, making B-Trees an efficient data structure for indexing and searching large datasets.

Bitmap Index

Bitmap indexes are useful for columns with low cardinality and relatively few distinct values. They work well for queries that involve multiple conditions combined with AND or OR operators. They are also helpful for data warehousing applications, where you must analyze large amounts of data across various dimensions. In general, you should use a bitmap index when the values in the column are highly repetitive and you must perform complex queries with multiple conditions.

Here’s a simple example of a bitmap data structure:

Suppose we have a table of customers, and we want to create a bitmap index on the “age” column. The table contains the following data:

+----+------+-----+
| ID | Name | Age |
+----+------+-----+
|  1 | Bob  |  25 |
|  2 | Sue  |  30 |
|  3 | Tom  |  20 |
|  4 | Ann  |  35 |
|  5 | Jim  |  30 |
+----+------+-----+

We would create a separate bitmap for each possible age value to create a bitmap index on the “age” column. Each bitmap would have one bit for each row in the table, with a value of 1 indicating that the row has the corresponding age value and a value of 0 indicating that it does not. The bitmaps might look like this:

Age = 20:  1 0 0 0 0
Age = 25:  1 0 0 0 0
Age = 30:  0 1 0 1 1
Age = 35:  0 0 0 1 0

To find all customers with an age of 30, we can use a bitwise OR operation on the “Age=30” and “Age=30” bitmaps, which gives us the following result:

Age = 30:  0 1 0 1 1

The bit corresponding to rows 2 and 5 is set to 1, indicating that those rows have an age of 30. Similarly, to find all customers with an age of 20 or 35, we can use a bitwise OR operation on the “Age=20” and “Age=35” bitmaps, which gives us the following result:

Age = 20:  1 0 0 0 0
Age = 35:  0 0 0 1 0

OR Result: 1 0 0 1 0

The bits corresponding to rows 1, 3, and 4 are set to 1, indicating that those rows have an age of 20 or 35.

Bitwise Operation

The bitwise OR operation is a binary operation that takes twobit patterns of equal length and performs a logical OR operation on each pair of corresponding bits, producing a new bit pattern where each bit is set to 1 if either of the corresponding bits in the input patterns is 1.

Here’s an example that demonstrates how the bitwise OR operation works:

Suppose we have two 4-bit patterns: 1010 and 1100. To perform the bitwise OR operation on these patterns, we line them up vertically and apply the OR operation to each pair of corresponding bits:

   1 0 1 0    (1010)
OR 1 1 0 0    (1100)
-----------
   1 1 1 0    (1110)

In this example, the resulting bit pattern is 1110. The first bit is set to 1 because either the first or first bit of the second pattern is 1. The second bit is set to 1 because the second bit of the second pattern is 1. The third bit is set to 1 because the third bit of the first pattern is 1. The fourth bit is set to 0 because neither the fourth bit of the first pattern nor the second bit is 1.

We can also use the bitwise OR operation on larger bit patterns or multiple-bit patterns. For example, suppose we have three 8-bit patterns: 11001100, 10101010, and 11110000. We can perform the bitwise OR operation on these patterns as follows:

   11001100
OR 10101010
OR 11110000
-----------
   11111110

In this example, the resulting bit pattern is 11111110. The operation sets each bit to 1 if any of the corresponding bits in the input patterns is 1.

Searching Time on Bitmaps

The computational searching time of a bitmap index depends on several factors, such as the size of the bitmap, the number of bits set to 1 in the bitmap, and the efficiency of the search algorithm.

In general, bitmap indexes are very fast for exact-match queries because they can use bitwise operations to identify the matching rows quickly. For example, suppose we have a bitmap index on a table column with 1 million rows, and we want to find all the rows where the column value is 42. We can perform the following steps:

  1. Find the bitmap for the value 42.
  2. Perform a bitwise AND operation between the bitmap and the bitmaps for all the other conditions in the query (if any). This operation will produce a new bitmap that only contains rows that match all the conditions in the query.
  3. Iterate over the resulting bitmap to retrieve the matching rows.

The time complexity of this operation is O(1) for steps 1 and 2 and O(k) for step 3, where k is the number of rows that match the query. Since the bitmap size is typically much smaller than the size of the table, and the bitwise operations can be performed very quickly, the overall computational searching time of a bitmap index is often very fast.

However, bitmap indexes may not be as efficient for range queries or with complex conditions requiring combining multiple bitmaps. Other indexes, such as B-tree indexes, may be more suitable in these cases.

Clustered Index

Clustered indexes are useful for frequently searched, sorted, or grouped columns and where the data is frequently updated. They work well for range and equality queries and are particularly useful for queries that return a range of values. It would be best to use a clustered index when the table has a primary key or unique index and when you frequently query it based on that key.

A clustered index is implemented as a B-Tree data structure in most relational database management systems (RDBMS).

In a clustered index, the B-Tree leaf nodes contain the data pages, which are physically ordered based on the clustered index key. This means that the rows are stored in the same order as the index key, making retrieving data based on the key faster.

When a query involves a table with a clustered index, the query optimizer can use the index to find the data pages containing the requested rows more efficiently. The index structure allows faster data access based on the key value.

Here’s an example of the structure of a clustered index using a B-tree:

Let’s say we have a table with four columns: ID, FirstName, LastName, and Age. We create a clustered index on the ID column, a unique identifier for each row.

The clustered index is implemented as a B-tree structure, which looks something like this:

                                  50,  100,  200
                                 /  |   |    |  \
                                /   |   |    |   \
                               /    |   |    |    \
                              /     |   |    |     \
                             10    30   70   150   300
                            / | \        |     |  \
                           /  |  \       |     |   \
                          /   |   \      |     |    \
                         1   20   40   80    250   500

In this example, the numbers at the bottom represent the values in the ID column. The numbers in the internal nodes represent the upper and lower bounds of the ranges of ID values stored in the corresponding child nodes.

The leaf nodes at the bottom of the tree contain the actual data pages, which are physically ordered based on the clustered index key (the ID column in this case). For example, the leaf node containing the ID value 20 will also have that row’s corresponding FirstName, LastName, and Age values.

When a query is executed that involves searching for a specific ID value, the query optimizer can use the clustered index to locate the data page that contains the row with that ID value more efficiently, as it can traverse the B-tree structure in a logarithmic number of operations to find the appropriate leaf node.

Non-Clustered Index

Non-clustered indexes are helpful when you need to improve the performance of queries that access a table based on columns other than the clustered index key or when you need to create an index on a view. Here are some scenarios where a non-clustered index may be appropriate:

  1. Queries that search for a specific value or range of values in a column that is not the primary key or clustered index key.
  2. Queries that join two or more tables on a non-primary key or non-clustered index column.
  3. Queries that sort or group data based on a non-primary key or non-clustered index column.
  4. Queries that perform aggregation or calculations on a non-primary key or non-clustered index column.

Non-clustered indexes can improve query performance by reducing the amount of data that needs to be read from disk, as the index provides a smaller, more targeted subset of the data that matches the query criteria. However, it’s important to note that creating too many non-clustered indexes on a table can also harm performance, increasing the overhead of maintaining the indexes during insert, update, and delete operations.

Here’s an example of the structure of a non-clustered index using a B-tree:

Suppose we have a table with three columns: ID, FirstName, and LastName. We create a non-clustered index on the LastName column. The non-clustered index is implemented as a separate B-tree structure from the clustered index, and it looks something like this:

                                  J,   S,    Z
                                 / |   |    | \
                                /  |   |    |  \
                               /   |   |    |   \
                              /    |   |    |    \
                             A    G   K    M    P
                            / \   |        |    |
                           /   \  |        |    |
                          /     \ |        |    |
                         B      F O        Q    R

In this example, the letters at the bottom represent the values in the LastName column. The numbers in the internal nodes represent the upper and lower bounds of the ranges of LastName values stored in the corresponding child nodes.

The leaf nodes at the bottom of the tree contain a reference to the corresponding data pages that store the actual rows in the table, sorted by the values in the LastName column. Each leaf node also contains a pointer to the clustered index, allowing quick access to the data row.

When a query is executed that involves searching for a specific LastName value, the query optimizer can use the non-clustered index to locate the data pages that contain the rows with that LastName value more efficiently, as it can traverse the B-tree structure in a logarithmic number of operations to find the appropriate leaf node, and then use the pointers to the clustered index to locate the corresponding data pages.

Full-Text Index

You would use a fulltext index to perform textbased searches on large amounts of text data, such as in a document or content management system.

The data structure used for a fulltext index can vary depending on the database system and implementation. Still, it typically involves creating an inverted index that maps each word or term in the text to the documents or records that contain that word or term.

For example, a simple data structure for a full-text index might include a list of words or terms in the text and document IDs or pointers to the documents containing each word or term. Each word or term would be associated with a list of documents in which it appears, allowing for efficient searching and retrieval of documents based on the words or terms contained in the text.

In more advanced implementations, the data structure may include additional information, such as the frequency of each word or term in the document, the location of the word or term within the document, or the proximity of multiple words or terms. These additional features can improve the accuracy and relevance of full-text search results.

Spatial Index

You would use a spatial index when performing spatial queries, such as searching for geographic locations or analyzing spatial relationships between objects, such as in a GIS application.

Several spatial indexes support spatial data types and queries, but the R-Tree is one of the most common data structures used for spatial indexing.

An R-Tree is a balanced tree data structure used to index spatial data by representing spatial objects as bounding boxes or minimum bounding rectangles (MBRs). Each node in the R-Tree contains one or more MBRs, and the root node contains the MBR that encompasses all of the objects in the index.

The R-Tree is organized so that the MBRs of the nodes in the tree overlap as little as possible. The leaf nodes of the R-Tree contain pointers to the actual spatial objects, while the internal nodes contain pointers to child nodes or MBRs that encompass their children.

When a spatial query is executed, the R-Tree is used to quickly find the objects that overlap with the search area by recursively traversing the tree nodes and identifying those that intersect with the search area. This allows efficient spatial data retrieval based on proximity, distance, or other spatial relationships.

Here is an example of an R-Tree balanced tree:

                        [A,G,K,Q]
                            / \
                           /   \
                          /     \
                 [A,C,E,G]     [J,K,M,Q]
                    / \             / \
                   /   \           /   \
          [A,B] [C,D] [E,F,G] [J,I] [K,L,M] [N,O,P,Q]

In this example, the R-Tree index spatial data is represented as minimum bounding rectangles (MBRs). The tree’s root node contains the MBR that encompasses all of the objects in the index, which is defined as [A, G, K, Q]. The tree’s leaf nodes contain the actual spatial objects, representing the smallest MBRs encompassing each object.

The R-Tree is organized so that the MBRs of the nodes in the tree overlap as little as possible. Each node in the tree contains one or more MBRs, and the internal nodes have pointers to child nodes or MBRs that encompass their children. In this example, the internal nodes of the R-Tree are represented as [A, C, E, G] and [J, K, M, Q], and the leaf nodes are represented as [A, B], [C, D], [E, F, G], [J, I], [K, L, M], [N, O, P, Q].

When a spatial query is executed, the R-Tree is used to quickly find the objects that overlap with the search area by recursively traversing the tree nodes and identifying those that intersect with the search area. This allows efficient spatial data retrieval based on proximity, distance, or other spatial relationships.


Thanks for reading. Stay tuned!

Learn. Grow. Teach.

Subscribe
Notify of
1 Comment
Most Voted
Newest Oldest
Inline Feedbacks
View all comments
1
0
Would love your thoughts, please comment.x
()
x