Before we start with the implementation, let's talk about why would you actually need an inverted index in a real life.
Why would anyone need inverted index at all
Imagine you need to create a system that would quickly look up a document, given several words from it - something like a wiki search. Simplest option I can think of would be to scan through each document, marking ones that have all the necessary words. That might work at first, but such solution wouldn't scale, as each new document increases time to answer any query.
Can we do something better? Actually, we can! If user wants to search by words - then words should be keys in our "database" (index). What would the values be in that case? Document ids, or any other unique references / pointers, that contain this word.
How inverted indexes work
Imagine we have documents like:
id: 1; Text: “In a hole in the ground there lived a hobbit.”
id: 2; Text: “The sky above the port was the color of television, tuned to a dead channel.”
id: 3; Text: "Hobbits Are Amazing Creatures"
In this case, our index would look something like that:
{
"hobbit": [1, 3],
"television": [2],
"hole": [1],
...
}
Seems quite obvious how we got here, right?
Now if user queries system for "hobbit" - we look up this key (in O(1) time), and return document 1 and 3.
This structure also solves case when we need to do queries like NOT, AND and OR - we get sets of documents by each key, and then do usual set operations, like intersect in case of AND, or diff in case of NOT. Hence for query like "hobbit AND hole" we will look up sets [1, 3] and [1], their intersection would be [1], and we would return document with id=1.
Obviously all this is just a tip of the iceberg, the most naive implementation - real-word document indexing / querying systems could rank results based on relevance, do different sorts of fuzzy search, faceted search etc, but still, naive implementation is a great place to start.
So - let's start, and implement it.
Implementation
I would start somewhat "from the bottom" - so we'll first implement just class representing Inverted Index, with ability to add and lookup token there, then add some helpers to index full documents, and then provide the scaffolding that allows to actually run this as an application.
So, let's start with class representing InvertedIndex itself:
type Token = String
type Filename = String
case class InvertedIndex(
tokenIndex: Map[Token, Set[Filename]] = HashMap[Token, Set[Filename]]()
)
As I mentioned in the beginning of the article, our index itself would be powered by a usual Map[String, Set[String]] - mapping from indexed word to a list of document ids (or, in our case, just document names).
Not much going on just yet - let's do something useful, at least add a method to add word to the index, and get added documents afterwards:
def add(token: Token, filename: Filename): InvertedIndex = {
tokenIndex.get(token) match {
case Some(set) =>
// we already know this token - add filename to the set
InvertedIndex(tokenIndex.updated(token, set + filename))
case None =>
// token didn't previously exist - creating new set
InvertedIndex(tokenIndex.updated(token, Set(filename)))
}
}
def getDocumentsForToken(token: Token): Set[Filename] =
tokenIndex.getOrElse(token, Set.empty[String])
Not a lot is going on here, but our inverted index is already usable!
Though we still need something to add words into the index. We'll create IndexGenerator class for that purpose:
class IndexGenerator(filesToIndex: List[String]) {
def generateIndex(): InvertedIndex = {
logger.info("Indexer started")
filesToIndex
.map(readFile)
.foldLeft(InvertedIndex()) { case (idx, doc) =>
val updIdx = processDocument(idx, doc)
updIdx
}
}
private def processDocument(index: InvertedIndex, document: GenericDocument): InvertedIndex =
document.text
.split(" ")
.flatMap(processToken)
.foldLeft(index)(_.add(_, document.fileName))
private def processToken(token: String): Option[String] = {
val tokenLower = token.toLowerCase
if (tokenLower.isEmpty ||
StringUtils.getStopWords.contains(tokenLower)) {
None
} else {
Some(tokenLower.removeTags().removeNumbers())
}
}
}
So what do we have here?
IndexGenerator takes a list of files to index, reads them from disk, and "processes". processing is just splitting whole text into words (or "tokens"), a little bit of cleanup via processToken function, and adding them one by one to the InvertedIndex.
I won't get into details of StringUtils functions here - their implementation is quite obvious from names and in any real-life application you would actually spend some time coming up with good data cleanup rules. One thing to mention is stop words - those are basically all the words that you often encounter in any text, but are not very helpful in actual search - words like the, a, and etc. You'll also probably want to do some stemming / lemmatization, so that documents with word "democratical" could be found via "democracy" query. But all these text processing methods are a topic of a whole separate discussion.
At this point we already have everything so that our indexer could work. There's just one more thing I'd want to add to spice things up - multiple threads.
Running in parallel
Absolute majority of CPUs currently have more than 1 core, hence can execute several operations in parallel. But our code currently doesn't use this fact to it's advantage - we just read files one by one, and add tokens, also one by one. Why not utilize more cores, if we have them? Adding such a powerful capability is surprisingly easy to do, and on our "business logic" side (in InvertedIndex) it would just require one more method - merge.
Let's add it now:
def merge(other: InvertedIndex): InvertedIndex = {
val unionTokens = this.tokenIndex.keys.toSet
.union(other.tokenIndex.keys.toSet)
val unionMap = unionTokens.map { token =>
val unionSet = this.tokenIndex
.getOrElse(token, Set.empty[Filename])
.union(other.tokenIndex
.getOrElse(token, Set.empty[Filename])
)
token -> unionSet
}.toMap
InvertedIndex(unionMap)
}
We could've just add words one by one, but in that case we would lose all the advantage of actually having 2 indexes already created. Instead, we first merge together all the keys, and then union all the Filenames that correspond to those keys.
Let's also a bit of scaffolding to run several instances of our existing IndexGenerator in parallel:
import scala.collection.parallel.CollectionConverters._
def runIndexer(inputDir: String, numThreads: Int): InvertedIndex = {
val filesToIndex = FileOps.getFilesToIndex(inputDir)
val groupSize = getGroupSize(filesToIndex, numThreads)
filesToIndex
.grouped(groupSize)
.toList
.par
.map { files =>
val indexGenerator = new IndexGenerator(files)
indexGenerator.generateIndex()
}
.reduce { (i1, i2) =>
i1.merge(i2)
}
}
def getGroupSize(filesToIndex: List[String], numThreads: Int): Int = {
val groupSize = filesToIndex.size / numThreads
if (groupSize == 0) 1 else groupSize
}
We split files into numThreads equal groups, run generateIndex for each, and then merge results together.
Where does all the parallelization come from? Just from .par method, courtesy of "org.scala-lang.modules" %% "scala-parallel-collections" library.
Conclusion
Summing things up, in this post we:
- discovered an efficient way to implement document search using inverted index.
- wrote a basic implementation of said index, and briefly discussed text processing that should go along with it.
- added a multi-threading capability to our code, to utilize multi-core architecture of modern hardware.
I hope this was helpful, feel free to post your feedback or any questions in the commentaries section. As usual, all the code (and some more code + tests + benchmarking results) is available in Github repo: https://github.com/RayanRal/indexVerse
cover image: Image by liravega on Freepik
good