An architectural approach to implement a large-scale document search engine based on Apache Nifi.

An architectural proposal for the storage and indexing of the content of a large number of files of various formats, which will offer the ability to perform advanced searches and improve content organization.

24 min readDec 28, 2020

In this article I would like to show you in detail a project in which I have been working on a personal level with the aim of deepening the use of Apache Nifi technology, widely used for the implementation of ETL flows. The main objective of the project is to implement a scalable processing flow that allows the content of all types of files of any size to be extracted and then indexed, as a result of, we will have the ability to perform full-text searches and be able to quickly locate the files that have a concrete content.

More specifically, the goal is to be able to search using a specific text term to get all the files that contain at least one occurrence of it in their content.

It is an interesting challenge, as it is necessary to properly combine several specialized technologies in a very specific task.

In general terms, I would like to comment on the main challenges to be solved in this architecture:

• Efficient and fast content searches.
• Extraction of the content of the files.
• File content and metadata management.
• Large-scale file storage with high availability and fault tolerance.

Specifically, the files will be uploaded to the platform through a API REST, later they will be temporarily stored on an SFTP server that will act as an entry point to the architecture, once processed they will become part of the distributed HDFS file system and, on the other hand, its content and metadata will be stored in a MongoDB collection that can be reviewed later.

This project demonstrates how Apache Nifi is a very powerful tool and the integration capabilities it has with other technologies such as MongoDB, Apache Kafka, HDFS …

The key point of this architecture is the use of the Apache Tika framework, used in this project as an independently managed server. Which will facilitate the extraction of metadata and the content of any file of any format.

sergio11/document_search_engine_architecture

An architectural approach to implementing a large-scale document search engine based on Apache Nifi. ETL process design…

github.com

Main technologies of architecture

First of all, a brief review of the technologies applied in this architecture:

Apache Nifi

Apache NiFi is a software project from the Apache Software Foundation designed to automate the flow of data between software systems. Leveraging the concept of Extract, transform, load, it is based on the “NiagaraFiles” software previously developed by the US National Security Agency (NSA), which is also the source of a part of its present name — NiFi. It was open-sourced as a part of NSA’s technology transfer program in 2014.

The software design is based on the flow-based programming model and offers features which prominently include the ability to operate within clusters, security using TLS encryption, extensibility (users can write their own software to extend its abilities) and improved usability features like a portal which can be used to view and modify behaviour visually.

Apache Tika

Apache Tika is a content detection and analysis framework, written in Java, stewarded at the Apache Software Foundation. It detects and extracts metadata and text from over a thousand different file types, and as well as providing a Java library, has server and command-line editions suitable for use from other programming languages.

Apache Kafka

Apache Kafka is a distributed data transmission platform that allows you to publish, store and process log streams, and subscribe to them, in real time. It is designed to handle data streams from various sources and distribute them to various users. In short, it transfers huge amounts of data, not only from point A to point B, but also from point A to Z and anywhere else you need, all at the same time.

Apache Kafka is the alternative to a traditional business messaging system. It started as an internal system that LinkedIn developed to handle 1.4 billion messages per day. Now, it is an open source data transmission solution with applications for various business needs.

Apache Hadoop HDFS

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is now an Apache Hadoop subproject.

Elasticsearch

Elasticsearch is a search engine based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents. Elasticsearch is developed in Java. Following an open-core business model, parts of the software are licensed under various open-source licenses (mostly the Apache License), while other parts fall under the proprietary (source-available) Elastic License. Official clients are available in Java, .NET (C#), PHP, Python, Apache Groovy, Ruby and many other languages. According to the DB-Engines ranking, Elasticsearch is the most popular enterprise search engine followed by Apache Solr, also based on Lucene.

Keycloak

Keycloak is an open source software product to allow single sign-on with Identity and Access Managemen aimed at modern applications and services. As of March 2018 this JBoss community project is under the stewardship of Red Hat who use it as the upstream project for their RH-SSO product.

Hashicorp Consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.

Consul provides several key features:

Multi-Datacenter: Consul is built to be datacenter aware, and can support any number of regions without complex configuration.
Service Mesh/Service Segmentation: Consul Connect enables secure service-to-service communication with automatic TLS encryption and identity-based authorization. Applications can use sidecar proxies in a service mesh configuration to establish TLS connections for inbound and outbound connections without being aware of Connect at all.
Service Discover: Consul makes it simple for services to register themselves and to discover other services via a DNS or HTTP interface. External services such as SaaS providers can be registered as well.
Health Checking: Health Checking enables Consul to quickly alert operators about any issues in a cluster. The integration with service discovery prevents routing traffic to unhealthy hosts and enables service level circuit breakers.
Key/Value Storage : A flexible key/value store enables storing dynamic configuration, feature flagging, coordination, leader election and more. The simple HTTP API makes it easy to use anywhere.

Logstash

Logstash is an open source data collection engine with real-time pipelining capabilities. Logstash can dynamically unify data from disparate sources and normalize the data into destinations of your choice. Cleanse and democratize all your data for diverse advanced downstream analytics and visualization use cases.

While Logstash originally drove innovation in log collection, its capabilities extend well beyond that use case. Any type of event can be enriched and transformed with a broad array of input, filter, and output plugins, with many native codecs further simplifying the ingestion process. Logstash accelerates your insights by harnessing a greater volume and variety of data.

MongoDB

MongoDB is an open source, document oriented, NoSQL database system. Instead of storing data in tables, as is done in relational databases, MongoDB saves BSON data structures (a JSON-like specification) with a dynamic schema, making data integration in certain applications easier and faster. MongoDB is a database suitable for use in production and with multiple functionalities.

Architecture Overview

In this section, I would like to go into more detail about the operation of architecture, although, first, it would be interesting to know the main objectives that I had in mind when I proposed this design:

It should have a fast and efficient search, providing the same search experience as others advance search tools.
All text in documents (including their content) must be extracted and indexed.
The architecture should be scalable, it must use technological references in the movement of data.
It should be able to handle a large number of files of various formats and some quite large.
It should be optimized to store large amounts of data and maintain multiple copies to ensure high availability and fault tolerance.
It should have the ability to integrate with external systems to collaborate on more complex tasks or simply define platform usage schemes.

Next, I would like to comment on how I have covered the four basic pillars of the platform.

Fast search

Today, almost all full-text search engines are based on Apache Lucene. The most popular and programmer-friendly Lucene-based container is ElasticSearch. It supports hundreds of plugins and is highly customizable. ElasticSearch provides you with an amazing REST API and a scalable architecture.

Therefore, advanced search services are based on this technology as will be shown later.

Text extraction

There is a zoo of different formats and encodings, from txt files in DOS encoding to PDF files with scanned images inside. My goal was to gracefully handle each of these guys. In open source, we can find countless libraries that can open and extract text from a strongly defined format and encoding, but there is only one library that can handle practically everything: Apache Tika. Its main goal is to bring together all the open source content extraction libraries and create an easy to use all-in-one system.

High availability and fault tolerance

HDFS (Hadoop Distributed File System) is the main component of the Hadoop ecosystem and the piece that will make it possible to store a large number of documents of various formats such as PDF, Word, PNG … . It is optimized for storing large amounts of data and maintaining multiple copies to ensure high availability and fault tolerance. With all this, HDFS is a fundamental technology for Big Data.

It allows to obtain a vision of the resources as a single unit. It does this by creating an abstraction layer as a single filesystem.

To achieve high scalability, HDFS uses local storage that scales out. Increasing storage space only means adding hard drives to existing nodes or adding more nodes to the system. These servers have a low cost, as they are basic hardware with attached storage.
To maintain data integrity, HDFS stores 3 copies of each data block by default. This means that the space required in HDFS is triple, so the cost also increases. Although data replication is not necessary for HDFS to work, storing just one copy could result in data loss due to file failure or corruption, losing data durability.

Metadata and content storage

It most likely wouldn’t be a good idea to use ElasticSearch as a primary store without some kind of backing database, due to the performance is going to be a problem if all data queries need to be served out of ElasticSearch especially if volume of data is huge and all data is being indexed without specific attention paid to the query patterns being used.

This platform is a typical use case of scenario that ElasticSearch is a sink in a data pipeline and with another system/database mastering the data. In case of data loss, there is a way to replay data from upstream.

In this case, MongoDB will be used as the primary storage system since it takes much less time to store data than ElasticSearch, these documents will later be replicated to ElasticSearch using the Logstash technology of the ELK stack.

This additional process will be secured using ELK X-Pack capabilities.

Taking into account all the above, the architecture design is as follows:

Document search engine architectural approach

As you can see in the picture above, in this architecture we can highlight two clearly differentiated parts:

An ETL process design based on Apache Nifi’s flow-based programming model to proccess and extract all metadata and content from each files. This using a SFTP as entry point and notify the progress through Apache Kafka topics. It is using several pre-built processors to store the file into the HDFS, make a HTTP request to proccess de file content with Apache Tika server capabilities or store a document with all information about file that has been gathered to a MongoDB collection.
Microservice architecture to interact with the platform. Concretely we can get metadata from a specific file, launch a new file processing, make a complex queries to search files that have a specific term into their content. All these services are managed and controlled by the Hashicorp Consul software, which facilitates, among other things, their location on the network, on the other hand, all microservices are protected through SSO Keycloak Server.

Several things to be consider

I am using a HDFS Cluster with 3 datanodes to store the original files that they will be process.
I am using two versions of Apache Tika server, one of them has a OCR capabilities to extract content from images or proccess scanned pdfs.
I am using an SFTP server as the entry point for the Nifi ETL process, a microservice will upload the file into a share directory, then a processor from nifi will try to poll continuously wheather a new file has been added.
A easy and quick way to explain how the Nifi ETL process works, is that it moves the file to the HDFS directory and then try to get their MIME type and considering this, make a HTTP request to the more adequeate Apache Tika server to get all metadata and text content from this. To end, will try to store all this information into a MongoDB collection and publish serveral records in Kafka of inform the process state.
It is necessary to moves this information to elasticsearch for make a complex searches, due to, MongoDB don’t have a powerful capabilities in this aspect. For that, I am using a Logstash pipeline that allows to sync MongoDB’s documents to a elasticsearch index.
I am using two poweful tools to explore the data that has been indexed and stored, one of them is MongoDB Express to explore the MongoDB collection and the other is Kibana to check hearbeat of the ELK Stack and show the data has been indexed until now.
The microservice architecture is coordinate for a consul agent, is continuously check the availability of each service and it allows to query the network location of each service registered.
All services exposed of each services require authentication and authorization, therefore, it is necessary get a identity from the SSO Keycloak Server through the API Gateway Service.
The API Gateway microservice unifies all APIs into a single API (using Spring Cloud Gateway for that), therefore, only will be necessary knows the location of gateway to interact with the platform.

An ETL process design based on Apache Nifi’s flow-based programming

Apache Nifi has a web administration panel through which it is possible to implement the design of the data processing flow. The mechanism consists of adding a several processor wich can work together to do wonderfull things. Each processor has a series of specific mandatory properties that it will be necessary to specify to ensure its correct operation and other optional properties that can be omitted and accept the values defined by default.

The implemented ETL flow takes into account two use cases:

When a new file to be processed is uploaded to the SFTP server’s “uploads” directory, the GetSFTP processor will download the file and start the processing flow for it. During the execution of this flow, the file will be stored in HDFS, its content and metadata will be extracted and a document will be registered in MongoDB organizing all this information.
When the Kafka topic “files-processed-operations” requests the deletion of a previously processed file, an attempt will be made to locate and delete the file in the HDFS directory and subsequently notify the result of the operation to another dedicated Kafka topic to notifications.

Once the file has been stored in HDFS, it will be necessary to retrieve a copy of it and determine its MIME type in order to determine if it is necessary to use the Apache Tika server with OCR capabilities or not.

For this task, I am using FetchHDFS and IdentityMimeType processors, both will work with the current flow-content.

Afterwards, an HTTP request will be made to Apache Tika to obtain the file’s metadata, the result will be processed and formatted in an independent flow that will be later combined with the result of the content extraction request.

The InvokeHTTP processor is easy to configure, it will be necessary to indicate the HTTP method and the remote URL to which the request will be launched. For this step you can use the normal Apache Tika server without OCR capabilities since only the file metadata will be obtained.

Next, a routing flow based on an attribute of the flow-file has been implemented. That is, through the RouteOnAttribute processor it is possible to define a condition based on the value of an attribute of the input flow-file to activate a specific path available to it.

In the properties of the processor, routing strategies can be specified, based on a condition a specific path can be activated. In the image below we can see that when the file being processed is a PDF or an image, the flow will continue through the connection called “require_ocr” so that the content extraction request is made on the Apache Tika server with OCR capabilities.

Next, another InvokeHTTP processor will be activated that will use the Apache Tika server endpoint /tika to obtain all the text content of the current file, later all this content will be formatted and reviewed to be combined with the metadata information that has been extracted before.

I am using MergeContent processor to do that, it will generate a new flow-file results for the combine from metadata and content flow-files that have been generated before.

It is important to define a correlation attribute to avoid mixing content from different files (Correlation Attribute Name) that may be being processed simultaneously.

It is also necessary to specify the number of pieces for which the processor must wait to generate the resulting flow-file.

In this case, the original attributes of the file, the metadata that has been retrieved and finally all the text content contained in the file will be combined.

To generate a more consistent result, the Attribute Strategy property has been set to “Keep Only Common Attributes”, in this way, only the common attributes present in the three flow-files will be included in the result.

Next, I am using a JoltTransformJSON processor to generate well-formatted JSON before to save into MongoDB.

The PutMongoDB processor will need to know the name of the database and the destination collection where the new document is to be stored. It is also necessary to review the insertion mode and properly configure the connection URI taking into account necessary security.

Once the document has been successfully stored in MongoDB, an event will be registered in Kafka’s “processed-files-state” topic to make it easier for external elements of the architecture to be aware of this event.

On the other hand, the ConsumeKafka processor will be used to launch the process of deleting processed files, this processor will act as a Kafka consumer that will wait for new events in the “files-processed-operations” topic. It will use the name of the file included in the event payload to instruct the DeleteHDFS processor to delete it. Finally, as in the previous flow, an event will be registered in Kafka to report the success of the operation.

This processor requires the definition of the Kafka broker URL, the name of the topic and the name of the consumer group, as can be seen in the following image.

You can see more details of how the flow works in the following video:

Document Search Engine Architecture

Storage of metadata and content

Once the file is processed, we will have a copy of it in the Hadoop distributed file system, which will consist of three data servers and the additional server that coordinates them. This ensures that you have a file system capable of storing a large number of files greater than that which can be stored on a single physical machine. Furthermore, thanks to the inherent characteristics of HDFS, the content of the file will be replicated on each data server to avoid any loss of information if one of the data servers fails.

As you can see in the image above, each data server has a maximum capacity in GB, the file system capacity will be equal to the sum of the capacities of all the data servers included in the configuration.

The content to be stored is organized in blocks of data, it is the basic unit of HDFS storage. As the files to be stored are not too large, it was not necessary to check the default block size.

Hadoop has a web tool through which it is possible to explore the file system, hiding all the complexity of the servers and data blocks.

You can see more details of how is the hdfs configuration in the following video:

Document Search Engine Architecture — HDFS

On the other hand, all the metadata and content extracted from the files processed by the Nifi stream are stored in document format within a MongoDB collection called “processed_files”

Each stored document will have three clearly differentiated parts:

Attributes (attrs): basic attributes of the file that allow determining its creation date, MIME type, location in the HDFS system.
Metadata (metadata): Large amount of data about the file, we can determine the author of the document, the tool used for its creation, etc.
Content (document): In this section you will find the textual content that could be extracted from the file, in addition to its processing date.

The metadata part is by far the most variable, depending on the type of file, there may be more properties or there may be fewer. It is the responsibility of the APIs (which will be discussed later) to unify this information.

Document Synchronization with ElasticSearch

To support the required Full-Text search capabilities, it will be necessary to synchronize the information inserted into the MongoDB collection with an ElasticSearch index.

For this, Logstash technology has been used, which through a pipeline it is possible to configure the movement of this data.

The typical way to implement this is using the logstash-input-mongodb plugin, but, I have had a problem with this, the logstash-input-mongodb plugin is fine, but it is very limited, it also seems that it is no longer being maintained, so, I have opted for the logstash-integration-jdbc plugin.

I have followed the following steps to sync a MongoDB collection with ElasticSearch:

First, I have downloaded the JDBC driver for MongoDB developed by DBSchema that you can find here.

MongoDb JDBC Driver

This article will explain what are JDBC drivers, how to download the MongoDb JDBC driver and how to connect to MongoDb…

dbschema.com

I have prepared a custom Dockerfile to integrate the driver and plugins as you can see below:

FROM docker.elastic.co/logstash/logstash:7.9.2

RUN mkdir /usr/share/logstash/drivers
COPY ./drivers/* /usr/share/logstash/drivers/

RUN logstash-plugin install logstash-integration-jdbc
RUN logstash-plugin install logstash-output-elasticsearch

I have configured a query that will be executed every 30 seconds and will look for documents with an insert timestamp later than the timestamp of the last query (provided with the parameter :sql_last_value)

input {
  jdbc {
    jdbc_driver_library => "/usr/share/logstash/drivers/mongojdbc2.3.jar"
    jdbc_driver_class => "com.dbschema.MongoJdbcDriver"
    jdbc_connection_string => "jdbc:mongodb://devroot:devroot@mongo:27017/files?authSource=admin"
    jdbc_user => "devroot"
    jdbc_password => "devroot"
    schedule => "*/30 * * * * *"
    statement => "db.processed_files.find({ 'document.processed_at' : {'$gte': :sql_last_value}},{'_id': false});"
  }
}output {
  stdout {
    codec => rubydebug
  }
  elasticsearch {
    action => "create"
    index => "processed_files"
    hosts => ["elasticsearch:9200"]
    user => "elastic"
    password => "password"
    ssl => true
    ssl_certificate_verification => false
    cacert => "/etc/logstash/keys/certificate.pem"
  }
}

As a result of this, we will have an index in ElasticSearch called “processed_files” with all the information of the files currently stored in MongoDB.

The most interesting property is the extracted content on which we will perform advanced searches.

Microservice architecture to interact with the platform

The entire platform is hidden by a unified REST API with which it will be possible to perform the following tasks:

Start processing a new file.
Delete a previously processed file.
Query the metadata of a file.
Search for files that contain a specific term.
Notification report about the status of a file’s processing.

Each micro service will be specialized in a specific task and will be registered in the Consul services directory, this allows the rest of the micro services to locate it without the need to know its details at the TCP/IP level.

As you can see in the image above, all the microservices are available, the consul agent will continually make requests to know their status.

Consul Service Directory — Microservice detail

All the TCP/IP information of the services will be centralized in the consul directory

The Consul agent makes use of the endpoints exposed by the Spring Boot Actuator to know the status of the service. Specifically, it makes use of actuator/health that indicates if the service is “UP”. This endpoint is excluded from the Spring Security configuration and no authentication is required to query it.

File management on the platform

The micro service “files-management-service” will be responsible for managing the files on the platform. The endpoints exposed by this service require using a client with more privileges since it allows deleting processed files or adding new ones.

Therefore, it performs two operations:

Add a new file to the platform

The microservice will perform the following steps when a new one file is adding:

• It will manage the upload of the file with any format.
• It will consult to the metadata micro service the existence of the new file.
• If the file does not exist, it will transfer them to the SFTP server so that processing can begin.

Delete a processed file

It will be necessary to specify the full name of the file (including its extension), in this way, the existence of the file will be queried through the metadata micro service, and if it exists, an operation event will be published in Kafka to request the deletion of it (for this I am using a message channel from Spring Cloud Stream).

Below, you can see the class of service that implements the logic discussed above:

Files Management Service

Query of file metadata

The micro service “files-metadata-service” will have the responsibility of managing the metadata of the files on the platform. It is possible to obtain the complete list of processed files or a specific one.

Know the details of the connection to MongoDB as well as the collection to query.

Advanced queries by content

The micro service “files-search-service” will have the responsibility for advanced searches based on specific text. It know the details of connecting to ElasticSearch and have the necessary logic to work with it

Processing status notifications

The micro service “files-notification-service” will have the responsibility to persist and report events on the status of file processing.
It will mainly perform two tasks:

Persist all generated events in a MongoDB collection so that they can be queried at any time.
It will report these events via WebSocket / STOMP to interested customers.

Each document in MongoDB will have several properties that allow you to identify the file and its processing status.

More specifically these properties are:

file_name: Name of the file whose status has changed.
file_last_modified_time: Date of the last modification of the file.
file_mime_extension: File extension.
file_mime_type: Type of the processed file.
file_hdfs_path: File path in HDFS file system.
file_state: Current state of the file.

Possible values can be:

SAVED: The file has been processed successfully and all its content and metadata is stored in MongoDB.
DELETED: The file has been removed from HDFS and MongoDB.
PROCESSED: The file is being processed, it could have been stored in HDFS but all its information is not yet available to be consulted.

Gateway to the microservices ecosystem

To facilitate the integration of external clients with the APIs exposed by each micro service, the “Gateway” pattern has been applied with the aim of implementing a unified REST API to which the desired authentication and authorization rules will be applied. This micro service acts as the entry point to the platform, from where all the desired operations can be carried out.

For this, the Spring Cloud Gateway OAuth2 software has been applied combined with Spring Security and Keycloak.

Spring Cloud Gateway provides a library for building an API Gateway on top of Spring WebFlux. Spring Cloud Gateway aims to provide a simple, yet effective way to route to APIs and provide cross cutting concerns to them such as: security, monitoring/metrics, and resiliency.

Spring Cloud Gateway features:

Built on Spring Framework 5, Project Reactor and Spring Boot 2.0.
Able to match routes on any request attribute.
Predicates and filters are specific to routes.
Circuit Breaker integration.
Spring Cloud DiscoveryClient integration
Easy to write Predicates and Filters
Request Rate Limiting
Path Rewriting

Authentication and access control

This microservice will have all the details of the clients enabled to work with the API REST, and all the necessary configuration to establish the OAuth2 authentication flow with the Keycloak server. As you can see in the following configuration:

Authentication and access control configuration

Route system configuration

The configuration of the routes can be done through a java class or a YML file, it will use the Consul agent to resolve the location of the services in the network.

The most important part that can be seen below is the use of the TokenRelay filter that enables the programming of the JWT token to the following calls.

Route system configuration

Spring Cloud Gateway OAuth2 use case scenario

First, let’s take a look at the picture that illustrates our use case. We are calling POST /login endpoint on the gateway (1). After receiving the login request Spring Cloud Gateway try to obtain the access token from the authorization server (2). Then Keycloak is returning the JWT access token. As a result, Spring Cloud Gateway is calling the userinfo endpoint (3). After receiving the response it is creating a web session and Authentication bean. Finally, the gateway application is returning a session id to the external client (4). The external client is using a cookie with session-id to authorize requests. It calls GET metadata/from the files-metadata-serviceapplication (5). The gateway application is forwarding the request to the downstream service (6). However, it removes the cookie and replaces it with a JWT access token. The files-metadata-service application verifies an incoming token (7). Finally, it returns 200 OK response if the client is allowed to call endpoint (8). Otherwise, it returns 403 Forbidded.

We may start testing in the web browser. First, let’s call the login endpoint. We have to available clients files-viewing-client and files-management-client. We will use the client files-management-client.

The administration client has an additional scope that allows to consume the services of the file administration microservice, that is, with a JWT token for that client it is possible to add new files and delete files that have already been processed previously.

The files-viewing-client client is more limited and only supports advanced search and metadata queries.

Once the desired client has been selected, the gateway redirects us to the Keycloak login page. We must provide the credentials of the user created in the realm of keycloak used.

After a successful login, the gateway will perform the OAuth2 authorization procedure. Finally, it redirects us to the main page. The main page is just a method index inside the controller. It is returning the current session id.

We can also use another endpoint implemented on the gateway — GET /token. It is returning the current JWT access token.

Just to check, you can decode a JWT token on the https://jwt.io site.

You can see more details in the next video:

Microservice architecture to interact with the platform

Used technology

Spring Boot 2.3.5 / Apache Maven 3.6.3.
Spring Boot Starter Actuator.
Spring Cloud Stream.
Spring Cloud Gateway.
Spring Cloud Starter Consul Discovery.
Spring Cloud Starter Open Feign.
Springdoc Open Api.
Spring Boot Starter Security.
Spring Security OAuth2.
ElasticSearch — Logstash — Kibana (ELK Stack).
MongoDB.
Mongo DB Express (Web-based MongoDB admin interface, written with Node.js and express).
Consul Server.
SSO Keycloak Server.
Hadoop HDFS.
Apache Nifi.
Apache Tika Server.
Rabbit MQ / STOMP protocol.
Apache Kafka.
Kafka Rest Proxy

This is it, I have really enjoyed developing and documenting this little project, thanks for reading it, I hope this is the first of many. If you are interested in see the complete code, here is the link to the public repository.

sergio11/document_search_engine_architecture

An architectural approach to implementing a large-scale document search engine based on Apache Nifi. ETL process design…

github.com

An architectural approach to implement a large-scale document search engine based on Apache Nifi.

An architectural proposal for the storage and indexing of the content of a large number of files of various formats, which will offer the ability to perform advanced searches and improve content organization.

sergio11/document_search_engine_architecture

An architectural approach to implementing a large-scale document search engine based on Apache Nifi. ETL process design…

Main technologies of architecture

Apache Nifi

Apache Tika

Apache Kafka

Apache Hadoop HDFS

Elasticsearch

Keycloak

Hashicorp Consul

Logstash

MongoDB

Architecture Overview

Fast search

Text extraction

High availability and fault tolerance

Metadata and content storage

Several things to be consider

An ETL process design based on Apache Nifi’s flow-based programming

Storage of metadata and content

Document Synchronization with ElasticSearch

MongoDb JDBC Driver

This article will explain what are JDBC drivers, how to download the MongoDb JDBC driver and how to connect to MongoDb…

Microservice architecture to interact with the platform

File management on the platform

Add a new file to the platform

Delete a processed file

Query of file metadata

Advanced queries by content

Processing status notifications

Gateway to the microservices ecosystem

Authentication and access control

Route system configuration

Spring Cloud Gateway OAuth2 use case scenario

Used technology

sergio11/document_search_engine_architecture

An architectural approach to implementing a large-scale document search engine based on Apache Nifi. ETL process design…

Written by Sergio Sánchez Sánchez