Covid Tweets ETL Architecture

7 min readAug 15, 2020

A microservices ETL architecture for the ingestion and analysis of Tweets about the COVID-19.

The health alert situation derived from the COVID-19 crisis is generating
a lot of opinions and information in mass media like Twitter.
Analyzing and processing all this information in real time in an efficient and scalable way turns out to be a challenge.

In this story I would like to show you a distributed architecture approach orchestrated by Apache Kafka, this is a unified, high-performance, low-latency platform for real-time manipulation of data sources. It can be seen as a message queue, under the publish-subscribe pattern, massively scalable conceived as a distributed transaction log, which makes it attractive for enterprise application infrastructures.

It is an architecture of decoupled and independent microservices developed with Spring Boot that collaborate with each other through the Spring Cloud Stream framework that allow building highly scalable applications driven by events connected with shared messaging systems.

The Spring Cloud Stream framework is an abstraction on the underlying queuing technology, this would allow to run the project on another similar technology such as RabbitMQ without the need to make major changes to the code.

Main technologies of architecture

First of all, a brief review of the technologies applied in this architecture:

Apache Kafka

Apache Kafka is a distributed data transmission platform that allows you to publish, store and process log streams, and subscribe to them, in real time. It is designed to handle data streams from various sources and distribute them to various users. In short, it transfers huge amounts of data, not only from point A to point B, but also from point A to Z and anywhere else you need, all at the same time.

Apache Kafka is the alternative to a traditional business messaging system. It started as an internal system that LinkedIn developed to handle 1.4 billion messages per day. Now, it is an open source data transmission solution with applications for various business needs.

Apache ZooKeeper

Apache ZooKeeper is a free software project of the Apache Software Foundation, which offers a highly reliable distributed process coordination service that provides solutions to various coordination problems for large distributed systems. ZooKeeper is a subproject of Hadoop.

Elasticsearch

Elasticsearch is a Lucene-based search server. It provides a multi-tenant, distributed, full-text search engine with a RESTful web interface and JSON documents. Elasticsearch is developed in Java and is released as open source under the Apache license.

Kibana

Kibana is an open source data visualization dashboard for Elasticsearch. It provides visualization capabilities on top of the content indexed on an Elasticsearch cluster. Users can create bar, line and scatter plots, or pie charts and maps on top of large volumes of data.

Kibana also provides a presentation tool, referred to as Canvas, that allows users to create slide decks that pull live data directly from Elasticsearch.

Architecture Overview

The main goal of the architecture is to connect the stream of tweets in real time from the Twitter platform, convert that information into suitable data models by discarding the information that is not necessary, and therefore prepare the information for analysis.

The tweets will be published in the topic of unprocessed tweets (called tweets-ingest), this will be the topic observed by the consumer group “Tweets Processor” which is made up of one or more microservices that will use the Stanford CoreNLP framework to determine the sentiment of the publication, determine the entities, tokens and NER tags.

Each microservice in this group of consumers will receive information from a different partition belonging to the same topic.

The processed tweets will be published in the processed tweets topic (called processed-tweets), which will be observed by two groups of consumers called “Tweets API” and “Tweets Collector”.

The first group of consumers will be constituted by a microservice that allows to retrieve and view the tweets processed through a REST API or STOMP over WebSocket. The set of microservices that make up the collector group of consumers will only perform the mapping necessary to correctly index the data in ElasticSearch for subsequent search and analysis.

Applications

Next, I will detail the four types of microservices present in the architecture:

Covid Tweets Ingest

Spring Boot web Java application that implement a Twitter client that receives the latest tweets about COVID-19, creates the data model associated with the tweet, and posts it to the topic tweets-ingestin Kafka.

A custom message producer has been implemented to send tweets through the Spring Cloud Stream framework.

Twitter Message Producer

I have used akhq as a Kafka GUI that allows to manage and view the information published in the topics, as well as to check the active consumer groups.

AKHQ - Kafka GUI for Apache Kafka to manage topics, topics data, consumers group, schema registry…

Get all the insight of your Apache Kafka clusters, see topics, browse data inside topics, see consumer groups and their…

akhq.io

Thanks to this tool we will be able to see in real time the raw tweets that are published in the Kafka topic.

Covid Tweets Processor

Spring Boot Web Java application that listens to news messages in tweets-ingest topic in Kafkaand it make the analysis of the text through the analysis service implemented on Standford Core NLP.

Once the analysis is completed we complete the commit of the consumer offset, this is the correct way to process the messages from kafka, if an error occurs during the analysis that causes the processor to stop next time you will receive the same tweet until be processed successfully.

Tweets Ingestion Handler

Covid Tweets Collector

Spring Boot Web Java application that listens to news messages in processed-tweets topic in Kafka, saves them in Elasticsearch.

This handler will only make a call to an elasticsearch repository to save the information and once this operation is completed it will confirm the consumer offsets.

Tweet Processed Handler

All processed tweets will be indexed as documents in an inverted index called tweets_processed, thanks to the Spring Data ElasticSearch project we can easily define the structure and fields of the index by defining annotations in the fields.

Tweet Entity

Covid Tweets API

Spring Boot Web Java application that allows to retrieve and view the tweets processed through a REST API or STOMP over WebSocket.

I have used the SpringDoc OpenApi UI project that implements OpenAPI Specification V3 to develop an interface that allows exploring and interacting with the REST API of the service.

This API offers interoperability and the possibility for other applications to retrieve the processed tweets.

Visualization with Kibana.

Kibana is a very powerful and easy-to-use tool, it allows you to view the information indexed in Elasticsearch in very different ways such as Pie Charts, Tags Cloud, Horizontal Bars Chart …

All these visualizations allow us to exploit and take advantage of all the information processed previously.

You have an option to browse the Elasticsearch index and verify the indexed information in JSON format before creating any more advanced type of visualization.

Search criteria based on KQL or Lucene can be specified to filter the information.

We can examine an individual document in JSON format as in the following example.

For the creation of more complex visualizations we can filter documents using temporary conditions (this week, last day ..) or other filters that we need. Later we will create a bucket that defines the information that will be represented. We can also configure a series of visual options for the graph.

In the visualization that I show below, the percentage of tweets with NEGATIVE, POSITIVE or NEUTRAL sentiment is represented

This same information can be representative of other forms as you can see in the following graph.

This is another visualization example, where we filter tweets with negative sentiments and create a cloud with more frequent terms in those tweets.

Kibana has the option to create Dashboard, where it is possible to combine several previously created visualizations to achieve a more global and complete vision.

Kibana Dashboard with all visualizations created previously.

Used technology detail

In this section I will list all the libraries and tools used to develop the project

Spring Boot 2.3.2 / Apache Maven 3.6.3.
Spring Cloud Stream (to build highly scalable event-driven applications connected with shared messaging systems)
Spring Cloud Starter Stream Kafka.
lombok.
Twitter4j Stream.
Mapstruct.
Elasticsearch oss 7.6.2.
Spring Boot Starter Data Elasticsearch.
kibana oss 7.6.2.
Spring Boot Starter Web.
Springdoc Openapi UI.
Spring Boot Starter Websocket.
Stanford Corenlp.

This is it, I have really enjoyed developing and documenting this little project, thanks for reading it, I hope this is the first of many. If you are interested in see the complete code, here is the link to the public repository.

sergio11/covid_tweets_etl_architecture

A microservices ETL architecture for the ingestion and analysis of Tweets about the COVID-19. Project developed to…

github.com