Tuesday, November 24, 2015

Graph Databases

Document DBs are known to be very good at searching data among billions or trillions of records, with little or no performance degradation. At the same time, they have one big limitation: they do not play well with too many relationships between entities.


Databases are an essential part of most IT systems, as they are the containers of the most valuable “piece of the puzzle”: the Data

Relational Databases (RDBMS) have been playing a major role in the Data Architecture, being a sort of standard in the software industry; products such as Oracle, Microsoft Sql Server, MySql have been used for almost any software or services worldwide for several years. Those are by now considered mature technologies, tested and refined over the years to cope with constantly challenging data requirements and new software paradigms coming up all the time.

However, since a few years, Social Media Networks systems such as Facebook, Twitter, LinkedIn (to name a few), introduced a new concept that represent an incredible challenge to traditional Databases: Big Data.

Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate.” [Wikipedia].


And:


An example of big data might be petabytes (1,024 terabytes) or exabytes (1,024 petabytes) of data consisting of billions to trillions of records of millions of people—all from different sources (e.g. Web, sales, customer contact center, social media, mobile data and so on). The data is typically loosely structured data that is often incomplete and inaccessible.” [Webopedia]


This is why NoSql Databases have emerged in the software industry over the years, with some famous name such as MongoDB (you can read my previous article on Geographically Distributed Replica Set withMongoDB and Azure), Cassandra, RocksDB, RavenDB; these are just a few examples of Key-Value Stores (or Document Databases), which can cope with Big Data better than traditional RDBMS.

Other technologies such as Redis or Hadoop and MapReduce are also used worldwide to store and view Big Data.

Document DBs are known to be very good at searching data among billions or trillions of records, with little or no performance degradation. At the same time, they have one big limitation: they do not play well with too many relationships between entities.


This is where Graph Databases come into picture:


In mathematics, and more specifically in graph theory, a graph is a representation of a set of objects where some pairs of objects are connected by links.” [Wikipedia]

Graphs are currently the best way to represent data in Social Networks and other similar systems, where relationships between people, their messages, posts, likes, comments, and all other types of social media interactions, often in the big numbers, are best represented with graphs. 

A few products are on the market nowadays offering graph data storage, and among all Neo4j seem to be the most mature product at the moment. This system has his own Query Language: Cypher, which similarly to SQL allows to search the Database for specific data, now with an optimized approach for the Shortest Path.


Neo4j stores data into collections of Nodes (equivalent of Tables in RDBMS) and Links (equivalent of Foreign Key Relations in RDBMS). The first main difference is that Links are completely decoupled from Nodes, and so they can be created or deleted at any time, without any impact on the data (try to remove a FK between two tables full of data on a RDBMS…). This is ideal for systems that are under constant evolution (such as, but not limited to Social Networks).


Another important concept of Neo4j is that Links are semantic: in the example of a private message conversation between two users, each user can either Send or Receive a message, hence having two different types of relations to messages.


This semantic cannot be easily represented by a traditional RDBMS, and would require some coding to represent it, while out-of-the-box it’s easy to achieve it on a Graph DB:



The image above shows the Neo4J User Interface (a local browser app using the Neo4j REST API), running a Cypher QueryMATCH (n) RETURN n”, which returns all Nodes and Links in the system; the output is displayed as a Data Graph, rendered by default as an SVG file, where you can easily layout and re-arrange all Nodes and Links for the best displaying, but it can also be displayed as tabular data, or the JSON plain response.


Here you can also notice the semantic Links, which also allow to easily representing the time concept within the conversation (Discussion in the graph):

  1. Person Carson sends Message 1 to Person Linda
  2. Person Linda receives Message 1 from Person Carson
  3. Person Linda sends Message 2 to Person Carson
  4. Person Carson receives Message 2 from Person Linda
  5. Person Carson sends Message 3 to Person Linda
  6. Person Linda receives message 3 from Person Carson


No extra code is needed to represent this logical sequence, unlike with traditional RDBMS.


Just as a side-note, systems like Facebook and Twitter use their own custom Graph solution, built on top of MySql RDBMS; the most plausiblereason is that they started with RDBMS technology years ago when Graph DBs did not exist yet, they invested lots of time, money and effort on it, and slowly transitioned to a custom Graph solution over time (see TAO and FlockDB).


And if you are wondering about performance numbers, a lot of benchmarking tests [1] have been done already, and it appears that Graph Databases are indeed a strong technology, here to stay in the long run.


So, are Graph Databases going to replace RDBMS altogether?


Not really.


RDBMS will still be in place for a long time; however they won’t be anymore the only way to store data in an Enterprise, as the IT industry is going toward a Polyglot Persistence, where different data sources and technologies will be used individually to address specific needs, and together to form the Enterprise Big Data.



[1] Benchmarking Graph Databases:

 

Wednesday, October 28, 2015

Geographically Distributed Replica Set with MongoDB and Azure


The company I work for is going under an incredible growth, opening stores worldwide on a monthly basis, as well as increasing its IT workforce and infrastructure accordingly, to cope with such growth.

One important project recently implemented in the Enterprise Architecture of the company has been the Geographically Distributed Replica Set with MongoDB and Azure (ReadDB as the “simple” name)

Before this project, the main database used for this purpose was a Sql Server located in a data center in Amsterdam; for obvious reasons, stores in US and even more in Asia, experienced latency during the retrieval of data, which resulted in a non-optimal user experience.

Since the main need for such stores is to present data to the customer (orders, customer details, etc), a ReadOnly database seemed to be the best solution.

Based on the popularity that NoSql databases gained recently, after some research we opted to use MongoDB as the ReadDB, and its Replica Set feature as the way to distribute data geographically.

We also wanted to leverage the scalability and configuration features of Microsoft Azure as the underlying Cloud infrastructure.

So we started with a simple setup of the MongoDB Replica Set within one Azure Region: here we were able to successfully test the replication between the Primary and Secondary nodes of MongoDB.

The installation and configuration of MongoDB Replica Set is relatively easy as described in the official documentation.

In order to facilitate the operations for this test, the initial Azure Virtual Machine containing the MongoDB installation was prepared using SysPrep, and then reused within the same Region (the SysPrep image hardcode the Azure Region information, so it cannot be used for different Regions).

When it comes to geographically expand the Replica Set to other Azure Regions, there is quite some infrastructure to be created first on Azure:


  1. Create an Azure Virtual Network in the West EU Azure Region.
  2. Create an Azure Local Network in the West EU Azure Region, with same address space as the Virtual Network, and configure the VPN Gateway.
  3. Create a second Azure Virtual Network in the West US Azure Region.
  4. Create a second Azure Local Network in the West US Azure Region, with same address space as the Virtual Network, and configure the VPN Gateway.
  5. Create a third Azure Virtual Network in the East Asia Azure Region.
  6. Create a third Azure Local Network in the East Asia Azure Region, with same address space as the Virtual Network, and configure the VPN Gateway.
  7. Follow this tutorial to create a VNet-to-VNet connection between the West EU Virtual Network and the West US Virtual Network.
  8. Follow this tutorial to create another VNet-to-VNet connection between the East Asia Virtual Network and the West EU Virtual Network.
  9. Create an Azure Virtual Machine image template  to be reused for all MongoDB Nodes in the West EU Replica Set.
  10. Create a Primary Node Virtual Machine from the image template, and create an Azure Availability Set in the West EU Azure Region.
  11. Create an Arbiter Node Virtual Machine and add it to the set in the West EU Azure Region.
  12. Create a Secondary Node Virtual Machine and add it to the set in the West EU Azure Region.
  13. Configure and test the Replica Set.
  14. Create an Azure Virtual Machine image template to be reused for all MongoDB Nodes in the West US Replica Set.
  15. Create a second Secondary Node Virtual Machine in the West US Azure Region.
  16. Create an Azure Virtual Machine image template to be reused for all MongoDB Nodes in the East Asia Replica Set.
  17. Create a third Secondary Node Virtual Machine in the East Asia Azure Region.
  18. Configure and test the added Nodes in the Replica Set.
  19. Relax and enjoy the Replication.


After the entire Azure infrastructure is in place, we simply started to insert data in the Primary node, and saw the working replication synchronizing all the Secondary nodes:



We then performed some test (nearly 1 million rows inserted in the Primary node, and for each row a record Count was executed on each Secondary node), and we were very positively impressed with the performances of the MongoDB Replica Set: in all Secondary nodes in fact, the data was replicated nearly in real-time, with just (sometimes) a gap of milliseconds! 

So besides the known advantages of using a NoSql database for this ReadOnly solution, the MongoDB Replica Set proved to be a great choice for Geographically Distributed Data, allowing a smooth, easy to setup, near-real-time synchronization.

Step-by-Step Guide to Fine-Tune an AI Model

Estimated reading time: ~ 8 minutes. Key Takeaways Fine-tuning enhances the performance of pre-trained AI models for specific tasks. Both Te...