[Back-end Tech] Research on Amazon Neptune Database and Data Visualization

Amazon Neptune Graph Database and Data Visualization is the topic we want to research in this cross-ream collaboration in our company.

Introduction

We have a special scrum sprint called Innovative Idea Research every 3 months in our company, in this sprint, software engineers can choose the topics they have interesting to do research and show the result in last demo day.

The company proposed the new rule allows us can collaborate with members from other team on same research for improving the cross-team communication and increase the research diversity.

I saw many different cards on the board and there is a interesting name on one card.

Neptune this name looks very cool, then "Visualization" might represent there will be some front-end skill included. I learned front-end since this year April and have some interesting with it.

So I filled my name on the card, then we begin the collaboration with 1 member from another team and 1 member in same team but haven't had collaboration before.

I like this experience, it look like I leave the comfortable zone, and have more new experience with cross-team collaboration and new technology. I would like to extract some content from our original research note to here.

Objective

Our research topic is Research on Amazon Neptune and Data Visualization.

This AWS service has been used for long while in the member who from another team, but me and my team member, we have 0 experience for it, and we need to learn even graph database and both AWS Neptune service in 2-3 weeks.

The table of content as following:

Learn Basic Knowledge of Graph Database
- NoSQL
- Nodes, Edges, Label, Properties
- Query Language: Gremlin
- Relational Database(RDB) VS Graph Database(Graph DB)
Implementation Part
- Create Amazon Neptune Database + Jupyter Notebook
- Use Gremlin to do simple query
- Data Visualization
  - aws/graph-notebook
  - G.V()
  - Gremlify

Are you ready?

Basic Knowledge

The following concepts can help you roughly understand Graph Database.

NoSQL
Nodes, Edges, Label, Properties
Query Language
Relational Database(RDB) VS Graph Database(Graph DB)

NOSQL

Graph Database is a NoSQL database, unlike relational database, such as MySQL, PostgreSQL, SQLite, they are different and not each service suit the Relational Database(RDBMS).

| | Type | Example | | ----- | -------------- | ----------------- | | NoSQL | Key-Value | Redis | | | Document-based | MongoDB | | | Column-based | BigTable | | | Graph-based | Neptune, Neo4j | | RDBMS | Table-based | MySQL, PostgreSQL |

Through the above table, we can know the Graph DB is one of NoSQL DB.

Nodes, Edges, Label and Property

Then we introduce the concepts of how Graph DB save data. If the nouns in RDB are Table, Column, Primery Key and Foreign Key，then in Graph DB there are Nodes, Edges, Label 和 Property.

| Object | Description | | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | Nodes | Represent entities like company, people, account, this is roughly the equivalent of a record or row in relational database. | | Edges | Like the relationship or connection between nodes, this is the key concept in graph database. | | Label | Attribute that group similar nodes together, for example different company nodes belong to Label Company, like Google, Tesla, Apple, they are different Nodes but belong to Label called Company. so you can use like g.V().hasLabel("Company") to get all company Nodes. | | Properties | Such as the properties which nodes or edges have, for example, if one company is a node, there might have employees count, founded date or location properties. Also, you can add property in Edge. |

Graph Database is more suitable for the data have highly complicated relationship.

Query Langauge: Gremlin

The relation of Query language and Database like SQL and MySQL. In Graph DB, there are many query language, like Gremlin, Cypher, GQL, SPARQL, PGQL, GraphQL.

Sample:

| Query Language | Statement | | -------------- | -------------------------- | | SQL | SELECT * FROM table_name | | Gremlin | g.V().hasLabel(label_name) |

In this research, we learn Gremlin and use it on AWS Neptune and Jupyter Notebook.

The following is the Cheat Sheet

%%Gremlin

// list all Nodes
g.V()

// list Nodes with limit 25 records
g.V().limit(25)

// list all Edges
g.E()

// list all Nodes and its value
g.V().valueMap()

// list all Nodes with its ID and its value
g.V().valueMap(true)

// list all Edges and its value
g.E().valueMap(true)

// Delete all Nodes <-- This dangerous query must teach haha, careful to use.
g.V().drop().iterate()

// Delete all Edges
g.E().drop().iterate()

// Add one Node
g.V().add("Company")

// Add one Node and its Property
g.V().add("Company").property(id, "COM-0001").property("name", "Innova").as("COM-0001")
g.V().add("Employee").property(id, "EMP-0001").property("name", "Mina").as("EMP-0001")

// Add Edge and Property (fill with Node Id)/and addE(the name of relationship): connect employee -> company
g.V("EMP-0001").addE("belong").to(__.V("COM-0001")).property("checkIn", "2022-02-14")

// Add data continuously: only the last query don't need to put next()
g.V().add("Company").property(id, "COM-0001").property("name", "Innova").as("COM-0001").next()
g.V().add("Company").property(id, "COM-0002").property("name", "Google").as("COM-0002").next()
g.V().add("Company").property(id, "COM-0003").property("name", "Tesla").as("COM-0003")

// Query same Label Node
g.V().hasLabel("Company")

// Query specific ID Node/Edge
g.V("COM-0001")
g.E("<id>")

Relational Database(RDB) VS Graph Database(Graph DB)

| | Graph DataBase | Relational Database | | ------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | Format | Nodes, Edges | Tables with Rows and Columns | | Relationship | Considered data, represented by edges between nodes | Related across tables, established using foreign keys between tables. | | Complex Query | Run quickly and do not require joins | Require complex joins between tables | | Top Use Case | Relationship-heavy use cases, including fraud detection and recommendation engines. Highly-connected data that comes with an intrinsic need for relationship analysis. Data model is inconsistent and demands frequent changes. | Transaction-focused use cases, including online transactions and accounting. | | Example | | | | Pros | Addtional attribute could be added at some point. Not all entities will have all the attribute in the table. The attribute types are not strictly defined. | it is faster when handling huge numbers of records because the structure of the data is known ahead of time. Relational database in tables with rows and columns and uses JOIN for fast querying. | | Cons | Run frequent table scans and searches for data that fits defined categories. When you have a known key and need to retrieve the data associated with it, a graph database is not particularly useful. If the entities in your model have very large attributes like BLOBs, CLOBs, long texts... then graph databases aren't the best solution. | Relational databases are slower for large datasets. Relational database has poor performance for deep analytics. |

Reference: https://www.techtarget.com/searchdatamanagement/feature/Graph-database-vs-relational-database-Key-differences

Implementation Part

After learn concepts, we need to do some exercise, there will be 3 parts inside:

Create Amazon Neptune Database + Jupyter Notebook
Use Gremlin to operate data
Data Visualization

That's GO!

Create Amazon Neptune Database + Jupyter Notebook

I'm not sure everyone want to use Amazon Neptune Database, because for me,

It is little expensive :)

At begining, I think I created a new AWS account and use the Free Tier Plan will not be charged, but I still be charged 40 USD in 2 months (acutaully I played it 2 weeks, but cross November and December.).

So weird, in second time, I only created Neptune Database, I haven't use it in few mins and I got charged again haha.

(40 USD in my country is expensive.)

I use the AWS console to create Neptune, not the AWS CLI, not the Terraform. So that's do it.

Enter your cute Amazon Dashoboard and type Neptune on Search Bar for entering Neptune.

Click three dots at the left side.

Click Database and click Create database.

In the setting part, be careful the Free Plan in AWS Neptune is the Engine options Provisioned + db.t3.medium

Please check your region, make sure your region supports Neptune instance.

If you got Failed, please check in CloudWatch.

If you got same error, it might represent your region doesn't support the instance db.t3.medium , we can created DB in region us-east-1。

In Create database setting, Engine options please choose Provisioned, I tried Serverless, it is not belong to Free Plan scope.

Templates choose Development and Testing，DB instance size choose Buerstable classes: db.t3.medium。

Connectivity is more complex part, it has permission problem and need to research the VPC part, I use the root user, so I have all permission, I suggest to call the helping from the MIS or DevOps, help to create Neptune Database directly.

Notebook configuration，We need Jupyter Notebook to connect to Neptune and query data with it. so turn on the Create notebook.

After created, it will take some time to create DB and Notebook, you can check jupyter Notebook in Notebook section.

Why the screenshot I provided is serverless? Because I told my company that I use my account to do research and be charged xD, they provide this Neptune for me to playing, so the Operator help me to create the Neptune directly, because I don't have permission.

Remember stop the DB and Notebook when you don't use them.

Use Gremlin to do simple operation

Next step, please open the Notebook，we will use Gremlin to put some data.

First, that us check the database.

// Remember put this in AWS Jupyter Notebook, mark the next query need to use Gremlin to run
%%gremlin

// list all Nodes
g.V()

Because I have put some data inside it, or the database should empty if just created.

That's add some data, we add 2 Nodes.

%%gremlin

g.addV("People").
     property(id,"PPL-0001").
     property("name","Amy").
     property("age",29).
     property("gender","F",).
     property("job","Engineer").
     as("PPL-0001").next()

g.addV("People").
     property(id,"PPL-0002").
     property("name","Jason").
     property("age", 31).
     property("gender","M",).
     property("job","Salesmen").
     as("PPL-0002")

Now, we have 2 Nodes (2 People) belong to Label People, there are Properties id, name, gender and job, Amy and Jason.

Then use g.V() to lise all Node, you can see the id PPL-0001 and PPL-0002 are the Nodes we just inserted.

Then if you want to see the properties.

g.V().valueMap(true)

List all Nodes and its properties.

Then, we are gonna to add spme relationship by using Edge.

I want to present that Amy lile Jason since 2022/8

%%gremlin

// PPL-0001 Amy add one Edge like -> PPL-0002 Jason
g.V('PPL-0001').addE('like').to(__.V('PPL-0002'))

We can see the Edge now.

So now we have 2 Nodes and 1 Edge.

PPL-0001
PPL-0002
PPL-0001-like->PPL-0002

But now each data present by words, it's hard to see, so now we are gonna to introduce Data Visualization.

Data Visualization

In our research, we took the research on three methods for Data Visualization of Graph Database, 2 of them are tools and not really need to install DB, so if you want to play Graph DB and Gremlin, here will introduce 2 methods.

aws/graph-notebook

aws/graph-notebook is a Python Package used in Jupyter Notebook, we use this in our implementation part. It has been included in Jupyter Notebook when created.

We researched one query sentence can show the data by graph.

%%gremlin -p v,ine,outv,oute,inv
g.V().or(hasLabel("<Label1>"), hasLabel("<Label2>"))
     .inE().outV()
     .outE().inV().path().by(valueMap(true))

Use this query and then fill up the Label data you want to show, you can touch it.

So, this is the data we just inserted, you can see the properties by press the three line icon.

If you want to find key words in any Nodes or Edges, you can use the search bar, like the screenshot showed.

About the data visualization, I didn't put many data in the above description, so can not show more, let me show some mock data in our research, these all are mock data, will not involve our company.

We use same query but with different Label name which the data I have inserted before.

Well, I think the naming of Edges is not good, it looks little mess haha.

Use search bar to search key words will show the related Nodes and Edges.

So that's all the content about using aws/graph-notebook.

G.V()

That's introduce G.V() which provided by AWS, the third-party software to realize the data visualization.

It is downloaded to the local in computer and includes the version Windows/Mac/Linux.

This software likes the MySQL WorkBrach, it allows we create a playground without any connection with any DB and then we can play Gremlin and GraphDB inside this playground.

There is always a button can show the graph when you do query, so it is very easy and convenient.

However, we stopped the research, because of the following reasons:

If we want to connect to DB, it needs Pro version, which means we need to pay, we don't have budget to buy, and we would like to connect Neptune originally not only play win playground feature.
I found after addE, the existing all Nodes will disappear, I'm not sure the bugs, this will increase more resarch time, due to the time limitation, we stopped the research.

Gremlify

The finall tool is website, don't need to install and it is free, if you don't have requirement to install DB and just want to try, you can use this. -> Gremlify，I haven't use it, but I saw our another member showed this, and seems very convenient.

I don't have screenshot can show, this website allows you use mouse to create Node and Edge directly and not need to write query sentence, I think this is more convenient, and it also have visualization graph when you do query.

If anyone use it and feel good, please leave comment to me, thankt you!

Alright, that's all the research note of our cross-team collaboration Innovative Idea Research. This time I collaborate with 1 member from another team and 1 member who is same team with me but neven collaborate before (she is QE, and we run pair programming), and this technology is totally new for us, this is a very good experience for me, so I would like to write it down.

Espicially when collaborating, we know the member from another team will leave company after 1 month, this will be the first time and also final time we have collaboration, we should hold this opptunity to perform well, to do great job and create good memory and present good result in final.

This is my first time to write English article after a long while, I believe my English improved a lot since I'm working in a Indian company located in Taiwan and having American business collaboration teams.

If you like or if there are some question/some errors in this article, please leave comment or send mail to me, thanks :).

Sign @MinaYu.

it-things-and-development