NoSQL databases have been around for several years now and have become the preferred choice of data storage for managing semi-structured and unstructured data.
These databases offer lot of advantages in terms of linear scalability and better performance for both data writes and reads.
With the emergence of time series data being generated from Internet of Things (IoT) devices and sensors, it’s important to take a look at the current state of NoSQL databases and learn about what’s happening now and what’s coming up in the future for these databases.
InfoQ spoke with four panelists from different NoSQL database organizations to get different perspectives on the current state of NoSQL databases.
- Seema Jethani – was until recently Director of Product Management at Basho Technologies
- Perry Krug – Principal Solutions Architect and customer advocate at Couchbase
- Dr. Jim Webber – Chief Scientist at Neo Technology
- Tim Berglund – Director of Training at DataStax [Note: The following opinions are Tim’s alone, and not those of his employer. Any speculation about the future is just Tim making things up – well, a kind of informed making things up – and does not reflect any future plans of any company he might work for at any point]
InfoQ: NoSQL databases have been around now for 10+ years. What is the current state of NoSQL databases in terms of industry adoption?
Seema Jethani: We are at the point where every industry has some NoSQL deployment. Web scale, social and mobile apps drove the first wave of adoption, IoT will drive the next big wave to mass adoption.
Perry Krug: We have typically looked at NoSQL adoption as taking place in 3 broad phases. Phase 1 refers to grassroots, developer adoption. Organizations are typically trying out and/or deploying NoSQL under non-mission critical apps if in production at all. Phase 2 refers to broader adoption where NoSQL is playing a much stronger role for mission critical/business critical applications but is not yet a standard part of the organization’s portfolio. Phase 3 signifies a strategic initiative in an organization and broad “re-platforming” to make NoSQL a standard within their organization. Depending on the organization, Phase 3 may see exclusive use of NoSQL or simply a well-understood balance between NoSQL and RDBMs.
Our view of the industry has been that organizations move through these phases at their own pace. Companies like Google, Facebook, PayPal, LinkedIn, etc have obviously been in Phase 3 for many years now, whereas other companies (without naming names) are still progressing through Phases 1 and 2.
Overall there is no denying that RDBMs still hold the vast majority of market share, but they are growing at a much slower rate than NoSQL. This rate is driven in part by the relative size between the two, but also by the fact that the need for NoSQL is growing at a much faster rate as well.
Jim Webber: Broadly I’d say that NoSQL databases have moved from a position of curious technology for early adopters and Web giants into a category that is quite accepted at least by the early majority. That’s compounded by the presence of many NoSQL databases (of all flavours) in the top 20 of the db-engines rankings of database popularity. Anecdotally it feels like NoSQL is well known in the developer and OSS community and its related applications like Big Data are somewhat understood by the business community.
Tim Berglund: Even five years ago, NoSQL was cool. It was something worth talking about at a conference all by itself; I had a very popular talk back then called “NoSQL Smackdown” that compared a few popular products. That talk could fill rooms just because of the buzzword.
Today it is still the case that NoSQL adoption is interesting to developers, but it has become more commonplace. Developers who work at companies in seemingly ordinary industries like finance, retail, and hospitality are building real systems using non-relational databases. Corporate IT decision-makers no longer need to be particularly progressive to commit to NoSQL. We have a long way to go before the category reaches maturity and competes on a completely level technology selection playing field with relational databases, but the trend line seems obvious to me.
InfoQ: What are some of the best practices for data modeling in NoSQL database projects?
- Denormalize all the things: Using denormalization one can group all data that is needed to process a query in one place. This speeds up queries.
- Deterministic materialized keys: Combine keys into composites and fetch data deterministically so that you can avoid searching for data.
- Application side joins: As joins are not universally supported in NoSQL solutions, joins need to be handled at design time
Perry Krug: The flexibility of data modelling and management is one of the more important driving factors for NoSQL adoption (the other being the need for operationability in terms of performance, scale, availability, etc). When talking about data modelling, the overarching “best practice” is to allow a much closer alignment of data model/structure between the application’s objects and the database. The idea of an ORM layer, which involves taking the application’s objects and breaking them out into rigid rows and tables and then joining those back together is quickly eroding.
Jim Webber: That’s rather a broad term given the range of data models supported under the NoSQL umbrella! In some data models the key is understand how to denormalize your data into keys and values, columns, or documents including any necessary user-level tricks to make it perform. Then to ponder how to support that model by indexing and so on. In graphs – which is my area of expertise – modelling is rather different but altogether more pleasant because of the data model (nodes, relationships and labels) and the processing model (graph traversal).
In a native graph database like Neo4j, the engine natively supports joins. These aren’t set joins as we’re used to in relational databases, but the ability to reconcile two related records based on a relationship between them performantly. Because of that join performance (many millions of joins per second, even on my laptop), we can traverse large graphs very quickly. In neo4j such joins are implemented as the ability to traverse relationships between nodes in a graph by cheap pointer chasing. This is an aspect of native graph databases known as “index free adjacency” that allows O(n) cost for accessing n graph elements as opposed to O( n log n ) or worse for non-native graph tech.
Then, the power to traverse large connected data structures cheaply and quickly actually drives modelling. Given a typical domain model of circles and lines on a whiteboard, we find that it is often the same as the data model in the database – what you draw is what you store. Further, as we expand on the questions we want to ask of that data, it leads us to add more relationships, layers, expand subgraphs, refine names and properties. We think of this as being query-driven modelling (QDD, if you will). This means the data model is open to domain experts rather than just database specialists and supports high quality collaboration for modelling and evolution. Graph modelling gives us such freedom: draw what you store, with a small set of affordances for making sure you’re mechanically sympathetic to your underlying stack.
Tim Berglund: If we were to confine our discussion to what I have traditionally seen as the “heavies” – Cassandra, MongoDB, and Neo4J – we can see that there is no one answer to this question. Each of these databases is as different from each other as they are from relational databases. So there really is no one approach to NoSQL data modeling; instead, we must approach each databases on its own terms and learn data modeling techniques appropriate to it.
At present, my work is dedicated to producing educational resources for Cassandra. Some Cassandra applications, like storing time-series data, have specific data models that work best and can be learned as canned solutions. More general business domain modeling – such as is intuitive for many developers using a relational database – require a specific methodology that differs from the received relational tradition. At one point we were all new to relational data modeling, and we had to learn how to do it well. It’s the same thing with the NoSQL databases. Someone has to sit us down and explain to us how they work and how best to represent the world using the data models they expose.
InfoQ: What should developers choose when there is a conflict between data modeling requirements that call for specific NoSQL database (for example, Document) but performance requirements may require a different type of database (like Key Value store)?
Seema Jethani: Developers should always choose performance. Data models can be modified to meet the needs of the use case. Granted additional work may be need in the application but you can’t get performance out of thin air. Always try to get better performance.
Perry Krug: IMO, Document and Key-Value are too similar to see as options for this sort of decision. A better example would be to compare Key-Value/Document vs Graph vs Columnar…
This is certainly one of the major challenges facing developers and architects today. In one sense, there is a growing convergence of capabilities and “fitness” between these different types, with some vendors providing multi-model and/or just expanding the features and functionality of one type so that it can handle more and more use cases. On the other hand, the difference between these different types of technologies is rooted in the idea that NoSQL is not just another “sledgehammer for every nail” in the way that RDBMs are. It’s a double-edged sword of added choice and complexity coupled with being able to choose the right tool for the job.
In the not too distant future, we expect to see more and more consolidation of these types of technologies so that customers don’t necessary have to choose between such widely different choices but can still tune an individual system to meet the needs of their different applications (without getting too far down the line of not being good at anything).
Jim Webber: I’ve been pondering this a lot lately as various graph libraries have appeared on top of non-native graph databases. I think it’s important to understand what your database is native for and what it is non-native for. For example Neo4j is native for graphs, and is optimized for graph traversals (those cheap joins via pointer chasing) and ACID transaction for writes. Layering a (linked) document store on top of Neo4j could make sense because linked documents can benefit from the graph model – the two compose well, as systems like Structr demonstrate.
The reverse isn’t true though. If you have, say, a document or column database which doesn’t understand joins, then grafting relationships onto them (in the application or via a graph-wrapper library) is going against the grain. The joins that Neo4j natively eats for breakfast via low-level pointer chasing have to be done at the application or library level. Retrieve a document over the network, examine its content, resolve another address and retrieve the document over the network and so on. This places substantial practical limits on efficiency and traversal performance of non-native approaches.
For balance I’d point out that Neo4j isn’t, for example, a native time series database. If you want to do time series in Neo4j you probably end up encoding a time-tree (a kind of indexing pattern that looks somewhat like a B+ tree) into your model and explicitly querying against that tree. A native time series database would automate much of that work for you. So the only time when you’d choose Neo4j as a non-native time series database is when you want to mix in other (graph) data to accompany the time points (e.g. transaction history, geospatial, etc). At that point you tip the balance and choose graph even though it isn’t native for one of your dimensions.
Tim Berglund: Performance requirements always have to win. You can’t turn a failing latency SLA into a success if you are asking the the underlying database to outdo its best-case performance. You can, however, bend a Cassandra data model (for example) to support Document-like storage and access patterns if you need to. This is not to say that there aren’t systems which are best represented in a Document store, but ultimately even a tabular data model can represent any real-world state of affairs, however inelegantly in some corner cases. Performance is not as flexible.
InfoQ: Can you discuss some of the tools that will help improve the developer productivity when working on NoSQL based applications?
Seema Jethani: There are two key aspects of developer productivity that need to be addressed – ease of use and features that allow them to easily do powerful things with the database. Ease of use can be enhanced by providing out of the box clustering which would save developers valuable time during the on-boarding process. Rich features such as support for higher level languages and client libraries in various languages enable developers to run complex queries without having to do a lot of heavy lifting in the application. Finally tracing and debugging tools allow developers to quickly identify the root cause of a problem freeing them from time spent debugging.
Perry Krug: This is definitely an area that is both lacking today as well as rapidly growing. From a deployment and provisioning perspective, there is fairly good standardization across technologies and integration into the common toolchains. However, for the most part, each technology currently provides a silo’ed set of tools for their own developers. I think some degree of standardization of language/API across a few different technologies will be very interesting to watch out for over the next few years. It’s a pretty big unknown at this point how soon that will happen, if at all.
Combined with that, there are a host of new languages (e.g. node.js, Go) that are changing the way applications are designed, developed and deployed.
The most useful tools we see out there today are around reference architectures/implementations that can provide copy-paste examples to build upon. This also includes training and hands-on workshop style engagements from the experts in each technology.
Jim Webber: It’s clear to me that the relational databases are more mature in their integration with developer tooling than the NoSQL databases, that’s just a function of time. But that is rapidly changing as the NoSQL market shakes out and the database and tooling vendors begin to consolidate around a small number of front-runners, supported by an enthusiastic OSS community.
In Neo4j specifically we’ve been working hard over the last 5 years to produce a very productive query language called Cypher that provides humane and expedient access to the graph. That language is now in the early stages of standardization as “openCypher”, and will appear as the API to other graph technology over time (e.g. there is an initiative to port Cypher to Spark).
That same network stack can also be used to invoke server-side procedures written in any JVM language. While initially I thought procedures might be an interesting footnote condemned by a sorry history of stored procs to irrelevance, it turns out they’re actually amazing. Neo4j procedures are just code. That code can be rigorously tested in your IDE (TDD’d even) way before deployment. Brilliantly, while we intended those procedures to be used for iterative graph algorithms we now see they’re being used to bring in data from other systems (including non-graph databases, Web services, and even spreadsheets) and mix that data into the same Cypher queries that operate on the local graph. The productivity this enables is simply amazing.
Atop all of that, we and others in the graph world are busy working on visualizations for graphs so that non-experts can interact with the model. We saw how this played out recently where the Panama Papers were exposed by a combination of Neo4j for graph query and Linkurious for visualization. In working with sophisticated connected data sets, this kind of tooling is becoming increasingly important to developers too.
Tim Berglund: Depending on your tooling preference, either NoSQL looks like a tooling wasteland or looks just fine. If you want a simple visual data model exploration tool and command-line query capability, we’re pretty much there today. Any of the major databases will give you that. Most of them have connectors in all of the major (and many of the minor) data integration tools as well.
But if you want, say, round-trip visual modeling support, you will still be disappointed. I am hopeful the next five years will see this tooling gap close for those databases for which it is appropriate.
InfoQ: What do you think about multi-model databases? What are the pros and cons of multi-model database option v. polyglot persistence?
Seema Jethani: Polyglot Persistence advocates using multiple databases to store data based upon the way data is being used by individual applications or components of a single application, i.e.you pick the right database for the right use case. However this approach presents operational and skills challenges. In contrast, multi-model databases are designed to support multiple data models against a single, integrated backend. These databases are designed to offer the data modeling advantages of polyglot persistence without the complexity of operating disparate databases and the need to be proficient in multiple databases as opposed to one.
Perry Krug: I think I covered a little bit of this above. Polyglot persistence is a sliding bar between “the right tool for the job” and “too many tools!”. Multi-model is also a sliding bar between “good for many things” and “not good at anything”. The legacy users of NoSQL have generally preferred polyglot because each individual technology excelled at a limit set of use cases and there was a need for many of them underneath a single broad application. However, newcomers to NoSQL are generally preferring a much smaller set of technologies and are looking to leverage each of them for a broader set of use cases.
Personally, I am fearful of multi-model databases ending up conflicting with their own feature sets and not being really good at anything. I have seen better results with a relatively small set of technologies (1-3) that can handle all or the majority of an organization’s needs. There will always be the need for super-specialized technologies, that’s not unique to databases.
Jim Webber: In a world of dominated systems composed from (micro) services, I think that that developers choose the right database or databases for their service and then compose those services to deliver functionality. As such I think that polyglot has strong credibility.
I also think the jury is out on multi-model. Anecdotally it’s hard to swallow that any single database can be all things to all people. We saw that in the era of the RDBMS when we shoehorned all things into the relational data model. That learning was what spawned NoSQL!
But where my thinking is at, as I mentioned earlier, is the notion of a database being native for something, and non-native for for other things, and whether the native model can be sympathetically composed into the non-native model. Graph happens to be a good native position there because it is the richest model – narrowing its affordances to other models is therefore plausible (should you choose to do that).
Tim Berglund: I have never been too persuaded of polyglot persistence as an architectural strategy. It may well emerge in some particular system that is composed of the integration of several legacy systems, but it’s probably an anti-pattern for greenfield work. My reason for this is partly operational and partly due to design considerations. Operationally speaking, it is more difficult to manage uptime and performance SLAs for multiple complex pieces of software infrastructure compared to just one. In terms of the code itself, it is also difficult to juggle many data models in a single project. The value of the different models has to exceed these two costs for it to be a rational choice, and I think this situation is rare.
That said, real use cases do exist for the different models. An architect might prefer to model part of her system as a graph, do ad-hoc SQL queries over another, and meet extremely aggressive performance SLAs with in a part of the system that can be modeled more simply. Multi-model is a good solution to this problem, since it answers the operational challenge by putting all of the models in the same database, and it has the potential of simplifying the API problem by making the different models’ interfaces share as much API surface area as possible. I think in the future all of the major NoSQL databases will tend to share features of one another’s native models as much as they can. I’m excited to see what the next decade brings in this area.
InfoQ: Gartner says the leading database vendors will offer multiple data models, relational and NoSQL, in a single platform. What do you think about this assessment?
Seema Jethani: Some leading DB vendors offer NoSQL today with little market traction. They are a long way off from being able to offer multi-model, RDBMS and noSQL from a single platform. And even if they do there are many functional and performance tradeoffs that it will limit its attractiveness. For now we see the world as moving towards multi-model NoSQL alongside RDBMS.
Perry Krug: I think this may be true in terms of what vendors like Oracle and IBM would like to provide, but I think it is false in terms of what the market really wants/needs. There will certainly be some degree of overlap, but there are fundamental design and architecture differences between relational and NoSQL. Simple laws of physics (not to mention CAP) dictate that certain capabilities around transactions, replication, distribution, etc cannot be mixed between the purest of needs of relational and NoSQL. In the end, a single vendor may provide multiple choices, but I believe they will have to be dealt with as very different products.
Jim Webber: At this point it seems that Gartner is the main proponent of this message, unsurprisingly. While I see some databases starting to offer multiple models, I’m not totally impressed with the notion. Like I said, we tried the all-things-to-all-people approach with relational databases. But pragmatically I think it again comes back to what your database is native for. If you’re native for graph, you can probably offer a reasonable document view of your data. Conversely, if you’re native for columns, it’s difficult to deliver native graph performance when your underlying engine can’t process joins.
On the multi-model versus polyglot persistence, I wonder whether the fault lines run along CIO and delivery responsibilities. As a CIO, I’d like to rationalize the number of databases that I have running my business. Whereas as someone who builds and operates software I’ve long since grown used to using the right tool for the right job (management permitting). It’s obvious which community Gartner addresses, and at some point it does make sense to play to your crowd.
Tim Berglund: I think that sentence calls to mind the image of a developer API that may never materialize, but apart from that I think the broader trend is already happening. Any non-relational database for which a Spark integration exists already offers a relational and non-relational data model through SparkSQL. After having initially discounted the importance of relational databases features for any new database we create (as all NoSQL advocates have done at some point!), we tend to re-implement the relational algebra on top of that database over time. This lets us explore the space of different operational and performance characteristics (e.g., elastic scalability, low-latency writes, schema-less data model, etc.) while still retaining the utility of the relational model over time.
InfoQ: Can you talk about using NoSQL databases and big data technologies (like Hadoop and Spark) together to solve big data problems?
Seema Jethani: NoSQL databases and other Big Data technologies like Hadoop, Spark, Kafka are used to build various data analysis pipelines for large data sets. One such example using Riak involves leveraging Riak for short term operational data that is queried or updated often, then moved to Hadoop for long term storage as a data warehouse, while Spark is used for ingestion, real time analysis on Riak and batch analysis on Hadoop.
Perry Krug: This goes a little bit to the above comments around polyglot persistence. For almost all time, there has been a separation between technologies for “operational databases” and those for “analytics databases”. Even if the same technology can be used for both in some places, it is usually deployed differently to meet those different needs.
“Big Data” is the very broad buzzword that encompasses both NoSQL (operations) and the traditional “big data” technologies like Hadoop and Spark (analytics). For an entire application (imagine Facebook or LinkedIn), combining NoSQL and Hadoop technologies is absolutely critical to meeting their overall needs. The idea of a “lambda architecture” with data being handled both in real-time as well as in batch is becoming fairly well established.
This can also be looked at in the light of NoSQL vs RDBMs…the designs, architectures, and resource requirements of NoSQL systems are very different from Hadoop and batch processing systems. This is for good reason as the goals of each are very different, and there is usually relatively little overlap between the two. Spark starts to blur the lines between batch and real-time, but it’s still not an “operational database” technology.
I expect we will continue to see convergence between operational/online and batch/offline technologies, but I expect that there will always be a separation of the two requirements within an application.
Jim Webber: That bifurcates easily in Neo4j’s world view. Neo4j is by far the leading technology in graph storage and query, but that’s only half the story. The other half is graph compute and the leader there is clearly Spark. In numerous use-cases we see Neo4j as the repository of the authoritative graph, feeding downstream systems and running graph query workloads. But we also see graph processing infrastructure like Spark taking projected sub-graphs from Neo4j, parallel processing them and returning the results to the graph, enriching the model in a virtuous cycle.
Tim Berglund: First, it’s important to note that “NoSQL” doesn’t always mean “scale.” Some NoSQL databases choose to innovate in performance and data model while not fundamentally scaling any differently than relational databases. However, for those NoSQL databases that also belong to the Big Data category, integration with distributed computation tools is a key architectural differentiator. In particular, integrating Spark with databases like Cassandra or Riak adds the flexibility of ad-hoc analysis on top of data models that otherwise do not support ad-hoc queries very well. This architecture offers the promise of doing analytics on top of an operational data store with zero ETL between the two systems. This is a new approach that architects are just starting to build out, but it’s a successful approach that will win over over traditional analytics systems at least some of the time.
InfoQ: Microservices are getting lot of attention recently to develop modular and scalable enterprise applications. Can you talk about how microservices can work with NoSQL databases?
Seema Jethani: There are many different ways microservices can work with NoSQL databases. At one end of the spectrum, each service may have its own database instance. This is operationally challenging and not a recommended approach. A use-case oriented approach, where services that address a particular problem share a database cluster is a better fit with the microservice architecture. Riak is very popular and a good fit for solutions which use microservices architecture. Specifically – as Riak is master-less, with each node running the same code, Riak cluster can be scaled horizontally up and down without dependency on other services. Riak has well-defined HTTP and PBC APIs for data reads, writes, updates and searches, provides REST endpoint for remote monitoring and has ready-to-use Ansible playbook and Chef recipe for automated deployment and configuration.
Perry Krug: Microservices work great with NoSQL databases 🙂 The main advantage that NoSQL provides (usually, depends on the vendor) is the ease of setting up and running many small instances/clusters while still allowing each of those to scale quickly as the needs of the application/microservice grow. From a hardware, cost, setup perspective this less true with RDBMs, but in theory it could be.
Jim Webber: In general, it makes sense for each microservice to use the best database for its own case (if it needs one). But managing systems composed from microservices is itself a demanding challenge. Not only are distributed systems hard from a computing science point of view (in particular handling failures), but managing the evolution of a network of many mutually dependent micro-services demands tooling.
Fortunately Neo4j is well known for microservices management where we consider the system as a whole to be a graph of interacting services. There’s an excellent video from our recent GraphConnect conference where the folks from Lending Club talk through their approach to managing their microservice estate with Neo4j here: http://neo4j.com/blog/managing-microservices-neo4j/
Having a graph view of your system enables you to have a predictive and reactive analysis of faults and contention. You can ascribe value to points of failure and reason about their costs and roll all of this up to the end user. You can locate single points of failure too and you can keep your model up to date with live monitoring data that then gives an empirically tempered view of the dependability of your whole system and the risks to which it is subjected.
And if you’re feeling particularly plucky (as one large telco I talked to a couple of years back), you can think about making the graph the authoritative description of your microservices deployment. As a side-effect of traversing the graph you can create deployment scripts that actually build your system atop your PAAS: graph as the ultimate configuration management database.
Tim Berglund: Relational databases grew up in a world in which a single schema supported a number of small client-server applications. NoSQL databases grew up in a world in which a single large web site was served by one database. Features like programmatic transactional boundaries and granular security – which are often lacking or immature in NoSQL databases – are less important in the latter architectural form.
But as NoSQL adoption moves from large and rare web properties to more commonplace corporate IT applications, the single-application architectural assumption is less likely to hold: many different applications and services inside the company may want access to the data in the NoSQL database.
Microservices are an excellent solution to this problem. They allow the architect to stand up a single piece of code to talk to the database, which holds to the original assumption under which NoSQL databases were designed, yet also to make that service available to other consumers in the corporate IT application stack. If microservices are adopted enterprise-wide, this method of integration becomes the native approach, and internal tooling and expertise grow up around it. The “missing” features that the databases of the 1990s gave us seem less important under the new paradigm.
InfoQ: Container technologies provide the mechanism to deploy software applications in isolated deployment environments. What are the advantages and limitations of running NoSQL databases in a container like Docker?
Seema Jethani: Containers are easy to setup and their lightweight nature allows more efficient use of hardware. However at the same time challenges around discovery, networking and ephemeral storage remain.
- Discovery: It’s possible to stand up multiple clusters inside of Docker containers. But how do we connect to the one we need? How do we keep track of which container holds the right cluster? There are tools to help with this like Weave but the issue of discovering the host:port to use to connect to can be a problem.
- Ephemeral data storage: Unless you take operational care to start a database cluster on the same node and using the same data directory, you’ll get a fresh cluster. In some cases this is exactly what you want. But not all. This is especially problematic in cloud environments where nodes could flap often. You don’t want to pay the penalty of convergence and re-allocating data partitions.
- Networking: Internal Docker IPs that the database binds to are not necessarily accessible outside the Docker daemon. Thus when the database must respond to a client with a coverage plan that indicates which nodes the data resides on, it needs to supply addresses that the client will understand.
Perry Krug: While running databases in containers is still fairly nascent, I think that it holds a lot of promise and will quickly become well adopted. The advantages for NoSQL+Docker are essentially the same for anything+Docker… removing the performance overhead of a hypervisor while allowing for even more flexible deployment than VM’s provide today.
In my opinion, there are a few disadvantages, but they are more factors of the maturity of running these two technologies together rather than inherent limitations between them:
- Security and resource segregation is a big one, but will be resolved through technology improvements and best practices
- At the moment, containers are typically seen as being very “stateless” whereas databases tend to want persistent storage. This is also something that is being improved upon at the container level.
Jim Webber: Neo4j, like most NoSQL databases, happily lives in a container (or indeed in other virtualization scheme). In fact there is an official Docker image supported by both Neo4j and Docker here.
The only real issue with virtualization of databases is the uncertainty about the behaviour of your neighbors. Databases (including Neo4j) love RAM, and failing that they appreciate a fast uncontended channel to disk. If there is contention for either of those because of unpredictable or greedy neighbors, then the performance of your instance will suffer. But if you have a good handle on that, everything should be OK.
Tim Berglund: Again, it helps to think of the distributed NoSQL databases here. These databases expect to be deployed to many servers, and their clusters may potentially be scaled up and down elastically.
The advantages of deployment automation and immutable infrastructure apply to any computer program you might want to deploy, and databases are no exception. However, Docker is probably slightly less valuable in deploying a NoSQL database than it is in deploying individual instances of, say, a given microservice. The case for containers in a microservice architecture relies on the fact that the deployed image changes as often as you change the code. Ideally, the deployed image of each database node does not change nearly as often.
This is not to say that Docker is the wrong idea when it comes to NoSQL. If you’re already using Docker elsewhere in your system, it might be smart to include your database in the fun, but NoSQL by itself will probably not convince you to switch to a container-based approach.
InfoQ: What is the current security and monitoring support in NoSQL databases?
Seema Jethani: Monitoring is often provided through integration with monitoring providers such as New Relic, Datadog etc or set up in house using nagios for example, both using metrics gathered and provided by the database.
Security features such as authentication, data governance and encryption are generally provided out of the box. Each database provides varying degrees of support for each.
Riak for example supports access control and authentication with support for multiple auth sources as well as group/user roles. They can be audited and we have access logging. For monitoring, we have stats which include calculated and raw stats available from command line and http interface. Enterprise customers also get access to jmx monitoring and snmp stats and traps.
Perry Krug: These two should really be separated or maybe clarified further. Whereas monitoring has always been a critical part of running and managing NoSQL databases, security has not been until more recently. That’s not to say that some technologies provided better or worse capabilities in each area, but monitoring has been a topic of discussion and improvement for much longer and I think is in a much better state overall.
I think security is the more interesting topic to talk about 🙂 The early adopters of NoSQL technology didn’t place a high value or have a high need for very robust security capabilities. When faced with an endless list of possible features/improvements, the creators of NoSQL technologies followed what was most important to their consumers. Over the last few years, that level of value/importance on security has shifted directly in line with the kinds of applications and organizations adopting NoSQL … and the creators of those technologies have followed suit.
In my opinion, it is not valid to compare NoSQL to RDBMs in terms of security. RDBMs have had 30-40 years of history to build those features. Looking back into their own history will show that they made similar “use-case driven” decisions as NoSQL has made in its early years. I have no doubt that security will play a more and more important role for NoSQL and that the leading technologies will continue to build the features that their users require.
Jim Webber: I actually don’t know what the general case is, but I imagine it’s reasonable since NoSQL databases underpin lots of production systems. In Neo4j’s case we have long had security and monitoring baked into the product and have a team whose entire responsibility is these kinds of operability opportunities. In future, we’ll have LDAP, kerberos, AD integration out of the box (some of that code is already visible on our github repo of course), and refine our monitoring surface. I’d like to think we’ll also expose system monitoring things through to client apps through our binary protocol too since that would make monitoring apps just like “normal” apps.
Tim Berglund: I can speak most readily to Cassandra in this area, since it’s where I specialize. Open-source Cassandra has very basic security, and can be monitored through other open-source tools like Nagios. The commercial version, DataStax Enterprise, has more sophisticated features like integration with LDAP and Kerberos (rather than storing security credential in the database itself), and has a custom-built management tool optimized for the management needs of a production Cassandra cluster.
InfoQ: What do you see as the new features and innovations coming up in NoSQL database space?
Seema Jethani: An interesting area in Research for example is how co-ordination may be avoided even in the case of concurrent transactions to maintain correctness and thus make transactions not only possible, but also performant in distributed databases. You can find the details of this research here.
Over the years NoSQL databases have been closing the gap between advantages of relational databases and flexibility and scalability offered by NoSQL databases. Research and innovations such as the above make it possible for us to enjoy the best of both worlds.
Perry Krug: It’s a very hard topic to discuss at the broadest level of NoSQL since each technology is fairly different and evolving in different paths. I think we will see more and more features that will attempt to address or mimic features found in RDBMs. I don’t think it is a good idea simply to copy, but we also can’t deny that some application do need the same “type” of feature in order to meet their requirements (thinking specifically about “transactions” here but could easily expand that to others).
There will also be continuing expansion and overlap across the different technologies which will lead them to differentiate more based upon performance and reliability. I think this will also lead to a consolidation of technologies/vendors with the vast majority of NoSQL databases (>100) fading away.
Truly innovation-wise, I expect there will be more real-time analytical capabilities built into NoSQL and perhaps the emergence of a standard language/API.
Jim Webber: My first job out of grad school was in transaction processing (under the auspices of InfoQ editor Mark Little no less). That whole area became deeply uncool with the advent of NoSQL and the popularisation of eventual consistency. But around 2012 that changed: researchers like Peter Bailis (HA transactions), Diago Ongarro (Raft), Emin Gun Sirer (Linear Transactions) and many others started to reconsider transactions and consensus protocols in the light of high throughput, coordination-avoiding scalable systems. This resurgence of interest in strong consistency for a highly-available world is profoundly exciting and I expect to see this thinking impacting the NoSQL world generally. Already it has impacted the way Neo4j works it very much shapes some of the fault-tolerance and scale aspects of our future product roadmap. Some days I can’t believe my boss actually pays me to work on this stuff – sucker!
Tim Berglund: There are a few things I expect to see:
- Better support for real-time analytics over operational data in horizontally scalable databases
- Improved tooling
- More and more of the relational algebra available natively
The panelists also provided additional comments on NoSQL databases.
Perry Krug: A couple thoughts:
- At a very high level, NoSQL is really about providing two things: data/development flexibility, and better operationability (perf, scale, HA, etc). This panel seemed to focus more on the first one which is primarily focused towards developers, but didn’t spend as much time on the ops discussion which is critical for an organization/application to rely upon its underlying database. Historically, different technologies in the NoSQL space have also focused on one or the other…but customers are demanding both. The choices being made by organizations whether to use NoSQL at all and which NoSQL technologies to choose need to take both of these into consideration.
- There is a growing importance of mobile computing/applications. The processing power at the edge is rapidly increasing, as well as an increasing need for applications to work in an offline or semi-connected fashion. Whether or not NoSQL databases will play a role out on the mobile device, and what that role might be, is an interesting discussion.
Jim Webber: Databases remain an exciting field with which to be involved. But I wonder for how much longer we’ll keep the NoSQL umbrella. Column, KV, Document and Graph all have their own strong identities now and it’ll be interesting to see how those categories forge ahead. Interesting times indeed.
In this virtual panel article, we learned about the current state of NoSQL databases from four subject matter experts with different NoSQL DB expertise. We also learned about how to use NoSQL databases with Big Data technologies and other emerging trends like microservices and container technologies.
About the Panelists
Seema Jethani was until recently the Director of Product Management at Basho Technologies for Basho’s flagship products Riak KV and Riak TS, distributed NoSQL databases. Prior to joining Basho, she held Product Management and Strategy positions at Dell, Enstratius and IBM. She can be found on twitter as @seemaj.
Perry Krug is a Principal Solutions Architect and customer advocate for Couchbase. Perry has worked with hundreds of users and companies to deploy, and maintain Couchbase’s NoSQL database technology. He has over 10 years of experience in high performance caching and database systems.
Dr. Jim Webber is Chief Scientist with Neo Technology, the company behind the popular open source graph database Neo4j, where he works on graph database server technology and writes open source software. Jim is interested in using big graphs like the Web for building distributed systems, which led him to being a co-author on the book REST in Practice, having previously written Developing Enterprise Web Services – An Architect’s Guide.
Tim Berglund is a teacher, author, and technology leader with DataStax, where he serves as the Director of Training. He can frequently be found at speaking at conferences in the United States and all over the world. He is the co-presenter of various O’Reilly training videos on topics ranging from Git to Distributed Systems, and is the author of Gradle Beyond the Basics. He tweets as @tlberglund, blogs very occasionally, and lives in Littleton, CO, USA with the wife of his youth and their youngest child.