In order to create meaningful visual representations of our data, and the complimentary tools to analyze said data, it is important to have a well-conceived data management framework. This requires the right backend storage, a paradigm for data access, and an engaging front-end for presentation and analysis. There are a variety of tools that you can use to build a data access stack in your web applications which we will be exploring here.
If you are collecting data that data is relevant to your visitors they will want some way to consume it. Our responsibility is to provide transparency to our visitors, give them the best possible experience, and build intuitive and performant tools to allow them access to these insights. The visual representation of that data is only a part of that. It is the mechanisms that we use to store, transform, and transport that data which plays as much a part in providing these rich experiences.
Data Storage Options
Data storage has become a huge market in recent years. Deciding which technology you want to use for your application can be a daunting task. There are a few things to think about: performance, scalability, ease of implementation, as well as the particular skill set of you and your team. This last point being extremely important and often overlooked. If you have a team of SQL developers on your team the benefits of moving to a MongoDB implementation would have to be overwhelming in order to persuade you to go down that route.
Other than “stick with what you know” there is no quick and easy answer to which you should use. Flat datasets can be relatively easy to work with. They are structured as a single table (think CSV files) and can be relatively easy to understand. The limitations of these sources show themselves quickly, because they perform poorly as they grow and can be difficult to maintain. If you have a flat dataset you most likely want to break it apart into one of the other storage options.
Relational Databases (MySQL, SQL Server) are great for storing data in separate tables that can be joined up using unique keys. Advantages of these are that they reduce the size of the datasets, perform better, and can be accessed using a well established querying language (SQL). This solution also requires a good deal of planning, creating unique keys for establishing relationships, and tuning for performance.
Unstructured databases (e.g. Hadoop) are good for extremely large datasets and outside the scope of this discussion. If you are working with datasets of this size you are likely going to want to use an ETL process to normalize the data before bringing it into your application.
The option to store data client-side is also appealing but it doesn’t come without its disadvantages. File storage and caching data on a client machine has some advantages in certain use cases but it requires a certain level of trust between you and the user. If this is a trusted service, or if the user knows they are going to be working with large volumes of data, then it is reasonable to expect them to allow access to file storage. By default, however, I would not recommend making this an expectation in any but the most demanding of use cases.
Creating Access Layers
There are a few methods for creating access layers into your data. Views have long been the standard way of doing this in relational databases. Views allow you to write queries around your data and present it as a table. Using data aggression techniques such as group by, order by, sum, etc you can create smaller, more targeted datasets for your visualizations and analytics.
CREATE VIEW population_vw AS SELECT country, age, year, sum(total) AS TOTAL FROM census_data WHERE year IN ('2010') AND country IN ('United States') GROUP BY country, age, year;
Most relational databases also allow for the creation of materialized views that require ETL to create the view but perform better because they only require one table to be accessed.
A hybrid approach can be effective as well. Often times this can be accomplished by creating a more targeted MongoDB layer for your larger dataset that is being stored in SQL Server. Offloading the most crucial data to the document-oriented database for quick access and consumption while storing the full breadth of data in your backend SQL database. If you are using Node you can use Express to manage the creation of these datasets and store them on your MongoDB server.
OLAP also allows you to create datasets that can be aggregated but allow you to pre-establish the dimensions and measures you want to use to represent your data. OLAP uses Multidimensional Expressions (MDX) for accessing data types but is not very well supported in web applications.
Aggregating your data before sending it to the client has always been considered best practice. Most likely, you want to reduce the data as much as possible on the server before you present it to your users. This can be troublesome, however, because you will often be tempted to reduce it to its most aggregated form on the server. If the user wants to change the layout of the data, you end up with network thrashing because you constantly need to pull a dataset from the server with the appropriate level of aggregation.
It is critical that you find that medium where the data is aggregated to a size that is responsible from a bandwidth perspective but also provides adequate detail to allow for analysis. This can be accomplished through requirements gathering and establishing the key metrics and dimensions that the end user requires for analysis.