Results for tag "data-science"

3 Articles

MongoDB’s Aggregation Pipeline: make your life easier… or at least less difficult

The following is from Avinash Kaza’s article: Business Intelligence Platform: Tutorial Using MongoDB Aggregation Pipeline

Found here:


Using data to answer interesting questions is what researchers are busy doing in today’s data driven world. Given huge volumes of data, the challenge of processing and analyzing it is a big one; particularly for statisticians or data analysts who do not have the time to invest in learning business intelligence platforms or technologies provided by Hadoop eco-system, Spark, or NoSQL databases that would help them to analyze terabytes of data in minutes.

The norm today is for researchers or statisticians to build their models on subsets of data in analytics packages like R, MATLAB, or Octave, and then give the formulas and data processing steps to IT teams who then build production analytics solutions.

One problem with this approach is that if the researcher realizes something new after running his model on all of the data in production, the process has to be repeated all over again.

What if the researcher could work with a MongoDB developer and run his analysis on all of the production data and use it as his exploratory dataset, without having to learn any new technology or complex programming languages, or even SQL?

If we use MongoDB’s Aggregation Pipeline and MEAN effectively we can achieve this in a reasonably short time. Through this article and the code that is available here in this GitHub repository, we would like to show how easy it is to achieve this.


For more, check out the entirety of Avinash’s tutorial.






5 Minute Super-Simple Guide to Using iPython/Jupyter Notebooks

No matter what you call it– you can’t dispute it’s convenient.

A Jupiter notebook.

A python notebook.

A python notebook.

I can’t decide which one I want more. They’re both pretty cute.


So what is it, already?

It’s a Word document you can code in.

In all seriousness, it’s a web application, designed for use in the data sciences.

iPython is a sub-project under Project Jupyter (which supports a whole toooooon of languages, not just Python). Project Jupyter was started in 2014, so although its Jupyter notebooks now, a lot of people still call them iPython notebooks. And yes, it takes some time to get used to spelling Jupyter.



Let’s work our way down from the top.





And take a look at the menu…




I just want to go through things that I think are useful, because most of it is very self-explanatory. So let’s start with Jupyter’s time travel device.

“Save and Checkpoint” and “Revert to Checkpoint” are its own built-in version control system, which is more convenient than saving it before every major change.


That brings us to the Edit menu.


The Edit menu allows you to easily edit the metadata.



The rest of the menu bar is pretty easy to understand, which brings us to a fun feature: Widgets.



Widgets can make your notebook come alive by adding interactivity to your notebook. Here’s some examples:




An interactive scatterplot you can explore.




A 3-D visualization you can explore from all angles.


Adding these widgets and more (customized maps!) make it a lot easier to turn your ideas into engaging data visualizations with Jupyter.


Now for buttons…




The Save button is obviously there, but what is that little cross next to it?

jupyter-add button

It adds a cell to your notebook (shown below).

The thing that is unique about Jupyter is that you run your code directly in the application, with the output coming out all nice and neat in its little box.


Another useful feature is that you can run the code section by section so that if you (GOD FORBID!) have a mistake, you don’t have to run your environment all over again from the beginning.


Next are the handy yet dangerous Cut, Copy and Paste buttons.



They cut, copy and can paste your cells, but be sure to be saving and adding checkpoints in case you hit the cut button by accident.



The next buttons move cells up and down.  Pretty easy to use.



The next buttons are pretty important.  The first one runs your currently selected cell, the stop button stops your kernel and the reset button resets it.



The next menu formats your cells.  There are three options: Code, Markdown and Raw NBConvert.



The next button opens up a handy little search menu of all the commands for Jupyter Notebook.




The next button is accessible two ways. Through this button:

And the View Menu:


They both open a menu with a couple of different options to view the Notebook.  I’m not sure why it’s really there.

This post is just designed to give the very basics of how to use Jupyter Notebooks. I highly recommend working your way through Rackspace’s Python Jupyter Notebook tutorial (in a Jupyter Notebook!).

Have fun!



Streamgraphs are pretty, but can you understand them?

Take a look at this:




Is it a digital representation of marble art?
It kind of looks like it…
but actually its a part of Nicolas Garcia Belmonte‘s Streamgraph showing the number of tweets during the 2012 European football tournament.





Click to read more about it here.


A what?

What is a streamgraph?  Basically, it’s an area graph (fancy line graph usually with a lot of colors for displaying a whole lot of quantities.)  See the below.



This fun visualization is by Jure Leskovec, Lars Backstrom and Jon Kleinberg (check them out here) is a flashback to the 2008 presidential campaign.  It shows the rise and fall of popularity of memes during that time.


And then you take it and you flip it on a center axis and then you have this:



This is the most famous example of a streamgraph, by a team at the NY Times showing the ebb and flow of box office revenue. Plus it’s pretty.


And there’s always a “but”

Streamgraphs are beautiful, but they have come under fire for their readability. Especially when you come across ones like this:


***Eric Rodenbeck (creator) made these as as a prototype and was planning on changing the colors.  Also please check out his blog here.




Eric Rodenbeck



This streamgraph not only highlights the importance of color choice but a potential problem with streamgraphs itself.  With that much visual information, what do people know what to look at? Only when you slice it down can you decipher what each color is referring to.



Steamgraph 2




So do streamgraphs offer a good way to convey information?  It depends on two factors, the target audience (how much time are people willing to interact with your visualization) and the amount of information that the creator wants to show (if your doing a time series analysis and have a lot of quantities, a streamgraph may be the way to go.)


Streamgraphs have an immense amount of potential in its digital form, as long as its large amount of information doesn’t get lost in the stream.
I’m sorry I had to.  Also check out more reading here.