Author

9 posts
Mongo DB Pipeline

MongoDB’s Aggregation Pipeline: make your life easier… or at least less difficult

The following is from Avinash Kaza’s article: Business Intelligence Platform: Tutorial Using MongoDB Aggregation Pipeline

Found here:  https://www.toptal.com/mongodb/business-intelligence-platform-using-mongodb-aggregation-pipeline

_______________________________________

Using data to answer interesting questions is what researchers are busy doing in today’s data driven world. Given huge volumes of data, the challenge of processing and analyzing it is a big one; particularly for statisticians or data analysts who do not have the time to invest in learning business intelligence platforms or technologies provided by Hadoop eco-system, Spark, or NoSQL databases that would help them to analyze terabytes of data in minutes.

The norm today is for researchers or statisticians to build their models on subsets of data in analytics packages like R, MATLAB, or Octave, and then give the formulas and data processing steps to IT teams who then build production analytics solutions.

One problem with this approach is that if the researcher realizes something new after running his model on all of the data in production, the process has to be repeated all over again.

What if the researcher could work with a MongoDB developer and run his analysis on all of the production data and use it as his exploratory dataset, without having to learn any new technology or complex programming languages, or even SQL?

If we use MongoDB’s Aggregation Pipeline and MEAN effectively we can achieve this in a reasonably short time. Through this article and the code that is available here in this GitHub repository, we would like to show how easy it is to achieve this.

________________________________________

For more, check out the entirety of Avinash’s tutorial.

Kristen

 

 

 

5 Minute Super-Simple Guide to Using iPython/Jupyter Notebooks

No matter what you call it– you can’t dispute it’s convenient.

A Jupiter notebook.

A python notebook.

A python notebook.

I can’t decide which one I want more. They’re both pretty cute.

 

So what is it, already?

It’s a Word document you can code in.

In all seriousness, it’s a web application, designed for use in the data sciences.

iPython is a sub-project under Project Jupyter (which supports a whole toooooon of languages, not just Python). Project Jupyter was started in 2014, so although its Jupyter notebooks now, a lot of people still call them iPython notebooks. And yes, it takes some time to get used to spelling Jupyter.

JU-PY-TER

 

Let’s work our way down from the top.

 

jupyter-header

 

 

And take a look at the menu…

jupyter-menu

 

 

I just want to go through things that I think are useful, because most of it is very self-explanatory. So let’s start with Jupyter’s time travel device.

file-menu
“Save and Checkpoint” and “Revert to Checkpoint” are its own built-in version control system, which is more convenient than saving it before every major change.

 

That brings us to the Edit menu.

edit-menu

The Edit menu allows you to easily edit the metadata.

metadata

 

The rest of the menu bar is pretty easy to understand, which brings us to a fun feature: Widgets.

widgets-menu

 

Widgets can make your notebook come alive by adding interactivity to your notebook. Here’s some examples:

 

scatterplot

 

An interactive scatterplot you can explore.

 

3-d-visualization

 

A 3-D visualization you can explore from all angles.

 

Adding these widgets and more (customized maps!) make it a lot easier to turn your ideas into engaging data visualizations with Jupyter.

 

Now for buttons…

 

jupyter-buttons

 

The Save button is obviously there, but what is that little cross next to it?

jupyter-add button

It adds a cell to your notebook (shown below).

The thing that is unique about Jupyter is that you run your code directly in the application, with the output coming out all nice and neat in its little box.

input-output

Another useful feature is that you can run the code section by section so that if you (GOD FORBID!) have a mistake, you don’t have to run your environment all over again from the beginning.

 

Next are the handy yet dangerous Cut, Copy and Paste buttons.

jupyter-edit-button

 

They cut, copy and can paste your cells, but be sure to be saving and adding checkpoints in case you hit the cut button by accident.

 

jupyter-up-down-button

The next buttons move cells up and down.  Pretty easy to use.

 

jupyter-run-button

The next buttons are pretty important.  The first one runs your currently selected cell, the stop button stops your kernel and the reset button resets it.

 

jupyter-markdown-button

The next menu formats your cells.  There are three options: Code, Markdown and Raw NBConvert.

 

jupyter-command-button

The next button opens up a handy little search menu of all the commands for Jupyter Notebook.

search-box

 

 

The next button is accessible two ways. Through this button:
jupyter-cell-toolbar-button

And the View Menu:

cell-toolbar-menu

They both open a menu with a couple of different options to view the Notebook.  I’m not sure why it’s really there.

This post is just designed to give the very basics of how to use Jupyter Notebooks. I highly recommend working your way through Rackspace’s Python Jupyter Notebook tutorial (in a Jupyter Notebook!).

Have fun!

Kristen

SQL vs NoSQL. Part Three.

In previous posts I talked about SQL and NoSQL, and I want to go into a little more detail (while keeping it simple) what makes them different.

Scalability>>> Think making big things small. In SQL data is stored vertically (so typically all on one server- expensive!).  NoSQL stores it horizontally (many servers==ok).

Schema>>> Technically schema means a representation of some model. In programming land, it is used to refer to a structure of a database.  So think because you can’t see a database (at least I hope you can’t) you have to think how that structure is represented.   In SQL, the schema is fixed, columns must be decided ahead of time, and you have to put data in every column.  Remember that wine shelf? You can’t really be adding a new column to your shelf after you’ve built it…it will probably look like all the images when you google “shelf fail.”

Shelf Fail

I don’t know why, but this shelf is kind of cute.

Also, you have to put a bottle in every slot. Someone’s going to be a happy wine collector.

NoSQL deals with schema in a very different way. It just says “Nope.” and walks away. You can add (or leave out) anything you want, anytime you want. Now that’s flexibility.

Data>>> Finally let’s get to the data. In SQL all rows contain one specific entry. For example, in a row containing information about a bottle of wine you might have “Year”,”Location”,”Winery” etc. You can’t have two years for a bottle of wine, or two locations. In NoSQL, that’s A-OK. You can have two wineries (maybe it was a collaboration?) or no wineries. If that’s what you want.

More reading.

Next post I’ll be going into more detail about NoSQL and specifically MongoDB.

Peace,

Kristen

What follows a really bad movie about databases?

NoSQL

This is SQL.
wine bottles

This is NoSQL.
wine bottles pile

As discussed in my previous post- SQL is a relational (tabular) database, one that looks like an Excel sheet or an empty shelf. NoSQL is its evil twin sister.

The lovable evil twin sister.

People like NoSQL for it’s flexibility. You can only fit one wine bottle per shelf using SQL, but with NoSQL you can throw those wine bottles in a pile and it’s A-OK.

Also some people believe it is faster… but there are a lot of mixed opinions on this…

Let’s look for our favorite red wine again…

db.shelf.find( { “taste”: “delicious”, “dryness”: “dry” , “color”: “red”} )

A lot shorter, isn’t it?
You find your wine not by searching for it in it’s proper cubby in its designated row and column, you search by keys.

As you can imagine, there are pros and cons to this. A lot of companies are still reluctant to embrace NoSQL. (See other article)

Here’s some more reading. If you have some time to kill.

Take care out there. There’s wild Pokemon.
Kristen

DIR: 07/04 Happy Monday

Day In Review: What I checked out today so you don’t have to

Unless you want to.

This section of the blog will be for posting some articles that I read everyday, and my honest opinion of whether they are worth your time or not. To save articles across the web, I use a combination of Pocket (more for long term saves) and Degreed (for something I will read later that day.

So let’s get to it:

SparkR

Today I stumbled across this article on Twitter today, and it caught my attention because both Spark and R are things on my “To Learn” list. I like the “Seven Steps” approach (only seven steps can’t be that hard right?) because it makes it seem more manageable. It is concise, contains a lot of different links (plus links to videos and reading) and is honest about what skills you will get in these seven steps. Good to get you started and head you in the right direction.

Rating: ★★★★☆

SQL vs NoSQL

An article that spawned the idea for the SQL and noSQL posts. The bold headings make for an easy read, and it’s quite concise.

 

Rating: ★★★☆☆

Stock Photography

And here’s for our random pick of the day. I was looking for pictures to accompany blog posts and I stumbled across this gem. Specifically Unsplash, which is one of the sights mentioned. Searchable, and beautiful, free pictures. Because, let’s face it; everything free is beautiful.

Rating: ★★★★★

Thanks for the read.
Kristen