BlogNoSql-Ver2.html

﻿<!DOCTYPE html>
<html>
<head>
  <!--
   - BlogNoSql.htm - NoSQL databases
   - ver 1.1 - 04 September 2015
   - Jim Fawcett, Syracuse University
  -->
  <meta http-equiv="content-type" content="text/html;charset=UTF-8" />
  <meta name="description" content="Software Engineering course notes. Code Samples. Software Links" />
  <meta name="keywords" content="Lecture, Notes, Code, Syracuse,University" />
  <meta name="Author" content="Jim Fawcett" />
  <meta name="Author" content="James Fawcett" />
  <meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
  <title>Blog NoSql</title>
  <script type="text/javascript" src="js/jquery-1.6.2.min.js"></script>
  <script type="text/javascript" src="js/TopMenu.js"></script>
  <script type="text/javascript" src="js/Fallback.js"></script>
  <link rel="stylesheet" href="css/TopLevel.css?v=1.0" />
  <link rel="stylesheet" href="css/Fallback.css?v=1.0" />
  <style type="text/css">
    .em { font-size:110%; font-weight:bold; }
    ul { margin-left:0px; }
    ul > li { padding-bottom:5px; }
    sup { font-size:medium; font-weight:bold; }
    .footnote { font-size:small; }
  </style>
</head>
<body id="github" onload="initializeMenu()">

  <header>
    <div class="container">
      <div id="topleft">
        Course Notes
      </div>
      <div id="topright">
        Computer Engineering
      </div>
    </div>
    <hgroup id="pagetitle">
      <h1 id="title">Code Artistry - No SQL Databases</h1>
      <div id="pagedate" class="center">
        <script type="text/javascript">
          document.write("Revised: " + document.lastModified)
        </script>
      </div>
    </hgroup>
  </header>

  <!-- site navigation menu built with Javascript -->
  <nav>
    <div id="nav">
      <div id="remove">
        <hr />
        <a href="TopNav.htm">Site Navigation with no Javascript</a>
        <hr />
        <br />
      </div>
    </div>
  </nav>

  <!-- page content -->
  <div class="content">
    <h2>Initial Thoughts:</h2>
    <p>
      There is currently a lot of technical interest in <a href="https://en.wikipedia.org/wiki/Big_data">&quot;Big&nbsp;Data&quot;</a>.
      Extreme examples are: data collection and analyses from the <a href="https://en.wikipedia.org/wiki/Large_Hadron_Collider">Large Hadron Collider</a>,
      the <a href="http://www.sdss.org/">Sloan Sky Survey</a>, analyses of Biological <a href="http://www.genome.jp/kegg/kegg1a.html">Genomes</a>,
      collecting data for <a href="https://en.wikipedia.org/wiki/General_Circulation_Model">global climate models</a>, and
      analyzing client interactions in <a href="https://en.wikipedia.org/wiki/Social_network_analysis">social networks</a>.
    </p>
    <p>
      Conventional SQL databases may not be well suited for these kinds of applications. While they have worked very well for many
      business applications and record keeping, they get overwhelmed by massive streams of data.
      Developers are turning to <a href="https://en.wikipedia.org/wiki/NoSQL">&quot;noSQL&quot;&nbsp;databases</a>
      like <a href="https://www.mongodb.org/">MongoDB</a>, <a href="http://couchdb.apache.org/">CouchDB</a>,
      and <a href="http://redis.io/">Redis</a> to handle massive data collection and analyses.
    </p>
    <h3>SQL Data Model:</h3>
    <p>
      Traditional SQL databases provide a very well understood data management model that supports the <a href="https://en.wikipedia.org/wiki/ACID">ACID properties</a>,
      e.g., each transaction is <strong>A</strong>tomic, leaves managed data in a <strong>C</strong>onsistent state, appears to operate in <strong>I</strong>solation from other
      transactions that may operate concurrently, and at the end of the transaction the database state is <strong>D</strong>urable, e.g, is persisted to a permanent
      store.
    </p>
    <p>
      SQL data is normalized into tables with relationships.  This matches very well with data models where many records may be associated with the same data.
      If we build a books database, for example, many books may be associated with the same publisher information.  We link the book information with a foreign key
      relationship to publisher information in another table to avoid duplicating the same publisher data in every book record. Many to many relationships are
      modeled by linking tables often containing two foreign keys.  For the books database a book may have several authors and an author may have published more
      than one book.  So the link table holds records each of which capture the association of a book with an author. If a book has two authors there are two
      records with that book key, one for each author.
    </p>
    <p>
      Each SQL Table has a fixed schema that captures the type of the records in the table.  A record in the books table
      might contain the book's name and date of publication.
      SQL database designs emphasize data integrity and structuring models in a fixed normalized tabular form.
      Queries into the database usually join data from
      several tables to build a complete description of the results to be returned.
    </p>
    <h3>noSQL Data Models:</h3>
    <p>
      The data models used by noSQL databases are usually based on key/value pairs, document stores, or networks.  noSQL processing
      favors modeling flexibility, the ability to easily scale out across multiple machines, and performance with very large datasets.
      For that flexibility they give up real-time data consistency, accepting application enforced eventual consistency.  They give up
      a formal query mechanism (hence the name).  And, they may give up Durability guarantees by only occasionally writing to persistant
      storage in order to provide high throughput with large volumes of data.
    </p>
    <p>
      The choice to use <a href="http://www.paperplanes.de/2010/7/5/relational_data_document_databases_schema_design.html">SQL or noSQL</a>
      data management is driven by the nature of its uses. Below we discuss <a href="../CSE681/lectures/Project5-F2015.htm">Project #5</a>,
      an application that builds a data management service for a large collaboration system composed of federated servers. That seems
      ideally suited for noSQL data managment.
    </p>
    <h3>Goals of a noSQL Implementation:</h3>
    <p>
      The noSQL model has goals that often prove to be difficult to implement with SQL databases.  A noSQL database is designed to support one or more
      of the following:
      <ul>
        <li>Very large collections of data</li>
        <li>High throughput with data from streams</li>
        <li>support tree or graph models for its data</li>
        <li>support heterogenious collections of data</li>
      </ul>
      When repeated data isn't a concern, we may avoid the overhead associated with following query references through potentially
      many tables and persisting every transaction to a durable store by using a network or key/value reference mechanism in
      conjunction with mostly in-memory storage using only occasional writes to the file system.  However, when dealing with very large
      data models these writes will likely be <a href="https://msdn.microsoft.com/en-us/library/azure/dn764982.aspx">sharded</a>
      into many files for durable storage. Probably a few shards, the most recently used, will be held in memory.
    </p>
    <p>
      A noSQL model may use a hashtable to store key/value pairs incurring essentially constant time lookup and retrieval of its data, e.g.,
      time independent of the size of the data.  However, when the size of the managed data requires sharding, the
      constant time lookup and retrievel may be compromised by processing necessary to locate shards that contain the data we
      need to retrieve.  We need to think about things like managing multiple shards in memory using a Least Recently Used
      mapping strategy, much like a virtual memory system.  We will likely think about using in-memory indexes to keep track of which
      shards hold specific data items or categories of items.  For some applications it may be appropriate to shard data into time-related
      batches, e.g., data collected in a day or a week.
    </p>
    <p>
      With SQL data management all data is managed the same way.  The only flexibility is how we partition the data into tables and
      possibly shard data across multiple machines.  Changing the schemas and sharding strategy can be quite difficult to implement.
      Using noSQL databases we have a lot more flexibility in configuring data and it is easier to change schemas.
    </p>
    <p>
      The good news is that configuring data, managing schemas, determining when and how to persist to durable storage, and maintaining
      consistancy is, with noSQL, up to the application.  The bad news is that it is up to the application.
    </p>
    <h2>Implementing a noSQL Database:</h2>
    <p>
      In <a href="CSE681.htm">CSE681&nbsp;-&nbsp;Software&nbsp;Modeling&nbsp;&amp;&nbsp;Analysis</a>, Fall 2015, we are exploring the development of a noSQL
      database in a series of five projects:
      <ul>
        <li>
          <a href="../CSE681/lectures/Project1-F2015.htm">Project #1</a><br />
          Develop the concept for a basic noSQL application.  We capture the concept with an &quot;Operational&nbsp;Concept&nbsp;Document&quot;&nbsp;(OCD).
        </li>
        <li>
          <a href="../CSE681/lectures/Project2-F2015.htm">Project #2</a><br />
          Implement most of the concept and perform thorough functional tests.
        </li>
        <li>
          <a href="../CSE681/lectures/Project3-F2015.htm">Project #3</a><br />
          Develop the concept for a remote noSQL application, based on Project #2, using
          a message-passing communication service.
        </li>
        <li>
          <a href="../CSE681/lectures/Project4-F2015.htm">Project #4</a><br />
          Implement the remote noSQL database server and do performance testing.
        </li>
        <li>
          <a href="../CSE681/lectures/Project5-F2015.htm">Project #5</a><br />
          Create and document a data management service architecture using the ideas developed in the first four projects.
          This service will provide the communication and state management infrastructure for a large Software Development
          Collaboration System composed of a federation of cooperating servers and client controllers.
        </li>
      </ul>

      Our goals are to understand why noSQL databases are interesting and useful, how they could be built, and
      to think about the consequences of this approach.  The concepts, developed in Projects #1 and #3, are expressed
      in Operational Concept Documents that focus on users and uses, top-level application structure, and critical issues.
    </p>
    <p>
      Documenting critical issues helps us think critically about our ideas and planned implementation before committing
      to code.  We may find that biasing our design in one direction or another may support the spinning off of new applications
      and services from a solid base.  We might also find that there are significant impediments on the path we are embarking
      and force a rethinking of the application and its goals.
    </p>
    <h3>Concept -> Uses:</h3>
    <p>
      In the projects for this course, we will be concerned with storing very large data sets, accepting data from streams quickly,
      storing and accessing networks of data, and managing collections of heterogeneous data.
    </p>
    <p>
      In the final project this Fall we will investigate the feasibility of building a data management service for a large collaboration
      system.  That involves: managing a large repository's data, recording continuous integration and test activities,
      managing notifications to a large collection of clients, and building and maintaining templates for test configurations,
      collaboration sessions, work package descriptions, etc.
    </p>
    <p>
      For the first project, however, uses focus on understanding requirements needed to implement a noSQL database, exploring alternative
      structures, and demonstrating the implications of our design choices. The users are the developer, Teaching Assistants, and the Instructor.
      Essentially each student developer is responsible for demonstrating that each of the requirements in the
      <a href="Project2-F2015.htm">Project 2</a> statement have been met.
    </p>
    <p>
      The design impact of this use is that the implementation must carefully demonstrate requirements in a step-by-step
      fashion.  When a requirement asks for the ability to change some aspect of the database state it is the design's responsibility
      to show the state before, display the nature of the change, and display the database state after the change.  This should be done
      trying to make the display as economical as practical so limiting what an observer must understand to verify the action.
    </p>
    <h3>Concept -> Structure:</h3>
    <p>
      Perhaps the easiest way to begin creating a structure for an application we're developing is to think about the tasks it must
      execute. The project statement for <a href="../cse681/lectures/Project2-F2015.htm">Project #2</a> requires the noSQL prototype to provide the
      capability to:
      <ul>
        <li>Create items described by metadata and holding an instance of some generic type.</li>
        <li>Create and Manage a Key/Value database with capability to store and delete Key/Value<sup><a href="#footnote">1</a></sup> pairs.</li>
        <li>Edit Values</li>
        <li>Persist database contents to an XML file<sup><a href="#footnote">2</a></sup>.</li>
        <li>Augment database contents from an XML file with the same format as persisted, above.</li>
        <li>Support a variety of queries, both simple and compound.</li>
        <li>Support demonstration of all functional requirements through a series of discrete tests with display to the console.</li>
      </ul>
    </p>
    <p>
      Each database Value has structured meta-data and an Instance of the generic type.  We will choose to create a C# class to represent
      Values that might look something like this:
    </p>
    <p>
      <pre>
    public class Value&lt;Key,Instance&gt;
    {
      // public methods providing
      // access to private data
      private string name;                           // Note: you may choose to capture
      private DateTime timeStamp;                    // these Value states as properties
      private string description;                    // rather than private data items.
      private List&lt;Key&gt; children;
      private Instance payload;
    }
      </pre>
      and a C# class representing the database engine:
      <pre>
    public class noSQLdb&lt;Key,Value&gt;
    {
      // public methods providing database API
      private Dictionary&lt;Key,Value&gt;                   // The <a href="https://msdn.microsoft.com/en-us/library/xfhwa508(v=vs.110).aspx">dictionary</a> should not be a public property.
    }
      </pre>
    </p>
    <p>
      Each task in the list at the top of this section is a candidate to become a package. Some we may decide to merge later.
      There may also be times to take an existing package and divide into smaller packages. Usually that happens when the
      original was becoming too complicated to test easily.  Finally there may be a very few packages that we didn't have
      the foresight to define in the concept, but discover a need for during implementation.
    </p>
    <div style="float:left; margin:20px 30px 20px 20px; border:1px solid gray; padding:5px; text-align:center; box-shadow:5px 5px 2px #888;">
      <img src="pictures/PackageDiagramPr2F15.jpg" height="500" />
    </div>
    <div style="min-width:300px;">
      <p>
        We start with a TestExec package at the top that is responsible for the project's main use - demonstrating that requirements
        have all been met.
      <p>
        TextExec creates instances of Key/Value pairs using a simple factory that may generate a unique key and
        construct a Value with supplied parameters.
      </p>
      <p>
        It uses those pairs to populate its noSQLdb instance through an API provided by
        the DBEngine package.
      </p>
      <p>
        The nature of query processing and sharding are the most interesting parts of this project and will be
        left to students to work out in their individual ways.
      </p>
      <p>
        The remaining parts are self-explanitory after reading the
        <a href="../CSE681/lectures/Project2-F2015.htm">Project Statement</a>.
      </p>
      <p>
        When an application is large or becomes complex we often provide a top-level package diagram, like this one,
        and later provide more package diagrams for individual parts with significant internal structure.
      </p>
      <p>
        We almost always provide activity diagrams to help OCD readers understand the intent of the concept.
        The OCD for this project would greatly benefit from activity diagrams for handling queries and for
        sharding.  These are left for students to provide.
      </p>
    </div>
    <div style="clear:both;"></div>
    <h3>Concept -> Critical Issues:</h3>
    <p>
      <ol>
        <li>
          <strong>Issue:</strong> - Demonstrating Requirements<br />
          Students only get credit for requirements they clearly demonstrate.  No inputs other than a supplied
          XML file to load the intial database are required<sup><a href="#footnote">3</a></sup>.  The only output required is a console display.<br />
          <strong>Solution:</strong><br />
          This requires careful orchestration of a series of tests invoked by the test executive and supported by processing
          in the Display package.
          <br />
          <strong>Impact on Design:</strong><br />
          It will be effective
          to provide a method for each test that announces the Requirement number and displays db state before and after
          each change.
        </li>
        <li>
          <strong>Issue:</strong> - Designing Queries<br />
          Statement and solution(s) are left to the students.
        </li>
        <li>
          <strong>Issue:</strong> - Sharding<br />
          Statement and solution(s) are left to the students.
        </li>
        <li>
          <strong>More Issues:</strong> - Left to Students.
        </li>
      </ol>
    </p>
    <h3>Later Projects:</h3>
    <p>
      After completing Project #2 we work on a concept, in <a href="../CSE681/Lectures/Project3-F2015.htm">Project #3</a>, and implement, in
      <a href="../CSE681/Lectures/Project4-F2015.htm">Project #4</a> remote access to the noSQL prototype via message-passing communication.
    </p>
    <p>
      Finally we develop an architecture, in <a href="../CSE681/Lectures/Project5-F2015.htm">Project #5</a>, for a data management service
      in a large Software Development Collaboration Environment using the NoSQL model we created in the earlier projects.
    </p>
    <p>
      You will find that several noSQL databases are required for Project #5 and that the key types and value types will not all be the same.  I would
      expect that sharding strategies may vary from database to database.  For that reason, it would be interesting to support
      pluggable sharding strategies in our noSQL design.  You should probably address that as a critical issue in your OCD for
      Project #1<sup><a href="#footnote">4</a></sup>.
    </p>
    <h3>Concept Revisited:</h3>
    <p>
      All the discussion that follows was added after students turned in their noSQL Operational Concept Documents.  This discussion is concerned
      with things I wanted students to think about without being given too much guidance, but now want to clarify before they begin their designs for
      the noSQL Database.  We will focus on Queries, Sharding, and say a couple of things about the ItemFactory.
    </p>
    <h4>Queries:</h4>
    <p>
      First, what is a query for this nonSQL database?  Let's define that in parts:
      <ul>
        <li>
          A <strong>QueryPredicate</strong> is a function that accepts a db key and returns true or false depending upon the processing of the
          predicate function.  For this noSQL db, the processing will look for specific conditions in the element bound to the supplied key,
          e.g., name, description, time-date stamp, children, or payload.
        </li>
        <li>
          A <strong>simple query</strong> then, consists of applying the QueryPredicate to each of the keys in the database and collecting all of the
          keys for which the predicate is true.
        </li>
        <li>
          A compound query is a chain of queries, each query using the keyset returned by the previous query<sup>5</sup>.
        </li>
      </ul>
    </p>
    <p>
      Suppose that we wrap each query return in an object that holds the resulting keyset and has the same reading interface (keys(), getValue(key, out val))
      as the DBEngine but doesn't have any writing methods. We'll call that VirtualDBEngine.  Suppose that we define a C# interface, IQueryable, that declares
      those "reading" methods and have both VirutalDBEngine and DBEngine implement that interface.
      Then each step of the compound query acts on an IQueryable object which, for the first query in the compound chain acts on a DBEngine instance and on every
      subsequent query clause acts on a VirtualDBEngine instance.
    </p>
    <p>
      With that setup we can define the showDB methods to use the IQueryable interface so it can be applied at each step of the query.
    </p>
    <p>
      DBFactory is the facility that makes a simple query and returns an IQueryable instance<sup>6</sup>.
    </p>
    <p>
      QueryEngine is configured with a set of QueryPredicates, uses the first on DBEngine to get an IQueryable with the first keyset, and uses each
      successive query on the returned IQueryable to refine the keyset returned by the previous simple query.
    </p>
    <p>
      You can think of the QueryPredicates to be equivalent to stored procedures in a conventional database.
    </p>
    <hr />
    <a name="footnote" />
    <ol class="footnote">
      <li>
        The C# language has two kinds of types: value types and reference types.  Value types reside in static or stack memory, are
        copyable, and when assigned are unique from the original source. Reference types reside in managed memory and are, in general,
        not copyable nor assignable.  The program's code may copy or assign a reference to an instance on the managed heap, but
        both target and source of the reference copy or assignment are the same heap-based instance.  Our use of the term Value
        in this blog does not mean a C# value type.  It simply means the database data referenced by the key.  The kind of it's type may be
        either a C# value or C# reference type.
      </li>
      <li>
        Project #1 encourages students to think about issues like sharding.  We do not require students to implement sharding in
        Project #2 but would be pleased to see and review any sharding processes they may attempt.
      </li>
      <li>
        Please do not provide console menues.  A GUI could be effective for Project #2 but I would much rather have you spending
        your time working on the functional requirements.
      </li>
      <li>
        This is a test to see if you've read the entire blog carefully before submitting your first project.
      </li>
    </ol>
    <hr />
    <h3 style="margin-bottom:2px; margin-top:25px;">Blogs:</h3>
    <div class="indent">
      <a href="blog.htm">First&nbsp;Things</a>,
      <a href="blogDesign.htm">SW&nbsp;Design</a>,
      <a href="blogOOD.htm">Object&nbsp;Oriented&nbsp;Design</a>,
      <a href="blogObjectModels.htm">Object&nbsp;Models</a>,
      <a href="blogOCD.htm">Operational&nbsp;Concept&nbsp;Document</a>,
      <a href="blogNoSql.htm">noSQL&nbsp;Database</a>,
      <a href="blogParser.htm">Parsing</a>,
      <a href="blogMTree.htm">M-Ary&nbsp;Trees</a>,
      <a href="blogGraph.htm">Directed&nbsp;Graphs</a>,
      <a href="blogFileSystem.htm">C++&nbsp;File&nbsp;System</a>,
      <a href="blogMsgPass.htm">Message&nbsp;Passing&nbsp;Systems</a>,
      <a href="blogGlobals.htm">Globals</a>,
      <a href="SummerReading.htm">Summer&nbsp;Reading</a>
    </div>
  </div>

  <footer>
    <hr />
    <img src="pictures/newhouse4.jpg" alt="Newhouse" width="98%" />
    <hr />
    Jim Fawcett &copy; copyright 2013
  </footer>

</body>
</html>