Wednesday, January 15, 2014

1,000 Features Per Second for Node-to-Node Trace!



1,000 Features Per Second for Node-to-Node Trace!
Description of an Electric Trace for GTViewer using Bentley Utility Designer Data

 Charlie Marlin
Huntsville Utilities
Begun April 3, 2013
Revised January 14, 2014



Summary

Network traces have historically been slow – 1 feature per second, maybe 10, maybe 100.  This paper shows how to achieve 1,000 features per second using standard data structures. The primary innovation is to pre-process certain portions of the data into text files which are loaded into memory (into hash tables) when the trace application is first started.  Most of the data access during the trace uses these hash tables (which are one of the fastest data structures available) instead of disk reads, so the trace runs faster than most.


Audience

The paper is intended for anyone writing a network trace, whether with GTViewer or not, whether with data from Bentley or not.  The trace is implemented using GTViewer on data that has been converted from Bentley to the format used by GTViewer, so I will necessarily get down into the details of GTViewer and Bentley in order to explain what’s going on.  But an interested reader should be able to understand how to implement the same concepts in his own environment.

Network tracing is one of many practical applications of Graph Theory.  Here is the definition from Wikipedia:  (http://en.wikipedia.org/wiki/Graph_theory)

“In mathematics and computer science, graph theory is the study of graphs, which are mathematical structures used to model pairwise relations between objects. A "graph" in this context is made up of "vertices" or "nodes" and lines called edges that connect them. A graph may be undirected, meaning that there is no distinction between the two vertices associated with each edge, or its edges may be directed from one vertex to another; see graph (mathematics) for more detailed definitions and for other variations in the types of graph that are commonly considered. Graphs are one of the prime objects of study in discrete mathematics.”

Why is Graph Theory important to GIS?  Because our electric, gas, telephone, water, sewer, and storm drain networks  are easily modeled as graphs, thereby allowing all of the many tools from graph theory to be used to analyze GIS data.



A Note on .NET’s NameValueCollection

Before we get very far, I should share some things about a data structure that .NET calls a “NameValueCollection”.  It is implemented as a hash table (actually .NET will determine how big the thing is going to be and implement either a hash table or a simpler structure, whichever will be faster, or so I’ve read in Rod Stevens, Visual Basic 2005.)  The programmer adds keys and values.  Then when he has a key, he can retrieve the value quickly. 

I should note here one very cool thing that .NET does with the NameValueCollection.  You don’t need to know at run time how many values you may wind up with for a given key.  In most cases, you have a single value for a key.  But the value returned by GetValues(key) in .NET is a List(Of String).  So in case there is only one value, it’s value(0).  But you could add several values over the course of processing and parse them when you retrieve them.  I use this capability during the trace to find nodes for electric primaries.   




Pre-Processing

Let me describe the pre-processing first.  During development, of course, this came last.  I got the trace working and then looked for ways to speed it up and wrote the pre-processing stuff then.  But the interesting aspect of the trace depends on the results of pre-processing.  The reader will need to be patient because the reasons for generating various text files during pre-processing may be poorly motivated if you don’t know what the trace logic is going to do with them.  There is a flow chart below on page 8, and I hope it’s helpful.

A word on “keys”.  GTViewer uses a one- or two-key system to link graphic elements to tabular data.  Older readers are welcome to think DMRS, MSLINK, FRAMME, etc.  In most of what follows, I will only be concerned with the second key because the first key for Bentley data as converted to GTViewer is always zero (or some other arbitrary value set at conversion time).


The VB.NET Project:  HU_BuildConnectivityTable

HU_BuildConnectivityTable builds six text files:

1.    Connectivity.txt
2.    ConductorTable.txt
3.    StopFeature.txt
4.    StopNode.txt
5.    UnitTransformerTable.txt
6.    TransformerThick.txt


1.    Building Connectivity.txt

We begin by opening a stream reader for GTViewer’s “data.txt” file, the record of tabular data for all features.  It reads line-by-line until it finds the table for “E_OHPRIMARY”.  Then it establishes offsets for the attributes "PRIMARY_STARTID_Remap" and
"PRIMARY_ENDID_Remap". 

(Why?  And what do I mean by this?  In data.txt, GTViewer has a line for the tabular data for each element.  Attribute values are delimited by pipe characters.  Each table section begins with a description of its structure, which includes a list of the attributes in the order they appear in the pipe-delimited list.  So it is a trivial programming activity in .NET to “Split” each line of attributes into an array where I know the index of the PRIMARY_STARTID_Remap and PRIMARY_ENDID_Remap attributes.  Why “_Remap”?  The conversion from Bentley to GTViewer retains the PRIMARY_STARTID and PRIMARY_ENDID attribute values as they exist in Bentley, but it also remaps these somewhat cumbersome GUID-style values to unique integer values that are easier to work with.)

Back to data.txt.  We read each line and write the GTViewer key2 and the start node (PRIMARY_STARTID_Remap) and end node (PRIMARY_ENDID_Remap), pipe delimited, to the text file Connectivity.txt.

Later on, when the user starts the trace application, this text file will be read and loaded into a NameValueCollection called “"nvcNodes” with key2 as the key.

So what?  “So what” is that at run time, I will be able to retrieve the start and end nodes of a primary if I know the graphic-tabular link of the primary.  Specific motivation for that will need to wait for the description of the recursive subroutine at the heart of the trace code.

Electricity travels over more features than OHPRIMARY, so the code also reads the “E_UGPRIMARY” and “E_SWCABBUSBAR” tables and appends the pipe-delimited key2|start_node|end_node string to the file Connectivity.txt.

(Looking ahead to a future version, the code also has provision for E_OHSECONDARY, E_UGSECONDARY, E_OHSERVICE, and E_UGSERVICE, but our data at Huntsville Utilities does not have these values sufficiently populated with accurate values so that a secondary trace can run.)


2.    Building ConductorTable.txt

Slightly different in implementation, but with similar results.  We run a GTViewer query for E_OHPRIMARY and add key2 and ”|” and  “E_OHPRIMARY” to a List(Of String).  We then run another query for E_UGPRIMARY, and another for E_SWCABBUSBAR.  Then we walk through that List and write ConductorTable.txt.

How can I motivate such a simple-minded activity?  At times during the trace, the code will have key2 for a primary, and it will need to now what table that primary is in.  In early versions of the trace, I found this table in a clunky way that took a lot of time.  Currently, I load up another NameValueCollection (hash table) called “nvcConductorTable” when the user starts the trace, so the look-up is brisk.


3.    Building StopFeature.txt

Similar to building ConductorTable.txt except that it takes two steps.  First, we build a StopFeatureList by running a query to find every switch.  That sentence hides some complexity.  I compile a GTViewer query that returns the PRIMARY_ATTACHEDEDGEID for each E_OHSWITCH.  Then I compile similar queries for each type of stop feature.  Because each of these queries only asks for PRIMARY_ATTACHEDEDGEID, I can append them into a single query.  GTViewer queries (when run in code) return the attribute(s) you ask for, and they also return the category and offset of each element they find.  The category for all primary features is 304, so we know that without the query.  But the offset is golden.  We write the PRIMARY_ATTACHEDEDGEID and “|” and the offset of the switch to a list. In .NET speak: a List(Of String) called “StopFeatureList”.

Then we walk through StopFeatureList and use the PRIMARY_ATTACHEDEDGEID value to run another query to find primaries by PRIMARY_NETID.  The PRIMARY_NETID for a primary is, of course, the same number as the PRIMARY_ATTACHEDEDGEID for a stop feature that is attached to this primary.  So as the query returns primaries which we know have stop features attached, we note the key2 of the primary and write it to a temporary list with a pipe symbol and the PRIMARY_NETID.  Then we write this list to the file StopFeature.txt.

Again, so what?  Well, after loading StopFeature.txt into a NameValueCollection called “nvcStopFeature” when the user starts the trace application, we can quickly test whether a primary has a stop feature attached.  We use the key2 of the primary as a key to nvcStopFeatures.  If we get any values, then the primary has a stop feature attached.  If not, it doesn’t.


4.    Building StopNode.txt

We use StopFeatureList built during the previous step.  We walk through it, pulling out the PRIMARY_ATTACHEDEDGEID (which, of course, is the PRIMARY_NETID of the primary that has a Switch or other stop feature) and and the offset of the stop feature.

First we build a short list called “shortList”, a 2-element array (Of String), that holds the nodes (PRIMARY_STARTID_Remap and PRIMARY_ENDID_Remap) for this primary.  The stop node will be one of these.  Perhaps both.

Then we use the offset to do a GTViewer proximity search (within 50 feet of the origin of the stop feature) to build a list of nodes called “nodeList” from nearby primaries.  (PRIMARY_STARTID_Remap and PRIMARY_ENDID_Remap)  We omit the primary that has the stop feature attached, because we already have those nodes in shortList.

Then, if nodeList contains shortList(0), then we add it to StopNodeList along with “|” and the offset of the stop feature.

Also, if nodeList contains shortList(1), then we add it to StopNodeList along with “|” and the offset of the stop feature.

Then we process the next item in StopFeatureList.

Then we write StopNodeList to the file StopNode.txt.  When the trace initializes, we load StopNode.txt into the NameValueCollection “nvcNodeOffsets”.

So now, while we search from node to node, we can quickly tell if we have reached a node where the trace should stop.  To appreciate why I take the two steps of building both StopNode.txt and StopFeature.txt, see “Special Bentley Considerations” below.


5.    Building UnitTransformerTable.txt and TransformerThick.txt

We can skip these.  They are not used for the trace proper, but are used to sum transformer load by phase for a circuit.  I build the file UnitTransformerTable.txt first and use it to help build the file TransformerThick.txt.  I do it that way because some transformers have a KVA for a phase, and some have child records with different KVA ratings for each of the children.  The point is that, as is becoming boring to you by now, I load this file into a NameValueCollection when the user starts the trace application, and then I can use the node number (on currently processing primary, which is equal to the PRIMARY_NETID of the attached transformer) to get the KVA load for each phase.




Summary of Pre-processing

File name
Contents
Meaning
Comment
ConnectivityTable.txt (used to build nvcNodes)
key2
linkage for E_ OHPRIMARY, E_UGPRIMARY, or E_SWCABBUSBAR
Written on graphic element; points to its tabular data.

node1
PRIMARY_STARTID_Remap
PRIMARY_STARTID mapped to an integer.

node2
PRIMARY_ENDID_Remap
PRIMARY_ENDID mapped to an integer.
ConductorTable.txt (used to build nvcConductorTable)
key2
linkage for E_ OHPRIMARY, E_UGPRIMARY, or E_SWCABBUSBAR
Written on graphic element; points to its tabular data.

table name: E_ OHPRIMARY, E_UGPRIMARY, or E_SWCABBUSBAR

StopFeature.txt
(used to build nvcStopFeature)
key2
linkage for E_ OHPRIMARY, E_UGPRIMARY, or E_SWCABBUSBAR that has a stop feature attached to it.
Written on graphic element; points to its tabular data.

PRIMARY_NETID
Link to PRIMARY_ATTACHEDEDGEID for the stop feature
All I really need is for key2 to have a value in the NameValueCollection.
StopNode.txt
(used to build nvcNodeOffsets)
node
PRIMARY_STARTID_Remap or PRIMARY_ENDID_Remap


offset
Position of the stop feature in the Primary category
Allows the trace to find the origin of the stop feature in order to circle it.  Other than the circle, all I need is a value in the NameValueCollection.

Looking over the summary table above, it may seem that I’m force fitting the stop feature and stop node into a NameValueCollection when all I need is a list, array, or stack.  I tried these data structures, and they were not as fast as the NameValueCollection.  My guess is that a list, array, or stack is searched in sequential order, so the seek time is N/2, whereas the seek time for a NameValueCollection (a hash table) is going to be more like a fixed value between 1 and 2. 
On my four year old Dell laptop, HU_BuildConnectivityTable runs in 121 seconds (call it 2 minutes) on a data set with 89,297 primaries, including overhead primaries, underground primaries, and bus bars.  This run time includes building the Transformer text files.  HU_TraceElectric takes about 2 or 3 seconds to initialize, which includes loading the text files into NameValueCollections.  I consider these to be acceptable times when they enable a node-to-node trace to run at about 1,000 features per second.
 

 The Trace Logic

The VB.NET project is called HU_ElectricTrace.  It is run as an external application in GTViewer.  Here is the essential section of the form shown to the user:


A simplified flowchart is presented on the next page.



Simplified Flow Chart



Sequence of Actions

The user clicks the “Trace Primary” button on the form and then clicks on a breaker at a substation (E_PRIMARYCIRCUIT).  A current limitation is that the trace can only start with a substation breaker.  Future versions will allow the user to start on any conducting feature.

Then we determine the name of the circuit from that feature and write it to the Listbox on the form.

Then we get keys for the breaker and determine its node number (PRIMARY_NETID_Remap).  This is the initialStartNode. We push that node on the node stack, push that node on the nodes visited stack, and pass it to the Node Iterator.


Node Iterator

The sub “Node Iterator” is the recursive subroutine at the heart of the trace.

It runs a special GTViewer query called a “connectivity query”, that returns 2 arrays when fed a node number.  One array is a list of key1 values.  The other is a list of key2 values.  It also returns a count of the number of values returned, so the programmer writes a For…Next loop that processes each set of key1-key2 pairs in turn.  The keys are the links to every primary that is connected to the node.  In our case, we need only use key2. 

(Now that I have learned a few things by working on this trace, I realize I could write an equivalent to the connectivity query.  It would be a NameValueCollection with the node as the key and a list of key2’s as the values.  When you define the connectivity query, you have to provide the sorts of features that can have nodes you want to consider, as well as the table and attribute names for the nodes.  For example, E_OHPRIMARY, PRIMARY_STARTID, PRIMARY_ENDID.  My code works fine using the special GTViewer query; but developers outside the GTViewer product can just as well use a NameValueCollection with the node number as key and the list of links to primaries as the values.)

Joey Rogers, designer and developer of GTViewer, has provided a note on the connectivity query:

The connectivity query in GTViewer is performing a Binary Search over records that are pre-sorted.  There is a lot of debate over using Binary Search vs. Hash Tables to look up values.  Binary Searches generally work better if you don’t have unique keys;  Hash Tables usually work better if you have unique keys.  Since there are many records with the same keys in typical GIS data, and since the connectivity query uses the same keys for each node of a feature, the Binary Search was used in GTViewer.   However, Hash tables should work better for this trace application because you can use unique keys for most/all of the different pre-processed data in the hash tables.

For each key2, we check whether it is already in a stack that contains key2 values for primaries that have already been processed.  (Since the connectivity query returns all the primaries that share that node, once you are downstream on the trace, it will return the primary you came from, which you want to disregard.  By historical accident, this stack is called “ufidStack”.  I will refer to it later.)

(At the risk of starting too many side trails, I will note here that if key2 is already in ufidStack, the code tests whether the count of connected primaries is 1.  If so, then we have a leaf node, and we add its keys to a List(Of Integer) called “leafNodes”.  I don’t use this in the current version of the trace, but in previous versions I did something with all these leaf nodes.  So I kept the ability to keep track of them, and here is where.  After this leaf node test the code skips to the …Next part of the For…Next loop.)

If key2 has not already been processed, we use it to get the table for that feature from the NameValueCollection “nvcConductorTable”.

Now we test whether the current node is a stop node and whether the primary for which we have key2 is a stop feature, using simple functions that use nvcStopNode and nvcStopFeature.  If both tests pass, we have found a stop node on a stop feature; so we add the stop feature to a highlight list for stop features and skip to the next set of keys in the For…Next loop.

If not, then we keep going.

We push the key2 of the primary onto the ufidStack.  We increment the primary counter.  We add the primary to the highlight list.  (In GTViewer, there is a concept of adding something to the highlight list and separately displaying the highlight.  This separation allows you to add lots of features to the highlight list without slowing down by updating the display.)  Based on a user setting on the trace form, we update the display after processing every nth primary or we wait until the whole circuit is complete.

Now we determine the next node to feed Node Iterator.

We look at the NameValueCollection “nvcNodes”, which was built during initialization from Connectivity.txt.  The key is the key2 for the primary; the value is the pipe-delimited list of two nodes.  The GetNextNode function merely parses out the nodes and returns the one that is not already in the nodesVisited stack. 

(Performance note:  It seems that .NET does a (slow) sequential search to find out if an item is in a List, but does a fast search to find if an item is in a Stack.  I could be wrong.  It may be that the stack of visited nodes is always relatively small (<2 and="" because="" composed="" each="" fast.="" feeder="" for="" integers="" is="" it="" of="" reinitialized="" s="" so="" span="" style="mso-spacerun: yes;" that="" why="">  Or it may be that even a sequential search of the stack from the top down will generally find the previous node soon in its path.)

Not quite done.  This new node may be a stop node.  We add it to the nodesVisited stack and test whether it is a stop node and whether the feature for which we have key2 is a stop feature.  We may have just turned up the other end of a feature that is a stop node and need to stop there.  If we do not find both stop node and feature, we push the new node onto the nodeStack and pass it to Node Iterator, which feeds the recursion. 

The very next line in the code is “nodeStack.Pop”.  It’s really magical to me.  The recursion can go down a long path, but as it unpeels, the nodeStack pops off the nodes, and can resume at a node it bypassed earlier.  When the last node on the stack is popped, it’s done.




When the Node Iterator is Done

We update the view.  In case the ShowAsYouGo box has not been checked, this is the first time the user sees the traced primaries on the screen.

We use the primary counter to show how many primaries have been processed.
  
Then we call the subroutine PostProcess.

Not too much of interest for the purposes of this paper, but I will note that if the box ShowTransformerKvaByPhase has been checked, we show a summary of load by phase.  This may have some interest for Bentley people.  For some types of transformers, it is necessary to drill down to the child records of a Transformer Bank to get the phases of the units.  The hard work is done in HU_BuildConnectivityTable where we create another text file that makes it easy (and brisk) at run time.


Special Bentley Considerations

Bentley models stop features (E_OHSWITCHBANK, E_SWCABFUSEDSWITCH, E_SWCABGROUPSWITCH) as edge devices.  Their attribute PRIMARY_ATTACHEDEDGEID maps to the PRIMARY_NETID of a conducting feature.  Therefore, stop features cannot be placed in a NODE1-NODE2 pattern.  It would simplify processing tremendously if Switches and other potential stop features had a node that matched where they connected to the primaries.  But it would require a potentially major rework of the Bentley feature model.

What the trace code has to do is test whether a node is a stop node and whether the primary it is currently processing is attached to a stop feature.  Only if both conditions are satisfied should the trace stop.  If stop features had a shared node with the primaries they connect to, this check would be simpler.

For a Bentley developer, the task of writing all these queries transforms into the task of writing a series of SQL statements.  I don’t mean to minimize the task, but it should be approachable for an SQL guy.