Old school graph storage with RDF

Graph storage systems can be pretty hard to grok. Especially if you are, like me, used to relational database systems.

Recently I had the opportunity to examine a suite of technologies related to graph storage and I felt the need to archive what I discovered here.

My journey begins with the NetworkX framework. There are Python (original) and Javascript (with D3!) implementations of this library and its pretty easy to understand. All you have are vertices and edges. NetworkX provides a way to represent various graph features such as multiple trees, directionality, and weight. It also comes bundled with helpful methods for working with the graph.

A certain fork of NetworkX allows you to export your graph into the resource description framework. RDF is old and XML-based. The advantage to RDF is that it 1. allows you to define your own namespaces and 2. is able to persist in a relational database thanks to Triple Store.

I found namespaces to be the hardest thing to understand about RDF. Mostly because they are so free-form and also because they are fundamentally XML-based. But the advantage of a flexible namespace is that it is very easy to model complex interactions between resources. Even resources that don’t belong in your own namespace. One thing I discovered is that the government is very fond of namespaces between its departments.

After exporting my NetworkX graph to RDF, I was able to persist it into using the MySQL triplestore component. This handy feature creates the tables and indexes you need. Aside from providing it with the connection string to your MySQL instance you don’t work with the database at all. That’s because RDF also describes how you should go about querying for data.

SPARQL is the language RDF specifies for exploring graphs. Properly implemented, SPARQL can facilitate some very interesting questions about the stored graph.

For my POC, I wrote a simple application that you can play with here. It includes a simple REST layer for obtaining nodes and edges and a D3-based interface for exploring the graph.

Here are some screenshots of the app in all it’s glory.

Screen Shot 2015-08-14 at 8.05.27 PM

Screen Shot 2015-08-14 at 8.42.39 PM -sm

Screen Shot 2015-08-14 at 8.06.03 PM - sm

Share/Save

Of Mikes and Davids

Mike is a professor at a reputable university. He teaches advanced machine learning and robotics, he’s finishing up his PhD in computer science, and he always has a new gadget he’s playing with.

David is a software entrepreneur. He has sold a software company or two for modest profit and he always has a business angle he’s working.

I had talked shop with Mike and David separately over the span of several months but one day the three of us managed to get together. Since Mike and David are both in the general information technology field I naturally assumed that we could find a common ground on that topic. I couldn’t have been more wrong. Mike and David clashed on almost every level and in the aftermath I learned a very important lesson about two distinctly different groups of technologists that I now classify as Mikes and Davids.

Mike is meticulous in his work. He needs to understand everything about the problem he is attempting to solve and all of the intricate mathematics behind any possible solution. As a result, it takes Mike a long time to ship a high quality product.

David, on the other hand, is focused on delivering something of value as quickly as possible. As a result, David can quickly churn out a working product which will likely need several iterations in order to work out all of the bugs.

These are fundamentally different approaches which I believe serve as a rosetta stone of sorts to help us decode the motivations and likely future actions of these two different schools of thought.

Let’s say, for example, you need X. David will bang out a version of X for you after pulling a week of all-nighters wherein his kids briefly forget they had a father. What you get will work according to your specifications. But don’t expect it to be pretty. But you’ll put it into production anyway. Because why not? Several months down the road you’ll wonder why your app is so sloooow and you’ll have to go back to David to have him fix a growing list of bugs. Thats not a knock on David. That’s just the nature of his work. Its fast and it’s to specifications.

By contrast Mike will take a very long time to complete a task. But when they do, you will have a rock-solid solution that has been thoroughly vetted. You will also have pages of proofs and data to go along with that solution.

In short, call David when you have an idea you want to have built quickly. And after David builds it, call Mike to build out the next version that will scale.

Both approaches have their place.

Having worked with startups for a long time I can appreciate each one in their own unique way. I am a Mike or a David depending on who I’m working with. I take great pains to build up my Mike and David skillsets equally. I can use Yeoman to quickly generate a skeleton of an application. And I can use SciKit Learn to discover patterns in data to make my processes more efficient.

I’ve decided the best engineers are honest with themselves on whether they are naturally more of a Mike or a David and are actively working to move towards the other end of the spectrum.

Manhandling windows with applescript

I’ve been setting up a Cuckoo cluster and the most tedious part of that involves configuring the guest VMs. To make a long story short, I needed to manually set the IP address of all my guest VMs. At first I started doing this the visual way by clicking through all of the windows settings menus. This quickly got tedious so I developed an applescript which opens the command prompt in administrative mode, bypassing the stupid user warning, enters the command to change the ip address of the box, and then closes the command prompt. If you have a task which involves monkeying with a windows system using Screen Sharing on a mac, I hope you find this little script useful.

tell application "Screen Sharing"
	activate
	
	delay 1
	
	tell application "System Events"
		-- Open an administrative command prompt
		key code 53 using control down
		delay 2
		keystroke "cmd"
		delay 1
		key code 125
		key code 126
		key code 76 using {shift down, control down}
		delay 2
		key code 123
		key code 76
		delay 2
		
		keystroke "netsh interface ip set"
		delay 1.2
		keystroke " address name"
		key code 24
		keystroke "\"Local"
		delay 1.2
		keystroke " Area Connection\""
		delay 1.2
		keystroke " static"
		delay 1.2
		keystroke space

		-- Numbers have to be entered by key code. Otherwise they are interpreted as control characters
		key code 18
		key code 25
		key code 19
		key code 47
		key code 18
		key code 22
		key code 28
		key code 47
		key code 23
		key code 22
		key code 47
		key code 18
		key code 29
		-- This is the part that needs to change
		-- key code 29 -- 0
		-- key code 18 -- 1
		-- key code 19 -- 2
		-- key code 20 -- 3
		-- key code 21 -- 4
		-- key code 23 -- 5
		-- key code 22 -- 6
		-- key code 26 -- 7
		key code 28 -- 8
		-- key code 25 -- 9
		
		keystroke space
		delay 1.2
		
		key code 19
		key code 23
		key code 23
		key code 47
		key code 19
		key code 23
		key code 23
		key code 47
		key code 19
		key code 23
		key code 23
		key code 47
		key code 29
		
		keystroke space
		delay 1.2
		
		key code 18
		key code 25
		key code 19
		key code 47
		key code 18
		key code 22
		key code 28
		key code 47
		key code 23
		key code 22
		key code 47
		key code 18
		
		delay 1.2
		
		-- Send the command
		keystroke return
		
		delay 2
		
		-- Exit the command prompt
		keystroke "exit"
		keystroke return
	end tell
end tell

A script to turn a Youtube playlist into mp3s

There are a lot of interesting lectures on Youtube. Recently I’ve taken to adding these lectures to a playlist and processing them on my mac using the following script:

youtube-dl -o '%(stitle)s.%(ext)s' $YOUR_YOUTUBE_PLAYLIST

for file in ./*.mp4; do
	echo "processing $file"
	file = $(print '%q' "$file")
	ffmpeg -i "$file" -filter:a "atempo=2.0, pan=stereo|c1<c0+c1" -c:a libmp3lame -q:a 4 "$file.mp3"
done

rm -rf *.mp4

You’ll need youtube-dl and ffmpeg installed for this script to work. Both of these are available from brew.

This script also encodes the youtube video at 2x speed and combines the left and right channels into the right channel. Remove the filter flag if you don’t want your files processed this way.

LiveStream chat-only mode

Recently I’ve taken to participating in a LiveStream of a Sunday School I physically attend. Because I’m there physically I don’t need the video so Here’s what I use to remove the video and expand the chat pane to fill up the resulting free space.

$('.player-wrapper').remove();
$('.chat_wrapper').css('width','100%');

An idiomatic way of converting an Option[String] into an Option[Int] in Scala

This always returns an Option[Int]

Option("bad") filter { _ != None } map { catching(classOf[NumberFormatException]) opt _.toInt } getOrElse None

Javascript replace on a capture group

I ran into a problem recently where I needed to perform a regex replace on a string and also manipulate the string captured in a capture group at the same time. What I discovered is that its valid to pass a function as the second argument to the replace function which gets passed the capture groups as arguments 1+

So here’s my code to capture a person’s name and escape it.

story = story.replace(person\.go\?ID=\d+"\s*[^>]*>([^(<)]+)<\/a>/g,function() {
     return 'person.go?ID='+escape(arguments[1].replace(/\./g,''))+'">'+arguments[1]+'';
});

ios7 form input patch

ios7 appears to have broken input fields for a number of web applications. Input fields now take two taps to allow the user to input data even though the keyboard is brought up after only one click. Here’s a hack to fix the input fields for any of your webapps ios7 broke.

if(window.navigator.standalone) {
    var arr = document.all.tags("input");
    var len = arr.length;
    for(;len--;) {
        arr[len].addEventListener('touchstart',function(ev){
            var tel = ev.target;
            setTimeout(function() {
                tel.focus();
            }, 150);
        });
    }
}

Getting useful index information from MongoDB

Here is a MongoDB script for presenting index information in a more concise way than getIndexes() provides. This script also presents an index’s total size along with a breakdown of its size on all of the shards.

//mongo --eval="var collection='file';"

var ret = db[collection].getIndexes().map(function(i){
    return {"key":i.key, "name":i.name};
});

var o = {};
for(r in ret) {
    o[ret[r].name] = ret[r].key;
}

var cstats = db[collection].stats();
for(k in cstats.indexSizes) {
    o[k].totalsize = cstats.indexSizes[k];
}

var shardinfo = cstats.shards;
for(s in shardinfo) {
    for(k in shardinfo[s].indexSizes) {
        if(!o[k].shards) o[k].shards = {};
        o[k].shards[s] = shardinfo[s].indexSizes[k];
    }
}

printjson(o);

Produces the following output:

{
    "_id_" : {
        "_id" : 1,
        "totalsize" : 50501459568,
        "shards" : {
            "shard0000" : 18620766416,
            "shard0001" : 18117909712,
            "shard0002" : 13762783440
        }
    }
}

Tags:

Simple Scala Map/Reduce Job

I was recently tasked with writing a Hadoop map/reduce job. This job had the requirement of taking a list of regular expressions and scouring hundreds of gigs worth of log files for matches. Since I’ve been leaning more and more towards Scala I wanted to use it for my job but I also wanted to use Maven for my job’s package management to make the job easy to setup and extend. And finally, I wanted to have unit tests for my mapper and reducer and an overall job unit test. The result is this project I posted to GitHub as a template for future projects. I hope it proves as helpful for others as I’m sure it’ll be for me.

Tags: , , , ,