Tuesday, March 21, 2017

Remove Nuget from your solution

If you have installed a package in Microsoft Visual Studio using Manage NuGet Packages, back in the day you might have used "Enable NuGet Package Restore" to make Visual Studio automatically restore these packages.

However, "Enable NuGet Package Restore" is the old way of doing things (technically called MS-build integrated restore). If you don't do that step, you are automatically on "NuGet automatic restore" which is the recommended way of doing things.

Since this is incredibly counter intuitive - when we see the "Enable NuGet Package Restore," that almost sounds like what we need, even though we already have the better option without knowing it - almost everyone has at some point clicked on "Enable NuGet Package Restore." And then, if you're doing any major scale development, you probably have SVN or GIT - do you commit the nuget.exe and other files that the above option adds to your project? I'm guessing you probably did - I would not be surprised. Some of you may not even realize that there is another option.

So how do you undo "Enable NuGet Package Restore" and go back to "NuGet automatic restore." (In)conveniently, Visual Studio has no menu options to do this!

What changes does "Enable NuGet Package Restore" make?

To answer this question, I made a new project in Visual Studio 2013, added a NuGet package - JSON.NET - to the project, and committed this to a local SVN repository. Then I did "Enable NuGet Package Restore" and did a diff for changes. Here's my change-log:
  1. This adds the ".nuget" folder, with files "NuGet.exe," "NuGet.config," and "NuGet.targets" to the solution. So the .sln is updated, and the above folders and files are made.
  2. It also adds the following changes to the .csproj file:

If you you have multiple projects in your solution, such entries can exist in more than one .csproj files, depending on whether you added NuGet packages to them or not. 

Hence to remove MS-build integrated restore

And so, to remove "Enable NuGet package restore" from your solution, basically unwind the effects of the above changes.
  1. Delete the .nuget foldler and all its contents from your solution. 
  2. Remove the NuGet added XML for , and tabs. Note that in the above I also have the element attributed to NuGet. This seems to be the case - this element also gets inserted by NuGet. And while its removal may not be essential to migrating to new style package management, I have removed this at least a few times from various projects, without issues.


Thursday, March 16, 2017

Fiddler Reverse Proxy

I manage a web server. It is an internal web server used for testing traffic, but it is a web server all the same. And since it is not a production web server, I have a fair amount of carte blanche to try things on it and learn.

One of the things I wanted to do was to watch the incoming requests and their responses using fiddler from the web server itself. Turns out it is not obvious to do (unlike outbound traffic from your web apps, which you can watch by running fiddler and running your app pool as the logged in user). Here’s what I found out.

HTTP reverse proxy

Setting up HTTP proxy is pretty simple. Let’s say you have a web server that’s serving up content on HTTP (port 80, or whatever other port you have configured). You can make Fiddler listen on another port, let’s say port 8889 and redirect it to port 80, while also watching the traffic. Now if you open your web browser and browse to http://yourserver.yourdomain.ext:8889 fiddler will reroute that traffic to port 80 + watch that traffic for you. To do this:
  1. Open registry editor (run > regedit), go to “HKEY_CURRENT_USER\SOFTWARE\Microsoft\Fiddler2” and add a DWORD named “ReverseProxyForPort” with decimal value = 80 (or whatever port your http server is listening to). This will cause fiddler to redirect traffic it captures onto port 80.
  2. Within Fiddler, tools menu > Fiddler options > connections tab, specify port number for Fiddler to listen on (we’ll use 8889), and check “Allow remote computers to connect.” This requires restarting Fiddler, plus you may see the User Access Control warning for enabling firewall rules to allow the port you are listening on (8889) to be exposed to the network.
Now, with fiddler running on the web server, browse to your site from another computer, with port 8889 specified in the url, and you should be able to watch the traffic on fiddler.

HTTPS reverse proxy

Creating an HTTPS reverse proxy is slightly different, since the “Allow remote computers to connect” option in Fiddler above does not support the SSL handshake. So instead you have to use the Fiddler QuickExec box (The black box below the left pane – you can get to it with ALT+Q) and issue the listen command to make an ssl listener. Also, since the ReverseProxyForPort registry key above does not do SSL, you have to Customize Fiddler rules and modify the OnBeforeRequest to reroute the SSL request.
  1. Modifying the OnBeforeRequest method: To do this, within Fiddler, go to Rules menu > Customize rules. There, find the OnBeforeRequest method. At the very top of this method, add code to catch connections to yourserver:8888 (8888 is the port we’ll use for our proxy) and route that to port 443 (or whatever port your SSL end point listens on)
    if ((oSession.HostnameIs("yourserver")) && (oSession.oRequest.pipeClient.LocalPort == 8888)){
        oSession.Host = "yourserver:443";
  2. Next, make fiddler listen with SSL on port 8888. To do this, go to the QuickExec box (either click in the black box below the left pane, or use ALT+Q), and enter this command:
    !listen 8888 yourserver
This causes Fiddler to start listening, and you get a confirmation dialog to the same effect. Now if you open your browser and navigate to https://yourserver:8888, you can see the traffic in fiddler, and it gets rerouted with SSL onto port 443.

Advanced stuff

For both HTTP and HTTPS, you could run your web server on a different port than 80 or 443 respectively. Then you can setup the reverse proxy on 80 or 443 respectively, and now you can watch traffic without having to explicitly specify port.

Nasty stuff

The observant among you have noticed that in the OnBeforeRequest, you are setting a new host endpoint. This actually does not have to be on the same machine. For example, in the above, I could instead do oSession.Host = "www.google.com:443"; and it works. Now you can navigate to https://yourserver:8888 and you will get google.com but you will also be watching the traffic.

Let’s say you are the network manager of an enterprise network. You could override the local dns server to point google.com at another machine (middleman), and within that machine explicitly set google.com back to the right IP address in the local hosts file. Next run the proxy on port 443 on your middleman machine and route it to back to google.com:443. Since the middleman itself has google at a different IP, this won’t cause infinite recursion. So now you’re watching your entire enterprise’s google traffic.

Please don’t use this to do harm. The above is a theoretical discussion of how bad you can get; but be warned that this could get you in jail. However, you may find need to put a middleman to debug why a website/webservice is misbehaving – make sure your users are aware that this is happening.



Monday, September 26, 2016

Switching logged in user for Microsoft Visual Studio

Reference: http://stackoverflow.com/questions/19517673/how-can-i-switch-my-signed-in-user-in-visual-studio-2013
  1. Close Visual Studio
  2. Start the Developer Command prompt installed with Visual Studio as an administrator.
  3. type 'devenv /resetuserdata' ('wdexpress /resetuserdata' for Express SKUs)
    • C:\Program Files (x86)\Microsoft Visual Studio 12.0\Common7\IDE
  4. Start Visual Studio Normally.

Friday, June 24, 2016

Setting up Spark on top of Hadoop

In a previous post, I describe setting up a hadoop 2.7.2 two node cluster:

In this post, I describe how to setup and run spark 1.6.1 on top of that hadoop cluster.

First, download the spark-1.6.1-bin-hadoop2.6.tgz package from the Spark download page at http://spark.apache.org/downloads.html. Extract this somewhere on your filesystem. Now, in the post where we did the hadoop setup, we extracted and configured hadoop identically on all nodes in the cluster. However, for spark, I only extracted and set it up on one of the nodes - the master. I'm not sure if it needs to be the master, or if it can be done on any single slave node - likely it can be.

Next, export the SPARK_HOME to point at your extracted folder, and add the spark bin to your path, in the .bashrc file.
export SPARK_HOME=<path-to-your-folder>
Also make sure that HADOOP_CONF_DIR is exported or do so if it already isn't. you can
to see if it is exported, and if it is not, then you can add it to the .bashrc file as
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
Remember that you need to restart bash / close & reopen the terminal for the new .bashrc to take effect.

Now, start hadoop hdfs and yarn if not already running
Next, start the interactive spark-shell on top of yarn
spark-shell --master yarn
Note: the spark documentation ( http://spark.apache.org/docs/latest/running-on-yarn.html ) says to use the command spark-shell --master yarn --deploy-mode client. However, the interactive shell can only run as client, so I think the latter parameter is moot.

At this point, after a little waiting, you'll get to the spark scala prompt.
You can type
at the prompt and this will show you that this variable is the spark context available to you.

So, let's do a quick spark data load. Assume that we have a large text file called bigdata.txt on our hadoop cluster under the hadoop user. We'll load this fil by the command:
val lines=sc.textFile("hdfs:/user/hadoop/bigdata.txt")
Note, that we don't use double blackslash after the hdfs protocol. If you use double backslash, you have to specify the server and port as well, such as hadoop-master:9000 etc. See: http://stackoverflow.com/questions/27478096/cannot-read-a-file-from-hdfs-using-spark

Next, let us count the number of lines on this file. The "lines" variable we used now contains an RDD object, because that is what the textFile() api returns. We can use the count() member of the RDD to count the number of lines.
This, assuming your data file was large enough to span a few hdfs blocks, will cause a spark DAG to run on yarn to count the number of lines.

Finally, CTRL+D to quit spark shell.

Another good example I tried was this Spark Word Count example from Roberto Marchetto at http://www.robertomarchetto.com/spark_java_maven_example

I found a large text file that spans a few HDFS blocks, and ran that example on the hadoop cluster using the command:
spark-submit --class org.sparkexamples.WordCount --master yarn sparkexample-1.0-SNAPSHOT.jar /user/hadoop/bigdata.txt wc-output
Where the jar file is the result of mvn package, bigdata.txt is the input file on hdfs and wc-output is the folder on hdfs into which spark will write output.

This: http://spark.apache.org/examples.html is another interesting spark example/tutorial to try out.

Sunday, May 29, 2016

Using apt on ubuntu

For a while, i've been an ardent fan of synaptic package manager, and have delegated to synaptic, the gory details of dealing with the apt command line.

However, recently, I started playing with docker (https://www.docker.com) a little. Also, I'm building a 10 node raspberry pi cluster ( http://varghese85-cs.blogspot.com/2016/05/10-node-raspberry-pi-cluster-part-1.html ). Towards this I'll be running the Pis headless, using Ubuntu-Mate's convenient command:
graphical disable/enable
That means I need to learn the gory details of  using apt on the command line.

NOTE: apt needs to run as root. In the example below, I'm assuming you are running as root. If not, prefix sudo appropriately.

The first lesson is apt vs. apt-get.
Historically, the command has been apt-get. However, recently I noticed apt-get has been deprecated in favor of apt. There isn't much of a difference; the sub commands are all the same. Apt also has a little more color and output formatting than apt-get. Otherwise, they are a horse a piece.

Finding the right packages:
On ubuntu, apt supports auto-completion on package names. So you can for instance do
apt install pyt
and hit the tab key, and you can sea  list of all options (except there are som 4000+ with the pyt prefix, so try something else).

apt list will print a list of all available packages. Obviously, this is too long of a list, but you can pass it to grep. Also, you can specify a wildcard search expression to apt list. For example:
apt list *java*
Additionally, there are some parameters you can pass to apt list. For instance,
apt list --installed
will list all the installed packages
apt list --upgradable 
will list all the packages that have upgrades available.

To search for a package, use apt search. For example,
apt search condor 
lists all packages that have condor in the name or description.

To see full details of a package, use
apt show <packagename>
Keeping your system up to date:
The usual way of updating your system is;
apt update; apt upgrade
The first command fetches the new list of packages. The second command does the actual upgrade.

Installing a specific package:
It usually helps to have done an apt update (or that and an apt upgrade) before installing a new package. To install a package, do
apt install <packagename>
Removing a package:
To remove a package, one would usually do
apt remove <packagename>
However, this does keep some of the configuration etc around. So to fully remove a package, one should do:
apt --purge remove <packagename>
Now, when you install a package, that pulls in dependencies. So when you remove a package, some dependencies it pulled in may have become no longer needed. So to remove those, you can do
apt --purge autoremove
Remember to not include the --purge option if you want the configuration to stick around.

If apt was interrupted:
Another common problem that happens is if apt was interrupted. For example, you hit CTRL+C on the console while apt was running. This can leave the packages in a broker state. To fix that, run
[sudo] dpkg --configure -a

Saturday, May 28, 2016

Setting up htcondor on ubuntu 16.04

Note: To setup a cluster, you will need some form or dns name-service, or your /etc/hosts should match hostnames against actual ip addresses rather than local loopback 127 prefix addresses. (If not, the most likely consequence is that condor_status will give you a blank.)

Condor brings back fond memories; my old workplace, since I was a research assistant with them while I was a graduate student doing my Master's at the University of Wisconsin - Madison. So it was a pleasant surprise to find that condor was available as a standard package on ubuntu.

To install condor:
sudo apt install htcondor
Part of this installation process, you'll be run through the condor configuration screens. On these screens:
  • Manage initial HTCondor configuration automatically? Yes
  • Perform a "Personal HTCondor installation"? No (since we're setting up a cluster)
    • Note: If you just want to setup a single node installation, you say yes here. This will bind condor to the local loopback 127 prefix address, and will not ask you any of the below questions.
  • Role of this machine in the HTCondor pool:
    • select appropriate roles
  • File system domain label: <leave blank>
  • user directory domain label: < leave blank > (unless you know what you're doing)
  • Address of the central manager:  <your central manager >
  • Machines with write access to this host: < comma separated list >
Note that if you use machine names in the above and you don't have a dns nameserver, you'll need to setup your /etc/hosts with the address resolution.

Next, edit /etc/condor/condor_config.local and add the following lines (the file is blank to start)
CONDOR_ADMIN = <username>@<hostname>
CONDOR_HOST = <hostname>
ALLOW_WRITE = <comma separated list >
PS: to inspect a particular variable's value and where it is coming from, you can use
condor_config_val -v <VARIABLE>
At this point, you can run condor by starting the service using
sudo service condor start
However, one would rather set it up to autostart. You can do that with
sudo update-rc.d condor enable
All of the above steps are done on all hosts. The CONDOR_HOST variable is what decides which host is your master.

Now that you have a running condor pool, issue
to see the status of the pool

To run a sample job, let us first create a sample job. We'll use reference 2 listed at the top, and steal the sample c code from there:
#include <stdio.h>

main(int argc, char **argv)
    int sleep_time;
    int input;
    int failure;

    if (argc != 3) {
        printf("Usage: simple <sleep-time> <integer> \n");
        failure = 1;
    } else {
        sleep_time = atoi(argv[1]);
        input      = atoi(argv[2]);

        printf("Thinking really hard for %d seconds...\n", sleep_time);
        printf("We calculated: %d\n", input * 2);
        failure = 0;
    return failure;
This can be compiled by:
% gcc -o simple simple.c
Create a submit file with contents:
Universe   = vanilla
Executable = simple
Arguments  = 4 10
Log        = simple.log
Output     = simple.$(Process).out
Error      = simple.$(Process).error

Arguments = 4 11

Arguments = 4 12
Submit job using the command:
condor_submit submit
And check the status of the job using the command:
If the job gets held, release it by using the command:
condor_release -all
My condor install kept setting some of my jobs to held state. So I ran a
watch -n 10 condor_release -a
to keep releasing jobs every 10 seconds. Haven't figured out the right fix on this one yet.

Sunday, May 22, 2016

10 Node Raspberry Pi Cluster - Part 1

Sometime back I posted about a BitScope 10 Raspberry Pi rack ( http://varghese85-cs.blogspot.com/2016/03/raspberry-pi-racks.html ) I recently purchased the same as, well as some accessories to go along with it to make a 10 node Raspberry Pi cluster.

PS: I was impressed it only took 4 days for it to ship to the US from Australia !

The idea is to use a Lenovo Thinkpad older model charger 170w or 135w (I found one for under $35). You can buy Lenovo tip adapters that convert from the old tip to the new tip. I bought one of these and cut the wire to make my power supply for the cluster. Each raspberry pi takes 5v 2amps = 10Watts. The BitScope rack can accept anything between 7 V and 48 V DC - so the 20V DC out of the Thinkpad chargers work just fine. And since there are 10 Raspberry Pis, that's 10 x 10Watt = 100 Watts. So a 170 or 135 Watt charger allows for some leeway with power dissipation as well ( not that the Pi will take the full 10 Watts running headless with no peripherals connected - you probably could get away with a 90 Watt charger, but why risk it).

The above picture shows the thinkpad charger, an un-mutilated tip adpater on the left, and the one I cut and plan to splice the wires on the right.

I ordered 10x Raspberry Pi3 boards from MCM Electronics ( http://www.mcmelectronics.com/product/83-17300 ) since they seem to be the only place that's not fleecing you by marking the pis up over the retail value. Unfortunately though, that is the one piece of the puzzle that hasn't shipped out to me yet.

The remaining pieces of the puzzle are all here in this picture:
APC UPS, Netgear 16 port 10/100 switch (the Pis only have 10/100 ethernet anyway), 10x 1.5ft Ethernet cables, 10x 32GB MicroSD cards (why waste money on larger cards since this is an experiment anyway), the power supply pieces described above, 14gauge connecting wires (14 gauge is probably overkill - you can run over 10ft of it at 100Watts) etc. Oh, and I included the one Raspberry Pi 3 that I had, since the 10 piece shipment hasn't reached me yet.

The idea:

I plan to do three things
(1) Heterogeneous Hadoop cluster: 
I picked up a used Lenovo Thinkpad T430 off craigslist, that I run Ubuntu Mate 16.04 on. I plan to make a 11 node heterogeneous hadoop cluster with the T430 as the master / namenode & resource manager, and the 10x Raspberry Pis as datanodes. I'll have IP assignments configured on my wireless router onto which I will plug in the switch above. The T430 will be part of the network over wireless, but the datanodes will all be on the wired network.

(2) Homogeneous Hadoop cluster:
Given the explanation for the heterogeneous cluster above, the homogeneous case is a trivial simplification. Use one of the Raspberry Pis or maybe the one I already have as the master node.

(3) CoreOS / Mesos clsuter:
This is the one I have very little clarity on at this point. The idea is to setup a CoreOS ( https://coreos.com/ ) or Apache Mesos ( http://mesos.apache.org/ ) cluster using the Raspberry Pis and run docker containers on it. It would be cool to create a fake application (like a library management / inventory management system or something) that uses the LAMP stack (MySQL container, HTTPD container etc) to deploy on the cluster as well. It would also be cool to have HAProxy on a container load-balancing a bunch of stateless HTTPD containers, all connecting to the MySQL container. We'll see how far I get on this one!


My Home Office ;)

Just felt like putting up a side note post up here:

My home desk :)
  • Each display is a separate physical computer
  • Two different architectures
    • Intel Thinkpads - Intel
    • Raspberry Pi 3 - ARM
  • Two different Operating Systems
    • Raspberry Pi 3 on Ubuntu Mate 16.04 armv7
    • Thinkpad T430 on Ubuntu Mate 16.04 amd_64
    • Thinkpad T450s on Windows

Friday, May 20, 2016

Docker Notes part 1

I've been learning Docker through John Willis's tutorials:

Here are my notes from the tutorials. As usual, these are for my own reference, but put on a public forum in the hope others may also find them useful.

#1 - Installing docker

apt install docker.io
add user to the docker group

docker version
docker -v
docker info

#2 - Docker run

docker ps
docker run busybox
docker ps -a
docker run -i busybox
docker run -it busybox
docker run -d busybox
docker run -it -v /volume busybox
docker restart <tag/volume>
docker rm <tag>

cid=$(docker run < >)
docker <command> $cid

Explicitly set container name:
--name <name>

Run a command inside the container:
docker exec <tag> <command>

docker inspect <tag>
docker history <image>

#3 - volumes

Mount a host folder onto the container as a colume
docker run -it -v <path  on host>:<path on container> <tag>
eg: docker run -it -v /home/ubuntu/docker-shared:/shared busybox

Flag to mount read only:
-v <host path>:<container path>:ro

docker ps -q   - gives ids in a list to pass to other commands
docker kill $(docker ps -q)
docker rm $(docker ps -aq)

#4 - more on run

Search for particular images
docker search <image>

Pull a particular image to local storage:
docker pull <image>

List all locally available images:
docker images

Output from command that was run in a container
docker log <tag>

$(docker ps -l)   - the last container

docker stats <tag>
docker top <tag> -ef   - similar to ps -ef

Docker run param to set metadata

Docker inspect formatted to show labels:
docker inspect --format '{{.Name}} {{.Config.Labels.<key>}}' <tag>

Flag to set limits: --ulimit <params>

#5 - Networking

ip a   (or ip address in full)   - shows the ip address
brctl show docker0

Bring up a shell without disrupting the container
docker exec -it <tag> /bin/sh

To lookup ip address of the container
docker exec <tag> ip a

apt install traceroute
traceroute <destination>

To watch the iptables rules that docker sets up as you expose/map container ports
sudo iptables -t nat -L -n

When using docker run, to map ports:
-P      - capital P maps all exposed ports on the container to high numbered ports on the host
-p <host port>:<container port>    - explicitly map ports.

HAProxy load balancer (Note to self: read up on this sometime)

Some images used in this tutorial
- wordpress
- httpd
- mysql

#6 - Dockerfiles

FROM ubuntu:14.04
RUN apt-get -y install apache2
CMD ["/usr/sbin/apache2ctl", "-D", "FOREGROUND"]
To build the docker file
docker build -f <filename> -t <imagename> .

So for the above:
docker build -f apache-ex1 -t apache-ex1 .

docker images    - lists images

Remove an image from local store:
docker rmi <imagename>

Flag for build to force rebuild:  --no-cache=true

RUN apt -y install apache2
CMD /usr/sbin/apache2ctl -D FOREGROUND
Array form:
RUN ["apt", "-y", "install", "apache2"]
CMD ["/usr/sbin/apache2ctl", "-D", "FOREGROUND"]
The difference is, free form prefixes /bin/sh -c whereas, first element is base command in array form.

Two ways to get ip address
(1) docker exec $cid ip a
(2) nid=$(docker inspect --format '{{.NetworkSettings.IPAddress}}' $cid)

Where does the index.html live in apache2
docker exec -it $cid /bin/sh
find / -name index.html

FROM ubuntu:latest
RUN  \
  apt-get update && \
  apt-get -y install apache2
ADD index.html /var/www/html/index.html
CMD ["/usr'sbin/apache2ctl", "-D", "FOREGROUND"]
To build this:
docker build -f apache-ex3 -t apache-ex3 .

docker ps shows port mappings

Two ways to map port when invoking docker run: -P and -p above.
To run above:
cid=$(docker run -itd -P apache-ex3) 
or cid=$(docker run -itd -p 8080:80 apache-ex3)
ipaddr=$(docker inspect -format '{{.NetworkSettings.IPAddress}}' $cid)
curl $ipaddr

FROM ubuntu:latest

VOLUME ["/var/www/html"]ADD
index.html /var/www/html/index.html
RUN  \
  apt-get update && \
  apt-get -y install apache2
CMD ["/usr'sbin/apache2ctl", "-D", "FOREGROUND"]
To build and run this:
docker build -f apache-ex4 -t apache-ex4 .
cid=$(docker run -itd -v ~/docker/:/var/www/html/ -p 8080:80 apache-ex4)
curl localhost:8080

FROM ubuntu:latest

MAINTAINER Matt Varghese

# Change this if you want to prevent cached build

VOLUME ["/var/www/html"]
WORKDIR /var/www/html

ADD index.html /var/www/html/index.html

  apt update && \
  apt -y install apache2


# this fixes the command to this executable
ENTRYPOINT ["/usr/sbin/apache2ctl"]
# the parameters may be modified at run
Note that the ENTRYPOINT - this means if you do
docker exec -it /bin/sh
you'll see the /var/www/html folder rather than the / folder.

Notice also the ENTRYPOINT + CMD split. ENTRYPOINT specifies the executable, and CMD specifies the arguments. This means that now when you run the docker image, the specified entry point will be the executable running - that cannot be overriden (the default is /bin/sh which allows you to pass some command to it) So something like
docker run -it apache-ex5 /bin/sh
will fail now, with a terminal dump from /usr/sbin/apache2ctl say '/bin/sh' is not a legitimate action.


docker apt-get issues and DNS problems

I had trouble running a docker build because
apt-get update
apt-get -y install apache2 
won't work due to DNS resolution failures

Found these posts that explained why:
Reference1: http://stackoverflow.com/questions/24832972/docker-apt-get-update-fails
Reference2: http://stackoverflow.com/questions/24151129/docker-network-calls-fail-during-image-build-on-corporate-network

Quoting the solution:
Those Google servers weren't accessible from behind our firewall, which is why we couldn't resolve any URLs.
The fix is to tell Docker which DNS servers to use. This fix depends on how you installed Docker:
Ubuntu Package
If you have the Ubuntu package installed, edit /etc/default/docker and add the following line:
DOCKER_OPTS="--dns <your_dns_server_1> --dns <your_dns_server_2>"
You can add as many DNS servers as you want to this config. Once you've edited this file you'll want to restart your Docker service:
sudo service docker restart
If you've installed Docker via the binaries method (i.e. no package), then you set the DNS servers when you start the Docker daemon:
sudo docker -d -D --dns --dns &
And on a windows machine, you can run
ipconfig /all
to find your DNS servers
On a linux host (ubuntu 16.10) you can do
nmcli device show <device>

Friday, May 13, 2016

Fighting with Apache Avro on Hadoop

Search title: Installing Apache Avro on Hadoop on Ubuntu
Search title: Installing Apache Avro on Hadoop 2.7.2 on Ubuntu 16.04

For the swear-jar: Everybody these days uses a Cloudera VM or Hortonworks VM with a single node hadoop cluster, and so it's hard to find useful information on the internet amidst the noise made by these local-hosted-single-noders !

Anyhow, I wanted to do it all from scratch. In a previous post, I describe how to setup a hadoop master-slave cluster. (Link: http://varghese85-cs.blogspot.com/2016/03/hadoop-cluster-setup-2-nodes-master.html ) So now, based on that config, I have a four node hadoop cluster running, with one master node (namenode and resource manager) and three slave nodes (datanodes).

As next step, I wanted to install "the rest of the animals in the zoo" on said cluster. I started with Oozie, but gave up on it, because they don't publish binaries, and their sources don't build (thanks to the codehaus repository closing its doors?).

The next animal I wanted to install (although technically not an animal) is Apache Avro (https://avro.apache.org/). I am following Tom White's book (Hadoop, the definitive guide, 4th edition; Link: http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1491901632 ). Chapter 12 gives an example avro map-reduce program in java on pages 360-361, which I wanted to try out.

Avro really isn't so much a project as a set of libraries. So I thought this was easy, but boy - was I mistaken!

At first, I added maven dependencies for org.apache.avro/avro and org.apache.avro/avro-tools. That compiled fine, but did not work at all.

Then I tried using the -libjars option to package avro, avro-tools, and avro-mapred jars along with my hadoop program. Even that kept failing saying AvroJob class could not be found.

And all the while I was trying to use Avro 1.7.7 which is the latest stable. After breaking my head against the wall, not being able to make it work for a while, I decided to figure out where is the hadoop classpath/jars - where does hadoop store other jars.

I found that the location was share/hadoop/common/lib (ps: I'm using Hadoop 2.7.2). And on top of that, to my wonderment, avro-1.7.4.jar was already there, although none of the other avro jars were there.

So first off, I switched my pom.xml to use avro 1.7.4 instead of 1.7.7. Then from the avro archives site (Link: http://archive.apache.org/dist/avro/ ) I donwloaded avro-ipc-1.7.4.jar and avro-mapred-1.7.4.jar and included them too at the above location (on all my nodes on the cluster). (How did I know to get these files? I watched the jars that maven downloaded when compiling!)

At this point, the map phase started working. I was excited. But the job kept consistently failing in the reduce phase. The exact error I found was:
Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected

This did not make any sense to me, but after some googling, I found that this is because avro-mapred-1.7.4.jar was compiled against hadoop-0.23 whereas what we need is hadoop 2. So that meant, one had to download avro-mapred-1.7.4-hadoop2.jar and then, specify the classifier clause in the maven pom.xml.

And with that, I was finally able to make avro work on my hadoop 2.7.2 cluster :)
The struggle with the Apache ecosystem is real - hopefully, someday BigTop will deliver!

Here are some screenshots to show what I did, in case you're trying to replicate this:
This is the directory listing of share/hadoop/common/lib
(PS: hadoop-2.7.2 is the folder where I extracted the hadoop-2.7.2.tar.gz file)
This is my pom.xml, relevant section:
(Note the use of the classifier)
Also, unlike in the book example where the schema is hard-coded as a static final string into the program, I made an avsc file on HDFS. That meant, I had to pass the name of the file as parameter to the map-reduce program (there is no shared variables - main, map and reduce run in different jvms!). And in the driver (main/toolrunner.run) I read the file off hdfs into a string and set a configuration value with this string, so I can get the schema from the configuration from the context within the mappers and reducers (this definitely can be optimized further!).

Hope that was useful. If you have questions, please feel free to comment below :)

Minimal pom.xml for hadoop

Here's a minimal maven pom.xml for use with hadoop. I've been coding map-reduce and using compression codecs, and this has worked so far.


Thursday, May 12, 2016

Apple Push Notification Certificate Renewal Process

At my organization, I'm responsible for the provider service that sends push notifications to mobile devices.

As you will already know if you are searching for this topic, Apple has the Apple Push Notification Service: https://developer.apple.com/library/ios/documentation/NetworkingInternet/Conceptual/RemoteNotificationsPG/Chapters/Introduction.html and Google has the Google Cloud Messaging https://developers.google.com/cloud-messaging/gcm frameworks to send notification to devices. In fact, GCM now offers sending notifications to APNS, but for those of us whose applications predate that GCM feature, we have to keep APNS certificates up to date on our provider servers for sending push notifications through APNS (and I'm guessing you have to plug in the APNS certificate to GCM if you want google to send notifications to apple).

So how do you create an APNS certificate in the first place?
On this topic, Apple does have some documentation. https://developer.apple.com/library/ios/documentation/IDEs/Conceptual/AppDistributionGuide/AddingCapabilities/AddingCapabilities.html#//apple_ref/doc/uid/TP40012582-CH26-SW11
So when you first create an App, it's pretty straightforward as to how to get the certificates.

In our case, we use password protected .p12 files for certificates. So you have to install the certificates in a Mac keychain, then select the private key under the certificate, and right click and export to create a password protected .p12 file that works for push notifications.

What happens when it's time to renew the certificate?
Renewing a certificate is a scary part because we have so many questions around it.
  • Will I incur downtime for notifications while I renew?
  • Will pre-existing users of my app lose notification functionality, and need to reinstall the app since the certificate is renewed?
  • Can I have an overlapping time window where multiple certificates are valid so I can transition? 
  • Do I have to republish the app on the App Store? 
  • I have lost he Certificate Signing Request I originally used; is this a problem?
etc are a small sample of the questions that come through our mind - or at least they did go through my mind. So this post tries to answer these questions, based on my recent experience with renewing certificates for our organization

Overlapping certificates:
Let me right away put this question to rest. Yes, we can have overlapping certificates with different validity windows. On the Apple developer portal (as described in the link above about creating the original certificate), under each application that you have, there will be two certificates listed (assuming you're already using push notifications). These are the Development and the Production certificates.

Next to each, you have the option to create new certificates. It looks like you cannot have more than 3 active certificates, or apple limits how many times you can create new certificates within a time window - there is some such restriction as we observed empirically. We did not play around enough to clearly understand what are the restrictions. However, within that limit, you can have overlapping certificates.

And obviously, since you can have two certificates with some overlap in validity, you do not have to worry about incurring a downtime when swapping certificates.

What really is the purpose of this certificate?
The only purpose that the certificate serves is to authenticate the SSL connection between your provider service and the APNS endpoint for this particular application. The certificate is not in any way connected to your provisioning profile.

So that means, there is no direct connection between application and the certificate. An app published while an old certificate is in effect will still receive notifications sent to APNS using a new certificate.

Creating New Certificate:
I already linked the Apple documentation on creating a new certificate. For renewal, the process you follow is the same, except you leave the old certificate as it is there (do not revoke it) and just create a new certificate alongside it. So now you will have two certificates listed under the Development/Production/both section for your application; one which already was there, and the other - the newly created certificate.

So that means, there are now two active certificates. And as long as you created the new one before the old one expired, you have a period of overlap.

During that period of overlap, push notifications can be sent with either of the certificates. This give you a window of time to switch certificates on your provider server without having to incur a downtime. The only downtime you will incur is if you need to bring down the service to change the certificates - in which case it's a design problem in your provider service and not Apple's fault.

A few words on CSRs: 

What is a Certificate Signing Request?
It actually is a file containing a pair of public and private keys. When you plug a CSR into apple to create your certificate, that key pair goes into the certificate.

That also means, once a certificate is created, the CSR can safely be discarded (and it probably is better to discard that keep it around so you don't have your keys lying around).

So that should also allay any concern around your no longer having the CSR you used to create your original certificate. It is just right that you didn't keep the CSR around.

Can I reuse a CSR?
If you are responsible for multiple IOS applications, your developer account will have them all listed. For human sanity it is likely you have at some point while developing, revoked all your certificates and recreated them on the same date, so that they all expire at the same time.

So now, when the certificate renewal time comes around, it is tempting to create one CSR and use the same to renew all your certificates. DO NOT DO THIS.

If you reuse the same CSR for multiple IOS applications, only the first application will work correctly.

We actually tested this out to some degree. We created a CSR and created two certificates with it, one for App A and one for App B. We were able to send push notifications to App A using App A's certificate. However, we were not able to send push notifications to App B using App B's certificate.

In addition, we were actually able to use App B's certificate and send push notifications to App A. This makes sense because the CSR had a key pair that is now associated with App A. Both certificates contain that key pair; so you are able to send to App A using either of these certificates.

I hope this anecdotal recitation of my experience helped you understand the push notification certificate renewal process. If you have additional questions, please feel free to comment on this post, and I'll try best to answer them.


Using curl command line

Up until now, I've generally used fiddler's composer to test web services. However, it's a bit of a heavyweight, and especially if I'm writing web services in linux, it's more work to get fiddler up and running than to type out a curl command on a terminal. So I learned just enough curl to test out web services.

In the most elemental form, you can do
curl [url]
For example, if you hit google.com as below, then you get an HTTP 301 - Moved (which if you did from a browser, the browser will automatically interpret the 301 response and take you to the new destination).

Sending post data is trickier. First off, you need to override the default method of GET. You also need to specify the data to send, and you need to specify the content type of that data.
  • Override the GET method:
    • This is done by specifying the -X option followed by the method to use
  •  The data to be sent:
    • This specified by the -d option (or --data in full) followed by the data in quotes.
  • The content type:
    • This is specified by using the Content-Type HTTP header. 
    • Headers are added using the -H option followed by the header specification in quotes.
So for example:
curl -X POST -H "Content-Type: text/plain" -d "Hello world" http://myserver.com/apipath
Will post the string "Hello world" as plain text to the web service.

You can also use PUT method etc and modify the command appropriately.

But what if the data you want to post isn't a small string, but the contents of a file rather? In that case, you specify the filename prefixed with @ and in quotes as argument to the -d option
curl -X POST -H "Content-Type: text/plain" -d "@path/filename" http://myserver.com/apipath

MySQL on Ubuntu

To install MySQL on Ubuntu:
sudo apt install mysql-server
As part of the install, it will ask you to set a password for the root user. If you later want to change that password, you can do (note that you should type password verbatim, not the new password you intend to use)
mysqladmin -u root -p password
This will ask you for current password, and then ask you to enter a new password, and confirm entry. If you did not pick a password during installation, skip the -p flag in the above command.

To start mysql,
mysql -u root -p
This will ask you for the root password, and then start mysql.

To quit out of mysql, use:
To show databases, at the mysql prompt, use
show databases;
To switch to using a specified database:
use [database-name];
To show tables within a database, first switch to that database using the above, and then:
show tables;
So at this point, the next logical step would be to create a user other than the superuser. But then, we also want that user to be able to play with some database.
So first let's create a test database named ubuntu_test:
create database ubuntu_test;
Now let's create a new user named ubuntu, and grant them all privileges on the ubuntu_test database:
create user 'ubuntu' identified by 'password';
That created a user with username "ubuntu" and user password "password".
grant all privileges on ubuntu_test.* to 'ubuntu';
Now, quit out of mysql and restart mysql as your ubuntu user:
mysql -u ubuntu -p
If you do "show databases;", you can see that the ubuntu user has access to only a limited set of databases, including the ubuntu_test database we created. At this point, we can create tables in the database etc using regular SQL commands, and be on our merry way.
use ubuntu_test;
create table [tablename] [details...];
While I do not wish to bore you with elementary SQL commands, there is one command I find rather useful - how to inspect the schema of a pre-existing table:
show create table [tablename];

Wednesday, May 11, 2016

Apache BigTop

Apparently, Apache has the BigTop project going, the idea of which is to have a single repository for linux distributions to pick up hadoop from.


I first came across this tutorial on how to install hadoop using BigTop:


 There are a couple of gotchas:
(1) you want a newer version of BigTop, considering, as of now 1.1.0 is released
(2) For Ubuntu, `lsb_release --codename --short` gives "wily" (on 15.10) whereas the specified BigTop repo only has "trusty" - so I had to replace that with trusty

This seemed to work. However, this is tying you to BigTop version 1.1.0

Thereafter, I found this blog post:
That post was written when big top was still incubating. So the url has changed. Also that post refers to an "ubuntu" folder within the repo for bigtop.list. That folder doesn't exist, and so based on my above experience, I used the "trusty" folder.

So the steps are:
wget -O- http://www.apache.org/dist/bigtop/stable/repos/GPG-KEY-bigtop | sudo apt-key add -
sudo wget -O /etc/apt/sources.list.d/bigtop.list http://www.apache.org/dist/bigtop/stable/repos/trusty/bigtop.list

At this point, edit the /etc/apt/sources.list.d/bigtop.list file and verify that there is only one deb and one (or zero) deb-src line each. If there are multiple, only one should be uncommented (comments prefixed with #)

Now you can apt update and be on your merry way to use say synaptic to install whatever you want. Follow the apache wiki link above to get hadoop up and running.

PS: this doesn't work for a raspberry pi, or at least not currently, since the above bigtop.list specifies an amd_64 architecture alone. So at some point, I'll write some blog posts on installing the hadoop ecosystem on the Raspberry Pi, especially since I now have a BitScope BP10A rack (see: http://my.bitscope.com/store/?p=list&a=list&i=cat+3 ) on order, which I posted about in my previous blog post.