Twitter JSON stream parser
So recently I’ve had occasion to parse the Twitter JSON stream, specifically the spritzer stream for data mining purposes. Turns out this is a pretty difficult problem to solve in most languages. So here’s my Alexandrian solution to this particular Gordian knot, in Bash, because that’s just how I roll.
curl -s --basic --user username:password http://stream.twitter.com/spritzer.json | while read line; do echo "${line}" > temp_tweet ; cat temp_tweet | sed -e 's/=\"/\=\\"/g' | sed -e 's/\">/\\">/g' | ./twitterparse.pl; done
This parses the JSON steam and passes each tweet to a perl script which does the actual parsing.
Debian Diskless Cluster Howto
Inital Setup
This guide will walk you through the diskless cluster install and setup process. The cluster has a head node that serves boot images to the compute nodes and the database node. We’re attempting to present a unified system image across the cluster. For this reason, all nodes are looking at the same root filesystem, served via NFS.
DHCP
I started from a bare-bones netinstall of Debian Squeeze (testing) on the head node. This should work about equally well on any Debian-derived distribution.
First, we need to install some packages we’ll need in a minute.
sudo apt-get install dnsmasq syslinux nfs-kernel-server nfs-common debootstrap tftpd-hpa xinetd
Now, we need to configure dnsmasq, which will serve as our DHCP server for diskless booting.
You replace your existing /etc/dnsmasq.conf with something like this:
dhcp-range=192.168.1.50,192.168.1.150,255.255.255.0,12h dhcp-boot=pxelinux.0,headnode,192.168.1.1
Replace 192.168.1.x with your preferred IP subnet and “headnode” with the hostname of your head node.
tftp
Our tftp server needs to be configured to launch on command from xinetd. The binary is already installed from our previous apt-get command.
create a file, /etc/xinetd/tftp-hpa that looks like this:
service tftp
{
disable = no
id = chargen-dgram
socket_type = dgram
protocol = udp
user = root
wait = yes
server = /usr/sbin/in.tftpd
server_args = -s /var/lib/tftpboot/
}
PXE
Now we need to tell the PXE server what to serve our clients.
Let’s set up our pxelinux configuration directory.
sudo cp /usr/lib/syslinux/pxelinux.0 /var/lib/tftpboot/ sudo mkdir /var/lib/tftpboot/pxelinux.cfg
We’ll need a kernel and a initial ramdisk to give to your diskless clients. Assuming you’re going to be running the same kernel on the head node as your diskless clients (recommended), you can just copy the kernel from /boot.
sudo cp /boot/vmlinuz-`uname -r` /var/lib/tftpboot/
You’re going to need to create a NFS-root-enabled ramdisk. This accomplished with the tool mkinitramfs. You should have a configuration directory, /etc/initramfs-tools/. Make a copy of it:
sudo cp /etc/initramfs-tools /etc/initramfs-pxe
Note: On Debian Squeeze, the installed /etc/initramfs-tools did not work for unknown reasons, it seems to be missing module configurations. I ended up copying a /etc/initramfs-tools from an Ubuntu 8.04 install. It worked fine.
Edit /etc/initramfs-pxe/initramfs.conf. Change BOOT=local to BOOT=nfs.
Now we can create the ramdisk.
sudo mkinitramfs -d /etc/initramfs-pxe -o /var/lib/tftpboot/initrd.img-`uname -r` `uname -r`
We should be ready to create a default boot configuration now. We’ll need to create /var/lib/tftpboot/pxelinux.cfg/default
LABEL linux KERNEL vmlinuz-2.6.29 APPEND root=/dev/nfs initrd=initrd.img-2.6.29 nfsroot=192.168.1.1:/home/nfsroot ip=dhcp rw
Change 2.6.29 to match your kernel, obviously.
If you want to pass different parameters to different machines, you can create individual configuration files in /var/lib/tftpboot/pxelinux.cfg/ based on their MAC addresses. For example, if I create a file, /var/lib/tftpboot/pxelinux.cfg/01-00-21-97-7a-24-0f, then my node with a MAC of 00:21:97:7a:24:0f will load that instead of the default. I like to create softlinks in the configuration directory corresponding to the hostnames of my nodes because if you can remember MAC addresses of individual machines then you’re a better man than I am.
NFS
NFS time! Create a directory to store your NFS root you’ll be serving clients.
sudo mkdir /home/nfsroot
Edit /etc/exports. It should look something like this:
/home/nfsroot 192.168.1.0/255.255.255.0(rw,no_subtree_check,async,no_root_squash)
Now we just have to bootstrap a basic Debian install into /home/nfsroot. Luckily for us, there’s a nifty little tool called debootstrap that does just that. For a 64-bit Debian Squeeze environment, I do this:
debootstrap --arch amd64 squeeze /home/nfsroot/
A few minutes later, it’s installed. Now you need to make some modifications to that system you just installed.
Edit /home/nfsroot/etc/fstab to look something like this:
#proc /proc proc defaults 0 0 /dev/nfs / nfs defaults 0 0 none /tmp tmpfs defaults 0 0 none /var/run tmpfs defaults 0 0 none /var/lock tmpfs defaults 0 0 none /var/tmp tmpfs defaults 0 0 none /media tmpfs defaults 0 0 none /var/log tmpfs defaults 0 0
/home/nfsroot/etc/network/interfaces should be:
auto lo iface lo inet loopback iface eth0 inet dhcp
Note that auto eth0 isn’t there anymore. That’s because your primary ethernet interface is already up. If you try to initialize it again, it might drop your existing connection and it’ll dump you out of the boot process.
Testing
At this point you’re ready to test. Make sure to restart xinetd, dnsmasq and nfs-kernel-server to make sure your new settings take effect. Then, check your node’s BIOS to verify that network boot is enabled and give it a shot.
Congratulations! You now have a diskless cluster. Next we’ll make some special modifications to the configuration of the nodes to make them play nicely together and make maintenance easier.
Networking
Each node will receive an IP address from dnsmasq on the head node. We can either just note which IP each node gets, as it should give each node a unique IP by default and these are persistent as long as the node’s MAC address remains the same, or you can force each node to a specified IP with a configuration similar to this in /etc/dnsmasq.conf on the head node:
dhcp-host=id:00:21:97:7d:ad:bf,192.168.1.10 dhcp-host=id:00:21:97:7a:24:0f,192.168.1.11 dhcp-host=id:00:21:97:7d:b3:26,192.168.1.12
Either way, you’ll need /etc/hosts on your head node to reflect the IP addresses of your nodes. Mine looks like this:
127.0.0.1 localhost 10.13.99.1 scoop head 192.168.1.10 dizzy db 192.168.1.11 tumbler 192.168.1.12 scrambler
You’ll want to copy that hosts file over to /home/nfsroot/etc/hosts as well.
Init Tricks
Sometimes you want the nodes to behave just a little bit differently from each other. I wanted my nodes to have different hostnames, fancy that. So, I wrote this bash script to figure out what their hostname should be:
#!/bin/bash
#finds node's hostname based on matching ip in /etc/hosts
grep `ifconfig | grep 'inet addr:'| grep -v '127.0.0.1' | /usr/bin/cut -d: -f2 \
| /usr/bin/awk '{ print $1}'` /etc/hosts | /usr/bin/awk '{print $2}'
Save the script in /home/nfsroot/bin/whereami. You’ll need awk in order for it to work. Boot up a node and just apt-get it from the node itself before running the script. Package installation is best accomplished from a booted diskless node, just try not to install multiple packages from multiple nodes simultaneously. You might corrupt your apt database.
Now that we have that taken care of, we can modify /etc/init.d/hostname.sh to set our hostname on boot based on the IP we’ve received. This is as simple as changing this:
[ -f /etc/hostname ] && HOSTNAME="$(cat /etc/hostname)"
To this:
[ -f /etc/hostname ] && HOSTNAME="$(/bin/whereami)"
This also allows us to modify other init scripts so they’ll only run on particular nodes. For example, I wanted MySQL to start only on the database node, dizzy. So I added this to the top of /etc/init.d/mysql:
hostname=$(hostname) if [ $hostname != "dizzy" ]; then exit 0 fi
Logging
Since we’re not saving local log files on the diskless nodes, it makes sense to centralize our logging on the head node. We’ll need a better logging daemon to accomplish this.
On both the head node and a diskless node (only do this on one of your nodes, changes populate to the others, remember?)
sudo apt-get install syslog-ng
Edit /etc/syslog-ng/syslog-ng.conf on the head node.
## add this to the options section
create_dirs(yes);
long_hostnames(off);
keep_hostname(yes);
## add this to the source section
source s_udp {
udp ( ip(192.168.1.1) ); # replace with your system's IP address
};
## add this to the destination section
destination df_udp {
file ("/var/log/$HOST/$FACILITY");
};
## add this to the log section
log {
source(s_udp);
destination (df_udp);
};
Now edit /etc/syslog-ng/syslog-ng.conf on one of the diskless nodes.
## add this to the destination section
destination remote_udp { udp("192.168.1.1"); }; # replace with your log server's IP address
## add this to the log section
log { source(src); destination(remote_udp); };
Restart syslog-ng on both head and diskless nodes.
That’s It
I hope this was helpful. Feel free to ask questions or leave comments.
Goddamnit.
Well, that’s the last Apple product I will ever purchase.
Gophercore
gopher://gopher.signalnine.net/
Should work in Firefox, or apt-get yourself a real commandline client if you’re on a Debian system. And yes, that is in fact a Twitter-to-gopher gateway I wrote. Also, this blog is being re-published as a phlog. Apparently Veronica is still around! I got my server added to the index seed list, I’m listed as the 136th gopher server still on the internet.
Spam Poetry OTD
This one reminds me of E2 poetry.
VLC streaming webcam server
I recently had occasion to create a streaming server to serve a video and audio feed, for use as a baby monitor. So, I decided to re-purpose my aging Eee 701 that’s become next to useless because the LCD is failing. However, it makes a perfect compact low-power headless Linux server. I bought a cheap-ass webcam.
Getting the webcam to recognize under Intrepid was a little bit of a pain, but the gspca drivers will compile under the 2.6.27-11 kernel if you apply this patch.
So, after making sure you have both vlc and v4l-config installed, here’s the VLC command you’ll need:
cvlc v4l:// :v4l-vdev=”/dev/video0″ :v4l2-adev=”/dev/audio” :v4l-norm=1 :v4l-chroma=UYVY :v4l-height=640 :v4l-width=480 –sout “#transcode{vcodec=h264,vb=800,scale=1,acodec=mp3,ab=64,channels=1,audio-sync,vfilter=deinterlace}:duplicate{dst=std{access=mmsh,mux=mp4,dst=192.168.0.24:1755}}”
Replace 192.168.0.24:1755 with your server’s IP and preferred port. This is serving MMS, you can replace mmsh with http or whatever your preferred protocol is. You can also change vcodec and vb, the video codec and bitrate respectively. h.264 is very efficient, but somewhat processor-intensive, but remarkably, my Eee 701 with a 600mhz Celeron handles it just fine.
Toodledo Bash Script
So, I’ve been playing with Toodledo recently. It’s a pretty impressive web-based todo system with pretty much all the functionality you could ask for. However, I don’t really like the Ruby command line client, so I wrote my own in Bash. It requires that you have a local MTA in order to function properly. I recommend ssmpt if you don’t need a full-featured MTA.
Downloadable here: http://signalnine.net/toodledo
Licence: GPL v2.
Install instructions:
- Download the script, put it in a directory that’s in your path. (i.e., ~/bin/ or /usr/bin/)
- Edit the script, replace from variable with your email and to with your toodledo secret email address which you can find here.
- Make the toodledo script executable.
- Follow the Toodledo email format, which the script reminds you of on execution.
That’s it. Have fun.