Monday, February 11, 2013

Is the "daily deals" space sustainable?

I have some amount of experience working in the "daily deals" space on the technical side, but with a lot of interaction with the "business end" (and boy do I ever hate that distinction).  Here's my MBA-style analysis of the sector, devoid of "secrets" but based heavily on the patterns I saw.

- There are roughly two kinds of businesses, for the purposes of the deals sector: those with marginal cost approaching zero, which tend to be focused on the service sector (hair salons, Lasik, lawn care, hotel rooms), and those with relatively high marginal costs (restaurants, actual physical products).

- It is incredibly easy to sell marginal customers to zero-marginal-cost businesses and they will benefit greatly.  It's no different than an airline or stadium selling cheap seats at the last moment to fill up to capacity.  For positive-marginal-cost businesses, when the direct "loss" on the deal takes out most of their margin (mid-tier restaurants are the main example here: if a merchant runs a $10-for-$20 voucher, of which they actually receive $5, they have to make a $15 profit per check ex ante to break even) you have to rely on fuzzy stories about "new repeat customers", redemption rates below 100%, and sales of ancillaries with high margins (dessert, booze), possibly due to price elasticity.  These stories might even be true, but they're much more difficult to quantify and spin into a sale (and more importantly, into a re-run a few months later without much additional sales effort).

- The press focuses on restaurants, but more or less ignores the fact that the most profitable deals are in the health-and-beauty space, which sells more or less exclusively to women.  There's a heavily gendered component.

- The deals sector relies totally on a large and effective sales force to create and maintain relationships with merchants.  The cost-per-runnable-deal is (or should be) their primary cost metric; overhead is (or should be) minimal compared to revenues, so it's all about whether each deal is generating enough in revenue to pay for the cost of signing it.  Anything you can do to make your sales force more productive is probably a Good Idea.  One advantage of this is that your capital requirements are pretty low (and you can self-finance just about everything, see below), and you can ramp up incrementally.

- One huge source of strength is that you get to keep the entire revenue from the deal for many weeks, and remit the merchant's cut later.  That gives you a large amount of cashflow to play with; essentially a "float" like an insurance company.

- The primary sales channel (to consumers) is email.  That spins you into a few interesting areas: you (theoretically) have the ability to tailor those emails very precisely and eke out some incremental gains.  You also have the ability to drive revenue via incentivized offers (eg, "an additional $10 off any offer").  This can let you do a combination of additional price discrimination for your most price-sensitive users (on top of that intrinsic to the deal itself), and having customers essentially float you an additional short-term loan.  Every company does this to some extent, but because email is already your primary communications channel and you're already selecting for the price-sensitive, you can ride that train a lot more effectively than, say, Best Buy e/mailing out coupons.  If you're an analyst (or the board setting management incentives), though, you really want to watch out for the company trying to game the system each quarter by messing with customer incentivizations and changes to their email composition, possibly at the expense of your margins.  "Revenue targets" are less meaningful than elsewhere.

- "Deals" are fundamentally a middle-class (broadly defined) product: people with enough money to go and do things, but not so much that they're completely price insensitive at the $20-savings margin.  It's also "middle class" on the sales side: businesses that have both excess capacity of one sort or another, and some ability/desire to try to price-discriminate downwards.  Foreign expansion for whatever reason seems to have ignored that.

- The competitive "moat" is 1) the need for a large sales force to get a sufficient variety of deals, and 2) the need for a large enough customer list willing to be mailed by you at a sufficient volume and purchasing rate to pay for the sales force and overhead.  Considering the limited willingness of customers to be mailed, and the limited pool of merchants willing to be sold, there's a limit to the number of deals providers that can sustainably exist.  My hunch is that it's somewhere between 1.5 and 2.5 (fractional company in this case is oligopoly profits or a region / sector-focused competitor).

- The email list (and their associated credit card numbers) plus the ability to email them whatever you want constitutes a very valuable capital asset.  It's hard to build such a list (you can buy it incrementally via advertising, but you need to keep them around by having stuff to send them for the list to grow on net, and have enough of an underlying business to generate organic growth) and actually technically challenging to mail them consistently.  It's difficult to value it on paper, but bounding it somewhere around acquisition costs, and at minimum what you could get by going pure black-hat-viagra-spam with it, should give you an estimate.  If all else fails, I'd think that one could "cobrand" with a similar business (PriceLine, Fab.com, whatever) and essentially rent it out.

In many ways the deals sector is the perfect target for an 80s-style leveraged buyout: a business with high underlying cash flows, low capital requirements, some capital assets, a relatively price-insensitive core market (the zero-marginal-cost businesses), and excessive overhead and some unprofitable markets (at the moment) due to overexpansion at the peak.  It's also probable that there are some investors who bought in at the peak and are looking to cut their losses.  I do think there is a very sustainable business underneath; "email marketing for local businesses" doesn't sound sexy (and doesn't really qualify as a "technology business") but it's an example of a profitable niche.  In the end it will come down to execution, but a good management team should be able to make it happen.

Sunday, February 3, 2013

Space efficiency of Clojure's collections

How much space do Clojure's collections take up compared to their mutable counterparts?  Let's see, using the SizeOf class.


(SizeOf/deepSize (long-array (range 1024)))
-> 8208


(SizeOf/deepSize (into (vector-of :long) (range 1024)))
-> 9552


(SizeOf/deepSize (vec (range 1024)))
-> 26192



(SizeOf/deepSize (ArrayList. (range 1024)))
-> 24624


(let [a (HashMap. 1024)]
     (dotimes [i 1024] (.put a i i))
     (SizeOf/deepSize a))
-> 82040


(SizeOf/deepSize (into {} (for [i (range 1024)] [i i])))
-> 166008

Unsurprisingly, primitives are great for space efficiency - and actually, Clojure's vector-of has minimal (~25%) overhead over a raw array, while having a much nicer interface.  Clojure hashmaps, though, are enormous - 10x overhead compared to two raw arrays.  

This makes me think that SOAC needs a semi-traditional hash table, built on top of primitive vector-ofs rather than trees, as it currently is, or raw primitive arrays, a la Trove or fastutil.  That would succeed in maintaining persistence while being much smaller.


Saturday, December 29, 2012

Sparsemas: A sparse matrix implementation that supports iteration & expansion


This Christmas I wrote up an implementation of what amounts to a sparse matrix, with a few additional features that are handy for the areas I work in.  It uses essentially the same approach as the Apache Commons sparse matrix implementation (and most others I've seen): it handles the (row, column) -> value mapping by transforming the (row, column) into a single index, and then using an efficient primitive-based hashmap to match that index to the value.

The first main difference is that my mapping uses a kind of spiral going out from the origin, that isn't dependent on the number of rows and columns.  Visually, the mapping goes like this:

17 19 21 23 24
10 12 14 15 22
5  7  8  13 20
2  3  6  11 18
0  1  4  9  16


And so on.  If you look on the diagonal, you see that it's the squares of integers minus one.  Going "down" from a square element, the index decreases by two, and going to the "left" it also decreases by 2, but starting from the square + 1.  The reverse mapping (from index to (row, column)) is also trivial.  The advantage to this is that if you add another row or column, you're not forced to recalculate indexes.  Theoretically, this even allows you to use a persistent tree for the map implementation, and so have a persistent implementation of a sparse matrix with structural sharing.

The second difference, which is totally separable from the first, is that each "node" in the matrix has a pointer to the next filled element by row and column.  This essentially defines a linked list over filled elements by row and column; it does come at the expense of additional space, but for my purposes it's necessary to be able to find which elements are filled in any particular row or column.  We store the first node by row and column in an additional int -> int mapping (and implicitly those also define a list of which rows & columns have any filled elements).  For maximal insert speed, an insert just adds an element to the head of the list, so initially, it's traversed in reverse insertion order.  I do however have functionality to sort the lists, so I can easily calculate overlap in filled rows between columns, or vice versa.

Limitations of this implementation are higher space overhead to store the linked lists over elements, and the ability to only (safely) address rows and columns by (>=0) integers.  Currently, it uses an object to store the (rating, next_item_pointer, next_person_pointer) tuple, which adds 8 bytes of object overhead per entry.  In a future implementation, it could use the underlying hash algorithm directly to handle indexing in 3 parallel arrays of ratings, item pointers, and person pointers.

So what is this good for, and why did I feel the need to create it in the first place?

If you're storing a correspondence between users and items, of the form (user, item) -> preference or interaction flag, a sparse matrix seems like a natural representation (besides which many recommendation algorithms explicitly model it as a sparse matrix with a goal of imputing the missing elements, so being able to do matrix operations is handy).  However, the number of items and users is unbounded, and if you want an online representation that doesn't have to be copied into a bigger container each time you add a new maximum item or person ID, you need consistent assignment of (person, item) -> index for insertions and lookups.

You also obviously need a way to figure out what elements are filled, without having to traverse the thing.

And, for simple similarity-based models that rely on consistency of ratings between two people or items, you need to be able to quickly find the items rated by a particular user or vice versa (and if they can be stored sorted for quick overlap calculation so much the better).

The traditional way to do it, which Apache Mahout does with its in-memory representation, is to have a mapping from person ID to two parallel arrays of item IDs and ratings, and the reverse mapping from item ID to two arrays of person IDs and ratings.  That has reasonable space efficiency, but a) has to store the data twice to handle mappings in both directions, and b) doesn't deal that well with insertions (or especially removals) which usually require you to traverse arrays scanning for elements, and copy the array en masse to modify it (or deal with having an "unfilled" magic value, buffering at the end of your arrays, etc.).  This implementation, despite the overhead, is still fast, reasonably efficient, and supports better online behavior.

Monday, November 5, 2012

Random comparison sorts for "merging" orderings

Let's say you have multiple orderings, and you'd like some sort of consensus on the overall ordering.  This is essentially the problem of voting, generalized to a total ordering instead of just the top choice (how topical!).  But, with some twists:

- You might not even care about the head or tail of the ordering, as long as most of the elements have achieved some kind of consensus.
- You might not actually have a defined ordering for each voter - instead you might have some sort of comparator or objective function.
- You might also want some "voters" to count more than others.  Doing it with non-low-integer powers is a pain - if you want to sort a list, with 1000 parts algorithm A and 7 parts algorithm B, do you really need to generate and tally 1007 votes?

The key here is a notion of a partially-applied ordering: a probability of two random items being in one order vs. another.  And you can create a partially-applied ordering by probabilistically selecting items and comparing/exchanging them.


I call it a "rational bozo sort" because of its similarity to the "irrational bozo sort", which randomly permutes two elements (ours permutes them according to some rule).  It's easy enough to extend it to randomly choose between multiple orderings in some proportion, rather than partially applying a single new ordering to an existing one.

Of course it's obviously not efficient to actually sort an entire list with this algorithm - it has >(O^2) complexity for comparisons.  It does seem to be a good way to "merge" orderings granularly though, without a lot of messing about with elaborate voting rules and tallying integer votes.

Tuesday, October 30, 2012

More compact bulk data structures in Clojure

I've written a library here that aims for more compact representations of arrays of "structs" (which I use in the sense of a mostly plain-old-data construct, consisting primarily of primitives).  Because Java has substantial overhead for an "Object", if you have a long sequence of data that is stored as one Object per datum, you end up with a lot of wasted space.  If they're truly homogenous, you can instead transpose them and represent them as a series of arrays, one for each field, with a length equal to the number of objects you'd like to store.

Making them "persistent" is unfortunately much harder - ideally I'd like to have been able to use Clojure's vector-of functionality, but the overhead in terms of storage and traversal was still too much for me compared to raw arrays.  Instead I fell back to a copy-on-write scheme that's quite slow if you actually need to make an "edit" to the sequence (as opposed to appending to the end), but otherwise is just as fast and compact as the mutable version.

In the future I'd like to get away from the runtime lookup of the proper accessing / copying / modifying functions for each "type" of SOA (speediness requires type-hinted versions of these functions, and finding the proper one via runtime table lookups was the fastest way to get it deployed) and instead have them be statically generated at compile time with a macro.  Unfortunately, as of right now I've been unable to get macro-generated type hints for deftype fields to work, but it's on my to-do.  An alternative (and something I might do anyway) is going True Scotsman and using sun.misc.Unsafe to define my own memory layout and avoid object overhead - just a stream of bytes, with proper indexing and offsetting necessary to find whatever you're looking for.  That does sound like fun, doesn't it?

DataGotham - "Cold Start My Heart"

My talk from DataGotham about recommender systems, the new item problem, and novelty, is available here.

Sunday, September 23, 2012

Clojure functions can implement protocols

It's common in Clojure to write functions that take other functions as parameters.  This ends up being a problem, though, when you want to keep track of exactly what functions are passable to others.  Clojure is by nature dynamic, so a Haskell-esque type system won't save you.

Fortunately, declaring functions declares a class as well, and that class can implement a protocol.  Here's an example of the raw functionality:


Instead of using the passed function directly, you pass the function to the protocol's function / method, which polymorphically "does the right thing" with it.  If you pass a function that doesn't implement the protocol it will fail, even if the function itself would do something apparently valid with the inputs.  Notably, that means you can have functions with different calling patterns all useable in the same manner, as long as you can translate their usage in the protocol implementation.

It isn't quite as nice as doing it in a more static type system, since the errors only happen at runtime and new functions have to be manually declared to satisfy the protocol, but in complex code it can be nice to document and enforce what goes where.

Saturday, September 1, 2012

Seqs over queues, and vice versa

Clojure Agents are great, but I find myself not using them all that much.  The fact that

1) They have an unbounded, non-blocking "mailbox", and
2) You can send them any arbitrary function and data

makes them very flexible, but usually more flexible than I actually need.  In particular the first one can be problematic because it puts the onus on the caller to make sure it's not saturating the agent.  I've run into this before, using an agent to buffer writes to a database - if your connection is slow enough, you'll blow up your heap with unprocessed messages because there's no mechanism to exert back-pressure.

The more common pattern I have is that I need to maintain a state and mutate it as new data comes in with a particular, fixed function.  Again, you can certainly express that with agents, but the fixed-function stream-processing approach seems more natural to me.  Blocking queues are the natural way to handle a stream of data that needs to stay inside some reasonable buffer.

But, you can still express operations over queues in idiomatic Clojure, by treating the queue as a seq.


In fact, you can express things at a deeper level than you can with agent operations, since you can arbitrarily traverse or manipulate the whole sequence instead of being restricted to binary operations over the agent state and the next piece of new data.  For instance, you can compute a mean efficiently by traversing the sequence and not holding onto the head, but you can also do a sort of the whole thing to find the median (assuming the whole thing fits in your underlying queue).

Distinctifying a sequence in Clojure

One of the things you run into writing a lot of performance-sensitive Clojure is the performance overhead of traversing sequences.

Here are three approaches to returning the distinct elements of a sequence, as a vector:

The first one is the most "idiomatic" in the sense of conciseness and using what Clojure gives you.  You can rely on the fact that you can cast a Clojure Set to a Seq (although a set is not itself a seq), and convert a Seq into a Vector.

Unfortunately if you benchmark it, you'll find that you spend a lot of time traversing the sequences. There's no point in generating persistent data structures for purely under-the-hood operations; CPUs are fundamentally machines for mutating data and you should use their capabilities when you're able to.  So, let's use the idiomatic Clojure way of mutating - transients.

Now we have a second decision to make - we can either use a set as both a distinctifier and an accumulator (unique-vec2), or we can use it purely to check for distinctness, and accumulate the vector elsewhere (unique-vec3).  This is a purely benchmark-driven decision, but it turns out that depending on what you're distinctifying, it could be faster either way.

The tradeoff you're making here is that unique-vec2 has a tighter inner loop (accessing only one data structure rather than 2) at the expense of a slower final step (you have to persist the set, convert it to a sequence, and traverse it).  So, you're better off using it for shorter sequences, and unique-vec3 for longer or more-distinct ones.

The same pattern applies in raw Java, of course, but the thing I like about Clojure is the ability to use quasi-imperative code with high-level abstractions.  You now have a pretty fast, generic algorithm that operates over arbitrary streams, and you could make it faster by using a standard mutable HashSet for checking uniqueness, or possibly using multiple threads and an atom to distinctify in parallel.  These modifications require very little additional code and they'll still be compatible with whatever data you'd like to process.

Addendum:
I probably should've mentioned Clojure's built-in "distinct" function.  It takes essentially the approach of unique-vec3, except for that it uses a persistent set rather than a transient, and builds a lazy sequence rather than a vector.  Additionally, (vec (distinct coll)) is actually slower than (vec (set coll)) or the other 2 approaches, presumably due to the additional overhead of laziness.

Wednesday, July 25, 2012

Poisson regression - it's more useful than you think

Everyone knows about logistic regression.  It's the AK47 of predictive models - cheap (to fit), available everywhere, manufacturable yourself with some basic tools, and pretty darn effective at what it does, as long as you don't ask for the latest and greatest.  You can hand it to a barely-literate junior analyst and send them to the front, and they'll be slaying problems in no time.

But as your positive examples become sparser, and your data gets bigger, logistic regression becomes less of a good fit.  Particularly because, depending on your application, you might not actually care about individuals being likely positive or negative (after all, at a certain sparsity, the vast majority are likely to be negative - and one person, in lot of applications, simply doesn't matter).  Instead, you care about the number of positive examples you can count on in a group of a certain composition.  Examples of this would be banner ad response rates, or the number of clicks on an email.

And that's where Poisson regression comes in.  There is a "law of rare events", where as your positive examples become sparser and your total number of examples increases, a Poisson regression becomes a better and better fit to the "real" binomial data.  Add in measurement errors and modeling time, and Poisson regression becomes downright compelling.

Particularly because the coefficients in a Poisson regression have a natural translation into multiplicative effects on a rate, which can float above 1 if necessary, as opposed to the parameters in a logistic regression, which translate into multiplicative effects on odds that can never be greater than 1.  That works out well if you're modeling an arbitrage scenario, where you're relating dollars-of-input to dollars-of-output, and have a relatively small number of very positive examples primarily effecting that ratio.  Ideally you'll end up with a ratio less than 1.0, but you really do have to deal with segments where it is greater, and that's something logistic regression can't handle.  It's fairly common in the insurance industry to do this with dollars-of-claim per dollar-of-premium - also known as their loss ratio.  That kind of arbitrage-identification translates naturally to things like ad dollars per dollar of revenue, or customer service minutes per sale.


I've written up a very straightforward implementation of Poisson regression for Java, because I'm focused on the JVM at the moment and there wasn't one available on that platform.  A number of implementations exist in R, and a few standouts in Python.  Mine incorporates the notion of "regularization" - that coefficients are likelier to be zero than the data suggests (or equivalently, that you should have a bit of a bias towards 0 coefficients).  That can keep you from seeing large coefficients on sparse segments of the population that randomly have a large number of positive examples (I have some fun horror stories from the insurance industry about regularization via repeatedly forwarded spreadsheets and staff meetings - doing it with math is so much better).  If anyone finds it handy, let me know - there's a lot of tools in the standard statistical toolbox that are woefully underused, and it's great to see them in wider circulation.

Monday, July 16, 2012

How I explained garbage collection to my girlfriend

Let's say you're working on a really complicated math program, on a blackboard.  There are a lot of different sub-problems that interlock, and you've got to solve them in a certain order.  Eventually, though, you run out of space on the blackboard.  Garbage collection is how you figure out what you can erase to free up space, without erasing anything you're going to need for the next steps.

Sunday, May 6, 2012

Building a hybrid storage server / compute box / gaming rig

Hardware is fun, especially for software guys who don't get to play around with component selection as much as we'd like.  Having just sold the rig I used to use for gaming, and my 20-drive disk tower, I'm in a consolidation mode - but that just means I have to build something generalized.

There are three main goals to this build.  The first is storage and general serve-ing - although my trusty Acer Easystore has been sufficient so far, I'm nearing its capacity.  The fact that I've just got the one box also makes it a bit more difficult to experiment than I'd like - I'd like to try out things like native ZFS and different encryption setups, and for that you really need a spare.

The second is to serve as a recreational compute box - I've been playing around with doing some large-scale analytics on scraped web data, monte carlo optimization methods, CUDA/OpenCL, and other stuff that requires a good amount of time to chew through.  A single-core Atom CPU barely cuts it for doing a ZFS scrub, let alone doing actual math.

The third is the extremely occasional game - every once in a while it's fun to fire up Oblivion (and I hear Skyrim is pretty cool).

The issue with the 20-drive tower and my previous gaming rig were size (both), noise (drive tower), and catproofness, so I'm looking to correct those flaws.  Initially I had considered splitting the roles into a dedicated storage box and a mini-ITX compute / gaming box (both probably built from a Lian Li PC-Q08), but I judged the extra overhead (both in money, space, and administration) as not being worth it.

The case will probably be a Cooler Master Silencio 550, which has the advantage of having few cat-accessible fans (cats apparently love to roll around on big 200mm top-exhaust fans), a good amount of space for HDDs, and the ability to hold a full-size GPU with all the HDD cages in.  It's also a reasonably sized enclosure, which is very nice when moving or rearranging.  Finally, it's reputed to be one of the quieter cases you can buy - I'm hoping to keep the noise to a "white noise machine" level while in file-server mode (I actually find it difficult to sleep without some background noise).

99% of its life, it'll happily chug away in Xubuntu, serving up files and doing occasional batch-compute work, with occasional forays into Windows for gaming, or some other linux distro for something experimental.  The Easystore will stay right where it is now, but be relegated to "grab in case of fire" data.

Setting up the disks in the ZFS equivalent of RAID10, with several pairs of mirrored disks, has many advantages.  Unless both disks of a pair go down at the same time, the array can sustain a lot of damage, and read/write speeds should scale more or less linearly with the number of pairs (handy for I/O bound data work).  Unlike enterprise grade servers, this one will be upgraded piecemeal, which is handy since I can upgrade one pair at a time (vs having to upgrade all disks simultaneously with RAIDZ1 or RAIDZ2).

Finally, a GTX560 or so should provide enough oomph for gaming, and gives access to both OpenCL and CUDA.

Looking forward to putting the beast together and taking it for a spin.

Monday, April 16, 2012

Tuning JVM garbage collection for a (big, mathy, STMful) Clojure app

As part of my "real job", I'm developing a Clojure-based app that does a significant amount of number crunching.  The inner loop continuously ref-sets random portions of a large array of refs (when I say "large", I mean that I can plausibly fire off a 50gb heap).  I had a tough time getting it performant, and it's an interesting enough story that I thought I'd relate it here.

After the standard futzing with algorithms and data structures, the thing that ended up holding me back was excessive time in GC.  I started with the "throughput" collector (hey, I'm doing number crunching, I don't have real-time requirements, throughput is awesome!).  Somewhat surprisingly, I saw worse and worse performance as my app ran, ending in a kind of sawtoothed purgatory of GC.  What little information I found about Clojure-specific GC tuning uniformly showed using the CMS / low-latency / concurrent collector as a good choice.  Curious, I switched - but that just resulted in slow-rising, slow-falling memory consumption patterns, with the pauses coinciding with pure serial collections (and still, not enough work geting done).

This caused me to go a bit down the rabbit hole of garbage collection lore - for simplicity I've condensed this somewhat.  The JVM stores its heap object in the permanent generation (ignorable, mostly class definitions and such), a few pools in the young generation, and a tenured generation of long-lived data.  The young generation is collected via a parallel algorithm that isn't really that tunable, and the split between young and tenured (and the various young sub-pools) will be auto-tuned if you let it.  The real mojo happens in the tenured pool, where you have the choice of a CMS / low-latency / concurrent collector, or a parallel throughput collector, with various options for each.  The trigger that sets off a CMS collection is theoretically auto-tuned as well, and there are "ergonomics" for the parallel collector, but performance was crappy enough as things adjusted themselves that I never got the chance to examine the long-run, automated performance.  I had to figure out most of this via experimentation and reading lots of GC logs.

What I discovered was that because Clojure data structures are immutable, I had enough of them, and my mutation rate was low enough, the majority were advancing to the JVM's tenured pool.  When a ref was set to a new object, the old one lingered in the tenured pool until it was collected, and the new one eventually migrated there too.  This is what caused the slow rise in memory usage using the stock CMS collector - objects weren't being collected until a threshold of the tenured generation was filled, and by the time the threshold was hit, it was close enough to filling the tenured pool entirely that it always blew through and caused a full, serial, stop-the-world collection.  Causing these objects to stay in the "young" pool was inefficient, since the majority at any given time were still live and being copied back and forth (and the ones that happened to survive were just thrown away later).  I needed to set the collector to kick in earlier and hoover up the garbage accreting in the tenured pool.

However, even with the CMS collector set to run more or less continuously, I was still allocating objects faster than they could be collected, because the young-generation collector was promoting almost all of my objects.  The solution ended up being to not generate objects fast enough to fill the tenured generation and cause a slow serial collection.  There were a few interlocking solutions to this.  First, I set the number of CMS-collector threads higher than usual (4 in my case) to mark the garbage faster during a collection.  I hard-coded the threshold at which the CMS collector kicks in, and actually set it relatively high, in order to have control over when GC triggered.  I figured the maximum amount of time the app could mutate for before causing a serial GC or triggering an uncontrolled CMS GC cycle, and paused at the end of that period to manually trigger a CMS run and block until it was complete, via calling System.gc() and telling the JVM to associate that with the CMS collector.  For long-term operation when I didn't care about maximum throughput, just stability, I simply ran the object-generating portions of my app on fewer threads (could have just as easily called Thread.sleep() intermittently), and set the CMS threshold low enough that it was running continuously.

Calling System.gc() is nearly always a smell, but in my case I found it was the only way to ensure I was operating at max throughput without causing excessive pausing.  My allocation pattern simply isn't something the JVM deals well with - had I been mutating objects directly, rather than generating new ones, running an alteration to the objects that leveraged some Clojure structural sharing, or operating at a higher CPU-heap ratio, I could have likely avoided calling it.

My JVM options (some of which may be superflous, I wasn't able to establish whether CMSIncrementalDutyCycle or CMSIncrementalSafetyFactor were necessary without the incremental-CMS option turned on) ended up being:

-d64 -da -dsa \
-XX:+PrintGCDetails \
-Xloggc:gc.log \
-Xms50g -Xmx50g \
-XX:+UseConcMarkSweepGC \
-XX:+UseParNewGC \
-XX:ParallelCMSThreads=4 \
-XX:+ExplicitGCInvokesConcurrent \
-XX:+CMSParallelRemarkEnabled \
-XX:-CMSIncrementalPacing \
-XX:+UseCMSInitiatingOccupancyOnly \
-XX:CMSIncrementalDutyCycle=100 \
-XX:CMSInitiatingOccupancyFraction=90 \
-XX:CMSIncrementalSafetyFactor=10 \
-XX:+CMSClassUnloadingEnabled -XX:+DoEscapeAnalysis

Monday, April 9, 2012

Clojure, interning, and immutability

"Interning" a variable means to have one underlying representation, regardless of where an equal-valued object appears.  Strings, for instance, are interned in Java - storing a million "copies" of "hello world" causes a million references to a single underling string.

You can intern anything - for instance, Java also interns ints between -128 and 127 (which is why

Integer x=10;
Integer y=10;
x==y;

is true (x and y point at the same underlying "10"), but not with x and y set to 1000).

Google Guava has a good implementation of interning that can apply to any object, not just strings.  But, a whole new library is overkill for quickly deduping a few collections - we'll implement something that gets us most of the way there in raw Clojure.

One of the caveats to interning is that because the different "copies" of the root object share the same underlying data, that data must be immutable.  Otherwise, Bad Things happen when you create a fresh copy of the same interned data and mutate it - there's no indication that that will have an effect on the other copies of that var.

Clojure handles this for you, because all of its native data structures are immutable.  In fact, it's better than interning, because of "structural sharing" - under the hood, a vector, and that same vector with a bunch of data conj'd onto it, share the same data for their common part.

But that only works if you start with a single data structure and "mutate" it with Clojure functions.  If you have, for instance, a bunch of generated, duplicated values in a hashmap, they will each occupy additional space.  Fortunately, it's trivial to de-duplicate such a structure, and there's no downside - Clojure's immutability means that you can still sling around modified versions of the map, with no side effects.

I've thrown up a gist that demonstrates this in action, and is reproduced below.


There's really not much "magic" beyond what Clojure provides you - we just throw all the values in a set, which (since sets compare their elements by value) gets rid of all but one copy of each distinct value.  Then, we return a new map, with the keys the same, and the new vals looked up from the set.  Voila!  We now occupy 10% of the memory we used to.

Sunday, March 25, 2012

Never delete, preferably never update

Web developers tend not to be database people (or even necessarily data people).  They access data mostly through an ORM, the whole point of which is to abstract away the "relational" bits (this is why key-value and document-based NoSQL solutions are so popular - everything is essentially durable hashmaps).  They also tend to access the data in isolation from the raw DB layer (eg, looking a few rows at a time, looping instead of doing joins on sets, and using programming language facilities instead of DB ones).  If you are in fact a data/base person and work with enough systems designed by application devs, you will eventually look at a table and your first instinct will be head-grabbing, "who-designed-this" incredulity (I rued the day I saw my first ActiveRecord polymorphic association in action).

Resist this temptation!  They have a job to do, and they are doing it - at least initially, they're writing the data layer purely to support the functionality of the app, and it likely fits that role pretty well.  They're not going to ask you before they design it, and probably shouldn't, because you don't necessarily know yet what sort of analysis it's going to need to support, and they don't necessarily know what future frontend stuff it'll have to be compatible with (and as much as possible, you want them to have the flexibility to make Cool Stuff as quickly as they can).  The most you can do is lay down some rules of thumb that let you attach functionality later as painlessly as possible.  Fortunately, there's really only a couple.

- Never delete data.
- As much as possible, never update data.

The second is really just a clarification of the first, since when you update a record, you're deleting the previous state.

The point of these is to be able to reconstruct all the past states of the table if you have the whole thing.  Say you're Zynga or whoever, and you want to track how many "credits" each user has in whatever game.  The first instinct is to have a "balance" table, with a user ID and their current balance, and increment / decrement it as they buy / spend credits.  This works fine, until you start asking questions like "How can I get customers to buy more credits", and "Do customers buy credits in bulk, or in many small transactions?".  You could hook in your payments system and your in-game inventory system to see how people were accumulating and spending credits, but you have to sum over the entire set in order to figure out the state at any point in time, and hope that it matches with your balance table.  Even if you store the state as of a particular time on separate lines, in order to figure out what led to that state, you have to link each row with its predecessor, which in strict SQL/set-theory terms requires a Cartesian join.  It would be much easier to record both the delta and the resulting state on a different line.

This is called "bookkeeping" and the Italians invented it six hundred or so years ago.

The key thing here is that you're recording both the action and the result, so it makes it incredibly easy to examine things from either perspective - what actions are most common, and what were the most common states.  The downside is that you have to store all this additional data.

But where you store it is a configurable parameter.  If your application only cares about the current state, then only store the current state - in the application's database.  Dump the rest to Hive, a series of daily Archive tables in MySQL, or wherever, and let your analysts worry about crawling through them. If it's too big to store it all, keep a buffer and discard everything beyond it - just as long as you have the ability to increase the buffer if you need to.

Sunday, March 11, 2012

Some Mahout recommendation framework performance tips

The Apache Mahout project covers a lot of ground - everything from distributed matrix algebra to clustering to binary classifiers and recommendation engines.  I've been working with the recommendation component (what used to be Taste), and I've discovered some interesting things about its performance characteristics.  The code is still a bit nascent, and there are plenty of opportunities to realize some big gains.  All of the following refers to 0.5, unless otherwise noted.

- The only DataModel remotely fast enough for production use is the GenericDataModel, which underlies the FileDataModel and the RefreshFromJDBCDataModel, as well as the Mongo data model in 0.6.  MySQL and other raw JDBC data models are not fast enough, despite what Mahout in Action and the docs tell you (especially with large datasets), unless you throw enough caching on top of them that the gradually become totally in-memory anyway (this is not necessarily a bad idea if you're recommending for definable chunks of the population at a time - it's similar to the working set vs. the whole dataset in MongoDB).  Fortunately, the data structures underlying the GenericDataModel (giant primitive arrays) are actually quite efficient - you can comfortably store 100M preferences in ~4GB of heap.

- Constructing this DataModel can take a very long time, depending on your data (particularly if you have many users per item or items per user) - this is because the GenericUserPreferenceArray and GenericItemPreferenceArray underlying the data model are sorted using a homebrewed selection sort.  Replacing this with a better sort (a comb sort in my patch) dramatically reduces construction time.  Alternately, if you're swimming in RAM, it's really not that hard to implement a DataModel on top of standard hash tables and such, and just swallow the object overhead.

- Caching is your best friend.  However, for user-based recommenders, you have three options for caching, and they all have different memory / performance gain tradeoffs.  You have access to a CachingRecommender, CachingUserSimilarity, and CachingUserNeighborhood.  CachingRecommender is probably best ignored and dealt with in your application layer, since the recommendations themselves do not per se depend on other recommendations, and you usually already have caching on some layer for external API calls.  CachingUserSimilarity is more useful, but because your total user similarities scale as N^2/2, this might just suck up memory without a corresponding number of cache hits.  In profiling, I've found that the most useful one, particularly for getting a large number of recommendations per user, is CachingUserNeighborhood, and this is where you should put your spare memory capacity for maximum effect (unfortunately the size isn't configurable as it stands, to quote the comment "just a dumb heuristic for sizing", but this is an easy change to make).

- For item-based recommenders, the only real choice is the CachingItemSimilarity.  Just make sure your cache size is large enough.

- Related to caching, the caching methods internally are synchronized, which means it makes a lot of sense to parallelize calls to your recommender during the cache warm-up phase.  A good profiler (and maybe even some changes to the source to add logging to the the caching class) will help you here - with a large amount of data, the long-run cache hit rate dominates performance, so it's important to monitor how fast that ramps up at a given cache size.

- For user-based models, the best performing similarity/distance measures are the Tanimoto, Log Likelihood, and City Block - about an order of magnitude faster than Spearman or Pearson correlation similarity.  Only City Block incorporates the actual values of recommendations in the similarity measure.  The speed / accuracy tradeoff may or may not be worth it, depending on your data, but the extra performance headroom gives you the opportunity to do more exotic things like voting schemes, alternate parameterizations of what an "item" or "ranking" is, and other sneaky ways to get more nuanced recommendations out of the framework.

The Taste framework is solid for what it gets you, but it's definitely not plug-and-play for problems that involve large amounts of data, performance constraints, or odd product offerings that don't neatly fit with the item/user conceptualization.  Really that's all you can ask for - these are tough problems to get right (and right consistently), and you will need someone who knows what they're doing to sit down and hash out how it all fits together.

Tuesday, February 28, 2012

Data science is the new cybernetics (if you let it be)

"Cybernetics" is a horribly dated word that, as near as I can tell, was last used in earnest sometime in the 80's, and now elicits at best the same uninformed, vaguely positive association as the phrase "neural network", and usually just a shrug and a question about robots.  This was not always the case - in roughly the 50s - 70s it was a hot field that claimed insights into everything from psychology to economics to ecology to bioengineering (if you have time purely to waste, check out the number of "subfields" listed on the wiki).

It never quite lived up to its promise, but despite that managed to have a lasting influence on what might be called a culture of quantitativeness (if someone at work has ever said "I don't have bandwidth to deal with that right now", thank Norbert Wiener et al).  The central insight of a field that claims as much breadth as cybernetics did at its peak is always a bit tricky to define, but if I had to boil it down it would be that although the forms might differ, information-theoretic constructs like signals, feedback, and interference make it possible to describe many systems as complex systems per se, with common patterns of behavior that arise from those abstract constructs.  Crucially, this lets you describe them with the same set of tools and analogize between them using the same shared vocabulary.

The problem is that this only gets you so far before you actually have to understand the specific mechanisms and where they differ from the flowchart you cooked up.  You can't use "cybernetics" to build a pacemaker, you can only use it to compare and contrast the general design with NTP.  The problem with that is that even with a shared vocabulary, scientists are generally terrible about analogizing abstract systems to other fields (even pure technical knowledge transfer is sticky a lot of the time).  They're a) more interested in their own, more technical problems than abstract theorizing, and b) don't have the background (or time to develop the same) to understand where these analogies break down.  Rather than generating cross-pollination, much of the time spent on cybernetics per se ended up being, so to speak, pure self-fertilization (culminating in things like second-order cybernetics).

I see some of the same promise developing in the "new" field of data science, but it's important to avoid a hangover from missed expectations.  Just like cybernetics, there's a semi-shared terminology and toolbox, and the problems are fundamentally ones of description, optimization, and control.  "Build me a model of sales, so I can figure out what's important, so I can sell more" is a totally reasonable request to make of a data scientist, and one that does not fundamentally depend on what exactly you're selling (except as it informs your choice of variables).  Regardless of whether you're selling Chevys or soft drink syrup, you'll be better off with someone who can work the numbers (whether that's a statistician, a machine learning expert, an econometrician, or a bioinformaticist) than with a sales executive, and they'll all be using similar approaches.  There's a real temptation to keep going with this line of reasoning, and evangelize "data science" as a general solution to quantitative problems that come with large datasets.  In some sense it is.

The hangover will come when data science gets billed as something it's not.  "Data science" isn't really a science, it's the art of making valuable inferences from data, and maintaining the infrastructure (eg, data storage and analytics software) that makes that possible.  To that end, it's not about a different approach to your data or a new spot on the org chart, it's about having the right people and giving them a mandate (= responsibility + support) to use their skills to make your organization better.  This is what DJ Patil's 90-day, rock-my-socks criteria is about - hiring extraordinary people and trusting them (for 90 days at least) to use their general quantitative and technical skills on your specific problems.  There is no "data science approach" to a question; there are only data scientists, data, and problems.

It worries me a bit when I see link-bait articles in trade publications with headlines like "What can Data Science do for you?", because it means we're a short way from 5-course data science certificate programs at your local continuing ed college, "data science" teams that amount to reporting and pivot-table-parsing groups, and "data science" conferences that do nothing but promote consultancies and software that are a combination of bog-standard databases and overpriced regression tools.  They think the solution is the toolkit and the vocabulary, not the person who uses them.

I'm of the opinion that "data scientists" should be treated as the new MBAs, not the new cyberneticians.  An MBA signals that they are a certified Smart Guy, they have a shared vocabulary with the rest of your organization, a toolkit that's likely applicable, and evidently enough ambition to make a stab at hard problems.  That said, there is no consensus "management science approach" to most problems (if you know what you're doing) - management science is the thing that is done, not the way it is done.

In the same way that not everyone needs a Stanford MBA, and if you have such a person, you had better make use of them, treat your data scientists as expensive people whose job it is to take responsibility and solve problems (and happen to have a particular toolkit), not people whose job it is to "do data science".  Join the party, but avoid the hangover.

Saturday, February 25, 2012

Why is writing stats / machine learning code different?

I write a lot of quantitative code, and especially for algorithms that are fresh out of academic journals, it seems very different from the kind of development that a lot of my colleagues do.  I've tried to think a little about why the workflow is so different.

- Academic statisticians aren't (for the most part) "programmers", and their fundamental unit of output is papers, not software, which means you often don't have a good implementation to work from.  Some libraries are of incredibly high quality, especially the core linear algebra libraries (ATLAS, LAPACK, etc.).  Other times, if you dig into the code you find bizarre global variables, algorithms with terrible scaling factors, or a plainly incomplete implementation.  For the bleeding-edge stuff, you'll have to dig into the source to find out (if it happens to be available, which a lot of the time it's not), or risk it blowing up in entertaining ways later.

- You don't know if it's going to work.  When you write code for a website, the outcome is more or less binary - the page renders, the order completes, life is good.  At the immediate margin, it matters very little how quick the code is, if it's maintainable, etc. (the long run is of course a different story).  If you're trying out a new modeling approach, it has to do better than whatever you have now, along a bunch of different dimensions; and all of this fundamentally depends on the quality of your algorithm.  This leads to a lot of discarded approaches and time that feels wasted, and longer / less immediately rewarding feedback cycles.

- The cake is a lie.  Your source material will make fantastic claims of performance, that are true on some publicly available dataset (Netflix prize, MovieLens, etc).  But not too seldom, the gains evaporate when you apply them to your own particular problem area.  You're forced to tune your BS detector pretty finely to get a sense of where the "state of the art" really lies, what it's really good for, and what a good baseline implementation might look like.

- Testing is hard, because you expect the results to change and can't simply test the presence of certain conditions.  That means you need simulation code, multiple quantitative evaluation metrics, testing and validation datasets, all of which is overhead that for the most part doesn't exist on the "normaller" side of programming.  It's also possible to develop lovely complementary bugs in your simulation-generating, modeling, and verification code that make it appear as though things are working, until you hit something outside of your codebase.  (On a similar note, fuzz testing a la Haskell's QuickCheck is super cool, and if I had copious free time I'd make one that verifies statistical properties).

I've got great respect for systems-level programmers, whether that system is an OS, a website, a programming language, etc., and I certainly don't think what I do is any "harder" in absolute terms than any of those.  Many of those systems will rock my socks any day.  That said, I think there's one fundamental difference - if you want a system that makes predictions as opposed to doing actual "stuff", there's always going to be increased overhead and a lack of certainty about whether or not it's "really" working.

Saturday, January 21, 2012

Dealing with high-latency workflows

Programming, like a lot of things, gets better (more fun and arguably more productive) with short feedback loops.  That's part of why agile development and dynamic programming languages are so popular with developers - the mental jolt you get every 10 minutes from tweaking your code to pass tests, or seeing your page render, keeps you going, and lets you know if you're on the right track in design terms.  Statically typed & compiled languages inherently have a longer feedback loop, because your code simply Won't Work unless your class layout, syntax, etc. are correct, and on top of that, you have to wait for the compiler to work its magic.

A lot of database / big data / analytics code doesn't follow that low-latency pattern.  If you're doing systems-level things (optimization code, for instance), validating the performance usually means waiting for it to optimize some non-trivial problem, which usually takes a while.  Data-processing code even more so; if you're trying to bring runtime from 20 hours to 16 and your code is reasonably complex, even testing the performance-critical parts might take a solid couple of days, and testing a completely new approach that might or might not be faster will take even longer.

So how do you stay productive when you're doing these kinds of projects?  In programming terms, how do you avoid blocking on I/O or wasting your time in context switches?  I've found some things to be helpful.

- Fire and forget (AKA, pipelining).  Don't be afraid to write a single-threaded version, and a version with a producer feeding a queue to multiple consumers, and a version that splits the work into multiple components, etc.  Kick them off as they complete (or schedule them to run when the prior one finishes) and write the next one while it runs.  Most importantly, don't look at the code once you submit it, until you finish them all - you will be tempted to tweak it a bit, kill the running job, and resubmit.  Only one of them will be "fastest" for the application du jour, but you will probably use the other code somewhere eventually.  It will feel like a waste, but remember that everyone discards a lot of code; you'll just be doing it in chunks.

- Fix the things you can fix ahead of time.  In particular, validate that at least your syntax is correct; nothing quite sucks as much as discovering that your last piece of code didn't fire off because you had a rogue "andd" in it, or your vectors were nested one level too deep.  If you're using a language where these kinds of things are difficult to detect statically, have a micro-dataset custom-written to ensure that every line of your code gets called at least once.  If you're super fancy, make a dataset that is exactly double the size (but still small) and count the number of calls between the two to get an idea of whether you accidentally made something non-scalable.

- Write locally, test remotely.  Most analytics code is designed to max out resource utilization, and your laptop, as beastly as it is if your employer knows what they're doing, can not handle running these multiple simultaneous performance-intensive tests while you multitask in user-land as well.  Run your code remotely and you will avoid having your editor of choice crawl to a halt, or waiting 30 seconds to switch between your API docs and your email.  If you don't have the resources to test in production (or there are other loads running that will skew the results unpredictably between jobs), consider spinning up a few EC2 instances specifically for benchmarking and testing your code.

- Find your optimal level of work-unit granularity, and schedule things so they align to it.  Some people are most productive when they chug away on a problem for a couple days at a time; others need to switch to something else every couple of hours in order to stay "fresh".  Regardless, make sure you have other stuff to switch to at the appropriate time - if you're a short-granularity person, try to schedule, say, your meetings throughout the day rather than in a bloc.  This is a solid approach anywhere, but specifically when your work is high-latency, it's nice to be able to kick off a job, do an interview or somesuch, and see the results return just as you get back to your desk.

Your experience might be different, but I've found that I need a genuinely different workflow than my pure web-programming colleagues to stay productive.  Managing workflow is an interesting problem in and of itself, and the most important thing is to experiment with it until you find something that works for you.

Friday, January 6, 2012

Starting a raw nREPL server, and connecting it to Counterclockwise/Eclipse

I like big data (I cannot lie).  I also like reasonably friendly dev environments that make it easy for me to switch between, eg, Clojure and the Java libraries I'm calling.  Eclipse & the Counterclockwise library make this fairly easy.  However, the default REPL that CCW starts has a max heap size of 128MB, which is fine for testing whether your code parses, but a bit thin for dealing with large amounts of data.  Because it actually starts a separate Java process, the options passed to Eclipse in eclipse.ini have no effect.  If you want to send custom JVM options to your REPL at startup, as far as I can tell the only way to do it right now is to manually start it outside of Eclipse and connect to it.

After some digging, I was able to figure out how to start a raw nREPL server on the command line, that can be passed whatever JVM startup options are necessary.  If you add the following entry to your project.clj under :dev-dependencies :

[org.clojure/tools.nrepl "0.0.5"]


And run "lein deps", you will have the necessary jars to do the following:

#!/bin/sh
java -classpath /MYPROJECT/lib/*:/MYPROJECT/lib/dev/* clojure.tools.nrepl.main --port 9000

That will fire up a raw nREPL server on port 9000 that you can easily connect to with Eclipse / Counterclockwise.