Your email address & Posterous

Over the past few weeks I've seen an increase in the amount of spam to one of my email addresses, perhaps as much as three times the average from a month ago. It's usually not possible to tell the exact cause of an increase in spam: unless you judiciously track and sparingly use an address, there is always the possibility that it's arrival on spam lists was simply delayed or through an avenue you had not even considered. Nevertheless, I decided to try to search for the address in Google to see if I could find some clue, and I just may have found one.


Searching for the email address exactly as one would type it into the to: field of an email returns 6 results, four of which are to this blog. This alone is not conclusive, however, as my email address is a combination of my first name and the nickname under which this blog is currently subtitled and thus the email may simply have been tokenized and the keywords still matched (my name is in the link to my profile and the nickname is on the header of the blog).

So I searched again, this time replacing the '@' and '.' in the email address with spaces (i.e. self-tokenizing). This search turned up 76 results, 12 times as many as the results for the untokenized email address. I repeated the experiment with another email address I've use with Posterous and got similar results: searching for the email address itself returned just a few results (one of which is on Posterous), while tokenizing the email address returned orders of magnitude more.

My concern is that spammers may discover a way to exploit these results to extract valid email addresses for spam lists. At first this may seem like an unlikely or perhaps unproductive attack, considering the number of possible email addresses vs. the number of Posterous users. Consider, however, that not many people have email addresses like k9OJ40az.39gj@poq1z1a.com, and thus just as in password cracking spammers can use heuristics such a so-called dictionary style attack (starting with the most likely words and working towards the less likely possibilities) and using only with valid domains to increase the yield by reducing the number of possibilities. As the number of Posterous users increase (and the number of other services that are vulnerable to this increase), so too do the returns. Furthermore, the harvest can be had simply by conducting a Google search, and would be difficult to detect at best when conducted via a botnet.

Is this a valid concern? Can anyone shed some light on the reason why the more specific, untokenized search for my email address returns less results than the tokenized version, and the likelihood of this being exploited?