Brightball

The Time I Accidentally Ended Up Combatting Fraud for a Year

DMARC | Phishing | Rails | Security | Fraud | Email | - February 10, 2023 // Barry

Lately, I’ve been spending a lot of time enjoying the Darknet Diaries podcast and it’s compelled me to finally share the entire story of the most intense year of my 20 year professional career. I was the sole developer hired by a company going through a circus-like ownership transition while criminals actively worked to defraud the 300,000 users of this 14 year old, high end marketplace.

We experienced late nights, numerous technical challenges, worked with abuse response teams, learned a lot of lessons about phishing and fraud, high emotions, death threats and at least one person lost a business that depended on the site. Here’s the story from start to finish, including how to prevent many of these problems on your own site. Buckle up.

Full discussion on Hacker News.

Quick Disclaimer

Even though this story took place 10 years ago, I won’t be naming the company or anyone involved. I’ve used a name generator to replace the parties in the story. While emotions certainly ran high during this period in my life, a decade of hindsight has left me extremely grateful for the experience that the opportunity presented me. The only details provided on the actions of the people who took part in this story will be those necessary to help understand the situation.

How did I get myself into this situation in the first place?

Our story takes place in 2012, but the year prior to that was a difficult one for me professionally. We officially shut down Brightball for contract work in late 2010 and I’d gone back to work for Windstream on a contract-to-fire job (perpetually renewed, short term contract). It was a frustrating experience for me because we did so much incredible work but the disparate scope across numerous small clients didn’t amount to an appealing resume. While I was at Windstream I was working on the production support team, handling deployments, emergency issues and writing a lot of Perl.

One day Roland, a former client of mine, called me to see if I could come down to his office to discuss an issue he was experiencing during my lunch break, so I paid him a visit. He was working on a new project, one that he’d spoken to me about just prior to my company shutting down. Because of the timing and some personal things I was dealing with, I couldn't bid on the project but I did give him some advice about what to watch out for from other contractors. Just a few tips at the time that I thought would land him in hot water but a contract company would probably think was fine. I can’t remember much of what that advice was, but ultimately the project landed in hot water.

This 14 year old marketplace, built from scratch with Perl was being rebuilt as a Ruby on Rails site. The site included a marketplace, forums, reviews and other features but initially only the marketplace was being rebuilt. I was never given a clear answer as to who made the decision, but someone decided to go ahead with the launch by migrating all of the user and marketplace data to the new site. This process took 3 days and caused both sites to be down for the entire transition.

When the new site reappeared, it looked a lot nicer but it was very buggy. To make matters worse, there was no going back. Tooling had only been built to transfer the data one way, from the Perl site to the Rails site. Also, the developer of the Perl site had walked away after completing his obligation following the sale of his half of the business to his partner. I suspect that timing had something to do with the decision to rush the launch with no live testing.

Roland told me all of this after inviting me down to his office to see if I could recommend somebody to help clean up this mess. We talked through the situation and what needed to be done. At the time I only knew 2-3 people who were competent enough with Perl to take on a project like that even though I’d dabbled some, but he ended up offering me the job on the spot.

It was a crazy situation to consider jumping into, but as an online marketplace (like eBay but for a very niche market) there was a natural network effect built in with all that history. The sellers couldn’t easily go elsewhere without the buyers. The buyers couldn’t easily go elsewhere without the sellers. There were over a dozen competitor sites all trying to capitalize on the chaos but no clear, single migration path. There was no Facebook to their MySpace. No Reddit to their Digg. If there was, this story would have unfolded differently.

I thought that would probably buy enough time to settle things down. In a weird way, this crazy situation actually felt like it was made for me with the sheer diversity of technology that needed to be tackled. Even the 9 month Perl refresher I was in the middle of at Windstream was incredibly fortunate timing. I’d learned so much over 3 years running Brightball with my 2 partners, but the lack of “5 years experience with Java or .NET” meant that I wasn’t qualified for almost any job in upstate South Carolina in 2011. As stressful as my empty job search had been for the prior year, I’d been praying hard to find the right place. Three days prior to getting this call from Roland, I actually broke down in a stairwell at Windstream and cried because I was so worried all of my time with Brightball had been a waste.

So I went home, discussed it with my wife and accepted the job that night.

Orientation

Every new job has a first day and this one was no different. I was brought into the makeshift shared office space and given a tour of the systems, asked for all of the access I needed while being introduced around.

The long time Perl site was hosted with a data center provider that I’d never heard of on 26 servers running a version of FreeBSD from the year 2000. The site had no database. It ran entirely from text files that used a directory structure and file names to create it’s own version of indexes. While this is an easy thing to gawk at the site was incredibly fast. Perl excels at working with text files.

This particular code base was a nightmare to get around in. The code had never used a version control system of any sort. Sometimes files would be slightly different on different servers. All variables were global and functions would pass data between each other by setting the global variable so that it could be referenced elsewhere. In order to debug what was happening in any given function you had to trace the exact execution path that called it to figure out where the data was being set. All email was being sent directly from those servers as well.

The new Rails site was developed by a company in Atlanta, we’ll call them Rails-R-Us for sake of this article. Like most contract development shops, their business model involved assigning a very senior developer to start and architect the project while eventually passing on the work to a less experienced developer but still charging the senior developer's rate. This was no different and there was a noticeable difference in code quality based on the name in the commit.

At the time, I was familiar with the Rails structure and reasoning thanks to a lot of CakePHP but I didn’t know Ruby at all yet. Luckily, it was fairly easy to pick up but it certainly had its challenges and pain points while going through the process. I was added to the company Hipchat and introduced to these developers. Roland told me to just ask them if I had any questions.

Rails-R-Us had a solid development pipeline setup for 2012. The site was hosted with Heroku, using Heroku’s PostgreSQL database, Solr for search, code versioned with Github and had a full test suite. There wasn’t a CI system in place at the time, so developers were expected to run the tests on their machines as part of their commit flow. Email was sent via Sendgrid. We had the free New Relic plan coupled with Airbrake for tracking performance and errors in the production environment. This was solid footing to move forward.

As I was getting to know the code base, I was also introduced to the owner Nathan and the COO Linda. Roland was also going to be interviewing two new support staff people over the next week and mentioned that he would like me to participate if possible.

Roland and I drove to Atlanta to the Rails-R-Us office so I could meet them in person, get a tour of the code base and ask questions. The office was close to the Georgia Tech campus with a very open profile and a TV that appeared to be playing Iron Man on repeat. I met with Clark and Jimmy, the senior and junior developer who had developed the code base so far. They showed me around the stack and I asked some questions about the reasoning behind technical decisions that were made, just so that I could understand the choices involved.

The one I will always remember though, was the database. PostgreSQL is a fine database and has numerous excellent reasons to be selected, but the reason it was chosen for this project was simply because “It’s what Heroku says to use.” This floored me. The database is the single most important part of almost any web application. It’s the backbone of everything and the hardest part to scale. It’s the one piece where, if it goes down it’s taking everything with it. The developers of the site not knowing WHY the most important component of the site was chosen was a huge red flag for every other decision that was made.

While this was a shock in the moment, I would later learn that this attitude was fairly common among Rails developers at the time who were largely encouraged to treat their database as a dumb data store by convention and to let the application do the work. This nugget will foreshadow many of the issues that had to be resolved.

So this is where we were after the first week. There were three levels of management, a 2 weeks at a time contract junior Rails developer in Atlanta, and me. Support staff were yet to be hired and the prior staff were implied to have been fired for some reason. The only person knowledgeable of this giant pile of Perl was unreachable. The COO and owner lived in another state but planned to move eventually. The site was buggy, users were in an uproar, steady income was tanking, competitors were trying to capitalize on the chaos…and it was my job to fix it.

But the fire hose of issues hadn’t even been turned on yet.

Turning on the Fire Hose

That started once we hired our two support staff, one for primary customer support and the other to get yelled at on the phone by angry premium members. Those were not the official job descriptions, but that's pretty much how it went. Premium members were the paid users of the site. The ones who ran businesses where this was their primary sales channel. The people paying for advertising packages. Disruptions to this site were disruptions to their livelihoods. And there were a lot of disruptions. Until I was in the same room with the support staff and seeing what was being reported, I had no idea how bad it was.

Our users were phished over email. They were defrauded with fake listings, fake purchases and Western Union scams. Reputable accounts were taken over and the original owners locked out. Our users were being actively spammed through our own systems for everything from trying to bypass listing fees, enticements to move to a competing site and fraud. When we eventually got really good at responding to these attacks, the attackers started planning them around our business hours to slow the response time.

This is without even talking about the actual bugs in the site itself, like N+1 problems so bad they could crash a web server. Users were rightfully ticked off. One long time business owner who depended on this site went belly up because of the chaos. Another called our office threatening to come find us all and kill us, so we had to call the police. In the midst of dealing with all of this, every person in our office had their credit cards stolen at the same time...twice. Somebody even setup an entire blog dedicated to us called "COMPANYsucksnow.com" but did eventually take it down.

Every competing company tried to take our users at the same time and hilariously ended up dealing with the exact same fraud problems that we were experiencing, but worse because they attempted to use free listings to entice away our sellers. The scammers weren't going after just us, they were going have this entire premium market. When we eventually dealt with all of this stuff, the site became the safest marketplace in the industry.

This entire year was an adventure unlike anything I've experienced in my career, but it has a happy ending. Each scenario encountered is broken down into it's own blog post, linked here with a summary. They say necessity is the mother of invention and did we ever do some inventing.

The 1st Phishing Email

All transactions on the site were conducted directly over personal email, so any communication on the site that you responded to exposed your email address directly. This led directly to tons of phishing against the user base. The chaos of having a new site design in place with so many things changing made it easy enough to dupe users, but people were able to send emails that appeared to come directly from our email account and I had no idea how to deal with that. The answer is DMARC, in case you're curious. DMARC is basically magic. That's step one in your anti-phishing journey. If you don't have it setup yet, you need it.

There's also the whack-a-mole situation of the phishing sites that accompany each campaign and the need to get these taken down as quickly as possible. If 1,000 messages go out to users and they all point to the same phishing site, you take down the phishing site quickly and the phishing message is rendered moot. The Golden Hour is critical to this process, because 50% of people who will fall for a phishing scam do so within 1 hour of the message being sent. Getting it taken down eventually is simply not good enough.

The detailed account this journey is covered in Combatting Phishing with DMARC and a followup on Deploying DMARC Without Breaking Everything focuses specifically on DMARC deployment. After this experience, I got very deep into DMARC, including 3 years working for dmarcian, speaking at a number of conferences including M3AAWG in Atlanta and the Anti-Phishing Working Group's (APWG) eCrime Summit in San Diego thanks to dmarcian founder Tim Draegen. Part 3 of my DMARC blog series covers learnings from that experience with Enterprise Challenges with DMARC Deployment, but doesn't directly relate to the rest of this story.

ANNOUNCEMENT: At BSides on October 28th, 2023 I announced the upcoming 2024 private beta of dmarcSTAR, a service that approaches DMARC in what I believe is the ideal manner after over a decade of experience with the specification. Read more at dmarcSTAR.com.

[ read the full story of the 1st phishing email... ]

Reversing Account Takeovers

When an account was actually compromised because somebody fell for a phishing scam, we had to deal with cleanup. The phish wanted to access established user accounts with transaction history to then lock out the original user and dupe others into fake transactions. This happened a lot and in order to clean it up, support was contacted, identities has to be verified steps were taken manually to reverse the process.

Building in a way to Automatically Reverse Account Takeovers is actually very straightforward. Security features like that just don't exist in most product roadmaps so people typically don't prioritize the work. After the Experian situation in July 2022 I decided to share this technique here on the blog. Brian Krebs ended up sharing it directly as well, which inspired me to write up this entire blog series so thanks Brian!

The short version, you have to permanently track every change to user credentials and contact information, making sure that each step along the way has a means of reversing everything that came afterwards. If you don't do that, somebody can just change the access information a few times.

Preventing account takeovers in the first place by mandating Multi-Factor Authentication would be ideal, but having this structure in place as a backup certainly wouldn't hurt. It should be built into every auth system.

[ read the full story of reversing account takeovers... ]

Waste Spammers Time to Kill Their Return on Investment

Spam was everywhere. From competing sites, to fraud to people who were just trying to sell stuff by abusing communication channels. It was a nuisance to our end users too. We learned early on that we had to protect everybody in without them knowing it, so we had to quietly figure out how to deal with this problem.

Of the blog series so far, this is the least popular post in terms of traffic but in my estimation it's actually the most important thing that we learned. Every one of these scammers are running a business of their own. They are investing time to defraud people and the longer it takes them to do it, the lower the reward is for their time. Stopping them entirely is impossible, but you can render them so ineffective by trapping them them in a cage of uselessness that they'll spend hours, days and weeks trying to defraud people who never see their messages.

The solution was building our own spam detection algorithm using Levenstein Distance and applying it with a level of strictness based on our own internal user trust scoring system, that we also built. That would allow us to give established users much more freedom while new accounts had steps involved to earn more trust. We also setup a catch and release approach to this problem, where we reviewed user behavior incrementally to determine if we may have been too aggressive. If so, we released the messages with a slight delay.

This technique was so effective when combined with everything laid out in the two prior posts that reports of fraud virtually stopped. We even made a dashboard of the verified and caught spammers, as well as how much time they were spending manually typing in messages while completing Captchas...that were never delivered. Our worst offender was doing this for 15 hours a day straight! That put a smile on all of our faces.

[ read the full story of wasting spammers time to reduce their ROI... ]

N+1 Problems and Beautiful Server Crashes

It wasn't all fraud combat though. This was also my first introduction to working with Ruby, Rails and Rails culture. Don't get me wrong here, I love Rails and the community but I started out with a sour impression. The code for the new Rails site had a few bugs and performance issues, but considering the entire site was built in just under a year is was still very impressive work.

What I couldn't abide was the "beautiful code" mantra that seemed to exist with this company at the time. There was no finer example of it then when one of our long time users would visit his account history page. He ran a business selling on the site that had been there from almost the inception of it 14 years prior...and the account history page had an N+1 problem. In order to load just 100 paginated items on his account history page, over 50,000 queries were triggered using so much RAM that it crashed the server.

I looked at the problem. It was a single line of code that chained all of these queries together. It was succinct, beautiful, tested...and wrong. In order to fix it, I used some SQL directly to reduce it all to a single query. I verified it worked and deployed it immediately. The problem was fixed and the crashes stopped. This caused the tests surrounding it to fail. For this sin, the contract company was very upset. They weren't upset that the code which passed their tests was crashing the production server, only that the tests were no longer passing. It bothered me.

There were other factors too, such as depending on Rails to validate uniqueness rather than the database. It resulted in a lot of messy data cleanup that I had to deal with which were causing hard to diagnose issues because race conditions will always happen.

There's no dedicated blog post about this for the series, but if you're a Rails developer I do have plenty of advice that still holds. First, use the bullet gem to find and eliminate N+1 issues during your development. Second, the database is the only place that can make guarantees about data integrity so use them. Today Rails and ActiveRecord do a fantastic job of leveraging the database properly but 11 years ago there was a lot of dogma about keeping all of the logic in the application itself.

PostgreSQL is Amazing

This was the first project where I got up close and personal with PostgreSQL and I've been a complete fanboy ever since. One big problem that we had to solve was with our search engine, Solr. Solr was a perfectly good engine but we kept experiencing these weird data syncing problems. Deleted records would keep showing in searches and we couldn't seem to get them removed, even with manual intervention. The problem created a lot of annoyance and customer support tickets, just making the site feel buggy in general but we believed that because Solr was so fast it was necessary.

Well...it turns out PostgreSQL has some pretty fantastic full text search features too so I decided to try them out. A couple of days later, I had a working prototype and to my shock it was just as fast as Solr but with none of the data syncing issues because there was no data to sync. We could easily filter the data down based on other criteria using simple where clauses and we could even geographically narrow the search thanks to the GIS functionality.

We made the switch and search worked perfectly from that point forward. One of the things that made this so effective was ActiveRecord scopes. They do such a good job of letting you make composable query parts that I could have a scope called search(term) and another scope called distance_from(coordinate, miles) that could just be attached to any query I happened to be running, like

Listing.search("speakers").distance(location, 50).where('price > ?', 1000)

and it put it all together. That included injecting the distance portion of the SELECT statement as well as the ORDER BY, right alongside the relevance score from the term search and the accompanying order by criteria there as well.

In fact, to this day the combination of PostgreSQL's wealth of functionality, extensions and advanced data types with the ActiveRecord scoping system remains one of the best pairings I've ever come across. So much so that I created and taught a class around it called Ruby on Rails and PostgreSQL - Intro to Advanced back in 2014 and published a list of Rails gems to Unlock Advanced PostgreSQL Features, which is probably outdated now because so much more has come along.

Additional Bits

  • At some point, we were having difficulty with image uploads taking far too long so I ended up building a multi-file upload and image resizer tool that was plugged into the listing system. This system resulted in a disk storage savings of 554% and a 3000% improvement in image processing time.
  • We sent our first ever newsletter to 350,000 email addresses acquired over 14 years, which resulted in a phone call from Sendgrid to grill us about where we obtained our email list. That was an experience. Sorry, Sendgrid folks.
  • Made a server side banner ad system that was extremely fast and managed to create an evenly distributed randomized result.
  • Rigged up a single sign on system between the Rails site and the old Perl site
  • At one point I had to explain to a mobile app development company that allowing them to send raw SQL over a REST API...was bad.

How the Story Ended

There's so much more to this story that it could probably be a book all it's own, but I'm going to leave it here. One of the unfortunate things that happened is I was put in a situation where I had to choose between doing what was needed to save the company and the businesses who relied on it or listen to the daily change in priorities from inexperienced non-technical leadership. In this situation, leadership wanted to sail the ship across the ocean without plugging the holes in the boat first. I like to get along with people, so this was a tough spot for me but ultimately I realized that if I didn't do what needed to be done, the ship was going under so it wasn't really a choice at all.

When things eventually stabilized, the leadership was happy with the result but displeased with my obedience. During my annual review, I was told I wasn't working hard enough. In hindsight, it's funny. In the moment, it felt like a betrayal after the amount of time I'd given to the company over the prior year. Keep in mind, I was also on call 24/7 during this process. I wanted stock and in my head was thinking that this was going to be where I finished my career.

I've always been a very even-keeled and upbeat person, but this review sent me into depression for the next month. I went from thinking this was the job to realizing it wasn't going to work and trying to figure out how to tell my wife that I was going to have to start the job hunt all over again. Over this month, I really was not productive for the first time in my entire career. I couldn't focus at all. This time, I received an angry message that I showed my wife. She told me to quit and that we would figure it out...so I called and resigned. When emotions had settled, the owner told me I saved the company and thanked me for it. I assisted in transition for 2 months and left company stable. I take a great deal of pride in that.

With no idea what the future would hold, I updated my resume and started job hunting. After the update, I was inundated with opportunities and my career has been very stable and fruitful ever since. This job was my resume maker and gave me experiences that I've been able to share in so many different forums it's still hard to believe.

If you made it this far and you want any piece of advice, take this one: desperate situations present the greatest opportunity for improvement.

Here's a small snippet of some numbers that I was able to put on my resume after this, just as an example...

  • Increased revenues by 93% by balancing user and business priorities against available resources
  • Increased new listings over 30 days by 33%
  • Improved page load time by 500%
  • Improved NewRelic Apdex score from .71 to .96
  • Improved image storage efficiency by 554% and image processing efficiency by 3000% by building a scalable image uploading, processing, storage, and delivery server

Eventually, everybody who worked with me moved onto better things as well. One of those support staff even went on to get a Computer Science degree and became a full time programmer.

What a ride these last 10 years have been.