Faking Data for Real Security | Ian Coe from Tonic

Data is becoming as precious a commodity as oil. Sales, marketing, dev ops, IT—nearly every operation depends wholly on data of one sort or another. But data is often extraordinarily sensitive.

All it takes is a dozen digits in the wrong hands to ruin a person’s life. So when you direly need to analyze data, but need to keep it private and secure, what do you do?

Our guest on this edition of UpTech Report started his own company. Ian Coe is the founder and CEO of Tonic, a company that, in their own words, “mimics your production data to create de-identified, realistic, and safe data for your test environments.”

Ian stops by to tell us how he originally conceived of the idea while trying to resolve some IT issues at a large bank that would have necessitated giving the developers data they certainly couldn’t have—if they couldn’t find another way. Now that solution is being offered to companies around the world.

More information: https://www.tonic.ai/

For over a decade, Ian has worked to advance the use of data by removing barriers impeding teams from answering their most important questions. As an early member of the commercial division at Palantir Technologies, he led teams solving data problems in industries ranging from financial services to the media.

At Tableau, he continued to focus on analytics, driving the vision around statistics and the calculation language. As a founder of Tonic, Ian is directing his energies toward synthetic data generation to break down traditional data silos, protect customer privacy, and drive analytic efficiency.

show less

TRANSCRIPTION

DISCLAIMER: Below is an AI generated transcript. There could be a few typos but it should be at least 90% accurate. Watch video or listen to the podcast for the full experience!

Ian Coe 0:00
More and more comes up. And it just even makes folks like cisos more nervous. And then you’re in conferences, people are thinking about this, and you eventually just becomes kind of unthinkable to not solve this problem.

Alexander Ferguson 0:15
Welcome to UpTech Report. This is our applied to tech series UpTech Report is sponsored by TeraLeap. Learn how to leverage the power of video at Teraleap.io. Today, I am joined by my guest, Ian Coe, who’s based in San Francisco, California. He’s the co founder and CEO of Tonic welcome, Ian could have you on, man.

Ian Coe 0:33
Hey, great to meet you. Thank you.

Alexander Ferguson 0:35
Now Tonic, what if I understand correctly, your whole focus is mimicking production data to create safe, usable, useful, de identified data for QA testing development, and that he could automatically create mock data, but preserve the same key characteristics of that secure data set. So developers data scientists, salespeople get all use that without breaching privacy and privacy is a big issue these days. I actually love on your website, you say, the fake data company to help me understand it. Like, where did this begin? For you guys? What was the problem that you saw? And you set out to solve?

Ian Coe 1:08
Yeah, absolutely. Yeah, I know that that’s kind of a funny slogan that we’ve been we’ve been toying with, you know, the more formal kind of thing that we talked about is synthetic data, obviously. And, you know, really, this problem actually came about, through some of the challenges that we saw in previous jobs, which is common for a lot of startups, especially in the b2b space. And, you know, for example, for me, you know, I was working at palantir, on this project. And, you know, I was at this big bank and trying to get a bunch of things done. And we got a few errors by email those to some developers and said, hey, how do I fix this? And, you know, their first response was, can you send me the data? And the answer was, of course not. So you know, what we ended up doing is making fake versions of that data using just Python, other manual tools. And so fast forward a few years, you know, me and my co founders, were thinking about all the different things that we could kind of take on. And this kind of jumped to the forefront for us as something that would be really helpful for the world, the good for the world, and actually, you know, really important and valuable for businesses.

Alexander Ferguson 2:14
Data is essential, obviously, as you’re developing, especially create and test creating and testing new products. But having creating an easily or not using existing data, for privacy reasons is a concern help me understand, like, the shift in mindset that someone has to go in saying, will this actually work? Like what some of the things that you’ve seen some pushback from people before they test and use your product that are like, ah, is this actually how easy is it? Does? Does the data actually really mimic it? Does it synthesize it? Well, yeah.

Ian Coe 2:45
Yeah, yeah. So I mean, that’s definitely a concern. So there’s kind of three things that you can do. If you want to help solve this problem, right, or not be affected by this problem, you can essentially give all your developers access to production. And that has significant security implications, you know, that runs counter to a lot of the principles set out by you know, regulations, like GDPR, ccpa, certainly is a violation and things like sock two. So, you know, that’s one approach you can take. And, you know, it’s something that most companies or a lot of companies do, because it’s just, you know, it’s, it’s challenging to solve this problem, the other thing you can kind of do is you can lock everything down, and, you know, make it really, really hard to get access to data that really slows down your development team. On the other thing you can do is attempt an internal solution. And we see a lot of customers do this as well. And the challenge there is, as you said, getting the data quality, right, actually being confident that you’ve done protects the data. And then also just handling all the idiosyncrasies, and all those sort of blocking and tackling that, you know, we spent a lot of time at tonik trying to solve so when data updates, you know, if you have a schema change, do you propagate that right away? And you know, did you have really large data systems, you replicate everything, you just grab a subset of it, you know, how do you create multiple instances for different scenarios for, you know, for developers versus testers? So all of that, you know, sort of painful data infrastructure we’ve made, you know, we sort of pushed out of, you know, sort of the forefront for folks solving this problem.

Alexander Ferguson 4:20
But this this concept of creating data, that’s not real, it’s not a new you haven’t this, you’re not the first one to solve it. Help me understand, like, how are you guys different from other options out there? Like maybe creating just synthetic data and how you put it differently?

Ian Coe 4:34
Yeah, I think one of the big differentiators between us and a lot of I think there’s actually kind of two styles of folks that we see out in the market. There’s kind of some older, larger companies that have test data management solutions. And I think the big difference between us and them is you know, we’re a much more modern platform we integrate directly into your existing infrastructure time to value is much much lower. You know, we our customers are getting you know, Sort of solving their problem within 30 days. Whereas I think a lot of those solutions are more like six months plus and a big deep engagement with a lot of proserv. And then the other style, I think of folks that we see on the market, are folks really focused on kind of the purist academic problem of synthetic data, which is, you know, how can I take a table, and then replicate that table? And then have it, you know, be valid according to all these statistical tests? And, you know, that’s something we can do. But we’ve really focused on the enterprise data problem of databases and making sure that not only can you, you know, do that kind of table comparison will actually transform the entire graph of a database and have that be valid use, you can actually run your app on top of databases, since that’s what most apps run on.

Alexander Ferguson 5:47
But just understand the difference between synthesized mimic data versus mock data, like what’s Yeah, the difference there?

Ian Coe 5:57
So, you know, these are actually debates we’ve had internally, you know, what is mass data, what is what is synthesize data, and it gets kind of, you know, it gets nuanced, and kind of weird. So imagine you had a table of names, and you replaced all those names with a dictionary of fake names is that synthesised data, it kind of is right, like it’s net new data that’s completely detached from the underlying, but has most of the properties. And we even do things in our product, like, make sure that there’s a one to one mapping, if you want it, we call it consistency. And we can preserve that across data sources, which is sort of what I was getting at those enterprise problems. But you know, or you could call that mass data, right? Like, you just took some values, and then replaced them with other values. It’s sort of like an advanced mask. But typically, when people talk about mass data, they’re really thinking of really trivial operations, like taking some data and like replacing it with a bunch of x’s are knowing it all out. And that’s when you start really destroying completely destroying the utility of the data and making it hard to, you know, be an effective developer.

Alexander Ferguson 6:57
What What is like the absolute useful part of the data that has to be maintained? And when when someone’s using and you need to make sure that the product is delivering for you guys?

Ian Coe 7:05
Yeah, yeah. So I mean, the bar for most of our customers is they can open their application, it works, their tests pass. So that’s, that’s kind of the bar for, you know, developers, there’s a lot of things that go into that, the main thing I would say, is the graph of the database, the structure and relationships in the database have to be preserved. And that’s what we spent a lot of time making sure that we do

Alexander Ferguson 7:28
at anyone’s data you can use I mean, and and be able to, to mimic and create new content is, is it just, yeah, just bring it we can make it happen? Or are there any variables or things that it only like it requires, you’d have to do something different each time.

Ian Coe 7:44
So we focused on relational data to start with. And that’s something that, you know, you look at our website, you can see the list of sources that we support, it’s pretty extensive. I think we cover most of the common relational databases. I mean, there’s always things out there. But yeah, any relational data, no problem, we are about to release our document store product. So that’s going to be something that I think, you know, further opens up the types of data we can process. We already process things like JSON and XML. But we’re going to go a level further and make that you know, really first class for folks, if you had to give a word of advice to a developer who’s working with data right now, or even

Alexander Ferguson 8:27
research analysts who are having to work with data in today’s environment, specially where we’re headed with with privacy, the need to use data smartly, what would you give? What would you share? As far as a word of wisdom?

Ian Coe 8:41
I’d say, you know, focus on what your job is, and, you know, let us take care of the painful stuff for you. I think that’s, that’s sort of the big thing is, you know, obviously, anything’s possible with code. Right. You know, there’s a lot of great developers out there, they can solve problems. But, you know, would you rather work on some data infrastructure stuff that, you know, you know, took, you know, I think we were pretty good engineering team took us a couple years to figure out or would you rather just, you know, do your job, you know, move your top line for your company. So, I’d say, you know, just, yeah, it can be a kind of a fun challenge. But, you know, you pick your battles, like your battle. Very well said.

Alexander Ferguson 9:24
Yeah. Now, it speaking of just a moment, going back to the regulatory around privacy, I’m curious, what, what’s your perspective on where we’re headed? And will it get more difficult around using actual data and the need to create mimicked or synthesize data?

Ian Coe 9:42
We’ve seen a real shift in the perception of this problem since we went out to market and I think a lot of that is regulatory. Some of that is also just, you know, I think the market recognizing this as a really important challenge, but yeah, if you look at this number of people that just accept that, you know, you have to do something about that it’s way higher than when we first started talking to people about this problem. And, you know, if you look at what’s in the lot of these regulations, it’s a little ambiguous like GDPR, you’re subject to fines if you breach unless your data is substantially resistant to reverse engineering, but they don’t actually tell you how to do that. So we fall back on something called differential privacy, which is, that’s a whole nother discussion. But really, you know, what it comes down to is, as I think these things get out into the market, you know, and then there’s shrimps too, and you know, all these other things, more and more comes up, and it just even makes folks like, cisos more nervous, and then you’re in conferences, people are thinking about this, and you eventually just becomes kind of unthinkable to not solve this problem.

Alexander Ferguson 10:44
It you say, basically, just getting harder to or more difficult to do it without having a solution in place to mimic it, and an increase in size data, because he just because of the external crackdown.

Ian Coe 11:00
It’s that and I think the other thing that, you know, is, you know, as we continue to advance the brand of tonic and, you know, explain what we do, I think we make people we help folks understand that it’s possible, you know, we’ve taken something that might have been, you know, you know, a multi month or multi year engineering effort, and we’ve condensed it down to 30 days. So, you know, if this is a lot of ceases, we talked to you say, yeah, I’ve been worried with this for 10 years, I glad someone’s finally solving this. So it’s, it’s kind of one of those things that the, the timing for people’s concern about it is really high. And then in addition, it’s also been something that a lot of folks have been really concerned with, but there’s just hasn’t been a great answer to.

Alexander Ferguson 11:41
So it’s, it’s the tipping point, getting to the point where the need is only increasing. And the ease of use to be able to create it is is increasing as well. And we’re

Ian Coe 11:51
examining together. Yeah, exactly.

Alexander Ferguson 11:55
What are you most excited about in the space, maybe it’s an upcoming feature where you guys are headed, that will only make this better or easier for developers and CSOs and research analysts.

Ian Coe 12:09
I mean, I think the thing that, you know, we’re gonna need to push on is just really driving down the time it takes to get that initial data set that works for your team. And to do that, you know, we’re going to be pushing out more automation, doing more things that we can’t, you know, you can do in the product today with some manual work making that, you know, happen automatically for you. Additionally, you know, like I said, supporting resources, you know, becoming more resilient to a, you know, idiosyncratic data, all of those things make it so eventually, you know, the dream is one day you point at your database, you click a button, you walk away, and half an hour later, you have exactly what you need, and you ship it to your team. So, I think that that’s kind of, you know, we’re, that’s what our sort of Northeast is,

Alexander Ferguson 12:56
yeah, yes, just absolutely, eventually getting to where you, you said, click a button, and then boom, you got you got your mimic date, or synthesized day that you can just start using,

Ian Coe 13:07
and, you know, who knows if we’ll ever actually get all the way there, you know, that. I think it’s a, you know, to some, but I think, when I think about what I would like for our customers, so what I’d like

Alexander Ferguson 13:16
you to share just some use cases, or examples of companies you’re working with, and how this is in play, how it works.

Ian Coe 13:23
Yeah, totally. So I mean, we’re working with a ton of customers, you know, we have a bunch of you can go to our website and see some of the logos, you know, we got folks like, eBay, flex port, a bunch of, sort of SMB style customers as well. And I think, you know, the main thing that we see consistently across all these customers, is that they’re supplying their dev teams with the data from tonic. And, you know, for some of the larger customers, that might be petabytes of data, whether it’s taking petabytes of data, making it, you know, using our subsetting technology to make it a digestible size, and then protecting that data before they ship it off off to developers. And so there’s a huge lift from that. And, you know, in terms of just the efficiency, they get, you know, being able to depend on, you know, continuous pipeline of, you know, useful data

Alexander Ferguson 14:12
for the environment that that we’re in.

Unknown Speaker 14:18
For, for

Alexander Ferguson 14:19
developers, the the need for for data isn’t going to decrease of being able to develop it and create content around it. But you said the ease of use is going to be the biggest thing. What can you speak to it’s kind of the future of technology, like the next near term, next two, three years or even a little bit beyond that? What kind of tech predictions Can you make of where we’ll see technology going and the use cases for for mimicking data need for data and developing that?

Ian Coe 14:48
Yeah, I mean, I think there’s, there’s a few things that we’ve, I think, a few bets that we have. So and I think this is also you know, potentially, you know, advice I would give to anyone doing this are up in data spaces. I guess one thing is don’t be afraid of on prem, it’s a lot easier than it used to be. I think, you know, we can install in under an hour, really reliably. So you know that, you know, that improved over time, the first install was not an hour. Yeah, you know, we got it there. I think there’s just a lot more tooling to make that possible. The other thing I would say is, don’t count out relational databases. You know, a lot of folks go to, you know, more complex setups, but if you look at sort of Postgres adoption, and things like that, and some of the things that they’re starting to support, you can do a lot with some of those, you know, traditional, you know, boring technologies, I guess you could say,

Alexander Ferguson 15:48
it’s like, the boring, boring technologies, or the ones that you don’t, you shouldn’t say, turn away from that, it actually can help.

Ian Coe 15:55
Yeah, especially if it’s something that’s, you know, really useful, and seems like it’s solving the problem. I think for us, especially in our field, you know, being on premise and extremely important, you know, a lot of our customers do not want to, you know, send their data out of their VPC, this, you know, and so that’s something that I think has been, you know, hugely important for, you know, making our customers feel secure and protected. And, you know, I think, obviously, that that’s that style problem changes as you advance your brand as you become bigger. But, you know, certainly as a, you know, a smaller startup, it’s pretty important to make sure your customers feel very confident what you’re doing

Alexander Ferguson 16:37
in sharing more about just kind of your, your story of your your history for co founders, right. So you three, three others. Yeah. How do you guys get together meeting how to begin?

Ian Coe 16:49
So one of the co founders, and I actually, we met in middle school. That was Carl, Carl and I actually knew each other from middle school. Andrew, was my first boss at palantir. And Adam and I worked together at the same team at tableau. So three of the founders come from palantir. To to spend time out of tableau, I realized that’s more than a diamond overlap, I was at both. So it is for it is for co founders, but I was above

Alexander Ferguson 17:17
it just because of the experience of those tableau. And the other one that you just like you guys saw this issues, you saying earlier that, that the need for this was arising? And then just like we got to get together make this happen?

Ian Coe 17:30
Yeah, I mean, it was a problem at Tableau as well, I can’t I mean, I was a product manager there. And I can’t tell you how many times we’d get a defect reported by a customer where they’d stripped the data out of the workbook and say, This is a defect and then we’d ask the customer for their data to help debug it. And they’d say, No. And then as a product manager, it’s a it’s a really hard challenge. I mean, do you tell the customer, hey, we’re not going to solve your problem? Or do you then dedicate some engineer and shift your whole roadmap, which also takes away value from customers, so that this engineer can recreate a dataset that can hopefully repeat, it’s not even a guarantee? I mean, the engineering spent a long time trying to reproduce it, and then after weeks, I can’t do it. So I think it’s a big challenge. And the you know, if we can really make data portable across enterprises, and, you know, all these things, I think that’s gonna, you know, greatly accelerate, you know, all a bunch of different aspects of development, you know, and in collaboration between companies,

Alexander Ferguson 18:29
what, where do you see your company in five years from now.

Ian Coe 18:34
So I think, you know, if we look at how folks are using tonic today, there’s a lot of things that they’re doing, they’re really managing their data in a, in a very first class and thoughtful way. So it’s, I think, tonic in five years is the place where you go to make sure that your data is handled well. And not only that, it allows you to collaborate between teams and make sure your practices are first class. And we also save you a ton of time and infrastructure. So I think your tonic becomes the, you know, it’s a synthetic data product today, but I think, you know, five years from now, it’s a, it’s a, you know, secure data platform.

Alexander Ferguson 19:13
I appreciate being able to share kind of the vision and the history of where you guys come from and the challenge of, of, of handling data well for data privacy reasons. And just making ease, easily ease for easy for developers and teams to be able to make it happen. I’m excited for what you guys are bringing forth. For those that want to learn more, you can head over to tonic.ai annd you can look like a book a demo. Is that a good first step that folks can take?

Ian Coe 19:40
Yeah, we’d love to talk to you. And we’re we keep the process really like.

Alexander Ferguson 19:46
I love it. All right. Well, thank you again. And good to have you on and we’ll see you guys next time on the next episode of UpTech Report. That concludes the audio version of this episode. To see the original and more visit our UpTech Report, YouTube If you know a tech company we should interview you can nominate them at UpTech report.com. Or if you just prefer to listen, make sure you’re subscribed to this series on Apple podcasts, Spotify or your favorite podcasting app.

YouTube | LinkedIn | Twitter| Podcast

Faking Data for Real Security | Ian Coe from Tonic

SUBSCRIBE

Written by Alexander Ferguson

Future of Healthcare Technology

Creating Better Marketing Emails with Peter Clark from Journey

An AI-Powered Reading Platform with Ryan Welsh from Kyndi

Keeping Workers Safe Using VR Training with Mousa Yassin from Pixaera

Building a Better Bot | Danny Tomsett at Uneeq

Predicting Preppy: AI in Fashion | Cece Lee at Trendalytics

The AI Revolution – Current impact & future progress

SUBSCRIBE

Add to Collection

No Collections