DataCamp and our partners use cookies to improve your learning experience, offer content relevant to your interests and show more relevant advertisements. You can change your mind at any time (learn more & configure).
Guest
Gabriel Straub
Gabriel is the Head of Data Science and Architecture at the BBC where his role is to help make the organization more data informed and to make it easier for product teams to build data and machine learning powered products. He is a Honorary Senior Research Associate at UCL where his research interests focus on the application of data science on the retail and media industries. He also advises start-ups and VCs on data and machine learning strategies.
He was previously the Data Director at notonthehighstreet.com and Head of Data Science at Tesco. His teams have worked on a diverse range of problems from search engines, recommendation engines, pricing optimization, to vehicle routing problems and store space optimization.
Gabriel has a MA (mathematics) from Cambridge and a MBA from London Business School.
Host
Hugo is a data scientist, educator, writer and podcaster at DataCamp. His main interests are promoting data & AI literacy, helping to spread data skills through organizations and society and doing amateur stand up comedy in NYC.
Transcript
Hugo: Hi there Gabriel, and welcome to DataFramed.
Gabriel: Hello, thanks a lot Hugo for having me.
Hugo: Such a pleasure to have you on the show. And weāre here today to talk about your work as head of Data Science and Architecture at the BBC, how youāre thinking about democratizing and spreading machine learning through the organization, how you think about data products, machine learning as a service, and content recommendation. All of these incredibly exciting and modern things, but before we get into all of that Iād like to find out a bit about you. As we all know, the BBC is a huge organization, and Iām sure there are a lot of opinions about what the head of data scienceās job actually is. So Iād like to know what you do, but before that Iād really like to know what your colleagues say or think that you do.
Gabriel: Thatās actually a really good question. I think as any large organization, and the BBC has about 20,000 people that work there, thereās quite a difference understanding of machine learning and data science. So probably if you ask some people they would not know at all what it means, some people would probably assume that itās something related to understanding audiences, so the kind of stuff that we potentially might want to call analytics, and then some people who have a bit more of a detailed understanding would probably tell you that a lot of the work that we do is around building recommendation engines and other kind of algorithms that help improve the audience experience. And actually, one of the big challenge⦠See more
s that we have is how do we educate enough of the organization so that everyone has a certain amount of understanding of the hopes, and the hypes of machine learning, so that from a journalistic perspective we can properly educate our audiences, and from a technology perspective we can actually take proper advantage of this technology.
Hugo: Thatās great. Now, I love this idea of the hopes and the hypes, because weāre also talking about constraints and what ML can do, machine learning can do, and what it canāt do, and what data is good for and what it isnāt, because thereās so much hype around this space that I think a lot of people think data and AI are capable of anything, right? But itās about using it wisely and mindfully, right?
Gabriel: Yeah, definitely. Thereās a lot of this concern, especially in the machine learn community of being hired as a data science savior, as the guy who comes in and is expected to save the business, when the business isnāt actually ready and hasnāt set in place the right kind of engineering and machine learning basics that you, or data basics, sorry, that you might need in order to make that happen. So I totally agree with you, itās around being actually quite knowledgeable. So for me, a lot of my job I would consider more data product. So being aware of what you can do, and what actually the problem is that youāre trying to solve, and trying to figure out how do you bring that together? So the possible with the needed, if that makes sense.
What do you do at the BBC?
Hugo: Yeah, that makes perfect sense. Well now, the second to final point was around building recommendation engines and other algorithms to help improve audience experience, and it strikes me that thatās probably one of the closest, besides trying to educate around ML, but maybe you can just give a few words about what you actually do at the BBC.
Gabriel: Yeah, so in a way I talk about wearing two hats, and on one side my job is to try and help the organization get a bit better, a bit more consistent in terms of how we tackle data and machine learning problems. So this is when I am wearing my architecture hat, because weāve been around since 1922, so almost 100 years, and weāve invented a lot of the broadcasting technology. Weāve been quite instrumental in inventing radio, TV, etc., and the way that we were really good at inventing these hew things is we almost built separate organizations, and that means that we have a lot of data in lots of different places. Unfortunately today our audiences really want to have access in one place, and they donāt really care whether somethingās called radio, or TV, or any of that stuff that we set up in your organization to run smoothly. But because weāve invented all of these different things, we have quite a siloed data approach.
Gabriel: So part of my job is to try and address some of that by developing consistent approaches to storing data, surfacing data, and using machine learning on top of it. But also part of my job is to run a team that is called Data Lab that actually builds machine learning algorithms that provide better audience experiences, and generally that falls into two areas. So recommendations, as you mentioned, so how do we bring the right piece of content, or the right service in front of the audience, given their interest and their context? And the second thing is what we call enrichment, which is all about how do you actually find a piece of content? Coming back to this whole thing that weāve been around for 100 years. 100 years ago people didnāt think about tagging content in such a way that would later be surfaced through recommendation engines. So that means that we have a whole bunch of content that is badly described, and there is now a big question around what do you do with that?
Hugo: Interesting. So is enrichment related in that sense to discoverability?
Gabriel: In general we talk about metadata in that space, which technically isnāt quite the right terminology, but itās how do we find the right descriptive data so that it can be surfaced through be it search engines, recommendation engines, or any other process that you might want to find content in.
Hugo: So you mention that the BBCās been around since 1922, and this is something thatās incredibly interesting to me about your role, and the team that youāve built out and are building out, that a lot of the time data science is synonymous with tech, and a lot of people think of the tech stack used in tech companies and online companies, but in this case, and a lot of your history actually in data science, is bringing data analytical and data science tools to companies that have existed before tech, and predate tech by a long shot. So I thought we could use that idea as a springboard to discuss how you actually got into data science originally, and your career trajectory up until now.
Gabriel: One of the interesting things though is you could argue maybe the BBC has always been a tech company, right? So we start as bringing together radio, which back then was high tech, and then TV, we were at the forefront again of that. So in a way weāve always been tech, itās just that tech has shifted significantly over the last hundred years, so we have a slightly different legacy compared to maybe a pure online based organization.
Hugo: I love it.
Gabriel: In a way, I almost fell into data science coincidentally. So I have a mathematics background, so I understand that side of things, I then went off and worked as a management consultant for a couple of years, and came back to the UK to do an MBA, and I joined Tesco as head of data science, or back then actually I had a fairly long title along the lines of Head of advanced algorithms and forecasting load optimization for general merchandise, something along those lines.
Hugo: What a mouthful.
Gabriel: Exactly. So the longer the title, the more difficult it is to know what youāre doing. But the idea was that they had already, Tesco actually was one of those other companies that you donāt really think about as a tech company, but actually in the nineties had developed two amazing pieces of technology. So in ā96 they created this thing called Clubcard, which was before the time of Google, they were dealing with big data. So basically kept track of all the purchasing that you did for this Clubcard and in return would give you one cent off for every pound that you spent. So they already were working with data innovation back in those days, and the second thing they were really, really good at was forecasting load optimization. So one of the reasons why Tesco had higher margins than a lot of other retailers were because they were really good at managing stock in their grocery business, and therefore to have a very high availability and really low waste.
Gabriel: So I came in after my MBA in order to help them build a similar capability, but related to the general merchandise business, and general merchandise, while you can inherit some of the stuff youāve learned from forecasting how many cans of tinned beans you need to buy, itās quite a different supply chain. So in a lot of your grocery business your stuff comes from a warehouse thatās maybe one or two days away, itās fresh, it doesnāt have long lead times, thereās quite a lot of turnover, therefore your stock actually comes into the business and disappears quite quickly. For the general merchandise, at least for a place like Tesco, your lead time is very long, because it gets produced in China, so youāre certainly not talking about a week long lead time, but a six month lead time, your sales rates are a lot lower, etc. So there was a bunch of new challenges that we had to resolve.
Gabriel: So my job was really to be a translator, someone who could speak enough of the business to understand what this was all about and then try and translate it into maths, and someone who could understand enough maths to make sure that we could hire and build the right team, and then translate some of the concerns and constraints from maths back into the business.
Hugo: And when was that?
Gabriel: So that was in 2012.
Hugo: Thatās quite prescient in a lot of ways, because what we do see now, we actually see the emergence of a role called data translator coming out now, which serves that purpose in a lot of respects.
Gabriel: Yeah. So I now call it more of a product role, because in general the product person is the person who tries to understand what is feasible technically, and what the customer wants, and tries and brings those two things together. But yeah, 2012, there was no such thing as data science, at least not in the UK. That was slowly coming across the pond just in that year. So actually we called our team commercial science, because we felt it was all about the science that helps it be commercially more successful, which was also quite on purpose that we didnāt really want to focus on the data, we wanted to focus on the impact that we could create.
Career up to the BBC?
Hugo: So what happened in your career then to take you to the BBC?
Gabriel: I was at Tesco for a couple of years, Iād built up a larger team there, by the end we were looking at anything from classical operations research type questions. So how do you do vehicle routing from your online deliveries, or how do you optimize a fulfillment center? We were doing things that we were kind of in the trade space, so beyond forecasting of demand we were also worrying about how do you optimally price a product, whatās the right range to have online, etc. So after Tesco I joined Not On The High Street as a data director, and there my job was really to try and understand how do we get a bit on top of the data we have, how do we build a data democracy? So one of the KPIs that I was quite keen on is the percentage of our colleagues whoād actually used data on a weekly basis as part of their job, and then also how do you slowly introduce some slightly more sophisticated machine learning in order to automate some of the decision making and just create a better audience experience.
Hugo: So I like this idea of data democracy, and trying to spread data use throughout organizations as widely as possible. What does that mean though for someone to use data? Like someone in a marketing role, or a sales role, do they need to be able to code, or is working with a GUI enough? What are we talking about?
Gabriel: So that was a lot of the things we were trying to figure out. I think thereās a bit of this data literacy, so how do you teach people the right understanding to ask the right questions, because actually conversion rate is not actually the same if two people use it. So you could have a conversion rate thatās based on top of views, or you could have a conversion rate thatās based on audiences. So you have to have a certain understanding around why would you choose one over the other. So there was a bit of that just giving the right skills, and then a lot of the other stuff that my team was working on was trying to provide the right tooling that would make it as simple as possible, and in my view thereās still a certain amount of SQL thatās quite useful I that space, but there is not great tools out there that actually allow you to almost create that, not quite dashboards, but pre-computer queries where then people can just put in parameters, and they can play around with it, and they can really learn some of the SQL.
Gabriel: So we tried to also teach people SQL, we tried to teach people a bit more how to use Excel as well, and then just making sure that they knew where the dashboards were. I think the most important thing for me though, was knowing how to ask the right questions, and knowing what questions could be answered with data, and actually answering the questions with the data, rather than trying to use data in order to confirm opinions that they already had. So it was mostly, to be honest, a cultural thing. More than anything else it was around getting people to not think that data is on the other side to creativity, but actually data and creativity work hand in hand.
Hugo: For sure, and thatās something we think a lot about here at DataCamp, of course, as well. Weāre trying to spread the use of data tools and data techniques through organizations, not only for data scientists or analysts, or this type of stuff, but for managers, people at C-level, trying to figure out how much data they need to be able to speak and know about in order to do their jobs as well as possible, essentially.
Gabriel: I think this old world where there was maybe a team that was responsible for data probably doesnāt work anymore. I think everyone has to have a certain amount of data literacy now. Itās not acceptable for anyone in the organization to say that they canāt write, right? Writing is just one of those basic skills, and my assumption is that maths and some sort of minimal data literacy, and potentially even programming is going to be one of those things that will be just a basic skill that will be required.
Hugo: I think thatās a bright vision for the future. I want to jump in and talk about your work at the BBC in particular, but I just want to preface that by saying I love this idea of the BBC being a very serious tech, and tech driven, and tech forward company from 1922. I also really like that you mentioned, although Tesco for example and retail and grocery and these types of thing arenāt historically tech per se, but they are actually huge innovators in the data space. As you said, and as we know, loyalty cards are a great example of seeing what people buy, segmenting them, and making recommendations to them based on what theyāve brought previously, right?
Gabriel: Yeah, as you said, right? So that was ā96, that was before Google was created, and if you can imagine Tesco has thousands of stores, those stores have quite a few tills, lots of people going through these tills. It wasnāt back then real time analytics, but there was still a lot of data that was created through all of these purchases, and Dunnhumby, which was the organization that Tesco then bought, who was dealing with all of this, they were able back in ā96 to analyze all of that data in order to provide you with coupons, and it was worth enough money for Tesco to provide you with one pence on every pound, and that might not seem like a lot, but itās one percentage point of margin that youāre giving up in retail, which is a low margin business.
Hugo: No, thatās a huge amount.
Gabriel: Itās massive, right? So in the good times, now Tesco probably has a margin of between somewhere in 3% and 4%, or maybe slightly higher, but definitely below 5% these days. So actually you can imagine, in a way, how brave it was as a decision to say, āOkay, itās worth us giving this to customers, because we believe that gathering that data is worth it.ā And similarly, the BBC went online and created BBC News in 1997. That was at the beginning where there were probably still quite a lot of people who were thinking that the internet is probably just a passing phase, and itās probably going to disappear. Thereās quite a lot of innovation that actually happens in the digital space in the companies that you might not consider as being your native tech companies.
Hugo: Yeah. I actually remember as a teenager in the late nineties being quite surprised at all the progressive work the BBC, and I think The New York Times were doing at the time as well, those are the two that caught my attention, at least at the end of high school.
Gabriel: Yeah. I think the BBC also created the BBC Micro, right? Which was one of those things that actually introduced lots of people into programming, and I think it is easy to forget that there were big companies that were at the top of their tech game before Google, Facebook, and all of these friends these days. And I think these big companies are still there, theyāre still surviving, and theyāre still innovating in new space, theyāre just maybe getting a bit less of the press.
Aspects of the BBC
Hugo: Yeah, I think so. So, letās dive in and talk about just what aspects of the BBC, business, content, otherwise, that you think data analytics and data science can have the biggest impact on?
Gabriel: When I think about this kind of stuff, and you look at it from a strategic perspective, what Iām really, really interested in is, the way that I find itās quite useful in this way, is to think about it from a value chain perspective, because for me data science and analytics and all of that stuff is all about decision making, itās all about decision support and making sure that you scale good decision making. Thatās where I think itās really, really, really powerful, and our value chain, simplistically, is around planning, commissioning, producing, scheduling, and in serving that kind of content, and then thereās a whole bunch of operations that underlie all of that stuff. But probably the areas that are the most obvious ones, or the most exciting ones at the moment is, the obvious one is around how do we get the right content in front of audiences in the right way?
Gabriel: So I talked about recommendation engines earlier, and what makes this particularly interesting with the BBC is that we have audio content, we have video content, we have text content, we have things like weather, which is probably text, but actually really is data, we have interactive games for the younger people, we have recipes, we have pictures. So we have a whole bunch of stuff, so in a way weāre basically Netflix, and Spotify, and CNN, and a weather channel combined. And that makes it significantly more challenging to bring the right content in front of the right users at the right time.
Hugo: And what type of approach to you use, or how do you think about this problem?
Gabriel: So at the moment a lot of our approach is still quite in the breaking it down into areas. So weāve had iPlayer since 2007, so I was one of the earlier video on demand services, and iPlayer has a recommendation engine that currently is very much focused on showing you more iPlayer content. Similarly, we have a product called Sounds, which is our audio product, and again there what we show you inside is more audio products. Weāre now trying to figure out how do we crack that, and actually itās not necessarily only an algorithmic problem, but itās also a product problem. So if you are in the space where youāre watching videos, when does it actually make sense for us to provide you some audio? When does it make sense for us to provide you some text? So some of the stuff that weāll probably start doing first is more understanding what kind of content youāve consumed over all of our product portfolio, and then using content youāve produced somewhere else in order to provide you with more relevant content on that thing.
Gabriel: So, for example, if youāve read a lot about science and technology, then maybe that gives us a hint that you might be interested in Planet Earth, or that kind of documentary style stuff when youāre on iplayer, so maybe it gives us the opportunity to recommend that to you, even if you havenāt consumed any of those types of content.
Hugo: Something you said there I want to zoom in on, is this idea about making recommendations across different products, because if I recall correctly, historically a lot of the products, as with a lot of orgs, but at the BBC, have been siloed, right? So you even have all the data in a variety of different places.
Gabriel: Yeah. So, it depends a bit on what you mean by data. Weāre quite lucky that weāve went on a journey a couple of years ago to try and at least bring a lot of our audience data together, so that at least helps us to understand what audiences are interacting with. Now we have a bit more work to do in order to bring content data together, because actually thereās not much point in me knowing that youāve watched a clip with a certain ID if I donāt know what the clip is about. So thereās still a bit of work there, but yeah, youāre exactly right, and that comes back to a bit of that history of having been here for 100 years and actually always building it up separately, and actually from the fact that we are still very heavily a linear broadcaster. So our TV channels, our radio channels are still what produce the most amount of our audience engagement, and they have a very different way of thinking about data than you would have in an online channel.
Hugo: And it also speaks to a point that you mentioned in passing earlier, that before we can even solve these types of challenges, we do need to get all the data, thereās a big data engineering challenge that happens before we can even solve these problems, right?
Gabriel: Yeah. Sometimes we donāt even have the data. So, articles is one of those examples. We use tags in the articles, we developed a system in 2012 to help us with sports. So actually we were once upon a time, we probably still are, one of the biggest users of linked data. So in 2012 we developed these ideas around how would you be able to track medals in sports across different people, so we knew which kind of personalities were related to which sports, and to which country and stuff like that, so we could actually give you interesting new ways of navigating our content set. We use something quite similar in the news world, but in the news world itās a bit less clear what is a sensible tag for people, because they donāt really understand why that tag actually creates value.
Gabriel: So our most common tags are āUKā and āpoliticsā, and these tags are probably not descriptive enough to really drive personalized recommendations, because just because you care about the UK, you probably donāt care about a lot of the UK stuff, and similarly with politics actually. You might care about UK politics, but a lot less about politics in South America, and finding the right processes to bring the right granularity of data into our systems is one of the things that weāre working on at the moment as well, to try and look into that.
Hugo: Interesting. Thatās a big challenge. Actually, that reminded me, you gave a talk, which weāll link to in the show notes, in which I recall you mentioned a related challenge, which is different uses of terms and tags within different parts of the organization. I think the example you gave is Manchester City, right? So if a sports broadcaster speaks about Manchester City, they might be talking about a team, whereas somebody else might be talking about the actual city. So that tag might mean very different things in different contexts.
Gabriel: Yeah, and I think the other example that I tend to talk about is āpiratesā, because pirates has at least three meanings. So they can be nice pirates that you have in childrenās programs, and then you have software pirates, and then you have the Somali pirates that kill people.
Hugo: Absolutely.
Gabriel: And it actually becomes quite a problematic one if you confuse them, because you definitely donāt want to show a kid the Somali pirates just because theyāve just been consuming a TV childrenās program.
Hugo: No, weāve worked very hard as a society to convince children that pirates are great, essentially.
Gabriel: Yeah, and recent history has shown us that actually maybe, depending where you are, you might disagree with that statement.
Hugo: How do you think about tagging in general, all this historical data? I mean, how do you think about labeling it? That seems like a huge task and a huge challenge.
Gabriel: So I think thereās going to be probably two approaches that weāre going to use. One is there will be a certain amount of manual tagging, probably for the newer content where we just need to get better about it, where we need to create more consistent approaches, processes and tags, and obviously this is another area where machine learning can be quite exciting. So already our R&D team has been working on this for a while, so we have pretty decent tools that allows us to do topic extraction, or entity extraction out of text. These have been very heavily trained on news, so weāll have to see how do they work for something like drama, or maybe articles that are a bit, or topics that are maybe not so update-y but more entertain-y or informing.
Gabriel: Weāve also worked on things like facial detection, where we have a fairly good system, and particularly for British politicians and stuff like that, that maybe the big commercial products care less about. And thereās a whole bunch of other suite in there that our R&D department has been working on, and is now working on making available to the rest of the business. That is quite exciting, because it means that we can then try and find ways of getting a lot more data out of all of or archives. And coming back to the fact that weāve been producing content for almost 100 years, we actually have quite a lot of archival content, and most of the commercial stuff as well is quite expensive. So if you were going to run any of the existing machine learning as a service tools across that, probably that would not be affordable for us. So itās great that we have something to start with that was built internally for our use cases and our needs.
Recommendations
Hugo: So in terms of making recommendations as well, a big challenge in the recommendation space these days is considering filter bubbles and echo chambers. So maybe you can speak to how you think about that. I have a potentially related question, so Iām going to throw two questions at you at the same time, feel free to answer them in any way you want. Whether you have humans in the loop with respect to these types of recommendations as well, like some sort of human editorial role?
Gabriel: Yeah. So I think, first of all, echo chambers for me are less issues of machine learning than issues of business models. Now, that doesnāt mean that the data scientists arenāt responsible for it, but the data scientists are basically optimizing for something that theyāre being asked for to optimize by their business, right? So you tend to have echo chambers or filter bubbles in particularly in places that are related to the attention economy. Where the product itself has an incentive to keep you engaged or on the platform for as long as possible, so they can show you as many ads as possible. This is not quite our business model. Our business model is that we are funded through TV license, we would like you to have a positive opinion of the BBC, and we believe that the more you interact with us, the better it is, but we donāt have to drive you quite that hard because we donāt serve you any advertising. So thatās one of the things.
Gabriel: The other thing is we do have humans in the loop. So for us, again, weāve been around for 100 years, this whole thing, also where we sit as a public broadcaster, we have to be impartial, we have to be objective. So actually, weāve already had to deal with this thing around how do you make sure that people get multiple sides of a story? Weāve had to deal with this for 100 years, and weāve had very good editorial guidelines and processes in place to help us with this. The way that we deal with this is that we actually have an editorial person that works very, very closely with our team.
Gabriel: So as we build these algorithms, we will show the outputs to her, and have discussions with her around what do you think happens if we do this, how do these results look for you? In practice what that means is, if we notice that your horizon is narrowing down too much, we stop recommending you purely, so only 50% of the content going forward will be recommended out of the algorithm, and then the 50% of the rest will be, again, a cold start, where weāve curated certain topics and brands that we think are relevant to the audience that weāre trying to hit. So itās something that weāre very conscious about.
Hugo: Yeah, thereās a lot in there. One thing that came to mind is something youāve spoken about before, which is the fact that this is great for your viewers and audience, but itās also essential for you, perhaps even in a legal sense, because unlike Facebook, for example, and other players in the attention economy, as you say, you both produce content and distribute it, so I presume itās a legal question as well?
Gabriel: Yeah. So we are definitely liable for the content we produce, and in particular thereās two areas where this becomes challenging. So in the product that my team has just been working on, and that was actually released two days ago, so weāre very excited about that.
Hugo: Congratulations.
Gabriel: We basically, weāve created a short form video product that takes clips from the BBC and shows them to you. Now, a lot of the clips generally in the BBC are embedded within text, and that allows in the text to balance some of the views or clarify some of the views. Now, because weāre just pulling out videos and thereās no oversight around ability to explain something, thereās the potential that these can be felt like theyāre being taken out of context, and that provides problems for us.
Gabriel: The second challenge that we have, and this is probably more relevant to the platform conversation is, as a media organization we can be responsible for contempt of court, and the example we came across not too long ago was, actually if someone is, for example, denying that he sexually harassed someone, and we have a video of that person denying this thing, giving that denial, if we then underneath that video show related content, and we have a good content to content recommendation, then thereās a chance that we will show other content related to sexual harassment, and some of these people might be proven sexual harassers, or something that actually where the court decided that they were guilty, and that could be considered a position on the guilt of that person from our side, and that is contempt of court. So we need to find the right ways of how to manage that, and thatās quite challenging at the moment because a lot of our processes, as an organization, were based around us always really driving what the viewers will see.
Gabriel: Now, in a one to one relationship, as driven by recommendation engines, that is no longer possible. So weāre trying to figure out what exactly is the best way of doing that, which again, is the reason that we have an editorial person working with us very closely to make sure that we are on the right side of the law, but also the right side of our editorial guidelines, and the right side of the public service remit that we have as an organization.
Hugo: And it also seems that having a human editor in the loop is a great way to position data science not as a new discipline and arm of the organization trying to take it over, but as embedding itself in the organization that will incorporate the traditions and history and culture of the organization within it, having data as one input to what the organization does.
Gabriel: Yeah. So Iām a strong believer that data science is there in order to augment an organization. So I think thereās certain decisions that data science can automate, and generally these are the decisions that most people done really want to do because theyāre quite repetitive, but data science is really powerful because, itās mostly powerful if it can automate the stuff that then frees up people to actually do the stuff that is actually more interesting. So the creative decisions, the more strategic stuff that an algorithm is going to take quite a while to be able to properly support.
What challenges do you face in incorporating data science at the BBC?
Hugo: So in that case, what challenges are involved in incorporating data science into the decision function at organizations such as the BBC?
Gabriel: Yeah, so we talked a bit about that just in our previous question. So, as weāre a media organization, and a media organization that is quite often in the spotlight, we are sometimes a bit, weāre quite nervous around how stuff can go wrong, and I think with data science itās a lot less predictable what results you will get going forward, because you cannot observe all of the possible ways of how content could be placed in front of an audience, just because everyone will see something slightly different. And we also donāt really yet understand how machine learning will be interacting with our editorial heritage, and I think finding that balance where actually machine learning supports editorial and editorial moves away from⦠At the moment basically our editorial process is what I would call one of micro decisions.
Gabriel: So our editorial teams will decide on which videos go where, which text weāre using to put in front of audience, what exactly the title is. That doesnāt really scale very well to one-to-one relationships. So we need to find the right ways of how we move this from these micro interventions to macro interventions, where instead we will work quite closely with our editorial teams to develop rules that guide what algorithms can and canāt do, and make sure that itās still within the heritage. So a lot about that is around how do we provide people with enough reassurance that this new world that is a bit more algorithmically driven does not go too far away from the BBCās public remit and our editorial heritage, and our editorial, stuff that really is what makes us the organization that we are.
Hugo: Yeah, and that, once again, speaks to incorporating data tools into the organization, as opposed to the other way around.
Gabriel: Yes, exactly.
Hugo: So, I know that youāre interested in what you refer to as applying machine learning in a sensible way, and Iām just wondering what this evokes for you, or what sensible ML means to you?
Gabriel: Yeah. So we talk about responsible machine learning a lot of the time, because of some of the stuff that youāve mentioned earlier, like filter bubbles. I think as a public service organization we have a bit of a concern that machine learning might not always be used for the benefit of the individuals, and therefore we want to make sure that at least when we build stuff it really provides the users with the true agency. So thereās quite interesting research where people say that they want to own the data that they create, and they feel that itās up to them, but actually they donāt really understand how that data is used by big tech organizations. And for us also this responsible ML also means that itās really about it beingā¦
Gabriel: So if thereās a machine learning algorithm, that more and more will decide on what kind of content you get access to, letās make it specific here and talk about news, we need to make sure that there is no commercial or political agenda behind that, because obviously news massively shapes opinions, opinion shapes elections, and weāve seen quite a lot over the last couple of years how if youāre not careful, and you donāt really understand whatās happening there, you can get yourself into quite a bad place as a country. So really making sure that people find the information they can trust is really important for us, and this is the independent stuff.
Gabriel: Impartiality is also really key for us. So the BBC has been built as a public service organization, because in 1922, or actually in the twenties, there was this feeling that radio was just too powerful a technology, and it shouldnāt be owned by commercial interests at all. It was probably also too powerful a technology to be owned by the state, which is why it is separate from the state. And thereās a strong feeling that maybe machine learning can become a similar technology to that, and we have to be very careful that we do not use existing biases that might be in the data to reinforce some sort of negative loop. And finally, itās about universality, so how do you make sure that the benefits of machine learning are for everyone? And we donāt end up in a world where you canāt really afford any of the stuff that I might advertise to you, therefore Iām not going to provide you with any content.
Hugo: Thereās a lot of stuff in there that Iād like to touch upon. When you were talking about independence, you mentioned this idea of the vitality of it, how vital it is that people can actually trust the recommendations, and the algorithms that people are providing. It seems like in general we are going in the opposite direction in some ways, I mean we had things this year like GDPR, which helped to a certain extent around consent of use of data and right to delete, among other things, but a lot of the time I think people, including myself, donāt even know what data is collected, how itās collected, why itās collected, what itās being used for, who itās being shared with, and these seem like huge challenges, right?
Gabriel: Yeah, I would agree with that, and I think it comes back to me⦠ML is not necessarily the thing that creates these problems, ML is the thing that exaggerates those problems, or exacerbates those problems. The problems are probably more around the business models that are basically reinforcing some of this ML behavior, but I think us as organizations, we will need to put a lot of effort over the coming years to clean up our act and be a lot more transparent around what we do, really be clear about what our algorithms, what data we use, and how they work, but also give people the ability to opt out of it if they are feeling uncomfortable with it. I think itās all going to be about providing people with real agency again, because we run the risk otherwise that we destroy this technology that actually has a lot of opportunity, and to be fair is the only thing that probably will solve a lot of the problems that we have going forward.
GDPR Compliance
Hugo: Right. Speaking of GDPR, was GDPR compliance a pretty serious issue for you at the BBC?
Gabriel: Did spend a fair amount of effort and time to make sure that we would become GDPR compliant, because we do collect information about people, we have sign in, so thereās quite a few areas, obviously as a large organization as well, you have a lot of information about colleagues, etc. So GDPR in my view actually hits a lot more organizations than organizations realize, and any big organization needs to be doubly careful in this space.
Hugo: Absolutely. I want to start thinking about data products, machine learning as a service, your thoughts about how we can spread machine learning knowledge and practice in general. So I suppose as a bouncing board for this, maybe I can say youāre involved heavily in developing broader data science and machine learning architectures, in particular to make sure that best practices are adopted, for example. Iām just wondering what this involves and how you think about this?
Gabriel: So one of the challenges that we have across the organization, as I mentioned before, we have for example a great R&D team that will build some stuff, and then we have lots of products teams that could probably use the stuff thatās being built by the R&D team. But itās not necessarily an easy transfer of the technology from out of R&D into the actual product teams. So one of the things that Iām quite interested in at the moment is what I would call ML as a platform, or ML as a service, which is how do we make it developing and deploying machine learning models at BBC scale as simple as watching TV?
Gabriel: So what is the process that we need to put in place, what are the platforms that we need to put in place that make it quite simple to create a model, bake a model, and then hand over into a system where it will run and scale? And ideally you do all of this while being aware what other teams are doing, so that you can build on the shoulder of giants, that rue aware of the results that have not worked somewhere, so you donāt try these things, and what that enables as well is that you can embed certain metrics, for example, at the end of a test. So we can embed some of the things that weāre really keen on, in terms of our responsible machine learning, and make sure that those tests are passed before anyone can put anything into production.
Gabriel: So it gives us the ability to also encourage the teams to be aligned with some of the thinking that we have in the space around responsible machine learning.
GUIs
Hugo: So in this sense, with machine learning as a platform, and machine learning as a service, do you envision a future in which people across many teams and parts of the organization can use GUIs to build and deploy models, as opposed to writing code?
Gabriel: Well, thatās a good question. Iām not sure it will go quite that far, but I definitely envision a world where actually I believe that there will be more smart people not sitting in my team than sitting in my team, and we need to find ways of allowing those people to contribute to the work that weāre trying to do. So I definitely envision a world where they can start with something that isnāt from scratch, and where there is enough there that helps them make sure that they follow a proper data science process, and that it makes it much easier for them because they donāt have to worry about infrastructure and all of the other stuff that otherwise takes a lot of your time. They donāt need to worry about data integration and all of that stuff. Whether it goes all the way over GUI, Iām less sure, Iām not sure. That all depends on how far we can go on this journey.
Hugo: And Iām just wondering, I clearly think about this a great deal, but is there a danger involved in this that essentially if people are building machine learning models, most of the time theyāll be building mathematical models, and if they donāt actually understand the math behind the models theyāre building, is that dangerous in some way?
Gabriel: I think it is, and this is why itās important that you embed a certain amount of process around it, and certain scores that you run at the end. And I see this a lot when actually interviewing people for some of roles, is you can notice certain people who just apply technologies, or algorithms, or methods, and they donāt really understand the assumptions behind it, and at some point those assumptions break down and you get unexpected results. I think itās still something that where a lot of people this is just a question of training, right?
Gabriel: So thereās a question around making sure that they, and I donāt think they need to understand all the algebra or whatever arithmetic that sits behind some of the models. They need to understand where stuff can become dangerous, and what the assumptions are that go into the model, and what the assumptions are that come out of the model. And I think thatās something that you can teach people, and I think thatās much more valuable to teach that kind of stuff than to teach them how to set up the right infrastructure, etc., which can be automated a bit more, or put into this platform.
Hugo: Right, and itās interesting that you mentioned hiring and that process. I suppose a question that our listeners wouldnāt be very interested in is, when you hire for your team, what do you look for, and what type of people would best do the work that you need to do?
Gabriel: I think weāre looking for people who are very curious, who are obsessed to a certain extent with finding better solutions, but people who understand that actually itās all about solving the problem, itās not about applying the coolest, newest, or whatever algorithm, and thatās why I was talking a bit about these assumptions. So do they understand what the limitations are of their technique, and do they only move onto the more complicated technique if the limitations are deal breaking? Or do they start with the newest thing just because thatās what everyoneās talking about? So that, in a way, very pragmatic approach to data science, which is always around, actually Iām not here to create a cool algorithm, Iām here to create a business problem, and I therefore need to understand what parts of that business problem really matter, and therefore decide on the right where my assumptions that feed into the algorithm more or less align with the assumptions I have underneath the business problem.
Hugo: In that sense, itās a really practical game.
Gabriel: I think so. I think actually weāre just starting an apprenticeship in data science as well next year, and I actually really like this idea of data science being an apprenticeship. I think you do need to have a certain amount of minimal understanding of programming and mathematics, etc. but there is nothing that will compensate for your ability to just work from models and learn the hard way, right? Like Iām sure for your career, you will have done a bunch of stuff where you think, āThis is feeling a little bit too good. This is converging too quick.ā Or something like that, and then you realize that youāve leaked from your target set data into your training set, or something like that, and you will never, unless youāve experienced that a couple of times, you will not be aware that that kind of stuff can happen.
Hugo: I love that. Particularly as data science is a discipline where career paths arenāt necessarily clear, and the role of junior data scientist is something that weāre seeing a bit more of now, but in all honesty, itās in a woeful state in terms of people being able to enter from a certain level, the industry as a whole.
Gabriel: Yeah, itās an interesting question as well. I think the other challenge you have a bit is that a junior data scientist is not the same as a junior data scientist in a different place. Thereās not even a consistent definition of data scientist, right? In certain places data scientists are product analysts, in certain places data scientists are research scientists. So itās very confusing for anyone out there whoās trying to break into this to try and understand where should they start, because itās not very clear what would be expected if you purely look at the job title.
Aspiring Data Scientists
Hugo: For sure. And weāll actually link to your careers page and the apprenticeship when that goes live in the show notes, so our listeners can check that out. A question that I hear a lot from aspiring data scientists is whether they should go to grad school, or start as a data analyst and learn a bunch of stuff on the job, and I was wondering if you had any advice around that question?
Gabriel: Thatās a really good question. I think Iāve had great data scientists coming from both directions. I think it really depends on what kind of data scientist you want to be, right? So thereās this concept of the research data scientist, and the applied data scientist. And the research data scientists are there to build new algorithms that then the applied data scientists can use to really solve the problem. So I think it really depends on what youāre more passionate about. Are you more passionate around writing lots of papers? Are you more passionate around creating new knowledge? Or are you more passionate around really trying to get the last percentage of efficiency or the next 10X growth out of the products that youāre building? And depending on what youāre more passionate about and what maybe feels more natural to you, I would decide one or the other. I donāt think thereās a one-size-fits-all, because again, thereās not one-size-fits-all data scientists across organizations anyway.
Hugo: The other thing I would say is that if youāre going to go to grad school, you have to really, really, really want to go to grad school.
Gabriel: Yeah, you need to really love your maths, right? Otherwise thereās no point in doing this for a couple of years.
Favorite Data Science Technique
Hugo: Yeah, exactly. So Iād love to just get slightly in the weeds and a bit technical. Iād just love to know what one of your favorite data science-y techniques or methodologies is?
Gabriel: I donāt know whether you count this as data science, but definitely probably my favorite own is Kalman filters, and I really like that because in a way itās so cool to be able to say, āYeah, Iāve used rocket technology in order to optimize the organization.ā So Kalman filters were developed in order to help during the Apollo mission to bring the rockets back into earth, because actually the precision you need is insane, in terms of the angle etc., and Kalman filters allow you to appreciate that thereās measuring error, and then movement error, and it deals with that, and it brings both of them together to actually, itās this really weird thing where you have uncertainty in measurement and uncertainty in movement and somehow through this magic of Kalman filters it has less uncertainty.
Gabriel: And Iāve used this, so far the BBC is the first organization where I havenāt yet been able to implement Kalman filters, but in the past Iāve used this, because I also believe that if you measure something like price elasticity, for example, you probably have an uncertainty in your measurement, because itās the price elasticity of the users who have bought this product at this point in time. You havenāt talked to everyone in your country, so you have the measurement uncertainty, and price elasticity isnāt fixed, it probably moves over the times. So you actually have quite a neat parallel to this whole rocket idea. So actually beyond it being actually cool to say that youāre powered by rocket science, I think it actually also provides some interesting and useful tooling, and itās not very commonly used beyond something like self driving cars, but even there theyāve moved onto more sophisticated techniques. But I donāt think Kalman filters are something that a lot of your data scientists would come across in their training.
Hugo: Iāve never used Kalman filters, and I look forward to checking them out, and Iām also sure youāre looking for an opening and itās only a matter of time before you get to use them at the BBC.
Gabriel: Yeah definitely, definitely. Iām trying to not be too distracting to my team though. They need to just get stuff out, right? As we mentioned beforehand, itās all about making sure you have impact on users, not about using the cool techniques. So I need to listen to that advice myself every now and then, as well.
Hugo: That actually brings something else to mind. Just quickly, Iām wondering, mentioning your team, what do you consider the most important part of your role in dealing with your team, or managing your team, or your sense of responsibility there, if that makes sense?
Gabriel: So I really manage a mixed product team. So in my team I have data scientists, but I also have software engineers, data engineers, architects, and actually a product manager, and all of that stuff around this. So for me the biggest job is, again, being that translator I think that I talked about earlier. So trying to find the right opportunities where we can really contribute and explaining that to the business, but also making sure that the business, sorry, that the team who has to build all of that stuff understands how it fits into the bigger context, and to then create the space and the opportunity for the team to really show how they can impact the BBC.
Hugo: I love that idea of creating space in that kind of role to allow your team to flourish and facilitate them doing the best jobs possible.
Gabriel: Yeah. I mean, it comes back to this, right? No one wants to be the savior data scientist that comes in and then gets an impossible task, and then suddenly the organization gets disenchanted, and gets rid again of data science. I think out of managing that possibility of whatās actually really possible by giving people the space to grow behind, while giving them the protection, so that they can suddenly come out of the⦠and show something for the organization that they never thought was possible. I think thatās the stuff thatās really the exciting and challenging part of my role.
Call to Action
Hugo: Yeah, incredibly exciting. So my final question, Gabriel, is, do you have a final call to action for our listeners out there?
Gabriel: Yeah. I think for me, the thing that Iām more and more passionate about, is the thing that actually as data scientists itās quite easy for us to say that weāre just scientists who observe the world, and I fundamentally disagree with that. I think we are world shapers. So itās not, if youāre building a recommendation engine, itās not like youāre observing what people are looking at and you guess. Youāre shaping their decisions, youāre shaping their behavior, and as we write more and more algorithms that decide what kind of news people see, what kind of universities people can apply to, whether people get an interview or not, who you get matched to on dating sites, whether you get a mortgage, what rate, whether youāre going to go to prison, etc. I think data scientists need to start taking a lot more responsibility about the outcomes.
Gabriel: Itās not good enough for us to say, āWell, Iāve been asked to optimize for this business objective and I just did it, and all of the bias that was in there was already in the data.ā I think we really need to take a lot more responsibility, because we are really the only ones that properly understand whatās happening. Because I think that data science has such a huge role to play for our future, because thereās this whole bunch of problems that we cannot solve without it. Be that efficient energy distribution, a whole bunch of health care stuff, we really need to make sure that we can make the most out of the potential, and I think the only way we can do that is by creating proper customer agency. And for customer agency to be there, customers need to trust that whatever we build is in their interest, and not just in the interest of the organizations we work for.
Hugo: Yeah. I really like this idea of thinking about the responsibility of data science and ML, in terms of the impact it has, and I actually had Cathy OāNeil on the last season, author of Weapons of Math Destruction, and sheās reconfigured her definition of data science, and she now says, āData science doesnāt just predict the future, it causes the future.ā So thatās a line that she thinks about a lot.
Gabriel: So I talk about data scientists now as market makers, because I think it is actually we create, we connect stuff, and for those connections we change the realities. And recommendations is the simplest one. So I personally donāt believe that you have a perfectly formed view of what you would like to see, for example when you go onto Netflix or something like that, or the BBC. I think instead what happens is that we give you a bunch of possibilities and then as you interact with those possibilities your opinion about what you would like to see really forms. And with a recommendation for some entertainment, that might not matter, but for a recommendation for news, or all the other places where machine learning is now being use, around HR, all of that stuff that actually has a fundamental impact on what choices people have in front of them, I think itās really, really important to take to heart that actually you are a market maker.
Hugo: Yeah, and I actually really like the twist you make on ethical data science and data science ethics, which is of course a current huge conversation, but the twist you make in terms of turning it from ethical machine learning to thinking about responsible machine learning, and the responsibility of data analysts and data scientists and machine learning engineers in this context.
Gabriel: Ethics is just this word that creates too much discussion about it. I think, to be honest, responsibility is also not well enough defined. So no one would say we use irresponsible machine learning, right? Itās very easy to say, āOf course itās responsible.ā I think the purpose therefore is we have to push organizations even further to say, āOkay, so what does that mean? What are the trade offs that youāre going to take? How are you going to optimize between the benefits for your organizations versus the benefits for the individual?ā Whatās that cost function of individual freedom, if you want, compared to organizational benefits? And I think thatās the way that itās going to become more responsible.
Hugo: Absolutely. Gabriel, thank you so much for coming on DataFramed.
Gabriel: Thank you so much for having me.
Related
Data Science, Past, Present and Future
podcast
In this episode, Hugo speaks with Hilary Mason about the past, present, and future of data science.
Data Science at Stitch Fix
podcast
Hugo speaks with Eric Colson, Chief Algorithms Officer at Stitch Fix.
Data Science at McKinsey
podcast
Hugo speaks with Taras Gorishnyy, a Senior Analytics Manager at McKinsey and Head of Data Science at QuantumBlack, a McKinsey company, about what it takes to change organizations through data science.
Building Data Science Teams
podcast
Drew Conway speaks with Hugo about how to build data science teams, along with the unique challenges of building data science products for industrial users.
Full Stack Data Science
podcast
Hugo speaks with Vicki Boykis about what full-stack end-to-end data science actually is and how it works in a consulting setting across various industries.
Managing Data Science Teams
podcast
Angela Bassa, the Director of Data Science at iRobot, talks about managing data science teams and much more.