@IndeedEng: Interactive Analytics with Imhotep
Articles Blog

@IndeedEng: Interactive Analytics with Imhotep

October 8, 2019


Well guys once again welcome to the final Indeed Engineering talk of 2014. As Jack mentioned, we’re going to have a talk followed by a brief demo, and then we’re going to break out into a workshop, and we’ll have lots of people around to help you get your own clusters up, or we also have a publicly available Imhotep instance up. So even if you don’t have your own AWS account or your own data we can still help show you how to use Imhotep and then you can get it running later. So Jack said my name is Tom Bergman I’m the Product Manager at Indeed San Francisco. I moved out to San Francisco about three months ago with a new team to start building a new product for Indeed. Mini pitch: for those of you who’ve looked for jobs in the past year or so maybe you remember everybody has this big spreadsheet — you have all companies you’ve talked to, and the jobs, and where you are in the process, and what you have to do next — and so we’re trying to do is build a way to just magically track all that for you, and automatically give you awesome recommendations. So over the past couple of months, we worked really hard, and we rolled out the very earliest v1 to a very very small group of test users a few weeks ago, and I was really excited! We have this new product out. What’s happening? Or is it working? Are people seeing it? So for me in that case, I turn to Imhotep, and I say is it working? Is anybody there? So I go and hit Imhotep, and I say — you know — are we getting traffic? So you can see here I have a query at the top hitting Imhotep, and we have no traffic — and then magic, we have traffic. Yeah it’s working…ok cool. So it’s on. But are people using it correctly? Are the features working? So again, I’m going to go, and I’m going to ask Imhotep — you know — is it working? So I’ll say hey: Are the are these features on? Are people clicking these buttons? Are these pages loading? Are these events firing off? And I can see ok yes they are — that’s awesome — so I can put in a query about some different things we’re trying to track and see if they happened. And I can also say like: Hey, how is this doing against other products we have it Indeed? How is this doing against previous versions of pages? So I can say are we winning? And so you can see here I’m actually querying three separate data sets. These are completely different data sets that have different bits of information, but I can say now how is this new product doing against the previous desktop version, how about against the previous mobile version, etc., and I can graph all four of these at once and kind of pick through it. In addition to simple questions like these, I can go and ask a little bit more complicated questions, like — okay we’re showing a promo to users after they return back to the site based on different test groups and how long they’re away from the site. What is the response rate? And I can put in a query Imhotep and get a response bucketed by milliseconds that tells me how these response rates change…so cool that’s awesome. I can go and I can ask Imhotep and get all the answers to my questions, and you know I like that but it doesn’t help you guys very much so I want to get into how you guys can start using Imhotep to do the same kind of stuff. First off, should cover what is Imhotep. Probably heard this a thousand times, but I’ll reiterate Imhotep is Indeed highly scalable open-source analytics platform. We just open sourced it a few weeks ago really excited about it what is the open source contain? There’s really four kind of main parts to Imhotep. So we have the Imhotep daemons this is the backend the things that are actually running those queries and producing the results for you. We have Imhotep query language or IQL: this is going to be the query language that we use to ask questions to the daemons and get response back. We have we call the IQL web client this is a web-based app or a cloud based app that you can go to and type in a query and get responses right back in your browser. And then finally we have this TSV / CSV uploader and this is another web based tool where you can just drag and drop or upload a TSV of information and will suck that up into Imhotep and make it searchable. So what does Imhotep do? There’s two main things that we use Imhotep for that makes it really valuable to us. First it lets us really easily upload files and compress them to very small sizes so we can get data into Imhotep really quickly and then once the data is there we can do basically you know real-time interactive queries on it and get responses back in real time so that’s what’s awesome about Imhotep to us. But Imhotep is not the first analytics tool we built and Indeed. We built some of the predecessors to Imhotep way back in 2010 and Imhotep is the the latest in this line. Over the course of building these different tools, we’ve learned a lot about what makes an analytics platform work, what makes it really useful, what’s really important to be able to have it be successful in the company. So I want to go over kind of this philosophy that we took with us in building Imhotep why we built it the way we did so you can kind of understand those choices. First off, we think it’s important that it’s interactive. You need to be able to ask a question and very quickly get an answer back. We want to try to lower the barrier of entry to asking questions of Imhotep so it’s not it’s not hard to ask questions. I can ask a question; I get an answer right back. You know if you if you’re having to wait a really long time for response you’re going to ask less questions you’re going to understand your data less and because it’s hard to formulate queries and takes a long time to get answers back less people are going to use it as well. So we just wanted to really lower that barrier of entry to asking questions and to asking follow along questions if any of you guys have tried to do big data stuff in Hadoop you probably all know the pain of executing an incredibly long Hadoop query that then returns absolutely no results so you wait 12 hours and you get nothing for it that sucks it’s horrible. So with Imhotep we wanted to not have that happen so you know we can ask question if it’s wrong you know first we’ll get an error before you an excuse but if it’s if it’s not kind of directionally the right questions want quite the question we wanted to ask, ask it, tweak it a little bit, tweak it a little bit more, and finally get the answer. In fact, in reality it really looks a lot more like this most of the time: ask a few questions see something interesting pivot a bit more try to find out what’s really interesting about it and keep going until we find the really really cool stuff and then do a bunch more queries afterwards that maybe aren’t quite as cool but they’re all so fast and so real time that it doesn’t matter and by having it so easy to ask these questions you know everyone in the company can also get involved in it more. The other thing that we think is really important is to have the real true data there we believe that we should never down sample the data because when we down sample the data we make it really really hard to drill down into the details and get good results. You know one way that people often make analytic systems fast is by doing aggressive down sampling anybody here uses google analytics has tried to drill down into something very tiny and just get nonsensical results because they do so much down sampling you can’t really you can’t really understand it, so what we want to do is make sure that we have the real data they’re accessible to us you know if we are doing you know even if you do understand what’s happening with the down sampling there’s assumptions that are made and if you don’t understand those assumptions you’re going to get the wrong answers interpret the questions the wrong way so just real data and that lets us do things. Like for example here’s just a query I pulled up for this talk I’m looking at just one day in September in the country of England where the user came from Belgium and was looking for the query administrator what are the top 10 titles and what is the click-through rate on those titles and I can just easily run that query and see the precise real numbers for this very very tiny data set from a while ago so that we think that’s really powerful it helps us look into outliers it helps us look into trends it helps us do all kinds of things. So what else you think is really important we thought it’s really important to have a web-based tool to facilitate sharing so if I have if I have a program ring on my computer I have to get it set up I you know I go and I execute it I download the answers and I want to get somebody else’s opinion on it you know they have to make sure they have the tool installed they have to go then figure out how to do the same query that I did and get the same answers with a web-based tool I can do a query I can look at the answers I can then copy that link send it over to somebody else in my team and say hey have a look at this and they can immediately see what I’m doing they can you know tweak the query a little bit send it back to me well what about this and we can have this conversation here. So by having this web-based app that easily allows sharing we kind of generate this you know database called data-driven culture or we can say you know show me the data don’t just make an argument actually show me what’s really happening so when PMs at Indeed talk to each other or conversations generally look a lot like this Hey Dude here’s a query blah blah blah here’s another query oh really here’s another query and so on and so sometimes even when we’re getting into really really hairy tests we’ll have JIRA issue comments that look like this shout out to the ridiculous amount of work Jan did to prove a point here. So what else is really important talked about making it really easy to share results with other people if I execute a particularly long query I don’t want to have somebody else have to wait at all to see it you know I want us to be joining conversation right there so when a Imhotep web app we do really aggressive caching the results so that I can do a query on my computer I can send it over to someone in my team and instantly they have the exact answer there we don’t have to wait for a text to you on them and then send her someone else and wait for it to execute it just all works really fast and gets us out of talking about how to get the data to actually talk about the data itself. Query ability doesn’t matter how fast you can get results if you can’t easily formulate queries to get at the results you know we want to have a query language as expressive enough that we can ask most of the questions we want and that we don’t have to that’s not so incredibly verbose that takes forever to write it. So before I get into what this query language looks like it’s probably helpful for me to talk about what the data inside Imhotep looks like so Imhotep data structures the primary data structure of Imhotep is a data set you can think of data set as being like a table in a relational database so we’ll have maybe one data set that’s all of our searches or one data set that’s all of our applies or something like that so once we have this data set any query we’re asking is going to be a query we’re asking to this specific data set. Then the key item of the data set which is kind of the atomic unit of Imhotep is going to be the document that’s what what is this data set based around, and you can think of a document as like a denormalized row in our relational database so we have this row it’s one item and then we’re going to have all the different information about that item. The next thing is a field you can think of a feels being like a column in a database and you can think of a term as like being a value. For example maybe we have a data set that’s about searches and we might have one field that’s the number of results and so maybe it’s ten or five or seven and we could have another field that’s the job IDs we’re job search companies the job IDs that appear in that results so maybe 1 7 5 maybe want to know the number of which order were clicked you know 532 so this this field and this term can be either a single thing or it can be a series of things that are all associated with this field or this attribute of the document. Seeing is how the data structure can be kind of compared to relational database we thought that it made a lot of sense to use some query language that’s reminiscent of relational database so we made a query language called Imhotep query language IQL this is an expressive SQL like language for aggregate analytics so if you know SQL you should be able to pick this up really quickly if you don’t know a SQL you should still be able to pick this up really quickly. The base query is incredibly simple there’s only two requirements choose a data set so which table are we querying which information are we querying and we have a date range. Imhotep assumes that all the documents are time series documents and so when you’re choosing the date range you’re choosing you know how big of a data set are you querying how much data are you looking at so if you just put in those two things data set and date range it’s going to return to you the number of documents in that day range for that data set. But we can also have a lot of things you can add on to this query so we have filters. “Filters” are things like where field equals term or field greater than term or field in term one term two. We also have “Grouped by” so if you’re familiar with SQL you know we can group by things but we can also do a limited group by so we can say just group by users in test group 1 and test group 2 and it will do a group by limited to just those two groups. And also have “Metrics” like I was talking about before there’s a lot of different attributes of the document that you might want to store in a data set for example for a click you might want to know you know how much revenue that click took or how long it took for someone to do that click so those are all metrics and we can do all sorts of math to them we can do you know multiplication, addition, modulo all kinds of different things to manipulate the metrics to get whatever we want so let me walk you through an example query here: so first the data set so we’re going to say from search results in which case search results is the name of the dataset we put the data set first in the query because it enables us to do really smart auto complete. If we don’t know what the data set is we can’t autocomplete it. Next we have a date range so in this case we’re going to be asking for all the documents between December 5th of last year December 10th of last year those five days. Then we have some filters so we have this filter where country equals Ireland and we also have this inequality filter where the job age days is less than one this is going to be giving us all the search results that were in the country of Ireland and that were less than one day old. We can then do a group by we group by you can group by any of the fields but we have a special group by which is called time which accepts inputs like days minutes hours seconds etc so you can just easily put in any kind of time you want to slice it by and it’ll intelligently interpret it. And then finally metrics. We can choose multiple metrics separated by comma we could for example if we want to add those two together and do clicked + count although it wouldn’t be super meaningful so this case clicked is going to be the number of clicks on this result and count is a special metric meaning the number of documents in that sense so we’re gonna get the number of clicks and the number of jobs. So we’ll put in a query and we’ll get a result like this in a table form so we can see you know there’s a group by five days so we have five groups we have the clicks for the groups and accounts for the groups and then we can just click over to a graph and then immediately see a graph of the same thing as fast as the JavaScript can draw it. Cool, you can you can query the data you can ask lots of questions about it but you know that’s only as good as the data that you have in there and if it’s not easy to get data into a system you’re not going to put it in there and they’re not going to have it to do getting queries on so extract transform load is a data warehousing term often referred to as ETL this is like how we get the data out of system it was in and into the new system that’s going to do the analytics on it. So ETL for Imhotep we tried to make as easy as possible we use it ourselves and we want it to be really easy. So first extract you know you have to get your data out if your data schema is really complicated this parts going to be hard if it’s not it won’t be. So next transform so we have we have a TSV uploader that’s going to make it really easy the only thing is there’s a couple things that you have to do first denormalize your data so for example you can imagine like if you have all the information about a document that’s stored in may be multiple databases you might want to do a big join across all those databases and merge them together into one big table so then you have this all the information you care about that can then be uploaded. Then you also have to do a little bit of stuff on formatting so making sure the header is correct the the date format needs to be in unix format there’s some things around the what the title the document is but we actually have a linter script that’ll take care of most of that for you so it should be pretty easy and it’s all outlined the documentation. One more thing I can add is that as part of the the configuration of the formatting you can configure whether you want the fields to be tokenized whether you want them to be by grams and it’ll just do that automatically for you when it’s creating the data set. So some example data sets we have here at Indeed job search, ad clicks, resume contacts, one of the hardest things about setting up the data set is actually thinking about what do you want this denormalized document to be so in job search for us this is we have one document for every load of a search results page this is going to be the primary document is at page load as it is a page load it doesn’t make sense for that to be you know very detailed information about about an individual job since there’s going to be you know 10 20 40 jobs on each thing it’s a lot of information to join in and doesn’t really applicable to this. Likewise we have one thing that’s ad clicks just all of the sponsored clicks so for this case it makes sense to be you know what’s the revenue associated with it whereas you know that might not make sense in other things and resume contacts you can imagine. So really the main thing think about you know what types of questions you want to ask and make sure that the data set that you’re going to be putting into Imhotep is going to allow you to answer those questions. And then you know once you have that data ready we go into the load step which is we have this TSV CSV uploader you just drag it in there or go to the website and upload it and then it’ll slurp it up into Imhotep and you know in a very short period of time you’ll be able to query on it. For more complicated things we also have a Java API so that you can actually build your own data sets and upload them through the API in practice that Indeed we actually do about 50/50 each of these one of the things we really liked about the the TSV uploader is that when we get data from outside sources or from Excel or from other programs that we’re using it’s pretty easy to get that data into TSV format and so we found that was kind of the minimal path that we could get things up in Imhotep. So once it actually puts it in Imhotep how does it store it we’re a search company and we like to you know go to search more thinking about how to solve this so we use a data structure called inverted index which is a really common data structure used in search it has a couple really good attributes which is why we chose it. One it offers massive compression so we have a sample data set that you guys can see in a little bit a Wikipedia data uncompressed it’s about 2.5 terabytes and then in once we compress it into inverted index it comes out less than 250 gigs on disk so it’s really really nice compression it gets the data very small and lets us query it very quickly also by being an inverted index that lets us do boolean searches much faster than if you’re doing a querying just traditional database as as we’ve said number times it’s open source the open source package is able to be run on basically any modern intel chipset really any kind of modern processor really it’s not going to run on a cluster of smartphones yet but who knows we also have it wrapped up so you can really easily get it deployed into AWS via a cloud formation script so if you want to do at us it’s super fast you can get up and running you guys know in just basically three steps so you create the s3 buckets you create the key pair you run the script and then done you have you have an instance up and you can start working on it so that’s that’s the talk portion that’s I think everything you need to know to start getting into Imhotep. Before we hand it over to Q&A I want to go and demo some of the sample data sets explain to you what what they contain and kind of guide you toward some interesting queries you might want to do. So our first sample data set on our demo cluster and on the sample data set page in the documentation we have it called NASA here and this is a selection of apache logs from NASA’s public website from July to September or just basically July and August of 1995 it’s a public data set we uploaded here probably most of you are familiar with apache logs so we think it would be a good data good place to start so right now I have this query ready from NASA selecting the data set and a time range of you know July through the end of August and I’ll run and we can see in this data set there’s 3.4 million documents so 3.4 million apache logs during this two-month time range. Now I can walk you through some of the cool fields so we have all the fields here are indexed and when we click on it it’s going to do a group by and show us all the most common terms so for example we have this field here called host it’s a little bit misleading but in this case host is where the source of where this traffic came from. So as you guys can see in 1995 Prodigy was still very very popular as was AOL.com we can then go look click on method and this is going to give us the method as you can see primarily gets with a few heads and posts. We can click on response and this is going to be the response message we got a lot of 200’s and 404’s and probably the most interesting field here is going to be URL so this is the URL that was requested and if we look here the most common requested URLs we can see that 1995 was a much “gif”ier web than we live in today so it’s kind of hard to see what’s going on through all these images so I’m going to do is I’m going to add a filter here to get rid of it so I’m going to add a regex filter on URL to exclude all document all URLs that contain .gif so we have now removed all the gifs now we’re looking at the top 1000 URLs that don’t end in .gif. If we scroll down we can see here at number 7 we have shuttle countdown liftoff the HTML now I happen to know that this is the URL for the much-publicized at the time shuttle countdown liftoff page so this is going to be we could go and see a countdown timer before a shuttle takes off. So if I click on this here it’s going to automatically add it into the where as a filter and if i were to execute the query again I’ll see there’s just one URL now because I’m looking for all the URLs that are this. So if we want to then go and pivot we can go and look at time so I’ll set time one day and run and what we’ll see here is the number of requests to shuttle countdown liftoff that HTML per day and looking at this we can see that traffic was kind of low and then it spiked up massively on July 13th. I know I happen to know that July 13th 1995 there was a Discovery shuttle launch that people talked about a lot and so we can see there’s a huge surge in traffic to this countdown page when the shuttle is taking off and we can in fact go from days to hours get more granularity very quickly and we can see that in fact everyone was there early in the morning when the shuttle was taking off and then didn’t come back after. I’m going to move on to the next data set we have Wikipedia so this data set here is a public dump of all the pages in Wikipedia which receive traffic during the month of September this is again a huge data set it’s raw 2.5 terabytes even on disk it’s 250 gigs so it’s very very very big. I will call out the the unit so the document for this index is actually a hourly roll up of logs for a page so we’re going to do a query and look for counts it’s going to give us 24 counts per page each day because each day has a 1 document per hour. So instead what we’re going to want to use is a metric num requests which is going to be the number of requests so I’m going to go execute this now and hopefully it’ll go fast and we’ll see how many requests there were in the entire month so we see there’s about 6.8 billion requests to Wikipedia during the month of September now that was fast but I still don’t want to anger the demo God’s here so I’m going to shorten the time range a little bit we’re going to go just down to will pick seemingly arbitrary date of September 19th and I will filter on title. There’s couple other things here there’s the categories the links out which is going to tell you all the links out of that page requests title words but I’m going to focus on title since I think that’s easiest to get across this is giving me title top 1000 by count. Count is not super meaningful in this index so instead we’re going to ask for the top 1000 by num requests and it’s going to resort it and what we’re going to see here once this query executes is for September the 18th what are the top pages of Wikipedia by the number of requests they received. Let me kind of make a little bit bigger so people can see. So top is the main page with 13 million followed by two undefines that appear in every day and then less unix I don’t know why this is here but it’s here in every single day for September so maybe you guys can dig into it and find out it’s a mystery to me. But if we scroll down a little bit we see the Scottish Independence Referendum is the next highest thing and this makes sense because it turns out that September 18 this year was the day of the Scottish Referendum and if we scroll down a little bit further we see that a Scotland was number 12…so that’s cool. It makes sense these were both very popular on the day that they were happening but we can actually go and say what was the popularity of these pages throughout the entire month. So what I’m going to do here is I’m going to create a second quick a second query here this is going to let us do two queries at once. I’m going to then set this to September 1st through about let’s say the 25th and I can press here to copy it down. I’ll then going to click here to add the style title Scottish Referendum and also click here to get Scotland, but i’ll just delete this and delete this now we have two identical queries looking at the time range September 1st through the 25th comparing the title Scottish Independence Referendum 2014 to title Scotland. I’m going to go ahead and change this group by to time one day and copy it down again and then we’ll be able to see how the popularity of these two related queries changed over the course of the month. So I’m going to run this switch over to a graph so this is that aggressive caching I was telling you guys about. So as you can see at the beginning of the month there is basically very very little interest in Scotland or the Scottish Independence Referendum — they both go up at the same time around the 8th and then they peak on the 19th one day after and then go back down to normal. And you can see these these lines track each other very closely it seems that there were not very many people interested in Scotland outside of the context of the Scottish Referendum. So that’s the Wikipedia set there’s it’s very very big there’s lots of cool stuff in it I highly encourage you guys to explore it but I’ll leave some most mysteries for you and move on to our final demo data set which is called World Cup 2014. So earlier I said that Imhotep expects that the data is going to be a time series so all the tools were built around in being a time series and we use time series on all of our logs so it’s really easy but occasionally we need to use it for non time series stuff and it can easily do that all you have to do is just set the timestamp for everything to be one specific day or set the title to be one day so we’ve just indexed this data set all on this day July 1st if I query July 1st I’ll get all the data in the data set. And this data set I will say the document for it is a player in the World Cup so when I asked for counts it’s going to give me the number of players matching my query. So I’ll just run this now and we’ll see in World Cup 2014 there were 736 players. So we have a couple different fields we have age the age of the player captain it’s a boolean or they captain or not what club did they play for I think a lot of these are fairly easy understand I will say rank is the rank of the team so all players in the team we’re going to have the same rank and finally selections is going to be the number of times they’ve played for their national team. So what I’ll do first is we’ll just look at when the we’ll do a group by captain so we’re going to group by is it captain yes / no and we’ll go let’s see what the how the age of a captain compares to other people so i’m going to add age divided by count. So whenever we want to get an average in Imhotep we’re going to divide by count. So in this case is going to be age of the group divided by the number of documents in the group or age of all the captains combined divided by the total number of captains, and then we’ll also look at selections divided by count or the average number of selections. We will see here that captains are on average about five years older than non captains and have played on average about three times as many games for their national team. We can also for instance since we know that there’s only one document per player, we can do some fun stuff and say let’s get some information about the captains. So we’ll say where captain is 1 let’s group by player which is going to be their name and then we’ll also group by country here I’m going to add closed brackets to say don’t show me results if they’re zero. So it’s not going to give me the cross product of everything it’s just going to give me the country for that player. I’ll do the same thing with club the club they play with normally and their age, and when I get it will see that I forgot to bracket out age. Here we go so then we can see the oldest captain in the world cup is Mario Yepes — sorry if I’m mispronouncing that — from Colombia followed by the captain of Greece, Honduras, and so on. I’ll do one more just to show you we can we can look at which clubs are contributing the most players here. So we just look at a club by count and we’ll see that’s still giving me captains maybe. There we go– all right — yeah so we can see Barcelona sent the most players followed by Bayern followed by Manchester United. So these are the three sample data sets we have. I’m going to go over to a Q&A briefly so if you have any questions answer any questions, and then we’ll open up the workshop and people can go around and help you get started on using Imhotep yourself.

Leave a Reply

Your email address will not be published. Required fields are marked *