Finding a Needle in a Haystack: Enterprise-wide FOIA Searches at CDC - Transcript
Virtual Event
Thursday, May 6, 2021
9:30 a.m. to 11:30 a.m. (ET)
EVENT PRODUCER (MICHELLE RIDLEY): Ladies and gentlemen, welcome. And thank you for joining today Finding a Needle in a Haystack: Enterprise-wide FOIA Searches at CDC webinar. Before we begin, please ensure that you have opened the WebEx participant and chat panel by using the associated icon located at the bottom right-hand side of your screen. Please note all audio connections are currently muted and this conference is being recorded. You are welcome to submit written questions throughout the webinar, which will be addressed at the Q&A sessions of the webinar. To submit a written question, select all panelists from the drop down menu in the chat panel, then enter your question in the message box provided and send. If you require technical assistance, please send a chat to the event producer. With that, I will turn the webinar over to Alina Semo, Director, Office of Government Information Services. Alina, please go ahead.
ALINA M. SEMO: Thank you, Michelle. Good morning everyone, my name is Alina Semo. And as the Director of the Office of Government Information Services at the National Archives and Records Administration, it is my pleasure to welcome all of you to our event today titled Finding a Needle in a Haystack: Enterprise-wide FOIA Searches at the CDC. I hope everyone who is joining us today has been staying safe, healthy, and well. Shortly, I will go through some basic housekeeping roles and set some expectations for today's meeting.
First, I would like to give you some background on today's event and how OGIS became involved. As many of you know, OGIS is the Federal FOIA Ombudsman. And in that role, we work to improve the FOIA process in a number of different ways, by reviewing agency compliance, by offering dispute resolution services to assist requesters and agencies, by chairing and managing bodies like the FOIA Advisory Committee and co-chairing the Chief FOIA Officers Council, and more. In that role, OGIS has a unique perspective on FOIA programs across the federal government landscape.
For the last 14 months, we have been watching with interest the impact of the pandemic on agencies FOIA programs. Just over a year ago, OGIS was pleased to host our first CDC led webinar FOIA requests for CDC COVID-19 records. Once again, this year, the CDC FOIA program managers sought our assistance to speak directly to all of you about how the CDC conducts enterprise-wide searches in response to FOIA requests. You will be hearing today from Srinath Tutukuri who is the IT Project Manager for the CDC FOIA program. Srinath is joined by the CDC FOIA Director, Roger Andoh, and the CDC Deputy FOIA Director of Bruno Viana.
The PowerPoint for today's presentation is accessible on the OGIS at archives.gov/OGIS. We will also add it to the chat. Throughout this morning, we will be monitoring the chat function on WebEx. We are also simultaneously live streaming on the NARA YouTube channel. And also monitoring the chat submitted on that platform. We will be taking questions throughout the presentation. So as you think of questions, please type them using the chat function on either platform. Our plan is to pause periodically to check in and see if there are any questions that have come in via chat. And we will also open up our telephone lines on WebEx during those pauses to give attendees the opportunity to ask any questions orally.
An important reminder with regard to your question, please be aware that this is not the right time to ask questions about a specific FOIA request. We're happy to have all points of view shared, but please respect your fellow attendees and keep the conversation civil and on topic. We will do our best to answer all of your chat, and tell us some questions. If we do not get to your question, please don't worry, we will post any unanswered questions and answers on the OGIS website in the upcoming days. We are recording today's session and we will post a video and transcript of this event on the OGIS website as soon as it becomes available.
I also want to take this opportunity to speak to those of you joining us from other federal agency FOIA programs. The CDC FOIA program has been proactive in communicating with their stakeholders using this venue. OGIS is happy to help any other agency FOIA program to host similar events. If you are interested, please send us a chat during today's event, call us at (202) 741-5470 or email us at ogis@nara.gov. We look forward to hearing from you.
At this time, I would like to welcome our main presenter today, Srinath Tutukuri, who is also joined, as I mentioned earlier, by CDC FOIA Director, Roger Andoh, and CDC FOIA Deputy Director, Bruno Viana. Srinath is the IT Project Manager in the CDC FOIA Office. He primarily takes care of managing the enterprise searches. In addition, to also being responsible for FOIA IT infrastructure at the CDC. He has been in this role for more than six months during which he has explored various tools and options to improve enterprise searches. During this presentation, he will present firsthand information on enterprise search process, the tools being used, potential issues, and finally tips to scope search request for optimal results. Srinath, over to you now.
SRINATH TUTUKURI: Thank you Alina. Good morning everyone, I'm Srinath Tutukuri, IT Project Manager here at the CDC FOIA Office. Today, I'll be giving a presentation related to how we perform enterprise searches at the CDC FOIA Office. In addition, to the issues that we run at the FOIA Office when we try to run these searches. And finally some tips and recommendations that we feel can help us get better search results and even probably take some advice and input from user community and come up with better search results which will help everyone in the long run. Having said that, I would like to go to the next slide which is the agenda of this meeting.
Before I get started with the agenda of this meeting, let me go with two important points that I need to tell. The first is whatever the capabilities that we have at the CDC's FOIA Office. The capabilities that we have are, number one is, we have access to search on all the email addresses within CDC's domain. So that comes to around five to 10,000 email boxes. In addition, to this capability, we also have another capability where we can search for documents on all the shared drive within the CDC's network. Having said that, we do have some limitations on this.
The first limitation is that we cannot run and wildcard search on any of the mailboxes, and we also definitely need to take some mandatory approvals from the custodians of this mailboxes in order for us to be able to perform any searches. And second is the same with any of this shared drives too. We need to take the approvals and be granted access on the shared drive before we can search for any documents, and locate any documents if there are any. Hopefully, this gives you an understanding that we do have limitations when we run the search processes and we cannot just simply run a search on all the mailboxes at CDC. And we have limitations where we have to only run searches on restricted mailboxes and a group of mailboxes. We also cannot run searches on a whole division or a CIO if there are hundreds of people. Hopefully, this gives a clear understanding before we can delve deeper into the agenda of this meeting actually.
So the agenda of this meeting has been divided into four categories. The first category is then Overview of ES, which is known as enterprise search. The second category is how we categorize this request based on the technical complexities. And the third is the issues that we run into when we perform these enterprise searches. And the last is what are the improvements that we would suggest based on the observations that we have seen when we perform these enterprise searches. And finally, we'll also have a Q&A session over this particular aspect. Before I move to the next slide, does anyone have any questions?
EVENT PRODUCER (MICHELLE RIDLEY): Ladies and gentlemen, if you'd like to ask the question via phone, please press #2 on your telephone keypad to enter the question queue. Once again, pressing #2 will enter you into the question queue. Or you may enter your question into the chat box.
ALINA M. SEMO: So right now we have no questions on chat. So go ahead.
SRINATH TUTUKURI: Thank you Alina. Let me go to the next slide. So the first category, which talks about enterprise search overview is split into three categories. The first category talks about the process flow. The second category talks about the tools that we use. And the third category is one of the most important features that we use to really find searches, which is known as de-dupe and containment. Any questions on these three slides so far?
EVENT PRODUCER (MICHELLE RIDLEY): Once again, pressing #2 will enter you into the question queue. I don't see any questions on the line, and no questions in the chat.
SRINATH TUTUKURI: Sure. Sure. Let me go to the next slide. The ES process flow. This slide pretty much gives a bird's-eye view or a complete understanding of how the enterprise search process is performed at the FOIA Office here. So even before we get an enterprise search into the technical team, the enterprise search is pretty much analyzed and vetted by the FOIA analysts to make sure that relevant information is present. The most key information that we need to perform an enterprise search is one, the custodian email mailboxes, and the second is the time span. Without these two pieces of information, we cannot proceed with any enterprise searches. Just in case the requester has not provided as the custodian email mailboxes, our FOIA analysts contact the relevant subject matter experts and get us the relevant custodian details for us to perform this search.
The same goes to the time span too. If there is no time span, the subject matter experts provide us the time span information too. So once this information is provided to us, we analyze the search request to see if it needs any keywords. Sometimes the keywords are also provided by the requester, sometimes if there are no keywords, they come in from our subject matter experts. Or if there are no keywords given, I or one of my team members goes through the request and understands the search request and comes up with the keywords to perform a search. Any questions so far on the analyze search aspect of the enterprise search?
EVENT PRODUCER (MICHELLE RIDLEY): I do not see any questions on the line, and no questions in chat.
SRINATH TUTUKURI: Sure. Thank you. Let me go to the next step in the enterprise search process. So once we have the keywords, the custodian mailboxes, and the time span, defined, we take this information and plug it into our primary search tool, which is the Microsoft Office 365 Compliance. I will be going over the capabilities of this tool in the future slides. But for now, we simply take these details, and enter this information as filters, and put it into the Office 365 Compliance tool. Which is in graphical user interface, which pretty much hooks into the Microsoft Exchange Server and brings us out all the emails from the Exchange Server. Once this information is available to us, we have some next steps that we follow. But before I go to the next step, does anybody have any questions related to this aspect on Office 365 search so far?
EVENT PRODUCER (MICHELLE RIDLEY): No questions in chat, and there're no questions on the phone.
SRINATH TUTUKURI: Sure. Thank you. Generally, the next step that we follow is we try to eliminate clutter. However, we do not eliminate any cl because if the requester has specifically stated that they're not interested in a subscription or newsletter, we hold back and we simply perform the search. But if the requester has explicitly given us instructions that they're not interested in a subscription or newsletter, we try to eliminate the details too. I do see one question regarding what is NUIX here? So NUIX is a forensics software, which is used for analyzing the data. Which I'll be going over in the next section, specifically where I'll be talking about different tools actually.
So once we perform the search and we get the necessary search results, and if we have to eliminate any clutter, we go ahead and eliminate any subscription based emails based on the records that we see. And then we do rerun the search. After rerunning the search, I do, or one of our analysts goes there and samples the data to make sure that the results meet the expectations or are within the scope of the request. So we do capture some metrics regarding what kind of records have been captured for each keyword? Or how many records are coming from a single mailbox or different custodians? And so forth. So we have all this information which is captured. And if the records are very less and we are certain that the search request is a simple request, and there's not much of ambiguity in the results that we see, we go ahead with the next steps of exporting the data, and preparing the data and findings, presenting the data.
However, if we see that the results look ambiguous and we have a lot of data, we present it to the analysts with all the necessary insights for them to make, or probably contact the requester, and make an informed decision. And if they are really to narrow down the scope if required to come down with lesser number of records. And then we probably rerun the search to get some better results. As and when we feel that the records are good enough, we go ahead and export this data into either a PDF and DOC format, or a message or an email, and sometimes even to an PST records. Occasionally, even [inaudible 00:16:22] graphical user interface has some issues where we can't really delve much deeper into each record to understand and see if the records are that good. We sometimes take this data and put it in Outlook to get a better insight and awareness of how the data looks. Once we feel that the data is good enough, we go ahead and prepare the data.
So the preparation of the data is where we talk about dedupe and containment. I have a specific section to talk about dedupe and containment in detail. But in a nutshell, what the dedupe and containment does is that it eliminates lots of duplicate data and it cuts down the volume of the results by around 20 to 40% based on what we have seen so far. But it helps us to make the records more concise, that way it saves us the time as well as the requester's time when we are not presenting them duplicate information. And the last step is that we do present this data and we take this data and put it into the fil share. And also into our case management tool, which is known as FOIA Express, which we use. And from here on, I pass on the data back to the analyst, and the analyst takes it over from here until he brings this case to a logical conclusion. And finally closing out the case.
This is an overview or nutshell of how the whole enterprise search process is performed from a technical perspective at the CDC's FOIA Office. So does anyone have any questions regarding the process flow here?
EVENT PRODUCER (MICHELLE RIDLEY): No new questions in the chat so far.
SRINATH TUTUKURI: I do see one question.
EVENT PRODUCER (MICHELLE RIDLEY): I do see one. Yeah. [Inaudible 00:18:06], go ahead. I was going to say, you see the same question I do, right?
SRINATH TUTUKURI: Yes. Yes, I do see one question. Have you ever had an email failed to properly import into FOIA Express? Yes, sometimes. Very rarely do we run into issues where some of these emails fail to load into FOIA Express. In that instance, what we do is that, we take the email message out and we try to reformat it into PDF manually. But, yes, we do run into occasions once in a while, but it is not very prevalent. Any additional questions?
ALINA M. SEMO: That individual clarified. More specifically they come in as their native format, rather than the proper format. Maybe you could talk a little bit about the format that they're going to come into?
SRINATH TUTUKURI: Yeah. Most of the emails come in the proper format itself. We don't get any formats which are not non-English specific or non-ASCII specific. So we never really ran into some of these issues. However, one issue we do run occasionally is that when emails are encrypted, it prevents those e-mails from being converted into PDF documents. So what we do is that we have to sometimes take those encrypted emails and probably even figure out a way of either going back to the requester to get that email [inaudible 00:19:38] and decrypt that email. And probably put it back into FOIA Express in a different format and resolve the issue. So hopefully that answers the question.
EVENT PRODUCER (MICHELLE RIDLEY): [inaudible 00:19:52]. And there are no questions on the phone.
SRINATH TUTUKURI: Yeah. I do see another question. Can you repeat the prepared data process? Yeah. Sure. Definitely. So what we do in the prepared data process is that once we have all the data exported from Office 365, which is usually either in individual messages or it's in .pst file. Think of it as a zip file with all the different messages. We take this data and put it into a software, which we use, which is called the case management software or FOIA Express. And we run this data into that software, which is known as dedupe and containment. When we do the dedupe and containment, what really happens is that all duplicate messages are eliminated. So to give you an example is that, let's say, I'm running a search on five custodian email boxes. And I had sent this, an email, to five people, and five of them are CC. During the dedupe process what happens is that it only picks one unique email, rather than the five emails. That way it saves us four records which are remitted from the total set.
And when it talks about containment, what it really means is that, let's say, there is a conversation between myself and another user, and we have 15 different emails going back and forth. What the containment process really does is that it eliminates all the individual emails and gets the last email of the email chain. So what really happens is that it saves us from having to go through each individual email and it eliminates the emails which are contained within the final email. So typically what we have observed, as I said in the past, is that the dedupe and containment process reduces the record volumes by 20 to 40% while making sure that the scope of the search result is still intact. Hopefully, this answers your question, Edwin.
Yes. For dedupe and containment, we can use different softwares. We use Office 365, sometimes does some dedupe process. NUIX can dedupe. And outlook can also do sometimes, we are capable of eliminating some records. But we primarily use the FOIA Express for containment actually. So to answer your question, yes. All the records, we definitely go through the FOIA Express software for the containment and it's extensively used. But every primary search goes through the containment. The exceptions would be that we have five or 10 records and there's no necessity to really do the containment process.
Let me go to the next slide. And I did go over some of the tools that we spoke about during the previous slide, but I can go over each of the different tools that we use here at the FOIA Office to make sure that we are able to get the best results out. So any search that is performed in the FOIA Office primarily first goes through the Microsoft 365 Compliance tool. So the way it runs is that, as I said, this tool is a graphical user interface. Which is a web based interface which has filters to perform searches. The different filter options that we have, it gives an ability to search on keywords, then the subject of the email, the recipients of this email, the participants of an email, as well as who the sender is, and the most important aspect is running the search on the custodians of this email, and finally a date range. Without the custodians and date range, we do not do any search because it's going to be a wild goose chase [inaudible 00:23:55] the search can take forever and they're not going to get any productive results actually.
As I said, the tool is very simple and it gives us insight. It's the first step for us to really get all the data. And based on our observation, if the scope is very defined and the record count is less, we do get our records as much more precise. If we're confident at this level, we just go ahead and skip running these records in any other tools. However, if we do get lots of data and we feel that the search does not look that great, or the results can be ambiguous at times, and if the record count is less than 150 records, we simply take the records, all this data, in a PST file and quickly analyze it in Outlook. And that is then a very quick way of looking at the records, for us to analyze if the data looks good or probably eliminate any records which are not needed.
We seldom go to the Outlook process. But if required, we do it, if they record volumes are less. And the next step is that we do not really go into FOIA Express to do any searches. And what we do is we have been using this forensic software called NUIX. And this software has higher capabilities than the Office 365 software as well as outlook. And it can really provide much more insights into the records. And it is capable of doing some containment and additional dedupe, which the Office 365 process fails when the record volumes are much more higher. And it gives us an insight into the data and helps us understand if the record [inaudible 00:25:43]. So what we see is that once we have a big set of records, of around 10,000 records, then we run through NUIX. It cuts down the data volumes and it's much more precise.
And it gives us a lot of options to give insights, like it groups the data based on subsets, groupings based on topics. And it also gives us different domains and how many emails are coming in from each domain. It also gives us a better ability to cut down. Let's say the user feels that I'm only interested in all the emails sent from CDC, it helps us to narrow down those records. So it has different additional search filter options, which are not available in the Office 365. And we do use this tool on an as needed basis, but it is definitely a powerful forensics software where we can pretty much run and analyze a lot of data. Not just Outlook emails, but also even a lot of hard drive based data, and a lot of documents, and so forth. Any questions on this so far? On the tools specifically?
EVENT PRODUCER (MICHELLE RIDLEY): I do not see any questions in the phone queue. As a reminder, ladies and gentlemen, if you would like to get into the phone queue, pressing #2 on your telephone keypad will enter you into that queue.
ALINA M. SEMO: I see no new questions from the chat.
SRINATH TUTUKURI: Yeah. Thank you. I can move to the next slide. This pretty much conclude, okay, sorry. Let me talk about this dedupe and containment, I did speak about it a few minutes back. But I can present a slide, which I've captured from the FOIA Express software, which gives us an insight into how the dedupe went actually. So in this instance, we had around 900 records, which were captured after running the search through Office 365 as well as the new software. We knew that these 903 records could be condensed further, and as I said, the dedupe process typically brings down the records by around 30%. So when we did run this 903 records through the dedupe process in FOIA Express, what we found was that our interest is primarily on the green bar here. We want to see how many records it comes down to. So from 903 records, the records were condensed to 630 records here.
And the next thing is that it also gives us the number of records which were eliminated as part of the containment process. So in this instance, 270 records were eliminated for the containment process. It was not able to eliminate any duplicates, the reason is because all those duplicates were already eliminated by Office 265 as well as NUIX. So we were able to condense the record volume for 903 to 630, which translates to a reduction of around 30 to 35% of records. Just to give you an overview, each record on an average translates to around four pages. So in a...
SRINATH TUTUKURI: ... which translates to around four pages. So in a perspective, 600 records roughly goes to around 2,500 pages of data which needs to be analyzed by the analysts again and presented to the final requester. So the containment process helps us greatly actually in reducing this volume of records while keeping the scope intact and saves time for the analyst as well as for the end requester. Any questions on this, on the dedupe and containment?
EVENT PRODUCER (MICHELLE RIDLEY): There are no questions on the phone line.
MARTHA WAGNER MURPHY: No chat questions either. Thank you.
SRINATH TUTUKURI: ... Thank you. I can move to the next slide. Yeah. So, that pretty much concludes the first section of the overview or the bird's eye view of the enterprise search process set. [inaudible 00:30:01] to the different aspects that I did cover were the process flow, then the different tools that we use and also the dedupe product. Let me move to the next section, which is how the technical team categorizes the enterprise search process. I would like to make it clear that we also have another categorization on the administrative aspect of enterprise searches. So I will not be going into that aspect. Here the categorization is primarily limited to the complexity that is involved from a technical aspect when we try to get search results. So the first, I've categorized this search into three different categories. One is the low-intensity, the next is the moderate-intensity and the last is in high-intensity search.
I would like to go to the next slide where I'll be talking about low-intensity search. So when I say low-intensity search, what it really means is that the search is a very simple to perform, and we are absolutely certain that we are getting the right results, and we can very quickly get this search done and close out the request in a timely fashion without any issues or without having to go back and forth with the requester. And as the slide, the picture here depicts, keep it simple. Generally when the requesters to strive to keep it simple, we know that the search is a very low-intensity search. So what is really a low intensity search? I have placed a few attributes, which primarily define what a low-intensity search is for us. We have the custodian mailboxes which are defined. So when I say student mailboxes are defined, we know that the requests we have to run a search on a few custodian mailboxes, it could be a director or an assistant director or probably the head of an division and things like that. So it's very clear on who we are running the search.
The next is that we also have a very short time span. It's a very important thing with the time span actually, because the shorter the time span, our results are more accurate and more in line and in sync with the scope of the request. So it makes it... So if it is a two week search or a three week search or a few days around an event, the search results are very precise. The next is the number of participants. So if we know who the participants are, let's say we have a very limited participants, like a discussion between a few individuals, four individuals, five individuals, then it really helps us to narrow down the search results that makes the search results more accurate actually. And the last is not having any unambiguous keywords. When I say unambiguous keywords, we don't expect a keyword like run a search on COVID or run a search on autism or run a search on aids or such. So searches could be very vague and we could get tons of records actually. That's what I mean by unambiguous keyword. Any questions on this so far?
MARTHA WAGNER MURPHY: We have one question on the chat. Do you use Nuix as your primary method to dedupe your records and/ or do you think this is a more efficient way to dedupe compared to the ADR/ EDR tools within FOIAXpress?
SRINATH TUTUKURI: We do not use dedupe to... We do not use Nuix to primarily dedupe the records. We do use Nuix for deduping, but the first step of dedupe always happens at Office 365, and if anything is missed out during the dedupe process at Office 365, it is captured in Nuix, and by and large Nuix does a good job with dedupe. So we do not see any dedupe happening when we do the containment and the dedupe process in FOIAXpress actually. But FOIAXpress does an excellent job with the containment process. And probably Nuix also has an ability to do the containment process, but we haven't figured that out. As I said, we have only started using it in the last three to four months. Does that -
MARTHA WAGNER MURPHY: Thank you. That's all.
SRINATH TUTUKURI: answer your question? Okay, sure. If there are no questions related to the low-intensity search, I can show an example of what a low-intensity search is, so that it gives them understanding of what I really mean by low-intensity search. The next slide, please. Here is an example of a low-intensity search and I'll give like 15 seconds for you to go ahead and read the content of the low-intensity search.
Yeah. All right. So the requester here has requested for only mail communication between the CDC's directors and the Office of the Vice President, Mike Pence between September 10th and October 1st. So the time span here is very short, 20 days. We know who is the custodian here. It is the director of the CDC, and we also know who are the participants here. So the participants are two people. One is the director of CDC, and it could also be anybody from the Office of the Vice President. It could be a secretary or anybody sending those emails to us. So we have our mailbox defined, we have our dates defined and there is no necessity to do the keyword search here. And all we do is that, we make sure that the participants are anyone from the domain of the email domain of the Vice President. So in this instance, all the participants would be or it would have a domain address of ovt.cop.gov. So that would be an partial content or a suffix within the email address of any email coming in from the Office of the Vice President.
So that pretty much will give us a very concise results, accurate results and from here, we can just take those results and we straight go to the... Take those results and if the record count is very small, we just take those records and we run them through the dedupe and containment process within FOIAXpress, and we were able to get the results of in a very short time span and the analyst is able to close out the results. So what I mean to say is that, it's a very simple search for us, because the scope is very clear and there is no ambiguity and it makes it very easy on us to get searches done. So if our requester communities can provide us searches, which are very specific and very low intensity, it helps us in the long run. Any questions on this example?
EVENT PRODUCER (MICHELLE RIDLEY): I do not see any questions in the phone queue.
MARTHA WAGNER MURPHY: Nothing on the chat. Thank you.
SRINATH TUTUKURI: Thank you. I can move to the next slide. So when I talk about, the next slide is about moderate-intensity search. When I say moderate-intensity search, what I really mean is that sometimes the custodian mailboxes can be defined or may not be defined, but we have a way of figuring out who the custodians here are. The participants may be known sometimes or they may not be known, but typically in a medium intensity search, we could have more number of participants too and we may also have to run searches on group mailboxes, like event-based mailboxes or responsibility mailboxes, and so forth.
The search is not specific to a particular keyword. It could be a phrase search, or it could be a combination of keywords that need to be searched on and generally the date range is larger, or it could be in a much longer date range at a time span. So when I say a moderate-intensity search, what it really means is that the record count is much more higher. Typically, it is between a 100 and a 1,000 of records. But we do know that we can definitely get the records here, but it involves some work on our end before we can really pinpoint and nail the accurate results, which are relevant for the scope of the request. Any questions on this moderate-intensity search?
EVENT PRODUCER (MICHELLE RIDLEY): There are no questions on the line.
MARTHA WAGNER MURPHY: Nope.
SRINATH TUTUKURI: Thank you. I will go over an example of a moderate-intensity search. Probably I'll take a few more minutes to really explain that example in much more detail. Next slide please. Thank you. I'll give 15 seconds for you to go ahead and read the content of the search. All right, let me get started with this request. There's this request related to a news reporter who was interested in finding out the investigation that CDC had performed related to an incident where a few people were infected with COVID when traveling on a bus from Milwaukee all the way till Texas, and apparently, unfortunately, an individual passed away too. So the event details, if you look at the email, it says it's around October 13th, 2020, when the event happened. So in this instance, we do not have any custodian mailboxes to search on, nor do we have a timeframe on where to perform the search.
However, the only thing that we have from the request is, we are able to pick up the different keywords. One is COVID-19, then it's a commercial bus. It has reference to a particular company, the bus company El Tornado and a place, which is Seneca Foods and some of the different stops like Laredo, Chicago, Wisconsin and so. So we do have sufficient keywords to probably even start off with the search here. So what happens here is that our analyst goes to the or probably contacts the relevant subject matter experts, and was able to identify the people who really performed this investigation. So we do get the custodian from them and they are also provided as a timeframe on when to perform this search. So we had the custodians now, and we also had the timeframe. Any questions so far on this?
EVENT PRODUCER (MICHELLE RIDLEY): No. No questions on the line.
MARTHA WAGNER MURPHY: No questions. Yeah. Thank you.
SRINATH TUTUKURI: Sure. Thank you. Please, to the next slide. Yeah. We did take the keywords, the custodian mailboxes and the date range and we did perform a search on the Office 365 tool on the Exchange Server. The way we did the search was that, we had to come up with then a concatenated phrase or come up with a set of keywords where we had to use either an Boolean search of and or, or, probably to draw some results and do some analysis on it. So what we really did was that we said, let's do a search based on bus and any of these keywords, Milwaukee or different places here, San Antonio, Dallas, and so forth, or it's a motor coach and any of the different places here within this particular date range. So based on the search that we ran, we were able to get around I think it is, I don't remember the exact figure, but it is a few hundreds of records. I believe we had probably around the three or four custodians on whose mailboxes we had to perform this search.
We came up with around four or 500 records, and we were quickly able to ascertain that, yes, these records look in sync with what the requester is looking for. I usually do or we do some sampling of a few records here and there to see if the records look relevant and the keywords look relevant. In this instance, we were able to identify that within the subject of each email which says COVID-19 bus contact or land conveyance bus investigation, those things which are highlighted in yellow show those particular keywords they captured. We also did see some emails, which were like newsletters, then news articles, reports and things like that which were... And news articles, which were not really relevant to the investigation. So we had to eliminate those.
We call them clutter because these are noise emails, which are not really relevant to the investigation, so we had to eliminate newsletters as well as subscription emails and things like that. We were able to cut down some of those [inaudible 00:44:03], but still we had a large number of records and we did know that Office 365 can sometimes be a little haywire here, where it's not really accurate. Because Office 365 generally is not going to be very accurate when we have multiple keywords and we have a combination of phrases and multiple custodians. So, that is when we really need to go to the next level. In this instance, after eliminating most of this noise-based emails, we took those records and we put them into the Nuix and Nuix did a much better job of eliminating some records which really didn't make sense, because when we were doing phrase-based searches and combination of keywords, we did come up with records which were not relevant. So it cut down some of those records actually. So this is what a moderate-intensity search is.
Nuix also gave us a lot of insights as well as topics, it gave us groupings and topics. But in general, we were confident that the records that were coming out of Nuix and what we were seeing was in line with what the requester was looking, but we need definitely needed much more effort. It was not a simple search and we needed to make sure that these records are relevant. We just went back to the requester again, based on the insight that the analysts have provided, and once we got the necessary approvals, we moved forward with the dedupe and containment process, which again reduced around 20 to 30% of records. So this is what really a moderate- intensity search looks like. When I say moderate-intensity search, it's the characteristics that it has lots of mailboxes, the record volumes run into hundreds, and we definitely need to run this process through multiple softwares, and we definitely do need some analysis and it was more than likely that we have to go back and forth to the requester before we can finalize the records.
And any questions? I do see that few questions on the chat. So let me try to go over the first question.
MARTHA WAGNER MURPHY: So the first question was, are you referring to eDiscovery when you're using search tool in Office 365, is it an eDiscovery tool?
SRINATH TUTUKURI: Are you referring to eDiscovery when you're using this search tool in Office 365? Yes, it is the same thing. We do use the eDiscovery tool. That's right.
MARTHA WAGNER MURPHY: Okay, great. And the next one, I'm not sure if you can answer or if this is going to be one for Roger or Bruno. Could you speak briefly on how subject matter experts and potential custodians are determined before creating a search query? Thank you.
ROGER ANDOH: This is Roger. I can do that question. So depending upon what state, scope of the request is, CDC has set up an emergency operation center to handle the coronavirus pandemic, and they have teams that are set up to address specific aspects of the pandemic. So you have, for example, folks to deal with their vaccine, other folks who deal with the [inaudible 00:47:14] order, various groups. So depending upon what the request is about, and if it's COVID related, we would send the request where we have... In a situation where we have no custodians provided, because there, probably the requester doesn't even know who are the custodians are, we attend it to emergency operations and then say, "Please give us the names of the folks who will involved with this topic that this requester is interested in." And so, they would then identify either custodians or a particular mailbox is being used by a team that would reasonably contain the records requested. Does that answer the question?
MARTHA WAGNER MURPHY: I don't see anything else in the chat. So I think we'll say yes, unless we hear something more. Thank you. Oh, I'm sorry. There was a follow-up.
ROGER ANDOH: Sure.
SRINATH TUTUKURI: This says thank you. And for non-COVID topics, is it the same process?
ROGER ANDOH: For non-COVID topics, it is the same process, but the process would be, we would identify the program office within CDC that is likely to have the responsive records and send it to them and say, "We've received this request. We need you to write the documents." And if they want us to conduct the search, they would provide the names of the custodians whose email boxes we search against.
MARTHA WAGNER MURPHY: Thank you.
ROGER ANDOH: And if for example, if they... And sometimes they might come back and say... I'll give an example. I'll give an example of where the requester doesn't identify mailboxes, but he identifies the whole group. So a requester comes and says, "I want you to search against, search for all employees within NCRD's emails for this keyword." Well, like Srinath has said earlier, we count performance tests against email boxes for justice, but the [inaudible 00:49:20] program office, right? So literally you're asking us to search against custodians for an entire program office or a division. We're going to come back to you and say, "You're going to have to limit it. So if you cannot limit it by name, you're going to have to limit it by a topic enough that they'll be able to identify who are the folks who worked on this particular subject matter and then they would provide a list of custodians."
MARTHA WAGNER MURPHY: Thanks, Roger. No other questions.
ROGER ANDOH: Sure.
SRINATH TUTUKURI: Yeah. Thank you. If there are no questions, I can move to the next slide. Okay. I am going to talk about the high-intensity search here and if you look at the pictures there, the person there is me who is bald and lost his hair because of the type of request I got. I am just kidding. So typically what happens in an high-intensity search is that, we do not... The biggest characteristic of an high-intensity search is that the scope of the request is very vague and we run into probably thousands of records, sometimes 10,000, sometimes 20, sometimes 30 and I've seen searches going to 80,000 records.
Why do they get such kind of records, actually? If you look at the characteristics here, we don't have a clear custodian mailbox defined sometimes. Sometimes we have too many custodian mailboxes defined. Okay? And the next is, sometimes we do not know the participants in this email conversation. So when we do not know the participants, it's possible that there could be a discussion with so many people on this particular topic, and it could go to any extent where it becomes very difficult to identify which emails are really relevant to the scope of this request, actually. And sometimes, we also do not get any keywords. So we have to frame our own keywords and we have to come up with keywords based on the request and most of the times when this high-intensity search starts, it's an unknown unknown for us.
But looking at the requests, we can say that this is probably an high-intensity search, because once we run the search through Office 365 and we see all those different volumes of records, then we figure out that, yes, this is going to be an wild goose chase where we are not going to get too many records. And what really makes it more complex is, sometimes we have requests where we have to use Boolean searches like OR and AND to concatenate the search results and the records can be very haywire. Finally, even having too many attachments in the emails can also complicate. Sometimes we see slides, presentations which have different words, which absolutely have no relation to the scope of the request. So that is what an high-intensity search really I'm talking about here right now. Any questions related to the high-intensity search?
EVENT PRODUCER (MICHELLE RIDLEY): There are no questions on the phone.
ROGER ANDOH: This is Roger. I just wanted to-
MARTHA WAGNER MURPHY: There are no [crosstalk 00:52:49]
ROGER ANDOH: I just wanted to add one. At least from my experience with these high intensity searches, I think Srinath was pretty generous when he said that a record's average size is four pages. A record size could be, it pulls out... It could be one page or it could be as much as five or six pages. So we're talking about an email string and that is just the email string itself without including the attachments. So if it has three or four attachments and each attachment is on average four pages. You see how that extrapolates into being a lot of records, just on its face. It's not that when we say we located 5,000 records, that doesn't translate into pages. It could be 5,000 pages, 5,000 records could be 25,000 pages.
It all depends upon what's the average size of... What the size of the record is and how many attachments it contains. What we've had with some requesters is that would say, at least we've agreed with them is, remove the attachments, right? So they just want the raw emails and then they would come back. We're going to negotiate and say, you can always come back and ask for specific number of attachments [inaudible 00:54:13] and that could also help with us being able to process your request much more timely. If we don't have to process the entire records that's included in the attachments and everything else. Srinath, it's over.
SRINATH TUTUKURI: Yeah. Thank you very much, Roger, for reminding me of that issue. I forgot about it. Thank you. If there are no questions, we can go to the next slide. I'll pause for 15 seconds. So you can read, everyone can read this request, it is an example of an high-intensity search here.
All right. So we had a requester who was interested in all responsive records related to procedures, guidelines and discussions that happened around coming up with a guidance on wearing face mask to slow down the speed of COVID-19. In this instance, there were no specific keywords given to us and we had to come up with a set of keywords. The next is that we had to identify who are the custodian mailboxes. As Roger has already mentioned, we get that information from the SMEs who give us the guidance on what those custodian mailboxes are. And the next is the date range, which is also coming either from requester or it's going to be given to us by the SME.
So in this instance, the keywords that were identified as face mask, face coverings, respirators and N95. See, these are the different four keywords that we were given. So when we do a search here, whatever it probably means is that I have to locate records which are either a face mask or masks, face covering or face coverings, respirator or respirators or even respiratory, anything. So we do prefix searches as well as suffix searches and finally the N95 mask. So we had to concatenate a string to come up with the keyword searches. So I can [crosstalk 00:56:43]. Yes. Any question so far?
MARTHA WAGNER MURPHY: No.
EVENT PRODUCER (MICHELLE RIDLEY): There are no questions on the line.
SRINATH TUTUKURI: I can move to the next slide which talks of, shows the results, of the search results actually. Based on the search that was performed here, we had come up with 22,877 records. So I'm only talking about unique emails, actually. It's not the number of pages. And these were the insights that we got, based on the preliminary search that we did in Office 365 and for mask, it came up with 12,000 records, for respirator it was 12,000, face and cover was 6,000. So the total records were around 22,000 records here. Just keep in mind that these emails were only emails sent by these core users, which were like high-level officials within CDC. You only had four high-level officials and we are not even talking about emails which were sent to them. So the volume of records here was very high. It was 22,000 records. And looking at these results, I-and looking at the results, I do know that probably since it is for mailboxes and we do containment, I could probably come down to ... eliminate 40% of the records here, and the search analytics I'm providing here is only primarily at the Office 365 level, which probably is around 80% accurate at this point of time because I see so many records. So probably if I were to run this record through the Nuix software to do containment, I would probably come down to less than 10,000, not less than that. I cannot determine the cost, but it's going to be less than that. But still, that some huge number of records. So 10,000 and even five pages for each record could translate to 50,000 record. So I don't think it is humanly possible for any of our analysts or even the requester to go through the 50,000 pages of data and digest that information and comprehend that information and come up with some reasonable analysis.
So at this point of time, I make a determination letting the analysts know that this is going to be ... The scope is too broad. We just certainly need to narrow down the request, and these are the insights that I see and these are the keywords that I see. So if the requester wants to make a determination on how insights look, I go ahead and share all the information with him. So the requester goes back to the analyst and tries to narrow down the scope in this instance. However, in some instances, let's say we come with a few thousand, like 7,000, 8,000 records and there are lots of mailboxes. There's a [inaudible 00:59:47] mailbox at 15 or 16. Then I do know that there's a potential for a lot of duplicates. So in that instance, we do go through the digital process and probably even ran through Nuix, and if it is less than 10,000 records, then probably we do the shot and we try to go through the ultimate steps to prepare the data, actually.
Any questions on this so far?
ALINA M. SEMO: So I have one technical question. Someone asked if applicable, have you used new Nuix, N-U-I-X, in high-intensity searches?
SRINATH TUTUKURI: Sure. So the way we use Nuix in high-intensity searches is that after the records have been filtered or we get in first set of data from Office 365 search, which is the eDiscovery search, we take the whole data as a PSD file, or even individually if it's less than 10,000. [inaudible 01:00:48] at 10,000, we don't take individually [inaudible 01:00:50]. We take the whole PSD file, and we take that data and put it into Nuix, and they run the same search terms that we ran in Office 365. Nuix does a better job with element, and it has a mechanism to eliminate some records which are probably missed in Office 365, and it brings down the number of record comes up. That is the first thing, and sometimes it can also eliminate some duplicates. So we definitely see a reduction; it just depends on the number of custodian mailboxes and the number of keywords and so forth. So it's [inaudible 01:01:24] to say how much [inaudible 01:01:26] can eliminate which was not eliminated by Office 365.
The next step is Nuix has a much more higher analytical capabilities where it uses analysis based on which mailbox has a lot of emails being sent or which domain or which organization that's sending all these emails and which email address is sending those emails, which are in the CC, which are in the BCC, which are in the two. It also provides us insights around which date or which timeframe do we see a lot of emails going out. I would call them heat map, heat map kind of things, so that analysis is also there, and it also upgrades the records. It gives us subsets of data saying that, okay, if you're trying to [inaudible 01:02:13] mask, face, and face mask, for these three keywords, we see it on 500 records. But it has its own way of analyzing groups. They give us different subsets of groups, actually. So it gives us all the analytics, provides us sufficient analytical information, further requested as analysts to make an informed decision to narrow down the scope, is all I can tell you.
So that is how we primarily use Nuix in high-intensity search.
ALINA M. SEMO: Thank you. I have another question that might be better for Roger or Bruno. Someone on YouTube chat asked, "Do they need to request all related attachments if they want attachments?" So I think that means would you assume attachments unless you hear otherwise, or do people actually have to specifically ask for attachments if that's what they want?
ROGER ANDOH: Great question. Unless you say you don't want attachments, then your records ... the search would include attachments.
ALINA M. SEMO: So the default is yes.
ROGER ANDOH: Default is yes, yes. Default is yes unless you say no.
ALINA M. SEMO: Great, and then the second question came in. It has to deal with records retention. "How far back do archives go for file search? Do they follow records retention schedules and get destroyed on a schedule like paper files ordinarily would be? And do searches recover files that have been deleted by individuals: no longer needed, not required to be retained?"
ROGER ANDOH: Long question. I'm not a records' retention expert, but this is what I would say, is when we receive a request for documents, primarily emails, which we have, where someone says, "I'm looking for all email correspondence from X starting from 2005 or from 2000," if we can search against the custodian's mailbox, we start from that timeframe, from 2000 or 2002. If those records are still contained within their mailbox, it's going to be pulled. If it's not there, they won't be able to pull it.
So can we pull data that has been deleted from a person's mailbox? I don't, and [inaudible 01:04:31] corrects me, I don't believe that the 365 eDiscovery tool can do that. If the emails are for somebody who is in a capstone program, which is a few folks whose emails are basically archived forever, and I don't mean that literally, but pretty much forever, then we can search against any date range. So for example, [Redford's 01:04:55] emails are archived, even though he's gone. So 10 years from now, if someone makes a FOIA request for Redford's COVID-related documents, we're going to find it because everything in his mailbox was captured.
ALINA M. SEMO: Okay.
MARTHA WAGNER MURPHY: Thank you, Roger.
ALINA M. SEMO: I'll just speak for NARA and records management. Electronic records are scheduled like paper files, generally. So yes, that is a true statement. Thank you.
SRINATH TUTUKURI: Yeah, thank you, and thank you, Roger. Yes, I just want to add one statement of this, too, is that there's a record retention policy within CDC, and each mailbox is treated differently. So our records liaison at the records retention agency within the division within CDC, we can do as clear directions on how many months or how many years the particular mailbox can be retained, actually. So based on that, we can quickly at least let the requester know that this mailbox of mail are not going to be found, or if as Roger established, the capstone official public records retention policy is much more longer, actually.
ROGER ANDOH: This is Roger again. I do want to do something just on the Nuix tool because someone had a question about that. Nuix, what it does that the 365 eDiscovery tool doesn't do is that it's able to better analyze the data. And so by being able to properly analyze the data, it helps us actually find a needle in a haystack. That is what Nuix is supposed to do. So we don't use it for, let's say, for a de-duplication because the EGR feature could do that in [inaudible 01:06:45]. It's more to analyze the data. So for example, what [inaudible 01:06:48] was talking about, heat maps, where is most of email traffic coming from? It categorizes the records.
We've had Nuix for quite a while, but CDC has basically ... We have definitely increased our usage of eDiscovery tools since COVID, and so we're still a work in progress, and we continue to utilize the functionality of the system, but it certainly is a much, much more robust system for analyzing data than the eDiscovery tool is or ADR is in helping us locate records.
ALINA M. SEMO: Thank you, Roger. Martha. We have [crosstalk 01:07:32]- Martha you have a question on Twitter, right?
MARTHA WAGNER MURPHY: Correct. We've been monitoring Twitter, and so we do have one question. "What does CDC have available or will make available to help requesters better understand who the custodians would be for a particular email? Are there org charts or directories?" They said, "It seems like CDC is placing the burden on the requester to know this."
ROGER ANDOH: To the extent that we placed ... My position is that in some situations the FOIA requests ... in a lot of situations the FOIA requester may not know who the custodians are, or sometimes they do. So to the extent that a FOIA requester doesn't know, would not know the custodians are, I tell my team we should not go back to them and ask them for names of custodians because they wouldn't know. For example, if somebody makes a FOIA request and say, "I want any correspondence sent by the chief of staff for Governor Cuomo, this is the person's name, to anybody in CDC," well, they don't have to know who the recipient in CDC is. They've given you the name of the person who sent some email, right? And so then we can go to EOC and say, "Hey, did anybody have any contact with the chief of staff for Cuomo?"
So, yes, in some circumstances, the EOC ... I would say the problems at EOC, the EOC is made up of employees who are detailed for a period of time and they leave. So it's a revolving door. So there's not ... The people who I see today may not be there 60 days from now. So it continues changes. So there's not a list of folks who are there for the entire duration of the pandemic. They're not. They go on detail for 30 to 60 days, and they go back to their program office. There're very few of them stay on for much longer periods. So that is part of the give and take. And so to the extent, and I'm sure this happened, and I would own that and apologize for that, but to the extent that we are placing a bet on you to look for custodians, we might be doing, say in a situation where (a) we've identified that you would know who the custodians are because of what you said, and in a situation where you don't know who the custodians are, then if you probably describe the topic matter, then it makes it easier for us to identify the custodians.
I mean, for example, if you say, "I want all correspondence about communications between CDC and CVP with regard to some particular topic," right, if the topic is scoped enough, we will be able to identify the folks within CDC who had any discussion, but not everyone, but at least the heavy-hitters who were involved in the discussion.
With regard to whether there's going to be an org chart, again, I'm not sure an org chart necessarily will be helpful unless you're talking about the heads of the units who don't change, but even then they change. I mean, I think the manager of the EOC has not ... We've gone through at least three, I think, to-date. So they change. So I think what is important is be very clear about what it is you're asking for. You don't have to give us clarity on the custodians, but be clear on what it is that you're looking for, and then we can take it from there. And to the extent that even they're not clear who the custodians are, we'll come back to you and ask you to refine your ask so that we can identify who was having a discussion about what you're asking for.
SRINATH TUTUKURI: Yeah. Thank you, Roger, and it was a good reminder of reminding everyone of the title, identifying the [inaudible 01:11:30] stack, which we forgot actually. I do see one question here is how do we eliminate duplicates? It's already been covered in the discussion. We can use any of the tools, like Office 365 or Nuix or even FOIA experts to eliminate duplicate. However, for containment, we only can use ... Right now, our capabilities are limited to using FOIA experts for containment.
Thank you. Any questions on this slide so far? [inaudible 01:12:04].
ALINA M. SEMO: There are no questions on the line.
MARTHA WAGNER MURPHY: That covers the chat for now. Thank you.
SRINATH TUTUKURI: So before I move to the next slide, what I would say is that any high-intensity searches [inaudible 01:12:17] with lots of complexities, a decision has to be made whether we move forward with the request or we hold back and send it back to the request, so that decision is made based on the number of custodians and the type of emails that we'll see, if the keywords are very generic and so forth. So sometimes there is some discretion when we have to go back to the requester to let them know that we cannot perform this search.
I can move to the next slide. Yeah, so I have covered the different categorizations of the searches based on the technical complexities that we have seen so far. The next topic, or next section, is the issues that we see when we perform searches here. So we have made an attempt to identify the problem and help the end user to know the problems that this creates to see if we can find some solutions and come up with better search results. So the first issue that was really identified was broad scope, the second was high record count, and the third is average data quality. The three of these are pretty much related, and I can quickly go to the next slide where I'll be talking about the broad scope of research.
I think by now most of you are pretty much aware of what the broad scope really means, like the characterizations of a broad scope are too many keywords, then having very generic keywords, like just say searching on autism or searching on SARS or searching on COVID or having too many mailboxes and the date range is very large. Sometimes we get requests where the date ranges for a few years or a few months and the results are very, very ... too many results where it becomes really hard for us to identify the search, actually. So just to put it in perspective, we look at the picture there, the scope, if there's any day, and we have so many umbrellas there, but in reality, we only need one umbrella to identify the request here. In this instance, the yellow umbrella is good enough for us to identify the records and narrow down the scope, actually.
Any questions related to this topic of broad scope?
ALINA M. SEMO: There are currently no questions on the line.
MARTHA WAGNER MURPHY: Nothing new. Thank you.
SRINATH TUTUKURI: Yeah, next slide. As we have already discussed in the high [inaudible 01:14:51] high-intensity search, we see very high data volumes, and it really becomes very difficult for us to identify which is the right data or which is the wrong data unless the requester is really specific about what he's looking for, and sometimes some requesters are very good at telling us what they're really looking for, but sometimes some requesters come up with some ... I cannot get into the requester's mind to ... I probably read his mind to understand what he really is looking for or what she's really looking for. That is what makes it complex so that when that such situation arises, it so happens that we get so much of data, and I cannot, or we cannot, know which is the real data in this.
So just to give you a perspective, look at the picture. We have so much of records there, and we do not know which is the right data in the picture, right data there.
Any question questions?
ALINA M. SEMO: There are no questions on the line.
MARTHA WAGNER MURPHY: No.
SRINATH TUTUKURI: Yeah, I can move to the next slide. I think this is a really interesting topic here. So I'm using this term called average data quality. So when we do a search based on few keywords, sometimes we do see records being read-write. And when we do analyze the records, it turns out that we know that these records are not really what the end user is looking for, but since the requester has not clearly specified that he needs this record or those records, we still have to deliver the record. I can give you an example of a request where we were asked to search for records on all mailboxes at the CDC Guatemala office. And when we did the end search on Guatemala and ICE, ICE stand for immigration and customs enforcement, what we found was that we were getting all emails where the word "Guatemala" was showing up in the email signature, and the word "ICE" was showing up in some attached documents, in a PDF or in a word document. We absolutely knew that the records were not what the end user was looking for. So this is what it means. So we have the quantity of the data here, but the quality is very poor because we are very certain that we are not getting the right records.
Sometimes somebody asks for a response for COVID. So when we run a search for response for COVID, we do have an division not in EOC, on a specific branch, which is looking at COVID response. So people have their ... excuse me. People have their addresses as COVID-19 response. So what happens is all the emails with signatures of COVID-19 response show up, and I do know that these are not the records that they're looking for, but I still have to deliver them because these records are what the requester requested.
So if it makes sense, what I'm trying to say is that the quality of the search is poor because of the keywords that have been provided or because the scope of the request was not really clear, if it makes sense.
Any questions on this?
ALINA M. SEMO: There are currently no questions in the phone queue.
MARTHA WAGNER MURPHY: There's one chat question. It's a bit broader. We can save it, or we can take it now. Which do you prefer? I think it's for you, Roger.
ROGER ANDOH: Okay. We can take it now.
MARTHA WAGNER MURPHY: Okay. "I seen the CDC FOIA annual report. You received approximately 2,400 requests last year. How many FTEs do you have dedicated to doing FOIA searches for this number of requests?
ROGER ANDOH: Dedicated do for searches is one. [inaudible 01:18:44]. That's it. We're working on getting a contractor to assist us, but right now, it's just her doing the searches.
MARTHA WAGNER MURPHY: Okay, thank you.
ROGER ANDOH: Sure. I just wanted to add as far as this average data quality, in a situation where the keyword that has been provided by a requester is so generic that it's going to be found in, for example, even signature bar, for example, one way to limit that would be to say the keyword should appear in the email content or in the subject. I mean, that would narrow it down. I mean, so that we go, okay, if the word should appear in the body of the email or it should be in a subject, or it should be ... I think we can do searches within certain number of words, so COVID within five or 10 words of, they'll say no sale order or some other word, just so that we make sure that whatever it is that you're looking for, right?
Because at the end of the day, the requester, you are seeking information that is useful to you. And to the extent that we are looking and reviewing documents that are of no use to you, that is a waste of our time. That's a waste of your time. That results in a delay of response to you because at the end of the day, you want information that's useful to you. And a lot of times when it comes to eDiscovery searches, you as a requester can do a lot to help us improve, making sure that we have good data to provide to you by the way you scope your request and at the extent that you make it easier for us to be much more precise in identifying the documents that are responsive to your FOIA request. [crosstalk 01:20:40]-
SRINATH TUTUKURI: Thank you, Roger. Yeah, thank you, Roger. If you don't have any additional questions, we can move to the next section.
So, so far, [inaudible 01:21:01] a lot of complaining regarding issue, and we have done some analysis and building up the observation. We are glad that we have found some recommendations that we are willing to share with the end users and also probably take any input advises that you have for us so that we can come up with better searches. So hopefully this last section is going to be more intuitive and useful to all of you.
So let me start with the first aspect of improved e-search when I say very defined scope. So what does an very defined scope really mean? So I'll categorize this into three different sections. So when I say very defined scope, what I mean is that we do not want any ambiguity in the scope. The requester needs to be very precise and concise in what he's looking for. So as long as the requester is very precise and concise in what he's looking for, I'm very confident that we can get very good results.
The second is if a requester is looking to perform multiple searches within one single search, the recommendation is that he split each search into its individual line item within the search request. It will really be better if each sub-search is really its own individual request. That way each search is pretty focused on an objective of what we are looking to achieve. That really helps us out, actually.
The last thing is that one recommendation is that most of the searches that I have observed is that there's a lot of newsletters and subscriptions that come in, and we do see lot of requesters explicitly stating that we do not need newsletters and subscriptions and we are looking only at conversations and things like that. So that is really appreciated. When we have this three or four items taken care of, when the scope is really very defined, it makes the search much more predictable, it saves us a lot of time, and as Roger has stated, it provides much more productive results and it helps the end requester get the right data.
Any questions on this?
ALINA M. SEMO: There are currently no questions on the phone.
SRINATH TUTUKURI: Thank you.
MARTHA WAGNER MURPHY: And no new chat questions. Thank you.
SRINATH TUTUKURI: Okay. Thank you. Let me go to the next item on this. Limiting keywords: So when I say limiting keywords, what I mean is that it's a list. Sometimes we do get requests; the requesters give us keywords and say we want to search on these keywords. We do [inaudible 01:23:43] keywords. Analysts consider keywords [inaudible 01:23:46] or within this subset, in that subset, or an "and" doing this and that. So what happens is that when we have multiple keywords coming in, I absolutely know that the search results are very diluted, and we are getting a much more generic and abstract subset of data. So it's going to be an needle in a haystack here. That's for sure. So if the requester can be very concise or precise saying that I'm only looking for this keyword or this keyword, that really helps us in narrowing down the searches.
The biggest recommendation I would say is that rather than using an "and" or an "or" search, the second recommendation is to go with a free search. I can go to an example of a free search. So we did get a requester asking for testing for COVID-19 in long-term care facilities. So that is a very good phrase, but it doesn't necessarily mean that when I search for this phrase, I'm going to get or any records or all the records because people can use different words that probably they could rephrase the content of what they're looking to search in different ways. So what we figured out is that testing for COVID-19 in skilled nursing facility, so the way the search was performed was that doesn't say testing; we said test, test star. So the word is a suffix. So it could be test, testing, or tested. That is way.
The next is looking for testing within five or 10 words to the reference of COVID, or it could be star in CO-19 or corona, things like that. So COVID within test or testing within COVID or star in CO-19 or corona, and also additionally, the term skilled nursing facility could be referred to as your centers, a long-term care facility, LTC, or long-term care facilities, skilled nursing home, and things like that. So we just need to get creative with those words and try to come up with an phrase search I can do, add all the synonyms and capture those words, and what my observation has been that rather than doing an answer just COVID-19 and skilled nursing facility and testing, when we did this phrase search trying to find words within a number of words, testing, we were able to get much better results which are much more accurate.
So that is one thing that is definitely recommended instead of doing an "and" or "or" search actually because an "and" or "or" search [inaudible 01:26:35] very weak. If there's an email with thousand pages, the first word could start at [inaudible 01:26:41], starting of the body of the email, and the last word could be somewhere and contained in an Excel document or it could be a Word document. So that record may not be relevant, actually. So that is one thing. We can eliminate such things when we try to do a free search.
The third thing is that let's say the end user is coming up with keywords. It will always be better if they can prioritize which keywords ...
SRINATH TUTUKURI: It will always be better if they can prioritize which keywords takes precedence. So if they're giving us three keywords that it commands then doing a priority, this is the first keyword that takes precedent the second less precedent [inaudible 01:27:12]. Because when we run this record and you're getting too many records, it gives us a subset of record stores that even in Office 365, that helps us present the information to the analyst stating that, okay, for this keyword you're seeing this record, and this is taking more precedence. If you want this to take precedence, we will use a this subset of records. So we are trying to help, I'm trying to help the user come up with what is really looking for. Other than having keyboards with an and, and so forth. So that is one thing that really helps us, validating the keywords, doing a phrase search, and keeping the keywords to as minimal as possible. Any questions on limiting the keywords?
EVENT PRODUCER (MICHELLE RIDLEY): There are currently no questions in the phone queue.
ROGER ANDOH: This is Roger. I wanted to say something here because I want to make clear to everyone who's listening, that there is no requirement that when you submit a FOIA request to us that you have to provide us with the keywords. So this example would be, if you do provide us with keywords, limit the number of keywords. Because we've had two page, sometimes folks give us a whole page of keywords or two pages of keywords. So one of the most important things that you could do is to have a well-defined scope, right? If you have a well-defined scope, we will be able to find, like [Srinath 01:28:37] was saying, the keyword that you might use might not be the term that internally, that folks who are having the conversations, would use. So you might say long-term care and maybe they just use a term, they might use the name of the facility, or they might just say LTC or whatever it is.
So if the scope is well-defined, that's a very good start. If you want to provide keywords, you can, a limited number of keywords, but you're not required to give us keywords. You're also not required to give us custodians. But if you do want to give us a list of custodians, limit the list of custodians, because the more custodians you provide to us, the more records are going to pull, the more duplicated records are going to be provided. Because if there are 10 or 15 custodians and all of them are CC'd our participants in particular discussion, that means that one email string is going to be content within 15 or 20 customer email boxes. And so just point of clarification, you don't need to give us keywords. You don't need to give us a list of custodians, but if you do just limit it.
SRINATH TUTUKURI: Thank you, Roger. It was very useful information and a very good reminder. And I can move to the next item, which is avoiding generic keywords. So when I say generic keywords, right, I do see a lot of, I can give an example here. I see a lot of requests coming with autism. And I had one request where we were asked to search on a request on a custodian's mailbox, who is a researcher on autism. So when we did a search on his mailbox, all his emails were about autism. So we came up with 30,000 records of autism-based emails within a span of three months. It was, it's like trying to search a stockbroker's email with the word stock. So that was the type of request, which is very generic. In this instance, the recommendation would be to, if you are giving some generic keywords, please also provide some supplemental keywords that will help us narrow the search.
So if somebody is sending us autism and we're searching the mailbox of an autism researcher, probably there is a medicine or there's a condition which is causing that. So something which can narrow the search results or something which is more specifically that [inaudible 01:31:08] subset of records within autism that you are looking for. So that really helps us out in the long run.
Next is, as I said, one more example is about the meat processing plant and guidelines and things like that. [inaudible 01:31:22] we had lots of keywords, very generic and things like that. So giving an example of, if you give us generic keywords, also make sure that if you provide at least one supplemental keyword to narrow down the results.
I can move to the next slide, which is to limit the number of custodians, as Roger has already gone over. The more the number of custodians that you're going to have, you're going to have more number of emails and more number of duplicates that we need to go through. So I'm hoping that I don't need to do that again and again. So the lesser number of custodians, they are going to get a lesser number of records, and it becomes a lot easier for us to really narrow down the search results.
And the last item is reducing the time span of the searches actually. So if, sometimes I do see requests coming with four year time span or five year time span, and we find records, sometimes we don't find records because of the record strategic retention policy, which is very different for each mailbox. However, we do notice that sometimes, again we run searches for a year or couple of months, we get 20,000, 30,000 records. So it's always better to limit the time span, be very specific on which time span you're looking for. If there was an event that happened, probably a month around that event, 15 days before the event, 15 days after the event, probably makes sense, where there's a lot of noise that [inaudible 01:32:52] particular activity. Or let's say an example is [inaudible 01:32:56] when people [inaudible 01:32:58] month or two. So if that one month or two months of time span [inaudible 01:33:04], it really helps us to identify the right tables of records, actually.
So these are some of the improvements actually, which will really help us get better search for you, the requesters, actually. And if anybody has any questions for me, I'm willing to answer them related to this topic on this section.
ALINA M. SEMO: So Srinath, this is Alina. We have a question actually from our side, from [inaudible 01:33:35] which, and this is possibly a question also for Roger and Bruno, would you be able to talk a little bit about the role of the FOIA public liaison? And whether, when a very broad search is submitted, is the requester able to reach out to the public liaison, who would be willing to help requesters draft up a well scoped request?
SRINATH TUTUKURI: Yeah, sure. Alina, I'll load the question to Roger if he wants to answer the question.
ROGER ANDOH: Sure. [inaudible 01:34:08]. So with CDC, yes, I've had requesters contact the FOIA public liaison, which I think now it's Bruno's listed as FOIA Public Liaison, or they reach out to me. I'm more than happy to work with requesters and scope of the request. But at least from where I sit, it is much more advantageous for them to work, what we do when we get a FOIA request is that assigned to an analyst and that analyst handles that case from cradle to grave, right? So at some point in that process, if it's COVID, I'm going to see that request, review the request, and then it gets released. Often that person who knows the day to day, the in and out, [inaudible 01:34:55], who has more details about the request would be the analyst. So my preference would be [inaudible 01:35:05] working with an analyst. If there's an impasse and then you have to escalate it, I'll be more than happy to jump in.
But I think if you start with an analyst, and most times, I think in most situations, they are able to work with the requesters as to reformulate their requests in a way that is satisfactory to both sides. Sometimes we have an impasse and sometimes even they might have an impasse with me. It just dependent upon what you're asking for. So for example, if somebody days I want you to do a search against all the email boxes by a particular program or division, we're going to have an impasse because I'm going to say we can't set that against 300 or 400 custodians, because Srinath cannot push a button to do that. He's going to have to manually put in every single email box for every single employee in that program division.
That right there would be an unreasonable request, and it's going to take an unreasonable amount of time. So certainly, yes, you can contact the FOIA public liaison you can contact me directly. You can contact Bruno to help you reform the request, but the person you really should start with would be the person who's assigned your request. And that person's name is always in your acknowledgement letter that you receive. So you have the contact information of that person in your acknowledgement letter, and it's best to start with that person.
ALINA M. SEMO: Okay, great. Thanks. Martha, I think we have another question in the chat.
MARTHA WAGNER MURPHY: Yes. So it was explained earlier that containment tools pull the last email string. However, what happens if multiple strings are created with recipients and CCs added or dropped and conversations going in multiple directions. Will the program keep those break off streams or will they be eliminated by the program?
SRINATH TUTUKURI: That's a very good question, and I can take this question. I can answer it for you. Yes. So if there is a breakage or somebody changes the content of the email or adds a new recipient or deletes a recipient, that chain is broken. And it so happens that another record is created, but when the analysts look at the record here, they make sure that sometimes if it's the same thing, it's not part of the containment. They can go in and delete the recorded request. But it does break, if a chain is broken, it does create a new record, actually. So the containment will not work for that particular instance here.
ROGER ANDOH: Yeah. Just to amplify what Srinath said. So if all the email correspondence was not all contained within one email string, then any separate emails are completely separate records are going to be pulled. They're not going to be eliminated.
ALINA M. SEMO: Okay. And then we had a follow up, I think, from the same basic topic. Could you please discuss the topic of the most comprehensive email thread?
ROGER ANDOH: Let me attempt to answer that question. I'm going to assume when you say the most comprehensive emails thread, you're saying the email thread that contains every single email correspondence about a particular topic. So if that exists, because sometimes it may not exist, right, so so if an extended email thread contains every single discussion about the particular subject matter. Well, I assume that is the most comprehensive. And then to an extent that the containment system identifies that, then it pulls that record. So the requester is receiving every single [inaudible 01:38:56] of discussion about that particular subject matter. But if I extend that one email string is not comprehensive, then they are maybe multiple ones, they are subsets of it would go in different directions. Then those are going to have to be pulled. And then they're not going to be considered as dupes or near dupes because they're not.
MARTHA WAGNER MURPHY: So when you say they're going to be pulled, do you mean they will be part of the responsive, comprehensive responses?
ROGER ANDOH: Absolutely. Yes, they will be part of the response. Exactly.
BRUNO VIANA: And I want to add on to that as well. This is Bruno Viana at the CDC. From my experience using the tool, and Srinath, Roger, you can back me up. As far as the duplicates and the containment is concerned, that tool is very sensitive. So I've had analysts come to me and say, "These are duplicates. Why is it not catching it?" But any sort of change here or there, if there's an attachment missing, if it's a forward, if there's any slight change, the tool is very sensitive and it'll include it in the responsive document set.
ROGER ANDOH: So let's give an example. For example, let's say Roger, Bruno and Alina had an email conversation about having this webinar, right? And so we have emails back and forth and they said, there's a final email thread of this discussion. And then I forward to [inaudible 01:40:18] and I just do, FYI I don't even say anything, I just forward her the whole email string, not having introduced it. That's a separate team because he was not part of our conversation. I just forwarded the whole email string between myself, Bruno and Alina. That is, we no longer have a quote unquote one comprehensive email. We've created two separate ones now.
SRINATH TUTUKURI: Thank you, Roger. And thank you Bruno for reminding this. And I just add one thing to what Bruno said, is that if somebody even tries to add a single line break within the email chain and forward it to somebody else, it can create another chain altogether. So we'll end up having the same content or the same quote but, as Roger has said, multiple subsets of data and also the same [inaudible 01:41:15].
ALINA M. SEMO: Thanks everyone. I don't see anything else in chat right now.
SRINATH TUTUKURI: Yeah. Thank you. And I would like to open this up to anybody within the user community who's willing to provide us any recommendations that can help us give them better results. So we can probably have a few minutes of chat or discussion to see if they have any [inaudible 01:41:44] for them. And we can take [inaudible 01:41:45] and have them discussed internally within the CDC for your office.
EVENT PRODUCER (MICHELLE RIDLEY): Ladies and gentlemen, if you would like to make a comment over the phone, or you have a question, you may press pound two on your telephone keypad to enter the queue.
ALINA M. SEMO: I think you've done such a great job answering questions as we've gone along, that everyone has been on silence at this point, but we'll give everyone a couple of minutes to absorb. And sure enough, I don't know if you want to ask Michelle to go to the next slide where the contact information is there?
SRINATH TUTUKURI: Yeah, sure, sure. Yeah.
ALINA M. SEMO: Perfect.
MARTHA WAGNER MURPHY: You did get one commendation on chat that the information session was very helpful for understanding your end in order to work together, so thank you.
SRINATH TUTUKURI: Thank you.
ROGER ANDOH: Thank you very much.
SRINATH TUTUKURI: Yes. So if anybody has any additional questions, please feel free to reach out to me related on the technical aspects, but if it is related to any business that I've missed, [inaudible 01:43:05] I recommend that you reach out to Roger or Bruno and they should be able to answer the question.
ALINA M. SEMO: Martha, do we have any other questions on the YouTube chat platform?
MARTHA WAGNER MURPHY: Nope. Nothing from our colleagues who are watching the YouTube chat right now. Thank you.
ALINA M. SEMO: I just saw another chat question come in. Does the CDC have analyst to do manual responsiveness checks to further reduce duplicate emails slash attachments within threads?
SRINATH TUTUKURI: Yeah, I will leave this question to Bruno.
BRUNO VIANA: Yep, I'll take that one. Sure. So this is just the first part of pulling the records and de-duping and doing all that. Every set of records that Srinath pulls is going to go to an analyst who is going to analyze it, go through it, process the records before it's released to the requester. During that process, if they are seeing duplicates, because it's not perfect, I mean, at the end of the day, it's a computer. Whatever you put in is what you're going to get out. So you still need that human eye to look at it to make sure that everything is still responsive or it's not, we didn't pull a bunch out of scope stuff for one reason or the other. So yes, there's every package that goes out a person will still look at it and do that analysis, and they look for duplicates.
And again, as much as a computer is imperfect, we are too. So there may be duplicates that we miss, but we take all the effort in the world to make sure that we catch those. And not just for the requester, but it also, it's easier on us if we can catch the duplicates. It's fewer pages that we've got to go line by line and review, so it helps us out as well. So we definitely do that review after certain [inaudible 01:44:54] process. It definitely goes through another review before the release.
ROGER ANDOH: And I wanted to add that in addition to the analyst who's assigned to review it, when I'm reviewing a COVID records or Bruno's reviewing COVID records, I'm also looking for everything. So to an extent I see duplicate emails of the same that are contained within a convention thread. I could either say flag as duplicate, or I might just leave it in. But I have to make sure that it's processed consistently. That's the biggest thing that I have to watch for is that separate email that is contained within a company's email is not processed differently from the comprehensive one, right? So I have to make sure that it's done accurately.
And so, and as far as attachment goes, that's a little bit more tricky when we talk about attachment is a duplicate, right? If I send, if there's email correspondence between CDC officials and they attach a document, right? It says, this is CDC's, please review and edit CDC school guidance, for example. Let's just use that for example, right? And then that same school guidance edit is sent by, let's say, Dr. Willinsky, and she sends it to, let's say, the White House and says, "This is our current draft of the school guidance." I can't say that just because we have released it in this email string internally, it's the same thing therefore it's a duplicate. No, it's not, right? Because that email chain to the White House is a separate email. That attachment is to that email string. Therefore, that document itself is not a duplicate.
So it's included, even though it's the exact same document [inaudible 01:46:52] was given to by her staff. It's the same document, but we're not going to mark that as a duplicate. Just because it's the same document attached to a different email. It's not. So when we talk about removing attachments that duplicates, it means that the email and the attachment are the same, so everything should be the same. Otherwise it's in. So if the email string is the same, but the attachment is different, that's a new record. If the email string is different and the attachment is the same we've seen earlier, it doesn't matter. It's still a different record.
BRUNO VIANA: And this Roger, this is Bruno again. This goes back to the question that Roger answered at the beginning of the presentation. So, in the FOIA world, it's considered, the email and the associated attachments are considered a record. So that's why the default is if you make a request for emails, those attachments are going to come. Unless you say that you don't want them, then we can exclude them. But a record, in this instance, is that email and any associated attachments. So that's why, even though the body of the email is just a forward, or it's just, it looks the same or the body of the, I'm sorry, the attachment is the same. There's no changes made to an attachment, but Roger sends me a draft of a document to five different people. It's going to go to five different people, the attachment's the same, but the text may be different if it's forwarded or replied, but there's no changes to that attachment.
ALINA M. SEMO: Martha, I think we have another couple chat questions.
MARTHA WAGNER MURPHY: Yes. So this is getting to communication between the analysts and the requester. Will the analysts reach out and say your request is probably high intensity, can we talk about scope to get it to moderate or low? Or will you do the search first before you determine that it's high intensity? I guess the question is when a request comes in, is there always a search conducted or can it be determined to be high intensity before the search is conducted? I think that's the question.
ROGER ANDOH: Great. Yeah, this is Roger. I think, from my experience, experience some requests on this face will be a high intensity search without you having to do a search. But in some situations I've asked my staff to go before you go back and say, this is overly broad or vague, or [inaudible 01:49:28] we need to have data to support that, right? So we should do a preliminary search and see what we pull, because it may turn out that there's not much discussion here. And sometimes when I do this, I then realize, okay that wasn't a lot of conversations around the subject matter. It seemed broad on its face, but there wasn't much conversation here.
So, but to the extent that, so if we do this search and then we determine it's a high intensity search, then Srinath would make that known to the analyst, and the analyst would go back to the requester with enough information to help them to reform the request. But sometimes on its face, and I go back to this one about, I want all correspondence that the CDC had with, for example, the White House. Okay, any conversation that the CDC had with the White House from January 1, 2020 through December 31, 2020 on its face is going to be a high intensity search because they're going to be multiple people. They're going to multiple email domain names. That's going to be high intensity search right on its face, and we don't need Srinath to do a search to tell us that.
MARTHA WAGNER MURPHY: So it depends is the answer, which is fair. Right?
ROGER ANDOH: Yes, exactly. It depends.
MARTHA WAGNER MURPHY: One question that someone had regarding duplicates If the recipient changes, but the email thread is identical, the thread containing a different recipient would be contained as a non-duplicate. Is that correct?
ROGER ANDOH: That's correct-
MARTHA WAGNER MURPHY: The content is exactly the same, but you got a different email.
ROGER ANDOH: It's a different email, yes.
MARTHA WAGNER MURPHY: Okay. I don't see anything else in the chat right now, unless I've missed something, Alina?
ALINA M. SEMO: No, I don't see anything else either. I think we've asked all the questions, Michelle. Anyone wants to chime in orally on the phone?
EVENT PRODUCER (MICHELLE RIDLEY): No, I do not see any chat questions or comments on the phone.
ALINA M. SEMO: Okay. All right. Srinath, any other wrap-up words before we say goodbye to everyone and let them get on with their day?
SRINATH TUTUKURI: Yeah, sure, Alina. I'd like to wrap this up by saying that I feel like there was the situation of the [inaudible 01:51:34] tech, as long as the scope is finalized and the scope is very concise. I think the biggest takeaway from this session would be that if the requester can provide us the right scope, it makes their life and our life a lot easier. And I thank you all for giving me this opportunity to present at today's session and I thank all the partners at [inaudible 01:52:02] for giving this opportunity for me to present this information. And hopefully this has been helpful session and it helps us to even cut down on our searches. Thank you.
ALINA M. SEMO: Thanks. Roger and Bruno, any other parting thoughts before we say goodbye to our folks?
ROGER ANDOH: Bruno, you want to go first?
BRUNO VIANA: Sure, I will. I just want to say thank you again to [inaudible 01:52:28] and I would recommend any other FOIA offices reach out and use their services as well. They're great about advertising events and organizing, running them, moderating, doing all the work. So they make us look good. We do the easy part. So we really appreciate that.
ROGER ANDOH: Yeah. I also would echo and I'll encourage any federal agency that's listening in to take advantage of the opportunity that [inaudible 01:53:02] has given to us to communicate with their requesters about the FOIA requests. I think the more we can communicate and the more we can let requesters know the challenges that we have to go through, what we have to do. I think the better it is for all of us.
And I want to say at least on behalf of CDC's FOIA office and the agency, is that we take our job in responding to FOIA requests very seriously. And we work tirelessly, I have to say that, we work tirelessly every day to make sure that we get respond timely to FOIA requests. Are we perfect? No. Are we close to being perfect? No, but we try our hardest every day to get there. And this is part of what we're trying to do, is to hopefully get FOIA requesters to understand that they can help us make that goal of getting responses to them as timely as we can. Thank you.
ALINA M. SEMO: Great message. Roger, during public service recognition week. So I think, yes, we're all tireless government employees. Well, thank you all very much Srinath, Roger and Bruno. You've all done a great job of covering a lot of important material. I think everyone will find it very helpful. They have your contact information, if they have any follow up questions. I want to thank everyone for joining us today. I hope everyone and their families remain safe, healthy, and resilient. Take care everyone, and have a great day. Bye.
BRUNO VIANA: Thank you all. Bye-bye.
EVENT PRODUCER (MICHELLE RIDLEY): That concludes our conference. Thank you for using events services. You may now disconnect.