Google Dataset Search: Facilitating data discovery in an open ecosystem


I AM ADAM THOMAS I LEAD THE DATA SCIENCE AND SHARING TEAM IN THE NIH INT MUIR RALE PROGRAM. IT IS MY PLEASURE TO WELCOME CHRIS GORGOLEWSKI. HE GOT HIS PH.D. FROM THE UNIVERSITY OF HE HAD DIN BURG A FEW YEARS AGO BUT HE’S THE LEADER IN OPEN AND REPRODUCEABLE METHODS IN THE NEUROSCIENCES. HIS WORKS INCLUDE SUCH THINGS AS NOVAULT. HE SPEARHEADED THE CREATION OF THE OHBM REPLICATION AWARD. CO DIRECTOR OF STANFORD CENTER OF REPRODUCEABLE NEUROSIGNS. HE IS RESPONSIBLE FOR THE WILDLY POPULAR BRAIN IMAGINING DATA STRUCTURE STANDARD BETTER KNOWN AS BIDS. HAVING SOLVED ALL OF THE REPRODUCIBILITY PROBLEMS IN NEUROSCIENCE EXPANDED HIS AM DIGSS TO CREATE AN OPEN ECOSYSTEM FOR ALL OF SIGNS. HELP ME WELCOME CHRIS GORGOLEWSKI. [APPLAUSE] >>THANK YOU SO MUCH FOR THIS KIND INTRODUCTION. YOU CAN SEE THE PROGRAMS ARE NOT SOLVED. I CONTRIBUTED A LITTLE BIT WITH THE HELP OF MANY, MANY PEOPLE. TODAY I AM GOING TO TALK ABOUT NEW GOOGLE PRODUCT CALLED DATASET. AND TELL YOU HOW WE ARE TRYING TO DO THIS TOGETHER WITH A BROADER COMMUNITY AND ECOSYSTEM OF DATA PROVIDERS OUT THERE. BEFORE I BEGIN I WANTED TO START TO BE INTERACTIVE. PLEASE DON’T HESITATE TO INTERRUPT ME AND ASK QUESTIONS. I AM HERE TO CLARIFY THINGS, AND I AM HERE TO PROVIDE INFORMATION. HOPEFULLY THERE WILL BE ALSO TIME AT THE END FOR DEDICATED QUESTION SESSION. LET’S START WITH OPEN DATA. WE KNOW FROM VARIOUS STUDIES THAT OPEN DATA DOES IMPROVE RESEARCH QUALITY. THERE ARE PAPERS THAT INCLUDE DATA SETS THAT HAVE FEWER MISTAKES AND WE KNOW IT SAVES MONEY AND WE KNOW THAT RESEARCH DONE USING EXISTING PUBLISHED DATA SETS, SAFE NUMBER OF — LARGE NUMBER OF TAX PAYER DOLLARS. FOR EXAMPLE [INDISCERNIBLE] IS SHOWING THAT FCP IN ONE OF THE INITIATIVES IN DATA SHARING IMAGE SAVED SOMETHING LIKE $1 BILLION. WE KNOW IT ALSO PROMOTES INNOVATION. REUSING DATA ACROSS FIELDS LEADS TO NEW IDEAS, AND IT ALSO ALLOWS DATA SET PRODUCED IN AN ACADEMIC ENVIRONMENT TO BE USED MORE COMMERCIAL ENVIRONMENT AND LEADS TO COMMERCIALIZEZATION OF SCIENTIFIC DISCOVERIES. BUT TO THINK MORE ABOUT THE OPEN DATA AND DATA SHARING, IT IS GOOD TO HAVE A FRAMEWORK. WILKINS AND OTHERS HAVE THE FRAME [INDISCERNIBLE] FOR INTERPRETABLE AND USABLE. LET ME HAVE A VOTE OF HANDS WHO HAS HEARD ABOUT FAIR THE ACRONYM. IT IS A USEFUL WAY OF THINKING ABOUT WHAT DOES IT MEAN FOR DATA TO BE USEFUL? AND I AM GOING TO FOCUS HERE ON THE VERY FIRST IN THIS ACRONYM FOR THE FIRST LETTER WHICH IS FINDABLE. FAIR DEFINES FINDABLE AS HAVING PERSISTENT IDENTIFIERS, AS HAVING BEEN DESCRIBED WITH RICH METADATA, AND THAT METADATA BEING REDUCED IN A WAY THAT MAKES IT SEARCHABLE VENLT . ALSO IDENTIFIED IN THE METADATA. THAT ASPECT IS WHAT WE ARE TRYING TO HELP WITH. AND — SORRY. YES. WHY DOES IT ACTUALLY NEED HELPING? IT IS BECAUSE THERE ARE — THERE ISN’T ONE SINGLE DATA REPORT STORY THAT EVERYONE DEPOSITS DATA IN. THIS IS ONE PLACE YOU GO TO AND YOU KNOW IF THE DATA IS NOT THERE, IT DOESN’T EXIST. THERE ARE OVER 2,000 POSITIVE CASES IN DATA. NATURE SETTING DATA RECOMMENDS 758 DIFFERENT REPOSITORIES BOTH FUEL SPECIFIC AND GENERIC, DATA SITE LISTS, 1600 DIFFERENT DATA CENTERS THAT PUBLISH — ARE LISTED AS PUBLISHERS OF DOI SIMULATED TO DATA, AND THERE ARE MANY DIFFERENT DATA REPOS TORS TORS — REPOSITORS OUT THERE. DATA IS SOMETHING THAT NEEDS TO BE SOLVED. IF WE WANT TO INCREASE REUSABLE DATA WE NEED TO MAKE IT EASY FOR PEOPLE TO FIND RELEVANT DATA SETS. THIS IS INCIDENTALLY WHAT GOOGLE’S MISSION IS. THIS IS OUR MISSION STATEMENT IS TO ORGANIZE WORLD’S INFORMATION MAKING UNIVERSITY ACCESSIBLE AND USEFUL. WHICH VERY MUCH FITS WITH THE GOAL OF INCREASING FIND ABILITY OF RESEARCH DATA. THIS IS WHY LAST YEAR IN SEPTEMBER WE HAVE RELEASED A NEW TOOL CALLED DATASET SEARCH. IT IS BASICALLY WHAT IT SAYS. A SEARCH ENGINE DEDICATED TO FINDING THIS. I AM GOING TO TALK A LITTLE BIT ABOUT HOW IT WORKS. I WILL ALSO SPEND A BIT MORE TIME ABOUT WHAT FROM A DATA PROVIDER POINT OF VIEW TO MAKE SURE YOUR DATA IS REPRESENTED WELL IN THE SEARCH ENGINE. FUTURE DIRECTIONS WE ARE PLANNING ESPECIALLY IN THE CONTEXT OF NEEDS OF FINDING THESE AND CREDITING SIGNS. IT IS A SEARCH ENGINE. BUT WHAT MAKES IT DIFFERENT IS THAT THE SEARCH ENGINE OVER METADATA. THAT DISTINCTION BETWEEN METADATA AND DATA MEANS THAT WHAT WE ARE INDEXING IS ACTUALLY DESCRIPTIONS OF DATA SETS RATHER THAN THE CONTENT OF THE DATA SET ITSELF. THAT MIGHT SEEM TO BE A NOT VERY IMPORTANT DISTINCTION BUT IT ALLOWS US TO SUPPORT VARIOUS USE CASES COMMON IN BIOMEDICAL RESEARCH. A LOT OF DATA SETS REQUIRES AGREEMENTS TO GET APPROVALS TO THE DATA. YOU CAN STILL INDEX THE INFORMATION ABOUT THE DATASET WITHOUT HAVING THE ACCESS TO THE DATA ITSELF. WE CAN PROVIDE THIS INFORMATION AS SEARCH RESULTS TO USERS SO THEN USERS CAN FIND A DATASET TO ACTUALLY ACCESS IT. THE SAME GOES FOR DATASET LIKE THE WELL-KNOWN [INDISCERNIBLE] PROJECT REQUIRE AS CERTAIN PAYMENT TO SUPPORT THE MAINTENANCE OF THE PROJECT. SO THEIR DATA IS NOT ALSO FULLY ACCESSIBLE WITHOUT ANY ADDITIONAL HOOPS TO JUMP THROUGH. WE CAN STILL INDEX SUCH DATA SETS BECAUSE WE ARE INDEXING BASICALLY DESCRIPTIONS OF DATASET ITSELF. THAT METADATA IS USED. HERE’S WHAT THE INTERFACE LOOKS. IT IS DIFFERENT THAN NORMAL GOOGLE. IT IS A BIT OF A MORE RICH EXAMPLE OF THE DATA SET YOU HAVE THE SEARCH RESULTS. IT IS RIGHT INSIDE OF DETAILS OF ONE PARTICULAR SEARCH RESULTS. IN THAT DETAILED INFORMATION YOU HAVE THE METADATA OF THE DATASET. THIS IS THE INDEX AND WHAT DATA PROVIDERS GIVE US FOR US TO BE ABLE TO BUILD THE SERVICE. THAT INCLUDES IN THIS PARTICULAR CASE DATASET NAME, THE DESCRIPTION, THE ENTITY THAT PROVIDED THE DATASET IT IS IMPORTANT FOR WHO PAID FOR THE STUDY. AS WELL AS THINGS LIKE TEMPORAL COVERAGE AS WELL AS SPATIAL COVERAGE AND VARIABLES MEASURED AND THESE SORTS OF THINGS. SO SEARCH ENGINE. IT IS A SEARCH ENGINE THAT WORKS OVER METADATA. SO THAT SOMETIMES THE CONFUSING THING IS WE DON’T HOST THESE DATA SETS. WE ONLY INDEX DESCRIPTIONS OF DATASET AND HELP PEOPLE FIND THOSE DESCRIPTIONS. THEN FINALLY IT IS A SEARCH ENGINE OF METADATA FROM DATA PROVIDERS. THAT’S WHY THE COMMUNITY ASPECT OF THIS PROGRESS IS SO IMPORTANT IS BECAUSE WE ARE ONLY INDEXES THINGS THAT WANT TO BE INDEXING. IN OTHER WORDS, WE LET THE DATA PROVIDE TORE TELL US WHAT ARE THE DATASET THEY HAVE, WHAT ARE THEIR NAMES AND DESCRIPTIONS AND OTHER PIECES OF METADATA. IT IS NOT US TRYING TO GUESS ALL OF THIS INFORMATION AND MAKING MISTAKES, THE POWER OF ACCURATELY DESCRIBING THE DATA IS IN THE HANDS OF DATA PROVIDERS. BECAUSE WE DON’T CLAIM WE ARE EXPERTS IN ALL OF THE DIFFERENT FIELDS OF SCIENCE. IT IS UP TO DATA PROVIDERS TO GIVE THIS INFORMATION. HOW DO WE DO IT? WE USE A STANDARD SKAUL– CALLED SCHEMA [INDISCERNIBLE] WHICH IS A LITTLE PIECE OF CODE THAT IS EMBEDDED IN THE WEBSITE ITSELF. IT IS AN OPEN STANDARD THAT HAS BEEN — HAD A LOT OF DIFFERENT MEMBERS OF THE INDUSTRY AND MICROSOFT, YAHOO, AS WELL AS GOOGLE. NOW IS BEING DRIVEN BY A PASSIONATE COMMUNITY ON GET HUB. IT IS A STANDARD FOR DESCRIBING DIFFERENT ENTITIES THAT CAN BE FOUND ON THE INTERNET. IT’S ADOPTION AND TOOLING IS DRIVEN BY OTHER SEARCH CASES, SEARCH USE CASES, SO FOR EXAMPLE WHO HAS EVER LOOKED FOR A RECIPE ON GOOGLE? RECIPES? GREAT. SO WE FOUND RECIPES THE SAME WAY. THERE IS A RECIPE SCHEMA.ORG. THIS IS DATA WEB SITES THAT PROVIDE RECIPES INCLUDE IN THEIR HDML PAGEINGS AND THIS — PAGES
AND THIS IS HOW WE FIND RECIPES. SAME THING FOR JOBS AND EVENTS AND MANY OTHERS. WHY IS THIS IMPORTANT? BECAUSE THERE ARE OTHER USE CASES. WE ARE BUYING AN ECOSYSTEM AND WE DON’T HAVE TO BUILD ALL OF THESE TOOLS OURSELVES. THE SAME PEOPLE BUILDING WEB SITES, DATA PROVIDERS CAN USE TOOLS THAT WERE DESIGNED FOR OTHER USE CASES. THE OTHER ASPECT OF IT IS THAT WE DON’T BUILD INDIVIDUAL PARTNERSHIPS WITH EACH OF THE DATA REPOSITORS THAT WOULD BE PROVIDED BY CODE. IN OTHER WORDS WE DON’T WORK WITH INDIVIDUAL API’S AND BUILD SYSTEMS TO INGEST THE API’S AND BUILD INDEX. SO ALL OF THE OUTREACH WORK THAT WE DO IN TERMS OF ADOPTION OF THE STANDARD CAN BE REUSEED BY ANY ONE. THOSE THE SCHEMA.ORG PIECES OF METADATA OUT THERE ON THE WEB WILL BE ACCESSIBLE BY ANY ONE. ANY ONE CAN BUILD THEIR ALTERNATIVE VERSION OF A DATASET INDEXES SERVICE. WHICH MAKES IT MORE OPEN AND SUSTAINABLE. AND THE OTHER ASPECT OF IT IS EASY TO ADD FROM A DATA PROVIDER POINT OF VIEW, YOU DON’T HAVE TO BUILD A NEW API. YOU DON’T HAVE TO PROVIDE THE NEW END POINT. ALL YOU HAVE TO DO IS ADD THIS PIECE OF CODE INTO THE PAGE THAT YOU WOULD BE SERVE TO GO USERS. IT IS INVISIBLE TO USERS BUT IT IS BEING PICKED UP BY OUR CRAWLERS. HOW EASY IT IS. IT IS SO EASY IT ACTUALLY FITS INTO A TWEET WHICH WE EXPERIMENTALLY FOUND RECENTLY. THIS IS HOW THE MINIMAL DESCRIPTION OF A DATASET LOOKS LIKE. YOU HAVE A DEFINITION OF WHAT STANDARD IS IT, WHAT ENTITY IS IT? IT IS NOT A RECIPE OR EVENT OR JOB POST. IT IS A DATASET. WE HAVE THE NAME OF THE DATASET AND DESCRIPTION. IT IS LOWER CASE. APOLOGIES FOR THE TYPO. THIS IS THE MINIMAL AMOUNT THAT WOULD BE REQUIRED TO BE ELIGIBLE TO BE LISTED IN SEARCH RESULTS. YOU HAVE MUCH MORE INCLUDING FILE FORMATS DOWNLOAD URL’S, SPECIAL TEMPORAL COVERAGE, ET CETERA, ET CETERA. THEN WE ALSO TAKE ADVANTAGE OF EXISTING INFRASTRUCTURE OF CRAWLING THE WEB AND THE DECADES OF EXPERIENCE THAT GOOGLE HAS IN SEARCH AND IN INDEXES CONTENT. AND THAT DOES DEPEND ON A COUPLE OF ADDITIONAL TOOLS THAT ARE GOOD FOR ANY ONE PUBLISHING ANY CONTENT ON THE INTERNET. IT IS NOT JUST HAVING THE SCHEMA.ORG ON THE WEB PAGE, BUT IT IS ALSO MAKING SURE THE WEB PAGE IS REL REPRESENTED ON THE WEB. THAT INCLUDES THE INCLUSION OF SITE MAPS WHICH ARE TOOLS FOR WEB MASTERS TO TELL WEB CRAWLERS WHAT SITES DO YOU HAVE IN THIS CASE IT WOULD BE WHICH DATASET DO YOU HAVE AND PROVIDE THIS TOOL CALLED GOOGLE SEARCH CONSOLE WHICH ALLOWS DATA REPOSITORY TO MONITOR THE COVERAGE OF WEB CRAWLING AND THEN POTENTIAL ERRORS. NOW THE VALIDATION OF DATA IS INTEGRATING THAT PRODUCT SO YOU WILL GET INFORMATION IF SOMETHING IS NOT CORRECT. ALL OF THOSE TOOLS ARE SUPPORTED BY MULTIPLE TEAM THAT IS BENEFIT FROM THIS BROADER ECOSYSTEM OF SCHEMA.ORG. OKAY. SO NOW I AM GOING TO GO A LITTLE BIT DEEPER INTO HOW THE DATASET WORKS UNDER THE HOOD. WHAT DO WE DO WITH THE INDEX DATA? SO WE INDEX THOUSANDS OF DOMAINS AND MILLIONS OF DATASET. IN OTHER WORDS, IF YOU ARE OPERATING THE SCALE AND TRYING TO HARMONIZE EVEN AT THIS VERY HIGH LEVEL METADATA FROM MANY, MANY DIFFERENT PROVIDERS IT TURNS OUT THAT ALMOST EVERYTHING THAT COULD GO WRONG OR COULD BE MISSPELLED WAS MISSPELLED AT SOME POINT. SO WE DEAL WITH A LOT OF HARMONIZEZATION, WHICH WE USE VARIOUS METHODS TO CLEAN UP THIS DATA. IN SEARCH RESULTS YOU ARE GOING TO HAVE THE CLEAN UP VERSION OF IT. WE ALSO USE INTERNAL KNOWLEDGE GRAPH TO DO SOME RECONCILIATION OF DATES, LOCATIONS AND ORGANIZATIONS TO PROVIDE RICHER RESULTS IN TERMS OF THE SPATIAL COVERAGE OF A PARTICULAR DATASET OR EXAMPLE THE BEST VISUAL PRESENTATION OF THE UNIVERSITY THAT PAID FOR ACQUISITION OF THIS DATASET. WE HAVE CONFIRMATION OF GOOGLE SCHOLAR WHERE WE ARE LOOKING FOR MENTIONS OF DATASET IN THE LITERATURE, AND WE SURFACE THIS INFORMATION TO YOU AS WELL. SO IF A GIVEN DATASET WAS MENTIONED IN INDEX BY GOOGLE SCHOLAR YOU CAN SEE ALL OF THOSE MENTIONS AND EXPLORE HOW OTHER PEOPLE REUSE THIS DATASET IN SCIENCE. SO TECHNICALLY SPEAKING WE HAVE BASICALLY TAP INTO THE GOOGLE CRAWLING INFRASTRUCTURE, AND WE DO SOME OFF-LINE PROCESSING WHERE WE DO DATA CLEANING AND RECONCILIATION WITH KNOWLEDGE GRAPH. WE ALSO DO REPLICA IDENTIFICATION, WHICH IS BASICALLY SAYING THAT WE START SEEING SOME DATA SETS BEING HOSTED IN MULTIPLE LOCATIONS. IT IS A STILL SMALL PERCENTAGE OF ALL DATASET OUT THERE. BUT TO PROVIDE HIGH QUALITY RESULTS WE DON’T WANT TO DUPLICATE AND WE HAVE TO FIGURE OUT WHAT IS THE BEST LOCATION TO POINT THE USER TO IN TERMS OF A DATASET. IT IS A SIMILAR CASE AS WITH PAPERS WHEN YOU COULD HAVE A PAPER THAT IS HOSTED ON THE PRECONSERVER AND HOSTED ON A PUBLISHER WEBSITE, AND YOU CAN — YOU HAVE TO FIGURE OUT WHICH ONE YOU SHOULD POINT THE USER TO. WE ALSO DISCOVER LINKING. WE BASICALLY COMB THROUGH LOOKING FOR THE UNIQUE IDENTIFIERS. WE HAVE TO FIGURE OUT FOR A GIVEN DATASET A UNIQUE IDENTIFIER NOT ALL USE THE [INDISCERNIBLE] IDENTIFIERS AND THINGS GET A LITTLE BIT TRICKY. THEN WE PUSH THAT TO OUR SERVICING API’S AND WE USED ANOTHER SERVICE, INTERNAL SERVICE FOR RANKING, SO WE DO TAKE ADVANTAGE OF A LOT OF RANKING INNOVATIONS THAT ARE USED IN THE MAIN SEARCH PRODUCT. THAT’S HOW YOU GET THE RESULTS IN THE PROBLEM. IN TERMS OF INTERNALIZING THE METADATA THIS IS A FAIRLY BORING TOPIC BUT IT IS IN MANY CASES BETTER THAN WE DO. IF YOU HAVEN’T NOTICED THAT WORK, THAT MEANS THAT WE HAVE BEEN DOING IT WELL. IN OTHER WORDS, WE HAVE INCONSISTENCIES AND REPRESENTATIONS OF DIFFERENT METADATA AND WE HAVE TO MAKE IT MORE CONSISTENT SO THE USER IS SEEING THINGS PROPERLY. THIS IS AN EXAMPLE WHERE SOMEONE WAS SAYING HOW ONE CAN DOWNLOAD THE DATA SET, AND AS YOU CAN SEE IT IS THE FIRST, SECOND AND THIRD OPTION USING DIFFERENT FIELD NAMES USING FILE FORMAT AND DISTRIBUTION BEING AN OBJECT VERSES BEING JUST A URL. THE ACTUAL DOWNLOAD FORMAT IS EITHER A MIND TYPE IN THE FIRST CASE OR CAPITALIZE FILE EXTENSION. WE DO A LOT OF THINGS LIKE THAT WHERE WE TAKE ALL OF THIS AND PUT IT INTO A COMMON FORM. SO PART OF THE ECOSYSTEM THAT I HAVE TALKED ABOUT INVOLVES ALSO BEING AN ADVOCATE OF GOOD PRACTICES AND DATA PUBLICATIONS. WE PAY A SMALL ROLE IN DATA SYSTEM WE ARE TRYING TO HELP WITH MOSTLY FIND ABILITY. THE WORK WE PUT INTO EVANGELIZING GOOD METADATA PRACTICES HOPEFULLY WILL HELP WITH THE OTHER ASPECTS. FOR EXAMPLE WHICH CITATIONS — WITH CITATIONS WE ARE A BIG PROPONENT OF IDEA FIRES OR COMPLEX IDEA FIRES WHICH MAKES COUNTING MENTIONS ORGANIZATIONS MATCH EASIER. AS WE ALL KNOW CITATION IS THE CURRENCY IN SCIENCE SO HAVING A GOOD, HEALTHY CONSISTENT DATA CITATIONS WILL HELP MAKING DATASETS VALUABLE SCHOLARLY OUTPUT THAT WILL HELP PEOPLE MAINTAIN CARRIERS. WE ALSO CONSTANTLY TALK TO NEW DATA PROVIDERS AND DEVELOPERS TO MAKE SURE THAT WE INCREASE OUR COVERAGE, BUT IN THAT PROCESS WE ALSO EXPOSE PEOPLE TO STANDARDS IN DATA PUBLISHING. SO WE ARE SORT OF THIS GLUE WHERE WE ARE TALKING TO A LOT OF PEOPLE DOING SIMILAR JOBS, SO IN OTHER WORDS, WE TALK TO A LOT OF DATA PUBLISHERS AND WE HELP TO SHARE KNOWLEDGE ACROSS THIS COMMUNITY OF DATA PUBLISHERS. IN THAT PROCESS WE HAVE LEARNED A FEW LESSONS, IN THE PROCESS OF BUILDING THIS PRODUCT. FIRST OF ALL WE REALLY WANTED TO FOCUS ON THE ECOSYSTEM, ON TRYING TO MAKE THIS A SCALABLE THING WHERE WE PROPOSE CERTAIN STANDARDS, HELP TO ADOPT THEM RATHER THAN BUILDING INDIVIDUAL SOLUTIONS FOR EVERY PARTNER. THAT ECOSYSTEM ASPECT OF IT AND CONVINCING THE COMMUNITY THAT THIS IS SOMETHING WORTH IMPORTANT AND MUCH HARDER THAN BUILDING ANY TECHNICAL SOLUTION. WE ALSO WANTED TO DO THIS USE AGO STANDARD THAT IS ALREADY ESTABLISHED. IT IS SOMETHING THAT IS NOT ONLY USED BY GOOGLE, AS SOMETHING THAT EVERYONE CAN CONTRIBUTE TO. ENCOURAGE YOU TO CHECK OUT THE SCHEMA.ORG TO SEE THE DISCUSSION THAT IS ARE HAPPENING THERE. ALSO ESPECIALLY IN THE VERY EARLY DAYS, WE DID REALIZE WE HAVE TO HAVE WORK CLOSELY WITH PEOPLE WHO ARE BELIEVERS IN THIS MISSION AND HAVE SOME LEADERS IN THE DATA PUBLISHING ECOSYSTEM TO SHARE EVERYONE ELSE THIS IS WORTH DOING. BUT IT PAID OFF SO FAR. WE HAVE AROUND 20 MILLION DAT DATASET AND 4,000 REPOSITORIES RIGHT NOW. WE ARE FOCUSING ON THE LONG TAIL. BECAUSE TO BE ELIGIBLE, IT IS RELATIVELY EASY AND DOESN’T REQUIRE ANY INTERNAL CONNECTIONS AND WHATNOT. WE ARE PICKING UP VERY SMALL DATASETS. WE RANGE FROM DATA.gov WHICH WOULD HAVE SOMETHING LIKE 2 AND A HALF MILLION DATASET TO OUR RESEARCHER FROM CZECH REPUBLIC WHO PUBLISHED ML DATA SETS FOR FACE RECOGNITION AND THAT’S THE ONLY DATASET THAT WAS PUBLISHED THEY PUT THEIR METADATA ON THEIR HOME PAGE WHERE YOU CAN DOWNLOAD. THIS IS THE BREADTH OF WHAT WE ARE HAVING. WE NOT ONLY LIMIT OURSELVES TO SCIENTIFIC DATA, WE ALSO INDEX A LOT OF OPEN GOVERNMENT DATA. THAT GOES BEYOND THE U.S. AND BEYOND ENGLISH SPEAKING WORLD. WE HAVE DATASETS FROM GOVERNMENTS ALL OVER THE WORLD. WE ARE RELATIVELY BIG IN BRAZIL AND INDIA. AND WE ARE HELPING CITIZEN SCIENTISTS TO MOVE FORWARD. WE ARE ALSO SEEING USERS IN EDUCATION AND JOURNALISM AND ESPECIALLY WITH THE RISE OF DATA JOURNALISM. IT HAS BEEN A VERY EXCITING JOURNEY SO FAR. SO NOW I AM GOING TO SWITCH GEARS A LITTLE BIT. AND FOR — WITH THE HOPE THAT WHO IS HERE INVOLVED IN ONE OF THE MANY NIH DATA REPOSITORIES? YAY, THANK YOU FOR YOUR SERVICE. SO THE NEXT BIT IS DEFINITELY ADDRESSED TO YOU. BUT IT IS ALSO USEFUL FOR ANY ONE BUILDING ANY CONTENT THAT YOU WOULD LIKE TO BE FOUND ON THE INTERNET. WE ARE GOING TO TALK ABOUT THE LIFE OF A DATASET SEARCH QUERY. THIS IS AN IMPORTANT POINT OF VIEW IF YOU ARE TRYING TO MAKE YOUR DATA FINDABLE. SO IT CAN BE DIVIDED INTO THREE PARTS. INTO QUERYING, SO ACTUALLY TYPING THE THING THAT YOU ARE LOOKING FOR, THEN INTO SCANNING THE RESULTS AND FINALLY EVALUATING THE LANDING PAGE. ALTHOUGH THIS IS A SPOILER, ONLY THE FIRST TWO ARE INVOLVING REALLY US. THE THIRD ONE THE LANDING PAGE IS REALLY UP TO THE DATA PROVIDERS. LET’S TALK ABOUT QUERY. THOSE ARE SOME OF THE QUERIES THAT WE SEE OR ARE SIMILAR TO ONES THAT WE SEE IN DATASET SEARCH. MANY OF THEM ARE QUITE GENERIC. SOMETIMES PEOPLE WILL USE SPECIALIZED LANGUAGE. QUITE OFTEN PEOPLE START WITH A VERY GENERIC TERM AND THEN THERE’S THE PROCESS OF QUERY REFINEMENT WHEN PEOPLE ARE GETTING SOME RESULTS, THEY ARE SEEING RESULTS THEY DON’T WANT TO SEE AND THEY ARE MORE SPECIFIC TERM TO NARROW IT DOWN. WHAT IS USEFUL IN THE CONTEXT OF THE GENERIC ASPECT OF QUERIES IS TO PROVIDE AT LEAST TO SOME DEGREE A GENERIC DESCRIPTION OF A DATASET. THIS IS ESPECIALLY IMPORTANT, BECAUSE WHO MIGHT BENEFIT FROM YOUR DATASET MIGHT NOT BE SOMEONE WHO YOU KNOW VERY WELL. IN OTHER WORDS, IT MIGHT BE AN OUTSIDER TO THE FIELD THAT YOU ARE STUDYING. IT IS THIS CONUNDRUM OF IF SOMEONE IS AN EXPERT IN YOUR PARTICULAR FIELD THEY PROBABLY DON’T KNOW ABOUT YOUR RESOURCE. BUT YOUR RESOURCE MIGHT BE BENEFICIAL TO PEOPLE FROM OTHER FIELDS ESPECIALLY NOW WITH THE APPLICATION OF ML AND DATA SCIENCE. SO PROVIDING A MORE GENERIC SYSTEM THAT COULD CROSS THAT DIVIDE CAN YIELD LARGE BENEFITS. THEN CERTAIN DATASET MIGHT BE KNOWN DMINCHT SYNONYMS, THOSE SYNONYMS HELP US FIND DATA SETS IN LITERATURE. AND ALSO FINALLY LINKING THE DATASET TO EXTERNAL ENTITIES IS ALSO VERY USEFUL FOR BUILDING A MORE RICHER PICTURE OF NOT ONLY THE DATA SET ITSELF BUT ALSO HOW THE DATASET RELATES TO OTHER DATASET, ARE THERE OTHER ORGANIZATIONS INVOLVED IN IT, AND WE ARE PROMOTING USERS OF THE IDEA FIRES — IDENTIFIERS AND GIVE ENOUGH TO PROVIDE THEM RICH CONTEXT AND WE CAN SEARCH FOR THOSE. THOSE DO CORRESPOND TO SPECIFIC ENTITIES IN THE METADATA STANDARD. THE DESCRIPTION, IT GOES TO ALTERNATIVE NAMES AND EXTERNAL LINKS THAT GOES INTO IDENTIFIERS AND CREATORS FOR OTHERS AND IS BASICALLY A WAY TO SAY THIS GIVEN DATASET IS A COPY OR MIRROR FOR OTHER DATASET. WE HAVE OTHER DATASET THAT ARE A PART OF AS WELL AS THE OPPOSITE OF THAT RELATION. THE IDENTIFIERS WE SUPPORT AND ENCOURAGE PEOPLE TO USE COMMONLY ADOPTED IDENTIFIERS IN SCHOLARLY PUBLISHING SUCH AS DIGITAL IDENTIFIERS COMPACT IDENTIFIERS KNOWN AS IDENTIFIERS.ORG, WHICH IS A SLIGHTLY EASIER TO ADOPT THE SCHEME. BOTH FOR DATASET AND RELATED ARTICLES. WE HAVE THE CONCEPT OF ARTICLES THAT ARE RELATED TO THE DATASET. THOSE ARE USUALLY THE RELATED PAPERS OR THE SCIENTIFIC PUBLICATION THAT THIS DATASET SUPPORTS WHICH IS SLIGHTLY DIFFERENT THAN REUSE SECONDARY REUSE OF THE DATASET. IN TERMS OF RESEARCHERS, CREATORS AND MAINTAINERS, FOR ORGANIZATIONS AND [INDISCERNIBLE] PROGRAMS TO REGISTER ORO BY THE DATASET PEOPLE. SO AFTER THE QUERY COMES SCANNING THE RESULTS. YOU DO THE QUERY, DEFINE IT AND LOOK AT THE RESULTS YOU HAVE. HERE THE EXPERIENCE IS SLIGHTLY DIFFERENT FROM THE MAIN SEARCH PRODUCT. YOU HAVE A RICHER DESCRIPTION OF EACH DATASET. SO THIS IS BASICALLY WHAT WE SHOW, BUT YOU ARE VERY MUCH IN CONTROL OF WHAT IS BEING SHOWN HERE. BECAUSE THAT INFORMATION IS TAKEN FROM THE METADATA PROVIDED BY DATA PROVIDERS. SO TO OPTIMIZE THIS, WHAT I USUALLY ENCOURAGE IS ANALOGY TO WIKIPEDIA. ALL OF US READ WIKIPEDIA PAGES AND THEY ALMOST ALWAYS, ESPECIALLY THE GOOD ONES START WITH GENERIC SINGLE ENGLISH OF THE TOPIC. SUCH APPROACHED DATASET HELP TO BRIDGE THAT DIVIDE BETWEEN FIELD EXPERTS AND EXPERTS IN ANOTHER FIELD THAT I TALKED ABOUT BEFORE. YOU HAVE TO SORT OF GET OUTSIDE OF THE BUBBLE OF THE DECADES OF EXPERIENCE YOU HAVE IN THIS PARTICULAR FIELD AND TRY TO INCLUDE SOME MORE GENERIC DESCRIPTION OF THE PARTICULAR DATASET TO BEGIN WITH. BUT A STRICT COST OF THE SPECIFIC EXPERT LANGUAGE, BECAUSE CERTAIN SPECIFIC TERMS AND FIELDS COULD BE USED TO QUERY THE DATASET BY USERS. IT HAS TO BE A MIX OF TWO. AND FINALLY IMAGES AS POSSIBLE IN THE DESCRIPTIONS. WE HAVE THE CONDITION HOW TO DO THIS. THAT MAKES THE RESULTS MORE ATTRACTIVE BUT IT ALSO MAKES IT EASIER IN SOME CASES FOR USERS TO DECIDE WHETHER THIS IS THE METADATA THEY WANT ESPECIALLY WITH IMAGINING DATA OF VARIOUS KIND THAT THERE ARE SO MANY PRODUCED HERE AT NIH. SO THAT’S ABOUT SCANNING. AND THERE ARE A COUPLE MORE — A FEW MORE PIECES ARE USEFUL. LICENSES, WE DO KNOW THAT THERE IS A NONTRIVIAL SUBSET OF PEOPLE SEARCHING FOR DATA WHICH HAVE A SPECIFIC LICENSING NEEDS. SO THERE ARE PRIVATE RESEARCH, LABS, AS WELL AS PRIVATE COMPANIES, FOR EXAMPLE ONLY WANT DATASET THAT ARE ALIEUED FOR COMMERCIAL UXHER — ALLOWED FOR
COMMERCIAL
USE. LICENSING INFORMATION IS USEFUL FOR THESE PEOPLE. SAME FOR SPATIAL COVERAGE AND TEMPORAL COVERAGE IF THAT APPLIES TO YOU. THIS IS HISTORICAL DATA ECONOMICAL DATA AS WELL AS GEOGRAPHICAL DATA. THAT COVERS SCANNING. FINALLY WE GO TO THE LANDING PAGE. THIS IS SOMETHING I DIDN’T APPRECIATE MUCH BEFORE I STARTED THIS JOB. MYSELF RUNNING THE DATA REPOSITORIES. THE WAY I INTERACTED WITH THE DATA LIFE BUILD WAS LIKE YOU GO TO THE MAIN PAGE OF YOUR DATA PORTAL. THAT’S THE ENTRY POINT. YOU START LOOKING FOR DATASET AND EXPLORING DATASETS. WHATEVER YOU REMEMBER FROM THE PAGE YOU INCORPORATE INTO YOUR JOURNEY, YOUR COGNITIVE IMAGE OF THE DATASET THAT YOU FIND. THIS IS NOT HOW THE SEARCH ENGINE WORKS. PEOPLE MIGHT DISCOVER YOUR DATA PORTAL TRUE SEARCH ENGINES SKIPPING ANY IN BETWEEN PAGES. SO IT IS WORTH WHILE INVESTING INTO THAT LANDING PAGE SO IT WILL GIVE PEOPLE MORE CONTEXT, INTRODUCE THE REPOSITORY. ALSO VERY IMPORTANT NOT ALWAYS OF USE MAKE IT CLEAR HOW TO ACCESS THE DATA AND IF THERE IS SOME PROCEDURE TO GAIN ACCESS, WHAT IS THAT PROCEDURE, HOW PEOPLE NEED TO MOVE FORWARD. FINALLY, SOMETIMES PEOPLE WILL FIND A DATASET THAT IS NOT EXACTLY WHAT THEY WANTED, BUT MAYBE THE REPOSITORY HAS EXACTLY WHAT THEY DO WANT. IF YOU HAVE INTERNAL SEARCH THAT SEARCHS IN THE REPOSITORY EXPOSING THAT EASILY ON EVERY LANDING PAGE WOULD HELP PEOPLE TO EXPLORE YOUR RESOURCES BETTER. OKAY. SO THIS IS — THIS FINISHES THE JOURNEY OF A DATASET USER, AND HOPEFULLY THIS WILL BE USEFUL WHEN PEOPLE ARE TRYING TO OPTIMIZE THEIR RESOURCES IN TERMS OF FIND ABILITY, AND I HAVE GOT A FEW MORE MINUTES, SO I AM GOING TO TALK A LITTLE BIT ABOUT OUR FUTURE DIRECTIONS THAT ARE SPECIFICALLY OF INTEREST TO FINDING BODIES LIKE NIH. WE STARTED WORKING WITH SCHEMA.ORG ON DESCRIBING EXACTLY THE INFORMATION ABOUT WHO BASICALLY PAID FOR ACQUISITION OF A GIVEN DATASET. THIS IS INFORMATION ABOUT BOTH THE FOUNDER, SO HERE IS THE GOVERNMENT ORGANIZATION AS WELL AS THE PARTICULAR GRANT NUMBER. BEFORE YOU MAKE THE PICTURE I AM GOING TO STRESS THIS IS A WORK IN PROGRESS, WHICH MEANS PLEASE DO TAKE A PICTURE, BUT THE DETAILS OF THIS MIGHT CHANGE. THE BOTTOM LINE HERE IS WE ARE INTERESTED IN HELPING PROVIDING A CONSISTENT WAY FOR DATA REPOSITORY WHICH HOPEFULLY WILL MAKE IT EASIER TO FOR EXAMPLE, SOMETHING IN THE FUTURE LOOK FOR ALL OF THE DATASETS PRODUCED, PUBLISHED FOR A GIVEN GRANT NUMBER OR FOR A GIVEN ORGANIZATION. SO HOPEFULLY THIS IS SOMETHING THAT WILL HELP QUITE A FEW PEOPLE AROUND HERE. SO THIS IS ALL I HAVE GOT FOR TODAY. I THANK YOU VERY MUCH FOR YOUR ATTENTION, AND I AM HOPE TO QUESTIONS. [APPLAUSE] >>THERE ARE 104 PEOPLE LISTENING ONLINE. IF THERE ARE PEOPLE ONLINE WHO HAVE QUESTIONS YOU CAN E-MAIL THEM TO ME ADAM [email protected] OR ON TWITTER.>>MANY PEOPLE THINK OF GOOGLE AS ONE OF THE BIGGEST CREATOR OF DATASETS. HOW MANY DATASETS IS GOOGLE ATTRIBUTING [INAUDIBLE].>>I AM GOING TO REPEAT THE QUESTION FOR THE PEOPLE LISTENING. GOOGLE IS CONSIDERED THE BIGGEST CREATOR OF DATASET AS YOU HAVE SAID. I DON’T KNOW IF THAT’S TRUE, REALLY. BUT THE QUESTION REALLY IS HOW MANY DATASET GOOGLE CONTRIBUTED TO DATASET SEARCH. IT IS A GREAT QUESTION. WE HAVE GOOGLE AI HAS A LIST OF PUBLISHED DATASET AND ALL ARE INDEXED IN DATASET SEARCH. THIS ISN’T A LARGE NUMBER. IT IS SOMETHING LIKE 20 OR 30 DATASETS MOST ARE DRIVEN BY A PARTICULAR MACHINE LEARNING USE. GOOGLE SISTER OR ORAL FAW BET COMPANIES SUCH AS WAY MOW HAVE DIFFERENT DATASET. TO ANSWER YOUR QUESTION THOSE THAT ARE PUBLISHED BY GOOGLE ARE INDEXED BUT IT IS NOT A HUGE NUMBER IN TERMS OF HOW MANY DATASETS GOOGLE ACTUALLY PUBLISHES. I AM SURE NIH PUBLISHES MORE AND GOOGLE LESS.>>[INAUDIBLE QUESTION] >>THE QUESTION IS NONE OF THE — SO THE QUESTION IS ABOUT LIKE THE WEB INDEXES DATASETS. THERE IS A DATASET I BELIEVE PUBLISHED BY GOOGLE THAT WAS FOR A COMPETITION TO HELP WITH RANKING. IT HAS BASICALLY QUERIES AND MATCHING SEARCH RESULTS, AND THIS WAS BEFORE DATASET SEARCH. THAT DATA IS PUBLISHED. I BELIEVE IT IS INDEXED BY THE SEARCH. THAT’S THE TYPE OF DATASET WE ARE INTERESTED IN.>>[INAUDIBLE QUESTION] >>THAT’S CORRECT. WE HAVE SOME PROVIDERS THAT LABEL DATA REPOSITORIES AS DATA SETS. AND WE — I UNDERSTAND THE ANT LOGICAL DISTINCTION. WE HAVEN’T FOUND THAT THOSE SHOW UP ESPECIALLY FOR GENERIC QUERIES TO BE DETRIMENTAL FOR USER EXPERIENCE SO FAR. THAT MIGHT CHANGE IN THE FUTURE, BUT THAT’S WHY WE SORT OF DON’T CRACK DOWN ON THIS.>>AND MOST OF THE WORK THAT YOU ARE DOING AND GOOGLE IS DOING IS RALLY TO AUTOMATE THE PROCESS OF FINDING REPOSITORY AND ORGANIZE IT. SO IN OTHER WORDS, WHAT — I WAS JUST PUTTING IN THE TERM [INAUDIBLE]. WHAT I FOUND WAS ONE REPOSITORY ON AWS I DIDN’T FIND THE ONE ON [INAUDIBLE]. THAT’S BECAUSE WHOEVER PUT THE DATA ON AWS THEY HAVE THE PAGE ENOUGH FOR YOU TO PICK UP — >>THAT IS THE REASON WHY. THAT’S WHY THIS ALL WORKS WITH PARTNERSHIP AND MAKING SURE METADATA AND ADDING TO THE PAGES IS SO IMPORTANT. WE ARE NOT INDEXING PAGES WHICH DO NOT SPECIFICALLY TELL US, YES, I AM A DATASET YES I WANT TO BE INCLUDED IN THE SEARCH RESULTS.>>WITHIN NIH THE FUNDING AGENCY WE FOUND A LOT OF DATASET REPOSITORIES IS A WAY FOR FOUNDING AGENCIES, THE TERM YOU USED IS THE OWNER OF THE DATASET TO CLAIM THE DATASET AND SAY THIS IS SOMETHING I CREATED, WE FOUND IT AND THIS METADATA IS NOT CORRECT AND HOW DO WE CORRECT IT?>>IT IS A GREAT QUESTION. I THINK THE BEST WAY TO GO AROUND IT IS YOU AS THE PRIMARY SOURCE OF THE DATA TO ANNOTATE WITH THE METADATA YOUR RESOURCE AND WE WILL BASICALLY MAKE SURE THAT EITHER WE DUPLICATE THESE THINGS OR THEY WILL AT LEAST SHOW HIGHER UP IN THE RANKINGS. BUT THE BOTTOM LINE IS, IF YOU PUBLISH A DATASET AND YOU WANT TO BE REPRESENTED ACCURATELY IN SEARCH RESULTS YOU SHOULD MAKE SURE THAT YOU ADD THE METADATA TO YOUR LANDING PAGE. THERE ARE MANY WONDERFUL DATA RESOURCES AT THE NIH THAT WOULD BENEFIT GREATLY FROM ADDING THESE PIECES OF METADATA.>>[INAUDIBLE QUESTION] >>SO THE QUESTION WAS ABOUT DATA PAPERS AND HOW DO THEY INTERACT AND DATA JOURNALS WHAT ROLE DOSS THEY PLAY AND HOW DO THEY FIT IN? I THINK THE BIGGEST BENEFIT OF THE DATA JOURNALS AND DATA PAPERS IS THEY INCENTIVIZE SCIENTISTS TO PUBLISH. THEY GIVE SCIENTISTS BACK SOMETHING THAT IS RECOGNIZABLE IN THE SCIENTIFIC SPHERE. IN TERMS OF US INTERACTING WITH THE JOURNALS WE TALK FOR EXAMPLE ABOUT SCIENTIFIC DATA BECAUSE WE HAVE A LOT OF TOPICS IN COMMON AND TYPES OF METADATA AND WHAT MAKES A GOOD REPOSITORY AND THINGS LIKE THAT. WE DO NOT INDEX THOSE PAPERS DIRECTLY. WHAT INDEX THE REPOSITORIES THEY REQUIRE THEM TO PUT THE DATA IN. THOSE REPOSITORIES WILL PUT IN THEIR METADATA, THE INFORMATION ABOUT WHAT IS THE RELATED DATA PAPER AND WE WILL SURFACE THAT INFORMATION IN THE SEARCH RESULTS. IN TERMS OF THE DESCRIPTION THIS IS A VERY GOOD POINT THAT DATA SURFACE DOESN’T HAVE CROSS FIELD. THE DATA PAPERS ARE VERY SOAKED IN THE FIELD SPECIFIC NOMENCLATURE. IT IS INTERESTING TO BRING THAT UP AND TALK TO THE EDITORS OF SOME OF THE MAJOR — THIS IS SOMETHING WE CAN DO. GREAT IDEA.>>[INAUDIBLE QUESTION] >>SURE. OF COURSE. THIS IS ACTUALLY A PAPER THAT WAS PUBLISHED, PRESENTED A COUPLE OF WEEKS AGO. I THINK THEY DIDN’T HAVE THE PRINT SERVERS YET BUT THEY WILL BE OUT THERE. IN TERMS OF THE GENERAL GUIDELINES AS PRINTED THEM HERE BUT THE ACTUAL SPECIFICATION ALL OF THAT RECOMMENDATION IS ONLINE, IT IS ON THE FIRST LINK THERE. WE ARE DEFINITELY INVESTING A LOT INTO DISSEMINATING THOSE GUIDELINES. BUT THAT SPECIFIC STUFF I THINK IT WOULD BE GREAT TO PUT OUT AS WELL. SO SOME HANDS ADAM?>>I GOT A QUESTION FROM ONLINE. THIS IS FROM [INAUDIBLE]. HER QUESTION IS, WHAT IS THE LIMIT OF THE METADATA THAT CAN BE ADDED TO THE SCHEMA.ORG [INAUDIBLE].>>WHAT IS THE LIMIT OF THE MET DAH DATA AT SCHEMA.ORG. THERE ARE LIMITS WITHIN THE TYPE OF FIELD INDEX. UNDER THIS YOU WILL SEE ALL OF THE DIFFERENT FIELDS THAT WE DO INDEX. SCHEMA.ORG HAS MANY MORE FIELDS. WE SORT OF HAVE TO START SOMEWHERE. SO THAT LIMITATION NUMBER 2 AND THE OTHER LIMITATION IS THE SIZE OF THE TOTAL METADATA WE ALSO PROVIDE INFORMATION THERE ON HOW LARGE THAT DESCRIPTION CAN BE. I BELIEVE RIGHT NOW IT IS AT 5,000 CHARACTERS. DON’T QUOTE ME ON THAT. THERE ARE MORE QUESTIONS BUT I FORGOT. OKAY.>>STRIKES ME THERE ARE A FEW TYPES OF DATA THAT DON’T FIT INTO THIS TYPE OF FRAMEWORK. THE FIRST ONE THAT CAME TO MIND I KNOW THERE ARE WEB KTSZ WITH THIS. [INAUDIBLE].>>THE QUESTION WAS WHAT ABOUT THE CASES SUCH AS DATASET THAT ARE SUCH AS WIKIPEDIA. WHAT ABOUT DATASET THAT ARE LIVE OR STREAMS THAT ARE LIVE. RIGHT NOW WE ARE FOCUSING ON THE [INDISCERNIBLE]. SOME OF THIS CAN BE DONE. IF YOU HAVE AN API YOU CAN ANNOTATE IT AS A DATASET AND PEOPLE WILL FIND IT THEY ARE GOING TO INTERACT DIFFERENTLY. IT IS DEFINITELY SOMETHING THAT WE ARE THINKING ABOUT.>>[INAUDIBLE QUESTION] >>A GREAT QUESTION. HOW DO WE QUANTIFY/EVALUATE WHETHER INCREASING FIND ABILITY ACTUALLY DOES INCREASE REUSE? THE SHORT ANSWER IS WE DON’T. IT IS A VERY HARD QUESTION, BECAUSE IT IS A VERY CONVOLUTED ROUTE. IF YOU TALK ABOUT SCIENTIFIC REUSE IT GOES FROM SOMEONE HAVING AN IDEA, GOING TO DATASET SEARCH, MAKING A FEAR RE, CLICKING ON THE DATASET AND THAT’S WHERE WE LOSE THEM. SOME OF THEM WILL DECIDE THIS IS NOT FOR THEM. SOME WILL DOWNLOAD THE DATASET AND DECIDE IT IS NOT FOR THEM. SOME WILL DOWNLOAD THE DATASET ANALYZE IT AND FIND THERE IS NOTHING INTERESTING THERE. SOME WILL DOWNLOAD DATASET, FIND SOMETHING AND WON’T BE ABLE TO PUBLISH IT. EVENTUALLY SOMEONE WILL ACTUALLY PUBLISH. THAT IS SORT OF THE GOLD STANDARD OF REUSE, SOMEONE ACTUALLY PUBLISHED A NOBLE FINDING ON THIS PAPER. WHICH OF THE DATA REUSES, REALLY FINDING DATA REUSES IS NOT TRUE BUT WE MADE HUGE PROGRESS ON THAT IN VARIOUS DIFFERENT ORGANIZATIONS. IT IS HARD. BUT THEN DISTINGUISHES WHICH OF THEM DATASET SEARCH HELPED WITH IS HARD. I WOULD LOVE TO KNOW THAT. FOR ME ULTIMATELY THAT’S THE BIGGEST MOTIVATION, BUT WE CAN ONLY USE PROXIES LOOKING AT WHETHER PEOPLE COME BACK, WHETHER PEOPLE ACTUALLY USE THE TOOL ON THE REGULAR BASIS. YOU CAN ASSUME THAT THIS IS NOT A COMMON HOBBY. PEOPLE ACTUALLY GAIN SOME PRODUCTIVITY ADVANTAGE FROM IT. THAT IS A PROXY. IT IS A VERY IMPORTANT QUESTION WHICH I THINK WE SHOULD ASK OURSELVES A LOT.>>WE HAD A REQUEST FROM TWITTER TO MIC THE AUDIENCE.>>SO YOU WERE TALKING ABOUT EXTENDING FIELDS LIKE FOR EXAMPLE WITH FUNDING BODIES INFORMATION. THE OTHER POTENTIAL EXTENSION WOULD BE ETHICS BOARDS. SO HUMAN SUBJECT BOARDS OR ANIMAL RESEARCH BOARDS. IF IT IS A STUDY IN THOSE CATEGORIES IT WAS APPROVED THIS BOARD ON THIS PROTOCOL NUMBER, BECAUSE THAT’S INFORMATION THAT DEFINITELY THERE MIGHT BE NEEDS IN THE SHORT TERM OR LONG-TERM TO PULL THAT INFORMATION UP AS WELL.>>LIKE ETHICAL BOARD APPROVALS, THIS SORT OF STUFF.>>YES.>>VERY INTERESTING. HAVEN’T THOUGHT ABOUT THIS. IT IS A GREAT IDEA. WE SHOULD DEFINITELY TALK AFTERWARDS. THAT WILL BE SOMETHING THAT WILL GO TO SCHEMA.ORG WITH A PUBLIC CALL FOR COMMENTS AND STUFF LIKE THAT.>>HI. YOU SHOWED LINKING TO THE PAPER RELATED TO DATASETS. CAN ONE ALSO HAVE A DATASET AND LINK IT THAT WAY LIKE BOTH DIRECTIONS. DOES THAT MAKE SENSE? CAN THEY USE THE DATASET SCHEMA IN A PUBLICATION TO REFERENCE IN A — >>LET ME ASK A CLARIFYING QUESTION. ARE YOU ASKING WHETHER SOMEONE CAN — WHETHER YOU ARE A PUBLISHER OF PAPERS ON SCHEMA.ORG.>>YES USING SCHEMA.ORG FOR YOUR CITATION, THE PUBLICATION, CAN YOU ADD THEN THE DATA SET AY–
DAT AYE– –
– DATA SET UNDER THAT?>>IF YOU ARE A PUBLISHER OF SCHOLARLY WORK AND YOU KNOW THIS PAPER IS USING THE DATASET, YOU CAN HAVE THE DESCRIPTION OF THAT DATASET ON THAT PAGE AND THE INFORMATION THAT THIS IS THE PAPER THAT IT LINKS TO, WE WOULD BE ABLE TO TAKE THAT IN AND IF WE HAVE INFORMATION ABOUT THE DATASET SOMEWHERE ELSE, WE WOULD INCORPORATE THAT TOGETHER. IF WE DON’T THEN WE WOULD SEND USERS BACK TO THE PAPER PAGE, BUT IT IS STILL BETTER THAN NOTHING AND STILL A GOOD ENTRY POINT. I THINK THAT’S A GOOD IDEA.>>IN MANY CASES WHEN YOU WERE SHARING DATA YOU UP LOADED TO A REPOSITORY THAT YOU MIGHT NOT HAVE CONTROL OF HTML CODE. IS THERE WORK AROUND STILL TO PROVIDE THE METADATA THAT IS NECESSARY TO GET INDEXED?>>IS THERE A WORK AROUND? SO, YES. THERE ARE SOME SORT OF WORK AROUNDS THAT PUTS YOU IN CONTROL. WE HAVE DESCRIBING HOW YOU CAN WRITE DESCRIPTION — INDEXABLE DESCRIPTION OF DATA SET ON GITHUB. YOU CAN DO THAT FOR ALL OF YOUR DATASETS. IN PRINCIPLE I RECOMMEND CHOOSING REPOSITORIES THAT SUPPORT SCHEMA.ORG. IN TERMS OF METRIC WE ARE IN CONVERSATIONS WITH THEM, TRYING TO CONVINCE THEM TO ADOPT IT AS WELL. SOMETIMES IT IS LIKE I THINK IN TERMS OF METRIC THEY SAY WAIT UNTIL X NOT DOES IT AND X NOT HAS OTHER THINGS TO DO. SOME OF IT IS TRICKY. THERE ARE WORK AROUNDS. IF YOU CAN IT IS BET TORE UP LOAD STUFF TOMORROW THINGS THAT SUPPORT SCHEMA.ORG. THE NIH FIXTURE.COM DOES SUPPORT SCHEMA.ORG. I SAW ANOTHER ONE. YES.>>>>THIS MIGHT BE A VERY SIMPLE QUESTION. THE SEARCH IS SUPER SIMPLE RIGHT NOW. WHAT ARE YOUR PRIORITIES FOR ADDING KIND OF OPTIONS FOR HOW USERS CAN FILTER AND THEN I HAD A SECOND QUESTION ABOUT WHETHER THERE ARE ANY LIMITS TO THE REPOSITORY THAT IS THE DATASET SEARCH COULD LINK TO OR HOW GOOGLE IS PRIORITIZING WHICH REPOSITORIES TO INCLUDE IN THE SEARCH.>>YES. SO TO ANSWER THE FIRST QUESTION, UNFORTUNATELY A VERY DISAPPOINTING ANSWER. THE COMPANY HAS A POLICY NOT TO TALK ABOUT FEATURES UNRELEASED. BUT I CAN SAY THAT WE ARE DEFINITELY VERY INTERESTED IN ADDING THIS. WHAT I WOULD LOVE TO KNOW FROM YOU IS WHAT TYPE OF FILTERS OR FACETS YOU WOULD FIND MOST USEFUL IN THIS PRODUCT. IS THERE A MICROPHONE?>>SHOULD ASK EVERYONE IN THE ROOM BECAUSE WE HAVE OUR OWN PRIORITY FILTERS.>>I DON’T WANT TO PUT YOU ON THE SPOTLIGHT BUT THERE’S A FEEDBACK BUTTON ON THE WEBSITE WHERE YOU CAN CLICK AND TELL US EVERYTHING YOU WANT TO TELL US INCLUDING WHAT FILTERS WE SHOULD ADD. SO WE ARE USER GENERATED PRODUCT. WE WOULD LOVE TO KNOW WHAT FILTERS YOU WOULD LIKE TO HAVE THERE. THE SECOND QUESTION HOW DO WE PRIORITIZE WHICH DATA REPOSITORIES ARE INCLUDED OR NOT. THE ANSWER IS WE ARE NOT DE PRIORITIZE ANY DATASET BECAUSE IT IS UP TO THE REPOSITORY TO ADD THE METADATA. IF YOU ADDED THE METADATA WE WILL INDEX YOU. IF YOU DON’T ADD THE METADATA WE WILL INDEX YOU. IT IS UP TO THE DATA PROVIDERS TO MAKE THAT STEP. IT PUTS THEM INTO CONTROL WHAT MET DAH TATA IS OUT — METADATA OUT THERE. WE HAVE PUSHES TOWARDS WE ARE GOING TO TALK TO ALL NXS SPONSORED REPOSITORIES NOW GOING TO LOOK AT ALL OF THE ECONOMICS AND STUFF LIKE THAT. THOSE ARE DEVELOPER RELATIONSHIPS. WE DON’T INDEX REPOSITORY THAT IS DON’T WANT TO BE INDEXED.>>YOUR FUNDING SLIDE WAS INTERESTING. YOU KNOW YOU ARE GOING TO HAVE A HUGE MESS WITH IDENTIFIERS, FUNDER NAMES ALL OF THAT. IS THIS NIH OR NATIONAL INSTITUTE OF MENTAL HEALTH. DOES THE GRANT NAME HAVE NIH IN FRONT OF IT ALL OF THAT STUFF?>>I AM LOOKING FORWARD TO THIS ONE.>>ARE WE HHS, ARE WE NIH. WHERE ARE WE IN LINE. THIS CAN BE THE U.S. VERSION.>>I ALSO LEARNED THERE’S A WHOLE WORLD OUTSIDE OF THIS COUNTRY.>> [LAUGHTER] >>JUST ADDING TO THE FILTER COMMENT. I WAS PLAYING AROUND WITH THAT IMMEDIATELY. YOU HAVE THE DATA THAT SAYS FROM TCGA FROM NCI. IF THERE WERE FILTER IF IT IS ANALYZED DATA OR SOURCED DATA IT IS VERY USEFUL. THERE’S A LOT OF SEARCH RESULT THAT IS GO THROUGH LIKE LARGE DAT DATASET.>>THAT WOULD REQUIRE A LOT OF SCHEMA.ORG WORK IN TERMS OF THE VOCABULARY AND THE FIELDS THAT SKRI DESCRIBE THIS PARTICULAR ONE.>>HOW MANY PEOPLE ON GOOGLE WORK ON THIS PROJECT AND WHAT IS THE SORT OF REVENUE STRUCTURE?>>SO NOT SURE I CAN DISCLOSE THAT INFORMATION. I CAN TELL YOU THAT THIS IS PART OF A PROJECT CALLED DATA SCIENCE FOR SOCIAL GOODS PART OF THE RESEARCH ORGANIZATION.>>THAT SHOULD GIVE HANDS ABOUT THE REVENUE SCREAM OR LACK OF IT. IT IS NOT A LARGE TEAM. IT IS A SMALL TEAM. HOPEFULLY PRODUCT IS GOOD AND IF IT ISN’T, PLEASE LET ME KNOW HOW WE CAN IMPROVE IT.>>THERE’S ONE MORE QUESTION FROM THE TWITTERS. IS THERE ANY LINKING BETWEEN DATA SETS?>>LIPPING — LINKING BETWEEN DATASET. DERIVED DATASET.>>WE DO SUPPORT PROVIDING INFORMATION ABOUT REPLICAS, DATASETS ARE PART OF ANOTHER DATASET AND RIGHT NOW WE ARE HEAVILY RELYING ON REPLICA INFORMATION. THERE ARE WAYS TO DO IT. WE ALREADY INCORPORATE SOME OF THEM AND WE HOPE WE CAN IMPROVE THAT IN THE FUTURE.>>ONE MORE QUESTION ABOUT HOW YOU ORDER THE SEARCH RESULTS. WHEN SOMEONE SEARCHED FOR A [email protected] SPECIFIC DATASET IS IT SIMILAR TO HOW REGULAR GOOGLE WORKS OR IS THERE SPECIFIC RANKING?>>IT IS SIMILAR TO HOW REGULAR GOOGLE WORKS. WE USED SIMILAR API’S BUILT INTERNALLY FOR RANKING. THERE ARE SMALL TWEAKS WE HAVE TO DO TO MAKE SURE WE GET DIVERSE ENOUGH RESULTS. IT IS A VERY COMPLEX MECHANISM THAT INCORPORATES DECADES OF EXPERIENCE THAT GOOGLE HAS. I DON’T KNOW ALL OF THE DETAILS OF IT MYSELF. IF THERE ARE SOME IMPROVEMENTS OR SHORTCOMINGS OF RAIPGING IN DATASET SEARCH — RANKINGS IN DATASET SEARCH LET ME KNOW I WILL MAKE SURE IT IS NOT HAPPENING. SINCE YOU ARE ASKING FOR A LOT OF COMMITMENT FOR OTHER PEOPLE TO SUPPORT THIS I GUESS THE LONG-TERM WITH THE REVENUE STREAM, HOW CAN WE BE CONVINCED OF GOOGLE SUPPORT IN THE LONG-TERM? SEEMS LIKE GOOGLE HAS A LOT OF PRODUCTS THEY TRY FOR A LITTLE BIT. THEY VANISH. YOU ARE CREATING THIS INFRASTRUCTURE YOU THINK IT WOULD SURVIVE WITHOUT GOOGLE OR ARE YOU ASKING PEOPLE TO INVEST IN A TRIAL BALLOON FOR THE COMPANY?>>IT IS A GREAT QUESTION: GOOGLE IS THE ONLY SEARCH ENGINE THAT IS PICKING UP IN SCHEMA.ORG BING PICKS IT UP AS WELL. NOT PARTICULARLY SCHEMA.ORG DATA SETS BUT THEY COULD IF THEY WANTED IN THE FUTURE. ANY ONE CAN BUILD A TOOL THAT WOULD CRAWL THIS METADATA IN THE FUTURE. YOU ASK WHAT WOULD HAPPEN IF THE PRODUCT IS DISCONTINUED THIS EFFORT IN STANDARDIZING THE WAY DATA REPOS TO — REPOSITORY
ANALYZE THE DATA WE WOUIT WOULD BE
FULFILLED WHICH THERE ISN’T ANY. THEY WOULD PICK IT UP AND HAVE THE CURATED ECOSYSTEM OF EVERYONE USING THE SAME STANDARD. HOPEFULLY THIS ANSWERS YOUR QUESTION. THERE WERE CERTAIN PARTS DISCONTINUED I HOPE IT FULFILLS AN IMPORTANT ROLE IF IT DOESN’T MAYBE IT SHOULDN’T THERE BE. IF WE CAN WORK TOGETHER TO MAKE IT BETTER I HOPE THAT IT WILL BE THERE FOR MANY YEARS TO COME IF IT IS THE SAME FOR GOOGLE SCHOLAR FOR EXAMPLE.>>ALL RIGHT I THINK WE SHOULD WRAP UP THERE.
01:02:43.326,00:00:00.000
LET’S THANK CHRIS AGAIN.

Leave a Reply

Your email address will not be published. Required fields are marked *