資源描述:
《bad data handbook data traceability》由會(huì)員上傳分享,免費(fèi)在線閱讀,更多相關(guān)內(nèi)容在工程資料-天天文庫(kù)。
1、CHAPTER17DataTraceabilityReidDraperYoursoftwareconsistentlyprovidesimpressivemusicrecommendationsbycombiningculturalandaudiodata.Customersarehappy.However,thingsaren’talwaysperfect.SometimesthatBeyoncétrackisattributedtoBeyonce.TheartistfortheBélaFlecksoloalbumshowsupasBélaFleckandtheFlecktones.Wo
2、rse,theボリスbiographyhastheartistnamelistedas???.Wheredidthingsgowrong?Didoneofyourcustomersprovideyouwithdatainanincorrectcharacterencoding?Didoneoftheweb-crawlershaveabug?Perhapsthenameresolutioncodewasincorrectlycombiningasoloartistwithhisband?Howdowesolvethisproblem?We’dliketobeabletotracedataba
3、cktoitsorigin,followingeachtransformation.Thisisreifiedasdataprovenenace.Inthischapter,we’llexplorewaysofkeepingtrackofthesourceofourdata,techniquesforbackingoutbaddata,andthebusinessvalueofadoptingthisability.Why?Theabilitytotraceadatumbacktoitsoriginisimportantforseveralreasons.Ithelpsustoback-o
4、utorreprocessbaddata,andconversely,itallowsustorewardandboostgooddatasourcesandprocessingtechniques.Furthermore,localprivacylawscanmandatethingslikeauditability,datatransferrestrictions,andmore.Forexample,Cal‐ifornia’sShinetheLightLawrequiresbusinessesdisclosethepersonalinformationthathasbeenshare
5、dwiththird-parties,shouldaresidentrequest.Europe’sDataProtectionDirectiveprovidesevenmorestringentregulationtobusinessescollectingdataaboutresidents.We’llalsolaterseehowdatatraceabilitycanprovidefurtherbusinessvaluebyallowingustoprovidestrongermeasurementsontheworthofaparticularsource,realizewhere
6、tofocusourdevelopmenteffort,andevenmanageblame.205PersonalExperienceIpreviouslyworkedinthedataingestionteamatamusicdatacompany.Weprovidedartistandsongrecommendations,artistbiographies,news,anddetailedaudioanalysisofdigitalmusic.Weexposedthosedatafeedsviawebservicesandrawdumps.Behindthescenes,these
7、feedswerecomposedofmanysourcesofdata,whichwereinturncleaned,transformed,andputthroughmachine-learningalgorithms.Oneofthefirstissuesweranintowaslearninghowtotraceaparticularresultbacktoitsconstituentparts.Ifagiven