Data Mining in a Complex World

January 29, 2013

Gold mining requires a cer­tain amount of patience: For example, you would have to sift through about 300 tons of earth and rock to come up with enough of the pre­cious metal to make a single wed­ding ring. Data mining is sim­ilar. Every day, ter­abytes of data accu­mu­late in the tech­nology that society has come to rely on. But turning that chaotic mess of zeros and ones into mean­ingful knowl­edge can be a com­plex math­e­mat­ical challenge.

Typ­i­cally, researchers try to sim­plify this chal­lenge by lim­iting the scope of their ques­tions. But Yizhou Sun, a newly appointed assis­tant pro­fessor in the Col­lege of Com­puter and Infor­ma­tion Sci­ence, believes that making useful pre­dic­tions and infer­ences with new data requires us to account for its complexity.

“My phi­los­ophy is that in the real world, objects are con­nected together but those objects belong to dif­ferent types,” she said, pointing to humans, build­ings, and dig­ital devices as exam­ples “Even with humans we can still iden­tify dif­ferent groups.”

Instead of looking at two-dimensional rela­tion­ships in an iso­lated system, her approach brings together a series of com­plex algo­rithms that simul­ta­ne­ously address objects from mul­tiple domains and their inter­ac­tions in a much bigger, real-world envi­ron­ment. She has used the method to probe social net­works like Flickr and Twitter for sim­i­lar­i­ties and patterns.

As a grad­uate stu­dent at the Uni­ver­sity of Illi­nois at Urbana-Champaign, Sun took on the task of mining the Dig­ital Bib­li­og­raphy & Library Project’s dataset of com­puter sci­ence pub­li­ca­tions. Her hope was to unearth some inter­esting and unex­pected pat­terns, which she did.

She found that a researcher’s social con­nect­ed­ness was the most impor­tant factor for deter­mining whom he would col­lab­o­rate with in the future. She also found, thank­fully, that social con­nec­tions did not figure very highly in a researcher’s citations.

But per­haps most impor­tant, Sun found that her ques­tions were always more com­pli­cated than she had expected. For instance, auto­mat­i­cally iden­ti­fying the most highly ranked authors in the DBLP dataset might require exam­ining the ranking of the con­fer­ences they attended. But that requires auto­mat­i­cally iden­ti­fying con­fer­ence ranking, which depends on the ranking of the authors in attendance.

The problem was that the data in ques­tion make up a com­plex, het­ero­ge­neous net­work wherein each piece affects every other. If Sun wanted to trust the prod­ucts of her algo­rithm, she was going to have to under­stand the net­work it acted upon.

Sun made it her life’s work to under­stand and then design strate­gies for exam­ining het­ero­ge­neous net­works. Last year, she pub­lished the sem­inal book on the matter, Mining Het­ero­ge­neous Infor­ma­tion Net­works: Prin­ci­ples and Method­olo­gies.

The impli­ca­tions for Sun’s work are vast. In order to take advan­tage of the ter­abytes of data now describing our world, we must under­stand the com­plex net­works of which they are a part. “In the real world, there are so many dif­ferent types of objects that interact with each other,” said Sun. “The real world system can be viewed as gigantic het­ero­ge­neous infor­ma­tion network.”