Song-Chun Zhu

Last updated • 10 min readFrom Wikipedia, The Free Encyclopedia
Song-Chun Zhu
朱松纯
BornJune 1968 (age 56)
Ezhou, Hubei, China
Alma mater University of Science and Technology of China (BS)
Harvard University (MS, PhD)
Occupation(s)Computer scientist, applied mathematician
Children Zhu Yi
AwardsHelmholtz Test-of-Time Award
IEEE Fellow
David Marr Prize
Scientific career
Fields Computer science
Applied mathematics
Institutions Peking University
University of California, Los Angeles
Thesis Statistical and Computational Theories for Image Segmentation, Texture Modeling and Object Recognition  (1996)
Doctoral advisor David Mumford
Website www.stat.ucla.edu/~sczhu

Song-Chun Zhu (Chinese :朱松纯; born June 1968) is a Chinese computer scientist and applied mathematician known for his work in computer vision, cognitive artificial intelligence and robotics. Zhu currently works at Peking University and was previously a professor in the Departments of Statistics and Computer Science at the University of California, Los Angeles. [1] Zhu also previously served as Director of the UCLA Center for Vision, Cognition, Learning and Autonomy (VCLA). [2]

Contents

In 2005, Zhu founded the Lotus Hill Institute, an independent non-profit organization to promote international collaboration within the fields of computer vision and pattern recognition. [3] Zhu has published extensively and lectured globally on artificial intelligence, and in 2011, he became an IEEE Fellow (Institute of Electrical and Electronics Engineers) for "contributions to statistical modeling, learning and inference in computer vision." [4]

Zhu has two daughters, Stephanie and Yi. [5] Zhu Yi (Chinese :朱易) is a competitive figure skater. [6]

Early life and education

Born and raised in Ezhou, China, Zhu found inspiration, when he was young, in the development of computers playing chess, sparking his interest in artificial intelligence. In 1991, Zhu earned his B.S. in Computer Science from the University of Science and Technology of China at Hefei. During his undergraduate years, Zhu, finding the computational theory of vision by the late MIT neuroscientist David Marr deeply influential, aspired to pursue a general unified theory of vision and AI. [7] In 1992, Zhu continued his study of computer vision at the Harvard Graduate School of Arts and Sciences. At Harvard, Zhu studied under the supervision of American mathematician David Mumford and gained an introduction to "probably approximately correct" (PAC) learning under the instruction of Leslie Valiant. Zhu concluded his studies at Harvard in 1996 with a Ph.D. in Computer Science and followed Mumford to the Division of Applied Mathematics at Brown University as a postdoctoral fellow. [3]

Career

Following his postdoctoral fellowship, Zhu lectured briefly in Stanford University's Computer Science Department. In 1998, he joined Ohio State University as an assistant professor in the Departments of Computer Science and Cognitive Science. In 2002, Zhu joined the University of California, Los Angeles in the Departments of Computer Science and Statistics as associate professor, rising to the rank of full professor in 2006. At UCLA, Zhu established the Center for Vision, Cognition, Learning and Autonomy. His chief research interest has resided in pursuing a unified statistical and computational framework for vision and intelligence, which includes the Spatial, Temporal, and Causal And-Or graph (STC-AOG) as a unified representation and numerous Monte Carlo methods for inference and learning. [8] [9]

In 2005, Zhu established an independent non-profit organization in his hometown of Ezhou, the Lotus Hill Institute (LHI). LHI has been involved with collecting large-scale dataset of images and annotating the objects, scenes, and activities, having received contributions from many renowned scholars, including Harry Shum. The institute also features a full-time annotation team for parsing image structures, having amassed over 500,000 images to date.[ citation needed ]

Since establishing LHI, Zhu has organized numerous workshops and conferences, along with serving as the general chair for both the 2012 Conference on Computer Vision and Pattern Recognition (CVPR) in Providence, Rhode Island, where he presented Ulf Grenander with a Pioneer Medal, and the 2019 CVPR held in Long Beach, California. [10]

In July 2017, Zhu founded DMAI in Los Angeles as an AI startup engaged in developing a unified cognitive AI platform. [11]

In September 2020, Zhu returned to China to join Peking University to lead its Institute for Artificial Intelligence, thus joining another Chinese AI expert in the US and a long-time acquaintance of Zhu, Microsoft's former head of artificial intelligence and research, Harry Shum. Shum was also appointed by Peking University in August to chair the academic committee of the Institute of Artificial Intelligence. [12]

Zhu is working on setting up a new and separate AI research institute - Beijing Institute for General Artificial Intelligence (BIGAI). According to the introduction, based on "small data for big task" paradigm, BIGAI focuses on advanced AI technology, multi-disciplinary integration, international academic exchange, to nurture the new generation of young AI talents. [12] The institute is expected to gather professional researchers, scholars and experts, to put Zhu's theoretical framework of artificial intelligence into practice, and jointly promoting Chinese original AI technologies and building a new generation of general AI platforms.[ citation needed ]

Research and work

Zhu has published over three hundred articles in peer-reviewed journals and proceedings in the following four phases:

Pioneering statistical models to formulate concepts in Marr’s framework

In the early 1990s, Zhu, with collaborators in the pattern theory group, developed advanced statistical models for computer vision. Focusing upon developing a unifying statistical framework for the early vision representations presented in David Marr's posthumously published work titled Vision, they first formulated textures in a new Markov random field model, called FRAME, using a minimax entropy principle to introduce discoveries in neuroscience and psychophysics to Gibbs distributions in statistical physics. [13] Then they proved the equivalence between the FRAME model and the micro-canonical ensemble, [14] which they named the Julesz ensemble. This work received the Marr Prize honorary nomination during the International Conference on Computer Vision (ICCV) in 1999. [15]

During the 1990s, Zhu developed two new classes of nonlinear partial differential equations (PDEs). One class for image segmentation is called region competition. [16] This work connecting PDEs to statistical image models received the Helmholtz Test of Time Award in ICCV 2013. The other class, called GRADE (Gibbs Reaction and Diffusion Equations) was published in 1997 and, employs a Langevin dynamics approach for inference and learning Stochastic gradient descent (SGD). [17]

In the early 2000s, Zhu formulated textons [18] using generative models with sparse coding theory and integrated both the texture and texton models to represent primal sketch. [19] With Ying Nian Wu, Zhu advanced the study of perceptual transitions between regimes of models in information scaling and proposed a perceptual scale space theory to extend the image scale space. [20]

Expanding Fu's grammar paradigm by stochastic and-or graph

From 1999 until 2002, with his Ph.D. student Zhuowen Tu, Zhu developed a data-driven Markov chain Monte Carlo (DDMCMC) paradigm [21] to traverse the entire state-space by extending the jump-diffusion work of Grenander-Miller. With another Ph.D. student, Adrian Barbu, he generalized the cluster sampling algorithm (Swendsen-Wang) in physics from Ising/Potts models to arbitrary probabilities. This advancement in the field made the split-merge operators reversible for the first time in the literature and achieved 100-fold speedups over Gibbs sampler and jump-diffusion. This accomplishment led to the work on image parsing [22] that won the Marr Prize in ICCV 2003. [15]

In 2004, Zhu moved to high level vision by studying stochastic grammar. The grammar method dated back to the syntactic pattern recognition approach advocated by King-Sun Fu in the 1970s. Zhu developed grammatical models for a few key vision problems, such as face modeling, face aging, clothes, object detection, rectangular structure parsing, and the sort. He wrote a monograph with Mumford in 2006 titled A Stochastic Grammar of Images. [23] In 2007, Zhu and co-authors received a Marr Prize nomination. The following year, Zhu received the J.K. Aggarwal Prize from the International Association of Pattern Recognition for "contributions to a unified foundation for visual pattern conceptualization, modeling, learning, and inference." [24]

Zhu has extended the and-or graph models to the spatial, temporal, and causal and-or graph (STC-AOG) to express the compositional structures as a unified representation for objects, scenes, actions, events, and causal effects in physical and social scene understanding problems.

Exploring the "dark matter of AI" cognition and visual commonsense

Since 2010, Zhu has collaborated with scholars from cognitive science, AI, robotics, and language to explore what he calls the "Dark Matter of AI"—the 95% of the intelligent processing not directly detectable in sensory input.

Together they have augmented the image parsing and scene understanding problem by cognitive modeling and reasoning about the following aspects: functionality (functions of objects and scenes, the use of tools), intuitive physics (supporting relations, materials, stability, and risk), intention and attention (what people know, think, and intend to do in social scene), causality (the causal effects of actions to change object fluents), and utility (the common values driving human activities in video). [25] [26] [27] The results are disseminated through a series of workshops. [28]

There are numerous other topics Zhu has explored during this period, including the following: formulating AI concepts such as tools, container, liquids; integrating three-dimensional scene parsing and reconstruction from single images by reasoning functionality, physical stability, situated dialogues by joint video and text parsing; developing communicative learning; and mapping the energy landscape of non-convex learning problems. [29]

Pursuing a "small-data for big task" paradigm for general AI

In a widely circulated public article written in Chinese in 2017, Zhu referred to popular data-driven deep learning research as a "big data for small task" paradigm that trains a neural network for each specific task with massive annotated data, resulting in uninterpretable models and narrow AI. Zhu, instead, advocated for a "small data for big task" paradigm to achieve general AI. [30]

At the 2023 meeting of the Chinese People's Political Consultative Conference's National Committee, Zhu said that, in the wake of ChatGPT's release, China should make artificial general intelligence a strategic goal, analogous to the pursuit of nuclear, missile, and satellite technology by the Two Bombs, One Satellite project of the 1960s. [31]

In February 2024, the Beijing Institute for General Artificial Intelligence (BIGAI) operating under the leadership of Zhu unveiled what they referred to as the world’s first artificial intelligence (AI) child named "Tong Tong" who possesses her own emotions and intellect and is capable of assigning tasks to herself independently demonstrating a level of autonomy previously unseen in virtual entities. [32]

Publications

Books

Papers

Related Research Articles

<span class="mw-page-title-main">Handwriting recognition</span> Ability of a computer to receive and interpret intelligible handwritten input

Handwriting recognition (HWR), also known as handwritten text recognition (HTR), is the ability of a computer to receive and interpret intelligible handwritten input from sources such as paper documents, photographs, touch-screens and other devices. The image of the written text may be sensed "off line" from a piece of paper by optical scanning or intelligent word recognition. Alternatively, the movements of the pen tip may be sensed "on line", for example by a pen-based computer screen surface, a generally easier task as there are more clues available. A handwriting recognition system handles formatting, performs correct segmentation into characters, and finds the most possible words.

<span class="mw-page-title-main">Jürgen Schmidhuber</span> German computer scientist

Jürgen Schmidhuber is a German computer scientist noted for his work in the field of artificial intelligence, specifically artificial neural networks. He is a scientific director of the Dalle Molle Institute for Artificial Intelligence Research in Switzerland. He is also director of the Artificial Intelligence Initiative and professor of the Computer Science program in the Computer, Electrical, and Mathematical Sciences and Engineering (CEMSE) division at the King Abdullah University of Science and Technology (KAUST) in Saudi Arabia.

<span class="mw-page-title-main">Image segmentation</span> Partitioning a digital image into segments

In digital image processing and computer vision, image segmentation is the process of partitioning a digital image into multiple image segments, also known as image regions or image objects. The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation is typically used to locate objects and boundaries in images. More precisely, image segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics.

<span class="mw-page-title-main">Automatic image annotation</span>

Automatic image annotation is the process by which a computer system automatically assigns metadata in the form of captioning or keywords to a digital image. This application of computer vision techniques is used in image retrieval systems to organize and locate images of interest from a database.

As applied in the field of computer vision, graph cut optimization can be employed to efficiently solve a wide variety of low-level computer vision problems, such as image smoothing, the stereo correspondence problem, image segmentation, object co-segmentation, and many other computer vision problems that can be formulated in terms of energy minimization. Many of these energy minimization problems can be approximated by solving a maximum flow problem in a graph. Under most formulations of such problems in computer vision, the minimum energy solution corresponds to the maximum a posteriori estimate of a solution. Although many computer vision algorithms involve cutting a graph, the term "graph cuts" is applied specifically to those models which employ a max-flow/min-cut optimization.

The image segmentation problem is concerned with partitioning an image into multiple regions according to some homogeneity criterion. This article is primarily concerned with graph theoretic approaches to image segmentation applying graph partitioning via minimum cut or maximum cut. Segmentation-based object categorization can be viewed as a specific case of spectral clustering applied to image segmentation.

The Conference on Computer Vision and Pattern Recognition (CVPR) is an annual conference on computer vision and pattern recognition, which is regarded as one of the most important conferences in its field. According to Google Scholar Metrics (2022), it is the highest impact computing venue.

<span class="mw-page-title-main">AlexNet</span> Convolutional neural network

AlexNet is the name of a convolutional neural network (CNN) architecture, designed by Alex Krizhevsky in collaboration with Ilya Sutskever and Geoffrey Hinton, who was Krizhevsky's Ph.D. advisor at the University of Toronto.

<span class="mw-page-title-main">René Vidal</span> Chilean computer scientist (born 1974)

René Vidal is a Chilean electrical engineer and computer scientist who is known for his research in machine learning, computer vision, medical image computing, robotics, and control theory. He is the Herschel L. Seder Professor of the Johns Hopkins Department of Biomedical Engineering, and the founding director of the Mathematical Institute for Data Science (MINDS).

<span class="mw-page-title-main">Alan Yuille</span> English academic

Alan Yuille is a Bloomberg Distinguished Professor of Computational Cognitive Science with appointments in the departments of Cognitive Science and Computer Science at Johns Hopkins University. Yuille develops models of vision and cognition for computers, intended for creating artificial vision systems. He studied under Stephen Hawking at Cambridge University on a PhD in theoretical physics, which he completed in 1981.

Jason Joseph Corso is Co-Founder / CEO of the computer vision startup Voxel51 and a Professor of Robotics, Electrical Engineering and Computer Science at the University of Michigan.

<span class="mw-page-title-main">Gregory D. Hager</span> American computer scientist

Gregory D. Hager is the Mandell Bellmore Professor of Computer Science and founding director of the Johns Hopkins Malone Center for Engineering in Healthcare at Johns Hopkins University.

Jiebo Luo is a Chinese-American computer scientist, the Albert Arendt Hopeman Professor of Engineering and Professor of Computer Science at the University of Rochester. He is interested in artificial intelligence, data science and computer vision.

Xu Li is a Chinese computer scientist and co-founder and current CEO of SenseTime, an artificial intelligence (AI) company. Xu has led SenseTime since the company's incorporation and helped it independently develop its proprietary deep learning platform.

An energy-based model (EBM) (also called a Canonical Ensemble Learning(CEL) or Learning via Canonical Ensemble (LCE)) is an application of canonical ensemble formulation of statistical physics for learning from data problems. The approach prominently appears in generative models (GMs).

In the domain of physics and probability, the filters, random fields, and maximum entropy (FRAME) model is a Markov random field model of stationary spatial processes, in which the energy function is the sum of translation-invariant potential functions that are one-dimensional non-linear transformations of linear filter responses. The FRAME model was originally developed by Song-Chun Zhu, Ying Nian Wu, and David Mumford for modeling stochastic texture patterns, such as grasses, tree leaves, brick walls, water waves, etc. This model is the maximum entropy distribution that reproduces the observed marginal histograms of responses from a bank of filters, where for each filter tuned to a specific scale and orientation, the marginal histogram is pooled over all the pixels in the image domain. The FRAME model is also proved to be equivalent to the micro-canonical ensemble, which was named the Julesz ensemble. Gibbs sampler is adopted to synthesize texture images by drawing samples from the FRAME model.

Self-supervised learning (SSL) is a paradigm in machine learning where a model is trained on a task using the data itself to generate supervisory signals, rather than relying on external labels provided by humans. In the context of neural networks, self-supervised learning aims to leverage inherent structures or relationships within the input data to create meaningful training signals. SSL tasks are designed so that solving it requires capturing essential features or relationships in the data. The input data is typically augmented or transformed in a way that creates pairs of related samples. One sample serves as the input, and the other is used to formulate the supervisory signal. This augmentation can involve introducing noise, cropping, rotation, or other transformations. Self-supervised learning more closely imitates the way humans learn to classify objects.

Jiaya Jia is a Chair Professor of the Department of Computer Science and Engineering at The Hong Kong University of Science and Technology (HKUST). He is an IEEE Fellow, the associate editor-in-chief of one of IEEE’s flagship and premier journals- Transactions on Pattern Analysis and Machine Intelligence (TPAMI), as well as on the editorial board of International Journal of Computer Vision (IJCV).

Gérard G. Medioni is a computer scientist, author, academic and inventor. He is a vice president and distinguished scientist at Amazon and serves as emeritus professor of Computer Science at the University of Southern California.

References

  1. "Song-Chun Zhu".
  2. "Center for Vision, Cognition, Learning and Autonomy".
  3. 1 2 "Professor Song-Chun Zhu, UCLA".
  4. "Song-Chun Zhu".
  5. "Research: are we on the right way?".
  6. "US-born ice skater joins China training program - Global Times". 2018-09-28. Retrieved 2022-02-06.
  7. "ACM图灵大会上的"华山论剑":朱松纯对话沈向洋 Dialogue by Drs. Song-Chun Zhu and Harry Shum at ACM TURC 2019".
  8. "A Unified Framework for Human-Robot Knowledge Transfer".
  9. "Monte Carlo Methods (Hardback)".
  10. "A letter from the PAMI TC and CVPR 2019 organizers".
  11. "DMAI".
  12. 1 2 "DMAI".
  13. Zhu, S. C., Wu, Y., & Mumford, D. (1998). FRAME: filters, random fields, and minimax entropy towards a unified theory for texture modeling. International Journal of Computer Vision, 27(2) pp.1-20.
  14. Y. N. Wu, S. C. Zhu and X. W. Liu, (2000). Equivalence of Julesz Ensemble and FRAME models International Journal of Computer Vision, 38(3), 247-265.
  15. 1 2 "Computer Vision Awards".
  16. Zhu, S. C., & Yuille, A. (1996). Region competition: unifying snakes, region growing, and Bayes/MDL for multiband image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(9), 884–900.
  17. Zhu, S. C., & Mumford, D. (1997). Prior learning and Gibbs reaction-diffusion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(11), 1236–1250.
  18. Zhu, S.-C., Guo, C., Wang, Y., & Xu, Z. (2005). What are Textons? International Journal of Computer Vision, 62(1/2), 121–143.
  19. Guo, C. Zhu, S.-C. and Wu, Y.(2007), Primal sketch: Integrating Texture and Structure. Computer Vision and Image Understanding, vol. 106, issue 1, 5-19.
  20. Y.N. Wu, C.E. Guo, and S.C. Zhu (2008), From Information Scaling of Natural Images to Regimes of Statistical Models, Quarterly of Applied Mathematics, vol. 66, no. 1, 81-122.
  21. Tu, Z. and Zhu, S.-C. Image Segmentation by Data Driven Markov Chain Monte Carlo, IEEE Trans. on PAMI, 24(5), 657-673, 2002.
  22. Tu, Z., Chen, X., Yuille, & Zhu, S.-C. (2003). Image parsing: unifying segmentation, detection, and recognition. Proceedings Ninth IEEE International Conference on Computer Vision.
  23. Zhu, S.-C., & Mumford, D. (2006). A Stochastic Grammar of Images. Foundations and Trends in Computer Graphics and Vision, 2(4), 259–362.
  24. "J.K. Aggarwal Prize 2008 Awarded to Prof. Song-Chun Zhu".
  25. B. Zheng, Y. Zhao, J. Yu, K. Ikeuchi, and S.C. Zhu (2015), Scene Understanding by Reasoning Stability and Safety, Int'l Journal of Computer Vision, vol. 112, no. 2, pp221-238, 2015.
  26. Y. Zhu, Y.B. Zhao and S.C. Zhu (2015), Understanding Tools: Task-Oriented Object Modeling, Learning and Recognition, Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR).
  27. Y.X. Zhu, C. Jiang, Y. Zhao, D. Terzopoulos and S.C. Zhu (2016), Inferring Forces and Learning Human Utilities from Video, Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR).
  28. "Vision Meets Cognition".
  29. "Song-chun Zhu".
  30. "Some Invited Talks".
  31. "AI Proposals at 'Two Sessions': AGI as 'Two Bombs, One Satellite'?".
  32. "China creates world's first AI child which shows human emotion". Interesting Engineering.com. Retrieved 16 April 2024.