Catalogue search • Linguistik portal • Fachinformationsdienst (FID)

1	Recovery of the 3D Virtual Human: Monocular Estimation of 3D Shape and Pose with Data Driven Priors
	Dibra, Endri. - : ETH Zurich, 2018
	Abstract: The virtual world is increasingly merging with the real one. Consequently a proper human representation in the virtual world is becoming more important as well. Despite recent technological advances in making the virtual human presence more realistic, we are still far from having a fully immersive experience in the virtual world, in part due to the lack of proper capturing and modeling of a virtual double. Thus, new methods and techniques are needed to obtain and recover a realistic virtual doppelg¨anger. This thesis aims to make virtual human representation accessible for every person, by showcasing how it can be obtained under inexpensive minimalistic sensor requirements. Potential fields of application of the findings could be the estimation of body shape from selfies, health monitoring and garment fitting. In this thesis we investigate the problem of reconstructing the 3D virtual human from monocular imagery, mainly coming from an RGB sensor. Instead of focusing on the full avatar at once, we separately consider three constituting parts of it: the naked body, clothing and the human hand. The preeminent focus is on the estimation of the 3D shape and pose from 2D images, e.g. taken from a smart-phone, making use of data-driven priors in order to alleviate this ill-posed problem. We utilize discriminative methods, with a focus on CNNs, and leverage existing and new realistically rendered synthetic datasets to learn important statistics. In this way, our presented data-driven methods can generalize well and provide accurate reconstructions on unseen real input data. Our research is not only based on single views and annotated groundtruth data for supervised learning, but also shows how to utilize multiple views simultaneously, or leverage from them during training time, in order to boost performance achieved from a single view at inference time. In addition, we demonstrate how to train and refine unsupervised with unlabeled real data, by integrating lightweight differentiable renderers into CNNs. In the first part of the thesis, we aim to estimate the intrinsic body shape, regardless of the adopted pose. Under assumptions of uniform background colours and poses under minimal self-occlusion, we show three different approaches for estimating the body shape: Firstly, by basing our estimation on handcrafted features in combination with CCA and random forest regressors, secondly by basing it on simple standard CNNs, and thirdly by basing it on more involved CNNs with generative and cross-modal components. We show robustness to pose changes, silhouette noise and state-of-the-art performance on existing datasets, outperforming also optimization based approaches. The second part of the thesis tackles the estimation of garment shape from one or two images. Two possible estimations of the garment shape are provided: one that gets deformed from a template garment (i.e. from a t-shirt or a dress) and second one that gets deformed from the underlying body. Our analysis includes empirical evidence which shows the advantages and disadvantages of utilizing either of the estimation methods. We adopt lightweight CNNs in combination with a new realistically rendered garment dataset, synthesized under physically correct dynamic assumptions, in order to tackle the very difficult problem of estimating 3D shape from an image. Training purely on synthetic data, we are the first to show that garment shape estimation from real images is possible through CNNs. The last and concluding part of the thesis focuses on the problem of inferring a 3D hand pose from an RGB or depth image. To this end, our proposal is an end-to-end CNN system that leverages data from our newly proposed realistically rendered hand dataset, consisting of 3 million samples of hands in various poses, orientations, textures and illuminations. Utilizing this dataset in a supervised training setting, helped us not only with pose inference tasks, but also with hand segmentation. We additionally introduce network components based on differentiable renderers that enabled us to train and refine our networks with unlabeled real images in an unsupervised fashion, showing clear improvements. We demonstrate on-par and improved performance over state-of-the-art methods for two input modalities, under various tasks varying from 3D pose estimation to gesture recognition.
	Keyword: computer science; Data processing; info:eu-repo/classification/ddc/4
	URL: https://doi.org/10.3929/ethz-b-000266852 https://hdl.handle.net/20.500.11850/266852
	BASE
	Hide details

2	A multimedia framework for effective language training
	Voegeli, Christian; Gross, Markus H.
	In: Technical Report / ETH Zurich, Department of Computer Science, 570 (2011)
	BASE
	Show details

3	A multimedia framework for effective language training ...
	Gross, Markus H.; Voegeli, Christian. - : ETH Zurich, 2011
	BASE
	Show details

Search in the Catalogues and Directories