We solve a pedestrian visual navigation problem with a first-person view in an urban setting via deep reinforcement learning in an end-to-end manner. The major challenges lie in severe partial observability and sparse positive experiences of reaching the goal. To address partial observability, we propose a novel 3D-temporal convolutional network to encode sequential historical visual observations, its effectiveness is verified by comparing to a commonly-used Frame-Stacking approach. For sparse positive samples, we propose an improved automatic curriculum learning algorithm NavACL + , which proposes meaningful curricula starting from easy tasks and gradually generalizing to challenging ones. NavACL + is shown to facilitate the learning process with 21% earlier convergence, to improve the task success rate on difficult tasks by 40% compared to the original NavACL algorithm [1] and to offer enhanced generalization to different initial poses compared to training from a fixed initial pose.