Well, there goes another day filled with confusion. Spent most of the time to get this code running on my laptop. It did eventually. Not sure how it works though. Though, now I know that there is Mel-frequency cepstral coefficients and Connectionist Temporal Classification and many such complicated things are involved to get an audio sample recognized.
Probably I start from scratch, in terms of coding. Start again from a simple example, perhaps.