Dataset API from tensorflow is used for building the complex data pipelines for building machine learning or deep learning models
The pipeline build using Dataset API by default is part of tensorflow graph
What problems tensorflow Dataset API can solve.
1)CPU - GPU problem
GPU doesnt have any memory all the data preparation happens in CPU.while the GPU finishes trainiing first batch of data and
requests the CPU for next batch CPU is idle . to solve this problem we can inturn run the data processing as a background
process in CPU before GPU finishes the training a batch of data.
dataset = dataset.prefetch(4) This command prefetches 4 batches in the queue always.
- Parallelize the data transformations
When preparing a batch of data we may need to preprocess the data, as this operation is independent for each example we can run it on multiple cores of the cpu
dataset = dataset.map(map_fn = parse_fn,num_parallel_calls = 16)
3)Caching Data All the above operations are lazy operations they happen only when the data is requested. Instead if your data can fit into the memory use the cache transformation to cache it in the memory during the first epoch so that subsequent epochs avoid the memory overhead associated with reading parsing and transforming it dataset.cache()
cnn daily mail data contain raw file(.story) lets see how to parse them using the data set API and build a pipeline.
import tensorflow as tf from glob import glob import numpy as np impot string import multiprocessing PAD_TOKEN = '[PAD]' UNKNOWN_TOKEN = '[UNK]' START_DECODING = '[START]' STOP_DECODING = '[STOP]' #helper functions def load_doc(filename): # open the file as read only file = open(filename, encoding='utf-8') # read all text text = file.read() # close the file file.close() return text # clean a list of lines def clean_lines(lines): cleaned = list() # prepare a translation table to remove punctuation table = str.maketrans('', '', string.punctuation) for line in lines: # strip source cnn office if it exists index = line.find('(CNN) -- ') if index > -1: line = line[index+len('(CNN)'):] # tokenize on white space line = line.split() # convert to lower case line = [word.lower() for word in line] # remove punctuation from each token line = [w.translate(table) for w in line] # remove tokens with numbers in them line = [word for word in line if word.isalpha()] # store as string cleaned.append(' '.join(line)) # remove empty strings cleaned = [c for c in cleaned if len(c) > 0] return cleaned def split_story(file): # find first highlight filename = file.decode(sys.getdefaultencoding()) doc = load_doc(filename) index = doc.find('@highlight') # split into story and highlights story, highlights = doc[:index], doc[index:].split('@highlight') # strip extra white space around each highlight highlights = [h.strip() for h in highlights if len(h) > 0] story = clean_lines(story) highlights = clean_lines(highlights) return " ".join(story[:max_enc_steps]), " ".join(highlights[:max_dec_steps]) #vocab file is \n spearted text file with a single word in each line it should also . contain the pad token and unknown token , start and stop token at the beginning vocab_table =tf.contrib.lookup.index_table_from_file(vocabulary_file ='vocablefile.txt', default_value = 0, delimiter ="\n",num_oov_buckets =0 ) print("Number of cores on your system",multiprocessing.cpu_count()) #dont use all the cores #path to cnn stories files = glob(../cnn/stories/*) dataset = tf.data.Dataset.from_tensor_slices((files)) dataset = dataset.map(lambda file : tuple( tf.py_func(split_story,[file],[tf.string,tf.string]) ),num_parallel_calls = 16) dataset = dataset.map(lambda story,summaty : (tf.string_split([story]).values,tf.string_split([summary]).values),num_parallel_calls = 16) dataset = datset.map(lambda story,summary : {"stroy_tokens":vocab_table.lookup(story), "summary_tokens":vocab_table.lookup(summary) "story_len":tf.size(story),"summary_len":tf.size(summary) } ,num_parallel_calls = 16) dataset.padded_batch(4, padded_shapes = { "story_tokens" : tf.TensorShape([None]),#None means padd it to longest length "summary_tokens" : tf.TensorShape([None], "story_len":[],#No padding "summary_len":[] ), "padding_values"={ "story_tokens" :tf.cast(1,tf.int64), "summary_tokens" : tf.cast(1,tf.int64), "story_len": tf.cast(1,tf.int64), "summary_len":tf.cast(1,tf.int64) #thought we are not padding this elements still some dummy value #needs to be given some bug in tnesoflow } ) dataset = dataset.prefetch(4) #prefetch 4 batches iterator = datset.make_intializable_iterator() sess =tf.Session() sess.run(tf.tables_intializer()) sess.run(iterator.intializer) next_batch = iterator.get_next() print(sees.run(next_element)) # returns numpy arrays of a single batch