Utilities for Machine Learning
PTMLib is a set of utilities that I have built and used while working with Machine Learning frameworks such as Scikit-Learn and TensorFlow.
Starting with Jupyter Notebook development, I began including similar Python classes and functions at the top of most of my notebooks. Once I started doing more work in IDEs it became clear that it was time to leverage Python packaging. The result of this iterative process of dogfooding is PTMLib, which I have released on GitHub. I have found these tools simple and effective and hope others will find them useful.
"Hand Tools in Black and White" Photo by Hunter Haley on Unsplash
In summary, here is what is included in the first release:
- ptmlib.time.Stopwatch – measure the time it takes to complete a long-running task, with an audio alert for task completion
- ptmlib.cpu.CpuCount – get info on CPUs available, with options to adjust/exclude based on a specific number/percentage. Useful for setting
n_jobs
in Scikit-Learn tools that support multiple CPUs, such asRandomForestClassifier
- ptmlib.charts – render separate line charts for TensorFlow accuracy and loss, with corresponding validation data if available
Let’s go through each of these in detail.
ptmlib.time.Stopwatch
The Stopwatch
class lets you measure the amount of time it takes to complete a long-running task. This is useful for evaluating different machine learning models.
When stop()
is called, an audio prompt will alert you that the task has completed. This helps when you are time constrained and multi-tasking while your code is executing, for example if you are taking the TensorFlow Developer Certificate exam.
To put this into context, I recently passed this exam back in December. It tests your ability to build deep learning models for tasks such as Image Classification and Natural Language Processing using TensorFlow and Keras. You have a maximum of five hours to complete the exam: this is important as you will need much of this time for model training.
Google’s own exam documentation states this clearly:
“We allow 5 hours for the exam because we know that it will take some time to train the models.”
Once a specific model’s training completes, you must evaluate its performance (e.g., accuracy/loss); a model that overfits won’t work here. You may need to adjust your model layers and/or hyperparameters and try again. That means more time off the clock. Tick tock…
The ability to multi-task and work on the next model challenge while model training takes place is therefore critical. Having a tool that alerts you as soon as processing completes is very handy in this scenario.
It’s also great for your own ML projects as you experiment with different model architectures. Trial and error and model training take time, so multi-tasking is essential.
I have also found the Stopwatch useful for Scikit-Learn development, especially with complex tasks such as training Ensemble methods and Hyperparameter optimization using Random/Grid Search. Beyond getting more work done, Stopwatch will help you determine if your model selection, configuration, and optimizations are worth the actual time to execute.
Example:
Output:
Start Time: Thu Jan 28 16:57:32 2021 Epoch 1/50 1500/1500 [==============================] - 2s 1ms/step - loss: 0.5316 - accuracy: 0.8086 - val_loss: 0.4141 - val_accuracy: 0.8503 ... 1500/1500 [==============================] - 2s 1ms/step - loss: 0.2337 - accuracy: 0.9101 - val_loss: 0.3212 - val_accuracy: 0.8879 End Time: Thu Jan 28 16:58:03 2021 Elapsed seconds: 30.8191 (0.51 minutes)
Start Time and End Time/Elapsed Seconds/Minutes are output when the start()
and stop()
methods are called, respectively. All other information in the above example output will be generated based on your ML framework.
Stopwatch has been tested using Scikit-Learn and TensorFlow and can be used for any long-running Python code for which you want to measure execution time performance or be notified of task completion.
Stopwatch has been tested with VS Code, PyCharm, Jupyter Notebook and Google Colab.
A default sound is provided for Google Colab, or you may specify your own:
Have I mentioned that Google Colab provides GPU acceleration? ? ?
ptmlib.cpu.CpuCount
The CpuCount class provides information on the number of CPUs available on the host machine. The exact number of logical CPUs is returned by the total_count()
method.
Knowing your CPU count, you can programmatically set the number of processors used in Scikit-Learn tools that support the n_jobs
parameter, such as RandomForestClassifier
and model_selection.cross_validate
.
In many cases (ex: a developer desktop), you will not want to use all your available processors for a task. The adjusted_count()
and adjusted_count_by_percent()
methods allow you to specify the number and percentage of processors to exclude, with default exclusion values of 1
and 0.25
, respectively. The defaults are reflected in the print_stats()
output in the example below.
Example:
Output:
Total CPU Count: 16 Adjusted Count: 15 By Percent: 12 By 50 Percent: 8
While certain Scikit-Learn classifiers/tools benefit greatly from concurrent multi-CPU processing, TensorFlow deep learning acceleration requires a supported GPU or TPU. As far as CPUs are concerned, TensorFlow handles this automatically; there is no benefit to using CpuCount here.
ptmlib.charts.show_history_chart()
The show_history_chart()
function renders separate line charts for TensorFlow training accuracy and loss, with corresponding validation data if available. This ties back to the topic of model performance evaluation I mentioned earlier.
I have refined the formatting of these charts over multiple projects, and have found the formatting and detail provided, including options such as major and minor ticks, to be just right for analysis during model development and troubleshooting. It certainly helped me when one of my models for the TensorFlow exam was clearly not going to cut it.
The save_fig_enabled
parameter lets you save a PNG image of the chart with a timestamped filename. Analyze and compare these charts to evaluate the impact of different optimizations.
For more robust experiment tracking there are tools such as TensorBoard.
Example:
Output:
TensorFlow History Accuracy Chart: accuracy-20210201-111540.png
TensorFlow History Loss Chart: loss-20210201-111545.png
Installation
To install ptmlib
in a virtualenv or conda environment:
To install the ptmlib
source code on your local machine:
PTMLib is available under an MIT License, a “short and simple permissive license” for those who choose to use it in their projects, or anyone who wants to learn more about AI/ML. I’m a big believer in MIT/BSD licenses, since they make things simple for me the developer and you the consumer.
GitHub Link: https://github.com/dreoporto/ptmlib
Any feedback is greatly appreciated and welcome! Please see the Contact page for details.