Thanks for your interest in Intel Threading Building Blocks. Regarding your questions:
1. Since a prime philosophy driving TBB is to maximize cache reuse, early allocation and stable memory use are important design goals. Real time memory allocation is minimized but not forbidden. For example, the scalable allocator on first allocation whacks off a big chunk of memory, which it divides into separate pools for per-thread allocation, so you’d see some heap activity right at the beginning. Likewise, data structures like concurrent_hash_map do an initial allocation sufficient to handle a nominal group of entries but uses binary growth should more space be needed. The threads used to handle TBB tasks are also allocated to a pool initially and reused, minimizing allocation thrash. If your program stabilizes in its resource use, heap allocations should also be fairly stable.
2. The latest release of Intel TBB provides a means to set thread stack size to enable the stack allocation for larger buffers such as in your current practice. It sounds, though, like you’d really like to have some Thread Local Storage allocated for each of the pool threads. One of the new Intel TBB 2.1 features, the task_scheduler_observer, provides the hooks you need to set up a per-thread storage area for the TBB worker threads. I explain how to do it in my under the hood blog series.
3. The parallel_for doesn’t actually create any threads. Those are created and pooled when you create the task_scheduler_init object and stay around as long as that object exists. These are native threads, spawned in whichever Intel TBB supported OS your program happens to be running. The parallel_for submits a task to the TBB scheduler, which under parallel_for and using the blocked_range can split that task into a bunch of smaller ones and allocate pool threads to execute various subranges of the original range. But just because you can, it doesn’t necessarily mean you should. Intel TBB enabled programs run most efficiently when you let its unfair scheduler maximize parallelism and minimize cache thrashing by avoiding synchronization as much as is feasible for the algorithm.