Currently, only enterprise MP platforms (usually 4 CPU sockets) have L3.
If you have many L1 and L2 misses, TLB misses may be more (or less) important to performance. We can't guess your CPU model. If you have an option to select cache misses retired, those are the ones which count. Some of the events may be available separately for instructions and data, or by cache lines.
High data cache miss rates are possible, if you didn't organize your program to use sequential memory access. You would want to do that anyway, to enable vectorization. Intel Fortran will make a few of those changes automatically, at -O3.
Tuning assistant performance assessments are rough, could be off by a factor of 2.